Uploaded by Jeremy Li

Mark L. Berenson David M. Levine Kathryn A. Szabat Martin O’Brien Nicola Jayne Judith Watson - Basic Business Statistics Concepts and applications (Australasian and Pacific edition)-Pearson Aust

advertisement
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5TH EDITION
Basic Business Statistics
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
This page is intentionally left blank
5TH EDITION
Basic Business Statistics
Concepts and applications
Berenson Levine Szabat
O’Brien Jayne Watson
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019
Pearson Australia
707 Collins Street
Melbourne VIC 3008
www.pearson.com.au
Authorised adaptation from the United States edition entitled Basic Business Statistics, 13th edition, ISBN 0321870026 by Berenson,
Mark L., Levine, David M., Szabat, Kathryn A., published by Pearson Education, Inc., Copyright © 2015.
Fifth adaptation edition published by Pearson Australia Group Pty Ltd, Copyright © 2019
The Copyright Act 1968 of Australia allows a maximum of one chapter or 10% of this book, whichever is the greater, to be copied by
any educational institution for its educational purposes provided that that educational institution (or the body that administers it) has given a
remuneration notice to Copyright Agency Limited (CAL) under the Act. For details of the CAL licence for educational institutions contact:
Copyright Agency Limited, telephone: (02) 9394 7600, email: info@copyright.com.au
All rights reserved. Except under the conditions described in the Copyright Act 1968 of Australia and subsequent amendments, no part of
this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical,
photocopying, recording or otherwise, without the prior permission of the copyright owner.
Portfolio Manager: Rebecca Pedley
Development Editor: Anna Carter
Project Managers: Anubhuti Harsh and Keely Smith
Production Manager: Julie Ganner
Product Manager: Sachin Dua
Content Developer: Victoria Kerr
Rights and Permissions Team Leader: Lisa Woodland
Lead Editor/Copy Editor: Julie Ganner
Proofreader: Katy McDevitt
Indexer: Garry Cousins
Cover and internal design by Natalie Bowra
Cover photograph © kireewong foto/Shutterstock
Typeset by iEnergizer Aptara®, Ltd
Printed in Malaysia
ISBN 9781488617249
1 2 3 4 5 23 22 21 20 19
Pearson Australia Group Pty Ltd ABN 40 004 245 943
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
brief contents
Preface
x
Acknowledgements
xi
How to use this book
xii
About the authors
PART 1
PRESENTING AND DESCRIBING INFORMATION
1
2
3
PART 2
5
6
7
4
37
91
Basic probability
Some important discrete probability distributions
The normal distribution and other continuous distributions
Sampling distributions
147
180
212
248
DRAWING CONCLUSIONS ABOUT POPULATIONS BASED
ONLY ON SAMPLE INFORMATION
8
9
10
11
PART 4
Defining and collecting data
Organising and visualising data
Numerical descriptive measures
MEASURING UNCERTAINTY
4
PART 3
xvii
Confidence interval estimation
Fundamentals of hypothesis testing: One-sample tests
Hypothesis testing: Two-sample tests
Analysis of variance
279
315
358
401
DETERMINING CAUSE AND MAKING RELIABLE FORECASTS
12
13
14
15
Simple linear regression
Introduction to multiple regression
Time-series forecasting and index numbers
Chi-square tests
455
504
544
607
ONLINE CHAPTERS
PART 5
FURTHER TOPICS IN STATS
16
17
18
19
20
21
Multiple regression model building
Decision making
Statistical applications in quality management
Further non-parametric tests
Business analytics
Data analysis: The big picture
650
680
704
740
770
794
Appendices A to F
A-1
Glossary
G-1
Index
I-1
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
vi
detailed contents
Preface Acknowledgements
How to use this book
About the authors
x
xi
xii
xvii
3.3
3.4
Calculating numerical descriptive
measures from a frequency distribution
118
Five-number summary and
box-and-whisker plots
120
3.5
Covariance and the coefficient of correlation 123
PRESENTING AND DESCRIBING INFORMATION
3.6
Pitfalls in numerical descriptive
measures and ethical issues
1 Defining and collecting data
Summary
Key formulas
Key terms
Chapter review problems
Continuing cases
Chapter 3 Excel Guide
130
130
132
132
134
135
End of Part 1 problems
139
PART 1
4
1.1
Basic concepts of data and statistics
6
1.2
Types of variables
9
1.3
Collecting data
13
1.4
Types of survey sampling methods
17
1.5
Evaluating survey worthiness
22
1.6
The growth of statistics and information
technology
26
Summary
Key terms
References
Chapter review problems
Continuing cases
Chapter 1 Excel Guide
2 Organising and visualising data
27
27
27
28
29
29
37
2.1
Organising and visualising categorical data
38
2.2
Organising numerical data
43
2.3
Summarising and visualising numerical data
46
2.4
Organising and visualising two
categorical variables
55
2.5
Visualising two numerical variables
59
2.6
Business analytics applications –
descriptive analytics
62
Misusing graphs and ethical issues
69
2.7
Summary
Key terms
References
Chapter review problems
Continuing cases
Chapter 2 Excel Guide
3 Numerical descriptive measures
3.1
3.2
Measures of central tendency,
variation and shape
Numerical descriptive measures
for a population
PART 2
MEASURING UNCERTAINTY
4 Basic probability
Basic probability concepts
148
4.2
Conditional probability
156
4.3
Bayes’ theorem
163
4.4
Counting rules
168
4.5
Ethical issues and probability
172
Summary
Key formulas
Key terms
Chapter review problems
Continuing cases
Chapter 4 Excel Guide
5 Some important discrete probability
distributions
173
173
173
174
177
178
180
Probability distribution for a discrete
random variable
181
5.2
Covariance and its application in finance
185
5.3
Binomial distribution
189
5.4
Poisson distribution
196
5.5
Hypergeometric distribution
200
5.1
91
113
147
4.1
73
73
73
74
76
77
92
129
Summary
Key formulas
Key terms
Chapter review problems
Chapter 5 Excel Guide
204
204
205
205
208
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
DETAILED CONTENTS
6 The normal distribution and
other continuous distributions
212
6.1
Continuous probability distributions
213
6.2
The normal distribution
214
6.3
Evaluating normality
229
6.4
The uniform distribution
233
6.5
The exponential distribution
235
6.6
The normal approximation to the
binomial distribution
238
Summary
Key formulas
Key terms
Chapter review problems
Continuing cases
Chapter 6 Excel Guide
242
242
242
243
244
246
7 Sampling distributions
248
7.1
Sampling distributions
249
7.2
Sampling distribution of the mean
249
7.3
Sampling distribution of the proportion
259
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 7 Excel Guide
262
263
263
263
263
265
265
End of Part 2 problems
267
PART 3
DRAWING CONCLUSIONS ABOUT
POPULATIONS BASED ONLY ON SAMPLE
INFORMATION
8 Confidence interval estimation
279
Confidence interval estimation for the
mean (σ known)
280
Confidence interval estimation for the
mean (σ unknown)
285
Confidence interval estimation for
the proportion
291
8.4
Determining sample size
294
8.5
Applications of confidence interval
estimation in auditing
300
More on confidence interval estimation
and ethical issues
307
8.1
8.2
8.3
8.6
Summary
Key formulas
308
308
Key terms
References
Chapter review problems
Continuing cases
Chapter 8 Excel Guide
9 Fundamentals of hypothesis testing:
One-sample tests
308
309
309
313
313
315
9.1
Hypothesis-testing methodology
9.2
Z test of hypothesis for the mean (σ known) 322
9.3
One-tail tests
9.4
t test of hypothesis for the mean (σ unknown) 334
9.5
Z test of hypothesis for the proportion
340
9.6
The power of a test
344
9.7
Potential hypothesis-testing pitfalls and
ethical issues
349
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 9 Excel Guide
10 Hypothesis testing: Two-sample tests
10.1
10.2
10.3
10.4
316
329
352
353
353
353
354
356
356
358
Comparing the means of two independent
populations
359
Comparing the means of two related
populations
371
F test for the difference between
two variances
378
Comparing two population proportions
384
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 10 Excel Guide
11 Analysis of variance
389
391
392
392
392
395
396
401
The completely randomised design:
One-way analysis of variance
402
11.2
The randomised block design
415
11.3
The factorial design: Two-way
analysis of variance
425
11.1
Summary
Key formulas
Key terms
References
Chapter review problems
438
439
440
440
441
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
vii
viii
DETAILED CONTENTS
Continuing cases
Chapter 11 Excel Guide
443
444
End of Part 3 problems
448
PART 4
DETERMINING CAUSE AND MAKING RELIABLE
FORECASTS
12 Simple linear regression
455
14 Time-series forecasting and
index numbers
544
14.1
The importance of business forecasting
545
14.2
Component factors of the classical
multiplicative time-series model
546
14.3
Smoothing the annual time series
547
14.4
Least-squares trend fitting and forecasting
555
14.5
The Holt–Winters method for trend
fitting and forecasting
567
Autoregressive modelling for trend
fitting and forecasting
570
12.1
Types of regression models
12.2
Determining the simple linear regression
equation
458
12.3
Measures of variation
467
14.7
Choosing an appropriate forecasting model
579
12.4
Assumptions
473
14.8
Time-series forecasting of seasonal data
584
12.5
Residual analysis
473
14.9
Index numbers
591
12.6
Measuring autocorrelation: The
Durbin–Watson statistic
14.10
Pitfalls in time-series forecasting
599
477
Inferences about the slope and
correlation coefficient
482
12.7
456
14.6
12.8
Estimation of mean values and prediction of
individual values
489
12.9
Pitfalls in regression and ethical issues
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 12 Excel Guide
13 Introduction to multiple regression
493
496
497
498
498
498
501
502
Chi-square test for differences
between more than two proportions
615
15.3
Chi-square test of independence
622
504
15.4
Chi-square goodness-of-fit tests
627
15.5
Chi-square test for a variance or
standard deviation
632
505
13.2
R 2, adjusted R 2 and the overall F test
511
Residual analysis for the multiple
regression model
514
Inferences concerning the population
regression coefficients
516
Testing portions of the multiple
regression model
520
Using dummy variables and interaction
terms in regression models
525
Collinearity
535
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 13 Excel Guide
536
537
537
537
538
541
541
13.5
13.6
13.7
607
608
Developing the multiple regression model
13.4
15 Chi-square tests
600
600
601
602
602
604
Chi-square test for the difference between
two proportions (independent samples)
13.1
13.3
Summary
Key formulas
Key terms
References
Chapter review problems
Chapter 14 Excel Guide
15.1
15.2
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 15 Excel Guide
635
635
636
636
636
640
641
End of Part 4 problems
642
PART 5 (ONLINE)
FURTHER TOPICS IN STATS
16 Multiple regression model building
650
16.1
Quadratic regression model
651
16.2
Using transformations in regression models
657
16.3
Influence analysis
660
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
DETAILED CONTENTS
16.4
Model building
663
16.5
Pitfalls in multiple regression and
ethical issues
673
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 16 Excel Guide
17 Decision making
674
674
674
676
676
677
677
Payoff tables and decision trees
681
17.2
Criteria for decision making
685
17.3
Decision making with sample information
694
17.4
Utility
699
18 Statistical applications in
quality management
700
701
701
701
701
703
704
18.1
Total quality management
705
18.2
Six Sigma management
707
18.3
The theory of control charts
708
18.4
Control chart for the proportion –
The p chart
710
The red bead experiment –
Understanding process variability
716
18.5
19.1
19.2
19.3
19.4
680
17.1
Summary
Key formulas
Key terms
References
Chapter review problems
Chapter 17 Excel Guide
19 Further non-parametric tests
19.5
740
McNemar test for the difference
between two proportions (related samples)
741
Wilcoxon rank sum test – Non-parametric
analysis for two independent populations
744
Wilcoxon signed ranks test – Nonparametric analysis for two related
populations
750
Kruskal–Wallis rank test – Non-parametric
analysis for the one-way anova
755
Friedman rank test – Non-parametric
analysis for the randomised block design
758
Summary
Key formulas
Key terms
Chapter review problems
Continuing cases
Chapter 19 Excel Guide
762
762
762
763
765
766
20 Business analytics
770
20.1
Predictive analytics
771
20.2
Classification and regression trees
772
20.3
Neural networks
777
20.4
Cluster analysis
781
20.5
Multidimensional scaling
783
Key formulas
Key terms
References
Chapter review problems
Chapter 20 Software Guide
786
787
787
787
788
21 Data analysis: The big picture
794
21.1
Analysing numerical variables
798
Control chart for an area of
opportunity – The c chart
718
21.2
Analysing categorical variables
800
18.7
Control charts for the range and the mean
721
21.3
Predictive analytics
801
18.8
Process capability
727
18.6
Summary
Key formulas
Key terms
References
Chapter review problems
Chapter 18 Excel Guide
733
733
734
734
734
736
Chapter review problems
802
End of Part 5 problems
804
Appendices A to F
A-1
Glossary
G-1
Index
I-1
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
ix
preface
This fifth Australasian and Pacific edition of Basic Business Statistics: Concepts and Applications
continues to build on the strengths of the fourth edition, and extends the outstanding teaching
foundation of the previous American editions, authored by ­Berenson, Levine and Szabat.
The teaching philosophy of this text is based upon the principles of the American book, but
each chapter has once again been carefully revised to include practical examples and a language and style that is more applicable to Australasian and Pacific readers.
In preparation for this edition we again asked lecturers from around the country to comment on
the format and content of the fourth edition and, based on those comments, the authors have
worked to create a text that is more accessible – but no less authoritative – for students.
Part 5 contains additional chapters: Chapter 16 on multiple regression and model building,
Chapter 17 on decision making, Chapter 18 on statistical applications in quality and productivity management, Chapter 19 on further non-parametric tests and two brand new chapters:
Chapter 20 on business analytics and Chapter 21 on data analysis. This chapter will be especially useful to students who wish to understand how the concepts and techniques studied in
this book all fit together. The Part 5 chapters can be found within the MyLab and student download page via our catalogue.
Chapter 21 (including Figure 21.1, which provides a summary of the contents of this book
arranged by data-analysis task) is designed to provide guidance in choosing appropriate statistical techniques to data-analysis questions arising in business or elsewhere. Figure 21.1, and
Chapter 21, should be referred to when working through the earlier chapters of this book. This
should enable students to see connections between topics; that is, the big picture.
The new edition has continued with a ‘real-world’ focus, to take students beyond the pure
theory. Some chapters have a completely new opening scenario, focusing on a person or company, which serves to introduce key concepts covered in the chapter. The scenario is interwoven throughout the chapter to reinforce the concepts to the student. Multiple in-chapter
examples have been updated that highlight real Australasian and Pacific data.
The Real people, real stats feature that opens each of the text’s five parts is composed
of a personal interview highlighting how real people in real business situations apply the principles of statistics to their jobs. The interviewees are:
Part 1
Part 2
Part 3
Part 4
Part 5
David McCourt BDO
Ellouise Roberts Deloitte Access Economics
Rod Battye Tourism Research Australia
Gautam Gangopadhyay Endeavour Energy
Deborah O’Mara The University of Sydney
Judith Watson
Nicola Jayne
Martin O’Brien
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
acknowledgements
When developing the new edition of Basic Business Statistics, we were mindful of retaining the
strengths of the current edition, but also of the need to build on those strengths, to enhance the
text and to ensure wider reader appeal and useability.
We are indebted to the following academics who contributed to the new edition.
Technical Editor
We would like to thank Martin Firth at UWA for carrying out a detailed technical edit of the text.
Reviewers
Ms Gerrie Roberts Monash University
Dr Sonika Singh University of Technology Sydney
Dr Erick Li University of Sydney
Dr Amir Arjomandi University of Wollongong
Mr Jason Hay Queensland University of Technology
Mr Martin J Firth University of Western Australia
Dr Scott Salzman Deakin University
Ms Charanjit Kaur Monash University
Dr Jill Wright Monash University
The enormous task of writing a book of this scope was possible only with the expert assistance
of all these friends and colleagues and that of the editorial and production staff at Pearson
Australia. We gratefully acknowledge their invaluable contributions at every stage of this project, collectively and, now, individually. We thank the following people at Pearson Australia:
Rebecca Pedley, Portfolio Manager; Anna Carter, Development Editor; Julie Ganner, Production
Manager and Copy Editor; and Lisa Woodland, Rights & Permissions Team Leader.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
xii
how to use this book
Real people, real stats interviews open each part. These introduce real people
working in real business environments, using statistics to tackle real business
challenges.
PA R T
1
Presenting
and describing
information
Real People, Real Stats
David McCourt BDO
Learning objectives introduce you to the key
concepts to be covered in each chapter, and are
signposted in the margins where they are covered
within the chapter.
Which company are you currently working for and what are some of your responsibilities?
I work at BDO, Chartered Accountants and Advisors, in the corporate finance team. My primary
responsibilities include the preparation of financial models and valuation reports.
List five words that best describe your personality.
Affable, level-headed, perceptive, analytical, assured (according to my colleagues).
What are some things that motivate you?
Success, working with a team, client satisfaction.
When did you first become interested in statistics?
I never really understood statistics at school and it was a minor part of my university degree. However,
statistics play a significant role in many of our valuations, including discounted cash flow valuations
and share option valuations.
Complete the following sentence. A world without statistics …
… is not worth thinking about.
LET’S TALK STATS
What do you enjoy most about working in statistics?
We use data services and statistical tools that have been created by third parties. I can use, and talk
reasonably knowledgeably about, statistical data without being an expert.
CHAPTER 1 DEFINING AND COLLECTING DATA
LEARNING
OBJECTIVES
04/07/18 6:33 PM
M01_BERE7249_05_SE_C01.indd 2
5
After studying this chapter you should be able to:
1 identify the types of data used in business
2 identify how statistics is used in business
3 recognise the sources of data used in business
4 distinguish between different survey sampling methods
5 evaluate the quality of surveys
Chapter-opening scenarios show how statistics are used in everyday life. The scenarios
introduce the concepts to be covered, showing the relevance of using particular statistical
techniques. The problem is woven throughout each chapter, showing the connection
between statistics and their use in business, as well as keeping you motivated.
C H AP T E R
1
Defining and
Collecting data
THE HONG KONG AIRPORT SURVEY
Not so long ago, business students were unfamiliar with the word data and had little experience
handling data. Today, every time you visit a search engine website or ‘ask’ your mobile device
a question, you are handling data. And if you ‘check in’ to a location or indicate that you ‘like’
something, you are creating data as well.
You accept as almost true the premises of stories in which characters collect ‘a lot of data’
to uncover conspiracies, foretell disasters or catch a criminal.
You hear concerns about how the government or business might be able to ‘spy’ on you in
some way or how large social media companies ‘mine’ your personal data for profit.
You hear the word data everywhere and may even have a ‘data plan’ for your smartphone.
You know, in a general way, that data are facts about the world and that most data seem to be,
ultimately, a set of numbers – that 34% of students recently polled prefer using a certain Internet browser, or that 50% of citizens believe the country is headed in the right direction, or that
unemployment is down 3%, or that your best friend’s social media account has 835 friends and
202 recent posts.
You cannot escape from data in this digital world. What, then, should you do? You could
try to ignore data and conduct business by relying on hunches or your ‘gut instincts’. However,
if you want to use only gut instincts, then you probably shouldn’t be reading this book or taking
business courses in the first place.
You could note that there is so much data in the world – or just in your own little part of the
world – that you couldn’t possibly get a handle on it.
You could accept other people’s data summaries and their conclusions without first reviewing the data yourself. That, of course, would expose yourself to fraudulent practices.
Or you could do things the proper way and realise the benefits of learning the methods of
statistics, the subject of this book. You can learn, though, the procedures and methods that will
help you make better decisions based on solid evidence. When you begin focusing on the procedures and methods involved in collecting, presenting and summarising a set of data, or forming conclusions about those data, you have discovered statistics.
In the Hong Kong Airport survey scenario it is important that research team members
focus on the information that is needed by many different stakeholders when planning for
future business and tourist visitors. If the research team fails to collect important information,
or misrepresents the opinions of current visitors, stakeholders may make poor decisions about
advertising, pricing, facilities and other factors relevant to attracting visitors and hosting them
in Hong Kong. Failure to offer suitable facilities and experiences could affect the profitability
of businesses in Hong Kong. In deciding how to collect the facts that are needed, it will help if
you know something about the basic concepts of statistics.
Y
ou are departing Hong Kong International Airport on the next leg of your trip and have
cleared Immigration. You are approached by a researcher holding a tablet computer
who asks if you can answer a few questions. The first question determines if you are a
visitor to Hong Kong or a resident. After establishing that you are a visitor the questions go on
to determine the purpose of your visit, the name of your hotel, the activities you have undertaken
and much additional information about your visit.
M01_BERE7249_05_SE_C01.indd 5
This information is useful for a tourism authority that has the task of marketing Hong Kong as a
travel destination and monitoring the quality of visitors’ experiences in the city. It may also
inform the authority’s government and commercial stakeholders, who provide transport, accommodation, and food and shopping for visitors, and be used for forward planning.
© Jungyeol & Mina/age fotostock
Data sets and Excel workbooks that accompany
the text can be downloaded and used to answer
the appropriate questions.
M01_BERE7249_05_SE_C01.indd 4
04/07/18 6:33 PM
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
04/07/18 6:33 PM
PRELIMS
HOW TO USE THIS
BOOK
detailed contents
xiii
xiii
41
2.1 ORGANISING AND VISUALISING CATEGORICAL DATA
What type of chart should you use? The selection of a chart depends on your intention. If a
comparison of categories is most important, use a bar chart. If observing the portion of the
whole that lies in a particular category is most important, use a pie chart. There should be no
more than eight categories or slices in a pie chart. If there are more than eight, merge the
smaller categories into a category called ‘other’.
Figure 2.3
Microsoft Excel pie chart
of the reasons for grocery
shopping online
Pie chart – reasons for grocery shopping online
Comfortable
environment
8%
Variety/range of
products
10%
Competitive prices
20%
Quality products
18%
Real world, business examples are included throughout the chapter.
These are designed to show the multiple applications of statistics, while
helping you to learn the statistics techniques.
Emphasis on data output and interpretation
The authors believe that the use of computer software is an integral part
of learning statistics. Our focus emphasises analysing data by
interpreting the output from Microsoft Excel while reducing emphasis on
doing calculations. Excel 2016 changes to statistical functions are
reflected in the operations shown in this edition.
In the coverage of hypothesis testing in Chapters 9 to 11, extensive
computer output is included so that the focus can be placed on the
p-value approach. In our coverage of simple linear regression in
Chapter 12, we assume that a software program will be used and our
focus is on interpretation of the output, not on hand calculations.
Summaries are provided at the end of each chapter, to help you review
the key content.
Key terms are signposted in the margins when they are first introduced,
and are referenced to page numbers at the end of each chapter, helping
you to revise key terms and concepts for the chapter.
End-of-section problems are divided into Learning the basics and
Applying the concepts.
Products well
displayed
3%
Convenience
28%
Customer service
13%
PIE CHART F OR FAMILY TYPE
Use the summary tables given for family type in < DEMOGRAPHIC_INFORMATION > to construct
and interpret pie charts for the capital city and the council area.
EXAMPLE 2.3
Figure 2.4
Microsoft Excel pie chart
for family type
Pie chart – council area
Couple with children
Couple no children
One parent
Other
Pie chart – capital city
Couple with children
Couple no children
One parent
Other
M02_BERE7249_05_SE_C02.indd 41
04/07/18 7:19 PM
End-of-part problems challenge the student to make decisions about
the appropriate technique to apply, to carry out that technique and to
interpret the data meaningfully.*
Australasian and Pacific data sets are used for the problems in each
chapter. These files are contained on the Pearson website.
Ethical issues sections are integrated into many chapters, raising
issues for ethical consideration.
674 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING
End of PART 1 PRoblEMs 139
16 Assess your progress
End of Part 1 problems
A.1
Summary
In this chapter, various multiple regression topics were considered
(see Figure 16.15) including quadratic regression models, interactions,
transformations square root and log transformations. A number of
criteria were presented to examine the influence of each individual
observation on the results. In addition, the best subsets and stepwise
regression approaches to model building were detailed.
You have learned how suburban ratings can be used to derive
a measure of income distribution. You also learned how a director of
operations at a television station could build a multiple regression
model as an aid to reducing labour expenses.
Enjoy shopping for clothing
Yes
No
Total
Key formulas
The quadratic regression model
Transformed exponential model
Yi = β0 + β1X1i + β2 X 21i + εi (16.1)
ln Yi = ln( eβ0+β1 X 1i +β2 X 2i εi )
= ln( e
Quadratic regression equation
t i = ei
X1i + εi (16.3)
Original multiplicative model
Yi =
b0 X 1ib1 X 2ib2
n – k –1
SSE (1 – hi ) – ei2
(16.8)
Di =
Transformed multiplicative model
log Yi = log(β0 X 1βi 1 X 2βi2 εi )
(16.5)
β
= log β0 + log( X 1i 1 ) + log( X 2i2 ) + log εi
ei2
hi
k MSE (1 – hi ) 2
54
11
12
13
33
(16.9)
The Cp statistic
Cp =
= log β0 + β1 log X 1i + β2 log X 2i + log εi
(1 – Rk2 )(n – T )
1–
RT2
– [ n – 2( k + 1)] (16.10)
Yi = e
εi
(16.6)
Key terms
best-subsets approach
Cook’s Di statistic
Cp statistic
cross-validation
M16_BERE7249_05_SE_C16.indd 674
665
662
667
672
data mining
hat matrix diagonal elements hi
logarithmic transformation
parsimony
665
661
658
663
Gender
Female
224
36
260
a. Construct contingency tables based on total percentages,
row percentages and column percentages.
b. Construct a side-by-side bar chart of enjoy shopping for
clothing based on gender.
c. What conclusions do you draw from these analyses?
One of the major measures of the quality of service provided by
any organisation is the speed with which the organisation
responds to customer complaints. A large family-owned
department store selling furniture and flooring, including
carpet, has undergone major expansion in the past few
years. In particular, the flooring department has expanded
from two installation crews to an installation supervisor,
a measurer and 15 installation crews. During a recent
year the company got 50 complaints about carpet installation.
The following data represent the number of days between
receipt of the complaint and resolution of the complaint.
quadratic regression model
square-root transformation
stepwise regression
Studentised deleted residual
5
19
4
10
68
35
126
165
5
137
110
32
27
31
110
29
4
27
29
28
52
152
61
29
30
2
35
26
22
123
94
25
36
81
31
1
26
74
26
14
20
651
657
663
661
The annual crediting rates (after tax and fees) on several
managed superannuation investment funds between 2013 and
2017 are:
Superannuation fund
Conservative
Balanced
Growth
High growth
Total
360
140
500
27
5
13
23
a. Construct frequency and percentage distributions.
b. Construct histogram and percentage polygons.
c. Construct a cumulative percentage distribution and plot the
corresponding ogive.
d. Calculate the mean, median, first quartile and third
quartile.
e. Calculate the range, interquartile range, variance, standard
deviation and coefficient of variation.
f. Construct a box-and-whisker plot. Are the data skewed? If
so, how?
g. On the basis of the results of (a) to (f), if you had to report
to the manager on how long a customer should expect to
wait to have a complaint resolved, what would you say?
Explain.
Original exponential model
β0+β1 X 1i +β2 X 2i
Male
136
104
240
A.3
A.4
Historical crediting rate for year ending
30 June, %
2016
2015
2014
2013
8.7
9.0
11.3
12.3
5.2
10.7
14.1
15.9
3.8
11.3
15.6
18.7
3.1
12.3
17.4
20.5
2017
5.5
9.5
11.8
13.7
a. For each fund, calculate the geometric rate of return for
three years (2015 to 2017) and for five years (2013 to 2017).
b. What conclusions can you reach concerning the geometric
rates of return for the funds?
A supplier of ‘Natural Australian’ spring water states that the
magnesium content is 1.6 mg/L. To check this, the quality
control department takes a random sample of 96 bottles
during a day’s production and obtains the magnesium content.
< SPRING_WATER1 >
< FURNITURE >
Cook’s Di statistic
εi (16.4)
β
) + ln εi
Studentised deleted residual
Regression model with a square-root transformation
Yi = b0 + b1
β0+β1 X 1i +β2 X 2i
A.2
(16.7)
= β0 + β1X 1i + β 2 X 2i + ln εi
Yˆi = b0 + b1X1i + b2 X 21i (16.2)
A sample of 500 shoppers was selected in a large
metropolitan area to obtain consumer behaviour
information. Among the questions asked was, ‘Do you enjoy
shopping for clothing?’ The results are summarised in the
following cross-classification table.
A.5
A.6
a. Construct frequency and percentage distributions.
b. Construct a histogram and a percentage polygon.
c. Construct a cumulative percentage distribution and plot the
corresponding ogive.
d. Calculate the mean, median, mode, first quartile and third
quartile.
e. Calculate the variance, standard deviation, range,
interquartile range and coefficient of variation.
f. Construct and interpret a box-and-whisker plot.
g. What conclusions can you reach concerning the magnesium
content of this day’s production?
The National Australia Bank (NAB) produces regular reports
titled NAB Online Retail Sales Index <www.business.nab.
com.au>. Download the latest in-depth report.
a. Give an example of a categorical variable found in the
report.
b. Give an example of a numerical variable found in the
report.
c. Is the variable you selected in (b) discrete or
continuous?
The data in the file < WEBSTATS > represent the number
of times during August and September that a sample
of 50 students accessed the website of a statistics
unit they were enrolled in.
a. Construct ordered arrays for August and September.
b. Construct stem-and-leaf displays for August and
September.
c. Construct frequency, percentage and cumulative
distributions for August and September.
7/5/18 9:00 PM
M03_BERE7249_05_SE_C03.indd 139
26/07/18 1:31 PM
*The solutions are calculated using the (raw) Excel output. If you use the rounded figures presented in the text to reproduce
these answers there may be minor differences.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
xiv
MyLab Statistics
a guided tour for students and educators
Study Plan
A study plan is generated from
each student’s results on a
pre-test. Students can clearly
see which topics they have
mastered and, more
importantly, which they need
to work on.
Unlimited Practice
Each MyLab Statistics comes
with preloaded assignments,
including select end-ofchapter questions, all of which
are automatically graded.
Many study plan and
educator-assigned exercises
contain algorithmically
generated values to ensure
students get as much practice
as they need.
As students work though
study plan or homework
exercises, instant feedback
and tutorial resources guide
them towards understanding.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
GUIDED TOUR FOR STUDENTS AND EDUCATORS
xv
Learning Resources
To further reinforce
understanding, study plan and
homework problems link to
the following learning
resources:
• eText linked to sections for
all study plan questions
• Help Me Solve This, which
walks students through the
problem with step-by-step
help and feedback without
giving away the answer
• StatCrunch.
StatTalk Videos
Fun-loving statistician Andrew
Vickers takes to the streets of
Brooklyn, New York to
demonstrate important
statistical concepts through
interesting stories and real-life
events. This series of videos
and corresponding autograded questions will help
students to understand
statistics.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
xvi
EDUCATOR RESOURCES
EDUCATOR RESOURCES
A suite of resources is provided to assist with delivery of the text, as well as to support teaching
and learning.
Solutions Manual
The Solutions Manual provides educators with detailed, accuracy-verified solutions to all the
in-chapter and end-of-chapter problems in the book.
Test Bank
The Test Bank provides a wealth of accuracy-verified testing material. Updated for the new edition, each chapter offers a wide variety of true/false and multiple-choice questions, arranged
by learning objective and tagged by AACSB standards. Questions can be integrated into Blackboard, Canvas or Moodle Learning Management Systems.
PowerPoint lecture slides
A comprehensive set of PowerPoint slides can be used by educators for class presentations or
by students for lecture preview or review. They include key figures and tables, as well as a
summary of key concepts and examples from the text.
Digital image PowerPoint slides
All the diagrams and tables from the text are available for lecturer use.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
about the authors
Judith Watson
Judith Watson teaches in the Business School at UNSW Australia. She has extensive
experience in lecturing and administering undergraduate and postgraduate Quantitative Methods courses.
Judith’s keen interest in student support led her to establish the Peer Assisted Support
Scheme (PASS) in 1996 and she has coordinated this program for many years. She
served as her faculty’s academic adviser from 2001 to 2004. Judith has been the
recipient of a number of awards for teaching. She received the inaugural Australian
School of Business Outstanding Teaching Innovations Award in 2008 and the 2012 Bill
Birkett Award for Teaching Excellence. She also won the UNSW Vice Chancellor’s
Award for Teaching Excellence in 2012 and a Citation of Outstanding Contributions to
Student Learning from the Australian Government’s Office for Learning and Teaching in
2013. Judith is interested in using online learning technology to engage students and
has created a number of adaptive e-learning tutorials for mathematics and statistics
and cartoon-style videos to explain statistical concepts.
Dr Nicola Jayne
Nicola Jayne is a lecturer in the Southern Cross Business School at the Lismore campus of Southern Cross University. She has been teaching quantitative units since being
appointed to the university in 1993 after several years at Massey University in New
Zealand. Nicola has lectured extensively in Business and Financial Mathematics, Discrete Mathematics and Statistics, both undergraduate and postgraduate, as well as
various Pure Mathematics units.
Nicola’s academic qualifications from Massey University include a Bachelor of Science
(majors in Mathematics and Statistics), a Bachelor of Science with Honours (first class)
and a Doctor of Philosophy, both in Mathematics. Nicola also has a Graduate Certificate in Higher Education (Learning & Teaching) from Southern Cross University. She
was the recipient of a Vice Chancellor’s Citation for an Outstanding Contribution to
Student Learning in 2011.
Dr Martin O’Brien
Dr Martin O’Brien is a senior lecturer in economics, Director of the Centre for Human
and Social Capital Research, and Director of the MBA program in the Sydney Business
School, University of Wollongong. Martin earned his Bachelor of Commerce (firstclass honours) and PhD in Economics at the University of Newcastle. His PhD and
subsequent published research is in the ­general area of labour economics, and specifically the exploration of older workers’ labour force participation in Australia in the
context of an ageing society. Martin has been an expert witness for a number of Fair
Work Commission cases, providing statistical analyses of the effects of penalty
rates, workforce casualisation and family and domestic violence leave.
Martin has taught a wide range of quantitative subjects at university level, including
business statistics, business analytics, quantitative analysis for decision making, econometrics, financial modelling and business research methods. He also has a keen interest in learning analytics and the development and analysis of new teaching technologies.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
xviii
ABOUT THE AUTHORS
about the originating authors
Mark L. Berenson is Professor of Management and Information Systems at Montclair State
University (Montclair, New Jersey) and also Professor Emeritus of Statistics and Computer
Information Systems at Bernard M. Baruch College (City University of New York). He currently
teaches graduate and undergraduate courses in statistics and in operations management in
the School of Business and an undergraduate course in international justice and human rights
that he co-developed in the College of Humanities and Social Sciences.
Berenson received a BA in economic statistics, an MBA in business statistics from City College
of New York and a PhD in business from the City University of New York. His research has been
published in Decision Sciences Journal of Innovative Education, Review of Business Research,
The American Statistician, Communications in Statistics, Psychometrika, Educational and Psychological Measurement, Journal of Management Sciences and Applied Cybernetics, Research
Quarterly, Stats Magazine, The New York Statistician, Journal of Health Administration Education, Journal of Behavioral Medicine and Journal of Surgical Oncology. His invited articles have
appeared in The Encyclopedia of Measurement & Statistics and Encyclopedia of Statistical
Sciences. He is co-author of 11 statistics texts published by Prentice Hall, including Statistics
for Managers Using Microsoft Excel, Basic Business Statistics: Concepts and Applications and
Business Statistics: A First Course.
Over the years, Berenson has received several awards for teaching and for innovative contributions to statistics education. In 2005, he was the first recipient of the Catherine A. Becker Service for Educational Excellence Award at Montclair State University and, in 2012, he was the
recipient of the Khubani/Telebrands Faculty Research Fellowship in the School of Business.
David M. Levine is Professor Emeritus of Statistics and Computer Information Systems at
Baruch College (City University of New York). He received BBA and MBA degrees in statistics
from City College of New York and a PhD from New York University in industrial engineering and
operations research. He is nationally recognised as a leading innovator in statistics education
and is the co-author of 14 books, including such best-selling statistics textbooks as Statistics
for Managers Using Microsoft Excel, Basic Business Statistics: Concepts and Applications,
Business Statistics: A First Course and Applied Statistics for Engineers and Scientists Using
Microsoft Excel and Minitab.
He also is the co-author of Even You Can Learn Statistics: A Guide for Everyone Who Has Ever
Been Afraid of Statistics (currently in its second edition), Six Sigma for Green Belts and Champions and Design for Six Sigma for Green Belts and Champions, and the author of Statistics for
Six Sigma Green Belts, all published by FT Press, a Pearson imprint, and Quality Management,
third edition, published by McGraw-Hill/Irwin. He is also the author of Video Review of Statistics
and Video Review of Probability, both published by Video Aided Instruction, and the statistics
module of the MBA primer published by Cengage Learning. He has published articles in various
journals, including Psychometrika, The American Statistician, Communications in Statistics,
Decision Sciences Journal of Innovative Education, Multivariate Behavioral Research, Journal
of Systems Management, Quality Progress and The American Anthropologist, and he has given
numerous talks at the Decision Sciences Institute (DSI), American Statistical Association (ASA)
and Making Statistics More Effective in Schools and Business (MSMESB) conferences. Levine
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
ABOUT THE AUTHORS
has also received several awards for outstanding teaching and curriculum development from
Baruch College.
Kathryn A. Szabat is Associate Professor and Chair of Business Systems and Analytics at
LaSalle University. She teaches undergraduate and graduate courses in business statistics and
operations management.
Szabat’s research has been published in International Journal of Applied Decision Sciences,
Accounting Education, Journal of Applied Business and Economics, Journal of Healthcare Management and Journal of Management Studies. Scholarly chapters have appeared in Managing
Adaptability, Intervention, and People in Enterprise Information Systems; Managing, Trade,
Economies and International Business; Encyclopedia of Statistics in Behavioral Science; and
Statistical Methods in Longitudinal Research.
Szabat has provided statistical advice to numerous business, non-business and academic
communities. Her more recent involvement has been in the areas of education, medicine and
non-profit capacity building.
Szabat received a BS in mathematics from State University of New York at Albany and MS and
PhD degrees in statistics, with a cognate in operations research, from the Wharton School of
the University of Pennsylvania.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
xix
PA R T
1
Presenting
and describing
information
Real People, Real Stats
David McCourt BDO
Which company are you currently working for and what are some of your responsibilities?
I work at BDO, Chartered Accountants and Advisors, in the corporate finance team. My primary
responsibilities include the preparation of financial models and valuation reports.
List five words that best describe your personality.
Affable, level-headed, perceptive, analytical, assured (according to my colleagues).
What are some things that motivate you?
Success, working with a team, client satisfaction.
When did you first become interested in statistics?
I never really understood statistics at school and it was a minor part of my university degree. However,
statistics play a significant role in many of our valuations, including discounted cash flow valuations
and share option valuations.
Complete the following sentence. A world without statistics …
… is not worth thinking about.
LET’S TALK STATS
What do you enjoy most about working in statistics?
We use data services and statistical tools that have been created by third parties. I can use, and talk
reasonably knowledgeably about, statistical data without being an expert.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
a quick q&a
Describe your first statistics-related job or work experience.
Was this a positive or a negative experience?
The first time I can recall using statistics was for a share option
valuation. We had to determine the share price volatility based
on historical share price data. There are about half a dozen
methods that can be used, all with various advantages and
disadvantages. I did and still find this analysis interesting.
What do you feel is the most common misconception about
your work held by students who are studying statistics?
Please explain.
Statistics provides information to support our analysis and
decisions. However, the information is never perfect, and
subjectivity and commercial common sense play a large part in
our work.
Do you need to be good at maths to understand and use
statistics successfully?
I think you need to have a logical and well-structured approach
to problems. These skills would probably make you good at both
maths and statistics.
Is there a high demand for statisticians in your industry (or in
other industries)? Please explain.
The finance industry is heavily reliant on statistics. I expect there
is high demand for statisticians from the various data providers,
and in a number of specialist areas (e.g. insurance).
PRESENTING AND DESCRIBING INFORMATION
Does data collection play an important role in the decisions
you make for your business/work? Please explain.
Accurate data collection is essential to our valuation projects.
Although our work involves a degree of commercial acumen, it is
essential that the data supports and justifies these decisions. We
also aggregate data for internal business use to measure staff
productivity, business performance and forecasting budgets.
Describe a project that you have worked on recently that might
have involved data collection. Please be specific.
We recently valued an infrastructure asset using the discounted
cash flow model. The model requires two essential inputs: the
forecast of future cash flows of the asset, and the discount rate
that reflects the riskiness of those cash flows. To arrive at an
appropriate discount rate we generally analyse comparable
companies for an indication of the level of risk that should be
attributed to the asset to be valued. In this exercise there are
several instances of data collection. We collect five-year
historical stock data for numerous comparable companies as an
initial indication of risk. We then collect data on key financial
indicators to assess the degree of comparability between the
stock and the asset to be valued. To determine the risk-free rate
and the market-risk premium, 10-year government bond rate data
is collected.
How are these data usually summarised? What are some
positives and negatives of these summary techniques?
We generally organise the collected data into Microsoft Excel
workbooks. The main advantage of using this software is the
ease of data analysis. Some powerful data analysis tools include
data tables, What-If Analysis, Solver, charting and common
statistical functions. Some shortcomings we have encountered
using Excel is that data sometimes need to be rearranged
depending on the analysis, [there can be] problems with
inconsistent or missing data, and output can sometimes be
incomplete. These factors increase the likelihood of errors in
data analysis; however, for the purposes of corporate finance,
Excel is generally sufficient as a means of summarising and
analysing the data collected.
In your experience, what is the most commonly referred to
measure of central tendency? What benefits does this measure
offer over others?
In valuations, we generally prefer to use the median as a
measure of central tendency rather than mean or mode. We find
that the mean has one main disadvantage: it is particularly
susceptible to outliers. When looking at comparable companies
there are often outliers caused by one-off business issues that
are irrelevant for the purposes of comparing our business. We
very rarely use mode given that it only really coincides with the
central tendency of data where the distribution is centre-heavy
and there are generally few recurring figures in the data set.
Why is it important to be aware of the spread/variation of data
points in a sample? What are the consequences of not knowing
this type of information about your sample?
Without an understanding of the spread and variation of a data set
there is no context to the measure of central tendency applied. A
measure of central tendency summarises the data into a single
value while the spread and variation of data gives an indication of
how reliable an average or median summary of collected data is.
For example, if the spread of values in the data set is relatively
large it suggests the mean is not as representative, and a
smoothing of data is required, when compared to a data set with a
smaller range. Adopting a mean without reference to the spread
can taint our analysis and results in a lack of validity to our
decisions that are based on the data.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
CHA PTER
1
Defining and
Collecting data
THE HONG KONG AIRPORT SURVEY
Y
ou are departing Hong Kong International Airport on the next leg of your trip and have
cleared Immigration. You are approached by a researcher holding a tablet computer
who asks if you can answer a few questions. The first question determines if you are a
visitor to Hong Kong or a resident. After establishing that you are a visitor the questions go on
to determine the purpose of your visit, the name of your hotel, the activities you have undertaken
and much additional information about your visit.
This information is useful for a tourism authority that has the task of marketing Hong Kong as a
travel destination and monitoring the quality of visitors’ experiences in the city. It may also
inform the authority’s government and commercial stakeholders, who provide transport, accommodation, and food and shopping for visitors, and be used for forward planning.
© Jungyeol & Mina/age fotostock
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
CHAPTER 1 DEFINING AND COLLECTING DATA
LEARNING
OBJECTIVES
After studying this chapter you should be able to:
1 identify the types of data used in business
2 identify how statistics is used in business
3 recognise the sources of data used in business
4 distinguish between different survey sampling methods
5 evaluate the quality of surveys
Not so long ago, business students were unfamiliar with the word data and had little experience
handling data. Today, every time you visit a search engine website or ‘ask’ your mobile device
a question, you are handling data. And if you ‘check in’ to a location or indicate that you ‘like’
something, you are creating data as well.
You accept as almost true the premises of stories in which characters collect ‘a lot of data’
to uncover conspiracies, foretell disasters or catch a criminal.
You hear concerns about how the government or business might be able to ‘spy’ on you in
some way or how large social media companies ‘mine’ your personal data for profit.
You hear the word data everywhere and may even have a ‘data plan’ for your smartphone.
You know, in a general way, that data are facts about the world and that most data seem to be,
ultimately, a set of numbers – that 34% of students recently polled prefer using a certain Internet browser, or that 50% of citizens believe the country is headed in the right direction, or that
unemployment is down 3%, or that your best friend’s social media account has 835 friends and
202 recent posts.
You cannot escape from data in this digital world. What, then, should you do? You could
try to ignore data and conduct business by relying on hunches or your ‘gut instincts’. However,
if you want to use only gut instincts, then you probably shouldn’t be reading this book or taking
business courses in the first place.
You could note that there is so much data in the world – or just in your own little part of the
world – that you couldn’t possibly get a handle on it.
You could accept other people’s data summaries and their conclusions without first reviewing the data yourself. That, of course, would expose yourself to fraudulent practices.
Or you could do things the proper way and realise the benefits of learning the methods of
statistics, the subject of this book. You can learn, though, the procedures and methods that will
help you make better decisions based on solid evidence. When you begin focusing on the procedures and methods involved in collecting, presenting and summarising a set of data, or forming conclusions about those data, you have discovered statistics.
In the Hong Kong Airport survey scenario it is important that research team members
focus on the information that is needed by many different stakeholders when planning for
future business and tourist visitors. If the research team fails to collect important information,
or misrepresents the opinions of current visitors, stakeholders may make poor decisions about
advertising, pricing, facilities and other factors relevant to attracting visitors and hosting them
in Hong Kong. Failure to offer suitable facilities and experiences could affect the profitability
of businesses in Hong Kong. In deciding how to collect the facts that are needed, it will help if
you know something about the basic concepts of statistics.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5
6
CHAPTER 1 DEFINING AND COLLECTING DATA
1.1 BASIC CONCEPTS OF DATA AND STATISTICS
The Meaning of ‘Data’
What do we mean by the word data? Its common use is somewhat different from its use in
statistics. It could be described in a general way as meaning ‘facts about the world’. However,
statisticians distinguish between the traits or properties that relate to people or things and the
actual values that these take.
variables
Characteristics or attributes that
can be expected to differ from one
individual to another.
data
The observed values of variables.
VA R IA B L E S
Variables are characteristics of items or individuals.
DATA
Data are the observed values of variables.
For a group of people, we could examine the traits of age, country of birth or weight. For
a group of cars, we could note the colour, current value or kilometres driven. These characteristics are called variables.
Data are the values associated with these traits or properties. As an example, in Table 1.1
we find a set of data collected from six people which represents observations on three different
variables.
Table 1.1
operational definition
Defines how a variable is to be
measured.
Variable
Age in years
Country of birth
Weight in kilograms
Data
24, 18, 53, 16, 22, 31
Australia, China, Australia, Malaysia, India, Australia
50.2, 74.6, 96.3, 45.2, 56.1, 87.3
In this book, the word data is always plural to remind you that data are a collection or set
of values. While we could say that a single value, such as ‘Australia’ is a datum, the terms data
point, observation, response or single data value are more typically encountered.
All variables should have an operational definition – a universally accepted meaning that is
clear to all associated with an analysis. Without operational definitions, confusion can occur.
An example of a situation where operational definitions are needed is for the process of data
gathering by the Australian Bureau of Statistics (ABS). The ABS needs to collect information
about the country of birth of a person and also the countries in which their father and mother
were born. While this might seem straightforward, definitional problems arise in the case of
people who were adopted or have step- or foster parents or other guardians. So the operational
definition used is:
• ‘Country of birth of person’, which is the country identified as being the one in which the
person was born
• ‘Country of birth of father’, which is the country in which the person’s birth father was
born, and
• ‘Country of birth of mother’, which is the country in which the person’s birth mother was born
(Australian Bureau of Statistics, Country of Birth Standard, Cat. No. 1200.0.55.004, 2016).
The Meaning of ‘Statistics’
statistics
A branch of mathematics
concerned with the collection and
analysis of data.
Statistics is the branch of mathematics that examines ways to process and analyse data. ­It
provides procedures to collect and transform data in ways that are useful to business decision
makers.
Statistics allows you to determine whether your data represent information that could be
used in making better decisions. Therefore, it helps you determine whether differences in the
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
1.1 Basic Concepts of DATA AND Statistics
numbers are meaningful in a significant way or are due to chance. To illustrate, consider the
following reports:
• In ‘News use across social media platforms 2016’ the Pew Research Center reported in
May 2016, that 67% of the adult US population had a Facebook account and 66% of
users get news from the site (<http://assets.pewresearch.org/wpcontent/uploads/
sites/13/2016/05/PJ_2016.05.26_social-media-and-news_FINAL-1.pdf>, accessed 12
June 2017).
• In a blog titled ‘The top 10 benefits of newspaper advertising’, the 360 Degree Marketing
Group says that a study showed newspaper advertising was considered a more trusted
paid medium for information (58%) compared with television (54%), radio (49%) or
online (27%) (<www.360degreemarketing.com.au/Blog/bid/407663/The-Top-10Benefits-of-Newspaper-Advertising>, accessed 12 June 2017).
Without statistics, you cannot determine whether the ‘numbers’ in these stories represent
useful information. Without statistics, you cannot validate claims such as the statement that
advertising in newspapers or on television is more trusted than online advertising. And without
statistics, you cannot see patterns that large amounts of data sometimes reveal.
Statistics is a way of thinking that can help you make better decisions. It helps you solve
problems that involve decisions based on data that have been collected. You may have had
some statistics instruction in the past. If you ever created a chart to summarise data or calculated values such as averages to summarise data, you have used statistics. But there’s even
more to statistics than these commonly taught techniques, as the detailed table of contents
shows.
Statistics is undergoing important changes today. There are new ways of visualising data
that did not exist, were not practicable or were not widely known until recently. And, increasingly, statistics today is being used to ‘listen’ to what the data might be telling you rather than
just being a way to use data to prove something you want to say.
If you associate statistics with doing a lot of mathematical calculations, you will quickly
learn that business statistics uses software to perform the calculations for you (and, generally,
the software calculates with more precision and efficiency than you could do manually). But
while you do not need to be a good manual calculator to apply statistics, because statistics is a
way of thinking, you do need to follow a framework or plan to minimise possible errors of
thinking and analysis.
One such framework consists of the following tasks to help apply statistics to business
decision making:
1. Define the data that you want to study in order to solve a problem or meet an objective.
2. Collect the data from appropriate sources.
3. Organise the data collected by developing tables.
4. Visualise the data collected by developing charts.
5. Analyse the data collected to reach conclusions and present those results.
Typically, you do the tasks in the order listed. You must always do the first two tasks to have
meaningful outcomes, but, in practice, the order of the other three can change or appear inseparable. Certain ways of visualising data will help you to organise your data while performing
preliminary analysis as well. In any case, when you apply statistics to decision making, you
should be able to identify all five tasks, and you should verify that you have done the first two
tasks before the other three.
Using this framework helps you to apply statistics to these four broad categories of business activities:
1. Summarise and visualise business data.
2. Reach conclusions from those data.
3. Make reliable forecasts about business activities.
4. Improve business processes.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
7
8
CHAPTER 1 DEFINING AND COLLECTING DATA
descriptive statistics
The field that focuses on
summarising or characterising a set
of data.
inferential statistics
Uses information from a sample to
draw conclusions about a
population.
Throughout this book, and especially in the scenarios that begin the chapters, you will discover specific examples of how we can apply statistics to business situations.
Statistics is itself divided into two branches, both of which are applicable to managing a
business. Descriptive statistics focuses on collecting, summarising and presenting a set of data.
Inferential statistics uses sample data to draw conclusions about a population.
Descriptive statistics has its roots in the record-keeping needs of large political and social
organisations. Refining the methods of descriptive statistics is an ongoing task for government
statistical agencies such as the Australian Bureau of Statistics and Statistics New Zealand as
they prepare for each Census. In Australia, a Census is scheduled to be carried out every five
years (e.g. 2011 and 2016) to count the entire population and to collect data about education,
occupation, languages spoken and many other characteristics of the citizens. A large amount of
planning and training is necessary to ensure that the data collected represent an accurate record
of the population’s characteristics at the Census date. However, despite the best planning, such
an immense data collection task can be affected by external factors. The Australian Census held
in 2016 was badly affected by a computer shutdown on Census night, 9 August. It was blamed
on the need to protect the system from denial of service cyber attacks and added approximately
$30 million to the cost of the Census (<www.abc.net.au/ news/2016-10-25/turning-router-offand-on-could-have-prevented-census-outage/7963916>, accessed 13 July 2017).
The foundation of inferential statistics is based on the mathematics of probability theory.
Inferential methods use sample data to calculate statistics that provide estimates of the characteristics of the entire population.
Today, applications of statistical methods can be found in different areas of business.
Accounting uses statistical methods to select samples for auditing purposes and to understand
the cost drivers in cost accounting. Finance uses statistical methods to choose between alternative portfolio investments and to track trends in financial measures over time. Management uses
statistical methods to improve the quality of the products manufactured or the services delivered by an organisation. Marketing uses statistical methods to estimate the proportion of customers who prefer one product over another and to draw conclusions about what advertising
strategy might be most useful in increasing sales of a product.
Other Important Definitions
Now that the terms variables, data and statistics have been defined, you need to understand the
meaning of the terms population, sample and parameter.
population
A collection of all members of a
group being investigated.
sample
The portion of the population
selected for analysis.
parameter
A numerical measure of some
population characteristic.
statistic
A numerical measure that describes
a characteristic of a sample.
P OPUL AT ION
A population consists of all the members of a group about which you want to draw a
conclusion.
S A M PL E
A sample is the portion of the population selected for analysis.
PA R A M E T E R
A parameter is a numerical measure that describes a characteristic of a population.
STAT IST IC
A statistic is a numerical measure that describes a characteristic of a sample.
Examples of populations are all the full-time students at a university, all the registered
v­ oters in New Zealand and all the people who were customers of the local shopping centre
last weekend. The term population is not limited to groups of people. We could refer to a
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
1.2 Types of Variables
population of all motor vehicles registered in Victoria. Two factors need to be specified when
defining a population:
1. the entity (e.g. people or motor vehicles)
2. the boundary (e.g. registered to vote in New Zealand or registered in Victoria for
road use).
Samples could be selected from each of the populations mentioned above. Examples
include 10 full-time students selected for a focus group; 500 registered voters in New Zealand
who were contacted by telephone for a political poll; 30 customers at the shopping centre who
were asked to complete a market research survey; and all the vehicles registered in Victoria that
are more than 10 years old. In each case, the people or the vehicles in the sample represent a
portion, or subset, of the people or vehicles comprising the population.
The average amount spent by all the customers at the local shopping centre last weekend is
an example of a parameter. Information from all the shoppers in the entire population is needed
to calculate this parameter.
The average amount spent by the 30 customers completing the market research survey is an
example of a statistic. Information from a sample of only 30 of the shopping centre’s customers
is used in calculating the statistic.
1.2 TYPES OF VARIABLES
As illustrated in Figure 1.1, there are two types of variables – categorical and numerical, sometimes referred to as qualitative and quantitative variables respectively.
The Hong Kong airport survey
Travellers in the departure lounge of the busy Hong Kong International Airport are asked to complete a
survey with questions about various aspects of their visit to the city and future travel plans. The
interviewer first asks if the traveller is a resident or a visitor. If the traveller is a visitor, the survey
proceeds. The survey includes these questions:
■ How many visits have you made to Hong Kong prior to this one?
■ How long is it since your visit here?
■ How satisfied were you with your accommodation?
Very satisfied
■
Satisfied
■
Undecided
■
Dissatisfied
■
Very dissatisfied
■
■ How many times during this visit did you travel by ferry?
■ Shopping in Hong Kong stores gives good value for money
Almost always
Sometimes
■
■
■ Was the purpose of your visit business? Yes
■
■
Very infrequently
Never
■
No
■
■ Are you likely to return to Hong Kong in the next 12 months? Yes
■
No
■
You have been asked to review the survey. What type of data does the survey seek to collect?
What type of information can be generated from the data of the completed survey? How can the
research company’s clients use this information when planning for future visitors? What other questions
would you suggest for the survey?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
9
10
CHAPTER 1 DEFINING AND COLLECTING DATA
Figure 1.1
Types of variables
VARIABLE TYPE
QUESTION TYPES
Categorical
numerical variables
Take numbers as their observed
responses.
LEARNING OBJECTIVE
1
Identify the types of data
used in business
discrete variables
Can only take a finite or countable
number of values.
continuous variables
Can take any value between
specified limits.
Yes
Do you currently own any shares?
No
Discrete
How many messages did
you send on social media
last week?
Number
Continuous
How tall are you?
Centimetres
Numerical
categorical variables
Take values that fall into one or
more categories.
RESPONSES
Categorical variables yield categorical responses, such as yes or no or male or female answers.
An example is the response to the question ‘Do you currently own any shares?’ because it is
limited to a simple yes or no answer. Another example is the response to the question in the
Hong Kong Airport survey (presented on page 9), ‘Are you likely to return to Hong Kong in the
next 12 months?’ Categorical variables can also yield more than one possible response; for
example, ‘On which days of the week are you most likely to use public transport?’
Numerical variables yield numerical responses, such as your height in centimetres. Other
examples are ‘How many times during this visit did you travel by ferry?’ (from the Hong Kong
Airport survey) or the response to the question, ‘How many messages did you send on social
media last week?’
There are two types of numerical variables: discrete and continuous. Discrete variables
­produce numerical responses that arise from a counting process. ‘The number of social media
messages sent’ is an example of a discrete numerical variable because the response is one of a
finite number of integers. You send zero, one, two, …, 50 and so on messages.
Continuous variables produce numerical responses that arise from a measuring process.
Your height is an example of a continuous numerical variable because the response takes on
any value within a continuum or interval, depending on the precision of the measuring instrument. For example, your height may be 158 cm, 158.3 cm or 158.2945 cm, depending on the
precision of the available instruments.
No two people are exactly the same height, and the more precise the measuring device used,
the greater the likelihood of detecting differences in their heights. However, most measuring
devices are not sophisticated enough to detect small differences. Hence, tied observations are
often found in experimental or survey data even though the variable is truly continuous and,
theoretically, all values of a continuous variable are different.
Levels of Measurement and Types of Measurement Scales
Data are also described in terms of their level of measurement. There are four widely recognised levels of measurement: nominal, ordinal, interval and ratio scales.
nominal scale
A classification of categorical data
that implies no ranking.
Nominal and ordinal scales
Data from a categorical variable are measured on a nominal scale or on an ordinal scale. A
nominal scale (Figure 1.2) classifies data into various distinct c­ ategories in which no ranking is
implied. In the Hong Kong Airport survey, the answer to the question ‘Are you likely to return to
CATEGORICAL VARIABLE
CATEGORIES
Yes
Personal computer ownership
Type of fuel used
Internet connection
Figure 1.2
Unleaded
Premium Unleaded
Diesel
Cable
No
LPG
Wireless
Examples of nominal scaling
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
1.2 Types of Variables
Hong Kong in the next 12 months?’ is an example of a nominally scaled variable, as is your
favourite soft drink, your political party affiliation and your gender. Nominal scaling is the weakest form of measurement because you cannot specify any ranking across the various categories.
An ordinal scale classifies data into distinct categories in which ranking is implied. In the
Hong Kong Airport survey, the answers to the question ‘Shopping in Hong Kong stores gives
good value for money’ represent an ordinal scaled variable because the responses ‘almost
always, sometimes, very infrequently and never’ are ranked in order of frequency. Figure 1.3
lists other examples of ordinal scaled variables.
Figure 1.3
11
ordinal scale
Scale of measurement where
values are assigned by ranking.
CATEGORICAL VARIABLE
ORDERED CATEGORIES
Product satisfaction
Clothing size
Type of Olympic medal
Education level
Very unsatisfied Fairly unsatisfied Neutral Fairly satisfied Very satisfied
S
M
L
XL
Gold
Silver
Bronze
Primary
Secondary
Tertiary
Examples of ordinal scaling
Ordinal scaling is a stronger form of measurement than nominal scaling because an
observed value classified into one category possesses more or less of a property than does an
observed value classified into another category. However, ordinal scaling is still a relatively
weak form of measurement because the scale does not account for the amount of the differences between the categories. The ordering implies only which category is ‘greater’, ‘better’ or
‘more preferred’ – not by how much.
Interval and ratio scales
Data from a numerical variable are measured on an interval or ratio scale. An interval scale
(Figure 1.4) is an ordered scale in which the difference between measurements is a meaningful
quantity but does not involve a true zero point. For example, sports shoes for adults are often
sold in Australia marked with sizes based on the US or UK system. Neither system has a true
zero size. The size below an adult size 1 is a child’s size 13. However, in each system the intervals between sizes are equal.
NUMERICAL VARIABLE
Shoe size (UK or US)
Height (in centimetres)
Weight (in kilograms)
Salary (in US dollars or Japanese yen)
LEVEL OF MEASUREMENT
Interval
Ratio
Ratio
Ratio
A ratio scale is an ordered scale in which the difference between the measurements involves
a true zero point, as in length, weight, age or salary measurements, and the ratio of two values
is meaningful. In the Hong Kong Airport survey, the number of times a visitor travelled by ferry
is an example of a ratio scaled variable, as six trips is three times as many as two trips. As
another example, a carton that weighs 40 kg is twice as heavy as one that weighs 20 kg.
Data measured on an interval scale or on a ratio scale constitute the highest levels of measurement. They are stronger forms of measurement than an ordinal scale, because you can determine not only which observed value is the largest but also by how much. Interval and ratio
scales may apply for either discrete or continuous data.
interval scale
A ranking of numerical data where
differences are meaningful but
there is no true zero point.
Figure 1.4
Examples of interval and
ratio scales
ratio scale
A ranking where the differences
between measurements involve a
true zero point.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
12
CHAPTER 1 DEFINING AND COLLECTING DATA
Telephone polling
think
about this
Companies such as Newspoll regularly undertake market research and political polling conducted by
phone interviews. A phone poll conducted by Newspoll in Sydney in November 2014 asked questions
about a number of topics. Some were demographic questions about the number of people who lived in
the household and the age, income, occupation and marital status of the participant. What would be the
purpose of asking such questions?
The other questions could be divided into three sections. The first section related to voting intentions for
the next state election and the level of satisfaction with the premier and the opposition leader. The
second section asked the participant’s opinion on the renewal of the federal government’s ban on super
trawlers. The third section asked a number of questions about domestic and international air travel
undertaken in the past year. These questions covered areas such as the purpose of travel, the airlines
used and level of satisfaction.
Who would use the data collected in this poll? If you were designing a similar poll, how would you
construct questions to collect data for the variables referred to above?
More recently, political and business functions of Newspoll have been separated. To see how results of
the latest political polls are published in the Australian, go to <www.theaustralian.com.au/nationalaffairs/newspoll>. To see some public opinion poll reports, go to <www.omnipoll.com.au>.
Problems for Section 1.2
LEARNING THE BASICS
1.1
1.2
1.3
Three different types of drinks are sold at a fast-food restaurant
– soft drinks, fruit juices and coffee.
a. Explain why the type of drinks sold is an example of a
categorical variable.
b. Explain why the type of drinks sold is an example of a
nominally scaled variable.
Coffee is sold in three sizes in takeaway cardboard cups –
small, medium and large. Explain why the size of the coffee cup
is an example of an ordinal scaled variable.
Suppose that you measure the time it takes to download an
MP3 file from the Internet.
a. Explain why the download time is a numerical variable.
b. Explain why the download time is a ratio scaled variable.
1.5
1.6
APPLYING THE CONCEPTS
1.4
For each of the following variables, determine whether the
variable is categorical or numerical. If the variable is numerical,
determine whether the variable is discrete or continuous. In
addition, determine the level of measurement.
a. Number of mobile phones per household
b. Length (in minutes) of the longest mobile call made per
month
c. Whether all mobile phones in the household use the same
telecommunications provider
d. Whether there is a landline telephone in the household
1.7
The following information is collected from students as they
leave the campus bookshop during the first week of classes:
a. Amount of time spent shopping in the bookshop
b. Number of textbooks purchased
c. Name of degree
d. Gender
Classify each of these variables as categorical or numerical. If the
variable is numerical, determine whether the variable is discrete
or continuous. In addition, determine the level of measurement.
For each of the following variables, determine whether the
variable is categorical or numerical. If the variable is numerical,
determine whether the variable is discrete or continuous. In
addition, determine the level of measurement.
a. Name of Internet provider
b. Amount of time spent surfing the Internet per week
c. Number of emails received per week
d. Number of online purchases made per month
Suppose the following information is collected from Andrew and
Fiona Chen on their application for a home loan mortgage at
Metro Home Loans:
a. Monthly expenses: $2,056
b. Number of dependants being supported by applicant(s): 2
c. Annual family salary income: $105,000
d. Marital status: Married
Classify each of the responses by type of data and level of
measurement.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
13
1.3 Collecting Data
1.8
1.9
One of the variables most often included in surveys is income.
Sometimes the question is phrased, ‘What is your income (in
thousands of dollars)?’ In other surveys, the respondent is
asked to ‘Place an X in the circle corresponding to your income
group’ and given a number of ranges to choose from.
a. In the first format, explain why income might be considered
either discrete or continuous.
b. Which of these two formats would you prefer to use if you
were conducting a survey? Why?
c. Which of these two formats would probably bring you a
greater rate of response? Why?
The director of research at the e-business section of a major
department store wants to conduct a survey throughout a
Australia to determine the amount of time working women
spend shopping online for clothing in a typical month.
a. Describe the population and the sample of interest, and
indicate the type of data the director might wish to collect.
b. Develop a first draft of the questionnaire needed in (a) by
writing a series of three categorical questions and three
numerical questions that you feel would be appropriate for
this survey.
1.10 A university researcher designs an experiment to see how
generous participants will be in giving to charity. Discuss the
types of variables the experiment might give compared with a
survey of the same subjects about donations to charity.
1.11 Before a company undertakes an online marketing campaign it
needs to consider information about its own current sales and
the sales made by its competitors. What categorical data might
it use?
1.3 COLLECTING DATA
In the Hong Kong Airport scenario, identifying the data that need to be collected is an important step in the process of marketing the city and operational planning. Some of the data will
come from consumers through market research. It is important that the correct inferences are
drawn from the research and that appropriate statistical methods assist planners and designers
to make the right decisions.
Managing a business effectively requires collecting the appropriate data. In most cases,
the data are measurements acquired from items in a sample. The samples are chosen from
populations in such a manner that the sample is as representative of the population as possible.
The most common technique to ensure proper representation is to use a random sample. (See
section 1.4 for a detailed discussion of sampling techniques.)
Many different types of circumstances require the collection of data:
• A marketing research analyst needs to assess the effectiveness of a new television
advertisement.
• A pharmaceutical manufacturer needs to determine whether a new drug is more effective
than those currently in use.
• An operations manager wants to monitor a manufacturing process to find out whether the
quality of output being produced is conforming to company standards.
• An auditor wants to review the financial transactions of a company to determine whether
or not the company is in compliance with generally accepted accounting principles.
• A potential investor wants to determine which firms within which industries are likely to
have accelerated growth in a period of economic recovery.
LEARNING OBJECTIVE
2
Identify how statistics is
used in business
Identifying Sources of Data
Identifying the most appropriate source of data is a critical aspect of statistical analysis. If biases,
ambiguities or other types of errors flaw the data being collected, even the most sophisticated
statistical methods will not produce accurate information. Five important sources of data are:
• data distributed by an organisation or an individual
• a designed experiment
• a survey
• an observational study
• data collected by ongoing business activities.
primary sources
Provide information collected by
the data analyser.
Data sources are classified as either primary sources or secondary sources. When the data collector is the one using the data for analysis, the source is primary. When another organisation or
secondary sources
Provide data collected by another
person or organisation.
LEARNING OBJECTIVE
3
Recognise the sources
of data used in business
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
14
CHAPTER 1 DEFINING AND COLLECTING DATA
focus group
A group of people who are asked
about attitudes and opinions for
qualitative research.
individual has collected the data that are used for analysis by an organisation or individual, the
source is secondary.
Organisations and individuals that collect and publish data typically use this information as
a primary source and then let others use the data as a secondary source. For example, the
­Australian federal government collects and distributes data in this way for both public and private purposes. The Australian Bureau of Statistics oversees a variety of ongoing data collection
in areas such as population, the labour force, energy, and the environment and health care, and
publishes statistical reports. The Reserve Bank of Australia collects and publishes data on
exchange rates, interest rates and ATM and credit card transactions.
Market research firms and trade associations also distribute data pertaining to specific
industries or markets. Investment services such as Morningstar provide financial data on a company-by-company basis. Syndicated services such as Nielsen provide clients with data enabling
the comparison of client products with those of their competitors. Daily newspapers in print
and online formats are filled with numerical information about share prices, weather conditions
and sports statistics.
As listed above, conducting an experiment is another important data-collection source. For
example, to test the effectiveness of laundry detergent, an experimenter determines which
brands in the study are more effective in cleaning soiled clothes by actually washing dirty laundry instead of asking customers which brand they believe to be more effective. Proper experimental designs are usually the subject matter of more advanced texts, because they often
involve sophisticated statistical procedures. However, some fundamental experimental design
concepts are considered in Chapter 11.
Conducting a survey is a third important data source. Here, the people being surveyed are
asked questions about their beliefs, attitudes, behaviours and other characteristics. Responses
are then edited, coded and tabulated for analysis.
Conducting an observational study is the fourth important data source. In such a study, a
researcher observes the behaviour directly, usually in its natural setting. Observational studies
take many forms in business. One example is the focus group, a market research tool that is used
to elicit unstructured responses to open-ended questions. In a focus group, a moderator leads
the discussion and all the participants respond to the questions asked. Other, more structured
types of studies involve group dynamics and consensus building and use various
­organisational-behaviour tools such as brainstorming, the Delphi technique and the nominalgroup method. Observational study techniques are also used in situations in which enhancing
teamwork or improving the quality of products and services are management goals.
Data collected through ongoing business activities are a fifth data source. Such data can be
collected from operational and transactional systems that exist in both physical ‘bricks-and-mortar’ and online settings but can also be gathered from secondary sources such as third-party social
media networks and online apps and website services that collect tracking and usage data. For
example, a bank might analyse a decade’s worth of financial transaction data to identify patterns
of fraud, and a marketer might use tracking data to determine the effectiveness of a website.
‘Big Data’
big data
Large data sets characterised by
their volume, velocity and variety.
Relatively recent advances in information technology allow businesses to collect, process, and
analyse very large volumes of data. Because the operational definition of ‘very large’ can be partially dependent on the context of a business – what might be ‘very large’ for a sole proprietorship
might be commonplace and small for a multinational corporation – many use the term big data.
Big data is more of a fuzzy concept than a term with a precise operational definition, but it
implies data that are being collected in huge volumes and at very fast rates (typically in real
time) and data that arrive in a variety of forms, both organised and unorganised. These attributes of ‘volume, velocity, and variety’, first identified in 2001 (see reference 1), make big data
different from any of the data sets used in this book.
Big data increases the use of business analytics because the sheer size of these very large
data sets makes preliminary exploration of the data using older techniques impracticable. This
effect is explored in Chapter 20.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
1.3 Collecting Data
15
Big data tends to draw on a mix of primary and secondary sources. For example, a retailer
interested in increasing sales might mine Facebook and Twitter accounts to identify sentiment
about certain products or to pinpoint top influencers and then match those data to its own data
collected during customer transactions.
Data Formatting
The data you collect may be formatted in more than one way. For example, suppose that you
wanted to collect electronic financial data about a sample of companies. The data you seek to
collect could be formatted in any number of ways, including:
• tables of data
• contents of standard forms
• a continuous data stream
• messages delivered from social media websites and networks.
These examples illustrate that data can exist in either a structured or an unstructured form.
Structured data are data that follow some organising principle or plan, typically a repeating pattern. For example, a simple ASX share price search record is structured because each entry
would have the name of a company, the last sale, change in price, bid price, volume traded, and
so on. Due to their inherent organisation, tables and forms are also structured. In a table, each
row contains a set of values for the same columns (i.e. variables), and in a set of forms, each
form contains the same set of entries. For example, once we identify that the second column of
a table or the second entry on a form contains the family name of an individual, then we know
that all entries in the second column of the table or all of the second entries in all copies of the
form contain the family name of an individual.
In contrast, unstructured data follows no repeating pattern. For example, if five different
people sent you an email message concerning the share trades of a specific company, that data
could be anywhere in the message. You could not reliably count on the name of the company
being the first words of each message (as in the ASX search), and the pricing, volume and percentage of change data could appear in any order. Earlier in this section, big data was defined,
in part, as data that arrive in a variety of forms, both organised and unorganised. You can restate
that definition as ‘big data exists as both structured and unstructured data’.
The ability to handle unstructured data represents an advance in information technology.
Chapter 20 discusses business analytics methods that can analyse structured data as well as
unstructured data or semi-structured data. (Think of an application form that contains structured form-fills but also contains an unstructured free-response portion.)
With the exception of some of the methods discussed in Chapter 20, the methods taught
and the software techniques used in this book involve structured data. Your beginning point
will always be tabular data, and for many problems and examples you can begin with that
data in the form of a Microsoft Excel worksheet that you can download and use (see companion website).
Electronic formats and encoding need to be considered. Data can exist in more than one
electronic format. This affects data formatting, as some electronic formats are more immediately usable than others. For example, which data would you like to use: data in an electronic
worksheet file or data in a scanned image file that contains one of the worksheet illustrations in
this book? Unless you like to do extra work, you would choose the first format because the
second would require you to employ a translation process – perhaps a character-scanning program that can recognise numbers in an image.
Data can also be encoded in more than one way, as you may have learned in an information
systems course. Different encodings can affect the precision of values for numerical variables,
and that can make some data not fully compatible with other data you have collected.
structured data
Data that follow an organised
pattern.
unstructured data
Data that have no repeated pattern.
electronic formats
Data in a form that can be read by
a computer.
encoding
Representing data by numbers or
symbols to convert the data into a
usable form.
Data Cleaning
No matter how you choose to collect data, you may find irregularities in the values you collect,
such as undefined or impossible values. For a categorical variable, an undefined value would be
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
16
CHAPTER 1 DEFINING AND COLLECTING DATA
outliers
Values that appear to be excessively
large or small compared with most
values observed.
missing values
Refers to when no data value is
stored for one or more variables in
an observation.
a value that does not represent one of the categories defined for the variable. For a numerical
variable, an impossible value would be a value that falls outside a defined range of possible
values for the variable. For a numerical variable without a defined range of possible values, you
might also find outliers, values that seem excessively different from most of the rest of the values. Such values may or may not be errors, but they demand a second review.
Missing values are another type of irregularity. They are values that were not able to be collected (and therefore are not available for analysis). For example, you would record a nonresponse to a survey question as a missing value. You can represent missing values in some
computer programs and such values will be properly excluded from analysis. The more limited
Excel has no special values that represent a missing value. When using Excel, you must find
and then exclude missing values manually.
When you spot an irregularity, you may have to ‘clean’ the data you have collected. A
full discussion of data cleaning is beyond the scope of this book. (See reference 2 for more
information.)
Recoding Variables
recoded variable
A variable that has been assigned
new values that replace the original
ones.
mutually exclusive
Two events that cannot occur
simultaneously.
collectively exhaustive
Set of events such that one of the
events must occur.
After you have collected data, you may discover that you need to reconsider the categories that
you have defined for a categorical variable, or that you need to transform a numerical variable
into a categorical variable by assigning the individual numeric data values to one of several
groups. In either case, you can define a recoded variable that supplements or replaces the original variable in your analysis. For example, when defining households by their location, the
suburb or town recorded might be replaced by a new variable of the postcode.
When recoding variables, be sure that the category definitions cause each data value to be
placed in one and only one category, a property known as being mutually exclusive. Also ensure
that the set of categories you create for the new, recoded variables include all the data values
being recoded, a property known as being collectively exhaustive. If you are recoding a categorical variable, you can preserve one or more of the original categories, as long as your recoded
values are both mutually exclusive and collectively exhaustive.
When recoding numerical variables, pay particular attention to the operational definitions
of the categories you create for the recoded variable, especially if the categories are not selfdefining ranges. For example, while the recoded categories ‘Under 12’, ‘12–20’, ‘21–34’,
‘35–59’ and ‘60 and over’ are self-defining for age, the categories ‘Child’, ‘Youth’, ‘Young
adult’, ‘Middle aged’ and ‘Senior’ need their own operational definitions.
Problems for Section 1.3
APPLYING THE CONCEPTS
1.12 The Data and Story Library (DASL) is an online library of data
files and stories that illustrate the use of basic statistical
methods. Visit <http://.lib.stat.cmu.edu/DASL>, click Power
search, and explore a datafile of interest to you. Which of the
five sources of data best describes the sources of the datafile
you selected?
1.13 Visit the website of Ipsos Australia at <www.ipsos.com.au>.
Read about a recent poll or news story. What type of data
source is this based on?
1.14 Visit the website of the Pew Research Center at <www.
pewresearch.org>. Read one of today’s top stories. What type of
data source is the story based on?
1.15 Transportation engineers and planners want to address the
dynamic properties of travel behaviour by describing in detail
the driving characteristics of drivers over the course of a month.
What type of data collection source do you think the
transportation engineers and planners should use?
1.16 Visit the homepage of the Statistics Portal ‘Statista’ at <www.
statista.com>. Go to Statistics>Popular Statistics, then choose
one item to examine. What type of data source is the
information presented here based on?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
17
1.4 Types of Survey Sampling Methods
1.4 TYPES OF SURVEY SAMPLING METHODS
LEARNING OBJECTIVE
4
In Section 1.1 a sample was defined as the portion of the population that has been selected for
analysis. You collect your data from either a population or a sample depending on whether all
items or people about whom you wish to reach conclusions are included. Rather than taking a
complete census of the whole population, statistical sampling procedures focus on collecting a
small representative group of the larger population. The resulting sample results are used to estimate characteristics of the entire population. The three main reasons for drawing a sample are:
1. A sample is less time-consuming than a census.
2. A sample is less costly to administer than a census.
3. A sample is less cumbersome and more practical to administer than a census.
Distinguish between
different survey sampling
methods
The sampling process begins by defining the frame. The frame is a listing of items that
make up the population. Frames are data sources such as population lists, directories or maps.
Samples are drawn from these frames. Inaccurate or biased results can occur if the frame
excludes certain groups of the population. Using different frames to generate data can lead to
opposite conclusions.
Once you select a frame, you draw a sample from the frame. As illustrated in Figure 1.5,
there are two kinds of samples: the non-probability sample and the probability sample.
frame
A list of the items in the population
of interest.
Figure 1.5
Types of samples
Types of samples used
Non-probability samples
Judgment
sample
Quota
sample
Chunk Convenience
sample
sample
Probability samples
Simple
random
sample
Systematic
sample
Stratified
sample
Cluster
sample
In a non-probability sample, you select the items or individuals without knowing their probabilities of selection. Thus, the theory that has been developed for probability sampling cannot be
applied to non-probability samples. A common type of non-probability sampling is convenience
sampling. In convenience sampling, items are selected based only on the fact that they are easy,
inexpensive or convenient to sample. In some cases, participants are self-selected. For example,
many companies conduct surveys by giving visitors to their website the opportunity to complete
survey forms and submit them electronically. The response to these surveys can provide large
amounts of data quickly, but the sample consists of self-selected web users. For many studies,
only a non-probability sample such as a judgment sample is available. In a judgment sample, you
get the opinions of preselected experts in the subject matter as to who should be included in the
survey. Some other common procedures of non-probability sampling are quota sampling and
chunk sampling. These are discussed in detail in specialised books on sampling methods (see
references 3 and 4).
Non-probability samples can have certain advantages such as convenience, speed and
lower cost. However, their lack of accuracy due to selection bias and their poorer capacity to
provide generalised results more than offset these advantages. Therefore, you should restrict
the use of non-probability sampling methods to situations in which you want to get rough
non-probability sample
One where selection is not based
on known probabilities.
convenience sampling
Selection using a method that is
easy or inexpensive.
judgment sample
Gives the opinions of preselected
experts.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
18
CHAPTER 1 DEFINING AND COLLECTING DATA
probability sample
One where selection is based on
known probabilities.
approximations at low cost to satisfy your curiosity about a particular subject, or to small-scale
studies that precede more rigorous investigations.
In a probability sample, you select the items based on known probabilities. Whenever
possible, you should use probability sampling methods. The samples based on these methods allow you to make unbiased inferences about the population of interest. In practice, it
is often difficult or impossible to take a probability sample. However, you should work
towards achieving a probability sample and acknowledge any potential biases that might
exist. The four types of probability samples most commonly used are simple random, systematic, stratified and cluster. These sampling methods vary in their cost, accuracy and
complexity.
Simple Random Sample
simple random sample
One where each item in the frame
has an equal chance of being
selected.
sampling with replacement
An item in the frame can be
selected more than once.
sampling without replacement
Each item in the frame can be
selected only once.
table of random numbers
Shows a list of numbers generated
in a random sequence.
In a simple random sample, every item from a frame has the same chance of selection as every
other item. In addition, every sample of a fixed size has the same chance of selection as every
other sample of that size. Simple random sampling is the most elementary random sampling
technique. It forms the basis for the other random sampling techniques.
With simple random sampling, you use n to represent the sample size and N to represent
the frame size. You number every item in the frame from 1 to N. The chance that you will select
any particular member of the frame on the first draw is 1/N.
You select samples with replacement or without replacement. Sampling with replacement
means that after you select an item you return it to the frame, where it has the same probability
of being selected again. Imagine you have a barrel which contains the shopping dockets of N
shoppers at a major retail centre who are entering a competition. First assume that each shopper
can have only one entry but can win more than one prize. The barrel is rolled, opened and the
entry of Jason O’Brien is selected. His docket is replaced, the barrel is rolled again and a second docket is chosen. Jason’s docket has the same probability of being selected again, 1/N. You
repeat this process until you have selected the desired sample size n. However, it is usually
more desirable to have a sample of different items than to permit a repetition of measurements
on the same item.
Sampling without replacement means that once you select an item it cannot be selected
again. The chance that you will select any particular item in the frame, say the shopping docket
of Jason O’Brien on the first draw is 1/N. The chance that you will select any shopping docket
not previously selected on the second draw is now 1 out of N – 1. This process continues until
you have selected the desired sample of size n.
Regardless of whether you have sampled with or without replacement, barrel draw
methods have a major drawback for sample selection. In a crowded barrel, it is difficult to
mix the entries thoroughly and ensure that the sample is selected randomly. As barrel draw
methods are not very useful, you need to use less cumbersome and more scientific methods
of selection.
One such method uses a table of random numbers (see Table E.1 in Appendix E of this
book) for selecting the sample. A table of random numbers consists of a series of digits listed
in a randomly generated sequence (see reference 5). Because the numeric system uses 10 digits
(0, 1, 2, …, 9), the chance that you will randomly generate any particular digit is equal to
the probability of generating any other digit. This probability is 1 out of 10. Hence, if a
sequence of 800 digits is generated, you would expect about 80 of them to be the digit 0, 80 to
be the digit 1, and so on. In fact, those who use tables of random numbers usually test the
generated digits for randomness prior to using them. Table E.1 has met all such criteria for
randomness. Because every digit or sequence of digits in the table is random, the table can be
read either horizontally or vertically. The margins of the table designate row numbers and
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
1.4 Types of Survey Sampling Methods
column numbers. The digits themselves are grouped into sequences of five in order to make
reading the table easier.
To use such a table instead of a barrel for selecting the sample, you first need to assign code
numbers to the individual members of the frame. Then you get the random sample by reading
the table of random numbers and selecting those individuals from the frame whose assigned
code numbers match the digits found in the table. Example 1.1 demonstrates the process of
sample selection.
SELECTING A S IMP LE R A NDO M S A MP L E U S I N G A TABL E OF RAN D OM
NUMB ER S
A company wants to select a sample of 32 full-time workers from a population of 800
full-time employees in order to collect information on expenditures concerning a
company-sponsored dental plan. How do you select a simple random sample?
EXAMPLE 1.1
SOLUTION
The company can contact all employees by email but assumes that not everyone will
respond to the survey, so you need to distribute more than 32 surveys to get the desired
32 responses. Assuming that 8 out of 10 full-time workers will respond to such a survey
(i.e. a response rate of 80%), you decide to email 40 surveys.
The frame consists of a listing of the names and email addresses of all N = 800
full-time employees taken from the company personnel files. Thus, the frame is
an accurate and complete listing of the population. To select the random sample
of 40 employees from this frame, you use a table of random numbers, as shown in
Table 1.2 on page 20. Because the population size (800) is a three-digit number, each
assigned code number must also be three digits so that every full-time worker has an
equal chance of selection. You give a code of 001 to the first full-time employee in
the population listing, a code of 002 to the second full-time employee in the population listing, and so on, until a code of 800 is given to the Nth full-time worker in the
listing. Because N = 800 is the largest possible coded value, you discard all threedigit code sequences greater than N (i.e. 801 to 999 and 000).
To select the simple random sample, you choose an arbitrary starting point from the
table of random numbers. One method you can use is to close your eyes and strike the table
of random numbers with a pencil. Suppose you use this procedure and select row 06,
column 05, of Table 1.2 (which is extracted from Table E.1) as the starting point. Although
you can go in any direction, in this example you will read the table from left to right in
sequences of three digits without skipping.
The individual with code number 003 is the first full-time employee in the sample (row
06 and columns 05–07), the second individual has code number 364 (row 06 and columns
08–10) and the third individual has code number 884. Because the highest code for any
employee is 800, you discard this number. Individuals with code numbers 720, 433, 463,
363, 109, 592, 470 and 705 are selected third to tenth, respectively.
You continue the selection process until you get the needed sample size of 40 full-time
employees. During the selection process, if any three-digit coded sequence is repeated, you
include the employee corresponding to that coded sequence again as part of the sample, if
sampling with replacement. You discard the repeating coded sequence if sampling without
replacement.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
19
20
CHAPTER 1 DEFINING AND COLLECTING DATA
Table 1.2
Using a table of random
numbers
Source: Data from the Rand
Corporation, from A Million
Random Digits with 100,000
Normal Deviates (Glencoe,
IL: The Free Press, 1955)
(displayed in Table E.1 in
Appendix E of this book).
Begin
selection
(row 06,
column 5)
Row
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
00000
00001
11111
Column
11112
22222
22223
33333
33334
12345
49280
61870
43898
62993
33850
97340
70543
89382
37818
60430
82975
39087
55700
14756
32166
23236
45794
09893
54382
94750
70297
85157
11100
36871
23913
67890
88924
41657
65923
93912
58555
03364
29776
93809
72142
22834
66158
71938
24586
23997
53251
73751
26926
20505
74598
89923
34135
47954
02340
50775
48357
12345
35779
07468
25078
30454
51438
88472
10087
00796
67140
14130
84731
40355
93247
78643
70654
31888
15130
14225
91499
37089
53140
32979
12860
30592
63308
67890
00283
08612
86129
84598
85507
04334
10072
95945
50785
96593
19436
54324
32596
75912
92827
81718
82455
68514
14523
20048
33340
26575
74697
57143
16090
67890
07275
97349
97653
20664
79488
36394
64688
81277
16703
56203
69229
26299
63397
32768
04233
83246
55058
56788
27686
94598
82341
40881
89439
68856
54607
12345
89863
20775
91550
12872
76783
11095
68239
66090
53362
92671
28661
49420
44251
18928
33825
47651
52551
96297
46162
26940
44104
12250
28707
25853
72407
67890
02348
45091
08078
64647
31708
92470
20461
88872
44940
15925
13675
59208
43189
57070
69662
04877
47182
78822
83554
36858
82949
73742
25815
35041
55538
12345
81163
98083
78496
56095
71865
63919
55980
34101
22380
23298
55790
08401
11865
83832
63491
06546
78305
46427
68479
80336
42050
57600
96644
17381
51690
Systematic Sample
systematic sample
A method that involves selecting the
first element randomly then
choosing every k th element
thereafter.
In a systematic sample, you partition the N items in the frame into n groups of k items where:
N
k= n
You round k to the nearest integer. To select a systematic sample, you choose the first item to be
selected at random from the first k items in the frame. Then you select the remaining n – 1 items
by taking every kth item thereafter from the entire frame.
If the frame consists of a listing of prenumbered cheques, sales receipts or invoices, a systematic sample is faster and easier to take than a simple random sample. A systematic sample is
also a convenient mechanism for collecting data from telephone directories, class rosters and
consecutive items coming off an assembly line.
To take a systematic sample of n = 40 from the population of N = 800 employees, you
partition the frame of 800 into 40 groups, each of which contains 20 employees. You then select
a random number from the first 20 individuals, and include every 20th individual after the first
selection in the sample. For example, if the first number you select is 008, your subsequent
selections are 028, 048, 068, 088, 108, … , 768 and 788.
Although they are simpler to use, simple random sampling and systematic sampling are
generally less efficient than other, more sophisticated probability sampling methods. Even
greater possibilities for selection bias and lack of representation of the population characteristics
occur from systematic samples than from simple random samples. If there is a pattern in the
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
1.4 Types of Survey Sampling Methods
21
frame, you could have severe selection biases. To overcome the potential problem of disproportionate representation of specific groups in a sample, you can use either stratified sampling
methods or cluster sampling methods.
Stratified Sample
In a stratified sample, you first subdivide the N items in the frame into separate subpopulations,
or strata. A stratum is defined by some common characteristic. You select a simple random
sample, in proportion to the size of the strata, and combine the results from the separate simple
random samples. This method is more efficient than either simple random sampling or systematic sampling because you are assured of the representation of items across the entire population. The homogeneity of items within each stratum provides greater precision in the estimates
of underlying population parameters.
SELECTING A ST R AT IFIE D S A MP LE
A company wants to select a sample of 32 full-time workers from a population of 800 fulltime employees in order to estimate expenditures from a company-sponsored dental plan.
Of the full-time employees, 25% are managerial and 75% are non-managerial workers. How
do you select the stratified sample so that the sample will represent the correct proportion of
managerial workers?
stratified sample
Items randomly selected from each
of several populations or strata.
strata
Subpopulations composed of items
with similar characteristics in a
stratified sampling design.
EXAMPLE 1.2
SOLUTION
If you assume an 80% response rate, you need to distribute 40 surveys to get the desired
32 responses. The frame consists of a listing of the names and company email addresses of
all N = 800 full-time employees included in the company personnel files. Since 25% of the
full-time employees are managerial, you first separate the population frame into two strata:
a subpopulation listing of all 200 managerial-level personnel and a separate subpopulation
listing of all 600 full-time non-managerial workers. Since the first stratum consists of a
listing of 200 managers, you assign three-digit code numbers from 001 to 200. Since
the second stratum contains a listing of 600 non-managerial-level workers, you assign
three-digit code numbers from 001 to 600.
To collect a stratified sample proportional to the sizes of the strata, you select 25% of
the overall sample from the first stratum and 75% of the overall sample from the second
stratum. You take two separate simple random samples, each of which is based on a distinct
random starting point from a table of random numbers (Table E.1). In the first sample you
select 10 managers from the listing of 200 in the first stratum, and in the second sample you
select 30 non-managerial workers from the listing of 600 in the second stratum. You then
combine the results to reflect the composition of the entire company.
Cluster Sample
In a cluster sample, you divide the N items in the frame into several clusters so that each cluster
is representative of the entire population. You then take a random sample of clusters and study
all items in each selected cluster. Clusters are naturally occurring designations, such as postcode areas, electorates, city blocks, households or sales territories.
Cluster sampling is often more cost-effective than simple random sampling, particularly if
the population is spread over a wide geographical region. However, cluster sampling often
requires a larger sample size to produce results as precise as those from simple random sampling or stratified sampling. A detailed discussion of systematic sampling, stratified sampling
and cluster sampling procedures can be found in references 3, 4 and 6.
cluster sample
The frame is divided into
representative groups (or clusters),
then all items in randomly selected
clusters are chosen.
cluster
A naturally occurring grouping, such
as a geographical area.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
22
CHAPTER 1 DEFINING AND COLLECTING DATA
Problems for Section 1.4
LEARNING THE BASICS
1.17 For a population containing N = 902 individuals, what code
number would you assign for:
a. the first person on the list?
b. the fortieth person on the list?
c. the last person on the list?
1.18 For a population of N = 902, verify that, by starting in row 05 of
the table of random numbers (Table E.1), you need only six rows
to select a sample of n = 60 without replacement.
1.19 Given a population of N = 93, starting in row 29 of the table of
random numbers (Table E.1) and reading across the row, select
a sample of n = 15:
a. without replacement
b. with replacement
APPLYING THE CONCEPTS
1.20 For a study that consists of personal interviews with
participants (rather than mail or phone surveys), explain why a
simple random sample might be less practical than some other
methods.
1.21 You want to select a random sample of n = 1 from a population
of three items (called A, B and C ). The rule for selecting the
sample is: flip a coin; if it is heads, pick item A; if it is tails, flip
the coin again; this time, if it is heads, choose B; if it is tails,
choose C. Explain why this is a random sample but not a simple
random sample.
1.22 A population has four members (call them A, B, C and D). You
would like to draw a random sample of n = 2, which you
decide to do in the following way: flip a coin; if it is heads, the
sample will be items A and B; if it is tails, the sample will be
items C and D. Although this is a random sample, it is not a
simple random sample. Explain why. (If you did problem 1.21,
compare the procedure described there with the procedure
described in this problem.)
1.23 The town planning department of a Sydney council with a
population of N = 40,000 registered voters is asked by the
mayor to conduct a survey to measure community attitudes to
LEARNING OBJECTIVE
Evaluate the quality
of surveys
5
urban consolidation. The table following contains a breakdown of
the 40,000 registered voters by gender and ward of residence.
Gender
Female
Male
Total
North
7,000
5,600
12,600
Ward of residence
South
East
5,200
5,000
4,600
4,000
9,800
9,000
West
4,800
3,800
8,600
Total
22,000
18,000
40,000
The planning department intends to take a probability sample of
n = 2,000 voters and project the results from the sample to the
entire population of voters.
a. If the frame available from the council files is an alphabetical
listing of the names of all N = 40,000 registered voters,
what type of sample could you take? Discuss.
b. What is the advantage of selecting a simple random sample
in (a)?
c. What is the advantage of selecting a systematic sample in (a)?
d. If the frame available from the council’s files is a listing of the
names and addresses of all N = 40,000 registered voters,
compiled from eight separate alphabetical lists based on the
gender and address breakdowns shown in the ward-ofresidence table, what type of sample should you take? Discuss.
e. At present East Ward has many high-rise apartments, West
Ward and South Ward have single dwellings only and North
Ward has a mixture of low- and medium-density housing.
What would be the danger in randomly choosing 40 street
names and systematically sampling 50 of the residents of
those streets?
1.24 Suppose that 5,000 sales invoices are separated into four
strata. Stratum 1 contains 50 electrical invoices, stratum 2
contains 500 paint invoices, stratum 3 contains 1,000 plumbing
supplies invoices and stratum 4 contains 3,450 hardware
invoices. A sample of 500 sales invoices is needed.
a. What type of sampling method should you use? Why?
b. Explain how you would carry out the sampling according to
the method stated in (a).
c. Why is the sampling in (a) not simple random sampling?
1.5 EVALUATING SURVEY WORTHINESS
Nearly every day you read or hear about survey or opinion poll results in newspapers, on the
Internet or on radio or television. To identify surveys that lack objectivity or credibility, you
must critically evaluate what you read and hear by examining the worthiness of the survey.
First, you must evaluate the purpose of the survey, why it was conducted and for whom it was
conducted. An opinion poll or survey conducted to satisfy curiosity is mainly for entertainment.
Its result is an end in itself rather than a means to an end. You should be sceptical of such a
survey because the result should not be put to further use.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
1.5 Evaluating Survey Worthiness
23
The second step in evaluating the worthiness of a survey is for you to determine whether
it was based on a probability or a non-probability sample (as discussed in Section 1.4). You
need to remember that the only way to make correct statistical inferences from a sample to a
population is through the use of a probability sample. Surveys that use non-probability sampling methods are subject to serious, perhaps unintentional, bias that may render the results
meaningless.
Survey Errors
Even when surveys use random probability sampling methods, they are subject to potential
errors. Four types of survey errors are:
• coverage error
• non-response error
• sampling error
• measurement error.
Good survey research design attempts to reduce or minimise these various survey errors, often
at considerable cost.
Coverage error
The key to proper sample selection is an adequate frame. Remember, a frame is an up-to-date
list of all the items from which you will select the sample. Coverage error occurs if certain
groups of items are excluded from this frame so that they have no chance of being selected in
the sample. Coverage error results in a selection bias. If the frame is inadequate because certain
groups of items in the population were not properly included, any random probability sample
selected will provide an estimate of the characteristics of the frame, not the actual population.
Computer-based surveys are useful for certain studies where the subjects all have Internet
access. Coverage error could result if the unemployed, the elderly or indigenous communities
are not selected in the frame due to their lack of Internet or email access.
Non-response error
Not everyone is willing to respond to a survey. In fact, research has shown that individuals in
the upper and lower socioeconomic classes tend to respond less frequently to surveys than p­ eople
in the middle class. Non-response error arises from the failure to collect data on all items in
the sample and results in a non-response bias. Because you cannot generally assume that people
who do not respond to surveys are similar to those who do, you need to follow up on the nonresponses after a specified period of time. You should make several attempts to persuade these
individuals to complete the survey. The follow-up responses are then compared with the initial
responses in order to make valid inferences from the survey (references 3, 4 and 6).
The mode of response you use affects the rate of response. The personal interview and the
telephone interview usually produce a higher response rate than a mail survey – but at a higher
cost.
Sampling error
There are three main reasons for selecting a sample rather than taking a complete census. It is
more expedient, less costly and more efficient. However, chance dictates which individuals
or items will or will not be included in the sample. Sampling error reflects the heterogeneity, or
‘chance differences’, from sample to sample, based on the probability of certain individuals or
items being selected in particular samples.
When you read about the results of surveys or polls in newspapers or magazines, there is
often a statement regarding margin of error or precision; for example, ‘the results of this poll
are expected to be within ±4 percentage points of the actual value’. This margin of error is the
sampling error. You can reduce sampling error by taking larger sample sizes, although this also
increases the cost of conducting the survey.
coverage error
Occurs when all items in a frame do
not have an equal chance of being
selected. This causes selection
bias.
non-response error
Occurs due to the failure to collect
information on all items chosen for
the sample; this causes nonresponse bias.
sampling error
The difference in results for
different samples of the same size.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
24
CHAPTER 1 DEFINING AND COLLECTING DATA
The problem of online survey rigging
think
about this
As the use of online methods for collecting information grows more prevalent we need to be aware that
individuals will not all act honestly, especially when they have something to gain. There are many
methods being used to contravene the rules of online competitions, such as paying companies to vote,
setting up multiple email addresses or Facebook accounts, and using methods to mask the true IP
address of the computer being used. Even if a small incentive is offered for completing a survey, similar
problems can arise.
At an Australian university, students were recently asked to complete a survey about a peerassisted learning program and were offered the chance to win movie tickets as an incentive to give
feedback. The survey was carried out anonymously in order to elicit frank responses, but on
completion students were automatically sent to a second site where they could register their
student ID in order to enter a draw to win movie tickets. One student registered 105 times in order
to increase the chance of winning the movie tickets. It is not clear how many times this person
completed the survey itself.
How could this type of behaviour potentially affect survey results? What could you do to minimise this
type of survey error if you were designing an online survey?
Measurement error
In the practice of good survey research, you design a questionnaire with the intention of gathering meaningful information. But you have a dilemma here – getting meaningful measurements
is easier said than done. Consider the following proverb:
A man with one watch always knows what time it is.
A man with two watches always searches to identify the correct one.
A man with ten watches is always reminded of the difficulty in measuring time.
measurement error
The difference between survey
results and the true value of what is
being measured.
Unfortunately, the process of getting a measurement is often governed by what is convenient,
not what is needed. The measurements are often only a proxy for the ones you really desire.
Much attention has been given to measurement error that occurs because of a weakness in
question wording (reference 6). A question should be clear, not ambiguous. And, to avoid
leading questions, you need to present them in a neutral manner.
There are three sources of measurement error: ambiguous wording of questions, the halo
effect and respondent error. The Australian Bureau of Statistics is very conscious of minimising
error caused by questionnaire design and survey operations. For the National Health Survey in
2010–11 it used Computer Assisted Interview techniques to collect information. It states:
the CAI instrument allows:
• data to be captured electronically at the point of interview, which obviates the cost,
logistical, timing and quality issues associated with transport, storage and security of
paper forms, and transcription/data entry of information from forms into electronic format
• the ability to use complex sequencing to define specific populations for questions,
and ensure word substitutes used in the questions were appropriate to each
respondent’s characteristics and prior responses
• the ability, through data validation (edits), to check responses entered against
previous responses, reduce data entry errors by interviewers, and enable seemingly
inconsistent responses to be clarified with respondents at the time of interview. The
audit trail recorded in the instrument also provides valuable information about the
operation of particular questions, and associated data quality issues. (Australian
Bureau of Statistics, Australian Health Survey: Users’ Guide, 2011–2013, electronic
publication, Cat. No. 4363.0.55.001, 2013)
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
1.5 Evaluating Survey Worthiness
25
The halo effect occurs when the respondent feels obligated to please the interviewer.
Proper interviewer training can minimise the halo effect. Respondent error occurs as a result
of overzealous or underzealous effort by the respondent. You can minimise this error in two
ways: (1) by carefully scrutinising the data and calling back those individuals whose responses
seem unusual, and (2) by establishing a program of random call-backs to determine the reliability of the responses.
Other sources of error besides measurement error can result from clerical or recording
errors. See references 7, 8 and 9 for a more detailed discussion of measurement error and the
difficulties of avoiding it.
Ethical Issues
Ethical considerations arise with respect to the four types of potential errors that can occur
when designing surveys that use probability samples: coverage error, non-response error,
sampling error and measurement error. Coverage error can result in selection bias and
becomes an ethical issue if particular groups or individuals are purposely excluded from the
frame so that the survey results are skewed, indicating a position more favourable to the survey’s sponsor. Non-response error can lead to non-response bias and becomes an ethical
issue if the sponsor knowingly designs the survey in such a manner that particular groups or
individuals are less likely to respond. Sampling error becomes an ethical issue if the findings
are purposely presented without reference to sample size and margin of error, so that the
sponsor can promote a viewpoint that might otherwise be truly insignificant. Measurement
error becomes an ethical issue in one of three ways: (1) a survey sponsor chooses leading
questions that guide the responses in a particular direction; (2) an interviewer, through mann­
erisms and tone, purposely creates a halo effect or otherwise guides the responses in a particular direction; (3) a respondent, having a disdain for the survey process, wilfully provides
false information.
Ethical issues also arise when the results of non-probability samples are used to form
conclusions about the entire population. When you use a non-probability sampling method,
you need to explain the sampling procedures and state that the results cannot be generalised
beyond the sample.
Problems for Section 1.5
APPLYING THE CONCEPTS
1.25 ‘A survey indicates that the vast majority of university students
own their own personal computer.’ What information would you
want to know before you accepted the results of this survey?
1.26 A simple random sample of n = 300 full-time employees is
selected from a company list containing the names of all
N = 5,000 full-time employees in order to evaluate job
satisfaction.
a. Give an example of possible coverage error.
b. Give an example of possible non-response error.
c. Give an example of possible sampling error.
d. Give an example of possible measurement error.
1.27 According to a recent cyber security report, ‘millennials remain
the most common victims of cybercrime, with 40 percent
having experienced cybercrime in the past year’. Reasons
given for this include slack online security habits and password
sharing (2016 Norton Cyber Security Insights Report,
<www.symantec.com/content/dam/symantec/docs/
reports/2016-norton-cyber-security-insights-report.pdf>,
accessed 16 June 2017). What information would you want to
know before you accepted the results of the survey?
1.28 Kiribati is a small, poor Pacific nation under threat from global
warming. According to the CIA World Factbook, Kiribati
comprises a group of 33 coral atolls in the Pacific Ocean
straddling the equator, with elevations varying from 0 to 81
metres above sea level. The low level of some of the islands
makes them sensitive to changes in sea level (Central
Intelligence Agency, The World Factbook, <www.cia.gov/library/
publications/the-world-factbook/geos/kr.html> accessed
16 June 2017). Suppose that an environmental economist has
seen results from a survey which claims that 30% of inhabitants
of Kiribati are already affected by roads having been
permanently cut by rising seawater. What information would she
want to know before accepting the results of the survey?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
26
CHAPTER 1 DEFINING AND COLLECTING DATA
1.29 Reality TV shows have incorporated surveys of audience opinion
into their formats. In Australia several shows have allowed the
audience to vote on whether contestants should remain on the
show or be excluded. Consider a show where voting is by SMS,
premium rate phone call, Facebook or another online site, and
viewers are limited to 10 votes using each method. Compare
this type of survey with a random poll of viewers without
replacement conducted by phone for the TV show.
a. How might the results differ?
b. What are the costs and benefits for the owners of the show
for each voting method?
1.30 The online restaurant search site Dimmi <www.dimmi.com.au>
encourages diners to rate restaurants they have been to by
giving them reward points which can be accumulated until a
meal discount is available. A restaurant at The Rocks in Sydney
has been rated as follows: Recommended 8.7; Food 8.5;
Service 8.7; Value for money 7.8; Atmosphere 8.4. What
differences could arise from this type of survey compared with
ratings derived from a random sample of diners?
1.6 THE GROWTH OF STATISTICS AND INFORMATION TECHNOLOGY
statistical packages
Computer programs designed to
perform statistical analysis.
During the past century, statistics has played an important role in spurring the use of information technology and, in turn, such technology has spurred the wider use of statistics. At the
beginning of the twentieth century, the expanding data-handling requirements associated with
the United States Federal Census led directly to the development of tabulating machines that
were the forerunners of today’s business computer systems. Statisticians such as Pearson,
Fisher, Gosset, Neyman, Wald and Tukey established the techniques of modern inferential
statistics as an alternative to analysing large sets of population data that had become increasingly costly, time-consuming and cumbersome to collect. The development of early computer
systems permitted others to develop computer programs to ease the calculation and data-processing burdens imposed by those techniques. Over time, greater use of statistical methods by
business decision makers and advances in computer capacity have led to the development of
even more sophisticated statistical methods.
Today, when you hear of retailers investing in a ‘customer-relationship management
system’, or CRM, or a packaged goods producer engaging in ‘data mining’ to uncover consumer preferences, you should realise that statistical techniques form the foundations of
such cutting-edge applications of information technology. As global information storage
increases dramatically, businesses are rapidly coming to terms with how to analyse big
data – data sets so large and varied that conventional software cannot readily handle them.
(Think of the huge volume of data produced each day by people using Visa, Facebook,
eBay and Twitter.) Even though cutting-edge applications might require custom programming, for many years businesses have had access to statistical packages such as Minitab,
SPSS/PASW Statistics, SAS and Stata – standardised sets of programs that help managers
use a wide range of statistical techniques by automating the data processing and calculations these techniques require.
The leasing and training costs associated with statistical packages have led many to consider using some of the graphical and statistical functions of Microsoft Excel. However, you
need to be aware that many statisticians have concerns about the accuracy and completeness of
the statistical results produced by early versions of Excel. Invalid results could be produced,
especially when the data sets were very large or had unusual statistical properties (see reference 10). Microsoft Excel 2010 and subsequent versions made some significant improvements
in statistical functions (see references 11 and 12) but it would still be wise to be careful about
the data and the analysis you are undertaking.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
References
Assess your progress
27
1
Summary
In this chapter you have studied data collection and the various types
of data used in business. In the Hong Kong International Airport
scenario you were asked to review the visitor survey which will be
used to provide information to the tourism authority planning staff
(see page 9). Three of the questions shown will produce numerical
data and four will produce categorical data. The responses to the first
question (number of previous visits to Hong Kong) are discrete, and
the responses to the second question (length of time since last visit)
are continuous. After the data have been collected, they must be
organised and prepared in order to make various analyses. You have
also learned about commonly used sampling methods and ways to
prepare data for analysis such as encoding, cleaning and recoding.
The next two chapters develop tables and charts and a variety of
descriptive numerical measures that are useful for data analysis.
Key terms
big data
categorical variables
cluster
cluster sample
collectively exhaustive
continuous variables
convenience sampling
coverage error
data
descriptive statistics
discrete variables
electronic formats
encoding
focus group
frame
inferential statistics
interval scale
14
10
21
21
16
10
17
23
6
8
10
15
15
14
17
8
11
judgment sample
measurement error
missing values
mutually exclusive
nominal scale
non-probability sample
non-response error
numerical variables
operational definition
ordinal scale
outliers
parameter
population
primary sources
probability sample
ratio scale
recoded variable
17
24
16
16
10
17
23
10
6
11
16
8
8
13
18
11
16
sample
sampling error
sampling with replacement
sampling without replacement
secondary sources
simple random sample
statistic
statistical packages
statistics
strata
stratified sample
structured data
systematic sample
table of random numbers
unstructured data
variables
8
23
18
18
13
18
8
26
6
21
21
15
20
18
15
6
References
1. Laney, D., 3D Data Management: Controlling Data Volume, Velocity, and
2.
3.
4.
5.
Variety (Stamford, CT: META Group. February 6, 2001).
Osbourne, J. Best Practices in Data Cleaning: A Complete Guide to
Everything You Need to Do Before and After Collecting Your Data
(Thousand Oaks, CA: Sage Publications, 2013).
Cochran, W. G., Sampling Techniques, 3rd edn (New York: Wiley, 1977).
Lohr, S. L., Sampling Design and Analysis, 2nd edn (Boston, MA: Brooks/
Cole Cengage Learning, 2010).
Rand Corporation, A Million Random Digits with 100,000 Normal
Deviates (Glencoe, IL: The Free Press, 1955).
6. Groves R. M., F. J. Fowler, M. P. Couper, J. M. Lepkowski, E. Singer &
R. Tourangeau, Survey Methodology, 2nd edn (New York: John Wiley, 2009).
7. Sudman, S., N. M. Bradburn & N. Schwarz. Thinking About Answers:
The Application of Cognitive Processes to Survey Methodology (San
Francisco, CA: Jossey-Bass, 1996).
8. Biemer, P. B., R. M. Graves, L. E. Lyberg, A. Mathiowetz & S. Sudman,
Measurement Errors in Survey (New York: Wiley Interscience, 2004).
9. Fowler, F. J., Improving Survey Questions: Design and Evaluation,
Applied Special Research Methods Series, Vol. 38 (Thousand Oaks,
CA: Sage Publications, 1995).
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
28
CHAPTER 1 DEFINING AND COLLECTING DATA
10. McCullough, B. D. & B. Wilson, ‘On the accuracy of statistical procedures
in Microsoft Excel 97’, Computational Statistics and Data Analysis,
31 (1999): 27–37.
11. Microsoft Corporation at <http://office.microsoft.com/en-au/excel-help/
what-s-new-changes-made-to-excel-functions-HA010355760.aspx>,
accessed June 2017.
12. Microsoft Corporation at <http://office.microsoft.com/en-001/excelhelp/new-functions-in-excel-2013-HA103980604.aspx>, accessed
June 2017.
Chapter review problems
CHECKING YOUR UNDERSTANDING
1.31
1.32
1.33
1.34
1.35
1.36
1.37
1.38
1.39
1.40
1.41
1.42
What is the difference between a sample and a population?
What is the difference between a statistic and a parameter?
What is the difference between descriptive and inferential statistics?
What is the difference between a categorical and a numerical
variable?
What is the difference between a discrete and a continuous
variable?
What is an operational definition and why is it so important?
What are the four types of measurement scales?
What are some potential problems with using ‘barrel draw’
methods to select a simple random sample?
What is the difference between sampling with replacement and
sampling without replacement?
What is the difference between a simple random sample and a
systematic sample?
What is the difference between a simple random sample and a
stratified sample?
What is the difference between a stratified sample and a
cluster sample?
1.47
1.48
APPLYING THE CONCEPTS
1.43
1.44
1.45
1.46
The Australasian Data and Story Library OZDASL
<www.statsci.org/data> is an online library of data files and
stories that illustrate the use of basic statistical methods. The
stories are classified by method and by topic. Go to this site
and click on ‘First Course in Statistics’. Pick a story and
summarise how statistics were used in the story.
Make a list of six ways you have used or encountered statistics
in the past week. Think about what you read or heard in a
news report or saw on a commercial website. Also think
whether you made a bet or participated in a survey.
The Australian Bureau of Statistics <www.abs.gov.au> site
contains survey information on people, business, geography
and other topics. Go to the site and find the latest version of
Labour Force, Australia (Cat. No. 6202.0).
a. Briefly describe the Labour Force survey.
b. Give an example of a categorical variable found in this survey.
c. Give an example of a numerical variable found in this survey.
d. Is the variable you selected in (c) discrete or continuous?
The Australian Bureau of Statistics website allows users to
access a large amount of Census data online. Go to
<www.abs.gov.au/census> and in the Data by Products
section click on the latest Census year, enter a location and
search for QuickStats.
a. Give an example of a categorical variable found in this
summary of survey results.
1.49
1.50
b. Give an example of a numerical variable found in this
summary of survey results.
c. Is the variable you selected in (b) discrete or continuous?
Detailed information on airport and airline on-time performance
can be found at <www.flightstats.com>. Explore the departures
performance data for different airports and regions.
a. Which of the five types of data sources listed in Section 1.3
do you think were used here?
b. Name a categorical variable for which observations were
collected.
c. Name a numerical variable for which observations were
collected.
d. What type of recoding has been used here and why?
Late in 2016 the National Roads and Motorists’ Association
(NRMA), a major Australian motoring organisation, released
results of a survey that sought to check members’ attitudes to
traffic congestion and a motorway extension (see <www.
mynrma.com.au/about/media/local-support-for-SouthConnexstrengthens-nrma-survey.htm>).
a. Describe the population(s) for this survey.
b. Describe the sample(s) for this survey.
c. Can you identify potential difficulties in comparing these
results with results from a similar 2005 survey?
A manufacturer of flavoured milk is planning to survey households
in Tasmania to determine the purchasing habits of consumers.
Among the questions to be included are those that relate to:
1. where flavoured milk is primarily purchased
2. what flavour of milk is purchased most often
3. how many people living in the household drink flavoured
milk
4. the total number of millilitres of flavoured milk drunk in
the past week by members of the household
a. Describe the population.
b. For each of the four items listed, indicate whether the
variable is categorical or numerical. If numerical, is it
discrete or continuous?
c. Develop five categorical questions for the survey.
d. Develop five numerical questions for the survey.
A new bus network is proposed for a north-eastern Sydney region.
A survey is sent out to residents asking questions which relate to:
1. the resident’s age
2. frequency of bus use
3. usual ticket type purchased
4. main purpose of using the bus
a. Describe the population.
b. Indicate whether each of the questions above is categorical
or numerical.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
CONTINUING CASES
1.51
1.52
1.53
c. Develop two more numerical questions and state whether
the variables are discrete or continuous.
d. Develop two more categorical questions.
Political polling has traditionally used telephone interviews.
Researchers at a polling organisation argue that Internet
polling is less expensive and faster, and offers higher response
rates than telephone surveys. Critics are concerned about the
scientific reliability of this approach. Even amid this strong
criticism, Internet polling is becoming more and more common.
What concerns, if any, do you have about Internet polling?
Statistics New Zealand mentions a number of possible sources
of non-sampling error in economic surveys in A Guide to Good
Survey Design, 3rd edition, which can be downloaded from
<www.stats.govt.nz>.
a. Which of the four types of survey error from Section 1.5 are
identified on this site as a non-sampling error?
b. Discuss which errors would be more difficult to eliminate.
Researchers at a university wish to conduct a survey of past
students to ascertain how frequently they are using statistical
techniques in the workforce. The researchers have permission
from the ethics committee to use the last recorded email and
postal addresses to contact ex-students, but these may be out
of date, particularly as many students have returned to homes
overseas without updating their records. The emails and
letters are sent out simultaneously. The response to the
survey is low.
1.54
29
a. What type of errors or biases should the researchers be
especially concerned with?
b. What step(s) should the researchers take to try to overcome
the problems noted in (a)?
c. What could have been done differently to improve the
survey’s worthiness?
According to a survey conducted by the Australian Interactive
Media Industry Association, 77% of mobile phone users
surveyed pay by a monthly phone bill compared to 21% who
are on pre-paid plans. The percentage of respondents that have
data included in their payment plans is 84% (M. M. Mackay,
Australian Mobile Phone Lifestyle Index, 9th edn, October 2013,
<www.aimia.com.au/ampli>, accessed 24 January 2014).
a. What other information would you want to know before you
accepted the results of this survey?
b. Suppose that you wished to conduct a similar survey for the
geographic region you live in. Describe the population for
your survey.
c. Explain how you could minimise the chance of a coverage
error in this type of survey.
d. Explain how you could minimise the chance of a
nonresponse error in this type of survey.
e. Explain how you could minimise the chance of a sampling
error in this type of survey.
f. Explain how you could minimise the chance of a
measurement error in this type of survey.
Continuing cases
Tasman University
Tasman University’s Tasman Business School (TBU) regularly surveys business students on a number of issues. In
particular, students within the school are asked to complete a student survey when they receive their grades each
semester. The results of Bachelor of Business (BBus) students who responded to the latest undergraduate (UG)
survey are stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY >.
a For each question asked in the survey, determine whether the variable is categorical or numerical. If you
determine that the variable is numerical, identify whether it is discrete or continuous.
b A separate survey has been carried out for Master of Business Administration (MBA) students. Results
for these postgraduate (PG) students are in the file < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >. Repeat
the analysis you carried out in (a) for the postgraduate survey results.
As Safe as Houses
To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a
large national real estate company, has collected samples of recent residential sales from a sample of non-capital
cities and towns in these states. The data are stored in < REAL_ESTATE >.
a Identify data sources and discuss the type of sampling that was most likely used to collect these data.
b Suggest any additional variables that could be collected in order to explain property prices, and
determine if they are numerical or categorical, discrete or continuous.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
30
CHAPTER 1 DEFINING AND COLLECTING DATA
Chapter 1 Excel Guide
EG1.1 GETTING STARTED WITH MICROSOFT EXCEL
Microsoft Excel is the electronic worksheet program of
Microsoft Office. Although not a specialised statistical
­program, Excel contains basic statistical functions, and the
Excel 2016 PC and Mac versions include Data Analysis
Toolpak procedures that you can use to perform selected
advanced statistical methods. To use the Data Analysis
Toolpak you must select it as an Excel add-in. You can also
install the PHStat add-in (available for separate purchase
or with some textbooks) to extend and enhance the Data
Analysis Toolpak that Microsoft Excel contains. (You do
not need to use PHStat in order to use Microsoft Excel with
this text, although using PHStat will simplify using Excel
for statistical analysis.)
In Microsoft Excel, you create or open and save files
that are called workbooks. Workbooks are collections of
worksheets and related items, such as charts, that contain
the original data as well as the calculations and results
associated with one or more analyses. Because of its widespread distribution, Microsoft Excel is a convenient program to use, but some statisticians have expressed concern
about its lack of fully reliable and accurate results for
some statistical procedures. Although Microsoft has
recently improved many statistical functions, especially
from Excel 2010 onwards, you should be somewhat cautious about using Microsoft Excel to perform analyses on
data other than the data used in this text. (If you plan to
install PHStat, make sure you first read Appendix F and
any PHStat read-me file.)
You can use Excel to learn and apply the statistical
methods discussed in this book and as an aid in solving
end-of-section and end-of-chapter problems. For many topics, you may choose to use the ‘Excel How-to’ instructions.
These instructions use pre-constructed worksheets as models or templates for a statistical solution. You learn how to
adapt these worksheets to construct your own solutions.
Many of these sections feature a specific Excel Guide
workbook that contains worksheets that are identical to the
worksheets that PHStat creates. Because both of these
methods create the same results and the same worksheets,
you can use a combination of them as you read through this
book.
The ‘Excel How-to’ instructions and the Excel Guide
workbooks work best with the latest Versions of Microsoft
Excel, including Excel 2016 and Excel 2013 (Microsoft
Windows), Excel 2016 for Mac, and Office 365. (Excel
Guides also contain instructions for using the Analysis
ToolPak add-in that is included with most of the latest
Microsoft Excel versions.) (Microsoft Excel 2016, Microsoft Corporation, 2015)
You will want to master the basic skills listed in
Table EG1.1 before you begin using Microsoft Excel to
understand statistical concepts and solve problems. If
you plan to use the ‘Excel How-to’ instructions, you will
also need to master the skills listed in the lower part of
Excel skill
Specifics
Excel data entry
• Organising worksheet data in columns
• Entering numerical and categorical data
File operations
• Open
• Save
• Print
Worksheet operations
• Create
• Copy and paste
Formula skills
•
•
•
•
•
Workbook presentation
• How to apply format changes that affect the display of worksheet
cell contents
Chart formatting correction
• How to correct the formatting of charts that Excel improperly creates
Discrete histogram creation
• How to create a properly formatted histogram for a discrete probability
distribution
Table EG1.1
Basic skills for using
Microsoft Excel
Concept of a formula
Cell references
Absolute and relative cell references
How to enter a formula
How to enter an array formula
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 1 Excel Guide
Operation
Examples
Keyboard keys
• Enter
• Ctrl
• Shift
Keystroke combinations
• Ctrl+C
• Ctrl+Shift+Enter
• Command+Enter
Click or select
operations
Menu or ribbon
selection
Placeholder object
31
Notes
Names of keys are always the object of the verb press, as in ‘press Enter’.
Keyboarding actions that require you to press more than one key at the same time.
Ctrl+C means press C while holding down Ctrl.
Ctrl+Shift+Enter means press Enter while holding down both Ctrl and Shift.
• Click OK
• Select the first 2-D Bar
gallery item
Mouse pointer actions that require you to single click an onscreen object.
This book uses the verb select when the object is either a worksheet cell or an
item in a gallery, menu, list or Ribbon tab.
• File ➔ New
• Layout ➔ Legend ➔ None
A sequence of Ribbon or menu selections.
File ➔ New means first select the File tab and then select New from the list that
appears.
• variable 1 cell range
• bins cell range
An italicised bold-faced phrase is a placeholder for an object reference. In making
entries, you enter the reference (e.g. A1:A10) and not the placeholder.
Table EG1.2
Excel typographic conventions
the table. While you do not necessarily need these skills
if you plan to use PHStat, knowing them will be useful if
you expect to customise the Excel worksheets that
PHStat creates or expect to be using Excel beyond the
course that uses this book.
The list of skills in Table EG1.1 begins with the more
basic skills and progresses towards slightly more advanced
skills that you will need to use less frequently.
Table EG1.2 presents the typographic conventions
that the Excel Guides in this book use to present computer
operations.
EG1.2 OPENING AND SAVING WORKBOOKS
Once you open the Excel program a new workbook will be
displayed where you can begin entering data in rows and
columns. Figure EG1.1 shows a newly opened workbook in
Excel 2016. It contains the elements that are common with
most Microsoft Windows programs.
If you wish to use a workbook created previously you
will need to use the following commands.
If you are using Microsoft Excel 2016, select File ➔
Open.
In the Backstage view you will be given a choice of
selecting from Recent Workbooks, OneDrive or the
Computer. You can browse, select the file to be opened
and then click on the OK button. If you cannot find your
file, you may need to do one or more of the following:
• Use the scroll bars or the slider, if present, to scroll
through the entire list of files.
•
•
Select the correct folder from the drop-down list at the
left-hand side of the dialog box.
To search every file in the folder, leave All Files
showing at the bottom of the dialog box. If you
want a specific type of file such as text files, use the
arrow to open a drop-down menu and then select
Text Files.
In Excel 2016, select File ➔ Save As, and in the Backstage view choose the location. In the dialog box enter (or
edit) the name of the file in the File name box and click on
the OK button. If applicable, you can also do the following:
• Change to another folder by selecting that folder from
the Save in drop-down list.
• Change the Save as type value to something other than
the default choice, Microsoft Excel Workbook. Text
(Tab delimited) or CSV (Comma delimited) are two
file types sometimes used to share Excel data with
other programs.
After saving your work, you should consider saving
your file a second time, using a different name, to create a
backup copy of your work. Read-only files cannot be saved
to their original folders unless the name is changed.
EG1.3 ENTERING DATA
The main worksheet area is composed of rows and columns
that you use for data entry. You enter data into the rows and
columns of a worksheet. By convention, and the style used
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
32
CHAPTER 1 DEFINING AND COLLECTING DATA
Quick access toolbar
Ribbon
Formula bar
Group
Launcher
button
Title bar
Column
labels
Minimise, Resize
and Close buttons
Tabs
(Home tab selected)
Row
labels
Workspace area with
opened workbook
Scroll bars
Sheet tab
Figure EG1.1
The Excel 2016 window
in this book, when you enter data for a set of variables you
enter the name of each variable into the cells of the first
row, beginning with column A. Then you enter the data for
the variable in the subsequent rows to create a DATA worksheet similar to the one shown in Figure EG1.2, which contains data from an auction sale. Note that the formula used
in the active cell F6 can be seen on the formula bar.
To enter data in a specific cell, either use the cursor
keys to move the cell pointer to the cell or use your mouse
to select the cell directly. As you type, what you type
appears in the formula bar. Complete your data entry by
pressing Tab or Enter or by clicking the checkmark button
in the formula bar.
When you enter data, never skip any rows in a column
and, as a general rule, avoid skipping any columns. Also try
to avoid using numbers as row 1 variable headings; if you
cannot avoid their use, precede such headings with apostrophes. Pay attention to any special instructions that occur
throughout the book for the order of the entry of your data.
For some statistical methods, entering your data in an order
that Excel does not expect will lead to incorrect results.
To refer to a specific entry, or cell, you use a
Sheetname!ColumnRow notation. For example, Data!A2
refers to the cell in column A and row 2 in the Data worksheet. To refer to a specific group or range of cells, you use a
Sheetname!Upperleftcell:Lowerrightcell notation. For example, Data!A2:B11 refers to the 20 cells that are in rows 2 to 11
in columns A and B of the Data worksheet. An absolute
address for the cell A6 is shown as $A$6. Even if a formula
using this address is copied to another row or column it will
still refer to this cell. However, if the formula is written with
the relative address A6, moving the formula will change the
Figure EG1.2
An example of a DATA worksheet
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 1 Excel Guide
reference cell. Both absolute and relative addresses may be
necessary in one sheet depending on the operations intended.
Also note that $A6 freezes the column but not the row and
A$6 freezes the row but allows the column to change.
Each Microsoft Excel worksheet has its own name.
Automatically, Microsoft Excel names worksheets in the
form of Sheet1, Sheet2 and so on. You should rename your
worksheets, giving them more self-descriptive names, by
double-clicking on the sheet tabs that appear at the bottom
of each sheet, typing a new name and pressing the Enter key.
EG1.4 USING FORMULAS IN EXCEL WORKSHEETS
Formulas are worksheet cell entries that perform a calculation or some other task. You enter formulas by typing the
equals sign symbol (5) followed by some combination of
mathematical or other data-processing operations.
For simple formulas, you use the symbols 1, 2, *, /
and ^ for the operations addition, subtraction, multiplication, division and exponentiation (a number raised to a
power), respectively. For example, the formula
5Data!B2 1 Data!B3 1 Data!B4 1 Data!B5
adds the contents of the cells B2, B3, B4 and B5 of the Data
worksheet and displays the sum as the value in the cell
­containing the formula. You can also use Microsoft Excel
functions in formulas to simplify formulas. To find lists of
the functions that can be selected in Excel, click on the fx
Function Wizard symbol on the Formula bar. For example,
the formula 5SUM(Data!B2:B5), using the Excel SUM()
function, is a shorter equivalent to the formula above.
You can also use cell or cell range references that do
not contain the Sheetname! part, such as B2 or B2:B5. Such
references always refer to the worksheet in which the formula has been entered.
Formulas allow you to create generalised solutions and
give Excel its distinctive ability to recalculate results automatically when you change the values of the supporting
data. Typically, when you use a worksheet, you see only the
results of any formulas entered, not the formulas themselves. However, for your reference, many illustrations of
Microsoft Excel worksheets in this text also show the
underlying formulas adjacent to the results they produce.
When using Excel 2016, select Formulas ➔ Formula
Auditing ➔ Show Formulas to see onscreen the formulas
themselves and not their results. To restore the original
view, click on Show Formulas again.
EG1.5 CREATING CHARTS
The method of creating charts can vary according to the
version of Excel you are using. Both these methods are
available in Excel 2016.
• Method 1 A feature in Excel 2016 allows you to create
charts easily using the Quick Analysis tool. Simply
•
33
highlight an area of the spreadsheet containing some
data you wish to graph by clicking on the top left-hand
cell, then dragging the mouse. The range may contain
labels. Click on the small box that appears in the bottom right-hand corner to open Quick Analysis. Select
Charts, then, by hovering the mouse over the different
chart types, you can see previews of recommended
charts for the selected data. You can also choose More,
which will open a dialog box with a more extensive
range of options. Once a chart is selected there are several ways you can modify it by clicking on the icons
that appear on its right-hand side. These are Chart
Elements (1), Chart Styles (paintbrush) and Chart
Filters (filter). You will also now see that multiple
design options are shown on the ribbon and that options
to change colours or chart type are shown there. By
right-clicking on the background area of the chart you
can also activate a drop-down menu. If you choose
Format Chart Area a menu will open on the righthand side of the spreadsheet that allows you to change
the format of the chart and text in many ways. If instead
you choose Move Chart you can choose a new location on another sheet. To reposition the chart on the
existing sheet, simply click on it and drag. To resize it,
drag using one of the circles on its border.
Method 2 Highlight the area of the spreadsheet with
your data as described above. If you wish to select areas
that are not adjacent, hold down the Ctrl key while
selecting. The area selected must be rectangular. Click
on the Insert tab, then from the Charts area click on the
Recommended Charts and select a particular format
from the drop-down gallery. Alternatively, you can select
a chart type from the icons shown. Once the chart is
­created it can be formatted or enhanced by clicking on it
and following the instructions given for Method 1.
Figure EG1.3 shows an example of a chart created in
Excel 2016 with the Format Axis panel open.
EG1.6 PRINTING WORKBOOKS
Before printing you may select a print area if you do not
want the whole sheet printed. To print Excel 2016 worksheets, select File ➔ Print. A print preview is automatically created, as can be seen in Figure EG1.4. Various print
settings are available in the drop-down list boxes. Clicking
on Page Setup will give access to more choices such as
changing from Portrait to Landscape orientation, as
would suit the worksheet shown. When you are satisfied
with the settings and look of the preview, click on the Print
button.
Note that if you want only a part of the worksheet to be
printed it is easier to set this using Page Layout tab then
Page setup ➔ Print area.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
34
CHAPTER 1 DEFINING AND COLLECTING DATA
Plot area
Chart title
Vertical axis title
Chart area
Legend
Horizontal axis title
Figure EG1.3
An example of a chart created in Excel 2016 with the Format Axis panel open
Page Setup allows you to customise printing to change
the print orientation, add gridlines and so on before printing. Once you are satisfied with the results, click on the
Print button in the print preview window, then OK in the
Print dialog box.
The Print Backstage view (see Figure EG1.4) contains settings to select the printer to be used, what parts of the workbook
to print (the active worksheet is the default) and the number of
copies to produce (1 is the default). If you need to change these
settings, change them before clicking on the OK button.
Figure EG1.4
The Excel 2016 Backstage view with Print and Page Setup selected
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 1 Excel Guide
After printing, you should verify the contents of your printout. Most printing failures will trigger the display of an error
message that you can use to work out the source of the failure.
EG1.7 HOW USING EXCEL FOR MAC DIFFERS
Excel 2016 for Mac comes with the add-ins for Analysis
Toolpack but earlier versions did not. If you don’t have a
current version, it is possible to download software made
by third-party companies to perform some of the same statistical analysis tasks. The free program StatPlus®:mac LE,
for instance, will allow you to run a regression, calculate
descriptive statistics and run analysis of variance tests. Further capability is available in the Pro edition at a cost.
In Excel 2016 for Mac you can open a new work­
book when the program opens by using New ➔ Blank
Workbook ➔ Create.
The easiest way to save a new workbook is to click on
the quick access toolbar file icon to Save. A Save As dialog
box will allow you to choose a file name, a location for the
file and the file format. You can also choose File ➔ Save to
begin this process.
To create a chart in Excel 2016 for Mac, use Method 2
described in section EG1.5. With the chart selected click on
the Chart Design tab. You will find that extra tabs such as
Add Chart Element, Quick Layout and Switch Row/
Column open on the ribbon to allow more formatting.
To print a worksheet or selection use File ➔ Print then
on the Printer select the printer you wish to use. The
default is that all active worksheets will be printed but to
modify that select Show Details. Then choose the option
preferred from the drop-down menu, and finally select Print.
EG1.8 DEFINING DATA
Establishing the Variable Type
Microsoft Excel infers the variable type from the data you
enter into a column. If Excel discovers a column that contains numbers, for example, it treats the column as a numerical variable. If Excel discovers a column that contains
words or alphanumeric entries, it treats the column as a
non-numerical (categorical) variable.
This imperfect method works most of the time, especially if you make sure that the categories for your categorical variables are words or phrases such as ‘yes’ and ‘no’.
However, because you cannot explicitly define the variable
type, Excel can mistakenly offer or allow you to do nonsens­
ical things such as using a statistical method that is designed
for numerical variables on categorical variables. If you must
use coded values such as 1, 2 or 3, enter them preceded by
an apostrophe, as Excel treats all values that begin with an
apostrophe as non-numerical data. (You can check whether
a cell entry includes a leading apostrophe by selecting a cell
and viewing the contents of the cell in the formula bar.)
35
EG1.9 COLLECTING DATA
Recoding Variables
Key technique
To recode a categorical variable, you first copy the original variable’s column of data and then use the find-andreplace function on the copied data. To recode a
numerical variable, or a categorical variable with only
two values, enter a form­ula that returns a recoded value
in a new column.
Example
Imagine that we have collected data at an airport using a
survey such as shown on page 9. The Recode workbook
shows how the original variables of ‘Accommodation satisfaction’ and ‘Business visit’ have been recoded.
Excel how-to
Two recoded variables were created by first opening the
Airport Survey worksheet in the Recode workbook and
then following these steps:
1. Right-click column B (right-click over the shaded
‘B’ at the top of column B) and click Copy in the
shortcut menu.
2. Right-click column C and click the first choice in
the Paste Options gallery.
3. Enter Accommodation code in cell C1.
4. Select column C. With column C selected, click
Home ➔ Find & Select ➔ Replace.
In the Replace tab of the Find and Replace dialog box:
5. Enter Very satisfied as Find what, 1 as Replace
with, and then click Replace All.
6. Click OK to close the dialog box that reports the
results of the replacement command.
7. Still in the Find and Replace dialog box, enter Very
dissatisfied as Find what (replacing Very satisfied), and 5 as Replace with, then click Replace All.
8. Click OK to close the dialog box that reports the
results of the replacement command.
9. Continue to replace the words Dissatisfied, Satisfied
and Undecided with the numbers 4, 2 and 3 respectively using this method. (This creates the recoded
variable Accommodation code in column C.)
10. Enter Business visit code in cell H1.
11. Enter the formula 5IF(F2 5 “No”, 0,1) in cell H2.
12. Copy this formula down the column to the last row
that contains Visitor data (row 31). (This creates
the recoded variable Business visit code in column H.) The Recode workbook uses the IF function to recode the two categories as numbers.
Numerical variables can also be recoded into multiple categories by using a more advanced technique using the VLOOKUP function.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
36
CHAPTER 1 DEFINING AND COLLECTING DATA
EG1.10 TYPES OF SAMPLING METHODS
Simple Random Sample
Key technique
Use the RANDBETWEEN(smallest integer, largest integer) function to generate a random integer that can then be
used to select an item from a frame.
Example
Create a simple random sample with replacement of size 40
from a population of 800 items.
Excel how-to
Enter a formula that uses this function and then copy the
formula down a column for as many rows as is necessary.
For example, to create a simple random sample with
replacement of size 40 from a population of 800 items,
open to a new worksheet. Enter Sample in cell A1 and
enter the formula 5RANDBETWEEN(1, 800) in cell A2.
Then copy the formula down the column to cell A41.
Excel contains no functions to select a random sample
without replacement. Such samples are most easily created
using an add-in such as PHStat or the Analysis ToolPak, as
described in the following paragraphs.
Analysis ToolPak
Use Sampling to create a random sample with replacement.
For the example, assume you have a worksheet that
contains the population of 800 items in column A and that
contains a column heading in cell A1. Select Data ➔ Data
Analysis. In the Data Analysis dialog box, select Sampling
from the Analysis Tools list and then click OK. In the procedure’s dialog box:
1. Enter A1:A801 as the Input Range and check
Labels.
2. Click Random and enter 40 as the Number of
Samples.
3. Click New Worksheet Ply and then click OK.
Example
Create a simple random sample without replacement of size
40 from a population of 800 items.
PHStat
Use Random Sample Generation.
For the example, select PHStat ➔ Sampling ➔ Random
Sample Generation. In the procedure’s dialog box:
1. Enter 40 as the Sample Size.
2. Click Generate list of random numbers and
enter 800 as the Population Size.
3. Enter a Title and click OK.
Unlike most other PHStat results worksheets, the worksheet created contains no formulas.
Excel how-to
Use the COMPUTE worksheet of the Random workbook
as a template. The worksheet already contains 40 copies of
the formula 5RANDBETWEEN(1, 800) in column B.
Because the RANDBETWEEN function samples with
replacement as discussed at the start of this section, you
may need to add additional copies of the formula in new
column B rows until you have 40 unique values.
If your intended sample size is large, you may find it
difficult to spot duplicates. See the ADVANCED worksheet in the Random workbook for more information
about an advanced technique that uses formulas to detect
duplicate values.
Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Organising and
visualising data
C HAP T E R
2
FESTIVAL EXPENDITURE
A
council is investigating the contribution to the local economy of visitors to an annual
three-day music festival. Kai, a researcher employed by the council, has collected data
from a random sample of non-local festival attendees aged 18 years and over. This data
includes total amount spent, excluding festival tickets, in the region during the festival and
whether the festival attendee has travelled from within the state (intrastate), from another state
(interstate) or from another country (international) to attend the festival.
The data is stored in the < FESTIVAL > file.
Kai is interested in answering the following questions:
■
■
■
What is the typical amount spent during the festival by intrastate, interstate and international
visitors?
How does the amount spent vary between visitors and between intrastate, interstate and
international visitors?
Is there a difference in the amount spent between intrastate, interstate and international
­visitors?
© Zoonar/Thomas Willer/age fotostock
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
38
CHAPTER 2 ORGANISING AND VISUALISING DATA
LEARNING
OBJECTIVES
After studying this chapter you should be able to:
1 describe the distribution of a single categorical variable using tables and charts
2 describe the distribution of a single numerical variable using tables and graphs
3 describe the relationship between two categorical variables using contingency tables
4 describe the relationship between two numerical variables using scatter diagrams and
time-series plots
5 develop dashboard elements such as sparklines, gauges, bullet graphs and treemaps for
descriptive analytics
6 correctly present data in graphs
Kai needs to organise the data into usable forms. One way of doing this is to use tables or
charts to organise and visualise the data. This chapter helps you to select and construct appropriate tables and charts. We can also use numerical measures to determine certain characteristics
of the data, such as their centre and spread. These numerical descriptive measures are covered
in the next chapter.
From Chapter 1 we know that data can be either categorical or numerical.
LEARNING OBJECTIVE
1
Describe the distribution
of a single categorical
variable using tables
and charts
2.1 ORGANISING AND VISUALISING CATEGORICAL DATA
The expenditure data in the < FESTIVAL > file are examples of raw data – that is, data presented
just as they were collected. Raw data give very little information, but by using summary tables
and charts we can condense and present the data in a meaningful way. For categorical data, you
first divide the data into categories and then present the frequency or percentage in each category in a table or chart.
Organising Categorical Data: Summary Table
summary table
Summarises categorical or
numerical data; gives the frequency,
proportion or percentage of data
values in each category or class.
Table 2.1
Reasons for grocery
shopping online
A summary table gives the frequency, proportion or percentage of the data in each category,
which allows you to see differences between the categories. A summary table lists the ­categories
in one column and the frequency, percentage or proportion in a separate column or columns.
Table 2.1 illustrates a summary table based on a recent survey that asked why people
shopped for groceries online. From this table, stored in < ONLINE SHOPPING >, the most ­common
reason for grocery shopping online was convenience, followed by competitive prices and
quality products. Very few respondents shopped for groceries online because of a comfortable
environment or well-displayed products.
Reason
Comfortable environment
Competitive prices
Convenience
Customer service
Products well displayed
Quality products
Variety/range of products
Percentage
8
20
28
13
3
18
10
100
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.1 ORGANISING AND VISUALISING Categorical Data
SUM MA RY TA B LE S FO R LO C AT IO N A N D TY P E OF P ROP E RTI E S
In other research Kai is exploring the property market in the council area. Data from 100 recent
property sales is stored in < PROPERTY >. These properties are classified according to location,
either in town or rural, and also by type, either a house or a unit. Construct summary tables for
the properties categorised by location and type.
39
EXAMPLE 2.1
SOLUTION
Location
Rural
Town
Total
Number (frequency) of properties
34
66
100
Table 2.2A
A frequency and
percentage summary
table for the location of
100 recent property sales
Percentage of properties
34.0
66.0
100.0
From Table 2.2A we can see that there are approximately twice as many urban properties
sold as rural properties.
Type
House
Unit
Total
Number of properties
82
18
100
Table 2.2B
A frequency and
percentage summary
table for type of
100 recent property
sales
Percentage of properties
82.0
18.0
100.0
From Table 2.2B we can see that there are relatively few units sold.
Visualising Categorical Data: Bar Charts
Each category in a bar chart is represented by a bar, the length of which indicates the proportion,
frequency or percentage of values falling into that category. Figure 2.1 displays a bar chart of the
reasons for grocery shopping online, presented in Table 2.1. Bar charts allow you to compare percentages, frequencies or proportions in the different categories. In Figure 2.1 the most common
reason for shopping online is convenience, followed by competitive prices. Very few respondents
shopped for groceries online because of a comfortable environment or well-displayed products.
bar chart
Graphical representation of a
summary table for categorical data;
the length of each bar represents the
proportion, frequency or percentage
of data values in a category.
Figure 2.1
Microsoft Excel bar chart
of the reasons for grocery
shopping online
Bar chart – reasons for grocery shopping online
Variety/range of products
Quality products
Category
Products well displayed
Customer service
Convenience
Competitive prices
Comfortable environment
0
5
10
15
20
25
30
%
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
40
CHAPTER 2 ORGANISING AND VISUALISING DATA
EXAMPLE 2.2
B A R C H A RT FO R FA M I LY TY P E
The council is also interested in demographic differences between the council area and the
capital city. Demographic information has been collected and is stored in < DEMOGRAPHIC_
INFORMATION >.
Use the summary tables for family type to construct and interpret bar charts for the council area and the capital city.
SOLUTION
Figure 2.2
Microsoft Excel bar chart
for family type
Bar chart – council area
Other
One parent
Couple no children
Couple with children
0
5
10
15
20
25
30
35
40
45
%
Bar chart – capital city
Other
One parent
Couple no children
Couple with children
0
5
10
15
20
25
%
30
35
40
45
50
We can see that, in both areas, the majority of families are couples with or without children,
with a significant number of one-parent families. However, the capital city has approximately
10% more couples without children and 5% fewer one-parent families.
Pie Charts
pie chart
Graphical representation of a
summary table for categorical data,
with each category represented by
a slice of a circle of which the area
represents the proportion or
percentage share of the category
relative to the total of all categories.
A pie chart is a circle, used to represent the total, which is divided into slices, each representing a
category. The area of each slice represents the proportion or the percentage share of the corresponding category. In Table 2.1, for example, 28% of the respondents said that convenience was the main
reason for grocery shopping online. Thus, in constructing the pie chart, the 360° that makes up a
circle is multiplied by 0.28, resulting in a slice of the pie that takes up 100.8° of the 360° of the circle
(Figure 2.3). A pie chart allows you to see the portion of the entire pie that falls into each category.
In Figure 2.3, convenience takes 28% of the pie and products well displayed takes only 3%.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.1 ORGANISING AND VISUALISING Categorical Data
41
What type of chart should you use? The selection of a chart depends on your intention. If a
comparison of categories is most important, use a bar chart. If observing the portion of the
whole that lies in a particular category is most important, use a pie chart. There should be no
more than eight categories or slices in a pie chart. If there are more than eight, merge the
smaller categories into a category called ‘other’.
Pie chart – reasons for grocery shopping online
Variety/range of
products
10%
Quality products
18%
Comfortable
environment
8%
Figure 2.3
Microsoft Excel pie chart
of the reasons for grocery
shopping online
Competitive prices
20%
Products well
displayed
3%
Customer service
13%
Convenience
28%
PIE C H A RT FO R FA MILY T YP E
Use the summary tables given for family type in < DEMOGRAPHIC_INFORMATION > to construct
and interpret pie charts for the capital city and the council area.
EXAMPLE 2.3
Figure 2.4
Microsoft Excel pie chart
for family type
Pie chart – council area
Couple with children
Couple no children
One parent
Other
Pie chart – capital city
Couple with children
Couple no children
One parent
Other
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
42
CHAPTER 2 ORGANISING AND VISUALISING DATA
We can see that, in both areas, most families are couples with or without children,
with a significant number of one-parent families. However, the capital city has a higher
proportion of ­couples without children.
Problems for Section 2.1
LEARNING THE BASICS
2.1
A categorical variable has three categories with the following
frequency of occurrence:
Category
A
B
C
2.2
Frequency
13
28
9
d. Channel 10
e. SBS
APPLYING THE CONCEPTS
You can solve problems 2.4 to 2.7 manually or by using Microsoft Excel.
2.4
Website
Google
Facebook
YouTube
Yahoo!
Amazon
Wikipedia
Twitter
Bing
eBay
MSN
a. Calculate the percentage of values in each category.
b. Construct a bar chart.
c. Construct a pie chart.
A categorical variable has four categories with the following
percentages of occurrence:
Category
A
B
Percentage
12
29
Category
C
D
Percentage
35
24
a. Construct a bar chart.
b. Construct a pie chart.
Unique monthly visitors (millions)
1,600
1,100
1,100
750
500
475
290
285
285
280
Data obtained from eBusMBA Guide, Top 15 Most Popular Websites
March 2017, at <www.ebizmba.com/articles/most-popular-websites>
accessed 13 March 2017
2.3
SBS
The following table gives the top 10 websites ranked by
estimated number of unique monthly visitors in March 2017.
ABC
2.5
Channel 10
Channel 7
Channel 9
The pie chart above was constructed from the results of a
survey of 2,000 viewers to determine which TV channels they
watch for news. By measuring the angle of each one using a
protractor, or estimating by eye, calculate the percentage of
viewers watching:
a. ABC
b. Channel 7
c. Channel 9
a. Construct bar and pie charts.
b. Which graphical method do you think best portrays these data?
c. What conclusions can you reach concerning the number of
unique visitors?
Pat, the owner of Pat’s Cars, asked 200 customers their colour
preference when purchasing a new car. The following summary
table gives the results.
Colour
White
Blue
Red
Brown
Grey
Silver
Green
Black
Other
2.6
Frequency
56
31
29
17
19
15
15
13
5
a. Construct bar and pie charts.
b. What colours of cars should Pat have on show?
The following table gives the labour force status of the Australian
civilian population aged 15 years and over in January 2017.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
43
2.2 Organising Numerical Data
6202.0 – Labour Force, Australia, Jan 2017
Labour force status (aged 15 years & over)
Total (‘000)
Employed full-time
8,066.3
Employed part-time
3,762.6
Unemployed looking for full-time work
561.4
Unemployed not looking for full-time work
213.7
7,096.2
Not in labour force
Civilian population 15 aged years and over
19,700.2
2.7
a. Construct bar and pie charts.
b. Which graphical method do you think best portrays these
data?
c. What conclusions can you draw about participation rate –
that is, the percentage of the population in the labour
force?
Use the summary table for country of birth in
< DEMOGRAPHIC_INFORMATION > to construct pie and bar
charts.
Data obtained from Australian Bureau of Statistics, Labour Force, Australia,
January 2017, Cat. No. 6202.0 <www.abs.gov.au/ausstats/abs@.nsf/mf/6202.0>
accessed 15 March 2017
2.2 ORGANISING NUMERICAL DATA
LEARNING OBJECTIVE
When you have a large amount of raw numerical data, a useful first step is to present the data as
either an ordered array or a stem-and-leaf display.
Suppose you undertake a study to compare the cost of a main meal at similar restaurants in
a city and in the suburbs. Table 2.3 gives the raw data for 50 city restaurants and 50 suburban
restaurants; these data are stored in < RESTAURANT >. From the raw data it is difficult to draw any
conclusions about the price of city and suburban restaurant meals.
City
50
34
44
31
36
Suburban
37
44
43
26
51
38
39
38
34
38
43
49
14
48
53
56
37
44
48
23
51
40
51
30
39
36
50
27
42
45
25
50
44
26
37
33
35
39
35
31
41
22
50
32
39
44
45
35
63
53
37
27
31
51
30
29
24
26
26
27
38
34
34
48
38
37
44
23
39
26
38
23
41
55
28
39
30
32
24
33
29
32
30
38
38
36
25
28
31
32
38
29
33
30
25
2
Describe the distribution of
a single numerical variable
using tables and graphs
Table 2.3
Price per main meal at
50 city restaurants and
50 suburban restaurants
Ordered Arrays
A more meaningful display is obtained by sorting the raw data in order of magnitude – that is,
from smallest to largest. This is called an ordered array. Table 2.4 presents the data in Table 2.3
as ordered arrays. From Table 2.4 you can see that the price of a main meal at city restaurants
is between $14 and $63, and the price of a main meal at suburban restaurants is between $23
and $55.
ordered array
Numerical data sorted by order of
magnitude.
Stem-and-Leaf Displays
A stem-and-leaf display is a quick and easy way to visually display numerical data. The data are
divided into groups (called stems) such that the values within each group (the leaves) branch
out to the right on each row. The resulting display allows you to see how the data are distributed
and also where they are concentrated.
stem-and-leaf display
Graphical representation of
numerical data; partitions each
data value into a stem portion and
a leaf portion.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
44
CHAPTER 2 ORGANISING AND VISUALISING DATA
Table 2.4
Ordered array of price per
main meal at 50 city
restaurants and 50
suburban restaurants
City
14
33
38
43
50
Suburban
23
27
30
36
39
22
34
38
44
50
23
34
38
44
50
25
35
39
44
50
26
35
39
44
51
27
35
39
45
51
30
36
39
45
53
31
36
40
48
53
31
37
41
48
56
32
37
42
49
63
23
27
31
37
39
24
28
31
37
41
24
28
32
37
43
25
29
32
38
44
25
29
32
38
44
26
29
33
38
48
26
30
33
38
51
26
30
34
38
51
26
30
34
38
55
To see how a stem-and-leaf display is constructed, suppose that 20 students spend the following amounts at a coffee cart between lectures: < COFFEE >
$6.35
$8.45
$4.75
$6.05
$4.30
$9.90
$5.40
$5.75
$4.85
$6.80
$6.60
$4.30
$5.55
$5.45
$4.90
$7.20
$6.85
$7.80
$7.50
$10.65
To construct a stem-and-leaf display for these data, use the $ amount as the stem and round the
cents to the nearest 10 cents for the leaves. Now list the stem values ($) in order of size to the left
of a vertical divider (|) and then record the leaves (10 cents) for each stem in rows to the right. The
‘unordered’ stem-and-leaf display for the amount spent at the coffee cart by the 20 students is:
stem unit: $
4
5
6
7
8
9
10
leaf unit: 10 cents
83993
4685
46918
528
5
9
7
The first value of $6.35 is rounded to 6.4. Its stem (row) is 6 and its leaf is 4. The second value
of $4.75 is rounded to 4.8. Its stem (row) is 4 and its leaf is 8. Then, ordering each leaf, we
obtain the following ordered stem-and-leaf display for the amount spent at the coffee cart by
the 20 students:
stem unit: $
4
5
6
7
8
9
10
EXAMPLE 2.4
leaf unit: 10 cents
33899
4568
14689
258
5
9
7
ST E M- A N D - LE A F DIS P L AY F OR F E STI VAL E XP E N D I TU RE – I N TE RSTATE
V IS ITO R S
Kai is interested in the amount spent during the festival by interstate visitors. < FESTIVAL >
Construct and interpret a stem-and-leaf display for these data.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
45
2.2 Organising Numerical Data
SOLUTION
Figure 2.5
PhStat stem-and-leaf
display for festival
expenditure by interstate
visitors
Festival expenditure by interstate visitors
Stem unit: $100
Leaf unit: $10
2
3
4
5
6
7
8
9
10
278
1235999
02335567889
1255689
00033346689
3567789
067
114
4
From Figure 2.5 Kai can conclude that during the festival:
• interstate visitors spend between $220 and $1,040
• most interstate visitors spend between $300 and $800
• interstate visitors rarely spend less than $300 or more than $800.
Problems for Section 2.2
LEARNING THE BASICS
2.8
68
2.9
stem unit: $100
1
2
3
4
5
Form an ordered array given the following data from a sample
of n = 7 mid-semester exam scores in accounting:
94
63
75
71
88
64
Form a stem-and-leaf display given the following data from
a sample of n = 7 mid-semester exam scores in finance:
80
54
69
98
93
53
74
2.10 Form an ordered array given the following stem-and-leaf
display from a sample of n = 7 mid-semester exam scores
in information systems:
stem unit: 10
5
6
7
8
9
leaf unit: 1
0
446
19
2
APPLYING THE CONCEPTS
2.11 Data were collected on the monthly expenses submitted by
35 employees in a firm’s sales team. The data are summarised
in the following stem-and-leaf display:
leaf unit: $10
12489
0013999999
01124445899
11556
0156
a. Place the data into an ordered array.
b. Which of the two displays provides the most information?
Discuss.
c. In what range are most monthly expense claims?
d. Is there a concentration of expense claims near the centre
of the distribution?
2.12 The following data represent the late payment fee in dollars for
a sample of 22 accounts. < LATE_PAYMENT >
20
45
40
20
40
38
38
45
35
45
35
15
45
35
50
40
45
35
40
45
35
40
a. Display the data as an ordered array.
b. Construct a stem-and-leaf display for the data.
c. Which of the two displays provides the most information?
Discuss.
d. Around what value, if any, are the late payment fees
concentrated? Explain.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
46
CHAPTER 2 ORGANISING AND VISUALISING DATA
2.13 The following data represent ATM fees for withdrawals
made above the free monthly allowance for a sample of
26 transaction accounts. < ATM_FEE >
0.65 0.50 0.70 1.30 2.50 0.50 2.00 1.00 2.00 1.25 1.50 2.00 0.30
2.00 0.65 2.00 0.50 0.65 0.50 0.65 1.60 0.70 1.00 1.50 1.65 0.50
a. Display the data as an ordered array.
b. Construct a stem-and-leaf display for the data.
c. Which of the two displays provides the most information?
Discuss.
d. Around what value, if any, are the withdrawal fees
concentrated? Explain.
2.14 Low-fat foods are not necessarily low calorie, as many are high
in sugar. The following data give calories per 250 ml cup of a
random sample of brands of fresh cow’s milk for sale in
Australia. < FRESH_MILK >
LEARNING OBJECTIVE
2
Describe the distribution of
a single numerical variable
using tables and graphs
Full cream milk
155 188 160 155 160 163 170 185 135 160 165 160 163
Low- or reduced-fat milk
120 133 133 125 118 113 140 110 128 115
No-fat or skim milk
133 90 90 98 88 85 115 108 88 90 90 98
Data obtained from Calorie King Australia <www.calorieking.com.au> accessed
22 December 2013
For each category of milk:
a. Display the data in ordered arrays.
b. Construct stem-and-leaf displays for the data.
c. Which arrangement provides more information? Discuss.
d. Compare the items in terms of calories. What conclusions
can you make?
2.3 SUMMARISING AND VISUALISING NUMERICAL DATA
Ordered arrays and stem-and-leaf displays are of limited use when we have very large quantities of data or the data are highly variable. In these cases we use tables and graphs to condense
and present the data visually. These tables and graphs include histograms, frequency, relative
frequency, and cumulative distributions and polygons.
Summarising Numerical Data: Frequency Distributions
A frequency distribution allows you to condense a set of data.
frequency distribution
Summary table for numerical data;
gives the frequency of data values
in each class.
class width
Distance between upper and lower
boundaries of a class.
range
Distance measure of variation;
difference between maximum and
minimum data values.
A frequency distribution is a summary table in which the data are arranged into numerically ordered classes or intervals.
To construct a frequency distribution, first select an appropriate number of classes and a
suitable class width. The classes should be exhaustive and mutually exclusive, so that any one
data value belongs to one and only one class. The number of classes chosen depends on the
amount of data – a small number of classes for small amounts of data and a larger number of
classes for larger amounts of data. In general, a frequency distribution should have at least five
classes but no more than 15. If there are too few classes we lose too much information and if
there are too many classes the data are not condensed enough.
Each class should be of equal width. To determine the required (approximate) width of the
classes, divide the range (the highest value – the lowest value) of the data by the required number of classes.
DE T E R M IN IN G A N AP PR O X I MAT E W I DT H O F A C LA SS
Class width =
range
number of classes
(2.1)
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.3 SUMMARISING AND VISUALISING Numerical Data
47
The city restaurant data consist of a sample of 50 restaurants; with this sample size 10 is an
appropriate number of classes. From the ordered array in Table 2.4, the range of the data is
$63 – $14 = $49. Using Equation 2.1, the approximate class width is:
Class width =
49
= 4.9
10
Choose a class width that simplifies the reading and interpretation of the distribution and resultant graphs. Therefore, instead of using a class width of $4.90, choose a width of $5.00.
Construct the frequency distribution table by first establishing clearly defined class ­boundaries
so that each data value belongs in one and only one class. The classes must be mutually exclusive
and exhaustive. Whenever possible, choose class boundaries that simplify the reading and interpretation of the resultant tables or graphs. For the city restaurant data the price ranges from $14 to
$63, so appropriate classes could be (1) from $10 to less than $15, (2) from $15 to less than $20,
and so on, until we have included the highest data value, in this case $63. The last and 11th class
ranges from $60 to less than $65. The centre of each class, called the class mid-point, is halfway
between the lower boundary and the upper boundary of the class. Thus, the class mid-point for the
10 + 15b
first class, from $10 to under $15, is $12.50 a
; the class mid-point for the second class,
2
from $15 to under $20, is $17.50, and so on. Table 2.5 gives a frequency distribution of the cost
per meal for the 50 city and the 50 suburban restaurants.
A frequency distribution allows you to draw conclusions about the major characteristics of
the data. For example, Table 2.5 shows that the price of main meals at city restaurants is
­concentrated between $30 and $55 compared with the price of main meals at suburban restaurants, which are clustered between $25 and $40.
For small data sets, one set of class boundaries may provide a different picture from another
set. For example, for the restaurant price data, using a class width of 4.0 instead of 5.0 (as was
used in Table 2.5) may cause shifts in the way in which the values are distributed between
the classes.
You can also get shifts in data concentration when you choose different lower and upper
class boundaries. Fortunately, as the sample size increases, alterations in the selection of class
boundaries affect the concentration of data less and less.
Price of main meal ($)
$10 but less than $15
$15 but less than $20
$20 but less than $25
$25 but less than $30
$30 but less than $35
$35 but less than $40
$40 but less than $45
$45 but less than $50
$50 but less than $55
$55 but less than $60
$60 but less than $65
Total
City frequency
1
0
2
3
7
14
8
5
8
1
1
50
Suburban frequency
0
0
4
13
13
12
4
1
2
1
0
50
FREQUENCY DISTRIBUTION FOR FESTIVAL EXPENDITURE – INTERSTATE
VISITORS
Kai is interested in the amount spent during the festival by interstate visitors. < FESTIVAL >
Construct and interpret a frequency distribution for this data.
class boundaries
Upper and lower values used to
define classes for numerical data.
class mid-point
Centre of a class; representative
value of class.
Table 2.5
Frequency distribution of
the price per main meal for
50 city restaurants and
50 suburban restaurants
EXAMPLE 2.5
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
48
CHAPTER 2 ORGANISING AND VISUALISING DATA
SOLUTION
As we have data from 52 interstate visitors, with expenditure during the festival ranging
from approximately $220 to $1,040 (see Figure 2.5), we can choose a class width of $200
with the first class starting at $200.
Table 2.6
Frequency distribution of
festival expenditure –
interstate visitors
Interstate visitors
Festival expenditure
$200 to < $400
$400 to < $600
$600 to < $800
$800 to < $1,000
$1000 to < $1,200
Total
•
•
•
Frequency
11
17
18
5
1
52
From Table 2.6 Kai can conclude that festival expenditure for interstate visitors is:
between $200 and $1,200
concentrated between $400 and $800
rarely more than $800.
Relative Frequency and Percentage Distributions
relative frequency distribution
Summary table for numerical data
which gives the proportion of data
values in each class.
percentage distribution
Summary table for numerical data
which gives the percentage of data
values in each class.
Table 2.7
Relative frequency and
percentage distributions of
the price of main meals at
city and suburban
restaurants
Instead of the frequency of the data in each class, knowing the proportion or the percentage of
the data that fall into each class is often more useful. To do this, we use either a relative frequency or a percentage distribution. Also, when comparing two or more samples with different
sample sizes, a relative frequency or percentage distribution should be used.
A relative frequency distribution is obtained by dividing the frequency in each class by the
total number of values. From this a percentage distribution can be obtained by multiplying each
relative frequency by 100%. Thus, the relative frequency of a main meal at city restaurants with
a price between $30 and $35 is 0.14 (7 ÷ 50), and the corresponding percentage is 14%.
Table 2.7 presents the relative frequency and percentage distributions of the price of main meals
at city and suburban restaurants.
From Table 2.7 you can conclude that meals cost more at city restaurants than at suburban
restaurants – 16% of main meals at city restaurants cost between $40 and $45 compared with
8% at suburban restaurants; 16% of main meals at city restaurants cost between $50 and $55
compared with 4% at suburban restaurants; while only 6% of main meals at city restaurants
cost between $25 and $30 compared with 26% at suburban restaurants.
Price of main meal ($)
$10 but less than $15
$15 but less than $20
$20 but less than $25
$25 but less than $30
$30 but less than $35
$35 but less than $40
$40 but less than $45
$45 but less than $50
$50 but less than $55
$55 but less than $60
$60 but less than $65
Total
City
Relative frequency
0.02
0.00
0.04
0.06
0.14
0.28
0.16
0.10
0.16
0.02
0.02
1.00
Percentage
2.0
0.0
4.0
6.0
14.0
28.0
16.0
10.0
16.0
2.0
2.0
100.0
Suburban
Relative frequency
Percentage
0.00
0.0
0.00
0.0
0.08
8.0
0.26
26.0
0.26
26.0
0.24
24.0
0.08
8.0
0.02
2.0
0.04
4.0
0.02
2.0
0.00
0.0
1.00
100.0
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.3 SUMMARISING AND VISUALISING Numerical Data
R ELATIVE FR E Q U E N CY D IST R IB U T IO N AN D P E RCE N TAGE D I STRI BU TI ON
FESTIVA L EXP E N D IT U R E – IN T E R STAT E AN D I N TRASTATE V I S I TORS
Kai is interested in the amount spent during the festival by festival attendees; in particular if
there is any difference in expenditure between interstate and intrastate visitors. < FESTIVAL >
Construct and interpret frequency and percentage distributions to compare the festival
expenditure of interstate and intrastate visitors.
49
EXAMPLE 2.6
SOLUTION
Festival expenditure
$0 to < $200
$200 to < $400
$400 to < $600
$600 to < $800
$800 to < $1,000
$1,000 to < $1,200
Total
Interstate
Proportion
0.000
0.212
0.327
0.346
0.096
0.019
1.000
Visitors
Intrastate
Proportion
0.019
0.442
0.250
0.135
0.115
0.039
1.000
Interstate
Percentage
0.00
21.15
32.69
34.62
9.62
1.92
100.00
Intrastate
Percentage
1.92
44.23
25.00
13.46
11.54
3.85
100.00
Table 2.8
Relative frequency and
percentage distributions
of festival expenditure –
intrastate and interstate
From Table 2.8 Kai can conclude that interstate visitors generally spend more during the
festival than intrastate visitors. However, there is more variation in festival expenditure
between intrastate visitors.
Cumulative Distributions
A cumulative percentage distribution gives the percentage of values that are less than a certain
value. For example, you may want to know what percentage of the city restaurant main meals
cost less than $20, less than $50, and so on. A percentage distribution is used to form the corresponding cumulative percentage distribution. From Table 2.7, 0% of main meals at city restaurants cost less than $10, 2% cost less than $15, 2% also cost less than $20 (since none of the
meals cost between $15 and $20), 6% (2% + 4%) cost less than $25, and so on, until all 100%
of the meals cost less than $65.
Table 2.9 summarises the cumulative percentages for the price of main meals at city and suburban restaurants. The cumulative distribution clearly shows that the cost of main meals is lower
in suburban restaurants than in city restaurants – 34% of main meals at suburban ­restaurants cost
Price ($)
$10
$15
$20
$25
$30
$35
$40
$45
$50
$55
$60
$65
City percentage of restaurants
less than indicated value
0
2
2
6
12
26
54
70
80
96
98
100
Suburban percentage of restaurants
less than indicated value
0
0
0
8
34
60
84
92
94
98
100
100
cumulative percentage
distribution
Summary table for numerical data;
gives the cumulative frequency of
each successive class.
Table 2.9
Cumulative percentage
distributions of the price of
city and suburban
restaurant main meals
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
50
CHAPTER 2 ORGANISING AND VISUALISING DATA
less than $30 compared with 12% at city restaurants; 60% of main meals at suburban restaurants
cost less than $35 compared with 26% at city restaurants; 84% of main meals at suburban restaurants cost less than $40 compared with 54% at city restaurants.
EXAMPLE 2.7
C U MU LAT IV E P E RC E N TAGE D I STRI BU TI ON F OR F E STI VAL E XP E N D I TU RE
Kai is interested in the amount spent during the festival by festival attendees; in particular if
there is any difference in expenditure between interstate and intrastate visitors. < FESTIVAL >
Construct and interpret cumulative distributions to compare festival expenditure of
interstate and intrastate visitors.
SOLUTION
Table 2.10
Cumulative percentage
distribution of festival
expenditure – intrastate
and interstate
Visitors
Interstate
Percentage
0.00
21.15
53.85
88.46
98.08
100.00
Festival expenditure
$0 to < $200
$200 to < $400
$400 to < $600
$600 to < $800
$800 to < $1,000
$1,000 to < $1,200
Intrastate
Percentage
1.92
46.15
71.15
84.61
96.15
100.00
From Table 2.10 Kai can conclude that 71% of intrastate visitors spend less than $600 ­during
the festival while only 54% of interstate visitors spend less than $600. This indicates that,
generally, intrastate visitors spend less during the festival than interstate visitors.
Histograms
histogram
Graphical representation of a
frequency, relative frequency or
percentage distribution; the area of
each rectangle represents the class
frequency, relative frequency or
percentage.
A grouped frequency, relative frequency or percentage distribution can be graphically represented by a histogram. The horizontal axis is divided into intervals corresponding to the classes.
Rectangles are constructed above these intervals, the heights of which measure the frequency,
relative frequency or percentage of data values in the class.
Figure 2.6 displays an Excel frequency histogram for the price of main meals at city restaurants. The histogram indicates that the price of main meals at city restaurants is concentrated
between approximately $30 and $55. Very few meals cost less than $25 or more than $55.
Instead of using class boundaries you can label and identify classes by their mid-point.
Figure 2.6
Excel histogram of the
price of main meals at city
restaurants
Histogram price of main meals at city restaurants
16
14
Frequency
12
10
8
6
4
2
2.
50
$6
7.
50
$5
2.
50
$5
7.
50
$4
2.
50
$4
7.
50
$3
2.
50
$3
7.
50
$2
2.
50
$2
7.
50
$1
0
$1
.5
$7
2.
50
0
Price – city
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.3 SUMMARISING AND VISUALISING Numerical Data
H ISTO G R A M FO R FE ST IVA L E X P E N D IT U RE – I N TE RSTATE V I S I TORS
Kai is interested in the amount spent during the festival by interstate visitors. < FESTIVAL >
Construct and interpret a histogram for the data.
EXAMPLE 2.8
SOLUTION
Figure 2.7
Histogram of festival
expenditure – interstate
visitors
Festival expenditure – interstate visitors
20
Frequency
15
10
5
0
0
200
400
600
800
1,000
1,200
1,400
Festival expenditure, $
From Figure 2.7 Kai can conclude that festival expenditure for interstate visitors is:
• between $200 and $1,200
• concentrated between $400 and $800
• rarely more than $800.
Polygons
When comparing two or more sets of data we can construct polygons on the same set of axes,
allowing for easy interpretation.
PE RC E N TAG E P OLYGON
A percentage polygon is constructed by plotting the percentage for each class above the
respective class mid-point and then joining the mid-points by straight lines. The graph is
extended at each end to classes with a frequency of zero so that the polygon starts and
finishes on the horizontal axis.
percentage polygon
Graphical representation of a
percentage distribution.
Figure 2.8 displays percentage polygons for the price of main meals in city and suburban
restaurants. The polygon for suburban restaurants is concentrated to the left (corresponding to
lower price) of the polygon for city restaurants. The highest percentages of price for suburban
restaurants are for class mid-points of $27.50 and $32.50, while the highest percentages of
price for city restaurants are for a class mid-point of $37.50.
The polygons in Figure 2.8 have plotted points whose values on the horizontal axis represent the class mid-points. For example, for class mid-point $22.50, the plotted point for suburban restaurants (the higher one) represents the fact that 8% of these restaurants have main meal
prices between $20 and $25, while the plotted point for city restaurants (the lower one) indicates that only 4% of these restaurants have main meal prices between $20 and $25.
When constructing polygons or histograms, the vertical axis should show the true zero or
‘origin’ so as not to distort the character of the data. The horizontal axis does not need to specify the zero point for the variable of interest, although the range of the variable should constitute the major portion of the axis.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
51
52
CHAPTER 2 ORGANISING AND VISUALISING DATA
Figure 2.8
Percentage polygons for
the price of main meals in
city and suburban
restaurants
Percentage polygon
30
25
City
Suburban
20
%
15
10
5
0
7.5
EXAMPLE 2.9
12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5
Price of main meal ($)
P E RC E N TA G E P O LYG O N S F OR F E STI VAL E XP E N D I TU RE
Kai is interested in the amount spent during the festival by attendees; in particular if there is
a difference between interstate and intrastate visitors. < FESTIVAL >
Construct and interpret percentage polygons to compare the festival expenditure of
interstate and intrastate visitors.
SOLUTION
Figure 2.9
Percentage polygons –
festival expenditure
Festival expenditure
%
50
Interstate visitors
Intrastate visitors
40
30
20
10
0
100
300
500
700
$
900
1,100
1,300
From Figure 2.9 Kai can conclude that intrastate visitors generally spend less during the
­festival than interstate visitors.
Cumulative Percentage Polygons (Ogives)
cumulative percentage polygon
(ogive)
Graphical representation of a
cumulative frequency distribution.
A cumulative percentage polygon, or ogive, displays the variable of interest along the horizontal
axis and the cumulative percentages (percentiles) on the vertical axis. A percentile is defined as
‘the value below which a given percentage of observations in a data set fall’.
Figure 2.10 shows the cumulative percentage polygons of the price of main meals at city
and suburban restaurants. Most of the curve for city restaurants is located to the right of the
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.3 SUMMARISING AND VISUALISING Numerical Data
53
curve for suburban restaurants. This indicates that city restaurants have fewer main meals that
cost below a particular value. For example, 12% of city restaurant main meals cost less than
$30 compared with 34% of suburban restaurant main meals.
Figure 2.10
Cumulative percentage
polygons of the cost of
main meals at city and
suburban restaurants
Cumulative percentage polygon
100
90
80
70
60
%
50
City
Suburban
40
30
20
10
0
10
15
20
25
30
35
40
45
50
55
60
65
Price of main ($)
CUMULATIVE P E RC E NTA G E P O LYG O NS F OR F E STI VAL E XP E N D I TU RE
Kai is interested in the amount spent during the festival by attendees; in particular if there is
a difference in expenditure between interstate and intrastate visitors. < FESTIVAL >
Construct and interpret cumulative percentage polygons to compare festival expenditure
of interstate and intrastate visitors.
EXAMPLE 2.10
SOLUTION
Figure 2.11
Cumulative percentage
polygons for festival
expenditure
Festival expenditure
%
100
Interstate visitors
Intrastate visitors
80
60
40
20
0
0
200
400
600
$
800
1,000
1,200
In Figure 2.11, we see that the curve for expenditure by intrastate visitors is to the left of
that by interstate visitors. This indicates that generally intrastate visitors spend less during
the festival than interstate visitors.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
54
CHAPTER 2 ORGANISING AND VISUALISING DATA
Problems for Section 2.3
LEARNING THE BASICS
2.15 The values for a data set vary from 11.6 to 97.8.
a. If these values are grouped into nine classes, indicate
appropriate class boundaries.
b. What class width did you choose?
c. What are the corresponding class mid-points?
2.16 The cumulative percentage polygon below shows the amount
spent (in dollars) by 200 customers at a local supermarket.
Ogive – amount spent at local supermarket
100
80
%
60
40
20
0
0
20
40
60
80
100
120
140
160
180
200
Amount spent ($)
a. Approximately what percentage of customers spent less
than $100?
b. Approximately how many customers spent at least $60?
c. Approximately how much did the top 10% of customers
spend?
d. Approximately how much did the bottom 10% of customers
spend?
APPLYING THE CONCEPTS
You can solve problems 2.17 to 2.19 manually or by using Microsoft Excel.
147
172
123
130
114
102
111
128
143
135
153
148
144
187
191
197
213
168
166
137
127
130
109
139
129
5,544
6,832
7,497
8,091
6,701
7,607
8,298
9,036
< ELECTRICITY >
178
116
175
154
151
Manufacturer A
5,814
6,868
7,645
8,119
6,190
6,879
7,654
8,392
6,307
6,930
7,773
8,416
6,342
6,941
7,816
8,416
6,423
7,007
7,838
8,514
6,429
7,037
7,924
8,532
6,485
7,043
7,999
8,542
6,612
7,059
8,038
8,544
6,667
7,136
8,067
8,731
7,118
7,721
8,666
9,385
7,133
7,754
8,792
9,460
7,142
7,767
8,800
9,471
7,156
7,806
8,856
9,521
7,344
7,839
8,861
9,540
7,493
7,888
8,993
9,693
7,569
7,983
9,001
9,744
Manufacturer B
2.17 The following data represent the electricity cost (in dollars)
during the month of July for a random sample of 50
two-bedroom apartments in a New Zealand city.
Electricity charge ($)
96 171 202
157 185
90
141 149 206
95 163 150
108 119 183
c. Construct the corresponding cumulative percentage
distribution and plot the corresponding ogive (cumulative
percentage polygon).
d. Around what amount does the monthly electricity cost seem
to be concentrated?
2.18 To investigate the variation in fuel prices in New South Wales on
a day in March 2017, a random sample of 45 petrol stations,
each in a different location, was selected. The price per litre of
both unleaded petrol and diesel is recorded in < FUEL_2017 >.
Using the New South Wales data:
a. Construct frequency, percentage and cumulative
distributions for the price of petrol and diesel.
b. As separate graphs, plot frequency histograms for the price
of petrol and diesel.
c. On the same set of axes plot percentage polygons for the
price of petrol and diesel.
d. On the same set of axes plot cumulative percentage
polygons for the price of petrol and diesel.
e. What can you conclude about the variation in the fuel
prices in New South Wales at the time the data were
collected?
2.19 The ordered arrays in the table below give the life (in hours of
usage) of samples of forty 15-watt CFL (compact fluorescent
lamp) energy-saving light bulbs produced by two
manufacturers, A and B. < BULBS >
82
165
167
149
158
a. Construct a frequency distribution and a percentage
distribution with upper class boundaries of <$100, <$120,
and so on.
b. Plot the corresponding histogram and percentage polygon.
6,837
7,612
8,344
9,096
6,961
7,651
8,535
9,262
a. Construct a frequency distribution and percentage
distribution for each manufacturer.
b. Plot the corresponding frequency histograms on
separate graphs and the percentage polygons on the
same graph.
c. Form the cumulative percentage distributions and plot the
ogives on one graph.
d. Which manufacturer has bulbs with a longer life –
manufacturer A or manufacturer B? Explain.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
55
2.4 ORGANISING AND VISUALISING TWO CATEGORICAL VARIABLES
2.4 ORGANISING AND VISUALISING TWO CATEGORICAL VARIABLES
We often wish to study patterns that may exist between two or more categorical variables – for
example, education level and gender.
Organising Two Categorical Variables: Contingency Tables
A contingency (or cross-classification) table presents the data for two categorical variables. The
rows contain the categories of one variable and the columns the categories of the other variable. The intersections of each row and column category, called the cells, contain the joint
responses – that is, the data that are in the row category and also in the column category.
Depending on the type of contingency table constructed, the cells may contain the frequency,
the percentage of the overall total, the percentage of the row total or the percentage of the
column total in both categories.
Suppose that for the 100 recent property sales in the council area introduced in Example 2.1, < PROPERTY >, Kai wishes to explore whether there is a pattern or relationship between
the size of a house or unit (defined by the number of bedrooms) and its location (either in town
or rural).
To construct a contingency table, classify or sort the data into one of the r × c possible
cells in the table, where r is the number of row categories and c is the number of column categories. Note that the cells must be mutually exclusive and exhaustive so that each data value
belongs in one and only one cell. In the contingency table in Table 2.11, we have two row categories, rural or town, and five column categories, from one to more than four bedrooms, so we
are sorting the data into 10 (2 × 5) possible cells. That is, each cell is a combination of number
of bedrooms and location. For example, for properties with more than four bedrooms, in the
sample there are five town properties but only one rural property.
Location
Rural
Town
Total
1
2
4
6
2
5
14
19
Bedrooms
3
16
29
45
4
10
14
24
>4
1
5
6
Total
34
66
100
LEARNING OBJECTIVE
3
Describe the relationship
between two categorical
variables using
contingency tables
contingency table (or crossclassification) table –
descriptive statistics
Summary table for two categorical
variables; each cell represents data
that satisfy the given values of both
variables.
Table 2.11
Frequency contingency
table for number of
bedrooms and location
For further exploration of possible patterns or relationships between number of bedrooms
and location in the council area, Kai can construct contingency tables based on percentages.
To do this, he will convert the cell frequencies into percentages based on one of the following
three totals:
1. The overall total (i.e. the 100 properties)
2. The row totals (i.e. 34 rural and 66 urban properties)
3. The column totals (i.e. number of one-bedroom, two-bedroom, up to more than ­
four-bedroom properties).
Tables 2.12, 2.13 and 2.14 summarise these percentages.
Location
Rural
Town
Total
1
2.0
4.0
6.0
2
5.0
14.0
19.0
Bedrooms %
3
16.0
29.0
45.0
4
10.0
14.0
24.0
>4
1.0
5.0
6.0
Total %
34.0
66.0
100.0
Table 2.12
Percentage contingency
table for number of
bedrooms and location
based on overall total
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
56
CHAPTER 2 ORGANISING AND VISUALISING DATA
Table 2.13
Contingency table for
number of bedrooms and
location based on row total
reported as a percentage
Table 2.14
Contingency table for
number of bedrooms and
location based on column
total reported as a
percentage
Location
Rural
Town
Total
1
5.9
6.1
6.0
2
14.7
21.2
19.0
Bedrooms %
3
47.1
43.9
45.0
Location
Rural
Town
Total
1
33.3
66.7
100.0
2
26.3
73.7
100.0
Bedrooms %
3
35.6
64.4
100.0
4
29.4
21.2
24.0
>4
2.9
7.6
6.0
Total %
100.0
100.0
100.0
4
41.7
58.3
100.0
>4
16.7
83.3
100.0
Total %
34.0
66.0
100.0
Table 2.12 shows that 45% of the properties have three bedrooms and that 29% of the
properties are located in town and have three bedrooms. Table 2.13 shows that 47.1% of rural
properties have three bedrooms while only 43.9% of properties located in town have three
bedrooms. Table 2.14 shows that 64.4% of three-bedroom properties are located in town while
35.6% are rural.
Visualising Two Categorical Variables: Side-by-Side Bar Charts
side-by-side bar chart
Graphical representation of a crossclassification table.
A useful way to display the results of contingency table data is by constructing a side-by-side
bar chart. Figure 2.12, using the data from Table 2.11, is a Microsoft Excel side-by-side bar
chart that compares the number of bedrooms based on the location of the property.
Figure 2.12
Microsoft Excel side-byside bar chart for number
of bedrooms and location
Side-by-side chart for number of bedrooms and location
Number of bedrooms
>4
Town
Rural
4
3
2
1
0
5
10
15
20
25
30
Number of properties
EXAMPLE 2.11
S IDE - BY- S IDE C H A RT S F OR P R I CE O F R UR AL A N D U RB AN P RO P E RTI E S
For the 100 recent property sales, construct and interpret side-by-side charts to investigate if
there is a difference between rural and urban property prices. < PROPERTY >
SOLUTION
First, construct a column percentage contingency table for price and location.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.4 ORGANISING AND VISUALISING TWO CATEGORICAL VARIABLES
Frequency
Asking price ($)
300,000 to < 400,000
400,000 to < 500,000
500,000 to < 600,000
600,000 to < 700,000
700,000 to < 800,000
800,000 to < 900,000
Total
Rural
8
9
12
4
0
1
34
Table 2.15
Contingency table for
price and location based
on percentage of column
total
Column percentage
Rural
Town
23.5
25.8
26.5
48.5
35.3
15.1
11.8
9.1
0.0
0.0
2.9
1.5
100.0
100.0
Town
17
32
10
6
0
1
66
57
From Table 2.15 we can construct a side-by-side bar chart for location and price.
Figure 2.13
Side-by-side bar chart
for location and price
Side-by-side chart for location and asking price
$800,000 to < $900,000
Town
Rural
$700,000 to < $800,000
$600,000 to < $700,000
$500,000 to < $600,000
$400,000 to < $500,000
$300,000 to < $400,000
0
10
20
%
30
40
50
Figure 2.13 shows that a higher proportion of rural properties have prices above $500,000,
and that approximately 50% of the urban properties have prices between $400,000 and $500,000.
Problems for Section 2.4
LEARNING THE BASICS
2.20 The following data represent the responses to two questions asked
in a survey of 40 undergraduate students majoring in business:
What is your gender? (M = Male; F = Female; O = Other)
What is your major? (A = Accounting; I = Information Systems;
M = Marketing)
Gender
Major
Gender
Major
M
A
M
I
M
I
M
I
M
I
M
A
F
M
M
A
M
A
F
M
F
I
M
M
F
A
F
I
M
A
F
A
F
I
M
A
M
I
M
A
F
A
F
I
M
A
M
I
M
A
M
A
M
M
M
A
M
I
M
A
F
M
F
A
F
A
M
I
M
A
F
I
F
A
M
A
F
I
M
I
a. Represent the data in a contingency table where the rows
represent the gender categories and the columns the
academic-major categories.
b. Construct cross-classification tables based on percentages
of all 40 student responses, on row percentages and on
column percentages.
c. Using the results from (a), construct a side-by-side bar chart
of gender based on student major.
2.21 Given the following cross-classification table, construct a sideby-side bar chart comparing A and B for each of the threecolumn categories on the vertical axis.
A
B
1
20
80
2
40
80
3
40
40
Total
100
200
APPLYING THE CONCEPTS
2.22 The Living in Australia Study gives information on the
study mode (full or part time) of students studying for a
post-school qualification, as well as their employment
status.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
58
CHAPTER 2 ORGANISING AND VISUALISING DATA
Percentage of students enrolled in post-school education
Studying
Studying
Employment status
full-time
part-time All students
Employed full-time
6.4
37.7
44.1
Employed part-time
18.1
12.2
30.3
Not employed
17.4
8.2
25.6
All students
41.9
58.1
100.0
Data obtained from the Household, Income and Labour Dynamics in Australia
(HILDA) Survey, 2001–2005 (also known as the Living in Australia Study), The
University of Melbourne 1994–2011
a. Construct cross-classification tables based on row
percentages and column percentages.
b. Construct a side-by-side bar chart for employment status
and study mode.
c. What conclusions do you draw from these analyses?
2.23 The following table classifies road fatalities in Australia from
2012 to 2016 (inclusive) by age and gender. < ROAD_
FATALITIES_2012_2016 >
Age
< 10
10 to < 20
20 to < 30
30 to < 40
40 to < 50
50 to < 60
60 to < 70
70 to < 80
80 to < 90
90 or more
Unknown
Total
Male
89
402
990
686
693
534
412
319
243
56
3
4,427
Gender
Female
74
182
310
185
182
179
212
162
178
47
1
1,712
Unknown
3
0
0
0
0
0
0
0
0
0
0
3
Total
166
584
1,300
871
875
713
624
481
421
103
4
6,142
Data obtained from the Australian Road Deaths Database <www.bitre.gov.au/
statistics/safety/fatal_road_crash_database.aspx> accessed 18 March 2017
Ignore the unknown categories.
a. Investigate the relationship between age and gender by
constructing a side-by-side bar chart to highlight the pattern
of male and female road fatalities.
b. Discuss the pattern of male and female road fatalities for
2012 to 2016.
2.24 The following data for people aged 15 years and older,
classified by highest level of educational attainment and gender,
were obtained for a certain Australian state:
Highest level of
educational attainment
Below Year 10
Year 10 or equivalent
Year 11 or equivalent
Year 12 or equivalent
Post-secondary below
bachelor degree
Bachelor degree or higher
Total
Males
(‘000)
238.1
326.7
102.0
492.2
840.8
749.8
2,749.6
Females
(‘000)
253.9
394.4
89.4
506.8
687.6
856.5
2,788.6
Total
(‘000)
492.0
721.1
191.4
999.0
1,528.4
1,606.3
5,538.2
Data obtained from Australian Bureau of Statistics, Education and Work, Australia,
May 2016, 62270DO001_201605 <www.abs.gov.au> accessed March 2017.
© Commonwealth of Australia
a. Construct a cross-classification table based on column
percentages.
b. Construct a side-by-side bar chart to highlight the
information in (a).
c. Discuss any apparent pattern in male and female education
levels in this Australian state.
2.25 The table below contains the sales of new passenger cars in
New Zealand for February 2016 and 2017. < NZ_CAR_
SALES_16_17 >
Make
Audi
BMW
Citroen
Dodge
Ford
Holden
Honda
Hyundai
Jaguar
Jeep
Kia
Land Rover
Lexus
Maserati
Mazda
Mercedes Benz
Mini
Mitsubishi
Nissan
Peugeot
Porsche
Renault
Skoda
Ssanyong
Subaru
Suzuki
Tesla
Toyota
Volkswagen
Volvo
Other
Total
Sales of new cars
February 2017
February 2016
176
137
160
193
15
10
23
44
611
604
654
645
373
292
606
470
26
37
56
100
513
407
93
64
62
59
29
6
755
719
245
164
45
44
547
413
346
484
48
55
22
25
30
11
104
102
93
95
305
208
624
362
21
1
990
915
355
309
48
53
75
163
8,050
7,191
Data obtained from Motor Industry Association of New Zealand <www.mia.org.nz>
accessed March 2017, reproduced with permission. © Motor Industry
Association of New Zealand
a. Construct a side-by-side bar chart for the makes of cars.
b. Discuss the changes in the sale of new cars in February
2017 compared with February 2016.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.5 VISUALISING TWO NUMERICAL VARIABLES
2.5 VISUALISING TWO NUMERICAL VARIABLES
LEARNING OBJECTIVE
Scatter Diagrams
When analysing a single numerical variable (univariate data), such as the price of a restaurant
meal or festival expenditure, you can use a histogram, polygon or cumulative percentage polygon, as introduced in Section 2.3. When examining the relationship between two numerical
variables (bivariate data) we can use a scatter diagram or plot to obtain a picture of a possible
relationship. Plot one variable, the independent variable, on the horizontal (or x) axis and the
other variable, the dependent variable, on the vertical (or y) axis. For example, a marketing analyst could study the effectiveness of advertising by comparing weekly sales volumes and weekly
advertising expenditures. Or, a human resources director interested in the salary structure of the
company could compare the employees’ years of experience with their current salaries.
For the data from 100 recent property sales in the council area introduced in Example 2.1,
and stored in < PROPERTY >, a scatter plot can be used to explore the relationship between number of bedrooms (independent variable) and asking price (dependent variable). For each property, plot the number of bedrooms on the horizontal axis and the corresponding asking price on
the vertical axis. Figure 2.14 gives an Excel scatter diagram for this data.
4
Describe the relationship
between two numerical
variables using scatter
diagrams and time-series
plots
scatter diagram
Graphical representation of the
relationship between two numerical
variables; plotted points represent
the given values of the independent
variable and corresponding
dependent variable.
Figure 2.14
Microsoft Excel scatter
diagram for number of
bedrooms and asking price
Scatter diagram – 100 recent property sales
$900,000
59
$800,000
$700,000
Asking price
$600,000
$500,000
$400,000
$300,000
$200,000
$100,000
$0
0
1
2
3
4
5
6
7
8
Number of bedrooms
As expected, there is a weak increasing (positive) linear relationship with more bedrooms
associated with higher asking prices.
Other pairs of variables may have an decreasing (negative) relationship in which one variable increases as the other decreases; for example, the age of a second-hand car and its value.
Scatter diagrams are revisited in Chapter 3 when the coefficient of correlation and the covariance are studied, and in Chapter 12 when regression analysis is introduced.
Time-series Plots
A time-series plot is used to study patterns in the value of a variable over time. A time-series
plot displays the time period on the horizontal axis and the variable of interest on the vertical
axis. Figure 2.15 is a time-series plot of the monthly exchange rate of the Australian dollar
against the United States dollar from January 2010 to February 2017. < EXCHANGE_
time-series plot
Graphical representation of the
value of a numerical variable over
time.
RATE_2010_2017 >
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
60
CHAPTER 2 ORGANISING AND VISUALISING DATA
Figure 2.15
Microsoft Excel time-series
plot of exchange rates:
Australian dollar against US
dollar 2010 to 2017
1.0
0.8
AUS$:US$
Source: Data based on
Reserve Bank of Australia,
Statistics, Exchange Rates
<www.rba.gov.au> accessed
March 2017.
Exchange rate US$ per AUS$
1.2
0.6
0.4
0.2
0.0
Jan 10 Oct 10 Jun 11 Mar 12 Nov 12 Aug 13 Apr 14 Jan 15 Sep 15 Jun 16 Feb 17
End of month
During 2010 and the first six months of 2011, rates rose steadily from US$0.90 to US$1.10.
They remained between US$1.00 and US$1.10 until 2013, steadily decreased to US$0.80 in
September 2015, and then remained between US$0.80 and US$0.90 until February 2017.
Rare events
think
about this
When rare events happen, we often react to them more strongly than to common events with similar
outcomes. Charts and graphs can give us a picture of the situation, helping to put the risk of these rare
events in perspective.
For example, in Australia when there is a shark attack, even if not fatal, there are often calls to protect
beach users from attack, including controlling shark numbers by culling. However, shark attacks are
rare: there are usually between 10 and 15 attacks annually in Australia, of which one or two are fatal, as
shown in the table below. < SHARKS_AND_DROWNINGS >
Year
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
Total attacks
9
9
8
13
10
7
12
10
22
14
13
14
10
11
18
15
Shark attacks, Australia
Fatal attacks
0
2
1
2
2
1
0
1
0
1
4
2
2
2
1
2
Non-fatal
9
7
7
11
8
6
12
9
22
13
9
12
8
9
17
13
Data obtained from the International Shark Attack File <www.flmnh.ufl.edu/fish/sharks/statistics/statsw.htm> accessed May 2014 and
March 2017, © Florida Museum of Natural History, University of Florida
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.5 VISUALISING TWO NUMERICAL VARIABLES
If we compare these mainly non-fatal shark attacks with the number of people drowning annually at
Australian beaches in the same period (see the bar chart below), it is clear that the risk of drowning
while at the beach is far higher than that of being attacked by a shark.
Australia – shark attacks and beach drownings
70
60
Shark attacks
Beach drownings
50
40
30
20
10
0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Drowning data obtained from Royal Life Saving, National Drowning Reports 2001 to 2016 <www.royallifesaving.com.au/facts-andfigures/research-and-reports/drowning-reports> accessed March 2017; shark attack data obtained from International Shark Attack
File, Florida Museum of Natural History, University of Florida <www.flmnh.ufl.edu/fish/sharks/statistics/statsw.htm>
A time-series plot of the same data, shown below, indicates that there is no apparent increase in either
the number of shark attacks or the number of drownings at Australian beaches.
Australia – shark attacks and beach drownings
70
60
50
40
30
20
10
0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Beach drownings
Shark attacks
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
61
62
CHAPTER 2 ORGANISING AND VISUALISING DATA
Problems for Section 2.5
LEARNING THE BASICS
2.26 Below is a set of data from a sample of n = 11 items:
X (horizontal axis)
Y (vertical axis)
7 5 8
21 15 24
3 6 10 12 4 9 15 18
9 18 30 36 12 27 45 54
a. Plot the scatter diagram.
b. Is there a relationship between X and Y? Explain.
2.27 Below is a series of real annual sales (in millions of constant 2010
dollars) for a department over an 11-year period (2007 to 2017):
Year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Sales 13.0 17.0 19.0 20.0 20.5 20.5 20.5 20.0 19.0 17.0 13.0
a. Construct a time-series plot.
b. Does there appear to be any change in real annual sales
over time? Explain.
selected. The average price per litre of both unleaded petrol and
diesel is recorded in < FUEL_MARCH_2017 >.
Using the New South Wales data:
a. Construct a scatter diagram to investigate the relationship
between petrol and diesel prices.
b. What conclusions can you reach about the relationship
between petrol and diesel prices?
2.31 The data file < UNEMPLOYMENT_RATE_2007_2017 > gives the
monthly Australian unemployment rate (seasonally adjusted)
from March 2007 to February 2017.
a. Construct a time-series plot for the unemployment rate.
b. Does there appear to be any pattern?
2.32 A general measure of inflation is the annual increase in the
consumer price index (CPI). The table below gives the annual
increase in the CPI in Australia and New Zealand.
< INFLATION_2011_2016 >
APPLYING THE CONCEPTS
You can solve problems 2.28 to 2.32 manually or by using Microsoft Excel.
2.28 For the city and suburban restaurants introduced in Section 2.2,
an independent reviewer rated each restaurant on food quality,
décor and service. Each was given a score out of 30 and then
the three scores were added to give an overall rating out of 90.
< RESTAURANT >
a. Construct a scatter diagram with overall rating on the
horizontal axis and price on the vertical axis.
b. Does there appear to be a relationship between overall rating
and price? If so, is the relationship positive or negative?
2.29 The data in < USED_CARS > were obtained from several usedcar yards for 4-cylinder, 4-door sedans.
a. Construct a scatter diagram, with price on the vertical axis, to
investigate the relationship between the age of a car and its price.
b. Construct a scatter diagram, with price on the vertical axis,
to investigate the relationship between the kilometres
travelled by a car and its price.
c. What conclusions can you reach about the relationship
between the age or the kilometres travelled and the price of
a used car? Are these the relationships you expected?
2.30 To investigate the variation in fuel prices in New South Wales on
a given day, a random sample of 45 towns and suburbs was
LEARNING OBJECTIVE
5
Develop dashboard
elements such as
sparklines, gauges, bullet
graphs and treemaps for
descriptive analytics
Year to
Mar 11
Jun 11
Sep 11
Dec 11
Mar 12
Jun 12
Sep 12
Dec 12
Mar 13
Jun 13
Sep 13
Dec 13
Australia
rate %
3.3
3.5
3.4
3.0
1.6
1.2
2.0
2.2
2.5
2.4
2.2
2.7
NZ
rate %
4.5
5.3
4.6
1.8
1.6
1.0
0.8
0.9
0.9
0.7
1.4
1.6
Year to
Mar 14
Jun 14
Sep 14
Dec 14
Mar 15
Jun 15
Sep 15
Dec 15
Mar 16
Jun 16
Sep 16
Dec 16
Australia
rate %
2.9
3.0
2.3
1.7
1.3
1.5
1.5
1.7
1.3
1.0
1.3
1.5
NZ
rate %
1.5
1.6
1.0
0.8
0.3
0.4
0.4
0.1
0.4
0.4
0.4
1.3
Data obtained from Reserve Bank of Australia <www.rba.gov.au> and Reserve
Bank of New Zealand <www.rbnz.govt.nz> accessed March 2017
a. Investigate the relationship between the inflation rates for
the two countries by constructing time-series plots on the
same set of axes.
b. What conclusions can you make about the inflation rates of
the two countries?
2.6 BUSINESS ANALYTICS APPLICATIONS – DESCRIPTIVE ANALYTICS
As business people gain the ability to retrieve and process larger amounts of data in smaller
amounts of time, sometimes approaching near real time, some have asked: At what point does
the need for using samples to expedite analysis disappear? Might there not be a day when business decision makers could just analyse all the data continuously as it flows into the business in
near real time?
While, in most cases, continuous data analysis is not yet a reality, these questions
taken together have created the demand for methods known collectively as
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.6 BUSINESS ANALYTICS APPLICATIONS – DESCRIPTIVE ANALYTICS
business analytics. Analytics represents an evolution of pre-existing statistical methods
combined with advances in information systems and techniques from management science. Analytics is naturally interdisciplinary, and this nature underscores how important
statistics is as part of your business education.
Descriptive analytics, predictive analytics and prescriptive analytics form the three
broad categories of analytic methods. Descriptive analytics explores business activities that
have occurred or are occurring now. Predictive analytics identifies what is likely to occur in
the (near) future and finds relationships in data that may not be readily apparent using
descriptive analytics. Prescriptive analytics investigates what should occur and prescribes the
best course of action for the future.
We may use a number of organising and visualisation tools to aid our descriptive analytics.
Giving decision makers the ability to combine, collect, organise and visualise data that could be
used for day-to-day, if not minute-by-minute, business monitoring in the present, rather than
business activity in the past, is one of the main goals of descriptive analytics.
Being able to do real-time monitoring can be useful for a business that handles a perishable
inventory. Perishable inventory is inventory that will disappear after a particular event takes
place, such as an airplane taking off for its destination or the end of a concert. Empty seats on
the airplane or at the concert cannot be sold later. Perishable inventory also occurs with less
tangible inventory, such as spaces reserved for advertisements on a commercial web page—
such spaces cannot be sold after the page has been viewed. In the past, the problem of perishable inventory was handled by models that predicted consumer behaviour based on historical
patterns. A concert promoter sets prices based on the best estimation of ticket-buying behaviour. Today, by constantly monitoring sales, the promoter can use a dynamic pricing model in
which the price of tickets could fluctuate in near real time based on whether sales are exceeding
or failing to meet predicted demand.
Real-time monitoring can also be useful for a business that manages flows of people or
objects that can be adjusted in near real time, especially when there is more than one flow and
the flows are interrelated. For example, overseers of a large sports stadium could benefit from
monitoring the flows of cars in parking facilities, as well as the flow of fans into the stadium,
and redirect stadium personnel to assist at points of congestion.
The managers of WaldoLands, the theme park that licenses the characters from the Waldo­
wood stories, seek to stabilise and grow their business. During the most recent tourist season,
their park was plagued by a number of major ride breakdowns, long lines at popular attractions and key food service areas, and a general inability to respond to the park’s day-to-day
operating status. Last year’s problems led to numerous unfavourable reviews in key social
media travel websites, and the managers are concerned that possible patrons may decide to
visit competing parks run by Universal Parks & Resorts and Six Flags Entertainment. For
this year, the managers have added the LineJumper service that allows patrons to ‘jump’ to
the head of a line, and are offering the premium-priced No-Stress-Express experience that
offers special guided tours and behind the scenes access. The managers also hope the new
multimillion-dollar Rabbit Creek Racers and a greatly expanded MirrorGate Experience,
based on a popular sci-fi franchise, will boost attendance, even though they fret about the
technical complexity of these rides.
In the WaldoLands scenario, managers could monitor flows of patrons through the ticket
booths and into the theme park while also keeping an eye on the length of waiting lines and the
use of the LineJumper service. This would allow the managers to adjust ride lengths or dispatch
live performers to entertain patrons in line, and to try to redirect patrons to areas of the park that
are currently under capacity.
63
business analytics
Skills, technologies and practices
for continuous iterative exploration
and investigation of past business
performance to gain insight and
drive business planning.
descriptive analytics
A form of business analytics that
explores business activities that
have occurred or are occurring in
the present moment.
predictive analytics
A form of business analytics that
identifies what is likely to occur in
the (near) future and finds
relationships in data that may not
be readily apparent using
descriptive analytics.
prescriptive analytics
A form of business analytics that
investigates what should occur and
prescribes the best course of action
for the future.
Dashboards
Over several decades, people talked about developing executive information systems that would
put information at the ‘fingertips’ of decision makers. Many of these efforts have spurred the
development of dashboards that use descriptive analytics methods to present up-to-the-minute
operational status about a business.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
64
CHAPTER 2 ORGANISING AND VISUALISING DATA
dashboard
Descriptive analytics methods to
present up-to-the-minute
operational status about a business.
An analytics dashboard provides this information in a visual form that is intended to be
easy to comprehend and review. Dashboards can contain the summary tables and charts discussed earlier in this chapter, as well as newer or more novel forms of information presentation
that can summarise big data as well as smaller sets of data. The dashboard in Figure 2.16 displays key WaldoLands operational statistics that are updated on a near-real-time basis. Clicking
one of the categories would lead to other displays that contain additional information about
theme park operations.
Figure 2.16
A WaldoLands dashboard
Source: The contents,
descriptions, and characters
of WaldoLands and
Waldowood are copyright
© 2018, 2014, 2011
Waldowood Productions, and
used with permission.
Sparklines are one of the descriptive analytic methods that dashboards can contain.
sparklines
A descriptive analytics method that
summarises time-series data as
small, compact graphs designed to
appear as part of a table.
­ parklines summarise time-series data as small, compact graphs designed to appear as part of a
S
table (or a written passage). In Figure 2.17, sparklines display the wait times for WaldoLands
attractions at half-hour intervals for the current day, helping to provide context for the current
wait times that are indicated by the dot markers. For example, the sparkline for the Rabbit
Springs Racers ride shows that the current wait time is one of the longest wait times for the day.
Figure 2.17
WaldoLands wait times
table with sparklines
Source: The contents,
descriptions, and characters
of WaldoLands and
Waldowood are copyright
© 2018, 2014, 2011
Waldowood Productions, and
used with permission.
gauges
A visual display of data inspired by
the speedometer in a car.
bullet graph
A horizontal bar chart inspired by a
thermometer.
Analogous to automotive dashboards, analytic dashboards can provide warnings when predefined conditions are met or exceeded. Figure 2.18 contains a set of gauges and a bullet graph
that both display the wait-line status for WaldoLands attractions. These displays combine a single
numerical measure (wait time) with one of five categorical values that rates the wait time subjectively, from excellent (less than 25 minutes) to poor (more than 85 minutes). While gauges have
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.6 BUSINESS ANALYTICS APPLICATIONS – DESCRIPTIVE ANALYTICS
been a popular choice in business, most information design specialists prefer bullet graphs because
those graphs foster the direct comparison of each measurement (wait time in Figure 2.18). Gauges
can also consume a lot of visual space in a dashboard. For ­example, in Figure 2.18, note the
amount of the space the gauges consume to show the status of the six most popular rides. The corresponding bullet graph can display the status of 14 rides and present the wait times in a way that
facilitates comparisons. For these reasons, some consider gauges little more than examples of
chartjunk (see reference 1), even as many decision makers request them due to their visual appeal.1
65
chartjunk
Unnecessary information and detail
that reduces the clarity of a graph.
Figure 2.18 Gauges and bullet graph of wait times for WaldoLands attractions
Source: The contents, descriptions, and characters of WaldoLands and Waldowood are copyright © 2018, 2014, 2011 Waldowood Productions, and used
with permission.
Dashboards may also contain treemaps that help users to visualise two variables, one of
which must be categorical. Treemaps are especially useful when categories can be grouped to
form a multilevel hierarchy or tree. Figure 2.19 displays a pair of treemaps that visualise the
number of social media comments made today about WaldoLands attractions (the size of each
rectangle). The left treemap shows each ride grouped by the ‘land’ of WaldoLands (StrausLand,
the BWLand or FamilyLand) where the attraction is found. The right treemap shows the data
for the six most popular WaldoLands attractions, ­illustrating that treemaps can be used with
non-hierarchical information as well.
StrausLand
StrausLand
The BWLand
FamilyLand
The BWLand
StrausLand
FamilyLand
StrausLand
treemaps
A descriptive analytics method that
helps visualise two variables, one of
which must be categorical.
The BWLand
The BWLand
Kirby’s
SplashDown
Soarin’ Stegosaurs
Stressed Out Wild Mouse
Rabbit Springs
Racers
Mt Waldo
Alpine
Sleds
Rabbit Springs Racers
A.B.ʹs Hall of
Mirrors
Ms
Cy...
WaldoLand Un...
OFFRO...
Mini RR
Ride
Truck...
MirrorGate Experience
Lande’s Musical
Chairs
Circle o... Taylorʹs...
1...
Soarinʹ Stegosaurs
Stressed Out Wild
Mouse
Mt
Waldo
Alpine
Sleds
MirrorGate Experience
Landeʹs Musical Chairs
Figure 2.19 Treemaps of number and favourability of social media comments about WaldoLands attractions
Source: The contents, descriptions, and characters of WaldoLands and Waldowood are copyright © 2018, 2014, 2011 Waldowood Productions, and used
with permission.
1
This tension between what decision makers might find visually appealing and what statisticians and information
specialists have found most useful reflects the relative newness of these descriptive methods. Over time, this tension
may ease and an acceptable standard for representing such information may emerge.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
66
CHAPTER 2 ORGANISING AND VISUALISING DATA
When combined with the Figure 2.18 gauges or bullet graph, the treemap on the right in
­ igure 2.19 would allow managers to preliminarily conclude that the negativity of comments
F
seems to be tied to current wait lines and that rides with the shortest wait lines may generate
the fewest social media comments. These relationships could then be further investigated
and, if the former one was confirmed, managers could, in the future, respond to excessive
wait lines by shortening the ride length to handle more customers, sending live performers to
entertain those waiting in line or instructing park staff to divert incoming park patrons to
other rides.
Note that gauges, bullet graphs and treemaps use colour to represent the value of a
second variable, thereby increasing the data density of the displays – one of the principles
of good information design (see reference 2). However, when using these displays, particularly bullet graphs and treemaps, avoid using colour spectrums that run from red to green,
the two colours most subject to confusion due to colour vision deficiencies. (This is less of
a problem with gauges, as colours subject to confusion will have unique positions on the
gauge dial.)
Data Discovery
data discovery
Methods used to take a closer look
at historical or status data, to
quickly review data for unusual
values or outliers, or to construct
visualisations for management
presentations.
drill-down
The revealing of the data that
underlie a higher-level summary.
Data discovery methods allow decision makers to interactively organise or visualise data and
perform preliminary analyses. These methods can be used to take a closer look at historical
or status data, to quickly review data for unusual values or outliers, or to construct visualisations for management presentations. In these ways, data discovery realises the earlier promise of executive information systems to give decision makers the tools of data exploration
and presentation.
In its simplest version, data discovery involves drill-down, the revealing of the data
that underlies a higher level summary. For example, clicking the merchandise entry in the
­Figure 2.16 WaldoLands dashboard would reveal more detailed information such as the
table of sales by ‘lands’ shown in the left table in Figure 2.20. In turn, this summary can
be drilled down to reveal sales by each store in the theme park (see table on the right in
Figure 2.20). At this level of detail, sales at Peri’s Playtime are significantly lower than
the other stores, perhaps suggesting that this store be closed, relocated, or have its merchandise mix reconsidered.
Figure 2.20
WaldoLands merchandise
sales summarised on two
different levels
Source: The contents,
descriptions, and characters
of WaldoLands and
Waldowood are copyright
© 2018, 2014, 2011
Waldowood Productions, and
used with permission.
Another level of drill-down (not shown) would reveal the sales of each item or SKU (stockkeeping unit) sold in each store. By reorganising that list by item, WaldoLands managers could
discover which items are selling the best and may be subject to stockouts.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.6 BUSINESS ANALYTICS APPLICATIONS – DESCRIPTIVE ANALYTICS
67
Problems for Section 2.6
2.33 The Edmunds.com NHTSA Complaints Activity Report is the
result of the examination of the frequency, trends and
composition of consumer vehicle complaint submissions at the
car manufacturer, brand and category levels (data obtained
from <www.edmunds.com/car-news/nhtsa-complaints-report.
html>). The table below stored in < AUTOMAKER1 >, contains
complaints received by six car manufacturers for January 2013.
When the number of complaints is less than 300, the complaint
rating is considered to be low; when the number of complaints
is between 300 and 500, the complaint rating is considered to
be medium; and when the number of complaints is more than
500, the complaint rating is considered to be high.
Car manufacturer
American Honda
Chrysler LLC
Ford Motor Company
General Motors
Nissan Motors Corporation
Toyota Motor Sales
2.36
Number of complaints
169
439
440
551
467
332
a. Construct a gauge for each car manufacturer.
b. Construct a bullet graph for the car manufacturers.
c. Which display is more effective at comparing the number of
complaints for each car manufacturer?
2.34 There is a very large number of mutual funds from which an
investor can choose. Each mutual fund has its own mix of
different types of investments. The file < BEST_FUNDS1 >
contains the one-year return percentage and the three-year
annualised return percentage for the 10 best short-term bond
and long-term bond funds according to the U.S. News & World
Report score. (data obtained from <www.money.usnews.com/
mutual-funds/rankings>).
a. Construct bullet graphs of the one-year returns and the threeyear returns. For the purposes of comparison, consider a
return below 5% as low-performing, a return between 5 and
10% as medium-performing and a return above 10% as highperforming.
b. Why would you not want to construct a gauge for each bond
fund?
c. What conclusions can you reach about the one-year and
three-year return percentages for the short-term bond and
long-term bond funds?
2.35 A financial analyst was interested in comparing the price-tobook ratio (P/B) of pharmaceutical companies. The analyst
collected P/B ratios for 71 pharmaceutical companies (Industry
Group SIC 3 code: 283) and stored them as part of the file
< BUSINESS_VALUATION >.
a. Visually evaluate the P/B ratios by constructing a bullet
graph. For the purposes of comparison, consider a P/B ratio
2.37
2.38
2.39
that is 2 or less as excellent, a P/B ratio that is between
2 and 5 as acceptable, and a P/B ratio that is above 5 as
unacceptable.
b. Why would using gauges be a poor choice for this analysis?
c. Are the three groupings of P/B ratios helpful in analysing the
data? What constitutes an acceptable P/B ratio varies by
industry and is partially based on subjective analysis. For the
purposes of information presentation, would you redefine or
subdivide the current acceptable category?
The file < BB_COST_2012 > contains the total cost (in $) for
four tickets, two beers, four soft drinks, four hot dogs, two
game programs, two baseball caps and parking for one
vehicle at each of the 30 Major League Baseball (MLB) parks
during the 2012 season. (data obtained from <http://
fancostexperience.com>).
a. Visually evaluate the total cost at each MLB park by
constructing a bullet graph. For the purposes of comparison,
consider a total cost (in dollars) less than $180 as
inexpensive, between $180 and $240 as typical, and more
than $240 as expensive.
b. Which display best visualises the distribution of costs - the
bullet graph or a stem-and-leaf display? Why?
c. Name something that the bullet graph reveals about the data
that the stem-and-leaf display does not. How could that be
used as the basis for future analysis of total costs at MLB
parks?
Referring to the movie attendance data between 2002 and
2012 (stored in < MOVIE_ATTENDANCE2 >):
a. Construct a sparkline graph for movie attendance between
2002 and 2012.
b. What conclusions can you reach about movie attendance
between 2002 and 2012?
c. When would using a sparkline graph be the better choice to
visualise these data? When would using the time-series plot
be the better choice?
d. Might you ever use both a sparkline graph and a timeseries plot in the same analysis report? Explain your
reasoning.
The file < STOCK_INDICES > contains the data that represent
the total rate of return (as a percentage) for the Dow Jones
Industrial Average (DJIA), the Standard & Poor’s 500 (S&P500)
and the technology-heavy NASDAQ Composite (NASDAQ) from
2006 through 2012. (data obtained from <https://finance.yahoo.
com> accessed 29 March 2013).
a. Construct sparklines for the annual rate of return for the
DJIA, S&P500 and NASDAQ from 2006 to 2012.
b. What conclusions can you reach concerning the annual rates
of return of the three market indices?
From 2006 to 2012, the value of precious metals fluctuated
dramatically. The file < METAL_INDICES > contains the total
rate of return (as a percentage) for platinum, gold and silver
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
68
2.40
2.41
2.42
2.43
CHAPTER 2 ORGANISING AND VISUALISING DATA
from 2006 through 2012. (data obtained from <https://finance.
yahoo.com> accessed 29 March 2013).
a. Construct sparklines for the annual rate of return for
platinum, gold and silver from 2006 to 2012.
b. What conclusions can you reach concerning the rates of
return of the three precious metals?
c. Compare the results of (b) to those of Problem 2.38(b).
Drive-through service time is an important quality attribute for
fast-food chains. The data in < SERVICE_TIME > are the mean
service times for Burger King, Chick-Fil-A, McDonald’s and Wendy’s
in 12 recent years. (data obtained from <bit.ly/qhvP3Zb>).
a. Construct sparklines of the mean service times for Burger
King, Chick-Fil-A, McDonald’s and Wendy’s in 12 recent years.
b. What conclusions can you reach concerning the mean
service times for Burger King, Chick-Fil-A, McDonald’s and
Wendy’s in 12 recent years?
Sales of cars in the United States fluctuate from month to month
and year to year. The data in the file < AUTO_SALES > represent
the sales for various manufacturers in July 2013 and
the change from July 2012 sales in percentages. (data
obtained from <www.nytimes.com/interactive/2013/08/01/
business/How-the-Auto-Industry-Fared-in-July.htm>).
a. Construct a treemap of the sales of cars and the change in
sales from July 2012.
b. What conclusions can you reach concerning the sales of
cars and the change in sales from July 2012?
The value of a National Basketball Association (NBA) franchise
has increased dramatically over the past few years. The value of
a franchise varies based on the size of the city in which the
team is located, the amount of revenue it receives and the
success of the team. The file < NBA_VALUES > contains the
value of each team and the change in value in the past year.
(data obtained from <www.forbes.com/nba-valuations>).
a. Construct a treemap that visualises the values of the NBA
teams (size) and the one-year changes in value (colour).
b. What conclusions can you reach concerning the value of
NBA teams and the one-year change in value?
The annual ranking of the FT Global 500 2013 provides a snapshot
of the world’s largest companies. The companies are ranked by
market capitalisation—the greater the sharemarket value of a
company, the higher the ranking. The market capitalisations (in
billions of dollars) and the 52-week change in market capitalisations
(in percentages) for companies in the Automobile & Parts, Financial
Services, Health Care Equipment & Services and ­Software &
Computer Services sectors are stored in < FT_GLOBAL500 > (data
obtained from <www.ft.com/intl/indepth/ft500>).
a. Construct a treemap that presents each company’s market
capitalisation (size) and the 52-week change in market
capitalisation (colour) grouped by sector and country.
b. Which sector seems to have the best gains in the market
capitalisations of its companies? Which sectors seem to
have the worst gains (or greatest losses)?
c. Construct a treemap that presents each company’s market
capitalisation (size) and the 52-week change in market
capitalisation (colour) grouped by country.
2.44
2.45
2.46
2.47
2.48
2.49
d. What comparison can be more easily made with the
treemap constructed in (c) than with the treemap
constructed in (a)?
Your task as a member of the International Strategic
­Management Team at your company is to investigate the
potential for entry into a foreign market. As part of your initial
investigation, you must provide an assessment of the
economies of countries in the Americas and the Asia and Pacific
regions. The file < DOING_BUSINESS > contains the 2012 GDPs
per capita for these countries as well as the number of Internet
users in 2011 (per 100 people) and the number of mobile phone
subscriptions in 2011 (per 100 people). (data obtained from
<https://data.worldbank.org>).
a. Construct a treemap of the GDPs per capita (size) and their
number of Internet users in 2011 (per 100 people) (colour)
for each country grouped by region.
b. Construct a treemap of the GDPs per capita (size) and their
number of mobile phone subscriptions in 2011 (per 100
people) (colour) for each country grouped by region.
c. What patterns to these data do the two treemaps suggest?
Are the patterns in the two treemaps similar or different?
Explain.
Using the sample of retirement funds stored in
< RETIREMENT_FUNDS >:
a. Construct a table that tallies type, market cap and risk.
b. Drill down to examine the large-cap growth funds with high
risk. How many funds are there? What conclusions can you
reach about these funds?
Using the sample of retirement funds stored in
< RETIREMENT_FUNDS >:
a. Construct a table that tallies type, market cap and rating.
b. Drill down to examine the large-cap growth funds with a
rating of three. How many funds are there? What
conclusions can you reach about these funds?
Using the sample of retirement funds stored in
< RETIREMENT_FUNDS >:
a. Construct a table that tallies market cap, risk and rating.
b. Drill down to examine the large-cap funds that are high risk
with a rating of three. How many funds are there? What
conclusions can you reach about these funds?
Using the sample of retirement funds stored in
< RETIREMENT_FUNDS >:
a. Construct a table that tallies type, risk and rating.
b. Drill down to examine the growth funds that are high risk
with a rating of three. How many funds are there? What
conclusions can you reach about these funds?
Using the sample of retirement funds stored in
< RETIREMENT_FUNDS >:
a. What are the attributes of the fund with the highest five-year
return?
b. What five-year returns are associated with small market cap
funds that have a rating of five stars?
c. Which fund(s) in the sample have the lowest five-year return?
d. What is the type and market cap of the five-star fund with
the highest five-year return?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.7 Misusing Graphs and Ethical Issues
2.7 MISUSING GRAPHS AND ETHICAL ISSUES
LEARNING OBJECTIVE
Good graphical displays should present the data in a clear and understandable way. Unfortunately, many graphs in newspapers and magazines, as well as graphs constructed using Microsoft Excel, are incorrect, misleading or unnecessarily complicated. To illustrate the misuse of
graphs, Figure 2.21 was constructed using data obtained from Wine Australia contained in data
file < WINE_PROD_2014_15 >. In the figure, the contents of the wine bottle representing 606 million litres for 1995/96 appear to be approximately three times the contents of the icon representing 346 million litres for 1990/91. This is because a magnification factor of 1.75 (606/346
≈ 1.75) has been applied to both height and width, so the volume has increased by 1.752 ≈ 3.
One principle of good graphs is that, when using three-dimensional icons, frequency/quantity
must be proportional to volume.
1,410
1,118
1,191
606
Correctly present data
in graphs
Source: Data obtained from
‘Australian Gross Wine
Production – pdf format’,
Wine Australia Corporation
<www.wineaustralia.com/
australia> accessed
December 2013.
346
1990/91
1995/96
2000/01
2005/06
6
Figure 2.21
Misleading display of
Australian wine production
Australian beverage wine production (million litres)
1,034
69
2010/11
2014/15
Also, the time difference between the wine bottles is not constant. There are five years between
the first five icons and four years between the last two. Good graphs should be properly scaled
along each axis. Finally, the year labels are ambiguous. It is not clear whether the 346 million litres
represent the total production for the two years 1990 and 1991, the average production for those
two years, or the wine production for the 1990/91 financial year. Good graphs should be clearly
labelled. Although the wine bottle presentation may catch the eye, the data would have been better
presented in a summary table or as a time-series plot using all the data available.
It is often the improper use of the vertical and horizontal axes that leads to distortions in
presenting data. Figure 2.22, representing New Zealand alcohol consumption, was constructed
using data from OECD (2011 and 2014), contained in data file < ALCOHOL_CONSUMPTION >.
The graph in Figure 2.22 is clearly labelled, the horizontal/time axis is correctly spaced and the
height and volume are proportional. However, the cylinder representing 9.1 litres for 2004 is more
than twice the height/volume of the cylinder representing 8.9 litres for 2003. This is because there
is no zero point on the vertical axis. The vertical axis on a good graph should usually begin at zero.
Other eye-catching displays seen in magazines and newspapers often include information
that is not necessary, blurring the effect.
Some guidelines for presenting good graphs are as follows:
• The graph should not distort the data. In particular, frequency/quantity should be
proportional to area and/or volume.
• The graph should not contain chartjunk.
• Any two-dimensional graph should contain a scale for each axis.
• The scale on the vertical axis should begin at zero.
• Graphs should be properly scaled along each axis.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
70
CHAPTER 2 ORGANISING AND VISUALISING DATA
Figure 2.22
Misleading display of New
Zealand alcohol
consumption
Alcohol consumption in litres per capita (15+)
9.6
9.5
Source: Data from OECD
(2011 and 2014), ‘Alcohol
consumption’, Health: Key
Tables from OECD, No. 24. doi:
10.1787/alcoholcons-table2014-1-en and 10.1787/
alcoholcons-table-2011-1-en,
accessed March 2017.
9.3
9.3
9.5
9.3
9.3
9.2
9.2
9.1
8.9
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Year
•
•
•
All axes should be properly labelled.
The graph should contain a title.
The simplest possible graph should be used for a given set of data.
Often these guidelines are unknowingly violated by individuals unaware of how to construct
appropriate graphs. Some applications, including Excel, tempt you to create ‘pretty’ charts
that may be fancy in their designs but represent unwise choices. For example, making a simple
pie chart fancier by adding exploded 3D slices is unwise as this can complicate a viewer’s
interpretation of the data. Uncommon chart choices such as doughnut, radar and surface
charts may look visually striking, but in most cases they obscure the data.
Ethical Concerns
Inappropriate graphs raise ethical concerns, especially when they, deliberately or not, present a
false impression of the data.
To illustrate this, take the example of mobile speed cameras that were reintroduced in New
South Wales on 19 July 2010. Suppose the following graphs were produced by groups for and
against this, using data in the file < NSW_ROAD_FATALITIES 2009_2017 > obtained from the
­Australian Road Deaths Database.
Figure 2.23A gives the impression that the number of road fatalities in New South Wales
has increased after the reintroduction of mobile speed cameras, while Figure 2.23B gives the
Figure 2.23A
NSW road fatalities 2010
NSW number of road fatalities 2010
40
35
Mobile cameras
introduced
30
25
20
Jul 10
Aug 10
Sep 10
Oct 10
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
2.7 Misusing Graphs and Ethical Issues
Figure 2.23B
NSW road fatalities 2010
NSW number of road fatalities 2010
45
71
40
Mobile cameras
introduced
35
30
25
20
Apr 10
60
May 10
Jun 10
Jul 10
Aug 10
Figure 2.23C
NSW road fatalities 2009
to 2017
NSW number of road fatalities 2009 to 2017
Mobile speed cameras introduced
50
Source: Data in Figures
2.23A–C obtained from
Australian Road Deaths
Database, <www.bitre.gov.
au/statistics/safety/fatal_
road_crash_database.aspx>,
accessed 8 April 2017.
40
30
20
10
Feb 17
Sep 16
Apr 16
Nov 15
Jun 15
Jan 15
Aug 14
Apr 14
Nov 13
Jun 13
Jan 13
Aug 12
Mar 12
Oct 11
Jun 11
Jan 11
Aug 10
Mar 10
Oct 09
Jan 09
May 09
0
opposite ­impression. However, a time-series plot for 2009 to 2017 (Figure 2.23C) shows that
there may be a slight decrease in fatalities since the introduction of mobile cameras, although
the number of fatalities per month is very variable.
Problems for Section 2.7
APPLYING THE CONCEPTS
2.50 (Student project) Bring to class a chart from a newspaper or
magazine that you believe to be a poor representation of a
numerical variable. Be prepared to discuss why you think this.
Do you believe that the intent of the chart is purposely to mislead
the reader?
2.51 (Student project) Bring to class a chart from a newspaper or
magazine that you believe to be a poor representation of a
categorical variable. Be prepared to discuss why you think this.
Do you believe that the intent of the chart is purposely to mislead
the reader?
2.52 (Student project) Bring to class a chart from a newspaper
or magazine that you believe contains too many unnecessary
adornments (i.e. chartjunk) that may cloud the message
given by the data. Be prepared to discuss why you think this.
2.53 The following graph shows a relationship between number of
pirates and global average temperature between 1820 and
2000. Comment on the influence of pirates on global warming.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
72
CHAPTER 2 ORGANISING AND VISUALISING DATA
c.
Global average temperature vs
number of pirates
2000
15.5
1980
1940
15.0
1920
1860
1880
1820
100
50
17
0
40
0
5,
00
0
0
15
,0
0
0
20
,0
0
0
,0
0
45
,0
0
Source: Church of the Flying Spaghetti Monster, <www.venganza.org/images/
PiratesVsTemp.png> accessed 28 December 2014. Used by permission of Bobby
Henderson
2.54 Using the data < WINE_PROD_2014_15 > and < ALCOHOL_
CONSUMPTION >, redraw Figures 2.21 and 2.22, following the
guidelines for good graphs given in Section 2.7.
2.55 The following three time-series plots show Perth’s monthly
average petrol prices from January 2006 to February 2017:
a.
Perth petrol price
140
120
100
80
60
40
Which graph do you think best represents the data and why?
2.56 An article in the New York Times (D. Rosato, ‘Worried about the
numbers? How about the charts?’, New York Times, 15 September
2002, Business 7) reported on research done on annual reports of
corporations by Professor Deanna Oxender Burgess of Florida Gulf
Coast University. Professor Burgess found that even slight distortions
in a chart changed readers’ perception of the information. The article
displayed sales information from the annual report of Zale
Corporation and showed how results were exaggerated.
Go online or to the library and study the most recent annual
report of a local corporation. Find at least one chart in the report
that you think needs improvement and develop an improved
chart. Explain why you believe the improved chart is better than
the one from the annual report.
2.57 Figures 2.1 and 2.3 show a bar chart and a pie chart,
respectively, for the online grocery shopping data.
a. Create an exploded pie chart, a doughnut chart, a cone chart
or a pyramid chart for the online shopping data.
b. Which graphs do you prefer? Explain.
20
Feb 17
Sep 15
May 16
Jan 15
May 14
Aug 13
Dec 12
Jul 11
Mar 12
Nov 10
Jun 09
Mar 10
Oct 08
Jan 08
Sep 06
May 07
Jan 06
0
b.
Perth petrol price
155
145
135
125
115
105
Feb 17
May 16
Sep 15
Jan 15
May 14
Aug 13
Dec 12
Mar 12
Jul 11
Nov 10
Mar 10
Jun 09
Oct 08
Jan 08
May 07
Sep 06
95
Jan 06
Average price (cents/litre)
165
Feb 17
May 16
Jan 15
Sep 15
Aug 13
May 14
Dec 12
Jul 11
Mar 12
Nov 10
Jun 09
Mar 10
Oct 08
Jan 08
Data obtained from Australian Automobile Association <www.aaa.asn.au>
accessed April 2017
Number of pirates (approximate)
160
May 07
13.5
35
150
0
13.0
Average price (cents/litre)
200
Jan 06
14.0
250
Sep 06
14.5
Perth petrol price
300
Average price (cents/litre)
Global average temperature, °C
16.0
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
References
73
2
Assess your progress
Summary
Table 2.16 summarises the tables and charts discussed in this chapter.
These tables and charts enabled us to draw conclusions about online
grocery shopping, the cost of restaurant meals in a city and its suburbs,
and festival expenditure in the scenario at the beginning of the chapter.
Table 2.16
Roadmap for selecting
tables and charts
Type of analysis
Tabulating, organising and
graphically presenting the
values of a variable
Organising and graphically
presenting the relationship
between two variables
Now that you have studied tables (which show how data are
distributed) and charts (which provide a visual display of how data
are distributed), a variety of numerical descriptive measures will be
introduced in Chapter 3 for further analysis and interpretation of data.
Type of data
Numerical
Ordered array, stem-and-leaf display,
frequency distribution, relative frequency
distribution, percentage distribution,
cumulative percentage distribution,
histogram, polygon, cumulative
percentage polygon (Sections 2.2 and 2.3)
Scatter diagram, time-series plot
(Section 2.5) Sparklines, gauges, bullet
graph, treemap, drill-down (Section 2.6)
Categorical
Summary table, bar chart,
pie chart (Section 2.1)
Contingency table,
side-by-side bar chart
(Section 2.4) Treemap,
drill-down (Section 2.6)
Key terms
bar chart
bullet graph
business analytics
chartjunk
class boundaries
class mid-point
class width
contingency (cross-classification)
table – descriptive statistics
cumulative percentage distribution
cumulative percentage polygon (ogive)
39
64
63
65
47
47
46
55
49
52
dashboard
data discovery
descriptive analytics
drill-down
frequency distribution
gauges
histogram
ordered array
percentage distribution
percentage polygon
pie chart
64
66
63
66
46
64
50
43
48
51
40
predictive analytics
prescriptive analytics
range
relative frequency distribution
scatter diagram
side-by-side bar chart
sparklines
stem-and-leaf display
summary table
time-series plot
treemaps
References
1. Few, S. Information Dashboard Design: Displaying Data for At-a-Glance
2. Tufte, E. Beautiful Evidence (Cheshire, CT: Graphics Press, 2006).
Monitoring, 2nd edn (Burlingame, CA: Analytics Press, 2013).
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
63
63
46
48
59
56
64
43
38
59
65
74
CHAPTER 2 ORGANISING AND VISUALISING DATA
Chapter review problems
CHECKING YOUR UNDERSTANDING
2.58
2.59
2.60
2.61
2.62
2.63
How do histograms and polygons differ with respect to their
construction and use?
When or why would you construct a summary table?
What are the advantages and/or disadvantages of a bar chart
or a pie chart?
Compare and contrast the bar chart for categorical data and
the histogram for numerical data.
What is the difference between a time-series plot and a scatter
diagram?
What are the three percentage breakdowns that can help
you interpret the results found in a cross-classification table?
2.66
a. Illustrate these data with an appropriate graph or graphs.
b. What can you conclude about Internet usage? Are these
conclusions different from those in problem 2.64? If so,
what could the reasons be?
The following table classifies road fatalities in Australia for
2012 to 2016 by crash type:
Crash type
Multiple vehicle
Pedestrian
Single vehicle
Total
2012
573
171
556
1,300
2013
479
158
550
1,187
Year
2014
503
154
493
1,150
2015
511
162
532
1,205
2016
556
171
573
1,300
APPLYING THE CONCEPTS
You can solve problems 2.64 to 2.76 manually or using Microsoft Excel.
2.64
One thousand Australians were asked which websites they had
visited in the previous week. The results were:
Type of sites
Auction
Banking
Classifieds
Dating
Email
Gaming
News
Online music site
Search engine
Shopping
Social network
Sport
TV
User generated or upload site
Weather
2.65
Number
122
245
213
41
552
132
335
186
743
381
649
236
201
472
398
2.67
a. Illustrate these data by constructing appropriate tables and
graphs.
b. What can you say about the pattern of road fatalities in
these five years?
Residents in the seaside town hosting three-day music festival
are concerned that the influx of tourists for this and other
events causes an increase in traffic and other offences. As the
council area has one of the highest drink driving rates in the
state, Kai is investigating whether tourists can be blamed for
this high rate.
The following table classifies the previous year’s 993
drink-driving offences by the home address of the offender:
Number of drink-driving
offences
Local – in council area
Seaside town
151
Not seaside town
462
Not local – not in council area
Intrastate (within state)
130
Interstate (another state)
228
International (outside Australia)
22
Home address
a. Illustrate these data with an appropriate graph or graphs.
b. What can you conclude about the type of website most visited?
Another poll asked Australians how they spent their time
online, with the following result.
Email and communications
Multimedia sites
Online shopping
Reading content
Searches
Social networking
Total
Data obtained from the Australian Road Deaths Database at <www.bitre.gov.au/
statistics/safety/fatal_road_crash_database.aspx> accessed 9 April 2017
19.3%
13.1%
5.4%
19.9%
20.7%
21.6%
100.0%
a. Construct bar and pie charts.
b. What conclusions can Kai draw about the prevalence of
drink driving?
c. The headline of an article in the local paper discussing
these data was ‘Tourists can’t be blamed for number of
drink-drivers’. Do you agree with this? Justify your
answer.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter review problems
The reasons why Queensland households installed a rainwater
tank is given in the table below.
2.71
Data obtained from Australian Bureau of Statistics, Environmental Issues: Water
Use and Conservation, Mar 2013, Cat. No. 4602.0.55.003 <www.abs.gov.au>
accessed 4 November 2013
2.73
Australian housing interest rates
8.0
7.5
7.0
% 6.5
6.0
5.5
2.74
Mar 17
Jun 16
Nov 16
Oct 15
Feb 16
Jun 15
Oct 14
Feb 15
Jun 14
Oct 13
Feb 14
Jan 13
5.0
May 13
2.70
2.72
Sep 11
2.69
a. Illustrate these data with an appropriate table or graph.
b. What can you conclude about the reasons for installing a
rainwater tank, and are there differences between Brisbane
and non-Brisbane Queensland households?
The data in < FRESH_MILK > contains the fat and sugar
content in grams (g) per 250 ml cup of a random sample of
brands of fresh cow’s milk for sale in Australia.
a. Use the combined data to construct graphs to explore the
relationship between the variables.
b. What conclusions can you reach about the relationship
between the fat, sugar and calorie content of fresh
milk?
On the same day in March 2017, the researcher in problem
2.30 also obtained the prices per litre of unleaded petrol and
diesel from a random sample of 45 towns and suburbs in
Queensland. This set of data is in the data file < FUEL_
MARCH_2017 > with the New South Wales data.
a. Using appropriate tables and graphs, investigate the
distribution of unleaded petrol and diesel prices in
Queensland on this day in March 2017. What can you
conclude about the variation in fuel prices in Queensland
when the data were collected?
b. Using an appropriate graph, investigate the relationship
between petrol and diesel prices in Queensland. What
conclusions can you draw about this relationship?
c. Using appropriate tables and graphs, investigate the
distribution of unleaded petrol and diesel prices in New
South Wales on this day in March 2017. What can you
conclude about the variation in fuel prices in New South
Wales when the data were collected?
d. Using an appropriate graph, investigate the relationship
between petrol prices in New South Wales and Queensland.
What conclusions can you draw?
e. Using an appropriate graph, investigate the relationship
between diesel prices in New South Wales and Queensland.
What conclusions can you draw?
Sep 12
Rest of
Queensland
59.00
41.70
31.90
73.70
28.10
7.90
47.20
206.30
Jan 12
Reason
Brisbane
To save water
142.10
To save on water costs
55.60
Water restrictions on mains water
55.20
Not connected to mains water
5.40
Concerns about quality of mains water
5.40
Water tank rebates
43.00
Other
48.50
Total households (thousands)
216.50
f. The data in < FUEL_MARCH_2017 > was obtained in March
2017. Go to Motor Mouth at <www.motormouth.com.au>,
NRMA at <www.mynrma.com.au>, RACQ at <www.racq.
com.au>, or elsewhere, to collect recent price data. Then
use appropriate graphs and tables to investigate any
changes in petrol and/or diesel prices in New South Wales
and/or Queensland.
Data from 100 recent property sales from a council area are
stored in < PROPERTY >.
For the asking price data:
a. Construct and interpret a stem-and-leaf display.
b. Construct frequency, percentage and cumulative distributions.
c. Construct a frequency histogram, a percentage polygon and
an ogive.
d. What conclusions can you make about the distribution of
asking prices?
e. Construct and interpret a scatter diagram for asking and
selling price.
For the type and bedroom data:
f. Construct cross-classification tables based on total, row
and column percentages.
g. Construct side-by-side charts to investigate the relationship
between number of bedrooms and type.
h. What conclusions can you make about the relationship
between type and number of bedrooms?
The data in data file < INTEREST_2017 > give the bank interest
rate for standard housing loans in New Zealand and Australia
from January 2000 to March 2017.
Construct and interpret time-series plots, on the same set
of axes, for New Zealand and Australian interest rates from
January 2000.
Using the Australian data from problem 2.72, a PR spokesperson
for an Australian political party constructed the following graph
to illustrate that the party’s influence has lowered interest rates.
Do you think this is an ethical graph? Discuss.
May 12
2.68
75
The data in data file < GRADES > contain sample student
marks and grades from a population of students enrolled in a
statistics unit.
a. Construct an appropriate graph to investigate the
distribution of grades. What conclusions can you draw?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
76
CHAPTER 2 ORGANISING AND VISUALISING DATA
b. Construct an appropriate graph to investigate the
distribution of total marks. What conclusions can you
draw?
c. Construct an appropriate graph to investigate the
relationship between a student’s semester mark and their
exam mark. What conclusions can you draw?
2.75 (Class project) Ask each student in the class to respond to the
question ‘Which soft drink do you prefer?’ and display the
results in a summary table.
a. Convert the data to percentages and construct a bar or pie
chart.
b. Analyse the findings.
2.76 (Class project) Classify each student in the class on the
basis of gender (male, female), study mode (full-time or
part-time) and current employment status (full-time,
part-time).
a. Construct contingency tables to explore the data.
b. What would you conclude from this study?
c. What other variables would you want to know about
employment in order to enhance your findings?
d. Compare your results with those from the Living in Australia
Study in problem 2.22.
2.77 The file < DOMESTIC­_ BEER2 > contains the number of calories
per 355 mL and number of carbohydrates (in grams) per 355 mL
for a sample of 15 of the best-selling domestic beers in the
2.78
United States (data obtained from <www.beer100.com/
beercalories.htm>).
a. Visually evaluate the number of calories per 355 mL for
each beer by constructing a bullet graph. For the purposes
of comparison, consider calories below 100 as low,
between 100 and 160 as medium, and above 160 as high.
b. Visually evaluate the number of carbohydrates (in grams)
per 355 mL for each beer by constructing a bullet graph.
For the purposes of comparison, consider carbohydrates
below 10 grams as low, between 10 and 14 grams as
medium, and above 14 grams as high.
c. What preliminary conclusions can you reach about the
number of calories and amount of carbohydrate in the
beers?
d. Why would constructing sets of gauges for the calories and
carbohydrates be a less effective means of visualising these
data?
The file < CURRENCY2 > contains the value of the Canadian
dollar, British pound and Euro for one US dollar from 2002 to
2012.
a. Construct sparklines for the value of the US dollar in terms
of the Canadian dollar, British pound and Euro.
b. What conclusions can you reach about the value of the US
dollar in terms of the Canadian dollar, British pound and
Euro from 2002 to 2012?
Continuing cases
Tasman University
Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues.
In particular, students within the school are asked to complete a student survey when they receive their grades
each semester. The results of Bachelor of Business (BBus) and Master of Business Administration (MBA) students
who responded to the latest undergraduate (UG) and postgraduate (PG) student surveys are stored in < TASMAN_
UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >.
Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman
University Undergraduate MBA Student Survey.
a For a selection of questions asked in the BBus student survey, construct appropriate tables and charts.
b For a selection of questions asked in the MBA student survey, construct appropriate tables and charts.
c Construct appropriate tables and charts to explore the relationship between selected pairs of questions
within a survey or between surveys.
d Write a report summarising your conclusions.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 2 Excel Guide
77
As Safe as Houses
To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a
large national real estate company, has collected samples of recent residential sales from a sample of non-capital
cities and towns in these states. The data are stored in < REAL­_ ESTATE >.
a For regional city 1, state A:
i
For a selection of variables, construct appropriate tables and charts.
ii Construct appropriate tables and charts to explore the relationship between pairs of variables.
b For coastal city 1, state A:
i
For a selection of variables, construct appropriate tables and charts.
ii Construct appropriate tables and charts to explore the relationship between pairs of variables.
c Construct appropriate tables and charts to explore the relationship between the same variable in coastal
city 1, state A, and regional city 1, state A.
d Write a report summarising your conclusions.
e Repeat (a) to (d) for another pair of non-capital cities or towns in state A and/or state B.
Chapter 2 Excel Guide
EG2.1ORGANISING AND VISUALISING
CATEGORICAL DATA
ORGANISING CATEGORICAL DATA
Figure EG2.1 One-Way
Tables & Charts dialog box
The Summary Table
Key technique
Use the PivotTable feature to create a summary table for
untallied data.
Example
Create a frequency and percentage summary table similar
to Table 2.2B on page 39.
PHStat
Use One-Way Tables & Charts.
For the example, open the Property file. Select PHStat
➔ Descriptive Statistics ➔ One-Way Tables & Charts. In
the procedure’s dialog box (shown in Figure EG2.1):
1. Click Raw Categorical Data (because the worksheet contains untallied data).
2. Enter or highlight G2:G102 as the Raw Data Cell
Range and check First cell contains label.
3. Enter a Title, check Percentage Column, and
click OK.
PHStat creates a PivotTable summary table on a new
worksheet.
In-depth Excel (untallied data)
Use the Summary_Table workbook as a model.
For the example, open the Property file and select
Insert ➔ PivotTable. In the Create PivotTable dialog box
(shown in Figure EG2.2):
1. Click Select a table or range and enter or highlight G2:G102 as the Table/Range cell range.
2. Click New Worksheet and then click OK.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
78
CHAPTER 2 ORGANISING AND VISUALISING DATA
7. Click the Layout & Format tab.
8. Check For empty cells show and enter 0 as its
value. Leave all other settings unchanged.
9. Click OK to complete the PivotTable.
Figure EG2.2 Create PivotTable dialog box
In the Excel 2016 PivotTable Fields task pane (shown in
Figure EG2.3) or in the similar PivotTable Field List task
pane in earlier Excels:
3. Tick Type in Choose fields to add to report to
add it to ROWS (or Row Labels) box.
4. Drag Type in Choose fields to add to report and
drop it in the Σ Values box. This second label
changes to Count of Type to indicate that a count,
or tally, of the type categories will be displayed in
the PivotTable.
Figure EG2.3
Microsoft Excel PivotTable
Fields task pane
In the PivotTable being created:
5. Enter Type in cell A3 to replace the heading Row
Labels.
6. Right-click cell A3 and then click PivotTable
Options in the shortcut menu that appears.
In the PivotTable Options dialog box (shown in Figure
EG2.4):
Figure EG2.4 PivotTable Options dialog box
To add a column for the percentage frequency:
10. Enter Percentage in cell C3. Enter the formula
5B4∙B$6 in cell C4 and copy it down to row 6.
11. Select cell range C4:C6, right-click, and select
Format Cells in the shortcut menu.
12. In the Number tab of the Format Cells dialog box,
select Percentage as the Category, and the number of decimal places you wish to show, and click
OK.
13. Adjust the worksheet formatting, if appropriate,
and enter a title in cell A1.
In the PivotTable, type categories appear in alphabetical order. To change the order:
14. Click the Unit label in cell A5 to highlight cell A5.
Move the mouse pointer to the top edge of the cell
until the mouse pointer changes to a four-way
arrow.
15. Drag the Unit label and drop the label over cell
A4. The type categories now appear in the order
Unit then House in the summary table.
In-depth Excel (tallied data)
Use the SUMMARY_SIMPLE worksheet of the
Summary_Table workbook as a model for creating a
summary table.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 2 Excel Guide
VISUALISING CATEGORICAL VARIABLES
The Bar Chart and the Pie Chart
Many of the In-depth Excel instructions in the rest of this
Excel Guide refer to the labelled Charts group illustration
shown in Figure EG2.5.
Figure EG2.5 Microsoft Excel Charts group
Key technique
Use the Excel bar or pie chart feature. If the variable to be
visualised is untallied, first construct a summary table (see
the instructions in Section EG2.1 ‘Organising Categorical
Data: The Summary Table’).
Example
Construct a bar or pie chart from a summary table similar to
Table 2.2B on page 39.
PHStat
Use One-Way Tables & Charts.
For the example, use the PHStat instructions in Section
EG2.1 ‘Organising Categorical Data: The Summary Table’,
but in step 3, check either Bar Chart or Pie Chart (or
both) in addition to entering a Title, checking Percentage
Column, and clicking OK.
In-depth Excel
Use the Summary_Table workbook as a model.
For the example, open to the OneWayTable worksheet
of the Summary_Table workbook. (The PivotTable in
this worksheet was constructed using the instructions in
Section EG2.1 ‘Organising Categorical Data: The Summary
Table’.) To construct a bar chart:
1. Select cell range A4:B5. (Begin your selection at cell
B5 and not at cell A4, as you would normally do.)
2. In Excel 2016, select Insert, then the Column
icon in the Charts group (#1 in the Charts group
illustration in Figure EG2.5), and then select the
first 2-D Bar gallery item (Clustered Bar). In other
Excels, select Insert ➔ Bar Icon and then select
the first 2-D Bar gallery item (Clustered Bar).
3. Right-click the Count of Type button in the chart
and click Hide All Field Buttons on Chart.
4. Select Design ➔ Add Chart Element ➔ Axis
Titles ➔ Primary Horizontal.
(Earlier Excels) Select Layout ➔ Axis Titles ➔
Primary Horizontal Axis Title ➔ Title Below
Axis. Select the words “Axis Title” in the chart and
enter the title Frequency.
79
5. If required, move to chart sheet (right-click on
Chart ➔ Move Chart). Adjust chart formatting if
required.
Although not the case with the example, sometimes
the horizontal axis scale of a bar chart will not begin at 0.
If this occurs, right-click the horizontal (value) axis in the
bar chart and click Format Axis in the shortcut menu. In
the Format Axis task pane, click Axis Options. In the
Axis Options, enter 0 in the Minimum box and then close
the pane. In earlier Excels, you set this value in the Format Axis dialog box. Click Axis Options in the left pane,
and in the Axis Options right pane, click the first Fixed
option button (for Minimum), enter 0 in its box, and then
click Close.
To construct a pie chart, replace steps 2 and 4 with
these steps:
2. Select Insert, then the Pie icon (#3 in the Charts
group illustration in Figure EG2.5), and then select
the first 2-D Pie gallery item (Pie). In earlier
Excels, select Insert ➔ Pie and then select the
first 2-D Pie gallery item (Pie).
4. Select Design ➔ Add Chart Element ➔ Data
Labels ➔ More Data Label Options. In the Format Data Labels task pane, click Label Options.
In the Label Options, check Category Name and
Percentage, clear the other Label Contains check
boxes, and click Outside End. (To see the Label
Options, you may have to first click the chart
(fourth) icon near the top of the task pane.) Then
close the task pane.
(Earlier Excels) Select Layout ➔ Data Labels ➔
More Data Label Options. In the Format Data
Labels dialog box, click Label Options in the left
pane. In the Label Options right pane, check Category Name and Percentage and clear the other
Label Contains check boxes. Click Outside End
and then click Close.
EG2.2 ORGANISING NUMERICAL DATA
Stacked and Unstacked Data
PHStat
Use Stack Data or Unstack Data.
For example, to unstack the Asking Price variable by
the Type variable in the property data given in Example 2.1,
open the Property file. Select Data Preparation ➔
Unstack Data. In that procedure’s dialog box, enter or
highlight G2:G102 (the Type variable cell range) as the
Grouping Variable Cell Range and enter or highlight
A2:A102 (the Asking Price variable cell range) as the
Stacked Data Cell Range. Check First cells in both
ranges contain label and click OK. The unstacked data
appear on a new worksheet.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
80
CHAPTER 2 ORGANISING AND VISUALISING DATA
The Ordered Array
In-depth Excel
To create an ordered array, first select the numerical variable
to be sorted. Then select Home ➔ Sort & Filter (in the Editing group) and in the drop-down menu click Sort Smallest
to Largest. (You will see Sort A to Z as the first drop-down
choice if you did not select a cell range of numerical data.)
The Stem-and-Leaf Display
Key technique
Enter leaves as a string of digits.
Example
Construct a stem-and-leaf display for festival expenditure
by interstate visitors, similar to Figure 2.5 on page 45.
PHStat
Use the Stem-and-Leaf Display.
For the example, open the Festival file. Select PHStat
➔ Descriptive Statistics ➔ Stem-and-Leaf Display. In
the procedure’s dialog box (shown in Figure EG2.6):
1. Enter or highlight A2:A54 as the Variable Cell
Range and check First cell contains label.
2. Click Set stem unit as and enter 100 in its box.
3. Enter a Title and click OK.
Figure EG2.6
Stem-and-Leaf Display
dialog box
When creating other displays, use the Set stem unit
as option sparingly and only if Autocalculate stem unit
creates a display that has too few or too many stems. (Any
stem unit you specify must be a power of 10.)
In-depth Excel
Use the Stem_and_Leaf workbook as a model.
Manually construct the stems and leaves on a new
worksheet to create a stem-and-leaf display. Adjust the column width of the column that holds the leaves as necessary.
EG2.3 SUMMARISING AND VISUALISING
NUMERICAL DATA
SUMMARISING NUMERICAL DATA
The Frequency Distribution
Key technique
Establish bins and then use the FREQUENCY(untallied
data cell range, bins cell range) array function to tally data.
Example
Create frequency, percentage and cumulative percentage distributions for the restaurant meal cost data as in Tables 2.5,
2.7 and 2.9 in Section 2.3.
To construct a frequency distribution using Excel or
PhStat, you must first define your classes by a bin range.
Defining Classes Using Bins
Open the worksheet containing the data you want to summarise in classes. Decide on your classes and, in a separate
column, enter the upper boundary or maximum value
called the Bin Value for each class. This gives the Bin Cell
Range.
If the data are discrete, the bin range should contain the
highest value in each class. If the data are continuous but
recorded to a set number of decimal places, the values in
the bin range should be just less than the minimum value in
the next class. In this case, record the value in the bin range
to one or two more significant figures than the data.
For example, for the restaurant data in < RESTAURANT >
in Section 2.3, the following classes were required (see
Table 2.5): $10 to less than $15, $15 to less than $20 and so
on. As the first class is $10 to less than $15, $15 belongs in
the second class and the bin value for the first class is just
less than this, 14.99 or 14.999. Therefore, the Bin Cell
Range would be 14.999, 19.999, 24.999 and so on.
Class
$10 to < $15
$15 to < $20
$20 to < $25
:
$60 to < $65
Bin values
14.999
19.999
24.999
:
64.999
Class mid-points
$12.50
$17.50
$22.50
:
$62.50
PHStat (untallied data)
Use Frequency Distribution. (Use Histogram & Polygons, discussed later in Section EG2.3, if you plan to construct a histogram or polygon in addition to a frequency
distribution.) For the example, open the Restaurant file.
The data worksheet contains the meal cost data in stacked
format in column G and enter an appropriate bin cell range
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 2 Excel Guide
(see above) in column H (say, H1:H12). Select PHStat ➔
Descriptive Statistics ➔ Frequency Distribution. In the
procedure’s dialog box (shown in Figure EG2.7):
1. Enter or highlight G1:G101 as the Variable Cell
Range, enter or highlight H1:H12 as the Bins Cell
Range, and check First cell in each range contains label.
2. Click Multiple Groups - Stacked and enter or
highlight A1:A101 as the Grouping Variable Cell
Range. (The cell range A1:A101 contains the
Location variable.)
3. Enter a Title and click OK.
Figure EG2.7
Frequency Distribution
dialog box
Click Single Group Variable in step 2 if constructing
a distribution from a single group of untallied data. Click
Multiple Groups - Unstacked in step 2 if the Variable
Cell Range contains two or more columns of unstacked,
untallied data.
Frequency distributions for the two groups appear
on separate worksheets. To display the information for
the two groups on one worksheet, select the cell range
B3:D14 on one of the worksheets. Right-click that
range and click Copy in the shortcut menu. Open to the
other worksheet. In that other worksheet, right-click
cell E3 and click Paste Special in the shortcut menu. In
the Paste Special dialog box, click Values and numbers
format and click OK. Adjust the worksheet title as
necessary.
In-depth Excel (untallied data)
Use the Distributions workbook as a model.
For the example, use the Unstacked worksheet of
the Restaurant file. This worksheet contains the meal
cost data unstacked in columns A and B. Enter an appropriate bin range (see above) in column D (say, D1:D12).
Then:
1. Right-click the Unstacked sheet tab and click
Insert in the shortcut menu.
81
2. In the General tab of the Insert dialog box, click
Worksheet and then click OK.
In the new worksheet:
3. Enter a title in cell A1, Bins in cell A3 and Frequency in cell B3.
4. Copy the bin number list in the cell range D2:D12
of the Unstacked worksheet and paste this list
into cell A4 of the new worksheet.
5. Select the cell range B4:B14 that will hold the
array formula.
6. Type (but do not press) the Enter or Tab key, the
formula 5FREQUENCY(UNSTACKED!$A$1:
$A$51, $A$4:$A$14). Then, while holding down
the Ctrl and Shift keys, press the Enter key to
enter the array formula into the cell range B4:B14.
7. Adjust the worksheet formatting as necessary.
Note that in step 6, you enter the cell range as
UNSTACKED! $A$1:$A$51 and not as $A$1:$A$51
because the untallied data are located on another (the
Unstacked) worksheet.
Steps 1 to 7 construct a frequency distribution for the
meal costs at city restaurants. To construct a frequency distribution for the meal costs at suburban restaurants, repeat
steps 1 to 7 but in step 6 type 5FREQUENCY(UNSTAC
KED!$B$1:$B$51, $A$4:$A$14) as the array formula.
To display the distributions for the two groups on one
worksheet, select the cell range B3:B14 on one of the
worksheets. Right-click that range and click Copy in the
shortcut menu. Open to the other worksheet. In that other
worksheet, right-click cell C3 and click Paste Special in
the shortcut menu. In the Paste Special dialog box, click
Values and numbers format and click OK. Adjust the
worksheet title as necessary.
Analysis ToolPak (untallied data)
Use Histogram.
For the example, use the Unstacked worksheet of the
Restaurant file. This worksheet contains the meal cost
data unstacked in columns A and B. Enter an appropriate
bin range (see above) in column D (say, D1:D12). Then:
1. Select Data ➔ Data Analysis. In the Data Analysis
dialog box, select Histogram from the Analysis
Tools list and then click OK.
In the Histogram dialog box (shown in Figure EG2.8):
2. Enter or highlight A1:A51 as the Input Range and
enter or highlight D1:D12 as the Bin Range. (If
you leave Bin Range blank, the procedure creates
a set of bins that will not be as well formed as the
ones you can specify.)
3. Check Labels and click New Worksheet Ply.
4. Click OK to create the frequency distribution on a
new worksheet.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
82
CHAPTER 2 ORGANISING AND VISUALISING DATA
Figure EG2.8
Histogram dialog
box
In the new worksheet:
5. Select row 1. Right-click this row and click Insert
in the shortcut menu. Repeat. (This creates two
blank rows at the top of the worksheet.)
6. Enter a title in cell A1.
The ToolPak creates a frequency distribution that contains
an improper bin labelled More. Correct this error by using
these general instructions:
7. Manually add the frequency count of the More
row to the frequency count of the preceding row.
(For the example, the More row contains a zero for
the frequency, so the frequency of the preceding
row does not change.)
8. Select the worksheet row (for this example, row 15)
that contains the More row.
9. Right-click that row and click Delete in the shortcut menu.
Steps 1 to 9 construct a frequency distribution for the
meal costs at city restaurants. To construct a frequency distribution for the meal costs at suburban restaurants, repeat
these nine steps but in step 2 enter or highlight B1:B51 as
the Input Range.
The Relative Frequency, Percentage and Cumulative
Distributions
Key technique
Add columns that contain formulas for the relative frequency or percentage and cumulative percentage to a previously constructed frequency distribution.
Example
Create a distribution that includes the relative frequency or
percentage as well as the cumulative percentage, as in
Tables 2.7 (relative frequency and percentage) and 2.9
(cumulative percentage) in Section 2.3 for the restaurant
meal cost data.
PHStat (untallied data)
Use Frequency Distribution.
For the example, use the PHStat instructions in ‘Summarising Numerical Data: The Frequency Distribution’ to
construct a frequency distribution. Note that the frequency
distribution constructed by PHStat also includes columns
for the percentages and cumulative percentages. To change
the column of percentages to a column of relative frequencies, reformat that column. For the example, open to the
new worksheet that contains the city restaurant frequency
distribution and:
1. Select the cell range C4:C14, right-click, and
select Format Cells from the shortcut menu.
2. In the Number tab of the Format Cells dialog box,
select Number as the Category and click OK.
Then repeat these two steps for the new worksheet that contains the suburban restaurant frequency distribution.
In-depth Excel (untallied data)
Use the Distributions workbook as a model.
For the example, first construct a frequency distribution created using the In-depth Excel instructions in ‘Summarising Numerical Data: The Frequency Distribution’.
Open to the new worksheet that contains the frequency distribution for the city restaurants and:
1. Enter Percentage in cell C3 and Cumulative
Pctage in cell D3.
2. Enter 5B4∙SUM($B$4:$B$14) in cell C4 and
copy this formula down to row 14.
3. Enter 5C4 in cell D4.
4. Enter 5C5 1 D4 in cell D5 and copy this formula
down to row 14.
5. Select the cell range C4:D14, right-click, and click
Format Cells in the shortcut menu.
6. In the Number tab of the Format Cells dialog box,
click Percentage in the Category list and click OK.
Then open to the worksheet that contains the frequency distribution for the suburban restaurants and repeat steps 1 to 6.
If you want column C to display relative frequencies
instead of percentages, enter Rel. Frequencies in cell C3.
Select the cell range C4:C12, right-click, and click Format
Cells in the shortcut menu. In the Number tab of the Format Cells dialog box, click Number in the Category list
and click OK.
Analysis ToolPak
Use Histogram and then modify the worksheet created.
For the example, first construct the frequency distributions using the Analysis ToolPak instructions in ‘The Frequency Distribution’. Then use the In-depth Excel
instructions to modify those distributions.
VISUALISING NUMERICAL DATA
The Histogram
Key technique
Construct a histogram.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 2 Excel Guide
Example
Construct histograms for price of main meals in city restaurants, similar to Figure 2.6 on page 50.
PHStat
Use Histogram & Polygons.
83
Figure EG2.9
Histogram &
Polygons
dialog box
PHStat Defining Classes Bins and Mid-points
If constructing a frequency polygon or histogram using
PHStat, include a class of zero frequency at the beginning
of the bin range.
For example, for the restaurant data in < RESTAURANT > in
Section 2.3, the first class of non-zero frequency is $10 to less
than $15 with bin value 14.999, so the class $5 to less than
$10 must be included before this. Therefore, the Bin Cell
Range would be 9.999, 14.999, 19.999, 24.999 and so on.
PHStat also requires a Mid-point Cell Range. Since
PHStat associates the first mid-point given with the second
bin value or second class, the Mid-point Cell Range must
have one fewer cells/values than the Bin Cell Range.
Price
Bin values
Class mid-points
50
9.999
$12.50
38
14.999
$17.50
43
19.999
$22.50
56
24.999
$27.50
51
29.999
$32.50
36
34.999
$37.50
25
39.999
$42.50
33
44.999
$47.50
41
49.999
$52.50
44
54.999
$57.50
34
59.999
$62.50
39
64.999
For the example, open to the Data worksheet of the
Restaurant file. Select PHStat ➔ Descriptive Statistics
➔ ­Histogram & Polygons. Enter an appropriate bin range,
see above, in column H (say, H1:H13) and Midpoint Range
in column I (say, I1:I12). Then in the procedure’s dialog
box (shown in Figure EG2.9):
1. Enter or highlight G1:G101 as the Variable Cell
Range, H1:H13 as the Bins Cell Range and
I1:I12 as the Midpoints Cell Range, and check
First cell in each range contains label.
2. Click Multiple Groups - Stacked and enter or
highlight A1:A101 as the Grouping Variable
Cell Range. (In the Data worksheet of the Restaurant file, the price of meals in city and suburban
restaurants are stacked, or placed in a single
c­ olumn. The column A values allow PHStat to
separate the city restaurant prices from the suburban restaurant prices.)
3. Enter a Title, check Histogram, and click OK.
PHStat inserts two new worksheets, each of which contains
a frequency distribution and a histogram.
Since you cannot define an explicit lower boundary for
the first bin, the first bin can never have a mid-point. Therefore, the Midpoints Cell Range you enter must have one
fewer cell than the Bins Cell Range. PHStat associates the
first mid-point with the second bin and uses -- as the label
for the first bin.
When you include a class of zero frequency before the
first class of non-zero frequency, as in this example, the histogram bar labelled -- will always be a zero bar.
In-depth Excel
Use the Histogram workbook as a model.
For the example, first construct frequency distributions
for city and suburban meal prices. Open the Unstacked
worksheet in the Restaurant file. This worksheet contains
the meal cost data unstacked in columns A and B. Enter
appropriate bin cell and mid-point cell ranges, including
titles, in columns D and E (say, D1:D12 and E1:E12).
Then:
1. Right-click the Unstacked sheet tab and click
Insert in the shortcut menu.
2. In the General tab of the Insert dialog box, click
Worksheet and then click OK.
In the new worksheet:
3. Enter a title in cell A1, Bins in cell A3, Frequency
in cell B3, and Midpoints in cell C3.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
84
CHAPTER 2 ORGANISING AND VISUALISING DATA
4. Copy the bin values in the cell range D2:D12 of
the Unstacked worksheet and paste this list into
cell A4 of the new worksheet.
5. Copy the mid-points in the cell range E2:E12 of
the Unstacked worksheet and paste this list into
cell C4 of the new worksheet.
6. Select the cell range B4:B14 that will hold the
array formula.
7. Type (but do not press the Enter or Tab key) the
formula 5FREQUENCY(UNSTACKED!$A$2:
$A$51, $A$4: $A$14). Then, while holding down
the Ctrl and Shift keys, press the Enter key to
enter the array formula into the cell range B4:B14.
8. Adjust the worksheet formatting as necessary.
Steps 1 to 8 construct a frequency distribution for city restaurant main meal prices. To construct a frequency distribution for main meal prices for suburban restaurants, repeat
steps 1 to 8 but in step 7 type 5FREQUENCY(UNSTACK
ED!$B$1:$B$51, $A$4: $A$14) as the array formula.
Having constructed the two frequency distributions,
continue constructing the two histograms. Open to the
worksheet that contains the frequency distribution for city
restaurant prices and:
1. Select the cell range B3:B14 (the cell range of the
frequencies).
2. Select Insert, then the Column icon in the Charts
group (#1 in the Charts group illustration in
Figure EG2.5), and then select the first 2-D Column gallery item (Clustered Column). In earlier
Excels, select Insert ➔ Column and select the first
2-D Column gallery item (Clustered Column).
3. Right-click the chart and click Select Data in the
shortcut menu.
In the Select Data Source dialog box:
4. Click Edit under the Horizontal (Categories)
Axis Labels heading.
5. In the Axis Labels dialog box, drag the mouse to
select the cell range C4:C14 (containing the midpoints) to enter that cell range. Do not type this
cell range in the Axis label range box as you would
otherwise do. Click OK in this dialog box and then
click OK (in the Select Data Source dialog box).
In the chart:
6. Right-click inside a bar and click Format Data
Series in the shortcut menu.
7. In the Format Data Series task pane, click Series
Options. In the Series Options, click Series
Options, enter 0 in the Gap Width box, and then
close the task pane. (To see the Series Options, you
may have to first click the chart [third] icon near
the top of the task pane.)
(Earlier Excels) In the Format Data Series dialog
box, click Series Options in the left pane, and in
the Series Options right pane, change the Gap
Width slider to No Gap. Click Close.
8. Move chart to a chart sheet (right-click on Chart ➔
Move Chart). Adjust chart formatting if required.
Analysis ToolPak
Use Histogram.
For the example, open the Unstacked worksheet in
the Restaurant file. Enter appropriate bin cell and midpoint cell ranges, including titles, in columns D and E (say,
D1:D12 and E1:E12) and:
1. Select Data ➔ Data Analysis. In the Data Analysis
dialog box, select Histogram from the Analysis
Tools list and then click OK.
In the Histogram dialog box:
2. Enter or highlight A1:A51 as the Input Range and
enter or highlight D1:D12 as the Bin Range.
3. Check Labels, click New Worksheet Ply, and check
Chart Output.
4. Click OK to create the frequency distribution and
histogram on a new worksheet.
In the new worksheet:
5. Follow steps 5 to 9 of the Analysis ToolPak instructions in ‘Summarising Numerical Data: The Frequency Distribution’ above.
These steps construct a frequency distribution and histogram for city restaurant main meal prices. To construct a
frequency distribution and histogram for suburban restaurant main meal prices repeat the nine steps, but in step 2
enter or highlight B1:B51 as the Input Range. You will
need to correct several formatting errors that Excel makes
to the histograms it constructs. For each histogram:
1. Right-click inside a bar and click Format Data
Series in the shortcut menu.
2. In the Format Data Series task pane, click Series
Options. In the Series Options, click Series
Options, enter 0 in the Gap Width box, and then
close the task pane. (To see the Series Options, you
may have to first click the chart [third] icon near
the top of the task pane.)
(Earlier Excels) In the Format Data Series dialog
box, click Series Options in the left pane, and in
the Series Options right pane, change the Gap
Width slider to No Gap. Click Close.
Histogram bars are labelled by bin numbers. To change the
labelling to mid-points, open to each of the new worksheets
and:
3. Enter Midpoints in cell C3. Copy the mid-point
cell range E2:E12 of the Unstacked worksheet
and paste this list into cell C4 of the new worksheet.
4. Right-click the histogram and click Select Data.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 2 Excel Guide
5. In the Select Data Source dialog box, click Edit
under the Horizontal (Categories) Axis Labels
heading.
6. In the Axis Labels dialog box, drag the mouse to
select the cell range C4:C14 to enter that cell
range. Do not type this cell range in the Axis label
range box as you would otherwise do. Click OK in
this dialog box and then click OK (in the Select
Data Source dialog box).
7. Move the chart to a chart sheet (right-click on
Chart ➔ Move Chart). Adjust chart formatting if
required.
The Percentage Polygon and the Cumulative Percentage
Polygon (Ogive)
Key technique
Construct percentage and cumulative percentage polygons.
Example
Construct percentage and cumulative percentage polygons for main meal prices at city and suburban restaurants, similar to Figure 2.8 on page 52 and Figure 2.10 on
page 53.
PHStat
Use Histogram & Polygons.
For the example, use the PHStat instructions for
creating a histogram on page 83 but in step 3 of those
instructions, also check Percentage Polygon and
Cumulative Percentage Polygon (Ogive) before
clicking OK.
In-depth Excel
Use the Polygons_workbook as a model.
For the example, open the Unstacked worksheet in
the Restaurant file. Then follow steps 1 to 8 to construct a
histogram for city restaurant meal prices. However, include
a class of zero frequency at either end of your bin cell
range. (Say, in cells D1:14, including title, also add corresponding class mid-points cells E1:14.) Repeat steps 1 to 8
but in step 7 type the array formula 5FREQUENCY(UNS
TACKED!$B$1:$B$51, $A$4: $A$16) to construct a frequency distribution for suburban restaurant main meal
prices. Open to the worksheet that contains the city restaurant meal price frequency distribution and:
1. Select column C. Right-click and click Insert in
the shortcut menu. Right-click and click Insert in
the shortcut menu a second time. (The worksheet
contains new, blank columns C and D and the midpoints column is now column E.)
2. Enter Percentage in cell C3 and Cumulative
Pctage. in cell D3.
85
3. Enter 5B4∙SUM($B$4:$B$16) in cell C4 and
copy this formula down to row 16.
4. Enter 5C4 in cell D4.
5. Enter 5C5 1 D4 in cell D5 and copy this formula
down to row 16.
6. Select the cell range C4:D16, right-click, and click
Format Cells in the shortcut menu.
7. In the Number tab of the Format Cells dialog
box, click Percentage in the Category list and
click OK.
Open to the worksheet that contains the suburban restaurant main meal price frequency distribution and repeat
steps 1 to 7. To construct the percentage polygons, open to
the worksheet that contains the city restaurant price frequency distribution and:
1. Select cell range C4:C16.
2. Select Insert, then select the Line icon in the
Charts group (#2 in the Charts group illustration
in Figure EG2.5), and then select the fourth 2-D
Line gallery item (Line with Markers). In earlier
Excels, select Insert ➔ Line and select the fourth
2-D Line gallery item (Line with Markers).
3. Right-click the chart and click Select Data in the
shortcut menu.
In the Select Data Source dialog box:
4. Click Edit under the Legend Entries (Series)
heading. In the Edit Series dialog box, enter the
formula 5“City Restaurants” as the Series name
and click OK.
5. Click Edit under the Horizontal (Categories) Axis
Labels heading. In the Axis Labels dialog box, drag
the mouse to select the cell range E4:E16 to enter
that cell range. Do not type this cell range in the
Axis label range box as you would otherwise do.
6. Click OK in this dialog box and then click OK (in
the Select Data Source dialog box).
Back in the chart:
7. Move chart to a chart sheet (right-click on Chart
➔ Move Chart). Adjust chart formatting if
required.
In the new chart sheet:
8. Right-click the chart and click Select Data in the
shortcut menu.
9. In the Select Data Source dialog box, click Add.
In the Edit Series dialog box:
10. Enter the formula 5“Suburban Restaurants” as
the Series name and press Tab.
11. With the current value in Series values highlighted, click the worksheet tab for the worksheet
that contains the suburban restaurant meal price
frequency distribution.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
86
CHAPTER 2 ORGANISING AND VISUALISING DATA
12. Drag the mouse to select the cell range C4:C16 to
enter that cell range as the Series values. Do not
type this cell range in the Series values box as you
would otherwise do.
13. Click OK. Back in the Select Data Source dialog
box, click OK.
To construct the cumulative percentage polygons, open
to the worksheet that contains the city restaurant price of
main meal frequency distribution and repeat steps 1 to 13
but replace steps 1, 5 and 12 with the following:
1. Select the cell range D4:D16.
5. Click Edit under the Horizontal (Categories)
Axis Labels heading. In the Axis Labels dialog
box, drag the mouse to select the cell range
A4:A16 to enter that cell range.
12. Drag the mouse to select the cell range D4:D16 to
enter that cell range as the Series values.
If the Y axis of the cumulative percentage polygon
extends past 100%, right-click the axis and click Format
Axis in the shortcut menu. In the Format Axis task pane,
click Axis Options. In the Axis Options, enter 0 in the
Minimum box and 1 in the Maximum box and then close
the pane. In earlier Excels, you set this value in the Format Axis dialog box. Click Axis Options in the left pane,
and in the Axis Options right pane, click the first Fixed
option button (for Minimum), enter 0 in its box, and then
click Close.
EG2.4 ORGANISING AND VISUALISING TWO
CATEGORICAL VARIABLES
ORGANISING TWO CATEGORICAL VARIABLES
The Contingency Table
Key technique
Use the PivotTable feature to create a contingency table for
untallied data.
Example
Construct a contingency table for location and number of
bedrooms similar to Table 2.11 on page 55.
PHStat (untallied data)
Use Two-Way Tables & Charts.
For the example, open the Property file. Select
PHStat ➔ Descriptive Statistics ➔ Two-Way Tables
& Charts. In the procedure’s dialog box (shown in
Figure EG2.10):
1. Enter or highlight F2:F102 as the Row Variable
Cell Range.
2. Enter or highlight C2:C102 as the Column Variable Cell Range.
3. Check First cell in each range contains label.
4. Enter a Title and click OK.
Figure EG2.10 Two-Way
Tables & Charts dialog box
In-depth Excel (untallied data)
Use the Contingency_Table workbook as a model.
For the example, open the Property file. Select
Insert ➔ PivotTable. In the Create PivotTable dialog
box:
1. Click Select a table or range and enter or highlight C2:F102 as the Table/Range cell range.
2. Click New Worksheet and then click OK.
In the PivotTable Fields (called the PivotTable Field List in
some Excel versions) task pane:
3. Tick Location in Choose fields to add to report
to add it to the ROWS (or Row Labels) box.
4. Tick Bedrooms in Choose fields to add to report
and drag it to the COLUMNS (or Column
Labels) box.
5. Drag Location in Choose fields to add to report
and drop it in the Σ Values box. (Location changes
to Count of Location.)
In the PivotTable being created:
6. Select cell A3 and enter a space character to clear
the label Count of Location.
7. Enter Location in cell A4 to replace the heading
Row Labels.
8. Enter Bedroom in cell B3 to replace the heading
Column Labels.
9. Right-click over the PivotTable and then click
PivotTable Options in the shortcut menu that
appears.
In the PivotTable Options dialog box:
10. Click the Layout & Format tab.
11. Check For empty cells show and enter 0 as its
value. Leave all other settings unchanged.
12. Click the Total & Filters tab.
13. Check Show grand totals for columns and Show
grand totals for rows.
14. Click OK to complete the table.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 2 Excel Guide
In-depth Excel (tallied data)
Use the CONTINGENCY_SIMPLE worksheet of the
­Contingency_Table workbook as a model for creating
a contingency table.
VISUALISING TWO CATEGORICAL VARIABLES
The Side-By-Side Chart
Key technique
Use an Excel bar chart that is based on a contingency table.
Example
Construct a side-by-side chart that displays location and
number of bedrooms, similar to Figure 2.12 on page 56.
PHStat
Use Two-Way Tables & Charts.
For the example, use the Section EG2.4 ‘The Contingency Table’ PHStat instructions, but in step 4, check Sideby-Side Bar Chart in addition to entering a Title and clicking
OK.
In-depth Excel
Use the Contingency_Table workbook as a model.
For the example, open to the TwoWayTable worksheet of the Contingency_Table workbook and:
1. Select cell A3 (or any other cell inside the PivotTable).
2. Select Insert ➔ Column in Excel 2016, or Bar in
earlier Excel versions, and select the first 2-D Bar
gallery item (Clustered Bar).
3. Right-click the Count of Location button in the
chart and click Hide All Field Buttons on Chart.
4. Move the chart to a chart sheet (right-click on
Chart ➔ Move Chart). Adjust formatting if
required.
When creating a chart from a contingency table that is
not a PivotTable, select the cell range of the contingency
table, including row and column headings, but excluding
the total row and total column, as step 1.
If you need to switch the row and column variables in a
side-by-side chart, right-click the chart and then click Select
Data in the shortcut menu. In the Select Data Source dialog
box, click Switch Row/Column and then click OK. (In
Excel 2007, if the chart is based on a PivotTable, the Switch
Row/Column button will be disabled. In that case, you need
to change the PivotTable to change the chart.)
EG2.5 VISUALISING TWO NUMERICAL VARIABLES
The Scatter Diagram
Key technique
Use the Excel scatter chart.
87
Example
Construct a scatter diagram of number of bedrooms and
asking price, similar to Figure 2.14 on page 59.
PHStat
Use Scatter Plot.
For the example, open the Property file. Select
PHStat ➔ Descriptive Statistics ➔ Scatter Plot. In the
procedure’s dialog box (shown in Figure EG2.11):
1. Enter or highlight A2:A102 as the Y Variable Cell
Range.
2. Enter or highlight C2:C102 as the X Variable Cell
Range.
3. Check First cells in each range contains label.
4. Enter a Title and click OK.
Figure EG2.11 Scatter Plot dialog box
To add a superimposed line like the one shown in
Figure 2.14, click the chart and use step 3 of the In-depth
Excel instructions.
In-depth Excel
Use the Scatter_Diagram workbook as a model.
For the example, open the Property file. The two variables ‘Number of bedrooms’ and ‘Asking price’ have been
copied to columns I and J.
1. Select the cell range I2:J102.
2. Select Insert, then the Scatter (X,Y) icon in
the Charts group (#4 in the illustration in Figure EG2.5), and then select the first Scatter
gallery item (Scatter). In earlier Excels, select
Insert ➔ Scatter and select the first Scatter
gallery item (Scatter with only Markers).
3. Select Design ➔ Add Chart Element ➔ Trendline ➔ Linear. In earlier Excels, select Layout ➔
Trendline ➔ Linear Trendline.
4. Move chart to a chart sheet (right-click on
Chart ➔ Move Chart). Adjust chart formatting
if required.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
88
CHAPTER 2 ORGANISING AND VISUALISING DATA
When constructing Excel scatter diagrams with other
variables, make sure that the X or horizontal variable column precedes (is to the left of) the Y or vertical variable
column. (If the worksheet is arranged Y then X, cut and
paste so that the Y variable column appears to the right of
the X variable column.)
The Time-Series Plot
Key technique
Use the Excel scatter chart.
Example
Construct a time-series plot of Australian dollar exchange
rate against US dollar from 2010 to 2017, similar to Fig­
ure 2.15 on page 60.
In-depth Excel
Use the Time-Series workbook as a model.
For the example, open the Exchange_Rate_2010_2017
file and:
1. Select the cell range A9:B95.
2. Select Insert, then select the Scatter (X, Y) icon
in the Charts group (#4 in the illustration in Figure EG2.5), and then select the fourth or fifth
Scatter gallery item (Scatter with Straight Lines
with or without Markers). In earlier Excels,
select Insert ➔ Scatter and select the fourth or
fifth Scatter gallery item (Scatter with Straight
Lines with or without Markers).
3. Move chart to a chart sheet (right-click on Chart
➔ Move Chart). Adjust chart formatting if
required.
When constructing time-series charts with other variables, make sure that the X or time variable column precedes (is to the left of) the Y or vertical variable column.
(If the worksheet is arranged Y then X, cut and paste so
that the Y variable column appears to the right of the X
variable column.)
2. In the Insert Sparkines dialog box, enter B3:B16
as the Location Range and click OK.
3. Select Axis and then Vertical. Choose Same for
all Sparklines for both Maximum and Minimum
Gauges
In-depth Excel
To construct a gauge we must create both a doughnut chart
for the coloured zones and a pie chart for the pointer. To
create the gauges equivalent to the one shown in Figure
2.18 on page 65, open to the TopSixDATA worksheet of
the WL_WaitData workbook and:
1. Select the cell range E3:E7.
2. Select Insert ➔ Pie Chart and select Doughnut.
3. Right click on the doughnut, select Format Data
Series and type ‘271’ into angle of first slice (see
Figure EG2.12) and close the box.
Figure EG2.12
Format Data
Series dialogue
box
4. Right-click on the largest doughnut slice and select
Format Data Point, select Fill ➔ No Fill.
Figure EG2.13
Format Data
Point dialogue
box
EG2.6 DESCRIPTIVE ANALYTICS
Sparklines
In-depth Excel
Use Sparklines.
For example, to create the Figure 2.17 sparklines display, open to the DATA worksheet of the WL_WaitHistory
workbook. In this worksheet, ride names are in column A and
the historical wait times data by half-hours are in Columns C
through W. Select cell range C3:W16 and:
1. Select Insert ➔ Sparklines (select line as the
sparkline type).
5. Right-click on the doughnut and choose Select
Data. Click the + button and add ‘pointer’ as the
name and ‘G3:G5’ as the Y values.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 2 Excel Guide
Figure EG2.14 Select Data Source dialogue box
6. Right-click at the new second doughnut, click on
Change Chart Type and choose Pie Chart.
7. Right-click on the pie chart, select Format Data
Series and check the Secondary Axis option. Type
‘270’ into the Angle of first slice.
Figure EG2.15 Format Data Series dialog box
8. Right-click on the largest slice of pie and select
Format Data Point. Select Fill ➔ No Fill. Repeat
for the next largest slice.
9. Insert ➔ Text Boxes to add the appropriate labels
and change the gauge colours to suit.
Bullet Graph
In-depth Excel
Use the BulletGraph worksheet of the GaugeBullet
workbook as a model for simulating a bullet graph.
89
To construct a simulated bullet graph in Excel, you create a bar chart of the variable being graphed with a transparent background and overlay this chart on a bar chart that
displays the coloured zones. For example, to construct a
chart similar to the bullet graph shown in Figure 2.18 on
page 65, open to the waitDATA worksheet of the WL_
WaitData workbook and:
1. Select cell range B1:C15.
2. Select Insert, then the bar chart icon, and select
the Clustered Bar.
3. In the newly constructed bar chart, turn off the
gridlines.
4. Right-click in the white space to the right of the
chart title and click Format Chart Area in the
shortcut menu.
5. In the Fill part of the Format Chart Area pane click
No fill. The background of the chart becomes
transparent.
Next, construct the bar chart that will serve as the coloured
zones for the bullet graph.
6. In the cell range D2:D6, enter the values 25, 20,
20, 20 and 15, to define the five zones of the­
Figure 2.18 bullet graph. Then select this edited
cell range D2:D6.
7. Select Insert, then the bar chart icon, and select
the Stacked Bar.
8. In the newly constructed bar chart, turn off the
gridlines.
9. Right-click in the white space to the right of the
chart title and click Select Data in the shortcut
menu.
10. In the Select Data Source dialog box, click Switch
Row/Column and then click OK. A chart of five
simple bars becomes a chart of one stacked bar
with five parts.
11. Right-click the one stacked bar and click Format
Data Series in the shortcut menu. In the Series
Options part of the Format Data Series pane,
change Gap Width to 0%.
12. Change the colouring of the stacked bars. Select
Design ➔ Change Colors and in the gallery click
one of the colour spectrums. Be sure to choose a
set of colours that does not include the colour used
for the bars in the bar chart you constructed using
steps 1 to 5.
13. Right-click the horizontal chart axis and click
­Format Axis in the shortcut menu.
14. In the Axis Options of the Format Axis pane, enter
100 as the Maximum. In Excel 2010, first click
Fixed in the Maximum line, then enter 100, and
then click Close.
15. Adjust the size of the chart, as necessary, by clicking a corner of the bar chart frame and then dragging that corner to resize the chart.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
90
CHAPTER 2 ORGANISING AND VISUALISING DATA
16. Right-click the chart border and select Send to
Back ➔ Send to Back in the shortcut menu.
17. Drag the bar chart with the transparent background
over the stacked bar chart and adjust so that the
zeroes on the horizontal axis of both charts coincide. Then adjust the width of that bar chart so that
all other horizontal axis numbers that the two
charts share coincide.
For other problems, you need to identify the maximum
value and enter the proper set of values in a new column in
order to correct the stacked bar chart that serves to display
the zones for the bullet graph.
Treemap
1. Highlight cells A1:C15 and select Insert ➔ Chart
➔ Other Charts ➔ Hierarchical Treemap.
More detailed instructions for treemaps and data discovery
are contained in the Software Guide in Chapter 20 (online).
Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Numerical descriptive
measures
C HAP T E R
3
FESTIVAL EXPENDITURE
R
eturning to the festival expenditure scenario introduced in Chapter 2, as well as
presenting the expenditure data graphically, Kai wishes to summarise and analyse the
data further. In particular, for each non-local visitor type (intrastate, interstate and
international) numerical measures of the centre and variation of total expenditure in the region
during the festival are required. This analysis will help to answer the following questions:
■
■
What is the ‘average’ amount spent during the festival? How does this differ between visitor
types?
How varied is the amount spent during the festival? How does this differ between visitor
types?
© Ton Koene/age fotostock
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
92
CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
LEARNING
OBJECTIVES
After studying this chapter you should be able to:
1 calculate and interpret numerical descriptive measures of central tendency, variation and
shape for numerical data
2 calculate and interpret descriptive summary measures for a population
3 construct and interpret a box-and-whisker plot
4 calculate and interpret the covariance and the coefficient of correlation for bivariate data
variation
The spread, scattering or dispersion
of data values.
In Chapter 2 we saw how tables and graphs can be used to organise, visualise, summarise and
describe data. In this chapter we discuss various numerical measures that can be used to summarise and describe numerical data. These numerical measures not only can be used to summarise a particular sample or population but will also enable the sample or population to be
compared with others. Furthermore, these numerical measures, unlike graphs and tables, are
precise, objectively determined and easy to manipulate, interpret and compare. They allow for
a careful analysis of data which is especially important when using sample data to make inferences about an entire population. For example, Kai may be interested in whether interstate
visitors spend more during the festival than do intrastate visitors. Also of interest would be how
expenditure by international visitors to the festival compares to that of non-local visitors from
within Australia.
This chapter introduces some of the statistics that measure:
• central tendency, the extent to which the data values are grouped around a central value
• variation, the spread, scattering or dispersion of data values
• shape, the pattern of the distribution of data values from the lowest value to the highest
value.
shape
The pattern of the distribution of
data values.
Covariance and the coefficient of correlation, which measure the strength of the association
between two numerical variables, are also introduced.
central tendency
The extent to which data values are
grouped around a central value.
LEARNING OBJECTIVE
1
Calculate and interpret
numerical descriptive
measures of central
tendency, variation and
shape for numerical data
arithmetic mean (mean)
Measure of central tendency;
sum of all values divided by the
number of values (usually called
the mean); called the arithmetic
mean to distinguish it from the
geometric mean.
3.1 MEASURES OF CENTRAL TENDENCY, VARIATION AND SHAPE
We can describe a data set by describing its central tendency, variation and shape.
Measures of Central Tendency
Many data sets have a distinct central tendency, with the data values grouped or clustered
around a central point. Everyday expressions such as ‘the average value’, ‘the middle value’ or
‘the most popular or frequent value’ refer to measures of central tendency. The three most
important measures of central tendency – mean, median and mode – are introduced in this section. These measures are precise, objectively determined and easy to manipulate, interpret and
compare. As we see in the following sections, each has its advantages and disadvantages.
Mean
The arithmetic mean (typically referred to as the mean) is the most common measure of central
tendency. The mean uses all the data values and can be calculated exactly. It can be thought of
as a ‘balance point’ in a set of data (like the fulcrum on a seesaw). The mean is calculated by
adding all the values of a variable in a data set and then dividing the sum by the number of
variable values in the data set.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.1 Measures of Central Tendency, Variation and Shape
93
The symbol X, called X bar, is used to represent the mean of a sample. For a sample containing n values, the equation for the mean of a sample is written as:
X =
sum of the sample values
number of sample values
Using the series X1, X2, …, Xn to represent the set of n values and n to represent the number of
values, the equation becomes:
X + X2 + p + Xn
X = 1
n
By using summation notation (discussed in Appendix B), we can replace the numerator
n
X1 + X2 + p + Xn by © Xi, which means sum all the Xi values from the first X value, X1, to the
i=1
last X value, Xn, to obtain Equation 3.1.
SAM PLE M E A N
The sample mean is the sum of the values divided by the number of values.
sample mean
Mean calculated from sample data.
n
X =
© Xi
i=1
(3.1)
n
where X = sample mean
n = number of values or sample size
Xi = ith value of the variable X
n
© Xi = X1 + X2 + p + Xn = sum of all Xi values in the sample
i=1
As all the data values play an equal role in the calculation of the mean, the mean will be
affected by any extreme (high or low) value. When there are extreme values, you should take
care when using the mean as a measure of central tendency.
The mean gives a ‘typical’ or central value for a data set. For example, if you knew the
typical time it takes you to get ready in the morning, you might be able to plan your morning
better and minimise any excessive lateness (or earliness). Suppose you define the time to get
ready as the time in minutes (rounded to the nearest minute) from when you get out of bed to
when you leave. You collect the times (shown below) for 10 consecutive working days; this
data is stored in < TIMES >.
Day:
Time (minutes):
1
39
2
29
3
43
4
52
5
39
6
44
7
40
8
31
9
44
10
35
The mean time to get ready is 39.6 minutes, calculated using Equation 3.1:
n
X =
© Xi
i=1
n
=
39 + 29 + … + 35
396
=
= 39.6
10
10
Even though no one day in the sample actually had the value 39.6 minutes, allotting about
40 minutes to get ready would be a good rule for planning your morning – but only because the
10 days did not contain any extreme values.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
94
CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
To illustrate how the mean can be greatly affected by any value that is very different from
the others, imagine that on day 4, a set of unusual circumstances delayed you getting ready by
50 minutes so that the time for that day was 102 minutes. This extreme value would cause the
mean to rise to 44.6 minutes:
n
X =
© Xi
i=1
n
=
446
= 44.6
10
The one extreme value has increased the mean by more than 10% from 39.6 to 44.6 minutes. In
contrast to the original mean, which was in the ‘middle’ (more than 5 of the times to get ready
and less than 5 of the times to get ready), the new mean is greater than 9 of the 10 times to get
ready. The extreme value of 102 has caused the mean to increase and thus become a poor measure of central tendency.
A statistical calculator can be used to calculate the mean (and other numerical measures
introduced in this chapter), while for large data sets, as we see later in this section, Excel can be
used. Even though it is not usually necessary to use Equation 3.1 to calculate the mean it is
important that you understand the process of how the mean is determined.
EXAMPLE 3.1
ME A N FE ST IVA L E X P E N D I TU RE – I N TE RN ATI ON AL V I S I TORS
In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors in the region during the festival. The following data give the dollar amount
spent by a random sample of 12 international visitors. < FESTIVAL >
1,119
615
971
553
343
502
928
1,005
993
408
725
763
Calculate and interpret the mean amount spent by international visitors.
SOLUTION
12
Calculate the sum of X,
we obtain:
© X i = 1,119 + 615 + … + 763 = 8,925
then using Equation 3.1,
i=1
n
X =
© Xi
i=1
n
=
8,925
= 743.75
12
Therefore, international visitors on average spent $743.75 in the region during the festival.
median
Measure of central tendency;
middle value in an array.
Median
The median is the value that partitions or splits an ordered set of data into two equal parts. As
the median is not affected by extreme values, it may be a better measure of central tendency
when there are extreme values.
The median is the middle value in a set of data that has been ordered from lowest to
highest value.
To calculate the median for a set of data, first order the values from smallest to largest.
Then use Equation 3.2 to calculate the rank of the value that is the median.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.1 Measures of Central Tendency, Variation and Shape
ME D IAN
50% of the values are equal to or smaller than the median and 50% of the values are
equal to or larger than the median.
Median =
•
•
n+1
ranked value
2
(3.2)
Calculate the median value by these two rules:
Rule 1 If there is an odd number of values in the data set, the median is the middle-ranked
value.
Rule 2 If there is an even number of values in the data set, then the median is the mean of
the two middle-ranked values.
To calculate the median for the sample of the 10 times to get ready, first order the times:
Ordered values:
Ranks:
29
1
31
2
35
3
39
4
39
5
40
6
43
7
44
8
44
9
52
10
c
Median = 39.5
Rank of the median is (n + 1)/2 = (10 + 1)/2 = 5.5. So, using rule 2, the median is the mean
of the fifth- and sixth-ranked values, (39 + 40)/2 = 39.5. Therefore, for half of the days the
time to get ready is less than or equal to 39.5 minutes and for half of the days the time to get
ready is greater than or equal to 39.5 minutes. The median time to get ready of 39.5 minutes is
very close to the mean time to get ready of 39.6 minutes.
CA LC ULATING T H E ME DIA N FO R A N O D D S AM P L E S I Z E
For a certain café, the number of customers during a selected seven-day week were 100, 75,
92, 85, 70, 80 and 71. Calculate the median number of customers for this week.
EXAMPLE 3.2
SOLUTION
Ordered values:
Ranks:
70
1
71
2
75
3
80
4
85
5
92
6
100
7
c
Median = 80
n+1 7+1
2 = 2 = 4. So, using rule 1, the median is the fourth-ranked
value. The median number of customers is 80. Therefore, 50% of days have 80 or less
customers and 50% have 80 or more customers.
Rank of the median is
CALCULATING THE MEDIAN FESTIVAL EXPENDITURE – INTERNATIONAL VISITORS
In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors in the region during the festival. < FESTIVAL >
Calculate and interpret the median amount spent by international visitors.
EXAMPLE 3.3
SOLUTION
First order the data:
343
408
502
553
615
725
763
928
971
993
1,005
1,119
c
Median
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
95
96
CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
Rank of the median is
n + 1 12 + 1
2 = 2 = 6.5. Using rule 2, the median is the mean of the
725 + 763
= 744.
2
Therefore, 50% of international visitors in the sample spent less than $744 during the
festival and 50% spent more than $744.
sixth- and seventh- ranked values,
mode
Measure of central tendency; most
frequent value.
Mode
The mode is the value in a data set that appears most frequently. Like the median and unlike the
mean, extreme values do not affect the mode. You should use the mode only for descriptive
purposes as it is more variable from sample to sample than either the mean or the median. Often
there is no mode or there are several modes in a set of data. For example, consider the data for
the times to get ready shown below:
29
31
35
39
39
40
43
44
44
52
There are two modes, 39 minutes and 44 minutes, since each of these values occurs twice.
Because it has two modes, this data set is considered to be bimodal.
EXAMPLE 3.4
C A LC U LAT ING T H E M OD E
A company’s information systems manager keeps track of the number of unplanned outages
that occur in a month. Calculate the mode for the following data, which represent the number
of unplanned outages during the past 14 months:
1
3
0
3
26
2
7
4
0
2
3
3
6
3
SOLUTION
The ordered array for these data is:
0
0
1
2
2
3
3
3
3
3
4
6
7
26
Because 3 appears five times, more than any other value, the mode is 3. Thus, the systems
manager can say that the most common occurrence is three unplanned outages a month. For
this data set, the median is also equal to 3 while the mean is equal to 4.5. As the mean is
affected by the extreme value of 26 unplanned outages, the median and the mode are better
measures of central tendency than the mean for this data set.
A set of data will have no mode if none of the values is ‘most typical’ – that is, if no data
value occurs more than once. Example 3.5 presents a data set with no mode.
EXAMPLE 3.5
DATA W IT H NO MO DE
For the café of Example 3.2, calculate the mode for the number of customers for the
seven days.
SOLUTION
The ordered array for these data is:
70
71
75
80
85
92
100
As none of the days have the same number of customers there is no mode.
quartiles
Measures of relative standing,
partition a data set into quarters.
Quartiles
We have seen that the median partitions a set of data into two equal parts. We can extend this
idea by partitioning a set of data into as many equal parts as we wish. Quartiles divide a set of
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.1 Measures of Central Tendency, Variation and Shape
data into quarters – that is, four equal parts. The first, or lower, quartile, Q1, divides the lower 25%
of the values from the other 75%, which are larger. The second quartile, Q2, is the median – 50%
of the values are below the median and 50% above. The third, or upper, quartile, Q3, has 75% of
the values below it and 25% above. Equations 3.3 and 3.4 define the first and third quartiles.
Q1, the median and Q3 are also the 25th, 50th and 75th percentiles, respectively. Equations
3.2, 3.3 and 3.4 can be expressed generally in terms of finding percentiles: (p × 100)th percentile = p × (n + 1) ranked value. p is between 0 and 1, with, for example, the median (Q2) corresponding to a p value of 0.5.
first (lower) quartile
Value that 25% of data values are
smaller than, or equal to.
second quartile
The median value that 50% of data
values are smaller than, or equal to.
third (upper) quartile
Value that 75% of data values are
smaller than, or equal to.
FIRST, O R LOW E R , QUA RT IL E , Q 1
25% of the values are smaller, or equal to, Q1, the first quartile, and 75% are larger than,
or equal to, the first quartile, Q1.
Q1 =
n +1
ranked value
4
(3.3)
THIRD , O R UPPE R , QUA RT IL E , Q 3
75% of the values are smaller than, or equal to, the third quartile, Q3, and 25% are larger
than, or equal to, the third quartile, Q3.
Q3 =
3(n + 1)
ranked value
4
(3.4)
Use the following rules to calculate the quartiles:
• Rule 1 If the result is an integer, then the quartile is equal to the ranked value. For
example, if the sample size is n = 7, the first quartile, Q1, is equal to the (7 + 1)/4 = 2,
second-ranked value.
• Rule 2 If the result is a fractional half (2.5, 4.5, etc.), then the quartile is equal to the
mean of the corresponding ranked values. For example, if the sample size is n = 9, the
first quartile, Q1, is equal to the (9 + 1)/4 = 2.5 ranked value, halfway between the
second- and the third-ranked values.
• Rule 3 If the result is neither an integer nor a fractional half, round the result to the
nearest integer and select that ranked value. For example, if the sample size is n = 10, the
first quartile, Q1, is equal to the (10 + 1)/4 = 2.75 ranked value. Round 2.75 to 3 and use
the third-ranked value.
To illustrate the calculation of the quartiles for the times to get ready, rank the data from
smallest to largest:
Ranked values:
Ranks:
29
1
31
2
35
3
39
4
39
5
40
6
43
7
44
8
44
9
97
52
10
The first quartile is the (n + 1)/4 = (10 + 1)/4 = 2.75 ranked value. Using the third rule for
quartiles, round up to the third-ranked value as it is the closest integer. The third-ranked value
for the data for the times to get ready is 35 minutes. Interpret the first quartile of 35 to mean that
on 25% of the days the time to get ready is less than or equal to 35 minutes, and on 75% of the
days the time to get ready is greater than or equal to 35 minutes.
The third quartile is the 3(n + 1)/4 = 3(10 + 1)/4 = 8.25 ranked value. Using the third rule
for quartiles, round down to the eighth-ranked value as it is the closest integer. The eighthranked value for the data for the times to get ready is 44 minutes. Interpret this to mean that on
75% of the days the time to get ready is less than or equal to 44 minutes, and on 25% of the
days the time to get ready is greater than or equal to 44 minutes.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
98
CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
Be aware that several methods exist for calculating quartiles. Other textbooks and Excel may
use different rules, which can result in slightly different values for the upper and lower quartiles.
EXAMPLE 3.6
CALCULATING THE QUARTILES FOR FESTIVAL EXPENDITURE – INTERNATIONAL
VISITORS
Kai is interested in the distribution of the amount spent by international visitors in the
region during the festival. < FESTIVAL >
Calculate and interpret the quartiles for the amount spent by international visitors.
SOLUTION
First order the data:
343
408
502
c
Q1
553
615
725
763
928
971
993
c
Q3
1,005
1,119
c
Median
n + 1 12 + 1
Rank of the first quartile is 4 = 4 = 3.25. Using rule 3, the first quartile is thirdranked values, Q1 = 502.
Therefore, 25% of international visitors in the sample spent $502 or less during the
­festival and 75% spent $502 or more.
3(n + 1) 3(12 + 1)
Rank of the third quartile is
=
= 9.75. Using rule 3, the third quartile
4
4
is 10th-ranked values, Q3 = 993.
Therefore, 75% of international visitors in the sample spent $993 or less during the
­festival and 25% spent $993 or more.
Geometric Mean
geometric mean
Average rate of change of a
variable.
The geometric mean and the geometric mean rate of return are used to measure the status on an
investment over time or the average percentage change in a variable. The geometric mean,
defined by Equation 3.5, measures the average rate of change of a variable over n periods.
GE OM E T R IC M E A N
The geometric mean is the nth root of the product of n values.
XG = (X1 * X2 * p * Xn)1/n
(3.5)
Using the geometric mean, we can measure the average return on an investment over time. This
is given by the geometric mean rate of return, defined by Equation 3.6.
GE OM E T R IC M E A N R AT E O F R E T U R N
RG = [(1 + R1) * (1 + R2) * p * (1 + Rn)]1/n - 1
(3.6)
where Ri = the rate of return in time period i as a decimal
To illustrate the use of these measures, consider an investment of $100,000 that declined to a
value of $50,000 at the end of year 1 and then rebounded back to its original $100,000 value at
the end of year 2. The rate of return for this investment for the two-year period is 0, because the
starting and ending value of the investment are the same. However, the arithmetic mean of the
annual rates of return of this investment is:
X =
(-0.50) + (1.00)
= 0.25 or 25%
2
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.1 Measures of Central Tendency, Variation and Shape
99
since the rate of return for year 1 is:
50,000 - 100,000
= -0.50 or -50%
100, 000
R1 =
and the rate of return for year 2 is:
R2 =
100,000 - 50,000
= 1.00 or 100%
50,000
Using Equation 3.6, the geometric mean rate of return for the two years is:
RG = [(1 + R1) 3 (1 + R2)]1/2 - 1
= {[1 + (-0.50)] 3 [1 + (1.0)]}1/2 - 1
= (0.50 3 2.0)1/2 - 1
= 11/2 - 1
=0
Thus, the geometric mean rate of return more accurately reflects the (zero) change in the value
of the investment for the two-year period than does the arithmetic mean.
CA LC ULATING T H E G E O ME T R IC ME A N RATE OF RE TU RN
The annual percentage change in a New Zealand share market index, the NZX-50, for 2012
to 2016 was:
Year
Annual change
2012
24%
2013
16%
2014
18%
2015
14%
EXAMPLE 3.7
2016
10%
Data obtained from Yahoo 7 Finance <http://au.finance.yahoo.com> accessed April 2017
Calculate the geometric rate of return for these five years.
SOLUTION
Using Equation 3.6, the geometric mean rate of return in the NZX 50 Index for the five
years is:
RG = [(1 + R2012) * (1 + R2013) * (1 + R2014) * (1 + R2015) * (1 + R2016)]1/5 - 1
= [(1 + 0.24) * (1 + 0.16) * (1 + 0.18) * (1 + 0.14) * (1 + 0.10)]1/5 - 1
= (1.24 * 1.16 * 1.18 * 1.14 * 1.10)1/5 - 1
= 1.16308p - 1
= 0.1630p
The geometric rate of return of the NZX 50 Index for the five years is approximately 16.3%
annually.
Measures of Variation
Variation measures the spread or dispersion of values in a data set. One simple measure of
variation is the range: the difference between the highest and lowest value. More commonly
used in statistics are the standard deviation and variance, two measures also introduced in
this section.
Range
The range is the simplest numerical descriptive measure of variation in a set of data.
spread (dispersion)
The amount of scattering of data
values.
range
Distance measure of variation;
difference between maximum and
minimum data values.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
100 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
R A N GE
The range is equal to the largest value minus the smallest value.
Range = Xlargest - Xsmallest
(3.7)
To determine the range of the times to get ready, first rank the data from smallest to largest:
29
31
35
39
39
40
43
44
44
52
Then, using Equation 3.7, the range is 52 − 29 = 23 minutes. The range of 23 minutes indicates that the largest difference between any two days in the time to get ready is 23 minutes.
EXAMPLE 3.8
CALCULATING THE RANGE FOR FESTIVAL EXPENDITURE – INTERNATIONAL VISITORS
In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors in the region during the festival. < FESTIVAL >
Calculate and interpret the range for amount spent by international visitors.
SOLUTION
From an ordered array of the data, the minimum amount an international visitor spent was
$343 and the maximum was $1,119. Using Equation 3.7 the range is
Xlargest − Xsmallest = 1,119 − 343 = 776
Therefore, the difference between the maximum and minimum amounts spent by international visitors during the festival was $776.
The range measures the total spread of the data. Although the range is a simple measure
of total variation, it is based only on the two extreme values and ignores all the other values.
Thus, it does not take into account how the data are distributed between the smallest and largest values; it does not indicate whether the values are evenly distributed throughout the data
set, clustered near the middle or clustered near one or both ends. Like the mean, the range is
distorted by very high or very low values, so care is needed when using the range as a measure
of variation.
interquartile range
Distance measure of variation;
difference between third and first
quartile; range of middle 50%
of data.
Interquartile Range
The interquartile range is the difference between the third and first quartiles in a set of data.
IN T E R QUA RT IL E R A NG E
The interquartile range is the difference between the third quartile and the first quartile.
Interquartile range = Q3 − Q1
(3.8)
The interquartile range is a more meaningful measure of variation than the range
because it ignores extreme values by finding the range of the middle 50% of the ordered
array of data values. In the times to get ready we found that Q1 = 35 and Q3 = 44. Hence,
using Equation 3.8:
Interquartile range = 44 − 35 = 9 minutes
Therefore, the interquartile range in the time to get ready is 9 minutes.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.1 Measures of Central Tendency, Variation and Shape 101
CA LC ULATING T H E IN T E RQ U A RT ILE R AN GE F OR F E STI VAL E XP E N D I TU RE –
INTER NATIONA L V IS ITO R S
In the opening scenario, Kai is interested in the distribution of the amount spent by
international visitors during the festival. < FESTIVAL >
Calculate and interpret the interquartile range for the amount spent by international
visitors.
EXAMPLE 3.9
SOLUTION
From Example 3.6, the first quartile, Q1, is 502 and the third quartile, Q3, is 993. Using
Equation 3.8 the interquartile range is:
Q3 − Q1 = 993 − 502 = 491
Therefore, the difference in the middle 50% of the amount spent by international visitors in
the sample is $491.
When calculating the interquartile range the highest and lowest 25% of the data values are
discarded. Therefore, the interquartile range is not affected by extreme values. Summary
measures such as the median, Q1, Q3 and the interquartile range, which are not influenced by
extreme values, are called resistant measures.
Variance and Standard Deviation
Although the range and interquartile range are measures of variation, they do not take into consideration how the values are distributed or clustered between the extremes. Two commonly
used and related measures of variation that take into account how all the values in the data set
are distributed are the variance and the standard deviation. These statistics measure the average
scatter around the mean – how larger values fluctuate above it and how smaller values are
distributed below it.
These measures are based on the difference between each data value and the mean, called
the deviation of the data value from the mean. The notation Xi − X is used to denote the deviation of a data value Xi from the mean X.
A measure of variation around the mean could be to take the deviation of each value from
the mean, and then sum the deviations. However, as the mean is the centre of balance in a set of
resistant measures
Summary measures not influenced
by extreme values.
variance
Measure of variation based on
squared deviations from the mean;
directly related to the standard
deviation.
standard deviation
Measure of variation based on
squared deviations from the mean;
directly related to the variance.
n
data, for every data set the deviations from the mean would sum to zero – that is,
© (Xi - X) = 0.
i=1
This can be overcome by squaring the deviations from the mean before summing. In
statistics, this quantity is called a sum of squares (or SS). So the sum of squares for X is
SSX =
n
© (Xi - X)2 . This sum of squares is then divided by the number of values minus 1 (for
sum of squares (SS)
Sum of the squared deviations.
i=1
sample data) to get the sample variance (S2 ). The square root of the sample variance is the sample standard deviation (S).
Because the sum of squares is a sum of squared differences that will always be non-negative,
neither the variance nor the standard deviation can ever be negative. For a data set, the variance
and standard deviation will usually be positive, and will only be zero if there is no variation – that
is, all the values are equal.
For a sample containing n values, X1, X2, …, Xn, the sample variance (given by the symbol
S2) is:
S2 =
( X1 - X )2 + ( X 2 - X )2 + … + ( X n - X )2
n-1
Equation 3.9a expresses the equation using summation notation.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
102 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
sample variance
Variance calculated from sample
data.
S A M PL E VA R IA N C E – D E F I NI T I O N F O R M U LA
The sample variance is the sum of the squared deviations from the sample mean divided
by the sample size minus one.
n
SSX
S2 =
=
n-1
©(Xi - X )2
i=1
n-1
(3.9a)
where X = sample mean
n = number of values or sample size
Xi = ith value of the variable X
n
SSX =
sample standard deviation
Standard deviation calculated from
sample data.
© (Xi - X )2 = sum of the squared deviations from the mean (sum of
squares)
i=1
S A M PL E STA N DA R D D E V I AT I O N – D E F I NI T I O N F O R M U LA
The sample standard deviation is the square root of the sample variance.
S=
S2
(3.10)
If the denominator was n instead of n − 1, Equation 3.9a would calculate the average of the
squared deviations from the mean. However, n − 1 is used because of certain desirable mathematical properties of the statistic S2 that make it appropriate for statistical inference (discussed
in Chapter 7).
The sample standard deviation, defined by Equation 3.10, is the more useful measure of
variation because, unlike the sample variance, which is a squared quantity, the standard deviation is a value that is expressed in the same units of measurement as the original sample data.
The standard deviation is a measure of how a set of data is clustered or distributed around its
mean. For most data sets the majority of the data values lie within one standard deviation of the
mean – that is, within (X − S, X + S) − and we will see later in this chapter that for all data sets
at least 75% of the data values lie within two standard deviations of the mean – that is, within
(X − 2S, X + 2S). Therefore, a knowledge of the mean and the standard deviation helps to
define where the majority of the data values are clustered.
Table 3.1 illustrates the steps for calculating the variance and standard deviation for the
data on the times to get ready with mean X = 39.6, calculated earlier. The second column of
Table 3.1 calculates the deviation of each time from the mean (step 1). The third column of
Table 3.1 calculates the square of each deviation from the mean (step 2). The sum of the squared
deviations (step 3) is shown at the bottom of Table 3.1. This total is then divided by 10 − 1 = 9
to calculate the variance (step 4).
We can also calculate the variance by substituting values for the terms in Equation 3.9a:
n
S2 =
© (Xi - X )2
i=1
n-1
(39 - 39.6)2 + (29 - 39.6)2 + … + (35 - 39.6)2
=
10 - 1
412.4
=
9
= 45.822…
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.1 Measures of Central Tendency, Variation and Shape 103
X = 39.6
Time
(X )
39
29
43
52
39
44
40
31
44
35
Table 3.1
Calculating the variance of
the times to get ready
Step 2:
(Xi − X )2
0.36
112.36
11.56
153.76
0.36
19.36
0.16
73.96
19.36
21.16
Step 1:
(Xi − X )
- 0.60
- 10.60
3.40
12.40
- 0.60
4.40
0.40
- 8.60
4.40
- 4.60
Step 3:
Sum
SSX = 412.40
Step 4:
Divide by (n − 1)
S2 = 45.822...
The variance is in squared units (in squared minutes for these data) so, to calculate the
standard deviation, which is in the original units (minutes for these data), take the square root
of the variance. Using Equation 3.10, the sample standard deviation S is:
S= S2=
45.82… = 6.769…
This indicates that most of the times to get ready in this sample are clustered within 6.77 minutes of the mean of 39.6 minutes (i.e. clustered between X − S = 32.83 and X + S = 46.37).
Seven of the 10 times to get ready lie within this interval.
To check that the mean is correct, use the second column of Table 3.1 to calculate the sum
of the deviations from the mean. For any set of data, this sum will be zero – that is:
n
© (Xi - X) = 0 for all sets of data
i=1
It is tedious to use Equation 3.9a to calculate sample variance, especially for large samples or
when the mean and/or data values are not integers. Instead, we can use algebra to obtain alternative calculation formulas.
S AMPLE VA R IA N CE – CA LCUL AT ION F O R M U LA
The sample variance is the sum of the squared deviations from the mean divided by the
sample size minus 1.
n
n
S2 =
SSX
=
n-1
© Xi2 - nX 2
i=1
n-1
n
=
© X i2 -
i=1
© Xi
2
i=1
(3.9b)
n
n-1
where X = sample mean
n = number of values or sample size
Xi = ith value of the variable X
n
© Xi2 = X12 + X22 + p + Xn2 = sum of the squared Xi values in the sample
i=1
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
104 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
Use either calculation formula. The sum of the squared times to get ready is:
10
© Xi2 = 392 + 292 + … + 352 = 16,094
i=1
Then, calculate the variance by substituting values for the terms in the calculation form of
Equation 3.9b:
n
S2 =
© Xi2 - nX 2
i=1
n-1
16,094 - 10 3 39.62
=
10 - 1
412.4
=
9
= 45.822…
A statistical calculator can be used to calculate the standard deviation (and some other
numerical measures introduced in this chapter) and, as covered later in this section, Excel
can be used for large data sets. Even though it is not usually necessary to use Equations
3.9a or 3.9b to calculate variance and Equation 3.10 to calculate standard deviation, it is
important that you understand the process of how the variance and standard deviation are
obtained.
EXAMPLE 3.10
C A LC U LAT ING T H E VARI AN CE AN D STAN D ARD D E V I ATI ON F OR F E STI VAL
E X P E N D IT U R E – IN T ERN ATI ON AL V I S I TORS
Kai is interested in the distribution of the amount spent by international visitors during the
festival. < FESTIVAL >
Calculate and interpret the variance and standard deviation for amount spent by international visitors.
SOLUTION
Calculate the sum of X squared:
12
© Xi2 = 1,1192 + 6152 + p + 7632 = 7,380,205;
i=1
then from Example 3.1, X = 743.75 and using Equation 3.9b, we obtain:
n
SSX = © Xi2 - nX 2 = 7,380,205 - 12 * 743.752 = 742,236.25
i=1
S2 =
SSX 742,236.25
= 67,476.022 …
=
11
n-1
Therefore, the variance for the amount spent by international visitors during the festival is
approximatively 67,476,022 dollars squared.
Now using Equation 3.10 the sample standard deviation, S, is:
S = S2 =
67,476.022… = 259.761…
Therefore, the standard deviation for the amount spent during the festival by international
visitors is approximatively $259.76.
This indicates that we expect the majority of international visitors in the sample spent
within $260 (plus or minus) of the mean expenditure $743.75 during the festival.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.1 Measures of Central Tendency, Variation and Shape 105
The following summarises the characteristics of the range, interquartile range, variance
and standard deviation:
• The more spread out, or dispersed, the data, the larger the range, interquartile range,
variance and standard deviation.
• The more concentrated, or homogeneous, the data, the smaller the range, interquartile
range, variance and standard deviation.
• If the values are all the same (so that there is no variation in the data), the range,
interquartile range, variance and standard deviation will all equal zero.
• None of the measures of variation (the range, interquartile range, standard deviation and
variance) can ever be negative.
Coefficient of Variation
Unlike the previous measures of variation presented, the coefficient of variation is a relative
measure of variation that is expressed as a percentage rather than in terms of the units of the
particular data. The coefficient of variation, denoted by the symbol CV, measures the scatter in
the data relative to the mean.
coefficient of variation
Relative measure of variation;
the standard deviation divided by
the mean.
CO E FFIC IE NT OF VA R IAT ION
The coefficient of variation is equal to the standard deviation divided by the mean,
multiplied by 100%.
S
(3.11)
CV =
100%
X
where S = sample standard deviation
X = sample mean
For the sample of 10 times to get ready, since X = 39.6 and S = 6.769…, the coefficient of
variation is:
S
6.769…
CV =
100% =
3 100% = 17.09…%
39.6
X
For the times to get ready, the standard deviation is 17.1% of the size of the mean.
You will find the coefficient of variation useful when comparing two or more sets of data
that have different units of measurement, as Example 3.11 illustrates, or when the scale of the
data sets is substantially different.
CO M PA R ING T WO C O E FFIC IE N T S O F VA RI ATI ON WHE N TWO VARI ABL E S HAV E
DIFFER ENT U N IT S O F ME A S U R E ME NT
The operations manager of a package delivery service is deciding whether to purchase a
new fleet of trucks. When packages are stored in the trucks in preparation for delivery,
two major constraints need to be considered – the weight (in kilograms) and the volume
(in cubic metres) of each item.
The operations manager samples 200 packages and finds that the mean weight is
12.0 kilograms, with a standard deviation of 1.8 kilograms; the mean volume is 0.25 cubic
metres, with a standard deviation of 0.06 cubic metres. How can the operations manager
compare the variation of the weight and the volume?
EXAMPLE 3.11
SOLUTION
Because the measurement units differ for the weight and volume constraints, the operations
manager should compare the relative variability in the two types of measurements.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
106 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
For weight, the coefficient of variation is:
CVW =
S
1.8
* 100% = 15%
100% =
12
X
For volume, the coefficient of variation is:
CVV =
S
0.06
* 100% = 24%
100% =
0.25
X
Thus, relative to the mean, the package volume is more variable than the package weight
because it has a higher coefficient of variation.
Z Scores
Z scores
Measures of relative standing;
number of standard deviations that
given data values are from the mean.
extreme value (outlier)
Value located far from the mean;
will have a large Z score, positive or
negative.
Z scores are measures of relative standing that take into consideration both the mean and the
standard deviation. A Z score represents the distance between a given observation and the mean
expressed in standard deviations. An extreme value or outlier, a value located far away from the
mean, will have a large Z score, either positive or negative. Therefore, Z scores are useful in
identifying extreme values or outliers.
Z S COR E
Z=
X-X
S
(3.12)
For the data for the times to get ready in the morning, the mean is 39.6 minutes and the
standard deviation is 6.77 minutes. The time to get ready on the first day is 39.0 minutes. Use
formula 3.12 to calculate the Z score for day 1:
Z=
X-X
39.0 - 39.6
=
= -0.09
S
6.77
Therefore, the first day’s time to get ready of 39 minutes is just 0.09 of a standard deviation
below the mean – that is, just slightly quicker than the mean time to get ready.
Table 3.2 shows the Z scores for all 10 days. The largest Z score is 1.83 for day 4, on which
the time to get ready was 52 minutes. The lowest Z score was −1.57 for day 2, on which the
time to get ready was 29 minutes. As a general rule, a value is said to be an outlier if its Z score
is less than −3.0 or greater than +3.0 – that is, the value is more than three standard deviations
below or above the mean. As none of the times to get ready meets the outlier criterion, we can
say there are no outliers in these data.
Table 3.2
Z scores for the 10 times to
get ready
Mean
Standard deviation
Time (X )
39
29
43
52
39
44
40
31
44
35
39.6
6.77
Z score
- 0.09
- 1.57
0.50
1.83
- 0.09
0.65
0.06
- 1.27
0.65
- 0.68
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.1 Measures of Central Tendency, Variation and Shape 107
CA LC ULATING T H E Z S C O R E S FO R R E AL E STATE P RI CE S
A couple seeking a ‘green-change’ sell their inner city unit for $520,000 and plan to purchase a house in a rural town for the same price. Given that the mean unit price in the inner
city is $845,000, with a standard deviation of $220,000, and the mean house price in the
rural town is $280,000 with a standard deviation of $120,000, use Z scores to determine the
price of each property relative to its region.
EXAMPLE 3.12
SOLUTION
The Z score for the inner city unit is:
Z=
X-X
520,000 - 845,000
=
= -1.477…
S
220,000
so the price of the unit sold is approximately 1.5 standard deviations below the mean price. That
is, the couple have sold their unit for a relatively low price compared with mean inner city prices.
If the couple purchase a house for $520,000, then its Z score is:
Z=
X-X
520,000 - 280,000
=
=2
S
120,000
The price of this property is approximately two standard deviations above the mean price.
That is, the couple plan to purchase a house for a relatively high price compared with property prices in the region.
Shape
As well as the centre and the variation of numerical data we also need a description of the shape
of the distribution which represents a pattern of all the values from the lowest to highest. Many
data sets are approximately mound- or bell shaped; other data sets may be skewed, with the
majority of data values clustered in the upper or lower end of the distribution.
A distribution is symmetrical if the lower and upper halves of the graph are mirror images
of each other. Panel B of Figure 3.1 illustrates a symmetrical distribution. If the distribution is
not symmetrical, it may be skewed.
A distribution is skewed to the right, or positively skewed, if there is a long tail to the right,
indicating that there are relatively few large data values and more smaller values – that is, most
of the values are concentrated in the lower portion of the distribution. Panel C of Figure 3.1
illustrates a positively skewed distribution. As relatively few people have extremely high
incomes, we would expect the distribution of annual income to be positively skewed.
A distribution is skewed to the left, or negatively skewed, if there is a long tail to the left,
indicating that there are relatively few small data values and more larger values, and so most of
the values are concentrated in the upper portion of the distribution. Panel A in Figure 3.1 illustrates a negatively skewed distribution. As relatively few people die at an early age, we would
expect the distribution of age at death of Australian residents to be negatively skewed.
symmetrical
Distribution of data values above
and below the mean are identical.
skewed
Non-symmetrical distribution; data
values are clustered either in the
lower or the upper portion of the
distribution.
Figure 3.1
A comparison of three data
sets differing in shape
Panel A
Negative, or left skewed
Panel B
Symmetrical
Panel C
Positive, or right skewed
The relative positions of the mean and median provide some information about the shape
of a distribution. In many, but not all, negative or left-skewed distributions the few extremely
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
108 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
small values pull the mean downwards so that the mean is less than the median. In many, but
again not all, positive or right-skewed distributions the few extremely large values pull the
mean upwards so that the mean is greater than the median. If the distribution is symmetrical,
the high and low values balance each other and the mean equals the median.
Therefore, for most continuous unimodal (one peak) distributions, we can say that:
• mean < median, the distribution is likely to be negative or left skewed
• mean = median, the distribution is symmetrical or has zero skewness
• mean > median, the distribution is likely to be positive or right skewed.
These rules often do not apply for discrete distributions, as illustrated in Example 3.13.
EXAMPLE 3.13
DIST R IB U T IO N O F NU M BE R OF AD U LTS I N HOU S E HOL D
From a random survey of 40 households the following data were obtained in response to the
question ‘How many adults (people over 18) are there in the household?’ < HOUSEHOLD >
4
4
2
2
2
1
1
3
2
2
1
1
1
1
3
2
3
2
2
3
1
2
1
1
1
2
2
5
1
3
1
2
1
2
1
1
Present these data graphically and calculate mean and median.
3
2
1
1
SOLUTION
A column chart of the data is given in Figure 3.2.
Figure 3.2
Column chart for number
of adults in household
Adults in household
20
Frequency
15
10
5
0
1
2
3
4
5
Number of adults
As most households have either one or two adults, the data are concentrated in the
lower portion of the graph with a tail to the right. Therefore, the distribution of the number
of adults in these households is positively or right skewed.
40
To calculate the mean, first calculate the sum of X, © Xi = 4 + 4 + … + 1 = 76.
i=1
Then, as n = 40, using Equation 3.1 we obtain:
n
X =
© Xi
i=1
n
=
76
= 1.9
40
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.1 Measures of Central Tendency, Variation and Shape 109
Rank of the median is:
n + 1 40 + 1
=
= 20.5
2
2
The median is the mean of the 20th- and 21st-ranked values. From the ordered array of the
data:
1
1
2
2
1
1
2
3
1
1
2
3
1
1
2
3
1
1
2
3
1
1
2
3
1
1
2
3
1
2
2
4
1
2
2
4
1
2
2
5
the 20th- and 21st-ranked values are 2, so:
Median = 2
So the mean number of adults per household is 1.9, while the median number of adults is 2.
In this case, mean < median even though the number of adults per household is skewed to
the right.
Microsoft Excel Descriptive Statistics Output
The Microsoft Excel Data Analysis Toolpak generates the mean, median, mode, standard
deviation, variance, range, minimum, maximum and count (sample size) on a single worksheet, all of which have been discussed in this section. In addition, Excel calculates the standard error, along with statistics for kurtosis and skewness. The standard error is the standard
deviation divided by the square root of the sample size and is discussed in Chapter 7. Skewness measures the lack of symmetry in the data and is based on a statistic that is a function of
the cubed differences around the mean. A skewness value of zero indicates a symmetrical
distribution. Positive and negative values indicate positive or negative skewness. Kurtosis
measures the relative concentration of values in the centre of the distribution compared with
the tails, and is based on the differences around the mean raised to the fourth power. This
measure is not discussed in this text.
For data on festival expenditure by international visitors, the Excel descriptive statistics
output, shown in Figure 3.3, gives many of the sample statistics calculated in the examples in
this section.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
A
B
Festival spending – international visitors
Mean
Standard error
Median
Mode
Standard deviation
Sample variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
743.75
74.9867
744
#N/A
259.761
67476
–1.41411
–0.13236
776
343
1119
8925
12
Figure 3.3
Microsoft Excel summary
statistics for festival
expenditure
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
110 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
Median or mean?
think
about this
While the mean is the most common measure of central tendency, there are times when the median is
the appropriate measure to use.
A common measure of relative poverty is living in a household that has less than 50% of median
household income. In Poverty in Australia 2016, the Australian Council of Social Service (www.
acoss.org.au) reveals that a single adult with a disposable income of less than $426 per week or a
couple with two children with a disposable income of less than $895 per week were living in
relative poverty in 2014.
Why is median household income used to define relative poverty, not mean household income? Two
possible reasons are:
■ Since household income is likely to be skewed to the right, mean household income is likely to be
considerably higher than the median household income. Therefore, defining the poverty line as 50%
of mean household income would lead to a greater proportion of the population being defined as
living in relative poverty.
■ Furthermore, defining the poverty line as 50% of mean household income would mean that any
measures to alleviate poverty would be unlikely to change the proportion of households in relative
poverty, since any increase in disposal household income of those in relative poverty would increase
mean household income and hence raise the poverty line.
However, using median household income to define relative poverty makes it possible to reduce,
possibly to zero, the proportion of households in relative poverty. This is because increasing the
disposal income of those living in relative poverty, through employment, benefits, tax rebates or other
means, so that household income is above 50% of median income, need not change the median
household income.
Exploring Descriptive Statistics
visual
explorations
Open the VE_Descriptive_Statistics workbook to explore the effects of changing data values on
measures of central tendency,
variation and shape. Change the
data values in the cell range
A2:A11 and then observe the
changes to the statistics shown in
the chart.
Click View the Suggested
Activity Page to view a specific
change you could make to the
data values in column A. Click
View the More About
Descriptive Statistics Page to
view summary definitions of the
descriptive statistics shown in the
chart.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.1 Measures of Central Tendency, Variation and Shape 111
Problems for Section 3.1
LEARNING THE BASICS
3.1
3.7
The data below are a sample of n = 5:
7
3.2
4
7
2
9
7
3
12
4
9
0
7
-5
-8
7
a. Calculate the mean, median and mode.
b. Calculate the range, interquartile range, variance, standard
deviation and coefficient of variation.
Suppose that the rate of return for a particular share during the
past two years was 10% and 30%. Calculate the geometric
mean rate of return.
(Note: A rate of return of 10% is recorded as 0.10 and a rate of
return of 30% is recorded as 0.30.)
Problems 3.6 to 3.18 can be solved manually or by using Microsoft Excel.
568
The operations manager of a plant that manufactures tyres
wants to compare the actual inner diameter of two grades of
tyres, each of which is expected to be 575 millimetres. A
sample of five tyres of each grade is selected and the results,
representing the inner diameters of the tyres, ranked from
smallest to largest, are as follows:
Grade X
570 575 578
584
573
1,520 2,620 3,360 3,550 1,350 2,545 1,430 2,400
3,580 2,390 1,525 2,400 1,420 1,550 2,390 1,560
1,680 2,330 < SALES >
3.9
9
APPLYING THE CONCEPTS
3.6
3.8
3
a. Calculate the mean, median and mode.
b. Calculate the range, interquartile range, variance, standard
deviation and coefficient of variation.
The data below are a sample of n = 5:
7
3.5
8
a. Calculate the mean, median and mode.
b. Calculate the range, interquartile range, variance, standard
deviation and coefficient of variation.
c. Calculate the Z scores. Are there any outliers?
The data below are a sample of n = 7:
12
3.4
9
a. Calculate the mean, median and mode.
b. Calculate the range, interquartile range, variance, standard
deviation and coefficient of variation.
c. Calculate the Z scores. Are there any outliers?
The data below are a sample of n = 6:
7
3.3
4
Grade Y
574 575 577
578
a. For each of the two grades of tyres, calculate the mean,
median and standard deviation.
b. Which grade of tyre is providing on average better quality?
Explain.
c. What would be the effect on your answers in (a) and (b)
if the last value for grade Y was 588 instead of 578?
Explain.
Low-fat foods are not necessarily low calorie, as many low-fat
foods are high in sugar. The calories per 250 ml cup of a random
sample of brands of fresh cow’s milk for sale in Australia was
given in problem 2.14 and stored in < FRESH_MILK >.
Using the calorie data for each milk category:
a. Calculate the mean, median, first quartile and third quartile.
b. Calculate the variance, standard deviation, range,
interquartile range and coefficient of variation.
c. Based on the results of (a) and (b), what conclusions can you
reach about the differences in calories between these types
of milk?
The sales per day, in dollars, at a certain store are:
2.4
a. Calculate the mean, median, mode, first quartile and third
quartile.
b. Calculate the variance, standard deviation, range,
interquartile range and coefficient of variation.
c. What conclusions can you reach about daily sales at this
store?
The supervisor of a tourist information desk at a local airport is
interested in how long it takes an employee to serve a
customer. For a sample of 12 customers, she measures the
amount of time taken to serve each one. These times, measured
in minutes, are reported below: < TOURIST >
1.5
3.9
0.6
2.7
3.1
2.8
0.9
1.4
2.6
1.4
6.1
a. Calculate the mean, median, mode, first quartile and third
quartile.
b. Calculate the variance, standard deviation, range,
interquartile range, coefficient of variation and Z scores.
c. Are there any outliers, and are the data skewed?
d. Based on the results of (a) to (c), what conclusions can you
reach about the time taken to serve a customer?
3.10 The ordered arrays in the table below give the life (in hours of
usage) of samples of forty 15-watt CFL (compact fluorescent
lamp) energy-saving light bulbs produced by two
manufacturers, A and B. < BULBS >
Manufacturer A
5,544 5,814 6,190
6,832 6,868 6,879
7,497 7,645 7,654
8,091 8,119 8,392
6,307
6,930
7,773
8,416
6,342
6,941
7,816
8,416
6,423
7,007
7,838
8,514
6,429
7,037
7,924
8,532
6,485
7,043
7,999
8,542
6,612
7,059
8,038
8,544
6,667
7,136
8,067
8,731
Manufacturer B
6,701 6,837 6,961
7,607 7,612 7,651
8,298 8,344 8,535
9,036 9,096 9,262
7,118
7,721
8,666
9,385
7,133
7,754
8,792
9,460
7,142
7,767
8,800
9,471
7,156
7,806
8,856
9,521
7,344
7,839
8,861
9,540
7,493
7,888
8,993
9,693
7,569
7,983
9,001
9,744
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
112 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
For each manufacturer:
a. Calculate the mean, median, first quartile and third quartile.
b. Calculate the variance, standard deviation, range,
interquartile range and coefficient of variation.
c. What conclusions can you reach concerning the life of each
manufacturer’s bulbs?
3.11 The prices (in dollars) of 14 models of camera at a camera
specialty store were as follows. < CAMERA >
340
370
450
400
450
310
280
340
220
430
340
270
290
380
4.21 5.55 3.02 5.13 4.77 2.34 3.54
3.20 4.50 6.10 0.38 5.12 6.46 6.19 3.79
a. Calculate the mean, median, first quartile and third
quartile.
b. Calculate the variance, standard deviation, range,
interquartile range, coefficient of variation and Z scores. Are
there any outliers? Explain.
c. Are the data skewed? If so, how?
d. Based on the results of (a) to (c), what conclusions can you
reach about the price of cameras at the camera specialty
store?
3.12 The following data refer to the number of kilometres
that a sample of 50 people drive to work each day.
< TRAVEL_WORK >
23
19
12
15
26
34
26
26
27
15
25
8
5
32
27
31
35
16
10
24
32
36
7
38
25
4
24
35
9
18
17
22
46
24
44
19
27
34
12
23
3.14 A bank branch located in a commercial district of a city has
developed an improved process for serving customers during
the noon to 1 pm lunch period. The waiting time in minutes
(defined as the time the customer enters the line to when
they reach the teller window) of all customers during this
hour is recorded over a period of one week. A random
sample of 15 customers is selected, and the results are as
follows: < BANK1 >
30
47
38
27
27
42
29
27
45
29
a. Calculate the mean, median and mode.
b. Calculate the range, variance and standard deviation.
c. Interpret the summary measures calculated in (a)
and (b).
3.13 A manufacturer of torch batteries took a sample of 13 batteries
from a day’s production and used them continuously until they
were drained. The numbers of hours they were used until failure
were: < BATTERIES >
342 426 317 545 264 451
1,049 631 512 266 492 562 298
a. Calculate the mean, median and mode. Looking at the
distribution of times to failure, which measures of central
tendency do you think are most appropriate and which least
appropriate to use for these data? Why?
b. Calculate the range, variance and standard deviation.
c. What would you advise if the manufacturer wanted to say
in advertisements that these batteries ‘should last 400
hours’? (Note: There is no right answer to this question;
the point is to consider how to make such a statement
precise.)
d. Suppose that the first value was 1,342 instead of 342.
Repeat (a) to (c), using this value. Comment on the
difference in the results.
a. Calculate the mean, median, first quartile and third
quartile.
b. Calculate the variance, standard deviation, range,
interquartile range, coefficient of variation and Z scores. Are
there any outliers? Explain.
c. Are the data skewed? If so, how?
d. As a customer walks into the branch office during the
lunch hour, she asks the branch manager how long
she can expect to wait. The branch manager replies,
‘Almost certainly less than five minutes’. On the basis
of the results of (a) and (b), evaluate the accuracy of
this statement.
3.15 Suppose that another branch, located in a residential
area, is also concerned about the noon to 1 pm lunch
hour. The waiting time in minutes (defined as the time the
customer enters the line to the time they reach the teller
window) of all customers during this hour is recorded
over a period of one week. A random sample of
15 customers is selected, and the results are as
follows: < BANK2 >
9.66 5.90 8.02 5.79 8.73 3.82 8.01
8.35 10.49 6.68 5.64 4.08 6.17 9.91 5.47
a. Calculate the mean, median, first quartile and third
quartile.
b. Calculate the variance, standard deviation, range,
interquartile range and coefficient of variation. Are there
any outliers? Explain.
c. Are the data skewed? If so, how?
d. As a customer walks into the branch office during
the lunch hour, he asks the branch manager how long
he can expect to wait. The branch manager replies,
‘Almost certainly less than five minutes’. On the basis
of the results of (a) and (b), evaluate the accuracy of
this statement.
3.16 Data from 100 recent property sales from a council area are
stored in < PROPERTY >. For the asking price data, calculate
and interpret:
a. the mean and median (refer to graphs in problem 2.71)
b. the quartiles
c. the range and interquartile range
d. the variance and standard deviation.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.2 Numerical Descriptive Measures for a Population 113
3.17 The five years 2012 to 2016 saw volatility in the value of shares.
The data in the following table give the annual percentage change
in the share market index for Hong Kong, the Hang Seng, and for
Australia, the S&P/ASX 200, for 2012 to 2016.
Year
Hang Seng
ASX 200
2012
22.9%
14.6%
2013
2.9%
15.1%
2014
1.3%
1.1%
2015
- 7.2%
- 2.1%
2016
0.4%
7.0%
Source: Data obtained from Yahoo 7 Finance <http://au.finance.yahoo.com>
accessed April 2017
3.18 The annual returns (before tax and fees) on several managed
superannuation investment funds are:
Fund
Conservative
balanced
Balanced
High growth
Sustainable
balanced
a. For each index calculate the geometric rate of return for the
five years.
b. What conclusions can you reach concerning the geometric
rates of return of the two indices?
2017
Historical crediting rate for
year ending 30 June %
2016
2015
2014
5.3
9.2
16.6
7.5
6.1
0.0
10.2
11.0
13.9
11.6
13.9
18.9
11.7
15.9
20.7
12.4
0.0
15.0
15.7
15.9
a. For each fund, calculate the geometric rate of return for three
years (2015 to 2017) and for five years (2013 to 2017).
b. What conclusions can you reach concerning the geometric
rates of return for the funds?
3.2 NUMERICAL DESCRIPTIVE MEASURES FOR A POPULATION
LEARNING OBJECTIVE
Section 3.1 introduces several statistics that describe the properties of central tendency, variation and shape for a sample. If we have population data there are similar numerical descriptive
measures, called population parameters, of central tendency, variation and shape. This section
introduces three population parameters: population mean, population variance and population
standard deviation.
To illustrate these population parameters we use the data in Table 3.3, which classifies
road fatalities in Australia for 2016 by month and gender. Because the table gives the total,
and the male and female monthly road fatalities for 2016, for all of Australia this is population data.
Gender
Month
January
February
March
April
May
June
July
August
September
October
November
December
Total
Unknown
0
0
0
0
0
0
0
0
0
0
0
1
1
Male
27
30
23
33
29
23
26
38
24
29
28
27
337
2013
Female
80
72
87
81
76
74
91
74
68
89
78
88
958
Total
107
102
110
114
105
97
117
112
92
118
106
116
1,296
Population Mean
The population mean, defined by Equation 3.13, is represented by the symbol μ, the Greek
lower-case letter mu.
2
Calculate and interpret
descriptive summary
measures for a population
Table 3.3
Road fatalities in Australia
2016
Source: Data obtained
from the Australian
Road Deaths Database
<www.bitre.gov.au/statistics/
safety/fatal_road_crash_
database.aspx> accessed
4 May 2017.
population mean
Mean calculated from population
data.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
114 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
POPUL AT ION M E A N
The population mean is the sum of the values in the population divided by the population
size, N.
N
μ =
© Xi
(3.13)
i=1
N
where μ = population mean
Xi = ith value of the variable X
N
© Xi = sum of all Xi values in the population
i=1
To calculate the mean monthly total road fatality for 2016 from the data given in Table 3.3,
use Equation 3.13:
N
μ=
© Xi
i=1
N
=
107 + 102 + … + 116 1296
=
= 108
12
12
Thus, the mean monthly road fatality for 2016 was 108.
Population Variance and Standard Deviation
population variance
Variance calculated from population
data.
population standard deviation
Standard deviation calculated from
population data.
The population variance and the population standard deviation measure variation in a population. Like the related sample statistic, the population standard deviation is the square root of
the population variance. The population variance is represented by the symbol σ2, the Greek
lower-case letter sigma squared, and the population standard deviation by the symbol σ.
These parameters are defined by Equations 3.14a and 3.15. The denominator in Equation
3.14a is N (population size) and not n − 1 as used in the equation for the sample variance
(see Equation 3.9a).
P OPUL AT ION VA R I A NC E – D E F I NI T I O N F O R M U LA
The population variance is the sum of the squared deviations from the population mean
divided by the population size N.
N
σ2 =
SSX
=
N
©(Xi - μ)2
i=1
(3.14a)
N
where μ = population mean
Xi = ith value of the variable X
N
SSX =
©(Xi - μ)2 = sum of the squared deviations from the mean (sum of
i=1
squares)
P OPUL AT ION STA NDA R D D E V I AT I O N
The population standard deviation is the square root of the population variance.
σ = σ2
(3.15)
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.2 Numerical Descriptive Measures for a Population 115
As we did for sample variance and standard deviation, we can use algebra to obtain alternative calculation formulas.
PO PULATION VA R IA N CE – CA LCUL ATI O N F OR M UL A
The population variance is the sum of the squared deviations from the population mean
divided by the population size N.
n
N
σ2 =
SSX
=
N
© Xi2 - Nμ2
i=1
N
© Xi
N
=
i=1
© Xi2 i=1
2
N
(3.14b)
N
where μ = population mean
Xi = ith value of the variable X
N
© Xi2 = X12 + X22 + p + XN2 = sum of the squared Xi values in the population
i=1
Use either calculation formula.
Using the data in Table 3.3 to calculate the population variance and standard deviation for
the 2016 monthly total road fatalities, first calculate:
N
© Xi2 = 1072 + 1022 + p + 1162 = 140,696
i=1
then use Equations 3.14b and 3.15 to obtain:
N
σ2 =
© Xi2 - Nμ2
i =1
σ = σ2 =
N
=
140,696 - 12 3 1082
= 60.666…
12
60.666… = 7.788…
Thus, the variance of monthly total fatalities for 2016 is approximately 60.7 and the standard
deviation is approximately 7.8 fatalities per month. So, the typical 2016 monthly fatality rate
differs from the mean of 108 by plus or minus 7.8.
The Empirical Rule
In many data sets a large portion of the values tend to cluster near the median. In right-skewed
data sets, this clustering occurs in the left or lower part of the distribution. In left-skewed data
sets, the values tend to cluster in the right or upper part of the distribution. In symmetrical data
sets, where the median and mean are similar, the values often cluster around the median and
mean, producing a bell-shaped distribution. You can use the empirical rule to examine the variability in bell-shaped distributions, both population and sample.
The empirical rule states that for bell-shaped distributions:
• Approximately 68% of the values are within a distance of ±1 standard deviation from the
mean. That is, approximately 68% of the data values have Z scores between −1 and 1.
• Approximately 95% of the values are within a distance of ±2 standard deviations from
the mean. That is, approximately 95% of the data values have Z scores between −2 and 2.
• Approximately 99.7% of the values are within a distance of ±3 standard deviations from
the mean. That is, approximately 99.7% of the data values have Z scores between −3 and 3.
bell-shaped
Symmetric, unimodal, moundshaped distribution.
empirical rule
Gives the distribution of data values
in terms of standard deviations from
the mean for bell-shaped
distributions.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
116 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
The empirical rule helps to identify outliers when analysing a set of numerical data. The
empirical rule implies that, for bell-shaped distributions, only about 1 in 20 values will be
beyond two standard deviations from the mean. As a general rule, you can consider values not
found in the interval μ ± 2σ (or X ± 2S) as potential outliers. The rule also implies that only
about 3 in 1,000 will be beyond three standard deviations from the mean. Therefore, values not
found in the interval μ ± 3σ (or X ± 3S) are almost always considered outliers. For heavily
skewed or non-bell-shaped data sets the Chebyshev rule, introduced next, should be used
instead of the empirical rule.
EXAMPLE 3.14
U S IN G T H E E MP IR IC AL R U L E
A population of 600-mL bottles of soft drink is known to have a mean fill-weight of 603 mL
and a standard deviation of 1 mL. The population is also known to be bell-shaped.
Describe the distribution of fill-weights. Is it very likely that a bottle will contain less
than 600 mL of soft drink?
SOLUTION
μ ± σ = 603 ± 1 = (602, 604)
μ ± 2σ = 603 ± 2(1) = (601, 605)
μ ± 3σ = 603 ± 3(1) = (600, 606)
Using the empirical rule, approximately 68% of the bottles will contain between 602 mL and
604 mL, approximately 95% will contain between 601 mL and 605 mL, and approximately
99.7% will contain between 600 mL and 606 mL. Therefore, it is highly unlikely that a bottle
will contain less than 600 mL of soft drink. Specifically, because of the assumed symmetry,
we would expect only 0.15% of bottles to have a volume of soft drink less than 600 mL (and
thus 0.15% above 606 mL).
The Chebyshev Rule
Chebyshev rule
Gives lower bounds of the
distribution of data values in terms
of standard deviations from the
mean for any distribution.
The Chebyshev rule states that, for all data sets, population or sample, the percentage of values
within k standard deviations of the mean must be at least:
1 2
c1 − a k b d 100%
You can use this rule for any value of k greater than 1. Consider k = 2. The Chebyshev rule
states that at least [1 − (1/2)2]100% = 75% of the values must be within ±2 standard deviations of the mean.
The Chebyshev rule is very general and applies to any distribution. The rule gives the
percentage of values that must at least be within a given distance from the mean. However, if
the data set is approximately bell-shaped, the empirical rule will more accurately reflect the
greater concentration of data close to the mean. Table 3.4 compares the Chebyshev and
empirical rules.
Table 3.4
How data vary around the
mean
Interval
(μ − σ, μ + σ)
(μ − 2σ, μ + 2σ)
(μ − 3σ, μ + 3σ)
% of values found in intervals around the mean
Chebyshev
Empirical rule
(any distribution)
(bell-shaped distribution)
At least 0%
Approximately 68%
At least 75%
Approximately 95%
At least 88.89%
Approximately 99.7%
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.2 Numerical Descriptive Measures for a Population 117
USING TH E C H E BYS H E V R U LE
As in Example 3.14, a population of 600-mL bottles of soft drink is known to have a mean
fill-weight of 603 mL and a standard deviation of 1 mL. However, the shape of the population is unknown and you cannot assume that it is bell-shaped. Describe the distribution of
fill-weights. Is it very likely that a bottle will contain less than 600 mL of soft drink?
EXAMPLE 3.15
SOLUTION
μ ± σ = 603 ± 1 = (602, 604)
μ ± 2σ = 603 ± 2(1) = (601, 605)
μ ± 3σ = 603 ± 3(1) = (600, 606)
Because the distribution may not be bell-shaped, the empirical rule should not be used. Using
the Chebyshev rule, you cannot say anything about the percentage of bottles containing between
602 mL and 604 mL. You can state that at least 75% of the bottles will contain between 601 mL
and 605 mL, and at least 88.89% will contain between 600 mL and 606 mL. Therefore, it is possible that up to 11.11% of bottles contain less than 600 mL of soft drink (or more than 606 mL).
These two rules apply to both population and sample data. For sample data, use the sample
mean X and sample standard deviation S in place of the population parameters μ and σ.
Problems for Section 3.2
LEARNING THE BASICS
3.19 The data below are for a population with N = 10:
7
5
11
8
3
6
2
1
9
8
a. Calculate the population mean.
b. Calculate the population standard deviation.
3.20 The data below are for a population with N = 10:
7
5
6
6
6
4
8
6
9
3
a. Calculate the population mean.
b. Calculate the population standard deviation.
APPLYING THE CONCEPTS
3.21 Analyse the road fatality data for 2016 given in
< MONTHLY_FATALITY _2016 > for each gender by:
a. Calculating the mean, variance and standard deviation.
b. Finding the proportion of months that have fatalities within
one and two standard deviations of the mean.
c. Comparing your findings with what would be expected on
the basis of the empirical rule.
3.22 Naturally Soap is a small business, based in a coastal town, that
makes and sells natural, luxurious, handmade soap bars in a
variety of scents. Presently the soap is sold at local markets:
Wednesday evening in the coastal town where the business is
located, and a scheduled Sunday morning market in a roster of
local villages. During the last six months, Naturally Soap has
also been available via the Internet.
Naturally Soap is interested in analysing the quantity sold
weekly at each market and Internet sales.
While Naturally Soap has complete sales and price data for
both markets for the previous year, due to a computer ‘problem’
there is only a sample of weekly sales and price data for the
Internet sales. The data is stored in the < NATURALLY_SOAP >
file.
a. For the Sunday morning market:
i. Calculate the mean, variance and standard deviation of
the weekly sales for the year.
ii. What conclusions can you make about the weekly sales
for this market?
iii. Use the empirical rule or the Chebyshev rule, whichever
is appropriate, to further explain the variation in the
weekly sales.
iv. Using the results in (iii), are there any outliers? Explain.
b. Repeat (a) for the Wednesday evening market.
3.23 The ages, to the nearest year, of all employees at a certain fastfood outlet are:
19
19
45
20
21
21
18
20
23
17
a. Calculate the mean, variance and standard deviation.
b. Calculate the Z scores.
c. Based on the results of (a) and (b), what conclusions can you
reach about employee ages at this fast-food outlet?
3.24 The file < HOURS > gives the hours worked during a recent
week by all 30 employees of a local bakery.
For this week:
a. Calculate and interpret the mean hours worked.
b. Calculate the variance and standard deviation of the hours
worked. Interpret the standard deviation.
c. Use the empirical rule or the Chebyshev rule, whichever is
appropriate, to explain further the variation in the hours
worked.
d. Using the results in (c), are there any outliers? Explain.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
118 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
LEARNING OBJECTIVE
1
Calculate and interpret
numerical descriptive
measures of central
tendency, variation and
shape for numerical data
3.3 CALCULATING NUMERICAL DESCRIPTIVE MEASURES FROM A
FREQUENCY DISTRIBUTION
When you have a frequency distribution and the raw data are not available, you can calculate
approximations to the mean and the standard deviation by assuming that all values within each
class are located at the class mid-point.
A P PR OXIM AT IN G T HE SAM PL E M E AN , VAR I A NC E A ND STA NDAR D
DE VIAT ION FR OM A F R E QU E NCY DI ST R I B UT I O N
c
X=
mj fj
©
j=1
(3.16)
n
where X = sample mean
n = number of values or sample size
c = number of classes in the frequency distribution
mj = mid-point of the jth class
fj = number of values in the jth class
S=
S2
(3.17)
c
c
where S 2 =
c
c
fj mj2 - nX 2 © fj mj2 © (mj - X )2 fj j©
j=1
=1
j=1
n-1
=
n-1
=
© mj fj
2
j=1
n
n-1
Example 3.16 illustrates the calculation of a sample mean and the standard deviation from
a frequency distribution.
EXAMPLE 3.16
Table 3.5
Frequency distribution:
real estate asking prices
A P P ROX IMAT ING T H E M E AN AN D STAN D ARD D E V I ATI ON F ROM A F RE QU ENCY
DIST R IB U T IO N
Use the frequency distribution for real estate prices given in Table 3.5 to calculate the approximate sample mean and standard deviation. Compare these approximations with the mean and
standard deviation calculated from the raw (ungrouped) data in < PROPERTY >; see problem 3.16.
Asking price ($)
300,000 to < 350,000
350,000 to < 400,000
400,000 to < 450,000
450,000 to < 500,000
500,000 to < 550,000
550,000 to < 600,000
600,000 to < 650,000
650,000 to < 700,000
700,000 to < 750,000
750,000 to < 800,000
800,000 to < 850,000
Total
Frequency
8
17
21
20
16
6
7
3
0
0
2
100
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.3 Calculating Numerical Descriptive Measures from a Frequency Distribution 119
SOLUTION
The calculations of the approximate mean and standard deviation of the real estate prices
are summarised in Table 3.6 where, to avoid extremely large numbers, the mid-point of each
class has been recorded in thousands of dollars.
Asking price ($)
300,000 to < 350,000
350,000 to < 400,000
400,000 to < 450,000
450,000 to < 500,000
500,000 to < 550,000
550,000 to < 600,000
600,000 to < 650,000
650,000 to < 700,000
700,000 to < 750,000
750,000 to < 800,000
800,000 to < 850,000
Total
Mid-point
in $000s
325
375
425
475
525
575
625
675
725
775
825
Frequency
8
17
21
20
16
6
7
3
0
0
2
100
fj mj
2,600
6,375
8,925
9,500
8,400
3,450
4,375
2,025
0
0
1,650
47,300
fj mj2
845,000
2,390,625
3,793,125
4,512,500
4,410,000
1,983,750
2,734,375
1,366,875
0
0
1,361,250
23,397,500
Table 3.6
Calculations needed to
calculate approximations
of the mean and
standard deviation of the
real estate prices
Using Equations 3.16 and 3.17:
c
X =
© mj fj
j=1
n
=
47,300
= 473
100
and
c
S=
© fjmj2 - nX 2
j=1
n-1
=
23,397,500 - 100 * 4732
=
99
10,349.4949… = 101.732…
Therefore, the mean and standard deviation are approximately $473,000 and $101,700.
These values compare with the actual mean, $472,440, and the standard deviation,
$102,395, calculated from the raw (ungrouped) data; see solutions to problem 3.16.
Problems for Section 3.3
LEARNING THE BASICS
3.25 Given the following frequency distribution for n = 100:
Class intervals
0–under 10
10–under 20
20–under 30
30–under 40
40–under 50
Approximate:
a. the mean
b. the standard deviation.
Frequency
10
20
40
20
10
100
3.26 Given the following frequency distribution for n = 100:
Class intervals
0–under 10
10–under 20
20–under 30
30–under 40
40–under 50
Frequency
40
25
15
15
5
100
Approximate:
a. the mean
b. the standard deviation.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
120 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
APPLYING THE CONCEPTS
3.27 A company wished to study its accounts receivable for two
successive months. An independent sample of 50 accounts was
selected for each month. The results are in the table below.
a. For each month, approximate the:
i. mean
ii. standard deviation
b. On the basis of your answers in (a), do you think the mean
and the standard deviation of the accounts receivable have
changed substantially from March to April? Explain.
Frequency Distributions for Accounts Receivable
March
April
Amount
frequency
frequency
$0 to under $2,000
6
10
$2,000 to under $4,000
13
14
$4,000 to under $6,000
17
13
$6,000 to under $8,000
10
10
$8,000 to under $10,000
4
0
$10,000 to under $12,000
0
3
50
50
Total
LEARNING OBJECTIVE
3
Construct and interpret a
box-and-whisker plot
3.4 FIVE-NUMBER SUMMARY AND BOX-AND-WHISKER PLOTS
Section 3.1 introduces sample statistics to measure the centre, variation and shape of numerical
data. Another way of describing numerical data is to use the five-number summary, which is
illustrated graphically by a box-and-whisker plot.
Five-Number Summary
five-number summary
Numerical data summarised by
quartiles.
The five-number summary consists of the five statistics:
Xsmallest Q1 Median Q3 Xlargest
The five-number summary characterises a sample (or population) reasonably well and is useful
for exploratory data analysis. In particular, it provides a way to determine the shape of the distribution. Table 3.7 explains how the relationships between the ‘five numbers’ allow you to
recognise the shape of a data set.
Table 3.7
Relationships between the
five-number summary and
the type of distribution
Comparison
Distance from Xsmallest to
the median versus the
distance from the median
to Xlargest.
Left skewed
The distance from
Xsmallest to the median is
greater than the distance
from the median to
Xlargest.
Distance from Xsmallest to The distance from
Xsmallest to Q1 is greater
Q1 versus the distance
from Q3 to Xlargest.
than the distance from Q3
to Xlargest.
The distance from Q1 to
Distance from Q1 to the
median versus the
the median is greater
distance from the median than the distance from
to Q3.
the median to Q3.
Type of distribution
Symmetrical
Both distances are the
same.
Both distances are the
same.
Both distances are the
same.
Right skewed
The distance Xsmallest to
the median is less than
the distance from the
median to Xlargest.
The distance from
Xsmallest to Q1 is less the
distance from Q3 to
Xlargest.
The distance from Q1 to
the median is less than
the distance from the
median to Q3.
The sample of 10 times to get ready (Section 3.1) ranges from 29 minutes to 52 minutes.
The median is 39.5, the first quartile is 35 and the third quartile is 44. Therefore, the five-­
number summary is:
29
35
39.5
44
52
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.4 Five-Number Summary and Box-and-Whisker Plots 121
The distance from Xsmallest to the median (39.5 − 29 = 10.5) is slightly less than the distance
from the median to Xlargest (52 − 39.5 = 12.5). The distance from Xsmallest to Q1 (35 − 29 = 6) is
slightly less than the distance from Q3 to Xlargest (52 − 44 = 8). Therefore, the times to get ready
are slightly right skewed.
CA LC ULATING T H E FIV E - N U MB E R S U M M ARY F OR F E STI VAL E XP E N D I TU RE –
INTER NATIONA L V IS ITO R S
In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors during the festival. < FESTIVAL >
Calculate the five-number summary.
EXAMPLE 3.17
SOLUTION
From Examples 3.3, 3.6 and 3.8 the five-number summary is:
343 502 744 993 1,119
The distance from the median to Xsmallest ($401) is more than the distance from Xlargest to the
median ($375). Furthermore, the distance from Xsmallest to Q1 ($159) is more than the distance
from Q3 to Xlargest ($126). Therefore, the amount spent by international visitors during the
festival has a slight left-skewed distribution.
Box-and-Whisker Plots
A box-and-whisker plot, alternatively called a boxplot, provides a graphical representation of
the data based on the five-number summary. It shows the range, interquartile range and quartiles. Figure 3.4 illustrates the box-and-whisker plot for the times to get ready. The vertical line
drawn within the box represents the median. The vertical line at the left side of the box represents Q1 and the vertical line at the right side of the box represents Q3. Thus, the box contains
the middle 50% of an ordered array of data values, 25% between the median and each quartile.
The lower 25% of the data values is represented by a line (i.e. a whisker) connecting the left
side of the box to the location of the smallest value, Xsmallest. Similarly, the upper 25% of the
data values is represented by a whisker connecting the right side of the box to Xlargest.
box-and-whisker plot
Graphical representation of the
five-number summary.
Figure 3.4
Box-and-whisker plot of
the time to get ready
Xsmallest
20
25
30
Q1
35
Median
40
Time (minutes)
Xlargest
Q3
45
50
55
The box-and-whisker plot of the times to get ready in Figure 3.4 confirms a very slight
right skewness since the right whisker is slightly longer than the left whisker.
CO NSTR UCTING A B OX - A ND- W H IS K E R P L OT F OR F E STI VAL E XP E N D I TU RE –
INTER NATIO N A L V IS ITO R S
In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors during the festival. < FESTIVAL >
Construct and interpret the box-and-whisker plot shown in Figure 3.5.
EXAMPLE 3.18
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
122 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
SOLUTION
Figure 3.5
Box-and-whisker plot,
festival expenditure –
international visitors
Festival expenditure
International
visitors
300
400
500
600
700
800
900
1,000
1,100
1,200
$
The left-hand whisker is slightly longer than the right-hand whisker and the left-hand and
right-hand rectangles are approximately the same. Therefore, the amount spent by international visitors during the festival has a very slight left or negative skew.
Figure 3.6 demonstrates the relationship between a box-and-whisker plot and the corresponding polygon for four different types of distributions. (Note: The area under each
polygon is split into quartiles corresponding to the five-number summary for the box-andwhisker plot.)
Panels A and D of Figure 3.6 are symmetrical. In these distributions, the length of the left
whisker is equal to the length of the right whisker, and the median line divides the box in half.
Panel B of Figure 3.6 is left skewed. For this left-skewed distribution, the skewness indicates that there is a heavy clustering of values at the high end of the scale (i.e. the right-hand
side); 75% of all values are found between the left edge of the box (Q1) and the end of the right
whisker (Xlargest). Therefore, the long left whisker contains the smallest 25% of the values.
Panel C of Figure 3.6 is right skewed. The concentration of values is on the low end of
the scale (i.e. the left side of the box-and-whisker plot). Here, 75% of all data values are
found between the beginning of the left whisker (Xsmallest) and the right edge of the box (Q3),
and the remaining 25% of the values are dispersed along the long right whisker at the upper
end of the scale.
Figure 3.6
Box-and-whisker plots and
corresponding polygons for
four distributions
Panel A
Bell-shaped distribution
Panel B
Left-skewed distribution
Panel C
Right-skewed distribution
Panel D
Rectangular/uniform distribution
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.5 Covariance and the Coefficient of Correlation 123
Problems for Section 3.4
LEARNING THE BASICS
APPLYING THE CONCEPTS
3.28 The data below are a sample of n = 5:
7 4
9
8
Problems 3.32 to 3.35 can be solved manually or by using Microsoft Excel
or PHStat.
3.32 For the life of 15-watt CFL light bulbs data in problem 3.10:
2
a. List the five-number summary.
b. Construct the box-and-whisker plot and describe the
shape.
3.29 The data below are a sample of n = 6:
7 4
9
7
3
12
a. List the five-number summary.
b. Construct the box-and-whisker plot and describe the
shape.
3.30 The data below are a sample of n = 7:
12
7
4
9
0
7
3
a. List the five-number summary.
b. Construct the box-and-whisker plot and describe the
shape.
3.31 The data below are a sample of n = 5:
7
−5
−8
7
9
a. List the five-number summary.
b. Construct the box-and-whisker plot and describe the
shape.
< BULBS >
a. List the five-number summary for each manufacturer.
b. Construct the box-and-whisker plot and describe the shape
of the distribution for each manufacturer.
3.33 For the daily sales data in problem 3.8: < SALES >
a. List the five-number summary.
b. Construct the box-and-whisker plot and discuss the daily
sales distribution for the store.
3.34 Many fast-food chains offer salads and low-fat options on their
menu as an alternative to their traditional rolls and burgers.
Data for a sample of these alternative and traditional menu
items are stored in < HEALTHY_FASTFOOD >. For each product
category, use the fat in grams per serve data:
a. List the five-number summary.
b. Construct the box-and-whisker plot.
c. What similarities and differences are there in the
distributions for the product categories?
3.35 Use the data in problems 3.14 and 3.15, representing the waiting
times of random samples of customers at two bank branches
during the noon to 1 pm lunch period. < BANK1 > < BANK2 >
For each bank:
a. List the five-number summary of the waiting time at the two
bank branches.
b. Construct the box-and-whisker plot and describe the shape
of the distribution of the two bank branches.
c. What similarities and differences are there in the distribution
of the waiting time at the two bank branches?
3.5 COVARIANCE AND THE COEFFICIENT OF CORRELATION
LEARNING OBJECTIVE
In Section 2.5, scatter diagrams are used to examine the relationship between two numerical
variables (bivariate data). In this section, the covariance and the coefficient of correlation are
introduced to measure the strength of the linear relationship between two numerical variables.
Calculate and interpret
the covariance and the
coefficient of correlation
for bivariate data
4
Covariance
The covariance is a measure of the strength and direction of the linear relationship between
two numerical variables (X and Y). A positive value indicates a positive linear relationship
between the two variables and a negative value indicates a negative relationship. A value
of zero indicates that there is no linear relationship between the variables. A relationship
that is linear can be graphed by a straight line, sloping upwards if positive and downwards
if negative.
Equation 3.18a defines the sample covariance.
covariance
Measure of the strength of the
linear relationship between two
numerical variables.
sample covariance
Covariance calculated from sample
data.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
124 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
T H E S A M P L E COVA R I A NC E – D E F I NI T I O N F O R M U LA
n
SSXY
cov(X, Y ) =
=
n-1
©(Xi - X )(Yi - Y )
i=1
(3.18a)
n-1
where X = sample mean of variable X
Y = sample mean of variable Y
n = number of data points (Xi, Yi)
Xi = ith value of the independent variable X
Yi = ith value of the dependent variable Y, which corresponds to Xi
n
SSXY =
© (Xi - X )(Yi - Y ) = sum of the squares for X and Y
i=1
As for the sample variance and standard deviation, we can use algebra to obtain alternative
calculation formulas.
T H E S A M P L E COVA R I A NC E – C A LC U LAT I O N F O R M U LA
n
n
n
SSXY
cov(X, Y ) =
=
n-1
©XiYi - nX Y
i=1
n-1
©Xi ©Yi
n
=
©XiYi - i = 1
i=1
i=1
(3.18b)
n
n-1
n
where
© XiYi = X1Y1 + X2Y2 + … + XnYn = sum of the product of XiYi values
i=1
Use either calculation formula.
EXAMPLE 3.19
C A LC U LAT ING T H E S AM P L E COVARI AN CE F OR D I S CRE TI ON ARY I N COM E
A N D E X P E NDIT U R E
The council in the opening scenario is also interested in the discretionary, or disposable,
income and corresponding expenditure of residents within the region. To explore this
Kai obtains the following data on discretionary weekly income and expenditure from
10 randomly selected residents of the region.
Calculate the sample covariance for discretionary weekly income and expenditure.
Discretionary income
$ 400
Discretionary expenditure $ 350
815
650
550
525
400
370
250
250
300
295
375
330
380
350
425
415
600
460
SOLUTION
Kai expects that discretionary expenditure is related to discretionary income, so defines
Discretionary Income $ as the independent variable (X) and Discretionary Expenditure $
as the dependent variable (Y).
Calculate:
n
© Xi
X=
i=1
n
=
4,495
= 449.50
10
=
3,995
= 399.50
10
n
Y=
© Yi
i=1
n
10
© XiYi = (400 * 350) + (815 * 650) + … + (600 * 460) = 1,966,625
i=1
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
X=
i=1
n
=
4,495
= 449.50
10
n
Y=
© Yi
i=1
n
3.5 Covariance and the Coefficient of Correlation 125
3,995
=
= 399.50
10
10
© XiYi = (400 * 350) + (815 * 650) + … + (600 * 460) = 1,966,625
i=1
Then, using Equation 3.18b, we obtain:
n
©XiYi - nX Y
SSXY =
i=1
= 1,966,625 - (10 * 449.50 * 399.50)
= 170,872.50
SSXY 170,872.50
= 18,985.833…
=
n-1
9
cov(X, Y ) =
As the covariance is positive, Kai can conclude that there is a positive linear relationship
between discretionary income and expenditure.
As covariance can have any value, it is difficult to use it as a measure of the relative strength
of a linear relationship. A better and related measure of the relative strength of a linear relationship is the coefficient of correlation.
Coefficient of Correlation
The coefficient of correlation measures the relative strength of a linear relationship between two
numerical variables. The values of the coefficient of correlation range from −1 for a perfect
negative linear correlation to +1 for a perfect positive linear correlation. Perfect means that, if
the points are plotted in a scatter diagram, all the points will lie in a straight line. When dealing
with population data for two numerical variables, the Greek letter ρ (rho) is used as the symbol
for the coefficient of correlation. Figure 3.7 illustrates three different types of association
between two variables.
Y
Y
Panel A
Perfect negative
correlation (r = –1)
X
Figure 3.7
Types of association
between variables
Y
Panel B
No correlation
(r = 0)
X
coefficient of correlation
(or correlation coefficient)
Measure of the relative strength of
the linear relationship between two
numerical variables.
Panel C
Perfect positive
correlation (r = +1)
X
Panel A of Figure 3.7 illustrates a perfect negative linear relationship between X and Y,
where the coefficient of correlation ρ equals −1. Panel B shows a situation in which there is no
relationship between X and Y. In this case, the coefficient of correlation ρ equals 0. Panel C
illustrates a perfect positive linear relationship where ρ equals +1.
With sample data, the sample coefficient of correlation r can be calculated. Figure 3.8
(page 127) gives the scatter diagrams with their respective sample coefficients of correlation r
for six data sets, each of which contains 100 values of X and Y.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
126 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
sample coefficient of
correlation
Coefficient of correlation
calculated from sample data.
In panel A of Figure 3.8 the coefficient of correlation r is −0.9. You can see that small values of X tend to be paired with large values of Y. Likewise, large values of X tend to be paired
with small values of Y. As the data are not all in a straight line, the linear relationship between
X and Y is strong but not perfect. The data in panel B have a coefficient of correlation equal to
–0.6, and the small values of X tend to be paired with large values of Y. However, as the data
points are more scattered in panel B, the linear relationship between X and Y in panel B is not
as strong as that in panel A. Thus, the coefficient of correlation in panel B, while still negative
(indicating a negative relationship), is closer to 0 than the correlation coefficient in panel A. In
panel C the negative linear relationship between X and Y is very weak, r = −0.3, and there is
only a slight tendency for the small values of X to be paired with the larger values of Y. Panels
D to F depict data sets that have positive coefficients of correlation, hence positive linear relationships, where small values of X tend to be paired with small values of Y, and the large values
of X tend to be paired with large values of Y.
In this discussion of Figure 3.8, the relationships are deliberately described as tendencies
and not as cause-and-effect. This wording is intentional. Correlation alone cannot prove that
there is a causal effect – that is, that the change in the value of one variable caused the change
in the other variable. A strong correlation can be produced simply by chance, by the effect of a
third variable not considered in the calculation of the coefficient of correlation, or by a causeand-effect relationship. You would need to perform additional analysis to determine which of
these three situations actually produced the correlation. Therefore, you can say that causation
implies correlation, but correlation alone does not imply causation.
Equation 3.19 defines the sample coefficient of correlation r and Example 3.20 illustrates
its use.
T H E S A M P L E COEF F I C I E NT O F C O R R E LAT I O N
The sample coefficient of correlation is sample covariance divided by the sample standard deviations of X and Y:
r=
cov(X, Y)
SX SY
where SX, SY are the sample standard deviations for variables X and Y, defined by
SSXY
SSX
, SX =
and SY =
n-1
n-1
correlation coefficient can also be defined as:
SSY
the sample
n-1
Equation 3.10. As cov(X,Y ) =
r=
SSXY
SSX
(3.19)
SSY
where the formulas for the respective sum of squares are:
n
SSXY =
n
n
n
i=1
i=1
i=1
© (Xi - X )(Yi - Y ) = © XiYi - nXY = © XiYi - i = 1 ni = 1
n
SSX =
n
n
n
i=1
i=1
i=1
© (Xi - X )2 = © Xi2 - nX 2 = © Xi2 -
©Xi
n
n
n
i=1
i=1
i=1
© (Yi - Y )2 = © Yi2 - nY 2 = © Yi2 -
2
i=1
n
n
SSY =
n
©Xi ©Yi
©Yi
2
i=1
n
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.5 Covariance and the Coefficient of Correlation 127
100
100
50
50
0
0
–200
–100
r = –0.9
Panel A
–300
0
–200
–100
r = –0.6
Panel B
100
100
50
50
0
100
0
0
–300
–200
Panel C
–100
r = –0.3
0
100
200
–200
–100
0
100
r = 0.3
Panel D
100
100
50
50
0
200
300
400
0
–100
0
Panel E
Figure 3.8
100
r = 0.6
200
300
0
Panel F
50
r = 0.9
100
150
Six scatter diagrams and their sample coefficients of correlation, r
CA LC ULATING T H E S A MP LE C O R R E LAT I ON COE F F I CI E N T F OR D I S CRE TI ON ARY
INC O ME A ND E X P E NDIT U R E
Kai is exploring the relationship between discretionary, or disposable, income and the
corresponding expenditure of residents within the region. From the data in Example 3.19,
calculate and interpret the sample correlation coefficient.
EXAMPLE 3.20
SOLUTION
From Example 3.19:
X = 449.50, Y = 399.50 and SSXY = 170,872.5
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
128 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
Calculate:
10
© Xi2 = 4002 + 8152 + … + 6002 = 2,264,875
i=1
10
© Yi2 = 3502 + 6502 + … + 4602 = 1,722,275
i=1
n
SSX = © Xi2 - nX 2 = 2,264,875 - 10 * 449.52 = 244,372.5
i=1
SSY =
n
© Yi2 - nY 2 = 1,722,275 - 10 * 399.52 = 126,272.5
i=1
Therefore, using Equation 3.19:
r=
SSXY
SSX
SSY
=
170,872.5
244,372.5 126,272.5
= 0.9727…
As r = 0.97 is very close to 1, Kai can conclude that there is a very strong positive linear
relationship between discretionary income and expenditure. As it is known that there is a
relationship between a person’s income and their expenditure, Kai can conclude that, if a
resident’s discretionary income increases, their expenditure is also highly likely to increase.
In summary, the coefficient of correlation is a measure of the strength of the linear relationship, or association, between two numerical variables. The closer the coefficient of correlation
is to +1 or −1, the stronger the linear relationship. When the coefficient of correlation is near
0, there is little or no linear relationship between the two numerical variables. The sign of the
coefficient of correlation indicates whether the data are positively correlated (i.e. the larger
values of X tend to be paired with the larger values of Y) or negatively correlated (i.e. the larger
values of X tend to be paired with the smaller values of Y). The existence of a strong correlation
does not imply a causation effect. It only indicates the tendencies present in the data.
Problems for Section 3.5
LEARNING THE BASICS
3.36 The data are from a sample of n = 11 items:
X
Y
7
21
5
15
8
24
3
9
6
18
10
30
12
36
4
12
9
27
15
45
18
54
a. Calculate the covariance.
b. Calculate the coefficient of correlation.
c. How strong is the relationship between X and Y? Explain.
APPLYING THE CONCEPTS
Problems 3.37 to 3.40 can be solved manually or by using Microsoft Excel.
3.37 You are interested in the relationship between the number of people
in a sales team and the sales generated, in a certain industry.
Number of staff
Sales
26
45
18
38
15
35
28
77
19
33
23
44
27
54
23
55
17
32
24
47
These data show gross sales, measured in millions of dollars,
and the number of people on a sales team.
a. Calculate the covariance and coefficient of correlation.
b. What conclusions can you reach about the relationship
between the number of people in a sales team and the sales
generated?
3.38 Use the data in problem 2.18 to investigate the relationship
between petrol and diesel prices in New South Wales and in
Queensland. < FUEL_2017 >
a. Calculate the covariance and coefficient of correlation for
diesel and petrol prices in New South Wales.
b. Calculate the covariance and coefficient of correlation for
diesel and petrol prices in Queensland.
c. What conclusions can you reach about the relationship
between petrol and diesel prices in New South Wales and in
Queensland?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
3.6 Pitfalls in Numerical Descriptive Measures and Ethical Issues 129
3.39 A local council is interested in the relationship between the size of
local restaurants, measured as number of seats, and their annual
water usage, in kilolitres. From a random sample of 10 local
restaurants the following information was obtained. < WATER2 >
Number of seats
X
60
45
54
68
70
55
67
45
64
42
Annual water usage
Y (kilolitres)
880
550
720
725
932
922
950
560
726
405
a. Construct a scatter diagram for the data and comment
on any apparent relationship between restaurant size and
annual water usage.
b. Calculate the sample covariance and coefficient of
correlation. Are these values what you expected from the
scatter diagram?
c. What conclusions can you reach about the relationship
between restaurant size and annual water usage?
3.40 The data file < MILK > gives nutrition content (number of calories
and total fat, in grams) per 250 mL of a random sample of 20
fresh milks available in Australia.
a. Calculate the covariance.
b. Calculate the coefficient of correlation.
c. Which do you think is more valuable in expressing the
relationship between calories and fat content – the
covariance or the coefficient of correlation? Explain.
d. What conclusions can you reach about the relationship
between calories and fat content?
3.6 PITFALLS IN NUMERICAL DESCRIPTIVE MEASURES AND
ETHICAL ISSUES
This chapter introduces sample statistics and population parameters that describe the centre,
variation and shape of a distribution of a single numerical variable and also the association
between two numerical variables. The next step is analysis and interpretation of the calculated
statistics. While your analysis is objective, your interpretation is subjective. Be careful to
avoid errors that may arise either in the objectivity of your analysis or in the subjectivity of
your interpretation.
Analysis of expenditure data in the opening scenario is objective and reveals several impartial findings. Objectivity in data analysis means reporting the most appropriate descriptive summary measures for a given data set. Now that you have read this chapter and become familiar
with various descriptive summary measures and their strengths and weaknesses, how should
you proceed with an objective analysis? For example, from Figure 2.9 the amount spent during
the festival by intrastate visitors is positively skewed, so shouldn’t both the median and the
mean be reported? Also, doesn’t the standard deviation and/or interquartile range provide more
information about the variation of amount spent than the range?
On the other hand, data interpretation is subjective. Different people form different conclusions when interpreting analytical findings. Everyone sees the world from different perspectives. Thus, because data interpretation is subjective, you must attempt to present your findings
in a fair, neutral and transparent manner.
Ethical Issues
Ethical issues are vitally important to all data analysis. As a daily consumer of information, you need to question what you read in newspapers and magazines, what you hear on
the radio or television, and what you see online. Over time, much scepticism has been
expressed about the purpose, the focus and the objectivity of published studies. Perhaps no
comment on this topic is more telling than a quip often attributed to the famous nineteenthcentury British statesman Benjamin Disraeli: ‘There are three kinds of lies: lies, damned
lies, and statistics’.
Ethical considerations arise when you are deciding what results to include in a report. You
should document both good and bad results. In addition, when making oral presentations and
compiling written reports, you need to give results in a fair, objective and neutral manner.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
130 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
Unethical behaviour occurs when you wilfully choose an inappropriate summary measure (e.g.
the mean for a very skewed set of data) to distort the facts in order to support a particular position. In addition, unethical behaviour occurs when you selectively fail to report pertinent findings because it would be detrimental to the support of a particular position.
To illustrate this selective use of statistics, in 2009 an Australian newspaper, under the
heading ‘Nation of gamblers’, stated:
Australian and New Zealand gamblers are the worst in the world, betting more money
online than those of any other country…
From the report that the statistics used came from (R. T. Wood and R. J. Williams, ‘Internet
gambling: Prevalence, patterns, problems, and policy options’, 5 January 2009), the mean net
monthly gambling expenditure of the 19 Australian and New Zealand Internet gamblers in the
sample (from more than 12,000 from 105 countries) was US$300.32, the second highest in the
survey. However, the report gave the median net monthly gambling expenditure of this group as
US$9.00 – the lowest.
3
Assess your progress
Summary
This chapter introduced numerical descriptive measures. This, and
Chapter 2, covered descriptive statistics – how data are presented
in tables and charts and then summarised, described, analysed
and interpreted. When dealing with the opening scenario data, we
were able to present useful information through the use of
histograms and other graphical methods. Then characteristics of
the expenditure data such as central tendency, variability and
shape were explored, using numerical descriptive measures
including the mean, median, quartiles, range and standard
deviation. The covariance and coefficient of correlation were
introduced to describe the relationship between two numerical
variables. In the next chapter, the basic principles of probability are
introduced to bridge the gap between descriptive statistics and
inferential statistics.
Key formulas
Sample mean
n
X =
©
i=1
Q1 =
Xi
n
First quartile Q1
(3.1)
Third quartile Q3
Q3 =
Median
Median =
n+1
ranked value (3.2)
2
n+1
ranked value (3.3)
4
3(n + 1)
ranked value (3.4)
4
Geometric mean
XG = (X1 * X2 * … * Xn)1/n (3.5)
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Key formulas 131
Population standard deviation
Geometric mean rate of return
RG = [(1 + R1) * (1 + R2) * p * (1 + Rn)]
1/n
- 1 (3.6)
σ = σ2 (3.15)
Approximating the mean from a frequency distribution
Range
Range = Xlargest - Xsmallest (3.7)
c
Interquartile range
mj fj
©
j=1
Interquartile range = Q3 - Q1 (3.8)
X=
Sample variance
Approximating the standard deviation from a frequency distribution
S2 =
SSX
=
n-1
n
n
n
S2 =
© Xi2 - nX 2
i=1
(3.9a) (definition)
n-1
n-1
n
=
© Xi2 -
i=1
© Xi
where
2
S2 =
n-1
c
c
i=1
n
(3.16)
S 2 (3.17)
S=
©(Xi - X )2
i=1
n
c
fj mj2 - nX 2
© (mj - X )2 fj j©
j=1
=1
=
n-1
Sample covariance
SSXY
cov(X, Y) =
=
n-1
S 2 (3.10)
S=
cov(X, Y) =
r=
N
SSX
n-1
=
©XiYi -
i=1
©Xi ©Yi
i=1
i=1
n
n-1
SSY
(3.19)
where
(3.13)
n
SSXY =
i=1
SSX
σ2 =
=
N
©(Xi - μ)2
i=1
N
n
(3.14a) (definition)
N
N
© Xi2 - Nμ2
N
© Xi
N
=
(3.14b) (calculation)
i=1
© Xi2 -
i=1
=
© XiYi -
i=1
2
SSX =
N
n
© (Xi - X )(Yi - Y ) = © XiYi - nXY
N
=
i=1
SSXY
Population variance
i=1
©XiYi - nX Y
n
n
Sample coefficient of correlation
Population mean
σ2
n-1
(3.18b) (calculation)
X-X
(3.12)
S
N
©(Xi - X )(Yi - Y )
i=1
n
Z score
μ=
n
n-1
n
S
CV =
100% (3.11)
X
© Xi
i=1
=
j=1
© fj mj2 j=1
(3.18a) (definition)
Coefficient of variation
Z=
c
n
(3.9b) (calculation)
Sample standard deviation
n-1
©mj fj
n
n
i=1
©Xi ©Yi
i=1
i=1
n
n
n
n
n
i=1
i=1
i=1
© (Xi - X )2 = © Xi2 - nX 2 = © Xi2 -
©Xi
i=1
n
n
N
SSY =
n
n
n
i=1
i=1
i=1
© (Yi - Y )2 = © Yi2 - nY 2 = © Yi2 -
2
©Yi
i=1
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
n
2
2
132 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
Key terms
arithmetic mean (mean)
bell-shaped
box-and-whisker plot
central tendency
Chebyshev rule
coefficient of correlation
coefficient of variation
covariance
empirical rule
extreme value (outlier)
first (lower) quartile
five-number summary
geometric mean
92
115
121
92
116
125
105
123
115
106
97
120
98
interquartile range
median
mode
population mean
population standard deviation
population variance
quartiles
range
resistant measures
sample coefficient of correlation
sample covariance
sample mean
sample standard deviation
100
94
96
113
114
114
96
99
101
126
123
93
102
sample variance
second quartile
shape
skewed
spread (dispersion)
standard deviation
sum of squares (SS)
symmetrical
third (upper) quartile
variance
variation
Z scores
102
97
92
107
99
101
101
107
97
101
92
106
Chapter review problems
CHECKING YOUR UNDERSTANDING
3.41
3.42
3.43
3.44
3.45
3.46
3.47
What is meant by a property of central tendency?
What are the differences between the mean, median and
mode, and what are the advantages and disadvantages
of each?
How do you interpret the first quartile, median and third
quartile?
What is meant by the property of variation?
What does the Z score measure?
What are the differences between the various measures of
variation such as the range, interquartile range, variance,
standard deviation and coefficient of variation, and what are
the advantages and disadvantages of each?
How do the empirical rule and the Chebyshev rule differ?
APPLYING THE CONCEPTS
You can solve problems 3.48 to 3.56 manually or by using Microsoft Excel.
3.48
A quality characteristic of interest for a tea-bag-filling process
is the weight of the tea in the individual bags. If the bags are
underfilled, two problems arise. First, customers may not be
able to brew the tea as strong as they wish. Second, the
company may be in violation of the truth-in-labelling laws. For
this product, the label weight on the package indicates that, on
average, there are 5.5 grams of tea in a bag. If the average
amount of tea in a bag exceeds the label weight, the company
is giving away product.
Getting an exact amount of tea into a bag is problematic
because of variation in the temperature and humidity inside the
factory, differences in the density of the tea, and the extremely
fast filling operation of the machine (approximately 170 bags a
minute). The table below provides the weight in grams of a
sample of 50 tea-bags produced within an hour by a single
machine. < TEABAGS >
5.65
5.57
5.47
5.77
5.61
5.44
5.40
5.40
5.57
5.45
5.42
5.53
5.47
5.42
5.44
5.40
5.54
5.61
5.58
5.25
5.53
5.55
5.53
5.58
5.56
5.34
5.62
5.32
5.50
5.63
5.54
5.56
5.67
5.32
5.50
5.45
5.46
5.29
5.50
5.57
5.52
5.44
5.49
5.53
5.67
5.41
5.51
5.55
5.58
5.36
a. Calculate the mean, median, first quartile and third quartile.
b. Calculate the range, interquartile range, variance, standard
deviation and coefficient of variation.
c. Interpret the measures of central tendency and variation
within the context of this problem. Why should the company
producing the tea-bags be concerned about the central
tendency and variation?
d. Construct a box-and-whisker plot. Are the data skewed? If
so, how?
e. Is the company meeting the requirement set forth on the
label that, on average, there are 5.5 grams of tea in a bag?
If you were in charge of this process, what changes, if any,
would you try to make concerning the distribution of
weights in the individual bags?
3.49 Use the data in problems 2.30 and 2.70 to investigate the
distribution of petrol and diesel prices in New South Wales and
Queensland. < FUEL_MARCH_2017 >
a. Calculate the mean, median, first quartile and third quartile
of New South Wales and Queensland petrol and diesel
prices. What conclusions can you draw?
b. Calculate the range, interquartile range, variance, standard
deviation and coefficient of variation of New South Wales
and Queensland petrol and diesel prices. What conclusions
can you draw?
c. Construct box-and-whisker plots for the data. Are the data
skewed? What conclusions can you draw?
d. Calculate the covariance and coefficient of correlation for
diesel and petrol prices in New South Wales and
Queensland. What conclusions can you reach about the
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter review problems 133
relationship between petrol and diesel prices in New South
Wales and Queensland?
3.50 The data file < GRADES > contains a sample of student
marks and grades from a population of students enrolled in
a statistics unit.
a. Calculate the mean, median, range and standard deviation
for total marks. Interpret these measures of central
tendency and variability.
b. List the five-number summary for total marks.
c. For total marks, construct and interpret a box-and-whisker
plot.
d. Ignoring students who did not attempt the final exam,
calculate the covariance and coefficient of correlation for
semester and exam marks.
e. What conclusions can you reach about the relationship
between a student’s semester and exam marks?
3.51 The file < AGE > contains the ages and gender of the Australian
population at 30 June 2013 and 2016.
a. Calculate the approximate mean age and the approximate
standard deviation of age for males and females at 30 June
2013 and 2016.
b. What conclusions can you draw about male and female
ages in 2013 and 2016?
3.52 In many manufacturing processes the term ‘work-in-process’
(WIP) is used. In a book-manufacturing plant the WIP
represents the time it takes for sheets from a press to be
folded, gathered, sewn, tipped on end sheets and bound. The
following data represent samples of 20 books at each of two
production plants and the processing time (operationally
defined as the time in days from when the books came off
the press to when they were packed in cartons) for these
jobs. < WIP >
Plant A
5.62 5.29 16.25 10.92 11.46 21.62
11.62 7.29 7.50 7.96 4.42 10.50
8.45
7.58
8.58
9.29
3.55
For each of the two plants:
a. Calculate the mean, median, first quartile and third quartile.
b. Calculate the range, interquartile range, variance, standard
deviation and coefficient of variation.
c. Construct a box-and-whisker plot. Are the data skewed? If
so, how?
d. On the basis of the results of (a) to (c), are there any
differences between the two plants? Explain.
Water_Wise is analysing water usage for a block of onebedroom flats. They collect data on daily water consumption in
kilolitres (kl) for 133 consecutive days. < WATER > Explore the
daily water usage in this block of flats by:
a. plotting the data graphically
b. calculating the summary statistics
c. commenting on the graphs and the summary statistics.
In this problem you are asked to select an appropriate value for
the standard deviation, based on your knowledge of how these
variables vary.
a. From a sample of 30 petrol stations, the mean price of E10
petrol is $1.56 per litre. Which of the following is a
reasonable value for the corresponding standard deviation
of prices: $0.03, $3.00 or $30.00?
b. The mean starting salary of a sample of 50 recent
graduates is $65,200. Which of the following is a
reasonable value for the standard deviation of starting
salaries: $5, $50 or $5,000?
c. The mean weight of a sample of 100 male university
students is 70 kg. Which of the following is a reasonable
value for the standard deviation of weights: 0.5 kg, 10 kg or
50 kg?
The following table gives the annual increase in the Consumer
Price Index (CPI), a measure of inflation in Australia and New
Zealand.
Year to
Dec 2012
Dec 2013
Dec 2014
Dec 2015
Dec 2016
CPI % annual change
Australia
New Zealand
2.2
0.9
2.7
1.6
1.7
0.8
1.7
0.1
1.5
1.3
Data obtained from Reserve Bank of Australia <www.rba.gov.au> and Reserve
Bank of New Zealand <www.rbnz.govt.nz> accessed Jun 2017
5.41 11.42
7.54 8.92
Plant B
9.54 11.46 16.62 12.62 25.75 15.41 14.29 13.13 13.71 10.04
5.75 12.46 9.17 13.21 6.00 2.33 14.25 5.37 6.25 9.71
3.53
3.54
3.56
3.57
For each country:
a. Calculate the geometric mean inflation rate from 2012 to
2016.
b. What conclusions can you draw about the inflation rate in
New Zealand and Australia?
Naturally Soap (see problem 3.22) is interested in exploring the
relationship between the price and the quantity sold at each
market. < NATURALLY_SOAP >
For the Sunday morning and Wednesday evening markets,
calculate and interpret the coefficient of correlation between
weekly quantity sold and price.
You are planning to study for your statistics examination with a
group of classmates, one of whom you particularly want to
impress. This individual has volunteered to use Microsoft Excel
to get the needed summary information, tables and charts for a
data set containing several numerical and categorical variables
assigned by your lecturer for study purposes. This person
comes over to you with the printout and exclaims, ‘I’ve got it
all – the means, the medians, the standard deviations, the
box-and-whisker plots, the pie charts – for all our variables. The
problem is, some of the output looks weird – like the box-andwhisker plots for gender and for major, and the pie charts for
grade point average and for height. Also, I can’t understand
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
134 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
why Professor Krehbiel said we can’t get the descriptive
statistics for some of the variables – I got it for everything! See,
the mean for height is 1.78, the grade point average is 2.76,
the mean for gender is 1.50 and the mean for major is 4.33.’
What is your reply?
REPORT WRITING EXERCISE
3.58
The data in the file < BEER > give the alcohol and calorie
content of a sample of 95 beers, together with country of
origin and type.
Your task is to write a report based on a complete
descriptive evaluation of each of the numerical variables –
calories and alcohol content – regardless of type of product or
origin. Then perform a similar evaluation comparing each of
these numerical variables based on type of product – regular,
light or non-alcoholic beers. In addition, perform a similar
evaluation comparing and contrasting each of these numerical
variables based on the origins of the beers – those of a
selected country or continent versus those from elsewhere.
Appended to your report should be all appropriate tables,
charts and numerical descriptive measures.
Continuing cases
Tasman University
Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues.
In particular, students within the school are asked to complete a student survey when they receive their grades
each semester. The results of Bachelor of Business (BBus) and Master of Business Administration (MBA)
students who responded to the latest undergraduate (UG) and postgraduate (PG) student surveys are stored in
< TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >.
Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman
University Postgraduate MBA Student Survey.
a For a selection of numerical variables in the BBus student survey, calculate appropriate descriptive
statistics.
b For a selection of numerical variables in the MBA student survey, calculate appropriate descriptive
statistics.
c Write a report summarising your conclusions.
As Safe as Houses
To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a
large national real estate company, has collected samples of recent residential sales from a sample of non-capital
cities and towns in these states. These data are stored in < REAL_ESTATE >.
a For a selection of numerical variables for regional city 1 state A, calculate appropriate descriptive
statistics.
b For a selection of numerical variables for coastal city 1 state A, calculate appropriate descriptive
statistics.
c Write a report summarising your conclusions.
d Repeat (a) to (c) for another pair of non-capital cities or towns in state A and/or state B.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 3 Excel Guide 135
Chapter 3 Excel Guide
EG3.1 MEASURES OF CENTRAL TENDENCY,
VARIATION AND SHAPE
CENTRAL TENDENCY
The Mean, Median and Mode
Key technique
U s e t h e AV E R AG E ( v a r i a b l e c e l l r a n g e ) ,
MEDIAN(variable cell range), and MODE(variable cell
range) functions to calculate these measures.
Example
Calculate the mean, median and mode for the sample of
getting-ready times introduced in Section 3.1.
PHStat
Use Descriptive Summary.
For the example, open the Times file. Select PHStat ➔
Descriptive Statistics ➔ Descriptive Summary. In the
Descriptive Summary dialog box (shown in Figure EG3.1):
Figure EG3.1
Descriptive Summary
dialog box
1. Enter or highlight cells A1:A11 as the Raw Data
Cell Range and check First cell contains label.
2. Click Single Group Variable.
3. Enter a Title and click OK.
PHStat inserts a new worksheet that contains various measures of central tendency, variation, and shape discussed in
Sections 3.1. This worksheet is similar to the CompleteStatistics worksheet of the Descriptive workbook.
In-depth Excel
Use the CentralTendency worksheet of the Descriptive
workbook as a model.
For the example, open the Times file and insert a new
worksheet (right-click on tab ➔ Insert ➔ Worksheet) and:
1. Enter a title in cell A1.
2. Enter Get-Ready Times in cell B3, Mean in cell
A4, Median in cell A5, and Mode in cell A6.
3. Enter the formula 5AVERAGE(DATA!A:A) in
cell B4, the formula 5MEDIAN(DATA!A:A) in
cell B5, and the ­formula 5MODE(DATA!A:A)
in cell B6.
For these functions, the variable cell range includes the
name of the DATA worksheet because the data being summarised appears on the separate DATA worksheet. If you
suspect that there may be more than one mode highlight
several cells, say B7:G7, enter 5TRANSPOSE(MODE.
MULTI(DATA!A:A)) then press Ctrl+Shift+Enter. See the
Central_Tendency workbook, which gives the two modes
for the times to get ready.
To calculate the mean, median and mode for another
set of data, paste the data into column A of the DATA
worksheet, overwriting the existing getting-ready times.
Analysis ToolPak
Use Descriptive Statistics.
For the example, open to the Times file and:
1. Select Data ➔ Data Analysis.
2. In the Data Analysis dialog box, select Descriptive Statistics from the Analysis Tools list and
then click OK.
In the Descriptive Statistics dialog box (shown in Figure
EG3.2):
3. Enter or highlight cells A1:A11 as the Input
Range. Click Columns and check Labels in first
row.
4. Click New Worksheet Ply and check Summary
statistics, Kth Largest, and Kth Smallest.
5. Click OK.
Figure EG3.2
Descriptive Statistics
dialog box
The ToolPak inserts a new worksheet that contains various measures of central tendency, variation, and shape discussed in Section 3.1.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
136 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
Quartiles
Key technique
Use the MEDIAN, COUNT, SMALL, INT, FLOOR, and
CEILING functions in combination with the IF decisionmaking function to calculate the quartiles. To apply the
rules for calculating quartiles on page 97, avoid using any
of the Excel quartile functions to calculate the first and
third quartiles.
Example
Calculate the quartiles for the sample of getting-ready
times introduced in Section 3.1.
PHStat
Use Boxplot (discussed on page 137).
In-depth Excel
Use the COMPUTE worksheet of the Quartiles workbook as a model.
For the example, the COMPUTE worksheet already
calculates the quartiles for the getting-ready times. To calculate the quartiles for another set of data, paste the data
into column A of the DATA worksheet, overwriting the
existing getting-ready times.
Open to the COMPUTE_FORMULAS worksheet to
examine the formulas.
The COMPARE worksheet compares the quartiles
obtained using Section 3.1 rules for quartiles and the Excel
quartile functions: QUARTILE(array, quart), QUARTILE.
INC(array, quart) and QUARTILE.EXC(array, quart).
The Geometric Mean
Key technique
Use the GEOMEAN((1 1 (R1)), (1 1 (R2)), . . . (1 1 (Rn)))
2 1 function to calculate the geometric mean rate of return.
Example
Calculate the geometric mean rate of return in the NZX-50
Index for the five years as shown in Example 3.7 on page 99.
In-depth Excel
Enter the formula 5GEOMEAN(110.24,110.16,11
0.18,110.14,110.10)21 in any cell.
VARIATION AND SHAPE
The Range
Key technique
Use the MIN(variable cell range) and MAX(variable cell
range) functions to help calculate the range.
Example
Calculate the range for the sample of getting-ready times
introduced in Section 3.1.
PHStat
Use Descriptive Summary as discussed earlier.
In-depth Excel
Use the Range worksheet of the Descriptive workbook as
a model.
For the example, open the worksheet implemented for
the example in the In-depth Excel ‘The Mean, Median, and
Mode’ instructions.
Enter Minimum in cell A7, Maximum in cell A8, and
Range in cell A9. Enter the formula 5MIN(DATA!A:A)
in cell B7, the formula 5MAX(DATA!A:A) in cell B8,
and the formula 5B8 2 B7 in cell B9.
Analysis ToolPak
Use Descriptive Statistics as discussed earlier.
The Interquartile Range
Key technique
Use a formula to subtract the first quartile from the third
quartile.
Example
Calculate the interquartile range for the sample of gettingready times introduced in Section 3.1.
In-depth Excel
Use the COMPUTE worksheet of the Quartiles workbook (introduced earlier) as a model.
For the example, the interquartile range is already calculated in cell B19 using the formula 5B18 2 B16.
The Variance, Standard Deviation, Coefficient of Variation
and Z Scores
Key technique
Use the VAR.S(variable cell range) and STDEV.S(variable
cell range) functions to calculate the sample variance and
the sample standard deviation, respectively. Use the AVERAGE and STDEV.S functions for the coefficient of variation. Use the STANDARDIZE(value, mean, standard
deviation) function to calculate Z scores.
Example
Calculate the variance, standard deviation, coefficient of
variation, and Z scores for the sample of getting-ready
times introduced in Section 3.1.
PHStat
Use Descriptive Summary as discussed earlier.
In-depth Excel
Use the Variation and ZScores worksheets of the Descriptive workbook as models.
For the example, open to the worksheet implemented
for the earlier examples. Enter Variance in cell A10,
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 3 Excel Guide 137
Standard Deviation in cell A11 and Coeff. of Variation in
cell A12. Enter the formula 5VAR.S(DATA!A:A) in cell
B10, the formula 5STDEV.S(DATA!A:A) in cell B11, and
the formula 5B11yAVERAGE(DATA!A:A) in cell B12. If
you previously entered the formula for the mean in cell A4
using the In-depth Excel instructions for the mean, enter the
simpler formula 5B11yB4 in cell B12. Right-click cell B12
and click Format Cells in the shortcut menu. In the Number
tab of the Format Cells dialog box, click Percentage in the
Category list, enter 2 as the Decimal places, and click OK.
To calculate the Z scores, copy the DATA worksheet. In
the new, copied worksheet, enter Z Score in cell B1. Enter the
formula 5STANDARDIZE(A2, AVERAGE(A:A),
STDEV.S(A:A)) in cell B2 and copy the formula down to row
11. Then format cells B2 to B11 to the required number of
decimal places. If you use an Excel version older than Excel
2010, use VAR and STDEV instead of VAR.S and STDEV.S.
Analysis ToolPak
Use Descriptive Statistics as discussed earlier. This procedure does not calculate Z scores.
Shape: Skewness and Kurtosis
Key technique
Use the SKEW(variable cell range) and the KURT(variable
cell range) functions to calculate these measures.
Example
Calculate the skewness and kurtosis for the sample of gettingready times introduced in Section 3.1.
PHStat
Use Descriptive Summary as discussed earlier.
In-depth Excel
Use the Shape worksheet of the Descriptive workbook as
a model.
For the example, open to the worksheet implemented for
the earlier examples. Enter Skewness in cell A13 and Kurtosis in cell A14. Enter the formula 5SKEW(DATA!A:A)
in cell B13 and the formula 5KURT(DATA!A:A) in cell
B14. Then format cells B13 and B14 to four decimal places.
Analysis ToolPak
Use Descriptive Statistics as discussed earlier.
EG3.2 NUMERICAL DESCRIPTIVE MEASURES FOR
A POPULATION
The Population Mean, Population Variance and Population
Standard Deviation
Key technique
Use AVERAGE(variable cell range), VAR.P(variable cell
range), and STDEV.P(variable cell range) to calculate
these measures.
Example
Calculate the population mean, population variance and
population standard deviation for the road fatality population data of Table 3.3 on page 113.
In-depth Excel
Use the Parameters workbook as a model. For the example,
the COMPUTE worksheet of the Parameters workbook
already calculates the three population parameters for the
road fatality data. For other problems, paste your unsummarised data into column B of the DATA worksheet, overwriting the road fatality data. If you use an Excel version
older than Excel 2010, use the COMPUTE_OLDER
worksheet.
The Empirical Rule and the Chebyshev Rule
Use the COMPUTE worksheet of the VE_Variability
workbook to explore the effects of changing the mean and
standard deviation on the ranges associated with 61 standard deviation, 62 standard deviations, and 63 standard
deviations from the mean. Change the mean in cell B4 and
the standard deviation in cell B5 and then note the updated
results in rows 9 to 11.
EG3.3 FIVE-NUMBER SUMMARY AND BOX-ANDWHISKER PLOTS
Key technique
Plot a series of line segments on the same chart to construct
a boxplot. (Excel chart types do not include boxplots.)
Example
Calculate the five-number summary and construct the boxplots for festival expenditure by international visitors in
Figure 3.5.
PHStat
Use Boxplot.
For the example, open the Festival file. Select PHStat
➔ Descriptive Statistics ➔ Boxplot. In the Boxplot dialog
box (shown in Figure EG3.3):
1. Enter or highlight C2:C14 as the Raw Data Cell
Range and check First cell contains label.
2. Click Single Group Variable.
3. Enter a Title, check Five-Number Summary, and
click OK.
The boxplot appears on its own chart sheet, separate from
the worksheet that contains the five-number summary.
In-depth Excel
Use the worksheets of the Boxplot workbook as templates.
For the example, use the PLOT_DATA worksheet,
which already shows the five-number summary and boxplot for festival expenditure by international visitors.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
138 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES
Figure EG3.3
Boxplot dialog box
For the example, the discretionary income and expenditure data have already been placed in columns A and B of
the DATA worksheet and the COMPUTE worksheet displays the calculated covariance in cell B9. For other problems, paste the data for two variables into columns A and B
of the DATA worksheet, overwriting the discretionary
income and expenditure data.
If you use an Excel version older than Excel 2010, use
the COMPUTE_OLDER worksheet that calculates the
covariance without using the COVARIANCE.S function
that was introduced in Excel 2010.
The Coefficient of Correlation
For other problems, use the PLOT_SUMMARY
worksheet as the template if the five-number summary has
already been determined; otherwise, paste your unsummarised data into column A of the DATA worksheet and use
the PLOT_DATA worksheet as was done for the example.
The worksheets creatively misuse Excel line-charting
features to construct a boxplot.
EG3.4 THE COVARIANCE AND THE COEFFICIENT OF
CORRELATION
The Covariance
Key technique
Use the COVARIANCE.S(variable 1 cell range, variable
2 cell range) function to calculate this measure.
Example
Calculate the sample covariance for discretionary income
and expenditure data, Example 3.19.
Key technique
Use the CORREL(variable 1 cell range, variable 2 cell
range) function to calculate this measure.
Example
Calculate the coefficient of correlation for discretionary
income and expenditure data in Example 3.19.
In-depth Excel
Use the Correlation workbook as a model.
For the example, the discretionary income and expenditure data have already been placed in columns A and B of
the DATA worksheet and the COMPUTE worksheet displays the coefficient of correlation in cell B14. For other
problems, paste the data for two variables into columns A
and B of the DATA worksheet, overwriting the revenue and
value data.
The COMPUTE worksheet uses the COVARIANCE.S
function to calculate the covariance (see the previous section) and also the DEVSQ, COUNT, and SUMPRODUCT
functions. Open the COMPUTE_FORMULAS worksheet
to examine the use of all these functions.
In-depth Excel
Use the Covariance workbook as a model.
Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
End of Part 1 problems 139
End of Part 1 problems
A.1
A sample of 500 shoppers was selected in a large
metropolitan area to obtain consumer behaviour
information. Among the questions asked was, ‘Do you enjoy
shopping for clothing?’ The results are summarised in the
following cross-classification table.
Enjoy shopping for clothing
Yes
No
Total
A.2
Male
136
104
240
Gender
Female
224
36
260
A.3
Superannuation fund
Conservative
Balanced
Growth
High growth
Total
360
140
500
a. Construct contingency tables based on total percentages,
row percentages and column percentages.
b. Construct a side-by-side bar chart of enjoy shopping for
clothing based on gender.
c. What conclusions do you draw from these analyses?
One of the major measures of the quality of service provided by
any organisation is the speed with which the organisation
responds to customer complaints. A large family-owned
department store selling furniture and flooring, including
carpet, has undergone major expansion in the past few
years. In particular, the flooring department has expanded
from two installation crews to an installation supervisor,
a measurer and 15 installation crews. During a recent
year the company got 50 complaints about carpet installation.
The following data represent the number of days between
receipt of the complaint and resolution of the complaint.
A.4
5
19
4
10
68
35
126
165
5
137
110
32
27
31
110
29
4
27
29
28
52
152
61
29
30
2
35
26
22
123
94
25
36
81
31
1
26
74
26
14
20
27
5
13
23
a. Construct frequency and percentage distributions.
b. Construct histogram and percentage polygons.
c. Construct a cumulative percentage distribution and plot the
corresponding ogive.
d. Calculate the mean, median, first quartile and third
quartile.
e. Calculate the range, interquartile range, variance, standard
deviation and coefficient of variation.
f. Construct a box-and-whisker plot. Are the data skewed? If
so, how?
g. On the basis of the results of (a) to (f), if you had to report
to the manager on how long a customer should expect to
wait to have a complaint resolved, what would you say?
Explain.
Historical crediting rate for year ending
30 June, %
2017
2016
2015
2014
2013
5.5
8.7
9.0
11.3
12.3
9.5
5.2
10.7
14.1
15.9
11.8
3.8
11.3
15.6
18.7
13.7
3.1
12.3
17.4
20.5
a. For each fund, calculate the geometric rate of return for
three years (2015 to 2017) and for five years (2013 to 2017).
b. What conclusions can you reach concerning the geometric
rates of return for the funds?
A supplier of ‘Natural Australian’ spring water states that the
magnesium content is 1.6 mg/L. To check this, the quality
control department takes a random sample of 96 bottles
during a day’s production and obtains the magnesium content.
< SPRING_WATER1 >
< FURNITURE >
54
11
12
13
33
The annual crediting rates (after tax and fees) on several
managed superannuation investment funds between 2013 and
2017 are:
A.5
A.6
a. Construct frequency and percentage distributions.
b. Construct a histogram and a percentage polygon.
c. Construct a cumulative percentage distribution and plot the
corresponding ogive.
d. Calculate the mean, median, mode, first quartile and third
quartile.
e. Calculate the variance, standard deviation, range,
interquartile range and coefficient of variation.
f. Construct and interpret a box-and-whisker plot.
g. What conclusions can you reach concerning the magnesium
content of this day’s production?
The National Australia Bank (NAB) produces regular reports
titled NAB Online Retail Sales Index <www.business.nab.
com.au>. Download the latest in-depth report.
a. Give an example of a categorical variable found in the
report.
b. Give an example of a numerical variable found in the
report.
c. Is the variable you selected in (b) discrete or
continuous?
The data in the file < WEBSTATS > represent the number
of times during August and September that a sample
of 50 students accessed the website of a statistics
unit they were enrolled in.
a. Construct ordered arrays for August and September.
b. Construct stem-and-leaf displays for August and
September.
c. Construct frequency, percentage and cumulative
distributions for August and September.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
140 End of Part 1 problems
d. Plot frequency histograms as separate graphs; plot
percentage polygons on the same graph.
e. Plot cumulative percentage polygons on the same graph.
f. Calculate the mean, median, mode, first quartile and third
quartile.
g. Calculate the variance, standard deviation, range,
interquartile range and coefficient of variation.
h. Based on the results of (a) to (g), what conclusions can you
reach about the number of times a student accesses the
website each month?
A.7 In problem A.6 sample statistics were calculated from data
representing the number of times, during August and
September, a sample of 50 students accessed the website of a
statistics unit they were enrolled in. < WEBSTATS >
For each month (August and September):
a. List the five-number summary.
b. Construct the box-and-whisker plot.
c. Discuss the distribution of the number of times a student
accesses the website each month.
A.8 The data stored in data file < WEBSTATS > classify the
number of times, during August and September, that a
sample of 50 students accessed a statistics unit website by
day and time.
a. Construct appropriate tables and/or charts to investigate
the day of the week and the time that students access the
website.
b. What conclusions can you draw about the pattern of web
access for the two months?
c. When would you post an announcement, so that the
maximum number of students would read it?
A.9 The data in the file < NZ_CAR_SALES_16_17 > are of sales of
new cars in New Zealand for February 2016 and 2017 (data
obtained from Motor Industry Association of New Zealand
<www.mia.org.nz> accessed 27 March 2017). For each year,
ignoring the other category:
a. Calculate the mean, variance and standard deviation for the
population of the 20 top-selling makes of car.
b. What proportion of the makes have sales within ±1, ±2
and ±3 standard deviations of the mean?
c. Compare and contrast your findings with what would be
expected based on the empirical rule or on the Chebyshev
rule.
A.10 The data below represent the distribution of the ages of
employees in two different divisions of a publishing
company.
Age of employees (years)
20–under 30
30–under 40
40–under 50
50–under 60
60–under 70
A
Frequency
8
17
11
8
2
B
Frequency
15
32
20
4
0
For each of the two divisions (A and B), approximate the
a. mean.
b. standard deviation.
c. On the basis of the results of (a) and (b), do you think there
are differences in the age distribution between the two
divisions? Explain.
A.11 For each of the following variables, determine whether the
variable is categorical or numerical. If the variable is numerical,
determine whether it is discrete or continuous. In addition,
determine the level of measurement.
a. Amount of money spent on clothing in the last month
b. Favourite department store
c. Most likely time period during which shopping for clothing
takes place (weekday, weeknight, weekend)
d. Number of pairs of jeans owned
A.12 The file < CURRENCY > contains the monthly closing exchange
rates for the New Zealand dollar (NZD), the Japanese yen (JPY),
the United States dollar (USD) and the Chinese renminbi (CNY)
from January 2010 to May 2017, where each currency is
expressed in units per Australian dollar (data obtained from
Reserve Bank of Australia <www.rba.gov.au> accessed 1 June
2017).
a. Construct time-series plots for the monthly closing values of
each currency.
b. Explain any patterns present in the plots.
c. Construct separate scatter plots of the value of pairs of
these currencies.
d. Calculate the correlation coefficient for pairs of
currencies.
e. What conclusions can you reach concerning the value of
these currencies in terms of the Australian dollar?
f. Obtain current exchange rates from Reserve Bank of
Australia or elsewhere for either these currencies or
alternative currencies. Then repeat parts (a) to (e).
A.13 The table below classifies the academic staff of a
small regional university by gender and level.
< ACADEMIC_STAFF >
Level
Professor
Associate professor
Senior lecturer
Lecturer
Associate lecturer
Total
Average salary
$172,500
$147,600
$128,500
$108,200
$ 86,500
Gender
Female
Male
13
21
16
24
37
52
74
58
23
13
163
168
Total
34
40
89
132
36
331
a. Illustrate these data by constructing appropriate tables and
graphs.
b. What can you conclude about gender and level for
academic staff at this university?
c. Estimate the mean and standard deviation of academic
salaries.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
End of Part 1 problems 141
d. Estimate the annual expenditure on academic salaries for
this university.
e. Estimate the mean and standard deviation of male and
female academic salaries.
f. Comment on the difference in male and female academic
salaries at this university.
A.14 To test the effectiveness of mail X-ray screening in identifying
potential illegal or threatening items a mail centre X-rays a
random sample of 500 packages and then independently
searches each package. The results of this test are given
below.
How does your lender compare?
BCU
X-ray items identified
Yes
No
36
12
14
438
50
450
Commonwealth Bank 9.44% pa
Comparison
of standard
variable home
loan rates on
a loan of
$150,000 over
25 years
2.086
2.038
2.014
2.003
1.981
1.957
1.894
2.066
2.031
2.013
1.999
1.973
1.951
2.075
2.029
2.014
1.996
1.975
1.951
2.065
2.025
2.012
1.997
1.971
1.947
2.057
2.029
2.012
1.992
1.969
1.941
2.052
2.023
2.012
1.994
1.966
1.941
National Australia Bank 9.46% pa
ANZ 9.47% pa
Suncorp 9.47% pa
St George 9.47% pa
Westpac 9.47% pa
Total
48
452
500
a. Illustrate these data by constructing appropriate tables and
graphs.
b. Do you feel that X-ray screening is effective in identifying
items of interest?
A.15 The following data represent the amount of soft drink filled in a
sample of 50 consecutive 2-litre bottles. The results are listed
horizontally in the order filled. < DRINK >
2.109
2.036
2.015
2.005
1.984
1.963
1.908
9.18% pa comparison rate
Newcastle Permanent 9.41% pa
Rates current
at 23 May 2008.
Search
items found
Yes
No
Total
9.15% pa
2.044
2.020
2.010
1.986
1.967
1.938
a. Construct a frequency distribution and a percentage
distribution.
b. Plot a histogram and a percentage polygon.
c. Form a cumulative percentage distribution and plot the
corresponding cumulative percentage polygon.
d. On the basis of the results of (a) to (c), does the amount of
soft drink in the bottles concentrate around specific values?
e. Construct a time-series plot with the amount of soft drink
on the vertical axis and the bottles’ numbers (from 1 to 50)
on the horizontal axis.
f. What pattern, if any, is present in the data?
g. If you had to make a prediction of the amount of soft drink
in the next bottle, what would you predict?
h. Based on the results of (e) to (g), explain why it is important
to construct a time-series plot and not just a histogram, as
was done in part (b).
A.16 Comment on the following graph, which appeared in the
Northern Star in August 2008.
Data obtained from InfoChoice <www.infochoice.com.au>
A.17 The following table gives the results on food groups never
eaten from a national study of 10,000 men and 10,000 women
aged at least 50. < FOOD >
Foods never eaten
Cheese
Cream
Diary products
Eggs
Fish
Seafood
Any meat
Chicken/Poultry
Pork/Ham
Red meat
Sugar
Wheat products
Eat all foods
Total number of respondents
Men
236
623
131
175
123
166
111
126
234
159
1,095
187
7,299
10,000
Women
219
917
196
279
266
268
353
368
495
247
897
380
7,878
10,000
Total
455
1,540
327
454
389
434
464
494
729
406
1,992
567
15,177
20,000
a. For men and women, separately and combined, construct
percentage summary tables and bar charts for the data.
b. What conclusions can you draw about the diet of the
participants in the study?
c. Why would a pie chart not be appropriate for these data?
A.18 The data in < PROBLEMS > are random samples of the time (in
minutes) taken to resolve 40 problems reported by students
and 40 problems reported by staff to the Technology Services
(TS) Service Desk at Tasman University.
For each sample:
a. Construct appropriate tables and/or charts to investigate
the time it takes the TS Service Desk to resolve problems.
b. Calculate the mean, median and quartiles.
c. Calculate the range, interquartile range, variance, standard
deviation and coefficient of variation.
d. Construct a box-and-whisker plot. Are the data skewed? If
so, how?
e. On the basis of the results of (a) to (d), are there any
differences between the time to resolve TS problems for
staff and for students? Explain.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
142 End of Part 1 problems
A.19 If two students receive a mark of 90 on the same examination,
what arguments could be used to show that the underlying
variable – test score – is continuous?
A.20 The call centre supervisor of the IT helpdesk of a large
university is monitoring the performance of the technical
support staff. The data in the file < HELP_DESK > give the
number of calls resolved during a random sample of 20 eighthour shifts by five support staff.
a. For each staff member, construct frequency, percentage
and cumulative distributions.
b. For each staff member, construct a histogram.
c. On the same graph, construct percentage polygons for all
staff members.
d. On the same graph, construct ogives for all staff members.
e. For each staff member, calculate the mean, median, mode,
first quartile and third quartile.
f. For each staff member, calculate the variance, standard
deviation, range, interquartile range and coefficient of
variation.
g. On the same graph, construct and interpret a box-andwhisker plot for each staff member.
h. What conclusions can you reach concerning the number of
resolved calls?
A.21 The file < AGE > contains the ages and gender of the Australian
population at 30 June 2013 and 2016.
a. Construct percentage and cumulative percentage
distributions for the age of males, females and the entire
Australian population in 2013 and 2016.
b. Construct and interpret appropriate graphs to investigate
the age distribution of males and females, separately and
combined, and how it is changing.
c. Calculate the approximate mean age and approximate
standard deviation of age for the entire Australian
population.
A.22 One operation of a mill is to cut pieces of steel into parts that
will later be used as the frame for front seats in a car. The steel
is cut with a diamond saw and the resulting parts must be
within ±0.125 mm of the length specified by the car
manufacturer.
The data in < STEEL > come from a sample of 100 steel
parts. The measurement reported is the difference in
millimetres between the actual length of the steel part, as
measured by a laser measurement device, and the specified
length of the steel part. For example, the data value –0.05
represents a steel part that is 0.05 mm shorter than the
specified length.
a. Construct a frequency distribution and a percentage
distribution.
b. Plot the corresponding histogram and percentage
polygon.
c. Plot the corresponding cumulative percentage
polygon.
d. Is the steel mill doing a good job in meeting the
requirements set by the car manufacturer? Explain.
A.23 For the previous year a large confectionary chain, Sweets-4-U,
is interested in analysing the quantity sold weekly, including
associated cost data, of two of its popular products, ‘Forgive’
and ‘Rejoice’.
These products, both wrapped chocolates sold by weight,
differ only in the message attached to each chocolate. Forgive
chocolates contain messages ‘Sorry’, ‘Forgive Me’, ‘Trust Me’
and similar, while the messages attached to Rejoice
chocolates are ‘Celebrate’, ‘Have Fun’, ‘I Love You’ and
similar. < SWEETS_4_U >
For Forgive chocolates quantity sold data, construct and
interpret:
a. a stem-and-leaf display
b. frequency, percentage and cumulative distributions
c. a frequency histogram, percentage polygon and ogive
d. a scatter diagram quantity sold and total cost.
For each product:
e. Calculate the mean, variance and standard deviation of the
weekly quantity sold for the year.
f. What conclusions can you make about the weekly quantity
sold for each product?
g. Use the empirical rule or the Chebyshev rule, whichever is
appropriate, to explain further the variation in the weekly
quantity sold.
h. Using the results in (g), are there any outliers? Explain.
i. Calculate and interpret the coefficient of correlation between
weekly quantity sold and the associated costs. Also calculate
and interpret the coefficient of correlation between the
weekly quantity sold of Rejoice and Forgive chocolates.
j. Construct time-series plots to investigate any pattern
in weekly sales over the year. What conclusions can
you make about the pattern of weekly sales for the
products?
A.24 Several hundred laboratory tests are performed at a large
hospital each day. The rate at which these tests are done
improperly (and therefore need to be redone) seems steady,
at about 4%. In an effort to get to the root cause of these
nonconformances (tests that need to be redone), the director of
the lab decided to keep records over a period of one week. The
laboratory tests were subdivided by the shift of workers who
performed them. The results are shown below.
Shift
Lab tests performed
Nonconforming
Conforming
Total
Day
16
654
670
Evening
24
306
330
Total
40
960
1,000
a. Construct cross-classification tables based on total
percentages, row percentages and column
percentages.
b. Which type of percentage – row, column or total – do you
think is most informative for these data? Explain.
c. What conclusions concerning the pattern of nonconforming
laboratory tests can the laboratory director reach?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
End of Part 1 problems 143
A.25 An economist exploring the relationship between interest rates
and inflation has collected interest and CPI data from the
reserve banks of New Zealand and Australia for 2000 to
March 2017 (data obtained from Reserve Bank of Australia
and Reserve Bank of New Zealand < www.rba.gov.au> and
<www.rbnz.govt.nz> accessed 1 June 2017).
< INTEREST_&C_PI_2017 >
For each country, use appropriate graphs and statistics to
investigate the relationship between interest and inflation
rates. What conclusions can you make?
A.26 < GDP > gives the annual percentage change in real gross
domestic product (GDP) per quarter since 2000 for New
Zealand (NZ), Australia, the United States of America (USA),
Japan and the United Kingdom (UK) (data obtained from
Reserve Bank of New Zealand <www.rbnz.govt.nz> accessed
1 June 2017).
a. Investigate the relationship between the annual percentage
changes in GDP for these five countries by constructing
time-series plots on the same set of axes.
b. What conclusions can you make about the changes in GDP
for these five countries?
A.27 Alex and Tyler have been monitoring their electricity use since
installing solar power almost a year ago, with the data stored
in < SOLAR_POWER > .
Explore Alex and Tyler’s power usage over this period by:
a. plotting the data graphically
b. calculating summary statistics
c. commenting on the graphs and summary statistics
A.28 The results of the 2017 Adobe Mobile Maturity Survey reveal
insights into the change to smartphones as primary online
access devices, and indicate the need for companies to focus
on creating engaging and personalised digital experiences for
their customers. How are companies addressing the mobile
experience? The survey found 40% of marketing decision
makers were prioritising mobile apps and only 24% were
prioritising mobile websites. However, the situation differed for
IT decision makers, of whom 26% were prioritising mobile
apps and 30% were prioritising mobile websites.
The research is based on an online survey with a sample
of 304 US executives, marketers, IT staff and analysts who had
experience with mobile marketing and who worked for or were
agents for organisations with 500+ employees. Of these, 254
were identified as marketing respondents and 50 as IT
respondents (data obtained from <www.adobe.com>).
a. Describe the populations of interest.
b. Describe the samples that were collected.
c. Describe a parameter of interest.
d. Describe the statistic used to estimate the parameter in (c).
A.29 A radio station survey of listeners found that 32% of the 1,356
drivers who responded admitted to talking on a hand-held
mobile phone while driving, and 23% admitted to reading or
sending SMS messages while driving. What information would
you want to know before you accepted the results of the
survey?
A.30 Pre-numbered sales invoices are kept in a sales journal. The
invoices are numbered from 0001 to 5,000.
a. Beginning in row 16, column 1, and proceeding horizontally
in Table E.1, select a simple random sample of 50 invoice
numbers.
b. Select a systematic sample of 50 invoice numbers. Use the
random numbers in row 20, columns 5–7, as the starting
point for your selection.
c. Are the invoices selected in (a) the same as those selected
in (b)? Why or why not?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
PA R T
2
Measuring
uncertainty
Real People, Real Stats
Ellouise Roberts DELLOITE ACCESS ECONOMICS
Which company are you currently working for and what are some of your responsibilities?
I currently work for Deloitte Access Economics where I’m in the macroeconomic policy and
forecasting team located in Canberra. One of my main responsibilities is working with our
demographic forecasting model, where we project the future Australian population and some of its
characteristics – such as where people will live, how many people will be in the labour force and the
industries they might work in. These population forecasts are a key driver of our macroeconomic
model, which is used to assist a variety of clients in determining the impacts of potential economic and
policy changes on their business, industry or region.
Before joining Deloitte Access Economics, I worked at the Australian Bureau of Statistics in a range of
roles related to social research, demography and the Census. This included calculating life tables,
analysing fertility rates and investigating the type of transport people use to get from home to work.
List five words that best describe your personality.
A statistical text book debutante!
(Practical, adaptable, instinctive, determined and enquiring.)
What are some things that motivate you?
In my working life I’m motivated by the role that statistics can play in solving problems. For example,
by undertaking statistical analysis to test the effectiveness of a particular policy in delivering intended
outcomes, we can provide a basis of evidence to assist in deciding whether or not to continue funding
existing programs, or to develop alternatives.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
a quick q&a
Many of the projects that I have been involved with will also play
an important role in the future direction of Australia – whether
these are in the area of higher education, infrastructure or the
implementation of environmental controls. In these instances,
the use of statistics can provide evidence and insights which
cannot be acquired through other means such as consultations
or literature reviews.
such as history, economics, demographics and sociology, to
make sense of statistical findings.
With such a wide applicability, working with statistics can offer a
range of opportunities across a wide variety of industries and
occupations. In many cases, the techniques and concepts used
are the same, but the subject matter can differ significantly,
which helps to keep work interesting.
When did you first become interested in statistics?
I began to appreciate the value that statistics could offer while at
school where I learnt about the work of John Graunt, who
analysed the vital statistics of London’s citizens during the
seventeenth century. During one of the many outbreaks of the
bubonic plague in London, Graunt became interested in the Bills
of Mortality – records of deaths from the plague – and through
the use of statistics was able to draw conclusions about how the
disease spread. Many ideas in use today – such as the
application of life tables in the insurance industry, national
censuses and medical statistics – utilise the principles and
foundations of Graunt’s work.
Statistics are also applicable to such a wide variety of industries
and occupations that it is hard to imagine a subject where they
could not offer additional insight and understanding. For
example, a farmer can collect a record of daily rainfall, but in
isolation those daily numbers do not offer any particularly
interesting findings. However, with the introduction of even the
most basic statistical techniques, such as the calculation of
monthly averages or the pattern of rainfall events, insights begin
to emerge. However, it is when they are combined with other
observations – such as pest or disease outbreaks, or cropping
metrics, or even worker productivity – that we begin to gain an
understanding of the relationships between inputs and outputs
(or dependent and independent variables) and appreciate the
real value that statistics can offer.
Describe your first statistics-related job or work experience.
Was this a positive or a negative experience?
My first statistics-related job, as a university student, involved
standing on the side of a road counting the types of vehicles that
went past. A seemingly simple job in itself; however, after the
counts were completed we would analyse the data to develop
traffic-flow diagrams to assist with the planning of future road
infrastructure, such as traffic lights. This was my first real
experience of collecting data and then transforming information –
a count of cars – into something meaningful and tangible to
everyday life. It also emphasised the importance of accurate and
suitable data collection techniques, and the role that sampling
plays in obtaining information. For example, although standing by
the side of the road counting cars for 24 hours was possible, it
would not be particularly cost effective (or exciting), and the use
of statistical techniques can help us build a comprehensive
picture using only a snapshot of data. Although a relatively
simple example, this experience helped to demonstrate the role
of statistics in society and encouraged me to continue working in
this area.
Complete the following sentence. A world without statistics …
… would be a world where we wouldn’t be able to celebrate
World Statistics Day.
LET’S TALK STATS
What do you enjoy most about working in statistics?
For me, it is not just the generation of the statistics and data that
I enjoy (although that in itself can be very interesting), but rather
the interpretation of these figures through the identification of
patterns, trends and relationships. As part of working with
statistics, you are also often involved in looking at the bigger
picture, drawing in knowledge from a range of other disciplines,
What do you feel is the most common misconception about your
work held by students who are studying statistics? Please explain.
One of my misconceptions when studying statistics was that you
are either a ‘numbers’ or a ‘words’ person. However, in the
workplace no matter how good you may be at undertaking
complex statistical analysis, or building complicated models, you
also need to be able to communicate your findings with a variety
of audiences with varying degrees of understanding and interest.
Therefore, it is critical that, in addition to understanding the
mathematical techniques, you also develop your ability to
interpret your findings and convey them in a language that your
audience will understand – no matter who they may be.
Do you need to be good at maths to understand and use
statistics successfully?
To some degree, I think you do need to have a certain level of
understanding of maths and an appreciation for the role that
statistics can play. However, this doesn’t necessarily mean that
you need to memorise countless formulas or mathematical
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
proofs. Rather, you need the ability to be able to understand the
concepts and their application. It is also important to remember
that in some instances, the interpretation of the statistics is the
key output or outcome and can be more important than the
numbers themselves.
More broadly, studying statistics in a purely theoretical sense is
useful, but the real value is being able to apply these techniques
and calculations to real-world data, whether this be in the area
of finance, oceanography or biomedical science. However, even
more than that, to successfully work as a statistician – or with
statistics in any capacity – I think you need to have an enquiring
mind and to want to know things and understand why things are
as they are (or may be in the future).
Is there a high demand for statisticians in your industry (or in
other industries)? Please explain.
Studying statistics provides a solid foundation for a wide range
of roles within the workplace – including ones that may not be
immediately obvious, such as building early monitoring systems
for tsunamis or in the monitoring of disease outbreaks. Within my
role, both the public and private sectors are becoming
increasingly aware of and interested in what the future
demographic profile of Australia will look like, and the
implications that this will have.
In such a dynamic environment, I expect the opportunities for
people with an understanding of statistics will only increase as
more and more aspects of society, nature and the economy are
investigated, evaluated and analysed. Ten years ago nobody
imagined that there would be professional roles for people
undertaking statistical analysis of social media, online social
networks and online human behaviour – let alone the
prominence that these applications would play in society.
MEASURING UNCERTAINTY
What are the most practical consequences in your work that
would result from failing to report uncertainty?
In much of the work that I do – and particularly in population
forecasting – the element of uncertainty is fairly explicit. No one
knows for certain how big the population is going to be decades
into the future, particularly when you consider the assumptions
that need to be made about future fertility (including for females
who have not yet been born themselves), mortality (where
numerous medical breakthroughs every year continue to extend
our lifespan) and migration (where government policies play a
key role). However, by observing past trends, patterns and
behaviours we can build a picture of what the demographic and
economic future may look like under certain conditions.
More broadly, statistics don’t always necessarily give you a
definitive number or answer as such. Instead, they are often
predications or assessments of information, making it critical to
explain the role of uncertainty in the conclusions that you make.
Given that our work can influence public and social policy within
Australia, the failure to report uncertainty can have considerable
consequences by falsely informing our client’s decisions.
When might a discrete probability distribution be useful for your
work? Can you provide a specific question for which it has
helped to provide an answer?
In our type of work we are often concerned with the distribution
of certain events, such as the success or failure of students
completing a particular year in their apprenticeship training. In
this example, we were interested in understanding the
probability of success in relation to a range of different
characteristics, such as age, sex and industry, as well as any
government assistance that they had received. Based on a
sample of records, we investigated the probability distribution
based on individual characteristics, which assisted us in
identifying how these factors might contribute to the relative
likelihood of success or failure in relation to the overall sample.
This type of work assisted us to provide a range of information to
our client. Firstly, it helped to establish whether the assistance
being made available was targeted at the desired group (i.e.
those least likely to complete a particular year of the
apprenticeship) and whether the government program was
having a positive influence on completion rates.
When might a continuous probability distribution be useful for
your work? Can you provide a specific question for which it has
helped to provide an answer?
Continuous probability distributions provide the foundation for
much of our multiple linear regression analysis. Using the
example from above, in this instance we were also interested in
estimating the overall probability of apprenticeship success.
While the methods used were themselves conceptually
advanced, they were built around the basic assumptions of
continuous probability distributions.
Is it difficult to liken collected data to a common distribution?
What features of the data are used to do so?
One thing that you quickly learn in any analysis of ‘real-world’ data
is that although some data may be easily likened to a common
distribution – like exam results, which often follow a bell-shaped
curve similar to a normal distribution – any data collected is likely
to present its own unique set of challenges. Taking the Census, for
example, despite the extensive effort put into the design of the
form, the collection procedures, the processing of answers and
the data analysis there are still a wide range of errors (respondent
error, processing error, partial or non-response, and undercount)
that need to be considered while interpreting the statistics. In
many other cases, the data you will be analysing may be collected
for a different purpose (such as registrations of births, deaths and
marriages for administrative purposes), and incorrect, incomplete
and duplicate entries can be a significant issue.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
C HAP T E R
4
Basic probability
REPEAT FESTIVAL ATTENDANCE
T
o increase visitor numbers during the year and repeat attendance at the three-day musical
festival presented in the Chapter 2 and 3 scenarios, non-local festival attendees are given a
book of discount vouchers for subsequent visits to the region and/or the annual music
festival. These vouchers include seven nights for the price of five at selected backpackers’ hostels
and motels, and two meals for the price of one at selected restaurants.
Gaia Adventure Tours, which runs tours and activities in the region, offers a voucher giving two for
the price of one on selected tours and activities.
Jo is analysing the use of these vouchers by a sample of 500 non-local festival attendees from five
years ago. Some of the questions Jo hopes to answer for these attendees are:
■
■
■
■
■
Are those who have been to a subsequent music festival more likely to have also used an
accommodation discount voucher than those who have not been a repeat attendee?
What proportion of past festival attendees attend the music festival again?
What proportion of repeat festival attendees use a discount meal voucher?
What proportion of repeat festival attendees use the two-for-one Gaia Adventure Tours voucher?
Is the proportion of repeat festival attendees who use the two-for-one Gaia Adventure Tours
voucher the same as those who use a discount meal voucher?
Answers to these questions and others can help Jo develop future sales and marketing strategies
to encourage repeat visits to the region and/or music festival by festival attendees.
© Africa Studio/Shutterstock
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
148 CHAPTER 4 BASIC PROBABILITY
LEARNING
OBJECTIVES
After studying this chapter you should be able to:
1 recognise basic probability concepts
2 calculate probabilities of simple, marginal and joint events
3 calculate conditional probabilities and determine whether events are independent or not
4 revise probabilities using Bayes’ theorem
5 use counting rules to calculate the number of possible outcomes
Probability is the link between descriptive statistics and inferential statistics. This chapter introduces several types of probability and discusses how to revise probabilities in light of new
information. These topics are the foundation for the probability distribution, the concept of
mathematical expectation and the binomial, hypergeometric and Poisson distributions (topics
covered in Chapter 5).
LEARNING OBJECTIVE
1
Recognise basic
probability concepts
probability
The likelihood of an event
occurring.
impossible event
An event that cannot occur.
certain event
An event that will occur.
a priori classical probability
Objective probability, obtained from
prior knowledge of the process.
4.1 BASIC PROBABILITY CONCEPTS
What is probability? A probability is a numerical value that represents the chance, likelihood or
possibility that a particular event will occur. Examples of events are the price of a share increasing, a rainy day, a defective item or the outcome 5 when you roll a die. A probability is given
either as a proportion or fraction whose value lies between 0 and 1, inclusive. An event that has
no chance of occurring (i.e. an impossible event) has a probability of 0. An event that is sure to
occur (i.e. a certain event) has a probability of 1. There are three approaches to assigning a
probability to an event:
• a priori classical probability
• empirical classical probability
• subjective probability.
In a priori classical probability, the probability of an event is based on prior knowledge of
the process involved. In the simplest case, each outcome is equally likely and the chance of
occurrence of the event is given by Equation 4.1.
P R OB A B IL IT Y OF OC CU R R E NC E
Probability of occurrence 5
X
T
(4.1)
where X 5 number of ways in which the event occurs
T 5 total number of possible outcomes
Consider a standard deck of cards with 26 red cards and 26 black cards. The probability of
selecting a black card (an event), using Equation 4.1, is 26/52 5 0.5 since there are X 5 26
black cards and a total of T 5 52 cards. What does this probability mean? As you cannot say for
certain what colour the next card selected will be, it does not mean that, if each card is replaced
after it is drawn, one out of the next two cards selected will be black. However, you can say
that, in the long run, if cards are continually selected and replaced, the proportion of black cards
selected will approach 0.5.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
4.1 Basic Probability Concepts 149
FINDING A PR IO R I P RO B A B ILIT IE S
A standard die has six faces. Each face carries one, two, three, four, five or six dots. If you
roll a die, what is the probability you will get a face with five dots?
EXAMPLE 4.1
SOLUTION
Each face is equally likely to occur. Since there are six faces, the probability of getting a
1
face with five dots is .
6
The above examples use the a priori classical approach to assigning a probability because
the number of ways the event occurs and the total number of possible outcomes are known
from the composition of the deck of cards or the faces of the die. In addition to the cards and die
examples discussed, games of chance such as Lotto and Roulette are based on known probabilities and, as such, are examples of a priori classical probability.
In the empirical classical approach to assigning a probability, the outcomes are based on
observed data, not on prior knowledge of a process. Examples of this type of probability are the
proportion of repeat festival attendees in the chapter scenario, the proportion of registered voters who prefer a certain political candidate or the proportion of students who have a part-time
job. For example, if you take a survey of students and 60% state that they have a part-time job,
then there is a 0.6 probability that an individual student has a part-time job.
The third approach to assigning a probability, subjective probability, differs from the other
two approaches because a subjective probability differs from person to person. For example,
the development team for a new product may assign a probability of 0.6 to the chance of success for the product while the managing director of the company is less optimistic and assigns
a probability of 0.3. The assignment of subjective probabilities to various outcomes is usually
based on a combination of an individual’s prior knowledge, personal opinion and analysis of a
particular situation. Subjective probability is useful in making decisions in situations in which
you cannot use a priori classical probability or empirical classical probability.
empirical classical probability
Objective probability, obtained from
the relative frequency of occurrence
of an event.
subjective probability
Probability that reflects an
individual’s belief that an event
occurs.
Events and Sample Spaces
We need the following definitions to understand probabilities.
A random experiment is a precisely described scenario that leads to an outcome that
cannot be predicted with certainty.
For example, the scenario could be ‘roll a die and record how many dots on the upper face’,
or ‘toss a coin twice and record whether heads (H) or tails (T) occurs on each toss’.
An event is specified by one or more outcomes of a random experiment. The event is
said to have occurred if one of the outcomes specified has occurred.
random experiment
A precisely described scenario that
leads to an outcome that cannot be
predicted with certainty.
event
One or more outcomes of a random
experiment.
For example, when rolling a die, the event of an even number consists of three outcomes:
2, 4 and 6.
A simple event is an event specified by a single outcome of a random experiment.
simple event
A single outcome of a random
experiment.
The collection of all simple events is called the sample space.
sample space
Collection of all simple events of a
random experiment.
For example, in the experiment of rolling the die, the sample space consists of the six
simple events: 1, 2, 3, 4, 5 and 6. In the experiment of tossing a coin twice, the sample space
consists of the four simple events: HH, HT, TH and TT.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
150 CHAPTER 4 BASIC PROBABILITY
joint event
An event described by two or more
characteristics.
A joint event is an event described by two or more characteristics.
A joint event can be a simple event. For example, in the experiment of tossing a coin twice,
the simple event HH has the two characteristics H on first toss and H on second toss.
complement
All simple outcomes not in an
event.
The complement of event A (written A′) includes all simple events that are not included in
the event A.
When tossing a coin, the complement of a head is a tail, since it is the only simple event
that is not a head. When rolling a die, the complement of ‘five’ is ‘not five’ – that is, a 1, 2, 3,
4 or 6 – and, when rolling a die, the complement of the event ‘an even number’ is ‘an odd
number’ – that is, 1, 3 or 5.
EXAMPLE 4.2
Table 4.1
Accommodation voucher
use and repeat festival
attendance
E V E NT S A N D S A MP L E S PACE S
Table 4.1 gives information on repeat attendance at festivals and the use of discount accommodation vouchers by the sample of 500 festival attendees.
Repeat festival attendance
Yes
No
Total
Accommodation voucher used
Yes
No
210
70
110
110
320
180
Total
280
220
500
What is the sample space? Give examples of simple events and joint events.
SOLUTION
The sample space consists of discount accommodation voucher use and repeat festival
attendance of the sample of 500 festival attendees. Examples of simple events are ‘Repeat
festival attendance’ and ‘Accommodation voucher used’. The complement of the event
‘Accommodation voucher used’ is ‘Accommodation voucher not used’. The event ‘Repeat
festival attendance and accommodation voucher used’ is a joint event because festival attendees have attended a subsequent music festival and used the discount accommodation voucher.
Contingency Tables and Venn Diagrams
contingency (or crossclassification) table –
probability
Represents a sample space for joint
events classified by two
characteristics; each cell represents
the joint event satisfying given
values of both characteristics.
Venn diagram
Graphical representation of a
sample space; joint events shown
as ‘unions’ and ‘intersections’ of
circles representing simple events.
There are several ways to present a sample space. Table 4.1 uses a contingency table, also called
a cross-classification table (see Section 2.4), to represent a sample space. The values in the cells
of the table are obtained by classifying the sample of 500 festival attendees by whether they
have attended a subsequent music festival and/or used the discount accommodation voucher.
For example, 210 festival attendees have used the discount accommodation voucher and
attended a subsequent music festival.
A Venn diagram is another way to present a sample space. It graphically represents the
various events as unions and intersections of circles. Figure 4.1 presents a typical Venn diagram
for a two-variable situation, with each variable having only two events (A and A′, B and B′).
The circle on the left represents all simple events that are part of A and the circle on the right
represents all simple events that are part of B. The area contained within circle A and circle B
(centre area) is the intersection of A and B (written as A ù B), since it contains all outcomes that
are in event A and also in event B. The total area of the two circles is the union of A and B (written as A ø B) and contains all outcomes in event A and/or in event B. The area in the diagram
outside A ø B contains outcomes that are neither in event A nor in event B.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
4.1 Basic Probability Concepts 151
To construct a Venn diagram the events A and B must be defined. You can define either
event as A or B, or use different letters, as long as you are consistent in evaluating the various
events. For the repeat festival attendance scenario you can define the events as follows:
A 5 repeat festival attendance
A' 5 no repeat festival attendance
B 5 accommodation voucher used
B ' 5 accommodation voucher not used
In drawing the Venn diagram (see Figure 4.2), you must determine the value of the intersection
of A and B in order to divide the sample space into its parts. A ù B consists of all 210 festival
attendees who have attended a subsequent music festival and used the discount accommodation
voucher. The remainder of event A (Repeat festival attendance) consists of the 70 repeat festival
attendees who did not use the discount accommodation voucher. The remainder of event B (Accommodation voucher used) consists of the 110 festival attendees who have used the discount accommodation voucher but not attended another music festival. The remaining 110 festival attendees
have neither attended a later music festival nor used the discount accommodation voucher.
A
A
B
B
B
A9
B
A
210
A
A
A
B 9= 110
B
70
B9
A
Figure 4.1 Venn diagram for events A and B
Note: A = A ù B + A ù B ′ and B = A ù B + A′ ù B
A9 B
110
B = 390
Figure 4.2 Venn diagram for repeat festival
attendance scenario
Note: A = A ù B + A ù B ′ and B = A ù B + A′ ù B
Marginal Probability
Now some of the questions posed in the repeat festival attendance scenario can be answered.
Since the results are based on data collected (see Table 4.1), the empirical classical approach to
assigning probabilities can be used.
Marginal probability refers to the probability P(A) of an occurrence of an event, A described
by a single characteristic. An example of a marginal probability in the repeat festival attendance scenario is the probability of a festival attendee attending a later music festival. Using
Equation 4.1:
P(repeat festival attendance) 5
number repeat festival attendees 280
5 0.56
5
total number of attendees
500
LEARNING OBJECTIVE
2
Calculate probabilities
of simple, marginal and
joint events
marginal probability
Probability of an event described by
a single characteristic.
Thus, there is a 0.56 (or 56%) likelihood that a festival attendee will attend a subsequent music
festival.
The name marginal probability derives from the fact that the total number of occurrences
of event A (in this case, repeat festival attendance) is obtained from the margin of the contingency table (see Table 4.1). Example 4.3 illustrates another application of marginal probability.
CA LC ULATING T H E P RO B A B ILIT Y T H AT A RE P E AT F E STI VAL ATTE N D E E U S E S
TH E G A IA A D V E NT U R E TO U R S D IS C O U N T VOU CHE R
In the repeat festival attendance scenario, festival attendees were given a book of discount
vouchers, including two-for-one vouchers for meals and selected activities and tours by
Gaia Adventure Tours. Table 4.2 gives the use of these two-for-one vouchers by the 280
repeat festival attendees.
EXAMPLE 4.3
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
152 CHAPTER 4 BASIC PROBABILITY
Table 4.2
Use of two-for-one
vouchers by repeat
festival attendees
Repeat festival attendance
Gaia Adventure Tours voucher used
Yes
No
Total
Meal voucher used
Yes
126
84
210
No
42
28
70
Total
168
112
280
Find the probability that a repeat festival attendee uses the Gaia Adventure Tours voucher.
SOLUTION
Using Equation 4.1:
number repeat festival attendees using
Gaia Adventure Tours voucher
P(Gaia Adventure Tours) 5
total number of repeat
festival attendees
5
168
5 0.6
280
Therefore, 60% of repeat festival attendees use the Gaia Adventure Tours two-for-one voucher.
Joint Probability
joint probability
Probability of an occurrence
described by two or more
characteristics.
Joint probability refers to the probability of an occurrence described by two or more characteris-
tics. An example of joint probability is the probability that you will get a head on the first toss
of a coin and a head on the second toss of a coin.
Referring to Table 4.1, the festival attendees who have attended a subsequent music
festival and used the discount accommodation voucher are represented by the 210 festival
attendees in the single cell ‘Yes – Repeat festival attendance and Yes – Accommodation
voucher used’. Because this group consists of 210 festival attendees, the probability of
picking a festival attendee who has attended a later music festival and used the discount
accommodation voucher is:
P (repeat festival attendance and accommodation voucher used)
number repeat festival attendees and accommodation voucher used
5
total number of festival attendees
5
210
5 0.42
500
Example 4.4 also demonstrates how to determine joint probability.
EXAMPLE 4.4
DE T E R MIN ING T H E J OI N T P ROBABI L I TY OF A RE P E AT F E STI VAL
AT T E N D E E U S ING T WO- F OR- ON E M E AL AN D GAI A AD V E N TU RE
TO U R S VO U C H E R S
In Table 4.2, festival attendees were given a book of discount vouchers, including two-forone vouchers for meals and Gaia Adventure Tours. Find the probability that a randomly
selected repeat festival attendee uses both the meal and Gaia Adventure Tours two-for-one
vouchers.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
4.1 Basic Probability Concepts 153
SOLUTION
Using Equation 4.1:
P(Gaia Adventure Tours and meal voucher used)
number repeat festival attendees using Gaia
Adventure Tours and meal vouchers
5
total number repeat festival attendees
5
126
5 0.45
280
Therefore, there is a 45% chance that a randomly selected repeat festival attendee uses both
the meal and Gaia Adventure Tours two-for-one vouchers.
The marginal probability of an event is the sum of joint probabilities. For example, if B
consists of two events, B1 and B2, then P(A), the probability of event A, consists of the joint
probability of event A occurring with event B1 plus the joint probability of event A occurring
with event B2. Equation 4.2 can be used to calculate marginal probabilities.
MARG INAL P R OB A B IL IT Y
P(A) = P(A and B1) + P(A and B2) + … + P(A and Bk)
(4.2)
where B1, B2, …, Bk are k mutually exclusive and collectively exhaustive events.
Mutually exclusive events and collectively exhaustive events are defined as follows.
Two events are mutually exclusive if the two events cannot occur simultaneously.
mutually exclusive
Two events that cannot occur
simultaneously.
Heads and tails in a coin toss are mutually exclusive events. When tossing a coin you cannot get both a head and a tail on the same toss.
A set of events is collectively exhaustive if one of the events must occur.
collectively exhaustive
Set of events such that one of the
events must occur.
Heads and tails in a coin toss are collectively exhaustive events. One of them must occur. If
heads does not occur, tails must occur. If tails does not occur, heads must occur.
In summary, the event of tossing a coin is both collectively exhaustive and mutually exclusive. The outcome must be either heads or tails, P(Heads or Tails) = 1, so the outcomes are
collectively exhaustive. When heads occurs, tails cannot occur, P(Heads and Tails) = 0, so the
outcomes are also mutually exclusive.
Equation 4.2 can be used to calculate the marginal probability of a festival attendee attending a later music festival:
P (repeat festival attendance)
5 P(repeat festival attendance and accommodation voucher used)
1 P(repeat festival attendance and accommodation voucher not used)
5
280
70
210
1
5
5 0.56
500 500 500
Alternatively, Equation 4.1 can be used to calculate P(repeat festival attendance).
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
154 CHAPTER 4 BASIC PROBABILITY
General Addition Rule
The probability of event ‘A or B’ can be calculated by the general addition rule. This rule considers the occurrence of either event A or event B or both A and B. The event ‘Repeat festival
attendance or accommodation voucher used’ includes all festival attendees who have attended
a subsequent music festival and all festival attendees who have used the discount accommodation voucher. Table 4.1 can be used to calculate the probability that a festival attendee either
attended a later music festival or used the accommodation discount voucher by examining
each cell of the contingency table (Table 4.1) to determine whether it is part of this event.
From Table 4.1, the cell ‘Repeat festival attendance and accommodation voucher not used’ is
part of the event, because it includes repeat festival attendees. The cell ‘No repeat festival
attendance and accommodation voucher used’ is included because it contains festival attendees using the discount accommodation voucher. Finally, the cell ‘Repeat festival attendance
and accommodation voucher used’ has both characteristics of interest. Therefore, the probability of a festival attendee either attending a later music festival or using the accommodation
discount voucher is:
P(repeat festival attendance or accommodation voucher used)
= P(repeat festival attendance and accommodation voucher used)
+ P(no repeat festival attendance and accommodation voucher used)
+ P(repeat festival attendance and accommodation voucher not used)
=
general addition rule
Used to calculate the probability of
the joint event A or B.
210
110
70
390
=
= 0.78
+
+
500
500
500
500
Instead of using a contingency table, the general addition rule defined in Equation 4.3 can be
used to calculate the probability of the event A or B, P(A or B).
GE N E R A L A DDIT IO N R U LE
The probability of A or B is equal to the probability of A plus the probability of B minus
the probability of A and B.
P(A or B) = P(A) + P(B) − P(A and B)
(4.3)
Applying this equation to the previous example produces the following:
P (repeat festival attendance or accommodation voucher used)
5 P(repeat festival attendance) 1 P(accommodation voucher used)
2 P(repeat festival attendance and accommodation voucher used)
280
320
390
210
5
2
1
5
5 0.78
500
500
500
500
The general addition rule adds the probability of A and the probability of B, and then subtracts the joint event of A and B from this total because the joint event has been included in both
the probability of A and the probability of B. Referring to Table 4.1, if the outcomes of the event
‘Repeat festival attendance’ are added to those of the event ‘Accommodation voucher used’, the
joint event ‘Repeat festival attendance and accommodation voucher used’ has been included in
each of these simple events. Therefore, because this joint event has been counted twice, it needs
to be subtracted once. Example 4.5 illustrates another application of the general addition rule.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
4.1 Basic Probability Concepts 155
A PPLY ING TH E G E NE R A L A DDIT IO N R U L E F OR RE P E AT F E STI VAL ATTE N D E E S
USING TWO -FO R - O NE ME A L O R G A IA A D V E N TU RE TOU RS VOU CHE RS
In Example 4.3, festival attendees were given a book of discount vouchers, including
two-for-one vouchers for meals and Gaia Adventure Tours. Find the probability that a
randomly selected repeat festival attendee uses a two-for-one meal or Gaia Adventure
Tours voucher.
EXAMPLE 4.5
SOLUTION
Using Equation 4.3:
P (Gaia Adventure Tours or meal voucher used)
5 P(Gaia Adventure Tours voucher used) 1 P(meal voucher used)
2 P(Gaia Adventure Tours and meal voucher used)
5
210
168
126 252
1
2
5
5 0.9
280
280
280 280
Therefore, there is a 90% chance that a return repeat festival attendee uses a two-for-one
meal or Gaia Adventure Tours voucher.
Problems for Section 4.1
LEARNING THE BASICS
4.1
4.2
4.3
Two coins are tossed.
a. Give an example of a simple event.
b. Give an example of a joint event.
c. What is the complement of a head on the first toss?
An urn contains 12 red balls and 8 white balls. One ball is to be
selected from the urn.
a. Give an example of a simple event.
b. What is the complement of a red ball?
Given the following contingency table:
A
A′
4.4
B
10
20
B
10
25
APPLYING THE CONCEPTS
4.5
B∙
20
40
what is the probability of
a. event A?
b. event A′?
c. event A and B?
d. event A or B?
Given the following contingency table:
A
A′
what is the probability of
a. event A′?
b. event A and B?
c. event A′ and B′?
d. event A′ or B′?
4.6
B∙
30
35
For each of the following, indicate whether the type of probability
involved is an example of a priori classical probability, empirical
classical probability or subjective probability.
a. The next toss of a fair coin will be heads.
b. Italy will win soccer’s World Cup the next time the
competition is held.
c. The sum of the faces of two dice will be 7.
d. The train taking a commuter to work will be more than
10 minutes late.
For each of the following, state whether the events are mutually
exclusive and/or collectively exhaustive. If they are not mutually
exclusive and/or collectively exhaustive, either reword the
categories to make them mutually exclusive and collectively
exhaustive or explain why this would not be useful.
a. An exit poll in an Australian federal election asked
voters if they had voted for the Labor or the Coalition
candidate.
b. Respondents were classified by type of car they drive:
Australian, American, European, Japanese or none.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
156 CHAPTER 4 BASIC PROBABILITY
4.7
4.8
c. People were asked, ‘Do you currently live in (i) an apartment
or (ii) a house?’
d. A product was classified as defective or not defective.
The probability of each of the following events is zero. For each,
state why.
a. A day is Christmas and Easter.
b. A product is defective and not defective.
c. A car is a Ford and a Toyota.
A researcher has completed a survey of 10,000 viewers in
a regional city to determine which TV network they watch
most weekdays during the 6 pm to 7 pm time slot. The
results are:
Network
ABC
Seven
Nine
Ten
SBS
Other or none
LEARNING OBJECTIVE
Number
1,290
2,850
2,060
1,695
430
1,675
3
Calculate conditional
probabilities and
determine whether events
are independent or not
conditional probability
Probability of an event, given
information on the occurrence of a
second event.
4.9
A surveyed viewer is chosen at random. Find the probability
that during the 6 pm to 7 pm time slot the viewer:
a. watches ABC
b. watches ABC or SBS
c. watches neither ABC nor SBS
d. watches one of Channels 7, 9 or 10
e. does not watch one of Channels 7, 9 or 10
A sample of 500 consumers is selected in a large metropolitan area
to study consumer behaviour. Among the questions asked was ‘Do
you enjoy shopping for clothing (Yes or No)?’ Of 240 males, 136
answered yes. Of 260 females, 224 answered yes. Construct a
contingency table or a Venn diagram to evaluate the probabilities.
What is the probability that a surveyed consumer chosen at
random:
a. enjoys shopping for clothing?
b. is a female and enjoys shopping for clothing?
c. is a female or enjoys shopping for clothing?
d. is a male or a female?
4.2 CONDITIONAL PROBABILITY
Calculating Conditional Probabilities
We can often make use of extra information about the events under consideration when calculating probabilities. In this section, we consider the case where the probability of an event
occurring depends on the occurrence of some other event. Suppose, for instance, that we are
interested in determining the probability that a person selected at random earns more than
$100,000 a year. If we know that the person has a degree, it might be reasonable to expect this
to affect the probability.
Conditional probability refers to the probability of event A, given information about the
occurrence of another event, B.
CON DIT ION A L PR OB AB I LI T Y
The probability of A given B, written P(A | B), is equal to the probability of A and B
divided by the probability of B.
P(A | B) 5
P ( A and B )
P(B)
(4.4a)
The probability of B given A is equal to the probability of A and B divided by the
probability of A.
P(B | A) 5
P( A and B)
P ( A)
(4.4b)
where P(A and B) 5 joint probability of A and B
P(A) 5 marginal probability of A
P(B) 5 marginal probability of B
Referring to the repeat festival attendance scenario, suppose we know that a festival attendee
has used the discount accommodation voucher. What is the probability that they have also
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
4.2 Conditional Probability 157
attended a later music festival – that is, P(repeat festival attendance | accommodation voucher
used )? As we know that the festival attendee has used the discount accommodation voucher, the
sample space does not consist of all 500 festival attendees in the sample. It consists only of the
festival attendees who have used the discount accommodation voucher. Of the 320 festival
attendees who have used the discount accommodation voucher, 210 are repeat festival attendees.
Therefore (see Table 4.1 or Figure 4.2), the probability that a festival attendee attends a subsequent music festival given that they have used the discount accommodation voucher is:
P(repeat festival attendance | accommodation voucher used)
number repeat festival attendees and accommodation vouchers used
5
number of accommodation vouchers used
5
210
320
5 0.65625
Equation 4.4a can be used to calculate the above result:
where define events: A 5 repeat festival attendance
B 5 accommodation voucher used
then:
P(A | B) =
P(A and B) 210/500 210
= 0.65625
=
=
320/500 320
P(B)
Therefore, if a festival attendee has used the discount accommodation voucher there is a 65.625%
probability that they have also attended a subsequent music festival.
Compare this conditional probability with the marginal probability of a festival attendee
attending a later music festival, which is 280/500 5 0.56, or 56%. These results indicate that
festival attendees who use the discount accommodation voucher are more likely to also attend
a subsequent music festival.
Example 4.6 further illustrates conditional probability.
FINDING A C O NDIT IO NA L P RO B A B ILIT Y CON CE RN I N G RE P E AT F E STI VAL
ATTENDEES’ U S E O F T WO - FO R O NE VO U CHE RS
Table 4.2 is a contingency table for whether repeat festival attendees use two-for one meal
and/or Gaia Adventure Tours vouchers. Find the probability that a randomly selected repeat
festival attendee who used the two-for-one meal voucher also used the Gaia Adventure
Tours voucher.
EXAMPLE 4.6
SOLUTION
We know that the repeat festival attendee has used the two-for-one meal voucher, so the
sample space is reduced to the 210 attendees who have used their meal voucher. Of these
210 attendees, 126 have used their Gaia Adventure Tours voucher. Therefore, the probability
that the Gaia Adventure Tours voucher is used, given that the meal voucher was used, is:
P(Gaia Adventure Tours voucher used ) meal voucher used)
number repeat attendees use meal and
Gaia Adventure Tours vouchers
5
number repeat attendees
use meal voucher
5
126
5 0.6
210
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
158 CHAPTER 4 BASIC PROBABILITY
If we define events:
M 5 meal voucher used
G 5 Gaia Adventure Tours voucher used
then Equation 4.4a may be used:
P(G M) =
126/280
126
P(G and M)
=
=
= 0.6
P(M )
210/280
210
Therefore, given that a repeat festival attendee has used the two-for-one meal voucher, there
is a 60% chance that the Gaia Adventure Tours two-for-one voucher is also used.
Decision Trees
decision tree
Graphical representation of simple
and joint probabilities as vertices of
a tree. Also known as a tree
diagram.
In Table 4.1, a sample of 500 festival attendees were classified according to whether they have
attended a later music festival or used the discount accommodation voucher. A decision tree (or
tree diagram) is an alternative to a contingency table or a Venn diagram. Figure 4.3 represents
the decision tree for this example.
In Figure 4.3, beginning at the left with the sample of 500 festival attendees, there are two
‘branches’ corresponding to whether or not a subsequent music festival was attended. Each
branch has two sub-branches, corresponding to whether the festival attendee used the discount
accommodation voucher. The probabilities at the end of the initial branches represent the marginal probabilities of A (Repeat festival attendance) and A′. The probabilities at the end of each
of the four sub-branches represent the joint probability for each combination of events A and B
(Accommodation voucher used). The conditional probability is calculated by dividing the joint
probability by the appropriate marginal probability.
For example, to calculate the conditional probability that a festival attendee uses the accommodation discount voucher given that they have attended a later music festival, divide P(repeat
festival attendance and accommodation voucher used) by P(repeat festival attendance). From
Figure 4.3:
P (accommodation voucher
210
210/500
=
= 0.75
used | repeat festival attendance) =
280
280/500
Example 4.7 illustrates how to construct a decision tree.
Figure 4.3
Decision tree for repeat
festival attendance
scenario
P(A) 5 280
500
nce
enda
l att
stiva
t fe
epea
R
Sample of 500
festival attendees N
o re
peat
festi
val
atte
ndan
ce
P(A 9) 5 220
500
n
datio
mmo sed
o
c
c
A
her u
vouc
P(A and B) 5 210
500
Accom
vouch modation
er not
used
P(A and B 9) 5 70
500
odation
Accomm sed
u
voucher
P(A 9 and B) 5 110
500
Acc
vouc ommod
a
her
not tion
used
P(A 9 and B 9) 5 110
500
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
4.2 Conditional Probability 159
FO R MING A D E C IS IO N T R E E FO R R E P E AT F E STI VAL ATTE N D E E S –
TWO -FO R -O N E VO U C H E R U S E
Using the cross-classified data in Table 4.2, construct a decision tree and use it to find the
probability a randomly selected repeat festival attendee who used the two-for-one meal
voucher also used the Gaia Adventure Tours two-for-one voucher.
EXAMPLE 4.7
SOLUTION
The decision tree for ‘two-for-one voucher use’ is displayed in Figure 4.4. Using Equation
4.4a and the following definitions:
G 5 Gaia Adventure Tours voucher used
P(G M) =
126/280
126
P(G and M)
=
=
= 0.6
P(M)
210/280
210
P (M ) 5 210
280
d
r use
uche
Set of repeat
festival attendees
vo
Meal
Mea
l vou
cher
not
M 5 meal voucher used
used
P(M9) 5 70
280
ture
dven used
A
a
i
Ga
her
vouc
Tours
Ga
Tours ia Adventur
e
vouch
er not
used
venture
Gaia Ad r used
che
u
o
Tours v
Tour Gaia Ad
v
s Vo
uche enture
r no
t use
d
P (M and G) 5
126
280
Figure 4.4
Decision tree for ‘two-forone voucher use’
P(M and G9) 5 84
280
P(M 9 and G ) 5 42
280
P (M 9and G9) 5 28
280
Statistical Independence
In the repeat festival attendance scenario, the conditional probability is 210/320 5 0.65625 that
a selected festival attendee attended a later music festival given that they have used the discount
accommodation voucher. The probability of a randomly selected festival attendee attends a
later music festival is 280/500 5 0.56. This result shows that the prior knowledge that a festival
attendee has used the discount accommodation voucher affected the probability that they
attended another music festival. In other words, the outcome of one event is dependent on the
outcome of a second event.
When the outcome of one event does not affect the probability of occurrence of another
event, the events are said to be statistically independent. Statistical independence can be determined by using Equation 4.5.
statistical independence
The occurrence of an event does
not affect the occurrence of a
second event.
STATISTICA L IN DE PE N DE N CE
Two events, A and B, are statistically independent if and only if
P(A | B) 5 P(A) (also P(B | A) 5 P(B))
(4.5)
Example 4.8 demonstrates the use of Equation 4.5.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
160 CHAPTER 4 BASIC PROBABILITY
EXAMPLE 4.8
DE T E R MIN ING STAT I STI CAL I N D E P E N D E N CE
Using the cross-classified data in Table 4.2, determine whether, for repeat festival attendees,
use of the two-for-one meal voucher and use of the Gaia Adventure Tours voucher are statistically independent events.
SOLUTION
From Examples 4.6 and 4.7:
P(Gaia Adventure Tours voucher used | meal voucher used) 5 0.6
which from Example 4.3 is equal to:
P(Gaia Adventure Tours voucher used) 5 0.6
Thus, use of the two-for-one meal voucher and use of the Gaia Adventure Tours voucher are
statistically independent events. Occurrence of one event does not affect the probability of
the other event.
Multiplication Rules
general multiplication rule
Used to calculate the probability of
the joint event A and B.
By manipulating the formula for conditional probability, you can determine the joint probability P(A and B) from the conditional probability of an event. The general multiplication rule is
derived using Equations 4.4a and 4.4b and solving for the joint probability P(A and B).
GE N E R A L M ULT IP LI C AT I O N R U LE
The probability of A and B is equal to the probability of A given B times the probability
of B or the probability of B given A times the probability of A.
P(A and B) = P(A | B)P(B) = P(B | A)P(A)
(4.6)
Example 4.9 demonstrates the use of the general multiplication rule.
EXAMPLE 4.9
U S IN G T H E MU LT IP LI CATI ON R U L E
Of the 500 festival attendees in the repeat festival attendance scenario (Table 4.1), 280 have
attended a subsequent music festival. Suppose two festival attendees are randomly selected.
Find the probability that both festival attendees have since attended a later music festival.
SOLUTION
We can use the multiplication rule. Define events:
F1 = repeat festival attendance first attendee
F2 = repeat festival attendance second attendee
then, using Equation 4.6:
P(F1 and F2) = P(F2 | F1)P(F1)
The probability that the first attendee has subsequently attended another music festival is
280/500. However, the probability that the second attendee has attended a later music festival
depends on the result of the first selection. If the first attendee is not returned to the sample
after any repeat festival attendance is determined (sampling without replacement), then the
number of attendees remaining will be 499. If the first festival attendee attends a later music
festival, the probability that the second also attends a later music festival is 279/499,
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
4.2 Conditional Probability 161
because 279 attendees who have subsequently attended a later music festival remain in the
sample. Therefore:
P(F1 and F2) = P(F2 | F1)P(F1) =
279
280
= 0.3131...
×
499
500
The probability that both festival attendees have since attended a later music festival is
approximately 0.313.
If A and B are independent events, then P(A | B) 5 P(A), so we can substitute P(A) for P(A | B)
(or P(B) for P(B | A)) in Equation 4.6 to obtain the multiplication rule for independent events.
M ULTIPLICAT ION R UL E FOR IN DE P E ND E NT E V E NT S
If A and B are statistically independent, the probability of A and B is equal to the probability of A times the probability of B.
P(A and B) = P(A)P(B)
multiplication rule for
independent events
Used to calculate the probability of
the joint event A and B when A and
B are independent.
(4.7)
If this rule holds for two events, A and B, then A and B are statistically independent. Thus,
there are two ways to determine statistical independence:
1. Events A and B are statistically independent if and only if P(A | B) 5 P(A) (or
P(B | A) 5 P(B)).
2. Events A and B are statistically independent if and only if P(A and B) 5 P(A)P(B).
Marginal Probability Using the General Multiplication Rule
In Section 4.1 marginal probability was defined using Equation 4.2, which can be rewritten
using the general multiplication rule. If:
P(A) = P(A and B1) + P(A and B2) + … + P(A and Bk)
then, using the general multiplication rule, Equation 4.8 defines the marginal probability.
M ARG INAL P R OB A B IL IT Y US IN G T H E G E NE R A L M U LT I P LI C AT I O N R U LE
P(A) = P(A | B1)P(B1) + P(A | B2)P(B2) + … + P(A | Bk)P(Bk)
(4.8)
where B1, B2, …, Bk are k mutually exclusive and collectively exhaustive events.
To illustrate this equation, refer to Table 4.1. Using Equation 4.8, the probability of a festival attendee attending a subsequent music festival is:
P(A) = P(A | B)P(B) + P(A | B′)P(B′)
where P(A) 5 probability of ‘repeat festival attendance’
P(B) 5 probability of ‘accommodation voucher used’
P(B9) 5 probability of ‘accommodation voucher not used’
P(A) =
280
210 320
210
70
70
180
×
=
×
= 0.56
=
+
+
500
320 500
180 500
500 500
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
162 CHAPTER 4 BASIC PROBABILITY
Problems for Section 4.2
LEARNING THE BASICS
4.10 Given the following contingency table:
A
A′
B
10
20
B∙
20
40
a. what is the probability of:
i. A | B?
ii. A | B′?
iii. A′| B′?
b. Are events A and B statistically independent?
4.11 Given the following contingency table:
A
A′
B
10
25
B∙
30
35
a. what is the probability of:
i. A | B?
ii. A′| B′?
iii. A | B′?
b. Are events A and B statistically independent?
4.12 If P (A and B ) 5 0.4 and P (B ) 5 0.8, find P (A | B ).
4.13 If P (A) 5 0.7, P (B ) 5 0.6, and A and B are statistically
independent, find P (A and B ).
4.14 If P (A) 5 0.3, P (B ) 5 0.4, and P (A and B ) 5 0.2, are A and B
statistically independent?
APPLYING THE CONCEPTS
4.15 The following table gives the labour force status of the Australian
civilian population aged 15 years and over in May 2017:
6202.0–Labour Force, Australia, May 2017
Labour force status
(aged 15 years and over)
Male
Female Total ('000)
Employed full-time
5,296.0
3,001.2
8,297.2
Employed part-time
1,230.5
2,678.2
3,908.7
Unemployed and looking for fulltime work
277.6 205.4 483.0
Unemployed and not looking for
full-time work
90.3 130.1 220.4
Not in labour force
2,859.4
4,060.5
6,919.9
Total civilian population aged
15 years and over
9,753.8 10,075.4 19,829.2
Data obtained from Australian Bureau of Statistics, Labour Force, Australia, May
2017, Cat. No. 6202.0 <www.abs.gov.au/ausstats/abs@.nsf/mf/6202.0> accessed
28 June 2017
a. What is the probability that a randomly selected person is
female?
b. What is the probability that a randomly selected male is not
employed?
c. Suppose you know that a person is employed full-time. What
is the probability that they are female?
d. Are the two events ‘employed full-time’ and ‘female’
statistically independent? Explain.
e. What is the probability that a randomly selected person is a
male in full-time employment?
f. The unemployment rate is defined as the percentage of the
labour force that is unemployed and either looking for fulltime work or not looking for full-time work. What is the
unemployment rate for males, females and overall?
g. The participation rate is defined as the percentage of the
civilian population in the labour force, either employed or
unemployed. What is the participation rate for males,
females and overall?
4.16 Households in a certain town were surveyed to determine
whether they would subscribe to a new Pay TV channel. The
households were classified according to ‘high’, ‘medium’ and
‘low’ income levels. The results of the survey are summarised
in the table below.
Income level
High
Medium
Low
Will subscribe
3,200
1,920
480
Will not subscribe
800
7,080
2,520
a. What is the probability that:
i. a household will subscribe?
ii. a household is high income?
iii. a household will subscribe and is high income?
iv. a high-income household will subscribe?
v. a household that subscribes is high income?
b. Is income level statistically independent of whether a
household subscribes or not? Explain.
4.17 At a certain university, 25% of students are in the business
faculty. Of the students in the business faculty, 66% are
males. However, only 52% of all students at the university
are male.
a. What is the probability that a student selected at random in
the university is a male in the business faculty?
b. What is the probability that a student selected at random in
the university is male or is in the business faculty?
c. What percentage of males are in the business faculty?
4.18 A sample of 500 consumers was selected in a large
metropolitan area to study consumer behaviour with the
following results:
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
4.3 Bayes’ Theorem 163
Exchange that is widely used as a benchmark for the
performance of US equity mutual funds) finished higher after
the first five days of trading. In 41 of those 59 years the S&P
500 finished higher for the year.
Is a good first week a good omen for the upcoming year?
The following table gives the first-week and annual
performance over this 88-year period:
Gender
Enjoys shopping
for clothing
Yes
No
Total
Male
136
104
240
Female
224
36
260
Total
360
140
500
a. What is the probability that a randomly chosen female
consumer does not enjoy shopping for clothing?
b. Suppose the chosen consumer enjoys shopping for clothing.
What is the probability that the individual is male?
c. Are enjoying shopping for clothing and the gender of the
individual statistically independent? Explain.
4.19 A study was done to determine the efficacy of three different
headache tablets – A, B and C. One thousand study participants
used all three tablets (at different times) over the period of the
study with the following results:
750
675
631
504
453
350
236
reported relief from tablet A
reported relief from tablet B
reported relief from tablet C
reported relief from both tablets A and B
reported relief from both tablets A and C
reported relief from both tablets B and C
reported relief from all three tablets
a. If a study participant is selected at random, what is the
probability that they
i. reported relief from tablet A?
ii. reported relief from tablet B?
iii. reported relief from tablet A and tablet B?
iv. reported relief from tablet A or tablet B?
v. did not report relief from tablet C?
b. What is the probability that, if a participant reported relief
from tablet A, they also reported relief from tablet B?
c. What is the probability that, if a participant reported relief
from tablet B, they also reported relief from tablet A?
d. Are the events ‘report relief from tablet A’ and ‘report relief
from tablet B’ statistically independent? Explain.
4.20 In 59 of the 88 years from 1929 to 2016, the S&P 500 (Standard
and Poor’s 500 Index, one of the indices of the New York Stock
First week
Higher
Not higher
S&P 500’s annual performance
Higher
Not higher
41
18
14
15
a. If a year is selected at random, what is the probability that
the S&P finished higher for the year?
b. Given that the S&P 500 finished higher after the first five
days of trading, what is the probability that it finished higher
for the year?
c. Are the two events, first-week performance and annual
performance, statistically independent? Explain.
d. In 2017 the S&P 500 was up 0.8% after the first five days. Look
up the 2017 annual performance of the S&P 500 at <https://
finance.yahoo.com> or elsewhere. Comment on the results.
e. Repeat part (d) for last year.
4.21 A standard deck of cards is being used to play a game. There
are four suits (hearts, diamonds, clubs and spades), each having
13 faces (ace, 2 to 10, jack, queen and king), making a total of
52 cards. This complete deck is thoroughly shuffled, and you
will receive the first two cards from the deck without
replacement.
a. What is the probability that both cards are queens?
b. What is the probability that the first card is a 10 and the
second card is a 5 or 6?
c. If you were sampling with replacement, what would be the
answer in (a)?
d. In the game of blackjack, the picture cards (jack, queen,
king) count as 10 points and the ace counts as either 1 or
11 points. All other cards are counted at their face value.
Blackjack is achieved if your two cards total 21 points. What
is the probability of getting blackjack in this problem?
4
4.3 BAYES’ THEOREM
LEARNING OBJECTIVE
Bayes’ theorem is used to revise previously calculated probabilities (called prior probabilities)
Revise probabilities using
Bayes’ theorem
when there is new information. Developed by the Rev. Thomas Bayes in the eighteenth century,
Bayes’ theorem is an extension of conditional probability.
The conditional probability of B given A is given by Equation 4.4b combined with Equation 4.6:
P(B | A) =
P(A | B)P(B)
P(A and B)
=
P( A)
P( A)
Bayes’ theorem is derived from this by substituting Equation 4.8 for P(A) in the above
equation.
Bayes’ theorem
Revises previously calculated
probabilities when new information
becomes available.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
164 CHAPTER 4 BASIC PROBABILITY
B AYE S ’ T H E OR E M
P(Bi | A) =
P(A | Bi)P(Bi)
P(A | B1)P(B1) + P(A | B2)P(B2) + … + P(A | Bk)P(Bk)
(4.9)
where Bi is the ith event out of k mutually exclusive and collectively exhaustive events.
The following situation illustrates when Bayes’ theorem can be used. Suppose the Consumer
Electronics Company is considering marketing a new model of television. In the past, 40% of the
televisions introduced by the company have been successful and 60% have been unsuccessful.
Before introducing a new model of television to the marketplace, the marketing research department always conducts an extensive study and releases a report, either favourable or unfavourable.
In the past, 80% of the successful televisions had received a favourable market research report and
30% of the unsuccessful televisions had received a favourable report. For the new model of television under consideration, the marketing research department has issued a favourable report. What
is the probability that the television will be successful, given this favourable report?
To use Equation 4.9 to calculate the required probability P(S | F), first define events:
S 5 successful television F 5 favourable report
S′ 5 unsuccessful television F′ 5 unfavourable report
then:
P(S) = 0.40
P(F | S) = 0.80
P(S') = 0.60
P(F | S') = 0.30
Therefore, using Equation 4.9:
P(S | F) =
P(F | S )P(S )
P(F | S)P(S ) + P(F | S')P(S')
=
(0.80)(0.40)
(0.80)(0.40) + (0.30)(0.60)
=
0.32
0.32
=
0.32 + 0.18 0.50
= 0.64
The probability of a successful television, given that a favourable report was received, is 0.64.
Thus, the probability of an unsuccessful television, given that a favourable report was received,
is 1 − 0.64 5 0.36. Table 4.3 summarises the calculation of the probabilities and Figure 4.5
presents the decision tree.
The denominator in Bayes’ theorem represents P(F), the probability of a favourable report.
This shows the connection between Equations 4.4a and 4.4b with Equation 4.9, reflecting that
Bayes’ theorem is a special case of conditional probability.
Event Si
S ∙ successful television set
S∙ ∙ unsuccessful television set
Prior
probability
P(Si)
0.40
0.60
Conditional
probability
P(F | Si)
0.80
0.30
Joint probability
P(F and S i ) ∙ P(F | Si )P(Si)
0.32
0.18
0.50 = P(F )
Revised
probability
P(Si | F )
0.32/0.50 = 0.64 = P(S | F )
0.18/0.50 = 0.36 = P(S′ | F )
Table 4.3 Bayes’ theorem calculations for the television-marketing example
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
4.3 Bayes’ Theorem 165
P(S and F )
= P(F |S) P(S)
= (0.80) (0.40) = 0.32
Figure 4.5
Decision tree for marketing
a new television set
P(S ) = 0.40
P(S and F 9) = P(F 9|S) P(S )
= (0.20) (0.40) = 0.08
P(S 9 and F ) = P(F |S 9) P(S 9)
= (0.30) (0.60) = 0.18
P(S 9) = 0.60
P(S 9 and F 9) = P(F 9|S 9) P(S 9)
= (0.70) (0.60) = 0.42
Example 4.10 applies Bayes’ theorem to a medical diagnosis problem.
USING B AY E S ’ T H E O R E M IN A ME DIC A L D I AGN OS I S P ROBL E M
The probability that a person has a certain disease is 0.03. Medical diagnostic tests are available to determine whether a person has the disease. If the disease is present, the probability
that the medical diagnostic test will give a positive result (indicating that the disease is present) is 0.90. If the disease is not present, the probability of a positive test result (indicating
that the disease is present when it is not, called a false positive) is 0.02. Suppose that the
medical diagnostic test has given a positive result. What is the probability that the disease is
present, given the positive test result? What is the probability of a positive test result?
EXAMPLE 4.10
SOLUTION
Define events:
D 5 has disease
D′ 5 does not have disease
T 5 test is positive
T′ 5 test is negative
We are given:
P(D) 5 0.03
P(D′) 5 0.97
P(T | D) 5 0.90
P(T | D′) 5 0.02
Using Equation 4.9 to calculate P(D | T) – that is, the probability that the disease is present,
given the positive test result – we obtain:
P(D | T) =
P(T | D)P(D)
P(T | D)P(D) + P(T | D' )P(D' )
(0.90)(0.03)
(0.90)(0.03) + (0.02)(0.97)
0.0270
=
0.0270 + 0.0194
0.0270
=
0.0464
= 0.5818…
=
The probability that the disease is present, given a positive result has occurred (indicating
that the disease is present), is 0.582. This means that if a person returns a positive test result,
there is only a 58% chance they have the disease. Table 4.4 summarises the calculation of
the probabilities and Figure 4.6 presents the decision tree.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
166 CHAPTER 4 BASIC PROBABILITY
Event Di
D ∙ has disease
D ∙ ∙ does not have disease
Prior
probability
P (Di)
Conditional
probability
P (T | Di)
Joint
probability
P (T | Di)P (Di)
0.03
0.97
0.90
0.02
0.0270
0.0194
Revised
probability
P (Di | T )
0.0270/0.0464 = 0.582 = P(D | T )
0.0194/0.0464 = 0.418 = P(D′ | T )
0.0464
Table 4.4 Bayes’ theorem calculations for the medical diagnosis problem
The denominator in Bayes’ theorem represents P(T), the probability of a positive test
result, which in this case is 0.0464, or 4.64%.
Figure 4.6
Decision tree for the
medical diagnosis
problem
P(D and T ) = P(T |D) P(D)
= (0.90) (0.03) = 0.0270
P(D) = 0.03
P(D and T 9) = P(T 9|D ) P(D)
(0.10) (0.03) = 0.0030
P(D 9) = 0.97
P(D 9 and T ) = P(T |D 9) P(D 9)
(0.02) (0.97) = 0.0194
P(D 9 and T 9) = P(T 9|D 9) P(D 9)
(0.98) (0.97) = 0.9506
Divine providence and spam
think
about this
Would you ever guess that the essays Divine Benevolence: Or, An Attempt to Prove that the Principal
End of the Divine Providence and Government is the Happiness of His Creatures and An Essay
Towards Solving a Problem in the Doctrine of Chances were written by the same person? Probably
not, and in doing so you illustrate a modern-day application of Bayesian statistics: spam, or junk
mail, filters.
In not guessing correctly, you probably looked at the words in the titles of the essays and concluded that
they were talking about two different things. An implicit rule you used was that word frequencies vary
by subject matter. A statistics essay would very likely contain the word statistics as well as words such
as chance, problem and solving. An eighteenth-century essay about theology and religion would be
more likely to contain the uppercase forms of Divine and Providence.
Likewise, there are words that you would guess to be very unlikely to appear in either book, such as
technical terms from finance, and words that are most likely to appear in both – common words such
as a, and and the. That words would either be likely or unlikely suggests an application of probability
theory. Of course, likely and unlikely are fuzzy concepts, and we might occasionally misclassify an
essay if we kept things too simple, such as relying solely on the occurrence of the words Divine and
Providence.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
4.3 Bayes’ Theorem 167
For example, a profile of the late Harris Milstead, better known as Divine, the star of Hairspray and other
films, visiting Providence (Rhode Island), would most certainly not be an essay about theology. But if we
widened the number of words we examined and found such words as movie or the name John Waters
(Divine’s director in many films), we probably would quickly realise the essay had something to do with
twentieth-century cinema and little to do with theology and religion.
We can use a similar process to try to classify a new email message in your inbox as either spam or a
legitimate message (called ‘ham’ in this context). We would first need to add to your email program a
‘spam filter’ that has the ability to track word frequencies associated with spam and ham messages as
you identify them on a day-to-day basis. This would allow the filter constantly to update the prior
probabilities necessary to use Bayes’ theorem. With these probabilities, the filter can ask, ‘What is the
probability that an email is spam, given the presence of a certain word?’
Applying the terms of Equation 4.9, such a Bayesian spam filter would multiply the probability of finding
the word in a spam email, P (A | B ), by the probability that the email is spam, P (B ), and then divide by
the probability of finding the word in an email, the denominator in Equation 4.9. Bayesian spam filters
also use shortcuts by focusing on a small set of words that have a high probability of being found in a
spam message and on a small set of other words that have a low probability of being found in a spam
message.
As spammers (people who send junk email) learned of such new filters, they tried to outfox them.
Having learned that Bayesian filters might be assigning a high P (A | B ) value to words commonly found
in spam, such as Viagra, spammers thought they could fool the filter by misspelling the word as
Vi@gr@ or V1agra. What they overlooked was that the misspelled variants were even more likely to
be found in a spam message than the original word. Thus, the misspelled variants made the job of
spotting spam easier for the Bayesian filters.
Other spammers tried to fool the filters by adding ‘good’ words, words that would have a low
probability of being found in a spam message, or ‘rare’ words, words not frequently encountered in
any message. But these spammers overlooked the fact that the conditional probabilities are
constantly updated and that words once considered ‘good’ would soon be discarded from the good
list by the filter as their P (A | B ) value increased. Likewise, as ‘rare’ words grew more common in
spam and yet stayed rare in ham, such words acted like the misspelled variants that others had
tried earlier.
Even then, and perhaps after reading about Bayesian statistics, spammers thought that they could
‘break’ Bayesian filters by inserting random words in their messages. Those random words would affect
the filter by causing it to see many words whose P (A | B ) value would be low. The Bayesian filter would
begin to label many spam messages as ham and end up being of no practical use. Spammers again
overlooked that conditional probabilities are constantly updated.
Other spammers decided to eliminate all or most of the words in their messages and replace them with
graphics so that Bayesian filters would have very few words with which to form conditional probabilities.
However, this approach failed too, as Bayesian filters were rewritten to consider things other than words
in a message. After all, Bayes’ theorem concerns events, and ‘graphics present with no text’ is as valid
an event as ‘some word, X, present in a message’. Other future tricks will ultimately fail for the same
reason. (By the way, spam filters use non-Bayesian techniques as well, which make spammers’ lives
even more difficult.)
Bayesian spam filters are an example of the unexpected way that applications of statistics can
show up in your daily life. You will discover more examples as you read the rest of this book.
Incidentally, the author of the two essays mentioned earlier was Thomas Bayes, who is a lot more
famous for the second essay than the first, a failed attempt to use mathematics and logic to prove
the existence of God.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
168 CHAPTER 4 BASIC PROBABILITY
Problems for Section 4.3
LEARNING THE BASICS
4.22 If P(B) 5 0.05, P(A | B) 5 0.80 and P(A | B′) 5 0.40,
find P(B | A).
4.23 If P(B) 5 0.30, P(A | B) 5 0.60 and P(A | B′) 5 0.50,
find P(B | A).
APPLYING THE CONCEPTS
4.24 In Example 4.10 on page 165, suppose that the probability
that the test will return a false positive (that is, the medical
diagnostic test gives a positive result when the disease is
not present) is reduced from 0.02 to 0.01. Given this
information:
a. If the medical diagnostic test has given a positive result
(indicating the disease is present), what is the probability
that the disease is present?
b. If the medical diagnostic test has given a positive result,
what is the probability that the disease is not present?
c. If the medical diagnostic test has given a negative result
(indicating that the disease is not present), what is the
probability that the disease is not present?
d. If the medical diagnostic test has given a negative result,
what is the probability that the disease is present?
4.25 An advertising executive is studying the television viewing
habits of married men and women during prime-time hours. On
the basis of past viewing records, the executive has determined
that, during prime time, husbands are watching television 60%
of the time. When the husband is watching television, 40% of
the time the wife is also watching. When the husband is not
LEARNING OBJECTIVE
5
Use counting rules to
calculate the number of
possible outcomes
watching television, 30% of the time the wife is watching
television. Find the probability that
a. if the wife is watching television, the husband is also
watching television.
b. the wife is watching television in prime time.
4.26 The editor of a textbook-publishing company is trying to decide
whether to publish a proposed business statistics textbook.
Information on previous textbooks published indicate that 10%
are huge successes, 20% are modest successes, 40% break
even and 30% are failures. However, before a publishing
decision is made, the book will be reviewed. In the past, 99% of
the huge successes received favourable reviews, 70% of the
moderate successes received favourable reviews, 40% of the
break-even books received favourable reviews and 20% of the
failures received favourable reviews.
a. If the proposed text receives a favourable review, how
should the editor revise the probabilities of the various
outcomes to take this information into account? (Hint: Derive
the conditional probabilities for each outcome given a
favourable review has been received.)
b. What proportion of textbooks receive favourable reviews?
4.27 From past records of personal loans the Check$mart Bank found
that 10% of borrowers default on their loan – that is, they fail to
pay. It also found that, of those who default, 32% are unemployed
while, of those who do not default, only 2% are unemployed.
a. What percentage of unemployed borrowers default?
b. What proportion of borrowers are unemployed?
c. What proportion of borrowers who are not unemployed do
not default?
4.4 COUNTING RULES
In Equation 4.1 the probability of occurrence of an outcome was defined as the number of ways
the outcome occurs divided by the total number of possible outcomes. In many instances, there
is a large number of possible outcomes and it is difficult to determine the exact number. In these
circumstances, rules for counting the number of possible outcomes have been developed. Five
different counting rules are introduced in this section.
COUN T IN G R UL E 1
If any one of k different mutually exclusive and collectively exhaustive events can occur
on each of n trials, the number of possible outcomes is equal to
kn
EXAMPLE 4.11
(4.10)
C O U N T IN G R U LE 1
Suppose you toss a coin five times. What is the number of different possible outcomes (the
sequences of heads and tails)?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
4.4 Counting Rules 169
SOLUTION
If you toss a coin (with two sides) five times, using Equation 4.10 the number of possible
outcomes is 25 5 2 × 2 × 2 × 2 × 2 5 32.
RO LLING A D IE T W IC E
Suppose you roll a die twice. How many different possible outcomes can occur?
EXAMPLE 4.12
SOLUTION
If a die (having six sides) is rolled twice, using Equation 4.10 the number of different
­possible outcomes is 62 5 36.
The second counting rule is a more general version of the first, and allows for the number
of possible events to differ from trial to trial.
C O UN TIN G R UL E 2
If there are k1 events on the first trial, k2 events on the second trial, … , and kn events on
the nth trial, then the number of possible outcomes is
k1 × k2 × … × kn
(4.11)
CO UNTING R U LE 2
At one stage, standard New South Wales vehicle number plates consisted of three letters
­followed by three digits. How many possible number plates are there of this form?
EXAMPLE 4.13
SOLUTION
Using Equation 4.11, if a number plate consists of three letters (A to Z) followed by three
numbers (0 to 9), the total number of number plates of this form is:
26 × 26 × 26 × 10 × 10 × 10 5 263 × 103 5 17,576,000.
DETER M INING T H E NU MB E R O F D IFFE R E N T D I N N E RS
A restaurant menu has a fixed-price dinner consisting of an entrée, a main, a beverage and a
dessert. There is a choice of ten entrées, five mains, three beverages and six desserts. Determine the total number of possible dinners.
EXAMPLE 4.14
SOLUTION
Using Equation 4.11, the total number of possible dinners is 10 × 5 × 3 × 6 5 900.
The third counting rule involves the calculation of the number of ways that a set of items
can be arranged in order.
C O UN TIN G R UL E 3
The number of ways that n items can be arranged in order is
n! = n × (n− 1) × … × 2 × 1
(4.12)
where n! is called n factorial and 0! is defined as 1.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
170 CHAPTER 4 BASIC PROBABILITY
EXAMPLE 4.15
C O U NT ING R U LE 3
If a set of six textbooks is to be placed on a shelf, in how many ways can the six books be
arranged?
SOLUTION
Any of the six books could occupy the first position on the shelf. Once the first position is
filled, there are five books to choose from in filling the second. Continue this assignment
procedure until all the positions are occupied. The number of ways that the six books can
arranged is 6! = 6 × 5 × 4 × 3 × 2 × 1 5 720.
permutation
Ordered selection of items.
In many instances we need to know the number of ways in which a subset of the entire group
of items can be arranged in order. Each possible ordered arrangement is called a permutation.
COUN T IN G R UL E 4 – P E R M U TAT I O NS
The number of ways of arranging X objects selected from n objects in order is
n PX
EXAMPLE 4.16
=
n!
(n − X )!
(4.13)
C O U NT ING R U LE 4
Modifying Example 4.15, if there are six textbooks but room for only four books on the
shelf, in how many ways can these books be arranged on the shelf?
SOLUTION
Using Equation 4.13, the number of ordered arrangements of four books selected from six
books is equal to:
6 P4 =
6!
6!
= = 360
(6 − 4)! 2!
Alternatively, any of the six books could occupy the first position. Once the first position
is filled, there are five books to choose from in filling the second. Continue this assignment
procedure until four books are placed on the shelf. Therefore, the number of ordered
arrangements of four books selected from six is:
6 × 5 × 4 × 3 5 360
combination
Unordered selection of items.
In other situations we are not interested in the order of the outcomes, but only in the number of ways that X items can be selected from n items, irrespective of order. Each unordered
selection is called a combination.
COUN T IN G R UL E 5 – C O M BI NAT I O NS
The number of ways of selecting X objects from n objects, irrespective of order, is
equal to:
nC X
=
n!
X !(n − X )!
(4.14)
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
4.4 Counting Rules 171
Comparing equations 4.13 and 4.14, it can be seen that they differ only in the inclusion of
a term X! in the denominator of equation 4.14. When permutations are used, all the arrangements of the chosen X objects are distinguishable. With combinations, the X! possible arrangements of the chosen X objects are irrelevant.
CO UNTING R U LE 5
Modifying Example 4.16, in how many ways can you choose four books to place on the
shelf?
EXAMPLE 4.17
SOLUTION
Using Equation 4.14, the number of combinations of four books selected from six books is
equal to:
6C4 =
6!
6!
= 15
=
4!(6 − 4)! 4!2!
Problems for Section 4.4
APPLYING THE CONCEPTS
4.28 If there are 10 multiple-choice questions in an exam, each with
three possible answers:
a. How many different answer sequences are there?
b. If you answer the questions randomly, what is the probability
that you get all 10 correct?
4.29 A lock on a bank vault consists of three dials, each with 30
positions. To open the vault, each of the three dials must be in
the correct position.
a. How many different possible dial combinations are there for
this lock?
b. What is the probability that, if you randomly select a position
on each dial, you will be able to open the bank vault?
c. Explain why ‘dial combinations’ are not mathematical
combinations expressed by Equation 4.14.
4.30 A particular brand of women’s jeans is available in seven
different sizes, three different colours and three different styles.
How many different jeans does the store manager need to order
to have one pair of each type?
4.31 Greenway Gardens has a $10 salad box consisting of lettuce,
tomatoes, cucumber, sprouts, capsicum, avocado and a bottle
of Greenway’s special salad dressing. Suppose that at
present there is a choice of eight types of lettuce, four types
of tomatoes, three types of cucumbers, three types of sprouts
and no choice for capsicum, avocado and dressing. How
many different salad boxes are there?
4.32 If each letter is used once, how many different arrangements
are there of:
a. Grafton?
b. Otaki?
c. Darwin?
d. Gore?
4.33 Currently, new standard New South Wales vehicle number
plates consist of two letters followed by two digits followed by
two letters. How many possible number plates are there of this
form?
4.34 Each employee of a large firm has an ID number consisting of
their initials (either two or three) followed by two digits. What
is the maximum number of unique ID numbers generated by
this system?
4.35 A trifecta consists of picking the correct finishing order of the
first three horses in a race. Suppose 12 horses are entered in
a race.
a. How many trifecta outcomes are there for this race?
b. If you choose three horses randomly, what is the probability
that you win the trifecta?
4.36 Nine passengers are on a waiting list for an overbooked
flight. Due to cancellations, four seats are available. How
many ways are there, regardless of order, to allocate the
four seats?
4.37 A daily lottery is conducted in which two winning numbers are
selected out of 100 numbers.
a. How many different combinations of winning numbers are
possible?
b. Suppose that you have an entry in this lottery – what is your
probability of winning?
4.38 A reading list for a unit contains 20 articles. How many ways
are there to choose three articles from this list?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
172 CHAPTER 4 BASIC PROBABILITY
4.5 ETHICAL ISSUES AND PROBABILITY
Ethical issues can arise when any statements relating to probability are presented to the public,
particularly when these statements are part of an advertising campaign for a product or service.
Unfortunately, many people are not comfortable with numerical concepts and tend to misinterpret the meaning of the probability. In some instances, the misinterpretation is not intentional
but, in other cases, advertisements may unethically try to mislead potential customers.
A commercial for a Lotto game that said ‘We won’t stop until we have made everyone a
millionaire’ would be a deceptive and possibly unethical application of probability. When purchasing a Lotto ticket, the customer selects a set of numbers (such as 6) from a larger list of
numbers (such as 45). Although virtually all participants know that they are unlikely to win a
first-division prize (select all six of the winning numbers drawn), they also have very little idea
of how small the probability is (1 in 8,145,060 if selecting 6 from 45). Given the fact that Lotto
makes millions of dollars, it is unlikely to stop running, so the statement made is true. However,
it may also be misleading as, in a lifetime, no one can be certain of becoming a millionaire by
winning Lotto.
A statement in an investment newsletter promising a 90% probability of a 20% annual
return on an investment is another example of a potentially unethical application of probability.
To make the claim in the newsletter an ethical one, the author needs to (a) explain the basis on
which this probability estimate rests, (b) provide the probability statement in another format,
such as 9 chances in 10, and (c) explain what happens to the investment in the 10% of cases in
which a 20% return is not achieved (e.g. Is the entire investment lost?).
Other ethical issues arise when probabilities are calculated from non-representative samples. An example of this was during the Australian 2007 federal election campaign where a
leaflet from the Christian Democratic Party included the following:
Daily Telegraph
Tele’s Voteline published on 31 March 2007
Fred Nile’s Christian Democrats are calling for an
immediate moratorium on Islamic immigration.
Do you agree?
YES 99%
As well as being overtly discriminatory, there are several problems with this probability.
•
•
•
•
The population sampled from are readers of the Daily Telegraph, which may not be
representative of the Australian electorate.
The sample is self-selected; readers have to ring the voteline at a cost of 55 cents a call.
Therefore, only those who feel strongly about an issue, for or against, are likely to vote.
Sample size is not given. Therefore, we do not know if probability is based on only a few
votes or a large number of votes. From the Daily Telegraph the sample size was 972, Yes
960 and No 12.
There is no mechanism to stop an individual voting more than once. The worst-case
scenario is that this probability is based on the votes of two individuals, one voting Yes
960 times, and the other No 12 times.
Problems for Section 4.5
APPLYING THE CONCEPTS
4.39 Write an advertisement for:
a. Lotto that ethically describes the probability of winning
b. the investment newsletter that ethically states the
probability of a 20% return
4.40 Find an example online or in print of an unethical or misleading
use of probability.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Key terms 173
4
Assess your progress
Summary
This chapter developed concepts concerning basic probability,
conditional probability, Bayes’ theorem and counting rules. In the next
chapter, important discrete probability distributions such as the
binomial, hypergeometric and Poisson distributions will be considered.
Key formulas
Marginal probability using the general multiplication rule
Probability of occurrence
Probability of occurrence =
P(A) = P(A | B1)P(B1) + P(A | B2)P(B2) + …
X
(4.1)
T + P(A | Bk)P(Bk)
Marginal probability
P(A) = P(A and B1) + P(A and B2) + … + P(A and Bk)
(4.2)
General addition rule
(4.8)
Bayes’ theorem
P(Bi | A) =
P(A | Bi)P(Bi)
P(A | B1)P(B1) + P(A | B2)P(B2) + … + P(A | Bk)P(Bk)
P(A or B) = P(A) + P(B) − P(A and B) (4.3)
(4.9)
Conditional probability
Counting rule 1
P(A | B) =
P ( A and B )
(4.4a)
P ( B ) kn (4.10)
P(B | A) =
P( A and B)
(4.4b)
P( A)
k1 × k2 × … × kn (4.11)
Counting rule 2
Factorials
Statistical independence
n! = n × (n − 1) × … × 2 × 1 (4.12)
P(A | B) = P(A) (and P(B | A) = P(B)) (4.5)
Permutations
General multiplication rule
n PX =
P(A and B) = P(A | B)P(B) = P(B | A)P(A) (4.6)
Multiplication rule for independent events
P(A and B) = P(A)P(B) (4.7)
n!
(4.13)
( n − X )!
Combinations
nC X
=
n!
(4.14)
X !(n − X )!
Key terms
a priori classical probability
148
Bayes’ theorem
163
certain event
148
collectively exhaustive
153
combination170
complement150
conditional probability
156
contingency (cross-classification)
table – probability
150
decision tree
158
empirical classical probability
149
event149
general addition rule
154
general multiplication rule
160
impossible event
148
joint event
150
joint probability
152
marginal probability
151
multiplication rule for independent
events161
mutually exclusive
153
permutation170
probability148
random experiment
149
sample space
149
simple event
149
statistical independence
159
subjective probability
149
Venn diagram
150
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
174 CHAPTER 4 BASIC PROBABILITY
Chapter review problems
CHECKING YOUR UNDERSTANDING
4.41
4.42
4.43
4.44
4.45
4.46
4.47
4.48
What are the differences between a priori classical probability,
empirical classical probability and subjective probability?
What is the difference between a simple event and a joint event?
How can you use the addition rule to find the probability of
occurrence of event A or B?
What is the difference between mutually exclusive events and
collectively exhaustive events?
How does conditional probability relate to the concept of
statistical independence?
How does the multiplication rule differ for events that are and
are not independent?
How can you use Bayes’ theorem to revise probabilities in light
of new information?
What is the difference between a permutation and a
combination?
4.52
APPLYING THE CONCEPTS
4.49
The breakdown by home address of the previous year’s 993
drink-driving offences in Problem 2.67 is:
Number of drink-driving
offences
Local – in council area
Seaside town
151
Not seaside town
462
Not local – not in council area
Intrastate (within state)
130
Interstate (another state)
228
International (outside Australia)
22
a. What is the probability of winning a prize in an office sweep
(where horses are randomly allocated), if prizes are given
for first, second and third places?
b. In a trifecta three horses are selected to finish first, second
and third in the correct order. How many possible trifectas
are there in the Melbourne Cup?
c. How many combinations of the winning three horses are
not trifectas – that is, the selected horses finish first,
second and third but not in the correct order?
d. Suppose that you have a sweep ticket (where horses are
randomly allocated) for the trifecta. What is your
probability of winning the major prize (the trifecta) or a
consolation prize (you have the three winning horses but
in the wrong order)?
In March 2013, 26.8% of New South Wales dwellings suitable
for a rainwater tank had one installed. Of the dwellings
with a rainwater tank, 53.1% had the rainwater tank plumbed
into the dwelling (Australian Bureau of Statistics,
Environmental Issues: Water Use and Conservation,
Mar 2013, Cat. No. 4602.0.55.003 <www.abs.gov.au>
accessed 4 November 2013).
a. Complete the following contingency table for this problem:
Home address
4.50
4.51
If a drink-driver offender is selected at random, what is
the probability that:
a. the offender is local?
b. the offender is from another state?
c. a non-local offender is from another state?
d. a local offender is from outside the seaside town?
e. the offender is from outside the state?
In a school of 200 students 95% are vaccinated against a
certain disease. During a recent outbreak of this disease
20 students, including 11 vaccinated students, developed the
disease.
a. Find the probability that a student
i. who has the disease has been vaccinated
ii. who has been vaccinated catches the disease
iii. who is unvaccinated catches the disease
b. A parent states that vaccination is ineffective as more than
50% of those who developed the disease had been
vaccinated. Comment on this.
The Melbourne Cup, held on the first Tuesday in November, has
24 horses entered in it.
Plumbed into Not plumbed
dwelling
into dwelling
Rainwater tank
No rainwater tank
Total
4.53
Total
0.0000
b. From part (a) or otherwise, answer the following, to four
decimal places:
i. What proportion of suitable New South Wales dwellings
have a rainwater tank that is not plumbed into the dwelling?
ii. What percentage of New South Wales dwellings that
have a rainwater tank do not have the tank plumbed into
the dwelling?
iii. What proportion of New South Wales dwellings that are
suitable for a rainwater tank do not have one?
c. There are an estimated 2,268,800 dwellings in New South
Wales that are suitable for a rainwater tank. Estimate the
number of dwellings with a rainwater tank plumbed into the
dwelling.
When calculating premiums on life insurance products
insurance companies often use life tables that enable the
probability of a person dying in any age interval to be
calculated.
The following data obtained from New Zealand Abridged
Period Life Table: 2014–2016 gives the number out of
100,000 New Zealand-born females and males who are still
alive during each five-year period of life between age 20 and
60 (inclusive).
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter review problems 175
4.57
Exact age
(years)
20
25
30
35
40
45
50
55
60
Number alive at exact age
Out of 100,000
Out of 100,000
females born
males born
99,288
99,031
99,128
98,685
98,949
98,312
98,726
97,899
98,427
97,381
97,934
96,649
97,157
95,548
95,933
93,853
94,162
91,352
Channel
TV One
TV2
TV3
Prime
Maori Television
Other or none
Data obtained from <www.stats.govt.nz> accessed June 2017. © Statistics
New Zealand, and licensed by Statistics New Zealand for re-use under the Creative
­Commons Attribution 3.0 New Zealand licence
4.54
4.55
4.56
a. What is the probability that a New Zealand-born female will
reach the age of 30?
b. What is the probability that a New Zealand-born female will
reach the age of 45?
c. What is the probability that a 20-year-old New Zealand-born
female will reach the age of 30?
d. What is the probability that a 20-year-old New Zealand-born
female will reach the age of 40?
e. A 30-year-old New Zealand-born female has purchased a
term life policy that will pay her estate a million dollars if
she dies within five years. What is the probability that the
insurance company will pay her estate this amount?
f. Repeat (a) to (e) for New Zealand-born males.
In a certain region, during a recent outbreak of a preventable
disease 0.1% of primary school children caught the disease; of
these 30% were vaccinated against it. Furthermore, of those
who did not catch the disease 80% were vaccinated.
a. What percentage of vaccinated children caught the disease?
b. What percentage of unvaccinated children caught the disease?
c. What percentage of primary school children in the region
are vaccinated against this disease?
In an online test, 10 multiple-choice questions are randomly
selected from a test bank of 100 questions.
a. If the order in which the questions appear is immaterial,
how many different tests can be generated?
b. If the order in which the questions appear is important, how
many different tests can be generated?
The employees of a company were surveyed and asked their
educational background and marital status. Of the 600
employees, 400 had university degrees, 100 were single and
60 were single university graduates.
a. Construct a contingency table for this problem.
b. Find the probability that a randomly selected employee of
the company is single or has a university degree.
c. What percentage of single employees have university degrees?
d. Are gender and educational background statistically
independent? Explain.
A researcher has completed a survey of 10,000 New Zealand
viewers to determine which channel they watch on a weekday
during the 6.30 pm to 7.30 pm time-slot, with the following results:
4.58
Number
3,160
1,940
2,190
860
650
1,200
A surveyed viewer is chosen at random. Find the probability
that during the 6.30 pm to 7.30 pm time-slot the viewer:
a. watches TV One
b. watches TV2 or TV3
c. watches Prime
d. does not watch TV One, TV2 or TV3
The following table classifies residents of a regional area of
New South Wales by gender and age.
Age groups
0–4 years
5–14 years
15–19 years
20–24 years
25–34 years
35–44 years
45–54 years
55–64 years
65–74 years
75–84 years
85 years and over
Total
Males
410
952
478
594
859
886
1,026
1,097
677
333
154
7,466
Females
369
861
501
559
885
974
1,105
1,033
703
492
327
7,809
Persons
779
1,813
979
1,153
1,744
1,860
2,131
2,130
1,380
825
481
15,275
Data obtained from Australian Bureau of Statistics, Census of Population and Housing:
General Community Profile, Australia, 2016 <www.abs.gov.au> accessed June 2017
4.59
a. If a resident is chosen at random, what is the probability
that the resident:
i. is male?
ii. is a female aged at least 65 years?
iii. is a child under 15 years?
b. What proportion of children, defined as under 15 years, are
male?
c. Are the events ‘Child under 15’ and ‘Male’ statistically
independent? Justify your answer.
d. What is the probability that a female chosen at random is at
least 65 years?
e. Access the Community Profiles for the 2016 Census at
<www.abs.gov.au> for a selected location in Australia and
repeat parts (a) to (d).
The following table classifies residents of a regional area of
Queensland by gender, age and hours of unpaid domestic work
in the week before the 2016 Census.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
176 CHAPTER 4 BASIC PROBABILITY
Less than
5 hours
Did no unpaid
domestic work
Did unpaid domestic work
5-14
15-29
hours
hours
30 hours
or more
Total
Males
15–19 years
20–24 years
25–34 years
35–44 years
45–54 years
55–64 years
65–74 years
75–84 years
85 years and over
Total males
681
562
1,176
1,119
1,045
878
488
190
67
6,206
97
191
768
1,101
1,068
1,055
738
301
77
5,396
7
32
134
285
264
259
317
163
48
1,509
0
17
54
95
106
121
146
116
39
694
602
524
712
537
663
802
780
496
273
5,389
1,387
1,326
2,844
3,137
3,146
3,115
2,469
1,266
504
19,194
Females
15–19 years
20–24 years
25–34 years
35–44 years
45–54 years
55–64 years
65–74 years
75–84 years
85 years and over
Total females
688
539
845
520
620
529
238
160
107
4,246
123
365
1,070
1,127
1,369
1,339
669
279
118
6,459
13
70
405
753
722
655
595
240
71
3,524
6
49
450
725
453
416
461
234
46
2,840
480
290
342
275
356
527
639
537
483
3,929
1,310
1,313
3,112
3,400
3,520
3,466
2,602
1,450
825
20,998
Data obtained from Australian Bureau of Statistics, Census of Population and Housing: General Community Profile, Australia, 2016 <www.abs.gov.au> accessed June 2017
a. If a resident is chosen at random, what is the probability
that the resident:
i. did unpaid domestic work?
ii. did no unpaid domestic work and is female?
iii. did unpaid domestic work and is male?
iv. did at least 15 hours’ unpaid domestic work and is
male?
v. did no unpaid domestic work and is male?
b. What proportion of male residents did unpaid domestic
work?
c. What percentage of female residents did unpaid domestic
work?
d. From parts (a) and (b), are the events ‘Male’ and ‘Did unpaid
domestic work’ statistically independent? Justify your
answer.
e. What proportion of men do:
i. at least 15 hours of unpaid domestic work?
ii. less than five hours of unpaid domestic work (including
no unpaid domestic work)?
f. What proportion of women do:
i. at least 15 hours of unpaid domestic work?
ii. less than five hours of unpaid domestic work (including
no unpaid domestic work)?
g. From parts (e) and (f), can you conclude that men do less
unpaid domestic work than women?
4.60
h. What proportion of male residents aged at least 65 did no
unpaid domestic work?
i. What percentage of female residents aged at least 65 did
no unpaid domestic work?
j. What proportion of male residents aged under 35 did
unpaid domestic work?
k. What percentage of female residents aged under 35 did
unpaid domestic work?
l. What conclusions can you draw from parts (h) to (k)?
m. Access the Community Profiles for the 2016 Census at
<www.abs.gov.au> for a selected location in Australia and
repeat parts (a) to (l).
In a town, 45% of all households have a pet, 35% have
children, and 40% of all households with children have a pet.
Using these definitions:
P 5 event household has a pet
C 5 event household has children
a. Complete the following contingency table.
P
P9
Total
C
C9
Total
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Continuing cases 177
b. From part (a) or otherwise, answer the following:
i. What is the probability that a randomly selected
household has neither pets nor children?
ii. What proportion of households with children do not have
a pet?
iii. Find and interpret P(C | P ).
Continuing cases
Tasman University
Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues.
In particular, students within the school are asked to complete a student survey when they receive their grades
each semester. The results of Bachelor of Business (BBus) and Master of Business Administration (MBA) students
who responded to the latest undergraduate (UG) and postgraduate (PG) student surveys are stored in < TASMAN_
UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >.
Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman
University Postgraduate MBA Student Survey.
a For pairs of variables in the BBus student survey, calculate contingency tables and then calculate
conditional and marginal probabilities.
b For pairs of variables in the MBA student survey, calculate contingency tables and then calculate
conditional and marginal probabilities.
c Write a report summarising your conclusions.
As Safe as Houses
To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a
large national real estate company, has collected samples of recent residential sales from a sample of non-capital
cities and towns in these states. The data are stored in < REAL_ESTATE >.
a From the contingency tables constructed for selected variables in Chapter 2 for regional city 1 state A,
calculate selected conditional and marginal probabilities.
b From the contingency tables constructed for selected variables in Chapter 2 for coastal city 1 state A,
calculate selected conditional and marginal probabilities.
c Write a report summarising your conclusions.
d Repeat (a) to (c) for another pair of non-capital cities or towns in state A and/or state B.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
178 CHAPTER 4 BASIC PROBABILITY
Chapter 4 Excel Guide
EG4.1 BASIC PROBABILITY CONCEPTS
Simple and Joint Probability and the General Addition Rule
Key technique
Use Excel arithmetic formulas.
ure EG4.1) the conditional probabilities are calculated in
rows 28 to 35. The worksheet in Figure EG4.1 already contains the Table 4.1 data. For other problems, change the
sample space table entries in the cell ranges C3:D4 and
A5:D6.
Example
Calculate simple and joint probabilities for the Table 4.1
data on discount accommodation voucher use and repeat
festival attendance.
EG4.3 BAYES’ THEOREM
PHStat
Use Simple & Joint Probabilities.
For the example, select PHStat ➔ Probability & Prob.
Distributions ➔ Simple & Joint Probabilities. In the new
template, similar to the worksheet shown below, fill in the
Sample Space area with the data.
Example
Apply Bayes’ theorem to the television marketing example
in Section 4.3.
In-depth Excel
Use the COMPUTE worksheet of the Probabilities
workbook as a template.
The worksheet (shown in Figure EG4.1) already contains the Table 4.1 discount accommodation voucher use
and repeat festival attendance data. For other problems,
change the sample space table entries in the cell ranges
C3:D4 and A5:D6.
Key technique
Use Excel arithmetic formulas.
In-depth Excel
Use the COMPUTE worksheet of the Bayes workbook
as a template.
The worksheet (shown in Figure EG4.2) already contains the probabilities for the Section 4.3 example. For other
problems, change those probabilities in the cell range B5:C6.
Figure EG4.2
COMPUTE worksheet
of the Bayes
workbook
Figure EG4.1
COMPUTE worksheet
of the Probabilities
workbook
The COMPUTE_FORMULAS worksheet gives the
formulas to calculate the probabilities, which are also
shown as an inset to the worksheet in Figure EG4.2.
EG4.4 COUNTING RULES
Counting Rule 1
The COMPUTE_FORMULAS worksheet gives the
formulas to calculate the probabilities.
EG4.2 CONDITIONAL PROBABILITY
There is no PhStat command for conditional probability.
In-depth Excel
Use the COMPUTE worksheet of the Probabilities
workbook as a template. In this worksheet (shown in Fig-
In-depth Excel
Use the POWER(k, n) worksheet function in a cell formula to calculate the number of outcomes given k events
and n trials. For example, the formula 5POWER(6, 2) calculates the answer for Example 4.12 on page 169.
Counting Rule 2
In-depth Excel
Use a formula that takes the product of successive
POWER(k, n) functions to solve problems related to counting rule 2. For example, the formula 5POWER(26, 3) *
POWER(10, 3) calculates the answer for Example 4.13
New South Wales vehicle number plates on page 169.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 4 Excel Guide 179
Counting Rule 3
In-depth Excel
Use the FACT(n) worksheet function in a cell formula to
calculate how many ways n items can be arranged. For
example, the formula 5FACT(6) calculates 6!, the answer
to Example 4.15 on page 170.
Counting Rule 4
In-depth Excel
Use the PERMUT(n, x) worksheet function in a cell formula to calculate the number of ways of arranging in order
x objects selected from n objects. For example, the ­formula
5PERMUT(6, 4) calculates the answer for Example 4.16
on page 170.
Counting Rule 5
In-depth Excel
Use the COMBIN(n, x) worksheet function in a cell formula to calculate the number of ways of selecting x
objects from n objects, irrespective of order. For example,
the formula 5COMBIN(6, 4) calculates the answer for
Example 4.17 on page 171.
Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
CHA PTER
5
Some important
discrete probability
distributions
GAIA ADVENTURE TOURS
T
ours and activities for Gaia Adventure Tours (see Chapter 4) are booked online. Potential
customers can submit an online enquiry, which Gaia Adventure Tours advertises will be answered
within 45 minutes between 7 am and 11 pm by a knowledgeable local adventure tour consultant.
Yang, who is in charge of Gaia Adventure Tours’ online enquiry and booking procedures, is investigating several key performance indicators (KPIs); in particular:
■
the proportion of online enquiries converted to bookings
■
the number of online enquiries received in 1 hour
■
the proportion of online enquiries submitted between 7 am and 11 pm answered within
45 minutes.
Recent data collected by Yang show that:
■
10% of online enquires are converted to bookings
■
on average, Gaia Adventure Tours receives 30 online enquiries an hour between 7 am and 11 pm
■
with the current levels of staffing for enquiries:
– when 24 or more online enquiries are received in 30 minutes, queries start to queue and
may not be answered within the stated 45 minutes
– when fewer than five enquiries are received in 20 minutes, enquiry staff have significant idle time.
Yang would like to determine the probability of a given number of online enquiries being converted
to confirmed bookings in a sample of a specific size. In addition, to help determine optimal enquiry
staffing levels, Yang would like to calculate the probability of receiving 24 or more online enquiries
in any 30 minutes or fewer than five online enquiries in any 20 minutes.
Answers to these questions and others can help Gaia Adventure Tours to develop future sales,
marketing and staffing strategies.
© Georgejmclittle/Shutterstock/Pearson Education Ltd
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5.1 Probability Distribution for a Discrete Random Variable 181
LEARNING
OBJECTIVES
After studying this chapter you should be able to:
1 recognise and apply the properties of a probability distribution
2 calculate the expected value and variance of a probability distribution
3 calculate average return and measure risk associated with various investment proposals
4 identify situations that can be modelled by a binomial distribution and calculate binomial
probabilities
5 identify situations that can be modelled by a Poisson distribution and calculate Poisson
probabilities
6 identify situations that can be modelled by a hypergeometric distribution and calculate
hypergeometric probabilities
To help answer the given probability questions Yang can use a model, or small-scale representation, that approximates the online enquiry process, allowing inferences to be made about the
processes. Although model building is a difficult task for some endeavours, in this case Yang can
use probability distributions, which are mathematical models suitable for solving these types of
probability questions.
This chapter introduces probability distributions and explains how to apply the binomial,
Poisson and hypergeometric distributions to business and other problems.
5.1 PROBABILITY DISTRIBUTION FOR A DISCRETE
RANDOM VARIABLE
A numerical variable (see Chapter 1) is a variable that yields numerical responses such as the
number of magazines you subscribe to or your height in centimetres. Numerical variables are
classified as either continuous or discrete. Continuous numerical variables have outcomes that
arise from a measuring process, for example your height or weight. Discrete numerical variables have outcomes that arise from a counting process, such as the number of magazines you
subscribe to or the number of phone calls received in an hour. This chapter introduces probability distributions that represent discrete numerical variables; continuous probability distributions are discussed in Chapter 6.
A probability distribution for a discrete random variable is a mutually exclusive list of all
possible numerical outcomes of the random variable with the probability of occurrence
associated with each outcome.
1.
2.
LEARNING OBJECTIVE
1
Recognise and apply the
properties of a probability
distribution
probability distribution for a
discrete random variable
Values of a discrete random
variable with the corresponding
probability of occurrence.
For a probability distribution for a discrete random variable:
all probabilities must be between 0 and 1 inclusive; that is, 0 # P(X) # 1
the sum of the probabilities must equal 1; that is, ∑ P(X) 5 1.
As an example, Table 5.1 gives the distribution of the number of home mortgages approved
per week by the loans manager at a local branch of Check$mart Bank. From this we can see
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
182 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
that the loans manager approves no more than six home mortgages per week as the list in Table 5.1
is collectively exhaustive. Furthermore, as one of the outcomes must happen – that is, between
none and six mortgages approved – the probabilities must sum to 1. Figure 5.1 is a graphical
representation of Table 5.1.
Table 5.1
Probability distribution of
the number of home
mortgages approved per
week
Home mortgages approved per week
0
1
2
3
4
5
6
Figure 5.1
Probability distribution of
the number of home
mortgages approved per
week
Probability
0.10
0.10
0.20
0.30
0.15
0.10
0.05
P (X )
0.3
0.2
0.1
0
LEARNING OBJECTIVE
2
Calculate the expected
value and variance of a
probability distribution
expected value of a discrete
random variable
Measure of central tendency; the
mean of a discrete random variable.
1
2
3
4
5
Home mortgages approved per week
6
X
Expected Value of a Discrete Random Variable
In Chapter 3 we used the sample mean and variance to describe the centre and variation of a
sample. In the same way, we can use the mean and variance of a random variable to describe the
centre and variation of a probability distribution.
The mean μ of a probability distribution is the expected value of its random variable. To
calculate the expected value of a discrete random variable multiply each outcome X by its corres­
ponding probability P(X) and then sum these products.
E XPE CT E D VA LUE O F A D I SC R E T E R A ND O M VA R I A BLE
N
μ = E(X ) =
∑ XiP(Xi)
(5.1)
i=1
where
Xi = the ith outcome of the discrete random variable X
P(Xi) = probability of occurrence of the ith outcome of X
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5.1 Probability Distribution for a Discrete Random Variable 183
Using Equation 5.1 the mean, or expected value, for the probability distribution of the
number of home mortgages approved per week is:
μ = E(X)
N
= ∑ XiP(Xi)
i=1
= (0 × 0.1) + (1 × 0.1) + (2 × 0.2) + (3 × 0.3) + (4 × 0.15) + (5 × 0.1) + (6 × 0.05)
= 0 + 0.1 + 0.4 + 0.9 + 0.6 + 0.5 + 0.3
= 2.8
The actual number of mortgages approved in a given week must be an integer value, so 2.8
mortgages are never approved in one week. However, on average, or in the long run, 2.8 are
approved per week.
Variance and Standard Deviation of a Discrete Random Variable
The variance of a discrete probability distribution is calculated by multiplying each squared
deviation from the mean [Xi – E(X)]2 by its corresponding probability P(Xi) and then summing
the resulting products. Equations 5.2a and 5.3 define, respectively, the variance of a discrete
random variable and the standard deviation of a discrete random variable.
VARIANC E OF A DIS CR E T E R A N DOM VA R I A BLE – D E F I NI T I O N F O R M U LA
N
∑ [Xi −
σ2 =
E(X)]2 P(Xi)
(5.2a)
i=1
where
Xi = the ith outcome of the discrete random variable X
P(Xi) = probability of occurrence of the ith outcome of X
variance of a discrete random
variable
Measure of variation, based on
squared deviations from the mean;
directly related to the standard
deviation.
standard deviation of a discrete
random variable
Measure of variation, based on
squared deviations from the mean;
directly related to the variance.
As for the sample variance, we can use algebra to obtain an alternative calculation formula.
VARIANCE OF A DISCRETE RANDOM VARIABLE – CALCULATION FORMULA
N
∑ Xi2P(Xi) − E(X )2
σ2 =
(5.2b)
i=1
N
where
∑ Xi2P(Xi) = X12P(X1) + X22P(X2) + … + XN2P(XN)
i =1
STAN DARD DE VIAT ION OF A DIS CR E T E R A ND O M VA R I A BLE
The standard deviation of a discrete random variable is the square root of the variance
σ = σ2
(5.3)
Using Equations 5.2b and 5.3, the variance and standard deviation for the probability distribution of the number of mortgages approved per week are:
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
184 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
N
σ2 = ∑ Xi2P(Xi) − E(X )2
i=1
= [(02 × 0.1) + (12 × 0.1) + (22 × 0.2) + (32 × 0.3) + (42 × 0.15)
+ (52 × 0.1) + (62 × 0.05)] − 2.82
= [(0 × 0.1) + (1 × 0.1) + (4 × 0.2) + (9 × 0.3) + (16 × 0.15)
+ (25 × 0.1) + (36 × 0.05)] − 7.84
= (0 + 0.1 + 0.8 + 2.7 + 2.4 + 2.5 + 1.8) − 7.84
= 10.3 − 7.84
= 2.46
σ = σ2 = 2.46 = 1.568…
Alternatively, a table format can be used to calculate the mean and variance. In Table 5.2, the
mean number of home mortgages approved per week is calculated. Then, using Equation 5.2b:
N
σ2 = ∑ Xi2P(Xi) − E(X )2 = 10.3 − (2.8)2 = 2.46
i=1
Table 5.2
Calculating the mean and
variance of the number of
home mortgages approved
per week
Home mortgages
approved per week
Xi
0
1
2
3
4
5
6
P(Xi )
0.10
0.10
0.20
0.30
0.15
0.10
0.05
1.00
XiP(Xi )
0.0
0.1
0.4
0.9
0.6
0.5
0.3
μ = E(X ) = 2.8
Xi2P(Xi )
0.0
0.1
0.8
2.7
2.4
2.5
1.8
10.3
The expected value is often used to measure the amount we can expect to gain or lose by undertaking a particular investment, while the standard deviation is used to measure the risk involved.
Problems for Section 5.1
LEARNING THE BASICS
5.1
Given the following probability distributions:
Distribution A
X
P(X)
0
0.50
1
0.20
2
0.15
3
0.10
4
0.05
5.2
Distribution B
X
P(X)
0
0.05
1
0.10
2
0.15
3
0.20
4
0.50
a. Calculate the expected value for each distribution.
b. Calculate the standard deviation for each distribution.
c. Compare and contrast the results of distributions A
and B.
Are each of the following a valid probability distribution? Justify
your answers:
Distribution A Distribution B Distribution C Distribution D
X
P(X)
X
P(X)
X
P(X)
X
P(X)
0.2
0
0.1
0.250 0.500
0
0.2
-1
1
0.9
1
0.2
0.500 0.250
1
0.1
2
2
0.3
1.000 0.250
2
0.4
-0.1
3
0.3
3
0.5
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5.2 Covariance and its Application in Finance 185
APPLYING THE CONCEPTS
5.3
Number of cars
sold per day
0
1
2
3
4
5
6
7
8
9
10
11
Total
5.4
Interruptions (X)
0
1
2
3
4
5
6
Using the company records for the past 500 working days, the
manager of Konig Motors has summarised the number of cars
sold per day in the following table:
Frequency
of occurrence
40
100
142
66
36
30
26
20
16
14
8
2
500
5.5
a. Form the probability distribution for the number of cars sold
per day.
b. Calculate the mean or expected number of cars sold per
day.
c. Calculate the standard deviation.
The manager of a large computer network has developed the
following probability distribution of the number of interruptions
per day:
P(X)
0.32
0.35
0.18
0.08
0.04
0.02
0.01
a. Calculate the mean or expected number of interruptions
per day.
b. Calculate the standard deviation.
In the casino version of the traditional Australian game of
two-up, a spinner stands in a ring and tosses two coins into the
air. The coins may land showing two heads, two tails or one tail
and one head (odds). Players can bet on either heads or tails at
odds of one to one. Therefore, if a player bets $1 on heads, the
player will win $1 if the coins land on heads but lose $1 if the
coins land on tails. Alternatively, if a player bets $1 on tails, the
player will win $1 if the coins land on tails but lose $1 if the
coins land on heads. If the coins land on odds, all bets are
frozen and the spinner tosses again until either heads or tails
comes up. If five odds are tossed in a row all players lose.
a. Construct the probability distribution representing the
different outcomes that are possible for a $1 bet on heads.
b. Construct the probability distribution representing the
different outcomes that are possible for a $1 bet on tails.
c. What is the expected long-run profit (or loss) to the player?
3
5.2 COVARIANCE AND ITS APPLICATION IN FINANCE
LEARNING OBJECTIVE
In Section 5.1 the expected value, variance and standard deviation of a discrete random variable
are discussed. In this section the covariance between two discrete random variables is introduced and then applied to portfolio management, a topic of interest to financial analysts.
Calculate average return
and measure risk
associated with various
investment proposals
Covariance
Covariance, σXY, is a measure of the strength of the relationship between two random variables,
X and Y. A positive covariance indicates a positive relationship, while a negative covariance
indicates a negative relationship. If the two variables are independent then their covariance is
zero. Equation 5.4a defines the covariance between two discrete random variables.
covariance
Measure of the strength of the
linear relationship between two
numerical variables.
CO VARIANC E – DE FIN IT ION FOR M UL A
σXY =
∑ ∑ [Xi − E(X )][Yj − E(Y)]P(Xi and Yj)
(5.4a)
all Xi all Yj
where Xi is the ith outcome of the discrete random variable X, and Yj is the jth outcome
of the discrete random variable Y.
As for the sample covariance, we can use algebra to obtain an alternative calculation formula.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
186 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
COVA R IA N CE – C A LC U LAT I O N F O R M U LA
σXY =
∑∑
XiYjP(Xi and Yj) − E(X )E(Y )
(5.4b)
all Xi all Yj
To illustrate covariance, suppose that we are deciding between two alternative investments
for the coming year. The first investment is a mutual fund that consists of shares that are
expected to do well when economic conditions are strong. The second investment is a mutual
fund that is expected to perform best when economic conditions are weak. Your estimate of the
returns for each investment (per $1,000 investment) under three economic conditions, each
with a given probability of occurrence, is summarised in Table 5.3.
Table 5.3
Estimated returns for each
investment under three
economic conditions
Economic
P(XiYi )
Xi
[5 P(Xi ) 5 P(Yi )] condition
0.2
Recession
-$100
0.5
Stable economy
+100
0.3
Expanding economy +250
1.0
Yi
+$200
+50
-100
Investment
XiYi Xi P(Xi )
−20,000 220
5,000
50
75
-25,000
105
Yi P(Yi )
40
25
230
35
E(X )
E(Y )
Xi Yi P(Xi Yi )
24,000
2,500
27,500
29,000
The expected value and standard deviation for each investment is calculated as follows. Let
X = strong-economy fund, and Y = weak-economy fund:
E(X ) = μX = (−100)(0.2) + (100)(0.5) + (250)(0.3) = $105
E(Y ) = μY = (+200)(0.2) + (50)(0.5) + (−100)(0.3) = $35
σ2X = [(−100)2 × 0.2) + (1002 × 0.5) + (2502 × 0.3)] − 1052 = 25,750 − 11,025 = 14,725
σX =
σ2X = 14,725 = 121.346… ≈ $121.35
σ2Y = [(2002 × 0.2) + (502 × 0.5) + (−100)2 × 0.3)] − 352 = 12,250 − 1225 = 11,025
E(Xσ)Y==μX σ=2Y(−100)(0.2)
+ $105.00
(100)(0.5) + (250)(0.3) = $105
= 11,025 =
E(Y
) =calculation
μ[(−100
+
(50)(0.5)
(−100)(0.3)
$35× (−100)
×of200
0.2)
+ (100
×only
50 ×
0.5) + =
(250
Y = (+200)(0.2)
Inσthe
the×
covariance,
the+
non-zero
probabilities
are:× 0.3)] − (105 × 35)
XY =
2
σ2X ==[(−100)
× 3,675
0.2) +=(100
0.5) + and
(250Y2 =×$200)
0.3)] −
1052 = 25,750 − 11,025 = 14,725
−9,0002 −
−12,675
P(X ×
= -$100
= 0.2
σX =
σ2X = 14,725 = 121.346…
≈and
$121.35
P(X = $100
Y = $50) = 0.5
2 = 12,250 − 1225 = 11,025
= $250
and Y2=×-$100)
0.3
σ2Y = [(2002 × 0.2) + (502 P(X
× 0.5)
+ (−100)
0.3)] −=35
σYtherefore
= σ2Y =
We
have:11,025 = $105.00
σXY = [(−100 × 200 × 0.2) + (100 × 50 × 0.5) + (250 × (−100) × 0.3)] − (105 × 35)
= −9,000 − 3,675 = −12,675
expected value of the sum of
two random variables
Measure of central tendency; mean
of the sum of two random variables.
variance of the sum of two
random variables
Measure of variation; directly
related to the standard deviation.
standard deviation of the sum
of two random variables
Measure of variation; directly
related to the variance.
Thus, the strong-economy fund has a higher expected value (i.e. larger expected return) than
the weak-economy fund but has a higher standard deviation (i.e. more risk). The covariance of
-12,675 between the two investments indicates a negative relationship in which the two investments are varying in the opposite direction. Therefore, when the return on one investment is
high, the return on the other is typically low.
Expected Value, Variance and Standard Deviation
of the Sum of Two Random Variables
Equation 5.4a defined the covariance between two discrete random variables, X and Y. Now, the
expected value of the sum of two random variables, variance of the sum of two random variables
and standard deviation of the sum of two random variables are defined.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5.2 Covariance and its Application in Finance 187
E XPE CTE D VA LUE OF T H E S UM OF T W O R A ND O M VA R I A BLE S
The expected value of the sum of two random variables is equal to the sum of the
expected values.
E(X + Y ) = E(X ) + E(Y )
μX + Y = μX + μY
Alternatively:
(5.5)
VARIANC E OF T H E S UM OF T W O R A ND O M VA R I A BLE S
The variance of the sum of two random variables is equal to the sum of the variances
plus twice the covariance.
(5.6)
σ2X + Y = σ2X + σ2Y + 2σXY
STAN DARD DE VIAT ION OF T H E S UM O F T W O R A ND O M VA R I A BLE S
The standard deviation is the square root of the variance.
σX + Y = σ2X
+Y
(5.7)
To illustrate the expected value, variance and standard deviation of the sum of two random
variables, consider the two investments previously discussed. Using Equations 5.5, 5.6 and 5.7:
μX + Y = E(X + Y ) = E(X ) + E(Y ) = 105 + 35 = $140
σ2X + Y = σ2X + σ2Y + 2σXY = (14,725 + 11,025) + 2 × (−12,675) = 400
σX + Y = 400 = $20
The expected return of the sum of the strong-economy fund and the weak-economy fund is
$140 with a standard deviation of $20. The standard deviation of the sum of the two investments is much less than the standard deviation of either single investment because there is a
large negative covariance between the investments.
Portfolio Expected Return and Portfolio Risk
The concepts of covariance, expected return and standard deviation of the sum of two random
variables can be applied to the study of investment portfolios where investors combine assets
into portfolios to reduce their risk. The objective is to maximise the return while minimising the
risk. For such portfolios, rather than studying the sum of two random variables, each investment is weighted by the proportion of assets assigned to that investment. Equations 5.8 and 5.9
define portfolio expected return and portfolio risk.
PO RTFO LIO E XPE CT E D R E T UR N
The portfolio expected return for a two-asset investment is equal to the weight assigned
to asset X multiplied by the expected return of asset X plus the weight assigned to asset
Y multiplied by the expected return of asset Y:
E(P ) = wE(X ) + (1 − w)E(Y )
portfolio
A combined investment in two or
more assets.
portfolio expected return
Measure of central tendency; mean
return on investment.
portfolio risk
Measure of the variation of
investment returns.
(5.8)
where E(P) = portfolio expected return
w = portion of the portfolio assigned to asset X, 0 ⩽ w ⩽ 1
1 – w = portion of the portfolio assigned to asset Y
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
188 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
PORT FOL IO R IS K
σp = w2σ2X + (1 − w)2σ2Y + 2w(1 − w)σXY
(5.9)
In the previous example, the expected return and risk of two different investments were
calculated, a strong-economy fund and a weak-economy fund. The covariance of the two investments was also calculated. Now, suppose that we wish to form a portfolio of these two investments that consists of an equal investment in each of these two funds. To calculate the portfolio
expected return and the portfolio risk, use Equations 5.8 and 5.9, with w = 0.5, to obtain:
E(P ) = wE(X ) + (1 − w)E(Y ) = (0.5 × 105) + (0.5 × 35) = $70
σp = (0.5)2(14,725) + (1 − 0.5)2(11,025) + 2(0.5)(1 − 0.5)(−12,675)
= 100
= $10
Thus, the portfolio has an expected return of $70 for each $1,000 invested (a return of 7%) and
has a portfolio risk of $10. The portfolio risk here is small because there is a large negative
covariance between the two investments. The fact that each investment performs best under different circumstances has reduced the overall risk of the portfolio.
It is possible to use calculus to determine the minimum portfolio risk – which may occasionally be zero – but that is outside the scope of this textbook.
Problems for Section 5.2
Two investments, X and Y, have the following characteristics:
LEARNING THE BASICS
5.8
5.6
E(X ) = $50, E(Y ) = $100, σX2 = 9,000, σY2 = 15,000 and σXY = 7,500
Given the following probability distributions for variables X
and Y:
P(XiYi )
0.4
0.6
5.7
X
100
200
Calculate:
a. E(X ) and E(Y )
b. σX and σY
c. σXY
d. E(X + Y )
Given the following probability distributions for variables X
and Y:
P(XiYi)
0.2
0.4
0.3
0.1
Calculate:
a. E(X ) and E(Y )
b. σX and σY
c. σXY
d. E(X + Y )
X
-100
50
200
300
If the weight assigned to investment X of portfolio assets is 0.4,
calculate:
a. the portfolio expected return
b. the portfolio risk
Y
200
100
Y
50
30
20
20
APPLYING THE CONCEPTS
5.9
The process of being served at a bank consists of two
independent parts – the time waiting in line and the time it
takes to be served by the teller. Suppose, at a branch of
Check$mart, that the time waiting in line has an expected value
of 4 minutes with a standard deviation of 1.2 minutes and the
time it takes to be served by the teller has an expected value of
5.5 minutes with a standard deviation of 1.5 minutes. Calculate:
a. the expected value of the total time it takes to be served
b. the standard deviation of the total time it takes to be served
5.10 For the investment example given in Table 5.3:
a. Calculate the portfolio expected return and the portfolio risk if:
i. 30% is invested in the strong-economy fund and 70% in
the weak-economy fund
ii. 70% is invested in the strong-economy fund and 30% in
the weak-economy fund
b. Which of the three investment strategies (30%, 50% or 70%
in the strong-economy fund) would you recommend? Why?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5.3 Binomial Distribution 189
5.11 You are developing a strategy for investing in two different
shares. The anticipated annual return for a $1,000 investment
in each share has the following probability distribution:
Returns
Probability
0.1
0.3
0.3
0.3
Share X
-$100
0
80
150
Share Y
$50
150
-20
-100
a. Calculate:
i. the expected return for share X and for share Y
ii. the standard deviation for share X and for share Y
iii. the covariance of share X and share Y
b. Would you invest in share X or share Y? Explain.
5.12 Suppose that in problem 5.11 you wanted to create a portfolio
that consists of share X and share Y.
a. Calculate the portfolio expected return and portfolio risk for
each of the following percentages invested in share X:
i. 30%
ii. 50%
iii. 70%
b. On the basis of the results of your calculations in part (a),
which portfolio would you recommend? Explain.
5.13 You are trying to set up a portfolio that consists of a corporate
bond fund and a common share fund. The following information
about the annual return (per $1,000) of each of these investments
under different economic conditions is available, together with the
probability that each of these economic conditions will occur.
Probability
0.10
0.15
0.35
0.30
0.10
State of
the economy
Recession
Stagnation
Slow growth
Moderate growth
High growth
Corporate
bond fund
-$30
50
90
100
110
Common
share fund
-$150
-20
120
160
250
a. Calculate:
i. the expected return for the corporate bond fund and for
the common share fund
ii. the standard deviation for the corporate bond fund and
for the common share fund
iii. the covariance of the corporate bond fund and the
common share fund
b. Would you invest in the corporate bond fund or the common
share fund? Explain.
5.14 Suppose that in problem 5.13 you wanted to create a
portfolio that consists of a corporate bond fund and a
common share fund.
a. Calculate the portfolio expected return and portfolio risk for
each of the following percentages invested in a corporate
bond fund:
i. 30%
ii. 50%
iii. 70%
b. On the basis of the results of your calculations in (a), which
portfolio would you recommend? Explain.
5.3 BINOMIAL DISTRIBUTION
The next three sections use mathematical models to solve business and other problems.
A mathematical model is a mathematical expression representing a variable of interest.
When a mathematical model of a discrete probability distribution is available, you can easily
calculate the exact probability of occurrence of any particular outcome of the random variable.
The binomial distribution is one of the most important and widely used discrete probability
distributions. The binomial distribution arises when the discrete random variable is the number of
successes in a sample of n observations. The binomial distribution has four essential properties:
1. The sample consists of a fixed number of observations, n.
2. Each observation is classified into one of two mutually exclusive and collectively exhaustive categories, usually called success and failure.
3. The probability of an observation being classified as a success, p, is constant from observation to observation. Thus, the probability of an observation being classified as a failure,
1 – p, is also constant for all observations.
4. The outcome (i.e. success or failure) of any observation is independent of the outcome of
any other observation. To ensure independence, the observations can be randomly
selected either from an infinite population without replacement or from a finite population
with replacement.
LEARNING OBJECTIVE
4
Identify situations that can
be modelled by a binomial
distribution and calculate
binomial probabilities
mathematical model
The mathematical representation of
a random variable.
binomial distribution
Discrete probability distribution,
where the random variable is the
number of successes in a sample
of n observations from either an
infinite population or sampling with
replacement.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
190 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
The proportion of online enquiries that are converted to bookings is of interest in the Gaia
Adventure Tours scenario, so Yang could define an online enquiry converted to a booking as a
success and an online enquiry that is not converted to a booking as a failure. Yang would then
be interested in the number of successes; that is, the number of online enquiries converted to
bookings in a random ­sample of n online enquiries. Note: In a binomial distribution, ‘success’
is usually defined as the outcome we are interested in – in this case an online query converted
to a booking.
This is a binomial situation because:
• a fixed number of online enquiries, n, is chosen
• each online enquiry is either converted to a booking – a success – or not converted – a failure
• 10% of online enquiries are converted to bookings, so the probability of a randomly
chosen online enquiry being converted to a booking is p = 0.1 and that of a randomly
chosen online enquiry not being converted to a booking is 1 – p = 0.9
• online enquiries are randomly selected; so the outcome, converted or not converted, of
any enquiry is independent of the outcome of any other enquiry.
If Yang takes a random sample of four online enquiries, the binomial random variable defined
as:
X = number of online enquiries converted to bookings
has a range from 0 to four as none, one, two, three or all four enquiries may be converted to
bookings. In general, a binomial random variable has a range from 0 to n.
Suppose that Yang observes the following result in a sample of four enquiries:
First order
Converted
Second order
Converted
Third order
Not converted
Fourth order
Converted
What is the probability of having three successes (converted enquiries) in a sample of four
enquiries in this particular sequence? Because the historical probability of enquiries converted
to bookings is 0.10, the probability that each enquiry occurs in the sequence is:
First enquiry
p = 0.1
Second enquiry
p = 0.1
Third enquiry
1 – p = 0.9
Fourth enquiry
p = 0.1
Each outcome is independent of the others because the enquiries are randomly selected. Therefore, the probability of having this particular sequence is:
pp(1 - p)p = p3(1 - p) = (0.1)3(0.9)1 = 0.0009
This result indicates only the probability of three online enquiries converted to bookings (successes) out of a sample of four online enquiries in a specific sequence.
The number of ways of selecting X objects from n objects irrespective of sequence is given
by the counting rule for combinations introduced in Chapter 4 as Equation 4.14 and as
Equation 5.10 below, introducing a different notation.
COM B IN AT ION S
The number of combinations of selecting X objects from n objects is given by
n!
n
= nCX =
X!(n − X)!
X
(5.10)
where n factorial is defined by n! = n × (n – 1) × … × 2 × 1 and by definition, 0! = 1.
Using Equation 5.10, we see that there are:
4C3
=
4!
=4
3!(4 − 3 )!
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5.3 Binomial Distribution 191
sequences of three converted enquiries and one enquiry not converted. The four possible
sequences are:
Sequence 1 Converted
Sequence 2 Converted
Sequence 3 Converted
Sequence 4 Not converted
and the probability of each is:
Converted
Converted
Not converted
Converted
Converted
Not converted
Converted
Converted
Not converted
Converted
Converted
Converted
p3(1 − p) = (0.1)3(0.9)1 = 0.0009
Therefore, the probability of three converted enquiries out of four is equal to:
number of sequences × probability of sequence = 4 × 0.0009 = 0.0036
We can make similar, intuitive derivations for the other possible outcomes of the random
variable – zero, one, two and four converted enquiries. However, as n, the sample size, gets
larger, the calculations involved in using this approach become time-consuming. Instead, a
mathematical model provides a formula to calculate any binomial probability. Equation 5.11 is
the mathematical model that represents the binomial probability distribution and is used to calculate the probability of X successes for any given values of n and p.
BIN O MIAL P R OB A B IL IT Y DIST R IB UT I O N
P(X ) =
n!
p X(1 − p)n − X
X!(n − X )!
(5.11)
where P(X) = probability of X successes given n and p
n = number of observations
p = probability of success
1 – p = probability of failure
X = number of successes (X = 0, 1, 2, …, n)
Equation 5.11 restates what we had intuitively derived. The binomial random variable X
can have any integer value X from 0 to n. In Equation 5.11 the product:
p X(1 − p)n − X
indicates the probability of exactly X successes out of n observations in a particular sequence.
The term:
n!
X!(n − X)!
indicates how many combinations of the X successes out of n observations are possible.
Hence, given the number of observations n and the probability of success p, the probability
of X successes is:
P(X) = number of sequences × probability of sequence
n!
p X(1 − p)n − X
=
X!(n − X)!
Example 5.1 illustrates the use of Equation 5.11.
DETER M INING P ( X 5 3 ), G IV E N n 5 4 AN D p 5 0. 1
If 10% of online enquiries are converted to bookings, what is the probability that there are
three converted enquiries in a sample of four?
EXAMPLE 5.1
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
192 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
SOLUTION
Using Equation 5.11, the probability of three converted enquiries from a sample of four is:
P(X = 3) =
4!
(0.1)3(1 − 0.1)4 − 3 = 4 × 0.001 × 0.9 = 0.0036
3 !(4 − 3 )!
Examples 5.2 and 5.3 give the calculations for other values of X.
EXAMPLE 5.2
DE T E R MIN ING P (X . 3) , GI V E N n 5 4 AN D p 5 0. 1
If 10% of online enquiries are converted to bookings, what is the probability that there are at
least three converted enquiries in a sample of four?
SOLUTION
In Example 5.1 we found that the probability of exactly three converted enquiries from a
sample of four is 0.0036. To calculate the probability of at least three converted enquiries,
we need to add the probability of three converted enquiries to the probability of four converted enquiries. The probability of four converted enquiries is:
P(X = 4) =
4!
(0.1)4(1 − 0.1)4 − 4 = 1 × 0.0001 × 1 = 0.0001
4 !(4 − 4 )!
Thus, the probability of at least three converted enquiries is:
P(X ⩾ 3) = P(X = 3) + P(X = 4) = 0.0036 + 0.0001 = 0.0037
There is a 0.37% chance that there will be at least three converted enquiries in a sample of four.
EXAMPLE 5.3
DE T E R MIN ING P ( X 6 3) , GI V E N n = 4 AN D p = 0. 1
If 10% of online enquiries are converted to bookings, what is the probability that there are
fewer than three converted enquiries in a sample of four?
SOLUTION
The probability that there are fewer than three converted enquiries is:
P(X < 3) = P(X = 0) + P(X = 1) + P(X = 2)
Use Equation 5.11 to calculate each of these probabilities:
P(X = 0) =
4!
(0.1)0(1 − 0.1)4 − 0 = 0.6561
0 !(4 − 0 )!
P(X = 1) =
4!
(0.1)1(1 − 0.1)4 − 1 = 0.2916
1 !(4 − 1 )!
P(X = 2) =
4!
(0.1)2(1 − 0.1)4 − 2 = 0.0486
2 !(4 − 2 )!
Therefore, P(X < 3) = 0.6561 + 0.2916 + 0.0486 = 0.9963
Alternatively, P(X 6 3) can also be calculated from its complement, P(X 9 3), since:
P(X < 3) = 1 − P(X ⩾ 3) = 1 − 0.0037 = 0.9963
Calculations such as those in Example 5.3 can become tedious, especially as n gets large.
To avoid computational drudgery, many binomial probabilities can be found directly from
Table E.6 (Appendix E), a portion of which is reproduced in Table 5.4. Table E.6 provides
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5.3 Binomial Distribution 193
­binomial probabilities for X = 0, 1, 2, … , n for selected combinations of n and p. For example,
to find the probability of exactly two successes in a sample of four when the probability of success is 0.1, first find n = 4 and then read off the required probability at the intersection of the
row X = 2 and the column p = 0.10. Thus:
P(X = 2) = 0.0486
n
4
X
0
1
2
3
4
0.01
0.9606
0.0388
0.0006
0.0000
0.0000
p
....
....
....
....
....
....
0.02
0.9224
0.0753
0.0023
0.0000
0.0000
0.10
0.6561
0.2916
0.0486
0.0036
0.0001
Table 5.4
Finding a binomial
probability for n = 4,
X = 2 and p = 0.1
(extracted from Table E.6)
The binomial probabilities given in Table E.6 can also be calculated using Microsoft Excel.
Figure 5.2 presents a Microsoft Excel worksheet for calculating binomial probabilities, using
the Excel 2010 and later inbuilt binomial function BINOM.DIST(number_s,trials,
probability_s,cumulative). For earlier versions of Excel the corresponding binomial function is
BINOMDIST(number_s,trials,probability_s,cumulative).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
A
Binomial probabilities
B
Data
Sample size
Probability of an event of interest
C
D
E
F
4
0.1
Figure 5.2
Microsoft Excel worksheet
for number of online
enquiries converted to
bookings example
Statistics
Mean
Variance
Standard deviation
0.4 =B4 * B5
0.36 =B8 * (1-B5)
0.6 =SQRT(B9)
Binomial probabilities table
X
0
1
2
3
4
P(X)
0.6561
0.2916
0.0486
0.0036
0.0001
=BINOM.DIST(A14, $B$4, $B$5, FALSE)
=BINOM.DIST(A15, $B$4, $B$5, FALSE)
=BINOM.DIST(A16, $B$4, $B$5, FALSE)
=BINOM.DIST(A17, $B$4, $B$5, FALSE)
=BINOM.DIST(A18, $B$4, $B$5, FALSE)
The shape of a binomial probability distribution depends on the values of n and p. When
p = 0.5, the binomial distribution is symmetrical, regardless of how large or small the value of
n. When p ∙ 0.5, the distribution is skewed, to the right if p < 0.5 and to the left if p > 0.5. The
closer p is to 0.5 and/or the larger the number of observations n, the less skewed the distribution. For example, the distribution of the number of converted online enquiries is highly skewed
to the right because p = 0.1 and n = 4 (see Figure 5.3).
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
194 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
Figure 5.3
0.7
Microsoft Excel graph of the
binomial probability
distribution with n = 4 and
p = 0.1
0.6
0.5
P (X )
0.4
0.3
0.2
0.1
0
0
1
2
3
4
Number of successes
Substituting the binomial probability equation (5.11) in the expected value equation (5.1)
and using algebra to simplify, it can be shown that the mean of the binomial distribution is
equal to the product of n and p, as shown in Equation 5.12. Therefore, use Equation 5.12 to
calculate the mean of a binomial distribution, instead of Equation 5.1.
T H E M E A N OF T HE BI NO M I A L D I ST R I BU T I O N
The mean μ of the binomial distribution is equal to the sample size n multiplied by the
probability of success p.
(5.12)
μ = E(X ) = np
Therefore, on average, Yang can theoretically expect E(X) = 4 * 0.1 = 0.4 converted
enquiries in a sample of four.
Similarly, by substituting the binomial probability equation (5.11) in the variance equation
(5.2a or 5.2b) and using algebra to simplify, it can be shown that the standard deviation of the
binomial distribution is given by Equation 5.13.
T H E STA N DA R D D E V I AT I O N O F T HE BI NO M I A L D I ST R I BU T I O N
σ = σ2 = np(1 − p)
(5.13)
Therefore, using Equation 5.13, the standard deviation of the number of converted enquiries is:
σ = 4(0.1)(0.9) = 0.60
EXAMPLE 5.4
C A LC U LAT IN G B INO M I AL P ROBABI L I TI E S
Accuracy (measured as the percentage of orders consisting of a main item, side item and
drink that are filled correctly) in taking orders at the drive-through window is an important
feature for fast-food chains. Suppose in a recent month that records show that the percentage
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5.3 Binomial Distribution 195
of correct orders of this type filled at a Hungry Jack’s franchise was 88%. Suppose three
friends go to the drive-through window at this Hungry Jack’s franchise and each places an
order of the type just mentioned.
• What is the probability that:
– all three orders will be filled correctly?
– none of the three will be filled correctly?
– at least two of the three will be filled correctly?
• What is the average and standard deviation of the number of orders filled correctly?
SOLUTION
There are three orders and the probability of any order being accurate is 0.88. Therefore:
X = number of orders filled correctly = 0, 1, 2, 3
is a binomial random variable with n = 3, p = 0.88.
Using Equations 5.11, 5.12 and 5.13:
P(X = 3) =
3!
(0.88)3(1 − 0.88)3 − 3 = 1 × 0.68147… × 1 = 0.68147…
3!(3 − 3 )!
P(X = 0) =
3!
(0.88)0(1 − 0.88)3 − 0 = 1 × 1 × 0.00172… = 0.00172
0!(3 − 0 )!
P(X = 2) =
3!
(0.88)2(1 − 0.88)3 − 2 = 3 × 0.7744 × 0.12… = 0.27878…
2!(3 − 2 )!
P(X ⩾ 2) = P(X = 2) + P(X = 3) = 0.27878… + 0.68147… = 0.96025…
μ = E(X ) = 3 × 0.88 = 2.64
σ=
np(1 − p) =
3 × 0.88 × 0.12 = 0.5628…
The probability that all three orders are filled correctly is 0.6815. The probability that none
of the orders is filled correctly is 0.0017. The probability that at least two orders are filled
correctly is 0.9603. The mean number of accurate orders filled in a sample of three orders is
2.64 and the standard deviation is 0.563.
This section introduced the binomial distribution and applied it to business and other problems. The binomial distribution plays an important role when it is used in statistical inference
problems involving the estimation or testing of hypotheses about proportions (discussed in
Chapters 8 and 9).
Problems for Section 5.3
Problems 5.15 to 5.24 can be solved manually or by using Microsoft Excel.
Some, but not all, can also be solved using Table E.6.
LEARNING THE BASICS
5.15 If X is a binomial random variable, determine the following:
a. For n = 4 and p = 0.12, what is P(X = 0)?
b. For n = 10 and p = 0.40, what is P(X = 9)?
c. For n = 10 and p = 0.50, what is P(X = 8)?
d. For n = 6 and p = 0.83, what is P(X = 5)?
5.16 If X is a binomial random variable with n = 5 and p = 0.40,
what is the probability that:
a. X = 4?
b. X ⩽ 3?
c. X < 2?
d. X > 1?
5.17 Determine the mean and standard deviation of the
random variable X in each of the following binomial
distributions:
a. n = 4 and p = 0.10
b. n = 4 and p = 0.40
c. n = 5 and p = 0.80
d. n = 3 and p = 0.50
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
196 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
APPLYING THE CONCEPTS
5.18 The increase or decrease in the price of a share between the
beginning and the end of a trading day is assumed to be an
equally likely random event. What is the probability that a share
will show an increase in its closing price on five consecutive days?
5.19 Research has shown that only 60% of consumers read every
word, including the fine print, of a service contract. Assume that
the number of consumers who read every word of a contract
can be modelled using the binomial distribution. A group of five
consumers has just signed a 12-month contract with an ISP
(Internet service provider).
a. What is the probability that:
i. all five will have read every word of their contract?
ii. at least three will have read every word of their contract?
iii. less than two will have read every word of their contract?
b. What would your answers be in (a) if the probability is 0.80
that a consumer reads every word of a service contract?
5.20 A student taking a multiple-choice test consisting of five
questions, each with four options, selects the answers
randomly. What is the probability that the student will get:
a. five questions correct?
b. at least four questions correct?
c. no questions correct?
d. no more than two questions correct?
5.21 In Example 5.4 three friends went to a Hungry Jack’s franchise.
Instead, suppose that they go to a McDonald’s franchise, which
last month filled 90% of orders correctly.
a. What is the probability that:
i. all three orders will be filled correctly?
ii. none of the three will be filled correctly?
iii. at least two of the three will be filled correctly?
b. What is the mean and standard deviation of the number of
orders filled correctly?
5.22 In a certain weekday television show, the winning contestant
has to choose randomly from 20 boxes, one of which contains a
major prize of $100,000.
LEARNING OBJECTIVE
5
Identify situations that
can be modelled by a
Poisson distribution
and calculate Poisson
probabilities
Poisson distribution
Discrete probability distribution,
where the random variable is the
number of events in a given
interval.
a. What is the probability that, during a week:
i. no contestant wins the major prize?
ii. exactly one contestant wins the major prize?
iii. no more than two contestants win the major prize?
iv. at least three contestants win the major prize?
b. Calculate the expected number and standard deviation of
winners in a week.
c. How much should the producers budget for major prizes per
week?
5.23 When a customer places an order with Rudy’s On-Line Office
Supplies, a computerised accounting information system (AIS)
automatically checks to see whether the customer has
exceeded their credit limit. Past records indicate that the
probability of customers exceeding their credit limit is 0.05.
Suppose that, in a given half hour, 20 customers place orders.
Assume that the number of customers that the AIS detects as
having exceeded their credit limit is distributed as a binomial
random variable.
a. What are the mean and standard deviation of the number of
customers exceeding their credit limits?
b. What is the probability that no customer will exceed their
limit?
c. What is the probability that one customer will exceed their
limit?
d. What is the probability that two or more customers will
exceed their limits?
5.24 A new drug is found to be effective on 90% of the patients
tested.
a. Is the 90% effective rate best classified as a priori classical
probability, empirical classical probability or subjective
probability?
b. If the drug is administered to 20 randomly chosen patients
at a large hospital, find the probability that it is effective for:
i. fewer than five patients
ii. 10 or more patients
iii. all 20 patients
5.4 POISSON DISTRIBUTION
Many studies are based on the number of times a random event occurs in an interval of time or
space. Examples are the number of surface defects on a new refrigerator, the number of network failures in a month or the number of fleas on the body of a dog.
The Poisson distribution can be used to calculate probabilities when counting the number of
times a particular event occurs in an interval of time or space if:
1. the probability an event occurs in any interval is the same for all intervals of the same size
2. the number of occurrences of the event in one non-overlapping interval is independent of
the number in any other interval
3. the probability that two or more occurrences of the event in an interval approaches zero as
the interval becomes smaller.
If these properties hold, then the average or expected number of occurrences over any interval
is proportional to the size of the interval.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5.4 Poisson Distribution 197
Consider the number of online enquiries received by Gaia Adventure Tours. Suppose that
Yang is interested in the number of online enquiries received during a 20-minute interval. Does
this situation match the properties of the Poisson distribution given above? First, define the
random variable as:
X = number of online enquiries received during a 20-minute interval
Suppose enquiries are received randomly, then it is reasonable to assume that the probability an
enquiry is received during a 20-minute interval is the same as the probability for all other
20-minute intervals. Yang can also assume that the receipt of an enquiry during a 20-minute interval has no effect on (i.e. is statistically independent of) the receipt of any other enquiry during
any 20-minute interval. Finally, the probability that two or more enquiries will be received in a
given time period approaches zero as the time interval becomes smaller. For example, the probability is virtually zero that two enquiries will be received in a time interval of 0.001 of a second. Thus, Yang can use the Poisson distribution to determine probabilities involving the
number of online enquiries received in a 20-minute interval.
The Poisson distribution has one parameter, λ (Greek lower-case letter lambda), which is
the mean or expected number of events per interval. The variance of a Poisson distribution is
also equal to λ, hence the standard deviation is equal to ∙∙λ. The number of events, X, of the
Poisson random variable ranges from 0 to infinity.
Equation 5.14, the mathematical formula for the Poisson distribution, gives the probability
of X events in an interval, given that λ events are expected.
PO IS S O N P R OB A B IL IT Y DIST R IB UT IO N
P(X ) =
e−λλX
X!
(5.14)
where P(X) = the probability of X events in a given interval
λ = expected number of events in the given interval
e = 2.71828 … is the base of natural logarithms
To illustrate the use of the Poisson distribution, calculate the probability that in a given
20 minutes exactly five online enquiries will be received, and the probability that less than five
online enquiries will be received. On average, Gaia Adventure Tours receives 30 online enquiries
30
an hour, so the average or expected number of enquiries in 20 minutes is λ =
× 20 = 10
60
Using Equation 5.14 with λ = 10, the probability that in a given 20 minutes exactly five
online enquiries will be received is:
P(X = 5) =
e−10105 4.53999…
=
= 0.03783…
5!
120
and the probability that in any given 20 minutes less than five online enquiries will be received
is:
P(X < 5) =
e−10(10)0 e−10(10)1 e−10(10)2 e−10(10)3 e−10(10)4
+
+
+
+
0!
1!
2!
3!
4!
= 0.00004… + 0.00045… + 0.00226… + 0.00756… + 0.01891…
= 0.029252…
Thus, there is a 3% likelihood that less than five online enquiries will be received in 20 minutes,
leading to enquiry staff having significant idle time.
To avoid the computational drudgery involved in these calculations, many Poisson
probabilities can be found directly from Table E.7 (Appendix E), a portion of which is
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
198 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
reproduced in Table 5.5. Table E.7 provides probabilities for the Poisson random variable
for X = 0, 1, 2, . . . for selected values of the parameter λ. The probability that exactly five
online enquiries will be received in a given 20-minute interval when the mean number of
enquiries received in 20 minutes is 10 is given by the intersection of the row X = 5 and column λ = 10. Therefore, from Table 5.5, P(X = 5) = 0.0378.
Table 5.5
Calculating a Poisson
probability for λ = 10
(extracted from Table E.7 in
Appendix E of this book)
𝛌
X
0
1
2
3
4
5
6
7
9.1
0.0001
0.0010
0.0046
0.0140
0.0319
0.0581
0.0881
0.1145
9.2
0.0001
0.0009
0.0043
0.0131
0.0302
0.0555
0.0851
0.1118
10
0.0000
0.0005
0.0023
0.0076
0.0189
0.0378
0.0631
0.0901
....
....
....
....
....
....
....
....
You can also calculate the Poisson probabilities given in Table E.7 using Microsoft Excel.
Figure 5.4 presents a Microsoft Excel worksheet for the Poisson distribution, with λ = 10, using
the Excel 2010 and later inbuilt Poisson function POISSON.DIST(x,mean,cumulative). For
earlier versions of Excel the corresponding Poisson function is POISSON(x,mean,cumulative).
Figure 5.4
Microsoft Excel worksheet
for ‘number of online
enquries in 20 minutes’
example
A
B
C
D
E
1 Poisson probabilities
2
3
Data
4 Mean/Expected number of events of interest:
5
6 Poisson probabilities table
7
X
P (X)
8
0
0.0000 =POISSON.DIST(A8, $E$4, FALSE)
9
1
0.0005 =POISSON.DIST(A9, $E$4, FALSE)
10
2
0.0023 =POISSON.DIST(A10, $E$4, FALSE)
11
3
0.0076 =POISSON.DIST(A11, $E$4, FALSE)
12
4
0.0189 =POISSON.DIST(A12, $E$4, FALSE)
13
5
0.0378 =POISSON.DIST(A13, $E$4, FALSE)
14
6
0.0631 =POISSON.DIST(A14, $E$4, FALSE)
15
7
0.0901 =POISSON.DIST(A15, $E$4, FALSE)
16
8
0.1126 =POISSON.DIST(A16, $E$4, FALSE)
17
9
0.1251 =POISSON.DIST(A17, $E$4, FALSE)
18
10
0.1251 =POISSON.DIST(A18, $E$4, FALSE)
19
11
0.1137 =POISSON.DIST(A19, $E$4, FALSE)
20
12
0.0948 =POISSON.DIST(A20, $E$4, FALSE)
21
13
0.0729 =POISSON.DIST(A21, $E$4, FALSE)
22
14
0.0521 =POISSON.DIST(A22, $E$4, FALSE)
23
15
0.0347 =POISSON.DIST(A23, $E$4, FALSE)
24
16
0.0217 =POISSON.DIST(A24, $E$4, FALSE)
25
17
0.0128 =POISSON.DIST(A25, $E$4, FALSE)
26
18
0.0071 =POISSON.DIST(A26, $E$4, FALSE)
27
19
0.0037 =POISSON.DIST(A27, $E$4, FALSE)
28
20
0.0019 =POISSON.DIST(A28, $E$4, FALSE)
F
10
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5.4 Poisson Distribution 199
CA LC ULATING P O IS S O N P RO B A B ILIT IES
The number of faults per month that arise in the gearboxes of a bus fleet is known to follow
a Poisson distribution with a mean of 2.5 faults per month. What is the probability that in a
given month no faults are found? At least one fault is found?
EXAMPLE 5.5
SOLUTION
Using Equation 5.14 with λ = 2.5 (or using Table E.7 or Microsoft Excel), the probabilities
that in a given month no faults are found and at least one fault is found are:
P(X = 0) =
e−2.5(2.5)0
0.08208… × 1
=
= 0.08208…
0!
1
P(X ⩾ 1) = 1 − P(X = 0) = 1 − 0.08208… = 0.91791…
The probability that there will be no faults in a given month is 0.0821. The probability that
there will be at least one fault is 0.9179, which is the complement of there being no faults
in a given month.
CA LC ULATING P O IS S O N P RO B A B ILIT IES
For the Gaia Adventure Tours scenario, what is the probability that 24 or more online
enquiries are received in 30 minutes?
EXAMPLE 5.6
SOLUTION
Let X = number of online enquiries received in 30 minutes, then X is Poisson with
30
λ=
× 30 = 15
60
Using Microsoft Excel we can obtain Table 5.6, which gives Poisson probabilities for
λ = 15.
Enquiries received in 30 minutes
Expected number of enquiries: 15
X
P(X )
0
0.0000
1
0.0000
2
0.0000
3
0.0002
4
0.0006
5
0.0019
6
0.0048
7
0.0104
X
8
9
10
11
12
13
14
15
P(X )
0.0194
0.0324
0.0486
0.0663
0.0829
0.0956
0.1024
0.1024
X
16
17
18
19
20
21
22
23
Total
P(X )
0.0960
0.0847
0.0706
0.0557
0.0418
0.0299
0.0204
0.0133
0.9805
Table 5.6
Poisson probabilities for
λ = 15
From Table 5.6:
P(X ⪖ 24) = 1 − P(X < 24) = 1 − P(X ⪕ 23) = 1 − 0.9805 = 0.0195
Therefore, in approximately 2% of 30-minute intervals 24 or more online enquiries are
expected to be received, hence increasing the likelihood that enquiries start to queue and
may not be answered within the stated 45 minutes.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
200 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
Problems for Section 5.4
LEARNING THE BASICS
5.25 Assume a Poisson distribution.
a. If λ = 2.5, find P(X = 2).
b. If λ = 8, find P(X = 8).
c. If λ = 0.5, find P(X = 1).
d. If λ = 3.7, find P(X = 0).
5.26 Assume a Poisson distribution.
a. If λ = 2, find P(X 9 2).
b. If λ = 8, find P(X 9 3).
c. If λ = 0.5, find P(X ⩽ 1).
d. If λ = 4, find P(X 9 1).
e. If λ = 5, find P(X ⩽ 3).
5.27 Assume a Poisson distribution with λ = 5. Find the probability
that:
a. X = 1
b. X 6 1
c. X > 1
d. X ⩽ 1
APPLYING THE CONCEPTS
Problems 5.28 to 5.32 can be solved manually or by using Microsoft Excel.
Some, but not all, can also be solved using Table E.7.
5.28 The quality control manager of Marilyn’s Bakery is inspecting a
batch of chocolate-chip biscuits that has just been baked. If the
production process is in control, the mean number of chip parts
per biscuit is 6.0. What is the probability that, in any particular
biscuit being inspected:
a. fewer than five chip parts will be found?
b. exactly five chip parts will be found?
c. five or more chip parts will be found?
d. either four or five chip parts will be found?
5.29 Refer to problem 5.28. How many biscuits in a batch of 100
should the manager expect to discard if company policy requires
that all chocolate-chip biscuits sold must have at least four
chocolate-chip parts?
LEARNING OBJECTIVE
6
Identify situations
that can be modelled
by a hypergeometric
distribution and calculate
hypergeometric
probabilities
hypergeometric distribution
Discrete probability distribution
where the random variable is the
number of successes in a sample
of n observations from a finite
population without replacement.
5.30 The number of floods in a certain region is approximately
Poisson distributed with an average of three floods every
10 years.
a. Find the probability that a family living in the area for one
year will experience:
i. exactly one flood
ii. at least one flood
b. Find the probability that a student who moves to the area for
three years will experience
i. exactly one flood
ii. at least one flood
5.31 Based on past experience, it is assumed that the number of
flaws per metre in rolls of grade 2 paper follow a Poisson
distribution with a mean of one flaw per 5 metres of paper.
What is the probability that in a:
a. 1-metre roll there will be at least two flaws?
b. 10-metre roll there will be at least one flaw?
c. 50-metre roll there will be between five and 15 (inclusive)
flaws?
5.32 A toll-free phone number is available from 9 am to 9 pm for
customers to register a complaint about a product purchased
from a large company. Past history indicates that an average of
0.4 calls are received per minute.
a. What properties must be true about the situation described
above in order to use the Poisson distribution to calculate
probabilities concerning the number of phone calls received
in a 1-minute period?
b. Assuming that this situation matches the properties you
discuss in (a), what is the probability that, during a 1-minute
period:
i. zero phone calls will be received?
ii. three or more phone calls will be received?
c. What is the maximum number of phone calls that will be
received in a 1-minute period 99.99% of the time?
5.5 HYPERGEOMETRIC DISTRIBUTION
The binomial distribution and the hypergeometric distribution are both concerned with the number of successes in a sample of n observations. However, they differ in the way in which the
sample is selected. For the binomial distribution, as the probability of success p must be constant for all observations and the outcome of any particular observation must be independent of
any other, the random sample is either selected with replacement from a finite population or
without replacement from an infinite population. For the hypergeometric distribution, the random sample is selected without replacement from a finite population. Thus, the outcome of one
observation is dependent on the outcomes of previous observations.
Consider a population of size N. Let A represent the total number of successes in the population. The hypergeometric distribution is then used to find the probability of X successes in a
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5.5 Hypergeometric Distribution 201
sample of size n selected without replacement. Equation 5.15, the mathematical formula for the
hypergeometric distribution, gives the probability of X successes, given n, N and A.
H YPE RGE O M E T R IC DIST R IB UT ION
P(X ) =
N−A
n−X
A
X
N
n
(5.15)
where P(X) = the probability of X successes, given n, N and A
n = sample size
N = population size
A = number of successes in the population
N – A = number of failures in the population
X = number of successes in the sample
n 2 X = number of successes in the sample
A
= ACX (see Equation 5.10)
X
The number of successes in the sample, represented by X, cannot be greater than the number of successes in the population, A, or the sample size, n. Thus, the range of the hypergeometric random variable is limited to the minimum of the sample size or the number of successes in
the population.
Equation 5.16 defines the mean of the hypergeometric distribution.
TH E M E AN OF T H E H YP E R GE OM E T R I C D I ST R I BU T I O N
μ = E (X ) =
nA
N
(5.16)
Equation 5.17 defines the standard deviation of the hypergeometric distribution.
TH E STAN DA R D DE VIAT ION OF T H E HY P E R G E O M E T R I C D I ST R I BU T I O N
σ=
nA(N − A)
N
2
⋅
N−n
N−1
(5.17)
N−n
is a finite population correction factor that results
N−1
from sampling without replacement from a finite population.
To illustrate the hypergeometric distribution, suppose that we wish to form a team of eight
executives from different departments within a company. Suppose the company has a total of
30 executives, and 10 of these are from the finance department. If members of the team are to
be selected at random, what is the probability that the team will contain two executives from the
finance department? Here, the population of N = 30 executives within the company is finite. In
addition, A = 10 are from the finance department and a team of n = 8 executives is to be
selected.
In Equation 5.17, the expression
finite population correction
factor
Factor required when sampling from
a finite population without
replacement.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
202 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
Using Equation 5.15: P(X = 2) =
10 20
2 6
30
8
10!
20!
×
2 !8! 6 !14!
=
= 0.2980…
30!
8 !22!
Using Equations 5.16 and 5.17:
μ = E(X ) =
8 × 10
= 2.666...
30
and
σ=
30 − 8
= 1.1613...
30 − 1
8 × 10 × (30 − 10)
30
2
Thus, the probability that the team will contain two members from the finance department is
0.298, or 29.8%.
Such calculations can become tedious, especially as N gets larger. However, Microsoft
Excel can be used to calculate hypergeometric probabilities. Figure 5.5, using the Excel 2010
and later inbuilt hypergeometric function HYPGEOM.DIST(sample_s,number_sample,
population_s,number_population,cumulative), presents a Microsoft Excel worksheet for the
team-formation example. Note that the number of executives from the finance department
(i.e. the number of successes in the sample) can be equal to 0, 1, 2, … 8.
Figure 5.5
Microsoft Excel worksheet
for the team-formation
example
A
1 Hypergeometric probabilities
2
3
Data
4 Sample size
5 No. of events of interest in population
6 Population size
7
8 Hypergeometric probabilities table
9
10
11
12
13
14
15
16
17
18
B
C
D
E
F
G
8
10
30
X
0
1
2
3
4
5
6
7
8
P(X)
0.0215
0.1324
0.2980
0.3179
0.1738
0.0491
0.0068
0.0004
0.0000
=HYPGEOM.DIST (A10, $B$4, $B$5, $B$6 FALSE)
=HYPGEOM.DIST (A11, $B$4, $B$5, $B$6 FALSE)
=HYPGEOM.DIST (A12, $B$4, $B$5, $B$6 FALSE)
=HYPGEOM.DIST (A13, $B$4, $B$5, $B$6 FALSE)
=HYPGEOM.DIST (A14, $B$4, $B$5, $B$6 FALSE)
=HYPGEOM.DIST (A15, $B$4, $B$5, $B$6 FALSE)
=HYPGEOM.DIST (A16, $B$4, $B$5, $B$6 FALSE)
=HYPGEOM.DIST (A17, $B$4, $B$5, $B$6 FALSE)
=HYPGEOM.DIST (A18, $B$4, $B$5, $B$6 FALSE)
For earlier versions of Excel the corresponding hypergeometric function is HYPGEOMDIST
(sample_s,number_sample,population_s,number_pop).
Problems for Section 5.5
LEARNING THE BASICS
APPLYING THE CONCEPTS
5.33 Determine the following:
a. If n = 4, N = 10 and A = 5, find P(X = 3).
b. If n = 4, N = 6 and A = 3, find P(X = 1).
c. If n = 5, N = 12 and A = 3, find P(X = 0).
d. If n = 3, N = 10 and A = 3, find P(X = 3).
5.34 Referring to problem 5.33, calculate the mean and the standard
deviation for the hypergeometric distributions described in (a) to (d).
5.35 An auditor for the Australian Taxation Office is selecting a
sample of six tax returns from a batch of 100 for an audit. If two
or more of these returns contain errors, the entire batch of 100
tax returns will be audited.
Problems 5.35 to 5.39 can be solved manually or by using Microsoft Excel.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
5.5 Hypergeometric Distribution 203
a.
What is the probability that the entire batch will be audited if
the true number of returns with errors in the batch is:
i. 25?
ii. 30?
iii. 5?
iv. 10?
b. Discuss the differences in your results depending on the true
number of returns in the batch with error.
5.36 The dean of a business faculty wishes to form an executive
committee of five from among the 40 tenured faculty members.
The selection is to be random, and there are eight tenured
faculty members in accounting.
a. What is the probability that the committee will contain:
i. none of them?
ii. at least one of them?
iii. not more than one of them?
b. What is your answer to part (i) above if the committee
consists of seven members?
5.37 In a shipment of 15 hard disks, five are defective. If four of the
disks are inspected,
a. What is the probability that:
i. exactly one is defective?
ii. at least one is defective?
iii. no more than two are defective?
b. What is the mean number of defective hard disks that you
would expect to find in the sample of four hard disks?
5.38 In each game of OZ Lotto seven numbers are selected, from
1 to 45. Seven winning numbers are chosen at random plus
two supplementary numbers. An extension of the
hypergeometric distribution to calculate probabilities of
selecting combinations of winning and supplementary
numbers is:
P(X, Y ) =
A
X
S
Y
N−A−S
n−X−Y
N
n
where P(X,Y) is the probability of selecting X winning numbers
and Y supplementary numbers, and S is the number of
supplementary numbers.
a. To win Division 1, the seven winning numbers must be
selected. In any game, what is the probability of winning
Division 1?
b. To win Division 2, six winning numbers plus either of the two
supplementary numbers must be selected. In any game,
what is the probability of winning Division 2?
c. To win Division 3, six winning numbers must be
selected. In any game, what is the probability of winning
Division 3?
d. To win Division 4, five winning numbers plus either of the
two supplementary numbers must be selected. In any
game, what is the probability of winning Division 4?
e. To win Division 5, five winning numbers must be
selected. In any game, what is the probability of winning
Division 5?
f. To win Division 6, four winning numbers must be
selected. In any game, what is the probability of winning
Division 6?
g. To win Division 7, three winning numbers plus either of the
two supplementary numbers must be selected. In any game,
what is the probability of winning Division 7?
h. What is the probability of selecting none of the winning or
supplementary numbers?
5.39 In a certain game of Lotto six numbers are selected from
1 to 45. Six winning numbers are chosen at random plus
two supplementary numbers. Use the formula in problem
5.38 or Equation 5.15 to calculate the following
probabilities.
a. To win Division 1, the six winning numbers must be
selected. In any game, what is the probability of winning
Division 1?
b. To win Division 2, five winning numbers plus either of the
two supplementary numbers must be selected. In any game,
what is the probability of winning Division 2?
c. To win Division 3, five winning numbers must be selected. In
any game, what is the probability of winning Division 3?
d. To win Division 4, four winning numbers must be
selected. In any game, what is the probability of winning
Division 4?
e. To win Division 5, three winning numbers plus either of the
two supplementary numbers must be selected. In any game,
what is the probability of winning Division 5?
f. What is the probability of selecting none of the winning or
supplementary numbers?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
204 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
5
Assess your progress
Summary
This chapter introduced mathematical expectation, covariance and
the development and application of the binomial, Poisson and
hypergeometric distributions. In the Gaia Adventure Tours scenario,
we saw how to calculate probabilities from the binomial and
Poisson distributions concerning the number of online enquiries
converted to bookings in a sample of n enquiries and the number of
online enquiries received in a given time interval. In the next chapter,
important continuous distributions are introduced, in particular the
normal distribution.
To help decide which discrete probability distribution to use for
a particular situation, we need to ask the following questions:
• Is there a fixed number of observations n, each of which is
classified as success or failure, or are we counting the number
•
of times an event happens in an interval of time or space? If
there is a fixed number of observations n, each of which is
classified as success or failure, we can use the binomial or
hypergeometric distribution, if the properties of the distribution are satisfied. If we are counting the number of events in
an interval, we can use the Poisson distribution only if all its
properties are satisfied.
In deciding whether to use the binomial or hypergeometric
distribution, is the probability of success constant for all
­observations? If yes, we may be able to use the binomial
­distribution. If no, we may be able to use the hypergeometric
distribution.
Key formulas
Expected value of the sum of two random variables
Expected value 𝛍 of a discrete random variable
E(X + Y ) = E(X ) + E(Y ) (5.5)
N
μ = E(X ) =
∑
XiP(Xi) (5.1)
Variance of the sum of two random variables
i=1
σ2X + Y = σ2X + σ2Y + 2σXY (5.6)
Variance of a discrete random variable
N
σ2 =
∑ [Xi − E(X)]2 P(Xi)
Standard deviation of the sum of two random variables
(5.2a) (definition)
σX + Y = σ2X + Y (5.7)
i=1
N
σ2 =
∑ Xi2P(Xi) − E(X)2
Portfolio expected return
(5.2b) (calculation)
E(P ) = wE(X ) + (1 − w)E(Y ) (5.8)
i=1
Portfolio risk
Standard deviation of a discrete random variable
σ = σ2
σp = w2σ2X + (1 − w)2σ2Y + 2w(1 − w)σXY
(5.3)
(5.9)
Combinations
Covariance
σXY =
∑∑
all Xi all Yj
σXY =
∑∑
all Xi all Yj
[Xi − E(X )][Yj − E(Y)]P(Xi and Yj) (5.4a)
(definition)
n!
n
= nCX =
(5.10)
X
!(n
− X)!
X
Binomial distribution
XiYjP(Xi and Yj) − E(X )E(Y ) (5.4b)
(calculation)
P(X ) =
n!
p X(1 − p)n − X (5.11)
X !(n − X )!
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter review problems 205
The mean of the binomial distribution
The mean of the hypergeometric distribution
μ = E(X ) = np (5.12)
nA
(5.16)
N
μ =E( X ) =
The standard deviation of the binomial distribution
σ = σ2 = np(1 − p) (5.13)
The standard deviation of the hypergeometric distribution
Poisson distribution
σ=
e−λλX
(5.14)
P(X ) =
X!
nA(N − A)
N
2
⋅
N−n
(5.17)
N−1
Hypergeometric distribution
P(X ) =
A
X
N−A
n−X
N
n
(5.15)
Key terms
binomial distribution
covariance
expected value of a discrete
random variable
expected value of the sum of two
random variables
finite population correction
factor
189
185
182
186
201
hypergeometric distribution
mathematical model
Poisson distribution
portfolio
portfolio expected return
portfolio risk
probability distribution for a
discrete random variable
200
189
196
187
187
187
181
standard deviation of a discrete
random variable
standard deviation of the sum of
two random variables
variance of a discrete random
variable
variance of the sum of two random
variables
183
186
183
186
Chapter review problems
CHECKING YOUR UNDERSTANDING
5.40
5.41
5.42
5.43
What is the meaning of the expected value of a probability
distribution?
What are the four properties of a binomial distribution?
What are the three properties of a Poisson distribution?
When is the hypergeometric distribution used instead of the
binomial distribution?
5.45
APPLYING THE CONCEPTS
Problems 5.44 to 5.53 can be solved manually or by using Microsoft Excel.
Some, but not all, can also be solved using Tables E.6 and E.7.
5.44
From September 1984 to July 2017 the ASX All Ordinaries
Index has opened higher than the previous month for 233 of
the 395 months – that is, approximately 59.0% of months
(Data from YAHOO!7FINANCE <http://au.finance.yahoo.com>
accessed July 2017).
a. Assuming a binomial distribution, estimate the probability
that the ASX All Ordinaries Index will open higher than the
previous month:
i. for one month
ii. for two months in a row
5.46
iii. in four of the next five months
iv. in none of the next five years
b. For the situation in (a) above, what assumption of the
binomial distribution might not be valid?
At a recent election, 12% of the voters in a certain electorate
gave their first preference to the Greens candidate. If 10 people
on the electoral roll for that electorate were randomly selected,
find the probability that:
a. exactly four gave their first preference to the Greens
candidate
b. at most four gave their first preference to the Greens
candidate
c. a majority gave their first preference to the Greens
candidate
When calculating premiums on life insurance products,
insurance companies often use life tables that enable the
probability of a person dying in any age interval to be
calculated.
The following data obtained from the ‘New Zealand Abridged
Period Life Table: 2014-16’ gives the number out of 100,000 New
Zealand-born males and females who are still alive during each
five-year period of life between age 20 and 60 (inclusive).
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
206 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
Exact age
(years)
20
25
30
35
40
45
50
55
60
Number alive at exact age
Out of 100,000
Out of 100,000
females born
males born
99,288
99,031
99,128
98,685
98,949
98,312
98,726
97,899
98,427
97,381
97,934
96,649
97,157
95,548
95,933
93,853
94,162
91,352
Data obtained from <www.stats.govt.nz> accessed June 2017. © Statistics
New Zealand and licensed by Statistics New Zealand for re-use under the
Creative Commons Attribution 3.0 New Zealand licence
5.47
5.48
5.49
Suppose a New Zealand-born female on her 35th birthday
purchases a one million dollar, five-year term life policy
from an insurance company. That is, the insurance company
must pay her estate $1 million if she dies within the next
five years.
a. Determine the insurance company’s expected payout on
this policy.
b. What would be the minimum you would expect the
insurance company to charge her for this policy?
c. What would the expected payout be if the same policy
were taken out by a New Zealand-born female on her
40th birthday?
d. Repeat parts (a) to (c) for a New Zealand-born male.
The emergency facility at a small country hospital has been in
operation for 60 weeks and has been used 120 times. The
weekly pattern of demand for this facility has a Poisson
distribution. Find the:
a. mean demand per week
b. probability the emergency facility is not used in a given week
c. probability the emergency facility is used at least twice in a week
d. probability the room is used at least once in a given twoweek period
Check$mart’s records show that 58% of its customers pay only
the minimum repayment on their credit card each month.
a. If a random sample of 20 credit-card holders is selected,
what is the probability that:
i. none pays the minimum amount?
ii. no more than five pay the minimum amount?
iii. more than 10 pay the minimum amount?
b. What assumptions did you have to make to answer each
part of (a) above?
In 2016, the New Zealand general marriage rate was 10.95
marriages and civil unions per 1,000 population 16 years and
over who are not married or in a civil union. The corresponding
divorce rate was 8.7 per 1,000 existing marriages and civil
unions (data obtained from Marriage, Civil Unions and Divorces:
Year ended December 2016, Statistics New Zealand <www.
stats.govt.nz> accessed July 2017).
5.50
5.51
5.52
5.53
a. Suppose 60 unmarried women were randomly selected on
1 January 2016.
i. Find the probability that at least three married, including
civil unions, during 2016.
ii. Find the probability that at most two married, including
civil unions, during 2016.
iii. What is the mean and standard deviation of the number
who married during 2016?
b. Suppose 60 married couples were randomly selected on 1
January 2016.
i. Find the probability that none divorced during 2016.
ii. Find the probability that at most two divorced during 2016.
iii. What is the mean and standard deviation of the number
of divorces during 2016?
A customer service manager of Check$mart bank is monitoring
one of its phone banking call centres servicing a rural region.
Suppose that on average the call centre receives 180 calls
an hour during its operating hours of 8 am to 6 pm.
a. Can the Poisson distribution be used to model the number
of calls received in one minute? Explain.
b. Assuming the number of calls received in a given interval is
Poisson, calculate the probability that:
i. in a given minute exactly two calls will be received
ii. more than two calls will be received in a minute
iii. the number of calls received in 5 minutes is at least 20
iv. the number of calls received in 5 minutes is less than 10
At current staffing levels calls start to queue, increasing
the time it takes to answer a call, when the number of calls
received in 5 minutes is 20 or more. However, when there are
less than 10 calls in 5 minutes, more than one Customer
Service Officer is usually available, increasing unproductive
staff time.
c. What conclusions can you draw from problem (b) parts (iii)
and (iv) above?
Suppose the average number of students who log on to a
university’s computer system is 4.45 in each 5-minute interval.
a. What is the probability that six students will log on in the
next minute?
b. What is the probability that fewer than six students will log
on during the next two minutes?
A study of various news home pages reports that the mean
number of bad links per home page is 0.4 and the mean
number of spelling errors per home page is 0.16. Use the
Poisson distribution to find the probability that a randomly
selected home page will contain:
a. no bad links
b. five or more bad links
c. no spelling errors
d. 10 or more spelling errors
In an online test, 10 multiple-choice questions are randomly
selected from a test bank of 100 questions.
Supposing that each student has two attempts at the online
test, what is the probability that in the second test a student
attempts there are:
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter review problems 207
5.54
a. no questions from the first test?
b. at least one question from the first test?
c. exactly five questions from the first test?
d. 10 questions from the first test?
The following table gives the grade distribution at a certain
university.
Fail
15%
5.55
Pass
40%
Credit
25%
Distinction
15%
High Distinction
5%
Supposing that a result is selected randomly, what is the
probability that:
a. the result is a passing grade (Pass or above)?
b. the result is a Credit or above?
c. If a random sample of 15 results is selected, what is the
probability of:
i. exactly three Fails?
ii. more than five Fails?
iii. all being Pass or above?
iv. none being Credit or above?
v. exactly five being Credits or above?
vi. at least one Distinction or High Distinction?
d. Based on the random sample of 15 results, what is the
expected number, variance and standard deviation of the
number of:
i. Fail grades?
ii. grades Pass or above?
e. Comment on the relationship between (i) and (ii) in part (d)
above.
A grade point of 7 is assigned to each High Distinction, 6
to each Distinction, 5 to each Credit, 4 to each Pass and 0 to
each fail.
f. What is the average, variance and standard deviation of
grade points for the university?
You are trying to develop a strategy for investing in two different
shares. The anticipated annual return for a $1,000 investment
in each share has the following probability distribution:
5.56
Number of drink-driving
offences
Local – in council area
Seaside town
151
Not seaside town
462
Not local – not in council area
Intrastate (within state)
130
Interstate (another state)
228
International (outside Australia)
22
Home address
5.57
5.58
Returns
Probability
0.25
0.50
0.25
Share A
$240
$150
–$100
Share B
–$100
$150
$240
a. Calculate:
i. the expected returns for share A and for share B
ii. the variances and standard deviations for share A and
for share B
iii. the covariance of share A and share B
b. Suppose you want to create a portfolio that consists of
share A and share B. Calculate the portfolio expected return
and risk if the proportion invested in share A is:
i. 0.40
ii. 0.50
iii. 0.60
c. On the basis of the results in (b), which portfolio would you
recommend? Explain.
The breakdown by home address of the previous year’s 993
drink-driving offences in Problem 2.67 is:
5.59
5.60
Suppose that Kai randomly selects 20 of the offenders to
interview in depth. What is the probability that:
a. all 20 will be local?
b. 15 will be local?
c. five will be from interstate?
d. at least 10 will not be local?
Past data indicate that 6% of all students enrolled in a firstyear statistics unit at Tasman University obtain a High
Distinction (HD). Assume that students are allocated randomly
to a tutorial group.
a. What is the probability that in a tutorial group of 30 students:
i. none receive an HD?
ii. at most, two students obtain an HD?
iii. more than four students obtain an HD?
b. What is the mean and standard deviation of the number of
HDs obtained in a tutorial group?
In a regional city, on average 2.6 traffic accidents are reported
an hour from 7 am to 7 pm.
On a given day, what is the probability that:
a. four accidents are reported from 9 am to 9.30 am?
b. five accidents are reported from 2 pm to 4 pm?
c. three or four accidents are reported from 2 pm to 3 pm?
d. at least one accident is reported from 4 pm to 5.30 pm?
A hand of five cards is dealt from a shuffled standard pack of
52 cards.
Find the probability that:
a. all the cards are red
b. exactly two of the cards are red
c. at least one card is red
d. the hand contains four kings
e. the hand has at least one king
f. all the cards are hearts
g. the cards are all the same suit
Pat’s Used Cars sells on average 3.6 used cars in a normal
trading day. Assume the number of used cars sold follows a
Poisson distribution.
Determine:
a. the probability that five used cars are sold in a day
b. the probability that no more than two used cars are sold in a day
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
208 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
5.61
c. the expected number and standard deviation of used cars
sold in a 10-day period
The Ashland MultiComm Services (AMS) marketing
department wants to increase subscriptions for a combined
telephone, pay TV and Internet bundle called 3-For-All. AMS
marketing has been conducting an aggressive directmarketing campaign that includes postal and electronic
mailings and telephone solicitations. Feedback from these
efforts indicates that including premium channels of the
customer’s choice in this bundle is a very important factor for
both current and prospective subscribers. After several
brainstorming sessions, the marketing department has
decided to add premium channels as a no-cost benefit of
subscribing to 3-For-All.
The research director, Mona Fields, is planning to conduct
a survey among prospective customers to determine how
many premium channels need to be added to 3-For-All in
order to generate increased subscriptions. Based on past
campaigns and on industry-wide data, she estimates the
following:
Number of free
premium channels
0
1
2
3
4
5
Probability of
subscriptions
0.020
0.040
0.060
0.070
0.080
0.085
a. If a sample of 50 prospective customers is selected and no
free premium channels are included in the 3-For-All bundle,
given the above probability estimates, what is the
probability that:
i. fewer than three customers will subscribe to 3-For-All?
ii. at most one customer will subscribe to 3-For- All ?
iii. more than four customers will subscribe to 3-For-All ?
iv. Suppose that in the survey of 50 prospective customers,
five customers subscribe to 3-For-All. What does this tell
you about the estimate of the proportion of customers
who would subscribe to 3-For-All if no free premium
channels are included?
b. Instead of offering no premium free channels, as in part (a),
suppose that two free premium channels of the customer’s
choice are included in the 3-For-All bundle. Given the above
probability estimates, what is the probability that:
i. fewer than three customers will subscribe to 3-For-All?
ii. at most one customer will subscribe to 3-For-All?
iii. more than four customers will subscribe to 3-For-All?
c. Compare the results of (b) to those of (a).
d. Suppose that in a survey of 50 prospective customers where
two free premium channels of the customer’s choice are
included in the 3-For-All offer, five customers subscribe.
What does this tell you about the estimate of the proportion
of customers who would subscribe to 3-For-All if two free
premium channels are included?
e. What do the above results tell you about the effect of
offering free premium channels of the customer’s choice on
the likelihood of obtaining subscriptions to 3-For-All?
Chapter 5 Excel Guide
EG5.1 THE PROBABILITY DISTRIBUTION
FOR A DISCRETE VARIABLE
Key technique
Use the SUMPRODUCT(cell range 1, cell range 2) function to calculate the expected value and variance.
Example
Calculate the expected value, variance, and standard deviation for the number of home mortgages approved per week
data given in Table 5.1 on page 182.
In-depth Excel
Use the Discrete_Variable workbook as a model.
For the example, open to the DATA worksheet of the
Discrete_Variable workbook. The worksheet already contains the ­entries needed to calculate the expected value,
variance, and standard deviation (shown in the COMPUTE
worksheet) for the example.
For other problems, enter the probability distribution data into columns A and B of the DATA worksheet,
overwriting the existing entries. If required, extend columns C and D, first selecting cell range C7:D7 and then
copying the cell range down as many rows as necessary.
If the probability distribution has fewer than six outcomes, select the rows that contain the extra, unwanted
outcomes, right-click, and then click Delete in the shortcut menu.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 5 Excel Guide 209
EG5.2 COVARIANCE OF A PROBABILITY DISTRIBUTION
AND ITS APPLICATION IN FINANCE
Key technique
Use the SQRT and SUMPRODUCT functions to calculate
the portfolio analysis statistics.
Example
Perform the portfolio analysis for the Section 5.2
investment example.
PHStat
Use Covariance and Portfolio Analysis.
For the example, select PHStat ➔ Decision-Making ➔
Covariance and Portfolio Analysis. In the Covariance and
Portfolio Management dialog box (shown in Figure EG5.1):
1. Enter 3 as the Number of Outcomes.
2. Enter a Title, check Portfolio Management Analysis and click OK.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
A
Covariance analysis
Probabilities & outcomes:
Weight assigned to X
Statistics
E(X )
E(Y )
Variance (X )
Standard deviation(X )
Variance (X )
Standard deviation(Y )
Covariance (X Y )
Variance(X+Y )
Standard deviation(X+Y )
B
C
D
P
0.2
0.5
0.3
X
–100
100
250
Y
200
50
–100
E
F
0.5
105
35
14725
121.346611
11025
105
–12675
400
20
Portfolio management
Weight assigned to X
Weight assigned to Y
Portfolio expected return
Portfolio risk
0.5
0.5
70
10
=SUMPRODUCT(B4:B6,C4:C6)
=SUMPRODUCT(B4:B6,D4:D6)
=SUMPRODUCT(B4:B6,G13:$G$15)
=SQRT(B13)
=SUMPRODUCT(B4:B6,H13:$H$15)
=SQRT(B15)
=SUMPRODUCT(B4:B6,I13:$I$15)
=B13+B15+2*B17
=SQRT(B18)
=B8
=1-B22
=B22*B11+B23*B12
=SQRT(B22^2*B13+B23^2*B15+2*B22*B23*B17)
Figure EG5.2 COMPUTE worksheet of Portfolio workbook
EG5.3 BINOMIAL DISTRIBUTION
Key technique
Use the BINOM.DIST(number of events of interest, sample
size, probability of an event of interest, FALSE) function.
Example
Calculate the binomial probabilities for n 5 4 and p 5 0.1,
given in Figure 5.2 for the ‘number of online enquiries converted to bookings’ problem.
Figure EG5.1 Covariance and Portfolio Management dialog box
In the new worksheet (shown in Figure EG5.2):
1. Enter the probabilities and outcomes in the table
that begins in cell B3.
2. Enter 0.5 as the Weight assigned to X.
In-depth Excel
Use the COMPUTE worksheet of the Portfolio workbook as a template.
The worksheet (shown in Figure EG5.2) already contains the data for the example. Overwrite the P, X and Y
values and the weight assigned to X when you enter data for
other problems. If a problem has more or fewer than three
outcomes, first select row 5, right-click, and click Insert
(or Delete) in the shortcut menu to insert (or delete) rows
one at a time. If you insert rows, select the cell range F4:J4
and copy the contents of this range down through the new
table rows.
The worksheet also contains a Calculations Area that
contains various intermediate calculations. Open the
COMPUTE_FORMULAS worksheet to examine all the
formulas used in this area.
PHStat
Use Binomial.
For the example, select PHStat ➔ Probability &
Prob. Distributions ➔ Binomial. In the Binomial Probability Distribution dialog box (shown in Figure EG5.3):
1. Enter 4 as the Sample Size.
2. Enter 0.1 as the Prob. of an Event of Interest.
3. Enter 0 as the Outcomes From value and enter 4
as the (Outcomes) To value.
4. Enter a Title, check Histogram, and click OK.
Figure EG5.3
Binomial Probability
Distribution dialog
box
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
210 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
Check Cumulative Probabilities before clicking OK in
step 4 to have the procedure include columns for P(#X),
P(,X), P(.X), and P($X) in the binomial probabilities
table.
In-depth Excel
Use the Binomial workbook as a template and model.
For the example, open to the COMPUTE worksheet
of the Binomial workbook, shown in Figure 5.2 on
page 193. The worksheet already contains the entries
needed for the example. For other problems, change the
sample size in cell B4 and the probability of an event of
interest in cell B5. If necessary, extend the binomial probabilities table by selecting cell range A18:B18 and then copying the cell range down as many rows as necessary.
Use the CUMULATIVE worksheet if you require
cumulative probabilities. Use CUMULATIVE_OLDER
worksheet if using a version of Excel before Excel 2010.
P(,X), P(.X), and P($X) in the Poisson probabilities
table. Check Histogram to construct a histogram of the
Poisson probability distribution.
In-depth Excel
Use the Poisson workbook as a template.
For the example, open to the COMPUTE worksheet
of the Poisson workbook, shown in Figure 5.4 on page
198. The w
­ orksheet already contains the entries for the
example. For other problems, change the mean or expected
number of events of ­interest in cell E4. If necessary, extend
the Poisson probabilities table by selecting cell range
A28:B28 and then copying the cell range down as many
rows as necessary.
Use the CUMULATIVE worksheet if you require
cumulative probabilities. Use the CUMULATIVE_OLDER
worksheet if using a version of Excel before Excel 2010.
EG5.5 HYPGEOMETRIC DISTRIBUTION
EG5.4 POISSON DISTRIBUTION
Key technique
Use the POISSON.DIST(number of events of interest, the
average or expected number of events of interest, FALSE)
function.
Example
Calculate the Poisson probabilities for the ‘number of online
enquiries received in 20 minutes’ problem with l 5 10, as
in Figure 5.4 on page 198.
PHStat
Use Poisson.
For the example, select PHStat ➔ Probability &
Prob. ­Distributions ➔ Poisson. In the Poisson Probability
Distribution dialog box (shown in Figure EG5.4):
1. Enter 10 as the Mean/Expected No. of Events of
Interest.
2. Enter a Title and click OK.
Key technique
Use the HYPGEOM.DIST(X, sample size, number of
events of interest in the population, population size,
FALSE) function.
Example
Calculate the hypergeometric probabilities for the team formation problem in Figure 5.5 on page 202.
PHStat
Use Hypergeometric.
For the example, select PHStat ➔ Probability &
Prob. Distributions ➔ Hypergeometric. In this procedure’s dialog box (shown in Figure EG5.5):
1. Enter 8 as the Sample Size.
2. Enter 10 as the No. of Events of Interest in Pop.
3. Enter 30 as the Population Size.
4. Enter a Title and click OK.
Figure EG5.5
Figure EG5.4
Poisson Probability
Distribution dialog
box
Check Cumulative Probabilities before clicking OK in
step 2 to have the procedure include columns for P(#X),
Hypergeometric
Probability
Distribution dialog
box
Check Histogram to produce a histogram of the probability ­distribution.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 5 Excel Guide 211
In-depth Excel
Use the Hypergeometric workbook as a template.
For the example, open to the COMPUTE worksheet
of the Hypergeometric workbook, shown in Figure 5.5 on
page 202. The worksheet already contains the entries for
the example. For other problems, change the sample size in
cell B4, the number of events of interest in the population
in cell B5, and the population size in cell B6. If necessary,
extend the hypergeometric probabilities table by selecting
cell range A18:B18 and then copying the cell range down
as many rows as necessary.
Use the CUMULATIVE worksheet if you require
cumulative probabilities. Use the CUMULATIVE_OLDER
worksheet if using a version of Excel before Excel 2010.
Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
CHA PTER
6
The normal distribution
and other continuous
distributions
TASMAN UNIVERSITY ORIENTATION
A
s part of orientation activities, new students at Tasman University (TU) are encouraged to
complete a ‘Welcome to Tasman University (TU)’ online program.
To assess the success – or otherwise – of this program, data have been collected on the
time a new student spends working through it.
The data suggest that the time students spend on the first module in the program ‘Introduction to
TU’ is normally distributed with a mean of 7 minutes and a standard deviation of 2. From the data,
the time students spend on another module in the program, ‘Support at TU’, is also normal, but
with a mean of 4 minutes and a standard deviation of 1 minute.
How can the orientation organisers use this data to answer questions about the time students
spend on the ‘Introduction to TU’ and ‘Support at TU’ modules?
© Solis Images/Shutterstock
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.1 Continuous Probability Distributions 213
LEARNING
OBJECTIVES
After studying this chapter you should be able to:
1 calculate probabilities from the normal distribution
2 determine whether a set of data is approximately normally distributed
3 calculate probabilities from the uniform distribution
4 calculate probabilities from the exponential distribution
5 use the normal distribution to approximate probabilities from the binomial distribution
In the Gaia Adventure Tours scenario in Chapter 5, Yang wanted to solve problems about the
number of occurrences of an outcome in a given sample size or the number of events in a
specified interval. A different task is faced in the Tasman University Orientation scenario, one
that involves a continuous measurement since the time students spend on the ‘Introduction to
TU’ module can be any positive value, not just an integer value. How, then, can the orientation
organisers answer questions about continuous numerical variables such as:
• What proportion of students spend more than 9 minutes on the ‘Introduction to TU’
module?
• 10% of students spend less than how long on the module?
• What is the probability that a randomly chosen student accessing the module spends less
than 3.5 minutes on it?
As in Chapter 5, we use probability distributions as models. This chapter introduces the
characteristics of a continuous probability distribution and then uses the normal, uniform and
exponential distributions to solve business and other problems.
6.1 CONTINUOUS PROBABILITY DISTRIBUTIONS
Chapter 5 discussed discrete random variables and probability distributions. In this chapter we
look at continuous random variables and probability distributions. Continuous random variables
arise from a measuring process where the response can take on any value within a continuum or
interval; for example time, temperature, weight, height, revenue or cost.
A continuous probability density function, represented by f(x), is the mathematical expression that defines the distribution of the values for a continuous random variable. Figure 6.1
graphically displays the three continuous probability density functions discussed in this
chapter. Panel A depicts a normal distribution. The normal distribution is symmetrical and
bell shaped, implying that most values tend to cluster around the mean, which, due to its
symmetry, is equal to the median. Although the values in a normal distribution can range
from negative infinity to positive infinity, the shape of the distribution makes it very unlikely
that extremely large or extremely small values will occur. Panel B depicts a uniform distribution where the probability of occurrence of a value is equally likely to occur anywhere in
the range between the smallest value a and the largest value b. Sometimes referred to as the
rectangular distribution, the uniform distribution is symmetrical and therefore the mean
equals the median. An exponential distribution is illustrated in panel C. This distribution is
skewed to the right, with the mean larger than the median. The range for an exponential
distribution is zero to positive infinity but its shape makes the occurrence of extremely large
values unlikely.
continuous probability density
function
Mathematical expression that
defines the distribution of the
values for a continuous random
variable.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
214 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
Figure 6.1
Three continuous
distributions
Values of X
Panel A
Normal distribution
a
b
Values of X
Panel B
Uniform distribution
Values of X
Panel C
Exponential distribution
Note that a continuous probability density function gives the graph of the probability distribution, not the probability, as is the case with a discrete probability function. Probabilities
involving continuous random variables are calculated as areas under the curve given by the
probability density function and between specified values of the random variable.
LEARNING OBJECTIVE
1
Calculate probabilities
from the normal
distribution
normal distribution
Continuous probability distribution
represented by a bell-shaped curve.
6.2 THE NORMAL DISTRIBUTION
The normal distribution (sometimes referred to as the Gaussian distribution) is the most common continuous probability distribution used in statistics. The normal distribution is vitally
important in statistics for three main reasons:
1. Numerous continuous variables common in business, and elsewhere, have distributions
that are normal or approximately normal.
2. The normal distribution can be used to approximate various discrete probability distributions.
3. The normal distribution provides the basis for classical statistical inference because of its
relationship to the Central Limit Theorem (discussed in Section 7.2).
The normal distribution is represented by the classic bell shape depicted in panel A of
­ igure 6.1. In the normal distribution, we can calculate the probability that values of the ranF
dom variable occur within a range or interval. However, the probability of a particular or individual value of a continuous random variable, such as a normal random variable, is zero. This
property distinguishes continuous variables, which are measured, from discrete variables,
which are counted. As an example, time (in seconds) is measured and not counted. Therefore,
we can determine the probability that the load time for a website is between 1 and 5 seconds or
between 2 and 4 seconds or between 2.99 and 3.01 seconds. However, the probability that the
load time is exactly 3 seconds is effectively zero.
The normal distribution has several important theoretical properties:
• It is bell-shaped (and thus symmetrical) in its appearance.
• Its mean and median are equal.
4
• Its middle 50% of data is within approximately standard deviations. This means that the
3
interquartile range is contained within an interval of two-thirds of a standard deviation
below the mean to two-thirds of a standard deviation above the mean – that is, the middle
2
2
50% of data have Z scores (introduced in Section 3.1) between 2 and .
3
3
• Its associated random variable has an infinite range (2 ∞ , X , ∞).
In practice, many variables have distributions that closely resemble the theoretical properties of the normal distribution. The data in Table 6.1 represent the thickness (in millimetres) of
10,000 brass washers manufactured by a large company. The continuous variable of interest,
thickness, can be approximated by the normal distribution. The measurements of the thickness
of the 10,000 brass washers cluster in the interval 0.485 to 0.495 mm and are distributed
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.2 The Normal Distribution 215
s­ ymmetrically around that interval, forming a bell-shaped pattern. As illustrated in Table 6.1,
the non-overlapping (mutually exclusive) classes contain all possible values (are collectively
exhaustive) and so the relative frequencies sum to 1.
Thickness (mm)
Under 0.425
0.425 , 0.435
0.435 , 0.445
0.445 , 0.455
0.455 , 0.465
0.465 , 0.475
0.475 , 0.485
0.485 , 0.495
0.495 , 0.505
0.505 , 0.515
0.515 , 0.525
0.525 , 0.535
0.535 , 0.545
0.545 , 0.555
0.555 or above
Total
Frequency
0
48
122
325
695
1,198
1,664
1,896
1,664
1,198
695
325
122
48
0
10,000
Relative frequency
0
0.0048
0.0122
0.0325
0.0695
0.1198
0.1664
0.1896
0.1664
0.1198
0.0695
0.0325
0.0122
0.0048
0
1.0000
Table 6.1
Thickness of 10,000 brass
washers
Figure 6.2 depicts the relative frequency histogram and polygon for the distribution of the
thickness of 10,000 brass washers. For these data, the first three theoretical properties of the
normal distribution are approximately satisfied; however, the fourth does not hold. The random
variable of interest, thickness, cannot possibly be zero or below, and a washer cannot be so
thick that it becomes unusable. From Table 6.1, only 48 out of every 10,000 brass washers are
expected to have a thickness of between 0.545 and 0.555 mm and none above 0.555 mm,
whereas an equal number is expected to have a thickness between 0.425 and 0.435 mm and
none below 0.425 mm. Thus, the chance of randomly getting a washer thinner than 0.435 mm
or thicker than 0.545 mm is 0.0048 1 0.0048 5 0.0096 2 or less than 1 in 100.
Figure 6.2
Relative frequency
histogram and polygon of
the thickness of 10,000
brass washers
0.20
Relative Freqency
0.15
0.10
0.05
0.00
0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56
Thickness (mm)
For the normal distribution, the normal probability density function is given by Equation 6.1.
normal probability density
function
Mathematical expression that
defines the normal distribution.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
216 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
T H E N OR M A L PR O BA BI LI T Y D E NSI T Y F U NC T I O N
f (X ) =
1
2
σ 2π
e −(1/2)[(X−μ)/σ]
(6.1)
where e 5 2.71828… is the base of natural logarithms
p 5 3.14159…
m is the mean
s is the standard deviation
X is any value of the normal random variable, where (2 ∞ , X , ∞)
Because e and p are mathematical constants, the probability density function depends on
the two parameters of the normal distribution: the mean m and the standard deviation s. Each
combination of m and s generates a different normal distribution. Figure 6.3 illustrates three
different normal distributions. Distributions A and B have the same mean (m) but have different
standard deviations. Distributions A and C have the same standard deviation (s) but have different means. Distributions B and C depict two normal probability density functions that differ
with respect to both m and s.
Figure 6.3
Three normal distributions
B
C
A
transformation formula
Z score formula used to convert any
normal random variable to the
standardised normal random
variable.
standardised normal random
variable
Normal random variable with a
mean of 0 and a standard
deviation of 1.
Normal probabilities are calculated as areas under the curve given by Equation 6.1; this
requires integral calculus and there is no exact rule. Fortunately, all normal probabilities can be
calculated from normal probability tables. However, as there is a different normal probability
distribution for each combination of m and s, the first step in finding a normal probability is to
use the transformation formula, given in Equation 6.2, to convert any normal random variable X
to a standardised normal random variable Z.
T R A N S FOR M AT IO N F O R M U LA
The Z value is equal to the difference between X and the mean m, divided by the standard
deviation s.
Z=
X−μ
σ
(6.2)
Equation 6.2 is a restatement of the Z score equation (3.12), introduced in Chapter 3. Thus,
Equation 6.2 represents the distance between a given value of the random variable X and the
mean expressed in standard deviations.
Although the original normal random variable X had mean m and standard deviation s, the
standard normal random variable Z has mean m 5 0 and standard deviation s 5 1. By substituting m 5 0 and s 5 1 in Equation 6.1, the probability density function of the standardised normal
variable Z is given in Equation 6.3.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.2 The Normal Distribution 217
THE STANDA R DIS E D N OR M A L P R OBA BI LI T Y D E NSI T Y F U NC T I O N
f (Z ) =
1
2π
e −(1/2)Z
2
(6.3)
Any normal probability distribution can be converted to the standardised probability
d­ istribution. Then normal probabilities can be determined from Table E.2, the cumulative
­standardised normal distribution.
To see how the transformation formula is applied and the results used to find probabilities
from Table E.2, recall from the Tasman University Orientation scenario at the beginning of the
chapter that data indicate that the time students spend on the ‘Introduction to TU’ module is
normal, with mean m 5 7 minutes and standard deviation s 5 2 minutes. From Figure 6.4, it
can be seen that every value of the random variable X, time, has a corresponding standardised
Z value calculated by the transformation formula (Equation 6.2). Therefore, a time of 9 minutes
is equivalent to Z 5 1; that is, 9 minutes is one standard deviation above the mean since:
Z=
9−7
= +1
2
Time on ‘Introduction to TU’
module
μ – 3σ
μ – 2σ
μ – 1σ
μ
cumulative standardised
normal distribution
Represents the cumulative area
under the standard normal curve
less than a given value.
μ + 1σ
μ + 2σ
μ + 3σ
1
3
5
7
9
11
13
–3
–2
–1
0
+1
+2
+3
Figure 6.4
Transformation of scales
X scale, minutes (μ = 7, σ = 2)
Z scale (μ = 0, σ = 1)
A time of 1 minute is equivalent to Z 5 23; that is, 1 minute is three standard deviations below
the mean since:
1−7
= −3
Z=
2
Thus, the standard deviation is the unit of measurement. In other words, a time of 9 minutes
is 2 minutes (i.e. one standard deviation) higher, or longer, than the mean time of 7 minutes.
Similarly, if a student spends 1 minute on the module it is 6 minutes (i.e. three standard
deviations) lower, or shorter, than the mean time.
To illustrate further the transformation formula, the time students spend on the ‘Support at
TU’ module is also normal with a mean of 4 minutes and a standard deviation of 1 minute. This
distribution is illustrated in Figure 6.5. For ‘Support at TU’, a time of 5 minutes is one standard
deviation above the mean time since:
Z=
5−4
= +1
1
A time of 1 minute is three standard deviations below the mean time since:
Z=
1−4
= −3
1
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
218 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
The two bell-shaped curves in Figures 6.4 and 6.5 represent the probability density functions of the time (in minutes) students spend on the two modules. Since the times represent the
entire population, the area under the entire curve, representing probability, must be 1.
Figure 6.5
A different transformation
of scales
Time on ‘Support at TU’
module
1
2
3
4
–3
–2
–1
0
5
6
7
+1 +2 +3
X scale, minutes (μ = 4, σ = 1)
Z scale (μ = 0, σ = 1)
The steps to find the probability that the time a student spends in the ‘Introduction to TU’
module in the Tasman University Orientation scenario is less than 9 minutes are as follows:
1. Use Equation 6.2 to transform X 5 9 to the corresponding Z value:
Z=
2.
Table 6.2
Finding a cumulative area
under the normal curve
(extracted from Table E.2 in
Appendix E of this book)
9−7
=1
2
Use Table E.2 to find the cumulative area under the standard normal curve less than (i.e. to the
left of) Z 5 1.00. To read the probability or area under the curve less than Z 5 1.00, scan down
the Z column in Table E.2 to the Z value of interest to one decimal place, the Z row for 1.0.
Read across this row until it intersects the column that contains the second decimal place of
the Z value, the column representing .00. Therefore, from the body of the table, the probability
for P(Z , 1.00) is given by the intersection of the row Z 5 1.0 and the column Z 5 .00, as
shown in Table 6.2, which is extracted from Table E.2. This probability is 0.8413 2 that is,
Z
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
.00
.5000
.5398
.5793
.6179
.6554
.6915
.7257
.7580
.7881
.8159
.8413
.01
.5040
.5438
.5832
.6217
.6591
.6950
.7291
.7612
.7910
.8186
.8438
.02
.5080
.5478
.5871
.6255
.6628
.6985
.7324
.7642
.7939
.8212
.8461
.03
.5120
.5517
.5910
.6293
.6664
.7019
.7357
.7673
.7967
.8238
.8485
.04
.5160
.5557
.5948
.6331
.6700
.7054
.7389
.7704
.7995
.8264
.8508
.05
.5199
.5596
.5987
.6368
.6736
.7088
.7422
.7734
.8023
.8289
.8531
.06
.5239
.5636
.6026
.6406
.6772
.7123
.7454
.7764
.8051
.8315
.8554
.07
.5279
.5675
.6064
.6443
.6808
.7157
.7486
.7794
.8078
.8340
.8577
.08
.5319
.5714
.6103
.6480
.6844
.7190
.7518
.7823
.8106
.8365
.8599
.09
.5359
.5753
.6141
.6517
.6879
.7224
.7549
.7852
.8133
.8389
.8621
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.2 The Normal Distribution 219
P(Z , 1.00) 5 0.8413. As illustrated in Figure 6.6, there is an 84.13% likelihood that a
student will spend less than 9 minutes on the ‘Introduction to TU’ online module.
Figure 6.6
Determining the area less
than Z from a cumulative
standardised normal
distribution
Time on ‘Introduction to TU’
module
Area 0.8413
1
3
5
7
–3.00 –2.00 –1.00
9
11
13
X scale, minutes
+1.00 +2.00 +3.00 Z scale
0
For ‘Support at TU’, as Z 5 (5 2 4)/1 5 1 (see Figure 6.7), the probability of a time less
than 5 minutes is also 0.8413.
Figure 6.7 shows that, regardless of the value of the mean m and standard deviation s of a
normal random variable X, Equation 6.2 can be used to transform the distribution to the standard normal distribution Z.
Figure 6.7
A transformation of scales
for corresponding
cumulative portions under
two normal curves
‘Support at TU’
cale
‘Introduction to TU’
7
9
11
Xs
13
cale
Zs
5
34
+2
+3
+1
0
–2
–1
–3
In the following examples, which answer questions relating to the time students spend on the
‘Introduction to TU’ module, when necessary the normal curve is sketched and the required probability/area shaded before using Table E.2 with Equation 6.2 to calculate the required probability.
FINDING P (X . 9 )
What is the probability that a student will spend at least 9 minutes on ‘Introduction to TU’?
EXAMPLE 6.1
SOLUTION
The probability that a student spends less than 9 minutes is 0.8413 (see Figure 6.6). Thus,
the probability that a student will spend at least 9 minutes is the complement of less than
9 minutes, so:
P(X ⩾ 9) = 1 − P(X < 9) = 1 − 0.8413 = 0.1587
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
220 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
Therefore, approximately 15.9% of students spend at least 9 minutes on ‘Introduction to
TU’.
Figure 6.8 illustrates this result.
Figure 6.8
Finding P(X > 9)
Time on ‘Introduction to TU’
module
Area 0.1587
0.8413
1
3
5
–3.00 –2.00 –1.00
EXAMPLE 6.2
7
0
9
11
13
X scale, minutes
+1.00 +2.00 +3.00 Z scale
FIN D ING P(7 * X * 9)
What is the probability that a student spends between 7 and 9 minutes on ‘Introduction to TU’?
SOLUTION
From Figure 6.6, P(X , 9) 5 0.8413. Now determine the probability that the time will be at
most 7 minutes and subtract this from the probability that the time is less than 9 minutes.
That is:
P(7 < X < 9) = P(X < 9) − P(X ⩽ 7)
This is shown in Figure 6.9.
Figure 6.9
Finding P(7 , X , 9)
Time on ‘Introduction to TU’
module
Area 0.3413
Area 0.5000
1
3
Area 0.1587
5
–3.00 –2.00 –1.00
7
0
9
11
13
+1.00 +2.00 +3.00
X scale, minutes
Z scale
From Equation 6.2 and Table E.2:
P(X ⩽ 7) = P Z ⩽
7−7
= P(Z ⩽ 0.00) = 0.5000
2
Therefore:
P(7 < X < 9) = P(X < 9) − P(X ⩽ 7) = 0.8413 − 0.5000 = 0.3413
EXAMPLE 6.3
FIN D ING P( X - 7 O R X . 9)
What is the probability that the time a student spends on ‘Introduction to TU’ is at most
7 minutes or at least 9 minutes?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.2 The Normal Distribution 221
SOLUTION
From Figure 6.9, the probability that a student spends between 7 and 9 minutes is 0.3413.
Therefore, the probability that the time spent on ‘Introduction to TU’ is at most 7 minutes or
at least 9 minutes is its complement, so:
P(X ⩽ 7 or X ⩾ 9) = 1 − P(7 < X < 9) = 1 − 0.3413 = 0.6587
Alternatively, calculate separately the probability of a time of 7 minutes or less and the
probability of a time of 9 minutes or more and then add these two probabilities together to
obtain the desired result (see Figure 6.10). Because the mean and median are the same for a
normal distribution, 50% of students spend 7 minutes or less. From Example 6.1, the probability of a student spending at least 9 minutes is 0.1587. Hence, the probability that the
time on ‘Introduction to TU’ is at most 7 minutes or at least 9 minutes is:
P(X ⩽ 7 or X ⩾ 9) = P(X ⩽ 7) + P(X ⩾ 9) = 0.5000 + 0.1587 = 0.6587
Time on ‘Introduction to TU’
module
Area 0.3413
Area 0.1587
Area 0.5000
1
Figure 6.10
Finding P(X ø 7 or X > 9)
3
5
7
–3.00 –2.00 –1.00
0
9
11
X scale, minutes
13
Z scale
+1.00 +2.00 +3.00
FINDING P (5 * X * 9 )
What is the probability that a student spends between 5 and 9 minutes on ‘Introduction to TU’?
EXAMPLE 6.4
SOLUTION
The required area/probability is the area under the curve between X 5 5 and X 5 9 (see Figure 6.11). As Table E.2 gives probabilities less than a particular value of interest, we ­calculate the
probabilities P(X , 9) and P(X … 5) and then obtain the desired probability/area by subtraction:
P(5 < X < 9) = P
5−7
9−7
<Z<
2
2
= P(− 1 < Z < 1)
= P(Z < 1) − P(Z ⩽ −1)
= 0.8413 − 0.1587
= 0.6826
Area 0.1587
Figure 6.11
Finding P(5 , X ø 9)
Area 0.6826
Area 0.1587
1
3
5
–3.00 –2.00 –1.00
7
0
9
11
13
+1.00 +2.00 +3.00
X scale
Z scale
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
222 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
The result of Example 6.4 is important and allows us to generalise the findings. For any
normal distribution there is a 0.6826 probability that a randomly selected item will fall within
±1 standard deviation of the mean. From Figure 6.12, slightly more than 95% of the items will
fall within ±2 standard deviations of the mean. Thus, 95.44% of students will spend between 3
and 11 minutes on ‘Introduction to TU’. From Figure 6.13, 99.73% of the items will fall within
±3 standard deviations of the mean. Thus, 99.73% of students will spend between 1 and
13 minutes on ‘Introduction to TU’. Therefore, it is unlikely (0.00135, or 135 in 100,000) that
a student will spend less than a minute on ‘Introduction to TU’. Similarly, it is unlikely (0.135%)
that a student will spend more than 13 minutes on ‘Introduction to TU’. For this reason, 6s (i.e.
three standard deviations below the mean to three standard deviations above the mean) is often
used as a practical approximation of the range for normally distributed data.
Figure 6.12
Finding P(3 , X , 11)
Area 0.9544
Area 0.0228
Area 0.0228
1
3
5
–3.00 –2.00 –1.00
7
0
Figure 6.13
Finding P(1 , X , 13)
9
11
13
X scale
+1.00 +2.00 +3.00 Z scale
Area 0.9973
Area 0.00135
1
Area 0.00135
3
5
–3.00 –2.00 –1.00
•
•
•
7
0
9
11
13
X scale
+1.00 +2.00 +3.00 Z scale
Therefore, for any normal distribution:
approximately 68.26% of the items will fall within ±1 standard deviation of the mean
approximately 95.44% of the items will fall within ±2 standard deviations of the mean
approximately 99.73% of the items will fall within ±3 standard deviations of the mean.
The above result is the justification for the empirical rule introduced in Chapter 3. The closer a
data set is to a normal distribution, the more accurate the empirical rule is.
EXAMPLE 6.5
FIN D ING P( X * 3 .5 )
What is the probability that a student will spend less than 3.5 minutes on ‘Introduction to TU’?
SOLUTION
The required probability/area is the shaded lower left-tail region of Figure 6.14.
Figure 6.14
Finding P(X , 3.5)
Area 0.0401
1
–3.00
3.5
–1.75
5
7
9
11
13
X scale
–1.00
0
+1.00
+2.00
+3.00
Z scale
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.2 The Normal Distribution 223
To determine the area under the curve below 3.5 years, first calculate the required Z value:
P(X < 3.5) = P Z <
3.5 − 7
= P(Z < −1.75)
2
Then look up the Z value of 21.75 in Table E.2 by matching the appropriate Z row (21.7)
with the appropriate Z column (.05) as shown in Table 6.3 (which is extracted from Table E.2).
The resulting probability or area under the curve less than 21.75 standard ­deviations below
the mean is 0.0401. Therefore:
P(X < 3.5) = 0.0401
That is, orientation organisers can expect approximately 4% of students to spend less than
3.5 minutes on ‘Introduction to TU’.
Z
.
.
.
-1.7
-1.6
.00
.
.
.
.0446
.0548
.01
.
.
.
.0436
.0537
.02
.
.
.
.0427
.0526
.03
.
.
.
.0418
.0516
.04
.
.
.
.0409
.0505
.05
.
.
.
.0401
.0495
.06
.
.
.
.0392
.0485
.07
.
.
.
.0384
.0475
.08
.
.
.
.0375
.0465
.09
.
.
.
.0367
.0455
Table 6.3
Finding a cumulative
area under the normal
curve (extracted from
Table E.2 in Appendix E
of this book)
Examples 6.1 to 6.5 used the cumulative standard normal table to find an area under the
normal curve that corresponded to a specific X value. There are circumstances when we want to
do the opposite. Examples 6.6 and 6.7, still referring to the time students spend on ‘Introduction to TU’ in the Tasman University Orientation scenario, illustrate how to find the X value that
corresponds to a specific area.
FINDING TH E X VALU E FO R A C U MU LATI V E P ROBABI L I TY OF 0. 10
What is the most amount of time spent on the ‘Introduction to TU’ module by the10% of
students who use it the least?
EXAMPLE 6.6
SOLUTION
Because 10% of students spend less than X minutes on ‘Introduction to TU’, the area under
the normal curve less than the corresponding Z value is 0.1000. Use the body of Table E.2 to
search for the area/probability of 0.1000. The closest result is 0.1003, as shown in Table 6.4
(which is extracted from Table E.2).
Z
.
.
.
-1.5
-1.4
-1.3
-1.2
.00
.
.
.
.0668
.0808
.0968
.1151
.01
.
.
.
.0655
.0793
.0951
.1131
.02
.
.
.
.0643
.0778
.0934
.1112
.03
.
.
.
.0630
.0764
.0918
.1093
.04
.
.
.
.0618
.0749
.0901
.1075
.05
.
.
.
.0606
.0735
.0885
.0156
.06
.
.
.
.0594
.0721
.0869
.0138
.07
.
.
.
.0582
.0708
.0853
.1020
.08
.
.
.
.0571
.0694
.0838
.1003
.09
.
.
.
.0559
.0681
.0823
.0985
Table 6.4
Finding a Z value corres­
ponding to a particular
cumulative area (0.10)
under the normal curve
(extracted from Table E.2
in Appendix E of this book)
Working from this area/probability to the margins of the table, the Z value corresponding to the particular Z row (21.2) and Z column (.08) is 21.28 (see Figure 6.15). That is,
the 10th-percentile time, being the amount of time spent on the ‘Introduction to TU’ module
by the10% of students who use it the least, is 1.28 standard deviations below the mean.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
224 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
Figure 6.15
Finding Z to determine X
Area 0.9000
Area 0.1000
X
7
X scale
–1.28
0
Z scale
Then rearrange the transformation formula equation (6.2) to determine the corresponding X
value as follows. Since:
X−μ
Z=
σ
then:
X = μ + Zσ
Substituting m 5 7, s 5 2 and Z 5 –1.28 in the rearranged transformation formula, we obtain:
X 5 7 1 (21.28) 3 2 5 7 2 2.56 5 4.44 minutes
Thus, 10% of students spend less than 4.44 minutes on ‘Introduction to TU’.
Equation 6.4 is used to find the X value that corresponds to a Z value.
FIN DIN G A N X VA LU E A SSO C I AT E D W I T H KNO W N P R O BA BI LI T Y
The X value is equal to the mean m plus the product of the Z value and the standard
deviation s.
X = μ + Zσ
(6.4)
1.
2.
3.
4.
5.
To find a particular value associated with a known probability, follow these steps:
Sketch the normal curve, and then place the values for the means on the respective X and
Z scales.
Find the cumulative area less than X.
Shade the area of interest.
Using Table E.2, determine the Z value corresponding to the area under the normal curve
less than X.
Use Equation 6.4 to solve for X:
X 5 m 1 Zs
EXAMPLE 6.7
FIN D ING T H E X VA LU E S THAT I N CL U D E THE TI M E S THAT 95% OF STU D E N T S
S P E ND O N ‘ INT RO DU CTI ON TO TU ’
What are the lower and upper values of X, located symmetrically around the mean, which
include the middle 95% of times that students spend on ‘Introduction to TU’?
SOLUTION
First find the lower value of X (called XL). Then find the upper value of X (called XU). Since
95% of the values are between XL and XU, and XL and XU are an equal distance from the
mean, 2.5% of the values are below XL (see Figure 6.16).
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.2 The Normal Distribution 225
Figure 6.16
Finding Z to determine XL
Area 0.9750
Area 0.0250
XL
7
X scale
–1.96
0
Z scale
Although XL is not known, the corresponding Z value can be found because the area
under the normal curve less than this Z is 0.0250. Using the body of Table E.2 (see Table
6.5), search for the probability 0.0250.
Z
.
.
.
-2.0
-1.9
-1.8
.00
.
.
.
.0228
.0287
.0359
.01
.
.
.
.0222
.0281
.0351
.02
.
.
.
.0217
.0274
.0344
.03
.
.
.
.0212
.0268
.0336
.04
.
.
.
.0207
.0262
.0329
.05
.
.
.
.0202
.0256
.0232
.06
.
.
.
.0197
.0250
.0314
.07
.
.
.
.0192
.0244
.0307
.08
.
.
.
.0188
.0239
.0301
.09
.
.
.
.0183
.0233
.0294
Table 6.5
Finding a Z value
corresponding to a
cumulative area of 0.025
under the normal curve
(extracted from Table E.2
in Appendix E of this book)
Working from the body of the table to the margins, the Z value that corresponds to the
particular Z row (21.9) and Z column (.06) is 21.96.
Then use Equation 6.4 to find the corresponding X value:
X = μ + Zσ = 7 + (−1.96) × 2 = 7 − 3.92 = 3.08 minutes
Use a similar process to find XU. Since only 2.5% of times are longer than XU minutes,
97.5% of times are less than XU minutes. From the symmetry of the normal distribution, the
desired Z value, as shown in Figure 6.17, is 11.96. Alternatively, extract this Z value from
Table E.2 (see Table 6.6). Note that 0.975 is the area under the normal curve less than the Z
value of 11.96.
Figure 6.17
Finding Z to determine XU
Area 0.9750
Area 0.0250
7
XU
0
+1.96
X scale
Z scale
Then use Equation 6.4 to find the corresponding X value:
X = μ + Zσ = 7 + 1.96 × 2 = 7 + 3.92 = 10.92 minutes
Therefore, 95% of students spend between 3.08 and 10.92 minutes on ‘Introduction
to TU’.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
226 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
Table 6.6
Finding a Z value
corresponding to a
cumulative area of 0.975
under the normal curve
(extracted from Table E.2
in Appendix E of this book)
Z
.
.
.
+1.8
+1.9
+2.0
.00
.
.
.
.9641
.9713
.9772
.01
.
.
.
.9649
.9719
.9778
.02
.
.
.
.9656
.9726
.9783
.03
.
.
.
.9664
.9732
.9788
.04
.
.
.
.9671
.9738
.9793
.05
.
.
.
.9678
.9744
.9798
.06
.
.
.
.9686
.9750
.9803
.07
.
.
.
.9693
.9756
.9808
.08
.
.
.
.9699
.9761
.9812
.09
.
.
.
.9706
.9767
.9817
You can also use Microsoft Excel to calculate normal probabilities. Figure 6.18 illustrates
a Microsoft Excel worksheet for Examples 6.5 and 6.6, using the Excel inbuilt functions
STANDARDIZE(x,mean,standard_dev), NORM.DIST(x,mean,standard_dev,cummulative),
NORM.S.INV(probability) and NORM.INV(probability,mean,standard_dev). For Excel 2007
and earlier, the corresponding functions are STANDARDIZE(x,mean,standard_dev),
NORMDIST(x,mean,standard_dev,cummulative), NORMSINV(probability) and NORMINV
(probability,mean,standard_dev).
Exploring Descriptive Statistics
visual
explorations
Open the VE_Normal_Distribution add-in workbook to explore the normal distribution.
To explore the effects of changing the mean and standard deviation on the area under a normal
distribution curve, select Add-ins ➔ Normal Distribution. The add-in displays a normal curve for the
Tasman University Orientation scenario and a floating control panel (at top right). Use the control panel
spinner buttons to change the values for the mean, standard deviation and X value, and then note the
effects of these changes on the probability of X < value and the corresponding shaded area under the
curve. To see the normal curve labelled with Z values, click Z Values. Click the Reset button to reset the
control panel values. Click Finish to finish exploring.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.2 The Normal Distribution 227
1
2
3
4
5
6
7
8
9
10
28
29
30
31
32
A
Normal probabilities
Common data
Mean
Standard deviation
X value
Z value
P(X<=3.5)
B
C
D
E
Figure 6.18
Microsoft Excel worksheet
for calculating normal
probabilities
7
2
Probability for X <=
Find X and Z given a cum. pctage
Cumulative percentage
Z value
X value
3.5
–1.75 =STANDARDIZE(B8, B4, B5)
0.0401 =NORM.DIST(B8, B4, B5, TRUE)
10.00%
–1.2816 =NORM.S.INV(B30)
4.4369 =NORM.INV(B30, B4, B5)
What is normal?
Ironically, the statistician who popularised the use of ‘normal’ to describe the distribution
discussed in Section 6.2 was someone who saw the distribution as anything but the everyday,
anticipated occurrence that the adjective normal usually suggests.
think
about this
Starting with an 1894 paper, Karl Pearson argued that measurements of phenomena do not
naturally, or ‘normally’, conform to the classic bell shape. While this principle underlies statistics
today, Pearson’s point of view was radical to contemporaries who saw the world as standardised
and normal. Pearson changed minds by showing that some populations are naturally skewed
(coining that term in passing), and he helped put to rest the notion that the normal distribution
underlies all phenomena.
Misunderstandings about the normal distribution have occurred both in business and in the public
sector throughout the years. These misunderstandings have caused a number of business blunders
and have sparked several public policy debates, including on the causes of the collapse of large
financial institutions in 2008. According to one theory, the investment banking industry’s application
of the normal distribution to assess risk may have contributed to the global collapse. (See ‘A finer
formula for assessing risks’, New York Times, 11 May 2010, p. B2.) Using the normal distribution led
these banks to overestimate the probability of having stable market conditions and underestimate
the chance of unusually large market losses. According to this theory, other distributions that have
less area in the middle of their curves and, therefore, more in the ‘tails’ that represent unusual
market outcomes, may have led to less serious losses.
As you study this chapter, make sure you understand the assumptions that must hold for the
proper use of the normal distribution, assumptions that were not explicitly verified by the
investment bankers. And, most importantly, always remember that the name normal distribution
does not mean to suggest normal in the everyday (dare we say ‘normal’?!) sense of the word.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
228 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
Problems for Section 6.2
LEARNING THE BASICS
Make sure that you sketch the normal curve and shade the required area/
probability.
6.1
6.2
6.3
6.4
6.5
6.6
Given the standard normal distribution (with a mean of 0 and a
standard deviation of 1, as in Table E.2), what is the probability that:
a. Z is less than 1.57?
b. Z is greater than 1.84?
c. Z is between 1.57 and 1.84?
d. Z is less than 1.57 or greater than 1.84?
a.Given the standard normal distribution, what is the
probability that:
i. Z is between 21.57 and 1.84?
ii. Z is less than 21.57 or greater than 1.84?
b. What is the value of Z if only 2.5% of all possible Z values
are larger?
c. Between which two values of Z (symmetrically distributed
around the mean) will 68.26% of all possible Z values be
contained?
Given the standard normal distribution, what is the probability that:
a. Z is less than 1.08?
b. Z is greater than 2 0.21?
c. Z is less than 2 0.21 or greater than the mean?
d. Z is less than 2 0.21 or greater than 1.08?
a.Given the standard normal distribution, determine the
following probabilities:
i. P (Z 7 1.08)
ii. P (Z , 2 0.21)
iii. P (2 1.96 , Z , 2 0.21)
b. What is the value of Z if only 15.87% of all possible Z values
are larger?
Given a normal distribution with m 5 100 and s 5 10, what is
the probability that:
a. X 7 75?
b. X , 70?
c. X , 80 or X 7 110?
d. 80% of the values are between which two X values
(symmetrically distributed around the mean)?
Given a normal distribution with m 5 50 and s 5 4, what is the
probability that:
a. X 7 43?
b. X , 42?
c. 5% of the values are less than which X value?
d. 60% of the values are between which two X values
(symmetrically distributed around the mean)?
APPLYING THE CONCEPTS
6.7
The records of Check$mart Bank show that the average credit
card balance of its customers is $3,325 with a standard
deviation of $1,500. Assume that the distribution of these credit
card balances is approximately normal.
a. Find the probability that an account balance is less than $2,500.
b. Find the probability that an account balance is more than
$5,000.
c. What proportion of account balances are between $3,000
and $4,000?
d. 99% of account balances are less than which amount?
6.8 Toby’s Trucking Company determined that, on an annual basis, the
distance travelled per truck is normally distributed with a mean of
100,000 kilometres and a standard deviation of 20,000 kilometres.
a. What proportion of trucks can be expected to travel between
80,000 and 120,000 kilometres in the year?
b. What percentage of trucks can be expected to travel either
below 60,000 or above 140,000 kilometres in the year?
c. How many kilometres will be travelled by at least 80% of the
trucks?
d. What are your answers to (a) to (c) if the standard deviation
is 10,000 km?
6.9 The breaking strength of plastic bags used for packaging
produce is normally distributed with a mean of 35 kPa
(kilopascals) and a standard deviation of 10 kPa.
a. What proportion of the bags have a breaking strength of:
i. less than 20 kPa?
ii. at least 30 kPa?
iii. between 25 and 45 kPa?
b. Between which two values symmetrically distributed around
the mean will 95% of the breaking strengths fall?
6.10 A set of final examination marks in an introductory statistics
unit is normally distributed with a mean of 73 and a standard
deviation of 8.
a. What is the probability of getting a mark of 91 or less?
b. What is the probability that a student obtains a mark
between 65 and 89?
c. If the lecturer gives Distinction and High Distinction grades
to the top 15% of students, what mark does a student need
to get a distinction?
d. If the lecturer gives a High Distinction to the top 5% of
students, are you better off with a mark of 80 on this exam
or a mark of 68 on a different exam where the mean is 62
and the standard deviation is 3? Show your answer
statistically and explain.
6.11 A statistical analysis of 1,000 long-distance telephone calls
made from the headquarters of the Bricks and Clicks Computer
Corporation indicates that the length of these calls is normal
with m 5 240 seconds and s 5 40 seconds.
a. What is the probability that a call lasted less than 180 seconds?
b. What is the probability that a particular call lasted between
180 and 300 seconds?
c. What is the probability that a call lasted between 110 and
180 seconds?
d. What is the length of a particular call if only 1% of all calls
are shorter?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.3 Evaluating Normality 229
6.12 The number of shares traded daily is referred to as the volume
of trade. During 2008 the average volume traded daily for the
ASX All Ordinaries was 992 million with a standard deviation of
252 million (Sydney Morning Herald, <http://business.smh.com.
au>, October 2008 and January 2009).
Assume that the number of All Ordinaries shares traded
daily on the ASX is a normal random variable with a mean of
992 million and a standard deviation of 252 million.
a. For a randomly selected day, what is the probability that the
volume of trading is:
i. below 500 million?
ii. between 750 and 1,000 million?
iii. below 1,500 million?
iv. above 1,200 million?
b. On 18 September 2008 the All Ordinaries volume of trade
was 2,125 million. What is the probability that the volume of
trading for the All Ordinaries on a randomly selected day is
at least 2,125 million? What conclusions can you draw from
this probability?
6.13 Many manufacturing problems involve the accurate
matching of machine parts, such as shafts, that fit into a
valve hole. A particular design requires a shaft with a
diameter of 22.000 mm, but shafts with diameters between
21.900 mm and 22.010 mm are acceptable. Suppose that
the manufacturing process yields shafts with diameters
normally distributed with a mean of 22.002 mm and a
standard deviation of 0.005 mm.
a. For this process, what is:
i. the proportion of shafts with a diameter between
21.900 mm and 22.000 mm?
ii. the probability a shaft is acceptable?
iii. the diameter that will be exceeded by only 2% of shafts?
b. What would be your answers in (a) if the standard deviation
of the shaft diameters was 0.004 mm?
6.3 EVALUATING NORMALITY
As discussed in Section 6.2, many continuous variables used in business and elsewhere closely
resemble a normal distribution. However, other variables cannot be approximated by the normal distribution. This section presents two approaches for evaluating whether a set of data can
be approximated by the normal distribution:
1. Compare the data set’s characteristics with the properties of the normal distribution.
2. Construct a normal probability plot.
LEARNING OBJECTIVE
2
Determine whether a set
of data is approximately
normally distributed
Evaluating the Properties
As mentioned in Section 6.2, the normal distribution has several important theoretical properties:
• It is symmetrical, thus the mean and median are equal.
• It is bell shaped, thus the empirical rule applies.
4
• The interquartile range equals approximately standard deviations.
3
• The range is infinite (but in practice is approximately six times the standard deviation).
In practice, some continuous variables may have characteristics that approximate these theoretical properties. However, many continuous variables are neither normal nor approximately
normal. For such variables, the descriptive characteristics of the data do not match well with the
properties of a normal distribution. One approach to checking for normality is to compare the
actual data characteristics with the corresponding properties from an underlying normal distribution, as follows:
• Construct charts and observe their appearance. For small or moderate-sized data sets,
construct a stem-and-leaf display or a box-and-whisker plot. For large data sets, construct
the frequency distribution and plot the histogram or polygon.
• Calculate descriptive numerical measures and compare the characteristics of the data with
the theoretical properties of the normal distribution. Compare the mean and median. Is the
interquartile range approximately 1.33 times the standard deviation? Is the range
approximately six times the standard deviation?
• Evaluate how the values in the data are distributed. Determine whether approximately
two-thirds of the values lie within ±1 standard deviation of the mean. Determine whether
approximately four-fifths of the values lie within ±1.28 standard deviations of the mean.
Determine whether approximately 19 out of every 20 values lie within ±2 standard
deviations of the mean. Example 6.8 illustrates these steps.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
230 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
EXAMPLE 6.8
E VA LU AT ING N O R MAL I TY
Innovative Kitchens design, build and install custom made kitchens. Toni, a customer manager, is interested in the length of the initial call from potential customers, in particular what
percentage of calls last more than 4 minutes.
As Toni wishes to use the normal probability distribution to calculate the required probability, the assumption that the random variable
X 5 length of call
is normal needs to be checked. The table below gives the length in seconds of 20 randomly
chosen calls. < CALL_LENGTH >
165 153 253 263 187 137 209 179
97 170
43 295 181 121 117 210
Do these data show the properties of the normal distribution?
200
191
181
248
SOLUTION
Figure 6.19 displays descriptive statistics for these data and Figure 6.20 presents a box-andwhisker plot.
Figure 6.19
Microsoft Excel and
PHStat descriptive
statistics for length
of call
A
B
1 Descriptive summary
2
Length of call
3
4 Mean
180
5 Median
181
6 Mode
181
7 Minimum
43
8 Maximum
295
9 Range
252
10 Variance
3599.5789
11 Standard deviation
59.9965
12 Coefficient of variation
33.33%
13 Skewness
–0.2288
14 Kurtosis
0.3939
15 Count
20
16 Standard error
13.4156
A
1
2
3
4
5
6
7
8
B
Boxplot
Five-number summary
Minimum
First quartile
Median
Third quartile
Maximum
43
137
181
210
295
Figure 6.20
Box-and-whisker plot for
length of calls
40
1.
2.
60
80
100
120
140 160 180 200
Length of call (seconds)
220
240
260
280
300
From these figures, we can make the following statements:
The mean of 180 is slightly less than the median and mode of 181.
The box-and-whisker plot appears slightly left skewed.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.3 Evaluating Normality 231
3.
The interquartile range of Q3 2 Q1 5 210 2 137 5 73 is
­deviations.
73
5 1.216… standard
60
252
5 4.2… standard deviations.
60
4.
The range of 252 is equal to
5.
13
5 65% of the calls are within 6 1 standard deviation of the mean (180 6 60 →
20
120 to 240).
6.
19
5 95% of the calls are within 6 2 standard deviations of the mean (180 6 120 →
20
60 to 300).
Based on these statements and the criteria given above, it can be concluded that the length
of calls is approximately normal. However, statements 1 and 2 indicate that the calls are
slightly left skewed.
Thus, Toni can use the normal probability distribution, with m 5 180 seconds and
s 5 60 seconds, to calculate the percentage of calls lasting more than 4 minutes.
Constructing a Normal Probability Plot
A normal probability plot is a graphical approach for evaluating whether data are normally distributed. One common approach is called the quantile–quantile plot. In this method, each ordered
value is transformed to a Z score and plotted along with the ordered data values of the variable.
For example, if you have a sample of n 5 19, the Z value for the smallest value corresponds to
1
1
1
a cumulative area of
= 0.05. The Z value for a cumulative area of 0.05
=
=
n + 1 19 + 1 20
(from Table E.2) is 21.65. Table 6.7 illustrates the entire set of Z values for a sample of n 5 19.
Ordered value
1
2
3
4
5
6
7
8
9
10
Z value
-1.65
-1.28
-1.04
-0.84
-0.67
-0.52
-0.39
-0.25
-0.13
0.00
Ordered value
11
12
13
14
15
16
17
18
19
Z value
0.13
0.25
0.39
0.52
0.67
0.84
1.04
1.28
1.65
normal probability plot
Graphical approach used to
evaluate if data are normal.
quantile–quantile plot
A normal probability plot.
Table 6.7
Ordered values and
corresponding Z values for
a sample of n = 19
The Z values are plotted on the horizontal axis and the corresponding values of the variable
are plotted on the vertical axis. If the data are normally distributed, the points will be approximately in a straight line.
Figure 6.21 illustrates the typical shape of normal probability plots for a left-skewed distribution (panel A), a normal distribution (panel B) and a right-skewed distribution (panel C). If
the data are left skewed, the curve will rise more rapidly at first, and then level off. If the data
are right skewed, the data will rise more slowly at first, and then rise at a faster rate for higher
values of the variable being plotted.
Figure 6.22 illustrates a PHStat normal probability plot for the length-of-call data in
Example 6.8, and shows that it is approximately a straight line. Thus, it can be concluded that
the distribution of the data on length of calls is approximately normal.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
232 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
Figure 6.21
Normal probability plots for
a left-skewed distribution,
a normal distribution and a
right-skewed distribution
%
%
Panel A
Left skewed
Figure 6.22
PHStat normal probability
plot for length of call
%
Panel B
Normal
Panel C
Right skewed
Normal probability plot
350
300
Length of call
250
200
150
100
50
0
–2
–1.5
–1
–0.5
0
Z value
0.5
1
1.5
2
Problems for Section 6.3
LEARNING THE BASICS
6.14 When evaluating normality, show that, for a sample of n 5 39,
the smallest and largest Z values are 2 1.96 and 1 1.96, and
the middle (i.e. 20th) Z value is 0.00.
6.15 For a sample of n 5 6, list the six Z values.
APPLYING THE CONCEPTS
You can solve problems 6.16 to 6.20 manually or by using Microsoft Excel. We
recommend that you use Microsoft Excel.
6.16 The full daily rates in Australian dollars for a random sample of
19 Australian and New Zealand hotels from a certain chain are
as follows: < HOTEL_RATE >
Location
Auckland
Barossa Valley
Brisbane
Full rate A$
280
200
441
Canberra
Darwin
Hamilton
Melbourne
Melbourne
Melbourne
Palmerston North
Perth
Queenstown
Rotorua
Snowy Mountains
Sunshine Coast
Sydney
Sydney
Sydney
Wellington
290
662
255
358
308
279
259
232
312
309
615
534
360
573
320
335
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.4 The Uniform Distribution 233
Decide whether or not the data appear to be approximately
normally distributed by:
a. evaluating the actual versus theoretical properties
b. constructing a normal probability plot
6.17 A problem with a telephone line that prevents a customer from
receiving or making calls is disconcerting to both the customer
and the telephone company. The following data represent two
samples of 20 problems reported to two different offices of a
telephone company. The time to clear these problems from the
customers’ lines is recorded in minutes. < PHONE >
Central Office I Time to clear problems (minutes)
1.48 1.75 0.78 2.85 0.52 1.60 4.15
1.02 0.53 0.93 1.60 0.80 1.05 6.32
3.97
3.93
1.48
5.45
3.10
0.97
Central Office II Time to clear problems (minutes)
7.55 3.75 0.10 1.10 0.60 0.52 3.30
3.75 0.65 1.92 0.60 1.53 4.23 0.08
2.10
1.48
0.58
1.65
4.02
0.72
For each of the two central office locations, decide whether
the data appear to be approximately normally distributed by:
a. evaluating the actual versus theoretical properties
b. constructing a normal probability plot
6.18 Many manufacturing processes use the term work-in-progress
(often abbreviated to WIP). In a book-manufacturing plant, the
WIP represents the time it takes for sheets from a press to be
folded, gathered, sewn, tipped on end sheets and bound. The
following data represent samples of 20 books at each of two
production plants and the processing time (operationally defined
as the time in days from when the books came off the press to
when they were packed in cartons) for these jobs: < WIP >
Plant A
5.62 11.62 5.29
21.62 10.50 8.45
7.29
7.58
16.25 7.50
8.58 9.29
Plant B
9.54 5.75 11.46 12.46 16.62
15.41 2.33 14.29 14.25 13.13
11.46 4.42
11.42 8.92
9.17 12.62 13.21 25.75 6.00
5.37 13.71 6.25 10.04 9.71
For each of the two plants, decide whether or not the data
appear to be approximately normally distributed by:
a. evaluating the actual versus theoretical properties
b. constructing a normal probability plot
6.19 The data file < GRADES > contains a sample of student marks
and grades from a population of students enrolled in a statistics
unit. Decide whether or not the ‘Total Mark’ data appear to be
approximately normal by:
a. evaluating the actual versus theoretical properties
b. constructing a normal probability plot
6.20 For the data from problem 6.19, < GRADES >, decide whether or
not the ‘Exam Mark’ data appear to be approximately normally
distributed by:
a. evaluating the actual versus theoretical properties
b. constructing a normal probability plot
6.4 THE UNIFORM DISTRIBUTION
LEARNING OBJECTIVE
In the uniform distribution, a value has the same probability of occurrence anywhere in the range
between the smallest value a and the largest value b. Because of its shape, the uniform distribution is sometimes called the rectangular distribution (see panel B of Figure 6.1). Equation 6.5
defines the uniform probability density function.
THE UN IFO R M PR OB A B IL IT Y DE N S IT Y F UN CT I O N
f (X ) =
10.92 7.96
5.41 7.54
1
if a ⩽ X ⩽ b and 0 elsewhere
b−a
(6.5)
Calculate probabilities
from the uniform
distribution
uniform (rectangular)
distribution
Continuous probability distribution;
the values of the random variable
have the same probability; also
called the ‘rectangular distribution’.
where a 5 the minimum value of X
b 5 the maximum value of X
Equations 6.6 and 6.7 define the mean and variance of the uniform distribution.
TH E M E AN A N D VA R IA N CE OF T H E U NI F O R M DI ST R I B UT I O N
a+b
2
(6.6)
(b − a)2
12
(6.7)
μ=
σ2 =
3
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
234 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
Figure 6.23 illustrates the uniform distribution with a 5 0 and b 5 1. The total area of the
rectangle is equal to base 3 height 5 1 3 1 5 1, thus satisfying the requirement that the area
under any probability density function equals 1. In such a distribution, what is the probability of
getting a value between 0.1 and 0.3? The area between 0.1 and 0.3, depicted in Figure 6.24, is
equal to the base (0.3 2 0.1 5 0.2) multiplied by the height (1.0). Therefore:
P(0.1 < X < 0.3) = base × height = 0.2 × 1 = 0.2
Figure 6.23
Probability density function
for a uniform distribution
with a 5 0 and b 5 1
f (x )
1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Figure 6.24
Finding P (0.1 , X , 0.3)
for a uniform distribution
with a 5 0 and b 5 1
x
f (x )
1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x
From Equations 6.6 and 6.7, the mean and standard deviation of the uniform distribution
for a 5 0 and b 5 1:
μ=
σ2 =
a+b
0+1
=
= 0.5
2
2
(b − a)2 (1 − 0)2
1
= 0.0833…
=
=
12
12
12
σ = 0.0833… = 0.2886…
Thus, the mean is 0.5 and the standard deviation is 0.2887.
Problems for Section 6.4
LEARNING THE BASICS
APPLYING THE CONCEPTS
6.21 Suppose you sample one value from a uniform distribution with
a 5 0 and b 5 10.
a. What is the probability of getting a value:
i. between 5 and 7?
ii. between 2 and 3?
b. What is the mean?
c. What is the standard deviation?
6.22 The time between arrivals of customers at a bank between
noon and 1 pm has a uniform distribution over an interval from
0 to 120 seconds.
a. What is the probability that the time between the arrival of
two customers will be:
i. less than 20 seconds?
ii. between 10 and 30 seconds?
iii. more than 35 seconds?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.5 The Exponential Distribution 235
b. What is the mean and standard deviation of the time
between arrivals?
6.23 The time of failure for a continuous operation monitoring device
of air quality has a uniform distribution over a 24-hour day.
a. If a failure occurs on a day when it is daylight between
5.55 am and 7.38 pm, what is the probability that the failure
will occur during daylight hours?
b. If the device is in secondary mode from 10 pm to 5 am,
what is the probability that a failure occurs during secondary
mode?
c. If the device has a self-checking computer chip that
determines whether the device is operational every hour on
the hour, what is the probability that a failure will be
detected within 10 minutes of its occurrence?
d. If the device has a self-checking computer chip that
determines whether the device is operational every hour
on the hour, what is the probability that it will take at least
40 minutes to detect that a failure has occurred?
6.24 In an apartment building the waiting time for a lift is found to be
uniformly distributed between 0 and 3 minutes.
a. What is the probability of waiting:
i. no more than a minute?
ii. between 1 and 2 minutes?
iii. more than 2 minutes?
b. What is the mean and standard deviation of waiting time?
6.5 THE EXPONENTIAL DISTRIBUTION
LEARNING OBJECTIVE
The exponential distribution is a continuous distribution that is right skewed and ranges from
zero to positive infinity (see panel C of Figure 6.1). The exponential distribution is widely used
in waiting line (or queuing) theory to model the length of time between random and independent events or the time to the first occurrence of an event. For example, the exponential random
variable can be used to model the:
• time between arrivals of customers at a bank’s ATM or a fast-food restaurant
• time between patients entering a hospital emergency room
• time between hits on a website
• time between outages to an Internet banking system
• time to failure of a certain item or component.
Calculate probabilities
from the exponential
distribution
exponential distribution
Continuous probability distribution,
used to model the interval between
Poisson events.
The exponential and Poisson distributions are closely related. The Poisson distribution is
used to count the number of times an event occurs in some interval, while the exponential distribution is used to measure the interval between Poisson events or until the first event.
The exponential distribution is defined by a single parameter, l(lambda), the expected
number of events per interval; note that this is the mean of the corresponding Poisson distribution. Equation 6.8 can be used to calculate exponential probabilities.
PRO BABILIT Y T H AT A N E XP ON E N T IA L R AN DO M VAR I A BL E I S LE SS
THAN A
If X is an exponential random variable, 0 8 X 8 ∞, then
P(X < A) = 1 − e−λA
4
(6.8)
where l 5 expected number of events in interval
e 5 2.71828… is the base of natural logarithms
A is a given value of the exponential random variable X
From Equation 6.8, using the complement rule, we obtain:
P(X ⩾ A) = 1 − P(X < A) = 1 − (1 − e−λA) = e−λA
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
236 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
T H E M E A N , VA R IA NC E A ND STA NDA R D D E V I AT I O N O F T HE
E XP ON E N T IA L DIST R I BU T I O N
μ=σ=
σ2 =
1
λ
(6.9)
1
λ2
(6.10)
where l 5 expected number of events in interval.
For example, if the expected number of events in a minute is l 5 4, then the mean time between
1
events is m 5 5 0.25 minutes or 15 seconds.
4
To illustrate the exponential distribution, suppose that customers arrive at an ATM randomly and independently at the rate of 20 per hour. If a customer has just arrived, what is the
probability that the next customer will arrive within 6 minutes (i.e. 0.1 hour)? For this example,
X 5 time in hours until next customer is exponential with l 5 20 per hour. Using Equation 6.8
and A 5 6 minutes 5 0.1 hour:
P(X < 0.1) = 1 − e−20×0.1 = 1 − e−2 = 1 − 0.13533… = 0.86466…
Thus, the probability that a customer will arrive within 6 minutes is 0.8647.
You can also use Microsoft Excel to calculate this probability. Figure 6.25 shows a
Microsoft Excel worksheet, using the Excel inbuilt exponential function EXPON.
DIST(x,lambda,cumulative). For Excel 2007 and earlier the corresponding exponential function is EXPONDIST(x,lambda,cumulative).
Figure 6.25
Microsoft Excel worksheet
for finding exponential
probabilities
EXAMPLE 6.9
1
2
3
4
5
6
7
8
9
A
B
Exponential probability
C
D
E
Data
λ
X value
P(<=X )
P(>X )
20
0.1
Results
0.8647 =EXPON.DIST(B5, B4, TRUE)
0.1353 =1-B8
C A LC U LAT IN G E X P O N E N TI AL P ROBABI L I TI E S
In the ATM example, what is the probability that the next customer will arrive within 3 minutes (i.e. 0.05 hour)?
SOLUTION
Using Equation 6.8:
P(X < 0.05) = 1 − e−20×0.05 = 1 − e−1 = 1 − 0.3678… = 0.63212…
Thus, the probability that a customer will arrive within 3 minutes is 0.6321.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.5 The Exponential Distribution 237
CA LC ULATING E X P O N E N T IA L P RO B A B I L I TI E S
Past data show that two serious workplace accidents resulting in employees taking time off
work occur annually at Innovative Kitchens. A serious workplace accident has just occurred.
What is the probability that there will not be another serious workplace accident in the next
year and the probability that there will be at least one serious workplace accident in the next
six months?
EXAMPLE 6.10
SOLUTION
X 5 time in years until next serious workplace accident is exponential with l 5 2 per year.
Using Equation 6.8 and the complement rule:
P(X > 1) = e −2×1 = e −2 = 0.13533…
Thus, the probability that there will not be another serious workplace accident in the next
year is 0.1353.
Using Equation 6.8:
P(X 8 0.5) 5 1 2 e22 3 0.5 5 1 2 e21 5 1 2 0.36787… 5 0.6321…
Thus, the probability that there will be at least one serious workplace accident in the next
six months is 0.632.
Memoryless distribution
Suppose customers arrive at an average rate of one per minute.
If no customer has arrived in the last minute, what is the probability that no customer will arrive in the
next 2 minutes?
think
about this
To answer this, let X 5 time until next customer arrives in minutes. Then X is exponential with λ 5 1.
We want the probability that we will wait at least another 2 minutes for the next customer, given that we
have waited 1 minute already; that is:
P(X > 2 + 1 X > 1) =
P(X > 3)
e –331
= –131 = 0.1353…
e
P(X > 1)
Now suppose that a customer has just arrived. The probability that no customer will arrive in the next
two minutes is:
P(X > 2) = e –231 = e –2 = 0.13533…
What do you notice?
The probability has not changed. This illustrates the memoryless property of the exponential distribution.
It means that it does not matter how long you have waited for a customer. If a customer has not arrived
at time T, the distribution of the waiting time from time T until the next customer arrives is the same as
when a customer has just arrived.
In general, it can be shown that if X is exponential, then X is a memoryless random variable and:
P(X > A + T X > T ) = P(X > A)
for A, T ⩾ 0
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
238 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
Problems for Section 6.5
LEARNING THE BASICS
6.25 Given an exponential distribution with l 5 10, what is the
probability that X is:
a. less than 0.1?
b. greater than 0.1?
c. between 0.1 and 0.2?
d. less than 0.1 or greater than 0.2?
6.26 Given an exponential distribution with l 5 30, what is the
probability that X is:
a. less than 0.1?
b. greater than 0.1?
c. between 0.1 and 0.2?
d. less than 0.1 or greater than 0.2?
6.27 Given an exponential distribution with l 5 20, what is the
probability that X is:
a. less than 4?
b. greater than 0.4?
c. between 0.4 and 0.5?
d. less than 0.4 or greater than 0.5?
APPLYING THE CONCEPTS
6.28 Vehicles arrive, randomly and independently, at a toll booth
located at the entrance to a bridge at the rate of 240 per
hour between 1 am and 2 am. Suppose a vehicle has just
arrived.
a. What is the probability that the next vehicle arrives within
the next minute?
b. What is the probability that no vehicle arrives in the next
30 seconds?
c. What is the mean time between arrivals at the toll booth?
d. What are your answers to (a) to (c) if the rate of arrival of
vehicles is 300 per hour?
e. What are your answers to (a) to (c) if the rate of arrival of
vehicles is 210 per hour?
LEARNING OBJECTIVE
5
Use the normal
distribution to approximate
probabilities from the
binomial distribution
6.29 Customers arrive at the drive-through window of a fast-food
restaurant at an average of two per minute during the lunch
hour.
a. What is the probability that the next customer will arrive
within 1 minute?
b. What is the probability that the next customer will arrive
within 5 minutes?
c. During the dinner time period, the average arrival rate is one
per minute. What are your answers to (a) and (b) for this
period?
6.30 The time between unplanned shutdowns of a power plant has
an exponential distribution with a mean of 20 days. Find the
probability that the time between two unplanned shutdowns is:
a. less than 14 days
b. more than 21 days
c. less than 7 days
6.31 Golfers arrive at the starter’s booth of a public golf course at an
average of eight per hour during the Monday-to-Thursday midweek period.
a. If a golfer has just arrived:
i. what is the probability that the next golfer arrives within
15 minutes (0.25 hour)?
ii. what is the probability that the next golfer arrives within
3 minutes (0.05 hour)?
b. The average arrival rate on Fridays is 15 per hour. What are
your answers to (a) on Fridays?
6.32 The number of floods in a certain region is approximately
Poisson distributed with an average of three floods every
10 years. A flood has just occurred.
a. What is the probability that:
i. a flood occurs in the next year?
ii. there isn’t a flood in the next two years?
iii. a flood occurs in the next month?
iv. at least one flood occurs in the next six months?
b. What is the average time between floods?
6.6 THE NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION
In earlier sections of this chapter, the normal probability distribution was introduced. In this
section we use the normal distribution to approximate the binomial distribution. When, as in
this case, a continuous distribution is used to approximate a discrete probability distribution, a
continuity correction factor is required.
Need for a Continuity Correction
There are two major reasons why a continuity correction is needed when using a continuous
random variable to approximate a discrete random variable. First, discrete random variables
such as binomial random variables can take on only specified (integer) values, while continuous random variables such as normal random variables can take on any values within a continuum or interval. When using the normal distribution to approximate the binomial
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.6 The Normal Approximation to the Binomial Distribution 239
distribution, more accurate approximations of the probabilities are obtained when a continuity
correction is used.
Second, with a continuous distribution such as the normal distribution, the probability of
getting a specific value of a random variable is zero. However, when a continuous distribution
is used to approximate a discrete distribution, a continuity correction is used to obtain the
approximate probability of a specific value of the discrete distribution.
Consider an experiment in which we toss a fair coin 10 times. Suppose we want to calculate the probability of getting exactly four heads. Whereas a discrete random variable can have
only a specified value (such as 4), a continuous random variable used to approximate it could
take on any values within an interval around that specified value, as demonstrated on the scale
below:
...
...X
2.5
3
3.5
4
4.5
5
5.5
The continuity correction requires adding or subtracting 0.5 from the value or values of the
discrete random variable X as required. To use the normal distribution to approximate the probability of getting exactly four heads, X 5 4, we need to find the area under the normal curve
from X 5 3.5 to X 5 4.5, the lower and upper boundaries of 4. To determine the approximate
probability of getting at least four heads, we find the area under the normal curve greater than
or equal to 3.5, X 9 3.5, since 3.5 is the lower boundary of 4. Similarly, to determine the
approximate probability of getting at most four heads, we find the area under the normal curve
equal to or less than 4.5, X 8 4.5, since 4.5 is the upper boundary of 4.
When using the normal distribution to approximate discrete probability distributions,
semantics are important. To determine the approximate probability of getting fewer than four
heads, we find the area under the normal curve less than or equal to 3.5, X 8 3.5. To determine
the approximate probability of getting more than four heads, we find the area under the normal curve greater than or equal to 4.5, X 9 4.5. To determine the approximate probability of
getting four to seven heads (inclusive), we find the area under the normal curve from 3.5 to
7.5, 3.5 8 X 8 7.5.
Approximating the Binomial Distribution
In Section 5.3 we saw that the binomial distribution is symmetrical (as is the normal distribution)
whenever p 5 0.5. When p Z 0.5, the binomial distribution is not symmetrical. However, the
closer p is to 0.5 and/or the larger the sample size n, the more symmetrical the distribution is.
On the other hand, the larger the sample size the more tedious it is to calculate the exact
probabilities of success using Equation 5.11. Fortunately, whenever the sample size is large, we
can use the normal distribution to approximate the exact binomial probabilities.
As a general rule, the normal distribution can be used to approximate the binomial distribution whenever np and n(1 2 p) are both at least 5. From Section 5.3, the mean and standard
deviation of the binomial distribution are:
μ = np
σ = np(1 − p)
By substituting these into the transformation formula (Equation 6.2), we obtain:
Z=
X−μ
X − np
=
σ
np(1 − p)
where for large enough n the random variable Z is approximately normal.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
240 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
Hence, Equation 6.11 is used to find approximate probabilities corresponding to the values
of the discrete binomial random variable, X.
N OR M A L A P PR OX I M AT I O N TO T HE BI NO M I A L D I ST R I BU T I O N
Z=
Xa − np
np(1 − p)
(6.11)
where m 5 np, mean of the binomial distribution
σ = np(1 − p), standard deviation of the binomial distribution
Xa 5 adjusted number of successes for the discrete random variable X, such that
Xa 5 X ± 0.5 as appropriate
EXAMPLE 6.11
U S ING T H E NO R MA L D I ST RI B UT I ON TO A P P ROXI M AT E T HE BI N O MI A L
D IST R IB U T IO N
A random sample of n 5 1,600 tyres is selected from an ongoing production process in
which 8% of all tyres produced are defective. What is the probability that 150 or fewer tyres
will be defective?
SOLUTION
Since both np 5 1,600 3 0.08 5 128 and n(1 2 p) 5 1,600 3 0.92 5 1,472 are greater
than 5, the normal distribution can be used to approximate the binomial. Here, Xa, the
adjusted number of successes, is 150.5 and:
Z≈
Xa − np
np(1 − p)
=
150.5 − 128
(1,600)(0.08)(0.92)
=
22.5
≈ 2.07
10.8517…
Then, using Table E.2, the area under the curve to the left of Z 5 2.07 is 0.9808 (see
­Figure 6.26). Therefore, the probability of 150 or fewer tyres being defective is approximately 0.98. This agrees to two decimal places with the exact binomial probability of 0.9790.
Figure 6.26
Approximating the binomial
distribution
Area 0.9808
μ = 128
150.5
X
0
+2.07
Z
Calculating a Probability Approximation for
an Individual Value
Suppose that we want to approximate the probability of getting exactly 150 defective tyres. The
correction for continuity defines the integer value of interest to range from one-half unit below
it to one-half unit above it. Therefore, we define the probability of getting exactly 150 defective
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
6.6 The Normal Approximation to the Binomial Distribution 241
tyres as the area under the normal curve between 149.5 and 150.5. Using Equation 6.11, the
corresponding Z values are:
Z=
149.5 − 128
(1,600)(0.08)(0.92)
=
21.5
= 1.98
10.85
=
22.5
= 2.07
10.85
and:
Z=
150.5 − 128
(1,600)(0.08)(0.92)
Therefore, using Table E.2, we obtain:
P(exactly 150 tyres defective) ≈ P(149.5 ⩽ X ⩽ 150.5)
≈ P(1.98 ⩽ Z ⩽ 2.07)
= 0.9808 − 0.9761
= 0.0047
Thus, the approximate probability of getting 150 defective tyres is 0.0047. Compare this with
the exact binomial probability which, to four decimal places, is 0.0048.
Problems for Section 6.6
LEARNING THE BASICS
6.33 For n 5 100 and p 5 0.2, use the normal distribution to
approximate the probability that:
a. X 5 25
b. X 7 25
c. X 8 25
d. X , 25
6.34 For n 5 100 and p 5 0.4, use the normal distribution to
approximate the probability that:
a. X 5 40
b. X 7 40
c. X 8 40
d. X , 40
i. four heads
ii. at least four heads
iii. four to seven heads
b. Use the normal approximation to the binomial distribution to
approximate the probabilities in (a).
6.36 For overseas flights, an airline has three different choices on its
dessert menu: ice cream, apple pie and chocolate cake. Based
on past experience, the airline feels that each dessert is equally
likely to be chosen. If a random sample of 90 passengers is
selected, what is the approximate probability that:
a. at least 20 will choose ice cream for dessert?
b. exactly 20 will choose ice cream for dessert?
c. less than 20 will choose ice cream for dessert?
APPLYING THE CONCEPTS
6.35 Consider an experiment in which a fair coin is tossed 10 times.
a. Use Equation 5.11, Table E.6 or Microsoft Excel to determine
the probability of getting:
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
242 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
6
Assess your progress
Summary
In this chapter we used the normal distribution for the Tasman
University Orientation scenario to study the time students spend on
the ‘Introduction to TU’ module. We also used the exponential
distribution to model the time between serious workplace accidents.
In addition, we studied the uniform distribution, the normal
probability plot and the normal approximation to the binomial
distribution. In the next chapter, the normal distribution is used in
developing the subject of statistical inference.
Key formulas
Variance of the uniform distribution
The normal probability density function
f (X ) =
1
2
σ 2π
e −(1/2)[(X − μ)/σ] (6.1)
σ2 =
Calculating exponential probabilities
Finding a Z value
Z=
X−μ
(6.2)
σ
P(X < A) = 1 − e−λA (6.8)
The standardised normal probability density function
f (Z ) =
1
2π
e
−(1/2)Z2
(6.3)
Mean and standard deviation of exponential distribution
μ=σ=
1
(6.9)
λ
Variance of exponential distribution
Finding an X value
σ2 =
X = μ + Zσ (6.4)
The uniform distribution probability density function
f (X ) =
(b − a)2
(6.7)
12
1
if a ⩽ X ⩽ b and 0 elsewhere (6.5)
b−a
1
(6.10)
λ2
Normal approximation to the binomial distribution
Z=
Xa − np
np(1 − p)
(6.11)
Mean of the uniform distribution
μ=
a+b
(6.6)
2
Key terms
continuous probability density
function213
cumulative standardised normal
distribution217
exponential distribution
235
normal distribution
214
normal probability density
function216
normal probability plot
231
quantile–quantile plot
231
standardised normal random
variable216
transformation formula
216
uniform (rectangular) distribution
233
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter review problems 243
Chapter review problems
CHECKING YOUR UNDERSTANDING
6.37
6.38
6.39
6.40
6.41
6.42
How do you find the area between two values under the normal
curve?
How do you find the X value that corresponds to a given
percentile of the normal distribution?
What are some of the properties of a normal distribution?
How can you use the normal probability plot to evaluate
whether a set of data is normally distributed?
Why is a continuity correction needed when approximating a
binomial probability with normal distribution?
When can you use the normal distribution to approximate the
binomial distribution?
APPLYING THE CONCEPTS
6.43
6.44
6.45
Based on past experience, it is assumed that the number of
flaws per metre in rolls of grade 2 paper follow a Poisson
distribution with a mean of one flaw per 5 metres of paper.
A flaw has just been found.
a. What is the probability that:
i. there is not another flaw in the remaining 10 metres of
the roll?
ii. a flaw will be found in the next metre of the roll?
iii. at least one flaw will be found in the next 5 metres?
b. What is the mean distance between flaws?
Aircraft arrive at a regional airport at a rate of 30 per hour.
a. If the interarrival time follows an exponential distribution:
i. What is the probability that air traffic control will have a
break of at least 2 minutes between arrivals?
ii. What is the probability that there is less than 30 seconds
between arrivals?
iii. What is the expected time between arrivals?
b. If the interarrival time follows a uniform distribution
between 0 and 4 minutes:
i. What is the probability that air traffic control will have a
break of at least 2 minutes between arrivals?
ii. What is the probability that there is less than 30 seconds
between arrivals?
iii. What is the expected time between arrivals?
c. If the interarrival time follows a normal distribution with
mean 2 minutes and standard deviation 0.6 minutes:
i. What is the probability that air traffic control will have a
break of at least 2 minutes between arrivals?
ii. What is the probability that there is less than 30 seconds
between arrivals?
An orange juice producer buys all his oranges from a large
orange grove. The amount of juice squeezed from each orange
is approximately normally distributed with a mean of 135 mL
and a standard deviation of 12 mL.
a. What is the probability that a randomly selected orange will
contain between 135 mL and 140 mL of juice?
b. What is the probability that a randomly selected orange will
contain between 140 mL and 155 mL of juice?
6.46
c. 77% of the oranges will contain at least how many
millilitres of juice?
d. 80% of the oranges are between which two values of
juice (in millilitres) symmetrically distributed around the
population mean?
The hotels from the chain in problem 6.16 frequently offer
discounted ‘hot deal’ rates online. The table below gives the
‘hot deal’ rates available recently on a selected Sunday, in
Australian dollars. < HOTEL_RATE >
Location
Auckland
Barossa Valley
Brisbane
Canberra
Darwin
Hamilton
Melbourne
Melbourne
Melbourne
Palmerston North
Perth
Queenstown
Rotorua
Snowy Mountains
Sunshine Coast
Sydney
Sydney
Sydney
Wellington
6.47
6.48
6.49
Hot deals rate A$
140
174
129
230
114
154
152
189
149
80
150
95
122
288
170
239
189
160
105
Decide whether or not the data appear to be approximately
normally distributed by:
a. evaluating the actual versus theoretical properties
b. constructing a normal probability plot
Geoscientists estimate that, on average, a given region has a
major earthquake every 250 years.
Assuming that the time between major earthquakes in this
region is exponentially distributed, what is the probability that a
major earthquake:
a. will not occur between 2020 and 2030?
b. will occur between 2020 and 2070?
c. will not occur between 2020 and 2200?
An examination consists of 40 multiple-choice questions, with
each question having four options. Suppose you randomly
select the answer to each question – that is, you guess. What
is the probability of obtaining at least 50% in the examination?
According to Burton G. Malkiel, the daily changes in the closing
price of shares follow a random walk – that is, these daily
events are independent of each other and move upwards or
downwards in a random manner – and can be approximated
by a normal distribution.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
244 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
6.50
6.51
6.52
6.53
a. To test this theory, use the daily changes in the All Ordinaries
for 2016–17 financial year in < ALL_ORDS_2016_17 > to:
i. construct a stem-and-leaf display, histogram, polygon
and/or box-and-whisker plot
ii. evaluate the actual versus theoretical properties
iii. construct a normal probability plot
b. Discuss the results of (a). Are the daily changes in closing
prices approximately normal?
From past data, Safe-As-Houses Real Estate concludes that the
age of houses in the suburb of NewAcres is uniformly
distributed between 20 and 40 years.
What is the probability that the age of a randomly chosen
house in NewAcres is:
a. more than 30 years?
b. between 25 and 35 years?
c. less than 35 years?
The time customers are on hold when ringing the IT help line
for a certain ISP provider is normally distributed with a mean of
20 minutes and a standard deviation of 10 minutes.
a. What proportion of customers are on hold for more than
40 minutes?
b. What is the probability that a customer is on hold for less
than 30 minutes?
c. What percentage of calls are answered within 10 minutes?
A study by the ISP provider in problem 6.51 has shown that the
length of time on hold before a customer hangs up follows an
approximate exponential distribution, with an average time of
15 minutes on hold before a customer hangs up.
a. What percentage of customers will hang up during the first
20 minutes on hold?
b. What is the probability that a customer will hang up during
the first 10 minutes on hold?
c. What proportion of customers do not hang up when on hold
for 40 minutes?
From the Household Expenditure Statistics: Year Ended
30 June 2016 (Statistics New Zealand, <www.stats.govt.nz>),
the average weekly household expenditure in New Zealand
was $1,300.
Assuming that weekly household expenditure is
approximately normal with a standard deviation of $350:
6.54
6.55
6.56
a. Find the probability that a household’s weekly expenditure is
i. less than $500
ii. more than $1,750
b. What proportion of household expenditures are between
$1,250 and $1,500?
c. 99% of households have weekly expenditures of less than
which amount?
d. 95% of households have weekly expenditures of more than
which amount?
Water_Wise (see problem 3.53) is analysing water usage for
a block of one-bedroom flats. It collects data on total daily
water consumption in kilolitres (kL) for 133 consecutive
days. < WATER >.
a. Decide whether total daily water usage in this block of flats
is approximately normal by:
i. evaluating the actual versus theoretical properties
ii. constructing a normal probability plot
b. From part (a), assume that total daily water usage of the
flats is normally distributed with a mean of 1.27 kL and
standard deviation of 0.33 kL.
i. On what percentage of days is total water usage less
than 1.0 kL?
ii. On what proportion of days is total water usage between
0.8kL and 1.4 kL?
iii. What is the probability that tomorrow total water usage
will exceed 2.0 kL?
Suppose there is a free bus, with no timetable, which circles
the city centre every 20 minutes. You arrive at a bus stop
unaware of when the bus last arrived at this stop. What is the
probability that you will wait for the bus:
a. less than 5 minutes?
b. between 10 and 15 minutes?
c. more than 12 minutes?
The lifespan of a certain car battery is normally distributed with
a mean of 5 years and a standard deviation of 9 months.
a. What is the probability that a battery lasts more than 7 years?
b. What proportion of batteries fail within the warranty period
of 3 years?
c. What warranty period, in months, should be set if only 1%
of batteries fail within the warranty period?
Continuing cases
Tasman University
Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues.
In particular, students within the school are asked to complete a student survey when they receive their grades
each semester. The results of Bachelor of Business (BBus) and Master of Business Administration (MBA) students
who responded to the latest undergraduate (UG) and postgraduate (PG) student surveys are stored in < TASMAN_
UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Continuing cases 245
Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman
University Postgraduate MBA Student Survey.
a For a selection of numerical variables in the BBus student survey, decide whether the variable is
approximately normally distributed by:
i
comparing data characteristics to theoretical properties
ii constructing a normal probability plot
b For a selection of numerical variables in the MBA student survey, decide whether the variable is
approximately normally distributed by:
i
comparing data characteristics to theoretical properties
ii constructing a normal probability plot
c Write a report summarising your conclusions.
d Assume that the weighted average mark (WAM) of BBus students is normal with a mean of 63.9 and a
standard deviation of 12.8.
i
What percentage of BBus students have a WAM of at least 65, a Credit average?
ii What percentage of BBus students have a WAM of at least 75, a Distinction average?
iii What proportion of BBus students have a WAM of at least 85, a High Distinction average?
iv What proportion of BBus students have a WAM of less than 50?
v What is the probability that a BBus student chosen at random has a WAM between 50 and 70?
vi Below what WAM do the lowest 10% of BBus students achieve?
vii What WAM is achieved by the top 5% of BBus students?
e Assume that the MBA weighted average mark (WAM) of MBA students is normal with a mean of 73.8
and a standard deviation of 8.6.
i
What percentage of MBA students have a WAM of at least 65, a Credit average?
ii What percentage of MBA students have a WAM of at least 75, a Distinction average?
iii What proportion of MBA students have a WAM of at least 85, a High Distinction average?
iv What proportion of MBA students have a WAM of less than 50?
v What is the probability that an MBA student chosen at random has a WAM between 50 and 70?
vi Below what WAM do the lowest 10% of MBA students achieve?
vii What WAM is achieved by the top 5% of MBA students?
As Safe as Houses
To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a
large national real estate company, has collected samples of recent residential sales from a sample of non-capital
cities and towns in these states. The data are stored in < REAL_ESTATE >.
a For a selection of numerical variables for regional city 1 state A, decide whether the variable is
approximately normally distributed by:
i
comparing data characteristics to theoretical properties
ii constructing a normal probability plot
b For a selection of numerical variables for coastal city 1 state A, decide whether the variable is
approximately normally distributed by:
i
comparing data characteristics to theoretical properties
ii constructing a normal probability plot
c Write a report summarising your conclusions.
d Repeat (a) to (c) for another pair of non-capital cities or towns in state A and/or state B.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
246 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS
Chapter 6 Excel Guide
EG6.1 CONTINUOUS PROBABILITY DISTRIBUTIONS
There are no Excel Guide instructions for this section.
EG6.2 THE NORMAL DISTRIBUTION
Key technique
Use the NORM.DIST(X value, mean, standard deviation, True) function to calculate normal probabilities and
use the NORM.S.INV(percentage) function and the
STANDARDIZE function (see Section EG3.1) to calculate
the Z value.
Example
Calculate the normal probabilities for Examples 6.1, 6.4
and 6.5 and the X and Z values for ­Examples 6.6 and 6.7.
PHStat
Use Normal.
For the example, select PHStat ➔ Probability &
Prob. Distributions ➔ Normal. In this procedure’s dialog
box (shown in Figure EG6.1):
1. Enter 7 as the Mean and 2 as the Standard
Deviation.
2. Check Probability for: X , 5 and enter 3.5 in its
box.
3. Check Probability for: X . and enter 9 in its box.
4. Check Probability for range and enter 5 in the
first box and 9 in the second box.
5. Check X for Cumulative Percentage and enter 10
in its box.
6. Check X Values for Percentage and enter 95 in its
box.
7. Enter a Title and click OK.
Figure EG6.1
Normal Probability
Distribution dialog
box
In-depth Excel
Use the COMPUTE worksheet of the Normal workbook
as a template.
The worksheet already contains the data for solving the
problems in Examples 6.1 and 6.4 to 6.7. For other problems, change the values for the Mean, Standard Deviation, X Value, From X Value, To X Value, Cumulative
Percentage and/or Percentage.
If you use an ­Excel ­version older than Excel 2010, use
the COMPUTE_OLDER worksheet.
EG6.3 EVALUATING NORMALITY
Comparing Data Characteristics to Theoretical Properties
Use the Sections EG2.3, EG3.1 and EG3.4 instructions to
compare data characteristics to theoretical properties.
Constructing the Normal Probability Plot
Key technique
Use an Excel Scatter (X, Y) chart with Z values calculated
using the NORM.S.INV function.
Example
Construct the normal probability plot for the call length
data, as in Figure 6.22.
PHStat
Use Normal Probability Plot.
For the example, open the Call_Length file. Select
PHStat ➔ Probability & Prob. Distributions ➔ Normal
Probability Plot. In the Normal Probability Plot dialog
box (shown in Figure EG6.2):
1. Enter or highlight A1:A21 as the Variable Cell
Range.
2. Check First cell contains label.
3. Enter a Title and click OK.
Figure EG6.2
Normal Probability Plot
dialog box
In addition to the chart sheet containing the normal
probability plot, the procedure creates a plot data worksheet
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 6 Excel Guide 247
identical to the PlotData worksheet discussed in the Indepth Excel instructions.
In-depth Excel
Use the worksheets of the NPP workbook as templates.
The NormalPlot chart sheet displays a normal probability plot using the rank, the proportion, the Z value and
the variable found in the PLOT_DATA worksheet. The
PLOT_DATA worksheet already contains the call length
data for the example. To construct a plot for a different variable, paste the sorted values for that variable in column D
of the PLOT_DATA worksheet.
If you have fewer than 20 values, delete rows from the
bottom up. If you have more than 20 values, select row 21,
right-click, click Insert ➔ Rows in the shortcut menu, copy
down the formulas in A20:C20 to the new rows and then
paste the sorted values for the variable in column D. To
create your own normal probability plot for the call length,
open to the PLOT_DATA worksheet and select the cell
range C1:D21. Then select Insert ➔ Scatter and select the
first Scatter gallery item (that shows only points and is
labeled with Scatter or Scatter with only Markers). Relocate the chart to a chart sheet, turn off the chart legend and
gridlines, add axis titles and modify the chart title.
If you use an Excel version older than Excel 2010, use
the PLOT_OLDER worksheet and the NormalPlot_
OLDER chart sheet.
EG6.4 THE UNIFORM DISTRIBUTION
There are no Excel Guide instructions for this section.
EG6.5 THE EXPONENTIAL DISTRIBUTION
Key technique
Use the EXPON.DIST(X value, mean, True) function.
Example
Calculate the exponential probability for the bank ATM
customer arrival example in Section 6.5.
PHStat
Use Exponential.
For the example, select PHStat ➔ Probability &
Prob. Distributions ➔ Exponential. In the procedure’s
dialog box (shown in Figure EG6.3):
1. Enter 20 as the Mean per unit (Lambda) and 0.1
as the X Value.
2. Enter a Title and click OK.
Figure EG6.3 Exponential Probability Distribution dialog box
In-depth Excel
Use the COMPUTE worksheet of the Exponential workbook as a template.
The worksheet already contains the data for the example. For other problems, change Lambda and X Value in
cells B4 and B5. If you use an Excel version older than
Excel 2010, use the COMPUTE_OLDER worksheet.
Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
CHA PTER
7
Sampling
distributions
PACKAGING TEA TREE SHAMPOO
F
or centuries, Indigenous Australian peoples used the leaves of the tea tree, Melaleuca
alternifolia, for healing purposes. Now tea tree oil is being used in a variety of products
for its beneficial antiseptic and antifungal properties. Zoffira Pty Ltd is a small company
that manufactures a number of tea tree oil products, including Zoffira T Shampoo. The shampoo
is packaged in 500 mL clear pump-pack bottles via a conveyor belt process. You are in charge
of monitoring that bottles are being filled correctly.
Bottles are supposed to contain a mean of 500 mL of shampoo, as indicated on the package label.
Because of the speed of the process, the volume of the contents varies from bottle to bottle, causing some bottles to be underfilled and some overfilled. If the process is not working properly, the
mean volume in the bottles could vary too much from the label volume of 500 mL to be acceptable.
As weighing every single bottle is too time-consuming, costly and inefficient, you must take a
sample of bottles and make a decision regarding the probability that the packaging process is
working properly. Each time
_ you select a sample of bottles and check the individual
_ contents, you
calculate a sample mean X . You need to determine the probability that such an X could have been
randomly drawn from a population whose population mean is 500 mL. Based on this assessment,
you will have to decide whether to maintain, alter or shut down the process.
© Nolan777|Dreamstime.com
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
7.2 Sampling Distribution of the Mean 249
LEARNING
OBJECTIVES
After studying this chapter you should be able to:
1 interpret the concept of the sampling distribution
2 calculate probabilities related to the sample mean
3 recognise the importance of the Central Limit Theorem
4 calculate probabilities related to the sample proportion
In this chapter you need to make a decision about the shampoo-packaging process based on a
sample of shampoo bottles. You will learn about sampling distributions and how to use them to
solve business problems. As in the previous chapter, the normal distribution is used to calculate
probabilities.
7.1 SAMPLING DISTRIBUTIONS
In many applications, you want to make statistical inferences – that is, to use statistics calculated from samples to estimate the values of population parameters. In this chapter you will
learn more about the sample mean, a statistic used to estimate a population mean (a parameter).
You will also learn about the sample proportion, a statistic used to estimate the population proportion (a parameter). Your main concern when making a statistical inference is drawing conclusions about a population, not about a sample. For example, a political pollster is interested in
the sample results only as a way of estimating the actual proportion of the votes that each candidate will receive from the population of voters. Likewise, as an operations manager for ­Zoffira
Pty Ltd, you are interested only in using the sample mean calculated from a sample of shampoo
bottles for estimating the mean volume contained in a population of bottles.
In practice, you select a single random sample of a predetermined size from the population.
The items included in the sample are determined through the use of a random number generator, such as a table of random numbers (see Section 1.4 and Table E.1), or by using Microsoft
Excel (see page 36).
Hypothetically, to use the sample statistic to estimate the population parameter, you should
examine every possible sample that could occur. A sampling distribution is the distribution of the
results if you actually selected all possible samples.
LEARNING OBJECTIVE
1
Interpret the concept of
the sampling distribution
sampling distribution
The probability distribution of a
given sample statistic with repeated
sampling of the population.
7.2 SAMPLING DISTRIBUTION OF THE MEAN
In Chapter 3, several measures of central tendency are discussed. Undoubtedly, the mean is the
most widely used measure of central tendency. The sample mean is often used to estimate the
population mean. The sampling distribution of the mean is the distribution of all possible sample
means if you select all possible samples of a certain size.
The Unbiased Property of the Sample Mean
sampling distribution of
the mean
The distribution of all possible
sample means from samples of a
given size for a given population.
The sample mean is unbiased because the mean of all possible sample means (of a given sample
size n), μX_, is equal to the population mean μ. A simple example concerns a population of four
candidates attempting a driver knowledge test of 45 questions in order to get a driver’s licence.
Table 7.1 presents the number of errors.
unbiased
If the average of all possible sample
means equals the population mean
then the sample mean is unbiased.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
250 CHAPTER 7 SAMPLING DISTRIBUTIONS
Table 7.1
Number of errors made by
each of four driver’s
knowledge test candidates
Candidate
Vicky
Yvana
Xing
Zac
Number of errors
X1 = 3
X2 = 2
X3 = 1
X4 = 4
This population distribution is shown in Figure 7.1.
Figure 7.1
Number of errors made by
a population of four driver’s
knowledge test candidates
Frequency
3
2
1
0
0
3
2
Number of errors
1
4
When you have the data from a population, you calculate the mean using Equation 7.1.
POPUL AT ION M E A N
The population mean is the sum of the values in the population divided by the population size N.
N
μ=
∑ Xi
(7.1)
i=1
N
You calculate the population standard deviation σ using Equation 7.2.
POPUL AT ION STA NDA R D D E V I AT I O N
N
σ=
∑ (X
i – μ)
2
i=1
(7.2)
N
Thus, for the data of Table 7.1:
μ=
3+2+1+4
= 2.5
4
and:
σ=
(3 – 2.5)2 + (2 – 2.5)2 + (1 – 2.5)2 + (4 – 2.5)2
= 1.12 errors
4
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
7.2 Sampling Distribution of the Mean 251
If you select samples of two candidates with replacement from this population, there are 16
possible samples (N n = 42 = 16). Table 7.2 lists the 16 possible sample outcomes. If you average all 16 of these sample means, the mean of these values, μX_, is equal to 2.5, which is also the
mean of the population μ.
Sample
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Candidates
Vicky, Vicky
Vicky, Yvana
Vicky, Xing
Vicky, Zac
Yvana, Vicky
Yvana, Yvana
Yvana, Xing
Yvana, Zac
Xing, Vicky
Xing, Yvana
Xing, Xing
Xing, Zac
Zac, Vicky
Zac, Yvana
Zac, Xing
Zac, Zac
Sample outcomes
3, 3
3, 2
3, 1
3, 4
2, 3
2, 2
2, 1
2, 4
1, 3
1, 2
1, 1
1, 4
4, 3
4, 2
4, 1
4, 4
Sample mean
–
X1 = 3
–
X 2 = 2.5
–
X3 = 2
–
X 4 = 3.5
–
X 5 = 2.5
–
X6 = 2
–
X 7 = 1.5
–
X8 = 3
–
X9 = 2
–
X 10 = 1.5
–
X 11 = 1
–
X 12 = 2.5
–
X 13 = 3.5
–
X 14 = 3
–
X 15 = 2.5
–
X 16 = 4
ΣX– = 40
40
µX– =
= 2.5
16
Table 7.2
All 16 samples of n = 2
test candidates from a
population of n = 4
candidates when sampling
with replacement
Since the mean of the 16 sample means is equal to the population mean, the sample mean
is an unbiased estimator of the population mean. Therefore, although you do not know how
close the sample mean of any particular sample selected comes to the population mean, you are
at least assured that the mean of all the possible sample means that could have been selected is
equal to the population mean.
Standard Error of the Mean
Figure 7.2 illustrates the variation in the sample mean when selecting all 16 possible samples.
In this small example, although the sample mean varies from sample to sample depending on
which candidates are selected, the sample mean does not vary as much as the individual values
in the population. That the sample means are less variable than the individual values in the
population follows directly from the fact that each sample mean averages together all the values
Figure 7.2
Sampling distribution of the
mean based on all possible
samples containing two
candidates
5
Frequency
4
3
2
1
0
0
1
2
3
Number of errors
4
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
252 CHAPTER 7 SAMPLING DISTRIBUTIONS
standard error of the mean
Reflects how much the sample
mean varies from its average value
in repeated experiments.
in the sample. A population consists of individual outcomes that can take on a wide range of
values from extremely small to extremely large. However, if a sample contains an extreme
value, although this value will have an effect on the sample mean, the effect is reduced because
the value is averaged with all the other values in the sample. As the sample size increases, the
effect of a single extreme value becomes smaller because it is averaged with more values.
The value of the standard deviation of all possible sample means, called the standard error
of the mean, expresses how the sample mean varies from sample to sample. Equation 7.3 defines
the standard error of the mean when sampling with replacement or without replacement (see
page 18) from large or infinite populations.
STA N DA R D E R R O R O F T HE M E A N
The standard error of the mean σX_ is equal to the standard deviation in the population σ
divided by the square root of the sample size n.
σX =
σ
n
(7.3)
Therefore, as the sample size increases, the standard error of the mean decreases by a factor equal to the square root of the sample size.
You can also use Equation 7.3 as an approximation to the standard error of the mean when
the sample is selected without replacement, if the sample contains less than 5% of the entire
population. Example 7.1 calculates the standard error of the mean for such a situation.
EXAMPLE 7.1
C A LC U LAT ING T H E STAN D ARD E RROR OF THE M E AN
Return to the shampoo-packaging process described in the scenario on page 248. If you randomly select a sample of 25 bottles without replacement from the thousands of bottles filled
during a shift, the sample contains far less than 5% of the population. Given that the standard
deviation of the shampoo-packaging process is 15 mL, calculate the standard error of the mean.
SOLUTION
Using Equation 7.3 with n = 25 and σ = 15, the standard error of the mean is:
sX =
s
n
=
15
25
=
15
= 3 mL
5
The variation in the sample means for samples of n = 25 is much less than the variation in
individual bottles of shampoo (i.e. σX_ = 3 while σ = 15).
Sampling from Normally Distributed Populations
Now that the concept of a sampling distribution has been introduced and the standard error of
the mean has been defined, what distribution will the sample mean follow? If you are sampling
from a population that is normally distributed with mean μ and standard deviation σ, regardless
of the sample size n, the sampling distribution of the mean is normally distributed with mean
μX_ = μ and standard error of the mean σX_.
In the simplest case, if you take samples of size n = 1, each possible sample mean is a single
value from the population because:
n
X =
∑X
i =1
n
i
=
X1
= X1
1
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
7.2 Sampling Distribution of the Mean 253
Therefore, if the population is normally distributed with mean μ and standard deviation σ, the
sampling distribution of X for samples of n = 1 must also follow the normal distribution with
mean μX_ = μ and standard error of the mean σX_ = σ/1 = σ. In addition, as the sample size
increases, the sampling distribution of the mean still follows a normal distribution with mean
μX_ = μ, but the standard error of the mean decreases, so that a larger proportion of sample
means are closer to the population mean. Figure 7.3 illustrates this reduction in variability, in
which 500 samples of sizes 1, 2, 4, 8, 16 and 32 were randomly selected from a normally
­distributed population. From the polygons in Figure 7.3, you can see that, although the sampling
distribution of the mean is approximately1 normal for each sample size, the sample means are
distributed more tightly around the population mean as the sample size is increased.
To examine the concept of the sampling distribution of the mean further, consider the
­shampoo-packaging scenario again. The packaging equipment that is filling 500-mL bottles of
shampoo is set so that the amount of shampoo in a bottle is normally distributed with a mean of
500 mL. From past experience, the population standard deviation for this filling process is 15 mL.
If you randomly select a sample of 25 bottles from the many thousands that are filled in a
day and the mean volume is calculated for this sample, what type of result could you expect?
For example, do you think that the sample mean could be 500 mL? 300 mL? 510 mL?
Figure 7.3
Sampling distribution of the
mean from 500 samples of
sizes n = 1, 2, 4, 8, 16 and
32 selected from a normal
population
n = 32
n = 16
n=8
n=4
n=2
n=1
0
Z
1
Remember that ‘only’ 500 samples out of an infinite number of samples have been selected, so that the sampling
distributions shown are only approximations of the true distributions.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
254 CHAPTER 7 SAMPLING DISTRIBUTIONS
The sample acts as a miniature representation of the population, so if the values in the
population are normally distributed, the values in the sample should be approximately normally
distributed. Thus, if the population mean is 500 mL, the sample mean has a good chance of
being close to 500 mL.
How can you determine the probability that the sample of 25 bottles will have a mean
below 497 mL? From the normal distribution (Section 6.2) you know that you can find the area
below any value X by converting to standardised Z units:
Z=
X–m
s
In the examples in Section 6.2 we saw how any single value X differs from the mean. Now, in
the shampoo-packaging example, the value involved is a sample mean X and we wish to determine the likelihood that a sample mean is below 497. Thus, by substituting X for X, μX_ for μ
and σX_ for σ, the appropriate Z value is defined in Equation 7.4.
FIN DIN G Z FOR T HE SA M P LI NG D I ST R I BU T I O N O F T HE M E A N
The Z value is equal to the difference between the sample mean and the population
mean μ, divided by the standard error of the mean σX_.
Z=
LEARNING OBJECTIVE
2
Calculate probabilities
related to the sample
mean
X – mX
X–m
=
s
sX
n
(7.4)
To find the area below 497 mL, from Equation 7.4:
Z=
X – mX
497 – 500
–3
=
=
= –1.00
15
sX
3
25
The area corresponding to Z = -1.00 in Table E.2 is 0.1587. Therefore, 15.87% of all the possible samples of size 25 have a sample mean below 497 mL. This is not the same as saying that
a certain percentage of individual bottles will have less than 497 mL of shampoo. We calculate
that percentage as follows:
Z=
–3
497 – 500
X–m
= –0.20
=
=
15
s
15
The area corresponding to Z = - 0.20 in Table E.2 is 0.4207. Therefore, 42.07% of the individual bottles are expected to contain less than 497 mL. Comparing these results, we see that
many more individual bottles than sample means are below 497 mL. This result is explained by
the fact that each sample consists of 25 different values, some small and some large. The averaging process dilutes the importance of any individual value, particularly when the sample size
is large. Thus, the chance that the sample mean of 25 bottles is far away from the population
mean is less than the chance that a single bottle is far away.
Examples 7.2 and 7.3 show how these results are affected by using a different sample size.
EXAMPLE 7.2
T H E E FFE CT O F S A M P L E S I Z E n ON THE CAL CU L ATI ON OF 𝛔X_
How is the standard error of the mean affected by increasing the sample size from 25 to
100 bottles?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
7.2 Sampling Distribution of the Mean 255
SOLUTION
If n = 100 bottles, then using Equation 7.3:
sX =
s
n
=
15
100
=
15
= 1.5
10
The fourfold increase in the sample size from 25 to 100 reduces the standard error of the
mean by half – from 3 mL to 1.5 mL. This demonstrates that taking a larger sample results
in less variability in the sample means from sample to sample.
TH E EFFECT O F S A MP LE S IZ E n O N T H E CL U STE RI N G OF M E AN S I N THE
SA MPLING D IST R IB U T IO N
In the shampoo-packaging example, if you select a sample of 100 bottles, what is the probability that the sample mean is below 497 mL?
EXAMPLE 7.3
SOLUTION
Using Equation 7.4:
Z=
X – mX
497 – 500
–3
= –2.00
=
=
15
sX
1.5
100
From Table E.2, the area less than Z = -2.00 is 0.0228. Therefore, 2.28% of the samples of
100 have means below 497 mL, as compared with 15.87% for samples of 25.
Sometimes, you need to find the interval that contains a fixed proportion of the sample
means. You need to determine a distance below and above the population mean containing a
specific area of the normal curve. From Equation 7.4:
Z=
X–m
s
n
Solving for X results in Equation 7.5.
_
FIND ING X FOR T H E S A M P L IN G DIST R I BU T I O N O F T HE M E A N
X=m+Z
s
n
(7.5)
Example 7.4 illustrates the use of Equation 7.5.
DETER M INING T H E IN T E R VA L T H AT INC L U D E S A F I XE D P ROP ORTI ON OF THE
SA MPLE M EANS
In the shampoo-packaging example, find an interval around the population mean that will
include 95% of the sample means based on samples of 25 bottles.
EXAMPLE 7.4
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
256 CHAPTER 7 SAMPLING DISTRIBUTIONS
SOLUTION
If 95% of the sample means are in the interval, then 5% are outside the interval. Divide the
5% into two equal parts of 2.5%. The value of Z in Table E.2 corresponding to an area of
0.0250 in the lower tail of the normal curve is -1.96, and the value of Z corresponding to a
cumulative area of 0.975 (i.e. 0.025 in the upper tail of the normal curve) is +1.96. The lower
value of X (called X L) and the upper value of X (called X U) are found by using Equation 7.5:
XL = 500 + (–1.96)
XU = 500 + (1.96)
15
25
15
25
= 500 – 5.88 = 494.12
= 500 + 5.88 = 505.88
Therefore, 95% of all sample means are between 494.12 and 505.88 mL for samples of
25 bottles.
Sampling from Non-normally Distributed Populations –
The Central Limit Theorem
So far in this section we have discussed the sampling distribution of the mean for a normally
distributed population. However, in many instances, either you know that the population is not
normally distributed or it is unrealistic to assume a normal distribution. An important theorem
in statistics, the Central Limit Theorem, deals with this situation.
LEARNING OBJECTIVE
3
Recognise the importance
of the Central Limit
Theorem
Central Limit Theorem
If the sample size is large enough,
the distribution of sample means
will be approximately normal even if
the samples came from a
population that was not normal.
T H E CE N T R A L L IM I T T HE O R E M
The Central Limit Theorem states that, as the sample size (i.e. the number of values in
each sample) gets large enough, the sampling distribution of the mean is approximately
normally distributed. This is true regardless of the shape of the distribution of the individual values in the population.
What sample size is large enough? A great deal of statistical research has gone into this
issue. As a general rule, statisticians have found that for many population distributions, when
the sample size is at least 30, the sampling distribution of the mean is approximately normal.
However, you can apply the Central Limit Theorem for even smaller sample sizes if the population distribution is approximately bell-shaped. In the uncommon case where the distribution is
extremely skewed or has more than one mode, you may need sample sizes larger than 30 to
ensure normality.
Figure 7.4 illustrates the application of the Central Limit Theorem to different populations.
The sampling distributions from three different continuous distributions (normal, uniform and
exponential) for varying sample sizes (n = 2, 5, 30) are displayed.
Panel A of Figure 7.4 shows the sampling distribution of the mean selected from a normal population. As mentioned earlier, when the population is normally distributed the sampling distribution of the mean is normally distributed for any sample size. (You can measure
the variability using the standard error of the mean, Equation 7.3.) Because of the unbiasedness property, the mean of any sampling distribution is always equal to the mean of the
population.
Panel B of Figure 7.4 depicts the sampling distribution from a population with a uniform
(or rectangular) distribution (see Section 6.4). When samples of size n = 2 are selected, there is
a peaking or central limiting effect already working. For n = 5, the sampling distribution is
bell shaped and approximately normal. When n = 30, the sampling distribution looks very
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
7.2 Sampling Distribution of the Mean 257
Panel A
Normal population
Panel B
Uniform population
Panel C
Exponential population
Values of X
Values of X
Values of X
Sampling
distribution of X
Sampling
distribution of X
Sampling
distribution of X
n=2
n=2
n=2
Values of X
Values of X
Values of X
Sampling
distribution of X
Sampling
distribution of X
Sampling
distribution of X
n=5
n=5
Values of X
Values of X
Sampling
distribution of X
Sampling
distribution of X
n = 30
Values of X
n = 30
Values of X
Figure 7.4
Sampling distribution of the
mean for different
populations for samples of
n = 2, 5 and 30
n=5
Values of X
Sampling
distribution of X
n = 30
Values of X
similar to a normal distribution. In general, the larger the sample size the more closely the sampling distribution will follow a normal distribution. As with all cases, the mean of each sampling
distribution is equal to the mean of the population, and the variability decreases as the sample
size increases.
Panel C of Figure 7.4 presents an exponential distribution (see Section 6.5). This population
is heavily skewed to the right. When n = 2, the sampling distribution is still highly skewed to the
right but less so than the distribution of the population. For n = 5, the sampling distribution is
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
258 CHAPTER 7 SAMPLING DISTRIBUTIONS
more symmetrical with only a slight skew to the right. When n = 30, the sampling distribution
looks approximately normal. Again, the mean of each sampling distribution is equal to the mean
of the population, and the variability decreases as the sample size increases.
Using the results from these well-known statistical distributions (normal, uniform and
exponential), you can make the following conclusions regarding the Central Limit Theorem:
• For most population distributions, regardless of shape, the sampling distribution of the
mean is approximately normally distributed if samples of at least 30 are selected.
• If the population distribution is fairly symmetrical, the sampling distribution of the mean
is approximately normal for samples as small as 5.
• If the population is normally distributed, the sampling distribution of the mean is
normally distributed regardless of the sample size.
The Central Limit Theorem is of crucial importance in using statistical inference to draw
conclusions about a population. It allows you to make inferences about the population mean
without having to know the specific shape of the population distribution.
You can explore how the Central Limit Theorem works yourself using Excel to generate
samples through a Random Number Generator (see the Chapter 7 Excel Guide at the end of
this chapter).
PHStat also has an easy-to-use Sampling Distributions Simulation.
Problems for Section 7.2
LEARNING THE BASICS
7.1
7.2
Given a normal distribution with μ = 100 and σ = 10, if you
select a sample of n = 25:
_
a. What is the probability that X is:
i. less than 95?
ii. between 95 and 97.5?
iii. above 102.2?
_
b. There is a 65% chance that X is above what value?
Given a normal distribution with μ = 50 and σ = 5, if you select
a sample of n = 100:
_
a. What is the probability that X is:
i. less than 47?
ii. between 47 and 49.5?
iii. above 51.1?
_
b. There is a 35% chance that X is above what value?
7.5
APPLYING THE CONCEPTS
7.3
7.4
For each of the following three populations, indicate what the
sampling distribution of the mean for samples of 25 would
consist of.
a. Travel expense vouchers for a university in an academic year
b. Absentee records (days absent per year) in 2010 for
employees of a large construction company
c. Yearly sales (in litres) of E10 fuel at service stations located
in a particular state
The following data represent the number of days absent per
year in a population of six employees of a small company:
1
3 6
7 9
10
7.6
a. Assuming that you sample without replacement, select all
possible samples of n = 2 and construct the sampling
distribution of the mean. Calculate the mean of all the
sample means and also calculate the population mean. Are
they equal? What is this property called?
b. Repeat (a) for all possible samples of n = 3.
c. Compare the shape of the sampling distribution of the mean
in (a) and (b). Which sampling distribution has less
variability? Why?
d. Assuming that you sample with replacement, repeat (a), (b)
and (c) and compare the results. Which sampling distributions
have the least variability, those in (a) or (b)? Why?
The number of passengers passing through a large South East
Asian airport is normally distributed with a mean of 110,000
persons per day and a standard deviation of 20,200 persons. If
you select a random sample of 16 days:
a. What is the sampling distribution of the mean?
b. What is the probability that the sample mean is less than
98,000 passengers per day?
c. What is the probability that the sample mean is between
102,000 and 104,500 passengers per day?
d. The probability is 60% that the sample mean will be
between which two values symmetrically distributed around
the population mean?
Realestate.com.au reports that the median price of houses
in the Newcastle suburb of Merewether that were sold in the
13 months to March 2017 was $1,150,000 (<www.realestate.
com.au/neighbourhoods/merewether-2291-nsw?cid=srp>
accessed 3 April 2017). Suppose that the mean price of
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
7.3 Sampling Distribution of the Proportion 259
7.7
7.8
houses in Merewether sold during that period was
$1,236,450 and the standard deviation was $150,000.
a. If you take samples of n =
_ 2, describe the shape of the
sampling distribution of X .
b. If you take samples of n =
_ 100, describe the shape of the
sampling distribution of X .
c. If you take a random sample of n = 100, what is the probability
that the sample mean will be less than $1,235,000?
Travel time on a bus between two suburban stops is normally
distributed with μ = 8 minutes and σ = 2 minutes.
a. If you select a random sample of 25 trips, what is the
probability that the sample mean is between 6.9 and
8.2 minutes?
b. If you select a random sample of 25 trips, what is the
probability that the sample mean is between 7.5 and
8 minutes?
c. If you select a random sample of 100 trips, what is the
probability that the sample mean is between 6.9 and
8.2 minutes?
d. Explain the difference between the results of (a) and (c).
It is often important to monitor traffic on a website as
organisations need to make online interactions with their clients
faster and easier. For example, businesses applying for an
Australian Business Number (ABN) online at <www.abr.gov.au>
are asked to have a variety of information about their entity
ready before they begin the online process. Assume that ABN
online-application times are normally distributed with a mean
7.9
time of 40 minutes and a standard deviation of 5 minutes. If a
random sample of 50 applications is taken:
a. What is the probability that the sample mean application
time is less than 38 minutes?
b. What is the probability that the sample mean is between
39 and 41 minutes?
c. The probability is 80% that the sample mean is between
what two values symmetrically distributed around the
population mean?
d. The probability is 90% that the sample mean is less than
what value?
A company is having a new corporate website developed. In the
final testing phase the download time to open the new home
page is recorded for a large number of computers in office and
home settings. The mean download time for the site is
3.61 seconds. Suppose that the download times for the site are
normally distributed with a standard deviation of 0.5 seconds. If
you select a random sample of 30 download times:
a. What is the probability that the sample mean download time
is less than 3.75 seconds?
b. What is the probability that the sample mean is between
3.70 and 3.90 seconds?
c. The probability is 80% that the sample mean is between
which two values symmetrically distributed around the
population mean?
d. The probability is 90% that the sample mean is less than
what value?
7.3 SAMPLING DISTRIBUTION OF THE PROPORTION
Consider a categorical variable that has only two categories, such as the customer prefers your
brand or the customer prefers the competitor’s brand. Of interest is the proportion of items
belonging to one of the categories; for example the proportion of customers who prefer your
brand. The population proportion, represented by π, is the proportion of items in the entire
population with the characteristic of interest. The sample proportion, represented by p, is the
proportion of items in the sample with the characteristic of interest. The sample proportion, a
statistic, is used to estimate the population proportion, a parameter. To calculate the sample
proportion, you assign the two possible outcomes scores of 1 or 0 to represent the presence or
absence of the characteristic. You then sum all the 1 and 0 scores and divide by n, the sample
size. For example, if, in a sample of five customers, three preferred your brand and two did
not, you have three ones and two zeroes. Summing the three ones and two zeroes and dividing
by the sample size of 5 gives you a sample proportion of 0.60 who preferred your brand.
THE SAM PLE PR OPORT ION
p=
X
number of items with the characteristic of interest
=
n
sample size
(7.6)
The sample proportion p takes on values between 0 and 1. If all individuals possess the
characteristic, you assign each a score of 1 and p is equal to 1. If half the individuals possess
the characteristic, you assign half a score of 1, and assign the other half a score of 0, and p is
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
260 CHAPTER 7 SAMPLING DISTRIBUTIONS
standard error of the proportion
The standard deviation of the
sample proportion for repeated
samples.
equal to 0.5. If none of the individuals possess the characteristic, you assign each a score of 0
and p is equal to 0.
While the sample mean X is an unbiased estimator of the population mean μ, the statistic
p is an unbiased estimator of the population proportion π. By analogy to the sampling distribution of the mean, the standard error of the proportion, σp, is given in Equation 7.7.
STA N DA R D E R R O R O F T HE P R O P O RT I O N
sp =
sampling distribution of the
proportion
The distribution of all possible
sample proportions from samples of
a certain size.
p(1 – p)
n
(7.7)
If you select all possible samples of a certain size, the distribution of all possible sample
proportions is referred to as the sampling distribution of the proportion. When sampling with
replacement from a finite population, the sampling distribution of the proportion follows the
binomial distribution, as discussed in Section 5.3. However, you can use the normal distribution
to approximate the binomial distribution when nπ and n(1 - π) are each greater than 5 (see
Section 6.6). In most cases in which inferences are made about the proportion, the sample size is
substantial enough to meet the conditions for using the normal approximation (see reference 1).
Therefore, in many instances, you can use the normal distribution to estimate the sampling
distribution of the proportion. Substituting p for X, π for μ and p(1 – p) for σ in Equan
n
tion 7.4 results in Equation 7.8.
DIFFE R E N CE B E T W E E N T HE SA M P LE P R O P O RT I O N A ND T HE
P OPUL AT ION P R O P O RT I O N I N STA NDA R D I SE D NO R M A L U NI T S
Z=
LEARNING OBJECTIVE
Calculate probabilities
related to the sample
proportion
4
p–p
(7.8)
p(1 – p)
n
To illustrate the sampling distribution of the proportion, suppose that the manager of a railway’s WiFi services determines that 40% of all passengers have multiple WiFi-enabled devices
available on board their train. You select a random sample of 200 passengers and count those with
multiple WiFi-enabled devices. The probability that the sample proportion of passengers with multiple devices is less than 0.30 is calculated as follows.
Because nπ = 200(0.40) = 80 7 5 and n(1 - π) = 200(0.60) = 120 7 5, the sample size is
large enough to assume that the sampling distribution of the proportion is approximately normally distributed. Using Equation 7.8:
Z=
=
p−π
π(1 − π)
n
0.30 − 0.40
(0.40)(0.60)
200
=
−0.10
0.24
200
=
−0.10
0.0346
= −2.89
Using Table E.2, the area under the normal curve less than Z = -2.89 is 0.0019. Therefore, the
probability that the sample proportion is less than 0.30 is 0.0019 – a highly unlikely event. This
means that if the true proportion of successes in the population is 0.40, less than one-fifth of 1%
of the samples of n = 200 are expected to have sample proportions of less than 0.30.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
7.3 Sampling Distribution of the Proportion 261
Problems for Section 7.3
LEARNING THE BASICS
7.10 In a random sample of 64 people, 48 are classified as
‘successful’. If the population proportion is 0.70:
a. Determine the sample proportion p of ‘successful’ people.
b. Determine the standard error of the proportion.
7.11 A random sample of 50 households was selected for a
telephone survey. The key question asked was, ‘Has any
member of your household travelled by plane in the past
month?’ Of the 50 respondents, 16 said yes and 34 said no. If
the population proportion is 0.40:
a. Determine the sample proportion p of households with
members who have travelled by plane in the past month.
b. Determine the standard error of the proportion.
7.12 The following data represent the responses (Y for yes and N for
no) from a sample of 40 university students to the question, ‘Do
you currently own any shares in listed companies?’:
N N Y N N Y N Y N Y N N Y N Y Y N N N Y
N Y N N N N Y N N Y Y N N N Y N N Y N N
If the population proportion is 0.30:
a. Determine the sample proportion p of university students
who own shares in listed companies.
b. Determine the standard error of the proportion.
APPLYING THE CONCEPTS
7.13 A political polling organisation is conducting an analysis of
sample results in order to make predictions on election night.
Assuming a two-candidate election, if a specific candidate
receives at least 55% of the vote in the sample, then that
candidate will be forecast as the winner of the election. If you
select a random sample of 100 voters:
a. What is the probability that a candidate will be forecast as
the winner when:
i. the true percentage of her vote is 50.1%?
ii. the true percentage of her vote is 60%?
iii. the true percentage of her vote is 49% (and she will
actually lose the election)?
b. If the sample size is increased to 400, what are your
answers to (a)? Discuss.
7.14 You plan to conduct a marketing experiment in which students
are to taste one of two different brands of soft drink. Their
task is to identify correctly the brand they tasted. You select a
random sample of 200 students and assume they have no
ability to distinguish between the two brands. (Hint: If an
individual has no ability to distinguish between the two soft
drinks, then each brand is equally likely to be selected.)
a. What is the probability that the sample will have between
50% and 60% of the identifications correct?
b. The probability is 90% that the sample percentage is
contained within which symmetrical limits of the population
percentage?
c. What is the probability that the sample percentage of correct
identifications is greater than 65%?
d. Which is more likely to occur – more than 60% correct
identifications in the sample of 200 or more than 55%
correct identifications in a sample of 1,000? Explain.
7.15 Over the past few years there has been increased
monitoring of the representation of women on corporate
boards. The Australian Institute of Company Directors
reports in its March–May 2016 Report that 23.6% of ASX
200 board members were female (<www.companydirectors.
com.au/~/media/resources/director-resource-centre/
governance-and-director-issues/board-diversity/boarddiversity-pdf/05385-2-coms-gender-diversity-quarterlyreport-june16-a4_web.ashx> accessed 25 April 2017).
Suppose that the true percentage of women on ASX
200 boards is now 24.6% and that a random sample of
220 board members is chosen.
a. What is the probability that in the sample less than 24% of
board members will be women?
b. What is the probability that in the sample between 24.2%
and 25.0% of board members will be women?
c. What is the probability that in the sample between 24.5%
and 24.7% of board members will be women?
d. If a sample of 100 is taken, how does this change your
answers to (a), (b) and (c)?
7.16 People with permanent visas accounted for 19.5% of the net
overseas migration to Australia during 2015. The relative shares
of the different visa categories were: Family visas, 6.9%;
Skilled, 9.0%; and Special Eligibility and Humanitarian, 2.5%
(Australian Bureau of Statistics, Migration, Australia, 2015–16 ,
Cat. No. 3412.0, March 2017). Suppose a government
department is conducting a follow-up study and randomly
selects 260 people who migrated in 2015.
a. What is the probability that more than 9.1% of the people in
the sample are skilled migrants?
b. What is the probability that less than 2.8% are holders of
Special Eligibility or Humanitarian permanent visas?
c. If a random sample of size 500 is taken, how does this
change your answers to (a) and (b)?
7.17 As technology continues to change rapidly there has been a
worldwide trend towards the use of smaller and more
mobile devices and away from PCs. Analysts at Gartner
predicted that in 2019 only 8% of devices shipped
worldwide would be traditional PCs (desktops or notebooks)
(<www.consumerit.eu/index.php?option=com_content&
view=article&id=3363:gartner-spending-on-the-devicesup-shipments-flat&catid= 20&Itemid=100017> accessed
27 April 2017). Assume this prediction holds and you
randomly select a sample of 100 people who purchase a
device shipped in 2019.
a. What is the probability that between 7.5% and 8.2%
purchase a traditional PC?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
262 CHAPTER 7 SAMPLING DISTRIBUTIONS
b. The probability is 90% that the sample percentage will be
contained within which symmetrical limits of the population
percentage?
c. The probability is 95% that the sample percentage will be
contained within which symmetrical limits of the population
percentage?
7.18 According to an Australian Government report, retail trade is the
second largest employing industry in Australia with more than
1.267 million workers, or 11% of working Australians
(Department of Employment, Australian Jobs 2016 <https://
docs.employment.gov.au/system/files/doc/other/
australianjobs2016_0.pdf> accessed 28 April 2017). This
report shows that the percentage of those employed in retail
trade in November 2015 who were working part-time was 49%.
Assuming this percentage is still current:
a. If you select a random sample of 400 Australian retail trade
workers, what is the probability that the sample has
between 45% and 50% who are employed part-time?
b. If a current sample of 400 Australian retail trade workers has
50.2% who are employed part-time, what can you infer
about the population estimate of 49%? Explain.
c. If a current sample of 100 Australian retail trade workers has
50.2% who are employed part-time, what can you infer
about the population estimate of 49%? Explain.
d. Explain the difference between the results in (b) and (c).
7
7.19 The Australian Tax Office carries out a range of verification
checks and audits for the goods and services tax (GST)
including Business Activity Statement integrity audits. Assume
that currently no additional tax is collected for 25% of such
audits. Suppose that you select a random sample of 100 audits.
What is the probability that the sample will have:
a. between 24% and 26% of audits that collect no additional tax?
b. between 20% and 30% of audits that collect no additional tax?
c. more than 30% of audits that collect no additional tax?
7.20 The 11th Annual Statistical Report of the HILDA Survey relates
to the 2016 phase of a large longitudinal study of Australian
residents. It found that 19.9% of households surveyed had
HECS/HELP debts and 35.7% had debts on their home (R.
Wilkins (ed), The Household, Income and Labour Dynamics in
Australia Survey: Selected Findings from Waves 1 to 14,
Melbourne Institute of Applied Economic and Social Research,
University of Melbourne, 2016 <http://melbourneinstitute.
unimelb.edu.au/__data/assets/pdf_file/0007/2155507/hildastatreport-2016.pdf> accessed 28 April 2017).
Assume the same percentages found in the survey apply
right now for all Australian households. In a sample of 600 of
these households, what is the probability that:
a. more than 18% of households have HECS/HELP debts?
b. fewer than 33.5% of households have debts on their home?
Assess your progress
Summary
In this chapter we looked at the sampling distribution of the
sample mean, the Central Limit Theorem and the sampling
distribution of the sample proportion. You learned that the sample
mean is an unbiased estimator of the population mean and the
sample proportion is an unbiased estimator of the population
proportion. By observing the mean volume in a sample of
shampoo bottles filled by Zoffira Pty Ltd, you were able to draw
conclusions about the mean volume in the population of shampoo
bottles.
In the next three chapters, techniques commonly used for
statistical inference, confidence intervals and tests of hypotheses
are discussed.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter review problems 263
Key formulas
_
Finding X for the sampling distribution of the mean
Population mean
N
μ=
∑
X=μ+Z
Xi
i =1
(7.1)
N
p=
N
∑ (X – μ)
i
2
i =1
N
σX =
n
X
number of items with the characteristic of interest
=
n
sample size
Standard error of the sample proportion
σp =
(7.3)
X – μX
X–μ
=
σ
σX
n
π(1 – π)
n
(7.7)
Finding Z for the sampling distribution of the proportion
Finding Z for the sampling distribution of the mean
Z=
(7.5)
(7.6)
(7.2)
Standard error of the mean
σ
n
Sample proportion
Population standard deviation
σ=
σ
Z=
(7.4)
p–π
π(1 – π)
n
(7.8)
Key terms
Central Limit Theorem
sampling distribution
sampling distribution of the mean
256
249
249
sampling distribution of the
proportion260
standard error of the mean
252
standard error of the proportion
260
unbiased249
References
1. Cochran, W. G., Sampling Techniques, 3rd edn (New York: Wiley, 1977).
Chapter review problems
CHECKING YOUR UNDERSTANDING
7.21
7.22
7.23
7.24
7.25
Why is the sample mean an unbiased estimator of the
population mean?
Why does the standard error of the mean decrease as the
sample size n increases?
Why does the sampling distribution of the mean follow a
normal distribution for a large enough sample size even though
the population may not be normally distributed?
What is the difference between a probability distribution and a
sampling distribution?
Under what circumstances does the sampling distribution of
the proportion approximately follow the normal distribution?
APPLYING THE CONCEPTS
7.26
A particular type of ballpoint pen uses minute ball bearings that
are targeted to have a diameter of 0.5 mm. The lower and
upper specification limits under which the ball bearing can
operate are 0.49 mm (lower) and 0.51 mm (upper). Past
7.27
experience has indicated that the actual diameter of the ball
bearings is approximately normally distributed with a mean of
0.503 mm and a standard deviation of 0.004 mm. If you select
a random sample of 25 ball bearings:
a. What is the probability that the sample mean is:
i. between the target and the population mean of 0.503?
ii. between the lower specification limit and the target?
iii. above the upper specification limit?
iv. below the lower specification limit?
b. The probability is 93.32% that the sample mean diameter
will be above what value?
The fill amount of milk in plastic containers is normally
distributed with a mean of 2.0 litres and a standard deviation
of 0.05 litres. If you select a random sample of 25 containers:
a. What is the probability that the sample mean will be:
i. between 1.99 and 2.0 litres?
ii. below 1.98 litres?
iii. above 2.01 litres?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
264 CHAPTER 7 SAMPLING DISTRIBUTIONS
7.28
7.29
7.30
7.31
b. The probability is 99% that the sample mean will contain at
least how much milk?
c. The probability is 99% that the sample mean will contain an
amount that is between which two values (symmetrically
distributed around the mean)?
The ABS has reported that in 2015, 26.78% of the 16.8 million
employees in Australia worked part-time in their main job
(Australian Bureau of Statistics, Characteristics of Employment,
Australia, August 2015, Cat. No. 6333.0, 2016). Suppose that you
select a random sample of 250 employees from around Australia.
a. What is the probability that more than 26.2% of those
sampled work part-time in their main job?
b. What is the probability that the proportion of part-time
employees is between 0.27 and 0.29?
c. The probability is 77% that the sample proportion of parttime employees will be above what value?
A new online advertisement for an Extra Dry beer has been
designed for a target audience of Australian males aged 18 to 30.
The advertisers hope that 24% of the target audience will find the
ad ‘very entertaining’. Suppose that a sample of 400 male
television viewers in the target age group is shown the advertise­
ment. What is the probability that the sample will have between:
a. 18% and 22% who find it ‘very entertaining’?
b. 16% and 24% who find it ‘very entertaining’?
c. 14% and 26% who find it ‘very entertaining’?
d. 12% and 28% who find it ‘very entertaining’?
Assume that, for the first quarter of 2017, the weekly rental
costs of three-bedroom dwellings in a coastal town in Western
Australia are normally distributed with a mean of $260 and a
standard deviation of $30. If you select a random sample of
10 dwellings from this population, what is the probability that
the sample will have a mean rental cost:
a. less than $270?
b. between $265 and $275?
c. greater than $282?
APRA, the Australian Prudential Regulation Authority, monitors
the return rates of large superannuation funds in Australia. Its
publication Statistics: Quarterly superannuation performance,
December 2016 showed an annual rate of return of 6.8% for
the year <www.apra.gov.au/Super/Publications/
Documents/2016QSP201612.pdf>. Imagine that a researcher
with access to the APRA data finds that the average rate of
return for the largest superannuation funds in the last year has
been 7.5% with a standard deviation of 0.7%, and that rates of
return were normally distributed.
a. If the researcher selects an individual fund at random from
this population, what is the probability that the fund had a
return of:
i. less than 8.2%?
ii. between 6.9% and 7.8%?
iii. greater than 7.9%?
b. If a random sample of 10 funds is selected from this
population, what is the probability that the sample mean
lies in the ranges given in (a)?
7.32
7.33
7.34
7.35
7.36
Assume that the returns for shares on the Chinese share
market were distributed as a normal random variable, with a
mean of 1.54 and a standard deviation of 10. If you select an
individual share from this population, what is the probability
that it would have a return:
a. less than 0 (i.e. a loss)?
b. between –10 and –20?
c. greater than –5?
If you selected a random sample of four shares from this
population, what is the probability that the sample would have
a mean return:
d. less than 0 (a loss)?
e. between –10 and –20?
f. greater than –5?
g. Compare your results in parts (d) to (f) to those in (a) to (c).
(Class project ) The table of random numbers is an example of
a uniform distribution because each digit is equally likely to
occur. Starting in the row corresponding to the day of the
month on which you were born, use the table of random
numbers (Table E.1) to take one digit at a time.
Select five different samples of n = 2, n = 5 and n = 10.
Calculate the sample mean of each sample. Develop a
frequency distribution of the sample means for the results of
the entire class based on samples of sizes n = 2, n = 5 and
n = 10. What can be said about the shape of the sampling
distribution for each of these sample sizes?
(Class project ) Toss a coin 10 times and record the number
of heads. If each student performs this experiment five
times, a frequency distribution of the number of heads
can be developed from the results of the entire class.
Does this distribution seem to approximate the normal
distribution?
(Class project ) The table of random numbers can simulate the
selection of different-coloured balls from a bowl as follows:
1. Start in the row corresponding to the day of the month on
which you were born.
2. Select one-digit numbers.
3. If a random digit between 0 and 6, inclusive, is selected,
consider the ball white; if a random digit is a 7, 8 or 9,
consider the ball red.
Select samples of n = 10, n = 25 and n = 50 digits. In
each sample, count the number of white balls and calculate
the proportion of white balls in the sample. If each student in
the class selects five different samples for each sample size,
a frequency distribution of the proportion of white balls (for
each sample size) can be developed from the results of the
entire class. What conclusions can you reach about the
sampling distribution of the proportion as the sample size is
increased?
(Class project ) Suppose that step 3 of problem 7.35 uses the
following rule: ‘If a random digit between 0 and 8, inclusive, is
selected, consider the ball to be white; if a random digit of 9 is
selected, consider the ball to be red’. Compare and contrast
the results in this problem and in problem 7.35.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 7 Excel Guide 265
Continuing cases
As Safe as Houses
To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a
large national real estate company, has collected samples of recent residential sales from a sample of non-capital
cities and towns in these states. The data are stored in < REAL_ESTATE >.
a Find the mean price for the sample of 125 properties sold in regional city 1 of state A. What is the
probability of finding a sample mean at least this large if the population mean and standard deviation
of prices for this city are $300,000 and $100,000 respectively?
b Now find the mean price for the sample of 125 properties sold in the coastal city of state B. What is the
probability that the sample mean is less than or equal to this value if the population mean and standard
deviation for this city are $595,000 and $287,000 respectively?
c Discuss why your answers to (a) and (b) are not the same as finding comparable probabilities for
individual properties sold in each city.
Chapter 7 Excel Guide
EG7.1 SAMPLING DISTRIBUTION OF THE MEAN
Key technique
Use an add-in procedure to create a simulated sampling distribution.
Example
Create a simulated sampling distribution that consists of
100 samples of n 5 30 from a uniformly distributed population.
Analysis ToolPak
Use Random Number Generation.
For the example, select Data ➔ Data Analysis. In the
Data Analysis dialog box, select Random Number Generation from the Analysis Tools list and then click OK. In
the procedure’s dialog box (shown in Figure EG7.1):
1. Enter 100 as the Number of Variables.
2. Enter 30 as the Number of Random Numbers.
3. Select Uniform from the Distribution dropdown list.
4. Keep the Parameters values as they are.
5. Click New Worksheet Ply and then click OK.
Figure EG7.1 shows the entries for generating 100 samples
of n 5 30 from a uniformly distributed population.
Figure EG7.1 Data Analysis Random Number Generation dialog box
If you are using PHStat with either Excel for Mac 2016
or Excel 2016, see Appendix D.1 (Sampling Distribution
of the Mean) to produce an enhanced version of this worksheet.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
266 CHAPTER 7 SAMPLING DISTRIBUTIONS
EG7.2 CENTRAL LIMIT THEOREM
By using the above method to generate 50 samples of size
3, then size 10 and size 40 from a uniform distribution, you
should be able to observe how the Central Limit Theorem
works. On PHStat simply click on Histogram to see the
shape of the sampling distribution. If you are using Excel’s
Random Number Generator a bit more work is required.
For each set of samples use the =AVERAGE function to
calculate the mean of the first sample, then drag or copy
this to find the means of the remaining 49 samples. Next,
create frequency distributions of the sample means using
the methods described in the Chapter 2 Excel Guide.
Last, compare the three frequency tables. You should
see that they resemble a normal distribution more closely as
the sample size increases.
Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
End of Part 2 problems 267
End of Part 2 problems
B.1
B.2
B.3
B.4
B.5
B.6
A soft-drink bottling company maintains records of the number
of unacceptable bottles of soft drink coming from the filling
and capping machines. Based on past data, the probability that
a bottle came from machine I and was unacceptable is 0.01,
and the probability that a bottle came from machine II and was
unacceptable is 0.025. Half the bottles are filled on machine I
and the other half are filled on machine II. If a filled bottle of
soft drink is selected at random:
a. What is the probability that it is unacceptable?
b. What is the probability that it was filled on machine I or is
acceptable?
c. Suppose you know that the bottle was filled on machine I.
What is the probability that it is unacceptable?
d. Suppose you know that the bottle is unacceptable. What is
the probability that it was filled on machine I?
e. Explain the difference in the answers to (c) and (d). (Hint:
Construct a 2 * 2 contingency table or a Venn diagram to
evaluate the probabilities.)
The fill amount of soft-drink bottles is normally distributed with
a mean of 2.0 litres (the listed content) and a standard deviation
of 0.05 litre. Bottles that contain less than 95% of the listed net
content (1.90 litres, in this case) make the manufacturer subject
to penalties. Bottles that have a net content above 2.10 litres
may cause excess spillage upon opening.
a. What proportion of the bottles will contain:
i. between 1.90 and 2.0 litres?
ii. between 1.90 and 2.10 litres?
iii. less than 1.90 litres or more than 2.10 litres?
b. 99% of the bottles contain at least how much soft drink?
c. 99% of the bottles contain an amount that is between which
two values (symmetrically distributed) around the mean?
In an effort to reduce the number of bottles that contain less
than 1.90 litres, the bottler in problem B.2 sets the filling
machine so that the mean is 2.02 litres. Under these
circumstances, what are your answers to (a) to (c)?
a.If a coin is tossed seven times, how many different
outcomes are possible?
b. If a die is rolled seven times, how many different outcomes
are possible?
c. Discuss the differences in your answers to (a) and (b).
The time between arrivals of cars at Sheng’s carwash is
exponential with an average of 6 minutes between arrivals.
What is the probability that the time between successive
arrivals will be
a. less than 2 minutes?
b. more than 10 minutes?
c. between 4 and 6 minutes?
The following data represent the electricity cost in dollars
during the month of July for a random sample of 50 twobedroom apartments in a New Zealand city: < ELECTRICITY >
96 171 202 178 147 102 153 197 127 82
157 185 90 116 172 111 148 213 130 165
141 149 206 175 123 128 144 168 109 167
95 163 150 154 130 143 187 166 139 149
108 119 183 151 114 135 191 137 129 158
a. Decide whether the electricity cost for July is approximately
normal by:
i. evaluating the actual versus theoretical properties
ii. constructing a normal probability plot
From part (a), assume that electricity cost for July is
normally distributed with a mean of $147 and standard
deviation of $31.70.
b. A two-bedroom apartment is selected at random. What is
the probability that electricity cost for July is:
i. less than $120?
ii. between $100 and $160?
iii. more than $225?
c. For 10% of two-bedroom apartments, the electricity cost for
July is above what amount?
d. The cost of electricity for the middle 95% of two-bedroom
apartments is between which two amounts?
B.7 An electrical retail store has found that 55% of its customers
use a credit card to pay for their purchases.
a. If 15 customers who make a purchase are randomly
selected, what is the probability that:
i. none use a credit card?
ii. exactly five use a credit card?
iii. more than two use a credit card?
b. What are the mean and the standard deviation of the
probability distribution?
B.8 It has been observed that 92% of train commuters travelling
during the 8.00 am to 9.00 am period use a mobile phone
during their trip for various activities.
a. In a train carriage with 42 passengers during this period,
what is the probability that fewer than 38 passengers use
their mobile phone during their commute?
b. If the carriage has 50 passengers, what is the probability
that between 43 and 47 passengers use their mobile phone?
B.9 From a consignment of 64 large garden pots in individual
crates being shipped from Vietnam to a local importer, 16 have
imperfections such as cracks or are broken.
a. If eight crates are shipped to a particular garden nursery,
what is the probability that:
i. all eight will have defective pots?
ii. none will have a defective pot?
iii. at least one will have a defective pot?
b. What would be your answers to (a) if eight crates have
defective pots?
B.10 East Park Realty, a small real estate company located in
country areas of South Australia, specialises primarily in
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
268 End of Part 2 problems
residential listings. It is interested in determining the probability
of one of its listings being sold within a certain number of days.
An analysis of company sales of 800 houses in the previous
year produces the following data.
Days listed until sold
Initial asking price
30 and
under
31–90
Over 90
Total
Under $200,000 50 40 10
100
$200,000–$299,999 40
140 70
250
$300,000–$399,999 30
270
100
400
$400,000 or more 10 30 10 50
480
190
800
Total
130
a.
b.
c.
d.
Give an example of a simple event.
Give an example of a joint event.
What is the complement of ‘asking price under $200,000’?
Why is ‘asking price under $200,000 and being listed more
than 90 days until sold’ a joint event?
e. Given that a house had an asking price of less than
$200,000, what is the probability that it took more than
90 days to sell?
f. Given that a house took more than 90 days to sell, what is
the probability that its asking price was less than
$200,000?
g. Explain the difference in the results in (e) and (f).
h. Are the two events – asking price less than $200,000, and
taking more than 90 days to sell – statistically independent?
i. If a house is selected at random, what is the probability that
i. it is listed more than 90 days before being sold?
ii. its initial asking price is at least $400,000?
iii. its initial asking price is at least $400,000 and it is listed
more than 90 days before being sold?
iv. its initial asking price is more than $400,000 or it is
listed more than 90 days before being sold?
j. Explain the difference in the results in parts (i) to (iv) above.
B.11 You are trying to develop a strategy for investing in two different
shares. The anticipated annual return for a $1,000 investment
in each share has the following probability distribution:
Probability
0.1
0.3
0.4
0.2
Returns
Share X
Share Y
-$50
20
100
150
-$100
50
130
200
a. Calculate the:
i. expected return for share X and for share Y
ii. standard deviation for share X and for share Y
iii. covariance of share X and share Y
b. Would you invest in share X or share Y? Explain.
B.12 Suppose that in problem B.11 you wanted to create a portfolio
that consists of share X and share Y.
a. Calculate the portfolio expected return and portfolio risk for
each of the following percentages invested in share X:
i. 30%
ii. 50%
iii. 70%
b. On the basis of the results in (a), which portfolio would you
recommend? Explain.
B.13 At an ocean-side nuclear power plant, seawater is used as part
of the cooling system. This system raises the temperature of
the water that is discharged back into the ocean. The amount
that the water temperature is raised has a uniform distribution
over the interval from 10°C to 25°C.
a. What is the probability that the temperature will increase
less than 20°C?
b. What is the probability that the temperature will increase
between 20°C and 22°C?
c. A temperature increase of more than 18°C is considered
potentially dangerous to the environment. What is the
probability that, at any point of time, the temperature
increase is potentially dangerous?
d. What is the mean and standard deviation of the temperature
increase?
B.14 A survey of 1,500 students at a large university gave the
following data on their study mode (full- or part-time) as well
as their employment status.
Employment
status
Studying
full-time
Studying
part-time
All
students
Employed full-time 94
558 652
Employed part-time
292
190 482
Not employed
278 88 366
836
1500
All students
664
a.
b.
c.
d.
e.
Give an example of a simple event.
Give an example of a joint event.
What is the complement of ‘employed full-time’?
Why is ‘employed full-time and studying full-time’ a joint event?
If a student is selected at random, what is the probability that:
i. they are employed?
ii. they are studying part-time and are employed?
iii. they are studying part-time or are employed?
f. Explain the difference between the results in part (e) above.
B.15 Telephone calls arrive at the information desk of a large
computer software company at the rate of 15 per hour.
a. What is the probability that the next call will arrive within
3 minutes (0.05 hour)?
b. What is the probability that the next call will arrive within
15 minutes (0.25 hour)?
c. Suppose the company has just introduced an updated
version of one of its software programs, and telephone calls
are now arriving at the rate of 25 per hour. Given this
information, redo (a) and (b).
B.16 On a tourism Twitter site, where photos of scenic views and native
animals are regularly shared, the long-term average number of
‘Likes’ obtained per photo posted is 600.5, with a standard
deviation of 76. A sample of 52 photos is selected at random.
a. What is the probability that the average number of ‘Likes’
for the sample is at least 630?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
End of Part 2 problems 269
b. What is the probability that the average number of ‘Likes’ is
less than 575?
B.17 In each game of OZ Lotto seven numbers are selected from 1
to 45. To win the first-division prize, the seven winning
numbers must have been selected. On any game, what is the
probability of winning the first division?
B.18 To test the effectiveness of mail X-ray screening in identifying
potential illegal or threatening items, a mail centre X-rays a
random sample of 500 packages and then independently
searches each package. The results of this test are given
below.
Search
Items found
Yes
No
Total
X-ray items identified
Yes
No
B.21
Total
36 12 48
14
438
452
450
500
50
B.22
a. What percentage of items does the X-ray identify as
potentially illegal or threatening?
b. What proportion of items identified by X-ray as potentially
illegal or threatening are found to be such when searched?
c. An item is found during the search to be illegal or
threatening. What is the probability that the X-ray identified
it as potentially illegal or threatening?
d. What percentage of items are not found to be illegal or
threatening during the search and not identified as illegal or
threatening by X-ray?
B.19 Of the packages searched at the mail centre in problem B.18,
9.6% are found to contain illegal or threatening items. Suppose
10 packages are independently and randomly selected to be
searched.
a. What is the probability that:
i. exactly two contain illegal or threatening items?
ii. none contain illegal or threatening items?
iii. at least one contains illegal or threatening items?
iv. more than half contain illegal or threatening items?
b. What is the expected number and standard deviation of the
number of packages with illegal or threatening items?
B.20 The table below classifies the academic staff of a small
regional university by gender and level of appointment.
Gender
Level
Female
Male
Total
Professor 13 21 34
Associate professor 16 24 40
Senior lecturer 37 52 89
Lecturer 74 58
132
Associate lecturer
23
13
36
Total
163
168
331
a. Calculate the following probabilities:
i. A randomly selected academic staff member is female.
ii. A randomly selected male academic staff member is a
senior lecturer or above.
B.23
B.24
iii. A randomly selected academic staff member is a female
associate lecturer.
iv. A randomly selected professor is female.
v. A randomly selected academic staff member is an
associate professor.
b. Are level of appointment and gender statistically
independent? Explain.
Suppose the executive of the university in problem B.20
randomly select five senior (senior lecturer and above)
academic staff members for a committee. Calculate the
following probabilities:
a. The selected members of the committee are all male senior
lecturers.
b. There are no professors on the committee.
c. At least half the committee is female.
d. There is exactly one professor on the committee.
e. There are three associate professors on the committee.
An on-the-job injury occurs once every 10 days on average at
a car manufacturer. What is the probability that the next
on-the-job injury will occur within:
a. 10 days?
b. 5 days?
c. 1 day?
In a recent opinion poll a sample of 1,200 adults (at least
20 years old) was surveyed. Of these adults, 768 were married,
684 were female and there were 459 married females.
Construct a contingency table or a Venn diagram and
evaluate the probability that a surveyed adult selected at
random:
a. is male
b. is single
c. is a married male
d. is a single female
The following table contains the probability distribution for the
number of traffic accidents per day in a small city.
Number of
accidents daily (X)
P(X)
0
1
2
3
4
5
0.10
0.20
0.45
0.15
0.05
0.05
a. Calculate the mean or expected number of accidents per
day.
b. Calculate the standard deviation.
B.25 On average 108 customers per hour join a queue at any one of
the checkout counters of a grocery store. Suppose that the
number of customers joining a queue at the checkout counters
follows an approximate Poisson distribution.
a. What is the probability that in the next minute:
i. exactly four customers join a queue?
ii. at least one customer joins a queue?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
270 End of Part 2 problems
B.26
B.27
B.28
B.29
b. What is the probability that in the next 5 minutes:
i. exactly 10 customers join a queue?
ii. at least 10 customers join a queue?
A computer Help desk has two technicians, A with advanced
training, who is able to solve 95% of problems, and B with less
training, who is only able to solve 85% of problems. Each
technician randomly receives 50% of problems.
a. What percentage of solved problems are solved by
technician A?
b. What percentage of problems are solved?
A particular weekly Bingo session consists of 20 games. In
each game, there are two points where a player can win (a line
and a house).
Assume on a typical week that there are 100 players, each
player is equally likely to win and winning is independent.
a. Calculate the probability that a player has a win (line and/or
house) on a game. Ignore the possibility of multiple winners
at any stage of a game.
b. Calculate the probability that a player wins at least once
during the evening.
On a typical week Biff went to Bingo with four friends, each
of whom won at least once but she did not.
c. Calculate the probability that in a group of five players
exactly four will win at least once.
d. Calculate the probability that Biff does not have a win but
her four friends do.
The Bingo session costs a player $8, with each line won
paying $10 and each house $20.
e. Construct the probability distribution for the amount a
player wins in a game.
f. What is the expected amount a player wins in a game?
g. What is the variance and standard deviation of the amount
a player wins a game?
h. What is a player’s expected profit (or loss) from the Bingo
session?
Based on past experience, 40% of all customers at Miller’s
Service Station pay for their purchases with a credit card. If a
random sample of 200 customers is selected, what is the
approximate probability that:
a. at least 75 pay with a credit card?
b. not more than 70 pay with a credit card?
c. between 70 and 75 customers, inclusive, pay with a credit
card?
At the local golf course golfers lose golf balls at a rate of 3.8
per 18-hole round. Assume that the number of golf balls lost in
an 18-hole round is distributed as a Poisson random variable.
a. What assumptions need to be made so that the number of
golf balls lost in an 18-hole round is distributed as a
Poisson random variable?
b. Given the assumptions made in (a), what is the probability
that in an 18-hole round:
i. at least one ball will be lost?
ii. less than three balls will be lost?
iii. more than five balls will be lost?
B.30 The Tasmanian Visitor Survey presents data in an analyser
database on a number of aspects of tourism, including
attractions visited by tourists aged 14 or over. The most visited
attractions by 1,283,618 tourists in the October 2016 to
September 2017 period were the Saturday Salamanca Market
(443,600/34.6%), MONA – the Museum of Old and New Art
(352,222 /27.4%) and Mt Wellington (328,752/25.6%) (data
obtained from <www.tvsanalyser.com.au>).
a. If a survey of 300 people aged 14 or over who toured
Tasmania during the period in question is taken, what is the
probability that at least 30% visited MONA?
b. What is the probability in this survey that between 31%
and 36% of tourists visited the Saturday Salamanca
Market?
c. What is the probability in this survey that fewer than 23% of
tourists visited Mt Wellington?
B.31 A box of nine golf gloves contains two left-handed gloves and
seven right-handed gloves.
a. If two gloves are randomly selected from the box without
replacement, what is the probability that both gloves will be
right-handed?
b. If two gloves are randomly selected from the box without
replacement, what is the probability that one right-handed
glove and one left-handed glove will be selected?
c. If three gloves are selected with replacement, what is the
probability that all three will be left-handed?
d. If you were sampling with replacement, what would be the
answers to (a) and (b)?
B.32 Based on past experience, the owner of a stall at the local annual
show states that 60% of visitors to the stall will purchase a
showbag. On a certain day, the stall has 100 visitors.
a. Is the 60% figure best classified as a priori classical probability,
empirical classical probability or subjective probability?
b. Find the expected number and standard deviation of sales,
assuming that number of sales is binomial.
c. If the showbags cost $12 each, find the expected revenue
from the sales.
d. What assumptions are necessary in (b)?
B.33 The cost of a phone call passed on to a ‘live’ operator is
approximately 10 times that of a call answered by an
automated customer-service system. However, as more and
more companies have implemented automated systems,
customer annoyance with these systems has grown. Many
customers are quick to leave the automated system when given
an option such as ‘Press zero to talk to a customer-service
representative’. Research has shown that approximately 40%
of all callers to automated customer-service systems will
automatically opt to go to a live operator when given the chance.
a. If 10 independent callers contact an automated customerservice system, what is the probability that:
i. none of the callers will automatically opt to talk to a live
operator?
ii. exactly one will automatically opt to talk to a live
operator?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
End of Part 2 problems 271
B.34
B.35
B.36
B.37
B.38
iii. two or fewer will automatically opt to talk to a live
operator?
iv. all 10 will automatically opt to talk to a live operator?
b. If all 10 automatically opt to talk to a live operator, do you
think that the 40% figure applies to this particular system?
Explain.
One theory concerning the Standard & Poor’s (S&P) 500 Index
of US stocks is that if it increases during the first five trading
days of the year, it is likely to increase during the entire year.
From 1929 to 2016, early gains during the first five days
predicted full-year gains approximately 69.5% (41 out of 59) of
the time. Assuming that this indicator is a random event with
no predictive value, you would expect that the indicator would
be correct 50% of the time.
a. What is the probability of the S&P 500 Index increasing in
41 or more of 59 years with an early gain if the true
probability of an increase in the S&P 500 Index is:
i. 0.50?
ii. 0.70?
iii. 0.90?
b. Based on the results in (a), what do you think is the
probability that the S&P 500 Index will increase if there is an
early gain in the first five trading days of the year? Explain.
A research institute has interviewed a total of 1,764 employers.
Fifty-three per cent of the 264 employers from the
telecommunications industry expected to have a net increase
in employment in their company during the next quarter. Only
43% of employers interviewed from other industries expected
a net increase during the same period.
a. If an employer from this survey pool is selected at random
and expects that there will be a net increase in employment
in his company during the next quarter, what is the
probability that his company is in the telecommunications
industry?
b. What is the chance that an employer, selected at random, is
neither from the telecommunications industry nor expects
an increase?
A quinella consists of picking the horses that will place first
and second in a race irrespective of order. Suppose eight
horses are entered in a race.
a. How many quinella combinations are there for this race?
b. If you choose two horses randomly, what is the probability
that you win the quinella?
Suppose that a quality control department has established that
0.1% of items produced are defective.
a. If 25 items are randomly selected, find the probability that:
i. exactly two items are defective
ii. at most one item is defective
iii. at least two items are defective
b. What is the expected number and standard deviation of
defective items?
Assume that the number of network errors experienced in a
day on a local area network (LAN) is distributed as a Poisson
random variable. The mean number of network errors
B.39
B.40
B.41
B.42
B.43
experienced in a day is 2.4. What is the probability that, in any
given day:
a. zero network errors will occur?
b. exactly one network error will occur?
c. two or more network errors will occur?
d. fewer than three network errors will occur?
Greenway Gardens currently has six plots available to plant
tomatoes, eggplant, capsicum, cucumbers, beans and lettuce.
Each vegetable will be planted in one and only one plot. How
many ways are there to position these vegetables in the gardens?
Olive Construction Company is determining whether it should
submit a bid for a new shopping centre. In the past, Olive’s
main competitor, Base Construction Company, has submitted
bids 70% of the time. If Base Construction does not bid on a
job, the probability that Olive Construction will get the job is
0.50. If Base Construction bids on a job, the probability that
Olive Construction will get the job is 0.25.
a. If Olive Construction gets the job, what is the probability
that Base Construction did not bid?
b. What is the probability that Olive Construction will get the job?
An airline maintains statistics for mishandled bags per 1,000
passengers. Suppose that last year this airline had 7.03
mishandled bags per 1,000 passengers. What is the probability
that the next 1,000 passengers on this airline will have:
a. no mishandled bags?
b. at least one mishandled bag?
c. at least two mishandled bags?
A small factory processes and bottles fruit juice. Two types of
defect can occur – an incorrect fill amount (over or under the
stated amount on the label) and an incorrect seal.
From production data it is known that 0.5% of two-litre
bottles filled have an incorrect fill amount and 0.1% are
incorrectly sealed, with 0.002% having both defects – an
incorrect fill amount and incorrectly sealing.
a. What proportion of two-litre bottles produced have at least
one type of defect?
b. What proportion of two-litre bottles produced have no defects?
c. A two-litre bottle has an incorrect fill amount. What is the
probability that it also is incorrectly sealed?
d. Twenty filled two-litre bottles are randomly chosen.
Determine the probability that:
i. only one bottle has an incorrect fill amount
ii. at least one bottle has an incorrect fill amount
iii. at most, two bottles have an incorrect fill amount
e. In a random sample of 100 filled two-litre bottles, find the
expected number of bottles which are incorrectly sealed.
The amount of time a bank teller spends with each customer
has a population mean μ = 3.10 minutes and standard
deviation σ = 0.40 minute.
a. If you select a random sample of 16 customers:
i. what is the probability that the mean time spent per
customer is at least 3 minutes?
ii. there is an 85% chance that the sample mean is below
how many minutes?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
272 End of Part 2 problems
b. What assumption must you make in order to solve both
parts of (a)?
c. If you select a random sample of 64 customers, there is an
85% chance that the sample mean is below how many
minutes?
B.44 A manager of a seafood restaurant is interested in both the
time it takes a customer to be seated (the waiting time) and
the length of time between a customer being seated and
leaving the restaurant (the service time). Over a month, a
random sample of 100 customers (only one per party/table)
was selected and waiting and serving times, in minutes, are
recorded in the file < RESTAURANT_TIMES >.
a. Construct a histogram for waiting times. Are waiting times
approximately normal, exponential or uniform? Is this what
you expected?
b. Construct a histogram of serving times. Are serving times
approximately normal, exponential or uniform? Is this what
you expected?
c. Calculate the mean and standard deviation of waiting and
serving times.
d. Use the results of (a) and (c) to calculate the approximate
probability that a customer will wait less than 5 minutes to
be seated.
e. Use the results of (a) and (c) to calculate the approximate
probability that a customer will wait more than 10 minutes
to be seated.
f. Use the results of (b) and (c) to calculate the approximate
probability that the serving time for a customer will be less
than 1 hour.
g. Use the results of (b) and (c) to calculate the approximate
probability that the serving time for a customer will be more
than 90 minutes.
B.45 Data from the Bureau of Infrastructure, Transport and Regional
Economics (BITRE) (<https://bitre.gov.au>) shows that in
Australia during 2015, the number of motorcyclist deaths was
6.47 per 100 million vehicle kilometres travelled (VKT), while
for car occupants it was 0.35 per 100 million VKT.
A local council estimates that within the council boundaries
there are annually 300 million VKT for cars and 5 million VKT
for motorcycles.
Assume that the fatality rates have not changed and that the
Poisson distribution can be used to model the number of deaths.
a. For motorcyclists in the local council area, calculate the
following probabilities that in the next 12 months:
i. there are no deaths
ii. there is at least one death
iii. there is exactly one death
iv. there are no more than two deaths
b. For car occupants in the local council area, calculate the
following probabilities that in the next 12 months:
i. there are no deaths
ii. there is at least one death
iii. there is exactly one death
iv. there are no more than two deaths
B.46 In 2015, 16.4% of Australians aged 45 to 54 years reported a
disability compared to 8.2% aged 15 to 24 years (data
obtained from Australian Bureau of Statistics, Disability,
Ageing and Carers, Australia: Summary of Findings, 2015,
Cat. No. 4430.0 <www.abs.gov.au>).
Suppose 15 Australians in each age group are randomly
selected.
a. For each age group of those selected, calculate the
probability that:
i. none reports a disability
ii. at least one reports a disability
iii. exactly five report a disability
iv. a majority report a disability
b. Repeat (i) to (iv) for the 90 years and over age group, of
whom 85.4% report a disability.
B.47 A telemarketing firm phones households at random. Data show
that 80% of such calls are answered.
a. If 100 households are called each evening, approximate the
probability that:
i. more than 50% of the calls are answered
ii. between 70 and 90 (inclusive) calls are answered
iii. fewer than 75 calls are answered
b. Use Excel to calculate the exact probabilities for part (a).
B.48 Check$mart encourages its customers to use Internet banking.
Therefore the bank is concerned with the download time (the
number of seconds that passes from first linking to the website
until the home page is fully displayed) of its home page. Both
the design of a home page and the load on the bank’s web
server affect the download time.
Past data indicate that download times are approximately
normal with a mean of 0.9 seconds and a standard deviation of
0.3 seconds.
What is the probability that a download time is:
a. less than 1 second?
b. more than 0.5 seconds?
c. between 0.5 and 1.5 seconds?
d. more than 2 seconds?
e. less than 0.6 seconds?
f. between 1.0 seconds and 1.5 seconds?
B.49 Past records show that on average there are four unplanned
outages a year to Check$mart’s Internet banking system and
that these unplanned outages occur randomly and are
independent of each other.
An unplanned outage has just occurred.
a. What is the probability that there will:
i. not be an unplanned outage in the next month?
ii. not be an unplanned outage in the next three months?
iii. be at least one unplanned outage in the next six months?
b. What is the mean time between unplanned outages?
c. What is the probability that there will:
i. be exactly three unplanned outages in the next year?
ii. be more than six unplanned outages in the next six
months?
iii. be fewer than two unplanned outages in the next month?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
End of Part 2 problems 273
B.50 In the Household Expenditure Statistics: Year Ended 30 June
2016 (Statistics New Zealand, <www.stats.govt.nz>, licensed
by Statistics New Zealand for re-use under the Creative
Commons Attribution 3.0 New Zealand licence), 64% of the
New Zealand households reported that their income was
enough or more than enough to meet their everyday needs.
However, of the 20% of households with an annual income
of less than $35,700, 48% reported that their income was
enough or more than enough for their everyday needs, while of
the 20% of households with an annual income of at least
$136,600, 87% reported that their income was enough or
more than enough to meet their everyday needs.
a. What proportion of households reporting that their income
is not enough for their everyday needs have an annual
income of at least $136,600?
b. What proportion of households who report their income is
enough for their everyday needs have incomes of less than
$35,700?
c. What is the probability that a household has an annual
income of less than $35,700 and reports that this is enough
for their everyday needs?
d. What proportion of households with an annual income of at
least $35,700 report that their income is enough?
e. What proportion of households with an annual income of
less than $136,600 report that their income is not enough?
f. What is the probability that a household has an annual
income of at least $136,600 and reports that this is not
enough for their everyday needs?
B.51 In problem 6.12 on page 229, it was assumed that the number
of All Ordinaries shares traded daily on the Australian Securities
Exchange (ASX) is a normal random variable.
a. To test this assumption use the All Ordinaries daily
volume of trade for the 2016–17 financial year
< ALL_ORDS_2016_17 > to:
i. construct a stem-and-leaf display, histogram, polygon
and/or box-and-whisker plot
ii. evaluate the actual versus theoretical properties
iii. construct a normal probability plot
b. Discuss the results in (a). Are the number of All Ordinaries
shares traded daily approximately normal?
B.52 According to Burton G. Malkiel, the daily changes in the closing
price of shares follow a random walk – that is, these daily
events are independent of each other and move upwards or
downwards in a random manner – and can be approximated
by a normal distribution.
To test this theory, use either a newspaper or the Internet to
select three companies traded on the ASX or other stock
exchange, and then do the following:
1. Obtain the daily closing share price of each company for six
consecutive weeks (so that you have 30 values per
company).
2. Obtain the daily changes in the closing share price of each
company for six consecutive weeks (so that you have 30
values per company).
a. For each of your six data sets, decide whether the data
are approximately normally distributed by:
i. examining the stem-and-leaf display, histogram or
polygon and the box-and-whisker plot
ii. evaluating the actual versus theoretical properties
iii. constructing a normal probability plot
b. Discuss the results in (a). What can you now say about
your three shares with respect to daily closing prices
and daily changes in closing prices?
c. Which, if any, of the data sets are approximately
normally distributed?
Note: The random-walk theory pertains to the daily changes in
the closing share price, not the daily closing share price.
B.53 A motoring organisation has conducted a survey of owners of
new cars manufactured in 2017. It has listed the average
number of problems per car as 1.27 for brand H. Let the
random variable X be equal to the number of problems with a
newly purchased brand H.
a. What assumptions must be made in order for X to be
distributed as a Poisson random variable? Are these
assumptions reasonable?
b. Making the assumptions as in (a), if you purchased a 2017
brand H, what is the probability that the new car will have:
i. zero problems?
ii. two or fewer problems?
c. Give an operational definition for ‘problem’. Why is the
operational definition important in interpreting the results of
the survey?
B.54 Assume that in 2018 the manufacturers of brand H improve
their performance, with owners of 2018 brand H reporting 1.04
problems per car.
a. If you purchased a 2018 brand H, what is the probability
that the new car will have:
i. zero problems?
ii. two or fewer problems?
b. Compare your answers in part (a) with those for 2017 brand
H in problem B.53 part (b).
B.55 Jay has had three incidents in the past 10 years where an
insurance excess needed to be paid. These were a collision
with a kangaroo, hail damage and a collision from behind while
stationary. In this last instance the excess was refunded as the
other driver was at fault. Furthermore, Jay estimates that he
drives 300 days a year.
Jay recently booked a rental car online for a 27-day holiday
in New Zealand. During the booking process he was offered a
policy at a price of $18.40 per day, to reduce the insurance
excess of $2,000 to $0. However, Jay chose not to accept this
offer.
a. Estimate the probability, per day of driving, that Jay will
have to pay an insurance excess, even if it is refunded later
because the other driver is at fault.
b. Assume that the number of days during the holiday that
require insurance excess to be paid can be modelled by the
binomial distribution.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
274 End of Part 2 problems
i. Calculate the probability that for the 27-day holiday
there are no days requiring insurance excess to be paid.
ii. Calculate the probability that for the 27-day holiday
there is exactly one day requiring insurance excess to be
paid.
iii. Calculate the probability that for the 27-day holiday
there are exactly two days requiring insurance excess to
be paid.
iv. Calculate the probability that for the 27-day holiday
there are at least three days requiring insurance excess
to be paid.
v. Calculate the expected payout on this policy.
c. Assume that the number of instances during the holiday in
which insurance excess is required to be paid can be
modelled by the Poisson distribution.
i. Calculate the probability that for the 27-day holiday
there are no instances in which insurance excess is
required to be to be paid.
ii. Calculate the probability that for the 27-day holiday
there is exactly one instance in which insurance excess
is required to be to be paid.
iii. Calculate the probability that for the 27-day holiday
there are exactly two instances in which insurance
excess is required to be paid.
iv. Calculate the probability that for the 27-day holiday
there are at least three instances in which insurance
excess is required to be paid.
v. Calculate the expected payout on this policy.
d. Did Jay make the correct decision?
e. Calculate the probability per day of driving that Jay will
have to pay an insurance excess for the policy to break
even.
B.56 Sam and Jo recently lost their house to fire. Although they were
insured, the insurance company has offered them 30% less
than the rebuild amount for which they were insured.
The amount for which they were insured was the amount
specified by the insurance company and is consistent with the
rebuild amount given by the insurance company’s online
calculator. Therefore, Sam and Jo are not accepting the
insurance company’s statement that they were over-insured.
Do Sam and Jo have a case to ask for a higher amount to
rebuild their house?
a. The online calculator states that ‘in approximately 80% of
cases the building estimate delivers an accuracy of +/–
10%’. Assuming that the difference between the estimated
rebuild cost given by the calculator and the actual rebuild
cost is normal with a mean of zero, estimate the standard
deviation.
b. Using the results of part (a), calculate the probability of an
actual rebuild cost of at most 30% less than the estimated
rebuild cost given by the online calculator.
c. Comment on the insurance company’s claim that Sam and
Jo were over-insured. Do you consider that they are
justified in asking for a higher rebuild amount?
B.57 Australia is known as a nation of sports lovers but cultural
events and venues are not all well supported. A survey by the
Australian Bureau of Statistics found that in 2013–14 the
attendance rates for Australians aged 15 years and over at the
following selected cultural events and venues were as follows:
cinemas 66.3%, zoological parks and aquariums 33.9%,
botanic gardens 37.2% and libraries 34.0%. It also found that
only 14.8% of Australians had attended an opera or musical in
the previous 12 months (Australian Bureau of Statistics,
Attendance at Selected Cultural Events and Venues, Australia,
2013–14, Cat. No. 4114.0).
a. If the percentages reported by the ABS are used in decimal
form as probabilities, are they best classified as a priori
classical probabilities, empirical classical probabilities or
subjective probabilities?
b. Suppose that 10 Australians aged 15 years and over are
randomly sampled. Consider the random variable defined by
the number of people that have attended a musical or opera
in the past year. What assumptions must be made so that
this random variable is distributed as a binomial random
variable?
c. Assuming that the number of people who have attended a
musical or opera in the past year is a binomial random
variable, what are the mean and standard deviation of the
distribution in (b)?
B.58 Refer to problem B.57. Calculate the probability that, of the
10 people sampled, the number who have attended a musical
or opera in the past year is:
a. exactly none
b. all 10
c. more than half
d. eight or more
B.59 Refer to problem B.57.
a. For cinemas, using the given probability of attendance of
0.663, calculate the probability that, of the 10 people sampled,
the number who have attended a cinema in the past year is:
i. exactly none
ii. all 10
iii. more than half
iv. eight or more
b. Compare the results in (a) with those of problem B.58 (a) to (d).
B.60 The manager of a seafood restaurant was interested in
studying ordering patterns of patrons for the Friday-to-Sunday
weekend time period. Records were maintained that indicated
the demand for dessert during the same period. The manager
decided to study two other variables together with whether a
dessert was ordered: the gender of the individual and whether
a shellfish entrée was ordered. The results are as follows:
Gender
Dessert ordered
Male
Female
Total
Yes 82 32
No
278
208
240
Total
360
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
114
486
600
End of Part 2 problems 275
Shellfish entrée
Dessert ordered
Yes
No
Total
Yes 52 62
No
106
380
442
Total
158
114
486
600
a. A waiter approaches a table to take an order. What is the
probability that the first customer to order at the table:
i. orders a dessert?
ii. orders a dessert or a shellfish entrée?
iii. is a female and does not order a dessert?
iv. is a female or does not order a dessert?
b. Suppose the first person that the waiter asks for a dessert
order is a female. What is the probability that she does not
order dessert?
c. Are gender and ordering dessert statistically independent?
d. Is ordering a shellfish entrée statistically independent of
whether the person orders dessert?
B.61 The council for a regional city constructed a levee to protect
the central business district and surrounding suburbs from
flooding in up to a 1-in-10-year flood. This levee was finished
12 years ago, and has just been breached for the first time,
having held during three previous floods.
Assume that the number of floods that breach the levee can
be modelled by a Poisson distribution.
a. What is the probability that the levee is:
i. not breached in 12 years?
ii. breached in the next 5 years?
iii. not breached in the next 20 years?
iv. breached again within 2 years?
v. not breached within 10 years?
b. Suppose the council decides to increase the height of the
levee, so that the new levee will protect the central
business district and surrounding suburbs from flooding in
up to a 1-in-20-year flood. What is the probability that the
new levee is:
i. not breached in 12 years?
ii. breached within 5 years of completion?
iii. not breached in 20 years?
iv. breached within 2 years of completion?
v. not breached within 10 years of completion?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
PA R T
3
Drawing
conclusions about
populations based
only on sample
information
Real People, Real Stats
Rod Battye TOURISM RESEARCH AUSTRALIA
Which company are you currently working for and what are some of your responsibilities?
Tourism Research Australia (TRA), which is currently a business unit within the Australian Trade
Commission (AUSTRADE). My main responsibilities are to manage:
• the International (IVS) and National (NVS) Visitor Surveys
• the service-level agreement with funding partners
• TRA’s interactive websites
• TRA software and databases
• data requests and statistics in general
• staff and individual and team development.
List five words that best describe your personality.
Patient, precise, conscientious, creative, relaxed.
What are some things that motivate you?
Working in a team, building relationships, creating new ways to communicate messages, doing new
things, getting it right and being relevant. Promoting a happy work environment.
When did you first become interested in statistics?
Started when I was writing programs to extract data in an area that handled statistics. I was always
good at maths and it came naturally from there. There are many disciplines in statistics that apply to
programming as well.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
a quick q&a
Complete the following sentence. A world without statistics …
… is uninformed and lacking the information required for good
planning and decision making.
LET’S TALK STATS
What do you enjoy most about working in statistics?
Not all statistics are enjoyable to work with; the subject matter is
very important. I work with tourism-related information, which
covers both domestic and international topics. I am especially
keen on the international topics. I enjoy working with people
across the many facets of survey work I do, from collection in the
field to publication and reporting. Tourism has a lot of positive
and forward-looking values; it cuts across nations, genders, age,
technology and much more (variety), which makes it interesting.
Describe your first statistics-related job or work experience.
Was this a positive or a negative experience?
As mentioned earlier, I was always good at maths and statistics
and got involved when I was writing software programs to
extract data on migration and other topics. This experience
increased my interest in information, wanting to know more, put
the pieces together and tell a story.
What do you feel is the most common misconception about your
work held by students who are studying statistics? Please explain.
The biggest misconceptions are that it’s easy and a lot of people
don’t realise there is a need to do a bit of an apprenticeship.
There are many different streams of statistics and a vast array of
survey methodologies. It takes some time to gain the knowledge
and experience to be competent at your job.
Do you need to be good at maths to understand and use
statistics successfully?
Overall the answer is yes! I’ve seen some horror stories when
people have ended in the wrong role and they are not numbers
savvy. Having said that, the direction we are moving in with the
way we report statistics in a simpler way, using more visual and
interactive formats/technologies, is removing some of the mystery.
Is there a high demand for statisticians in your industry (or in
other industries)? Please explain.
There is a demand particularly for younger people. There seems
to have been a drop off in younger people coming through. There
is a tendency for people to focus on policy or marketing and other
avenues, as the trip to the top is considered to happen more
quickly. What we really need more of is statistics and research in
the one package. What I mean by that is someone who can reveal
the numbers, interpret and write the story/convey the message.
DRAWING CONCLUSIONS ABOUT POPULATIONS BASED
ONLY ON SAMPLE INFORMATION
What are some variables for which data have been collected in
your field?
There are so many to list in terms of international visitors to Australia:
where they went; what activities they undertook; their satisfaction
levels with cost, food, language services, accommodation etc.; their
likelihood to recommend Australia as a holiday destination;
expenditure; places and attractions; tours; demographics; where
they come from; and why they are here, just to name a few.
Why is sampling an important part of your work? What are some
common sampling techniques that you employed in the past?
Sampling techniques are vital to what I do as they are a costeffective means of obtaining good results for a fraction of the
cost of conducting a census. The surveys I work on are ongoing
measurement surveys vital to government and the broader
tourism industry.
We mostly use stratified random sampling techniques. We have
excellent data that we use for our sampling and benchmarking
processes.
Our domestic survey uses computer-assisted telephone
interviewing (CATI) via random digit dialing of household phone
numbers, using the last birthday method of selection. Samples
are stratified using telephone prefixes and the estimated resident
population of Australia by capital city/rest of state.
The international survey uses computer-assisted personal
interviewing (CAPI) at international departure lounges in airports
throughout Australia. Immigration data using flight details,
airport, country of residence and gender are used to stratify
samples. Interviews are chosen at random.
All TRA surveys use screening questions to check for in-scope/
targeted respondents.
What are some statistical methods you have used that have
assisted in solving a problem?
In our surveys we have a small number of records that end up
with high weights. These weights can have a detrimental effect
on results at lower levels. We use a ‘trimming’ technique to
reduce these influences by trimming to the weights to five
standard errors from the mean. The weights are then
redistributed using a raking method.
Have you ever conducted a hypothesis test where the outcome
was not what you expected? If so, what did you do?
Yes, we conduct these types of tests on a regular basis when we
consider adding new topics to the surveys. We conduct testing
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
before implementation and have some expectation as to what the
result would be. Occasionally we get an outcome well removed
from what was expected. In this case we consult other data/
information and source industry players. We then review and
update certain details, re-test, then implement. We need to be
sure what we are doing does not influence results (skew them).
What are some challenges that you have faced in using
statistics to provide information about a population of interest?
How have you overcome these challenges?
We have had difficulty reporting travel by domestic visitors due to
the growth in mobile-phone-only households (under coverage of
the population); our CATI collection has been conducted via
random selection of landline telephone numbers. This has long
been the accepted way of surveying the community in a costeffective manner. However, because the growth of mobile-only
households was taken up at disproportionate rates across the
age groups, there had been a shift in the characteristic of
travellers and an under-representation of the younger age groups.
We conducted an extensive review of methodologies and the
result was that CATI was still the best method of collection for
our survey (a large tracking survey with complex definitions). In
recent years we had looked into phoning mobiles, but this was
too expensive due to the large number of invalid numbers (SIM
cards that were no longer in use, SIM cards in shops, etc.).
Advancements in technology that reduced the cost issue (being
able to ping and identify invalid mobile numbers) had recently
appeared. With this we decided to push ahead with the
introduction of a dual-frame overlap survey. This type of
collection is cutting-edge, and nowhere in the world is there a
survey of this size (120,000 sample) that measures visitation via a
dual-frame survey.
Whereas before the survey sampled and weighted to the
estimated resident population of Australia, we now have three
distinct populations: mobile only, landline only, and mobile and
landline.
The new approach is only in its early stages but all is looking
very good; the weights are now distributed as they should be and
we have successfully implemented all other facets of the
sampling, collection, processing and weighting.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Confidence interval
estimation
C HAP T E R
8
AUDITING SALES INVOICES AT CALLISTEMON
CAMPING SUPPLIES
C
allistemon Camping Supplies Pty Ltd has several outlets that sell outdoor clothing,
backpacks, tents and other camping equipment. As the company’s accountant, you are
responsible for the accuracy of the integrated inventory management and sales information
system. You could review the contents of every record to check the accuracy of this system, but
such a detailed review would be time-consuming and costly. A better approach is to use statistical
inference techniques to draw conclusions about the population of all records from a relatively
small sample collected during an audit. At the end of each month, you can select a sample of the
sales invoices to determine the following:
■
the mean dollar amount listed on the sales invoices for the month
■
the total dollar amount listed on the sales invoices for the month
■
■
any differences between the dollar amounts on the sales invoices and the amounts entered
into the sales information system
the frequency of occurrence of various types of errors that violate the internal control policy of the
distribution sites. These errors include making a shipment when there is no authorised delivery
docket, failure to include the correct account number and shipment of the incorrect part.
How accurate are the results from the samples and how do you use this information? Are the
sample sizes large enough to give you the information you need?
© Chris Howes/Wild Places Photography/Alamy Stock Photo
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
280 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
LEARNING
OBJECTIVES
After studying this chapter you should be able to:
1 construct and interpret confidence interval estimates for the mean
2 construct and interpret confidence interval estimates for the proportion
3 determine the sample size necessary to develop a confidence interval for the mean or
proportion
4 recognise how to use confidence interval estimates in auditing
point estimate
A single value calculated from a
sample which is used to estimate
an unknown population parameter.
confidence interval estimate
A range of numbers constructed
about the point estimate.
Statistical inference is the process of using sample results to draw conclusions about the characteristics of a population. Inferential statistics enables you to estimate unknown population
characteristics such as a population mean or a population proportion. Two types of estimates
are used to estimate population parameters: point estimates and interval estimates. A point
estimate is the value of a single sample statistic. A confidence interval estimate is a range of
numbers, called an interval, constructed around the point estimate. The process used to construct
confidence intervals tells us that the population parameter is located somewhere within the interval in a known percentage of the intervals that could be constructed from different samples.
Suppose that you would like to estimate the mean number of hours of paid work undertaken per week during term time by students in your university. The mean hours of paid work
for all the students is an unknown population mean, denoted by μ. You select a sample of
–
students and find that the sample mean is 14.8. The sample mean, X, is a point estimate of the
population mean μ. How accurate is 14.8? To answer this question you must construct a confidence interval estimate.
In this chapter you will learn how to construct and interpret confidence interval estimates.
–
Recall that the sample mean, X, is a point estimate of the population mean μ. However, the
sample mean will vary from sample to sample because it depends on the items selected in the
sample. By taking into account the known variability from sample to sample (see Section 7.2
on the sampling distribution of the mean), you will learn how to develop the interval estimate
for the population mean. The interval constructed will have a specified confidence of correctly
estimating the value of the population parameter μ. In other words, there is a specified confidence that μ is somewhere in the range of numbers defined by the interval.
Suppose that after studying this chapter you find that a 95% confidence interval for the
mean number of hours students at your university are employed in paid work per week is
(14.75 8 μ 8 14.85). You can interpret this interval estimate by stating that you are 95% confident that the mean number of hours per week of paid work undertaken by students at your
university is between 14.75 and 14.85. However, there is still a possibility that the mean number of hours is below 14.75 or above 14.85.
After learning about the confidence interval for the mean, we look at how to develop an
interval estimate for the population proportion. Then we consider how large a sample to select
when constructing confidence intervals, and how to perform several important estimation procedures that accountants use when performing audits.
8.1 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ KNOWN)
In Section 7.2 we used the Central Limit Theorem and knowledge of the population distribution
to determine the percentage of sample means that fall within certain distances of the population
mean. For instance, in the shampoo-bottling example used throughout Chapter 7, 95% of all
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.1 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ KNOWN) 281
sample means are between 494.12 and 505.88 mL. This statement is based on deductive reasoning.
However, inductive reasoning is what we need here.
We need inductive reasoning because, in statistical inference, you use the results of a
single sample to draw conclusions about the population, not vice versa. Suppose that in the
shampoo-bottling example you wish to estimate the unknown population mean using the
information from only a sample. Thus, rather than take μ ± (1.96) (σ∕∙∙n) to find the upper
–
and lower limits around μ, as in Section 7.2, you substitute the sample mean, X , for the
–
unknown μ and use X ± (1.96) (σ∕∙∙n) as an interval to estimate the unknown μ. Although in
–
practice you select a single sample of size n and calculate the mean X, in order to understand
the full meaning of the interval estimate you need to examine a hypothetical set of all possible samples of n values.
Figure 8.1 shows the actual population distribution of shampoo bottle contents at the top with
a mean value of 500 and five confidence intervals for the population mean based on five different
sample means. Suppose that a sample of n = 25 bottles has a mean of 496.2 mL. The interval developed to estimate μ is 496.2 ± (1.96)(15)∕(∙∙
25) or 496.2 ± 5.88. The interval estimate of μ is:
Deductive reasoning
Reasoning that starts with a
hypothesis and examines
possibilities to move to a specific
conclusion.
Inductive reasoning
Reasoning that uses specific
observations to make a general
conclusion.
490.32 8 μ 8 502.08
Because the population mean μ (equal to 500) is included within the interval, this sample has
led to a correct statement about μ (see Figure 8.1).
Figure 8.1
Confidence interval
estimates for five different
samples of n = 25 taken
from a population where
μ = 500 and σ = 15
500
494.12
X1 = 496.2
490.32
X2 = 501.6
X3 = 493.0
X4 = 494.12
X5 = 505.88
496.2
495.72
487.12
493.0
488.24
505.88
502.08
501.6
507.48
498.88
494.12
500
500
505.88
511.76
To continue this hypothetical example, suppose that for a different sample of n = 25 bottles
the mean is 501.6. The interval developed from this sample is:
501.6 ± (1.96)(15)/( 25 )
or 501.6 ± 5.88. The estimate is:
495.72 8 μ 8 507.48
Because the population mean μ (equal to 500) is also included within this interval, this statement about μ is correct.
Now, before you begin to think that correct statements about μ are always made by
developing a confidence interval estimate, suppose a third hypothetical sample of n = 25
bottles is selected and the sample mean is equal to 493 mL. The interval developed here is
493 ± (1.96) (15)∕(∙∙
25) or 493 ± 5.88. In this case, the interval estimate of μ is:
487.12 8 μ 8 498.88
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
282 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
This estimate is not a correct statement, because the population mean μ is not included in the
interval developed from this sample (see Figure 8.1). Thus, for some samples the interval estimate
of μ is correct but for others it is incorrect. In practice, only one sample is selected and, because
the population mean is unknown, you cannot determine whether the interval estimate is correct.
To resolve this dilemma of sometimes having an interval that provides a correct estimate
and sometimes having an interval that provides an incorrect estimate, you need to determine the
proportion of samples producing intervals that result in correct statements about the population
–
mean μ. To do this, consider two other hypothetical samples: the case in which X = 494.12 mL
–
–
and the case in which X = 505.88 mL. If X = 494.12, the interval is 494.12 ± (1.96)(15)∕(∙∙
25)
or 494.12 ± 5.88. This leads to the following interval:
488.24 8 μ 8 500.00
Because the population mean of 500 is at the upper limit of the interval, the statement is correct
(see Figure 8.1).
–
When X = 505.88, the interval is 505.88 ± (1.96)(15)∕(∙∙
25) or 505.88 ± 5.88. The interval
for the sample mean is:
500.00 8 μ 8 511.76
In this case, because the population mean of 500 is included at the lower limit of the interval,
the statement is correct.
Figure 8.1 shows that when the sample mean falls anywhere between 494.12 and
505.88 mL, the population mean is included somewhere within the interval. In Section 7.2 we
found that 95% of the sample means fall between 494.12 and 505.88 mL. Therefore, 95% of all
samples of n = 25 bottles have sample means that include the population mean within the interval developed. The interval from 494.12 to 505.88 is referred to as a 95% confidence interval.
Because, in practice, you select only one sample and μ is unknown, you never know for sure
whether the specific interval includes the population mean or not. However, if you take all possible
samples of n and calculate their sample means, 95% of the intervals will include the population
mean and only 5% of them will not. In other words, there is 95% confidence that the population
mean is somewhere in the interval. Thus, we can interpret the confidence interval above as follows:
LEARNING OBJECTIVE
1
Construct and interpret
confidence intervals for
the mean
level of confidence
Represents the percentage of
intervals, based on all samples of a
certain size, which would contain
the population parameter.
I am 95% confident that the mean amount of shampoo in the population of bottles is
somewhere between 494.12 and 505.88 mL.
In some situations, you might want a higher degree of confidence (such as 99%) of including the population mean within the interval. In other cases, you might accept less confidence
(such as 90%) of correctly estimating the population mean.
In general, the level of confidence is symbolised by (1 - α) * 100%, where α is the area in
the tails of the distribution that is outside the confidence interval. The area in the upper tail of
the distribution is α/2, and the area in the lower tail of the distribution is α/2. We can use Equation 8.1 to construct a (1 - α) * 100% confidence interval estimate of the mean with σ known.
CON FIDE N CE IN TE R VA L F O R A M E A N ( σ KNO W N)
X±Z
σ
n
or
X−Z
σ
n
⩽μ⩽X+Z
σ
n
(8.1)
where Z = the value corresponding to a cumulative area of 1 - α/2 from the standardised normal distribution – that is, an upper-tail probability of α/2.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.1 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ KNOWN) 283
The value of Z needed for constructing a confidence interval is called the critical value for
the distribution. For a 95% confidence interval the value of α is 0.05. The critical Z value corresponding to a cumulative area of 0.9750 is 1.96 because there is 0.025 in the upper tail of the
distribution and the cumulative area less than Z = 1.96 is 0.975.
There is a different critical value for each level of confidence 1 - α. A level of confidence
of 95% leads to a Z value of 1.96 (see Figure 8.2). For a 99% level of confidence, α is 0.01. The
Z value is approximately 2.58 because the upper-tail area is 0.005 and the cumulative area less
than Z = 2.58 is 0.995 (see Figure 8.3).
0.025
0.475
μ
0
–1.96
0.005
–2.58
0.475
0.495
Figure 8.2
Normal curve for
determining the Z value
needed for 95% confidence
0.025
X
Z
+1.96
0.495
μ
0
critical value
The value in a distribution that cuts
off the required probability in the tail
for a given confidence level.
0.005
Figure 8.3
Normal curve for
determining the Z value
needed for 99% confidence
X
+2.58 Z
Now that various levels of confidence have been considered, why not make the confidence
level as close to 100% as possible? Before doing so, you need to realise that any increase in the
level of confidence is achieved only by widening (and making less precise) the confidence
interval. You would have more confidence that the population mean is within a broader range of
values. However, this might make the interpretation of the confidence interval less useful. The
trade-off between the width of the confidence interval and the level of confidence is discussed
in greater depth in the context of determining the sample size in Section 8.4.
Example 8.1 illustrates the application of the confidence interval estimate.
ESTIM ATING T H E ME A N S A LMO N W E IGHT WI TH 95% CON F I D E N CE
Atlantic Salmon farming is an important industry in Tasmania. Fish are grown to market size in a
series of large, circular, netted enclosures in areas such as the Huon River, Port Esperance, the
D’Entrecasteaux Channel and around the Tasman Peninsula. When salmon are harvested to send to
market they need to weigh 3.5–4 kg, so the farmer is aiming to have an average weight of 3.75 kg.
We will assume that all salmon are placed in their final growing enclosure at the same time and
spend 12 months there, and that the standard deviation of their weights after that time is 380 g. A
farmer wishes to check whether the average weight of salmon in the enclosure falls in the required
range. He weighs a sample of 50 salmon being sent to market and finds their average weight is
3,607 g. Construct a 95% confidence interval estimate for the population mean salmon weight.
EXAMPLE 8.1
SOLUTION
Using Equation 8.1 with Z = 1.96 for 95% confidence:
σ
380
X±Z
= 3607 ± (1.96)
n
50
= 3607 ± 105.33
3501.67 ⩽ μ ⩽ 3712.33
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
284 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
Thus, with 95% confidence you can conclude that the mean weight of salmon in the enclosure
is between 3,501.67 g and 3,712.33 g. This would indicate that the average weight of fish in
the enclosure is below the average of 3,750 g desirable for market-ready fish. We would
expect that many fish in the enclosure still need to grow larger before being harvested.
To see the effect of using a 99% confidence interval, examine Example 8.2.
EXAMPLE 8.2
E ST IMAT ING T H E ME AN S AL M ON WE I GHT WI TH 99% CON F I D E N CE
Construct a 99% confidence interval for the population mean salmon weight.
SOLUTION
Using Equation 8.1 with Z = 2.58 for 99% confidence:
σ
380
X±Z
= 3607 ± (2.58)
n
50
= 3607 ± 138.65
3468.35 ⩽ μ ⩽ 3745.65
The interval still does not contain the desired mean weight of 3.75 kg, so the fish will need
to grow larger.
Problems for Section 8.1
LEARNING THE BASICS
8.1
8.2
8.3
8.4
8.5
8.6
–
If X = 85, σ = 8 and n = 64, construct a 95% confidence
interval estimate of the population mean μ.
–
If X = 125, σ = 24 and n = 36, construct a 99% confidence
interval estimate of the population mean μ.
A market researcher states that she has 95% confidence that
the mean monthly sales of a product are between $170,000
and $200,000. Explain the meaning of this statement.
Why is it not possible in Example 8.1 to have 100% confidence?
Explain.
From the results of Example 8.1 regarding salmon farming, is it
true that 95% of the sample means will fall between 3,501.67 g
and 3,712.33 g? Explain.
Is it true in Example 8.1 that you do not know for sure whether
the population mean is between 3,501.67 g and 3,712.33 g?
Explain.
APPLYING THE CONCEPTS
8.7
The manager of a paint supply store wants to estimate the
actual amount of paint contained in 4-litre cans purchased from
a nationally known manufacturer. It is known from the
manufacturer’s specifications that the standard deviation of the
amount of paint is equal to 0.08 litres. A random sample of
8.8
50 cans is selected, and the sample mean amount of paint per
4-litre can is 3.98 litres.
a. Construct a 99% confidence interval estimate of the
population mean amount of paint included in a 4-litre can.
b. On the basis of your results, do you think that the manager
has a right to complain to the manufacturer? Why?
c. Must you assume that the population amount of paint per
can is normally distributed here? Explain.
d. Construct a 95% confidence interval estimate. How does
this change your answer to (b)?
The quality control manager at a light globe factory needs to
estimate the mean life of a large shipment of energy-saving
light-emitting diode (LED) light globes. The standard deviation is
3,000 hours. A random sample of 64 light globes indicates a
sample mean life of 34,000 hours.
a. Construct a 95% confidence interval estimate of the
population mean life of light globes in this shipment.
b. Do you think that the manufacturer has the right to state that
the light globes last an average of 35,000 hours? Explain.
c. Must you assume that the population of light globe life is
normally distributed? Explain.
d. Suppose that the standard deviation changes to
6,000 hours. What are your answers in (a) and (b)?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.2 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ UNKNOWN) 285
8.9
The inspection division of a state department that regulates trade
measurement wants to estimate the actual amount of soft drink in
2-litre bottles at the local bottling plant of a large, nationally
known soft-drink company. The bottling plant has informed the
inspection division that the population standard deviation for 2-litre
bottles is 0.05 litres. A random sample of 100 2-litre bottles at this
bottling plant indicates a sample mean of 1.99 litres.
a. Construct a 95% confidence interval estimate of the
population mean amount of soft drink in each bottle.
b. Must you assume that the population of soft-drink fill is
normally distributed? Explain.
c. Explain why a value of 2.02 litres for a single bottle is not
unusual, even though it is outside the confidence interval
you calculated.
d. Suppose that the sample mean had been 1.97 litres. What is
your answer to (a)?
8.2 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (𝛔 UNKNOWN)
Just as the mean of the population μ is usually unknown, you rarely know the actual standard
deviation of the population, σ. Therefore, you need to develop a confidence interval estimate of
–
μ using only the sample statistics X and S.
Student’s t Distribution
At the beginning of the twentieth century a statistician for Guinness Breweries in Ireland (see
reference 1), William S. Gosset, wanted to make inferences about the mean when σ was
unknown. Because Guinness employees were not permitted to publish research work under
their own names, Gosset adopted the pseudonym ‘Student’. The distribution that he developed
is known as Student’s t distribution.
If the random variable X is normally distributed, then the following statistic has a t distribution with n - 1 degrees of freedom:
t=
X−μ
S
Student’s t distribution
A continuous probability distribution
whose shape depends on the
number of degrees of freedom.
degrees of freedom
Relate to the number of values in
the calculation of a statistic that are
free to vary.
n
This expression has the same form as the Z statistic in Equation 7.4 on page 254, except that S is
used to estimate the unknown σ. The concept of degrees of freedom is discussed further on page 286.
Properties of the t Distribution
In appearance, the t distribution is very similar to the standardised normal distribution. Both
distributions are bell shaped. However, the t distribution has more area in the tails and less in
the centre than the standardised normal distribution (see Figure 8.4). Because the value of σ is
unknown and S is used to estimate it, the values of t are more variable than those for Z.
The degrees of freedom n - 1 are directly related to the sample size n. As the sample size
and degrees of freedom increase, S becomes a better estimate of σ and the t distribution gradually approaches the standardised normal distribution until the two are virtually identical. With a
sample size of about 120 or more, S estimates σ precisely enough that there is little difference
between the t and Z distributions. For this reason, most statisticians use Z instead of t when the
sample size is greater than 120.
Standardised normal
t distribution
for 5 degrees
of freedom
Figure 8.4
Standardised normal
distribution and t
distribution for 5 degrees
of freedom
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
286 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
As stated earlier, the t distribution assumes that the random variable X is normally distributed.
In practice, however, as long as the sample size is large enough and the population is not very
skewed, you can use the t distribution to estimate the population mean when σ is unknown. When
dealing with a small sample size and a skewed population distribution, the validity of the confidence interval is a concern. To assess the assumption of normality, you can evaluate the shape of
the sample data by using a histogram, stem-and-leaf display, box-and-whisker plot or normal
probability plot.
You find the critical values of t for the appropriate degrees of freedom from the table of the
t distribution (Table E.3). The columns of the table represent the area in the upper tail of the t
distribution. Each row represents the particular t value for each specific degree of freedom. For
example, with 99 degrees of freedom, if you want 95% confidence you find the appropriate
value of t as shown in Table 8.1. The 95% confidence level means that 2.5% of the values (an
area of 0.025) are in each tail of the distribution. Looking in the column for an upper-tail area
of 0.025 and in the row corresponding to 99 degrees of freedom gives you a critical value for t
of 1.9842. Because t is a symmetrical distribution with a mean of 0, if the upper-tail value is
+1.9842, the value for the lower-tail area (lower 0.025) is -1.9842. A t value of -1.9842 means
that the probability that t is less than -1.9842 is 0.025, or 2.5% (see Figure 8.5).
The Concept of Degrees of Freedom
In Chapter 3 we saw that the numerator of the sample variance S2 (see Equation 3.9a) requires
the calculation of:
n
∑ (Xi − X )2
i=1
Table 8.1
Determining the critical
value from the t table for
an area of 0.025 in each
tail with 99 degrees of
freedom
(extracted from Table E.3 in
Appendix E of this book)
Upper-tail areas
Degrees of freedom
.25
.10
.05
.025
.01
.005
1
1.0000
3.0777
6.3138
12.7062
31.8207
63.6574
2
0.8165
1.8856
2.9200
4.3027
6.9646
9.9248
3
0.7649
1.6377
2.3534
3.1824
4.5407
5.8409
4
0.7407
1.5332
2.1318
2.7764
3.7469
4.6041
5
0.7267
1.4759
2.0150
2.5706
3.3649
4.0322
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
96
0.6771
1.2904
1.6609
1.9850
2.3658
2.6280
97
0.6770
1.2903
1.6607
1.9847
2.3654
2.6275
98
0.6770
1.2902
1.6606
1.9845
2.3650
2.6269
99
0.6770
1.2902
1.6604
1.9842
2.3646
2.6264
100
0.6770
1.2901
1.6602
1.9840
2.3642
2.6259
Figure 8.5
t distribution with 99
degrees of freedom
0.025
–1.9842
1 – α = 0.95
0.025
+1.9842
t99
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.2 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ UNKNOWN) 287
–
In order to calculate S2, you first need to know X. Therefore, only n - 1 of the sample values are free to vary. This means that you have n - 1 degrees of freedom. For example, suppose
a sample of five values has a mean of 20. How many values do you need to know before you
–
can determine the remainder of the values? The fact that n = 5 and X = 20 also tells you that:
n
∑ Xi = 100
i=1
because:
n
∑ Xi
i=1
n
=X
Thus, when you know four of the values, the fifth one will not be free to vary because the sum
must add to 100. For example, if four of the values are 18, 24, 19 and 16, the fifth value must be
23 so that the sum equals 100.
The Confidence Interval Statement
Equation 8.2 defines the (1 - α) * 100% confidence interval estimate for the mean with σ
unknown.
LEARNING OBJECTIVE
Construct and interpret
confidence intervals for
the mean
CO N FID E N CE IN T E R VA L FOR T H E M E A N (σ U NKNO W N)
X ± tn−1
S
n
or
X − tn−1
S
n
⩽ μ ⩽ X + tn−1
S
n (8.2)
where tn-1 is the critical value of the t distribution with n - 1 degrees of freedom for an
area of α/2 in the upper tail.
To illustrate the application of the confidence interval estimate for the mean when the
standard deviation σ is unknown, return to the Callistemon Camping Supplies scenario presented on page 279. You select a sample of 100 sales invoices from the population of sales
invoices during the month and the sample mean of the 100 sales invoices is $230.27, with a
sample standard deviation of $52.62. For 95% confidence, the critical value from the t distribution
(as shown in Table 8.1) is 1.9842. Using Equation 8.2:
X ± tn−1
S
n
= 230.27 ± (1.9842)
1
52.62
100
= 230.27 ± 10.44
$219.83 ⩽ μ ⩽ $240.71
A Microsoft Excel worksheet for these data is presented in Figure 8.6 (overleaf).
Thus, with 95% confidence, you conclude that the mean amount of all the sales invoices
is between $219.83 and $240.71. The 95% confidence level indicates that if you selected all
possible samples of 100 (something that is never done in practice), 95% of the intervals developed would include the population mean somewhere within the interval. The validity of this
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
288 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
Figure 8.6
Microsoft Excel 2016
worksheet to calculate a
confidence interval
estimate for the mean
sales invoice amount for
Callistemon Camping
Supplies
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
A
B
Estimate for the mean sales invoice amount
Data
Sample standard deviation
Sample mean
Sample size
Confidence level
52.62
230.27
100
95%
Intermediate calculations
Standard error of the mean
5.262
Degrees of freedom
99
t value
1.984217
Interval half width
10.44095
=B4/SQRT(B6)
=B6 – 1
=T.INV.2T(1 – B7,B11)
=B12 * B10
Confidence interval
Interval lower limit
Interval upper limit
=B5 – B13
=B5 + B13
219.8291
240.7109
confidence interval estimate depends on the assumption of normality for the distribution of the
amount of the sales invoices. With a sample of 100, the normality assumption is not overly restrictive and the use of the t distribution is probably appropriate. Example 8.3 further illustrates how to
construct the confidence interval for a mean when the population standard deviation is unknown.
EXAMPLE 8.3
Table 8.2
Heights (in millimetres)
of female athletes aged
18–25
E ST IMAT ING T H E ME AN HE I G HT O F FE M AL E ATH LE T E S A GE D 18 –2 5
A manufacturer of women’s tracksuits needs to estimate the average height of female
­athletes in the 18–25 age group. The measurements of a sample of 30 women are taken
and their heights recorded in millimetres. Table 8.2 lists these values. < HEIGHTS >
Construct a 95% confidence interval estimate for the population mean height of female
athletes in this age group.
1,870
1,728
1,656
1,610
1,634
1,784
1,522
1,696
1,592
1,662
1,866
1,764
1,734
1,662
1,734
1,774
1,550
1,756
1,762
1,866
1,820
1,744
1,788
1,688
1,810
1,752
1,680
1,810
1,652
1,736
SOLUTION
–
Figure 8.7 shows that the sample mean is X = 1,723.4 mm and the sample standard deviation is S = 89.55 mm. Using Equation 8.2 to construct the confidence interval, you need to
determine the critical value from the t table for an area of 0.025 in each tail with 29 degrees
–
of freedom. Table E.3 shows that t29 = 2.0452. Thus, using X = 1,723.4, S = 89.55, n = 30
and t29 = 2.0452:
S
X ± tn−1
n
= 1,723.4 ± (2.0452)
89.55
30
= 1,723.4 ± 33.44
1,689.96 ⩽ μ ⩽ 1,756.84
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.2 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ UNKNOWN) 289
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
A
One sample t : height
Figure 8.7
PHStat confidence
interval estimate for the
mean height (in
millimetres) of female
athletes aged 18–25
B
Data
Sample standard deviation
Sample mean
Sample size
Confidence level
89.55083319
1723.4
30
95%
Intermediate calculations
Standard error of the mean
16.34967046
Degrees of freedom
29
t value
2.045229642
Interval half width
33.43883066
Confidence interval
Interval lower limit
Interval upper limit
1689.96
1756.84
You conclude with 95% confidence that the mean height of 18–25-year-old female
a­ thletes is between 1,689.96 and 1,756.84 mm. The validity of this confidence interval estimate
depends on the assumption that the heights in the population are normally distributed. Remember,
however, that you can slightly relax this assumption for large sample sizes. Thus, with a sample of
30, you can use the t distribution even if the distribution of heights is slightly skewed. From the
normal probability plot displayed in Figure 8.8, or the boxplot displayed in Figure 8.9, the heights
appear only slightly skewed. Thus the t distribution is appropriate for these data.
Height
Normal probability plot of height
Figure 8.8
PHStat normal probability
plot for the height (in
millimetres) of female
athletes aged 18–25
2,000
1,800
1,600
1,400
1,200
1,000
800
600
400
200
0
–2.5
–2
–1.5
–1
–0.5
0
0.5
1
1.5
2
2.5
Z value
Boxplot of height
Figure 8.9
PHStat boxplot for the
height (in millimetres) of
female athletes aged
18–25
Height
1,520
1,620
1,720
1,820
1,920
2,020
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
290 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
The validity of this confidence interval estimate depends on the assumption that the
p­ rocessing time is normally distributed. What would happen if there was a small sample and
the boxplot and the normal probability plot indicted that the distribution was right-skewed?
In this case you would have some concern about the validity of the confidence interval in
estimating the population mean. The concern is that a 95% confidence interval based on a
small sample from a skewed distribution will contain the population mean less than 95% of
the time in repeated sampling. In the case of small sample sizes and skewed distributions,
you might consider the sample median as an estimate of central tendency and construct a
confidence interval for the population median (see reference 2).
Problems for Section 8.2
LEARNING THE BASICS
8.10 Determine the critical value of t in each of the following
circumstances:
a. 1 - α = 0.95, n = 10
b. 1 - α = 0.99, n = 10
c. 1 - α = 0.95, n = 32
d. 1 - α = 0.95, n = 65
e. 1 - α = 0.90, n = 16
–
8.11 If X = 75, S = 24, n = 36, and assuming that the population is
normally distributed, construct a 95% confidence interval
estimate of the population mean μ.
–
8.12 If X = 50, S = 15, n = 16, and assuming that the population is
normally distributed, construct a 99% confidence interval
estimate of the population mean μ.
8.13 Construct a 95% confidence interval estimate for the population
mean, based on each of the following sets of data, assuming
that the population is normally distributed:
Set 1: 1, 1, 1, 1, 8, 8, 8, 8
Set 2: 1, 2, 3, 4, 5, 6, 7, 8
Explain why these data sets have different confidence intervals
even though they have the same mean and range.
8.14 Construct a 95% confidence interval for the population mean, based
on the numbers 1, 2, 3, 4, 5, 6 and 20. Change the number 20 to 7
and recalculate the confidence interval. Using these results, describe
the effect of an outlier (i.e. extreme value) on the confidence interval.
APPLYING THE CONCEPTS
You can solve problems 8.15 to 8.21 with or without Microsoft Excel.
8.15 A stationery store wants to estimate the mean retail value of
greeting cards that it has in its inventory. A random sample of
20 greeting cards indicates a mean value of $4.95 and a
standard deviation of $0.82.
a. Assuming a normal distribution, construct a 95% confidence
interval estimate of the mean value of all greeting cards in
the store’s inventory.
b. How are the results in (a) useful in assisting the store owner
to estimate the total value of his inventory?
8.16 Water resources in many parts of Australia are being closely
watched and restrictions or water-wise rules have been
imposed on activities such as garden watering. Suppose that
Sydney Water monitors water usage in a suburb and finds that
for one summer the average household usage is 408 litres per
day. A year later it examines records of a sample of 50
households and finds that there is a daily mean usage of 380
litres with a standard deviation of 25 litres.
a. Construct a 95% confidence interval for the population
mean daily water usage in the second summer. Assume the
population usage is normally distributed.
b. Interpret the interval constructed in (a).
c. Do you think water usage has changed in the second
summer? Explain.
8.17 The energy consumption of refrigerators sold in Australia and
New Zealand is checked and appliances are given a star rating
to guide consumers who are about to make purchases. The
consumption in kilowatts per annum is also displayed for each
model on the website <www.energyrating.gov.au>. Suppose a
consumer organisation wants to estimate the actual electricity
usage of a model of refrigerator that has an advertised energy
usage of 355 kW per annum. It tests a random sample of
n = 18 fridges and finds a sample mean usage of 367 and a
sample standard deviation of 30.
a. Assuming that the energy usage in the population is
normally distributed, construct a 95% confidence interval
estimate of the population mean energy usage for this
model of refrigerator.
b. Do you think that the consumer organisation should accuse
the manufacturer of producing fridges that do not meet the
advertised energy consumption? Explain.
c. Explain why an observed energy usage of 350 kW
for a particular refrigerator is not unusual, even
though it is outside the confidence interval developed
in (a).
8.18 The data below represent the annual account fees for
cheques made by a bank for a sample of 23 clients with
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.3 Confidence Interval Estimation for the Proportion 291
cheque accounts who do not undertake Internet banking.
the profitability of this service to the insurance company.
Over a period of one month, a random sample of 27 approved
policies was selected and the total processing time in days
recorded. < INSURANCE >
< BANK_COST1 >
26
29
20
20
21
22
25
25
18
25
15
18
20
25
25
22
30
30
30
15
20
29
20
a. Construct a 95% confidence interval for the population
mean annual cheque fee.
b. What assumption must you make about the population
distribution in (a)?
c. Interpret the interval constructed in (a).
8.19 One of the major measures of the quality of service provided by
any organisation is the speed with which it responds to
customer complaints. A large family-held department store
selling furniture and flooring, including carpeting, has
undergone a major expansion in the past several years. In
particular, the flooring department has expanded from two
installation crews to an installation supervisor, a measurer and
15 installation crews. Last year there were 50 complaints about
carpet installation. The data below represent the number of
days between the receipt of the complaint and the resolution of
the complaint. < FURNITURE >
54
5
35 137
31 27 152
2 123 81 74
11 19 126 110 110 29
61 35
94 31 26
12
29 26
25
4 165
13 10 5
32
27
29 28
4 52
30 22
1 14
36 26 20
73 19 16 64 28 28 31 90 60 56 31 56 22 18
45 48 17 17 17 91 92 63 50 51 69 16 17
a. Construct a 95% confidence interval estimate of the mean
processing time.
b. What assumption must you make about the population
distribution in (a)?
c. Do you think that the assumption made in (b) is seriously
violated? Use a plot and explain.
8.21 The data below represent the daily rate in Australian dollars for
a double room or studio booking on the following Monday night
at a sample of hotels, motels and motor lodges in 20 New
Zealand cities and towns in July 2017. < MOTEL_2017 >
City/Town
Room cost
Lake Taupo
138
Hamilton
147
27
Whitianga
152
Waitomo
118
5
Auckland
179
Whangarei
137
13
Paihia
113
Russell
129
23
Wellington
141
Kerikeri
136
Tauranga
128
Havelock North
156
New Plymouth
149
Thames
121
Hastings
103
Palmerston North
137
Napier
135
Wanganui
114
Gisborne
122
Rotorua
132
33 68
a. Construct a 95% confidence interval estimate of the mean
number of days between receipt of the complaint and
resolution of the complaint.
b. What assumption must you make about the population
distribution in (a)?
c. Do you think that the assumption made in (b) is seriously
violated? Explain.
d. What effect might your conclusion in (c) have on the validity
of the results in (a)?
8.20 The approval process for a life insurance policy requires a
review of the application and the applicant’s medical history,
possible requests for additional medical information and
medical examinations, and a policy compilation stage where the
policy pages are generated then delivered. The ability to deliver
approved policies to customers in a timely manner is critical to
City/Town
Room cost
Data obtained from <http://compare.jasons.co.nz> accessed 4 July 2017
a. Construct a 95% confidence interval for the population
mean lowest room cost.
b. Construct a 99% confidence interval for the population
mean lowest room cost.
c. What assumption do you need to make about the population
of interest to construct the intervals in (a) and (b)?
d. Given the data presented, do you think the assumption
needed in (a) and (b) is valid? Use a plot and explain.
8.3 CONFIDENCE INTERVAL ESTIMATION FOR THE PROPORTION
This section extends the concept of the confidence interval to categorical data. Here you are
concerned with estimating the proportion of items in a population with a certain characteristic
of interest. The unknown population proportion is represented by the Greek letter π (pronounced pi). The point estimate for π is the sample proportion, p = X/n, where n is the sample
size and X is the number of items in the sample with the characteristic of interest. Equation 8.3
defines the confidence interval estimate for the population proportion.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
292 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
CON FIDE N CE IN TE R VA L E ST I M AT E F O R T HE P R O P O RT I O N
p±Z
or
p(1− p)
n
p(1 − p)
<π<p1Z
n
p−Z
p(1 − p)
n
(8.3)
X number of items with the characteristic
=
n
sample size
π = population proportion
Z = critical value from the standardised normal distribution
n = sample size
assuming both np and n (1 - p) are greater than 5
where p = sample proportion =
LEARNING OBJECTIVE
2
Construct and interpret
confidence intervals for
the proportion
You can use the confidence interval estimate of the proportion defined in Equation 8.3 to
estimate the proportion of sales invoices that contain errors (see the opening scenario on
page 279). Suppose that in a sample of 100 sales invoices, 10 contain errors. Thus, for these
data, p = X/n = 10/100 = 0.10, so np = 10 > 5 and n(1 - p) = 90 > 5. Using Equation 8.3 and
Z = 1.96 for 95% confidence:
p±Z
p(1 − p)
n
= 0.10 ± (1.96)
(0.10 )(0.90 )
100
= 0.10 ± (1.96)(0.03)
= 0.10 ± 0.0588
0.0412 ⩽ π ⩽ 0.1588
Therefore, you have 95% confidence that between 4.12% and 15.88% of all the sales invoices
contain errors. Figure 8.10 shows a Microsoft Excel worksheet for these data. Note that in early
versions of Excel, the formula used in cell B10 would be = NORMSINV((1+B6)/2).
Example 8.4 illustrates another application of a confidence interval estimate for the proportion.
Figure 8.10
Microsoft Excel 2016
worksheet to form a
confidence interval
estimate for the proportion
of sales invoices that
contain errors
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
A
Proportion of in-error sales invoices
B
Data
Sample size
Number of success
Confidence level
100
10
95%
Intermediate calculations
Sample proportion
Z value
Standard error of the proportion
Interval half width
0.1
1.96
0.03
0.0588
=B5/B4
=NORM.S.INV((1 + B6)/2)
=SQRT(B9 * (1 – B9)/B4)
=(B10 * B11)
Confidence interval
Interval lower limit
Interval upper limit
0.0412
0.1588
=B9 – B12
=B9 + B12
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.3 Confidence Interval Estimation for the Proportion 293
ESTIM ATING T H E P RO P O RT IO N O F T YP OGRAP HI CAL E RRORS
IN O NLINE N E W S PA P E R S
With the latest technology available to check written text, mistakes in newspapers are
becoming less common. However, humans still make mistakes. A large media corporation
wants to estimate the proportion of online newspaper articles written by a variety of journalists that have typographical errors. A random sample of 200 articles is selected from all the
newspapers posted online during a single month. For this sample of 200, 7 contain some
type of typographical error. Construct and interpret a 90% confidence interval for the
proportion of articles posted online during the month that have a typographical error.
EXAMPLE 8.4
SOLUTION
Using Equation 8.3:
7
= 0.035
200
so np = 200 3 0.035 = 7 . 5
n(1 – p) = 200 3 0.965 = 193 . 5
and with a 90% level of confidence Z = 1.645
p=
p±Z
p(1 − p)
n
= 0.035 ± (1.645)
(0.035)(0.965)
200
= 0.035 ± (1.645)(0.0130)
= 0.035 ± 0.0214
0.0136 < π < 0.0564
You can conclude with 90% confidence that between 1.36% and 5.64% of the newspaper
articles posted online in that month have a typographical error.
Equation 8.3 contains a Z statistic since you can use the normal distribution to approximate
the binomial distribution when the sample size is sufficiently large. In Example 8.4, the confidence interval using Z provides an excellent approximation for the population proportion since
both X and n - X are greater than 5. However, if you do not have a sufficiently large sample
size, you should use the binomial distribution rather than Equation 8.3 (see references 3, 4
and 5). The exact confidence intervals for various sample sizes and proportions of successes
have been tabulated by Fisher and Yates (reference 4).
Problems for Section 8.3
LEARNING THE BASICS
8.22 If n = 200 and X = 50, construct a 95% confidence interval
estimate of the population proportion.
8.23 If n = 400 and X = 25, construct a 99% confidence interval
estimate of the population proportion.
APPLYING THE CONCEPTS
8.24 A telco wants to estimate the proportion of mobile phone
customers who would purchase a phone plan with unlimited
standard calls and SMS and 2GB of data if it were made
available at a substantially reduced cost. A random sample of
500 customers is selected. The results indicate that 190 of the
customers would purchase the plan at a reduced cost.
a. Construct a 99% confidence interval estimate of the
population proportion of customers who would purchase the
unlimited 2GB plan.
b. How would the manager in charge of promotional programs
for mobile customers use the results in (a)?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
294 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
8.25 A survey of 500 highly educated women who left careers for
family reasons found that 66% postponed their return to work
due to difficulty in making suitable childcare arrangements.
a. Construct a 95% confidence interval for the population
proportion of highly educated women who have postponed
their return to work due to difficulty in making suitable
childcare arrangements.
b. Interpret the interval in (a).
8.26 A survey of 293 inhabitants of Tropical North Queensland in 2013
found that 45% considered increased property values were a
negative impact of tourism in the region (Tropical North
Queensland Social Indicators 2013 <https://cdn-teq.queensland.
com/~/media/d0af5b7686754e2591d7e3fad2cdb673.
ashx?vs=1&d=20140515T080145> accessed 5 July 2017).
a. Construct a 95% confidence interval for the proportion of all
residents in the region who believe increased property
values are a negative impact of tourism.
b. Construct a 90% confidence interval for the proportion of all
residents in the region who believe increased property
values are a negative impact of tourism.
c. Which interval is wider? Explain why this is true.
8.27 The number of older consumers in Australia is growing and
they are becoming an important economic force. According to
the Australian Bureau of Statistics, the proportion of the
population aged 65 years and over increased from 14% in
2011 to 16% in 2016. (Australian Bureau of Statistics,
Reflecting Australia- Stories from the Census, 2016, Cat. No.
2071.0, 2017). The proportion is projected to grow higher in
coming years. Many older consumers feel overwhelmed when
confronted with the task of selecting investments, banking
services, health insurance or phone service providers. Suppose
a telephone survey of 1,900 older consumers found that 27%
said they felt confused when making financial decisions.
a. Construct a 95% confidence interval for the population
proportion of older consumers who feel confused when
making financial decisions.
b. Interpret the interval in (a).
8.28 The Australian Telecommunications Industry Ombudsman
2016 Annual Report states that 34.1% of new complaints in
2015–16 related to faults (<http://annualreport2016.tio.com.
au/#Service_type_in_complaints> accessed 5 July 2017).
Imagine that you take a survey of 1,000 Australian users
and find that 36% of this sample report that they
have had telecommunication service faults in the past
three months.
a. Construct a 95% confidence interval for the population
proportion of users who have experienced service faults in
the past three months.
b. Does your interval indicate that there is a difference from
the percentage reported by the Ombudsman? Give reasons
why a difference may occur.
8.29 The Australian Psychological Society conducted an online
survey in 2016 of 1,000 adults and 518 adolescents. It found
that 69% of adolescents reported consuming food from fast
food restaurants at least once a week. (Psychology Week
2016, APS Compass for Life Wellbeing Survey <www.
psychology.org.au/Assets/Files/16APS-PW-Survey-Web.pdf>
accessed 5 July 2017).
a. Construct a 95% confidence interval for the proportion of all
Australian adolescents who consume food from fast food
restaurants at least once per week.
b. How would your result change if it was a 99% interval?
8.30 Suppose that, in a survey of 600 employers, 126 indicate that
they have used a recruitment service within the past two
months to find new staff.
a. Construct a 95% confidence interval for the population
proportion of employers who have used a recruitment
service within the past two months to find new staff.
b. Construct a 99% confidence interval for the population
proportion of employers who have used a recruitment
service within the past two months to find new staff.
c. Interpret the intervals in (a) and (b).
d. Discuss the effect on the confidence interval estimate when
you change the level of confidence.
8.4 DETERMINING SAMPLE SIZE
LEARNING OBJECTIVE
3
Determine the sample
size necessary to develop
a confidence interval for
the mean
In each example of confidence interval estimation, you selected the sample size without regard
to the width of the resulting confidence interval. In the business world, determining the proper
sample size is a complicated procedure, subject to the constraints of budget, time and the
amount of acceptable sampling error. If, in the Callistemon Camping Supplies scenario, you
want to estimate the mean dollar amount of the sales invoices or the proportion of sales invoices
that contain errors, you must determine in advance how large a sampling error to allow in estimating each of the parameters. You must also determine in advance the level of confidence to
use in estimating the population parameter.
Sample Size Determination for the Mean
To develop a formula for determining the appropriate sample size needed when constructing a
confidence interval estimate of the mean, recall Equation 8.1 on page 282:
X±Z
σ
n
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.4 Determining Sample Size 295
–
The amount added to or subtracted from X is equal to half the width of the interval. This quantity represents the amount of imprecision in the estimate that results from sampling error.
The sampling error e (in this context, some statisticians refer to e as the ‘margin of error’) is
defined as:
σ
e=Z
n
sampling error
The difference in results for
different samples of the same size.
Solving for n gives the sample size needed to construct the appropriate confidence interval estimate for the mean. ‘Appropriate’ means that the resulting interval will have an acceptable
amount of sampling error.
S AMPLE S IZ E DE T E R M IN AT ION FOR T HE M E A N
The sample size n is equal to the product of the Z value squared and the variance σ2,
divided by the sampling error e squared.
n=
1.
2.
3.
Z 2σ 2
e2
(8.4)
To determine the sample size, you must know three factors:
the desired confidence level, which determines the value of Z, the critical value from the
standardised normal distribution1
the acceptable sampling error e
the standard deviation σ.
In some business-to-business relationships requiring estimation of important parameters,
legal contracts specify acceptable levels of sampling error and the confidence level required.
For companies in the food or drug sectors, government regulations often specify sampling
errors and confidence levels. In general, however, it is usually not easy to specify the two factors needed to determine the sample size. How can you determine the level of confidence and
sampling error? Typically, these questions are answered only by the subject matter expert (i.e.
the individual most familiar with the variables under study). Although 95% is the most common
confidence level used, if more confidence is desired then 99% might be more appropriate; if
less confidence is deemed acceptable, then 90% might be used. For the sampling error, you
should think not of how much sampling error you would like to have (you really do not want
any error), but of how much you can tolerate when drawing conclusions from the data.
In addition to specifying the confidence level and the sampling error, you need an estimate of
the standard deviation. Unfortunately, you rarely know the population standard deviation, σ. In some
instances, you can estimate the standard deviation from past data. In other situations, you can make
an educated guess by taking into account the range and distribution of the variable. For example, if
you assume a normal distribution, the range is approximately equal to 6σ (i.e. ±3σ around the
mean) so that you estimate σ as the range divided by 6. If you cannot estimate σ in this way, you can
conduct a small-scale study and estimate the standard deviation from the resulting data.
To explore how to determine the sample size needed for estimating the population mean,
consider again the audit at Callistemon Camping Supplies. In Section 8.2, we selected a sample
of 100 sales invoices and developed a 95% confidence interval estimate of the population mean
sales invoice amount. How was this sample size determined? Should we have selected a different sample size?
Suppose that, after consultation with company officials, we determine that a sampling error
of no more than ±$10 is desired, together with 95% confidence. Past data indicate that the
1
You use Z instead of t because to determine the critical value of t you need to know the sample size, but you do not
know it yet. For most studies, the sample size needed is large enough that the standardised normal distribution is a
good approximation of the t distribution.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
296 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
standard deviation of the sales amount is approximately $50. Thus, e = $5, σ = $50 and
Z = 1.96 (for 95% confidence). Using Equation 8.4:
n=
Z 2σ 2
e2
=
(1.96 ) 2 ( 50) 2
(10)2
= 96.04
Because the general rule is to oversatisfy slightly the criteria by rounding the sample size up to
the next whole integer, you should select a sample of size 97. Thus, the sample of size 100 used
on page 287 is close to what is necessary to satisfy the needs of the company based on the
­estimated standard deviation, desired confidence level and sampling error. Because the calculated sample standard deviation is slightly higher than expected, $52.62 compared with $50.00,
the confidence interval is slightly wider than desired. Figure 8.11 illustrates the Microsoft Excel
worksheet to determine the sample size. For early versions of Excel use the formula
=NORMSINV((1+B6)/2) in cell B9.
Figure 8.11
Microsoft Excel 2016
worksheet for determining
sample size for estimating
the mean sales invoice
amount for Callistemon
Camping Supplies Pty Ltd
1
2
3
4
5
6
7
8
9
10
11
12
13
A
For the mean sales invoice amount
Data
Population standard deviation
Sampling error
Confidence level
50
10
95%
Intermediate calculations
Z value
Calculated sample size
Result
Sample size needed
B
1.9600
96.0365
97
=NORM.S.INV((1 + B6)/2)
=((B9 * B4)/B5)^2
=ROUNDUP(B10,0)
Example 8.5 illustrates another application of determining the sample size needed to
develop a confidence interval estimate for the mean.
EXAMPLE 8.5
D E T E R MININ G T H E S AM P LE S I Z E F OR T HE ME A N
Returning to Example 8.3, suppose you want to estimate the population mean height for females
who wear size 12 to within ±15 mm with 95% confidence. On the basis of a study taken the
previous year, you believe that the standard deviation is 100 mm. Find the sample size needed.
SOLUTION
Using Equation 8.4 on page 295 and e = 15, σ = 100 and Z = 1.96 for 95% confidence:
n=
Z 2σ 2
e2
=
(1.96)2 (100)2
(15)2
= 170.74
Therefore, you should select a sample size of 171 women, because the general rule for
determining sample size is always to round up to the next integer value in order to oversatisfy slightly the criteria desired.
An actual sampling error slightly larger than 15 will result if the sample standard deviation calculated in this sample of 171 is greater than 100, and it will be slightly smaller if the
sample standard deviation is less than 100.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.4 Determining Sample Size 297
Sample Size Determination for the Proportion
So far, we have seen how to determine the sample size needed for estimating the population
mean. Now suppose that you want to determine the sample size necessary for estimating the
proportion of sales invoices at Callistemon Camping Supplies that contain errors.
To determine the sample size needed to estimate a population proportion (π), you use a
method similar to that for a population mean. Recall that in developing the sample size for a
confidence interval for the mean, the sampling error is defined by:
e=Z
LEARNING OBJECTIVE
Determine the sample
size necessary to develop
a confidence interval for
the proportion
σ
n
When estimating a proportion, you replace σ with π(1 - π). Thus, the sampling error is:
e=Z
π(1− π)
n
Solving for n, you have the sample size necessary to develop a confidence interval estimate for
a proportion.
SAM PLE S IZ E DE T E R M IN AT ION FOR T HE P R O P O RT I O N
The sample size n is equal to the product of Z value squared, the population proportion π
and 1 minus the population proportion π, divided by the sampling error e squared.
n=
1.
2.
3.
Z 2 π(1− π)
e2
3
(8.5)
To determine the sample size, you must know three factors:
the desired confidence level, which determines the value of Z, the critical value from the
standardised normal distribution
the acceptable sampling error e
the population proportion π.
In practice, selecting these quantities requires some planning. Once you determine the
desired level of confidence, you can find the appropriate Z value from the standardised normal
distribution. The sampling error e indicates the amount of error that you are willing to tolerate
in estimating the population proportion. The third quantity, π, is actually the population parameter that you want to estimate! How do you state a value for the very thing that you are taking a
sample in order to determine?
There are two alternatives. In many situations, you may have past information or relevant
experiences that provide an educated estimate of π. If you do not, you can try to provide a value
for π that would never underestimate the sample size needed. Referring to Equation 8.5, you
can see that the quantity π(1 - π) appears in the numerator. Thus, you need to determine the
value of π that will make the quantity π(1 - π) as large as possible. When π = 0.5, the product
π(1 - π) achieves its maximum result. To show this, here are several values of π together with
the accompanying products of π(1 - π):
When π = 0.9, π(1 - π) = (0.9)(0.1) = 0.09
When π = 0.7, π(1 - π) = (0.7)(0.3) = 0.21
When π = 0.5, π(1 - π) = (0.5)(0.5) = 0.25
When π = 0.3, π(1 - π) = (0.3)(0.7) = 0.21
When π = 0.1, π(1 - π) = (0.1)(0.9) = 0.09
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
298 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
Therefore, when you have no prior knowledge or estimate of the population proportion π,
use π = 0.5 for determining the sample size. This produces the largest possible sample size and
results in the highest possible cost of sampling. Using π = 0.5 may overestimate the sample
size needed because you use the actual sample proportion in developing the confidence interval. You will get a confidence interval narrower than originally intended if the actual sample
proportion is different from 0.5. The increased precision comes at the cost of spending more
time and money for an increased sample size.
Returning to the Callistemon Camping Supplies scenario, suppose that the auditing procedures require you to have 95% confidence in estimating the population proportion of sales
invoices with errors to within ±0.07. The results from past months indicate that the largest
proportion has been no more than 0.15. Thus, using Equation 8.5 and e = 0.07, π = 0.15 and
Z = 1.96 for 95% confidence:
Z 2 π(1− π)
n=
=
e2
(1.96 ) 2 ( 0.15)(0.85)
(0.07 ) 2
= 99.96
Because the general rule is to round up the sample size to the next whole integer to slightly
oversatisfy the criteria, a sample size of 100 is needed. Thus, the sample size needed to satisfy
the requirements of the company based on the estimated proportion, desired confidence level
and sampling error is equal to the sample size taken on page 292. The actual confidence interval
is narrower than required since the sample proportion is 0.10, while 0.15 was used for π in
Equation 8.5. Figure 8.12 shows a Microsoft Excel 2016 worksheet. Change the formula in cell
B9 to =NORMSINV((1+B6)/2) for early versions of Excel.
Example 8.6 provides a second application of determining the sample size for estimating
the population proportion.
Figure 8.12
Microsoft Excel 2016
worksheet for determining
sample size for estimating
the proportion of sales
invoices with errors for
Callistemon Camping
Supplies Pty Ltd
EXAMPLE 8.6
1
2
3
4
5
6
7
8
9
10
11
12
13
A
B
For the proportion of in-error sales invoices
Data
Estimate of true proportion
Sampling error
Confidence level
Intermediate calculations
Z value
Calculated sample size
Result
Sample size needed
0.15
0.07
95%
1.9600
99.9563
100
=NORM.S.INV((1 + B6)/2)
=(B9^2 * B4 * (1 – B4))/B5^2
=ROUNDUP(B10,0)
DE T E R MIN ING T H E SA MP L E S I Z E FO R TH E P O P UL AT I ON P RO P ORT I ON
You want to have 90% confidence of estimating the proportion of office workers who
respond to email within an hour to within ±0.05. Because you have not previously
undertaken such a study, there is no information available from past data. Determine the
sample size needed.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.4 Determining Sample Size 299
SOLUTION
Because no information is available from past data, assume π = 0.50. Using Equation 8.5
and e = 0.05, π = 0.50 and Z = 1.645 for 90% confidence:
n=
(1.645) 2 ( 0.50 )(0.50 )
(0.05) 2
= 270.6
Therefore, you need a sample of 271 office workers to estimate the population proportion to
within ±0.05 with 90% confidence.
Problems for Section 8.4
LEARNING THE BASICS
8.31 If you want to be 95% confident of estimating the population
mean to within a sampling error of ±5 and the standard
deviation is assumed to be 15, what sample size is required?
8.32 If you want to be 99% confident of estimating the population
mean to within a sampling error of ±20 and the standard
deviation is assumed to be 100, what sample size is required?
8.33 If you want to be 99% confident of estimating the population
proportion to within a sampling error of ±0.04, what sample
size is needed?
8.34 If you want to be 95% confident of estimating the population
proportion to within a sampling error of ±0.02 and there is
historical evidence that the population proportion is approximately
0.40, what sample size is needed?
APPLYING THE CONCEPTS
8.35 A survey is planned to determine the mean annual family
medical expenses of employees of a large company which
subsidises the health insurance of its staff. The management of
the company wishes to be 95% confident that the sample mean
is correct to within ±$50 of the population mean annual family
medical expenses. A previous study indicates that the standard
deviation is approximately $400.
a. How large a sample size is necessary?
b. If management wants to be correct to within ±$25, what
sample size is necessary?
8.36 If the manager of a paint supply store wants to estimate the
mean amount of paint in a 4-litre can to within ±0.015 litres
with 95% confidence and also assumes that the standard
deviation is 0.075 litres, what sample size is needed?
8.37 If a quality control manager wants to estimate the mean life of a
new type of LED light globe to within 1,000 hours with 95%
confidence and also assumes that the population standard
deviation is 5,000 hours, what sample size is needed?
8.38 The inspection division of a state department which regulates
trade measurement wants to estimate the mean amount of
soft-drink fill in 2-litre bottles to within ±0.01 litres with 95%
confidence. If it assumes that the standard deviation is 0.05
litres, what sample size is needed?
8.39 A consumer group wants to estimate the mean electric bill for
the month of July for single family homes in a large city. Based
on studies conducted in other cities, the standard deviation is
assumed to be $60. The group wants to estimate the mean bill
for July to within ±$15 with 99% confidence.
a. What sample size is needed?
b. If 95% confidence is desired, what sample size is necessary?
8.40 An advertising agency that serves a major radio station wants to
estimate the mean amount of time that the station’s audience
spends listening to the radio daily. From past studies, the
standard deviation is estimated as 45 minutes.
a. What sample size is needed if the agency wants to be 90%
confident of being correct to within ±5 minutes?
b. If 99% confidence is desired, what sample size is necessary?
8.41 Suppose that an energy company wants to estimate its mean
waiting time for natural gas installation to within ±5 days with
95% confidence. The company does not have access to
previous data, but suspects that the standard deviation is
approximately 20 days. What sample size is needed?
8.42 At a large South East Asian airport flights are classified as being
‘on time’ if they land less than 15 minutes after the scheduled
time. A study of airlines using the airport finds that one of the
airlines that services Australia has a record of 17% of flights
arriving late. Suppose you were asked to perform a follow-up
study for this airline in order to update the estimated proportion
of late arrivals. What sample size would you use to estimate the
population proportion to within a sampling error of:
a. ±0.06 with 95% confidence?
b. ±0.04 with 95% confidence?
c. ±0.02 with 95% confidence?
8.43 The Nielsen company regularly conducts research into
consumer purchases. Neilsen Homescan data for the
52 weeks ended 28 January 2017 showed that 34.5% of
Australian homes had purchased Asian vegetables in that
period. Households of 1–2 persons accounted for 47% of the
volume in Asian vegetable sales. (Neilsen Insights <http://
www.nielsen.com/au/en/insights/news/2017/green-eatersasian-vegetables-on-therise-in-australia.html> accessed
5 July 2017).
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
300 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
Consider a follow-up study focusing on the latest calendar year.
a. What sample size is needed to estimate the population
proportion of Australian households that have purchased
Asian vegetables to within ±0.02 with 95% confidence?
b. What sample size is needed to estimate the population
proportion of the volume of Asian vegetables that are
purchased by 1–2 person households to within ±0.02 with
95% confidence?
c. Compare the results of (a) and (b). Explain why these results
differ.
d. If you were to design a data collection method for a
follow-up study, would you use one sample and collect data
to answer both questions, or would you select two separate
samples? Explain the rationale behind your decision.
8.44 Suppose that a survey of the audience at a Sydney Symphony
Orchestra (SSO) concert has found that 48 out of 350 members
of the audience who participated in the survey are visitors to
Sydney.
a. Construct a 95% confidence interval for the population
proportion of audience members at SSO concerts who are
visitors to Sydney.
b. Interpret the interval constructed in (a).
c. To conduct a follow-up study that would provide 95%
confidence that the point estimate is correct to within ±0.03
of the population proportion, how large a sample size is
required?
d. To conduct a follow-up study that would provide 99%
confidence that the point estimate is correct to within ±0.03
of the population proportion, how large a sample size is
required?
8.45 A study conducted by the Australian Securities Exchange found
that 36% of 4,009 Australian adults surveyed in late 2014 held
shares, either directly or indirectly through unlisted managed
funds (Australian Securities Exchange, 2014 Australian Share
Ownership Study, <www.asx.com.au/documents/resources/
australian-share-ownership-study-2014.pdf> accessed 5 July
2017).
a. Construct a 95% confidence interval for the proportion of
Australian adults who held shares in late 2014.
b. Interpret the interval constructed in (a).
c. To conduct a follow-up study to estimate the population
proportion of adults who currently hold shares to within
±0.01 with 95% confidence, how many adults would you
interview?
8.5 APPLICATIONS OF CONFIDENCE INTERVAL ESTIMATION IN AUDITING
This chapter has focused on estimating either the population mean or the population proportion. Auditing is one area in business that makes widespread use of statistical sampling for the
purposes of estimation.
A UDIT IN G
Auditing is the collection and evaluation of evidence about information relating to an economic entity such as a sole business proprietor, a partnership, a corporation or a government agency in order to determine and report on how well the information corresponds to
established criteria.
auditing
A process of checking the accuracy
of financial records.
1.
2.
3.
4.
5.
6.
Six advantages of statistical sampling in auditing are:
Results are objective and defensible. Because the sample size is based on demonstrable
statistical principles, the audit is defensible before one’s superiors and in a court of law.
Statistical sampling provides an objective way of estimating the sample size in advance.
Statistical sampling provides an estimate of the sampling error.
Statistical sampling is often more accurate for drawing conclusions about large
populations. Examining large populations is time-consuming and therefore often subject
to more non-sampling error than a statistical sample.
Statistical sampling allows auditors to combine, and then evaluate collectively, samples
collected by different individuals.
Statistical sampling allows auditors to generalise their findings to the population with a
known sampling error.
Estimating the Population Total Amount
total amount
The sum of all values.
In auditing applications we are often more interested in developing estimates of the population total
amount than the population mean. Equation 8.6 shows how to estimate a population total amount.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.5 Applications of Confidence Interval Estimation in Auditing 301
E STIM ATING T H E P OPUL AT ION TOTA L
The point estimate for the population total is equal to the population size N times the
sample mean.
Total = NX
(8.6)
Equation 8.7 defines the confidence interval estimate for the population total. The term is
included where sampling is from a finite population.
LEARNING OBJECTIVE
CO N FID E N CE IN T E R VA L E ST IM AT E F O R T HE TOTA L
NX ± N(tn−1)
S
n
N−n
N−1
(8.7)
4
Recognise how to use
confidence interval
estimates in auditing
To demonstrate the application of the confidence interval estimate for the population total
amount, we return to the Callistemon Camping Supplies scenario. One of the auditing tasks is
to estimate the total dollar amount of all sales invoices for the month. If there are 5,000 invoices
–
for that month and X = $110.27, then, using Equation 8.6:
NX = (5,000)($110.27) = $551,350
If n = 100 and S = $28.95, then, using Equation 8.7 with t99 = 1.9842 for 95% confidence:
NX ± N (tn−1)
S
n
N−n
28.95 5,000 − 100
= 551,350 ± (5,000)(1.9842)
5,000 − 1
N−1
100
= 551,350 ± 28, 721.295(0.99005)
= 551,350 ± 28,436
$522,914 < population total < $579,786
Therefore, with 95% confidence, you estimate that the total amount of sales invoices is between
$522,914 and $579,786.
Example 8.7 further illustrates the population total.
DEVELOPING A CONFIDENCE INTERVAL ESTIMATE FOR THE POPULATION TOTAL
An auditor is faced with a population of 1,000 vouchers and wants to estimate the total
value of the population of vouchers. A sample of 50 vouchers is selected with the following
results:
–
Mean voucher amount (X) = $1,076.39
Standard deviation (S) = $273.62
EXAMPLE 8.7
Construct a 95% confidence interval estimate of the total amount for the population of
vouchers.
SOLUTION
Using Equation 8.6, the point estimate of the population total is:
NX = (1,000)(1,076.39) = $1,076,390
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
302 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
From Equation 8.7, a 95% confidence interval estimate of the population total amount is:
(1,000)(1,076.39) ± (1,000)(2.0096)
273.62
50
1,000 − 50
1,000 − 1
= 1,076,390 ± 77,762.902( 0.97517 )
= 1,076,390 ± 75,832
$1,000,558 < population total < $1,152,222
Therefore, with 95% confidence, you estimate that the total amount of the vouchers is
between $1,000,558 and $1,152,222.
Difference Estimation
difference estimation
A method of estimating the level of
discrepancy between book and
audit values for a population.
Auditors use difference estimation when they believe that errors exist in a set of items and they
want to estimate the magnitude of the errors based only on a sample. The following steps are
used in difference estimation:
1. Determine the sample size required.
2. Calculate the differences between the values reached during the audit and the original
values recorded. The difference in value i, denoted Di, is equal to 0 if the auditor finds
that the original value is correct, is a positive value when the audited value is larger than
the original value, and is negative when the audited value is smaller than the original
value.
–
3. Calculate the mean difference in the sample (D) by dividing the total difference by the
sample size, as shown in Equation 8.8.
M E A N DIFFE R E N C E
n
D=
∑ Di
i=1
(8.8)
n
where Di = audited value – original value
4.
Calculate the standard deviation of the differences (SD), as shown in Equation 8.9.
Remember that any item that is not in error has a difference value of 0.
STA N DA R D DE VIAT I O N O F T HE D I F F E R E NC E
n
SD =
5.
∑ ( Di − D )2
i=1
(8.9)
n−1
Use Equation 8.10 to construct a confidence interval estimate of the total difference in the
population.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.5 Applications of Confidence Interval Estimation in Auditing 303
CO N FID E N CE IN T E R VA L E ST IM AT E F O R T HE TOTA L D I F F E R E NC E
ND ± N (tn−1)
SD N − n
n N−1
(8.10)
The auditing procedures for Callistemon Camping Supplies require a 95% confidence
interval estimate of the difference between the actual dollar amounts on the sales invoice and
the amounts entered into the integrated inventory and sales information system. Suppose that,
in a sample of 100 sales invoices, you have 12 invoices in which the actual amount on the sales
invoice and the amount entered into the integrated inventory and sales information system are
different. These 12 differences < PARTS_INV > are:
$9.03 $7.47 $17.32 $8.30 $5.21 $10.80 $6.22 $5.63 $4.97 $7.43 $2.99 $4.63
The other 88 invoices are not in error. Their differences are each 0. Thus:
n
D=
∑ Di
i=1
n
=
90
= 0.90
100
and:
n
SD =
=
∑ ( Di − D )2
i=1
n −1
(9.03 − 0.9 ) 2 + (7.47 − 0.9 ) 2 + … + (0 − 0.9 ) 2
100 − 1
(In the numerator, there are 100 differences. The last 88 are all (0 − 0.9)2 .)
SD = 2.7518
Using Equation 8.10, construct the confidence interval estimate for the total difference in the
population of 5,000 sales invoices as follows:
(5,000)(0.90) ± (5,000)(1.9842)
2.7518 5,000 − 100
5,000 − 1
100
= 4,500 ± 2,702.89
$1,797.11 < total difference < $7,202.89
Thus, the auditor estimates with 95% confidence that the total difference between the sales
invoices, as determined during the audit, and the amount originally entered into the accounting
system is between $1,797.11 and $7,202.89.
In the previous example, all 12 differences are positive because the actual amount on the
sales invoice is more than the amount entered into the accounting system. In some circumstances you could have negative errors. Example 8.8 illustrates such an occurrence.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
304 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
EXAMPLE 8.8
DIFFE R E NC E E ST IMATI ON
Returning to Example 8.7, suppose that 14 vouchers contain errors in the sample of 50
vouchers. The values < DIFF_TEST > of the 14 errors are as follows, in which two differences
are negative:
$75.41
$127.74
$38.97
$55.42
$108.54
$39.03
–$37.18
$29.41
$62.75
$47.99
$118.32
$28.73
–$88.84
$84.05
Construct a 95% confidence interval estimate of the total difference in the population of
1,000 vouchers.
SOLUTION
For these data:
n
D=
∑ Di
i=1
n
=
690.34
= 13.8068
50
and:
n
SD =
=
∑ ( Di − D )2
i=1
n −1
( 75.41 − 13.8068 ) 2 + ( 38.97 − 13.8068 ) 2 + … + (0 − 13.8068) 2
50 − 1
= 37.427
Using Equation 8.10, construct the confidence interval estimate for the total difference in the
population:
(1,000)(13.8068) ± (1,000)(2.0096)
37.427
50
1,000 − 50
1,000 − 1
= 13,806.8 ± 10,372.63
$3,434.17 < total difference < $24,179.43
Therefore, with 95% confidence you estimate that the total difference in the population of
vouchers is between $3,434.17 and $24,179.43.
LEARNING OBJECTIVE
4
Recognise how to use
confidence interval
estimates in auditing
one-sided confidence interval
Gives only an upper or lower bound
to the value of the population
parameter.
One-Sided Confidence Interval Estimation of the Rate
of Non-Compliance with Internal Controls
Organisations use internal control mechanisms to ensure that individuals act in accordance with
company guidelines. For example, Callistemon Camping Supplies requires that an authorised
delivery docket is completed before goods are removed from the warehouse. During the monthly
audit of the company, the auditing team is charged with the task of estimating the proportion of
times goods were removed without proper authorisation. This is referred to as the rate of noncompliance with the internal control. To estimate the rate of non-compliance, auditors take a
random sample of sales invoices and determine how often merchandise was shipped without an
authorised delivery docket. The auditors then compare their results with a previously established
tolerable exception rate, which is the maximum allowable proportion of items in the population
not in compliance. When estimating the rate of non-compliance, it is commonplace to use a
one-sided confidence interval. That is, the auditors estimate an upper bound on the rate of noncompliance. Equation 8.11 defines a one-sided confidence interval for a proportion.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.5 Applications of Confidence Interval Estimation in Auditing 305
O NE -S ID E D CON FIDE N CE IN T E R VA L F O R A P R O P O RT I O N
Upper bound = p + Z
p(1 − p)
n
N−n
N−1
(8.11)
where Z = the value corresponding to a cumulative area of (1 - α) from the standardised
normal distribution – that is, a right-hand tail probability of α.
If the tolerable exception rate is higher than the upper bound, then the auditor concludes
that the company is in compliance with the internal control. If the upper bound is higher than
the tolerable exception rate, the auditor concludes that the control non-compliance rate is too
high. The auditor may then request a larger sample.
Suppose that, in the monthly audit, you select 400 of the sales invoices from a population
of 10,000 invoices. In the sample of 400 sales invoices, 20 are in violation of the internal control. If the tolerable exception rate for this internal control is 6%, what should you conclude?
Use a 95% level of confidence.
The one-sided confidence interval is calculated using p = 20/400 = 0.05 and Z = 1.645.
Using Equation 8.11:
Upper bound = p + Z
p(1 − p)
n
N−n
0.05(1 − 0.05) 10,000 − 400
= 0.05 + 1.645
N−1
400
10,000 − 1
= 0.05 + 1.645(0.0109)(0.98) = 0.05 + 0.0176 = 0.0676
Thus, you have 95% confidence that the rate of non-compliance is less than 6.76%. Because the
tolerable exception rate is 6%, the rate of non-compliance may be too high for this internal
control. In other words, it is possible that the non-compliance rate for the population is higher
than the rate deemed tolerable. Therefore, you should request a larger sample.
In many cases, the auditor is able to conclude that the rate of non-compliance with the
company’s internal controls is acceptable. Example 8.9 illustrates such an occurrence.
ESTIM ATING T H E R AT E O F N O N- C O MP L I AN CE
A large electronics firm makes one million direct debit payments a year. An internal control
policy requires that each payment is made only after an invoice has been authorised by an
accounts payable supervisor. The company’s tolerable exception rate for this control is 4%.
If control deviations are found in 8 of the 400 invoices sampled, what should the auditor do?
Use a 95% level of confidence.
EXAMPLE 8.9
SOLUTION
The auditor constructs a 95% one-sided confidence interval for the proportion of invoices in
non-compliance and compares this with the tolerable exception rate. Using Equation 8.11,
p = 8/400 = 0.02 and Z = 1.645 for 95% confidence:
Upper bound = p + Z
p(1 − p)
n
N−n
0.02(1 − 0.02) 1,000,000 − 400
= 0.02 + 1.645
N−1
400
1,000,000 − 1
= 0.02 + 1.645(0.007)(0.9998) = 0.02 + 0.0115 = 0.0315
The auditor concludes with 95% confidence that the rate of non-compliance is less than 3.15%.
Since this is less than the tolerable exception rate, the auditor concludes that the internal control
compliance is adequate. In other words, the auditor is more than 95% confident that the rate of
non-compliance is less than 4%.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
306 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
Problems for Section 8.5
LEARNING THE BASICS
8.46 A sample of 25 is selected from a population of 500 items. The
sample mean is 25.7 and the sample standard deviation is 7.8.
Construct a 99% confidence interval estimate of the population
total.
8.47 Suppose that a sample of 200 is selected from a population of
10,000 items. Ten items are found to have errors of the
following amounts:
13.76 42.87 34.65 11.09 14.54
22.87 25.52 9.81 10.03 15.49
Construct a 95% confidence interval estimate of the total
difference in the population. < ITEM_ERR >
8.48 If p = 0.04, n = 300 and N = 5,000, calculate the upper bound
for a one-sided confidence interval estimate of the population
proportion, π, using a level of confidence of:
a. 90%
b. 95%
c. 99%
APPLYING THE CONCEPTS
8.49 A stationery store wants to estimate the total retail value of the
300 greeting cards it has in its inventory. Construct a 95%
confidence interval estimate of the population total value of all
greeting cards that are in the inventory if a random sample of
20 greeting cards indicates an average value of $5.45 and a
standard deviation of $0.82.
8.50 The personnel department of a large corporation employing
3,000 workers wants to estimate the family dental expenses of
its employees to determine the feasibility of providing a dental
insurance plan. A random sample of 10 employees reveals the
following family dental expenses (in dollars) for the preceding
year: < DENTAL >
Tax (GST) payable to the Australian Tax Office needs to be
adjusted. A sample of 150 items selected from a population of
4,000 invoices at the end of a period of time revealed that in 13
cases staff failed to adjust the GST amount correctly. The
amounts (in dollars) of the 13 amounts by which GST was
overcharged are: < DISCOUNT >
6.45 15.32 97.36 230.63 104.18 84.92 132.76
66.12 26.55 129.43 88.32 47.81 89.01
Construct a 99% confidence interval estimate of the population
total amount of GST overcharged.
8.53 Econe Pty Ltd is a small company that manufactures women’s
dresses for sale to specialty stores. There are 1,200 inventory
items, and the historical cost is recorded on a first in, first out
(FIFO) basis. In the past, approximately 15% of the inventory
items were incorrectly priced. However, any misstatements
were usually not significant. A sample of 120 items was
selected and the historical cost of each item compared with the
audited value. The results indicated that 15 items differed in
their historical cost and audited value. These differences were
as follows: < FIFO >
Sample Historical Audited
number cost ($) value ($)
5
261
240
Sample Historical Audited
number cost ($) value ($)
60
21
210
9
87
105
73
140
152
17
201
276
86
129
112
18
121
110
95
340
216
28
315
298
96
341
402
35
411
356
107
135
97
43
249
211
119
228
220
51
216
305
1,110 362 2,320 1,930 3,210 208 1,730 825 616 1,179
Construct a 90% confidence interval estimate of the total family
dental expenses for all employees in the preceding year.
8.51 A branch of a chain of large electronics stores is conducting an
end-of-month inventory of the merchandise in stock. There are
1,546 items in inventory at the time. A sample of 50 items is
randomly selected and an audit conducted, with the following
results:
Value of merchandise
X = $252.28
S = $93.67
Construct a 95% confidence interval estimate of the total value
of the merchandise in the inventory at the end of the month.
8.52 When a trade discount is allowed by wholesalers for particular
types of early payments by customers, the Goods and Services
Construct a 95% confidence interval estimate of the total
population difference in the historical cost and audited
value.
8.54 The Snowy Ski Centre Pty Ltd conducts an annual audit of its
financial records. An internal control policy for the company
is that a cheque can be issued only after the accounts
payable manager initials the invoice. The tolerable exception
rate for this internal control is 0.04. During an audit, a
sample of 300 invoices is examined from a population of
10,000 invoices and 11 invoices are found to violate the
internal control.
a. Calculate the upper bound for a 95% one-sided confidence
interval estimate for the rate of non-compliance.
b. Based on (a), what should the auditor conclude?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
8.6 More On Confidence Interval Estimation And Ethical Issues 307
8.6 MORE ON CONFIDENCE INTERVAL ESTIMATION AND ETHICAL ISSUES
You should be aware that when sampling is done without replacement from a finite population,
an adjustment to the standard error of the mean or standard error of the proportion is required.
This has been included in equations 8.7, 8.10 and 8.11, where standard errors have been multiplied by the correction factor square root of (N - n)/(N - 1). The correction factor is used in
confidence intervals for the population mean and proportion when the sample size, n, is large in
relation to the population size, N (i.e. more than 5%).
Ethical issues relating to the selection of samples and the inferences that accompany them
can arise in several ways. The major ethical issue relates to whether or not confidence interval
estimates are provided together with the sample statistics. To provide a sample statistic without
also including the confidence interval limits (typically set at 95%), the sample size used and an
interpretation of the meaning of the confidence interval in terms that a layperson can understand
raises ethical issues because of their omission. Failure to include a confidence interval estimate
might mislead the user of the results into thinking that the point estimate is all that is needed to
predict the population characteristic with certainty. Thus, it is important that you indicate the
interval estimate in a prominent place in any written communication, together with a simple
explanation of the meaning of the confidence interval. In addition, you should highlight the size
of the sample.
Ethical issues concerning estimation most commonly occur in the publication of the results
of political polls. Often the results of the polls are highlighted in a prominent part of the newspaper, while the sampling error involved and the methodology used is printed on the page
where the article is continued, frequently in the middle of the newspaper in print editions or
with a separate link in online ones. To ensure an ethical presentation of statistical results, the
confidence levels, sample size and confidence limits should be made available for all surveys
and other statistical studies.
Reporting poll results
Let’s imagine that a newspaper reports the following table in both its print and online editions.
State premier’s performance
July–Sept
2016 (%)
Oct–Dec
2016 (%)
Jan–Mar
2017 (%)
Mar–Jun
2017 (%)
July–Sept
2017 (%)
Satisfied
52
50
48
42
33
Dissatisfied
33
33
41
46
57
Uncommitted
15
17
11
12
10
think
about this
In the print edition it shows this extra information immediately below the table. In the online edition
readers need to click on a link to see it.
Question: Are you satisfied or dissatisfied with the way the current state premier is performing?
This poll was carried out by a phone interview of the state’s voters, with the number in each poll
being a constant percentage of the estimated number of voters. The latest survey interviewed
1,560 voters.
Do you think the variation in display methods between the print and online editions will alter the way
readers interpret the poll results? What other information is necessary for you to be able to evaluate the
poll results effectively?
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
308 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
8
Assess your progress
Summary
This chapter has discussed confidence intervals for estimating the
characteristics of a population, and explained how to determine the
necessary sample size. We showed how an accountant of
Callistemon Camping Supplies can use the sample data from an
audit to estimate important population parameters such as the
mean dollar amount on invoices and the proportion of shipments
that are made without proper authorisation.
To determine which equation to use for a particular situation,
you need to ask several questions:
•
Are you developing a confidence interval or are you determining sample size?
• Do you have a numerical variable or do you have a categorical
variable?
• If you have a numerical variable, do you know the population
standard deviation? If you do, use the normal distribution. If you
do not, use the t distribution.
The next three chapters develop a hypothesis-testing approach
that makes decisions about population parameters.
Key formulas
Confidence interval estimate for the proportion
Confidence interval for the mean (𝛔 known)
X±Z
σ
n
p±Z
(8.1)
or
or
X−Z
p(1 − p)
(8.3)
n
σ
n
<μ<X+Z
σ
Confidence interval for the mean (𝛔 unknown)
X ± tn−1
S
n
(8.2)
p(1 − p)
n
Sample size determination for the mean
n=
Z 2σ 2
e2
(8.4)
Sample size determination for the proportion
or
X − tn−1
p(1 − p)
<π<p1Z
n
p−Z
n
S
n
⩽ μ ⩽ X + tn−1
S
n=
n
Z 2 π(1− π)
e2
(8.5)
Key terms
auditing
confidence interval estimate
critical value
deductive reasoning
degrees of freedom
300
280
283
281
285
difference estimation
inductive reasoning
level of confidence
one-sided confidence interval
point estimate
302
281
282
304
280
sampling error
Student’s t distribution
total amount
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
295
285
300
Chapter review problems 309
References
1. Statprob: The Encyclopedia Sponsored by Statistics and Probability
4. Fisher, R. A. & F. Yates, Statistical Tables for Biological,
Societies, at <http://statprob.com/encyclopedia/williamsealygosset.
html> accessed April 2014.
2. Daniel, W. W. Applied Nonparametric Statistics, 2nd edn (Boston, MA:
PWS Kent, 1990).
3. Cochran, W. G., Sampling Techniques, 3rd edn (New York: Wiley, 1977).
Agricultural and Medical Research, 5th edn (Edinburgh: Oliver &
Boyd, 1957).
5. Snedecor, G. W. & W. G. Cochran, Statistical Methods, 8th edn (Ames,
IA: Iowa State University Press, 1989).
Chapter review problems
CHECKING YOUR UNDERSTANDING
8.55
8.56
8.57
8.58
8.59
8.60
Why is it that you can never really have 100% confidence of
correctly estimating the population characteristic of interest?
When do you use the t distribution to develop the confidence
interval estimate for the mean?
Why is it true that, for a given sample size n, an increase in
confidence is achieved by widening (and making less precise)
the confidence interval?
Under what circumstances do you use a one-sided confidence
interval instead of a two-sided confidence interval?
When would you want to estimate the population total instead
of the population mean?
How does difference estimation differ from estimating the mean?
APPLYING THE CONCEPTS
8.63
You can solve problems 8.61 to 8.75 with or without a computer. You should use
Microsoft Excel or another program to solve problems 8.76 to 8.80.
8.61
8.62
A trade union of medical workers conducted a survey through
its website about preferred working hours. Hospital workers
visiting the website were given the opportunity to fill out an
on-screen survey form. A total of 665 workers responded to a
question that asked whether they would prefer a five-day
working week with eight-hour shifts, or seven 12-hour shifts
per fortnight. Twelve-hour shifts were the preference for 412 of
the respondents.
a. Define the population from which this sample was drawn.
b. Is this a random sample from this population?
c. Is this a statistically valid study?
d. Describe how you would design a statistically valid study to
investigate the proportion of hospital workers who would
prefer 12-hour shifts rather than a five-day working week.
Use the information above to determine the sample size
needed to estimate this population proportion to within
±0.02 with 95% confidence.
In 2014–15 the Australian Bureau of Statistics conducted a
multipurpose household survey and had responses from 13,686
individuals in private dwellings on their use of information
technology. Assume there were 477 15–17 year old respondents
and 2,256 45–54 year old respondents who used the Internet to
purchase goods or services online. For 45–54 year olds, 49.2% of
online purchases were on travel, accommodation or related
services. By comparison, for 15–17 year olds, 60% of purchases
were of music, movies, electronic games or books (Australian
8.64
Bureau of Statistics, Household Use of Information Technology,
Australia, 2014–15, Cat. No. 8146.0, 2016).
a. Construct a 95% confidence interval for the population
proportion of all Australian Internet purchasers aged 45–54
who bought travel, accommodation or related services online
in 2014–15.
b. Construct a 95% confidence interval for the population
proportion of all Australian Internet purchasers aged 15–17
who bought music, movies, electronic games or books
online in 2014–15.
c. Construct a 99% confidence interval for the population
proportion of all Australian Internet purchasers aged 15–17
who bought music, movies, electronic games or books
online in 2014–15.
The KPMG 2016 report, Global Profiles of the Fraudster gives
details of a survey of investigations between March 2013 and
August 2015 relating to frauds committed by 750 people
worldwide. Where fraudsters were working in collaboration
with others, the most common means of detection were tipoffs and complaints (31%), but fraudsters acting alone were
most often detected by management review (25%). (Global
Profiles of the Fraudster: Technology and Weak Controls
<https://assets.kpmg.com/content/dam/kpmg/pdf/2016/06/
profiles-of-the-fraudster-au.pdf> accessed 6 July 2017).
Suppose the percentages above are based on 210 singleperson frauds and 240 frauds where there was collusion.
a. Find a 95% confidence interval for the proportion of all
single person fraud incidents that are detected by
management review.
b. Find a 95% confidence interval for the proportion of all
fraud collusion incidents that are detected due to tip-offs
and complaints.
The Legal Services Council conducted a consumer survey in
2017 which asked respondents about different attitudes and
experiences relating to legal costs. One question asked: ‘How
well did you understand what the costs were likely to be?’
(Legal Services Council Consumer Survey 2017 (<www.
legalservicescouncil.org.au/Documents/consultation/LSC_
Consumer_Survey_Report.pdf> accessed 6 July 2017). There
were 1402 replies to this question. The percentages of those
who replied to each category were:
• Understood well: 21%
• Understood adequately: 33%
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
310 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
8.65
8.66
8.67
• Understood a little: 34%
• I did not understand: 12%
Construct 95% confidence interval estimates for each of
these categories. What conclusions can you reach about
consumers’ understanding of legal costs from these results?
A study by the Australian Bureau of Statistics looked at health
and activity habits of various groups of Australians. It found
that only 43.3 % of males aged 35–44 had participated in
sufficient physical activity in the last week for health
purposes. It also found that males of all ages spent an
average of 12.9 hours in the last week sitting to watch
television or videos (Australian Bureau of Statistics, Australian
Health Survey: Physical Activity, 2011–12, Cat. No.
4364.0.55.004, 2013).
Assume you have two samples with 500 males aged 35–44
and 2000 males of all ages. Assume the values given above
apply and S = 2.8 hours sitting time.
a. Construct a 95% confidence interval estimate for the mean
time males sit per week to watch television or videos.
b. Construct a 95% confidence interval estimate for the
population proportion of 35–44-year-old males who
participate in sufficient activity per week for health.
If you want to take another survey in future, answer the
following questions:
c. What sample size is required to be 95% confident of estimating
the population mean to within ±2 hours assuming that the
population standard deviation is equal to 3 hours?
d. What sample size is needed to be 95% confident of being
within ±0.035 of the population proportion of 35–44 year
old males who participate in sufficient activity if no previous
estimate is available?
A researcher for a state government agriculture department
wants to study various characteristics of medium-sized farms
in the state. A random sample of 70 farms of between 100 and
600 hectares reveals the following:
–
• average area X = 350 hectares, standard deviation
S = 70 hectares
• 21 farms are engaged primarily in beef cattle production
a. Construct a 99% confidence interval estimate of the
population mean area of medium-sized farms.
b. Construct a 95% confidence interval estimate of the
population proportion of medium-sized farms which are
primarily beef cattle producers.
The personnel manager of a large corporation wishes to study
absenteeism among clerical workers at the corporation’s
central office during the year. A random sample of 25 clerical
workers reveals the following:
–
• absenteeism: X = 9.7 days, S = 4.0 days
• 12 clerical workers were absent for more than 10 days
a. Construct a 95% confidence interval estimate of the mean
number of absences for clerical workers last year.
b. Construct a 95% confidence interval estimate of the
population proportion of clerical workers absent for more
than 10 days last year.
8.68
8.69
8.70
If the personnel manager also wishes to take a survey in a
branch office, answer these questions:
c. What sample size is needed to have 95% confidence in
estimating the population mean to within ±1.5 days if the
population standard deviation is 4.5 days?
d. What sample size is needed to have 90% confidence in
estimating the population proportion to within ±0.075 if no
previous estimate is available?
e. Based on (c) and (d), what sample size is needed if a single
survey is being conducted?
The market research manager for Dalton’s department store
wants to study women’s spending on cosmetics. A survey is
designed to estimate the proportion of women who purchase
their cosmetics primarily from Dalton’s department store, and
the mean yearly amount that women spend on cosmetics. A
previous survey found that the standard deviation of the amount
women spend on cosmetics in a year is approximately $64.70.
a. What sample size is needed to have 99% confidence of
estimating the population mean to within ±$5?
b. What sample size is needed to have 90% confidence of
estimating the population proportion to within ±0.045?
c. Based on the results in (a) and (b), how many of the store’s
female customers should be sampled? Explain.
A survey of Internet shopping for goods looked at how much
shoppers spent on online purchases of clothing, footwear and
accessories in the past year. The results from a sample of 270
customers are as follows:
–
• amount spent: X $528.90 S = $113.90
• 108 customers stated that they made the majority of
purchases at overseas sites
a. Construct a 95% confidence interval estimate of the
population mean amount spent on Internet purchases of
clothing, footwear and accessories in the past year.
b. Construct a 90% confidence interval estimate of the
population proportion of customers who have made the
majority of purchases on overseas sites.
Assume that you wish to run a similar survey for the coming year.
c. What sample size is needed to have 95% confidence of
estimating the population mean amount spent on online
purchases of clothing, footwear and accessories to within
±$1.20 if the standard deviation is assumed to be $10?
d. What sample size is needed to have 90% confidence of
estimating the population proportion that will make the
majority of purchases on overseas sites to within ±0.04?
e. Based on your answers to (c) and (d), how large a sample
should be taken?
The branch manager of an outlet (store 1) of a nationwide chain of
pet supply stores wants to study the characteristics of her
customers. In particular, she decides to focus on two variables: the
amount of money spent by customers and whether the customers
own only one dog, only one cat, or more than one dog and/or cat.
The results from a sample of 70 customers are shown below:
–
• amount of money spent: X = $21.34, S = $9.22
• 37 customers own only a dog
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter review problems 311
8.71
8.72
• 26 customers own only a cat
• 7 customers own more than one dog and/or cat
a. Construct a 95% confidence interval estimate of the
population mean amount spent in the pet supply store.
b. Construct a 90% confidence interval estimate of the
population proportion of customers who own only a cat.
The branch manager of another outlet (store 2) wishes to
conduct a similar survey in his store. The manager does not have
any access to the information generated by the manager of store 1.
c. What sample size is needed to have 95% confidence of
estimating the population mean amount spent in his store
to within ±$1.50 if the standard deviation is $10?
d. What sample size is needed to have 90% confidence of
estimating the population proportion of customers who own
only a cat to within ±0.045?
e. Based on your answers to (c) and (d), how large a sample
should the manager take?
The owner of a restaurant that serves continental food wants
to study the characteristics of his customers. He decides to
focus on two variables: the amount of money spent per diner
on food and whether diners order dessert. The results from a
sample of 60 diners are as follows:
–
• amount spent: X = $47.20, S = $8.60
• number of diners who purchased dessert: 18
a. Construct a 95% confidence interval estimate of the
population mean amount spent per diner on food.
b. Construct a 90% confidence interval estimate of the
population proportion of diners who purchase dessert.
The owner of a competing restaurant wants to conduct a
similar survey in her restaurant. This owner does not have
access to the information generated by the owner of the first
restaurant.
c. What sample size is needed to have 95% confidence of
estimating the population mean amount spent by each diner
on food in her restaurant to within ±$1.50, assuming the
standard deviation is $9?
d. What sample size is needed to have 90% confidence of
estimating the population proportion of diners who
purchase dessert to within ±0.04?
e. Based on your answers to (c) and (d), how large a sample
should the owner take?
The manufacturer of Tuffstuff concrete pavers claims its
products have a breaking strength of 5 kN. A representative of
a building advisory organisation is interested in assessing this
claim and sends a number of pavers to be tested in a
laboratory. The representative wants to know with 95%
confidence, within ±0.05, what proportion of pavers perform
the job as claimed by the manufacturer.
a. How many pavers does the laboratory need to test? What
assumption should be made about the population proportion?
The laboratory tests 50 pavers, and 42 have the breaking
strength claimed.
b. Construct a 95% confidence interval estimate for the
population proportion that have the breaking strength claimed.
8.73
8.74
c. How can the representative use the results of (b) to advise
the public about the product?
An auditor needs to estimate the percentage of times a
company fails to follow an internal control procedure. A sample
of 50 from a population of 1,000 items is selected, and in 7
instances the internal control procedure was not followed.
a. Construct a 90% one-sided confidence interval estimate of
the population proportion of items in which the internal
control procedure was not followed.
b. If the tolerable exception rate is 0.15, what should the
auditor conclude?
An auditor for a government agency needs to evaluate payments
that were made by Medicare for consultations in doctors’
surgeries in a particular postcode area during June. A total of
25,056 visits occurred during June in this area. The auditor
wants to estimate the total amount paid by Medicare to within
± $10 with 95% confidence. On the basis of past experience,
she believes that the standard deviation is approximately $60.
a. What sample size should she select?
Using the sample size selected in (a), an audit is conducted.
It is discovered that for 12 of the surgery consultations an
incorrect amount of reimbursement was provided.
Amount of reimbursement
X = $98.70
S = $44.55
For the 12 surgery consultations for which incorrect
reimbursement was provided, the differences between the
amount reimbursed and the amount that the auditor
determined should have been reimbursed were: < MEDICARE >
$17 $25 $14 -$10 $20 $40 $35 $30 $28 $22 $15 $5
8.75
b. Construct a 90% confidence interval estimate of the
population proportion of reimbursements that contain errors.
c. Construct a 95% confidence interval estimate of the
population mean reimbursement per surgery consultation.
d. Construct a 95% confidence interval estimate of the
population total amount of reimbursements for this
postcode area for consultations in June.
e. Construct a 95% confidence interval estimate of the total
difference between the amount reimbursed and the amount
that should have been reimbursed.
A large computer store is conducting an end-of-month inventory of
the tablet computers in stock. An auditor for the store wants to
estimate the mean value of the tablets in stock at that time. He
wants to have 99% confidence that his estimate of the mean value
is correct to within ±$23. On the basis of past experience, he
estimates that the standard deviation of the value of a tablet is $45.
a. What sample size should he select?
b. Using the sample size selected in (a), an audit is conducted
with the following results:
X = $575
S = $72.20
Construct a 99% confidence interval estimate of the total value
of the tablets in stock at the end of the month if there were
258 tablets listed in the inventory.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
312 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
8.76
A quality characteristic of interest for a tea-bag-filling process
is the weight of the tea in the individual bags. In this example,
the label weight on the package indicates that the mean
amount of tea in a bag is 5.5 g. If the bags are underfilled, two
problems arise. First, customers may not be able to brew the
tea to be as strong as they wish. Second, the company may be
in violation of the law because of misleading labelling. On the
other hand, if the mean amount of tea in a bag exceeds the
label weight, the company is giving away product. Getting an
exact amount of tea in a bag is problematic because of
variation in the temperature and humidity inside the factory,
differences in the density of the tea, and the extremely fast
filling operation of the machine (approximately 170 bags a
minute). The following table provides the weight in grams of a
sample of 50 tea-bags produced in one hour by a single
machine. < TEABAGS >
5.65
5.57
5.47
5.77
5.61
8.77
5.44
5.40
5.40
5.57
5.45
Weight of tea-bags in grams
5.42 5.40 5.53 5.34 5.54 5.45
5.53 5.54 5.55 5.62 5.56 5.46
5.47 5.61 5.53 5.32 5.67 5.29
5.42 5.58 5.58 5.50 5.32 5.50
5.44 5.25 5.56 5.63 5.50 5.57
5.52
5.44
5.49
5.53
5.67
8.79
5.41
5.51
5.55
5.58
5.36
a. Construct a 99% confidence interval estimate of the
population mean weight of the tea-bags.
b. Is the company meeting the requirement set forth on the
label that the mean amount of tea in a bag is 5.5 g?
A manufacturing company produces steel housings for electrical
equipment. The main component of the housing is a steel
trough made out of a 2-mm steel coil. It is produced using a
250-tonne progressive punch press with a wipe-down operation
that puts two 90-degree forms in the flat steel to make the
trough. The distance from one side of the form to the other is
critical because of weatherproofing in outdoor applications. The
data from a sample of 49 troughs follow: < TROUGH >
Width of trough (in mm)
203.12
204.22
204.98
204.29
204.10
204.27
8.78
203.43
204.76
204.47
204.58
204.05
204.20
203.17
203.82
204.36
204.62
203.23
204.98
203.83
204.84
204.13
204.60
204.20
204.09
203.48
204.03
204.89
204.44
203.96
204.10
204.14
204.14
204.29
204.47
203.51
204.19
204.81
204.60
204.05
203.73
203.85
204.15
204.12
204.39
204.81
204.65
204.79
204.20
204.11
a. Construct a 95% confidence interval estimate of the mean
width of the troughs.
b. Interpret the interval developed in (a).
A busy landscaping supplies company sells wood chips for
garden mulch. The mulch is sold by the cubic metre and
delivered to households in a small truck. Each truckload is
expected to be 4 cubic metres. The company decides to
conduct an audit of actual load volumes by smoothing and
measuring samples of loads for a two-week period. The data
file < MULCH > contains the volume (in cubic metres) from a
8.80
sample of 368 truckloads of cypress pine mulch and from a
sample of 330 truckloads of cedar wood chips.
a. For the cypress pine wood chips, construct a 95%
confidence interval estimate of the mean volume.
b. For the cedar wood chips, construct a 95% confidence
interval estimate of the mean volume.
c. Evaluate whether the assumption needed for (a) and (b) has
been seriously violated.
d. Based on the results of (a) and (b), what conclusions can
you reach concerning the mean volume of the cypress pine
and cedar wood chips?
The manufacturer of ‘Bondi’ and ‘Vincentia’ terracotta roof
shingles provides its customers with a 50-year warranty on the
product. To determine whether a shingle will last as long as the
warranty period, accelerated life testing is conducted at the
manufacturing plant. Accelerated life testing exposes the
shingle to the stresses it would be subject to in a lifetime of
normal use via a laboratory experiment that takes only a few
hours to conduct. In this test, a shingle is repeatedly scraped
with an abrasive and the particles that are removed are
weighed (in grams). Shingles that experience small amounts of
particle loss are expected to last longer in normal use than
shingles that experience large amounts of particle loss. In this
situation, a shingle should experience no more than 0.8 g of
particle loss if it is expected to last the length of the warranty
period. The data file < PARTICLE > contains a sample of
170 measurements made on the company’s ‘Bondi’ shingles,
and 140 measurements made on ‘Vincentia’ shingles.
a. For the ‘Bondi’ shingles, construct a 95% confidence
interval estimate of the mean particle loss.
b. For the ‘Vincentia’ shingles, construct a 95% confidence
interval estimate of the mean particle loss.
c. Evaluate whether the assumption needed for (a) and (b) has
been seriously violated.
d. Based on the results of (a) and (b), what conclusions can
you reach concerning the mean particle loss of the ‘Bondi’
and ‘Vincentia’ shingles?
Diners have rated 14 North Island and 14 South Island New
Zealand restaurants on the basis of food, presentation, service
and toilets using an online review system with ratings from 1
to 10. The data file < REST_NZ > contains the ratings for each
of these categories.
For each island separately:
a. Construct 95% confidence interval estimates for the mean
food rating, mean presentation rating, mean service rating
and mean toilet rating.
b. What conclusions can you reach about the North and South
Island restaurants from the results in (a)?
REPORT WRITING EXERCISE
8.81
Referring to the results in problem 8.77 concerning the width of
a steel trough, write a report that summarises your conclusions.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Chapter 8 Excel Guide 313
Continuing cases
Tasman University
The Business School at Tasman University (TBU) has decided to gather data about its undergraduate students. It
has created and distributed a survey of 14 questions and receives responses from 62 undergraduates (stored in
< TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY >).
a For each variable included in the survey, construct a 95% confidence interval estimate for the
population characteristic and write a report summarising your conclusions.
Shortly afterwards, TBU decides to undertake a similar survey for graduate students. It creates and distributes a
survey of 14 questions and receives responses from 44 graduate students (stored in < TASMAN_UNIVERSITY_MBA_
STUDENT_SURVEY >).
b For each variable included in the survey, construct a 95% confidence interval estimate for the
population characteristic and write a report summarising your conclusions.
As Safe as Houses
While working at Safe-As-Houses Real Estate, you are told the company wishes to explore variations in the average
prices of properties in towns and cities. Using data in the file < REAL_ESTATE >, find a 95% confidence interval for the
mean property price in each town or city in both states. Write a report that details your findings. Have you found
any evidence of differences between average prices in these towns and cities?
Chapter 8 Excel Guide
EG8.1 CONFIDENCE INTERVAL ESTIMATE FOR THE
MEAN (σ KNOWN)
EG8.2 CONFIDENCE INTERVAL ESTIMATE FOR THE
MEAN (σ UNKNOWN)
Open the CIE_Sigma_Known workbook. This workbook
already contains the entries for Example 8.1 on page 283
and uses the NORM.S.INV and CONFIDENCE.NORM
functions (see Appendix D.2 for more information). To
adapt this worksheet to other problems, change the population standard deviation, sample mean, sample size and confidence level values in the tinted cells in rows 4 to 7.
Open the CIE_Sigma_Unknown workbook, shown in
Figure 8.6 on page 288. The workbook uses the T.INV.2T
function to determine the critical value from the t distribution (see Appendix D.3 for more information).
To adapt this workbook to other problems, change the
sample statistics and confidence level values in the tinted
cells in rows 4 to 7.
OR See Appendix D.2 (Confidence Interval Estimate for
the Mean, sigma known) if you want PHStat to produce a
worksheet for you.
OR See Appendix D.3 (Confidence Interval Estimate for
the Mean, sigma unknown) if you want PHStat to produce
a worksheet for you.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
314 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION
EG8.3 CONFIDENCE INTERVAL ESTIMATE FOR THE
PROPORTION
Open the CIE_Proportion workbook, shown in Figure
8.10 on page 292. The workbook uses the NORM.S.INV
function to determine the Z value (see Appendix D.4 for
more information).
To adapt this workbook to other problems, change
the sample size, number of successes and confidence level
values in the tinted cells in rows 4 to 6.
OR See Appendix D.6 (Sample Size Determination for
the Proportion) if you want PHStat to produce a worksheet for you.
EG8.6 CONFIDENCE INTERVAL ESTIMATE FOR THE
POPULATION TOTAL
OR See Appendix D.4 (Confidence Interval Estimate for
the Proportion) if you want PHStat to produce a worksheet for you.
Open the CIE_Total workbook. The workbook uses the
T.INV.2T function to determine the critical value from the t
distribution (see Appendix D.7 for more information).
To adapt this workbook to other problems, change the
population size, sample mean, sample size, sample standard deviation and confidence level values in the tinted cells
in rows 4 to 8.
EG8.4 SAMPLE SIZE DETERMINATION FOR THE
MEAN
OR See Appendix D.7 (Estimate for the Population Total)
if you want PHStat to produce a worksheet for you.
Open the Sample_Size_Mean workbook, shown in Figure 8.11 on page 296. The workbook uses the NORM.S.INV
and ROUNDUP functions (see Appendix D.5 for more
information).
To adapt this workbook to other problems, change the
population standard deviation, sampling error and confidence level values in the tinted cells in rows 4 to 6.
EG8.7 CONFIDENCE INTERVAL ESTIMATE FOR THE
TOTAL DIFFERENCE
OR See Appendix D.5 (Sample Size Determination for the
Mean) if you want PHStat to produce a worksheet for you.
EG8.5 SAMPLE SIZE DETERMINATION FOR THE
PROPORTION
Open the Sample_Size_Proportion workbook, shown in
Figure 8.12 on page 298. The workbook uses the
NORM.S.INV and ROUNDUP functions (see Appendix
D.6 for more information).
To adapt this workbook to other problems, change the
estimate of true proportion, sampling error and confidence
level values in the tinted cells in rows 4 to 6.
Open the CIE_Total_Difference workbook. This two-­
worksheet file already contains the entries for the Callistemon Camping Supplies example used in Section 8.5. To
adapt this workbook to other problems, first change the
population size, sample size and confidence level values
in the tinted cells in rows 4 to 6. Then select the Data
worksheet and enter differences data in column A, replacing the data already there for the Section 8.5 problem.
Finally, adjust the column B formulas, copying the formulas down to additional cells if you have more than 12 differences, or deleting the unneeded formulas if you have
fewer than 12 differences.
OR See Appendix D.8 (Estimate for the Total Difference)
if you want PHStat to produce a worksheet for you.
Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
Fundamentals of
hypothesis testing:
One-sample tests
C HAP T E R
9
PATRICIO’S PASTA CO.
Y
ou have recently been appointed to oversee quality control at Patricio’s Pasta Co.,
which produces and packages a range of dried pasta in traditional Italian shapes. It is
made from Australian durum wheat semolina, sourced from grain grown in the Narrabri
region of New South Wales.
The pasta is sold in 500-gram packets, and part of your job is to ensure that packets are being
filled correctly and that the weight of the contents is as shown on the packet. You select and weigh
a random sample of 25 filled spiral pasta packets in order to calculate a sample mean and investigate how close the weights are to the company’s specifications of a mean of 500 grams. You
must make a decision and conclude whether (or not) the mean fill weight in the entire process is
equal to 500 grams, in order to know whether the fill process needs adjustment. How could you
rationally make this decision?
© Tim Hill/Alamy Stock Photo
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
316 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS
LEARNING
OBJECTIVES
After studying this chapter you should be able to:
1 identify the basic principles of hypothesis testing
2 explain the assumptions of each hypothesis-testing procedure, how to evaluate them and the
consequences if they are seriously violated
3 use hypothesis testing to test a mean or proportion
4 recognise the pitfalls involved in hypothesis testing
5 identify the ethical issues involved in hypothesis testing
Unlike Chapter 7, in which the problem facing the operations manager was to determine whether
the sample mean was consistent with a known population mean, this chapter’s opening scenario
asks how the sample mean can validate the claim that the population mean is 500 grams. To
validate the claim, you must first state the claim unambiguously. For example, the population
mean is 500 grams. In the inferential method known as hypothesis testing you consider the
evidence – the sample statistic – to see whether the evidence better supports the statement,
called the null hypothesis, or the mutually exclusive alternative which, in this case, states that
the population mean is not 500 grams.
In this chapter the focus is on hypothesis testing, another aspect of statistical inference that,
like confidence interval estimation, is based on sample information. A step-by-step methodology is developed that enables you to make inferences about a population parameter by analysing differences between the results observed (the sample statistic) and the results you expect to
get if some underlying hypothesis is actually true. For example, is the mean weight of the retail
spiral pasta packets in the sample taken at Patricio’s Pasta consistent with what you would
expect if the mean of the entire population of retail packets is 500 grams? Or can you infer that
the population mean is not equal to 500 grams because the sample mean is significantly different from 500 grams?
9.1 HYPOTHESIS-TESTING METHODOLOGY
The Null and Alternative Hypotheses
hypothesis testing
A method of statistical inference
used to make tests about the value
of population parameters.
null hypothesis (H0)
A statement about the value of one
or more population parameters
which we test and aim to disprove.
Hypothesis testing typically begins with some theory, claim or assertion about a particular
parameter of a population. For example, your initial hypothesis about the pasta company example is that the process is working properly, meaning that the mean weight is 500 grams, and no
corrective action is needed.
The hypothesis that the population parameter is equal to the company specification is
referred to as the null hypothesis. A null hypothesis is always one of status quo, and is identified
by the symbol H0. Here, the null hypothesis is that the filling process is working properly and
therefore the mean weight is the 500-gram specification. This is stated as:
H0: μ 5 500
Even though information is available only from the sample, the null hypothesis is written in
terms of the population. Remember, your focus is on the population of all retail spiral pasta
packets. The sample statistic is used to make inferences about the entire filling process. One
inference may be that the results observed from the sample data indicate that the null hypothesis is false. If the null hypothesis is considered false, something else must be true. Whenever a
null hypothesis is specified, an alternative hypothesis is also specified, one that must be true if
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
9.1 Hypothesis-testing Methodology 317
the null hypothesis is false. The alternative hypothesis, H1, is the opposite of the null hypothesis,
H0. This is stated in the pasta example as:
H1: m Z 500
The alternative hypothesis represents the conclusion reached by rejecting the null hypothesis. The
null hypothesis is rejected when there is sufficient evidence from the sample information that the
null hypothesis is false. In the pasta example, if the weights of the sampled packets are sufficiently
above or below the expected 500-gram mean specified by the company, you reject the null hypothesis in favour of the alternative hypothesis that the mean fill is different from 500 grams. You stop
production and take whatever action is necessary to correct the problem. If the null hypothesis is
not rejected, then you should continue to believe in the status quo, that the process is working correctly and that no corrective action is necessary. Note that this does not mean you have proved that
the process is working correctly. Rather, you have failed to prove that it is working incorrectly
and, therefore, you continue your (unproven) belief in the null hypothesis.
In the hypothesis-testing methodology, the null hypothesis is rejected when the sample
evidence suggests that it is far more likely that the alternative hypothesis is true. However, failure to reject the null hypothesis is not proof that it is true. You can never prove that the null
hypothesis is correct because the decision is based only on the sample information, not on the
entire population. Therefore, if you fail to reject the null hypothesis, you can only conclude that
there is insufficient evidence to warrant its rejection. The following key points summarise the
null and alternative hypotheses:
• The null hypothesis, H0, represents the status quo or the current belief in a situation.
• The alternative hypothesis, H1, is the opposite of the null hypothesis and represents a
research claim or specific inference you would like to prove.
• If you reject the null hypothesis, you have statistical proof that the alternative hypothesis
is correct.
• If you do not reject the null hypothesis, you have failed to prove the alternative
hypothesis. Failure to prove the alternative hypothesis, however, does not mean that you
have proved the null hypothesis.
• The null hypothesis, H0, always refers to a specified/hypothesised value of the population
parameter (such as m), not a sample statistic (such as X ).
• The statement of the null hypothesis always contains an equals sign regarding the
specified value of the population parameter (e.g. H0: m 5 500 or H0: m > 400).
• The statement of the alternative hypothesis never contains an equals sign regarding the
specified value of the population parameter (e.g. H1: m . 500 or H1: m , 400).
TH E NULL A N D A LT E R N AT IV E H YP OT H E S E S
You are the manager of an Internet provider’s call centre for customer support. You want to
determine whether the time taken to call back customers who elected to leave the phone
queue has changed in the past month from its previous population mean value of 4.5 minutes. State the null and alternative hypotheses.
alternative hypothesis (H1)
A statement that we aim to prove
about one or more population
parameters; the opposite of the null
hypothesis.
LEARNING OBJECTIVE
1
Identify the basic
principles of hypothesis
testing
EXAMPLE 9.1
SOLUTION
The null hypothesis is that the population mean has not changed from its previous value of
4.5 minutes. This is stated as:
H0: μ = 4.5
The alternative hypothesis is the opposite of the null hypothesis. Since the null hypothesis is
that the population mean is 4.5 minutes, the alternative hypothesis is that the population
mean is not 4.5 minutes. This is stated as:
H1: μ ≠ 4.5
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
318 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS
Determining the Test Statistic
test statistic
A value derived from sample data
that is used to determine whether
the null hypothesis should be
rejected or not.
region of rejection
The range of values of the test
statistic where the null hypothesis is
rejected; it is also called the ‘critical
region’.
region of non-rejection
The range of values of the test
statistic where the null hypothesis
cannot be rejected.
The logic behind the hypothesis-testing methodology is to determine how likely it is that the
null hypothesis is true by considering the information gathered in a sample. In the Patricio’s
Pasta scenario, the null hypothesis is that the mean weight of spiral pasta packets in the entire
filling process is 500 grams (i.e. the population parameter specified by the company). You
select a sample of packets from the filling process, weigh each packet and calculate the sample
mean. This statistic is an estimate of the corresponding parameter (the population mean m).
Even if the null hypothesis is in fact true, the statistic (the sample mean X ) is likely to differ
from the value of the parameter (the population mean m) because of variation due to sampling.
However, you expect the sample statistic to be close to the population parameter if the null
hypothesis is true. If the sample statistic is close to the population parameter, you have insufficient evidence to reject the null hypothesis. For example, if the sample mean is 499.9, you
would conclude that the population mean has not changed (i.e. m 5 500), because a sample
mean of 499.9 is very close to the hypothesised value of 500. Intuitively, you think that it is
likely that you could get a sample mean of 499.9 from a population whose mean is 500.
On the other hand, if there is a large difference between the value of the statistic and the
hypothesised value of the population parameter, you will conclude that the null hypothesis is
false. For example, if the sample mean is 420, you would conclude that the population mean is
not 500 (i.e. m Z 500), because the sample mean is very far from the hypothesised value of 500.
In such a case you conclude that it is very unlikely to get a sample mean of 420 if the population
mean is really 500. Therefore, it is more logical to conclude that the population mean is not
equal to 500 and reject the null hypothesis.
Unfortunately, the decision-making process is not always so clear-cut. Determining what is
‘very close’ and what is ‘very different’ is arbitrary and without clear definitions. Hypothesistesting methodology provides clear definitions for evaluating differences. It also enables you to
quantify the decision-making process by calculating the probability of getting a given sample
result if the null hypothesis is true. You calculate this probability by determining the sampling
distribution for the sample statistic of interest (e.g. the sample mean) and then calculating the
particular test statistic based on the given sample result. Because the sampling distribution for
the test statistic often follows a well-known statistical distribution, such as the standardised
normal distribution or t distribution, you can use these distributions to help determine whether
the null hypothesis is true.
Regions of Rejection and Non-Rejection
The sampling distribution of the test statistic is divided into two regions, a region of rejection
(sometimes called the critical region) and a region of non-rejection (see Figure 9.1).
If the test statistic falls into the region of non-rejection, you do not reject the null hypothesis.
In the Patricio’s Pasta scenario, you see that there is insufficient evidence that the population
mean fill is different from 500 grams. If the test statistic falls into the rejection region, you reject
the null hypothesis. In this case, you will see that the population mean is not 500 grams.
Figure 9.1
Regions of rejection and
non-rejection in hypothesis
testing
X
μ
Critical
Region of value
rejection
Region of
non-rejection
Critical
value
Region of
rejection
Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e
9.1 Hypothesis-testing Methodology 319
The region of rejection consists of the values of the test statistic that are unlikely to occur if
the null hypothesis is true. These values are more likely to occur if the null hypothesis is false.
Therefore, if a value of the test statistic falls into this rejection region, you reject the null
hypothesis because that value is unlikely if the null hypothesis is true.
To make a decision concerning the null hypothesis, you first determine the critical value of the
test statistic. The critical value divides the non-rejection region from the rejection region. Determining this critical value depends on the size of the rejection region. The size of the rejection
region is directly related to the risks involved in using only sample evidence to make decisions
about a population parameter.
critical value
The value in a distribution that cuts
off the required probability in the tail
for a given confidence level.
Risks in Decision Making Using Hypothesis Testing
When using a sample statistic to make decisions about a population parameter, there is a risk
that you will reach an incorrect conclusion. You can make two different types of errors when
applying hypothesis-testing methodology: a Type I error and a Type II error.
A Type I error occurs if you reject the null hypothesis, H0, when in fact it is true and
should not be rejected. The probability of a Type I error occurring is a.
A Type II error occurs if you do not reject the null hypothesis, H0, when in fact it is false
and should be rejected. The probability of a Type II error occurring is β.
In the Patricio’s Pasta scenario, you make a Type I error if you conclude that the population
mean weight is not 500 when in fact it is 500. You make a Type II error if you conclude that the
population mean weight is 500 when in fact it is not 500.
The Level of Significance (a)
The probability of committing a Type I error, denoted by a (the lower-case Greek letter alpha),
is referred to as the level of significance of the statistical test. Traditionally, you control the Type
I error by deciding on the risk level, a, that you are willing to have in rejecting the null hypothesis when it is true. Because you specify the level of significance before the hypothesis test is
performed, the risk of committing a Type I error, a, is directly under your control. Traditionally,
you select levels of 0.01, 0.05 or 0.10. The choice of a particular risk level for making a Type I
error depends on the cost of making such an error. After you specify the value for a, you know
the size of the rejection region because a is the probability of rejection under the null hypothesis. From this fact, you can then determine the critical value or values that divide the rejection
and non-rejection regions.
Type I error
The rejection of a null hypothesis
that is true and should not be
rejected.
Type II error
The non-rejection of a null
hypothesis that is false and should
be rejected.
level of significance (𝛂)
The probability of rejecting a null
hypothesis which is in fact true.
The Confidence Coefficient
The complement of the probability of a Type I error (1 2 a) is called the confidence coefficient.
When multiplied by 100%, the confidence coefficient yields the confidence level that was studied when constructing confidence intervals (see Section 8.1).
The confidence coefficient, 1 2 a, is the probability that you will not reject the null
hypothesis, H0, when it is true and should not be rejected. The confidence level of a
hypothesis test is (1 2 a) * 100%.
In terms of hypothesis-testing methodology, the confidence coefficient represents the
probability of concluding that the value of the parameter as specified i
Download