Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5TH EDITION Basic Business Statistics Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e This page is intentionally left blank 5TH EDITION Basic Business Statistics Concepts and applications Berenson Levine Szabat O’Brien Jayne Watson Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019 Pearson Australia 707 Collins Street Melbourne VIC 3008 www.pearson.com.au Authorised adaptation from the United States edition entitled Basic Business Statistics, 13th edition, ISBN 0321870026 by Berenson, Mark L., Levine, David M., Szabat, Kathryn A., published by Pearson Education, Inc., Copyright © 2015. Fifth adaptation edition published by Pearson Australia Group Pty Ltd, Copyright © 2019 The Copyright Act 1968 of Australia allows a maximum of one chapter or 10% of this book, whichever is the greater, to be copied by any educational institution for its educational purposes provided that that educational institution (or the body that administers it) has given a remuneration notice to Copyright Agency Limited (CAL) under the Act. For details of the CAL licence for educational institutions contact: Copyright Agency Limited, telephone: (02) 9394 7600, email: info@copyright.com.au All rights reserved. Except under the conditions described in the Copyright Act 1968 of Australia and subsequent amendments, no part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner. Portfolio Manager: Rebecca Pedley Development Editor: Anna Carter Project Managers: Anubhuti Harsh and Keely Smith Production Manager: Julie Ganner Product Manager: Sachin Dua Content Developer: Victoria Kerr Rights and Permissions Team Leader: Lisa Woodland Lead Editor/Copy Editor: Julie Ganner Proofreader: Katy McDevitt Indexer: Garry Cousins Cover and internal design by Natalie Bowra Cover photograph © kireewong foto/Shutterstock Typeset by iEnergizer Aptara®, Ltd Printed in Malaysia ISBN 9781488617249 1 2 3 4 5 23 22 21 20 19 Pearson Australia Group Pty Ltd ABN 40 004 245 943 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e brief contents Preface x Acknowledgements xi How to use this book xii About the authors PART 1 PRESENTING AND DESCRIBING INFORMATION 1 2 3 PART 2 5 6 7 4 37 91 Basic probability Some important discrete probability distributions The normal distribution and other continuous distributions Sampling distributions 147 180 212 248 DRAWING CONCLUSIONS ABOUT POPULATIONS BASED ONLY ON SAMPLE INFORMATION 8 9 10 11 PART 4 Defining and collecting data Organising and visualising data Numerical descriptive measures MEASURING UNCERTAINTY 4 PART 3 xvii Confidence interval estimation Fundamentals of hypothesis testing: One-sample tests Hypothesis testing: Two-sample tests Analysis of variance 279 315 358 401 DETERMINING CAUSE AND MAKING RELIABLE FORECASTS 12 13 14 15 Simple linear regression Introduction to multiple regression Time-series forecasting and index numbers Chi-square tests 455 504 544 607 ONLINE CHAPTERS PART 5 FURTHER TOPICS IN STATS 16 17 18 19 20 21 Multiple regression model building Decision making Statistical applications in quality management Further non-parametric tests Business analytics Data analysis: The big picture 650 680 704 740 770 794 Appendices A to F A-1 Glossary G-1 Index I-1 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e vi detailed contents Preface Acknowledgements How to use this book About the authors x xi xii xvii 3.3 3.4 Calculating numerical descriptive measures from a frequency distribution 118 Five-number summary and box-and-whisker plots 120 3.5 Covariance and the coefficient of correlation 123 PRESENTING AND DESCRIBING INFORMATION 3.6 Pitfalls in numerical descriptive measures and ethical issues 1 Defining and collecting data Summary Key formulas Key terms Chapter review problems Continuing cases Chapter 3 Excel Guide 130 130 132 132 134 135 End of Part 1 problems 139 PART 1 4 1.1 Basic concepts of data and statistics 6 1.2 Types of variables 9 1.3 Collecting data 13 1.4 Types of survey sampling methods 17 1.5 Evaluating survey worthiness 22 1.6 The growth of statistics and information technology 26 Summary Key terms References Chapter review problems Continuing cases Chapter 1 Excel Guide 2 Organising and visualising data 27 27 27 28 29 29 37 2.1 Organising and visualising categorical data 38 2.2 Organising numerical data 43 2.3 Summarising and visualising numerical data 46 2.4 Organising and visualising two categorical variables 55 2.5 Visualising two numerical variables 59 2.6 Business analytics applications – descriptive analytics 62 Misusing graphs and ethical issues 69 2.7 Summary Key terms References Chapter review problems Continuing cases Chapter 2 Excel Guide 3 Numerical descriptive measures 3.1 3.2 Measures of central tendency, variation and shape Numerical descriptive measures for a population PART 2 MEASURING UNCERTAINTY 4 Basic probability Basic probability concepts 148 4.2 Conditional probability 156 4.3 Bayes’ theorem 163 4.4 Counting rules 168 4.5 Ethical issues and probability 172 Summary Key formulas Key terms Chapter review problems Continuing cases Chapter 4 Excel Guide 5 Some important discrete probability distributions 173 173 173 174 177 178 180 Probability distribution for a discrete random variable 181 5.2 Covariance and its application in finance 185 5.3 Binomial distribution 189 5.4 Poisson distribution 196 5.5 Hypergeometric distribution 200 5.1 91 113 147 4.1 73 73 73 74 76 77 92 129 Summary Key formulas Key terms Chapter review problems Chapter 5 Excel Guide 204 204 205 205 208 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e DETAILED CONTENTS 6 The normal distribution and other continuous distributions 212 6.1 Continuous probability distributions 213 6.2 The normal distribution 214 6.3 Evaluating normality 229 6.4 The uniform distribution 233 6.5 The exponential distribution 235 6.6 The normal approximation to the binomial distribution 238 Summary Key formulas Key terms Chapter review problems Continuing cases Chapter 6 Excel Guide 242 242 242 243 244 246 7 Sampling distributions 248 7.1 Sampling distributions 249 7.2 Sampling distribution of the mean 249 7.3 Sampling distribution of the proportion 259 Summary Key formulas Key terms References Chapter review problems Continuing cases Chapter 7 Excel Guide 262 263 263 263 263 265 265 End of Part 2 problems 267 PART 3 DRAWING CONCLUSIONS ABOUT POPULATIONS BASED ONLY ON SAMPLE INFORMATION 8 Confidence interval estimation 279 Confidence interval estimation for the mean (σ known) 280 Confidence interval estimation for the mean (σ unknown) 285 Confidence interval estimation for the proportion 291 8.4 Determining sample size 294 8.5 Applications of confidence interval estimation in auditing 300 More on confidence interval estimation and ethical issues 307 8.1 8.2 8.3 8.6 Summary Key formulas 308 308 Key terms References Chapter review problems Continuing cases Chapter 8 Excel Guide 9 Fundamentals of hypothesis testing: One-sample tests 308 309 309 313 313 315 9.1 Hypothesis-testing methodology 9.2 Z test of hypothesis for the mean (σ known) 322 9.3 One-tail tests 9.4 t test of hypothesis for the mean (σ unknown) 334 9.5 Z test of hypothesis for the proportion 340 9.6 The power of a test 344 9.7 Potential hypothesis-testing pitfalls and ethical issues 349 Summary Key formulas Key terms References Chapter review problems Continuing cases Chapter 9 Excel Guide 10 Hypothesis testing: Two-sample tests 10.1 10.2 10.3 10.4 316 329 352 353 353 353 354 356 356 358 Comparing the means of two independent populations 359 Comparing the means of two related populations 371 F test for the difference between two variances 378 Comparing two population proportions 384 Summary Key formulas Key terms References Chapter review problems Continuing cases Chapter 10 Excel Guide 11 Analysis of variance 389 391 392 392 392 395 396 401 The completely randomised design: One-way analysis of variance 402 11.2 The randomised block design 415 11.3 The factorial design: Two-way analysis of variance 425 11.1 Summary Key formulas Key terms References Chapter review problems 438 439 440 440 441 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e vii viii DETAILED CONTENTS Continuing cases Chapter 11 Excel Guide 443 444 End of Part 3 problems 448 PART 4 DETERMINING CAUSE AND MAKING RELIABLE FORECASTS 12 Simple linear regression 455 14 Time-series forecasting and index numbers 544 14.1 The importance of business forecasting 545 14.2 Component factors of the classical multiplicative time-series model 546 14.3 Smoothing the annual time series 547 14.4 Least-squares trend fitting and forecasting 555 14.5 The Holt–Winters method for trend fitting and forecasting 567 Autoregressive modelling for trend fitting and forecasting 570 12.1 Types of regression models 12.2 Determining the simple linear regression equation 458 12.3 Measures of variation 467 14.7 Choosing an appropriate forecasting model 579 12.4 Assumptions 473 14.8 Time-series forecasting of seasonal data 584 12.5 Residual analysis 473 14.9 Index numbers 591 12.6 Measuring autocorrelation: The Durbin–Watson statistic 14.10 Pitfalls in time-series forecasting 599 477 Inferences about the slope and correlation coefficient 482 12.7 456 14.6 12.8 Estimation of mean values and prediction of individual values 489 12.9 Pitfalls in regression and ethical issues Summary Key formulas Key terms References Chapter review problems Continuing cases Chapter 12 Excel Guide 13 Introduction to multiple regression 493 496 497 498 498 498 501 502 Chi-square test for differences between more than two proportions 615 15.3 Chi-square test of independence 622 504 15.4 Chi-square goodness-of-fit tests 627 15.5 Chi-square test for a variance or standard deviation 632 505 13.2 R 2, adjusted R 2 and the overall F test 511 Residual analysis for the multiple regression model 514 Inferences concerning the population regression coefficients 516 Testing portions of the multiple regression model 520 Using dummy variables and interaction terms in regression models 525 Collinearity 535 Summary Key formulas Key terms References Chapter review problems Continuing cases Chapter 13 Excel Guide 536 537 537 537 538 541 541 13.5 13.6 13.7 607 608 Developing the multiple regression model 13.4 15 Chi-square tests 600 600 601 602 602 604 Chi-square test for the difference between two proportions (independent samples) 13.1 13.3 Summary Key formulas Key terms References Chapter review problems Chapter 14 Excel Guide 15.1 15.2 Summary Key formulas Key terms References Chapter review problems Continuing cases Chapter 15 Excel Guide 635 635 636 636 636 640 641 End of Part 4 problems 642 PART 5 (ONLINE) FURTHER TOPICS IN STATS 16 Multiple regression model building 650 16.1 Quadratic regression model 651 16.2 Using transformations in regression models 657 16.3 Influence analysis 660 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e DETAILED CONTENTS 16.4 Model building 663 16.5 Pitfalls in multiple regression and ethical issues 673 Summary Key formulas Key terms References Chapter review problems Continuing cases Chapter 16 Excel Guide 17 Decision making 674 674 674 676 676 677 677 Payoff tables and decision trees 681 17.2 Criteria for decision making 685 17.3 Decision making with sample information 694 17.4 Utility 699 18 Statistical applications in quality management 700 701 701 701 701 703 704 18.1 Total quality management 705 18.2 Six Sigma management 707 18.3 The theory of control charts 708 18.4 Control chart for the proportion – The p chart 710 The red bead experiment – Understanding process variability 716 18.5 19.1 19.2 19.3 19.4 680 17.1 Summary Key formulas Key terms References Chapter review problems Chapter 17 Excel Guide 19 Further non-parametric tests 19.5 740 McNemar test for the difference between two proportions (related samples) 741 Wilcoxon rank sum test – Non-parametric analysis for two independent populations 744 Wilcoxon signed ranks test – Nonparametric analysis for two related populations 750 Kruskal–Wallis rank test – Non-parametric analysis for the one-way anova 755 Friedman rank test – Non-parametric analysis for the randomised block design 758 Summary Key formulas Key terms Chapter review problems Continuing cases Chapter 19 Excel Guide 762 762 762 763 765 766 20 Business analytics 770 20.1 Predictive analytics 771 20.2 Classification and regression trees 772 20.3 Neural networks 777 20.4 Cluster analysis 781 20.5 Multidimensional scaling 783 Key formulas Key terms References Chapter review problems Chapter 20 Software Guide 786 787 787 787 788 21 Data analysis: The big picture 794 21.1 Analysing numerical variables 798 Control chart for an area of opportunity – The c chart 718 21.2 Analysing categorical variables 800 18.7 Control charts for the range and the mean 721 21.3 Predictive analytics 801 18.8 Process capability 727 18.6 Summary Key formulas Key terms References Chapter review problems Chapter 18 Excel Guide 733 733 734 734 734 736 Chapter review problems 802 End of Part 5 problems 804 Appendices A to F A-1 Glossary G-1 Index I-1 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e ix preface This fifth Australasian and Pacific edition of Basic Business Statistics: Concepts and Applications continues to build on the strengths of the fourth edition, and extends the outstanding teaching foundation of the previous American editions, authored by ­Berenson, Levine and Szabat. The teaching philosophy of this text is based upon the principles of the American book, but each chapter has once again been carefully revised to include practical examples and a language and style that is more applicable to Australasian and Pacific readers. In preparation for this edition we again asked lecturers from around the country to comment on the format and content of the fourth edition and, based on those comments, the authors have worked to create a text that is more accessible – but no less authoritative – for students. Part 5 contains additional chapters: Chapter 16 on multiple regression and model building, Chapter 17 on decision making, Chapter 18 on statistical applications in quality and productivity management, Chapter 19 on further non-parametric tests and two brand new chapters: Chapter 20 on business analytics and Chapter 21 on data analysis. This chapter will be especially useful to students who wish to understand how the concepts and techniques studied in this book all fit together. The Part 5 chapters can be found within the MyLab and student download page via our catalogue. Chapter 21 (including Figure 21.1, which provides a summary of the contents of this book arranged by data-analysis task) is designed to provide guidance in choosing appropriate statistical techniques to data-analysis questions arising in business or elsewhere. Figure 21.1, and Chapter 21, should be referred to when working through the earlier chapters of this book. This should enable students to see connections between topics; that is, the big picture. The new edition has continued with a ‘real-world’ focus, to take students beyond the pure theory. Some chapters have a completely new opening scenario, focusing on a person or company, which serves to introduce key concepts covered in the chapter. The scenario is interwoven throughout the chapter to reinforce the concepts to the student. Multiple in-chapter examples have been updated that highlight real Australasian and Pacific data. The Real people, real stats feature that opens each of the text’s five parts is composed of a personal interview highlighting how real people in real business situations apply the principles of statistics to their jobs. The interviewees are: Part 1 Part 2 Part 3 Part 4 Part 5 David McCourt BDO Ellouise Roberts Deloitte Access Economics Rod Battye Tourism Research Australia Gautam Gangopadhyay Endeavour Energy Deborah O’Mara The University of Sydney Judith Watson Nicola Jayne Martin O’Brien Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e acknowledgements When developing the new edition of Basic Business Statistics, we were mindful of retaining the strengths of the current edition, but also of the need to build on those strengths, to enhance the text and to ensure wider reader appeal and useability. We are indebted to the following academics who contributed to the new edition. Technical Editor We would like to thank Martin Firth at UWA for carrying out a detailed technical edit of the text. Reviewers Ms Gerrie Roberts Monash University Dr Sonika Singh University of Technology Sydney Dr Erick Li University of Sydney Dr Amir Arjomandi University of Wollongong Mr Jason Hay Queensland University of Technology Mr Martin J Firth University of Western Australia Dr Scott Salzman Deakin University Ms Charanjit Kaur Monash University Dr Jill Wright Monash University The enormous task of writing a book of this scope was possible only with the expert assistance of all these friends and colleagues and that of the editorial and production staff at Pearson Australia. We gratefully acknowledge their invaluable contributions at every stage of this project, collectively and, now, individually. We thank the following people at Pearson Australia: Rebecca Pedley, Portfolio Manager; Anna Carter, Development Editor; Julie Ganner, Production Manager and Copy Editor; and Lisa Woodland, Rights & Permissions Team Leader. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e xii how to use this book Real people, real stats interviews open each part. These introduce real people working in real business environments, using statistics to tackle real business challenges. PA R T 1 Presenting and describing information Real People, Real Stats David McCourt BDO Learning objectives introduce you to the key concepts to be covered in each chapter, and are signposted in the margins where they are covered within the chapter. Which company are you currently working for and what are some of your responsibilities? I work at BDO, Chartered Accountants and Advisors, in the corporate finance team. My primary responsibilities include the preparation of financial models and valuation reports. List five words that best describe your personality. Affable, level-headed, perceptive, analytical, assured (according to my colleagues). What are some things that motivate you? Success, working with a team, client satisfaction. When did you first become interested in statistics? I never really understood statistics at school and it was a minor part of my university degree. However, statistics play a significant role in many of our valuations, including discounted cash flow valuations and share option valuations. Complete the following sentence. A world without statistics … … is not worth thinking about. LET’S TALK STATS What do you enjoy most about working in statistics? We use data services and statistical tools that have been created by third parties. I can use, and talk reasonably knowledgeably about, statistical data without being an expert. CHAPTER 1 DEFINING AND COLLECTING DATA LEARNING OBJECTIVES 04/07/18 6:33 PM M01_BERE7249_05_SE_C01.indd 2 5 After studying this chapter you should be able to: 1 identify the types of data used in business 2 identify how statistics is used in business 3 recognise the sources of data used in business 4 distinguish between different survey sampling methods 5 evaluate the quality of surveys Chapter-opening scenarios show how statistics are used in everyday life. The scenarios introduce the concepts to be covered, showing the relevance of using particular statistical techniques. The problem is woven throughout each chapter, showing the connection between statistics and their use in business, as well as keeping you motivated. C H AP T E R 1 Defining and Collecting data THE HONG KONG AIRPORT SURVEY Not so long ago, business students were unfamiliar with the word data and had little experience handling data. Today, every time you visit a search engine website or ‘ask’ your mobile device a question, you are handling data. And if you ‘check in’ to a location or indicate that you ‘like’ something, you are creating data as well. You accept as almost true the premises of stories in which characters collect ‘a lot of data’ to uncover conspiracies, foretell disasters or catch a criminal. You hear concerns about how the government or business might be able to ‘spy’ on you in some way or how large social media companies ‘mine’ your personal data for profit. You hear the word data everywhere and may even have a ‘data plan’ for your smartphone. You know, in a general way, that data are facts about the world and that most data seem to be, ultimately, a set of numbers – that 34% of students recently polled prefer using a certain Internet browser, or that 50% of citizens believe the country is headed in the right direction, or that unemployment is down 3%, or that your best friend’s social media account has 835 friends and 202 recent posts. You cannot escape from data in this digital world. What, then, should you do? You could try to ignore data and conduct business by relying on hunches or your ‘gut instincts’. However, if you want to use only gut instincts, then you probably shouldn’t be reading this book or taking business courses in the first place. You could note that there is so much data in the world – or just in your own little part of the world – that you couldn’t possibly get a handle on it. You could accept other people’s data summaries and their conclusions without first reviewing the data yourself. That, of course, would expose yourself to fraudulent practices. Or you could do things the proper way and realise the benefits of learning the methods of statistics, the subject of this book. You can learn, though, the procedures and methods that will help you make better decisions based on solid evidence. When you begin focusing on the procedures and methods involved in collecting, presenting and summarising a set of data, or forming conclusions about those data, you have discovered statistics. In the Hong Kong Airport survey scenario it is important that research team members focus on the information that is needed by many different stakeholders when planning for future business and tourist visitors. If the research team fails to collect important information, or misrepresents the opinions of current visitors, stakeholders may make poor decisions about advertising, pricing, facilities and other factors relevant to attracting visitors and hosting them in Hong Kong. Failure to offer suitable facilities and experiences could affect the profitability of businesses in Hong Kong. In deciding how to collect the facts that are needed, it will help if you know something about the basic concepts of statistics. Y ou are departing Hong Kong International Airport on the next leg of your trip and have cleared Immigration. You are approached by a researcher holding a tablet computer who asks if you can answer a few questions. The first question determines if you are a visitor to Hong Kong or a resident. After establishing that you are a visitor the questions go on to determine the purpose of your visit, the name of your hotel, the activities you have undertaken and much additional information about your visit. M01_BERE7249_05_SE_C01.indd 5 This information is useful for a tourism authority that has the task of marketing Hong Kong as a travel destination and monitoring the quality of visitors’ experiences in the city. It may also inform the authority’s government and commercial stakeholders, who provide transport, accommodation, and food and shopping for visitors, and be used for forward planning. © Jungyeol & Mina/age fotostock Data sets and Excel workbooks that accompany the text can be downloaded and used to answer the appropriate questions. M01_BERE7249_05_SE_C01.indd 4 04/07/18 6:33 PM Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 04/07/18 6:33 PM PRELIMS HOW TO USE THIS BOOK detailed contents xiii xiii 41 2.1 ORGANISING AND VISUALISING CATEGORICAL DATA What type of chart should you use? The selection of a chart depends on your intention. If a comparison of categories is most important, use a bar chart. If observing the portion of the whole that lies in a particular category is most important, use a pie chart. There should be no more than eight categories or slices in a pie chart. If there are more than eight, merge the smaller categories into a category called ‘other’. Figure 2.3 Microsoft Excel pie chart of the reasons for grocery shopping online Pie chart – reasons for grocery shopping online Comfortable environment 8% Variety/range of products 10% Competitive prices 20% Quality products 18% Real world, business examples are included throughout the chapter. These are designed to show the multiple applications of statistics, while helping you to learn the statistics techniques. Emphasis on data output and interpretation The authors believe that the use of computer software is an integral part of learning statistics. Our focus emphasises analysing data by interpreting the output from Microsoft Excel while reducing emphasis on doing calculations. Excel 2016 changes to statistical functions are reflected in the operations shown in this edition. In the coverage of hypothesis testing in Chapters 9 to 11, extensive computer output is included so that the focus can be placed on the p-value approach. In our coverage of simple linear regression in Chapter 12, we assume that a software program will be used and our focus is on interpretation of the output, not on hand calculations. Summaries are provided at the end of each chapter, to help you review the key content. Key terms are signposted in the margins when they are first introduced, and are referenced to page numbers at the end of each chapter, helping you to revise key terms and concepts for the chapter. End-of-section problems are divided into Learning the basics and Applying the concepts. Products well displayed 3% Convenience 28% Customer service 13% PIE CHART F OR FAMILY TYPE Use the summary tables given for family type in < DEMOGRAPHIC_INFORMATION > to construct and interpret pie charts for the capital city and the council area. EXAMPLE 2.3 Figure 2.4 Microsoft Excel pie chart for family type Pie chart – council area Couple with children Couple no children One parent Other Pie chart – capital city Couple with children Couple no children One parent Other M02_BERE7249_05_SE_C02.indd 41 04/07/18 7:19 PM End-of-part problems challenge the student to make decisions about the appropriate technique to apply, to carry out that technique and to interpret the data meaningfully.* Australasian and Pacific data sets are used for the problems in each chapter. These files are contained on the Pearson website. Ethical issues sections are integrated into many chapters, raising issues for ethical consideration. 674 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING End of PART 1 PRoblEMs 139 16 Assess your progress End of Part 1 problems A.1 Summary In this chapter, various multiple regression topics were considered (see Figure 16.15) including quadratic regression models, interactions, transformations square root and log transformations. A number of criteria were presented to examine the influence of each individual observation on the results. In addition, the best subsets and stepwise regression approaches to model building were detailed. You have learned how suburban ratings can be used to derive a measure of income distribution. You also learned how a director of operations at a television station could build a multiple regression model as an aid to reducing labour expenses. Enjoy shopping for clothing Yes No Total Key formulas The quadratic regression model Transformed exponential model Yi = β0 + β1X1i + β2 X 21i + εi (16.1) ln Yi = ln( eβ0+β1 X 1i +β2 X 2i εi ) = ln( e Quadratic regression equation t i = ei X1i + εi (16.3) Original multiplicative model Yi = b0 X 1ib1 X 2ib2 n – k –1 SSE (1 – hi ) – ei2 (16.8) Di = Transformed multiplicative model log Yi = log(β0 X 1βi 1 X 2βi2 εi ) (16.5) β = log β0 + log( X 1i 1 ) + log( X 2i2 ) + log εi ei2 hi k MSE (1 – hi ) 2 54 11 12 13 33 (16.9) The Cp statistic Cp = = log β0 + β1 log X 1i + β2 log X 2i + log εi (1 – Rk2 )(n – T ) 1– RT2 – [ n – 2( k + 1)] (16.10) Yi = e εi (16.6) Key terms best-subsets approach Cook’s Di statistic Cp statistic cross-validation M16_BERE7249_05_SE_C16.indd 674 665 662 667 672 data mining hat matrix diagonal elements hi logarithmic transformation parsimony 665 661 658 663 Gender Female 224 36 260 a. Construct contingency tables based on total percentages, row percentages and column percentages. b. Construct a side-by-side bar chart of enjoy shopping for clothing based on gender. c. What conclusions do you draw from these analyses? One of the major measures of the quality of service provided by any organisation is the speed with which the organisation responds to customer complaints. A large family-owned department store selling furniture and flooring, including carpet, has undergone major expansion in the past few years. In particular, the flooring department has expanded from two installation crews to an installation supervisor, a measurer and 15 installation crews. During a recent year the company got 50 complaints about carpet installation. The following data represent the number of days between receipt of the complaint and resolution of the complaint. quadratic regression model square-root transformation stepwise regression Studentised deleted residual 5 19 4 10 68 35 126 165 5 137 110 32 27 31 110 29 4 27 29 28 52 152 61 29 30 2 35 26 22 123 94 25 36 81 31 1 26 74 26 14 20 651 657 663 661 The annual crediting rates (after tax and fees) on several managed superannuation investment funds between 2013 and 2017 are: Superannuation fund Conservative Balanced Growth High growth Total 360 140 500 27 5 13 23 a. Construct frequency and percentage distributions. b. Construct histogram and percentage polygons. c. Construct a cumulative percentage distribution and plot the corresponding ogive. d. Calculate the mean, median, first quartile and third quartile. e. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. f. Construct a box-and-whisker plot. Are the data skewed? If so, how? g. On the basis of the results of (a) to (f), if you had to report to the manager on how long a customer should expect to wait to have a complaint resolved, what would you say? Explain. Original exponential model β0+β1 X 1i +β2 X 2i Male 136 104 240 A.3 A.4 Historical crediting rate for year ending 30 June, % 2016 2015 2014 2013 8.7 9.0 11.3 12.3 5.2 10.7 14.1 15.9 3.8 11.3 15.6 18.7 3.1 12.3 17.4 20.5 2017 5.5 9.5 11.8 13.7 a. For each fund, calculate the geometric rate of return for three years (2015 to 2017) and for five years (2013 to 2017). b. What conclusions can you reach concerning the geometric rates of return for the funds? A supplier of ‘Natural Australian’ spring water states that the magnesium content is 1.6 mg/L. To check this, the quality control department takes a random sample of 96 bottles during a day’s production and obtains the magnesium content. < SPRING_WATER1 > < FURNITURE > Cook’s Di statistic εi (16.4) β ) + ln εi Studentised deleted residual Regression model with a square-root transformation Yi = b0 + b1 β0+β1 X 1i +β2 X 2i A.2 (16.7) = β0 + β1X 1i + β 2 X 2i + ln εi Yˆi = b0 + b1X1i + b2 X 21i (16.2) A sample of 500 shoppers was selected in a large metropolitan area to obtain consumer behaviour information. Among the questions asked was, ‘Do you enjoy shopping for clothing?’ The results are summarised in the following cross-classification table. A.5 A.6 a. Construct frequency and percentage distributions. b. Construct a histogram and a percentage polygon. c. Construct a cumulative percentage distribution and plot the corresponding ogive. d. Calculate the mean, median, mode, first quartile and third quartile. e. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. f. Construct and interpret a box-and-whisker plot. g. What conclusions can you reach concerning the magnesium content of this day’s production? The National Australia Bank (NAB) produces regular reports titled NAB Online Retail Sales Index <www.business.nab. com.au>. Download the latest in-depth report. a. Give an example of a categorical variable found in the report. b. Give an example of a numerical variable found in the report. c. Is the variable you selected in (b) discrete or continuous? The data in the file < WEBSTATS > represent the number of times during August and September that a sample of 50 students accessed the website of a statistics unit they were enrolled in. a. Construct ordered arrays for August and September. b. Construct stem-and-leaf displays for August and September. c. Construct frequency, percentage and cumulative distributions for August and September. 7/5/18 9:00 PM M03_BERE7249_05_SE_C03.indd 139 26/07/18 1:31 PM *The solutions are calculated using the (raw) Excel output. If you use the rounded figures presented in the text to reproduce these answers there may be minor differences. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e xiv MyLab Statistics a guided tour for students and educators Study Plan A study plan is generated from each student’s results on a pre-test. Students can clearly see which topics they have mastered and, more importantly, which they need to work on. Unlimited Practice Each MyLab Statistics comes with preloaded assignments, including select end-ofchapter questions, all of which are automatically graded. Many study plan and educator-assigned exercises contain algorithmically generated values to ensure students get as much practice as they need. As students work though study plan or homework exercises, instant feedback and tutorial resources guide them towards understanding. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e GUIDED TOUR FOR STUDENTS AND EDUCATORS xv Learning Resources To further reinforce understanding, study plan and homework problems link to the following learning resources: • eText linked to sections for all study plan questions • Help Me Solve This, which walks students through the problem with step-by-step help and feedback without giving away the answer • StatCrunch. StatTalk Videos Fun-loving statistician Andrew Vickers takes to the streets of Brooklyn, New York to demonstrate important statistical concepts through interesting stories and real-life events. This series of videos and corresponding autograded questions will help students to understand statistics. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e xvi EDUCATOR RESOURCES EDUCATOR RESOURCES A suite of resources is provided to assist with delivery of the text, as well as to support teaching and learning. Solutions Manual The Solutions Manual provides educators with detailed, accuracy-verified solutions to all the in-chapter and end-of-chapter problems in the book. Test Bank The Test Bank provides a wealth of accuracy-verified testing material. Updated for the new edition, each chapter offers a wide variety of true/false and multiple-choice questions, arranged by learning objective and tagged by AACSB standards. Questions can be integrated into Blackboard, Canvas or Moodle Learning Management Systems. PowerPoint lecture slides A comprehensive set of PowerPoint slides can be used by educators for class presentations or by students for lecture preview or review. They include key figures and tables, as well as a summary of key concepts and examples from the text. Digital image PowerPoint slides All the diagrams and tables from the text are available for lecturer use. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e about the authors Judith Watson Judith Watson teaches in the Business School at UNSW Australia. She has extensive experience in lecturing and administering undergraduate and postgraduate Quantitative Methods courses. Judith’s keen interest in student support led her to establish the Peer Assisted Support Scheme (PASS) in 1996 and she has coordinated this program for many years. She served as her faculty’s academic adviser from 2001 to 2004. Judith has been the recipient of a number of awards for teaching. She received the inaugural Australian School of Business Outstanding Teaching Innovations Award in 2008 and the 2012 Bill Birkett Award for Teaching Excellence. She also won the UNSW Vice Chancellor’s Award for Teaching Excellence in 2012 and a Citation of Outstanding Contributions to Student Learning from the Australian Government’s Office for Learning and Teaching in 2013. Judith is interested in using online learning technology to engage students and has created a number of adaptive e-learning tutorials for mathematics and statistics and cartoon-style videos to explain statistical concepts. Dr Nicola Jayne Nicola Jayne is a lecturer in the Southern Cross Business School at the Lismore campus of Southern Cross University. She has been teaching quantitative units since being appointed to the university in 1993 after several years at Massey University in New Zealand. Nicola has lectured extensively in Business and Financial Mathematics, Discrete Mathematics and Statistics, both undergraduate and postgraduate, as well as various Pure Mathematics units. Nicola’s academic qualifications from Massey University include a Bachelor of Science (majors in Mathematics and Statistics), a Bachelor of Science with Honours (first class) and a Doctor of Philosophy, both in Mathematics. Nicola also has a Graduate Certificate in Higher Education (Learning & Teaching) from Southern Cross University. She was the recipient of a Vice Chancellor’s Citation for an Outstanding Contribution to Student Learning in 2011. Dr Martin O’Brien Dr Martin O’Brien is a senior lecturer in economics, Director of the Centre for Human and Social Capital Research, and Director of the MBA program in the Sydney Business School, University of Wollongong. Martin earned his Bachelor of Commerce (firstclass honours) and PhD in Economics at the University of Newcastle. His PhD and subsequent published research is in the ­general area of labour economics, and specifically the exploration of older workers’ labour force participation in Australia in the context of an ageing society. Martin has been an expert witness for a number of Fair Work Commission cases, providing statistical analyses of the effects of penalty rates, workforce casualisation and family and domestic violence leave. Martin has taught a wide range of quantitative subjects at university level, including business statistics, business analytics, quantitative analysis for decision making, econometrics, financial modelling and business research methods. He also has a keen interest in learning analytics and the development and analysis of new teaching technologies. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e xviii ABOUT THE AUTHORS about the originating authors Mark L. Berenson is Professor of Management and Information Systems at Montclair State University (Montclair, New Jersey) and also Professor Emeritus of Statistics and Computer Information Systems at Bernard M. Baruch College (City University of New York). He currently teaches graduate and undergraduate courses in statistics and in operations management in the School of Business and an undergraduate course in international justice and human rights that he co-developed in the College of Humanities and Social Sciences. Berenson received a BA in economic statistics, an MBA in business statistics from City College of New York and a PhD in business from the City University of New York. His research has been published in Decision Sciences Journal of Innovative Education, Review of Business Research, The American Statistician, Communications in Statistics, Psychometrika, Educational and Psychological Measurement, Journal of Management Sciences and Applied Cybernetics, Research Quarterly, Stats Magazine, The New York Statistician, Journal of Health Administration Education, Journal of Behavioral Medicine and Journal of Surgical Oncology. His invited articles have appeared in The Encyclopedia of Measurement & Statistics and Encyclopedia of Statistical Sciences. He is co-author of 11 statistics texts published by Prentice Hall, including Statistics for Managers Using Microsoft Excel, Basic Business Statistics: Concepts and Applications and Business Statistics: A First Course. Over the years, Berenson has received several awards for teaching and for innovative contributions to statistics education. In 2005, he was the first recipient of the Catherine A. Becker Service for Educational Excellence Award at Montclair State University and, in 2012, he was the recipient of the Khubani/Telebrands Faculty Research Fellowship in the School of Business. David M. Levine is Professor Emeritus of Statistics and Computer Information Systems at Baruch College (City University of New York). He received BBA and MBA degrees in statistics from City College of New York and a PhD from New York University in industrial engineering and operations research. He is nationally recognised as a leading innovator in statistics education and is the co-author of 14 books, including such best-selling statistics textbooks as Statistics for Managers Using Microsoft Excel, Basic Business Statistics: Concepts and Applications, Business Statistics: A First Course and Applied Statistics for Engineers and Scientists Using Microsoft Excel and Minitab. He also is the co-author of Even You Can Learn Statistics: A Guide for Everyone Who Has Ever Been Afraid of Statistics (currently in its second edition), Six Sigma for Green Belts and Champions and Design for Six Sigma for Green Belts and Champions, and the author of Statistics for Six Sigma Green Belts, all published by FT Press, a Pearson imprint, and Quality Management, third edition, published by McGraw-Hill/Irwin. He is also the author of Video Review of Statistics and Video Review of Probability, both published by Video Aided Instruction, and the statistics module of the MBA primer published by Cengage Learning. He has published articles in various journals, including Psychometrika, The American Statistician, Communications in Statistics, Decision Sciences Journal of Innovative Education, Multivariate Behavioral Research, Journal of Systems Management, Quality Progress and The American Anthropologist, and he has given numerous talks at the Decision Sciences Institute (DSI), American Statistical Association (ASA) and Making Statistics More Effective in Schools and Business (MSMESB) conferences. Levine Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e ABOUT THE AUTHORS has also received several awards for outstanding teaching and curriculum development from Baruch College. Kathryn A. Szabat is Associate Professor and Chair of Business Systems and Analytics at LaSalle University. She teaches undergraduate and graduate courses in business statistics and operations management. Szabat’s research has been published in International Journal of Applied Decision Sciences, Accounting Education, Journal of Applied Business and Economics, Journal of Healthcare Management and Journal of Management Studies. Scholarly chapters have appeared in Managing Adaptability, Intervention, and People in Enterprise Information Systems; Managing, Trade, Economies and International Business; Encyclopedia of Statistics in Behavioral Science; and Statistical Methods in Longitudinal Research. Szabat has provided statistical advice to numerous business, non-business and academic communities. Her more recent involvement has been in the areas of education, medicine and non-profit capacity building. Szabat received a BS in mathematics from State University of New York at Albany and MS and PhD degrees in statistics, with a cognate in operations research, from the Wharton School of the University of Pennsylvania. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e xix PA R T 1 Presenting and describing information Real People, Real Stats David McCourt BDO Which company are you currently working for and what are some of your responsibilities? I work at BDO, Chartered Accountants and Advisors, in the corporate finance team. My primary responsibilities include the preparation of financial models and valuation reports. List five words that best describe your personality. Affable, level-headed, perceptive, analytical, assured (according to my colleagues). What are some things that motivate you? Success, working with a team, client satisfaction. When did you first become interested in statistics? I never really understood statistics at school and it was a minor part of my university degree. However, statistics play a significant role in many of our valuations, including discounted cash flow valuations and share option valuations. Complete the following sentence. A world without statistics … … is not worth thinking about. LET’S TALK STATS What do you enjoy most about working in statistics? We use data services and statistical tools that have been created by third parties. I can use, and talk reasonably knowledgeably about, statistical data without being an expert. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e a quick q&a Describe your first statistics-related job or work experience. Was this a positive or a negative experience? The first time I can recall using statistics was for a share option valuation. We had to determine the share price volatility based on historical share price data. There are about half a dozen methods that can be used, all with various advantages and disadvantages. I did and still find this analysis interesting. What do you feel is the most common misconception about your work held by students who are studying statistics? Please explain. Statistics provides information to support our analysis and decisions. However, the information is never perfect, and subjectivity and commercial common sense play a large part in our work. Do you need to be good at maths to understand and use statistics successfully? I think you need to have a logical and well-structured approach to problems. These skills would probably make you good at both maths and statistics. Is there a high demand for statisticians in your industry (or in other industries)? Please explain. The finance industry is heavily reliant on statistics. I expect there is high demand for statisticians from the various data providers, and in a number of specialist areas (e.g. insurance). PRESENTING AND DESCRIBING INFORMATION Does data collection play an important role in the decisions you make for your business/work? Please explain. Accurate data collection is essential to our valuation projects. Although our work involves a degree of commercial acumen, it is essential that the data supports and justifies these decisions. We also aggregate data for internal business use to measure staff productivity, business performance and forecasting budgets. Describe a project that you have worked on recently that might have involved data collection. Please be specific. We recently valued an infrastructure asset using the discounted cash flow model. The model requires two essential inputs: the forecast of future cash flows of the asset, and the discount rate that reflects the riskiness of those cash flows. To arrive at an appropriate discount rate we generally analyse comparable companies for an indication of the level of risk that should be attributed to the asset to be valued. In this exercise there are several instances of data collection. We collect five-year historical stock data for numerous comparable companies as an initial indication of risk. We then collect data on key financial indicators to assess the degree of comparability between the stock and the asset to be valued. To determine the risk-free rate and the market-risk premium, 10-year government bond rate data is collected. How are these data usually summarised? What are some positives and negatives of these summary techniques? We generally organise the collected data into Microsoft Excel workbooks. The main advantage of using this software is the ease of data analysis. Some powerful data analysis tools include data tables, What-If Analysis, Solver, charting and common statistical functions. Some shortcomings we have encountered using Excel is that data sometimes need to be rearranged depending on the analysis, [there can be] problems with inconsistent or missing data, and output can sometimes be incomplete. These factors increase the likelihood of errors in data analysis; however, for the purposes of corporate finance, Excel is generally sufficient as a means of summarising and analysing the data collected. In your experience, what is the most commonly referred to measure of central tendency? What benefits does this measure offer over others? In valuations, we generally prefer to use the median as a measure of central tendency rather than mean or mode. We find that the mean has one main disadvantage: it is particularly susceptible to outliers. When looking at comparable companies there are often outliers caused by one-off business issues that are irrelevant for the purposes of comparing our business. We very rarely use mode given that it only really coincides with the central tendency of data where the distribution is centre-heavy and there are generally few recurring figures in the data set. Why is it important to be aware of the spread/variation of data points in a sample? What are the consequences of not knowing this type of information about your sample? Without an understanding of the spread and variation of a data set there is no context to the measure of central tendency applied. A measure of central tendency summarises the data into a single value while the spread and variation of data gives an indication of how reliable an average or median summary of collected data is. For example, if the spread of values in the data set is relatively large it suggests the mean is not as representative, and a smoothing of data is required, when compared to a data set with a smaller range. Adopting a mean without reference to the spread can taint our analysis and results in a lack of validity to our decisions that are based on the data. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e CHA PTER 1 Defining and Collecting data THE HONG KONG AIRPORT SURVEY Y ou are departing Hong Kong International Airport on the next leg of your trip and have cleared Immigration. You are approached by a researcher holding a tablet computer who asks if you can answer a few questions. The first question determines if you are a visitor to Hong Kong or a resident. After establishing that you are a visitor the questions go on to determine the purpose of your visit, the name of your hotel, the activities you have undertaken and much additional information about your visit. This information is useful for a tourism authority that has the task of marketing Hong Kong as a travel destination and monitoring the quality of visitors’ experiences in the city. It may also inform the authority’s government and commercial stakeholders, who provide transport, accommodation, and food and shopping for visitors, and be used for forward planning. © Jungyeol & Mina/age fotostock Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e CHAPTER 1 DEFINING AND COLLECTING DATA LEARNING OBJECTIVES After studying this chapter you should be able to: 1 identify the types of data used in business 2 identify how statistics is used in business 3 recognise the sources of data used in business 4 distinguish between different survey sampling methods 5 evaluate the quality of surveys Not so long ago, business students were unfamiliar with the word data and had little experience handling data. Today, every time you visit a search engine website or ‘ask’ your mobile device a question, you are handling data. And if you ‘check in’ to a location or indicate that you ‘like’ something, you are creating data as well. You accept as almost true the premises of stories in which characters collect ‘a lot of data’ to uncover conspiracies, foretell disasters or catch a criminal. You hear concerns about how the government or business might be able to ‘spy’ on you in some way or how large social media companies ‘mine’ your personal data for profit. You hear the word data everywhere and may even have a ‘data plan’ for your smartphone. You know, in a general way, that data are facts about the world and that most data seem to be, ultimately, a set of numbers – that 34% of students recently polled prefer using a certain Internet browser, or that 50% of citizens believe the country is headed in the right direction, or that unemployment is down 3%, or that your best friend’s social media account has 835 friends and 202 recent posts. You cannot escape from data in this digital world. What, then, should you do? You could try to ignore data and conduct business by relying on hunches or your ‘gut instincts’. However, if you want to use only gut instincts, then you probably shouldn’t be reading this book or taking business courses in the first place. You could note that there is so much data in the world – or just in your own little part of the world – that you couldn’t possibly get a handle on it. You could accept other people’s data summaries and their conclusions without first reviewing the data yourself. That, of course, would expose yourself to fraudulent practices. Or you could do things the proper way and realise the benefits of learning the methods of statistics, the subject of this book. You can learn, though, the procedures and methods that will help you make better decisions based on solid evidence. When you begin focusing on the procedures and methods involved in collecting, presenting and summarising a set of data, or forming conclusions about those data, you have discovered statistics. In the Hong Kong Airport survey scenario it is important that research team members focus on the information that is needed by many different stakeholders when planning for future business and tourist visitors. If the research team fails to collect important information, or misrepresents the opinions of current visitors, stakeholders may make poor decisions about advertising, pricing, facilities and other factors relevant to attracting visitors and hosting them in Hong Kong. Failure to offer suitable facilities and experiences could affect the profitability of businesses in Hong Kong. In deciding how to collect the facts that are needed, it will help if you know something about the basic concepts of statistics. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5 6 CHAPTER 1 DEFINING AND COLLECTING DATA 1.1 BASIC CONCEPTS OF DATA AND STATISTICS The Meaning of ‘Data’ What do we mean by the word data? Its common use is somewhat different from its use in statistics. It could be described in a general way as meaning ‘facts about the world’. However, statisticians distinguish between the traits or properties that relate to people or things and the actual values that these take. variables Characteristics or attributes that can be expected to differ from one individual to another. data The observed values of variables. VA R IA B L E S Variables are characteristics of items or individuals. DATA Data are the observed values of variables. For a group of people, we could examine the traits of age, country of birth or weight. For a group of cars, we could note the colour, current value or kilometres driven. These characteristics are called variables. Data are the values associated with these traits or properties. As an example, in Table 1.1 we find a set of data collected from six people which represents observations on three different variables. Table 1.1 operational definition Defines how a variable is to be measured. Variable Age in years Country of birth Weight in kilograms Data 24, 18, 53, 16, 22, 31 Australia, China, Australia, Malaysia, India, Australia 50.2, 74.6, 96.3, 45.2, 56.1, 87.3 In this book, the word data is always plural to remind you that data are a collection or set of values. While we could say that a single value, such as ‘Australia’ is a datum, the terms data point, observation, response or single data value are more typically encountered. All variables should have an operational definition – a universally accepted meaning that is clear to all associated with an analysis. Without operational definitions, confusion can occur. An example of a situation where operational definitions are needed is for the process of data gathering by the Australian Bureau of Statistics (ABS). The ABS needs to collect information about the country of birth of a person and also the countries in which their father and mother were born. While this might seem straightforward, definitional problems arise in the case of people who were adopted or have step- or foster parents or other guardians. So the operational definition used is: • ‘Country of birth of person’, which is the country identified as being the one in which the person was born • ‘Country of birth of father’, which is the country in which the person’s birth father was born, and • ‘Country of birth of mother’, which is the country in which the person’s birth mother was born (Australian Bureau of Statistics, Country of Birth Standard, Cat. No. 1200.0.55.004, 2016). The Meaning of ‘Statistics’ statistics A branch of mathematics concerned with the collection and analysis of data. Statistics is the branch of mathematics that examines ways to process and analyse data. ­It provides procedures to collect and transform data in ways that are useful to business decision makers. Statistics allows you to determine whether your data represent information that could be used in making better decisions. Therefore, it helps you determine whether differences in the Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 1.1 Basic Concepts of DATA AND Statistics numbers are meaningful in a significant way or are due to chance. To illustrate, consider the following reports: • In ‘News use across social media platforms 2016’ the Pew Research Center reported in May 2016, that 67% of the adult US population had a Facebook account and 66% of users get news from the site (<http://assets.pewresearch.org/wpcontent/uploads/ sites/13/2016/05/PJ_2016.05.26_social-media-and-news_FINAL-1.pdf>, accessed 12 June 2017). • In a blog titled ‘The top 10 benefits of newspaper advertising’, the 360 Degree Marketing Group says that a study showed newspaper advertising was considered a more trusted paid medium for information (58%) compared with television (54%), radio (49%) or online (27%) (<www.360degreemarketing.com.au/Blog/bid/407663/The-Top-10Benefits-of-Newspaper-Advertising>, accessed 12 June 2017). Without statistics, you cannot determine whether the ‘numbers’ in these stories represent useful information. Without statistics, you cannot validate claims such as the statement that advertising in newspapers or on television is more trusted than online advertising. And without statistics, you cannot see patterns that large amounts of data sometimes reveal. Statistics is a way of thinking that can help you make better decisions. It helps you solve problems that involve decisions based on data that have been collected. You may have had some statistics instruction in the past. If you ever created a chart to summarise data or calculated values such as averages to summarise data, you have used statistics. But there’s even more to statistics than these commonly taught techniques, as the detailed table of contents shows. Statistics is undergoing important changes today. There are new ways of visualising data that did not exist, were not practicable or were not widely known until recently. And, increasingly, statistics today is being used to ‘listen’ to what the data might be telling you rather than just being a way to use data to prove something you want to say. If you associate statistics with doing a lot of mathematical calculations, you will quickly learn that business statistics uses software to perform the calculations for you (and, generally, the software calculates with more precision and efficiency than you could do manually). But while you do not need to be a good manual calculator to apply statistics, because statistics is a way of thinking, you do need to follow a framework or plan to minimise possible errors of thinking and analysis. One such framework consists of the following tasks to help apply statistics to business decision making: 1. Define the data that you want to study in order to solve a problem or meet an objective. 2. Collect the data from appropriate sources. 3. Organise the data collected by developing tables. 4. Visualise the data collected by developing charts. 5. Analyse the data collected to reach conclusions and present those results. Typically, you do the tasks in the order listed. You must always do the first two tasks to have meaningful outcomes, but, in practice, the order of the other three can change or appear inseparable. Certain ways of visualising data will help you to organise your data while performing preliminary analysis as well. In any case, when you apply statistics to decision making, you should be able to identify all five tasks, and you should verify that you have done the first two tasks before the other three. Using this framework helps you to apply statistics to these four broad categories of business activities: 1. Summarise and visualise business data. 2. Reach conclusions from those data. 3. Make reliable forecasts about business activities. 4. Improve business processes. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 7 8 CHAPTER 1 DEFINING AND COLLECTING DATA descriptive statistics The field that focuses on summarising or characterising a set of data. inferential statistics Uses information from a sample to draw conclusions about a population. Throughout this book, and especially in the scenarios that begin the chapters, you will discover specific examples of how we can apply statistics to business situations. Statistics is itself divided into two branches, both of which are applicable to managing a business. Descriptive statistics focuses on collecting, summarising and presenting a set of data. Inferential statistics uses sample data to draw conclusions about a population. Descriptive statistics has its roots in the record-keeping needs of large political and social organisations. Refining the methods of descriptive statistics is an ongoing task for government statistical agencies such as the Australian Bureau of Statistics and Statistics New Zealand as they prepare for each Census. In Australia, a Census is scheduled to be carried out every five years (e.g. 2011 and 2016) to count the entire population and to collect data about education, occupation, languages spoken and many other characteristics of the citizens. A large amount of planning and training is necessary to ensure that the data collected represent an accurate record of the population’s characteristics at the Census date. However, despite the best planning, such an immense data collection task can be affected by external factors. The Australian Census held in 2016 was badly affected by a computer shutdown on Census night, 9 August. It was blamed on the need to protect the system from denial of service cyber attacks and added approximately $30 million to the cost of the Census (<www.abc.net.au/ news/2016-10-25/turning-router-offand-on-could-have-prevented-census-outage/7963916>, accessed 13 July 2017). The foundation of inferential statistics is based on the mathematics of probability theory. Inferential methods use sample data to calculate statistics that provide estimates of the characteristics of the entire population. Today, applications of statistical methods can be found in different areas of business. Accounting uses statistical methods to select samples for auditing purposes and to understand the cost drivers in cost accounting. Finance uses statistical methods to choose between alternative portfolio investments and to track trends in financial measures over time. Management uses statistical methods to improve the quality of the products manufactured or the services delivered by an organisation. Marketing uses statistical methods to estimate the proportion of customers who prefer one product over another and to draw conclusions about what advertising strategy might be most useful in increasing sales of a product. Other Important Definitions Now that the terms variables, data and statistics have been defined, you need to understand the meaning of the terms population, sample and parameter. population A collection of all members of a group being investigated. sample The portion of the population selected for analysis. parameter A numerical measure of some population characteristic. statistic A numerical measure that describes a characteristic of a sample. P OPUL AT ION A population consists of all the members of a group about which you want to draw a conclusion. S A M PL E A sample is the portion of the population selected for analysis. PA R A M E T E R A parameter is a numerical measure that describes a characteristic of a population. STAT IST IC A statistic is a numerical measure that describes a characteristic of a sample. Examples of populations are all the full-time students at a university, all the registered v­ oters in New Zealand and all the people who were customers of the local shopping centre last weekend. The term population is not limited to groups of people. We could refer to a Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 1.2 Types of Variables population of all motor vehicles registered in Victoria. Two factors need to be specified when defining a population: 1. the entity (e.g. people or motor vehicles) 2. the boundary (e.g. registered to vote in New Zealand or registered in Victoria for road use). Samples could be selected from each of the populations mentioned above. Examples include 10 full-time students selected for a focus group; 500 registered voters in New Zealand who were contacted by telephone for a political poll; 30 customers at the shopping centre who were asked to complete a market research survey; and all the vehicles registered in Victoria that are more than 10 years old. In each case, the people or the vehicles in the sample represent a portion, or subset, of the people or vehicles comprising the population. The average amount spent by all the customers at the local shopping centre last weekend is an example of a parameter. Information from all the shoppers in the entire population is needed to calculate this parameter. The average amount spent by the 30 customers completing the market research survey is an example of a statistic. Information from a sample of only 30 of the shopping centre’s customers is used in calculating the statistic. 1.2 TYPES OF VARIABLES As illustrated in Figure 1.1, there are two types of variables – categorical and numerical, sometimes referred to as qualitative and quantitative variables respectively. The Hong Kong airport survey Travellers in the departure lounge of the busy Hong Kong International Airport are asked to complete a survey with questions about various aspects of their visit to the city and future travel plans. The interviewer first asks if the traveller is a resident or a visitor. If the traveller is a visitor, the survey proceeds. The survey includes these questions: ■ How many visits have you made to Hong Kong prior to this one? ■ How long is it since your visit here? ■ How satisfied were you with your accommodation? Very satisfied ■ Satisfied ■ Undecided ■ Dissatisfied ■ Very dissatisfied ■ ■ How many times during this visit did you travel by ferry? ■ Shopping in Hong Kong stores gives good value for money Almost always Sometimes ■ ■ ■ Was the purpose of your visit business? Yes ■ ■ Very infrequently Never ■ No ■ ■ Are you likely to return to Hong Kong in the next 12 months? Yes ■ No ■ You have been asked to review the survey. What type of data does the survey seek to collect? What type of information can be generated from the data of the completed survey? How can the research company’s clients use this information when planning for future visitors? What other questions would you suggest for the survey? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 9 10 CHAPTER 1 DEFINING AND COLLECTING DATA Figure 1.1 Types of variables VARIABLE TYPE QUESTION TYPES Categorical numerical variables Take numbers as their observed responses. LEARNING OBJECTIVE 1 Identify the types of data used in business discrete variables Can only take a finite or countable number of values. continuous variables Can take any value between specified limits. Yes Do you currently own any shares? No Discrete How many messages did you send on social media last week? Number Continuous How tall are you? Centimetres Numerical categorical variables Take values that fall into one or more categories. RESPONSES Categorical variables yield categorical responses, such as yes or no or male or female answers. An example is the response to the question ‘Do you currently own any shares?’ because it is limited to a simple yes or no answer. Another example is the response to the question in the Hong Kong Airport survey (presented on page 9), ‘Are you likely to return to Hong Kong in the next 12 months?’ Categorical variables can also yield more than one possible response; for example, ‘On which days of the week are you most likely to use public transport?’ Numerical variables yield numerical responses, such as your height in centimetres. Other examples are ‘How many times during this visit did you travel by ferry?’ (from the Hong Kong Airport survey) or the response to the question, ‘How many messages did you send on social media last week?’ There are two types of numerical variables: discrete and continuous. Discrete variables ­produce numerical responses that arise from a counting process. ‘The number of social media messages sent’ is an example of a discrete numerical variable because the response is one of a finite number of integers. You send zero, one, two, …, 50 and so on messages. Continuous variables produce numerical responses that arise from a measuring process. Your height is an example of a continuous numerical variable because the response takes on any value within a continuum or interval, depending on the precision of the measuring instrument. For example, your height may be 158 cm, 158.3 cm or 158.2945 cm, depending on the precision of the available instruments. No two people are exactly the same height, and the more precise the measuring device used, the greater the likelihood of detecting differences in their heights. However, most measuring devices are not sophisticated enough to detect small differences. Hence, tied observations are often found in experimental or survey data even though the variable is truly continuous and, theoretically, all values of a continuous variable are different. Levels of Measurement and Types of Measurement Scales Data are also described in terms of their level of measurement. There are four widely recognised levels of measurement: nominal, ordinal, interval and ratio scales. nominal scale A classification of categorical data that implies no ranking. Nominal and ordinal scales Data from a categorical variable are measured on a nominal scale or on an ordinal scale. A nominal scale (Figure 1.2) classifies data into various distinct c­ ategories in which no ranking is implied. In the Hong Kong Airport survey, the answer to the question ‘Are you likely to return to CATEGORICAL VARIABLE CATEGORIES Yes Personal computer ownership Type of fuel used Internet connection Figure 1.2 Unleaded Premium Unleaded Diesel Cable No LPG Wireless Examples of nominal scaling Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 1.2 Types of Variables Hong Kong in the next 12 months?’ is an example of a nominally scaled variable, as is your favourite soft drink, your political party affiliation and your gender. Nominal scaling is the weakest form of measurement because you cannot specify any ranking across the various categories. An ordinal scale classifies data into distinct categories in which ranking is implied. In the Hong Kong Airport survey, the answers to the question ‘Shopping in Hong Kong stores gives good value for money’ represent an ordinal scaled variable because the responses ‘almost always, sometimes, very infrequently and never’ are ranked in order of frequency. Figure 1.3 lists other examples of ordinal scaled variables. Figure 1.3 11 ordinal scale Scale of measurement where values are assigned by ranking. CATEGORICAL VARIABLE ORDERED CATEGORIES Product satisfaction Clothing size Type of Olympic medal Education level Very unsatisfied Fairly unsatisfied Neutral Fairly satisfied Very satisfied S M L XL Gold Silver Bronze Primary Secondary Tertiary Examples of ordinal scaling Ordinal scaling is a stronger form of measurement than nominal scaling because an observed value classified into one category possesses more or less of a property than does an observed value classified into another category. However, ordinal scaling is still a relatively weak form of measurement because the scale does not account for the amount of the differences between the categories. The ordering implies only which category is ‘greater’, ‘better’ or ‘more preferred’ – not by how much. Interval and ratio scales Data from a numerical variable are measured on an interval or ratio scale. An interval scale (Figure 1.4) is an ordered scale in which the difference between measurements is a meaningful quantity but does not involve a true zero point. For example, sports shoes for adults are often sold in Australia marked with sizes based on the US or UK system. Neither system has a true zero size. The size below an adult size 1 is a child’s size 13. However, in each system the intervals between sizes are equal. NUMERICAL VARIABLE Shoe size (UK or US) Height (in centimetres) Weight (in kilograms) Salary (in US dollars or Japanese yen) LEVEL OF MEASUREMENT Interval Ratio Ratio Ratio A ratio scale is an ordered scale in which the difference between the measurements involves a true zero point, as in length, weight, age or salary measurements, and the ratio of two values is meaningful. In the Hong Kong Airport survey, the number of times a visitor travelled by ferry is an example of a ratio scaled variable, as six trips is three times as many as two trips. As another example, a carton that weighs 40 kg is twice as heavy as one that weighs 20 kg. Data measured on an interval scale or on a ratio scale constitute the highest levels of measurement. They are stronger forms of measurement than an ordinal scale, because you can determine not only which observed value is the largest but also by how much. Interval and ratio scales may apply for either discrete or continuous data. interval scale A ranking of numerical data where differences are meaningful but there is no true zero point. Figure 1.4 Examples of interval and ratio scales ratio scale A ranking where the differences between measurements involve a true zero point. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 12 CHAPTER 1 DEFINING AND COLLECTING DATA Telephone polling think about this Companies such as Newspoll regularly undertake market research and political polling conducted by phone interviews. A phone poll conducted by Newspoll in Sydney in November 2014 asked questions about a number of topics. Some were demographic questions about the number of people who lived in the household and the age, income, occupation and marital status of the participant. What would be the purpose of asking such questions? The other questions could be divided into three sections. The first section related to voting intentions for the next state election and the level of satisfaction with the premier and the opposition leader. The second section asked the participant’s opinion on the renewal of the federal government’s ban on super trawlers. The third section asked a number of questions about domestic and international air travel undertaken in the past year. These questions covered areas such as the purpose of travel, the airlines used and level of satisfaction. Who would use the data collected in this poll? If you were designing a similar poll, how would you construct questions to collect data for the variables referred to above? More recently, political and business functions of Newspoll have been separated. To see how results of the latest political polls are published in the Australian, go to <www.theaustralian.com.au/nationalaffairs/newspoll>. To see some public opinion poll reports, go to <www.omnipoll.com.au>. Problems for Section 1.2 LEARNING THE BASICS 1.1 1.2 1.3 Three different types of drinks are sold at a fast-food restaurant – soft drinks, fruit juices and coffee. a. Explain why the type of drinks sold is an example of a categorical variable. b. Explain why the type of drinks sold is an example of a nominally scaled variable. Coffee is sold in three sizes in takeaway cardboard cups – small, medium and large. Explain why the size of the coffee cup is an example of an ordinal scaled variable. Suppose that you measure the time it takes to download an MP3 file from the Internet. a. Explain why the download time is a numerical variable. b. Explain why the download time is a ratio scaled variable. 1.5 1.6 APPLYING THE CONCEPTS 1.4 For each of the following variables, determine whether the variable is categorical or numerical. If the variable is numerical, determine whether the variable is discrete or continuous. In addition, determine the level of measurement. a. Number of mobile phones per household b. Length (in minutes) of the longest mobile call made per month c. Whether all mobile phones in the household use the same telecommunications provider d. Whether there is a landline telephone in the household 1.7 The following information is collected from students as they leave the campus bookshop during the first week of classes: a. Amount of time spent shopping in the bookshop b. Number of textbooks purchased c. Name of degree d. Gender Classify each of these variables as categorical or numerical. If the variable is numerical, determine whether the variable is discrete or continuous. In addition, determine the level of measurement. For each of the following variables, determine whether the variable is categorical or numerical. If the variable is numerical, determine whether the variable is discrete or continuous. In addition, determine the level of measurement. a. Name of Internet provider b. Amount of time spent surfing the Internet per week c. Number of emails received per week d. Number of online purchases made per month Suppose the following information is collected from Andrew and Fiona Chen on their application for a home loan mortgage at Metro Home Loans: a. Monthly expenses: $2,056 b. Number of dependants being supported by applicant(s): 2 c. Annual family salary income: $105,000 d. Marital status: Married Classify each of the responses by type of data and level of measurement. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 13 1.3 Collecting Data 1.8 1.9 One of the variables most often included in surveys is income. Sometimes the question is phrased, ‘What is your income (in thousands of dollars)?’ In other surveys, the respondent is asked to ‘Place an X in the circle corresponding to your income group’ and given a number of ranges to choose from. a. In the first format, explain why income might be considered either discrete or continuous. b. Which of these two formats would you prefer to use if you were conducting a survey? Why? c. Which of these two formats would probably bring you a greater rate of response? Why? The director of research at the e-business section of a major department store wants to conduct a survey throughout a Australia to determine the amount of time working women spend shopping online for clothing in a typical month. a. Describe the population and the sample of interest, and indicate the type of data the director might wish to collect. b. Develop a first draft of the questionnaire needed in (a) by writing a series of three categorical questions and three numerical questions that you feel would be appropriate for this survey. 1.10 A university researcher designs an experiment to see how generous participants will be in giving to charity. Discuss the types of variables the experiment might give compared with a survey of the same subjects about donations to charity. 1.11 Before a company undertakes an online marketing campaign it needs to consider information about its own current sales and the sales made by its competitors. What categorical data might it use? 1.3 COLLECTING DATA In the Hong Kong Airport scenario, identifying the data that need to be collected is an important step in the process of marketing the city and operational planning. Some of the data will come from consumers through market research. It is important that the correct inferences are drawn from the research and that appropriate statistical methods assist planners and designers to make the right decisions. Managing a business effectively requires collecting the appropriate data. In most cases, the data are measurements acquired from items in a sample. The samples are chosen from populations in such a manner that the sample is as representative of the population as possible. The most common technique to ensure proper representation is to use a random sample. (See section 1.4 for a detailed discussion of sampling techniques.) Many different types of circumstances require the collection of data: • A marketing research analyst needs to assess the effectiveness of a new television advertisement. • A pharmaceutical manufacturer needs to determine whether a new drug is more effective than those currently in use. • An operations manager wants to monitor a manufacturing process to find out whether the quality of output being produced is conforming to company standards. • An auditor wants to review the financial transactions of a company to determine whether or not the company is in compliance with generally accepted accounting principles. • A potential investor wants to determine which firms within which industries are likely to have accelerated growth in a period of economic recovery. LEARNING OBJECTIVE 2 Identify how statistics is used in business Identifying Sources of Data Identifying the most appropriate source of data is a critical aspect of statistical analysis. If biases, ambiguities or other types of errors flaw the data being collected, even the most sophisticated statistical methods will not produce accurate information. Five important sources of data are: • data distributed by an organisation or an individual • a designed experiment • a survey • an observational study • data collected by ongoing business activities. primary sources Provide information collected by the data analyser. Data sources are classified as either primary sources or secondary sources. When the data collector is the one using the data for analysis, the source is primary. When another organisation or secondary sources Provide data collected by another person or organisation. LEARNING OBJECTIVE 3 Recognise the sources of data used in business Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 14 CHAPTER 1 DEFINING AND COLLECTING DATA focus group A group of people who are asked about attitudes and opinions for qualitative research. individual has collected the data that are used for analysis by an organisation or individual, the source is secondary. Organisations and individuals that collect and publish data typically use this information as a primary source and then let others use the data as a secondary source. For example, the ­Australian federal government collects and distributes data in this way for both public and private purposes. The Australian Bureau of Statistics oversees a variety of ongoing data collection in areas such as population, the labour force, energy, and the environment and health care, and publishes statistical reports. The Reserve Bank of Australia collects and publishes data on exchange rates, interest rates and ATM and credit card transactions. Market research firms and trade associations also distribute data pertaining to specific industries or markets. Investment services such as Morningstar provide financial data on a company-by-company basis. Syndicated services such as Nielsen provide clients with data enabling the comparison of client products with those of their competitors. Daily newspapers in print and online formats are filled with numerical information about share prices, weather conditions and sports statistics. As listed above, conducting an experiment is another important data-collection source. For example, to test the effectiveness of laundry detergent, an experimenter determines which brands in the study are more effective in cleaning soiled clothes by actually washing dirty laundry instead of asking customers which brand they believe to be more effective. Proper experimental designs are usually the subject matter of more advanced texts, because they often involve sophisticated statistical procedures. However, some fundamental experimental design concepts are considered in Chapter 11. Conducting a survey is a third important data source. Here, the people being surveyed are asked questions about their beliefs, attitudes, behaviours and other characteristics. Responses are then edited, coded and tabulated for analysis. Conducting an observational study is the fourth important data source. In such a study, a researcher observes the behaviour directly, usually in its natural setting. Observational studies take many forms in business. One example is the focus group, a market research tool that is used to elicit unstructured responses to open-ended questions. In a focus group, a moderator leads the discussion and all the participants respond to the questions asked. Other, more structured types of studies involve group dynamics and consensus building and use various ­organisational-behaviour tools such as brainstorming, the Delphi technique and the nominalgroup method. Observational study techniques are also used in situations in which enhancing teamwork or improving the quality of products and services are management goals. Data collected through ongoing business activities are a fifth data source. Such data can be collected from operational and transactional systems that exist in both physical ‘bricks-and-mortar’ and online settings but can also be gathered from secondary sources such as third-party social media networks and online apps and website services that collect tracking and usage data. For example, a bank might analyse a decade’s worth of financial transaction data to identify patterns of fraud, and a marketer might use tracking data to determine the effectiveness of a website. ‘Big Data’ big data Large data sets characterised by their volume, velocity and variety. Relatively recent advances in information technology allow businesses to collect, process, and analyse very large volumes of data. Because the operational definition of ‘very large’ can be partially dependent on the context of a business – what might be ‘very large’ for a sole proprietorship might be commonplace and small for a multinational corporation – many use the term big data. Big data is more of a fuzzy concept than a term with a precise operational definition, but it implies data that are being collected in huge volumes and at very fast rates (typically in real time) and data that arrive in a variety of forms, both organised and unorganised. These attributes of ‘volume, velocity, and variety’, first identified in 2001 (see reference 1), make big data different from any of the data sets used in this book. Big data increases the use of business analytics because the sheer size of these very large data sets makes preliminary exploration of the data using older techniques impracticable. This effect is explored in Chapter 20. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 1.3 Collecting Data 15 Big data tends to draw on a mix of primary and secondary sources. For example, a retailer interested in increasing sales might mine Facebook and Twitter accounts to identify sentiment about certain products or to pinpoint top influencers and then match those data to its own data collected during customer transactions. Data Formatting The data you collect may be formatted in more than one way. For example, suppose that you wanted to collect electronic financial data about a sample of companies. The data you seek to collect could be formatted in any number of ways, including: • tables of data • contents of standard forms • a continuous data stream • messages delivered from social media websites and networks. These examples illustrate that data can exist in either a structured or an unstructured form. Structured data are data that follow some organising principle or plan, typically a repeating pattern. For example, a simple ASX share price search record is structured because each entry would have the name of a company, the last sale, change in price, bid price, volume traded, and so on. Due to their inherent organisation, tables and forms are also structured. In a table, each row contains a set of values for the same columns (i.e. variables), and in a set of forms, each form contains the same set of entries. For example, once we identify that the second column of a table or the second entry on a form contains the family name of an individual, then we know that all entries in the second column of the table or all of the second entries in all copies of the form contain the family name of an individual. In contrast, unstructured data follows no repeating pattern. For example, if five different people sent you an email message concerning the share trades of a specific company, that data could be anywhere in the message. You could not reliably count on the name of the company being the first words of each message (as in the ASX search), and the pricing, volume and percentage of change data could appear in any order. Earlier in this section, big data was defined, in part, as data that arrive in a variety of forms, both organised and unorganised. You can restate that definition as ‘big data exists as both structured and unstructured data’. The ability to handle unstructured data represents an advance in information technology. Chapter 20 discusses business analytics methods that can analyse structured data as well as unstructured data or semi-structured data. (Think of an application form that contains structured form-fills but also contains an unstructured free-response portion.) With the exception of some of the methods discussed in Chapter 20, the methods taught and the software techniques used in this book involve structured data. Your beginning point will always be tabular data, and for many problems and examples you can begin with that data in the form of a Microsoft Excel worksheet that you can download and use (see companion website). Electronic formats and encoding need to be considered. Data can exist in more than one electronic format. This affects data formatting, as some electronic formats are more immediately usable than others. For example, which data would you like to use: data in an electronic worksheet file or data in a scanned image file that contains one of the worksheet illustrations in this book? Unless you like to do extra work, you would choose the first format because the second would require you to employ a translation process – perhaps a character-scanning program that can recognise numbers in an image. Data can also be encoded in more than one way, as you may have learned in an information systems course. Different encodings can affect the precision of values for numerical variables, and that can make some data not fully compatible with other data you have collected. structured data Data that follow an organised pattern. unstructured data Data that have no repeated pattern. electronic formats Data in a form that can be read by a computer. encoding Representing data by numbers or symbols to convert the data into a usable form. Data Cleaning No matter how you choose to collect data, you may find irregularities in the values you collect, such as undefined or impossible values. For a categorical variable, an undefined value would be Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 16 CHAPTER 1 DEFINING AND COLLECTING DATA outliers Values that appear to be excessively large or small compared with most values observed. missing values Refers to when no data value is stored for one or more variables in an observation. a value that does not represent one of the categories defined for the variable. For a numerical variable, an impossible value would be a value that falls outside a defined range of possible values for the variable. For a numerical variable without a defined range of possible values, you might also find outliers, values that seem excessively different from most of the rest of the values. Such values may or may not be errors, but they demand a second review. Missing values are another type of irregularity. They are values that were not able to be collected (and therefore are not available for analysis). For example, you would record a nonresponse to a survey question as a missing value. You can represent missing values in some computer programs and such values will be properly excluded from analysis. The more limited Excel has no special values that represent a missing value. When using Excel, you must find and then exclude missing values manually. When you spot an irregularity, you may have to ‘clean’ the data you have collected. A full discussion of data cleaning is beyond the scope of this book. (See reference 2 for more information.) Recoding Variables recoded variable A variable that has been assigned new values that replace the original ones. mutually exclusive Two events that cannot occur simultaneously. collectively exhaustive Set of events such that one of the events must occur. After you have collected data, you may discover that you need to reconsider the categories that you have defined for a categorical variable, or that you need to transform a numerical variable into a categorical variable by assigning the individual numeric data values to one of several groups. In either case, you can define a recoded variable that supplements or replaces the original variable in your analysis. For example, when defining households by their location, the suburb or town recorded might be replaced by a new variable of the postcode. When recoding variables, be sure that the category definitions cause each data value to be placed in one and only one category, a property known as being mutually exclusive. Also ensure that the set of categories you create for the new, recoded variables include all the data values being recoded, a property known as being collectively exhaustive. If you are recoding a categorical variable, you can preserve one or more of the original categories, as long as your recoded values are both mutually exclusive and collectively exhaustive. When recoding numerical variables, pay particular attention to the operational definitions of the categories you create for the recoded variable, especially if the categories are not selfdefining ranges. For example, while the recoded categories ‘Under 12’, ‘12–20’, ‘21–34’, ‘35–59’ and ‘60 and over’ are self-defining for age, the categories ‘Child’, ‘Youth’, ‘Young adult’, ‘Middle aged’ and ‘Senior’ need their own operational definitions. Problems for Section 1.3 APPLYING THE CONCEPTS 1.12 The Data and Story Library (DASL) is an online library of data files and stories that illustrate the use of basic statistical methods. Visit <http://.lib.stat.cmu.edu/DASL>, click Power search, and explore a datafile of interest to you. Which of the five sources of data best describes the sources of the datafile you selected? 1.13 Visit the website of Ipsos Australia at <www.ipsos.com.au>. Read about a recent poll or news story. What type of data source is this based on? 1.14 Visit the website of the Pew Research Center at <www. pewresearch.org>. Read one of today’s top stories. What type of data source is the story based on? 1.15 Transportation engineers and planners want to address the dynamic properties of travel behaviour by describing in detail the driving characteristics of drivers over the course of a month. What type of data collection source do you think the transportation engineers and planners should use? 1.16 Visit the homepage of the Statistics Portal ‘Statista’ at <www. statista.com>. Go to Statistics>Popular Statistics, then choose one item to examine. What type of data source is the information presented here based on? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 17 1.4 Types of Survey Sampling Methods 1.4 TYPES OF SURVEY SAMPLING METHODS LEARNING OBJECTIVE 4 In Section 1.1 a sample was defined as the portion of the population that has been selected for analysis. You collect your data from either a population or a sample depending on whether all items or people about whom you wish to reach conclusions are included. Rather than taking a complete census of the whole population, statistical sampling procedures focus on collecting a small representative group of the larger population. The resulting sample results are used to estimate characteristics of the entire population. The three main reasons for drawing a sample are: 1. A sample is less time-consuming than a census. 2. A sample is less costly to administer than a census. 3. A sample is less cumbersome and more practical to administer than a census. Distinguish between different survey sampling methods The sampling process begins by defining the frame. The frame is a listing of items that make up the population. Frames are data sources such as population lists, directories or maps. Samples are drawn from these frames. Inaccurate or biased results can occur if the frame excludes certain groups of the population. Using different frames to generate data can lead to opposite conclusions. Once you select a frame, you draw a sample from the frame. As illustrated in Figure 1.5, there are two kinds of samples: the non-probability sample and the probability sample. frame A list of the items in the population of interest. Figure 1.5 Types of samples Types of samples used Non-probability samples Judgment sample Quota sample Chunk Convenience sample sample Probability samples Simple random sample Systematic sample Stratified sample Cluster sample In a non-probability sample, you select the items or individuals without knowing their probabilities of selection. Thus, the theory that has been developed for probability sampling cannot be applied to non-probability samples. A common type of non-probability sampling is convenience sampling. In convenience sampling, items are selected based only on the fact that they are easy, inexpensive or convenient to sample. In some cases, participants are self-selected. For example, many companies conduct surveys by giving visitors to their website the opportunity to complete survey forms and submit them electronically. The response to these surveys can provide large amounts of data quickly, but the sample consists of self-selected web users. For many studies, only a non-probability sample such as a judgment sample is available. In a judgment sample, you get the opinions of preselected experts in the subject matter as to who should be included in the survey. Some other common procedures of non-probability sampling are quota sampling and chunk sampling. These are discussed in detail in specialised books on sampling methods (see references 3 and 4). Non-probability samples can have certain advantages such as convenience, speed and lower cost. However, their lack of accuracy due to selection bias and their poorer capacity to provide generalised results more than offset these advantages. Therefore, you should restrict the use of non-probability sampling methods to situations in which you want to get rough non-probability sample One where selection is not based on known probabilities. convenience sampling Selection using a method that is easy or inexpensive. judgment sample Gives the opinions of preselected experts. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 18 CHAPTER 1 DEFINING AND COLLECTING DATA probability sample One where selection is based on known probabilities. approximations at low cost to satisfy your curiosity about a particular subject, or to small-scale studies that precede more rigorous investigations. In a probability sample, you select the items based on known probabilities. Whenever possible, you should use probability sampling methods. The samples based on these methods allow you to make unbiased inferences about the population of interest. In practice, it is often difficult or impossible to take a probability sample. However, you should work towards achieving a probability sample and acknowledge any potential biases that might exist. The four types of probability samples most commonly used are simple random, systematic, stratified and cluster. These sampling methods vary in their cost, accuracy and complexity. Simple Random Sample simple random sample One where each item in the frame has an equal chance of being selected. sampling with replacement An item in the frame can be selected more than once. sampling without replacement Each item in the frame can be selected only once. table of random numbers Shows a list of numbers generated in a random sequence. In a simple random sample, every item from a frame has the same chance of selection as every other item. In addition, every sample of a fixed size has the same chance of selection as every other sample of that size. Simple random sampling is the most elementary random sampling technique. It forms the basis for the other random sampling techniques. With simple random sampling, you use n to represent the sample size and N to represent the frame size. You number every item in the frame from 1 to N. The chance that you will select any particular member of the frame on the first draw is 1/N. You select samples with replacement or without replacement. Sampling with replacement means that after you select an item you return it to the frame, where it has the same probability of being selected again. Imagine you have a barrel which contains the shopping dockets of N shoppers at a major retail centre who are entering a competition. First assume that each shopper can have only one entry but can win more than one prize. The barrel is rolled, opened and the entry of Jason O’Brien is selected. His docket is replaced, the barrel is rolled again and a second docket is chosen. Jason’s docket has the same probability of being selected again, 1/N. You repeat this process until you have selected the desired sample size n. However, it is usually more desirable to have a sample of different items than to permit a repetition of measurements on the same item. Sampling without replacement means that once you select an item it cannot be selected again. The chance that you will select any particular item in the frame, say the shopping docket of Jason O’Brien on the first draw is 1/N. The chance that you will select any shopping docket not previously selected on the second draw is now 1 out of N – 1. This process continues until you have selected the desired sample of size n. Regardless of whether you have sampled with or without replacement, barrel draw methods have a major drawback for sample selection. In a crowded barrel, it is difficult to mix the entries thoroughly and ensure that the sample is selected randomly. As barrel draw methods are not very useful, you need to use less cumbersome and more scientific methods of selection. One such method uses a table of random numbers (see Table E.1 in Appendix E of this book) for selecting the sample. A table of random numbers consists of a series of digits listed in a randomly generated sequence (see reference 5). Because the numeric system uses 10 digits (0, 1, 2, …, 9), the chance that you will randomly generate any particular digit is equal to the probability of generating any other digit. This probability is 1 out of 10. Hence, if a sequence of 800 digits is generated, you would expect about 80 of them to be the digit 0, 80 to be the digit 1, and so on. In fact, those who use tables of random numbers usually test the generated digits for randomness prior to using them. Table E.1 has met all such criteria for randomness. Because every digit or sequence of digits in the table is random, the table can be read either horizontally or vertically. The margins of the table designate row numbers and Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 1.4 Types of Survey Sampling Methods column numbers. The digits themselves are grouped into sequences of five in order to make reading the table easier. To use such a table instead of a barrel for selecting the sample, you first need to assign code numbers to the individual members of the frame. Then you get the random sample by reading the table of random numbers and selecting those individuals from the frame whose assigned code numbers match the digits found in the table. Example 1.1 demonstrates the process of sample selection. SELECTING A S IMP LE R A NDO M S A MP L E U S I N G A TABL E OF RAN D OM NUMB ER S A company wants to select a sample of 32 full-time workers from a population of 800 full-time employees in order to collect information on expenditures concerning a company-sponsored dental plan. How do you select a simple random sample? EXAMPLE 1.1 SOLUTION The company can contact all employees by email but assumes that not everyone will respond to the survey, so you need to distribute more than 32 surveys to get the desired 32 responses. Assuming that 8 out of 10 full-time workers will respond to such a survey (i.e. a response rate of 80%), you decide to email 40 surveys. The frame consists of a listing of the names and email addresses of all N = 800 full-time employees taken from the company personnel files. Thus, the frame is an accurate and complete listing of the population. To select the random sample of 40 employees from this frame, you use a table of random numbers, as shown in Table 1.2 on page 20. Because the population size (800) is a three-digit number, each assigned code number must also be three digits so that every full-time worker has an equal chance of selection. You give a code of 001 to the first full-time employee in the population listing, a code of 002 to the second full-time employee in the population listing, and so on, until a code of 800 is given to the Nth full-time worker in the listing. Because N = 800 is the largest possible coded value, you discard all threedigit code sequences greater than N (i.e. 801 to 999 and 000). To select the simple random sample, you choose an arbitrary starting point from the table of random numbers. One method you can use is to close your eyes and strike the table of random numbers with a pencil. Suppose you use this procedure and select row 06, column 05, of Table 1.2 (which is extracted from Table E.1) as the starting point. Although you can go in any direction, in this example you will read the table from left to right in sequences of three digits without skipping. The individual with code number 003 is the first full-time employee in the sample (row 06 and columns 05–07), the second individual has code number 364 (row 06 and columns 08–10) and the third individual has code number 884. Because the highest code for any employee is 800, you discard this number. Individuals with code numbers 720, 433, 463, 363, 109, 592, 470 and 705 are selected third to tenth, respectively. You continue the selection process until you get the needed sample size of 40 full-time employees. During the selection process, if any three-digit coded sequence is repeated, you include the employee corresponding to that coded sequence again as part of the sample, if sampling with replacement. You discard the repeating coded sequence if sampling without replacement. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 19 20 CHAPTER 1 DEFINING AND COLLECTING DATA Table 1.2 Using a table of random numbers Source: Data from the Rand Corporation, from A Million Random Digits with 100,000 Normal Deviates (Glencoe, IL: The Free Press, 1955) (displayed in Table E.1 in Appendix E of this book). Begin selection (row 06, column 5) Row 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 00000 00001 11111 Column 11112 22222 22223 33333 33334 12345 49280 61870 43898 62993 33850 97340 70543 89382 37818 60430 82975 39087 55700 14756 32166 23236 45794 09893 54382 94750 70297 85157 11100 36871 23913 67890 88924 41657 65923 93912 58555 03364 29776 93809 72142 22834 66158 71938 24586 23997 53251 73751 26926 20505 74598 89923 34135 47954 02340 50775 48357 12345 35779 07468 25078 30454 51438 88472 10087 00796 67140 14130 84731 40355 93247 78643 70654 31888 15130 14225 91499 37089 53140 32979 12860 30592 63308 67890 00283 08612 86129 84598 85507 04334 10072 95945 50785 96593 19436 54324 32596 75912 92827 81718 82455 68514 14523 20048 33340 26575 74697 57143 16090 67890 07275 97349 97653 20664 79488 36394 64688 81277 16703 56203 69229 26299 63397 32768 04233 83246 55058 56788 27686 94598 82341 40881 89439 68856 54607 12345 89863 20775 91550 12872 76783 11095 68239 66090 53362 92671 28661 49420 44251 18928 33825 47651 52551 96297 46162 26940 44104 12250 28707 25853 72407 67890 02348 45091 08078 64647 31708 92470 20461 88872 44940 15925 13675 59208 43189 57070 69662 04877 47182 78822 83554 36858 82949 73742 25815 35041 55538 12345 81163 98083 78496 56095 71865 63919 55980 34101 22380 23298 55790 08401 11865 83832 63491 06546 78305 46427 68479 80336 42050 57600 96644 17381 51690 Systematic Sample systematic sample A method that involves selecting the first element randomly then choosing every k th element thereafter. In a systematic sample, you partition the N items in the frame into n groups of k items where: N k= n You round k to the nearest integer. To select a systematic sample, you choose the first item to be selected at random from the first k items in the frame. Then you select the remaining n – 1 items by taking every kth item thereafter from the entire frame. If the frame consists of a listing of prenumbered cheques, sales receipts or invoices, a systematic sample is faster and easier to take than a simple random sample. A systematic sample is also a convenient mechanism for collecting data from telephone directories, class rosters and consecutive items coming off an assembly line. To take a systematic sample of n = 40 from the population of N = 800 employees, you partition the frame of 800 into 40 groups, each of which contains 20 employees. You then select a random number from the first 20 individuals, and include every 20th individual after the first selection in the sample. For example, if the first number you select is 008, your subsequent selections are 028, 048, 068, 088, 108, … , 768 and 788. Although they are simpler to use, simple random sampling and systematic sampling are generally less efficient than other, more sophisticated probability sampling methods. Even greater possibilities for selection bias and lack of representation of the population characteristics occur from systematic samples than from simple random samples. If there is a pattern in the Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 1.4 Types of Survey Sampling Methods 21 frame, you could have severe selection biases. To overcome the potential problem of disproportionate representation of specific groups in a sample, you can use either stratified sampling methods or cluster sampling methods. Stratified Sample In a stratified sample, you first subdivide the N items in the frame into separate subpopulations, or strata. A stratum is defined by some common characteristic. You select a simple random sample, in proportion to the size of the strata, and combine the results from the separate simple random samples. This method is more efficient than either simple random sampling or systematic sampling because you are assured of the representation of items across the entire population. The homogeneity of items within each stratum provides greater precision in the estimates of underlying population parameters. SELECTING A ST R AT IFIE D S A MP LE A company wants to select a sample of 32 full-time workers from a population of 800 fulltime employees in order to estimate expenditures from a company-sponsored dental plan. Of the full-time employees, 25% are managerial and 75% are non-managerial workers. How do you select the stratified sample so that the sample will represent the correct proportion of managerial workers? stratified sample Items randomly selected from each of several populations or strata. strata Subpopulations composed of items with similar characteristics in a stratified sampling design. EXAMPLE 1.2 SOLUTION If you assume an 80% response rate, you need to distribute 40 surveys to get the desired 32 responses. The frame consists of a listing of the names and company email addresses of all N = 800 full-time employees included in the company personnel files. Since 25% of the full-time employees are managerial, you first separate the population frame into two strata: a subpopulation listing of all 200 managerial-level personnel and a separate subpopulation listing of all 600 full-time non-managerial workers. Since the first stratum consists of a listing of 200 managers, you assign three-digit code numbers from 001 to 200. Since the second stratum contains a listing of 600 non-managerial-level workers, you assign three-digit code numbers from 001 to 600. To collect a stratified sample proportional to the sizes of the strata, you select 25% of the overall sample from the first stratum and 75% of the overall sample from the second stratum. You take two separate simple random samples, each of which is based on a distinct random starting point from a table of random numbers (Table E.1). In the first sample you select 10 managers from the listing of 200 in the first stratum, and in the second sample you select 30 non-managerial workers from the listing of 600 in the second stratum. You then combine the results to reflect the composition of the entire company. Cluster Sample In a cluster sample, you divide the N items in the frame into several clusters so that each cluster is representative of the entire population. You then take a random sample of clusters and study all items in each selected cluster. Clusters are naturally occurring designations, such as postcode areas, electorates, city blocks, households or sales territories. Cluster sampling is often more cost-effective than simple random sampling, particularly if the population is spread over a wide geographical region. However, cluster sampling often requires a larger sample size to produce results as precise as those from simple random sampling or stratified sampling. A detailed discussion of systematic sampling, stratified sampling and cluster sampling procedures can be found in references 3, 4 and 6. cluster sample The frame is divided into representative groups (or clusters), then all items in randomly selected clusters are chosen. cluster A naturally occurring grouping, such as a geographical area. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 22 CHAPTER 1 DEFINING AND COLLECTING DATA Problems for Section 1.4 LEARNING THE BASICS 1.17 For a population containing N = 902 individuals, what code number would you assign for: a. the first person on the list? b. the fortieth person on the list? c. the last person on the list? 1.18 For a population of N = 902, verify that, by starting in row 05 of the table of random numbers (Table E.1), you need only six rows to select a sample of n = 60 without replacement. 1.19 Given a population of N = 93, starting in row 29 of the table of random numbers (Table E.1) and reading across the row, select a sample of n = 15: a. without replacement b. with replacement APPLYING THE CONCEPTS 1.20 For a study that consists of personal interviews with participants (rather than mail or phone surveys), explain why a simple random sample might be less practical than some other methods. 1.21 You want to select a random sample of n = 1 from a population of three items (called A, B and C ). The rule for selecting the sample is: flip a coin; if it is heads, pick item A; if it is tails, flip the coin again; this time, if it is heads, choose B; if it is tails, choose C. Explain why this is a random sample but not a simple random sample. 1.22 A population has four members (call them A, B, C and D). You would like to draw a random sample of n = 2, which you decide to do in the following way: flip a coin; if it is heads, the sample will be items A and B; if it is tails, the sample will be items C and D. Although this is a random sample, it is not a simple random sample. Explain why. (If you did problem 1.21, compare the procedure described there with the procedure described in this problem.) 1.23 The town planning department of a Sydney council with a population of N = 40,000 registered voters is asked by the mayor to conduct a survey to measure community attitudes to LEARNING OBJECTIVE Evaluate the quality of surveys 5 urban consolidation. The table following contains a breakdown of the 40,000 registered voters by gender and ward of residence. Gender Female Male Total North 7,000 5,600 12,600 Ward of residence South East 5,200 5,000 4,600 4,000 9,800 9,000 West 4,800 3,800 8,600 Total 22,000 18,000 40,000 The planning department intends to take a probability sample of n = 2,000 voters and project the results from the sample to the entire population of voters. a. If the frame available from the council files is an alphabetical listing of the names of all N = 40,000 registered voters, what type of sample could you take? Discuss. b. What is the advantage of selecting a simple random sample in (a)? c. What is the advantage of selecting a systematic sample in (a)? d. If the frame available from the council’s files is a listing of the names and addresses of all N = 40,000 registered voters, compiled from eight separate alphabetical lists based on the gender and address breakdowns shown in the ward-ofresidence table, what type of sample should you take? Discuss. e. At present East Ward has many high-rise apartments, West Ward and South Ward have single dwellings only and North Ward has a mixture of low- and medium-density housing. What would be the danger in randomly choosing 40 street names and systematically sampling 50 of the residents of those streets? 1.24 Suppose that 5,000 sales invoices are separated into four strata. Stratum 1 contains 50 electrical invoices, stratum 2 contains 500 paint invoices, stratum 3 contains 1,000 plumbing supplies invoices and stratum 4 contains 3,450 hardware invoices. A sample of 500 sales invoices is needed. a. What type of sampling method should you use? Why? b. Explain how you would carry out the sampling according to the method stated in (a). c. Why is the sampling in (a) not simple random sampling? 1.5 EVALUATING SURVEY WORTHINESS Nearly every day you read or hear about survey or opinion poll results in newspapers, on the Internet or on radio or television. To identify surveys that lack objectivity or credibility, you must critically evaluate what you read and hear by examining the worthiness of the survey. First, you must evaluate the purpose of the survey, why it was conducted and for whom it was conducted. An opinion poll or survey conducted to satisfy curiosity is mainly for entertainment. Its result is an end in itself rather than a means to an end. You should be sceptical of such a survey because the result should not be put to further use. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 1.5 Evaluating Survey Worthiness 23 The second step in evaluating the worthiness of a survey is for you to determine whether it was based on a probability or a non-probability sample (as discussed in Section 1.4). You need to remember that the only way to make correct statistical inferences from a sample to a population is through the use of a probability sample. Surveys that use non-probability sampling methods are subject to serious, perhaps unintentional, bias that may render the results meaningless. Survey Errors Even when surveys use random probability sampling methods, they are subject to potential errors. Four types of survey errors are: • coverage error • non-response error • sampling error • measurement error. Good survey research design attempts to reduce or minimise these various survey errors, often at considerable cost. Coverage error The key to proper sample selection is an adequate frame. Remember, a frame is an up-to-date list of all the items from which you will select the sample. Coverage error occurs if certain groups of items are excluded from this frame so that they have no chance of being selected in the sample. Coverage error results in a selection bias. If the frame is inadequate because certain groups of items in the population were not properly included, any random probability sample selected will provide an estimate of the characteristics of the frame, not the actual population. Computer-based surveys are useful for certain studies where the subjects all have Internet access. Coverage error could result if the unemployed, the elderly or indigenous communities are not selected in the frame due to their lack of Internet or email access. Non-response error Not everyone is willing to respond to a survey. In fact, research has shown that individuals in the upper and lower socioeconomic classes tend to respond less frequently to surveys than p­ eople in the middle class. Non-response error arises from the failure to collect data on all items in the sample and results in a non-response bias. Because you cannot generally assume that people who do not respond to surveys are similar to those who do, you need to follow up on the nonresponses after a specified period of time. You should make several attempts to persuade these individuals to complete the survey. The follow-up responses are then compared with the initial responses in order to make valid inferences from the survey (references 3, 4 and 6). The mode of response you use affects the rate of response. The personal interview and the telephone interview usually produce a higher response rate than a mail survey – but at a higher cost. Sampling error There are three main reasons for selecting a sample rather than taking a complete census. It is more expedient, less costly and more efficient. However, chance dictates which individuals or items will or will not be included in the sample. Sampling error reflects the heterogeneity, or ‘chance differences’, from sample to sample, based on the probability of certain individuals or items being selected in particular samples. When you read about the results of surveys or polls in newspapers or magazines, there is often a statement regarding margin of error or precision; for example, ‘the results of this poll are expected to be within ±4 percentage points of the actual value’. This margin of error is the sampling error. You can reduce sampling error by taking larger sample sizes, although this also increases the cost of conducting the survey. coverage error Occurs when all items in a frame do not have an equal chance of being selected. This causes selection bias. non-response error Occurs due to the failure to collect information on all items chosen for the sample; this causes nonresponse bias. sampling error The difference in results for different samples of the same size. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 24 CHAPTER 1 DEFINING AND COLLECTING DATA The problem of online survey rigging think about this As the use of online methods for collecting information grows more prevalent we need to be aware that individuals will not all act honestly, especially when they have something to gain. There are many methods being used to contravene the rules of online competitions, such as paying companies to vote, setting up multiple email addresses or Facebook accounts, and using methods to mask the true IP address of the computer being used. Even if a small incentive is offered for completing a survey, similar problems can arise. At an Australian university, students were recently asked to complete a survey about a peerassisted learning program and were offered the chance to win movie tickets as an incentive to give feedback. The survey was carried out anonymously in order to elicit frank responses, but on completion students were automatically sent to a second site where they could register their student ID in order to enter a draw to win movie tickets. One student registered 105 times in order to increase the chance of winning the movie tickets. It is not clear how many times this person completed the survey itself. How could this type of behaviour potentially affect survey results? What could you do to minimise this type of survey error if you were designing an online survey? Measurement error In the practice of good survey research, you design a questionnaire with the intention of gathering meaningful information. But you have a dilemma here – getting meaningful measurements is easier said than done. Consider the following proverb: A man with one watch always knows what time it is. A man with two watches always searches to identify the correct one. A man with ten watches is always reminded of the difficulty in measuring time. measurement error The difference between survey results and the true value of what is being measured. Unfortunately, the process of getting a measurement is often governed by what is convenient, not what is needed. The measurements are often only a proxy for the ones you really desire. Much attention has been given to measurement error that occurs because of a weakness in question wording (reference 6). A question should be clear, not ambiguous. And, to avoid leading questions, you need to present them in a neutral manner. There are three sources of measurement error: ambiguous wording of questions, the halo effect and respondent error. The Australian Bureau of Statistics is very conscious of minimising error caused by questionnaire design and survey operations. For the National Health Survey in 2010–11 it used Computer Assisted Interview techniques to collect information. It states: the CAI instrument allows: • data to be captured electronically at the point of interview, which obviates the cost, logistical, timing and quality issues associated with transport, storage and security of paper forms, and transcription/data entry of information from forms into electronic format • the ability to use complex sequencing to define specific populations for questions, and ensure word substitutes used in the questions were appropriate to each respondent’s characteristics and prior responses • the ability, through data validation (edits), to check responses entered against previous responses, reduce data entry errors by interviewers, and enable seemingly inconsistent responses to be clarified with respondents at the time of interview. The audit trail recorded in the instrument also provides valuable information about the operation of particular questions, and associated data quality issues. (Australian Bureau of Statistics, Australian Health Survey: Users’ Guide, 2011–2013, electronic publication, Cat. No. 4363.0.55.001, 2013) Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 1.5 Evaluating Survey Worthiness 25 The halo effect occurs when the respondent feels obligated to please the interviewer. Proper interviewer training can minimise the halo effect. Respondent error occurs as a result of overzealous or underzealous effort by the respondent. You can minimise this error in two ways: (1) by carefully scrutinising the data and calling back those individuals whose responses seem unusual, and (2) by establishing a program of random call-backs to determine the reliability of the responses. Other sources of error besides measurement error can result from clerical or recording errors. See references 7, 8 and 9 for a more detailed discussion of measurement error and the difficulties of avoiding it. Ethical Issues Ethical considerations arise with respect to the four types of potential errors that can occur when designing surveys that use probability samples: coverage error, non-response error, sampling error and measurement error. Coverage error can result in selection bias and becomes an ethical issue if particular groups or individuals are purposely excluded from the frame so that the survey results are skewed, indicating a position more favourable to the survey’s sponsor. Non-response error can lead to non-response bias and becomes an ethical issue if the sponsor knowingly designs the survey in such a manner that particular groups or individuals are less likely to respond. Sampling error becomes an ethical issue if the findings are purposely presented without reference to sample size and margin of error, so that the sponsor can promote a viewpoint that might otherwise be truly insignificant. Measurement error becomes an ethical issue in one of three ways: (1) a survey sponsor chooses leading questions that guide the responses in a particular direction; (2) an interviewer, through mann­ erisms and tone, purposely creates a halo effect or otherwise guides the responses in a particular direction; (3) a respondent, having a disdain for the survey process, wilfully provides false information. Ethical issues also arise when the results of non-probability samples are used to form conclusions about the entire population. When you use a non-probability sampling method, you need to explain the sampling procedures and state that the results cannot be generalised beyond the sample. Problems for Section 1.5 APPLYING THE CONCEPTS 1.25 ‘A survey indicates that the vast majority of university students own their own personal computer.’ What information would you want to know before you accepted the results of this survey? 1.26 A simple random sample of n = 300 full-time employees is selected from a company list containing the names of all N = 5,000 full-time employees in order to evaluate job satisfaction. a. Give an example of possible coverage error. b. Give an example of possible non-response error. c. Give an example of possible sampling error. d. Give an example of possible measurement error. 1.27 According to a recent cyber security report, ‘millennials remain the most common victims of cybercrime, with 40 percent having experienced cybercrime in the past year’. Reasons given for this include slack online security habits and password sharing (2016 Norton Cyber Security Insights Report, <www.symantec.com/content/dam/symantec/docs/ reports/2016-norton-cyber-security-insights-report.pdf>, accessed 16 June 2017). What information would you want to know before you accepted the results of the survey? 1.28 Kiribati is a small, poor Pacific nation under threat from global warming. According to the CIA World Factbook, Kiribati comprises a group of 33 coral atolls in the Pacific Ocean straddling the equator, with elevations varying from 0 to 81 metres above sea level. The low level of some of the islands makes them sensitive to changes in sea level (Central Intelligence Agency, The World Factbook, <www.cia.gov/library/ publications/the-world-factbook/geos/kr.html> accessed 16 June 2017). Suppose that an environmental economist has seen results from a survey which claims that 30% of inhabitants of Kiribati are already affected by roads having been permanently cut by rising seawater. What information would she want to know before accepting the results of the survey? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 26 CHAPTER 1 DEFINING AND COLLECTING DATA 1.29 Reality TV shows have incorporated surveys of audience opinion into their formats. In Australia several shows have allowed the audience to vote on whether contestants should remain on the show or be excluded. Consider a show where voting is by SMS, premium rate phone call, Facebook or another online site, and viewers are limited to 10 votes using each method. Compare this type of survey with a random poll of viewers without replacement conducted by phone for the TV show. a. How might the results differ? b. What are the costs and benefits for the owners of the show for each voting method? 1.30 The online restaurant search site Dimmi <www.dimmi.com.au> encourages diners to rate restaurants they have been to by giving them reward points which can be accumulated until a meal discount is available. A restaurant at The Rocks in Sydney has been rated as follows: Recommended 8.7; Food 8.5; Service 8.7; Value for money 7.8; Atmosphere 8.4. What differences could arise from this type of survey compared with ratings derived from a random sample of diners? 1.6 THE GROWTH OF STATISTICS AND INFORMATION TECHNOLOGY statistical packages Computer programs designed to perform statistical analysis. During the past century, statistics has played an important role in spurring the use of information technology and, in turn, such technology has spurred the wider use of statistics. At the beginning of the twentieth century, the expanding data-handling requirements associated with the United States Federal Census led directly to the development of tabulating machines that were the forerunners of today’s business computer systems. Statisticians such as Pearson, Fisher, Gosset, Neyman, Wald and Tukey established the techniques of modern inferential statistics as an alternative to analysing large sets of population data that had become increasingly costly, time-consuming and cumbersome to collect. The development of early computer systems permitted others to develop computer programs to ease the calculation and data-processing burdens imposed by those techniques. Over time, greater use of statistical methods by business decision makers and advances in computer capacity have led to the development of even more sophisticated statistical methods. Today, when you hear of retailers investing in a ‘customer-relationship management system’, or CRM, or a packaged goods producer engaging in ‘data mining’ to uncover consumer preferences, you should realise that statistical techniques form the foundations of such cutting-edge applications of information technology. As global information storage increases dramatically, businesses are rapidly coming to terms with how to analyse big data – data sets so large and varied that conventional software cannot readily handle them. (Think of the huge volume of data produced each day by people using Visa, Facebook, eBay and Twitter.) Even though cutting-edge applications might require custom programming, for many years businesses have had access to statistical packages such as Minitab, SPSS/PASW Statistics, SAS and Stata – standardised sets of programs that help managers use a wide range of statistical techniques by automating the data processing and calculations these techniques require. The leasing and training costs associated with statistical packages have led many to consider using some of the graphical and statistical functions of Microsoft Excel. However, you need to be aware that many statisticians have concerns about the accuracy and completeness of the statistical results produced by early versions of Excel. Invalid results could be produced, especially when the data sets were very large or had unusual statistical properties (see reference 10). Microsoft Excel 2010 and subsequent versions made some significant improvements in statistical functions (see references 11 and 12) but it would still be wise to be careful about the data and the analysis you are undertaking. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e References Assess your progress 27 1 Summary In this chapter you have studied data collection and the various types of data used in business. In the Hong Kong International Airport scenario you were asked to review the visitor survey which will be used to provide information to the tourism authority planning staff (see page 9). Three of the questions shown will produce numerical data and four will produce categorical data. The responses to the first question (number of previous visits to Hong Kong) are discrete, and the responses to the second question (length of time since last visit) are continuous. After the data have been collected, they must be organised and prepared in order to make various analyses. You have also learned about commonly used sampling methods and ways to prepare data for analysis such as encoding, cleaning and recoding. The next two chapters develop tables and charts and a variety of descriptive numerical measures that are useful for data analysis. Key terms big data categorical variables cluster cluster sample collectively exhaustive continuous variables convenience sampling coverage error data descriptive statistics discrete variables electronic formats encoding focus group frame inferential statistics interval scale 14 10 21 21 16 10 17 23 6 8 10 15 15 14 17 8 11 judgment sample measurement error missing values mutually exclusive nominal scale non-probability sample non-response error numerical variables operational definition ordinal scale outliers parameter population primary sources probability sample ratio scale recoded variable 17 24 16 16 10 17 23 10 6 11 16 8 8 13 18 11 16 sample sampling error sampling with replacement sampling without replacement secondary sources simple random sample statistic statistical packages statistics strata stratified sample structured data systematic sample table of random numbers unstructured data variables 8 23 18 18 13 18 8 26 6 21 21 15 20 18 15 6 References 1. Laney, D., 3D Data Management: Controlling Data Volume, Velocity, and 2. 3. 4. 5. Variety (Stamford, CT: META Group. February 6, 2001). Osbourne, J. Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data (Thousand Oaks, CA: Sage Publications, 2013). Cochran, W. G., Sampling Techniques, 3rd edn (New York: Wiley, 1977). Lohr, S. L., Sampling Design and Analysis, 2nd edn (Boston, MA: Brooks/ Cole Cengage Learning, 2010). Rand Corporation, A Million Random Digits with 100,000 Normal Deviates (Glencoe, IL: The Free Press, 1955). 6. Groves R. M., F. J. Fowler, M. P. Couper, J. M. Lepkowski, E. Singer & R. Tourangeau, Survey Methodology, 2nd edn (New York: John Wiley, 2009). 7. Sudman, S., N. M. Bradburn & N. Schwarz. Thinking About Answers: The Application of Cognitive Processes to Survey Methodology (San Francisco, CA: Jossey-Bass, 1996). 8. Biemer, P. B., R. M. Graves, L. E. Lyberg, A. Mathiowetz & S. Sudman, Measurement Errors in Survey (New York: Wiley Interscience, 2004). 9. Fowler, F. J., Improving Survey Questions: Design and Evaluation, Applied Special Research Methods Series, Vol. 38 (Thousand Oaks, CA: Sage Publications, 1995). Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 28 CHAPTER 1 DEFINING AND COLLECTING DATA 10. McCullough, B. D. & B. Wilson, ‘On the accuracy of statistical procedures in Microsoft Excel 97’, Computational Statistics and Data Analysis, 31 (1999): 27–37. 11. Microsoft Corporation at <http://office.microsoft.com/en-au/excel-help/ what-s-new-changes-made-to-excel-functions-HA010355760.aspx>, accessed June 2017. 12. Microsoft Corporation at <http://office.microsoft.com/en-001/excelhelp/new-functions-in-excel-2013-HA103980604.aspx>, accessed June 2017. Chapter review problems CHECKING YOUR UNDERSTANDING 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42 What is the difference between a sample and a population? What is the difference between a statistic and a parameter? What is the difference between descriptive and inferential statistics? What is the difference between a categorical and a numerical variable? What is the difference between a discrete and a continuous variable? What is an operational definition and why is it so important? What are the four types of measurement scales? What are some potential problems with using ‘barrel draw’ methods to select a simple random sample? What is the difference between sampling with replacement and sampling without replacement? What is the difference between a simple random sample and a systematic sample? What is the difference between a simple random sample and a stratified sample? What is the difference between a stratified sample and a cluster sample? 1.47 1.48 APPLYING THE CONCEPTS 1.43 1.44 1.45 1.46 The Australasian Data and Story Library OZDASL <www.statsci.org/data> is an online library of data files and stories that illustrate the use of basic statistical methods. The stories are classified by method and by topic. Go to this site and click on ‘First Course in Statistics’. Pick a story and summarise how statistics were used in the story. Make a list of six ways you have used or encountered statistics in the past week. Think about what you read or heard in a news report or saw on a commercial website. Also think whether you made a bet or participated in a survey. The Australian Bureau of Statistics <www.abs.gov.au> site contains survey information on people, business, geography and other topics. Go to the site and find the latest version of Labour Force, Australia (Cat. No. 6202.0). a. Briefly describe the Labour Force survey. b. Give an example of a categorical variable found in this survey. c. Give an example of a numerical variable found in this survey. d. Is the variable you selected in (c) discrete or continuous? The Australian Bureau of Statistics website allows users to access a large amount of Census data online. Go to <www.abs.gov.au/census> and in the Data by Products section click on the latest Census year, enter a location and search for QuickStats. a. Give an example of a categorical variable found in this summary of survey results. 1.49 1.50 b. Give an example of a numerical variable found in this summary of survey results. c. Is the variable you selected in (b) discrete or continuous? Detailed information on airport and airline on-time performance can be found at <www.flightstats.com>. Explore the departures performance data for different airports and regions. a. Which of the five types of data sources listed in Section 1.3 do you think were used here? b. Name a categorical variable for which observations were collected. c. Name a numerical variable for which observations were collected. d. What type of recoding has been used here and why? Late in 2016 the National Roads and Motorists’ Association (NRMA), a major Australian motoring organisation, released results of a survey that sought to check members’ attitudes to traffic congestion and a motorway extension (see <www. mynrma.com.au/about/media/local-support-for-SouthConnexstrengthens-nrma-survey.htm>). a. Describe the population(s) for this survey. b. Describe the sample(s) for this survey. c. Can you identify potential difficulties in comparing these results with results from a similar 2005 survey? A manufacturer of flavoured milk is planning to survey households in Tasmania to determine the purchasing habits of consumers. Among the questions to be included are those that relate to: 1. where flavoured milk is primarily purchased 2. what flavour of milk is purchased most often 3. how many people living in the household drink flavoured milk 4. the total number of millilitres of flavoured milk drunk in the past week by members of the household a. Describe the population. b. For each of the four items listed, indicate whether the variable is categorical or numerical. If numerical, is it discrete or continuous? c. Develop five categorical questions for the survey. d. Develop five numerical questions for the survey. A new bus network is proposed for a north-eastern Sydney region. A survey is sent out to residents asking questions which relate to: 1. the resident’s age 2. frequency of bus use 3. usual ticket type purchased 4. main purpose of using the bus a. Describe the population. b. Indicate whether each of the questions above is categorical or numerical. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e CONTINUING CASES 1.51 1.52 1.53 c. Develop two more numerical questions and state whether the variables are discrete or continuous. d. Develop two more categorical questions. Political polling has traditionally used telephone interviews. Researchers at a polling organisation argue that Internet polling is less expensive and faster, and offers higher response rates than telephone surveys. Critics are concerned about the scientific reliability of this approach. Even amid this strong criticism, Internet polling is becoming more and more common. What concerns, if any, do you have about Internet polling? Statistics New Zealand mentions a number of possible sources of non-sampling error in economic surveys in A Guide to Good Survey Design, 3rd edition, which can be downloaded from <www.stats.govt.nz>. a. Which of the four types of survey error from Section 1.5 are identified on this site as a non-sampling error? b. Discuss which errors would be more difficult to eliminate. Researchers at a university wish to conduct a survey of past students to ascertain how frequently they are using statistical techniques in the workforce. The researchers have permission from the ethics committee to use the last recorded email and postal addresses to contact ex-students, but these may be out of date, particularly as many students have returned to homes overseas without updating their records. The emails and letters are sent out simultaneously. The response to the survey is low. 1.54 29 a. What type of errors or biases should the researchers be especially concerned with? b. What step(s) should the researchers take to try to overcome the problems noted in (a)? c. What could have been done differently to improve the survey’s worthiness? According to a survey conducted by the Australian Interactive Media Industry Association, 77% of mobile phone users surveyed pay by a monthly phone bill compared to 21% who are on pre-paid plans. The percentage of respondents that have data included in their payment plans is 84% (M. M. Mackay, Australian Mobile Phone Lifestyle Index, 9th edn, October 2013, <www.aimia.com.au/ampli>, accessed 24 January 2014). a. What other information would you want to know before you accepted the results of this survey? b. Suppose that you wished to conduct a similar survey for the geographic region you live in. Describe the population for your survey. c. Explain how you could minimise the chance of a coverage error in this type of survey. d. Explain how you could minimise the chance of a nonresponse error in this type of survey. e. Explain how you could minimise the chance of a sampling error in this type of survey. f. Explain how you could minimise the chance of a measurement error in this type of survey. Continuing cases Tasman University Tasman University’s Tasman Business School (TBU) regularly surveys business students on a number of issues. In particular, students within the school are asked to complete a student survey when they receive their grades each semester. The results of Bachelor of Business (BBus) students who responded to the latest undergraduate (UG) survey are stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY >. a For each question asked in the survey, determine whether the variable is categorical or numerical. If you determine that the variable is numerical, identify whether it is discrete or continuous. b A separate survey has been carried out for Master of Business Administration (MBA) students. Results for these postgraduate (PG) students are in the file < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >. Repeat the analysis you carried out in (a) for the postgraduate survey results. As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >. a Identify data sources and discuss the type of sampling that was most likely used to collect these data. b Suggest any additional variables that could be collected in order to explain property prices, and determine if they are numerical or categorical, discrete or continuous. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 30 CHAPTER 1 DEFINING AND COLLECTING DATA Chapter 1 Excel Guide EG1.1 GETTING STARTED WITH MICROSOFT EXCEL Microsoft Excel is the electronic worksheet program of Microsoft Office. Although not a specialised statistical ­program, Excel contains basic statistical functions, and the Excel 2016 PC and Mac versions include Data Analysis Toolpak procedures that you can use to perform selected advanced statistical methods. To use the Data Analysis Toolpak you must select it as an Excel add-in. You can also install the PHStat add-in (available for separate purchase or with some textbooks) to extend and enhance the Data Analysis Toolpak that Microsoft Excel contains. (You do not need to use PHStat in order to use Microsoft Excel with this text, although using PHStat will simplify using Excel for statistical analysis.) In Microsoft Excel, you create or open and save files that are called workbooks. Workbooks are collections of worksheets and related items, such as charts, that contain the original data as well as the calculations and results associated with one or more analyses. Because of its widespread distribution, Microsoft Excel is a convenient program to use, but some statisticians have expressed concern about its lack of fully reliable and accurate results for some statistical procedures. Although Microsoft has recently improved many statistical functions, especially from Excel 2010 onwards, you should be somewhat cautious about using Microsoft Excel to perform analyses on data other than the data used in this text. (If you plan to install PHStat, make sure you first read Appendix F and any PHStat read-me file.) You can use Excel to learn and apply the statistical methods discussed in this book and as an aid in solving end-of-section and end-of-chapter problems. For many topics, you may choose to use the ‘Excel How-to’ instructions. These instructions use pre-constructed worksheets as models or templates for a statistical solution. You learn how to adapt these worksheets to construct your own solutions. Many of these sections feature a specific Excel Guide workbook that contains worksheets that are identical to the worksheets that PHStat creates. Because both of these methods create the same results and the same worksheets, you can use a combination of them as you read through this book. The ‘Excel How-to’ instructions and the Excel Guide workbooks work best with the latest Versions of Microsoft Excel, including Excel 2016 and Excel 2013 (Microsoft Windows), Excel 2016 for Mac, and Office 365. (Excel Guides also contain instructions for using the Analysis ToolPak add-in that is included with most of the latest Microsoft Excel versions.) (Microsoft Excel 2016, Microsoft Corporation, 2015) You will want to master the basic skills listed in Table EG1.1 before you begin using Microsoft Excel to understand statistical concepts and solve problems. If you plan to use the ‘Excel How-to’ instructions, you will also need to master the skills listed in the lower part of Excel skill Specifics Excel data entry • Organising worksheet data in columns • Entering numerical and categorical data File operations • Open • Save • Print Worksheet operations • Create • Copy and paste Formula skills • • • • • Workbook presentation • How to apply format changes that affect the display of worksheet cell contents Chart formatting correction • How to correct the formatting of charts that Excel improperly creates Discrete histogram creation • How to create a properly formatted histogram for a discrete probability distribution Table EG1.1 Basic skills for using Microsoft Excel Concept of a formula Cell references Absolute and relative cell references How to enter a formula How to enter an array formula Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 1 Excel Guide Operation Examples Keyboard keys • Enter • Ctrl • Shift Keystroke combinations • Ctrl+C • Ctrl+Shift+Enter • Command+Enter Click or select operations Menu or ribbon selection Placeholder object 31 Notes Names of keys are always the object of the verb press, as in ‘press Enter’. Keyboarding actions that require you to press more than one key at the same time. Ctrl+C means press C while holding down Ctrl. Ctrl+Shift+Enter means press Enter while holding down both Ctrl and Shift. • Click OK • Select the first 2-D Bar gallery item Mouse pointer actions that require you to single click an onscreen object. This book uses the verb select when the object is either a worksheet cell or an item in a gallery, menu, list or Ribbon tab. • File ➔ New • Layout ➔ Legend ➔ None A sequence of Ribbon or menu selections. File ➔ New means first select the File tab and then select New from the list that appears. • variable 1 cell range • bins cell range An italicised bold-faced phrase is a placeholder for an object reference. In making entries, you enter the reference (e.g. A1:A10) and not the placeholder. Table EG1.2 Excel typographic conventions the table. While you do not necessarily need these skills if you plan to use PHStat, knowing them will be useful if you expect to customise the Excel worksheets that PHStat creates or expect to be using Excel beyond the course that uses this book. The list of skills in Table EG1.1 begins with the more basic skills and progresses towards slightly more advanced skills that you will need to use less frequently. Table EG1.2 presents the typographic conventions that the Excel Guides in this book use to present computer operations. EG1.2 OPENING AND SAVING WORKBOOKS Once you open the Excel program a new workbook will be displayed where you can begin entering data in rows and columns. Figure EG1.1 shows a newly opened workbook in Excel 2016. It contains the elements that are common with most Microsoft Windows programs. If you wish to use a workbook created previously you will need to use the following commands. If you are using Microsoft Excel 2016, select File ➔ Open. In the Backstage view you will be given a choice of selecting from Recent Workbooks, OneDrive or the Computer. You can browse, select the file to be opened and then click on the OK button. If you cannot find your file, you may need to do one or more of the following: • Use the scroll bars or the slider, if present, to scroll through the entire list of files. • • Select the correct folder from the drop-down list at the left-hand side of the dialog box. To search every file in the folder, leave All Files showing at the bottom of the dialog box. If you want a specific type of file such as text files, use the arrow to open a drop-down menu and then select Text Files. In Excel 2016, select File ➔ Save As, and in the Backstage view choose the location. In the dialog box enter (or edit) the name of the file in the File name box and click on the OK button. If applicable, you can also do the following: • Change to another folder by selecting that folder from the Save in drop-down list. • Change the Save as type value to something other than the default choice, Microsoft Excel Workbook. Text (Tab delimited) or CSV (Comma delimited) are two file types sometimes used to share Excel data with other programs. After saving your work, you should consider saving your file a second time, using a different name, to create a backup copy of your work. Read-only files cannot be saved to their original folders unless the name is changed. EG1.3 ENTERING DATA The main worksheet area is composed of rows and columns that you use for data entry. You enter data into the rows and columns of a worksheet. By convention, and the style used Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 32 CHAPTER 1 DEFINING AND COLLECTING DATA Quick access toolbar Ribbon Formula bar Group Launcher button Title bar Column labels Minimise, Resize and Close buttons Tabs (Home tab selected) Row labels Workspace area with opened workbook Scroll bars Sheet tab Figure EG1.1 The Excel 2016 window in this book, when you enter data for a set of variables you enter the name of each variable into the cells of the first row, beginning with column A. Then you enter the data for the variable in the subsequent rows to create a DATA worksheet similar to the one shown in Figure EG1.2, which contains data from an auction sale. Note that the formula used in the active cell F6 can be seen on the formula bar. To enter data in a specific cell, either use the cursor keys to move the cell pointer to the cell or use your mouse to select the cell directly. As you type, what you type appears in the formula bar. Complete your data entry by pressing Tab or Enter or by clicking the checkmark button in the formula bar. When you enter data, never skip any rows in a column and, as a general rule, avoid skipping any columns. Also try to avoid using numbers as row 1 variable headings; if you cannot avoid their use, precede such headings with apostrophes. Pay attention to any special instructions that occur throughout the book for the order of the entry of your data. For some statistical methods, entering your data in an order that Excel does not expect will lead to incorrect results. To refer to a specific entry, or cell, you use a Sheetname!ColumnRow notation. For example, Data!A2 refers to the cell in column A and row 2 in the Data worksheet. To refer to a specific group or range of cells, you use a Sheetname!Upperleftcell:Lowerrightcell notation. For example, Data!A2:B11 refers to the 20 cells that are in rows 2 to 11 in columns A and B of the Data worksheet. An absolute address for the cell A6 is shown as $A$6. Even if a formula using this address is copied to another row or column it will still refer to this cell. However, if the formula is written with the relative address A6, moving the formula will change the Figure EG1.2 An example of a DATA worksheet Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 1 Excel Guide reference cell. Both absolute and relative addresses may be necessary in one sheet depending on the operations intended. Also note that $A6 freezes the column but not the row and A$6 freezes the row but allows the column to change. Each Microsoft Excel worksheet has its own name. Automatically, Microsoft Excel names worksheets in the form of Sheet1, Sheet2 and so on. You should rename your worksheets, giving them more self-descriptive names, by double-clicking on the sheet tabs that appear at the bottom of each sheet, typing a new name and pressing the Enter key. EG1.4 USING FORMULAS IN EXCEL WORKSHEETS Formulas are worksheet cell entries that perform a calculation or some other task. You enter formulas by typing the equals sign symbol (5) followed by some combination of mathematical or other data-processing operations. For simple formulas, you use the symbols 1, 2, *, / and ^ for the operations addition, subtraction, multiplication, division and exponentiation (a number raised to a power), respectively. For example, the formula 5Data!B2 1 Data!B3 1 Data!B4 1 Data!B5 adds the contents of the cells B2, B3, B4 and B5 of the Data worksheet and displays the sum as the value in the cell ­containing the formula. You can also use Microsoft Excel functions in formulas to simplify formulas. To find lists of the functions that can be selected in Excel, click on the fx Function Wizard symbol on the Formula bar. For example, the formula 5SUM(Data!B2:B5), using the Excel SUM() function, is a shorter equivalent to the formula above. You can also use cell or cell range references that do not contain the Sheetname! part, such as B2 or B2:B5. Such references always refer to the worksheet in which the formula has been entered. Formulas allow you to create generalised solutions and give Excel its distinctive ability to recalculate results automatically when you change the values of the supporting data. Typically, when you use a worksheet, you see only the results of any formulas entered, not the formulas themselves. However, for your reference, many illustrations of Microsoft Excel worksheets in this text also show the underlying formulas adjacent to the results they produce. When using Excel 2016, select Formulas ➔ Formula Auditing ➔ Show Formulas to see onscreen the formulas themselves and not their results. To restore the original view, click on Show Formulas again. EG1.5 CREATING CHARTS The method of creating charts can vary according to the version of Excel you are using. Both these methods are available in Excel 2016. • Method 1 A feature in Excel 2016 allows you to create charts easily using the Quick Analysis tool. Simply • 33 highlight an area of the spreadsheet containing some data you wish to graph by clicking on the top left-hand cell, then dragging the mouse. The range may contain labels. Click on the small box that appears in the bottom right-hand corner to open Quick Analysis. Select Charts, then, by hovering the mouse over the different chart types, you can see previews of recommended charts for the selected data. You can also choose More, which will open a dialog box with a more extensive range of options. Once a chart is selected there are several ways you can modify it by clicking on the icons that appear on its right-hand side. These are Chart Elements (1), Chart Styles (paintbrush) and Chart Filters (filter). You will also now see that multiple design options are shown on the ribbon and that options to change colours or chart type are shown there. By right-clicking on the background area of the chart you can also activate a drop-down menu. If you choose Format Chart Area a menu will open on the righthand side of the spreadsheet that allows you to change the format of the chart and text in many ways. If instead you choose Move Chart you can choose a new location on another sheet. To reposition the chart on the existing sheet, simply click on it and drag. To resize it, drag using one of the circles on its border. Method 2 Highlight the area of the spreadsheet with your data as described above. If you wish to select areas that are not adjacent, hold down the Ctrl key while selecting. The area selected must be rectangular. Click on the Insert tab, then from the Charts area click on the Recommended Charts and select a particular format from the drop-down gallery. Alternatively, you can select a chart type from the icons shown. Once the chart is ­created it can be formatted or enhanced by clicking on it and following the instructions given for Method 1. Figure EG1.3 shows an example of a chart created in Excel 2016 with the Format Axis panel open. EG1.6 PRINTING WORKBOOKS Before printing you may select a print area if you do not want the whole sheet printed. To print Excel 2016 worksheets, select File ➔ Print. A print preview is automatically created, as can be seen in Figure EG1.4. Various print settings are available in the drop-down list boxes. Clicking on Page Setup will give access to more choices such as changing from Portrait to Landscape orientation, as would suit the worksheet shown. When you are satisfied with the settings and look of the preview, click on the Print button. Note that if you want only a part of the worksheet to be printed it is easier to set this using Page Layout tab then Page setup ➔ Print area. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 34 CHAPTER 1 DEFINING AND COLLECTING DATA Plot area Chart title Vertical axis title Chart area Legend Horizontal axis title Figure EG1.3 An example of a chart created in Excel 2016 with the Format Axis panel open Page Setup allows you to customise printing to change the print orientation, add gridlines and so on before printing. Once you are satisfied with the results, click on the Print button in the print preview window, then OK in the Print dialog box. The Print Backstage view (see Figure EG1.4) contains settings to select the printer to be used, what parts of the workbook to print (the active worksheet is the default) and the number of copies to produce (1 is the default). If you need to change these settings, change them before clicking on the OK button. Figure EG1.4 The Excel 2016 Backstage view with Print and Page Setup selected Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 1 Excel Guide After printing, you should verify the contents of your printout. Most printing failures will trigger the display of an error message that you can use to work out the source of the failure. EG1.7 HOW USING EXCEL FOR MAC DIFFERS Excel 2016 for Mac comes with the add-ins for Analysis Toolpack but earlier versions did not. If you don’t have a current version, it is possible to download software made by third-party companies to perform some of the same statistical analysis tasks. The free program StatPlus®:mac LE, for instance, will allow you to run a regression, calculate descriptive statistics and run analysis of variance tests. Further capability is available in the Pro edition at a cost. In Excel 2016 for Mac you can open a new work­ book when the program opens by using New ➔ Blank Workbook ➔ Create. The easiest way to save a new workbook is to click on the quick access toolbar file icon to Save. A Save As dialog box will allow you to choose a file name, a location for the file and the file format. You can also choose File ➔ Save to begin this process. To create a chart in Excel 2016 for Mac, use Method 2 described in section EG1.5. With the chart selected click on the Chart Design tab. You will find that extra tabs such as Add Chart Element, Quick Layout and Switch Row/ Column open on the ribbon to allow more formatting. To print a worksheet or selection use File ➔ Print then on the Printer select the printer you wish to use. The default is that all active worksheets will be printed but to modify that select Show Details. Then choose the option preferred from the drop-down menu, and finally select Print. EG1.8 DEFINING DATA Establishing the Variable Type Microsoft Excel infers the variable type from the data you enter into a column. If Excel discovers a column that contains numbers, for example, it treats the column as a numerical variable. If Excel discovers a column that contains words or alphanumeric entries, it treats the column as a non-numerical (categorical) variable. This imperfect method works most of the time, especially if you make sure that the categories for your categorical variables are words or phrases such as ‘yes’ and ‘no’. However, because you cannot explicitly define the variable type, Excel can mistakenly offer or allow you to do nonsens­ ical things such as using a statistical method that is designed for numerical variables on categorical variables. If you must use coded values such as 1, 2 or 3, enter them preceded by an apostrophe, as Excel treats all values that begin with an apostrophe as non-numerical data. (You can check whether a cell entry includes a leading apostrophe by selecting a cell and viewing the contents of the cell in the formula bar.) 35 EG1.9 COLLECTING DATA Recoding Variables Key technique To recode a categorical variable, you first copy the original variable’s column of data and then use the find-andreplace function on the copied data. To recode a numerical variable, or a categorical variable with only two values, enter a form­ula that returns a recoded value in a new column. Example Imagine that we have collected data at an airport using a survey such as shown on page 9. The Recode workbook shows how the original variables of ‘Accommodation satisfaction’ and ‘Business visit’ have been recoded. Excel how-to Two recoded variables were created by first opening the Airport Survey worksheet in the Recode workbook and then following these steps: 1. Right-click column B (right-click over the shaded ‘B’ at the top of column B) and click Copy in the shortcut menu. 2. Right-click column C and click the first choice in the Paste Options gallery. 3. Enter Accommodation code in cell C1. 4. Select column C. With column C selected, click Home ➔ Find & Select ➔ Replace. In the Replace tab of the Find and Replace dialog box: 5. Enter Very satisfied as Find what, 1 as Replace with, and then click Replace All. 6. Click OK to close the dialog box that reports the results of the replacement command. 7. Still in the Find and Replace dialog box, enter Very dissatisfied as Find what (replacing Very satisfied), and 5 as Replace with, then click Replace All. 8. Click OK to close the dialog box that reports the results of the replacement command. 9. Continue to replace the words Dissatisfied, Satisfied and Undecided with the numbers 4, 2 and 3 respectively using this method. (This creates the recoded variable Accommodation code in column C.) 10. Enter Business visit code in cell H1. 11. Enter the formula 5IF(F2 5 “No”, 0,1) in cell H2. 12. Copy this formula down the column to the last row that contains Visitor data (row 31). (This creates the recoded variable Business visit code in column H.) The Recode workbook uses the IF function to recode the two categories as numbers. Numerical variables can also be recoded into multiple categories by using a more advanced technique using the VLOOKUP function. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 36 CHAPTER 1 DEFINING AND COLLECTING DATA EG1.10 TYPES OF SAMPLING METHODS Simple Random Sample Key technique Use the RANDBETWEEN(smallest integer, largest integer) function to generate a random integer that can then be used to select an item from a frame. Example Create a simple random sample with replacement of size 40 from a population of 800 items. Excel how-to Enter a formula that uses this function and then copy the formula down a column for as many rows as is necessary. For example, to create a simple random sample with replacement of size 40 from a population of 800 items, open to a new worksheet. Enter Sample in cell A1 and enter the formula 5RANDBETWEEN(1, 800) in cell A2. Then copy the formula down the column to cell A41. Excel contains no functions to select a random sample without replacement. Such samples are most easily created using an add-in such as PHStat or the Analysis ToolPak, as described in the following paragraphs. Analysis ToolPak Use Sampling to create a random sample with replacement. For the example, assume you have a worksheet that contains the population of 800 items in column A and that contains a column heading in cell A1. Select Data ➔ Data Analysis. In the Data Analysis dialog box, select Sampling from the Analysis Tools list and then click OK. In the procedure’s dialog box: 1. Enter A1:A801 as the Input Range and check Labels. 2. Click Random and enter 40 as the Number of Samples. 3. Click New Worksheet Ply and then click OK. Example Create a simple random sample without replacement of size 40 from a population of 800 items. PHStat Use Random Sample Generation. For the example, select PHStat ➔ Sampling ➔ Random Sample Generation. In the procedure’s dialog box: 1. Enter 40 as the Sample Size. 2. Click Generate list of random numbers and enter 800 as the Population Size. 3. Enter a Title and click OK. Unlike most other PHStat results worksheets, the worksheet created contains no formulas. Excel how-to Use the COMPUTE worksheet of the Random workbook as a template. The worksheet already contains 40 copies of the formula 5RANDBETWEEN(1, 800) in column B. Because the RANDBETWEEN function samples with replacement as discussed at the start of this section, you may need to add additional copies of the formula in new column B rows until you have 40 unique values. If your intended sample size is large, you may find it difficult to spot duplicates. See the ADVANCED worksheet in the Random workbook for more information about an advanced technique that uses formulas to detect duplicate values. Microsoft® product screen shots are reprinted with permission from Microsoft Corporation. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Organising and visualising data C HAP T E R 2 FESTIVAL EXPENDITURE A council is investigating the contribution to the local economy of visitors to an annual three-day music festival. Kai, a researcher employed by the council, has collected data from a random sample of non-local festival attendees aged 18 years and over. This data includes total amount spent, excluding festival tickets, in the region during the festival and whether the festival attendee has travelled from within the state (intrastate), from another state (interstate) or from another country (international) to attend the festival. The data is stored in the < FESTIVAL > file. Kai is interested in answering the following questions: ■ ■ ■ What is the typical amount spent during the festival by intrastate, interstate and international visitors? How does the amount spent vary between visitors and between intrastate, interstate and international visitors? Is there a difference in the amount spent between intrastate, interstate and international ­visitors? © Zoonar/Thomas Willer/age fotostock Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 38 CHAPTER 2 ORGANISING AND VISUALISING DATA LEARNING OBJECTIVES After studying this chapter you should be able to: 1 describe the distribution of a single categorical variable using tables and charts 2 describe the distribution of a single numerical variable using tables and graphs 3 describe the relationship between two categorical variables using contingency tables 4 describe the relationship between two numerical variables using scatter diagrams and time-series plots 5 develop dashboard elements such as sparklines, gauges, bullet graphs and treemaps for descriptive analytics 6 correctly present data in graphs Kai needs to organise the data into usable forms. One way of doing this is to use tables or charts to organise and visualise the data. This chapter helps you to select and construct appropriate tables and charts. We can also use numerical measures to determine certain characteristics of the data, such as their centre and spread. These numerical descriptive measures are covered in the next chapter. From Chapter 1 we know that data can be either categorical or numerical. LEARNING OBJECTIVE 1 Describe the distribution of a single categorical variable using tables and charts 2.1 ORGANISING AND VISUALISING CATEGORICAL DATA The expenditure data in the < FESTIVAL > file are examples of raw data – that is, data presented just as they were collected. Raw data give very little information, but by using summary tables and charts we can condense and present the data in a meaningful way. For categorical data, you first divide the data into categories and then present the frequency or percentage in each category in a table or chart. Organising Categorical Data: Summary Table summary table Summarises categorical or numerical data; gives the frequency, proportion or percentage of data values in each category or class. Table 2.1 Reasons for grocery shopping online A summary table gives the frequency, proportion or percentage of the data in each category, which allows you to see differences between the categories. A summary table lists the ­categories in one column and the frequency, percentage or proportion in a separate column or columns. Table 2.1 illustrates a summary table based on a recent survey that asked why people shopped for groceries online. From this table, stored in < ONLINE SHOPPING >, the most ­common reason for grocery shopping online was convenience, followed by competitive prices and quality products. Very few respondents shopped for groceries online because of a comfortable environment or well-displayed products. Reason Comfortable environment Competitive prices Convenience Customer service Products well displayed Quality products Variety/range of products Percentage 8 20 28 13 3 18 10 100 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.1 ORGANISING AND VISUALISING Categorical Data SUM MA RY TA B LE S FO R LO C AT IO N A N D TY P E OF P ROP E RTI E S In other research Kai is exploring the property market in the council area. Data from 100 recent property sales is stored in < PROPERTY >. These properties are classified according to location, either in town or rural, and also by type, either a house or a unit. Construct summary tables for the properties categorised by location and type. 39 EXAMPLE 2.1 SOLUTION Location Rural Town Total Number (frequency) of properties 34 66 100 Table 2.2A A frequency and percentage summary table for the location of 100 recent property sales Percentage of properties 34.0 66.0 100.0 From Table 2.2A we can see that there are approximately twice as many urban properties sold as rural properties. Type House Unit Total Number of properties 82 18 100 Table 2.2B A frequency and percentage summary table for type of 100 recent property sales Percentage of properties 82.0 18.0 100.0 From Table 2.2B we can see that there are relatively few units sold. Visualising Categorical Data: Bar Charts Each category in a bar chart is represented by a bar, the length of which indicates the proportion, frequency or percentage of values falling into that category. Figure 2.1 displays a bar chart of the reasons for grocery shopping online, presented in Table 2.1. Bar charts allow you to compare percentages, frequencies or proportions in the different categories. In Figure 2.1 the most common reason for shopping online is convenience, followed by competitive prices. Very few respondents shopped for groceries online because of a comfortable environment or well-displayed products. bar chart Graphical representation of a summary table for categorical data; the length of each bar represents the proportion, frequency or percentage of data values in a category. Figure 2.1 Microsoft Excel bar chart of the reasons for grocery shopping online Bar chart – reasons for grocery shopping online Variety/range of products Quality products Category Products well displayed Customer service Convenience Competitive prices Comfortable environment 0 5 10 15 20 25 30 % Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 40 CHAPTER 2 ORGANISING AND VISUALISING DATA EXAMPLE 2.2 B A R C H A RT FO R FA M I LY TY P E The council is also interested in demographic differences between the council area and the capital city. Demographic information has been collected and is stored in < DEMOGRAPHIC_ INFORMATION >. Use the summary tables for family type to construct and interpret bar charts for the council area and the capital city. SOLUTION Figure 2.2 Microsoft Excel bar chart for family type Bar chart – council area Other One parent Couple no children Couple with children 0 5 10 15 20 25 30 35 40 45 % Bar chart – capital city Other One parent Couple no children Couple with children 0 5 10 15 20 25 % 30 35 40 45 50 We can see that, in both areas, the majority of families are couples with or without children, with a significant number of one-parent families. However, the capital city has approximately 10% more couples without children and 5% fewer one-parent families. Pie Charts pie chart Graphical representation of a summary table for categorical data, with each category represented by a slice of a circle of which the area represents the proportion or percentage share of the category relative to the total of all categories. A pie chart is a circle, used to represent the total, which is divided into slices, each representing a category. The area of each slice represents the proportion or the percentage share of the corresponding category. In Table 2.1, for example, 28% of the respondents said that convenience was the main reason for grocery shopping online. Thus, in constructing the pie chart, the 360° that makes up a circle is multiplied by 0.28, resulting in a slice of the pie that takes up 100.8° of the 360° of the circle (Figure 2.3). A pie chart allows you to see the portion of the entire pie that falls into each category. In Figure 2.3, convenience takes 28% of the pie and products well displayed takes only 3%. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.1 ORGANISING AND VISUALISING Categorical Data 41 What type of chart should you use? The selection of a chart depends on your intention. If a comparison of categories is most important, use a bar chart. If observing the portion of the whole that lies in a particular category is most important, use a pie chart. There should be no more than eight categories or slices in a pie chart. If there are more than eight, merge the smaller categories into a category called ‘other’. Pie chart – reasons for grocery shopping online Variety/range of products 10% Quality products 18% Comfortable environment 8% Figure 2.3 Microsoft Excel pie chart of the reasons for grocery shopping online Competitive prices 20% Products well displayed 3% Customer service 13% Convenience 28% PIE C H A RT FO R FA MILY T YP E Use the summary tables given for family type in < DEMOGRAPHIC_INFORMATION > to construct and interpret pie charts for the capital city and the council area. EXAMPLE 2.3 Figure 2.4 Microsoft Excel pie chart for family type Pie chart – council area Couple with children Couple no children One parent Other Pie chart – capital city Couple with children Couple no children One parent Other Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 42 CHAPTER 2 ORGANISING AND VISUALISING DATA We can see that, in both areas, most families are couples with or without children, with a significant number of one-parent families. However, the capital city has a higher proportion of ­couples without children. Problems for Section 2.1 LEARNING THE BASICS 2.1 A categorical variable has three categories with the following frequency of occurrence: Category A B C 2.2 Frequency 13 28 9 d. Channel 10 e. SBS APPLYING THE CONCEPTS You can solve problems 2.4 to 2.7 manually or by using Microsoft Excel. 2.4 Website Google Facebook YouTube Yahoo! Amazon Wikipedia Twitter Bing eBay MSN a. Calculate the percentage of values in each category. b. Construct a bar chart. c. Construct a pie chart. A categorical variable has four categories with the following percentages of occurrence: Category A B Percentage 12 29 Category C D Percentage 35 24 a. Construct a bar chart. b. Construct a pie chart. Unique monthly visitors (millions) 1,600 1,100 1,100 750 500 475 290 285 285 280 Data obtained from eBusMBA Guide, Top 15 Most Popular Websites March 2017, at <www.ebizmba.com/articles/most-popular-websites> accessed 13 March 2017 2.3 SBS The following table gives the top 10 websites ranked by estimated number of unique monthly visitors in March 2017. ABC 2.5 Channel 10 Channel 7 Channel 9 The pie chart above was constructed from the results of a survey of 2,000 viewers to determine which TV channels they watch for news. By measuring the angle of each one using a protractor, or estimating by eye, calculate the percentage of viewers watching: a. ABC b. Channel 7 c. Channel 9 a. Construct bar and pie charts. b. Which graphical method do you think best portrays these data? c. What conclusions can you reach concerning the number of unique visitors? Pat, the owner of Pat’s Cars, asked 200 customers their colour preference when purchasing a new car. The following summary table gives the results. Colour White Blue Red Brown Grey Silver Green Black Other 2.6 Frequency 56 31 29 17 19 15 15 13 5 a. Construct bar and pie charts. b. What colours of cars should Pat have on show? The following table gives the labour force status of the Australian civilian population aged 15 years and over in January 2017. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 43 2.2 Organising Numerical Data 6202.0 – Labour Force, Australia, Jan 2017 Labour force status (aged 15 years & over) Total (‘000) Employed full-time 8,066.3 Employed part-time 3,762.6 Unemployed looking for full-time work 561.4 Unemployed not looking for full-time work 213.7 7,096.2 Not in labour force Civilian population 15 aged years and over 19,700.2 2.7 a. Construct bar and pie charts. b. Which graphical method do you think best portrays these data? c. What conclusions can you draw about participation rate – that is, the percentage of the population in the labour force? Use the summary table for country of birth in < DEMOGRAPHIC_INFORMATION > to construct pie and bar charts. Data obtained from Australian Bureau of Statistics, Labour Force, Australia, January 2017, Cat. No. 6202.0 <www.abs.gov.au/ausstats/abs@.nsf/mf/6202.0> accessed 15 March 2017 2.2 ORGANISING NUMERICAL DATA LEARNING OBJECTIVE When you have a large amount of raw numerical data, a useful first step is to present the data as either an ordered array or a stem-and-leaf display. Suppose you undertake a study to compare the cost of a main meal at similar restaurants in a city and in the suburbs. Table 2.3 gives the raw data for 50 city restaurants and 50 suburban restaurants; these data are stored in < RESTAURANT >. From the raw data it is difficult to draw any conclusions about the price of city and suburban restaurant meals. City 50 34 44 31 36 Suburban 37 44 43 26 51 38 39 38 34 38 43 49 14 48 53 56 37 44 48 23 51 40 51 30 39 36 50 27 42 45 25 50 44 26 37 33 35 39 35 31 41 22 50 32 39 44 45 35 63 53 37 27 31 51 30 29 24 26 26 27 38 34 34 48 38 37 44 23 39 26 38 23 41 55 28 39 30 32 24 33 29 32 30 38 38 36 25 28 31 32 38 29 33 30 25 2 Describe the distribution of a single numerical variable using tables and graphs Table 2.3 Price per main meal at 50 city restaurants and 50 suburban restaurants Ordered Arrays A more meaningful display is obtained by sorting the raw data in order of magnitude – that is, from smallest to largest. This is called an ordered array. Table 2.4 presents the data in Table 2.3 as ordered arrays. From Table 2.4 you can see that the price of a main meal at city restaurants is between $14 and $63, and the price of a main meal at suburban restaurants is between $23 and $55. ordered array Numerical data sorted by order of magnitude. Stem-and-Leaf Displays A stem-and-leaf display is a quick and easy way to visually display numerical data. The data are divided into groups (called stems) such that the values within each group (the leaves) branch out to the right on each row. The resulting display allows you to see how the data are distributed and also where they are concentrated. stem-and-leaf display Graphical representation of numerical data; partitions each data value into a stem portion and a leaf portion. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 44 CHAPTER 2 ORGANISING AND VISUALISING DATA Table 2.4 Ordered array of price per main meal at 50 city restaurants and 50 suburban restaurants City 14 33 38 43 50 Suburban 23 27 30 36 39 22 34 38 44 50 23 34 38 44 50 25 35 39 44 50 26 35 39 44 51 27 35 39 45 51 30 36 39 45 53 31 36 40 48 53 31 37 41 48 56 32 37 42 49 63 23 27 31 37 39 24 28 31 37 41 24 28 32 37 43 25 29 32 38 44 25 29 32 38 44 26 29 33 38 48 26 30 33 38 51 26 30 34 38 51 26 30 34 38 55 To see how a stem-and-leaf display is constructed, suppose that 20 students spend the following amounts at a coffee cart between lectures: < COFFEE > $6.35 $8.45 $4.75 $6.05 $4.30 $9.90 $5.40 $5.75 $4.85 $6.80 $6.60 $4.30 $5.55 $5.45 $4.90 $7.20 $6.85 $7.80 $7.50 $10.65 To construct a stem-and-leaf display for these data, use the $ amount as the stem and round the cents to the nearest 10 cents for the leaves. Now list the stem values ($) in order of size to the left of a vertical divider (|) and then record the leaves (10 cents) for each stem in rows to the right. The ‘unordered’ stem-and-leaf display for the amount spent at the coffee cart by the 20 students is: stem unit: $ 4 5 6 7 8 9 10 leaf unit: 10 cents 83993 4685 46918 528 5 9 7 The first value of $6.35 is rounded to 6.4. Its stem (row) is 6 and its leaf is 4. The second value of $4.75 is rounded to 4.8. Its stem (row) is 4 and its leaf is 8. Then, ordering each leaf, we obtain the following ordered stem-and-leaf display for the amount spent at the coffee cart by the 20 students: stem unit: $ 4 5 6 7 8 9 10 EXAMPLE 2.4 leaf unit: 10 cents 33899 4568 14689 258 5 9 7 ST E M- A N D - LE A F DIS P L AY F OR F E STI VAL E XP E N D I TU RE – I N TE RSTATE V IS ITO R S Kai is interested in the amount spent during the festival by interstate visitors. < FESTIVAL > Construct and interpret a stem-and-leaf display for these data. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 45 2.2 Organising Numerical Data SOLUTION Figure 2.5 PhStat stem-and-leaf display for festival expenditure by interstate visitors Festival expenditure by interstate visitors Stem unit: $100 Leaf unit: $10 2 3 4 5 6 7 8 9 10 278 1235999 02335567889 1255689 00033346689 3567789 067 114 4 From Figure 2.5 Kai can conclude that during the festival: • interstate visitors spend between $220 and $1,040 • most interstate visitors spend between $300 and $800 • interstate visitors rarely spend less than $300 or more than $800. Problems for Section 2.2 LEARNING THE BASICS 2.8 68 2.9 stem unit: $100 1 2 3 4 5 Form an ordered array given the following data from a sample of n = 7 mid-semester exam scores in accounting: 94 63 75 71 88 64 Form a stem-and-leaf display given the following data from a sample of n = 7 mid-semester exam scores in finance: 80 54 69 98 93 53 74 2.10 Form an ordered array given the following stem-and-leaf display from a sample of n = 7 mid-semester exam scores in information systems: stem unit: 10 5 6 7 8 9 leaf unit: 1 0 446 19 2 APPLYING THE CONCEPTS 2.11 Data were collected on the monthly expenses submitted by 35 employees in a firm’s sales team. The data are summarised in the following stem-and-leaf display: leaf unit: $10 12489 0013999999 01124445899 11556 0156 a. Place the data into an ordered array. b. Which of the two displays provides the most information? Discuss. c. In what range are most monthly expense claims? d. Is there a concentration of expense claims near the centre of the distribution? 2.12 The following data represent the late payment fee in dollars for a sample of 22 accounts. < LATE_PAYMENT > 20 45 40 20 40 38 38 45 35 45 35 15 45 35 50 40 45 35 40 45 35 40 a. Display the data as an ordered array. b. Construct a stem-and-leaf display for the data. c. Which of the two displays provides the most information? Discuss. d. Around what value, if any, are the late payment fees concentrated? Explain. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 46 CHAPTER 2 ORGANISING AND VISUALISING DATA 2.13 The following data represent ATM fees for withdrawals made above the free monthly allowance for a sample of 26 transaction accounts. < ATM_FEE > 0.65 0.50 0.70 1.30 2.50 0.50 2.00 1.00 2.00 1.25 1.50 2.00 0.30 2.00 0.65 2.00 0.50 0.65 0.50 0.65 1.60 0.70 1.00 1.50 1.65 0.50 a. Display the data as an ordered array. b. Construct a stem-and-leaf display for the data. c. Which of the two displays provides the most information? Discuss. d. Around what value, if any, are the withdrawal fees concentrated? Explain. 2.14 Low-fat foods are not necessarily low calorie, as many are high in sugar. The following data give calories per 250 ml cup of a random sample of brands of fresh cow’s milk for sale in Australia. < FRESH_MILK > LEARNING OBJECTIVE 2 Describe the distribution of a single numerical variable using tables and graphs Full cream milk 155 188 160 155 160 163 170 185 135 160 165 160 163 Low- or reduced-fat milk 120 133 133 125 118 113 140 110 128 115 No-fat or skim milk 133 90 90 98 88 85 115 108 88 90 90 98 Data obtained from Calorie King Australia <www.calorieking.com.au> accessed 22 December 2013 For each category of milk: a. Display the data in ordered arrays. b. Construct stem-and-leaf displays for the data. c. Which arrangement provides more information? Discuss. d. Compare the items in terms of calories. What conclusions can you make? 2.3 SUMMARISING AND VISUALISING NUMERICAL DATA Ordered arrays and stem-and-leaf displays are of limited use when we have very large quantities of data or the data are highly variable. In these cases we use tables and graphs to condense and present the data visually. These tables and graphs include histograms, frequency, relative frequency, and cumulative distributions and polygons. Summarising Numerical Data: Frequency Distributions A frequency distribution allows you to condense a set of data. frequency distribution Summary table for numerical data; gives the frequency of data values in each class. class width Distance between upper and lower boundaries of a class. range Distance measure of variation; difference between maximum and minimum data values. A frequency distribution is a summary table in which the data are arranged into numerically ordered classes or intervals. To construct a frequency distribution, first select an appropriate number of classes and a suitable class width. The classes should be exhaustive and mutually exclusive, so that any one data value belongs to one and only one class. The number of classes chosen depends on the amount of data – a small number of classes for small amounts of data and a larger number of classes for larger amounts of data. In general, a frequency distribution should have at least five classes but no more than 15. If there are too few classes we lose too much information and if there are too many classes the data are not condensed enough. Each class should be of equal width. To determine the required (approximate) width of the classes, divide the range (the highest value – the lowest value) of the data by the required number of classes. DE T E R M IN IN G A N AP PR O X I MAT E W I DT H O F A C LA SS Class width = range number of classes (2.1) Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.3 SUMMARISING AND VISUALISING Numerical Data 47 The city restaurant data consist of a sample of 50 restaurants; with this sample size 10 is an appropriate number of classes. From the ordered array in Table 2.4, the range of the data is $63 – $14 = $49. Using Equation 2.1, the approximate class width is: Class width = 49 = 4.9 10 Choose a class width that simplifies the reading and interpretation of the distribution and resultant graphs. Therefore, instead of using a class width of $4.90, choose a width of $5.00. Construct the frequency distribution table by first establishing clearly defined class ­boundaries so that each data value belongs in one and only one class. The classes must be mutually exclusive and exhaustive. Whenever possible, choose class boundaries that simplify the reading and interpretation of the resultant tables or graphs. For the city restaurant data the price ranges from $14 to $63, so appropriate classes could be (1) from $10 to less than $15, (2) from $15 to less than $20, and so on, until we have included the highest data value, in this case $63. The last and 11th class ranges from $60 to less than $65. The centre of each class, called the class mid-point, is halfway between the lower boundary and the upper boundary of the class. Thus, the class mid-point for the 10 + 15b first class, from $10 to under $15, is $12.50 a ; the class mid-point for the second class, 2 from $15 to under $20, is $17.50, and so on. Table 2.5 gives a frequency distribution of the cost per meal for the 50 city and the 50 suburban restaurants. A frequency distribution allows you to draw conclusions about the major characteristics of the data. For example, Table 2.5 shows that the price of main meals at city restaurants is ­concentrated between $30 and $55 compared with the price of main meals at suburban restaurants, which are clustered between $25 and $40. For small data sets, one set of class boundaries may provide a different picture from another set. For example, for the restaurant price data, using a class width of 4.0 instead of 5.0 (as was used in Table 2.5) may cause shifts in the way in which the values are distributed between the classes. You can also get shifts in data concentration when you choose different lower and upper class boundaries. Fortunately, as the sample size increases, alterations in the selection of class boundaries affect the concentration of data less and less. Price of main meal ($) $10 but less than $15 $15 but less than $20 $20 but less than $25 $25 but less than $30 $30 but less than $35 $35 but less than $40 $40 but less than $45 $45 but less than $50 $50 but less than $55 $55 but less than $60 $60 but less than $65 Total City frequency 1 0 2 3 7 14 8 5 8 1 1 50 Suburban frequency 0 0 4 13 13 12 4 1 2 1 0 50 FREQUENCY DISTRIBUTION FOR FESTIVAL EXPENDITURE – INTERSTATE VISITORS Kai is interested in the amount spent during the festival by interstate visitors. < FESTIVAL > Construct and interpret a frequency distribution for this data. class boundaries Upper and lower values used to define classes for numerical data. class mid-point Centre of a class; representative value of class. Table 2.5 Frequency distribution of the price per main meal for 50 city restaurants and 50 suburban restaurants EXAMPLE 2.5 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 48 CHAPTER 2 ORGANISING AND VISUALISING DATA SOLUTION As we have data from 52 interstate visitors, with expenditure during the festival ranging from approximately $220 to $1,040 (see Figure 2.5), we can choose a class width of $200 with the first class starting at $200. Table 2.6 Frequency distribution of festival expenditure – interstate visitors Interstate visitors Festival expenditure $200 to < $400 $400 to < $600 $600 to < $800 $800 to < $1,000 $1000 to < $1,200 Total • • • Frequency 11 17 18 5 1 52 From Table 2.6 Kai can conclude that festival expenditure for interstate visitors is: between $200 and $1,200 concentrated between $400 and $800 rarely more than $800. Relative Frequency and Percentage Distributions relative frequency distribution Summary table for numerical data which gives the proportion of data values in each class. percentage distribution Summary table for numerical data which gives the percentage of data values in each class. Table 2.7 Relative frequency and percentage distributions of the price of main meals at city and suburban restaurants Instead of the frequency of the data in each class, knowing the proportion or the percentage of the data that fall into each class is often more useful. To do this, we use either a relative frequency or a percentage distribution. Also, when comparing two or more samples with different sample sizes, a relative frequency or percentage distribution should be used. A relative frequency distribution is obtained by dividing the frequency in each class by the total number of values. From this a percentage distribution can be obtained by multiplying each relative frequency by 100%. Thus, the relative frequency of a main meal at city restaurants with a price between $30 and $35 is 0.14 (7 ÷ 50), and the corresponding percentage is 14%. Table 2.7 presents the relative frequency and percentage distributions of the price of main meals at city and suburban restaurants. From Table 2.7 you can conclude that meals cost more at city restaurants than at suburban restaurants – 16% of main meals at city restaurants cost between $40 and $45 compared with 8% at suburban restaurants; 16% of main meals at city restaurants cost between $50 and $55 compared with 4% at suburban restaurants; while only 6% of main meals at city restaurants cost between $25 and $30 compared with 26% at suburban restaurants. Price of main meal ($) $10 but less than $15 $15 but less than $20 $20 but less than $25 $25 but less than $30 $30 but less than $35 $35 but less than $40 $40 but less than $45 $45 but less than $50 $50 but less than $55 $55 but less than $60 $60 but less than $65 Total City Relative frequency 0.02 0.00 0.04 0.06 0.14 0.28 0.16 0.10 0.16 0.02 0.02 1.00 Percentage 2.0 0.0 4.0 6.0 14.0 28.0 16.0 10.0 16.0 2.0 2.0 100.0 Suburban Relative frequency Percentage 0.00 0.0 0.00 0.0 0.08 8.0 0.26 26.0 0.26 26.0 0.24 24.0 0.08 8.0 0.02 2.0 0.04 4.0 0.02 2.0 0.00 0.0 1.00 100.0 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.3 SUMMARISING AND VISUALISING Numerical Data R ELATIVE FR E Q U E N CY D IST R IB U T IO N AN D P E RCE N TAGE D I STRI BU TI ON FESTIVA L EXP E N D IT U R E – IN T E R STAT E AN D I N TRASTATE V I S I TORS Kai is interested in the amount spent during the festival by festival attendees; in particular if there is any difference in expenditure between interstate and intrastate visitors. < FESTIVAL > Construct and interpret frequency and percentage distributions to compare the festival expenditure of interstate and intrastate visitors. 49 EXAMPLE 2.6 SOLUTION Festival expenditure $0 to < $200 $200 to < $400 $400 to < $600 $600 to < $800 $800 to < $1,000 $1,000 to < $1,200 Total Interstate Proportion 0.000 0.212 0.327 0.346 0.096 0.019 1.000 Visitors Intrastate Proportion 0.019 0.442 0.250 0.135 0.115 0.039 1.000 Interstate Percentage 0.00 21.15 32.69 34.62 9.62 1.92 100.00 Intrastate Percentage 1.92 44.23 25.00 13.46 11.54 3.85 100.00 Table 2.8 Relative frequency and percentage distributions of festival expenditure – intrastate and interstate From Table 2.8 Kai can conclude that interstate visitors generally spend more during the festival than intrastate visitors. However, there is more variation in festival expenditure between intrastate visitors. Cumulative Distributions A cumulative percentage distribution gives the percentage of values that are less than a certain value. For example, you may want to know what percentage of the city restaurant main meals cost less than $20, less than $50, and so on. A percentage distribution is used to form the corresponding cumulative percentage distribution. From Table 2.7, 0% of main meals at city restaurants cost less than $10, 2% cost less than $15, 2% also cost less than $20 (since none of the meals cost between $15 and $20), 6% (2% + 4%) cost less than $25, and so on, until all 100% of the meals cost less than $65. Table 2.9 summarises the cumulative percentages for the price of main meals at city and suburban restaurants. The cumulative distribution clearly shows that the cost of main meals is lower in suburban restaurants than in city restaurants – 34% of main meals at suburban ­restaurants cost Price ($) $10 $15 $20 $25 $30 $35 $40 $45 $50 $55 $60 $65 City percentage of restaurants less than indicated value 0 2 2 6 12 26 54 70 80 96 98 100 Suburban percentage of restaurants less than indicated value 0 0 0 8 34 60 84 92 94 98 100 100 cumulative percentage distribution Summary table for numerical data; gives the cumulative frequency of each successive class. Table 2.9 Cumulative percentage distributions of the price of city and suburban restaurant main meals Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 50 CHAPTER 2 ORGANISING AND VISUALISING DATA less than $30 compared with 12% at city restaurants; 60% of main meals at suburban restaurants cost less than $35 compared with 26% at city restaurants; 84% of main meals at suburban restaurants cost less than $40 compared with 54% at city restaurants. EXAMPLE 2.7 C U MU LAT IV E P E RC E N TAGE D I STRI BU TI ON F OR F E STI VAL E XP E N D I TU RE Kai is interested in the amount spent during the festival by festival attendees; in particular if there is any difference in expenditure between interstate and intrastate visitors. < FESTIVAL > Construct and interpret cumulative distributions to compare festival expenditure of interstate and intrastate visitors. SOLUTION Table 2.10 Cumulative percentage distribution of festival expenditure – intrastate and interstate Visitors Interstate Percentage 0.00 21.15 53.85 88.46 98.08 100.00 Festival expenditure $0 to < $200 $200 to < $400 $400 to < $600 $600 to < $800 $800 to < $1,000 $1,000 to < $1,200 Intrastate Percentage 1.92 46.15 71.15 84.61 96.15 100.00 From Table 2.10 Kai can conclude that 71% of intrastate visitors spend less than $600 ­during the festival while only 54% of interstate visitors spend less than $600. This indicates that, generally, intrastate visitors spend less during the festival than interstate visitors. Histograms histogram Graphical representation of a frequency, relative frequency or percentage distribution; the area of each rectangle represents the class frequency, relative frequency or percentage. A grouped frequency, relative frequency or percentage distribution can be graphically represented by a histogram. The horizontal axis is divided into intervals corresponding to the classes. Rectangles are constructed above these intervals, the heights of which measure the frequency, relative frequency or percentage of data values in the class. Figure 2.6 displays an Excel frequency histogram for the price of main meals at city restaurants. The histogram indicates that the price of main meals at city restaurants is concentrated between approximately $30 and $55. Very few meals cost less than $25 or more than $55. Instead of using class boundaries you can label and identify classes by their mid-point. Figure 2.6 Excel histogram of the price of main meals at city restaurants Histogram price of main meals at city restaurants 16 14 Frequency 12 10 8 6 4 2 2. 50 $6 7. 50 $5 2. 50 $5 7. 50 $4 2. 50 $4 7. 50 $3 2. 50 $3 7. 50 $2 2. 50 $2 7. 50 $1 0 $1 .5 $7 2. 50 0 Price – city Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.3 SUMMARISING AND VISUALISING Numerical Data H ISTO G R A M FO R FE ST IVA L E X P E N D IT U RE – I N TE RSTATE V I S I TORS Kai is interested in the amount spent during the festival by interstate visitors. < FESTIVAL > Construct and interpret a histogram for the data. EXAMPLE 2.8 SOLUTION Figure 2.7 Histogram of festival expenditure – interstate visitors Festival expenditure – interstate visitors 20 Frequency 15 10 5 0 0 200 400 600 800 1,000 1,200 1,400 Festival expenditure, $ From Figure 2.7 Kai can conclude that festival expenditure for interstate visitors is: • between $200 and $1,200 • concentrated between $400 and $800 • rarely more than $800. Polygons When comparing two or more sets of data we can construct polygons on the same set of axes, allowing for easy interpretation. PE RC E N TAG E P OLYGON A percentage polygon is constructed by plotting the percentage for each class above the respective class mid-point and then joining the mid-points by straight lines. The graph is extended at each end to classes with a frequency of zero so that the polygon starts and finishes on the horizontal axis. percentage polygon Graphical representation of a percentage distribution. Figure 2.8 displays percentage polygons for the price of main meals in city and suburban restaurants. The polygon for suburban restaurants is concentrated to the left (corresponding to lower price) of the polygon for city restaurants. The highest percentages of price for suburban restaurants are for class mid-points of $27.50 and $32.50, while the highest percentages of price for city restaurants are for a class mid-point of $37.50. The polygons in Figure 2.8 have plotted points whose values on the horizontal axis represent the class mid-points. For example, for class mid-point $22.50, the plotted point for suburban restaurants (the higher one) represents the fact that 8% of these restaurants have main meal prices between $20 and $25, while the plotted point for city restaurants (the lower one) indicates that only 4% of these restaurants have main meal prices between $20 and $25. When constructing polygons or histograms, the vertical axis should show the true zero or ‘origin’ so as not to distort the character of the data. The horizontal axis does not need to specify the zero point for the variable of interest, although the range of the variable should constitute the major portion of the axis. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 51 52 CHAPTER 2 ORGANISING AND VISUALISING DATA Figure 2.8 Percentage polygons for the price of main meals in city and suburban restaurants Percentage polygon 30 25 City Suburban 20 % 15 10 5 0 7.5 EXAMPLE 2.9 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 Price of main meal ($) P E RC E N TA G E P O LYG O N S F OR F E STI VAL E XP E N D I TU RE Kai is interested in the amount spent during the festival by attendees; in particular if there is a difference between interstate and intrastate visitors. < FESTIVAL > Construct and interpret percentage polygons to compare the festival expenditure of interstate and intrastate visitors. SOLUTION Figure 2.9 Percentage polygons – festival expenditure Festival expenditure % 50 Interstate visitors Intrastate visitors 40 30 20 10 0 100 300 500 700 $ 900 1,100 1,300 From Figure 2.9 Kai can conclude that intrastate visitors generally spend less during the ­festival than interstate visitors. Cumulative Percentage Polygons (Ogives) cumulative percentage polygon (ogive) Graphical representation of a cumulative frequency distribution. A cumulative percentage polygon, or ogive, displays the variable of interest along the horizontal axis and the cumulative percentages (percentiles) on the vertical axis. A percentile is defined as ‘the value below which a given percentage of observations in a data set fall’. Figure 2.10 shows the cumulative percentage polygons of the price of main meals at city and suburban restaurants. Most of the curve for city restaurants is located to the right of the Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.3 SUMMARISING AND VISUALISING Numerical Data 53 curve for suburban restaurants. This indicates that city restaurants have fewer main meals that cost below a particular value. For example, 12% of city restaurant main meals cost less than $30 compared with 34% of suburban restaurant main meals. Figure 2.10 Cumulative percentage polygons of the cost of main meals at city and suburban restaurants Cumulative percentage polygon 100 90 80 70 60 % 50 City Suburban 40 30 20 10 0 10 15 20 25 30 35 40 45 50 55 60 65 Price of main ($) CUMULATIVE P E RC E NTA G E P O LYG O NS F OR F E STI VAL E XP E N D I TU RE Kai is interested in the amount spent during the festival by attendees; in particular if there is a difference in expenditure between interstate and intrastate visitors. < FESTIVAL > Construct and interpret cumulative percentage polygons to compare festival expenditure of interstate and intrastate visitors. EXAMPLE 2.10 SOLUTION Figure 2.11 Cumulative percentage polygons for festival expenditure Festival expenditure % 100 Interstate visitors Intrastate visitors 80 60 40 20 0 0 200 400 600 $ 800 1,000 1,200 In Figure 2.11, we see that the curve for expenditure by intrastate visitors is to the left of that by interstate visitors. This indicates that generally intrastate visitors spend less during the festival than interstate visitors. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 54 CHAPTER 2 ORGANISING AND VISUALISING DATA Problems for Section 2.3 LEARNING THE BASICS 2.15 The values for a data set vary from 11.6 to 97.8. a. If these values are grouped into nine classes, indicate appropriate class boundaries. b. What class width did you choose? c. What are the corresponding class mid-points? 2.16 The cumulative percentage polygon below shows the amount spent (in dollars) by 200 customers at a local supermarket. Ogive – amount spent at local supermarket 100 80 % 60 40 20 0 0 20 40 60 80 100 120 140 160 180 200 Amount spent ($) a. Approximately what percentage of customers spent less than $100? b. Approximately how many customers spent at least $60? c. Approximately how much did the top 10% of customers spend? d. Approximately how much did the bottom 10% of customers spend? APPLYING THE CONCEPTS You can solve problems 2.17 to 2.19 manually or by using Microsoft Excel. 147 172 123 130 114 102 111 128 143 135 153 148 144 187 191 197 213 168 166 137 127 130 109 139 129 5,544 6,832 7,497 8,091 6,701 7,607 8,298 9,036 < ELECTRICITY > 178 116 175 154 151 Manufacturer A 5,814 6,868 7,645 8,119 6,190 6,879 7,654 8,392 6,307 6,930 7,773 8,416 6,342 6,941 7,816 8,416 6,423 7,007 7,838 8,514 6,429 7,037 7,924 8,532 6,485 7,043 7,999 8,542 6,612 7,059 8,038 8,544 6,667 7,136 8,067 8,731 7,118 7,721 8,666 9,385 7,133 7,754 8,792 9,460 7,142 7,767 8,800 9,471 7,156 7,806 8,856 9,521 7,344 7,839 8,861 9,540 7,493 7,888 8,993 9,693 7,569 7,983 9,001 9,744 Manufacturer B 2.17 The following data represent the electricity cost (in dollars) during the month of July for a random sample of 50 two-bedroom apartments in a New Zealand city. Electricity charge ($) 96 171 202 157 185 90 141 149 206 95 163 150 108 119 183 c. Construct the corresponding cumulative percentage distribution and plot the corresponding ogive (cumulative percentage polygon). d. Around what amount does the monthly electricity cost seem to be concentrated? 2.18 To investigate the variation in fuel prices in New South Wales on a day in March 2017, a random sample of 45 petrol stations, each in a different location, was selected. The price per litre of both unleaded petrol and diesel is recorded in < FUEL_2017 >. Using the New South Wales data: a. Construct frequency, percentage and cumulative distributions for the price of petrol and diesel. b. As separate graphs, plot frequency histograms for the price of petrol and diesel. c. On the same set of axes plot percentage polygons for the price of petrol and diesel. d. On the same set of axes plot cumulative percentage polygons for the price of petrol and diesel. e. What can you conclude about the variation in the fuel prices in New South Wales at the time the data were collected? 2.19 The ordered arrays in the table below give the life (in hours of usage) of samples of forty 15-watt CFL (compact fluorescent lamp) energy-saving light bulbs produced by two manufacturers, A and B. < BULBS > 82 165 167 149 158 a. Construct a frequency distribution and a percentage distribution with upper class boundaries of <$100, <$120, and so on. b. Plot the corresponding histogram and percentage polygon. 6,837 7,612 8,344 9,096 6,961 7,651 8,535 9,262 a. Construct a frequency distribution and percentage distribution for each manufacturer. b. Plot the corresponding frequency histograms on separate graphs and the percentage polygons on the same graph. c. Form the cumulative percentage distributions and plot the ogives on one graph. d. Which manufacturer has bulbs with a longer life – manufacturer A or manufacturer B? Explain. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 55 2.4 ORGANISING AND VISUALISING TWO CATEGORICAL VARIABLES 2.4 ORGANISING AND VISUALISING TWO CATEGORICAL VARIABLES We often wish to study patterns that may exist between two or more categorical variables – for example, education level and gender. Organising Two Categorical Variables: Contingency Tables A contingency (or cross-classification) table presents the data for two categorical variables. The rows contain the categories of one variable and the columns the categories of the other variable. The intersections of each row and column category, called the cells, contain the joint responses – that is, the data that are in the row category and also in the column category. Depending on the type of contingency table constructed, the cells may contain the frequency, the percentage of the overall total, the percentage of the row total or the percentage of the column total in both categories. Suppose that for the 100 recent property sales in the council area introduced in Example 2.1, < PROPERTY >, Kai wishes to explore whether there is a pattern or relationship between the size of a house or unit (defined by the number of bedrooms) and its location (either in town or rural). To construct a contingency table, classify or sort the data into one of the r × c possible cells in the table, where r is the number of row categories and c is the number of column categories. Note that the cells must be mutually exclusive and exhaustive so that each data value belongs in one and only one cell. In the contingency table in Table 2.11, we have two row categories, rural or town, and five column categories, from one to more than four bedrooms, so we are sorting the data into 10 (2 × 5) possible cells. That is, each cell is a combination of number of bedrooms and location. For example, for properties with more than four bedrooms, in the sample there are five town properties but only one rural property. Location Rural Town Total 1 2 4 6 2 5 14 19 Bedrooms 3 16 29 45 4 10 14 24 >4 1 5 6 Total 34 66 100 LEARNING OBJECTIVE 3 Describe the relationship between two categorical variables using contingency tables contingency table (or crossclassification) table – descriptive statistics Summary table for two categorical variables; each cell represents data that satisfy the given values of both variables. Table 2.11 Frequency contingency table for number of bedrooms and location For further exploration of possible patterns or relationships between number of bedrooms and location in the council area, Kai can construct contingency tables based on percentages. To do this, he will convert the cell frequencies into percentages based on one of the following three totals: 1. The overall total (i.e. the 100 properties) 2. The row totals (i.e. 34 rural and 66 urban properties) 3. The column totals (i.e. number of one-bedroom, two-bedroom, up to more than ­ four-bedroom properties). Tables 2.12, 2.13 and 2.14 summarise these percentages. Location Rural Town Total 1 2.0 4.0 6.0 2 5.0 14.0 19.0 Bedrooms % 3 16.0 29.0 45.0 4 10.0 14.0 24.0 >4 1.0 5.0 6.0 Total % 34.0 66.0 100.0 Table 2.12 Percentage contingency table for number of bedrooms and location based on overall total Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 56 CHAPTER 2 ORGANISING AND VISUALISING DATA Table 2.13 Contingency table for number of bedrooms and location based on row total reported as a percentage Table 2.14 Contingency table for number of bedrooms and location based on column total reported as a percentage Location Rural Town Total 1 5.9 6.1 6.0 2 14.7 21.2 19.0 Bedrooms % 3 47.1 43.9 45.0 Location Rural Town Total 1 33.3 66.7 100.0 2 26.3 73.7 100.0 Bedrooms % 3 35.6 64.4 100.0 4 29.4 21.2 24.0 >4 2.9 7.6 6.0 Total % 100.0 100.0 100.0 4 41.7 58.3 100.0 >4 16.7 83.3 100.0 Total % 34.0 66.0 100.0 Table 2.12 shows that 45% of the properties have three bedrooms and that 29% of the properties are located in town and have three bedrooms. Table 2.13 shows that 47.1% of rural properties have three bedrooms while only 43.9% of properties located in town have three bedrooms. Table 2.14 shows that 64.4% of three-bedroom properties are located in town while 35.6% are rural. Visualising Two Categorical Variables: Side-by-Side Bar Charts side-by-side bar chart Graphical representation of a crossclassification table. A useful way to display the results of contingency table data is by constructing a side-by-side bar chart. Figure 2.12, using the data from Table 2.11, is a Microsoft Excel side-by-side bar chart that compares the number of bedrooms based on the location of the property. Figure 2.12 Microsoft Excel side-byside bar chart for number of bedrooms and location Side-by-side chart for number of bedrooms and location Number of bedrooms >4 Town Rural 4 3 2 1 0 5 10 15 20 25 30 Number of properties EXAMPLE 2.11 S IDE - BY- S IDE C H A RT S F OR P R I CE O F R UR AL A N D U RB AN P RO P E RTI E S For the 100 recent property sales, construct and interpret side-by-side charts to investigate if there is a difference between rural and urban property prices. < PROPERTY > SOLUTION First, construct a column percentage contingency table for price and location. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.4 ORGANISING AND VISUALISING TWO CATEGORICAL VARIABLES Frequency Asking price ($) 300,000 to < 400,000 400,000 to < 500,000 500,000 to < 600,000 600,000 to < 700,000 700,000 to < 800,000 800,000 to < 900,000 Total Rural 8 9 12 4 0 1 34 Table 2.15 Contingency table for price and location based on percentage of column total Column percentage Rural Town 23.5 25.8 26.5 48.5 35.3 15.1 11.8 9.1 0.0 0.0 2.9 1.5 100.0 100.0 Town 17 32 10 6 0 1 66 57 From Table 2.15 we can construct a side-by-side bar chart for location and price. Figure 2.13 Side-by-side bar chart for location and price Side-by-side chart for location and asking price $800,000 to < $900,000 Town Rural $700,000 to < $800,000 $600,000 to < $700,000 $500,000 to < $600,000 $400,000 to < $500,000 $300,000 to < $400,000 0 10 20 % 30 40 50 Figure 2.13 shows that a higher proportion of rural properties have prices above $500,000, and that approximately 50% of the urban properties have prices between $400,000 and $500,000. Problems for Section 2.4 LEARNING THE BASICS 2.20 The following data represent the responses to two questions asked in a survey of 40 undergraduate students majoring in business: What is your gender? (M = Male; F = Female; O = Other) What is your major? (A = Accounting; I = Information Systems; M = Marketing) Gender Major Gender Major M A M I M I M I M I M A F M M A M A F M F I M M F A F I M A F A F I M A M I M A F A F I M A M I M A M A M M M A M I M A F M F A F A M I M A F I F A M A F I M I a. Represent the data in a contingency table where the rows represent the gender categories and the columns the academic-major categories. b. Construct cross-classification tables based on percentages of all 40 student responses, on row percentages and on column percentages. c. Using the results from (a), construct a side-by-side bar chart of gender based on student major. 2.21 Given the following cross-classification table, construct a sideby-side bar chart comparing A and B for each of the threecolumn categories on the vertical axis. A B 1 20 80 2 40 80 3 40 40 Total 100 200 APPLYING THE CONCEPTS 2.22 The Living in Australia Study gives information on the study mode (full or part time) of students studying for a post-school qualification, as well as their employment status. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 58 CHAPTER 2 ORGANISING AND VISUALISING DATA Percentage of students enrolled in post-school education Studying Studying Employment status full-time part-time All students Employed full-time 6.4 37.7 44.1 Employed part-time 18.1 12.2 30.3 Not employed 17.4 8.2 25.6 All students 41.9 58.1 100.0 Data obtained from the Household, Income and Labour Dynamics in Australia (HILDA) Survey, 2001–2005 (also known as the Living in Australia Study), The University of Melbourne 1994–2011 a. Construct cross-classification tables based on row percentages and column percentages. b. Construct a side-by-side bar chart for employment status and study mode. c. What conclusions do you draw from these analyses? 2.23 The following table classifies road fatalities in Australia from 2012 to 2016 (inclusive) by age and gender. < ROAD_ FATALITIES_2012_2016 > Age < 10 10 to < 20 20 to < 30 30 to < 40 40 to < 50 50 to < 60 60 to < 70 70 to < 80 80 to < 90 90 or more Unknown Total Male 89 402 990 686 693 534 412 319 243 56 3 4,427 Gender Female 74 182 310 185 182 179 212 162 178 47 1 1,712 Unknown 3 0 0 0 0 0 0 0 0 0 0 3 Total 166 584 1,300 871 875 713 624 481 421 103 4 6,142 Data obtained from the Australian Road Deaths Database <www.bitre.gov.au/ statistics/safety/fatal_road_crash_database.aspx> accessed 18 March 2017 Ignore the unknown categories. a. Investigate the relationship between age and gender by constructing a side-by-side bar chart to highlight the pattern of male and female road fatalities. b. Discuss the pattern of male and female road fatalities for 2012 to 2016. 2.24 The following data for people aged 15 years and older, classified by highest level of educational attainment and gender, were obtained for a certain Australian state: Highest level of educational attainment Below Year 10 Year 10 or equivalent Year 11 or equivalent Year 12 or equivalent Post-secondary below bachelor degree Bachelor degree or higher Total Males (‘000) 238.1 326.7 102.0 492.2 840.8 749.8 2,749.6 Females (‘000) 253.9 394.4 89.4 506.8 687.6 856.5 2,788.6 Total (‘000) 492.0 721.1 191.4 999.0 1,528.4 1,606.3 5,538.2 Data obtained from Australian Bureau of Statistics, Education and Work, Australia, May 2016, 62270DO001_201605 <www.abs.gov.au> accessed March 2017. © Commonwealth of Australia a. Construct a cross-classification table based on column percentages. b. Construct a side-by-side bar chart to highlight the information in (a). c. Discuss any apparent pattern in male and female education levels in this Australian state. 2.25 The table below contains the sales of new passenger cars in New Zealand for February 2016 and 2017. < NZ_CAR_ SALES_16_17 > Make Audi BMW Citroen Dodge Ford Holden Honda Hyundai Jaguar Jeep Kia Land Rover Lexus Maserati Mazda Mercedes Benz Mini Mitsubishi Nissan Peugeot Porsche Renault Skoda Ssanyong Subaru Suzuki Tesla Toyota Volkswagen Volvo Other Total Sales of new cars February 2017 February 2016 176 137 160 193 15 10 23 44 611 604 654 645 373 292 606 470 26 37 56 100 513 407 93 64 62 59 29 6 755 719 245 164 45 44 547 413 346 484 48 55 22 25 30 11 104 102 93 95 305 208 624 362 21 1 990 915 355 309 48 53 75 163 8,050 7,191 Data obtained from Motor Industry Association of New Zealand <www.mia.org.nz> accessed March 2017, reproduced with permission. © Motor Industry Association of New Zealand a. Construct a side-by-side bar chart for the makes of cars. b. Discuss the changes in the sale of new cars in February 2017 compared with February 2016. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.5 VISUALISING TWO NUMERICAL VARIABLES 2.5 VISUALISING TWO NUMERICAL VARIABLES LEARNING OBJECTIVE Scatter Diagrams When analysing a single numerical variable (univariate data), such as the price of a restaurant meal or festival expenditure, you can use a histogram, polygon or cumulative percentage polygon, as introduced in Section 2.3. When examining the relationship between two numerical variables (bivariate data) we can use a scatter diagram or plot to obtain a picture of a possible relationship. Plot one variable, the independent variable, on the horizontal (or x) axis and the other variable, the dependent variable, on the vertical (or y) axis. For example, a marketing analyst could study the effectiveness of advertising by comparing weekly sales volumes and weekly advertising expenditures. Or, a human resources director interested in the salary structure of the company could compare the employees’ years of experience with their current salaries. For the data from 100 recent property sales in the council area introduced in Example 2.1, and stored in < PROPERTY >, a scatter plot can be used to explore the relationship between number of bedrooms (independent variable) and asking price (dependent variable). For each property, plot the number of bedrooms on the horizontal axis and the corresponding asking price on the vertical axis. Figure 2.14 gives an Excel scatter diagram for this data. 4 Describe the relationship between two numerical variables using scatter diagrams and time-series plots scatter diagram Graphical representation of the relationship between two numerical variables; plotted points represent the given values of the independent variable and corresponding dependent variable. Figure 2.14 Microsoft Excel scatter diagram for number of bedrooms and asking price Scatter diagram – 100 recent property sales $900,000 59 $800,000 $700,000 Asking price $600,000 $500,000 $400,000 $300,000 $200,000 $100,000 $0 0 1 2 3 4 5 6 7 8 Number of bedrooms As expected, there is a weak increasing (positive) linear relationship with more bedrooms associated with higher asking prices. Other pairs of variables may have an decreasing (negative) relationship in which one variable increases as the other decreases; for example, the age of a second-hand car and its value. Scatter diagrams are revisited in Chapter 3 when the coefficient of correlation and the covariance are studied, and in Chapter 12 when regression analysis is introduced. Time-series Plots A time-series plot is used to study patterns in the value of a variable over time. A time-series plot displays the time period on the horizontal axis and the variable of interest on the vertical axis. Figure 2.15 is a time-series plot of the monthly exchange rate of the Australian dollar against the United States dollar from January 2010 to February 2017. < EXCHANGE_ time-series plot Graphical representation of the value of a numerical variable over time. RATE_2010_2017 > Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 60 CHAPTER 2 ORGANISING AND VISUALISING DATA Figure 2.15 Microsoft Excel time-series plot of exchange rates: Australian dollar against US dollar 2010 to 2017 1.0 0.8 AUS$:US$ Source: Data based on Reserve Bank of Australia, Statistics, Exchange Rates <www.rba.gov.au> accessed March 2017. Exchange rate US$ per AUS$ 1.2 0.6 0.4 0.2 0.0 Jan 10 Oct 10 Jun 11 Mar 12 Nov 12 Aug 13 Apr 14 Jan 15 Sep 15 Jun 16 Feb 17 End of month During 2010 and the first six months of 2011, rates rose steadily from US$0.90 to US$1.10. They remained between US$1.00 and US$1.10 until 2013, steadily decreased to US$0.80 in September 2015, and then remained between US$0.80 and US$0.90 until February 2017. Rare events think about this When rare events happen, we often react to them more strongly than to common events with similar outcomes. Charts and graphs can give us a picture of the situation, helping to put the risk of these rare events in perspective. For example, in Australia when there is a shark attack, even if not fatal, there are often calls to protect beach users from attack, including controlling shark numbers by culling. However, shark attacks are rare: there are usually between 10 and 15 attacks annually in Australia, of which one or two are fatal, as shown in the table below. < SHARKS_AND_DROWNINGS > Year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Total attacks 9 9 8 13 10 7 12 10 22 14 13 14 10 11 18 15 Shark attacks, Australia Fatal attacks 0 2 1 2 2 1 0 1 0 1 4 2 2 2 1 2 Non-fatal 9 7 7 11 8 6 12 9 22 13 9 12 8 9 17 13 Data obtained from the International Shark Attack File <www.flmnh.ufl.edu/fish/sharks/statistics/statsw.htm> accessed May 2014 and March 2017, © Florida Museum of Natural History, University of Florida Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.5 VISUALISING TWO NUMERICAL VARIABLES If we compare these mainly non-fatal shark attacks with the number of people drowning annually at Australian beaches in the same period (see the bar chart below), it is clear that the risk of drowning while at the beach is far higher than that of being attacked by a shark. Australia – shark attacks and beach drownings 70 60 Shark attacks Beach drownings 50 40 30 20 10 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Drowning data obtained from Royal Life Saving, National Drowning Reports 2001 to 2016 <www.royallifesaving.com.au/facts-andfigures/research-and-reports/drowning-reports> accessed March 2017; shark attack data obtained from International Shark Attack File, Florida Museum of Natural History, University of Florida <www.flmnh.ufl.edu/fish/sharks/statistics/statsw.htm> A time-series plot of the same data, shown below, indicates that there is no apparent increase in either the number of shark attacks or the number of drownings at Australian beaches. Australia – shark attacks and beach drownings 70 60 50 40 30 20 10 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Beach drownings Shark attacks Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 61 62 CHAPTER 2 ORGANISING AND VISUALISING DATA Problems for Section 2.5 LEARNING THE BASICS 2.26 Below is a set of data from a sample of n = 11 items: X (horizontal axis) Y (vertical axis) 7 5 8 21 15 24 3 6 10 12 4 9 15 18 9 18 30 36 12 27 45 54 a. Plot the scatter diagram. b. Is there a relationship between X and Y? Explain. 2.27 Below is a series of real annual sales (in millions of constant 2010 dollars) for a department over an 11-year period (2007 to 2017): Year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Sales 13.0 17.0 19.0 20.0 20.5 20.5 20.5 20.0 19.0 17.0 13.0 a. Construct a time-series plot. b. Does there appear to be any change in real annual sales over time? Explain. selected. The average price per litre of both unleaded petrol and diesel is recorded in < FUEL_MARCH_2017 >. Using the New South Wales data: a. Construct a scatter diagram to investigate the relationship between petrol and diesel prices. b. What conclusions can you reach about the relationship between petrol and diesel prices? 2.31 The data file < UNEMPLOYMENT_RATE_2007_2017 > gives the monthly Australian unemployment rate (seasonally adjusted) from March 2007 to February 2017. a. Construct a time-series plot for the unemployment rate. b. Does there appear to be any pattern? 2.32 A general measure of inflation is the annual increase in the consumer price index (CPI). The table below gives the annual increase in the CPI in Australia and New Zealand. < INFLATION_2011_2016 > APPLYING THE CONCEPTS You can solve problems 2.28 to 2.32 manually or by using Microsoft Excel. 2.28 For the city and suburban restaurants introduced in Section 2.2, an independent reviewer rated each restaurant on food quality, décor and service. Each was given a score out of 30 and then the three scores were added to give an overall rating out of 90. < RESTAURANT > a. Construct a scatter diagram with overall rating on the horizontal axis and price on the vertical axis. b. Does there appear to be a relationship between overall rating and price? If so, is the relationship positive or negative? 2.29 The data in < USED_CARS > were obtained from several usedcar yards for 4-cylinder, 4-door sedans. a. Construct a scatter diagram, with price on the vertical axis, to investigate the relationship between the age of a car and its price. b. Construct a scatter diagram, with price on the vertical axis, to investigate the relationship between the kilometres travelled by a car and its price. c. What conclusions can you reach about the relationship between the age or the kilometres travelled and the price of a used car? Are these the relationships you expected? 2.30 To investigate the variation in fuel prices in New South Wales on a given day, a random sample of 45 towns and suburbs was LEARNING OBJECTIVE 5 Develop dashboard elements such as sparklines, gauges, bullet graphs and treemaps for descriptive analytics Year to Mar 11 Jun 11 Sep 11 Dec 11 Mar 12 Jun 12 Sep 12 Dec 12 Mar 13 Jun 13 Sep 13 Dec 13 Australia rate % 3.3 3.5 3.4 3.0 1.6 1.2 2.0 2.2 2.5 2.4 2.2 2.7 NZ rate % 4.5 5.3 4.6 1.8 1.6 1.0 0.8 0.9 0.9 0.7 1.4 1.6 Year to Mar 14 Jun 14 Sep 14 Dec 14 Mar 15 Jun 15 Sep 15 Dec 15 Mar 16 Jun 16 Sep 16 Dec 16 Australia rate % 2.9 3.0 2.3 1.7 1.3 1.5 1.5 1.7 1.3 1.0 1.3 1.5 NZ rate % 1.5 1.6 1.0 0.8 0.3 0.4 0.4 0.1 0.4 0.4 0.4 1.3 Data obtained from Reserve Bank of Australia <www.rba.gov.au> and Reserve Bank of New Zealand <www.rbnz.govt.nz> accessed March 2017 a. Investigate the relationship between the inflation rates for the two countries by constructing time-series plots on the same set of axes. b. What conclusions can you make about the inflation rates of the two countries? 2.6 BUSINESS ANALYTICS APPLICATIONS – DESCRIPTIVE ANALYTICS As business people gain the ability to retrieve and process larger amounts of data in smaller amounts of time, sometimes approaching near real time, some have asked: At what point does the need for using samples to expedite analysis disappear? Might there not be a day when business decision makers could just analyse all the data continuously as it flows into the business in near real time? While, in most cases, continuous data analysis is not yet a reality, these questions taken together have created the demand for methods known collectively as Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.6 BUSINESS ANALYTICS APPLICATIONS – DESCRIPTIVE ANALYTICS business analytics. Analytics represents an evolution of pre-existing statistical methods combined with advances in information systems and techniques from management science. Analytics is naturally interdisciplinary, and this nature underscores how important statistics is as part of your business education. Descriptive analytics, predictive analytics and prescriptive analytics form the three broad categories of analytic methods. Descriptive analytics explores business activities that have occurred or are occurring now. Predictive analytics identifies what is likely to occur in the (near) future and finds relationships in data that may not be readily apparent using descriptive analytics. Prescriptive analytics investigates what should occur and prescribes the best course of action for the future. We may use a number of organising and visualisation tools to aid our descriptive analytics. Giving decision makers the ability to combine, collect, organise and visualise data that could be used for day-to-day, if not minute-by-minute, business monitoring in the present, rather than business activity in the past, is one of the main goals of descriptive analytics. Being able to do real-time monitoring can be useful for a business that handles a perishable inventory. Perishable inventory is inventory that will disappear after a particular event takes place, such as an airplane taking off for its destination or the end of a concert. Empty seats on the airplane or at the concert cannot be sold later. Perishable inventory also occurs with less tangible inventory, such as spaces reserved for advertisements on a commercial web page— such spaces cannot be sold after the page has been viewed. In the past, the problem of perishable inventory was handled by models that predicted consumer behaviour based on historical patterns. A concert promoter sets prices based on the best estimation of ticket-buying behaviour. Today, by constantly monitoring sales, the promoter can use a dynamic pricing model in which the price of tickets could fluctuate in near real time based on whether sales are exceeding or failing to meet predicted demand. Real-time monitoring can also be useful for a business that manages flows of people or objects that can be adjusted in near real time, especially when there is more than one flow and the flows are interrelated. For example, overseers of a large sports stadium could benefit from monitoring the flows of cars in parking facilities, as well as the flow of fans into the stadium, and redirect stadium personnel to assist at points of congestion. The managers of WaldoLands, the theme park that licenses the characters from the Waldo­ wood stories, seek to stabilise and grow their business. During the most recent tourist season, their park was plagued by a number of major ride breakdowns, long lines at popular attractions and key food service areas, and a general inability to respond to the park’s day-to-day operating status. Last year’s problems led to numerous unfavourable reviews in key social media travel websites, and the managers are concerned that possible patrons may decide to visit competing parks run by Universal Parks & Resorts and Six Flags Entertainment. For this year, the managers have added the LineJumper service that allows patrons to ‘jump’ to the head of a line, and are offering the premium-priced No-Stress-Express experience that offers special guided tours and behind the scenes access. The managers also hope the new multimillion-dollar Rabbit Creek Racers and a greatly expanded MirrorGate Experience, based on a popular sci-fi franchise, will boost attendance, even though they fret about the technical complexity of these rides. In the WaldoLands scenario, managers could monitor flows of patrons through the ticket booths and into the theme park while also keeping an eye on the length of waiting lines and the use of the LineJumper service. This would allow the managers to adjust ride lengths or dispatch live performers to entertain patrons in line, and to try to redirect patrons to areas of the park that are currently under capacity. 63 business analytics Skills, technologies and practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning. descriptive analytics A form of business analytics that explores business activities that have occurred or are occurring in the present moment. predictive analytics A form of business analytics that identifies what is likely to occur in the (near) future and finds relationships in data that may not be readily apparent using descriptive analytics. prescriptive analytics A form of business analytics that investigates what should occur and prescribes the best course of action for the future. Dashboards Over several decades, people talked about developing executive information systems that would put information at the ‘fingertips’ of decision makers. Many of these efforts have spurred the development of dashboards that use descriptive analytics methods to present up-to-the-minute operational status about a business. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 64 CHAPTER 2 ORGANISING AND VISUALISING DATA dashboard Descriptive analytics methods to present up-to-the-minute operational status about a business. An analytics dashboard provides this information in a visual form that is intended to be easy to comprehend and review. Dashboards can contain the summary tables and charts discussed earlier in this chapter, as well as newer or more novel forms of information presentation that can summarise big data as well as smaller sets of data. The dashboard in Figure 2.16 displays key WaldoLands operational statistics that are updated on a near-real-time basis. Clicking one of the categories would lead to other displays that contain additional information about theme park operations. Figure 2.16 A WaldoLands dashboard Source: The contents, descriptions, and characters of WaldoLands and Waldowood are copyright © 2018, 2014, 2011 Waldowood Productions, and used with permission. Sparklines are one of the descriptive analytic methods that dashboards can contain. sparklines A descriptive analytics method that summarises time-series data as small, compact graphs designed to appear as part of a table. ­ parklines summarise time-series data as small, compact graphs designed to appear as part of a S table (or a written passage). In Figure 2.17, sparklines display the wait times for WaldoLands attractions at half-hour intervals for the current day, helping to provide context for the current wait times that are indicated by the dot markers. For example, the sparkline for the Rabbit Springs Racers ride shows that the current wait time is one of the longest wait times for the day. Figure 2.17 WaldoLands wait times table with sparklines Source: The contents, descriptions, and characters of WaldoLands and Waldowood are copyright © 2018, 2014, 2011 Waldowood Productions, and used with permission. gauges A visual display of data inspired by the speedometer in a car. bullet graph A horizontal bar chart inspired by a thermometer. Analogous to automotive dashboards, analytic dashboards can provide warnings when predefined conditions are met or exceeded. Figure 2.18 contains a set of gauges and a bullet graph that both display the wait-line status for WaldoLands attractions. These displays combine a single numerical measure (wait time) with one of five categorical values that rates the wait time subjectively, from excellent (less than 25 minutes) to poor (more than 85 minutes). While gauges have Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.6 BUSINESS ANALYTICS APPLICATIONS – DESCRIPTIVE ANALYTICS been a popular choice in business, most information design specialists prefer bullet graphs because those graphs foster the direct comparison of each measurement (wait time in Figure 2.18). Gauges can also consume a lot of visual space in a dashboard. For ­example, in Figure 2.18, note the amount of the space the gauges consume to show the status of the six most popular rides. The corresponding bullet graph can display the status of 14 rides and present the wait times in a way that facilitates comparisons. For these reasons, some consider gauges little more than examples of chartjunk (see reference 1), even as many decision makers request them due to their visual appeal.1 65 chartjunk Unnecessary information and detail that reduces the clarity of a graph. Figure 2.18 Gauges and bullet graph of wait times for WaldoLands attractions Source: The contents, descriptions, and characters of WaldoLands and Waldowood are copyright © 2018, 2014, 2011 Waldowood Productions, and used with permission. Dashboards may also contain treemaps that help users to visualise two variables, one of which must be categorical. Treemaps are especially useful when categories can be grouped to form a multilevel hierarchy or tree. Figure 2.19 displays a pair of treemaps that visualise the number of social media comments made today about WaldoLands attractions (the size of each rectangle). The left treemap shows each ride grouped by the ‘land’ of WaldoLands (StrausLand, the BWLand or FamilyLand) where the attraction is found. The right treemap shows the data for the six most popular WaldoLands attractions, ­illustrating that treemaps can be used with non-hierarchical information as well. StrausLand StrausLand The BWLand FamilyLand The BWLand StrausLand FamilyLand StrausLand treemaps A descriptive analytics method that helps visualise two variables, one of which must be categorical. The BWLand The BWLand Kirby’s SplashDown Soarin’ Stegosaurs Stressed Out Wild Mouse Rabbit Springs Racers Mt Waldo Alpine Sleds Rabbit Springs Racers A.B.ʹs Hall of Mirrors Ms Cy... WaldoLand Un... OFFRO... Mini RR Ride Truck... MirrorGate Experience Lande’s Musical Chairs Circle o... Taylorʹs... 1... Soarinʹ Stegosaurs Stressed Out Wild Mouse Mt Waldo Alpine Sleds MirrorGate Experience Landeʹs Musical Chairs Figure 2.19 Treemaps of number and favourability of social media comments about WaldoLands attractions Source: The contents, descriptions, and characters of WaldoLands and Waldowood are copyright © 2018, 2014, 2011 Waldowood Productions, and used with permission. 1 This tension between what decision makers might find visually appealing and what statisticians and information specialists have found most useful reflects the relative newness of these descriptive methods. Over time, this tension may ease and an acceptable standard for representing such information may emerge. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 66 CHAPTER 2 ORGANISING AND VISUALISING DATA When combined with the Figure 2.18 gauges or bullet graph, the treemap on the right in ­ igure 2.19 would allow managers to preliminarily conclude that the negativity of comments F seems to be tied to current wait lines and that rides with the shortest wait lines may generate the fewest social media comments. These relationships could then be further investigated and, if the former one was confirmed, managers could, in the future, respond to excessive wait lines by shortening the ride length to handle more customers, sending live performers to entertain those waiting in line or instructing park staff to divert incoming park patrons to other rides. Note that gauges, bullet graphs and treemaps use colour to represent the value of a second variable, thereby increasing the data density of the displays – one of the principles of good information design (see reference 2). However, when using these displays, particularly bullet graphs and treemaps, avoid using colour spectrums that run from red to green, the two colours most subject to confusion due to colour vision deficiencies. (This is less of a problem with gauges, as colours subject to confusion will have unique positions on the gauge dial.) Data Discovery data discovery Methods used to take a closer look at historical or status data, to quickly review data for unusual values or outliers, or to construct visualisations for management presentations. drill-down The revealing of the data that underlie a higher-level summary. Data discovery methods allow decision makers to interactively organise or visualise data and perform preliminary analyses. These methods can be used to take a closer look at historical or status data, to quickly review data for unusual values or outliers, or to construct visualisations for management presentations. In these ways, data discovery realises the earlier promise of executive information systems to give decision makers the tools of data exploration and presentation. In its simplest version, data discovery involves drill-down, the revealing of the data that underlies a higher level summary. For example, clicking the merchandise entry in the ­Figure 2.16 WaldoLands dashboard would reveal more detailed information such as the table of sales by ‘lands’ shown in the left table in Figure 2.20. In turn, this summary can be drilled down to reveal sales by each store in the theme park (see table on the right in Figure 2.20). At this level of detail, sales at Peri’s Playtime are significantly lower than the other stores, perhaps suggesting that this store be closed, relocated, or have its merchandise mix reconsidered. Figure 2.20 WaldoLands merchandise sales summarised on two different levels Source: The contents, descriptions, and characters of WaldoLands and Waldowood are copyright © 2018, 2014, 2011 Waldowood Productions, and used with permission. Another level of drill-down (not shown) would reveal the sales of each item or SKU (stockkeeping unit) sold in each store. By reorganising that list by item, WaldoLands managers could discover which items are selling the best and may be subject to stockouts. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.6 BUSINESS ANALYTICS APPLICATIONS – DESCRIPTIVE ANALYTICS 67 Problems for Section 2.6 2.33 The Edmunds.com NHTSA Complaints Activity Report is the result of the examination of the frequency, trends and composition of consumer vehicle complaint submissions at the car manufacturer, brand and category levels (data obtained from <www.edmunds.com/car-news/nhtsa-complaints-report. html>). The table below stored in < AUTOMAKER1 >, contains complaints received by six car manufacturers for January 2013. When the number of complaints is less than 300, the complaint rating is considered to be low; when the number of complaints is between 300 and 500, the complaint rating is considered to be medium; and when the number of complaints is more than 500, the complaint rating is considered to be high. Car manufacturer American Honda Chrysler LLC Ford Motor Company General Motors Nissan Motors Corporation Toyota Motor Sales 2.36 Number of complaints 169 439 440 551 467 332 a. Construct a gauge for each car manufacturer. b. Construct a bullet graph for the car manufacturers. c. Which display is more effective at comparing the number of complaints for each car manufacturer? 2.34 There is a very large number of mutual funds from which an investor can choose. Each mutual fund has its own mix of different types of investments. The file < BEST_FUNDS1 > contains the one-year return percentage and the three-year annualised return percentage for the 10 best short-term bond and long-term bond funds according to the U.S. News & World Report score. (data obtained from <www.money.usnews.com/ mutual-funds/rankings>). a. Construct bullet graphs of the one-year returns and the threeyear returns. For the purposes of comparison, consider a return below 5% as low-performing, a return between 5 and 10% as medium-performing and a return above 10% as highperforming. b. Why would you not want to construct a gauge for each bond fund? c. What conclusions can you reach about the one-year and three-year return percentages for the short-term bond and long-term bond funds? 2.35 A financial analyst was interested in comparing the price-tobook ratio (P/B) of pharmaceutical companies. The analyst collected P/B ratios for 71 pharmaceutical companies (Industry Group SIC 3 code: 283) and stored them as part of the file < BUSINESS_VALUATION >. a. Visually evaluate the P/B ratios by constructing a bullet graph. For the purposes of comparison, consider a P/B ratio 2.37 2.38 2.39 that is 2 or less as excellent, a P/B ratio that is between 2 and 5 as acceptable, and a P/B ratio that is above 5 as unacceptable. b. Why would using gauges be a poor choice for this analysis? c. Are the three groupings of P/B ratios helpful in analysing the data? What constitutes an acceptable P/B ratio varies by industry and is partially based on subjective analysis. For the purposes of information presentation, would you redefine or subdivide the current acceptable category? The file < BB_COST_2012 > contains the total cost (in $) for four tickets, two beers, four soft drinks, four hot dogs, two game programs, two baseball caps and parking for one vehicle at each of the 30 Major League Baseball (MLB) parks during the 2012 season. (data obtained from <http:// fancostexperience.com>). a. Visually evaluate the total cost at each MLB park by constructing a bullet graph. For the purposes of comparison, consider a total cost (in dollars) less than $180 as inexpensive, between $180 and $240 as typical, and more than $240 as expensive. b. Which display best visualises the distribution of costs - the bullet graph or a stem-and-leaf display? Why? c. Name something that the bullet graph reveals about the data that the stem-and-leaf display does not. How could that be used as the basis for future analysis of total costs at MLB parks? Referring to the movie attendance data between 2002 and 2012 (stored in < MOVIE_ATTENDANCE2 >): a. Construct a sparkline graph for movie attendance between 2002 and 2012. b. What conclusions can you reach about movie attendance between 2002 and 2012? c. When would using a sparkline graph be the better choice to visualise these data? When would using the time-series plot be the better choice? d. Might you ever use both a sparkline graph and a timeseries plot in the same analysis report? Explain your reasoning. The file < STOCK_INDICES > contains the data that represent the total rate of return (as a percentage) for the Dow Jones Industrial Average (DJIA), the Standard & Poor’s 500 (S&P500) and the technology-heavy NASDAQ Composite (NASDAQ) from 2006 through 2012. (data obtained from <https://finance.yahoo. com> accessed 29 March 2013). a. Construct sparklines for the annual rate of return for the DJIA, S&P500 and NASDAQ from 2006 to 2012. b. What conclusions can you reach concerning the annual rates of return of the three market indices? From 2006 to 2012, the value of precious metals fluctuated dramatically. The file < METAL_INDICES > contains the total rate of return (as a percentage) for platinum, gold and silver Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 68 2.40 2.41 2.42 2.43 CHAPTER 2 ORGANISING AND VISUALISING DATA from 2006 through 2012. (data obtained from <https://finance. yahoo.com> accessed 29 March 2013). a. Construct sparklines for the annual rate of return for platinum, gold and silver from 2006 to 2012. b. What conclusions can you reach concerning the rates of return of the three precious metals? c. Compare the results of (b) to those of Problem 2.38(b). Drive-through service time is an important quality attribute for fast-food chains. The data in < SERVICE_TIME > are the mean service times for Burger King, Chick-Fil-A, McDonald’s and Wendy’s in 12 recent years. (data obtained from <bit.ly/qhvP3Zb>). a. Construct sparklines of the mean service times for Burger King, Chick-Fil-A, McDonald’s and Wendy’s in 12 recent years. b. What conclusions can you reach concerning the mean service times for Burger King, Chick-Fil-A, McDonald’s and Wendy’s in 12 recent years? Sales of cars in the United States fluctuate from month to month and year to year. The data in the file < AUTO_SALES > represent the sales for various manufacturers in July 2013 and the change from July 2012 sales in percentages. (data obtained from <www.nytimes.com/interactive/2013/08/01/ business/How-the-Auto-Industry-Fared-in-July.htm>). a. Construct a treemap of the sales of cars and the change in sales from July 2012. b. What conclusions can you reach concerning the sales of cars and the change in sales from July 2012? The value of a National Basketball Association (NBA) franchise has increased dramatically over the past few years. The value of a franchise varies based on the size of the city in which the team is located, the amount of revenue it receives and the success of the team. The file < NBA_VALUES > contains the value of each team and the change in value in the past year. (data obtained from <www.forbes.com/nba-valuations>). a. Construct a treemap that visualises the values of the NBA teams (size) and the one-year changes in value (colour). b. What conclusions can you reach concerning the value of NBA teams and the one-year change in value? The annual ranking of the FT Global 500 2013 provides a snapshot of the world’s largest companies. The companies are ranked by market capitalisation—the greater the sharemarket value of a company, the higher the ranking. The market capitalisations (in billions of dollars) and the 52-week change in market capitalisations (in percentages) for companies in the Automobile & Parts, Financial Services, Health Care Equipment & Services and ­Software & Computer Services sectors are stored in < FT_GLOBAL500 > (data obtained from <www.ft.com/intl/indepth/ft500>). a. Construct a treemap that presents each company’s market capitalisation (size) and the 52-week change in market capitalisation (colour) grouped by sector and country. b. Which sector seems to have the best gains in the market capitalisations of its companies? Which sectors seem to have the worst gains (or greatest losses)? c. Construct a treemap that presents each company’s market capitalisation (size) and the 52-week change in market capitalisation (colour) grouped by country. 2.44 2.45 2.46 2.47 2.48 2.49 d. What comparison can be more easily made with the treemap constructed in (c) than with the treemap constructed in (a)? Your task as a member of the International Strategic ­Management Team at your company is to investigate the potential for entry into a foreign market. As part of your initial investigation, you must provide an assessment of the economies of countries in the Americas and the Asia and Pacific regions. The file < DOING_BUSINESS > contains the 2012 GDPs per capita for these countries as well as the number of Internet users in 2011 (per 100 people) and the number of mobile phone subscriptions in 2011 (per 100 people). (data obtained from <https://data.worldbank.org>). a. Construct a treemap of the GDPs per capita (size) and their number of Internet users in 2011 (per 100 people) (colour) for each country grouped by region. b. Construct a treemap of the GDPs per capita (size) and their number of mobile phone subscriptions in 2011 (per 100 people) (colour) for each country grouped by region. c. What patterns to these data do the two treemaps suggest? Are the patterns in the two treemaps similar or different? Explain. Using the sample of retirement funds stored in < RETIREMENT_FUNDS >: a. Construct a table that tallies type, market cap and risk. b. Drill down to examine the large-cap growth funds with high risk. How many funds are there? What conclusions can you reach about these funds? Using the sample of retirement funds stored in < RETIREMENT_FUNDS >: a. Construct a table that tallies type, market cap and rating. b. Drill down to examine the large-cap growth funds with a rating of three. How many funds are there? What conclusions can you reach about these funds? Using the sample of retirement funds stored in < RETIREMENT_FUNDS >: a. Construct a table that tallies market cap, risk and rating. b. Drill down to examine the large-cap funds that are high risk with a rating of three. How many funds are there? What conclusions can you reach about these funds? Using the sample of retirement funds stored in < RETIREMENT_FUNDS >: a. Construct a table that tallies type, risk and rating. b. Drill down to examine the growth funds that are high risk with a rating of three. How many funds are there? What conclusions can you reach about these funds? Using the sample of retirement funds stored in < RETIREMENT_FUNDS >: a. What are the attributes of the fund with the highest five-year return? b. What five-year returns are associated with small market cap funds that have a rating of five stars? c. Which fund(s) in the sample have the lowest five-year return? d. What is the type and market cap of the five-star fund with the highest five-year return? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.7 Misusing Graphs and Ethical Issues 2.7 MISUSING GRAPHS AND ETHICAL ISSUES LEARNING OBJECTIVE Good graphical displays should present the data in a clear and understandable way. Unfortunately, many graphs in newspapers and magazines, as well as graphs constructed using Microsoft Excel, are incorrect, misleading or unnecessarily complicated. To illustrate the misuse of graphs, Figure 2.21 was constructed using data obtained from Wine Australia contained in data file < WINE_PROD_2014_15 >. In the figure, the contents of the wine bottle representing 606 million litres for 1995/96 appear to be approximately three times the contents of the icon representing 346 million litres for 1990/91. This is because a magnification factor of 1.75 (606/346 ≈ 1.75) has been applied to both height and width, so the volume has increased by 1.752 ≈ 3. One principle of good graphs is that, when using three-dimensional icons, frequency/quantity must be proportional to volume. 1,410 1,118 1,191 606 Correctly present data in graphs Source: Data obtained from ‘Australian Gross Wine Production – pdf format’, Wine Australia Corporation <www.wineaustralia.com/ australia> accessed December 2013. 346 1990/91 1995/96 2000/01 2005/06 6 Figure 2.21 Misleading display of Australian wine production Australian beverage wine production (million litres) 1,034 69 2010/11 2014/15 Also, the time difference between the wine bottles is not constant. There are five years between the first five icons and four years between the last two. Good graphs should be properly scaled along each axis. Finally, the year labels are ambiguous. It is not clear whether the 346 million litres represent the total production for the two years 1990 and 1991, the average production for those two years, or the wine production for the 1990/91 financial year. Good graphs should be clearly labelled. Although the wine bottle presentation may catch the eye, the data would have been better presented in a summary table or as a time-series plot using all the data available. It is often the improper use of the vertical and horizontal axes that leads to distortions in presenting data. Figure 2.22, representing New Zealand alcohol consumption, was constructed using data from OECD (2011 and 2014), contained in data file < ALCOHOL_CONSUMPTION >. The graph in Figure 2.22 is clearly labelled, the horizontal/time axis is correctly spaced and the height and volume are proportional. However, the cylinder representing 9.1 litres for 2004 is more than twice the height/volume of the cylinder representing 8.9 litres for 2003. This is because there is no zero point on the vertical axis. The vertical axis on a good graph should usually begin at zero. Other eye-catching displays seen in magazines and newspapers often include information that is not necessary, blurring the effect. Some guidelines for presenting good graphs are as follows: • The graph should not distort the data. In particular, frequency/quantity should be proportional to area and/or volume. • The graph should not contain chartjunk. • Any two-dimensional graph should contain a scale for each axis. • The scale on the vertical axis should begin at zero. • Graphs should be properly scaled along each axis. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 70 CHAPTER 2 ORGANISING AND VISUALISING DATA Figure 2.22 Misleading display of New Zealand alcohol consumption Alcohol consumption in litres per capita (15+) 9.6 9.5 Source: Data from OECD (2011 and 2014), ‘Alcohol consumption’, Health: Key Tables from OECD, No. 24. doi: 10.1787/alcoholcons-table2014-1-en and 10.1787/ alcoholcons-table-2011-1-en, accessed March 2017. 9.3 9.3 9.5 9.3 9.3 9.2 9.2 9.1 8.9 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Year • • • All axes should be properly labelled. The graph should contain a title. The simplest possible graph should be used for a given set of data. Often these guidelines are unknowingly violated by individuals unaware of how to construct appropriate graphs. Some applications, including Excel, tempt you to create ‘pretty’ charts that may be fancy in their designs but represent unwise choices. For example, making a simple pie chart fancier by adding exploded 3D slices is unwise as this can complicate a viewer’s interpretation of the data. Uncommon chart choices such as doughnut, radar and surface charts may look visually striking, but in most cases they obscure the data. Ethical Concerns Inappropriate graphs raise ethical concerns, especially when they, deliberately or not, present a false impression of the data. To illustrate this, take the example of mobile speed cameras that were reintroduced in New South Wales on 19 July 2010. Suppose the following graphs were produced by groups for and against this, using data in the file < NSW_ROAD_FATALITIES 2009_2017 > obtained from the ­Australian Road Deaths Database. Figure 2.23A gives the impression that the number of road fatalities in New South Wales has increased after the reintroduction of mobile speed cameras, while Figure 2.23B gives the Figure 2.23A NSW road fatalities 2010 NSW number of road fatalities 2010 40 35 Mobile cameras introduced 30 25 20 Jul 10 Aug 10 Sep 10 Oct 10 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 2.7 Misusing Graphs and Ethical Issues Figure 2.23B NSW road fatalities 2010 NSW number of road fatalities 2010 45 71 40 Mobile cameras introduced 35 30 25 20 Apr 10 60 May 10 Jun 10 Jul 10 Aug 10 Figure 2.23C NSW road fatalities 2009 to 2017 NSW number of road fatalities 2009 to 2017 Mobile speed cameras introduced 50 Source: Data in Figures 2.23A–C obtained from Australian Road Deaths Database, <www.bitre.gov. au/statistics/safety/fatal_ road_crash_database.aspx>, accessed 8 April 2017. 40 30 20 10 Feb 17 Sep 16 Apr 16 Nov 15 Jun 15 Jan 15 Aug 14 Apr 14 Nov 13 Jun 13 Jan 13 Aug 12 Mar 12 Oct 11 Jun 11 Jan 11 Aug 10 Mar 10 Oct 09 Jan 09 May 09 0 opposite ­impression. However, a time-series plot for 2009 to 2017 (Figure 2.23C) shows that there may be a slight decrease in fatalities since the introduction of mobile cameras, although the number of fatalities per month is very variable. Problems for Section 2.7 APPLYING THE CONCEPTS 2.50 (Student project) Bring to class a chart from a newspaper or magazine that you believe to be a poor representation of a numerical variable. Be prepared to discuss why you think this. Do you believe that the intent of the chart is purposely to mislead the reader? 2.51 (Student project) Bring to class a chart from a newspaper or magazine that you believe to be a poor representation of a categorical variable. Be prepared to discuss why you think this. Do you believe that the intent of the chart is purposely to mislead the reader? 2.52 (Student project) Bring to class a chart from a newspaper or magazine that you believe contains too many unnecessary adornments (i.e. chartjunk) that may cloud the message given by the data. Be prepared to discuss why you think this. 2.53 The following graph shows a relationship between number of pirates and global average temperature between 1820 and 2000. Comment on the influence of pirates on global warming. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 72 CHAPTER 2 ORGANISING AND VISUALISING DATA c. Global average temperature vs number of pirates 2000 15.5 1980 1940 15.0 1920 1860 1880 1820 100 50 17 0 40 0 5, 00 0 0 15 ,0 0 0 20 ,0 0 0 ,0 0 45 ,0 0 Source: Church of the Flying Spaghetti Monster, <www.venganza.org/images/ PiratesVsTemp.png> accessed 28 December 2014. Used by permission of Bobby Henderson 2.54 Using the data < WINE_PROD_2014_15 > and < ALCOHOL_ CONSUMPTION >, redraw Figures 2.21 and 2.22, following the guidelines for good graphs given in Section 2.7. 2.55 The following three time-series plots show Perth’s monthly average petrol prices from January 2006 to February 2017: a. Perth petrol price 140 120 100 80 60 40 Which graph do you think best represents the data and why? 2.56 An article in the New York Times (D. Rosato, ‘Worried about the numbers? How about the charts?’, New York Times, 15 September 2002, Business 7) reported on research done on annual reports of corporations by Professor Deanna Oxender Burgess of Florida Gulf Coast University. Professor Burgess found that even slight distortions in a chart changed readers’ perception of the information. The article displayed sales information from the annual report of Zale Corporation and showed how results were exaggerated. Go online or to the library and study the most recent annual report of a local corporation. Find at least one chart in the report that you think needs improvement and develop an improved chart. Explain why you believe the improved chart is better than the one from the annual report. 2.57 Figures 2.1 and 2.3 show a bar chart and a pie chart, respectively, for the online grocery shopping data. a. Create an exploded pie chart, a doughnut chart, a cone chart or a pyramid chart for the online shopping data. b. Which graphs do you prefer? Explain. 20 Feb 17 Sep 15 May 16 Jan 15 May 14 Aug 13 Dec 12 Jul 11 Mar 12 Nov 10 Jun 09 Mar 10 Oct 08 Jan 08 Sep 06 May 07 Jan 06 0 b. Perth petrol price 155 145 135 125 115 105 Feb 17 May 16 Sep 15 Jan 15 May 14 Aug 13 Dec 12 Mar 12 Jul 11 Nov 10 Mar 10 Jun 09 Oct 08 Jan 08 May 07 Sep 06 95 Jan 06 Average price (cents/litre) 165 Feb 17 May 16 Jan 15 Sep 15 Aug 13 May 14 Dec 12 Jul 11 Mar 12 Nov 10 Jun 09 Mar 10 Oct 08 Jan 08 Data obtained from Australian Automobile Association <www.aaa.asn.au> accessed April 2017 Number of pirates (approximate) 160 May 07 13.5 35 150 0 13.0 Average price (cents/litre) 200 Jan 06 14.0 250 Sep 06 14.5 Perth petrol price 300 Average price (cents/litre) Global average temperature, °C 16.0 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e References 73 2 Assess your progress Summary Table 2.16 summarises the tables and charts discussed in this chapter. These tables and charts enabled us to draw conclusions about online grocery shopping, the cost of restaurant meals in a city and its suburbs, and festival expenditure in the scenario at the beginning of the chapter. Table 2.16 Roadmap for selecting tables and charts Type of analysis Tabulating, organising and graphically presenting the values of a variable Organising and graphically presenting the relationship between two variables Now that you have studied tables (which show how data are distributed) and charts (which provide a visual display of how data are distributed), a variety of numerical descriptive measures will be introduced in Chapter 3 for further analysis and interpretation of data. Type of data Numerical Ordered array, stem-and-leaf display, frequency distribution, relative frequency distribution, percentage distribution, cumulative percentage distribution, histogram, polygon, cumulative percentage polygon (Sections 2.2 and 2.3) Scatter diagram, time-series plot (Section 2.5) Sparklines, gauges, bullet graph, treemap, drill-down (Section 2.6) Categorical Summary table, bar chart, pie chart (Section 2.1) Contingency table, side-by-side bar chart (Section 2.4) Treemap, drill-down (Section 2.6) Key terms bar chart bullet graph business analytics chartjunk class boundaries class mid-point class width contingency (cross-classification) table – descriptive statistics cumulative percentage distribution cumulative percentage polygon (ogive) 39 64 63 65 47 47 46 55 49 52 dashboard data discovery descriptive analytics drill-down frequency distribution gauges histogram ordered array percentage distribution percentage polygon pie chart 64 66 63 66 46 64 50 43 48 51 40 predictive analytics prescriptive analytics range relative frequency distribution scatter diagram side-by-side bar chart sparklines stem-and-leaf display summary table time-series plot treemaps References 1. Few, S. Information Dashboard Design: Displaying Data for At-a-Glance 2. Tufte, E. Beautiful Evidence (Cheshire, CT: Graphics Press, 2006). Monitoring, 2nd edn (Burlingame, CA: Analytics Press, 2013). Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 63 63 46 48 59 56 64 43 38 59 65 74 CHAPTER 2 ORGANISING AND VISUALISING DATA Chapter review problems CHECKING YOUR UNDERSTANDING 2.58 2.59 2.60 2.61 2.62 2.63 How do histograms and polygons differ with respect to their construction and use? When or why would you construct a summary table? What are the advantages and/or disadvantages of a bar chart or a pie chart? Compare and contrast the bar chart for categorical data and the histogram for numerical data. What is the difference between a time-series plot and a scatter diagram? What are the three percentage breakdowns that can help you interpret the results found in a cross-classification table? 2.66 a. Illustrate these data with an appropriate graph or graphs. b. What can you conclude about Internet usage? Are these conclusions different from those in problem 2.64? If so, what could the reasons be? The following table classifies road fatalities in Australia for 2012 to 2016 by crash type: Crash type Multiple vehicle Pedestrian Single vehicle Total 2012 573 171 556 1,300 2013 479 158 550 1,187 Year 2014 503 154 493 1,150 2015 511 162 532 1,205 2016 556 171 573 1,300 APPLYING THE CONCEPTS You can solve problems 2.64 to 2.76 manually or using Microsoft Excel. 2.64 One thousand Australians were asked which websites they had visited in the previous week. The results were: Type of sites Auction Banking Classifieds Dating Email Gaming News Online music site Search engine Shopping Social network Sport TV User generated or upload site Weather 2.65 Number 122 245 213 41 552 132 335 186 743 381 649 236 201 472 398 2.67 a. Illustrate these data by constructing appropriate tables and graphs. b. What can you say about the pattern of road fatalities in these five years? Residents in the seaside town hosting three-day music festival are concerned that the influx of tourists for this and other events causes an increase in traffic and other offences. As the council area has one of the highest drink driving rates in the state, Kai is investigating whether tourists can be blamed for this high rate. The following table classifies the previous year’s 993 drink-driving offences by the home address of the offender: Number of drink-driving offences Local – in council area Seaside town 151 Not seaside town 462 Not local – not in council area Intrastate (within state) 130 Interstate (another state) 228 International (outside Australia) 22 Home address a. Illustrate these data with an appropriate graph or graphs. b. What can you conclude about the type of website most visited? Another poll asked Australians how they spent their time online, with the following result. Email and communications Multimedia sites Online shopping Reading content Searches Social networking Total Data obtained from the Australian Road Deaths Database at <www.bitre.gov.au/ statistics/safety/fatal_road_crash_database.aspx> accessed 9 April 2017 19.3% 13.1% 5.4% 19.9% 20.7% 21.6% 100.0% a. Construct bar and pie charts. b. What conclusions can Kai draw about the prevalence of drink driving? c. The headline of an article in the local paper discussing these data was ‘Tourists can’t be blamed for number of drink-drivers’. Do you agree with this? Justify your answer. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter review problems The reasons why Queensland households installed a rainwater tank is given in the table below. 2.71 Data obtained from Australian Bureau of Statistics, Environmental Issues: Water Use and Conservation, Mar 2013, Cat. No. 4602.0.55.003 <www.abs.gov.au> accessed 4 November 2013 2.73 Australian housing interest rates 8.0 7.5 7.0 % 6.5 6.0 5.5 2.74 Mar 17 Jun 16 Nov 16 Oct 15 Feb 16 Jun 15 Oct 14 Feb 15 Jun 14 Oct 13 Feb 14 Jan 13 5.0 May 13 2.70 2.72 Sep 11 2.69 a. Illustrate these data with an appropriate table or graph. b. What can you conclude about the reasons for installing a rainwater tank, and are there differences between Brisbane and non-Brisbane Queensland households? The data in < FRESH_MILK > contains the fat and sugar content in grams (g) per 250 ml cup of a random sample of brands of fresh cow’s milk for sale in Australia. a. Use the combined data to construct graphs to explore the relationship between the variables. b. What conclusions can you reach about the relationship between the fat, sugar and calorie content of fresh milk? On the same day in March 2017, the researcher in problem 2.30 also obtained the prices per litre of unleaded petrol and diesel from a random sample of 45 towns and suburbs in Queensland. This set of data is in the data file < FUEL_ MARCH_2017 > with the New South Wales data. a. Using appropriate tables and graphs, investigate the distribution of unleaded petrol and diesel prices in Queensland on this day in March 2017. What can you conclude about the variation in fuel prices in Queensland when the data were collected? b. Using an appropriate graph, investigate the relationship between petrol and diesel prices in Queensland. What conclusions can you draw about this relationship? c. Using appropriate tables and graphs, investigate the distribution of unleaded petrol and diesel prices in New South Wales on this day in March 2017. What can you conclude about the variation in fuel prices in New South Wales when the data were collected? d. Using an appropriate graph, investigate the relationship between petrol prices in New South Wales and Queensland. What conclusions can you draw? e. Using an appropriate graph, investigate the relationship between diesel prices in New South Wales and Queensland. What conclusions can you draw? Sep 12 Rest of Queensland 59.00 41.70 31.90 73.70 28.10 7.90 47.20 206.30 Jan 12 Reason Brisbane To save water 142.10 To save on water costs 55.60 Water restrictions on mains water 55.20 Not connected to mains water 5.40 Concerns about quality of mains water 5.40 Water tank rebates 43.00 Other 48.50 Total households (thousands) 216.50 f. The data in < FUEL_MARCH_2017 > was obtained in March 2017. Go to Motor Mouth at <www.motormouth.com.au>, NRMA at <www.mynrma.com.au>, RACQ at <www.racq. com.au>, or elsewhere, to collect recent price data. Then use appropriate graphs and tables to investigate any changes in petrol and/or diesel prices in New South Wales and/or Queensland. Data from 100 recent property sales from a council area are stored in < PROPERTY >. For the asking price data: a. Construct and interpret a stem-and-leaf display. b. Construct frequency, percentage and cumulative distributions. c. Construct a frequency histogram, a percentage polygon and an ogive. d. What conclusions can you make about the distribution of asking prices? e. Construct and interpret a scatter diagram for asking and selling price. For the type and bedroom data: f. Construct cross-classification tables based on total, row and column percentages. g. Construct side-by-side charts to investigate the relationship between number of bedrooms and type. h. What conclusions can you make about the relationship between type and number of bedrooms? The data in data file < INTEREST_2017 > give the bank interest rate for standard housing loans in New Zealand and Australia from January 2000 to March 2017. Construct and interpret time-series plots, on the same set of axes, for New Zealand and Australian interest rates from January 2000. Using the Australian data from problem 2.72, a PR spokesperson for an Australian political party constructed the following graph to illustrate that the party’s influence has lowered interest rates. Do you think this is an ethical graph? Discuss. May 12 2.68 75 The data in data file < GRADES > contain sample student marks and grades from a population of students enrolled in a statistics unit. a. Construct an appropriate graph to investigate the distribution of grades. What conclusions can you draw? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 76 CHAPTER 2 ORGANISING AND VISUALISING DATA b. Construct an appropriate graph to investigate the distribution of total marks. What conclusions can you draw? c. Construct an appropriate graph to investigate the relationship between a student’s semester mark and their exam mark. What conclusions can you draw? 2.75 (Class project) Ask each student in the class to respond to the question ‘Which soft drink do you prefer?’ and display the results in a summary table. a. Convert the data to percentages and construct a bar or pie chart. b. Analyse the findings. 2.76 (Class project) Classify each student in the class on the basis of gender (male, female), study mode (full-time or part-time) and current employment status (full-time, part-time). a. Construct contingency tables to explore the data. b. What would you conclude from this study? c. What other variables would you want to know about employment in order to enhance your findings? d. Compare your results with those from the Living in Australia Study in problem 2.22. 2.77 The file < DOMESTIC­_ BEER2 > contains the number of calories per 355 mL and number of carbohydrates (in grams) per 355 mL for a sample of 15 of the best-selling domestic beers in the 2.78 United States (data obtained from <www.beer100.com/ beercalories.htm>). a. Visually evaluate the number of calories per 355 mL for each beer by constructing a bullet graph. For the purposes of comparison, consider calories below 100 as low, between 100 and 160 as medium, and above 160 as high. b. Visually evaluate the number of carbohydrates (in grams) per 355 mL for each beer by constructing a bullet graph. For the purposes of comparison, consider carbohydrates below 10 grams as low, between 10 and 14 grams as medium, and above 14 grams as high. c. What preliminary conclusions can you reach about the number of calories and amount of carbohydrate in the beers? d. Why would constructing sets of gauges for the calories and carbohydrates be a less effective means of visualising these data? The file < CURRENCY2 > contains the value of the Canadian dollar, British pound and Euro for one US dollar from 2002 to 2012. a. Construct sparklines for the value of the US dollar in terms of the Canadian dollar, British pound and Euro. b. What conclusions can you reach about the value of the US dollar in terms of the Canadian dollar, British pound and Euro from 2002 to 2012? Continuing cases Tasman University Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues. In particular, students within the school are asked to complete a student survey when they receive their grades each semester. The results of Bachelor of Business (BBus) and Master of Business Administration (MBA) students who responded to the latest undergraduate (UG) and postgraduate (PG) student surveys are stored in < TASMAN_ UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >. Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman University Undergraduate MBA Student Survey. a For a selection of questions asked in the BBus student survey, construct appropriate tables and charts. b For a selection of questions asked in the MBA student survey, construct appropriate tables and charts. c Construct appropriate tables and charts to explore the relationship between selected pairs of questions within a survey or between surveys. d Write a report summarising your conclusions. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 2 Excel Guide 77 As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL­_ ESTATE >. a For regional city 1, state A: i For a selection of variables, construct appropriate tables and charts. ii Construct appropriate tables and charts to explore the relationship between pairs of variables. b For coastal city 1, state A: i For a selection of variables, construct appropriate tables and charts. ii Construct appropriate tables and charts to explore the relationship between pairs of variables. c Construct appropriate tables and charts to explore the relationship between the same variable in coastal city 1, state A, and regional city 1, state A. d Write a report summarising your conclusions. e Repeat (a) to (d) for another pair of non-capital cities or towns in state A and/or state B. Chapter 2 Excel Guide EG2.1ORGANISING AND VISUALISING CATEGORICAL DATA ORGANISING CATEGORICAL DATA Figure EG2.1 One-Way Tables & Charts dialog box The Summary Table Key technique Use the PivotTable feature to create a summary table for untallied data. Example Create a frequency and percentage summary table similar to Table 2.2B on page 39. PHStat Use One-Way Tables & Charts. For the example, open the Property file. Select PHStat ➔ Descriptive Statistics ➔ One-Way Tables & Charts. In the procedure’s dialog box (shown in Figure EG2.1): 1. Click Raw Categorical Data (because the worksheet contains untallied data). 2. Enter or highlight G2:G102 as the Raw Data Cell Range and check First cell contains label. 3. Enter a Title, check Percentage Column, and click OK. PHStat creates a PivotTable summary table on a new worksheet. In-depth Excel (untallied data) Use the Summary_Table workbook as a model. For the example, open the Property file and select Insert ➔ PivotTable. In the Create PivotTable dialog box (shown in Figure EG2.2): 1. Click Select a table or range and enter or highlight G2:G102 as the Table/Range cell range. 2. Click New Worksheet and then click OK. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 78 CHAPTER 2 ORGANISING AND VISUALISING DATA 7. Click the Layout & Format tab. 8. Check For empty cells show and enter 0 as its value. Leave all other settings unchanged. 9. Click OK to complete the PivotTable. Figure EG2.2 Create PivotTable dialog box In the Excel 2016 PivotTable Fields task pane (shown in Figure EG2.3) or in the similar PivotTable Field List task pane in earlier Excels: 3. Tick Type in Choose fields to add to report to add it to ROWS (or Row Labels) box. 4. Drag Type in Choose fields to add to report and drop it in the Σ Values box. This second label changes to Count of Type to indicate that a count, or tally, of the type categories will be displayed in the PivotTable. Figure EG2.3 Microsoft Excel PivotTable Fields task pane In the PivotTable being created: 5. Enter Type in cell A3 to replace the heading Row Labels. 6. Right-click cell A3 and then click PivotTable Options in the shortcut menu that appears. In the PivotTable Options dialog box (shown in Figure EG2.4): Figure EG2.4 PivotTable Options dialog box To add a column for the percentage frequency: 10. Enter Percentage in cell C3. Enter the formula 5B4∙B$6 in cell C4 and copy it down to row 6. 11. Select cell range C4:C6, right-click, and select Format Cells in the shortcut menu. 12. In the Number tab of the Format Cells dialog box, select Percentage as the Category, and the number of decimal places you wish to show, and click OK. 13. Adjust the worksheet formatting, if appropriate, and enter a title in cell A1. In the PivotTable, type categories appear in alphabetical order. To change the order: 14. Click the Unit label in cell A5 to highlight cell A5. Move the mouse pointer to the top edge of the cell until the mouse pointer changes to a four-way arrow. 15. Drag the Unit label and drop the label over cell A4. The type categories now appear in the order Unit then House in the summary table. In-depth Excel (tallied data) Use the SUMMARY_SIMPLE worksheet of the Summary_Table workbook as a model for creating a summary table. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 2 Excel Guide VISUALISING CATEGORICAL VARIABLES The Bar Chart and the Pie Chart Many of the In-depth Excel instructions in the rest of this Excel Guide refer to the labelled Charts group illustration shown in Figure EG2.5. Figure EG2.5 Microsoft Excel Charts group Key technique Use the Excel bar or pie chart feature. If the variable to be visualised is untallied, first construct a summary table (see the instructions in Section EG2.1 ‘Organising Categorical Data: The Summary Table’). Example Construct a bar or pie chart from a summary table similar to Table 2.2B on page 39. PHStat Use One-Way Tables & Charts. For the example, use the PHStat instructions in Section EG2.1 ‘Organising Categorical Data: The Summary Table’, but in step 3, check either Bar Chart or Pie Chart (or both) in addition to entering a Title, checking Percentage Column, and clicking OK. In-depth Excel Use the Summary_Table workbook as a model. For the example, open to the OneWayTable worksheet of the Summary_Table workbook. (The PivotTable in this worksheet was constructed using the instructions in Section EG2.1 ‘Organising Categorical Data: The Summary Table’.) To construct a bar chart: 1. Select cell range A4:B5. (Begin your selection at cell B5 and not at cell A4, as you would normally do.) 2. In Excel 2016, select Insert, then the Column icon in the Charts group (#1 in the Charts group illustration in Figure EG2.5), and then select the first 2-D Bar gallery item (Clustered Bar). In other Excels, select Insert ➔ Bar Icon and then select the first 2-D Bar gallery item (Clustered Bar). 3. Right-click the Count of Type button in the chart and click Hide All Field Buttons on Chart. 4. Select Design ➔ Add Chart Element ➔ Axis Titles ➔ Primary Horizontal. (Earlier Excels) Select Layout ➔ Axis Titles ➔ Primary Horizontal Axis Title ➔ Title Below Axis. Select the words “Axis Title” in the chart and enter the title Frequency. 79 5. If required, move to chart sheet (right-click on Chart ➔ Move Chart). Adjust chart formatting if required. Although not the case with the example, sometimes the horizontal axis scale of a bar chart will not begin at 0. If this occurs, right-click the horizontal (value) axis in the bar chart and click Format Axis in the shortcut menu. In the Format Axis task pane, click Axis Options. In the Axis Options, enter 0 in the Minimum box and then close the pane. In earlier Excels, you set this value in the Format Axis dialog box. Click Axis Options in the left pane, and in the Axis Options right pane, click the first Fixed option button (for Minimum), enter 0 in its box, and then click Close. To construct a pie chart, replace steps 2 and 4 with these steps: 2. Select Insert, then the Pie icon (#3 in the Charts group illustration in Figure EG2.5), and then select the first 2-D Pie gallery item (Pie). In earlier Excels, select Insert ➔ Pie and then select the first 2-D Pie gallery item (Pie). 4. Select Design ➔ Add Chart Element ➔ Data Labels ➔ More Data Label Options. In the Format Data Labels task pane, click Label Options. In the Label Options, check Category Name and Percentage, clear the other Label Contains check boxes, and click Outside End. (To see the Label Options, you may have to first click the chart (fourth) icon near the top of the task pane.) Then close the task pane. (Earlier Excels) Select Layout ➔ Data Labels ➔ More Data Label Options. In the Format Data Labels dialog box, click Label Options in the left pane. In the Label Options right pane, check Category Name and Percentage and clear the other Label Contains check boxes. Click Outside End and then click Close. EG2.2 ORGANISING NUMERICAL DATA Stacked and Unstacked Data PHStat Use Stack Data or Unstack Data. For example, to unstack the Asking Price variable by the Type variable in the property data given in Example 2.1, open the Property file. Select Data Preparation ➔ Unstack Data. In that procedure’s dialog box, enter or highlight G2:G102 (the Type variable cell range) as the Grouping Variable Cell Range and enter or highlight A2:A102 (the Asking Price variable cell range) as the Stacked Data Cell Range. Check First cells in both ranges contain label and click OK. The unstacked data appear on a new worksheet. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 80 CHAPTER 2 ORGANISING AND VISUALISING DATA The Ordered Array In-depth Excel To create an ordered array, first select the numerical variable to be sorted. Then select Home ➔ Sort & Filter (in the Editing group) and in the drop-down menu click Sort Smallest to Largest. (You will see Sort A to Z as the first drop-down choice if you did not select a cell range of numerical data.) The Stem-and-Leaf Display Key technique Enter leaves as a string of digits. Example Construct a stem-and-leaf display for festival expenditure by interstate visitors, similar to Figure 2.5 on page 45. PHStat Use the Stem-and-Leaf Display. For the example, open the Festival file. Select PHStat ➔ Descriptive Statistics ➔ Stem-and-Leaf Display. In the procedure’s dialog box (shown in Figure EG2.6): 1. Enter or highlight A2:A54 as the Variable Cell Range and check First cell contains label. 2. Click Set stem unit as and enter 100 in its box. 3. Enter a Title and click OK. Figure EG2.6 Stem-and-Leaf Display dialog box When creating other displays, use the Set stem unit as option sparingly and only if Autocalculate stem unit creates a display that has too few or too many stems. (Any stem unit you specify must be a power of 10.) In-depth Excel Use the Stem_and_Leaf workbook as a model. Manually construct the stems and leaves on a new worksheet to create a stem-and-leaf display. Adjust the column width of the column that holds the leaves as necessary. EG2.3 SUMMARISING AND VISUALISING NUMERICAL DATA SUMMARISING NUMERICAL DATA The Frequency Distribution Key technique Establish bins and then use the FREQUENCY(untallied data cell range, bins cell range) array function to tally data. Example Create frequency, percentage and cumulative percentage distributions for the restaurant meal cost data as in Tables 2.5, 2.7 and 2.9 in Section 2.3. To construct a frequency distribution using Excel or PhStat, you must first define your classes by a bin range. Defining Classes Using Bins Open the worksheet containing the data you want to summarise in classes. Decide on your classes and, in a separate column, enter the upper boundary or maximum value called the Bin Value for each class. This gives the Bin Cell Range. If the data are discrete, the bin range should contain the highest value in each class. If the data are continuous but recorded to a set number of decimal places, the values in the bin range should be just less than the minimum value in the next class. In this case, record the value in the bin range to one or two more significant figures than the data. For example, for the restaurant data in < RESTAURANT > in Section 2.3, the following classes were required (see Table 2.5): $10 to less than $15, $15 to less than $20 and so on. As the first class is $10 to less than $15, $15 belongs in the second class and the bin value for the first class is just less than this, 14.99 or 14.999. Therefore, the Bin Cell Range would be 14.999, 19.999, 24.999 and so on. Class $10 to < $15 $15 to < $20 $20 to < $25 : $60 to < $65 Bin values 14.999 19.999 24.999 : 64.999 Class mid-points $12.50 $17.50 $22.50 : $62.50 PHStat (untallied data) Use Frequency Distribution. (Use Histogram & Polygons, discussed later in Section EG2.3, if you plan to construct a histogram or polygon in addition to a frequency distribution.) For the example, open the Restaurant file. The data worksheet contains the meal cost data in stacked format in column G and enter an appropriate bin cell range Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 2 Excel Guide (see above) in column H (say, H1:H12). Select PHStat ➔ Descriptive Statistics ➔ Frequency Distribution. In the procedure’s dialog box (shown in Figure EG2.7): 1. Enter or highlight G1:G101 as the Variable Cell Range, enter or highlight H1:H12 as the Bins Cell Range, and check First cell in each range contains label. 2. Click Multiple Groups - Stacked and enter or highlight A1:A101 as the Grouping Variable Cell Range. (The cell range A1:A101 contains the Location variable.) 3. Enter a Title and click OK. Figure EG2.7 Frequency Distribution dialog box Click Single Group Variable in step 2 if constructing a distribution from a single group of untallied data. Click Multiple Groups - Unstacked in step 2 if the Variable Cell Range contains two or more columns of unstacked, untallied data. Frequency distributions for the two groups appear on separate worksheets. To display the information for the two groups on one worksheet, select the cell range B3:D14 on one of the worksheets. Right-click that range and click Copy in the shortcut menu. Open to the other worksheet. In that other worksheet, right-click cell E3 and click Paste Special in the shortcut menu. In the Paste Special dialog box, click Values and numbers format and click OK. Adjust the worksheet title as necessary. In-depth Excel (untallied data) Use the Distributions workbook as a model. For the example, use the Unstacked worksheet of the Restaurant file. This worksheet contains the meal cost data unstacked in columns A and B. Enter an appropriate bin range (see above) in column D (say, D1:D12). Then: 1. Right-click the Unstacked sheet tab and click Insert in the shortcut menu. 81 2. In the General tab of the Insert dialog box, click Worksheet and then click OK. In the new worksheet: 3. Enter a title in cell A1, Bins in cell A3 and Frequency in cell B3. 4. Copy the bin number list in the cell range D2:D12 of the Unstacked worksheet and paste this list into cell A4 of the new worksheet. 5. Select the cell range B4:B14 that will hold the array formula. 6. Type (but do not press) the Enter or Tab key, the formula 5FREQUENCY(UNSTACKED!$A$1: $A$51, $A$4:$A$14). Then, while holding down the Ctrl and Shift keys, press the Enter key to enter the array formula into the cell range B4:B14. 7. Adjust the worksheet formatting as necessary. Note that in step 6, you enter the cell range as UNSTACKED! $A$1:$A$51 and not as $A$1:$A$51 because the untallied data are located on another (the Unstacked) worksheet. Steps 1 to 7 construct a frequency distribution for the meal costs at city restaurants. To construct a frequency distribution for the meal costs at suburban restaurants, repeat steps 1 to 7 but in step 6 type 5FREQUENCY(UNSTAC KED!$B$1:$B$51, $A$4:$A$14) as the array formula. To display the distributions for the two groups on one worksheet, select the cell range B3:B14 on one of the worksheets. Right-click that range and click Copy in the shortcut menu. Open to the other worksheet. In that other worksheet, right-click cell C3 and click Paste Special in the shortcut menu. In the Paste Special dialog box, click Values and numbers format and click OK. Adjust the worksheet title as necessary. Analysis ToolPak (untallied data) Use Histogram. For the example, use the Unstacked worksheet of the Restaurant file. This worksheet contains the meal cost data unstacked in columns A and B. Enter an appropriate bin range (see above) in column D (say, D1:D12). Then: 1. Select Data ➔ Data Analysis. In the Data Analysis dialog box, select Histogram from the Analysis Tools list and then click OK. In the Histogram dialog box (shown in Figure EG2.8): 2. Enter or highlight A1:A51 as the Input Range and enter or highlight D1:D12 as the Bin Range. (If you leave Bin Range blank, the procedure creates a set of bins that will not be as well formed as the ones you can specify.) 3. Check Labels and click New Worksheet Ply. 4. Click OK to create the frequency distribution on a new worksheet. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 82 CHAPTER 2 ORGANISING AND VISUALISING DATA Figure EG2.8 Histogram dialog box In the new worksheet: 5. Select row 1. Right-click this row and click Insert in the shortcut menu. Repeat. (This creates two blank rows at the top of the worksheet.) 6. Enter a title in cell A1. The ToolPak creates a frequency distribution that contains an improper bin labelled More. Correct this error by using these general instructions: 7. Manually add the frequency count of the More row to the frequency count of the preceding row. (For the example, the More row contains a zero for the frequency, so the frequency of the preceding row does not change.) 8. Select the worksheet row (for this example, row 15) that contains the More row. 9. Right-click that row and click Delete in the shortcut menu. Steps 1 to 9 construct a frequency distribution for the meal costs at city restaurants. To construct a frequency distribution for the meal costs at suburban restaurants, repeat these nine steps but in step 2 enter or highlight B1:B51 as the Input Range. The Relative Frequency, Percentage and Cumulative Distributions Key technique Add columns that contain formulas for the relative frequency or percentage and cumulative percentage to a previously constructed frequency distribution. Example Create a distribution that includes the relative frequency or percentage as well as the cumulative percentage, as in Tables 2.7 (relative frequency and percentage) and 2.9 (cumulative percentage) in Section 2.3 for the restaurant meal cost data. PHStat (untallied data) Use Frequency Distribution. For the example, use the PHStat instructions in ‘Summarising Numerical Data: The Frequency Distribution’ to construct a frequency distribution. Note that the frequency distribution constructed by PHStat also includes columns for the percentages and cumulative percentages. To change the column of percentages to a column of relative frequencies, reformat that column. For the example, open to the new worksheet that contains the city restaurant frequency distribution and: 1. Select the cell range C4:C14, right-click, and select Format Cells from the shortcut menu. 2. In the Number tab of the Format Cells dialog box, select Number as the Category and click OK. Then repeat these two steps for the new worksheet that contains the suburban restaurant frequency distribution. In-depth Excel (untallied data) Use the Distributions workbook as a model. For the example, first construct a frequency distribution created using the In-depth Excel instructions in ‘Summarising Numerical Data: The Frequency Distribution’. Open to the new worksheet that contains the frequency distribution for the city restaurants and: 1. Enter Percentage in cell C3 and Cumulative Pctage in cell D3. 2. Enter 5B4∙SUM($B$4:$B$14) in cell C4 and copy this formula down to row 14. 3. Enter 5C4 in cell D4. 4. Enter 5C5 1 D4 in cell D5 and copy this formula down to row 14. 5. Select the cell range C4:D14, right-click, and click Format Cells in the shortcut menu. 6. In the Number tab of the Format Cells dialog box, click Percentage in the Category list and click OK. Then open to the worksheet that contains the frequency distribution for the suburban restaurants and repeat steps 1 to 6. If you want column C to display relative frequencies instead of percentages, enter Rel. Frequencies in cell C3. Select the cell range C4:C12, right-click, and click Format Cells in the shortcut menu. In the Number tab of the Format Cells dialog box, click Number in the Category list and click OK. Analysis ToolPak Use Histogram and then modify the worksheet created. For the example, first construct the frequency distributions using the Analysis ToolPak instructions in ‘The Frequency Distribution’. Then use the In-depth Excel instructions to modify those distributions. VISUALISING NUMERICAL DATA The Histogram Key technique Construct a histogram. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 2 Excel Guide Example Construct histograms for price of main meals in city restaurants, similar to Figure 2.6 on page 50. PHStat Use Histogram & Polygons. 83 Figure EG2.9 Histogram & Polygons dialog box PHStat Defining Classes Bins and Mid-points If constructing a frequency polygon or histogram using PHStat, include a class of zero frequency at the beginning of the bin range. For example, for the restaurant data in < RESTAURANT > in Section 2.3, the first class of non-zero frequency is $10 to less than $15 with bin value 14.999, so the class $5 to less than $10 must be included before this. Therefore, the Bin Cell Range would be 9.999, 14.999, 19.999, 24.999 and so on. PHStat also requires a Mid-point Cell Range. Since PHStat associates the first mid-point given with the second bin value or second class, the Mid-point Cell Range must have one fewer cells/values than the Bin Cell Range. Price Bin values Class mid-points 50 9.999 $12.50 38 14.999 $17.50 43 19.999 $22.50 56 24.999 $27.50 51 29.999 $32.50 36 34.999 $37.50 25 39.999 $42.50 33 44.999 $47.50 41 49.999 $52.50 44 54.999 $57.50 34 59.999 $62.50 39 64.999 For the example, open to the Data worksheet of the Restaurant file. Select PHStat ➔ Descriptive Statistics ➔ ­Histogram & Polygons. Enter an appropriate bin range, see above, in column H (say, H1:H13) and Midpoint Range in column I (say, I1:I12). Then in the procedure’s dialog box (shown in Figure EG2.9): 1. Enter or highlight G1:G101 as the Variable Cell Range, H1:H13 as the Bins Cell Range and I1:I12 as the Midpoints Cell Range, and check First cell in each range contains label. 2. Click Multiple Groups - Stacked and enter or highlight A1:A101 as the Grouping Variable Cell Range. (In the Data worksheet of the Restaurant file, the price of meals in city and suburban restaurants are stacked, or placed in a single c­ olumn. The column A values allow PHStat to separate the city restaurant prices from the suburban restaurant prices.) 3. Enter a Title, check Histogram, and click OK. PHStat inserts two new worksheets, each of which contains a frequency distribution and a histogram. Since you cannot define an explicit lower boundary for the first bin, the first bin can never have a mid-point. Therefore, the Midpoints Cell Range you enter must have one fewer cell than the Bins Cell Range. PHStat associates the first mid-point with the second bin and uses -- as the label for the first bin. When you include a class of zero frequency before the first class of non-zero frequency, as in this example, the histogram bar labelled -- will always be a zero bar. In-depth Excel Use the Histogram workbook as a model. For the example, first construct frequency distributions for city and suburban meal prices. Open the Unstacked worksheet in the Restaurant file. This worksheet contains the meal cost data unstacked in columns A and B. Enter appropriate bin cell and mid-point cell ranges, including titles, in columns D and E (say, D1:D12 and E1:E12). Then: 1. Right-click the Unstacked sheet tab and click Insert in the shortcut menu. 2. In the General tab of the Insert dialog box, click Worksheet and then click OK. In the new worksheet: 3. Enter a title in cell A1, Bins in cell A3, Frequency in cell B3, and Midpoints in cell C3. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 84 CHAPTER 2 ORGANISING AND VISUALISING DATA 4. Copy the bin values in the cell range D2:D12 of the Unstacked worksheet and paste this list into cell A4 of the new worksheet. 5. Copy the mid-points in the cell range E2:E12 of the Unstacked worksheet and paste this list into cell C4 of the new worksheet. 6. Select the cell range B4:B14 that will hold the array formula. 7. Type (but do not press the Enter or Tab key) the formula 5FREQUENCY(UNSTACKED!$A$2: $A$51, $A$4: $A$14). Then, while holding down the Ctrl and Shift keys, press the Enter key to enter the array formula into the cell range B4:B14. 8. Adjust the worksheet formatting as necessary. Steps 1 to 8 construct a frequency distribution for city restaurant main meal prices. To construct a frequency distribution for main meal prices for suburban restaurants, repeat steps 1 to 8 but in step 7 type 5FREQUENCY(UNSTACK ED!$B$1:$B$51, $A$4: $A$14) as the array formula. Having constructed the two frequency distributions, continue constructing the two histograms. Open to the worksheet that contains the frequency distribution for city restaurant prices and: 1. Select the cell range B3:B14 (the cell range of the frequencies). 2. Select Insert, then the Column icon in the Charts group (#1 in the Charts group illustration in Figure EG2.5), and then select the first 2-D Column gallery item (Clustered Column). In earlier Excels, select Insert ➔ Column and select the first 2-D Column gallery item (Clustered Column). 3. Right-click the chart and click Select Data in the shortcut menu. In the Select Data Source dialog box: 4. Click Edit under the Horizontal (Categories) Axis Labels heading. 5. In the Axis Labels dialog box, drag the mouse to select the cell range C4:C14 (containing the midpoints) to enter that cell range. Do not type this cell range in the Axis label range box as you would otherwise do. Click OK in this dialog box and then click OK (in the Select Data Source dialog box). In the chart: 6. Right-click inside a bar and click Format Data Series in the shortcut menu. 7. In the Format Data Series task pane, click Series Options. In the Series Options, click Series Options, enter 0 in the Gap Width box, and then close the task pane. (To see the Series Options, you may have to first click the chart [third] icon near the top of the task pane.) (Earlier Excels) In the Format Data Series dialog box, click Series Options in the left pane, and in the Series Options right pane, change the Gap Width slider to No Gap. Click Close. 8. Move chart to a chart sheet (right-click on Chart ➔ Move Chart). Adjust chart formatting if required. Analysis ToolPak Use Histogram. For the example, open the Unstacked worksheet in the Restaurant file. Enter appropriate bin cell and midpoint cell ranges, including titles, in columns D and E (say, D1:D12 and E1:E12) and: 1. Select Data ➔ Data Analysis. In the Data Analysis dialog box, select Histogram from the Analysis Tools list and then click OK. In the Histogram dialog box: 2. Enter or highlight A1:A51 as the Input Range and enter or highlight D1:D12 as the Bin Range. 3. Check Labels, click New Worksheet Ply, and check Chart Output. 4. Click OK to create the frequency distribution and histogram on a new worksheet. In the new worksheet: 5. Follow steps 5 to 9 of the Analysis ToolPak instructions in ‘Summarising Numerical Data: The Frequency Distribution’ above. These steps construct a frequency distribution and histogram for city restaurant main meal prices. To construct a frequency distribution and histogram for suburban restaurant main meal prices repeat the nine steps, but in step 2 enter or highlight B1:B51 as the Input Range. You will need to correct several formatting errors that Excel makes to the histograms it constructs. For each histogram: 1. Right-click inside a bar and click Format Data Series in the shortcut menu. 2. In the Format Data Series task pane, click Series Options. In the Series Options, click Series Options, enter 0 in the Gap Width box, and then close the task pane. (To see the Series Options, you may have to first click the chart [third] icon near the top of the task pane.) (Earlier Excels) In the Format Data Series dialog box, click Series Options in the left pane, and in the Series Options right pane, change the Gap Width slider to No Gap. Click Close. Histogram bars are labelled by bin numbers. To change the labelling to mid-points, open to each of the new worksheets and: 3. Enter Midpoints in cell C3. Copy the mid-point cell range E2:E12 of the Unstacked worksheet and paste this list into cell C4 of the new worksheet. 4. Right-click the histogram and click Select Data. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 2 Excel Guide 5. In the Select Data Source dialog box, click Edit under the Horizontal (Categories) Axis Labels heading. 6. In the Axis Labels dialog box, drag the mouse to select the cell range C4:C14 to enter that cell range. Do not type this cell range in the Axis label range box as you would otherwise do. Click OK in this dialog box and then click OK (in the Select Data Source dialog box). 7. Move the chart to a chart sheet (right-click on Chart ➔ Move Chart). Adjust chart formatting if required. The Percentage Polygon and the Cumulative Percentage Polygon (Ogive) Key technique Construct percentage and cumulative percentage polygons. Example Construct percentage and cumulative percentage polygons for main meal prices at city and suburban restaurants, similar to Figure 2.8 on page 52 and Figure 2.10 on page 53. PHStat Use Histogram & Polygons. For the example, use the PHStat instructions for creating a histogram on page 83 but in step 3 of those instructions, also check Percentage Polygon and Cumulative Percentage Polygon (Ogive) before clicking OK. In-depth Excel Use the Polygons_workbook as a model. For the example, open the Unstacked worksheet in the Restaurant file. Then follow steps 1 to 8 to construct a histogram for city restaurant meal prices. However, include a class of zero frequency at either end of your bin cell range. (Say, in cells D1:14, including title, also add corresponding class mid-points cells E1:14.) Repeat steps 1 to 8 but in step 7 type the array formula 5FREQUENCY(UNS TACKED!$B$1:$B$51, $A$4: $A$16) to construct a frequency distribution for suburban restaurant main meal prices. Open to the worksheet that contains the city restaurant meal price frequency distribution and: 1. Select column C. Right-click and click Insert in the shortcut menu. Right-click and click Insert in the shortcut menu a second time. (The worksheet contains new, blank columns C and D and the midpoints column is now column E.) 2. Enter Percentage in cell C3 and Cumulative Pctage. in cell D3. 85 3. Enter 5B4∙SUM($B$4:$B$16) in cell C4 and copy this formula down to row 16. 4. Enter 5C4 in cell D4. 5. Enter 5C5 1 D4 in cell D5 and copy this formula down to row 16. 6. Select the cell range C4:D16, right-click, and click Format Cells in the shortcut menu. 7. In the Number tab of the Format Cells dialog box, click Percentage in the Category list and click OK. Open to the worksheet that contains the suburban restaurant main meal price frequency distribution and repeat steps 1 to 7. To construct the percentage polygons, open to the worksheet that contains the city restaurant price frequency distribution and: 1. Select cell range C4:C16. 2. Select Insert, then select the Line icon in the Charts group (#2 in the Charts group illustration in Figure EG2.5), and then select the fourth 2-D Line gallery item (Line with Markers). In earlier Excels, select Insert ➔ Line and select the fourth 2-D Line gallery item (Line with Markers). 3. Right-click the chart and click Select Data in the shortcut menu. In the Select Data Source dialog box: 4. Click Edit under the Legend Entries (Series) heading. In the Edit Series dialog box, enter the formula 5“City Restaurants” as the Series name and click OK. 5. Click Edit under the Horizontal (Categories) Axis Labels heading. In the Axis Labels dialog box, drag the mouse to select the cell range E4:E16 to enter that cell range. Do not type this cell range in the Axis label range box as you would otherwise do. 6. Click OK in this dialog box and then click OK (in the Select Data Source dialog box). Back in the chart: 7. Move chart to a chart sheet (right-click on Chart ➔ Move Chart). Adjust chart formatting if required. In the new chart sheet: 8. Right-click the chart and click Select Data in the shortcut menu. 9. In the Select Data Source dialog box, click Add. In the Edit Series dialog box: 10. Enter the formula 5“Suburban Restaurants” as the Series name and press Tab. 11. With the current value in Series values highlighted, click the worksheet tab for the worksheet that contains the suburban restaurant meal price frequency distribution. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 86 CHAPTER 2 ORGANISING AND VISUALISING DATA 12. Drag the mouse to select the cell range C4:C16 to enter that cell range as the Series values. Do not type this cell range in the Series values box as you would otherwise do. 13. Click OK. Back in the Select Data Source dialog box, click OK. To construct the cumulative percentage polygons, open to the worksheet that contains the city restaurant price of main meal frequency distribution and repeat steps 1 to 13 but replace steps 1, 5 and 12 with the following: 1. Select the cell range D4:D16. 5. Click Edit under the Horizontal (Categories) Axis Labels heading. In the Axis Labels dialog box, drag the mouse to select the cell range A4:A16 to enter that cell range. 12. Drag the mouse to select the cell range D4:D16 to enter that cell range as the Series values. If the Y axis of the cumulative percentage polygon extends past 100%, right-click the axis and click Format Axis in the shortcut menu. In the Format Axis task pane, click Axis Options. In the Axis Options, enter 0 in the Minimum box and 1 in the Maximum box and then close the pane. In earlier Excels, you set this value in the Format Axis dialog box. Click Axis Options in the left pane, and in the Axis Options right pane, click the first Fixed option button (for Minimum), enter 0 in its box, and then click Close. EG2.4 ORGANISING AND VISUALISING TWO CATEGORICAL VARIABLES ORGANISING TWO CATEGORICAL VARIABLES The Contingency Table Key technique Use the PivotTable feature to create a contingency table for untallied data. Example Construct a contingency table for location and number of bedrooms similar to Table 2.11 on page 55. PHStat (untallied data) Use Two-Way Tables & Charts. For the example, open the Property file. Select PHStat ➔ Descriptive Statistics ➔ Two-Way Tables & Charts. In the procedure’s dialog box (shown in Figure EG2.10): 1. Enter or highlight F2:F102 as the Row Variable Cell Range. 2. Enter or highlight C2:C102 as the Column Variable Cell Range. 3. Check First cell in each range contains label. 4. Enter a Title and click OK. Figure EG2.10 Two-Way Tables & Charts dialog box In-depth Excel (untallied data) Use the Contingency_Table workbook as a model. For the example, open the Property file. Select Insert ➔ PivotTable. In the Create PivotTable dialog box: 1. Click Select a table or range and enter or highlight C2:F102 as the Table/Range cell range. 2. Click New Worksheet and then click OK. In the PivotTable Fields (called the PivotTable Field List in some Excel versions) task pane: 3. Tick Location in Choose fields to add to report to add it to the ROWS (or Row Labels) box. 4. Tick Bedrooms in Choose fields to add to report and drag it to the COLUMNS (or Column Labels) box. 5. Drag Location in Choose fields to add to report and drop it in the Σ Values box. (Location changes to Count of Location.) In the PivotTable being created: 6. Select cell A3 and enter a space character to clear the label Count of Location. 7. Enter Location in cell A4 to replace the heading Row Labels. 8. Enter Bedroom in cell B3 to replace the heading Column Labels. 9. Right-click over the PivotTable and then click PivotTable Options in the shortcut menu that appears. In the PivotTable Options dialog box: 10. Click the Layout & Format tab. 11. Check For empty cells show and enter 0 as its value. Leave all other settings unchanged. 12. Click the Total & Filters tab. 13. Check Show grand totals for columns and Show grand totals for rows. 14. Click OK to complete the table. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 2 Excel Guide In-depth Excel (tallied data) Use the CONTINGENCY_SIMPLE worksheet of the ­Contingency_Table workbook as a model for creating a contingency table. VISUALISING TWO CATEGORICAL VARIABLES The Side-By-Side Chart Key technique Use an Excel bar chart that is based on a contingency table. Example Construct a side-by-side chart that displays location and number of bedrooms, similar to Figure 2.12 on page 56. PHStat Use Two-Way Tables & Charts. For the example, use the Section EG2.4 ‘The Contingency Table’ PHStat instructions, but in step 4, check Sideby-Side Bar Chart in addition to entering a Title and clicking OK. In-depth Excel Use the Contingency_Table workbook as a model. For the example, open to the TwoWayTable worksheet of the Contingency_Table workbook and: 1. Select cell A3 (or any other cell inside the PivotTable). 2. Select Insert ➔ Column in Excel 2016, or Bar in earlier Excel versions, and select the first 2-D Bar gallery item (Clustered Bar). 3. Right-click the Count of Location button in the chart and click Hide All Field Buttons on Chart. 4. Move the chart to a chart sheet (right-click on Chart ➔ Move Chart). Adjust formatting if required. When creating a chart from a contingency table that is not a PivotTable, select the cell range of the contingency table, including row and column headings, but excluding the total row and total column, as step 1. If you need to switch the row and column variables in a side-by-side chart, right-click the chart and then click Select Data in the shortcut menu. In the Select Data Source dialog box, click Switch Row/Column and then click OK. (In Excel 2007, if the chart is based on a PivotTable, the Switch Row/Column button will be disabled. In that case, you need to change the PivotTable to change the chart.) EG2.5 VISUALISING TWO NUMERICAL VARIABLES The Scatter Diagram Key technique Use the Excel scatter chart. 87 Example Construct a scatter diagram of number of bedrooms and asking price, similar to Figure 2.14 on page 59. PHStat Use Scatter Plot. For the example, open the Property file. Select PHStat ➔ Descriptive Statistics ➔ Scatter Plot. In the procedure’s dialog box (shown in Figure EG2.11): 1. Enter or highlight A2:A102 as the Y Variable Cell Range. 2. Enter or highlight C2:C102 as the X Variable Cell Range. 3. Check First cells in each range contains label. 4. Enter a Title and click OK. Figure EG2.11 Scatter Plot dialog box To add a superimposed line like the one shown in Figure 2.14, click the chart and use step 3 of the In-depth Excel instructions. In-depth Excel Use the Scatter_Diagram workbook as a model. For the example, open the Property file. The two variables ‘Number of bedrooms’ and ‘Asking price’ have been copied to columns I and J. 1. Select the cell range I2:J102. 2. Select Insert, then the Scatter (X,Y) icon in the Charts group (#4 in the illustration in Figure EG2.5), and then select the first Scatter gallery item (Scatter). In earlier Excels, select Insert ➔ Scatter and select the first Scatter gallery item (Scatter with only Markers). 3. Select Design ➔ Add Chart Element ➔ Trendline ➔ Linear. In earlier Excels, select Layout ➔ Trendline ➔ Linear Trendline. 4. Move chart to a chart sheet (right-click on Chart ➔ Move Chart). Adjust chart formatting if required. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 88 CHAPTER 2 ORGANISING AND VISUALISING DATA When constructing Excel scatter diagrams with other variables, make sure that the X or horizontal variable column precedes (is to the left of) the Y or vertical variable column. (If the worksheet is arranged Y then X, cut and paste so that the Y variable column appears to the right of the X variable column.) The Time-Series Plot Key technique Use the Excel scatter chart. Example Construct a time-series plot of Australian dollar exchange rate against US dollar from 2010 to 2017, similar to Fig­ ure 2.15 on page 60. In-depth Excel Use the Time-Series workbook as a model. For the example, open the Exchange_Rate_2010_2017 file and: 1. Select the cell range A9:B95. 2. Select Insert, then select the Scatter (X, Y) icon in the Charts group (#4 in the illustration in Figure EG2.5), and then select the fourth or fifth Scatter gallery item (Scatter with Straight Lines with or without Markers). In earlier Excels, select Insert ➔ Scatter and select the fourth or fifth Scatter gallery item (Scatter with Straight Lines with or without Markers). 3. Move chart to a chart sheet (right-click on Chart ➔ Move Chart). Adjust chart formatting if required. When constructing time-series charts with other variables, make sure that the X or time variable column precedes (is to the left of) the Y or vertical variable column. (If the worksheet is arranged Y then X, cut and paste so that the Y variable column appears to the right of the X variable column.) 2. In the Insert Sparkines dialog box, enter B3:B16 as the Location Range and click OK. 3. Select Axis and then Vertical. Choose Same for all Sparklines for both Maximum and Minimum Gauges In-depth Excel To construct a gauge we must create both a doughnut chart for the coloured zones and a pie chart for the pointer. To create the gauges equivalent to the one shown in Figure 2.18 on page 65, open to the TopSixDATA worksheet of the WL_WaitData workbook and: 1. Select the cell range E3:E7. 2. Select Insert ➔ Pie Chart and select Doughnut. 3. Right click on the doughnut, select Format Data Series and type ‘271’ into angle of first slice (see Figure EG2.12) and close the box. Figure EG2.12 Format Data Series dialogue box 4. Right-click on the largest doughnut slice and select Format Data Point, select Fill ➔ No Fill. Figure EG2.13 Format Data Point dialogue box EG2.6 DESCRIPTIVE ANALYTICS Sparklines In-depth Excel Use Sparklines. For example, to create the Figure 2.17 sparklines display, open to the DATA worksheet of the WL_WaitHistory workbook. In this worksheet, ride names are in column A and the historical wait times data by half-hours are in Columns C through W. Select cell range C3:W16 and: 1. Select Insert ➔ Sparklines (select line as the sparkline type). 5. Right-click on the doughnut and choose Select Data. Click the + button and add ‘pointer’ as the name and ‘G3:G5’ as the Y values. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 2 Excel Guide Figure EG2.14 Select Data Source dialogue box 6. Right-click at the new second doughnut, click on Change Chart Type and choose Pie Chart. 7. Right-click on the pie chart, select Format Data Series and check the Secondary Axis option. Type ‘270’ into the Angle of first slice. Figure EG2.15 Format Data Series dialog box 8. Right-click on the largest slice of pie and select Format Data Point. Select Fill ➔ No Fill. Repeat for the next largest slice. 9. Insert ➔ Text Boxes to add the appropriate labels and change the gauge colours to suit. Bullet Graph In-depth Excel Use the BulletGraph worksheet of the GaugeBullet workbook as a model for simulating a bullet graph. 89 To construct a simulated bullet graph in Excel, you create a bar chart of the variable being graphed with a transparent background and overlay this chart on a bar chart that displays the coloured zones. For example, to construct a chart similar to the bullet graph shown in Figure 2.18 on page 65, open to the waitDATA worksheet of the WL_ WaitData workbook and: 1. Select cell range B1:C15. 2. Select Insert, then the bar chart icon, and select the Clustered Bar. 3. In the newly constructed bar chart, turn off the gridlines. 4. Right-click in the white space to the right of the chart title and click Format Chart Area in the shortcut menu. 5. In the Fill part of the Format Chart Area pane click No fill. The background of the chart becomes transparent. Next, construct the bar chart that will serve as the coloured zones for the bullet graph. 6. In the cell range D2:D6, enter the values 25, 20, 20, 20 and 15, to define the five zones of the­ Figure 2.18 bullet graph. Then select this edited cell range D2:D6. 7. Select Insert, then the bar chart icon, and select the Stacked Bar. 8. In the newly constructed bar chart, turn off the gridlines. 9. Right-click in the white space to the right of the chart title and click Select Data in the shortcut menu. 10. In the Select Data Source dialog box, click Switch Row/Column and then click OK. A chart of five simple bars becomes a chart of one stacked bar with five parts. 11. Right-click the one stacked bar and click Format Data Series in the shortcut menu. In the Series Options part of the Format Data Series pane, change Gap Width to 0%. 12. Change the colouring of the stacked bars. Select Design ➔ Change Colors and in the gallery click one of the colour spectrums. Be sure to choose a set of colours that does not include the colour used for the bars in the bar chart you constructed using steps 1 to 5. 13. Right-click the horizontal chart axis and click ­Format Axis in the shortcut menu. 14. In the Axis Options of the Format Axis pane, enter 100 as the Maximum. In Excel 2010, first click Fixed in the Maximum line, then enter 100, and then click Close. 15. Adjust the size of the chart, as necessary, by clicking a corner of the bar chart frame and then dragging that corner to resize the chart. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 90 CHAPTER 2 ORGANISING AND VISUALISING DATA 16. Right-click the chart border and select Send to Back ➔ Send to Back in the shortcut menu. 17. Drag the bar chart with the transparent background over the stacked bar chart and adjust so that the zeroes on the horizontal axis of both charts coincide. Then adjust the width of that bar chart so that all other horizontal axis numbers that the two charts share coincide. For other problems, you need to identify the maximum value and enter the proper set of values in a new column in order to correct the stacked bar chart that serves to display the zones for the bullet graph. Treemap 1. Highlight cells A1:C15 and select Insert ➔ Chart ➔ Other Charts ➔ Hierarchical Treemap. More detailed instructions for treemaps and data discovery are contained in the Software Guide in Chapter 20 (online). Microsoft® product screen shots are reprinted with permission from Microsoft Corporation. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Numerical descriptive measures C HAP T E R 3 FESTIVAL EXPENDITURE R eturning to the festival expenditure scenario introduced in Chapter 2, as well as presenting the expenditure data graphically, Kai wishes to summarise and analyse the data further. In particular, for each non-local visitor type (intrastate, interstate and international) numerical measures of the centre and variation of total expenditure in the region during the festival are required. This analysis will help to answer the following questions: ■ ■ What is the ‘average’ amount spent during the festival? How does this differ between visitor types? How varied is the amount spent during the festival? How does this differ between visitor types? © Ton Koene/age fotostock Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 92 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES LEARNING OBJECTIVES After studying this chapter you should be able to: 1 calculate and interpret numerical descriptive measures of central tendency, variation and shape for numerical data 2 calculate and interpret descriptive summary measures for a population 3 construct and interpret a box-and-whisker plot 4 calculate and interpret the covariance and the coefficient of correlation for bivariate data variation The spread, scattering or dispersion of data values. In Chapter 2 we saw how tables and graphs can be used to organise, visualise, summarise and describe data. In this chapter we discuss various numerical measures that can be used to summarise and describe numerical data. These numerical measures not only can be used to summarise a particular sample or population but will also enable the sample or population to be compared with others. Furthermore, these numerical measures, unlike graphs and tables, are precise, objectively determined and easy to manipulate, interpret and compare. They allow for a careful analysis of data which is especially important when using sample data to make inferences about an entire population. For example, Kai may be interested in whether interstate visitors spend more during the festival than do intrastate visitors. Also of interest would be how expenditure by international visitors to the festival compares to that of non-local visitors from within Australia. This chapter introduces some of the statistics that measure: • central tendency, the extent to which the data values are grouped around a central value • variation, the spread, scattering or dispersion of data values • shape, the pattern of the distribution of data values from the lowest value to the highest value. shape The pattern of the distribution of data values. Covariance and the coefficient of correlation, which measure the strength of the association between two numerical variables, are also introduced. central tendency The extent to which data values are grouped around a central value. LEARNING OBJECTIVE 1 Calculate and interpret numerical descriptive measures of central tendency, variation and shape for numerical data arithmetic mean (mean) Measure of central tendency; sum of all values divided by the number of values (usually called the mean); called the arithmetic mean to distinguish it from the geometric mean. 3.1 MEASURES OF CENTRAL TENDENCY, VARIATION AND SHAPE We can describe a data set by describing its central tendency, variation and shape. Measures of Central Tendency Many data sets have a distinct central tendency, with the data values grouped or clustered around a central point. Everyday expressions such as ‘the average value’, ‘the middle value’ or ‘the most popular or frequent value’ refer to measures of central tendency. The three most important measures of central tendency – mean, median and mode – are introduced in this section. These measures are precise, objectively determined and easy to manipulate, interpret and compare. As we see in the following sections, each has its advantages and disadvantages. Mean The arithmetic mean (typically referred to as the mean) is the most common measure of central tendency. The mean uses all the data values and can be calculated exactly. It can be thought of as a ‘balance point’ in a set of data (like the fulcrum on a seesaw). The mean is calculated by adding all the values of a variable in a data set and then dividing the sum by the number of variable values in the data set. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.1 Measures of Central Tendency, Variation and Shape 93 The symbol X, called X bar, is used to represent the mean of a sample. For a sample containing n values, the equation for the mean of a sample is written as: X = sum of the sample values number of sample values Using the series X1, X2, …, Xn to represent the set of n values and n to represent the number of values, the equation becomes: X + X2 + p + Xn X = 1 n By using summation notation (discussed in Appendix B), we can replace the numerator n X1 + X2 + p + Xn by © Xi, which means sum all the Xi values from the first X value, X1, to the i=1 last X value, Xn, to obtain Equation 3.1. SAM PLE M E A N The sample mean is the sum of the values divided by the number of values. sample mean Mean calculated from sample data. n X = © Xi i=1 (3.1) n where X = sample mean n = number of values or sample size Xi = ith value of the variable X n © Xi = X1 + X2 + p + Xn = sum of all Xi values in the sample i=1 As all the data values play an equal role in the calculation of the mean, the mean will be affected by any extreme (high or low) value. When there are extreme values, you should take care when using the mean as a measure of central tendency. The mean gives a ‘typical’ or central value for a data set. For example, if you knew the typical time it takes you to get ready in the morning, you might be able to plan your morning better and minimise any excessive lateness (or earliness). Suppose you define the time to get ready as the time in minutes (rounded to the nearest minute) from when you get out of bed to when you leave. You collect the times (shown below) for 10 consecutive working days; this data is stored in < TIMES >. Day: Time (minutes): 1 39 2 29 3 43 4 52 5 39 6 44 7 40 8 31 9 44 10 35 The mean time to get ready is 39.6 minutes, calculated using Equation 3.1: n X = © Xi i=1 n = 39 + 29 + … + 35 396 = = 39.6 10 10 Even though no one day in the sample actually had the value 39.6 minutes, allotting about 40 minutes to get ready would be a good rule for planning your morning – but only because the 10 days did not contain any extreme values. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 94 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES To illustrate how the mean can be greatly affected by any value that is very different from the others, imagine that on day 4, a set of unusual circumstances delayed you getting ready by 50 minutes so that the time for that day was 102 minutes. This extreme value would cause the mean to rise to 44.6 minutes: n X = © Xi i=1 n = 446 = 44.6 10 The one extreme value has increased the mean by more than 10% from 39.6 to 44.6 minutes. In contrast to the original mean, which was in the ‘middle’ (more than 5 of the times to get ready and less than 5 of the times to get ready), the new mean is greater than 9 of the 10 times to get ready. The extreme value of 102 has caused the mean to increase and thus become a poor measure of central tendency. A statistical calculator can be used to calculate the mean (and other numerical measures introduced in this chapter), while for large data sets, as we see later in this section, Excel can be used. Even though it is not usually necessary to use Equation 3.1 to calculate the mean it is important that you understand the process of how the mean is determined. EXAMPLE 3.1 ME A N FE ST IVA L E X P E N D I TU RE – I N TE RN ATI ON AL V I S I TORS In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors in the region during the festival. The following data give the dollar amount spent by a random sample of 12 international visitors. < FESTIVAL > 1,119 615 971 553 343 502 928 1,005 993 408 725 763 Calculate and interpret the mean amount spent by international visitors. SOLUTION 12 Calculate the sum of X, we obtain: © X i = 1,119 + 615 + … + 763 = 8,925 then using Equation 3.1, i=1 n X = © Xi i=1 n = 8,925 = 743.75 12 Therefore, international visitors on average spent $743.75 in the region during the festival. median Measure of central tendency; middle value in an array. Median The median is the value that partitions or splits an ordered set of data into two equal parts. As the median is not affected by extreme values, it may be a better measure of central tendency when there are extreme values. The median is the middle value in a set of data that has been ordered from lowest to highest value. To calculate the median for a set of data, first order the values from smallest to largest. Then use Equation 3.2 to calculate the rank of the value that is the median. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.1 Measures of Central Tendency, Variation and Shape ME D IAN 50% of the values are equal to or smaller than the median and 50% of the values are equal to or larger than the median. Median = • • n+1 ranked value 2 (3.2) Calculate the median value by these two rules: Rule 1 If there is an odd number of values in the data set, the median is the middle-ranked value. Rule 2 If there is an even number of values in the data set, then the median is the mean of the two middle-ranked values. To calculate the median for the sample of the 10 times to get ready, first order the times: Ordered values: Ranks: 29 1 31 2 35 3 39 4 39 5 40 6 43 7 44 8 44 9 52 10 c Median = 39.5 Rank of the median is (n + 1)/2 = (10 + 1)/2 = 5.5. So, using rule 2, the median is the mean of the fifth- and sixth-ranked values, (39 + 40)/2 = 39.5. Therefore, for half of the days the time to get ready is less than or equal to 39.5 minutes and for half of the days the time to get ready is greater than or equal to 39.5 minutes. The median time to get ready of 39.5 minutes is very close to the mean time to get ready of 39.6 minutes. CA LC ULATING T H E ME DIA N FO R A N O D D S AM P L E S I Z E For a certain café, the number of customers during a selected seven-day week were 100, 75, 92, 85, 70, 80 and 71. Calculate the median number of customers for this week. EXAMPLE 3.2 SOLUTION Ordered values: Ranks: 70 1 71 2 75 3 80 4 85 5 92 6 100 7 c Median = 80 n+1 7+1 2 = 2 = 4. So, using rule 1, the median is the fourth-ranked value. The median number of customers is 80. Therefore, 50% of days have 80 or less customers and 50% have 80 or more customers. Rank of the median is CALCULATING THE MEDIAN FESTIVAL EXPENDITURE – INTERNATIONAL VISITORS In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors in the region during the festival. < FESTIVAL > Calculate and interpret the median amount spent by international visitors. EXAMPLE 3.3 SOLUTION First order the data: 343 408 502 553 615 725 763 928 971 993 1,005 1,119 c Median Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 95 96 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES Rank of the median is n + 1 12 + 1 2 = 2 = 6.5. Using rule 2, the median is the mean of the 725 + 763 = 744. 2 Therefore, 50% of international visitors in the sample spent less than $744 during the festival and 50% spent more than $744. sixth- and seventh- ranked values, mode Measure of central tendency; most frequent value. Mode The mode is the value in a data set that appears most frequently. Like the median and unlike the mean, extreme values do not affect the mode. You should use the mode only for descriptive purposes as it is more variable from sample to sample than either the mean or the median. Often there is no mode or there are several modes in a set of data. For example, consider the data for the times to get ready shown below: 29 31 35 39 39 40 43 44 44 52 There are two modes, 39 minutes and 44 minutes, since each of these values occurs twice. Because it has two modes, this data set is considered to be bimodal. EXAMPLE 3.4 C A LC U LAT ING T H E M OD E A company’s information systems manager keeps track of the number of unplanned outages that occur in a month. Calculate the mode for the following data, which represent the number of unplanned outages during the past 14 months: 1 3 0 3 26 2 7 4 0 2 3 3 6 3 SOLUTION The ordered array for these data is: 0 0 1 2 2 3 3 3 3 3 4 6 7 26 Because 3 appears five times, more than any other value, the mode is 3. Thus, the systems manager can say that the most common occurrence is three unplanned outages a month. For this data set, the median is also equal to 3 while the mean is equal to 4.5. As the mean is affected by the extreme value of 26 unplanned outages, the median and the mode are better measures of central tendency than the mean for this data set. A set of data will have no mode if none of the values is ‘most typical’ – that is, if no data value occurs more than once. Example 3.5 presents a data set with no mode. EXAMPLE 3.5 DATA W IT H NO MO DE For the café of Example 3.2, calculate the mode for the number of customers for the seven days. SOLUTION The ordered array for these data is: 70 71 75 80 85 92 100 As none of the days have the same number of customers there is no mode. quartiles Measures of relative standing, partition a data set into quarters. Quartiles We have seen that the median partitions a set of data into two equal parts. We can extend this idea by partitioning a set of data into as many equal parts as we wish. Quartiles divide a set of Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.1 Measures of Central Tendency, Variation and Shape data into quarters – that is, four equal parts. The first, or lower, quartile, Q1, divides the lower 25% of the values from the other 75%, which are larger. The second quartile, Q2, is the median – 50% of the values are below the median and 50% above. The third, or upper, quartile, Q3, has 75% of the values below it and 25% above. Equations 3.3 and 3.4 define the first and third quartiles. Q1, the median and Q3 are also the 25th, 50th and 75th percentiles, respectively. Equations 3.2, 3.3 and 3.4 can be expressed generally in terms of finding percentiles: (p × 100)th percentile = p × (n + 1) ranked value. p is between 0 and 1, with, for example, the median (Q2) corresponding to a p value of 0.5. first (lower) quartile Value that 25% of data values are smaller than, or equal to. second quartile The median value that 50% of data values are smaller than, or equal to. third (upper) quartile Value that 75% of data values are smaller than, or equal to. FIRST, O R LOW E R , QUA RT IL E , Q 1 25% of the values are smaller, or equal to, Q1, the first quartile, and 75% are larger than, or equal to, the first quartile, Q1. Q1 = n +1 ranked value 4 (3.3) THIRD , O R UPPE R , QUA RT IL E , Q 3 75% of the values are smaller than, or equal to, the third quartile, Q3, and 25% are larger than, or equal to, the third quartile, Q3. Q3 = 3(n + 1) ranked value 4 (3.4) Use the following rules to calculate the quartiles: • Rule 1 If the result is an integer, then the quartile is equal to the ranked value. For example, if the sample size is n = 7, the first quartile, Q1, is equal to the (7 + 1)/4 = 2, second-ranked value. • Rule 2 If the result is a fractional half (2.5, 4.5, etc.), then the quartile is equal to the mean of the corresponding ranked values. For example, if the sample size is n = 9, the first quartile, Q1, is equal to the (9 + 1)/4 = 2.5 ranked value, halfway between the second- and the third-ranked values. • Rule 3 If the result is neither an integer nor a fractional half, round the result to the nearest integer and select that ranked value. For example, if the sample size is n = 10, the first quartile, Q1, is equal to the (10 + 1)/4 = 2.75 ranked value. Round 2.75 to 3 and use the third-ranked value. To illustrate the calculation of the quartiles for the times to get ready, rank the data from smallest to largest: Ranked values: Ranks: 29 1 31 2 35 3 39 4 39 5 40 6 43 7 44 8 44 9 97 52 10 The first quartile is the (n + 1)/4 = (10 + 1)/4 = 2.75 ranked value. Using the third rule for quartiles, round up to the third-ranked value as it is the closest integer. The third-ranked value for the data for the times to get ready is 35 minutes. Interpret the first quartile of 35 to mean that on 25% of the days the time to get ready is less than or equal to 35 minutes, and on 75% of the days the time to get ready is greater than or equal to 35 minutes. The third quartile is the 3(n + 1)/4 = 3(10 + 1)/4 = 8.25 ranked value. Using the third rule for quartiles, round down to the eighth-ranked value as it is the closest integer. The eighthranked value for the data for the times to get ready is 44 minutes. Interpret this to mean that on 75% of the days the time to get ready is less than or equal to 44 minutes, and on 25% of the days the time to get ready is greater than or equal to 44 minutes. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 98 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES Be aware that several methods exist for calculating quartiles. Other textbooks and Excel may use different rules, which can result in slightly different values for the upper and lower quartiles. EXAMPLE 3.6 CALCULATING THE QUARTILES FOR FESTIVAL EXPENDITURE – INTERNATIONAL VISITORS Kai is interested in the distribution of the amount spent by international visitors in the region during the festival. < FESTIVAL > Calculate and interpret the quartiles for the amount spent by international visitors. SOLUTION First order the data: 343 408 502 c Q1 553 615 725 763 928 971 993 c Q3 1,005 1,119 c Median n + 1 12 + 1 Rank of the first quartile is 4 = 4 = 3.25. Using rule 3, the first quartile is thirdranked values, Q1 = 502. Therefore, 25% of international visitors in the sample spent $502 or less during the ­festival and 75% spent $502 or more. 3(n + 1) 3(12 + 1) Rank of the third quartile is = = 9.75. Using rule 3, the third quartile 4 4 is 10th-ranked values, Q3 = 993. Therefore, 75% of international visitors in the sample spent $993 or less during the ­festival and 25% spent $993 or more. Geometric Mean geometric mean Average rate of change of a variable. The geometric mean and the geometric mean rate of return are used to measure the status on an investment over time or the average percentage change in a variable. The geometric mean, defined by Equation 3.5, measures the average rate of change of a variable over n periods. GE OM E T R IC M E A N The geometric mean is the nth root of the product of n values. XG = (X1 * X2 * p * Xn)1/n (3.5) Using the geometric mean, we can measure the average return on an investment over time. This is given by the geometric mean rate of return, defined by Equation 3.6. GE OM E T R IC M E A N R AT E O F R E T U R N RG = [(1 + R1) * (1 + R2) * p * (1 + Rn)]1/n - 1 (3.6) where Ri = the rate of return in time period i as a decimal To illustrate the use of these measures, consider an investment of $100,000 that declined to a value of $50,000 at the end of year 1 and then rebounded back to its original $100,000 value at the end of year 2. The rate of return for this investment for the two-year period is 0, because the starting and ending value of the investment are the same. However, the arithmetic mean of the annual rates of return of this investment is: X = (-0.50) + (1.00) = 0.25 or 25% 2 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.1 Measures of Central Tendency, Variation and Shape 99 since the rate of return for year 1 is: 50,000 - 100,000 = -0.50 or -50% 100, 000 R1 = and the rate of return for year 2 is: R2 = 100,000 - 50,000 = 1.00 or 100% 50,000 Using Equation 3.6, the geometric mean rate of return for the two years is: RG = [(1 + R1) 3 (1 + R2)]1/2 - 1 = {[1 + (-0.50)] 3 [1 + (1.0)]}1/2 - 1 = (0.50 3 2.0)1/2 - 1 = 11/2 - 1 =0 Thus, the geometric mean rate of return more accurately reflects the (zero) change in the value of the investment for the two-year period than does the arithmetic mean. CA LC ULATING T H E G E O ME T R IC ME A N RATE OF RE TU RN The annual percentage change in a New Zealand share market index, the NZX-50, for 2012 to 2016 was: Year Annual change 2012 24% 2013 16% 2014 18% 2015 14% EXAMPLE 3.7 2016 10% Data obtained from Yahoo 7 Finance <http://au.finance.yahoo.com> accessed April 2017 Calculate the geometric rate of return for these five years. SOLUTION Using Equation 3.6, the geometric mean rate of return in the NZX 50 Index for the five years is: RG = [(1 + R2012) * (1 + R2013) * (1 + R2014) * (1 + R2015) * (1 + R2016)]1/5 - 1 = [(1 + 0.24) * (1 + 0.16) * (1 + 0.18) * (1 + 0.14) * (1 + 0.10)]1/5 - 1 = (1.24 * 1.16 * 1.18 * 1.14 * 1.10)1/5 - 1 = 1.16308p - 1 = 0.1630p The geometric rate of return of the NZX 50 Index for the five years is approximately 16.3% annually. Measures of Variation Variation measures the spread or dispersion of values in a data set. One simple measure of variation is the range: the difference between the highest and lowest value. More commonly used in statistics are the standard deviation and variance, two measures also introduced in this section. Range The range is the simplest numerical descriptive measure of variation in a set of data. spread (dispersion) The amount of scattering of data values. range Distance measure of variation; difference between maximum and minimum data values. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 100 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES R A N GE The range is equal to the largest value minus the smallest value. Range = Xlargest - Xsmallest (3.7) To determine the range of the times to get ready, first rank the data from smallest to largest: 29 31 35 39 39 40 43 44 44 52 Then, using Equation 3.7, the range is 52 − 29 = 23 minutes. The range of 23 minutes indicates that the largest difference between any two days in the time to get ready is 23 minutes. EXAMPLE 3.8 CALCULATING THE RANGE FOR FESTIVAL EXPENDITURE – INTERNATIONAL VISITORS In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors in the region during the festival. < FESTIVAL > Calculate and interpret the range for amount spent by international visitors. SOLUTION From an ordered array of the data, the minimum amount an international visitor spent was $343 and the maximum was $1,119. Using Equation 3.7 the range is Xlargest − Xsmallest = 1,119 − 343 = 776 Therefore, the difference between the maximum and minimum amounts spent by international visitors during the festival was $776. The range measures the total spread of the data. Although the range is a simple measure of total variation, it is based only on the two extreme values and ignores all the other values. Thus, it does not take into account how the data are distributed between the smallest and largest values; it does not indicate whether the values are evenly distributed throughout the data set, clustered near the middle or clustered near one or both ends. Like the mean, the range is distorted by very high or very low values, so care is needed when using the range as a measure of variation. interquartile range Distance measure of variation; difference between third and first quartile; range of middle 50% of data. Interquartile Range The interquartile range is the difference between the third and first quartiles in a set of data. IN T E R QUA RT IL E R A NG E The interquartile range is the difference between the third quartile and the first quartile. Interquartile range = Q3 − Q1 (3.8) The interquartile range is a more meaningful measure of variation than the range because it ignores extreme values by finding the range of the middle 50% of the ordered array of data values. In the times to get ready we found that Q1 = 35 and Q3 = 44. Hence, using Equation 3.8: Interquartile range = 44 − 35 = 9 minutes Therefore, the interquartile range in the time to get ready is 9 minutes. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.1 Measures of Central Tendency, Variation and Shape 101 CA LC ULATING T H E IN T E RQ U A RT ILE R AN GE F OR F E STI VAL E XP E N D I TU RE – INTER NATIONA L V IS ITO R S In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors during the festival. < FESTIVAL > Calculate and interpret the interquartile range for the amount spent by international visitors. EXAMPLE 3.9 SOLUTION From Example 3.6, the first quartile, Q1, is 502 and the third quartile, Q3, is 993. Using Equation 3.8 the interquartile range is: Q3 − Q1 = 993 − 502 = 491 Therefore, the difference in the middle 50% of the amount spent by international visitors in the sample is $491. When calculating the interquartile range the highest and lowest 25% of the data values are discarded. Therefore, the interquartile range is not affected by extreme values. Summary measures such as the median, Q1, Q3 and the interquartile range, which are not influenced by extreme values, are called resistant measures. Variance and Standard Deviation Although the range and interquartile range are measures of variation, they do not take into consideration how the values are distributed or clustered between the extremes. Two commonly used and related measures of variation that take into account how all the values in the data set are distributed are the variance and the standard deviation. These statistics measure the average scatter around the mean – how larger values fluctuate above it and how smaller values are distributed below it. These measures are based on the difference between each data value and the mean, called the deviation of the data value from the mean. The notation Xi − X is used to denote the deviation of a data value Xi from the mean X. A measure of variation around the mean could be to take the deviation of each value from the mean, and then sum the deviations. However, as the mean is the centre of balance in a set of resistant measures Summary measures not influenced by extreme values. variance Measure of variation based on squared deviations from the mean; directly related to the standard deviation. standard deviation Measure of variation based on squared deviations from the mean; directly related to the variance. n data, for every data set the deviations from the mean would sum to zero – that is, © (Xi - X) = 0. i=1 This can be overcome by squaring the deviations from the mean before summing. In statistics, this quantity is called a sum of squares (or SS). So the sum of squares for X is SSX = n © (Xi - X)2 . This sum of squares is then divided by the number of values minus 1 (for sum of squares (SS) Sum of the squared deviations. i=1 sample data) to get the sample variance (S2 ). The square root of the sample variance is the sample standard deviation (S). Because the sum of squares is a sum of squared differences that will always be non-negative, neither the variance nor the standard deviation can ever be negative. For a data set, the variance and standard deviation will usually be positive, and will only be zero if there is no variation – that is, all the values are equal. For a sample containing n values, X1, X2, …, Xn, the sample variance (given by the symbol S2) is: S2 = ( X1 - X )2 + ( X 2 - X )2 + … + ( X n - X )2 n-1 Equation 3.9a expresses the equation using summation notation. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 102 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES sample variance Variance calculated from sample data. S A M PL E VA R IA N C E – D E F I NI T I O N F O R M U LA The sample variance is the sum of the squared deviations from the sample mean divided by the sample size minus one. n SSX S2 = = n-1 ©(Xi - X )2 i=1 n-1 (3.9a) where X = sample mean n = number of values or sample size Xi = ith value of the variable X n SSX = sample standard deviation Standard deviation calculated from sample data. © (Xi - X )2 = sum of the squared deviations from the mean (sum of squares) i=1 S A M PL E STA N DA R D D E V I AT I O N – D E F I NI T I O N F O R M U LA The sample standard deviation is the square root of the sample variance. S= S2 (3.10) If the denominator was n instead of n − 1, Equation 3.9a would calculate the average of the squared deviations from the mean. However, n − 1 is used because of certain desirable mathematical properties of the statistic S2 that make it appropriate for statistical inference (discussed in Chapter 7). The sample standard deviation, defined by Equation 3.10, is the more useful measure of variation because, unlike the sample variance, which is a squared quantity, the standard deviation is a value that is expressed in the same units of measurement as the original sample data. The standard deviation is a measure of how a set of data is clustered or distributed around its mean. For most data sets the majority of the data values lie within one standard deviation of the mean – that is, within (X − S, X + S) − and we will see later in this chapter that for all data sets at least 75% of the data values lie within two standard deviations of the mean – that is, within (X − 2S, X + 2S). Therefore, a knowledge of the mean and the standard deviation helps to define where the majority of the data values are clustered. Table 3.1 illustrates the steps for calculating the variance and standard deviation for the data on the times to get ready with mean X = 39.6, calculated earlier. The second column of Table 3.1 calculates the deviation of each time from the mean (step 1). The third column of Table 3.1 calculates the square of each deviation from the mean (step 2). The sum of the squared deviations (step 3) is shown at the bottom of Table 3.1. This total is then divided by 10 − 1 = 9 to calculate the variance (step 4). We can also calculate the variance by substituting values for the terms in Equation 3.9a: n S2 = © (Xi - X )2 i=1 n-1 (39 - 39.6)2 + (29 - 39.6)2 + … + (35 - 39.6)2 = 10 - 1 412.4 = 9 = 45.822… Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.1 Measures of Central Tendency, Variation and Shape 103 X = 39.6 Time (X ) 39 29 43 52 39 44 40 31 44 35 Table 3.1 Calculating the variance of the times to get ready Step 2: (Xi − X )2 0.36 112.36 11.56 153.76 0.36 19.36 0.16 73.96 19.36 21.16 Step 1: (Xi − X ) - 0.60 - 10.60 3.40 12.40 - 0.60 4.40 0.40 - 8.60 4.40 - 4.60 Step 3: Sum SSX = 412.40 Step 4: Divide by (n − 1) S2 = 45.822... The variance is in squared units (in squared minutes for these data) so, to calculate the standard deviation, which is in the original units (minutes for these data), take the square root of the variance. Using Equation 3.10, the sample standard deviation S is: S= S2= 45.82… = 6.769… This indicates that most of the times to get ready in this sample are clustered within 6.77 minutes of the mean of 39.6 minutes (i.e. clustered between X − S = 32.83 and X + S = 46.37). Seven of the 10 times to get ready lie within this interval. To check that the mean is correct, use the second column of Table 3.1 to calculate the sum of the deviations from the mean. For any set of data, this sum will be zero – that is: n © (Xi - X) = 0 for all sets of data i=1 It is tedious to use Equation 3.9a to calculate sample variance, especially for large samples or when the mean and/or data values are not integers. Instead, we can use algebra to obtain alternative calculation formulas. S AMPLE VA R IA N CE – CA LCUL AT ION F O R M U LA The sample variance is the sum of the squared deviations from the mean divided by the sample size minus 1. n n S2 = SSX = n-1 © Xi2 - nX 2 i=1 n-1 n = © X i2 - i=1 © Xi 2 i=1 (3.9b) n n-1 where X = sample mean n = number of values or sample size Xi = ith value of the variable X n © Xi2 = X12 + X22 + p + Xn2 = sum of the squared Xi values in the sample i=1 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 104 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES Use either calculation formula. The sum of the squared times to get ready is: 10 © Xi2 = 392 + 292 + … + 352 = 16,094 i=1 Then, calculate the variance by substituting values for the terms in the calculation form of Equation 3.9b: n S2 = © Xi2 - nX 2 i=1 n-1 16,094 - 10 3 39.62 = 10 - 1 412.4 = 9 = 45.822… A statistical calculator can be used to calculate the standard deviation (and some other numerical measures introduced in this chapter) and, as covered later in this section, Excel can be used for large data sets. Even though it is not usually necessary to use Equations 3.9a or 3.9b to calculate variance and Equation 3.10 to calculate standard deviation, it is important that you understand the process of how the variance and standard deviation are obtained. EXAMPLE 3.10 C A LC U LAT ING T H E VARI AN CE AN D STAN D ARD D E V I ATI ON F OR F E STI VAL E X P E N D IT U R E – IN T ERN ATI ON AL V I S I TORS Kai is interested in the distribution of the amount spent by international visitors during the festival. < FESTIVAL > Calculate and interpret the variance and standard deviation for amount spent by international visitors. SOLUTION Calculate the sum of X squared: 12 © Xi2 = 1,1192 + 6152 + p + 7632 = 7,380,205; i=1 then from Example 3.1, X = 743.75 and using Equation 3.9b, we obtain: n SSX = © Xi2 - nX 2 = 7,380,205 - 12 * 743.752 = 742,236.25 i=1 S2 = SSX 742,236.25 = 67,476.022 … = 11 n-1 Therefore, the variance for the amount spent by international visitors during the festival is approximatively 67,476,022 dollars squared. Now using Equation 3.10 the sample standard deviation, S, is: S = S2 = 67,476.022… = 259.761… Therefore, the standard deviation for the amount spent during the festival by international visitors is approximatively $259.76. This indicates that we expect the majority of international visitors in the sample spent within $260 (plus or minus) of the mean expenditure $743.75 during the festival. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.1 Measures of Central Tendency, Variation and Shape 105 The following summarises the characteristics of the range, interquartile range, variance and standard deviation: • The more spread out, or dispersed, the data, the larger the range, interquartile range, variance and standard deviation. • The more concentrated, or homogeneous, the data, the smaller the range, interquartile range, variance and standard deviation. • If the values are all the same (so that there is no variation in the data), the range, interquartile range, variance and standard deviation will all equal zero. • None of the measures of variation (the range, interquartile range, standard deviation and variance) can ever be negative. Coefficient of Variation Unlike the previous measures of variation presented, the coefficient of variation is a relative measure of variation that is expressed as a percentage rather than in terms of the units of the particular data. The coefficient of variation, denoted by the symbol CV, measures the scatter in the data relative to the mean. coefficient of variation Relative measure of variation; the standard deviation divided by the mean. CO E FFIC IE NT OF VA R IAT ION The coefficient of variation is equal to the standard deviation divided by the mean, multiplied by 100%. S (3.11) CV = 100% X where S = sample standard deviation X = sample mean For the sample of 10 times to get ready, since X = 39.6 and S = 6.769…, the coefficient of variation is: S 6.769… CV = 100% = 3 100% = 17.09…% 39.6 X For the times to get ready, the standard deviation is 17.1% of the size of the mean. You will find the coefficient of variation useful when comparing two or more sets of data that have different units of measurement, as Example 3.11 illustrates, or when the scale of the data sets is substantially different. CO M PA R ING T WO C O E FFIC IE N T S O F VA RI ATI ON WHE N TWO VARI ABL E S HAV E DIFFER ENT U N IT S O F ME A S U R E ME NT The operations manager of a package delivery service is deciding whether to purchase a new fleet of trucks. When packages are stored in the trucks in preparation for delivery, two major constraints need to be considered – the weight (in kilograms) and the volume (in cubic metres) of each item. The operations manager samples 200 packages and finds that the mean weight is 12.0 kilograms, with a standard deviation of 1.8 kilograms; the mean volume is 0.25 cubic metres, with a standard deviation of 0.06 cubic metres. How can the operations manager compare the variation of the weight and the volume? EXAMPLE 3.11 SOLUTION Because the measurement units differ for the weight and volume constraints, the operations manager should compare the relative variability in the two types of measurements. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 106 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES For weight, the coefficient of variation is: CVW = S 1.8 * 100% = 15% 100% = 12 X For volume, the coefficient of variation is: CVV = S 0.06 * 100% = 24% 100% = 0.25 X Thus, relative to the mean, the package volume is more variable than the package weight because it has a higher coefficient of variation. Z Scores Z scores Measures of relative standing; number of standard deviations that given data values are from the mean. extreme value (outlier) Value located far from the mean; will have a large Z score, positive or negative. Z scores are measures of relative standing that take into consideration both the mean and the standard deviation. A Z score represents the distance between a given observation and the mean expressed in standard deviations. An extreme value or outlier, a value located far away from the mean, will have a large Z score, either positive or negative. Therefore, Z scores are useful in identifying extreme values or outliers. Z S COR E Z= X-X S (3.12) For the data for the times to get ready in the morning, the mean is 39.6 minutes and the standard deviation is 6.77 minutes. The time to get ready on the first day is 39.0 minutes. Use formula 3.12 to calculate the Z score for day 1: Z= X-X 39.0 - 39.6 = = -0.09 S 6.77 Therefore, the first day’s time to get ready of 39 minutes is just 0.09 of a standard deviation below the mean – that is, just slightly quicker than the mean time to get ready. Table 3.2 shows the Z scores for all 10 days. The largest Z score is 1.83 for day 4, on which the time to get ready was 52 minutes. The lowest Z score was −1.57 for day 2, on which the time to get ready was 29 minutes. As a general rule, a value is said to be an outlier if its Z score is less than −3.0 or greater than +3.0 – that is, the value is more than three standard deviations below or above the mean. As none of the times to get ready meets the outlier criterion, we can say there are no outliers in these data. Table 3.2 Z scores for the 10 times to get ready Mean Standard deviation Time (X ) 39 29 43 52 39 44 40 31 44 35 39.6 6.77 Z score - 0.09 - 1.57 0.50 1.83 - 0.09 0.65 0.06 - 1.27 0.65 - 0.68 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.1 Measures of Central Tendency, Variation and Shape 107 CA LC ULATING T H E Z S C O R E S FO R R E AL E STATE P RI CE S A couple seeking a ‘green-change’ sell their inner city unit for $520,000 and plan to purchase a house in a rural town for the same price. Given that the mean unit price in the inner city is $845,000, with a standard deviation of $220,000, and the mean house price in the rural town is $280,000 with a standard deviation of $120,000, use Z scores to determine the price of each property relative to its region. EXAMPLE 3.12 SOLUTION The Z score for the inner city unit is: Z= X-X 520,000 - 845,000 = = -1.477… S 220,000 so the price of the unit sold is approximately 1.5 standard deviations below the mean price. That is, the couple have sold their unit for a relatively low price compared with mean inner city prices. If the couple purchase a house for $520,000, then its Z score is: Z= X-X 520,000 - 280,000 = =2 S 120,000 The price of this property is approximately two standard deviations above the mean price. That is, the couple plan to purchase a house for a relatively high price compared with property prices in the region. Shape As well as the centre and the variation of numerical data we also need a description of the shape of the distribution which represents a pattern of all the values from the lowest to highest. Many data sets are approximately mound- or bell shaped; other data sets may be skewed, with the majority of data values clustered in the upper or lower end of the distribution. A distribution is symmetrical if the lower and upper halves of the graph are mirror images of each other. Panel B of Figure 3.1 illustrates a symmetrical distribution. If the distribution is not symmetrical, it may be skewed. A distribution is skewed to the right, or positively skewed, if there is a long tail to the right, indicating that there are relatively few large data values and more smaller values – that is, most of the values are concentrated in the lower portion of the distribution. Panel C of Figure 3.1 illustrates a positively skewed distribution. As relatively few people have extremely high incomes, we would expect the distribution of annual income to be positively skewed. A distribution is skewed to the left, or negatively skewed, if there is a long tail to the left, indicating that there are relatively few small data values and more larger values, and so most of the values are concentrated in the upper portion of the distribution. Panel A in Figure 3.1 illustrates a negatively skewed distribution. As relatively few people die at an early age, we would expect the distribution of age at death of Australian residents to be negatively skewed. symmetrical Distribution of data values above and below the mean are identical. skewed Non-symmetrical distribution; data values are clustered either in the lower or the upper portion of the distribution. Figure 3.1 A comparison of three data sets differing in shape Panel A Negative, or left skewed Panel B Symmetrical Panel C Positive, or right skewed The relative positions of the mean and median provide some information about the shape of a distribution. In many, but not all, negative or left-skewed distributions the few extremely Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 108 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES small values pull the mean downwards so that the mean is less than the median. In many, but again not all, positive or right-skewed distributions the few extremely large values pull the mean upwards so that the mean is greater than the median. If the distribution is symmetrical, the high and low values balance each other and the mean equals the median. Therefore, for most continuous unimodal (one peak) distributions, we can say that: • mean < median, the distribution is likely to be negative or left skewed • mean = median, the distribution is symmetrical or has zero skewness • mean > median, the distribution is likely to be positive or right skewed. These rules often do not apply for discrete distributions, as illustrated in Example 3.13. EXAMPLE 3.13 DIST R IB U T IO N O F NU M BE R OF AD U LTS I N HOU S E HOL D From a random survey of 40 households the following data were obtained in response to the question ‘How many adults (people over 18) are there in the household?’ < HOUSEHOLD > 4 4 2 2 2 1 1 3 2 2 1 1 1 1 3 2 3 2 2 3 1 2 1 1 1 2 2 5 1 3 1 2 1 2 1 1 Present these data graphically and calculate mean and median. 3 2 1 1 SOLUTION A column chart of the data is given in Figure 3.2. Figure 3.2 Column chart for number of adults in household Adults in household 20 Frequency 15 10 5 0 1 2 3 4 5 Number of adults As most households have either one or two adults, the data are concentrated in the lower portion of the graph with a tail to the right. Therefore, the distribution of the number of adults in these households is positively or right skewed. 40 To calculate the mean, first calculate the sum of X, © Xi = 4 + 4 + … + 1 = 76. i=1 Then, as n = 40, using Equation 3.1 we obtain: n X = © Xi i=1 n = 76 = 1.9 40 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.1 Measures of Central Tendency, Variation and Shape 109 Rank of the median is: n + 1 40 + 1 = = 20.5 2 2 The median is the mean of the 20th- and 21st-ranked values. From the ordered array of the data: 1 1 2 2 1 1 2 3 1 1 2 3 1 1 2 3 1 1 2 3 1 1 2 3 1 1 2 3 1 2 2 4 1 2 2 4 1 2 2 5 the 20th- and 21st-ranked values are 2, so: Median = 2 So the mean number of adults per household is 1.9, while the median number of adults is 2. In this case, mean < median even though the number of adults per household is skewed to the right. Microsoft Excel Descriptive Statistics Output The Microsoft Excel Data Analysis Toolpak generates the mean, median, mode, standard deviation, variance, range, minimum, maximum and count (sample size) on a single worksheet, all of which have been discussed in this section. In addition, Excel calculates the standard error, along with statistics for kurtosis and skewness. The standard error is the standard deviation divided by the square root of the sample size and is discussed in Chapter 7. Skewness measures the lack of symmetry in the data and is based on a statistic that is a function of the cubed differences around the mean. A skewness value of zero indicates a symmetrical distribution. Positive and negative values indicate positive or negative skewness. Kurtosis measures the relative concentration of values in the centre of the distribution compared with the tails, and is based on the differences around the mean raised to the fourth power. This measure is not discussed in this text. For data on festival expenditure by international visitors, the Excel descriptive statistics output, shown in Figure 3.3, gives many of the sample statistics calculated in the examples in this section. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A B Festival spending – international visitors Mean Standard error Median Mode Standard deviation Sample variance Kurtosis Skewness Range Minimum Maximum Sum Count 743.75 74.9867 744 #N/A 259.761 67476 –1.41411 –0.13236 776 343 1119 8925 12 Figure 3.3 Microsoft Excel summary statistics for festival expenditure Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 110 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES Median or mean? think about this While the mean is the most common measure of central tendency, there are times when the median is the appropriate measure to use. A common measure of relative poverty is living in a household that has less than 50% of median household income. In Poverty in Australia 2016, the Australian Council of Social Service (www. acoss.org.au) reveals that a single adult with a disposable income of less than $426 per week or a couple with two children with a disposable income of less than $895 per week were living in relative poverty in 2014. Why is median household income used to define relative poverty, not mean household income? Two possible reasons are: ■ Since household income is likely to be skewed to the right, mean household income is likely to be considerably higher than the median household income. Therefore, defining the poverty line as 50% of mean household income would lead to a greater proportion of the population being defined as living in relative poverty. ■ Furthermore, defining the poverty line as 50% of mean household income would mean that any measures to alleviate poverty would be unlikely to change the proportion of households in relative poverty, since any increase in disposal household income of those in relative poverty would increase mean household income and hence raise the poverty line. However, using median household income to define relative poverty makes it possible to reduce, possibly to zero, the proportion of households in relative poverty. This is because increasing the disposal income of those living in relative poverty, through employment, benefits, tax rebates or other means, so that household income is above 50% of median income, need not change the median household income. Exploring Descriptive Statistics visual explorations Open the VE_Descriptive_Statistics workbook to explore the effects of changing data values on measures of central tendency, variation and shape. Change the data values in the cell range A2:A11 and then observe the changes to the statistics shown in the chart. Click View the Suggested Activity Page to view a specific change you could make to the data values in column A. Click View the More About Descriptive Statistics Page to view summary definitions of the descriptive statistics shown in the chart. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.1 Measures of Central Tendency, Variation and Shape 111 Problems for Section 3.1 LEARNING THE BASICS 3.1 3.7 The data below are a sample of n = 5: 7 3.2 4 7 2 9 7 3 12 4 9 0 7 -5 -8 7 a. Calculate the mean, median and mode. b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. Suppose that the rate of return for a particular share during the past two years was 10% and 30%. Calculate the geometric mean rate of return. (Note: A rate of return of 10% is recorded as 0.10 and a rate of return of 30% is recorded as 0.30.) Problems 3.6 to 3.18 can be solved manually or by using Microsoft Excel. 568 The operations manager of a plant that manufactures tyres wants to compare the actual inner diameter of two grades of tyres, each of which is expected to be 575 millimetres. A sample of five tyres of each grade is selected and the results, representing the inner diameters of the tyres, ranked from smallest to largest, are as follows: Grade X 570 575 578 584 573 1,520 2,620 3,360 3,550 1,350 2,545 1,430 2,400 3,580 2,390 1,525 2,400 1,420 1,550 2,390 1,560 1,680 2,330 < SALES > 3.9 9 APPLYING THE CONCEPTS 3.6 3.8 3 a. Calculate the mean, median and mode. b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. The data below are a sample of n = 5: 7 3.5 8 a. Calculate the mean, median and mode. b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. c. Calculate the Z scores. Are there any outliers? The data below are a sample of n = 7: 12 3.4 9 a. Calculate the mean, median and mode. b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. c. Calculate the Z scores. Are there any outliers? The data below are a sample of n = 6: 7 3.3 4 Grade Y 574 575 577 578 a. For each of the two grades of tyres, calculate the mean, median and standard deviation. b. Which grade of tyre is providing on average better quality? Explain. c. What would be the effect on your answers in (a) and (b) if the last value for grade Y was 588 instead of 578? Explain. Low-fat foods are not necessarily low calorie, as many low-fat foods are high in sugar. The calories per 250 ml cup of a random sample of brands of fresh cow’s milk for sale in Australia was given in problem 2.14 and stored in < FRESH_MILK >. Using the calorie data for each milk category: a. Calculate the mean, median, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. c. Based on the results of (a) and (b), what conclusions can you reach about the differences in calories between these types of milk? The sales per day, in dollars, at a certain store are: 2.4 a. Calculate the mean, median, mode, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. c. What conclusions can you reach about daily sales at this store? The supervisor of a tourist information desk at a local airport is interested in how long it takes an employee to serve a customer. For a sample of 12 customers, she measures the amount of time taken to serve each one. These times, measured in minutes, are reported below: < TOURIST > 1.5 3.9 0.6 2.7 3.1 2.8 0.9 1.4 2.6 1.4 6.1 a. Calculate the mean, median, mode, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range, coefficient of variation and Z scores. c. Are there any outliers, and are the data skewed? d. Based on the results of (a) to (c), what conclusions can you reach about the time taken to serve a customer? 3.10 The ordered arrays in the table below give the life (in hours of usage) of samples of forty 15-watt CFL (compact fluorescent lamp) energy-saving light bulbs produced by two manufacturers, A and B. < BULBS > Manufacturer A 5,544 5,814 6,190 6,832 6,868 6,879 7,497 7,645 7,654 8,091 8,119 8,392 6,307 6,930 7,773 8,416 6,342 6,941 7,816 8,416 6,423 7,007 7,838 8,514 6,429 7,037 7,924 8,532 6,485 7,043 7,999 8,542 6,612 7,059 8,038 8,544 6,667 7,136 8,067 8,731 Manufacturer B 6,701 6,837 6,961 7,607 7,612 7,651 8,298 8,344 8,535 9,036 9,096 9,262 7,118 7,721 8,666 9,385 7,133 7,754 8,792 9,460 7,142 7,767 8,800 9,471 7,156 7,806 8,856 9,521 7,344 7,839 8,861 9,540 7,493 7,888 8,993 9,693 7,569 7,983 9,001 9,744 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 112 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES For each manufacturer: a. Calculate the mean, median, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. c. What conclusions can you reach concerning the life of each manufacturer’s bulbs? 3.11 The prices (in dollars) of 14 models of camera at a camera specialty store were as follows. < CAMERA > 340 370 450 400 450 310 280 340 220 430 340 270 290 380 4.21 5.55 3.02 5.13 4.77 2.34 3.54 3.20 4.50 6.10 0.38 5.12 6.46 6.19 3.79 a. Calculate the mean, median, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range, coefficient of variation and Z scores. Are there any outliers? Explain. c. Are the data skewed? If so, how? d. Based on the results of (a) to (c), what conclusions can you reach about the price of cameras at the camera specialty store? 3.12 The following data refer to the number of kilometres that a sample of 50 people drive to work each day. < TRAVEL_WORK > 23 19 12 15 26 34 26 26 27 15 25 8 5 32 27 31 35 16 10 24 32 36 7 38 25 4 24 35 9 18 17 22 46 24 44 19 27 34 12 23 3.14 A bank branch located in a commercial district of a city has developed an improved process for serving customers during the noon to 1 pm lunch period. The waiting time in minutes (defined as the time the customer enters the line to when they reach the teller window) of all customers during this hour is recorded over a period of one week. A random sample of 15 customers is selected, and the results are as follows: < BANK1 > 30 47 38 27 27 42 29 27 45 29 a. Calculate the mean, median and mode. b. Calculate the range, variance and standard deviation. c. Interpret the summary measures calculated in (a) and (b). 3.13 A manufacturer of torch batteries took a sample of 13 batteries from a day’s production and used them continuously until they were drained. The numbers of hours they were used until failure were: < BATTERIES > 342 426 317 545 264 451 1,049 631 512 266 492 562 298 a. Calculate the mean, median and mode. Looking at the distribution of times to failure, which measures of central tendency do you think are most appropriate and which least appropriate to use for these data? Why? b. Calculate the range, variance and standard deviation. c. What would you advise if the manufacturer wanted to say in advertisements that these batteries ‘should last 400 hours’? (Note: There is no right answer to this question; the point is to consider how to make such a statement precise.) d. Suppose that the first value was 1,342 instead of 342. Repeat (a) to (c), using this value. Comment on the difference in the results. a. Calculate the mean, median, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range, coefficient of variation and Z scores. Are there any outliers? Explain. c. Are the data skewed? If so, how? d. As a customer walks into the branch office during the lunch hour, she asks the branch manager how long she can expect to wait. The branch manager replies, ‘Almost certainly less than five minutes’. On the basis of the results of (a) and (b), evaluate the accuracy of this statement. 3.15 Suppose that another branch, located in a residential area, is also concerned about the noon to 1 pm lunch hour. The waiting time in minutes (defined as the time the customer enters the line to the time they reach the teller window) of all customers during this hour is recorded over a period of one week. A random sample of 15 customers is selected, and the results are as follows: < BANK2 > 9.66 5.90 8.02 5.79 8.73 3.82 8.01 8.35 10.49 6.68 5.64 4.08 6.17 9.91 5.47 a. Calculate the mean, median, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. Are there any outliers? Explain. c. Are the data skewed? If so, how? d. As a customer walks into the branch office during the lunch hour, he asks the branch manager how long he can expect to wait. The branch manager replies, ‘Almost certainly less than five minutes’. On the basis of the results of (a) and (b), evaluate the accuracy of this statement. 3.16 Data from 100 recent property sales from a council area are stored in < PROPERTY >. For the asking price data, calculate and interpret: a. the mean and median (refer to graphs in problem 2.71) b. the quartiles c. the range and interquartile range d. the variance and standard deviation. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.2 Numerical Descriptive Measures for a Population 113 3.17 The five years 2012 to 2016 saw volatility in the value of shares. The data in the following table give the annual percentage change in the share market index for Hong Kong, the Hang Seng, and for Australia, the S&P/ASX 200, for 2012 to 2016. Year Hang Seng ASX 200 2012 22.9% 14.6% 2013 2.9% 15.1% 2014 1.3% 1.1% 2015 - 7.2% - 2.1% 2016 0.4% 7.0% Source: Data obtained from Yahoo 7 Finance <http://au.finance.yahoo.com> accessed April 2017 3.18 The annual returns (before tax and fees) on several managed superannuation investment funds are: Fund Conservative balanced Balanced High growth Sustainable balanced a. For each index calculate the geometric rate of return for the five years. b. What conclusions can you reach concerning the geometric rates of return of the two indices? 2017 Historical crediting rate for year ending 30 June % 2016 2015 2014 5.3 9.2 16.6 7.5 6.1 0.0 10.2 11.0 13.9 11.6 13.9 18.9 11.7 15.9 20.7 12.4 0.0 15.0 15.7 15.9 a. For each fund, calculate the geometric rate of return for three years (2015 to 2017) and for five years (2013 to 2017). b. What conclusions can you reach concerning the geometric rates of return for the funds? 3.2 NUMERICAL DESCRIPTIVE MEASURES FOR A POPULATION LEARNING OBJECTIVE Section 3.1 introduces several statistics that describe the properties of central tendency, variation and shape for a sample. If we have population data there are similar numerical descriptive measures, called population parameters, of central tendency, variation and shape. This section introduces three population parameters: population mean, population variance and population standard deviation. To illustrate these population parameters we use the data in Table 3.3, which classifies road fatalities in Australia for 2016 by month and gender. Because the table gives the total, and the male and female monthly road fatalities for 2016, for all of Australia this is population data. Gender Month January February March April May June July August September October November December Total Unknown 0 0 0 0 0 0 0 0 0 0 0 1 1 Male 27 30 23 33 29 23 26 38 24 29 28 27 337 2013 Female 80 72 87 81 76 74 91 74 68 89 78 88 958 Total 107 102 110 114 105 97 117 112 92 118 106 116 1,296 Population Mean The population mean, defined by Equation 3.13, is represented by the symbol μ, the Greek lower-case letter mu. 2 Calculate and interpret descriptive summary measures for a population Table 3.3 Road fatalities in Australia 2016 Source: Data obtained from the Australian Road Deaths Database <www.bitre.gov.au/statistics/ safety/fatal_road_crash_ database.aspx> accessed 4 May 2017. population mean Mean calculated from population data. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 114 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES POPUL AT ION M E A N The population mean is the sum of the values in the population divided by the population size, N. N μ = © Xi (3.13) i=1 N where μ = population mean Xi = ith value of the variable X N © Xi = sum of all Xi values in the population i=1 To calculate the mean monthly total road fatality for 2016 from the data given in Table 3.3, use Equation 3.13: N μ= © Xi i=1 N = 107 + 102 + … + 116 1296 = = 108 12 12 Thus, the mean monthly road fatality for 2016 was 108. Population Variance and Standard Deviation population variance Variance calculated from population data. population standard deviation Standard deviation calculated from population data. The population variance and the population standard deviation measure variation in a population. Like the related sample statistic, the population standard deviation is the square root of the population variance. The population variance is represented by the symbol σ2, the Greek lower-case letter sigma squared, and the population standard deviation by the symbol σ. These parameters are defined by Equations 3.14a and 3.15. The denominator in Equation 3.14a is N (population size) and not n − 1 as used in the equation for the sample variance (see Equation 3.9a). P OPUL AT ION VA R I A NC E – D E F I NI T I O N F O R M U LA The population variance is the sum of the squared deviations from the population mean divided by the population size N. N σ2 = SSX = N ©(Xi - μ)2 i=1 (3.14a) N where μ = population mean Xi = ith value of the variable X N SSX = ©(Xi - μ)2 = sum of the squared deviations from the mean (sum of i=1 squares) P OPUL AT ION STA NDA R D D E V I AT I O N The population standard deviation is the square root of the population variance. σ = σ2 (3.15) Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.2 Numerical Descriptive Measures for a Population 115 As we did for sample variance and standard deviation, we can use algebra to obtain alternative calculation formulas. PO PULATION VA R IA N CE – CA LCUL ATI O N F OR M UL A The population variance is the sum of the squared deviations from the population mean divided by the population size N. n N σ2 = SSX = N © Xi2 - Nμ2 i=1 N © Xi N = i=1 © Xi2 i=1 2 N (3.14b) N where μ = population mean Xi = ith value of the variable X N © Xi2 = X12 + X22 + p + XN2 = sum of the squared Xi values in the population i=1 Use either calculation formula. Using the data in Table 3.3 to calculate the population variance and standard deviation for the 2016 monthly total road fatalities, first calculate: N © Xi2 = 1072 + 1022 + p + 1162 = 140,696 i=1 then use Equations 3.14b and 3.15 to obtain: N σ2 = © Xi2 - Nμ2 i =1 σ = σ2 = N = 140,696 - 12 3 1082 = 60.666… 12 60.666… = 7.788… Thus, the variance of monthly total fatalities for 2016 is approximately 60.7 and the standard deviation is approximately 7.8 fatalities per month. So, the typical 2016 monthly fatality rate differs from the mean of 108 by plus or minus 7.8. The Empirical Rule In many data sets a large portion of the values tend to cluster near the median. In right-skewed data sets, this clustering occurs in the left or lower part of the distribution. In left-skewed data sets, the values tend to cluster in the right or upper part of the distribution. In symmetrical data sets, where the median and mean are similar, the values often cluster around the median and mean, producing a bell-shaped distribution. You can use the empirical rule to examine the variability in bell-shaped distributions, both population and sample. The empirical rule states that for bell-shaped distributions: • Approximately 68% of the values are within a distance of ±1 standard deviation from the mean. That is, approximately 68% of the data values have Z scores between −1 and 1. • Approximately 95% of the values are within a distance of ±2 standard deviations from the mean. That is, approximately 95% of the data values have Z scores between −2 and 2. • Approximately 99.7% of the values are within a distance of ±3 standard deviations from the mean. That is, approximately 99.7% of the data values have Z scores between −3 and 3. bell-shaped Symmetric, unimodal, moundshaped distribution. empirical rule Gives the distribution of data values in terms of standard deviations from the mean for bell-shaped distributions. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 116 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES The empirical rule helps to identify outliers when analysing a set of numerical data. The empirical rule implies that, for bell-shaped distributions, only about 1 in 20 values will be beyond two standard deviations from the mean. As a general rule, you can consider values not found in the interval μ ± 2σ (or X ± 2S) as potential outliers. The rule also implies that only about 3 in 1,000 will be beyond three standard deviations from the mean. Therefore, values not found in the interval μ ± 3σ (or X ± 3S) are almost always considered outliers. For heavily skewed or non-bell-shaped data sets the Chebyshev rule, introduced next, should be used instead of the empirical rule. EXAMPLE 3.14 U S IN G T H E E MP IR IC AL R U L E A population of 600-mL bottles of soft drink is known to have a mean fill-weight of 603 mL and a standard deviation of 1 mL. The population is also known to be bell-shaped. Describe the distribution of fill-weights. Is it very likely that a bottle will contain less than 600 mL of soft drink? SOLUTION μ ± σ = 603 ± 1 = (602, 604) μ ± 2σ = 603 ± 2(1) = (601, 605) μ ± 3σ = 603 ± 3(1) = (600, 606) Using the empirical rule, approximately 68% of the bottles will contain between 602 mL and 604 mL, approximately 95% will contain between 601 mL and 605 mL, and approximately 99.7% will contain between 600 mL and 606 mL. Therefore, it is highly unlikely that a bottle will contain less than 600 mL of soft drink. Specifically, because of the assumed symmetry, we would expect only 0.15% of bottles to have a volume of soft drink less than 600 mL (and thus 0.15% above 606 mL). The Chebyshev Rule Chebyshev rule Gives lower bounds of the distribution of data values in terms of standard deviations from the mean for any distribution. The Chebyshev rule states that, for all data sets, population or sample, the percentage of values within k standard deviations of the mean must be at least: 1 2 c1 − a k b d 100% You can use this rule for any value of k greater than 1. Consider k = 2. The Chebyshev rule states that at least [1 − (1/2)2]100% = 75% of the values must be within ±2 standard deviations of the mean. The Chebyshev rule is very general and applies to any distribution. The rule gives the percentage of values that must at least be within a given distance from the mean. However, if the data set is approximately bell-shaped, the empirical rule will more accurately reflect the greater concentration of data close to the mean. Table 3.4 compares the Chebyshev and empirical rules. Table 3.4 How data vary around the mean Interval (μ − σ, μ + σ) (μ − 2σ, μ + 2σ) (μ − 3σ, μ + 3σ) % of values found in intervals around the mean Chebyshev Empirical rule (any distribution) (bell-shaped distribution) At least 0% Approximately 68% At least 75% Approximately 95% At least 88.89% Approximately 99.7% Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.2 Numerical Descriptive Measures for a Population 117 USING TH E C H E BYS H E V R U LE As in Example 3.14, a population of 600-mL bottles of soft drink is known to have a mean fill-weight of 603 mL and a standard deviation of 1 mL. However, the shape of the population is unknown and you cannot assume that it is bell-shaped. Describe the distribution of fill-weights. Is it very likely that a bottle will contain less than 600 mL of soft drink? EXAMPLE 3.15 SOLUTION μ ± σ = 603 ± 1 = (602, 604) μ ± 2σ = 603 ± 2(1) = (601, 605) μ ± 3σ = 603 ± 3(1) = (600, 606) Because the distribution may not be bell-shaped, the empirical rule should not be used. Using the Chebyshev rule, you cannot say anything about the percentage of bottles containing between 602 mL and 604 mL. You can state that at least 75% of the bottles will contain between 601 mL and 605 mL, and at least 88.89% will contain between 600 mL and 606 mL. Therefore, it is possible that up to 11.11% of bottles contain less than 600 mL of soft drink (or more than 606 mL). These two rules apply to both population and sample data. For sample data, use the sample mean X and sample standard deviation S in place of the population parameters μ and σ. Problems for Section 3.2 LEARNING THE BASICS 3.19 The data below are for a population with N = 10: 7 5 11 8 3 6 2 1 9 8 a. Calculate the population mean. b. Calculate the population standard deviation. 3.20 The data below are for a population with N = 10: 7 5 6 6 6 4 8 6 9 3 a. Calculate the population mean. b. Calculate the population standard deviation. APPLYING THE CONCEPTS 3.21 Analyse the road fatality data for 2016 given in < MONTHLY_FATALITY _2016 > for each gender by: a. Calculating the mean, variance and standard deviation. b. Finding the proportion of months that have fatalities within one and two standard deviations of the mean. c. Comparing your findings with what would be expected on the basis of the empirical rule. 3.22 Naturally Soap is a small business, based in a coastal town, that makes and sells natural, luxurious, handmade soap bars in a variety of scents. Presently the soap is sold at local markets: Wednesday evening in the coastal town where the business is located, and a scheduled Sunday morning market in a roster of local villages. During the last six months, Naturally Soap has also been available via the Internet. Naturally Soap is interested in analysing the quantity sold weekly at each market and Internet sales. While Naturally Soap has complete sales and price data for both markets for the previous year, due to a computer ‘problem’ there is only a sample of weekly sales and price data for the Internet sales. The data is stored in the < NATURALLY_SOAP > file. a. For the Sunday morning market: i. Calculate the mean, variance and standard deviation of the weekly sales for the year. ii. What conclusions can you make about the weekly sales for this market? iii. Use the empirical rule or the Chebyshev rule, whichever is appropriate, to further explain the variation in the weekly sales. iv. Using the results in (iii), are there any outliers? Explain. b. Repeat (a) for the Wednesday evening market. 3.23 The ages, to the nearest year, of all employees at a certain fastfood outlet are: 19 19 45 20 21 21 18 20 23 17 a. Calculate the mean, variance and standard deviation. b. Calculate the Z scores. c. Based on the results of (a) and (b), what conclusions can you reach about employee ages at this fast-food outlet? 3.24 The file < HOURS > gives the hours worked during a recent week by all 30 employees of a local bakery. For this week: a. Calculate and interpret the mean hours worked. b. Calculate the variance and standard deviation of the hours worked. Interpret the standard deviation. c. Use the empirical rule or the Chebyshev rule, whichever is appropriate, to explain further the variation in the hours worked. d. Using the results in (c), are there any outliers? Explain. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 118 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES LEARNING OBJECTIVE 1 Calculate and interpret numerical descriptive measures of central tendency, variation and shape for numerical data 3.3 CALCULATING NUMERICAL DESCRIPTIVE MEASURES FROM A FREQUENCY DISTRIBUTION When you have a frequency distribution and the raw data are not available, you can calculate approximations to the mean and the standard deviation by assuming that all values within each class are located at the class mid-point. A P PR OXIM AT IN G T HE SAM PL E M E AN , VAR I A NC E A ND STA NDAR D DE VIAT ION FR OM A F R E QU E NCY DI ST R I B UT I O N c X= mj fj © j=1 (3.16) n where X = sample mean n = number of values or sample size c = number of classes in the frequency distribution mj = mid-point of the jth class fj = number of values in the jth class S= S2 (3.17) c c where S 2 = c c fj mj2 - nX 2 © fj mj2 © (mj - X )2 fj j© j=1 =1 j=1 n-1 = n-1 = © mj fj 2 j=1 n n-1 Example 3.16 illustrates the calculation of a sample mean and the standard deviation from a frequency distribution. EXAMPLE 3.16 Table 3.5 Frequency distribution: real estate asking prices A P P ROX IMAT ING T H E M E AN AN D STAN D ARD D E V I ATI ON F ROM A F RE QU ENCY DIST R IB U T IO N Use the frequency distribution for real estate prices given in Table 3.5 to calculate the approximate sample mean and standard deviation. Compare these approximations with the mean and standard deviation calculated from the raw (ungrouped) data in < PROPERTY >; see problem 3.16. Asking price ($) 300,000 to < 350,000 350,000 to < 400,000 400,000 to < 450,000 450,000 to < 500,000 500,000 to < 550,000 550,000 to < 600,000 600,000 to < 650,000 650,000 to < 700,000 700,000 to < 750,000 750,000 to < 800,000 800,000 to < 850,000 Total Frequency 8 17 21 20 16 6 7 3 0 0 2 100 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.3 Calculating Numerical Descriptive Measures from a Frequency Distribution 119 SOLUTION The calculations of the approximate mean and standard deviation of the real estate prices are summarised in Table 3.6 where, to avoid extremely large numbers, the mid-point of each class has been recorded in thousands of dollars. Asking price ($) 300,000 to < 350,000 350,000 to < 400,000 400,000 to < 450,000 450,000 to < 500,000 500,000 to < 550,000 550,000 to < 600,000 600,000 to < 650,000 650,000 to < 700,000 700,000 to < 750,000 750,000 to < 800,000 800,000 to < 850,000 Total Mid-point in $000s 325 375 425 475 525 575 625 675 725 775 825 Frequency 8 17 21 20 16 6 7 3 0 0 2 100 fj mj 2,600 6,375 8,925 9,500 8,400 3,450 4,375 2,025 0 0 1,650 47,300 fj mj2 845,000 2,390,625 3,793,125 4,512,500 4,410,000 1,983,750 2,734,375 1,366,875 0 0 1,361,250 23,397,500 Table 3.6 Calculations needed to calculate approximations of the mean and standard deviation of the real estate prices Using Equations 3.16 and 3.17: c X = © mj fj j=1 n = 47,300 = 473 100 and c S= © fjmj2 - nX 2 j=1 n-1 = 23,397,500 - 100 * 4732 = 99 10,349.4949… = 101.732… Therefore, the mean and standard deviation are approximately $473,000 and $101,700. These values compare with the actual mean, $472,440, and the standard deviation, $102,395, calculated from the raw (ungrouped) data; see solutions to problem 3.16. Problems for Section 3.3 LEARNING THE BASICS 3.25 Given the following frequency distribution for n = 100: Class intervals 0–under 10 10–under 20 20–under 30 30–under 40 40–under 50 Approximate: a. the mean b. the standard deviation. Frequency 10 20 40 20 10 100 3.26 Given the following frequency distribution for n = 100: Class intervals 0–under 10 10–under 20 20–under 30 30–under 40 40–under 50 Frequency 40 25 15 15 5 100 Approximate: a. the mean b. the standard deviation. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 120 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES APPLYING THE CONCEPTS 3.27 A company wished to study its accounts receivable for two successive months. An independent sample of 50 accounts was selected for each month. The results are in the table below. a. For each month, approximate the: i. mean ii. standard deviation b. On the basis of your answers in (a), do you think the mean and the standard deviation of the accounts receivable have changed substantially from March to April? Explain. Frequency Distributions for Accounts Receivable March April Amount frequency frequency $0 to under $2,000 6 10 $2,000 to under $4,000 13 14 $4,000 to under $6,000 17 13 $6,000 to under $8,000 10 10 $8,000 to under $10,000 4 0 $10,000 to under $12,000 0 3 50 50 Total LEARNING OBJECTIVE 3 Construct and interpret a box-and-whisker plot 3.4 FIVE-NUMBER SUMMARY AND BOX-AND-WHISKER PLOTS Section 3.1 introduces sample statistics to measure the centre, variation and shape of numerical data. Another way of describing numerical data is to use the five-number summary, which is illustrated graphically by a box-and-whisker plot. Five-Number Summary five-number summary Numerical data summarised by quartiles. The five-number summary consists of the five statistics: Xsmallest Q1 Median Q3 Xlargest The five-number summary characterises a sample (or population) reasonably well and is useful for exploratory data analysis. In particular, it provides a way to determine the shape of the distribution. Table 3.7 explains how the relationships between the ‘five numbers’ allow you to recognise the shape of a data set. Table 3.7 Relationships between the five-number summary and the type of distribution Comparison Distance from Xsmallest to the median versus the distance from the median to Xlargest. Left skewed The distance from Xsmallest to the median is greater than the distance from the median to Xlargest. Distance from Xsmallest to The distance from Xsmallest to Q1 is greater Q1 versus the distance from Q3 to Xlargest. than the distance from Q3 to Xlargest. The distance from Q1 to Distance from Q1 to the median versus the the median is greater distance from the median than the distance from to Q3. the median to Q3. Type of distribution Symmetrical Both distances are the same. Both distances are the same. Both distances are the same. Right skewed The distance Xsmallest to the median is less than the distance from the median to Xlargest. The distance from Xsmallest to Q1 is less the distance from Q3 to Xlargest. The distance from Q1 to the median is less than the distance from the median to Q3. The sample of 10 times to get ready (Section 3.1) ranges from 29 minutes to 52 minutes. The median is 39.5, the first quartile is 35 and the third quartile is 44. Therefore, the five-­ number summary is: 29 35 39.5 44 52 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.4 Five-Number Summary and Box-and-Whisker Plots 121 The distance from Xsmallest to the median (39.5 − 29 = 10.5) is slightly less than the distance from the median to Xlargest (52 − 39.5 = 12.5). The distance from Xsmallest to Q1 (35 − 29 = 6) is slightly less than the distance from Q3 to Xlargest (52 − 44 = 8). Therefore, the times to get ready are slightly right skewed. CA LC ULATING T H E FIV E - N U MB E R S U M M ARY F OR F E STI VAL E XP E N D I TU RE – INTER NATIONA L V IS ITO R S In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors during the festival. < FESTIVAL > Calculate the five-number summary. EXAMPLE 3.17 SOLUTION From Examples 3.3, 3.6 and 3.8 the five-number summary is: 343 502 744 993 1,119 The distance from the median to Xsmallest ($401) is more than the distance from Xlargest to the median ($375). Furthermore, the distance from Xsmallest to Q1 ($159) is more than the distance from Q3 to Xlargest ($126). Therefore, the amount spent by international visitors during the festival has a slight left-skewed distribution. Box-and-Whisker Plots A box-and-whisker plot, alternatively called a boxplot, provides a graphical representation of the data based on the five-number summary. It shows the range, interquartile range and quartiles. Figure 3.4 illustrates the box-and-whisker plot for the times to get ready. The vertical line drawn within the box represents the median. The vertical line at the left side of the box represents Q1 and the vertical line at the right side of the box represents Q3. Thus, the box contains the middle 50% of an ordered array of data values, 25% between the median and each quartile. The lower 25% of the data values is represented by a line (i.e. a whisker) connecting the left side of the box to the location of the smallest value, Xsmallest. Similarly, the upper 25% of the data values is represented by a whisker connecting the right side of the box to Xlargest. box-and-whisker plot Graphical representation of the five-number summary. Figure 3.4 Box-and-whisker plot of the time to get ready Xsmallest 20 25 30 Q1 35 Median 40 Time (minutes) Xlargest Q3 45 50 55 The box-and-whisker plot of the times to get ready in Figure 3.4 confirms a very slight right skewness since the right whisker is slightly longer than the left whisker. CO NSTR UCTING A B OX - A ND- W H IS K E R P L OT F OR F E STI VAL E XP E N D I TU RE – INTER NATIO N A L V IS ITO R S In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors during the festival. < FESTIVAL > Construct and interpret the box-and-whisker plot shown in Figure 3.5. EXAMPLE 3.18 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 122 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES SOLUTION Figure 3.5 Box-and-whisker plot, festival expenditure – international visitors Festival expenditure International visitors 300 400 500 600 700 800 900 1,000 1,100 1,200 $ The left-hand whisker is slightly longer than the right-hand whisker and the left-hand and right-hand rectangles are approximately the same. Therefore, the amount spent by international visitors during the festival has a very slight left or negative skew. Figure 3.6 demonstrates the relationship between a box-and-whisker plot and the corresponding polygon for four different types of distributions. (Note: The area under each polygon is split into quartiles corresponding to the five-number summary for the box-andwhisker plot.) Panels A and D of Figure 3.6 are symmetrical. In these distributions, the length of the left whisker is equal to the length of the right whisker, and the median line divides the box in half. Panel B of Figure 3.6 is left skewed. For this left-skewed distribution, the skewness indicates that there is a heavy clustering of values at the high end of the scale (i.e. the right-hand side); 75% of all values are found between the left edge of the box (Q1) and the end of the right whisker (Xlargest). Therefore, the long left whisker contains the smallest 25% of the values. Panel C of Figure 3.6 is right skewed. The concentration of values is on the low end of the scale (i.e. the left side of the box-and-whisker plot). Here, 75% of all data values are found between the beginning of the left whisker (Xsmallest) and the right edge of the box (Q3), and the remaining 25% of the values are dispersed along the long right whisker at the upper end of the scale. Figure 3.6 Box-and-whisker plots and corresponding polygons for four distributions Panel A Bell-shaped distribution Panel B Left-skewed distribution Panel C Right-skewed distribution Panel D Rectangular/uniform distribution Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.5 Covariance and the Coefficient of Correlation 123 Problems for Section 3.4 LEARNING THE BASICS APPLYING THE CONCEPTS 3.28 The data below are a sample of n = 5: 7 4 9 8 Problems 3.32 to 3.35 can be solved manually or by using Microsoft Excel or PHStat. 3.32 For the life of 15-watt CFL light bulbs data in problem 3.10: 2 a. List the five-number summary. b. Construct the box-and-whisker plot and describe the shape. 3.29 The data below are a sample of n = 6: 7 4 9 7 3 12 a. List the five-number summary. b. Construct the box-and-whisker plot and describe the shape. 3.30 The data below are a sample of n = 7: 12 7 4 9 0 7 3 a. List the five-number summary. b. Construct the box-and-whisker plot and describe the shape. 3.31 The data below are a sample of n = 5: 7 −5 −8 7 9 a. List the five-number summary. b. Construct the box-and-whisker plot and describe the shape. < BULBS > a. List the five-number summary for each manufacturer. b. Construct the box-and-whisker plot and describe the shape of the distribution for each manufacturer. 3.33 For the daily sales data in problem 3.8: < SALES > a. List the five-number summary. b. Construct the box-and-whisker plot and discuss the daily sales distribution for the store. 3.34 Many fast-food chains offer salads and low-fat options on their menu as an alternative to their traditional rolls and burgers. Data for a sample of these alternative and traditional menu items are stored in < HEALTHY_FASTFOOD >. For each product category, use the fat in grams per serve data: a. List the five-number summary. b. Construct the box-and-whisker plot. c. What similarities and differences are there in the distributions for the product categories? 3.35 Use the data in problems 3.14 and 3.15, representing the waiting times of random samples of customers at two bank branches during the noon to 1 pm lunch period. < BANK1 > < BANK2 > For each bank: a. List the five-number summary of the waiting time at the two bank branches. b. Construct the box-and-whisker plot and describe the shape of the distribution of the two bank branches. c. What similarities and differences are there in the distribution of the waiting time at the two bank branches? 3.5 COVARIANCE AND THE COEFFICIENT OF CORRELATION LEARNING OBJECTIVE In Section 2.5, scatter diagrams are used to examine the relationship between two numerical variables (bivariate data). In this section, the covariance and the coefficient of correlation are introduced to measure the strength of the linear relationship between two numerical variables. Calculate and interpret the covariance and the coefficient of correlation for bivariate data 4 Covariance The covariance is a measure of the strength and direction of the linear relationship between two numerical variables (X and Y). A positive value indicates a positive linear relationship between the two variables and a negative value indicates a negative relationship. A value of zero indicates that there is no linear relationship between the variables. A relationship that is linear can be graphed by a straight line, sloping upwards if positive and downwards if negative. Equation 3.18a defines the sample covariance. covariance Measure of the strength of the linear relationship between two numerical variables. sample covariance Covariance calculated from sample data. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 124 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES T H E S A M P L E COVA R I A NC E – D E F I NI T I O N F O R M U LA n SSXY cov(X, Y ) = = n-1 ©(Xi - X )(Yi - Y ) i=1 (3.18a) n-1 where X = sample mean of variable X Y = sample mean of variable Y n = number of data points (Xi, Yi) Xi = ith value of the independent variable X Yi = ith value of the dependent variable Y, which corresponds to Xi n SSXY = © (Xi - X )(Yi - Y ) = sum of the squares for X and Y i=1 As for the sample variance and standard deviation, we can use algebra to obtain alternative calculation formulas. T H E S A M P L E COVA R I A NC E – C A LC U LAT I O N F O R M U LA n n n SSXY cov(X, Y ) = = n-1 ©XiYi - nX Y i=1 n-1 ©Xi ©Yi n = ©XiYi - i = 1 i=1 i=1 (3.18b) n n-1 n where © XiYi = X1Y1 + X2Y2 + … + XnYn = sum of the product of XiYi values i=1 Use either calculation formula. EXAMPLE 3.19 C A LC U LAT ING T H E S AM P L E COVARI AN CE F OR D I S CRE TI ON ARY I N COM E A N D E X P E NDIT U R E The council in the opening scenario is also interested in the discretionary, or disposable, income and corresponding expenditure of residents within the region. To explore this Kai obtains the following data on discretionary weekly income and expenditure from 10 randomly selected residents of the region. Calculate the sample covariance for discretionary weekly income and expenditure. Discretionary income $ 400 Discretionary expenditure $ 350 815 650 550 525 400 370 250 250 300 295 375 330 380 350 425 415 600 460 SOLUTION Kai expects that discretionary expenditure is related to discretionary income, so defines Discretionary Income $ as the independent variable (X) and Discretionary Expenditure $ as the dependent variable (Y). Calculate: n © Xi X= i=1 n = 4,495 = 449.50 10 = 3,995 = 399.50 10 n Y= © Yi i=1 n 10 © XiYi = (400 * 350) + (815 * 650) + … + (600 * 460) = 1,966,625 i=1 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e X= i=1 n = 4,495 = 449.50 10 n Y= © Yi i=1 n 3.5 Covariance and the Coefficient of Correlation 125 3,995 = = 399.50 10 10 © XiYi = (400 * 350) + (815 * 650) + … + (600 * 460) = 1,966,625 i=1 Then, using Equation 3.18b, we obtain: n ©XiYi - nX Y SSXY = i=1 = 1,966,625 - (10 * 449.50 * 399.50) = 170,872.50 SSXY 170,872.50 = 18,985.833… = n-1 9 cov(X, Y ) = As the covariance is positive, Kai can conclude that there is a positive linear relationship between discretionary income and expenditure. As covariance can have any value, it is difficult to use it as a measure of the relative strength of a linear relationship. A better and related measure of the relative strength of a linear relationship is the coefficient of correlation. Coefficient of Correlation The coefficient of correlation measures the relative strength of a linear relationship between two numerical variables. The values of the coefficient of correlation range from −1 for a perfect negative linear correlation to +1 for a perfect positive linear correlation. Perfect means that, if the points are plotted in a scatter diagram, all the points will lie in a straight line. When dealing with population data for two numerical variables, the Greek letter ρ (rho) is used as the symbol for the coefficient of correlation. Figure 3.7 illustrates three different types of association between two variables. Y Y Panel A Perfect negative correlation (r = –1) X Figure 3.7 Types of association between variables Y Panel B No correlation (r = 0) X coefficient of correlation (or correlation coefficient) Measure of the relative strength of the linear relationship between two numerical variables. Panel C Perfect positive correlation (r = +1) X Panel A of Figure 3.7 illustrates a perfect negative linear relationship between X and Y, where the coefficient of correlation ρ equals −1. Panel B shows a situation in which there is no relationship between X and Y. In this case, the coefficient of correlation ρ equals 0. Panel C illustrates a perfect positive linear relationship where ρ equals +1. With sample data, the sample coefficient of correlation r can be calculated. Figure 3.8 (page 127) gives the scatter diagrams with their respective sample coefficients of correlation r for six data sets, each of which contains 100 values of X and Y. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 126 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES sample coefficient of correlation Coefficient of correlation calculated from sample data. In panel A of Figure 3.8 the coefficient of correlation r is −0.9. You can see that small values of X tend to be paired with large values of Y. Likewise, large values of X tend to be paired with small values of Y. As the data are not all in a straight line, the linear relationship between X and Y is strong but not perfect. The data in panel B have a coefficient of correlation equal to –0.6, and the small values of X tend to be paired with large values of Y. However, as the data points are more scattered in panel B, the linear relationship between X and Y in panel B is not as strong as that in panel A. Thus, the coefficient of correlation in panel B, while still negative (indicating a negative relationship), is closer to 0 than the correlation coefficient in panel A. In panel C the negative linear relationship between X and Y is very weak, r = −0.3, and there is only a slight tendency for the small values of X to be paired with the larger values of Y. Panels D to F depict data sets that have positive coefficients of correlation, hence positive linear relationships, where small values of X tend to be paired with small values of Y, and the large values of X tend to be paired with large values of Y. In this discussion of Figure 3.8, the relationships are deliberately described as tendencies and not as cause-and-effect. This wording is intentional. Correlation alone cannot prove that there is a causal effect – that is, that the change in the value of one variable caused the change in the other variable. A strong correlation can be produced simply by chance, by the effect of a third variable not considered in the calculation of the coefficient of correlation, or by a causeand-effect relationship. You would need to perform additional analysis to determine which of these three situations actually produced the correlation. Therefore, you can say that causation implies correlation, but correlation alone does not imply causation. Equation 3.19 defines the sample coefficient of correlation r and Example 3.20 illustrates its use. T H E S A M P L E COEF F I C I E NT O F C O R R E LAT I O N The sample coefficient of correlation is sample covariance divided by the sample standard deviations of X and Y: r= cov(X, Y) SX SY where SX, SY are the sample standard deviations for variables X and Y, defined by SSXY SSX , SX = and SY = n-1 n-1 correlation coefficient can also be defined as: SSY the sample n-1 Equation 3.10. As cov(X,Y ) = r= SSXY SSX (3.19) SSY where the formulas for the respective sum of squares are: n SSXY = n n n i=1 i=1 i=1 © (Xi - X )(Yi - Y ) = © XiYi - nXY = © XiYi - i = 1 ni = 1 n SSX = n n n i=1 i=1 i=1 © (Xi - X )2 = © Xi2 - nX 2 = © Xi2 - ©Xi n n n i=1 i=1 i=1 © (Yi - Y )2 = © Yi2 - nY 2 = © Yi2 - 2 i=1 n n SSY = n ©Xi ©Yi ©Yi 2 i=1 n Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.5 Covariance and the Coefficient of Correlation 127 100 100 50 50 0 0 –200 –100 r = –0.9 Panel A –300 0 –200 –100 r = –0.6 Panel B 100 100 50 50 0 100 0 0 –300 –200 Panel C –100 r = –0.3 0 100 200 –200 –100 0 100 r = 0.3 Panel D 100 100 50 50 0 200 300 400 0 –100 0 Panel E Figure 3.8 100 r = 0.6 200 300 0 Panel F 50 r = 0.9 100 150 Six scatter diagrams and their sample coefficients of correlation, r CA LC ULATING T H E S A MP LE C O R R E LAT I ON COE F F I CI E N T F OR D I S CRE TI ON ARY INC O ME A ND E X P E NDIT U R E Kai is exploring the relationship between discretionary, or disposable, income and the corresponding expenditure of residents within the region. From the data in Example 3.19, calculate and interpret the sample correlation coefficient. EXAMPLE 3.20 SOLUTION From Example 3.19: X = 449.50, Y = 399.50 and SSXY = 170,872.5 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 128 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES Calculate: 10 © Xi2 = 4002 + 8152 + … + 6002 = 2,264,875 i=1 10 © Yi2 = 3502 + 6502 + … + 4602 = 1,722,275 i=1 n SSX = © Xi2 - nX 2 = 2,264,875 - 10 * 449.52 = 244,372.5 i=1 SSY = n © Yi2 - nY 2 = 1,722,275 - 10 * 399.52 = 126,272.5 i=1 Therefore, using Equation 3.19: r= SSXY SSX SSY = 170,872.5 244,372.5 126,272.5 = 0.9727… As r = 0.97 is very close to 1, Kai can conclude that there is a very strong positive linear relationship between discretionary income and expenditure. As it is known that there is a relationship between a person’s income and their expenditure, Kai can conclude that, if a resident’s discretionary income increases, their expenditure is also highly likely to increase. In summary, the coefficient of correlation is a measure of the strength of the linear relationship, or association, between two numerical variables. The closer the coefficient of correlation is to +1 or −1, the stronger the linear relationship. When the coefficient of correlation is near 0, there is little or no linear relationship between the two numerical variables. The sign of the coefficient of correlation indicates whether the data are positively correlated (i.e. the larger values of X tend to be paired with the larger values of Y) or negatively correlated (i.e. the larger values of X tend to be paired with the smaller values of Y). The existence of a strong correlation does not imply a causation effect. It only indicates the tendencies present in the data. Problems for Section 3.5 LEARNING THE BASICS 3.36 The data are from a sample of n = 11 items: X Y 7 21 5 15 8 24 3 9 6 18 10 30 12 36 4 12 9 27 15 45 18 54 a. Calculate the covariance. b. Calculate the coefficient of correlation. c. How strong is the relationship between X and Y? Explain. APPLYING THE CONCEPTS Problems 3.37 to 3.40 can be solved manually or by using Microsoft Excel. 3.37 You are interested in the relationship between the number of people in a sales team and the sales generated, in a certain industry. Number of staff Sales 26 45 18 38 15 35 28 77 19 33 23 44 27 54 23 55 17 32 24 47 These data show gross sales, measured in millions of dollars, and the number of people on a sales team. a. Calculate the covariance and coefficient of correlation. b. What conclusions can you reach about the relationship between the number of people in a sales team and the sales generated? 3.38 Use the data in problem 2.18 to investigate the relationship between petrol and diesel prices in New South Wales and in Queensland. < FUEL_2017 > a. Calculate the covariance and coefficient of correlation for diesel and petrol prices in New South Wales. b. Calculate the covariance and coefficient of correlation for diesel and petrol prices in Queensland. c. What conclusions can you reach about the relationship between petrol and diesel prices in New South Wales and in Queensland? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 3.6 Pitfalls in Numerical Descriptive Measures and Ethical Issues 129 3.39 A local council is interested in the relationship between the size of local restaurants, measured as number of seats, and their annual water usage, in kilolitres. From a random sample of 10 local restaurants the following information was obtained. < WATER2 > Number of seats X 60 45 54 68 70 55 67 45 64 42 Annual water usage Y (kilolitres) 880 550 720 725 932 922 950 560 726 405 a. Construct a scatter diagram for the data and comment on any apparent relationship between restaurant size and annual water usage. b. Calculate the sample covariance and coefficient of correlation. Are these values what you expected from the scatter diagram? c. What conclusions can you reach about the relationship between restaurant size and annual water usage? 3.40 The data file < MILK > gives nutrition content (number of calories and total fat, in grams) per 250 mL of a random sample of 20 fresh milks available in Australia. a. Calculate the covariance. b. Calculate the coefficient of correlation. c. Which do you think is more valuable in expressing the relationship between calories and fat content – the covariance or the coefficient of correlation? Explain. d. What conclusions can you reach about the relationship between calories and fat content? 3.6 PITFALLS IN NUMERICAL DESCRIPTIVE MEASURES AND ETHICAL ISSUES This chapter introduces sample statistics and population parameters that describe the centre, variation and shape of a distribution of a single numerical variable and also the association between two numerical variables. The next step is analysis and interpretation of the calculated statistics. While your analysis is objective, your interpretation is subjective. Be careful to avoid errors that may arise either in the objectivity of your analysis or in the subjectivity of your interpretation. Analysis of expenditure data in the opening scenario is objective and reveals several impartial findings. Objectivity in data analysis means reporting the most appropriate descriptive summary measures for a given data set. Now that you have read this chapter and become familiar with various descriptive summary measures and their strengths and weaknesses, how should you proceed with an objective analysis? For example, from Figure 2.9 the amount spent during the festival by intrastate visitors is positively skewed, so shouldn’t both the median and the mean be reported? Also, doesn’t the standard deviation and/or interquartile range provide more information about the variation of amount spent than the range? On the other hand, data interpretation is subjective. Different people form different conclusions when interpreting analytical findings. Everyone sees the world from different perspectives. Thus, because data interpretation is subjective, you must attempt to present your findings in a fair, neutral and transparent manner. Ethical Issues Ethical issues are vitally important to all data analysis. As a daily consumer of information, you need to question what you read in newspapers and magazines, what you hear on the radio or television, and what you see online. Over time, much scepticism has been expressed about the purpose, the focus and the objectivity of published studies. Perhaps no comment on this topic is more telling than a quip often attributed to the famous nineteenthcentury British statesman Benjamin Disraeli: ‘There are three kinds of lies: lies, damned lies, and statistics’. Ethical considerations arise when you are deciding what results to include in a report. You should document both good and bad results. In addition, when making oral presentations and compiling written reports, you need to give results in a fair, objective and neutral manner. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 130 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES Unethical behaviour occurs when you wilfully choose an inappropriate summary measure (e.g. the mean for a very skewed set of data) to distort the facts in order to support a particular position. In addition, unethical behaviour occurs when you selectively fail to report pertinent findings because it would be detrimental to the support of a particular position. To illustrate this selective use of statistics, in 2009 an Australian newspaper, under the heading ‘Nation of gamblers’, stated: Australian and New Zealand gamblers are the worst in the world, betting more money online than those of any other country… From the report that the statistics used came from (R. T. Wood and R. J. Williams, ‘Internet gambling: Prevalence, patterns, problems, and policy options’, 5 January 2009), the mean net monthly gambling expenditure of the 19 Australian and New Zealand Internet gamblers in the sample (from more than 12,000 from 105 countries) was US$300.32, the second highest in the survey. However, the report gave the median net monthly gambling expenditure of this group as US$9.00 – the lowest. 3 Assess your progress Summary This chapter introduced numerical descriptive measures. This, and Chapter 2, covered descriptive statistics – how data are presented in tables and charts and then summarised, described, analysed and interpreted. When dealing with the opening scenario data, we were able to present useful information through the use of histograms and other graphical methods. Then characteristics of the expenditure data such as central tendency, variability and shape were explored, using numerical descriptive measures including the mean, median, quartiles, range and standard deviation. The covariance and coefficient of correlation were introduced to describe the relationship between two numerical variables. In the next chapter, the basic principles of probability are introduced to bridge the gap between descriptive statistics and inferential statistics. Key formulas Sample mean n X = © i=1 Q1 = Xi n First quartile Q1 (3.1) Third quartile Q3 Q3 = Median Median = n+1 ranked value (3.2) 2 n+1 ranked value (3.3) 4 3(n + 1) ranked value (3.4) 4 Geometric mean XG = (X1 * X2 * … * Xn)1/n (3.5) Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Key formulas 131 Population standard deviation Geometric mean rate of return RG = [(1 + R1) * (1 + R2) * p * (1 + Rn)] 1/n - 1 (3.6) σ = σ2 (3.15) Approximating the mean from a frequency distribution Range Range = Xlargest - Xsmallest (3.7) c Interquartile range mj fj © j=1 Interquartile range = Q3 - Q1 (3.8) X= Sample variance Approximating the standard deviation from a frequency distribution S2 = SSX = n-1 n n n S2 = © Xi2 - nX 2 i=1 (3.9a) (definition) n-1 n-1 n = © Xi2 - i=1 © Xi where 2 S2 = n-1 c c i=1 n (3.16) S 2 (3.17) S= ©(Xi - X )2 i=1 n c fj mj2 - nX 2 © (mj - X )2 fj j© j=1 =1 = n-1 Sample covariance SSXY cov(X, Y) = = n-1 S 2 (3.10) S= cov(X, Y) = r= N SSX n-1 = ©XiYi - i=1 ©Xi ©Yi i=1 i=1 n n-1 SSY (3.19) where (3.13) n SSXY = i=1 SSX σ2 = = N ©(Xi - μ)2 i=1 N n (3.14a) (definition) N N © Xi2 - Nμ2 N © Xi N = (3.14b) (calculation) i=1 © Xi2 - i=1 = © XiYi - i=1 2 SSX = N n © (Xi - X )(Yi - Y ) = © XiYi - nXY N = i=1 SSXY Population variance i=1 ©XiYi - nX Y n n Sample coefficient of correlation Population mean σ2 n-1 (3.18b) (calculation) X-X (3.12) S N ©(Xi - X )(Yi - Y ) i=1 n Z score μ= n n-1 n S CV = 100% (3.11) X © Xi i=1 = j=1 © fj mj2 j=1 (3.18a) (definition) Coefficient of variation Z= c n (3.9b) (calculation) Sample standard deviation n-1 ©mj fj n n i=1 ©Xi ©Yi i=1 i=1 n n n n n i=1 i=1 i=1 © (Xi - X )2 = © Xi2 - nX 2 = © Xi2 - ©Xi i=1 n n N SSY = n n n i=1 i=1 i=1 © (Yi - Y )2 = © Yi2 - nY 2 = © Yi2 - 2 ©Yi i=1 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e n 2 2 132 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES Key terms arithmetic mean (mean) bell-shaped box-and-whisker plot central tendency Chebyshev rule coefficient of correlation coefficient of variation covariance empirical rule extreme value (outlier) first (lower) quartile five-number summary geometric mean 92 115 121 92 116 125 105 123 115 106 97 120 98 interquartile range median mode population mean population standard deviation population variance quartiles range resistant measures sample coefficient of correlation sample covariance sample mean sample standard deviation 100 94 96 113 114 114 96 99 101 126 123 93 102 sample variance second quartile shape skewed spread (dispersion) standard deviation sum of squares (SS) symmetrical third (upper) quartile variance variation Z scores 102 97 92 107 99 101 101 107 97 101 92 106 Chapter review problems CHECKING YOUR UNDERSTANDING 3.41 3.42 3.43 3.44 3.45 3.46 3.47 What is meant by a property of central tendency? What are the differences between the mean, median and mode, and what are the advantages and disadvantages of each? How do you interpret the first quartile, median and third quartile? What is meant by the property of variation? What does the Z score measure? What are the differences between the various measures of variation such as the range, interquartile range, variance, standard deviation and coefficient of variation, and what are the advantages and disadvantages of each? How do the empirical rule and the Chebyshev rule differ? APPLYING THE CONCEPTS You can solve problems 3.48 to 3.56 manually or by using Microsoft Excel. 3.48 A quality characteristic of interest for a tea-bag-filling process is the weight of the tea in the individual bags. If the bags are underfilled, two problems arise. First, customers may not be able to brew the tea as strong as they wish. Second, the company may be in violation of the truth-in-labelling laws. For this product, the label weight on the package indicates that, on average, there are 5.5 grams of tea in a bag. If the average amount of tea in a bag exceeds the label weight, the company is giving away product. Getting an exact amount of tea into a bag is problematic because of variation in the temperature and humidity inside the factory, differences in the density of the tea, and the extremely fast filling operation of the machine (approximately 170 bags a minute). The table below provides the weight in grams of a sample of 50 tea-bags produced within an hour by a single machine. < TEABAGS > 5.65 5.57 5.47 5.77 5.61 5.44 5.40 5.40 5.57 5.45 5.42 5.53 5.47 5.42 5.44 5.40 5.54 5.61 5.58 5.25 5.53 5.55 5.53 5.58 5.56 5.34 5.62 5.32 5.50 5.63 5.54 5.56 5.67 5.32 5.50 5.45 5.46 5.29 5.50 5.57 5.52 5.44 5.49 5.53 5.67 5.41 5.51 5.55 5.58 5.36 a. Calculate the mean, median, first quartile and third quartile. b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. c. Interpret the measures of central tendency and variation within the context of this problem. Why should the company producing the tea-bags be concerned about the central tendency and variation? d. Construct a box-and-whisker plot. Are the data skewed? If so, how? e. Is the company meeting the requirement set forth on the label that, on average, there are 5.5 grams of tea in a bag? If you were in charge of this process, what changes, if any, would you try to make concerning the distribution of weights in the individual bags? 3.49 Use the data in problems 2.30 and 2.70 to investigate the distribution of petrol and diesel prices in New South Wales and Queensland. < FUEL_MARCH_2017 > a. Calculate the mean, median, first quartile and third quartile of New South Wales and Queensland petrol and diesel prices. What conclusions can you draw? b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation of New South Wales and Queensland petrol and diesel prices. What conclusions can you draw? c. Construct box-and-whisker plots for the data. Are the data skewed? What conclusions can you draw? d. Calculate the covariance and coefficient of correlation for diesel and petrol prices in New South Wales and Queensland. What conclusions can you reach about the Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter review problems 133 relationship between petrol and diesel prices in New South Wales and Queensland? 3.50 The data file < GRADES > contains a sample of student marks and grades from a population of students enrolled in a statistics unit. a. Calculate the mean, median, range and standard deviation for total marks. Interpret these measures of central tendency and variability. b. List the five-number summary for total marks. c. For total marks, construct and interpret a box-and-whisker plot. d. Ignoring students who did not attempt the final exam, calculate the covariance and coefficient of correlation for semester and exam marks. e. What conclusions can you reach about the relationship between a student’s semester and exam marks? 3.51 The file < AGE > contains the ages and gender of the Australian population at 30 June 2013 and 2016. a. Calculate the approximate mean age and the approximate standard deviation of age for males and females at 30 June 2013 and 2016. b. What conclusions can you draw about male and female ages in 2013 and 2016? 3.52 In many manufacturing processes the term ‘work-in-process’ (WIP) is used. In a book-manufacturing plant the WIP represents the time it takes for sheets from a press to be folded, gathered, sewn, tipped on end sheets and bound. The following data represent samples of 20 books at each of two production plants and the processing time (operationally defined as the time in days from when the books came off the press to when they were packed in cartons) for these jobs. < WIP > Plant A 5.62 5.29 16.25 10.92 11.46 21.62 11.62 7.29 7.50 7.96 4.42 10.50 8.45 7.58 8.58 9.29 3.55 For each of the two plants: a. Calculate the mean, median, first quartile and third quartile. b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. c. Construct a box-and-whisker plot. Are the data skewed? If so, how? d. On the basis of the results of (a) to (c), are there any differences between the two plants? Explain. Water_Wise is analysing water usage for a block of onebedroom flats. They collect data on daily water consumption in kilolitres (kl) for 133 consecutive days. < WATER > Explore the daily water usage in this block of flats by: a. plotting the data graphically b. calculating the summary statistics c. commenting on the graphs and the summary statistics. In this problem you are asked to select an appropriate value for the standard deviation, based on your knowledge of how these variables vary. a. From a sample of 30 petrol stations, the mean price of E10 petrol is $1.56 per litre. Which of the following is a reasonable value for the corresponding standard deviation of prices: $0.03, $3.00 or $30.00? b. The mean starting salary of a sample of 50 recent graduates is $65,200. Which of the following is a reasonable value for the standard deviation of starting salaries: $5, $50 or $5,000? c. The mean weight of a sample of 100 male university students is 70 kg. Which of the following is a reasonable value for the standard deviation of weights: 0.5 kg, 10 kg or 50 kg? The following table gives the annual increase in the Consumer Price Index (CPI), a measure of inflation in Australia and New Zealand. Year to Dec 2012 Dec 2013 Dec 2014 Dec 2015 Dec 2016 CPI % annual change Australia New Zealand 2.2 0.9 2.7 1.6 1.7 0.8 1.7 0.1 1.5 1.3 Data obtained from Reserve Bank of Australia <www.rba.gov.au> and Reserve Bank of New Zealand <www.rbnz.govt.nz> accessed Jun 2017 5.41 11.42 7.54 8.92 Plant B 9.54 11.46 16.62 12.62 25.75 15.41 14.29 13.13 13.71 10.04 5.75 12.46 9.17 13.21 6.00 2.33 14.25 5.37 6.25 9.71 3.53 3.54 3.56 3.57 For each country: a. Calculate the geometric mean inflation rate from 2012 to 2016. b. What conclusions can you draw about the inflation rate in New Zealand and Australia? Naturally Soap (see problem 3.22) is interested in exploring the relationship between the price and the quantity sold at each market. < NATURALLY_SOAP > For the Sunday morning and Wednesday evening markets, calculate and interpret the coefficient of correlation between weekly quantity sold and price. You are planning to study for your statistics examination with a group of classmates, one of whom you particularly want to impress. This individual has volunteered to use Microsoft Excel to get the needed summary information, tables and charts for a data set containing several numerical and categorical variables assigned by your lecturer for study purposes. This person comes over to you with the printout and exclaims, ‘I’ve got it all – the means, the medians, the standard deviations, the box-and-whisker plots, the pie charts – for all our variables. The problem is, some of the output looks weird – like the box-andwhisker plots for gender and for major, and the pie charts for grade point average and for height. Also, I can’t understand Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 134 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES why Professor Krehbiel said we can’t get the descriptive statistics for some of the variables – I got it for everything! See, the mean for height is 1.78, the grade point average is 2.76, the mean for gender is 1.50 and the mean for major is 4.33.’ What is your reply? REPORT WRITING EXERCISE 3.58 The data in the file < BEER > give the alcohol and calorie content of a sample of 95 beers, together with country of origin and type. Your task is to write a report based on a complete descriptive evaluation of each of the numerical variables – calories and alcohol content – regardless of type of product or origin. Then perform a similar evaluation comparing each of these numerical variables based on type of product – regular, light or non-alcoholic beers. In addition, perform a similar evaluation comparing and contrasting each of these numerical variables based on the origins of the beers – those of a selected country or continent versus those from elsewhere. Appended to your report should be all appropriate tables, charts and numerical descriptive measures. Continuing cases Tasman University Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues. In particular, students within the school are asked to complete a student survey when they receive their grades each semester. The results of Bachelor of Business (BBus) and Master of Business Administration (MBA) students who responded to the latest undergraduate (UG) and postgraduate (PG) student surveys are stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >. Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman University Postgraduate MBA Student Survey. a For a selection of numerical variables in the BBus student survey, calculate appropriate descriptive statistics. b For a selection of numerical variables in the MBA student survey, calculate appropriate descriptive statistics. c Write a report summarising your conclusions. As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. These data are stored in < REAL_ESTATE >. a For a selection of numerical variables for regional city 1 state A, calculate appropriate descriptive statistics. b For a selection of numerical variables for coastal city 1 state A, calculate appropriate descriptive statistics. c Write a report summarising your conclusions. d Repeat (a) to (c) for another pair of non-capital cities or towns in state A and/or state B. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 3 Excel Guide 135 Chapter 3 Excel Guide EG3.1 MEASURES OF CENTRAL TENDENCY, VARIATION AND SHAPE CENTRAL TENDENCY The Mean, Median and Mode Key technique U s e t h e AV E R AG E ( v a r i a b l e c e l l r a n g e ) , MEDIAN(variable cell range), and MODE(variable cell range) functions to calculate these measures. Example Calculate the mean, median and mode for the sample of getting-ready times introduced in Section 3.1. PHStat Use Descriptive Summary. For the example, open the Times file. Select PHStat ➔ Descriptive Statistics ➔ Descriptive Summary. In the Descriptive Summary dialog box (shown in Figure EG3.1): Figure EG3.1 Descriptive Summary dialog box 1. Enter or highlight cells A1:A11 as the Raw Data Cell Range and check First cell contains label. 2. Click Single Group Variable. 3. Enter a Title and click OK. PHStat inserts a new worksheet that contains various measures of central tendency, variation, and shape discussed in Sections 3.1. This worksheet is similar to the CompleteStatistics worksheet of the Descriptive workbook. In-depth Excel Use the CentralTendency worksheet of the Descriptive workbook as a model. For the example, open the Times file and insert a new worksheet (right-click on tab ➔ Insert ➔ Worksheet) and: 1. Enter a title in cell A1. 2. Enter Get-Ready Times in cell B3, Mean in cell A4, Median in cell A5, and Mode in cell A6. 3. Enter the formula 5AVERAGE(DATA!A:A) in cell B4, the formula 5MEDIAN(DATA!A:A) in cell B5, and the ­formula 5MODE(DATA!A:A) in cell B6. For these functions, the variable cell range includes the name of the DATA worksheet because the data being summarised appears on the separate DATA worksheet. If you suspect that there may be more than one mode highlight several cells, say B7:G7, enter 5TRANSPOSE(MODE. MULTI(DATA!A:A)) then press Ctrl+Shift+Enter. See the Central_Tendency workbook, which gives the two modes for the times to get ready. To calculate the mean, median and mode for another set of data, paste the data into column A of the DATA worksheet, overwriting the existing getting-ready times. Analysis ToolPak Use Descriptive Statistics. For the example, open to the Times file and: 1. Select Data ➔ Data Analysis. 2. In the Data Analysis dialog box, select Descriptive Statistics from the Analysis Tools list and then click OK. In the Descriptive Statistics dialog box (shown in Figure EG3.2): 3. Enter or highlight cells A1:A11 as the Input Range. Click Columns and check Labels in first row. 4. Click New Worksheet Ply and check Summary statistics, Kth Largest, and Kth Smallest. 5. Click OK. Figure EG3.2 Descriptive Statistics dialog box The ToolPak inserts a new worksheet that contains various measures of central tendency, variation, and shape discussed in Section 3.1. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 136 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES Quartiles Key technique Use the MEDIAN, COUNT, SMALL, INT, FLOOR, and CEILING functions in combination with the IF decisionmaking function to calculate the quartiles. To apply the rules for calculating quartiles on page 97, avoid using any of the Excel quartile functions to calculate the first and third quartiles. Example Calculate the quartiles for the sample of getting-ready times introduced in Section 3.1. PHStat Use Boxplot (discussed on page 137). In-depth Excel Use the COMPUTE worksheet of the Quartiles workbook as a model. For the example, the COMPUTE worksheet already calculates the quartiles for the getting-ready times. To calculate the quartiles for another set of data, paste the data into column A of the DATA worksheet, overwriting the existing getting-ready times. Open to the COMPUTE_FORMULAS worksheet to examine the formulas. The COMPARE worksheet compares the quartiles obtained using Section 3.1 rules for quartiles and the Excel quartile functions: QUARTILE(array, quart), QUARTILE. INC(array, quart) and QUARTILE.EXC(array, quart). The Geometric Mean Key technique Use the GEOMEAN((1 1 (R1)), (1 1 (R2)), . . . (1 1 (Rn))) 2 1 function to calculate the geometric mean rate of return. Example Calculate the geometric mean rate of return in the NZX-50 Index for the five years as shown in Example 3.7 on page 99. In-depth Excel Enter the formula 5GEOMEAN(110.24,110.16,11 0.18,110.14,110.10)21 in any cell. VARIATION AND SHAPE The Range Key technique Use the MIN(variable cell range) and MAX(variable cell range) functions to help calculate the range. Example Calculate the range for the sample of getting-ready times introduced in Section 3.1. PHStat Use Descriptive Summary as discussed earlier. In-depth Excel Use the Range worksheet of the Descriptive workbook as a model. For the example, open the worksheet implemented for the example in the In-depth Excel ‘The Mean, Median, and Mode’ instructions. Enter Minimum in cell A7, Maximum in cell A8, and Range in cell A9. Enter the formula 5MIN(DATA!A:A) in cell B7, the formula 5MAX(DATA!A:A) in cell B8, and the formula 5B8 2 B7 in cell B9. Analysis ToolPak Use Descriptive Statistics as discussed earlier. The Interquartile Range Key technique Use a formula to subtract the first quartile from the third quartile. Example Calculate the interquartile range for the sample of gettingready times introduced in Section 3.1. In-depth Excel Use the COMPUTE worksheet of the Quartiles workbook (introduced earlier) as a model. For the example, the interquartile range is already calculated in cell B19 using the formula 5B18 2 B16. The Variance, Standard Deviation, Coefficient of Variation and Z Scores Key technique Use the VAR.S(variable cell range) and STDEV.S(variable cell range) functions to calculate the sample variance and the sample standard deviation, respectively. Use the AVERAGE and STDEV.S functions for the coefficient of variation. Use the STANDARDIZE(value, mean, standard deviation) function to calculate Z scores. Example Calculate the variance, standard deviation, coefficient of variation, and Z scores for the sample of getting-ready times introduced in Section 3.1. PHStat Use Descriptive Summary as discussed earlier. In-depth Excel Use the Variation and ZScores worksheets of the Descriptive workbook as models. For the example, open to the worksheet implemented for the earlier examples. Enter Variance in cell A10, Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 3 Excel Guide 137 Standard Deviation in cell A11 and Coeff. of Variation in cell A12. Enter the formula 5VAR.S(DATA!A:A) in cell B10, the formula 5STDEV.S(DATA!A:A) in cell B11, and the formula 5B11yAVERAGE(DATA!A:A) in cell B12. If you previously entered the formula for the mean in cell A4 using the In-depth Excel instructions for the mean, enter the simpler formula 5B11yB4 in cell B12. Right-click cell B12 and click Format Cells in the shortcut menu. In the Number tab of the Format Cells dialog box, click Percentage in the Category list, enter 2 as the Decimal places, and click OK. To calculate the Z scores, copy the DATA worksheet. In the new, copied worksheet, enter Z Score in cell B1. Enter the formula 5STANDARDIZE(A2, AVERAGE(A:A), STDEV.S(A:A)) in cell B2 and copy the formula down to row 11. Then format cells B2 to B11 to the required number of decimal places. If you use an Excel version older than Excel 2010, use VAR and STDEV instead of VAR.S and STDEV.S. Analysis ToolPak Use Descriptive Statistics as discussed earlier. This procedure does not calculate Z scores. Shape: Skewness and Kurtosis Key technique Use the SKEW(variable cell range) and the KURT(variable cell range) functions to calculate these measures. Example Calculate the skewness and kurtosis for the sample of gettingready times introduced in Section 3.1. PHStat Use Descriptive Summary as discussed earlier. In-depth Excel Use the Shape worksheet of the Descriptive workbook as a model. For the example, open to the worksheet implemented for the earlier examples. Enter Skewness in cell A13 and Kurtosis in cell A14. Enter the formula 5SKEW(DATA!A:A) in cell B13 and the formula 5KURT(DATA!A:A) in cell B14. Then format cells B13 and B14 to four decimal places. Analysis ToolPak Use Descriptive Statistics as discussed earlier. EG3.2 NUMERICAL DESCRIPTIVE MEASURES FOR A POPULATION The Population Mean, Population Variance and Population Standard Deviation Key technique Use AVERAGE(variable cell range), VAR.P(variable cell range), and STDEV.P(variable cell range) to calculate these measures. Example Calculate the population mean, population variance and population standard deviation for the road fatality population data of Table 3.3 on page 113. In-depth Excel Use the Parameters workbook as a model. For the example, the COMPUTE worksheet of the Parameters workbook already calculates the three population parameters for the road fatality data. For other problems, paste your unsummarised data into column B of the DATA worksheet, overwriting the road fatality data. If you use an Excel version older than Excel 2010, use the COMPUTE_OLDER worksheet. The Empirical Rule and the Chebyshev Rule Use the COMPUTE worksheet of the VE_Variability workbook to explore the effects of changing the mean and standard deviation on the ranges associated with 61 standard deviation, 62 standard deviations, and 63 standard deviations from the mean. Change the mean in cell B4 and the standard deviation in cell B5 and then note the updated results in rows 9 to 11. EG3.3 FIVE-NUMBER SUMMARY AND BOX-ANDWHISKER PLOTS Key technique Plot a series of line segments on the same chart to construct a boxplot. (Excel chart types do not include boxplots.) Example Calculate the five-number summary and construct the boxplots for festival expenditure by international visitors in Figure 3.5. PHStat Use Boxplot. For the example, open the Festival file. Select PHStat ➔ Descriptive Statistics ➔ Boxplot. In the Boxplot dialog box (shown in Figure EG3.3): 1. Enter or highlight C2:C14 as the Raw Data Cell Range and check First cell contains label. 2. Click Single Group Variable. 3. Enter a Title, check Five-Number Summary, and click OK. The boxplot appears on its own chart sheet, separate from the worksheet that contains the five-number summary. In-depth Excel Use the worksheets of the Boxplot workbook as templates. For the example, use the PLOT_DATA worksheet, which already shows the five-number summary and boxplot for festival expenditure by international visitors. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 138 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES Figure EG3.3 Boxplot dialog box For the example, the discretionary income and expenditure data have already been placed in columns A and B of the DATA worksheet and the COMPUTE worksheet displays the calculated covariance in cell B9. For other problems, paste the data for two variables into columns A and B of the DATA worksheet, overwriting the discretionary income and expenditure data. If you use an Excel version older than Excel 2010, use the COMPUTE_OLDER worksheet that calculates the covariance without using the COVARIANCE.S function that was introduced in Excel 2010. The Coefficient of Correlation For other problems, use the PLOT_SUMMARY worksheet as the template if the five-number summary has already been determined; otherwise, paste your unsummarised data into column A of the DATA worksheet and use the PLOT_DATA worksheet as was done for the example. The worksheets creatively misuse Excel line-charting features to construct a boxplot. EG3.4 THE COVARIANCE AND THE COEFFICIENT OF CORRELATION The Covariance Key technique Use the COVARIANCE.S(variable 1 cell range, variable 2 cell range) function to calculate this measure. Example Calculate the sample covariance for discretionary income and expenditure data, Example 3.19. Key technique Use the CORREL(variable 1 cell range, variable 2 cell range) function to calculate this measure. Example Calculate the coefficient of correlation for discretionary income and expenditure data in Example 3.19. In-depth Excel Use the Correlation workbook as a model. For the example, the discretionary income and expenditure data have already been placed in columns A and B of the DATA worksheet and the COMPUTE worksheet displays the coefficient of correlation in cell B14. For other problems, paste the data for two variables into columns A and B of the DATA worksheet, overwriting the revenue and value data. The COMPUTE worksheet uses the COVARIANCE.S function to calculate the covariance (see the previous section) and also the DEVSQ, COUNT, and SUMPRODUCT functions. Open the COMPUTE_FORMULAS worksheet to examine the use of all these functions. In-depth Excel Use the Covariance workbook as a model. Microsoft® product screen shots are reprinted with permission from Microsoft Corporation. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e End of Part 1 problems 139 End of Part 1 problems A.1 A sample of 500 shoppers was selected in a large metropolitan area to obtain consumer behaviour information. Among the questions asked was, ‘Do you enjoy shopping for clothing?’ The results are summarised in the following cross-classification table. Enjoy shopping for clothing Yes No Total A.2 Male 136 104 240 Gender Female 224 36 260 A.3 Superannuation fund Conservative Balanced Growth High growth Total 360 140 500 a. Construct contingency tables based on total percentages, row percentages and column percentages. b. Construct a side-by-side bar chart of enjoy shopping for clothing based on gender. c. What conclusions do you draw from these analyses? One of the major measures of the quality of service provided by any organisation is the speed with which the organisation responds to customer complaints. A large family-owned department store selling furniture and flooring, including carpet, has undergone major expansion in the past few years. In particular, the flooring department has expanded from two installation crews to an installation supervisor, a measurer and 15 installation crews. During a recent year the company got 50 complaints about carpet installation. The following data represent the number of days between receipt of the complaint and resolution of the complaint. A.4 5 19 4 10 68 35 126 165 5 137 110 32 27 31 110 29 4 27 29 28 52 152 61 29 30 2 35 26 22 123 94 25 36 81 31 1 26 74 26 14 20 27 5 13 23 a. Construct frequency and percentage distributions. b. Construct histogram and percentage polygons. c. Construct a cumulative percentage distribution and plot the corresponding ogive. d. Calculate the mean, median, first quartile and third quartile. e. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. f. Construct a box-and-whisker plot. Are the data skewed? If so, how? g. On the basis of the results of (a) to (f), if you had to report to the manager on how long a customer should expect to wait to have a complaint resolved, what would you say? Explain. Historical crediting rate for year ending 30 June, % 2017 2016 2015 2014 2013 5.5 8.7 9.0 11.3 12.3 9.5 5.2 10.7 14.1 15.9 11.8 3.8 11.3 15.6 18.7 13.7 3.1 12.3 17.4 20.5 a. For each fund, calculate the geometric rate of return for three years (2015 to 2017) and for five years (2013 to 2017). b. What conclusions can you reach concerning the geometric rates of return for the funds? A supplier of ‘Natural Australian’ spring water states that the magnesium content is 1.6 mg/L. To check this, the quality control department takes a random sample of 96 bottles during a day’s production and obtains the magnesium content. < SPRING_WATER1 > < FURNITURE > 54 11 12 13 33 The annual crediting rates (after tax and fees) on several managed superannuation investment funds between 2013 and 2017 are: A.5 A.6 a. Construct frequency and percentage distributions. b. Construct a histogram and a percentage polygon. c. Construct a cumulative percentage distribution and plot the corresponding ogive. d. Calculate the mean, median, mode, first quartile and third quartile. e. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. f. Construct and interpret a box-and-whisker plot. g. What conclusions can you reach concerning the magnesium content of this day’s production? The National Australia Bank (NAB) produces regular reports titled NAB Online Retail Sales Index <www.business.nab. com.au>. Download the latest in-depth report. a. Give an example of a categorical variable found in the report. b. Give an example of a numerical variable found in the report. c. Is the variable you selected in (b) discrete or continuous? The data in the file < WEBSTATS > represent the number of times during August and September that a sample of 50 students accessed the website of a statistics unit they were enrolled in. a. Construct ordered arrays for August and September. b. Construct stem-and-leaf displays for August and September. c. Construct frequency, percentage and cumulative distributions for August and September. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 140 End of Part 1 problems d. Plot frequency histograms as separate graphs; plot percentage polygons on the same graph. e. Plot cumulative percentage polygons on the same graph. f. Calculate the mean, median, mode, first quartile and third quartile. g. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. h. Based on the results of (a) to (g), what conclusions can you reach about the number of times a student accesses the website each month? A.7 In problem A.6 sample statistics were calculated from data representing the number of times, during August and September, a sample of 50 students accessed the website of a statistics unit they were enrolled in. < WEBSTATS > For each month (August and September): a. List the five-number summary. b. Construct the box-and-whisker plot. c. Discuss the distribution of the number of times a student accesses the website each month. A.8 The data stored in data file < WEBSTATS > classify the number of times, during August and September, that a sample of 50 students accessed a statistics unit website by day and time. a. Construct appropriate tables and/or charts to investigate the day of the week and the time that students access the website. b. What conclusions can you draw about the pattern of web access for the two months? c. When would you post an announcement, so that the maximum number of students would read it? A.9 The data in the file < NZ_CAR_SALES_16_17 > are of sales of new cars in New Zealand for February 2016 and 2017 (data obtained from Motor Industry Association of New Zealand <www.mia.org.nz> accessed 27 March 2017). For each year, ignoring the other category: a. Calculate the mean, variance and standard deviation for the population of the 20 top-selling makes of car. b. What proportion of the makes have sales within ±1, ±2 and ±3 standard deviations of the mean? c. Compare and contrast your findings with what would be expected based on the empirical rule or on the Chebyshev rule. A.10 The data below represent the distribution of the ages of employees in two different divisions of a publishing company. Age of employees (years) 20–under 30 30–under 40 40–under 50 50–under 60 60–under 70 A Frequency 8 17 11 8 2 B Frequency 15 32 20 4 0 For each of the two divisions (A and B), approximate the a. mean. b. standard deviation. c. On the basis of the results of (a) and (b), do you think there are differences in the age distribution between the two divisions? Explain. A.11 For each of the following variables, determine whether the variable is categorical or numerical. If the variable is numerical, determine whether it is discrete or continuous. In addition, determine the level of measurement. a. Amount of money spent on clothing in the last month b. Favourite department store c. Most likely time period during which shopping for clothing takes place (weekday, weeknight, weekend) d. Number of pairs of jeans owned A.12 The file < CURRENCY > contains the monthly closing exchange rates for the New Zealand dollar (NZD), the Japanese yen (JPY), the United States dollar (USD) and the Chinese renminbi (CNY) from January 2010 to May 2017, where each currency is expressed in units per Australian dollar (data obtained from Reserve Bank of Australia <www.rba.gov.au> accessed 1 June 2017). a. Construct time-series plots for the monthly closing values of each currency. b. Explain any patterns present in the plots. c. Construct separate scatter plots of the value of pairs of these currencies. d. Calculate the correlation coefficient for pairs of currencies. e. What conclusions can you reach concerning the value of these currencies in terms of the Australian dollar? f. Obtain current exchange rates from Reserve Bank of Australia or elsewhere for either these currencies or alternative currencies. Then repeat parts (a) to (e). A.13 The table below classifies the academic staff of a small regional university by gender and level. < ACADEMIC_STAFF > Level Professor Associate professor Senior lecturer Lecturer Associate lecturer Total Average salary $172,500 $147,600 $128,500 $108,200 $ 86,500 Gender Female Male 13 21 16 24 37 52 74 58 23 13 163 168 Total 34 40 89 132 36 331 a. Illustrate these data by constructing appropriate tables and graphs. b. What can you conclude about gender and level for academic staff at this university? c. Estimate the mean and standard deviation of academic salaries. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e End of Part 1 problems 141 d. Estimate the annual expenditure on academic salaries for this university. e. Estimate the mean and standard deviation of male and female academic salaries. f. Comment on the difference in male and female academic salaries at this university. A.14 To test the effectiveness of mail X-ray screening in identifying potential illegal or threatening items a mail centre X-rays a random sample of 500 packages and then independently searches each package. The results of this test are given below. How does your lender compare? BCU X-ray items identified Yes No 36 12 14 438 50 450 Commonwealth Bank 9.44% pa Comparison of standard variable home loan rates on a loan of $150,000 over 25 years 2.086 2.038 2.014 2.003 1.981 1.957 1.894 2.066 2.031 2.013 1.999 1.973 1.951 2.075 2.029 2.014 1.996 1.975 1.951 2.065 2.025 2.012 1.997 1.971 1.947 2.057 2.029 2.012 1.992 1.969 1.941 2.052 2.023 2.012 1.994 1.966 1.941 National Australia Bank 9.46% pa ANZ 9.47% pa Suncorp 9.47% pa St George 9.47% pa Westpac 9.47% pa Total 48 452 500 a. Illustrate these data by constructing appropriate tables and graphs. b. Do you feel that X-ray screening is effective in identifying items of interest? A.15 The following data represent the amount of soft drink filled in a sample of 50 consecutive 2-litre bottles. The results are listed horizontally in the order filled. < DRINK > 2.109 2.036 2.015 2.005 1.984 1.963 1.908 9.18% pa comparison rate Newcastle Permanent 9.41% pa Rates current at 23 May 2008. Search items found Yes No Total 9.15% pa 2.044 2.020 2.010 1.986 1.967 1.938 a. Construct a frequency distribution and a percentage distribution. b. Plot a histogram and a percentage polygon. c. Form a cumulative percentage distribution and plot the corresponding cumulative percentage polygon. d. On the basis of the results of (a) to (c), does the amount of soft drink in the bottles concentrate around specific values? e. Construct a time-series plot with the amount of soft drink on the vertical axis and the bottles’ numbers (from 1 to 50) on the horizontal axis. f. What pattern, if any, is present in the data? g. If you had to make a prediction of the amount of soft drink in the next bottle, what would you predict? h. Based on the results of (e) to (g), explain why it is important to construct a time-series plot and not just a histogram, as was done in part (b). A.16 Comment on the following graph, which appeared in the Northern Star in August 2008. Data obtained from InfoChoice <www.infochoice.com.au> A.17 The following table gives the results on food groups never eaten from a national study of 10,000 men and 10,000 women aged at least 50. < FOOD > Foods never eaten Cheese Cream Diary products Eggs Fish Seafood Any meat Chicken/Poultry Pork/Ham Red meat Sugar Wheat products Eat all foods Total number of respondents Men 236 623 131 175 123 166 111 126 234 159 1,095 187 7,299 10,000 Women 219 917 196 279 266 268 353 368 495 247 897 380 7,878 10,000 Total 455 1,540 327 454 389 434 464 494 729 406 1,992 567 15,177 20,000 a. For men and women, separately and combined, construct percentage summary tables and bar charts for the data. b. What conclusions can you draw about the diet of the participants in the study? c. Why would a pie chart not be appropriate for these data? A.18 The data in < PROBLEMS > are random samples of the time (in minutes) taken to resolve 40 problems reported by students and 40 problems reported by staff to the Technology Services (TS) Service Desk at Tasman University. For each sample: a. Construct appropriate tables and/or charts to investigate the time it takes the TS Service Desk to resolve problems. b. Calculate the mean, median and quartiles. c. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. d. Construct a box-and-whisker plot. Are the data skewed? If so, how? e. On the basis of the results of (a) to (d), are there any differences between the time to resolve TS problems for staff and for students? Explain. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 142 End of Part 1 problems A.19 If two students receive a mark of 90 on the same examination, what arguments could be used to show that the underlying variable – test score – is continuous? A.20 The call centre supervisor of the IT helpdesk of a large university is monitoring the performance of the technical support staff. The data in the file < HELP_DESK > give the number of calls resolved during a random sample of 20 eighthour shifts by five support staff. a. For each staff member, construct frequency, percentage and cumulative distributions. b. For each staff member, construct a histogram. c. On the same graph, construct percentage polygons for all staff members. d. On the same graph, construct ogives for all staff members. e. For each staff member, calculate the mean, median, mode, first quartile and third quartile. f. For each staff member, calculate the variance, standard deviation, range, interquartile range and coefficient of variation. g. On the same graph, construct and interpret a box-andwhisker plot for each staff member. h. What conclusions can you reach concerning the number of resolved calls? A.21 The file < AGE > contains the ages and gender of the Australian population at 30 June 2013 and 2016. a. Construct percentage and cumulative percentage distributions for the age of males, females and the entire Australian population in 2013 and 2016. b. Construct and interpret appropriate graphs to investigate the age distribution of males and females, separately and combined, and how it is changing. c. Calculate the approximate mean age and approximate standard deviation of age for the entire Australian population. A.22 One operation of a mill is to cut pieces of steel into parts that will later be used as the frame for front seats in a car. The steel is cut with a diamond saw and the resulting parts must be within ±0.125 mm of the length specified by the car manufacturer. The data in < STEEL > come from a sample of 100 steel parts. The measurement reported is the difference in millimetres between the actual length of the steel part, as measured by a laser measurement device, and the specified length of the steel part. For example, the data value –0.05 represents a steel part that is 0.05 mm shorter than the specified length. a. Construct a frequency distribution and a percentage distribution. b. Plot the corresponding histogram and percentage polygon. c. Plot the corresponding cumulative percentage polygon. d. Is the steel mill doing a good job in meeting the requirements set by the car manufacturer? Explain. A.23 For the previous year a large confectionary chain, Sweets-4-U, is interested in analysing the quantity sold weekly, including associated cost data, of two of its popular products, ‘Forgive’ and ‘Rejoice’. These products, both wrapped chocolates sold by weight, differ only in the message attached to each chocolate. Forgive chocolates contain messages ‘Sorry’, ‘Forgive Me’, ‘Trust Me’ and similar, while the messages attached to Rejoice chocolates are ‘Celebrate’, ‘Have Fun’, ‘I Love You’ and similar. < SWEETS_4_U > For Forgive chocolates quantity sold data, construct and interpret: a. a stem-and-leaf display b. frequency, percentage and cumulative distributions c. a frequency histogram, percentage polygon and ogive d. a scatter diagram quantity sold and total cost. For each product: e. Calculate the mean, variance and standard deviation of the weekly quantity sold for the year. f. What conclusions can you make about the weekly quantity sold for each product? g. Use the empirical rule or the Chebyshev rule, whichever is appropriate, to explain further the variation in the weekly quantity sold. h. Using the results in (g), are there any outliers? Explain. i. Calculate and interpret the coefficient of correlation between weekly quantity sold and the associated costs. Also calculate and interpret the coefficient of correlation between the weekly quantity sold of Rejoice and Forgive chocolates. j. Construct time-series plots to investigate any pattern in weekly sales over the year. What conclusions can you make about the pattern of weekly sales for the products? A.24 Several hundred laboratory tests are performed at a large hospital each day. The rate at which these tests are done improperly (and therefore need to be redone) seems steady, at about 4%. In an effort to get to the root cause of these nonconformances (tests that need to be redone), the director of the lab decided to keep records over a period of one week. The laboratory tests were subdivided by the shift of workers who performed them. The results are shown below. Shift Lab tests performed Nonconforming Conforming Total Day 16 654 670 Evening 24 306 330 Total 40 960 1,000 a. Construct cross-classification tables based on total percentages, row percentages and column percentages. b. Which type of percentage – row, column or total – do you think is most informative for these data? Explain. c. What conclusions concerning the pattern of nonconforming laboratory tests can the laboratory director reach? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e End of Part 1 problems 143 A.25 An economist exploring the relationship between interest rates and inflation has collected interest and CPI data from the reserve banks of New Zealand and Australia for 2000 to March 2017 (data obtained from Reserve Bank of Australia and Reserve Bank of New Zealand < www.rba.gov.au> and <www.rbnz.govt.nz> accessed 1 June 2017). < INTEREST_&C_PI_2017 > For each country, use appropriate graphs and statistics to investigate the relationship between interest and inflation rates. What conclusions can you make? A.26 < GDP > gives the annual percentage change in real gross domestic product (GDP) per quarter since 2000 for New Zealand (NZ), Australia, the United States of America (USA), Japan and the United Kingdom (UK) (data obtained from Reserve Bank of New Zealand <www.rbnz.govt.nz> accessed 1 June 2017). a. Investigate the relationship between the annual percentage changes in GDP for these five countries by constructing time-series plots on the same set of axes. b. What conclusions can you make about the changes in GDP for these five countries? A.27 Alex and Tyler have been monitoring their electricity use since installing solar power almost a year ago, with the data stored in < SOLAR_POWER > . Explore Alex and Tyler’s power usage over this period by: a. plotting the data graphically b. calculating summary statistics c. commenting on the graphs and summary statistics A.28 The results of the 2017 Adobe Mobile Maturity Survey reveal insights into the change to smartphones as primary online access devices, and indicate the need for companies to focus on creating engaging and personalised digital experiences for their customers. How are companies addressing the mobile experience? The survey found 40% of marketing decision makers were prioritising mobile apps and only 24% were prioritising mobile websites. However, the situation differed for IT decision makers, of whom 26% were prioritising mobile apps and 30% were prioritising mobile websites. The research is based on an online survey with a sample of 304 US executives, marketers, IT staff and analysts who had experience with mobile marketing and who worked for or were agents for organisations with 500+ employees. Of these, 254 were identified as marketing respondents and 50 as IT respondents (data obtained from <www.adobe.com>). a. Describe the populations of interest. b. Describe the samples that were collected. c. Describe a parameter of interest. d. Describe the statistic used to estimate the parameter in (c). A.29 A radio station survey of listeners found that 32% of the 1,356 drivers who responded admitted to talking on a hand-held mobile phone while driving, and 23% admitted to reading or sending SMS messages while driving. What information would you want to know before you accepted the results of the survey? A.30 Pre-numbered sales invoices are kept in a sales journal. The invoices are numbered from 0001 to 5,000. a. Beginning in row 16, column 1, and proceeding horizontally in Table E.1, select a simple random sample of 50 invoice numbers. b. Select a systematic sample of 50 invoice numbers. Use the random numbers in row 20, columns 5–7, as the starting point for your selection. c. Are the invoices selected in (a) the same as those selected in (b)? Why or why not? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e PA R T 2 Measuring uncertainty Real People, Real Stats Ellouise Roberts DELLOITE ACCESS ECONOMICS Which company are you currently working for and what are some of your responsibilities? I currently work for Deloitte Access Economics where I’m in the macroeconomic policy and forecasting team located in Canberra. One of my main responsibilities is working with our demographic forecasting model, where we project the future Australian population and some of its characteristics – such as where people will live, how many people will be in the labour force and the industries they might work in. These population forecasts are a key driver of our macroeconomic model, which is used to assist a variety of clients in determining the impacts of potential economic and policy changes on their business, industry or region. Before joining Deloitte Access Economics, I worked at the Australian Bureau of Statistics in a range of roles related to social research, demography and the Census. This included calculating life tables, analysing fertility rates and investigating the type of transport people use to get from home to work. List five words that best describe your personality. A statistical text book debutante! (Practical, adaptable, instinctive, determined and enquiring.) What are some things that motivate you? In my working life I’m motivated by the role that statistics can play in solving problems. For example, by undertaking statistical analysis to test the effectiveness of a particular policy in delivering intended outcomes, we can provide a basis of evidence to assist in deciding whether or not to continue funding existing programs, or to develop alternatives. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e a quick q&a Many of the projects that I have been involved with will also play an important role in the future direction of Australia – whether these are in the area of higher education, infrastructure or the implementation of environmental controls. In these instances, the use of statistics can provide evidence and insights which cannot be acquired through other means such as consultations or literature reviews. such as history, economics, demographics and sociology, to make sense of statistical findings. With such a wide applicability, working with statistics can offer a range of opportunities across a wide variety of industries and occupations. In many cases, the techniques and concepts used are the same, but the subject matter can differ significantly, which helps to keep work interesting. When did you first become interested in statistics? I began to appreciate the value that statistics could offer while at school where I learnt about the work of John Graunt, who analysed the vital statistics of London’s citizens during the seventeenth century. During one of the many outbreaks of the bubonic plague in London, Graunt became interested in the Bills of Mortality – records of deaths from the plague – and through the use of statistics was able to draw conclusions about how the disease spread. Many ideas in use today – such as the application of life tables in the insurance industry, national censuses and medical statistics – utilise the principles and foundations of Graunt’s work. Statistics are also applicable to such a wide variety of industries and occupations that it is hard to imagine a subject where they could not offer additional insight and understanding. For example, a farmer can collect a record of daily rainfall, but in isolation those daily numbers do not offer any particularly interesting findings. However, with the introduction of even the most basic statistical techniques, such as the calculation of monthly averages or the pattern of rainfall events, insights begin to emerge. However, it is when they are combined with other observations – such as pest or disease outbreaks, or cropping metrics, or even worker productivity – that we begin to gain an understanding of the relationships between inputs and outputs (or dependent and independent variables) and appreciate the real value that statistics can offer. Describe your first statistics-related job or work experience. Was this a positive or a negative experience? My first statistics-related job, as a university student, involved standing on the side of a road counting the types of vehicles that went past. A seemingly simple job in itself; however, after the counts were completed we would analyse the data to develop traffic-flow diagrams to assist with the planning of future road infrastructure, such as traffic lights. This was my first real experience of collecting data and then transforming information – a count of cars – into something meaningful and tangible to everyday life. It also emphasised the importance of accurate and suitable data collection techniques, and the role that sampling plays in obtaining information. For example, although standing by the side of the road counting cars for 24 hours was possible, it would not be particularly cost effective (or exciting), and the use of statistical techniques can help us build a comprehensive picture using only a snapshot of data. Although a relatively simple example, this experience helped to demonstrate the role of statistics in society and encouraged me to continue working in this area. Complete the following sentence. A world without statistics … … would be a world where we wouldn’t be able to celebrate World Statistics Day. LET’S TALK STATS What do you enjoy most about working in statistics? For me, it is not just the generation of the statistics and data that I enjoy (although that in itself can be very interesting), but rather the interpretation of these figures through the identification of patterns, trends and relationships. As part of working with statistics, you are also often involved in looking at the bigger picture, drawing in knowledge from a range of other disciplines, What do you feel is the most common misconception about your work held by students who are studying statistics? Please explain. One of my misconceptions when studying statistics was that you are either a ‘numbers’ or a ‘words’ person. However, in the workplace no matter how good you may be at undertaking complex statistical analysis, or building complicated models, you also need to be able to communicate your findings with a variety of audiences with varying degrees of understanding and interest. Therefore, it is critical that, in addition to understanding the mathematical techniques, you also develop your ability to interpret your findings and convey them in a language that your audience will understand – no matter who they may be. Do you need to be good at maths to understand and use statistics successfully? To some degree, I think you do need to have a certain level of understanding of maths and an appreciation for the role that statistics can play. However, this doesn’t necessarily mean that you need to memorise countless formulas or mathematical Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e proofs. Rather, you need the ability to be able to understand the concepts and their application. It is also important to remember that in some instances, the interpretation of the statistics is the key output or outcome and can be more important than the numbers themselves. More broadly, studying statistics in a purely theoretical sense is useful, but the real value is being able to apply these techniques and calculations to real-world data, whether this be in the area of finance, oceanography or biomedical science. However, even more than that, to successfully work as a statistician – or with statistics in any capacity – I think you need to have an enquiring mind and to want to know things and understand why things are as they are (or may be in the future). Is there a high demand for statisticians in your industry (or in other industries)? Please explain. Studying statistics provides a solid foundation for a wide range of roles within the workplace – including ones that may not be immediately obvious, such as building early monitoring systems for tsunamis or in the monitoring of disease outbreaks. Within my role, both the public and private sectors are becoming increasingly aware of and interested in what the future demographic profile of Australia will look like, and the implications that this will have. In such a dynamic environment, I expect the opportunities for people with an understanding of statistics will only increase as more and more aspects of society, nature and the economy are investigated, evaluated and analysed. Ten years ago nobody imagined that there would be professional roles for people undertaking statistical analysis of social media, online social networks and online human behaviour – let alone the prominence that these applications would play in society. MEASURING UNCERTAINTY What are the most practical consequences in your work that would result from failing to report uncertainty? In much of the work that I do – and particularly in population forecasting – the element of uncertainty is fairly explicit. No one knows for certain how big the population is going to be decades into the future, particularly when you consider the assumptions that need to be made about future fertility (including for females who have not yet been born themselves), mortality (where numerous medical breakthroughs every year continue to extend our lifespan) and migration (where government policies play a key role). However, by observing past trends, patterns and behaviours we can build a picture of what the demographic and economic future may look like under certain conditions. More broadly, statistics don’t always necessarily give you a definitive number or answer as such. Instead, they are often predications or assessments of information, making it critical to explain the role of uncertainty in the conclusions that you make. Given that our work can influence public and social policy within Australia, the failure to report uncertainty can have considerable consequences by falsely informing our client’s decisions. When might a discrete probability distribution be useful for your work? Can you provide a specific question for which it has helped to provide an answer? In our type of work we are often concerned with the distribution of certain events, such as the success or failure of students completing a particular year in their apprenticeship training. In this example, we were interested in understanding the probability of success in relation to a range of different characteristics, such as age, sex and industry, as well as any government assistance that they had received. Based on a sample of records, we investigated the probability distribution based on individual characteristics, which assisted us in identifying how these factors might contribute to the relative likelihood of success or failure in relation to the overall sample. This type of work assisted us to provide a range of information to our client. Firstly, it helped to establish whether the assistance being made available was targeted at the desired group (i.e. those least likely to complete a particular year of the apprenticeship) and whether the government program was having a positive influence on completion rates. When might a continuous probability distribution be useful for your work? Can you provide a specific question for which it has helped to provide an answer? Continuous probability distributions provide the foundation for much of our multiple linear regression analysis. Using the example from above, in this instance we were also interested in estimating the overall probability of apprenticeship success. While the methods used were themselves conceptually advanced, they were built around the basic assumptions of continuous probability distributions. Is it difficult to liken collected data to a common distribution? What features of the data are used to do so? One thing that you quickly learn in any analysis of ‘real-world’ data is that although some data may be easily likened to a common distribution – like exam results, which often follow a bell-shaped curve similar to a normal distribution – any data collected is likely to present its own unique set of challenges. Taking the Census, for example, despite the extensive effort put into the design of the form, the collection procedures, the processing of answers and the data analysis there are still a wide range of errors (respondent error, processing error, partial or non-response, and undercount) that need to be considered while interpreting the statistics. In many other cases, the data you will be analysing may be collected for a different purpose (such as registrations of births, deaths and marriages for administrative purposes), and incorrect, incomplete and duplicate entries can be a significant issue. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e C HAP T E R 4 Basic probability REPEAT FESTIVAL ATTENDANCE T o increase visitor numbers during the year and repeat attendance at the three-day musical festival presented in the Chapter 2 and 3 scenarios, non-local festival attendees are given a book of discount vouchers for subsequent visits to the region and/or the annual music festival. These vouchers include seven nights for the price of five at selected backpackers’ hostels and motels, and two meals for the price of one at selected restaurants. Gaia Adventure Tours, which runs tours and activities in the region, offers a voucher giving two for the price of one on selected tours and activities. Jo is analysing the use of these vouchers by a sample of 500 non-local festival attendees from five years ago. Some of the questions Jo hopes to answer for these attendees are: ■ ■ ■ ■ ■ Are those who have been to a subsequent music festival more likely to have also used an accommodation discount voucher than those who have not been a repeat attendee? What proportion of past festival attendees attend the music festival again? What proportion of repeat festival attendees use a discount meal voucher? What proportion of repeat festival attendees use the two-for-one Gaia Adventure Tours voucher? Is the proportion of repeat festival attendees who use the two-for-one Gaia Adventure Tours voucher the same as those who use a discount meal voucher? Answers to these questions and others can help Jo develop future sales and marketing strategies to encourage repeat visits to the region and/or music festival by festival attendees. © Africa Studio/Shutterstock Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 148 CHAPTER 4 BASIC PROBABILITY LEARNING OBJECTIVES After studying this chapter you should be able to: 1 recognise basic probability concepts 2 calculate probabilities of simple, marginal and joint events 3 calculate conditional probabilities and determine whether events are independent or not 4 revise probabilities using Bayes’ theorem 5 use counting rules to calculate the number of possible outcomes Probability is the link between descriptive statistics and inferential statistics. This chapter introduces several types of probability and discusses how to revise probabilities in light of new information. These topics are the foundation for the probability distribution, the concept of mathematical expectation and the binomial, hypergeometric and Poisson distributions (topics covered in Chapter 5). LEARNING OBJECTIVE 1 Recognise basic probability concepts probability The likelihood of an event occurring. impossible event An event that cannot occur. certain event An event that will occur. a priori classical probability Objective probability, obtained from prior knowledge of the process. 4.1 BASIC PROBABILITY CONCEPTS What is probability? A probability is a numerical value that represents the chance, likelihood or possibility that a particular event will occur. Examples of events are the price of a share increasing, a rainy day, a defective item or the outcome 5 when you roll a die. A probability is given either as a proportion or fraction whose value lies between 0 and 1, inclusive. An event that has no chance of occurring (i.e. an impossible event) has a probability of 0. An event that is sure to occur (i.e. a certain event) has a probability of 1. There are three approaches to assigning a probability to an event: • a priori classical probability • empirical classical probability • subjective probability. In a priori classical probability, the probability of an event is based on prior knowledge of the process involved. In the simplest case, each outcome is equally likely and the chance of occurrence of the event is given by Equation 4.1. P R OB A B IL IT Y OF OC CU R R E NC E Probability of occurrence 5 X T (4.1) where X 5 number of ways in which the event occurs T 5 total number of possible outcomes Consider a standard deck of cards with 26 red cards and 26 black cards. The probability of selecting a black card (an event), using Equation 4.1, is 26/52 5 0.5 since there are X 5 26 black cards and a total of T 5 52 cards. What does this probability mean? As you cannot say for certain what colour the next card selected will be, it does not mean that, if each card is replaced after it is drawn, one out of the next two cards selected will be black. However, you can say that, in the long run, if cards are continually selected and replaced, the proportion of black cards selected will approach 0.5. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 4.1 Basic Probability Concepts 149 FINDING A PR IO R I P RO B A B ILIT IE S A standard die has six faces. Each face carries one, two, three, four, five or six dots. If you roll a die, what is the probability you will get a face with five dots? EXAMPLE 4.1 SOLUTION Each face is equally likely to occur. Since there are six faces, the probability of getting a 1 face with five dots is . 6 The above examples use the a priori classical approach to assigning a probability because the number of ways the event occurs and the total number of possible outcomes are known from the composition of the deck of cards or the faces of the die. In addition to the cards and die examples discussed, games of chance such as Lotto and Roulette are based on known probabilities and, as such, are examples of a priori classical probability. In the empirical classical approach to assigning a probability, the outcomes are based on observed data, not on prior knowledge of a process. Examples of this type of probability are the proportion of repeat festival attendees in the chapter scenario, the proportion of registered voters who prefer a certain political candidate or the proportion of students who have a part-time job. For example, if you take a survey of students and 60% state that they have a part-time job, then there is a 0.6 probability that an individual student has a part-time job. The third approach to assigning a probability, subjective probability, differs from the other two approaches because a subjective probability differs from person to person. For example, the development team for a new product may assign a probability of 0.6 to the chance of success for the product while the managing director of the company is less optimistic and assigns a probability of 0.3. The assignment of subjective probabilities to various outcomes is usually based on a combination of an individual’s prior knowledge, personal opinion and analysis of a particular situation. Subjective probability is useful in making decisions in situations in which you cannot use a priori classical probability or empirical classical probability. empirical classical probability Objective probability, obtained from the relative frequency of occurrence of an event. subjective probability Probability that reflects an individual’s belief that an event occurs. Events and Sample Spaces We need the following definitions to understand probabilities. A random experiment is a precisely described scenario that leads to an outcome that cannot be predicted with certainty. For example, the scenario could be ‘roll a die and record how many dots on the upper face’, or ‘toss a coin twice and record whether heads (H) or tails (T) occurs on each toss’. An event is specified by one or more outcomes of a random experiment. The event is said to have occurred if one of the outcomes specified has occurred. random experiment A precisely described scenario that leads to an outcome that cannot be predicted with certainty. event One or more outcomes of a random experiment. For example, when rolling a die, the event of an even number consists of three outcomes: 2, 4 and 6. A simple event is an event specified by a single outcome of a random experiment. simple event A single outcome of a random experiment. The collection of all simple events is called the sample space. sample space Collection of all simple events of a random experiment. For example, in the experiment of rolling the die, the sample space consists of the six simple events: 1, 2, 3, 4, 5 and 6. In the experiment of tossing a coin twice, the sample space consists of the four simple events: HH, HT, TH and TT. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 150 CHAPTER 4 BASIC PROBABILITY joint event An event described by two or more characteristics. A joint event is an event described by two or more characteristics. A joint event can be a simple event. For example, in the experiment of tossing a coin twice, the simple event HH has the two characteristics H on first toss and H on second toss. complement All simple outcomes not in an event. The complement of event A (written A′) includes all simple events that are not included in the event A. When tossing a coin, the complement of a head is a tail, since it is the only simple event that is not a head. When rolling a die, the complement of ‘five’ is ‘not five’ – that is, a 1, 2, 3, 4 or 6 – and, when rolling a die, the complement of the event ‘an even number’ is ‘an odd number’ – that is, 1, 3 or 5. EXAMPLE 4.2 Table 4.1 Accommodation voucher use and repeat festival attendance E V E NT S A N D S A MP L E S PACE S Table 4.1 gives information on repeat attendance at festivals and the use of discount accommodation vouchers by the sample of 500 festival attendees. Repeat festival attendance Yes No Total Accommodation voucher used Yes No 210 70 110 110 320 180 Total 280 220 500 What is the sample space? Give examples of simple events and joint events. SOLUTION The sample space consists of discount accommodation voucher use and repeat festival attendance of the sample of 500 festival attendees. Examples of simple events are ‘Repeat festival attendance’ and ‘Accommodation voucher used’. The complement of the event ‘Accommodation voucher used’ is ‘Accommodation voucher not used’. The event ‘Repeat festival attendance and accommodation voucher used’ is a joint event because festival attendees have attended a subsequent music festival and used the discount accommodation voucher. Contingency Tables and Venn Diagrams contingency (or crossclassification) table – probability Represents a sample space for joint events classified by two characteristics; each cell represents the joint event satisfying given values of both characteristics. Venn diagram Graphical representation of a sample space; joint events shown as ‘unions’ and ‘intersections’ of circles representing simple events. There are several ways to present a sample space. Table 4.1 uses a contingency table, also called a cross-classification table (see Section 2.4), to represent a sample space. The values in the cells of the table are obtained by classifying the sample of 500 festival attendees by whether they have attended a subsequent music festival and/or used the discount accommodation voucher. For example, 210 festival attendees have used the discount accommodation voucher and attended a subsequent music festival. A Venn diagram is another way to present a sample space. It graphically represents the various events as unions and intersections of circles. Figure 4.1 presents a typical Venn diagram for a two-variable situation, with each variable having only two events (A and A′, B and B′). The circle on the left represents all simple events that are part of A and the circle on the right represents all simple events that are part of B. The area contained within circle A and circle B (centre area) is the intersection of A and B (written as A ù B), since it contains all outcomes that are in event A and also in event B. The total area of the two circles is the union of A and B (written as A ø B) and contains all outcomes in event A and/or in event B. The area in the diagram outside A ø B contains outcomes that are neither in event A nor in event B. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 4.1 Basic Probability Concepts 151 To construct a Venn diagram the events A and B must be defined. You can define either event as A or B, or use different letters, as long as you are consistent in evaluating the various events. For the repeat festival attendance scenario you can define the events as follows: A 5 repeat festival attendance A' 5 no repeat festival attendance B 5 accommodation voucher used B ' 5 accommodation voucher not used In drawing the Venn diagram (see Figure 4.2), you must determine the value of the intersection of A and B in order to divide the sample space into its parts. A ù B consists of all 210 festival attendees who have attended a subsequent music festival and used the discount accommodation voucher. The remainder of event A (Repeat festival attendance) consists of the 70 repeat festival attendees who did not use the discount accommodation voucher. The remainder of event B (Accommodation voucher used) consists of the 110 festival attendees who have used the discount accommodation voucher but not attended another music festival. The remaining 110 festival attendees have neither attended a later music festival nor used the discount accommodation voucher. A A B B B A9 B A 210 A A A B 9= 110 B 70 B9 A Figure 4.1 Venn diagram for events A and B Note: A = A ù B + A ù B ′ and B = A ù B + A′ ù B A9 B 110 B = 390 Figure 4.2 Venn diagram for repeat festival attendance scenario Note: A = A ù B + A ù B ′ and B = A ù B + A′ ù B Marginal Probability Now some of the questions posed in the repeat festival attendance scenario can be answered. Since the results are based on data collected (see Table 4.1), the empirical classical approach to assigning probabilities can be used. Marginal probability refers to the probability P(A) of an occurrence of an event, A described by a single characteristic. An example of a marginal probability in the repeat festival attendance scenario is the probability of a festival attendee attending a later music festival. Using Equation 4.1: P(repeat festival attendance) 5 number repeat festival attendees 280 5 0.56 5 total number of attendees 500 LEARNING OBJECTIVE 2 Calculate probabilities of simple, marginal and joint events marginal probability Probability of an event described by a single characteristic. Thus, there is a 0.56 (or 56%) likelihood that a festival attendee will attend a subsequent music festival. The name marginal probability derives from the fact that the total number of occurrences of event A (in this case, repeat festival attendance) is obtained from the margin of the contingency table (see Table 4.1). Example 4.3 illustrates another application of marginal probability. CA LC ULATING T H E P RO B A B ILIT Y T H AT A RE P E AT F E STI VAL ATTE N D E E U S E S TH E G A IA A D V E NT U R E TO U R S D IS C O U N T VOU CHE R In the repeat festival attendance scenario, festival attendees were given a book of discount vouchers, including two-for-one vouchers for meals and selected activities and tours by Gaia Adventure Tours. Table 4.2 gives the use of these two-for-one vouchers by the 280 repeat festival attendees. EXAMPLE 4.3 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 152 CHAPTER 4 BASIC PROBABILITY Table 4.2 Use of two-for-one vouchers by repeat festival attendees Repeat festival attendance Gaia Adventure Tours voucher used Yes No Total Meal voucher used Yes 126 84 210 No 42 28 70 Total 168 112 280 Find the probability that a repeat festival attendee uses the Gaia Adventure Tours voucher. SOLUTION Using Equation 4.1: number repeat festival attendees using Gaia Adventure Tours voucher P(Gaia Adventure Tours) 5 total number of repeat festival attendees 5 168 5 0.6 280 Therefore, 60% of repeat festival attendees use the Gaia Adventure Tours two-for-one voucher. Joint Probability joint probability Probability of an occurrence described by two or more characteristics. Joint probability refers to the probability of an occurrence described by two or more characteris- tics. An example of joint probability is the probability that you will get a head on the first toss of a coin and a head on the second toss of a coin. Referring to Table 4.1, the festival attendees who have attended a subsequent music festival and used the discount accommodation voucher are represented by the 210 festival attendees in the single cell ‘Yes – Repeat festival attendance and Yes – Accommodation voucher used’. Because this group consists of 210 festival attendees, the probability of picking a festival attendee who has attended a later music festival and used the discount accommodation voucher is: P (repeat festival attendance and accommodation voucher used) number repeat festival attendees and accommodation voucher used 5 total number of festival attendees 5 210 5 0.42 500 Example 4.4 also demonstrates how to determine joint probability. EXAMPLE 4.4 DE T E R MIN ING T H E J OI N T P ROBABI L I TY OF A RE P E AT F E STI VAL AT T E N D E E U S ING T WO- F OR- ON E M E AL AN D GAI A AD V E N TU RE TO U R S VO U C H E R S In Table 4.2, festival attendees were given a book of discount vouchers, including two-forone vouchers for meals and Gaia Adventure Tours. Find the probability that a randomly selected repeat festival attendee uses both the meal and Gaia Adventure Tours two-for-one vouchers. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 4.1 Basic Probability Concepts 153 SOLUTION Using Equation 4.1: P(Gaia Adventure Tours and meal voucher used) number repeat festival attendees using Gaia Adventure Tours and meal vouchers 5 total number repeat festival attendees 5 126 5 0.45 280 Therefore, there is a 45% chance that a randomly selected repeat festival attendee uses both the meal and Gaia Adventure Tours two-for-one vouchers. The marginal probability of an event is the sum of joint probabilities. For example, if B consists of two events, B1 and B2, then P(A), the probability of event A, consists of the joint probability of event A occurring with event B1 plus the joint probability of event A occurring with event B2. Equation 4.2 can be used to calculate marginal probabilities. MARG INAL P R OB A B IL IT Y P(A) = P(A and B1) + P(A and B2) + … + P(A and Bk) (4.2) where B1, B2, …, Bk are k mutually exclusive and collectively exhaustive events. Mutually exclusive events and collectively exhaustive events are defined as follows. Two events are mutually exclusive if the two events cannot occur simultaneously. mutually exclusive Two events that cannot occur simultaneously. Heads and tails in a coin toss are mutually exclusive events. When tossing a coin you cannot get both a head and a tail on the same toss. A set of events is collectively exhaustive if one of the events must occur. collectively exhaustive Set of events such that one of the events must occur. Heads and tails in a coin toss are collectively exhaustive events. One of them must occur. If heads does not occur, tails must occur. If tails does not occur, heads must occur. In summary, the event of tossing a coin is both collectively exhaustive and mutually exclusive. The outcome must be either heads or tails, P(Heads or Tails) = 1, so the outcomes are collectively exhaustive. When heads occurs, tails cannot occur, P(Heads and Tails) = 0, so the outcomes are also mutually exclusive. Equation 4.2 can be used to calculate the marginal probability of a festival attendee attending a later music festival: P (repeat festival attendance) 5 P(repeat festival attendance and accommodation voucher used) 1 P(repeat festival attendance and accommodation voucher not used) 5 280 70 210 1 5 5 0.56 500 500 500 Alternatively, Equation 4.1 can be used to calculate P(repeat festival attendance). Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 154 CHAPTER 4 BASIC PROBABILITY General Addition Rule The probability of event ‘A or B’ can be calculated by the general addition rule. This rule considers the occurrence of either event A or event B or both A and B. The event ‘Repeat festival attendance or accommodation voucher used’ includes all festival attendees who have attended a subsequent music festival and all festival attendees who have used the discount accommodation voucher. Table 4.1 can be used to calculate the probability that a festival attendee either attended a later music festival or used the accommodation discount voucher by examining each cell of the contingency table (Table 4.1) to determine whether it is part of this event. From Table 4.1, the cell ‘Repeat festival attendance and accommodation voucher not used’ is part of the event, because it includes repeat festival attendees. The cell ‘No repeat festival attendance and accommodation voucher used’ is included because it contains festival attendees using the discount accommodation voucher. Finally, the cell ‘Repeat festival attendance and accommodation voucher used’ has both characteristics of interest. Therefore, the probability of a festival attendee either attending a later music festival or using the accommodation discount voucher is: P(repeat festival attendance or accommodation voucher used) = P(repeat festival attendance and accommodation voucher used) + P(no repeat festival attendance and accommodation voucher used) + P(repeat festival attendance and accommodation voucher not used) = general addition rule Used to calculate the probability of the joint event A or B. 210 110 70 390 = = 0.78 + + 500 500 500 500 Instead of using a contingency table, the general addition rule defined in Equation 4.3 can be used to calculate the probability of the event A or B, P(A or B). GE N E R A L A DDIT IO N R U LE The probability of A or B is equal to the probability of A plus the probability of B minus the probability of A and B. P(A or B) = P(A) + P(B) − P(A and B) (4.3) Applying this equation to the previous example produces the following: P (repeat festival attendance or accommodation voucher used) 5 P(repeat festival attendance) 1 P(accommodation voucher used) 2 P(repeat festival attendance and accommodation voucher used) 280 320 390 210 5 2 1 5 5 0.78 500 500 500 500 The general addition rule adds the probability of A and the probability of B, and then subtracts the joint event of A and B from this total because the joint event has been included in both the probability of A and the probability of B. Referring to Table 4.1, if the outcomes of the event ‘Repeat festival attendance’ are added to those of the event ‘Accommodation voucher used’, the joint event ‘Repeat festival attendance and accommodation voucher used’ has been included in each of these simple events. Therefore, because this joint event has been counted twice, it needs to be subtracted once. Example 4.5 illustrates another application of the general addition rule. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 4.1 Basic Probability Concepts 155 A PPLY ING TH E G E NE R A L A DDIT IO N R U L E F OR RE P E AT F E STI VAL ATTE N D E E S USING TWO -FO R - O NE ME A L O R G A IA A D V E N TU RE TOU RS VOU CHE RS In Example 4.3, festival attendees were given a book of discount vouchers, including two-for-one vouchers for meals and Gaia Adventure Tours. Find the probability that a randomly selected repeat festival attendee uses a two-for-one meal or Gaia Adventure Tours voucher. EXAMPLE 4.5 SOLUTION Using Equation 4.3: P (Gaia Adventure Tours or meal voucher used) 5 P(Gaia Adventure Tours voucher used) 1 P(meal voucher used) 2 P(Gaia Adventure Tours and meal voucher used) 5 210 168 126 252 1 2 5 5 0.9 280 280 280 280 Therefore, there is a 90% chance that a return repeat festival attendee uses a two-for-one meal or Gaia Adventure Tours voucher. Problems for Section 4.1 LEARNING THE BASICS 4.1 4.2 4.3 Two coins are tossed. a. Give an example of a simple event. b. Give an example of a joint event. c. What is the complement of a head on the first toss? An urn contains 12 red balls and 8 white balls. One ball is to be selected from the urn. a. Give an example of a simple event. b. What is the complement of a red ball? Given the following contingency table: A A′ 4.4 B 10 20 B 10 25 APPLYING THE CONCEPTS 4.5 B∙ 20 40 what is the probability of a. event A? b. event A′? c. event A and B? d. event A or B? Given the following contingency table: A A′ what is the probability of a. event A′? b. event A and B? c. event A′ and B′? d. event A′ or B′? 4.6 B∙ 30 35 For each of the following, indicate whether the type of probability involved is an example of a priori classical probability, empirical classical probability or subjective probability. a. The next toss of a fair coin will be heads. b. Italy will win soccer’s World Cup the next time the competition is held. c. The sum of the faces of two dice will be 7. d. The train taking a commuter to work will be more than 10 minutes late. For each of the following, state whether the events are mutually exclusive and/or collectively exhaustive. If they are not mutually exclusive and/or collectively exhaustive, either reword the categories to make them mutually exclusive and collectively exhaustive or explain why this would not be useful. a. An exit poll in an Australian federal election asked voters if they had voted for the Labor or the Coalition candidate. b. Respondents were classified by type of car they drive: Australian, American, European, Japanese or none. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 156 CHAPTER 4 BASIC PROBABILITY 4.7 4.8 c. People were asked, ‘Do you currently live in (i) an apartment or (ii) a house?’ d. A product was classified as defective or not defective. The probability of each of the following events is zero. For each, state why. a. A day is Christmas and Easter. b. A product is defective and not defective. c. A car is a Ford and a Toyota. A researcher has completed a survey of 10,000 viewers in a regional city to determine which TV network they watch most weekdays during the 6 pm to 7 pm time slot. The results are: Network ABC Seven Nine Ten SBS Other or none LEARNING OBJECTIVE Number 1,290 2,850 2,060 1,695 430 1,675 3 Calculate conditional probabilities and determine whether events are independent or not conditional probability Probability of an event, given information on the occurrence of a second event. 4.9 A surveyed viewer is chosen at random. Find the probability that during the 6 pm to 7 pm time slot the viewer: a. watches ABC b. watches ABC or SBS c. watches neither ABC nor SBS d. watches one of Channels 7, 9 or 10 e. does not watch one of Channels 7, 9 or 10 A sample of 500 consumers is selected in a large metropolitan area to study consumer behaviour. Among the questions asked was ‘Do you enjoy shopping for clothing (Yes or No)?’ Of 240 males, 136 answered yes. Of 260 females, 224 answered yes. Construct a contingency table or a Venn diagram to evaluate the probabilities. What is the probability that a surveyed consumer chosen at random: a. enjoys shopping for clothing? b. is a female and enjoys shopping for clothing? c. is a female or enjoys shopping for clothing? d. is a male or a female? 4.2 CONDITIONAL PROBABILITY Calculating Conditional Probabilities We can often make use of extra information about the events under consideration when calculating probabilities. In this section, we consider the case where the probability of an event occurring depends on the occurrence of some other event. Suppose, for instance, that we are interested in determining the probability that a person selected at random earns more than $100,000 a year. If we know that the person has a degree, it might be reasonable to expect this to affect the probability. Conditional probability refers to the probability of event A, given information about the occurrence of another event, B. CON DIT ION A L PR OB AB I LI T Y The probability of A given B, written P(A | B), is equal to the probability of A and B divided by the probability of B. P(A | B) 5 P ( A and B ) P(B) (4.4a) The probability of B given A is equal to the probability of A and B divided by the probability of A. P(B | A) 5 P( A and B) P ( A) (4.4b) where P(A and B) 5 joint probability of A and B P(A) 5 marginal probability of A P(B) 5 marginal probability of B Referring to the repeat festival attendance scenario, suppose we know that a festival attendee has used the discount accommodation voucher. What is the probability that they have also Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 4.2 Conditional Probability 157 attended a later music festival – that is, P(repeat festival attendance | accommodation voucher used )? As we know that the festival attendee has used the discount accommodation voucher, the sample space does not consist of all 500 festival attendees in the sample. It consists only of the festival attendees who have used the discount accommodation voucher. Of the 320 festival attendees who have used the discount accommodation voucher, 210 are repeat festival attendees. Therefore (see Table 4.1 or Figure 4.2), the probability that a festival attendee attends a subsequent music festival given that they have used the discount accommodation voucher is: P(repeat festival attendance | accommodation voucher used) number repeat festival attendees and accommodation vouchers used 5 number of accommodation vouchers used 5 210 320 5 0.65625 Equation 4.4a can be used to calculate the above result: where define events: A 5 repeat festival attendance B 5 accommodation voucher used then: P(A | B) = P(A and B) 210/500 210 = 0.65625 = = 320/500 320 P(B) Therefore, if a festival attendee has used the discount accommodation voucher there is a 65.625% probability that they have also attended a subsequent music festival. Compare this conditional probability with the marginal probability of a festival attendee attending a later music festival, which is 280/500 5 0.56, or 56%. These results indicate that festival attendees who use the discount accommodation voucher are more likely to also attend a subsequent music festival. Example 4.6 further illustrates conditional probability. FINDING A C O NDIT IO NA L P RO B A B ILIT Y CON CE RN I N G RE P E AT F E STI VAL ATTENDEES’ U S E O F T WO - FO R O NE VO U CHE RS Table 4.2 is a contingency table for whether repeat festival attendees use two-for one meal and/or Gaia Adventure Tours vouchers. Find the probability that a randomly selected repeat festival attendee who used the two-for-one meal voucher also used the Gaia Adventure Tours voucher. EXAMPLE 4.6 SOLUTION We know that the repeat festival attendee has used the two-for-one meal voucher, so the sample space is reduced to the 210 attendees who have used their meal voucher. Of these 210 attendees, 126 have used their Gaia Adventure Tours voucher. Therefore, the probability that the Gaia Adventure Tours voucher is used, given that the meal voucher was used, is: P(Gaia Adventure Tours voucher used ) meal voucher used) number repeat attendees use meal and Gaia Adventure Tours vouchers 5 number repeat attendees use meal voucher 5 126 5 0.6 210 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 158 CHAPTER 4 BASIC PROBABILITY If we define events: M 5 meal voucher used G 5 Gaia Adventure Tours voucher used then Equation 4.4a may be used: P(G M) = 126/280 126 P(G and M) = = = 0.6 P(M ) 210/280 210 Therefore, given that a repeat festival attendee has used the two-for-one meal voucher, there is a 60% chance that the Gaia Adventure Tours two-for-one voucher is also used. Decision Trees decision tree Graphical representation of simple and joint probabilities as vertices of a tree. Also known as a tree diagram. In Table 4.1, a sample of 500 festival attendees were classified according to whether they have attended a later music festival or used the discount accommodation voucher. A decision tree (or tree diagram) is an alternative to a contingency table or a Venn diagram. Figure 4.3 represents the decision tree for this example. In Figure 4.3, beginning at the left with the sample of 500 festival attendees, there are two ‘branches’ corresponding to whether or not a subsequent music festival was attended. Each branch has two sub-branches, corresponding to whether the festival attendee used the discount accommodation voucher. The probabilities at the end of the initial branches represent the marginal probabilities of A (Repeat festival attendance) and A′. The probabilities at the end of each of the four sub-branches represent the joint probability for each combination of events A and B (Accommodation voucher used). The conditional probability is calculated by dividing the joint probability by the appropriate marginal probability. For example, to calculate the conditional probability that a festival attendee uses the accommodation discount voucher given that they have attended a later music festival, divide P(repeat festival attendance and accommodation voucher used) by P(repeat festival attendance). From Figure 4.3: P (accommodation voucher 210 210/500 = = 0.75 used | repeat festival attendance) = 280 280/500 Example 4.7 illustrates how to construct a decision tree. Figure 4.3 Decision tree for repeat festival attendance scenario P(A) 5 280 500 nce enda l att stiva t fe epea R Sample of 500 festival attendees N o re peat festi val atte ndan ce P(A 9) 5 220 500 n datio mmo sed o c c A her u vouc P(A and B) 5 210 500 Accom vouch modation er not used P(A and B 9) 5 70 500 odation Accomm sed u voucher P(A 9 and B) 5 110 500 Acc vouc ommod a her not tion used P(A 9 and B 9) 5 110 500 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 4.2 Conditional Probability 159 FO R MING A D E C IS IO N T R E E FO R R E P E AT F E STI VAL ATTE N D E E S – TWO -FO R -O N E VO U C H E R U S E Using the cross-classified data in Table 4.2, construct a decision tree and use it to find the probability a randomly selected repeat festival attendee who used the two-for-one meal voucher also used the Gaia Adventure Tours two-for-one voucher. EXAMPLE 4.7 SOLUTION The decision tree for ‘two-for-one voucher use’ is displayed in Figure 4.4. Using Equation 4.4a and the following definitions: G 5 Gaia Adventure Tours voucher used P(G M) = 126/280 126 P(G and M) = = = 0.6 P(M) 210/280 210 P (M ) 5 210 280 d r use uche Set of repeat festival attendees vo Meal Mea l vou cher not M 5 meal voucher used used P(M9) 5 70 280 ture dven used A a i Ga her vouc Tours Ga Tours ia Adventur e vouch er not used venture Gaia Ad r used che u o Tours v Tour Gaia Ad v s Vo uche enture r no t use d P (M and G) 5 126 280 Figure 4.4 Decision tree for ‘two-forone voucher use’ P(M and G9) 5 84 280 P(M 9 and G ) 5 42 280 P (M 9and G9) 5 28 280 Statistical Independence In the repeat festival attendance scenario, the conditional probability is 210/320 5 0.65625 that a selected festival attendee attended a later music festival given that they have used the discount accommodation voucher. The probability of a randomly selected festival attendee attends a later music festival is 280/500 5 0.56. This result shows that the prior knowledge that a festival attendee has used the discount accommodation voucher affected the probability that they attended another music festival. In other words, the outcome of one event is dependent on the outcome of a second event. When the outcome of one event does not affect the probability of occurrence of another event, the events are said to be statistically independent. Statistical independence can be determined by using Equation 4.5. statistical independence The occurrence of an event does not affect the occurrence of a second event. STATISTICA L IN DE PE N DE N CE Two events, A and B, are statistically independent if and only if P(A | B) 5 P(A) (also P(B | A) 5 P(B)) (4.5) Example 4.8 demonstrates the use of Equation 4.5. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 160 CHAPTER 4 BASIC PROBABILITY EXAMPLE 4.8 DE T E R MIN ING STAT I STI CAL I N D E P E N D E N CE Using the cross-classified data in Table 4.2, determine whether, for repeat festival attendees, use of the two-for-one meal voucher and use of the Gaia Adventure Tours voucher are statistically independent events. SOLUTION From Examples 4.6 and 4.7: P(Gaia Adventure Tours voucher used | meal voucher used) 5 0.6 which from Example 4.3 is equal to: P(Gaia Adventure Tours voucher used) 5 0.6 Thus, use of the two-for-one meal voucher and use of the Gaia Adventure Tours voucher are statistically independent events. Occurrence of one event does not affect the probability of the other event. Multiplication Rules general multiplication rule Used to calculate the probability of the joint event A and B. By manipulating the formula for conditional probability, you can determine the joint probability P(A and B) from the conditional probability of an event. The general multiplication rule is derived using Equations 4.4a and 4.4b and solving for the joint probability P(A and B). GE N E R A L M ULT IP LI C AT I O N R U LE The probability of A and B is equal to the probability of A given B times the probability of B or the probability of B given A times the probability of A. P(A and B) = P(A | B)P(B) = P(B | A)P(A) (4.6) Example 4.9 demonstrates the use of the general multiplication rule. EXAMPLE 4.9 U S IN G T H E MU LT IP LI CATI ON R U L E Of the 500 festival attendees in the repeat festival attendance scenario (Table 4.1), 280 have attended a subsequent music festival. Suppose two festival attendees are randomly selected. Find the probability that both festival attendees have since attended a later music festival. SOLUTION We can use the multiplication rule. Define events: F1 = repeat festival attendance first attendee F2 = repeat festival attendance second attendee then, using Equation 4.6: P(F1 and F2) = P(F2 | F1)P(F1) The probability that the first attendee has subsequently attended another music festival is 280/500. However, the probability that the second attendee has attended a later music festival depends on the result of the first selection. If the first attendee is not returned to the sample after any repeat festival attendance is determined (sampling without replacement), then the number of attendees remaining will be 499. If the first festival attendee attends a later music festival, the probability that the second also attends a later music festival is 279/499, Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 4.2 Conditional Probability 161 because 279 attendees who have subsequently attended a later music festival remain in the sample. Therefore: P(F1 and F2) = P(F2 | F1)P(F1) = 279 280 = 0.3131... × 499 500 The probability that both festival attendees have since attended a later music festival is approximately 0.313. If A and B are independent events, then P(A | B) 5 P(A), so we can substitute P(A) for P(A | B) (or P(B) for P(B | A)) in Equation 4.6 to obtain the multiplication rule for independent events. M ULTIPLICAT ION R UL E FOR IN DE P E ND E NT E V E NT S If A and B are statistically independent, the probability of A and B is equal to the probability of A times the probability of B. P(A and B) = P(A)P(B) multiplication rule for independent events Used to calculate the probability of the joint event A and B when A and B are independent. (4.7) If this rule holds for two events, A and B, then A and B are statistically independent. Thus, there are two ways to determine statistical independence: 1. Events A and B are statistically independent if and only if P(A | B) 5 P(A) (or P(B | A) 5 P(B)). 2. Events A and B are statistically independent if and only if P(A and B) 5 P(A)P(B). Marginal Probability Using the General Multiplication Rule In Section 4.1 marginal probability was defined using Equation 4.2, which can be rewritten using the general multiplication rule. If: P(A) = P(A and B1) + P(A and B2) + … + P(A and Bk) then, using the general multiplication rule, Equation 4.8 defines the marginal probability. M ARG INAL P R OB A B IL IT Y US IN G T H E G E NE R A L M U LT I P LI C AT I O N R U LE P(A) = P(A | B1)P(B1) + P(A | B2)P(B2) + … + P(A | Bk)P(Bk) (4.8) where B1, B2, …, Bk are k mutually exclusive and collectively exhaustive events. To illustrate this equation, refer to Table 4.1. Using Equation 4.8, the probability of a festival attendee attending a subsequent music festival is: P(A) = P(A | B)P(B) + P(A | B′)P(B′) where P(A) 5 probability of ‘repeat festival attendance’ P(B) 5 probability of ‘accommodation voucher used’ P(B9) 5 probability of ‘accommodation voucher not used’ P(A) = 280 210 320 210 70 70 180 × = × = 0.56 = + + 500 320 500 180 500 500 500 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 162 CHAPTER 4 BASIC PROBABILITY Problems for Section 4.2 LEARNING THE BASICS 4.10 Given the following contingency table: A A′ B 10 20 B∙ 20 40 a. what is the probability of: i. A | B? ii. A | B′? iii. A′| B′? b. Are events A and B statistically independent? 4.11 Given the following contingency table: A A′ B 10 25 B∙ 30 35 a. what is the probability of: i. A | B? ii. A′| B′? iii. A | B′? b. Are events A and B statistically independent? 4.12 If P (A and B ) 5 0.4 and P (B ) 5 0.8, find P (A | B ). 4.13 If P (A) 5 0.7, P (B ) 5 0.6, and A and B are statistically independent, find P (A and B ). 4.14 If P (A) 5 0.3, P (B ) 5 0.4, and P (A and B ) 5 0.2, are A and B statistically independent? APPLYING THE CONCEPTS 4.15 The following table gives the labour force status of the Australian civilian population aged 15 years and over in May 2017: 6202.0–Labour Force, Australia, May 2017 Labour force status (aged 15 years and over) Male Female Total ('000) Employed full-time 5,296.0 3,001.2 8,297.2 Employed part-time 1,230.5 2,678.2 3,908.7 Unemployed and looking for fulltime work 277.6 205.4 483.0 Unemployed and not looking for full-time work 90.3 130.1 220.4 Not in labour force 2,859.4 4,060.5 6,919.9 Total civilian population aged 15 years and over 9,753.8 10,075.4 19,829.2 Data obtained from Australian Bureau of Statistics, Labour Force, Australia, May 2017, Cat. No. 6202.0 <www.abs.gov.au/ausstats/abs@.nsf/mf/6202.0> accessed 28 June 2017 a. What is the probability that a randomly selected person is female? b. What is the probability that a randomly selected male is not employed? c. Suppose you know that a person is employed full-time. What is the probability that they are female? d. Are the two events ‘employed full-time’ and ‘female’ statistically independent? Explain. e. What is the probability that a randomly selected person is a male in full-time employment? f. The unemployment rate is defined as the percentage of the labour force that is unemployed and either looking for fulltime work or not looking for full-time work. What is the unemployment rate for males, females and overall? g. The participation rate is defined as the percentage of the civilian population in the labour force, either employed or unemployed. What is the participation rate for males, females and overall? 4.16 Households in a certain town were surveyed to determine whether they would subscribe to a new Pay TV channel. The households were classified according to ‘high’, ‘medium’ and ‘low’ income levels. The results of the survey are summarised in the table below. Income level High Medium Low Will subscribe 3,200 1,920 480 Will not subscribe 800 7,080 2,520 a. What is the probability that: i. a household will subscribe? ii. a household is high income? iii. a household will subscribe and is high income? iv. a high-income household will subscribe? v. a household that subscribes is high income? b. Is income level statistically independent of whether a household subscribes or not? Explain. 4.17 At a certain university, 25% of students are in the business faculty. Of the students in the business faculty, 66% are males. However, only 52% of all students at the university are male. a. What is the probability that a student selected at random in the university is a male in the business faculty? b. What is the probability that a student selected at random in the university is male or is in the business faculty? c. What percentage of males are in the business faculty? 4.18 A sample of 500 consumers was selected in a large metropolitan area to study consumer behaviour with the following results: Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 4.3 Bayes’ Theorem 163 Exchange that is widely used as a benchmark for the performance of US equity mutual funds) finished higher after the first five days of trading. In 41 of those 59 years the S&P 500 finished higher for the year. Is a good first week a good omen for the upcoming year? The following table gives the first-week and annual performance over this 88-year period: Gender Enjoys shopping for clothing Yes No Total Male 136 104 240 Female 224 36 260 Total 360 140 500 a. What is the probability that a randomly chosen female consumer does not enjoy shopping for clothing? b. Suppose the chosen consumer enjoys shopping for clothing. What is the probability that the individual is male? c. Are enjoying shopping for clothing and the gender of the individual statistically independent? Explain. 4.19 A study was done to determine the efficacy of three different headache tablets – A, B and C. One thousand study participants used all three tablets (at different times) over the period of the study with the following results: 750 675 631 504 453 350 236 reported relief from tablet A reported relief from tablet B reported relief from tablet C reported relief from both tablets A and B reported relief from both tablets A and C reported relief from both tablets B and C reported relief from all three tablets a. If a study participant is selected at random, what is the probability that they i. reported relief from tablet A? ii. reported relief from tablet B? iii. reported relief from tablet A and tablet B? iv. reported relief from tablet A or tablet B? v. did not report relief from tablet C? b. What is the probability that, if a participant reported relief from tablet A, they also reported relief from tablet B? c. What is the probability that, if a participant reported relief from tablet B, they also reported relief from tablet A? d. Are the events ‘report relief from tablet A’ and ‘report relief from tablet B’ statistically independent? Explain. 4.20 In 59 of the 88 years from 1929 to 2016, the S&P 500 (Standard and Poor’s 500 Index, one of the indices of the New York Stock First week Higher Not higher S&P 500’s annual performance Higher Not higher 41 18 14 15 a. If a year is selected at random, what is the probability that the S&P finished higher for the year? b. Given that the S&P 500 finished higher after the first five days of trading, what is the probability that it finished higher for the year? c. Are the two events, first-week performance and annual performance, statistically independent? Explain. d. In 2017 the S&P 500 was up 0.8% after the first five days. Look up the 2017 annual performance of the S&P 500 at <https:// finance.yahoo.com> or elsewhere. Comment on the results. e. Repeat part (d) for last year. 4.21 A standard deck of cards is being used to play a game. There are four suits (hearts, diamonds, clubs and spades), each having 13 faces (ace, 2 to 10, jack, queen and king), making a total of 52 cards. This complete deck is thoroughly shuffled, and you will receive the first two cards from the deck without replacement. a. What is the probability that both cards are queens? b. What is the probability that the first card is a 10 and the second card is a 5 or 6? c. If you were sampling with replacement, what would be the answer in (a)? d. In the game of blackjack, the picture cards (jack, queen, king) count as 10 points and the ace counts as either 1 or 11 points. All other cards are counted at their face value. Blackjack is achieved if your two cards total 21 points. What is the probability of getting blackjack in this problem? 4 4.3 BAYES’ THEOREM LEARNING OBJECTIVE Bayes’ theorem is used to revise previously calculated probabilities (called prior probabilities) Revise probabilities using Bayes’ theorem when there is new information. Developed by the Rev. Thomas Bayes in the eighteenth century, Bayes’ theorem is an extension of conditional probability. The conditional probability of B given A is given by Equation 4.4b combined with Equation 4.6: P(B | A) = P(A | B)P(B) P(A and B) = P( A) P( A) Bayes’ theorem is derived from this by substituting Equation 4.8 for P(A) in the above equation. Bayes’ theorem Revises previously calculated probabilities when new information becomes available. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 164 CHAPTER 4 BASIC PROBABILITY B AYE S ’ T H E OR E M P(Bi | A) = P(A | Bi)P(Bi) P(A | B1)P(B1) + P(A | B2)P(B2) + … + P(A | Bk)P(Bk) (4.9) where Bi is the ith event out of k mutually exclusive and collectively exhaustive events. The following situation illustrates when Bayes’ theorem can be used. Suppose the Consumer Electronics Company is considering marketing a new model of television. In the past, 40% of the televisions introduced by the company have been successful and 60% have been unsuccessful. Before introducing a new model of television to the marketplace, the marketing research department always conducts an extensive study and releases a report, either favourable or unfavourable. In the past, 80% of the successful televisions had received a favourable market research report and 30% of the unsuccessful televisions had received a favourable report. For the new model of television under consideration, the marketing research department has issued a favourable report. What is the probability that the television will be successful, given this favourable report? To use Equation 4.9 to calculate the required probability P(S | F), first define events: S 5 successful television F 5 favourable report S′ 5 unsuccessful television F′ 5 unfavourable report then: P(S) = 0.40 P(F | S) = 0.80 P(S') = 0.60 P(F | S') = 0.30 Therefore, using Equation 4.9: P(S | F) = P(F | S )P(S ) P(F | S)P(S ) + P(F | S')P(S') = (0.80)(0.40) (0.80)(0.40) + (0.30)(0.60) = 0.32 0.32 = 0.32 + 0.18 0.50 = 0.64 The probability of a successful television, given that a favourable report was received, is 0.64. Thus, the probability of an unsuccessful television, given that a favourable report was received, is 1 − 0.64 5 0.36. Table 4.3 summarises the calculation of the probabilities and Figure 4.5 presents the decision tree. The denominator in Bayes’ theorem represents P(F), the probability of a favourable report. This shows the connection between Equations 4.4a and 4.4b with Equation 4.9, reflecting that Bayes’ theorem is a special case of conditional probability. Event Si S ∙ successful television set S∙ ∙ unsuccessful television set Prior probability P(Si) 0.40 0.60 Conditional probability P(F | Si) 0.80 0.30 Joint probability P(F and S i ) ∙ P(F | Si )P(Si) 0.32 0.18 0.50 = P(F ) Revised probability P(Si | F ) 0.32/0.50 = 0.64 = P(S | F ) 0.18/0.50 = 0.36 = P(S′ | F ) Table 4.3 Bayes’ theorem calculations for the television-marketing example Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 4.3 Bayes’ Theorem 165 P(S and F ) = P(F |S) P(S) = (0.80) (0.40) = 0.32 Figure 4.5 Decision tree for marketing a new television set P(S ) = 0.40 P(S and F 9) = P(F 9|S) P(S ) = (0.20) (0.40) = 0.08 P(S 9 and F ) = P(F |S 9) P(S 9) = (0.30) (0.60) = 0.18 P(S 9) = 0.60 P(S 9 and F 9) = P(F 9|S 9) P(S 9) = (0.70) (0.60) = 0.42 Example 4.10 applies Bayes’ theorem to a medical diagnosis problem. USING B AY E S ’ T H E O R E M IN A ME DIC A L D I AGN OS I S P ROBL E M The probability that a person has a certain disease is 0.03. Medical diagnostic tests are available to determine whether a person has the disease. If the disease is present, the probability that the medical diagnostic test will give a positive result (indicating that the disease is present) is 0.90. If the disease is not present, the probability of a positive test result (indicating that the disease is present when it is not, called a false positive) is 0.02. Suppose that the medical diagnostic test has given a positive result. What is the probability that the disease is present, given the positive test result? What is the probability of a positive test result? EXAMPLE 4.10 SOLUTION Define events: D 5 has disease D′ 5 does not have disease T 5 test is positive T′ 5 test is negative We are given: P(D) 5 0.03 P(D′) 5 0.97 P(T | D) 5 0.90 P(T | D′) 5 0.02 Using Equation 4.9 to calculate P(D | T) – that is, the probability that the disease is present, given the positive test result – we obtain: P(D | T) = P(T | D)P(D) P(T | D)P(D) + P(T | D' )P(D' ) (0.90)(0.03) (0.90)(0.03) + (0.02)(0.97) 0.0270 = 0.0270 + 0.0194 0.0270 = 0.0464 = 0.5818… = The probability that the disease is present, given a positive result has occurred (indicating that the disease is present), is 0.582. This means that if a person returns a positive test result, there is only a 58% chance they have the disease. Table 4.4 summarises the calculation of the probabilities and Figure 4.6 presents the decision tree. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 166 CHAPTER 4 BASIC PROBABILITY Event Di D ∙ has disease D ∙ ∙ does not have disease Prior probability P (Di) Conditional probability P (T | Di) Joint probability P (T | Di)P (Di) 0.03 0.97 0.90 0.02 0.0270 0.0194 Revised probability P (Di | T ) 0.0270/0.0464 = 0.582 = P(D | T ) 0.0194/0.0464 = 0.418 = P(D′ | T ) 0.0464 Table 4.4 Bayes’ theorem calculations for the medical diagnosis problem The denominator in Bayes’ theorem represents P(T), the probability of a positive test result, which in this case is 0.0464, or 4.64%. Figure 4.6 Decision tree for the medical diagnosis problem P(D and T ) = P(T |D) P(D) = (0.90) (0.03) = 0.0270 P(D) = 0.03 P(D and T 9) = P(T 9|D ) P(D) (0.10) (0.03) = 0.0030 P(D 9) = 0.97 P(D 9 and T ) = P(T |D 9) P(D 9) (0.02) (0.97) = 0.0194 P(D 9 and T 9) = P(T 9|D 9) P(D 9) (0.98) (0.97) = 0.9506 Divine providence and spam think about this Would you ever guess that the essays Divine Benevolence: Or, An Attempt to Prove that the Principal End of the Divine Providence and Government is the Happiness of His Creatures and An Essay Towards Solving a Problem in the Doctrine of Chances were written by the same person? Probably not, and in doing so you illustrate a modern-day application of Bayesian statistics: spam, or junk mail, filters. In not guessing correctly, you probably looked at the words in the titles of the essays and concluded that they were talking about two different things. An implicit rule you used was that word frequencies vary by subject matter. A statistics essay would very likely contain the word statistics as well as words such as chance, problem and solving. An eighteenth-century essay about theology and religion would be more likely to contain the uppercase forms of Divine and Providence. Likewise, there are words that you would guess to be very unlikely to appear in either book, such as technical terms from finance, and words that are most likely to appear in both – common words such as a, and and the. That words would either be likely or unlikely suggests an application of probability theory. Of course, likely and unlikely are fuzzy concepts, and we might occasionally misclassify an essay if we kept things too simple, such as relying solely on the occurrence of the words Divine and Providence. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 4.3 Bayes’ Theorem 167 For example, a profile of the late Harris Milstead, better known as Divine, the star of Hairspray and other films, visiting Providence (Rhode Island), would most certainly not be an essay about theology. But if we widened the number of words we examined and found such words as movie or the name John Waters (Divine’s director in many films), we probably would quickly realise the essay had something to do with twentieth-century cinema and little to do with theology and religion. We can use a similar process to try to classify a new email message in your inbox as either spam or a legitimate message (called ‘ham’ in this context). We would first need to add to your email program a ‘spam filter’ that has the ability to track word frequencies associated with spam and ham messages as you identify them on a day-to-day basis. This would allow the filter constantly to update the prior probabilities necessary to use Bayes’ theorem. With these probabilities, the filter can ask, ‘What is the probability that an email is spam, given the presence of a certain word?’ Applying the terms of Equation 4.9, such a Bayesian spam filter would multiply the probability of finding the word in a spam email, P (A | B ), by the probability that the email is spam, P (B ), and then divide by the probability of finding the word in an email, the denominator in Equation 4.9. Bayesian spam filters also use shortcuts by focusing on a small set of words that have a high probability of being found in a spam message and on a small set of other words that have a low probability of being found in a spam message. As spammers (people who send junk email) learned of such new filters, they tried to outfox them. Having learned that Bayesian filters might be assigning a high P (A | B ) value to words commonly found in spam, such as Viagra, spammers thought they could fool the filter by misspelling the word as Vi@gr@ or V1agra. What they overlooked was that the misspelled variants were even more likely to be found in a spam message than the original word. Thus, the misspelled variants made the job of spotting spam easier for the Bayesian filters. Other spammers tried to fool the filters by adding ‘good’ words, words that would have a low probability of being found in a spam message, or ‘rare’ words, words not frequently encountered in any message. But these spammers overlooked the fact that the conditional probabilities are constantly updated and that words once considered ‘good’ would soon be discarded from the good list by the filter as their P (A | B ) value increased. Likewise, as ‘rare’ words grew more common in spam and yet stayed rare in ham, such words acted like the misspelled variants that others had tried earlier. Even then, and perhaps after reading about Bayesian statistics, spammers thought that they could ‘break’ Bayesian filters by inserting random words in their messages. Those random words would affect the filter by causing it to see many words whose P (A | B ) value would be low. The Bayesian filter would begin to label many spam messages as ham and end up being of no practical use. Spammers again overlooked that conditional probabilities are constantly updated. Other spammers decided to eliminate all or most of the words in their messages and replace them with graphics so that Bayesian filters would have very few words with which to form conditional probabilities. However, this approach failed too, as Bayesian filters were rewritten to consider things other than words in a message. After all, Bayes’ theorem concerns events, and ‘graphics present with no text’ is as valid an event as ‘some word, X, present in a message’. Other future tricks will ultimately fail for the same reason. (By the way, spam filters use non-Bayesian techniques as well, which make spammers’ lives even more difficult.) Bayesian spam filters are an example of the unexpected way that applications of statistics can show up in your daily life. You will discover more examples as you read the rest of this book. Incidentally, the author of the two essays mentioned earlier was Thomas Bayes, who is a lot more famous for the second essay than the first, a failed attempt to use mathematics and logic to prove the existence of God. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 168 CHAPTER 4 BASIC PROBABILITY Problems for Section 4.3 LEARNING THE BASICS 4.22 If P(B) 5 0.05, P(A | B) 5 0.80 and P(A | B′) 5 0.40, find P(B | A). 4.23 If P(B) 5 0.30, P(A | B) 5 0.60 and P(A | B′) 5 0.50, find P(B | A). APPLYING THE CONCEPTS 4.24 In Example 4.10 on page 165, suppose that the probability that the test will return a false positive (that is, the medical diagnostic test gives a positive result when the disease is not present) is reduced from 0.02 to 0.01. Given this information: a. If the medical diagnostic test has given a positive result (indicating the disease is present), what is the probability that the disease is present? b. If the medical diagnostic test has given a positive result, what is the probability that the disease is not present? c. If the medical diagnostic test has given a negative result (indicating that the disease is not present), what is the probability that the disease is not present? d. If the medical diagnostic test has given a negative result, what is the probability that the disease is present? 4.25 An advertising executive is studying the television viewing habits of married men and women during prime-time hours. On the basis of past viewing records, the executive has determined that, during prime time, husbands are watching television 60% of the time. When the husband is watching television, 40% of the time the wife is also watching. When the husband is not LEARNING OBJECTIVE 5 Use counting rules to calculate the number of possible outcomes watching television, 30% of the time the wife is watching television. Find the probability that a. if the wife is watching television, the husband is also watching television. b. the wife is watching television in prime time. 4.26 The editor of a textbook-publishing company is trying to decide whether to publish a proposed business statistics textbook. Information on previous textbooks published indicate that 10% are huge successes, 20% are modest successes, 40% break even and 30% are failures. However, before a publishing decision is made, the book will be reviewed. In the past, 99% of the huge successes received favourable reviews, 70% of the moderate successes received favourable reviews, 40% of the break-even books received favourable reviews and 20% of the failures received favourable reviews. a. If the proposed text receives a favourable review, how should the editor revise the probabilities of the various outcomes to take this information into account? (Hint: Derive the conditional probabilities for each outcome given a favourable review has been received.) b. What proportion of textbooks receive favourable reviews? 4.27 From past records of personal loans the Check$mart Bank found that 10% of borrowers default on their loan – that is, they fail to pay. It also found that, of those who default, 32% are unemployed while, of those who do not default, only 2% are unemployed. a. What percentage of unemployed borrowers default? b. What proportion of borrowers are unemployed? c. What proportion of borrowers who are not unemployed do not default? 4.4 COUNTING RULES In Equation 4.1 the probability of occurrence of an outcome was defined as the number of ways the outcome occurs divided by the total number of possible outcomes. In many instances, there is a large number of possible outcomes and it is difficult to determine the exact number. In these circumstances, rules for counting the number of possible outcomes have been developed. Five different counting rules are introduced in this section. COUN T IN G R UL E 1 If any one of k different mutually exclusive and collectively exhaustive events can occur on each of n trials, the number of possible outcomes is equal to kn EXAMPLE 4.11 (4.10) C O U N T IN G R U LE 1 Suppose you toss a coin five times. What is the number of different possible outcomes (the sequences of heads and tails)? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 4.4 Counting Rules 169 SOLUTION If you toss a coin (with two sides) five times, using Equation 4.10 the number of possible outcomes is 25 5 2 × 2 × 2 × 2 × 2 5 32. RO LLING A D IE T W IC E Suppose you roll a die twice. How many different possible outcomes can occur? EXAMPLE 4.12 SOLUTION If a die (having six sides) is rolled twice, using Equation 4.10 the number of different ­possible outcomes is 62 5 36. The second counting rule is a more general version of the first, and allows for the number of possible events to differ from trial to trial. C O UN TIN G R UL E 2 If there are k1 events on the first trial, k2 events on the second trial, … , and kn events on the nth trial, then the number of possible outcomes is k1 × k2 × … × kn (4.11) CO UNTING R U LE 2 At one stage, standard New South Wales vehicle number plates consisted of three letters ­followed by three digits. How many possible number plates are there of this form? EXAMPLE 4.13 SOLUTION Using Equation 4.11, if a number plate consists of three letters (A to Z) followed by three numbers (0 to 9), the total number of number plates of this form is: 26 × 26 × 26 × 10 × 10 × 10 5 263 × 103 5 17,576,000. DETER M INING T H E NU MB E R O F D IFFE R E N T D I N N E RS A restaurant menu has a fixed-price dinner consisting of an entrée, a main, a beverage and a dessert. There is a choice of ten entrées, five mains, three beverages and six desserts. Determine the total number of possible dinners. EXAMPLE 4.14 SOLUTION Using Equation 4.11, the total number of possible dinners is 10 × 5 × 3 × 6 5 900. The third counting rule involves the calculation of the number of ways that a set of items can be arranged in order. C O UN TIN G R UL E 3 The number of ways that n items can be arranged in order is n! = n × (n− 1) × … × 2 × 1 (4.12) where n! is called n factorial and 0! is defined as 1. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 170 CHAPTER 4 BASIC PROBABILITY EXAMPLE 4.15 C O U NT ING R U LE 3 If a set of six textbooks is to be placed on a shelf, in how many ways can the six books be arranged? SOLUTION Any of the six books could occupy the first position on the shelf. Once the first position is filled, there are five books to choose from in filling the second. Continue this assignment procedure until all the positions are occupied. The number of ways that the six books can arranged is 6! = 6 × 5 × 4 × 3 × 2 × 1 5 720. permutation Ordered selection of items. In many instances we need to know the number of ways in which a subset of the entire group of items can be arranged in order. Each possible ordered arrangement is called a permutation. COUN T IN G R UL E 4 – P E R M U TAT I O NS The number of ways of arranging X objects selected from n objects in order is n PX EXAMPLE 4.16 = n! (n − X )! (4.13) C O U NT ING R U LE 4 Modifying Example 4.15, if there are six textbooks but room for only four books on the shelf, in how many ways can these books be arranged on the shelf? SOLUTION Using Equation 4.13, the number of ordered arrangements of four books selected from six books is equal to: 6 P4 = 6! 6! = = 360 (6 − 4)! 2! Alternatively, any of the six books could occupy the first position. Once the first position is filled, there are five books to choose from in filling the second. Continue this assignment procedure until four books are placed on the shelf. Therefore, the number of ordered arrangements of four books selected from six is: 6 × 5 × 4 × 3 5 360 combination Unordered selection of items. In other situations we are not interested in the order of the outcomes, but only in the number of ways that X items can be selected from n items, irrespective of order. Each unordered selection is called a combination. COUN T IN G R UL E 5 – C O M BI NAT I O NS The number of ways of selecting X objects from n objects, irrespective of order, is equal to: nC X = n! X !(n − X )! (4.14) Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 4.4 Counting Rules 171 Comparing equations 4.13 and 4.14, it can be seen that they differ only in the inclusion of a term X! in the denominator of equation 4.14. When permutations are used, all the arrangements of the chosen X objects are distinguishable. With combinations, the X! possible arrangements of the chosen X objects are irrelevant. CO UNTING R U LE 5 Modifying Example 4.16, in how many ways can you choose four books to place on the shelf? EXAMPLE 4.17 SOLUTION Using Equation 4.14, the number of combinations of four books selected from six books is equal to: 6C4 = 6! 6! = 15 = 4!(6 − 4)! 4!2! Problems for Section 4.4 APPLYING THE CONCEPTS 4.28 If there are 10 multiple-choice questions in an exam, each with three possible answers: a. How many different answer sequences are there? b. If you answer the questions randomly, what is the probability that you get all 10 correct? 4.29 A lock on a bank vault consists of three dials, each with 30 positions. To open the vault, each of the three dials must be in the correct position. a. How many different possible dial combinations are there for this lock? b. What is the probability that, if you randomly select a position on each dial, you will be able to open the bank vault? c. Explain why ‘dial combinations’ are not mathematical combinations expressed by Equation 4.14. 4.30 A particular brand of women’s jeans is available in seven different sizes, three different colours and three different styles. How many different jeans does the store manager need to order to have one pair of each type? 4.31 Greenway Gardens has a $10 salad box consisting of lettuce, tomatoes, cucumber, sprouts, capsicum, avocado and a bottle of Greenway’s special salad dressing. Suppose that at present there is a choice of eight types of lettuce, four types of tomatoes, three types of cucumbers, three types of sprouts and no choice for capsicum, avocado and dressing. How many different salad boxes are there? 4.32 If each letter is used once, how many different arrangements are there of: a. Grafton? b. Otaki? c. Darwin? d. Gore? 4.33 Currently, new standard New South Wales vehicle number plates consist of two letters followed by two digits followed by two letters. How many possible number plates are there of this form? 4.34 Each employee of a large firm has an ID number consisting of their initials (either two or three) followed by two digits. What is the maximum number of unique ID numbers generated by this system? 4.35 A trifecta consists of picking the correct finishing order of the first three horses in a race. Suppose 12 horses are entered in a race. a. How many trifecta outcomes are there for this race? b. If you choose three horses randomly, what is the probability that you win the trifecta? 4.36 Nine passengers are on a waiting list for an overbooked flight. Due to cancellations, four seats are available. How many ways are there, regardless of order, to allocate the four seats? 4.37 A daily lottery is conducted in which two winning numbers are selected out of 100 numbers. a. How many different combinations of winning numbers are possible? b. Suppose that you have an entry in this lottery – what is your probability of winning? 4.38 A reading list for a unit contains 20 articles. How many ways are there to choose three articles from this list? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 172 CHAPTER 4 BASIC PROBABILITY 4.5 ETHICAL ISSUES AND PROBABILITY Ethical issues can arise when any statements relating to probability are presented to the public, particularly when these statements are part of an advertising campaign for a product or service. Unfortunately, many people are not comfortable with numerical concepts and tend to misinterpret the meaning of the probability. In some instances, the misinterpretation is not intentional but, in other cases, advertisements may unethically try to mislead potential customers. A commercial for a Lotto game that said ‘We won’t stop until we have made everyone a millionaire’ would be a deceptive and possibly unethical application of probability. When purchasing a Lotto ticket, the customer selects a set of numbers (such as 6) from a larger list of numbers (such as 45). Although virtually all participants know that they are unlikely to win a first-division prize (select all six of the winning numbers drawn), they also have very little idea of how small the probability is (1 in 8,145,060 if selecting 6 from 45). Given the fact that Lotto makes millions of dollars, it is unlikely to stop running, so the statement made is true. However, it may also be misleading as, in a lifetime, no one can be certain of becoming a millionaire by winning Lotto. A statement in an investment newsletter promising a 90% probability of a 20% annual return on an investment is another example of a potentially unethical application of probability. To make the claim in the newsletter an ethical one, the author needs to (a) explain the basis on which this probability estimate rests, (b) provide the probability statement in another format, such as 9 chances in 10, and (c) explain what happens to the investment in the 10% of cases in which a 20% return is not achieved (e.g. Is the entire investment lost?). Other ethical issues arise when probabilities are calculated from non-representative samples. An example of this was during the Australian 2007 federal election campaign where a leaflet from the Christian Democratic Party included the following: Daily Telegraph Tele’s Voteline published on 31 March 2007 Fred Nile’s Christian Democrats are calling for an immediate moratorium on Islamic immigration. Do you agree? YES 99% As well as being overtly discriminatory, there are several problems with this probability. • • • • The population sampled from are readers of the Daily Telegraph, which may not be representative of the Australian electorate. The sample is self-selected; readers have to ring the voteline at a cost of 55 cents a call. Therefore, only those who feel strongly about an issue, for or against, are likely to vote. Sample size is not given. Therefore, we do not know if probability is based on only a few votes or a large number of votes. From the Daily Telegraph the sample size was 972, Yes 960 and No 12. There is no mechanism to stop an individual voting more than once. The worst-case scenario is that this probability is based on the votes of two individuals, one voting Yes 960 times, and the other No 12 times. Problems for Section 4.5 APPLYING THE CONCEPTS 4.39 Write an advertisement for: a. Lotto that ethically describes the probability of winning b. the investment newsletter that ethically states the probability of a 20% return 4.40 Find an example online or in print of an unethical or misleading use of probability. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Key terms 173 4 Assess your progress Summary This chapter developed concepts concerning basic probability, conditional probability, Bayes’ theorem and counting rules. In the next chapter, important discrete probability distributions such as the binomial, hypergeometric and Poisson distributions will be considered. Key formulas Marginal probability using the general multiplication rule Probability of occurrence Probability of occurrence = P(A) = P(A | B1)P(B1) + P(A | B2)P(B2) + … X (4.1) T + P(A | Bk)P(Bk) Marginal probability P(A) = P(A and B1) + P(A and B2) + … + P(A and Bk) (4.2) General addition rule (4.8) Bayes’ theorem P(Bi | A) = P(A | Bi)P(Bi) P(A | B1)P(B1) + P(A | B2)P(B2) + … + P(A | Bk)P(Bk) P(A or B) = P(A) + P(B) − P(A and B) (4.3) (4.9) Conditional probability Counting rule 1 P(A | B) = P ( A and B ) (4.4a) P ( B ) kn (4.10) P(B | A) = P( A and B) (4.4b) P( A) k1 × k2 × … × kn (4.11) Counting rule 2 Factorials Statistical independence n! = n × (n − 1) × … × 2 × 1 (4.12) P(A | B) = P(A) (and P(B | A) = P(B)) (4.5) Permutations General multiplication rule n PX = P(A and B) = P(A | B)P(B) = P(B | A)P(A) (4.6) Multiplication rule for independent events P(A and B) = P(A)P(B) (4.7) n! (4.13) ( n − X )! Combinations nC X = n! (4.14) X !(n − X )! Key terms a priori classical probability 148 Bayes’ theorem 163 certain event 148 collectively exhaustive 153 combination170 complement150 conditional probability 156 contingency (cross-classification) table – probability 150 decision tree 158 empirical classical probability 149 event149 general addition rule 154 general multiplication rule 160 impossible event 148 joint event 150 joint probability 152 marginal probability 151 multiplication rule for independent events161 mutually exclusive 153 permutation170 probability148 random experiment 149 sample space 149 simple event 149 statistical independence 159 subjective probability 149 Venn diagram 150 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 174 CHAPTER 4 BASIC PROBABILITY Chapter review problems CHECKING YOUR UNDERSTANDING 4.41 4.42 4.43 4.44 4.45 4.46 4.47 4.48 What are the differences between a priori classical probability, empirical classical probability and subjective probability? What is the difference between a simple event and a joint event? How can you use the addition rule to find the probability of occurrence of event A or B? What is the difference between mutually exclusive events and collectively exhaustive events? How does conditional probability relate to the concept of statistical independence? How does the multiplication rule differ for events that are and are not independent? How can you use Bayes’ theorem to revise probabilities in light of new information? What is the difference between a permutation and a combination? 4.52 APPLYING THE CONCEPTS 4.49 The breakdown by home address of the previous year’s 993 drink-driving offences in Problem 2.67 is: Number of drink-driving offences Local – in council area Seaside town 151 Not seaside town 462 Not local – not in council area Intrastate (within state) 130 Interstate (another state) 228 International (outside Australia) 22 a. What is the probability of winning a prize in an office sweep (where horses are randomly allocated), if prizes are given for first, second and third places? b. In a trifecta three horses are selected to finish first, second and third in the correct order. How many possible trifectas are there in the Melbourne Cup? c. How many combinations of the winning three horses are not trifectas – that is, the selected horses finish first, second and third but not in the correct order? d. Suppose that you have a sweep ticket (where horses are randomly allocated) for the trifecta. What is your probability of winning the major prize (the trifecta) or a consolation prize (you have the three winning horses but in the wrong order)? In March 2013, 26.8% of New South Wales dwellings suitable for a rainwater tank had one installed. Of the dwellings with a rainwater tank, 53.1% had the rainwater tank plumbed into the dwelling (Australian Bureau of Statistics, Environmental Issues: Water Use and Conservation, Mar 2013, Cat. No. 4602.0.55.003 <www.abs.gov.au> accessed 4 November 2013). a. Complete the following contingency table for this problem: Home address 4.50 4.51 If a drink-driver offender is selected at random, what is the probability that: a. the offender is local? b. the offender is from another state? c. a non-local offender is from another state? d. a local offender is from outside the seaside town? e. the offender is from outside the state? In a school of 200 students 95% are vaccinated against a certain disease. During a recent outbreak of this disease 20 students, including 11 vaccinated students, developed the disease. a. Find the probability that a student i. who has the disease has been vaccinated ii. who has been vaccinated catches the disease iii. who is unvaccinated catches the disease b. A parent states that vaccination is ineffective as more than 50% of those who developed the disease had been vaccinated. Comment on this. The Melbourne Cup, held on the first Tuesday in November, has 24 horses entered in it. Plumbed into Not plumbed dwelling into dwelling Rainwater tank No rainwater tank Total 4.53 Total 0.0000 b. From part (a) or otherwise, answer the following, to four decimal places: i. What proportion of suitable New South Wales dwellings have a rainwater tank that is not plumbed into the dwelling? ii. What percentage of New South Wales dwellings that have a rainwater tank do not have the tank plumbed into the dwelling? iii. What proportion of New South Wales dwellings that are suitable for a rainwater tank do not have one? c. There are an estimated 2,268,800 dwellings in New South Wales that are suitable for a rainwater tank. Estimate the number of dwellings with a rainwater tank plumbed into the dwelling. When calculating premiums on life insurance products insurance companies often use life tables that enable the probability of a person dying in any age interval to be calculated. The following data obtained from New Zealand Abridged Period Life Table: 2014–2016 gives the number out of 100,000 New Zealand-born females and males who are still alive during each five-year period of life between age 20 and 60 (inclusive). Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter review problems 175 4.57 Exact age (years) 20 25 30 35 40 45 50 55 60 Number alive at exact age Out of 100,000 Out of 100,000 females born males born 99,288 99,031 99,128 98,685 98,949 98,312 98,726 97,899 98,427 97,381 97,934 96,649 97,157 95,548 95,933 93,853 94,162 91,352 Channel TV One TV2 TV3 Prime Maori Television Other or none Data obtained from <www.stats.govt.nz> accessed June 2017. © Statistics New Zealand, and licensed by Statistics New Zealand for re-use under the Creative ­Commons Attribution 3.0 New Zealand licence 4.54 4.55 4.56 a. What is the probability that a New Zealand-born female will reach the age of 30? b. What is the probability that a New Zealand-born female will reach the age of 45? c. What is the probability that a 20-year-old New Zealand-born female will reach the age of 30? d. What is the probability that a 20-year-old New Zealand-born female will reach the age of 40? e. A 30-year-old New Zealand-born female has purchased a term life policy that will pay her estate a million dollars if she dies within five years. What is the probability that the insurance company will pay her estate this amount? f. Repeat (a) to (e) for New Zealand-born males. In a certain region, during a recent outbreak of a preventable disease 0.1% of primary school children caught the disease; of these 30% were vaccinated against it. Furthermore, of those who did not catch the disease 80% were vaccinated. a. What percentage of vaccinated children caught the disease? b. What percentage of unvaccinated children caught the disease? c. What percentage of primary school children in the region are vaccinated against this disease? In an online test, 10 multiple-choice questions are randomly selected from a test bank of 100 questions. a. If the order in which the questions appear is immaterial, how many different tests can be generated? b. If the order in which the questions appear is important, how many different tests can be generated? The employees of a company were surveyed and asked their educational background and marital status. Of the 600 employees, 400 had university degrees, 100 were single and 60 were single university graduates. a. Construct a contingency table for this problem. b. Find the probability that a randomly selected employee of the company is single or has a university degree. c. What percentage of single employees have university degrees? d. Are gender and educational background statistically independent? Explain. A researcher has completed a survey of 10,000 New Zealand viewers to determine which channel they watch on a weekday during the 6.30 pm to 7.30 pm time-slot, with the following results: 4.58 Number 3,160 1,940 2,190 860 650 1,200 A surveyed viewer is chosen at random. Find the probability that during the 6.30 pm to 7.30 pm time-slot the viewer: a. watches TV One b. watches TV2 or TV3 c. watches Prime d. does not watch TV One, TV2 or TV3 The following table classifies residents of a regional area of New South Wales by gender and age. Age groups 0–4 years 5–14 years 15–19 years 20–24 years 25–34 years 35–44 years 45–54 years 55–64 years 65–74 years 75–84 years 85 years and over Total Males 410 952 478 594 859 886 1,026 1,097 677 333 154 7,466 Females 369 861 501 559 885 974 1,105 1,033 703 492 327 7,809 Persons 779 1,813 979 1,153 1,744 1,860 2,131 2,130 1,380 825 481 15,275 Data obtained from Australian Bureau of Statistics, Census of Population and Housing: General Community Profile, Australia, 2016 <www.abs.gov.au> accessed June 2017 4.59 a. If a resident is chosen at random, what is the probability that the resident: i. is male? ii. is a female aged at least 65 years? iii. is a child under 15 years? b. What proportion of children, defined as under 15 years, are male? c. Are the events ‘Child under 15’ and ‘Male’ statistically independent? Justify your answer. d. What is the probability that a female chosen at random is at least 65 years? e. Access the Community Profiles for the 2016 Census at <www.abs.gov.au> for a selected location in Australia and repeat parts (a) to (d). The following table classifies residents of a regional area of Queensland by gender, age and hours of unpaid domestic work in the week before the 2016 Census. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 176 CHAPTER 4 BASIC PROBABILITY Less than 5 hours Did no unpaid domestic work Did unpaid domestic work 5-14 15-29 hours hours 30 hours or more Total Males 15–19 years 20–24 years 25–34 years 35–44 years 45–54 years 55–64 years 65–74 years 75–84 years 85 years and over Total males 681 562 1,176 1,119 1,045 878 488 190 67 6,206 97 191 768 1,101 1,068 1,055 738 301 77 5,396 7 32 134 285 264 259 317 163 48 1,509 0 17 54 95 106 121 146 116 39 694 602 524 712 537 663 802 780 496 273 5,389 1,387 1,326 2,844 3,137 3,146 3,115 2,469 1,266 504 19,194 Females 15–19 years 20–24 years 25–34 years 35–44 years 45–54 years 55–64 years 65–74 years 75–84 years 85 years and over Total females 688 539 845 520 620 529 238 160 107 4,246 123 365 1,070 1,127 1,369 1,339 669 279 118 6,459 13 70 405 753 722 655 595 240 71 3,524 6 49 450 725 453 416 461 234 46 2,840 480 290 342 275 356 527 639 537 483 3,929 1,310 1,313 3,112 3,400 3,520 3,466 2,602 1,450 825 20,998 Data obtained from Australian Bureau of Statistics, Census of Population and Housing: General Community Profile, Australia, 2016 <www.abs.gov.au> accessed June 2017 a. If a resident is chosen at random, what is the probability that the resident: i. did unpaid domestic work? ii. did no unpaid domestic work and is female? iii. did unpaid domestic work and is male? iv. did at least 15 hours’ unpaid domestic work and is male? v. did no unpaid domestic work and is male? b. What proportion of male residents did unpaid domestic work? c. What percentage of female residents did unpaid domestic work? d. From parts (a) and (b), are the events ‘Male’ and ‘Did unpaid domestic work’ statistically independent? Justify your answer. e. What proportion of men do: i. at least 15 hours of unpaid domestic work? ii. less than five hours of unpaid domestic work (including no unpaid domestic work)? f. What proportion of women do: i. at least 15 hours of unpaid domestic work? ii. less than five hours of unpaid domestic work (including no unpaid domestic work)? g. From parts (e) and (f), can you conclude that men do less unpaid domestic work than women? 4.60 h. What proportion of male residents aged at least 65 did no unpaid domestic work? i. What percentage of female residents aged at least 65 did no unpaid domestic work? j. What proportion of male residents aged under 35 did unpaid domestic work? k. What percentage of female residents aged under 35 did unpaid domestic work? l. What conclusions can you draw from parts (h) to (k)? m. Access the Community Profiles for the 2016 Census at <www.abs.gov.au> for a selected location in Australia and repeat parts (a) to (l). In a town, 45% of all households have a pet, 35% have children, and 40% of all households with children have a pet. Using these definitions: P 5 event household has a pet C 5 event household has children a. Complete the following contingency table. P P9 Total C C9 Total Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Continuing cases 177 b. From part (a) or otherwise, answer the following: i. What is the probability that a randomly selected household has neither pets nor children? ii. What proportion of households with children do not have a pet? iii. Find and interpret P(C | P ). Continuing cases Tasman University Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues. In particular, students within the school are asked to complete a student survey when they receive their grades each semester. The results of Bachelor of Business (BBus) and Master of Business Administration (MBA) students who responded to the latest undergraduate (UG) and postgraduate (PG) student surveys are stored in < TASMAN_ UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >. Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman University Postgraduate MBA Student Survey. a For pairs of variables in the BBus student survey, calculate contingency tables and then calculate conditional and marginal probabilities. b For pairs of variables in the MBA student survey, calculate contingency tables and then calculate conditional and marginal probabilities. c Write a report summarising your conclusions. As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >. a From the contingency tables constructed for selected variables in Chapter 2 for regional city 1 state A, calculate selected conditional and marginal probabilities. b From the contingency tables constructed for selected variables in Chapter 2 for coastal city 1 state A, calculate selected conditional and marginal probabilities. c Write a report summarising your conclusions. d Repeat (a) to (c) for another pair of non-capital cities or towns in state A and/or state B. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 178 CHAPTER 4 BASIC PROBABILITY Chapter 4 Excel Guide EG4.1 BASIC PROBABILITY CONCEPTS Simple and Joint Probability and the General Addition Rule Key technique Use Excel arithmetic formulas. ure EG4.1) the conditional probabilities are calculated in rows 28 to 35. The worksheet in Figure EG4.1 already contains the Table 4.1 data. For other problems, change the sample space table entries in the cell ranges C3:D4 and A5:D6. Example Calculate simple and joint probabilities for the Table 4.1 data on discount accommodation voucher use and repeat festival attendance. EG4.3 BAYES’ THEOREM PHStat Use Simple & Joint Probabilities. For the example, select PHStat ➔ Probability & Prob. Distributions ➔ Simple & Joint Probabilities. In the new template, similar to the worksheet shown below, fill in the Sample Space area with the data. Example Apply Bayes’ theorem to the television marketing example in Section 4.3. In-depth Excel Use the COMPUTE worksheet of the Probabilities workbook as a template. The worksheet (shown in Figure EG4.1) already contains the Table 4.1 discount accommodation voucher use and repeat festival attendance data. For other problems, change the sample space table entries in the cell ranges C3:D4 and A5:D6. Key technique Use Excel arithmetic formulas. In-depth Excel Use the COMPUTE worksheet of the Bayes workbook as a template. The worksheet (shown in Figure EG4.2) already contains the probabilities for the Section 4.3 example. For other problems, change those probabilities in the cell range B5:C6. Figure EG4.2 COMPUTE worksheet of the Bayes workbook Figure EG4.1 COMPUTE worksheet of the Probabilities workbook The COMPUTE_FORMULAS worksheet gives the formulas to calculate the probabilities, which are also shown as an inset to the worksheet in Figure EG4.2. EG4.4 COUNTING RULES Counting Rule 1 The COMPUTE_FORMULAS worksheet gives the formulas to calculate the probabilities. EG4.2 CONDITIONAL PROBABILITY There is no PhStat command for conditional probability. In-depth Excel Use the COMPUTE worksheet of the Probabilities workbook as a template. In this worksheet (shown in Fig- In-depth Excel Use the POWER(k, n) worksheet function in a cell formula to calculate the number of outcomes given k events and n trials. For example, the formula 5POWER(6, 2) calculates the answer for Example 4.12 on page 169. Counting Rule 2 In-depth Excel Use a formula that takes the product of successive POWER(k, n) functions to solve problems related to counting rule 2. For example, the formula 5POWER(26, 3) * POWER(10, 3) calculates the answer for Example 4.13 New South Wales vehicle number plates on page 169. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 4 Excel Guide 179 Counting Rule 3 In-depth Excel Use the FACT(n) worksheet function in a cell formula to calculate how many ways n items can be arranged. For example, the formula 5FACT(6) calculates 6!, the answer to Example 4.15 on page 170. Counting Rule 4 In-depth Excel Use the PERMUT(n, x) worksheet function in a cell formula to calculate the number of ways of arranging in order x objects selected from n objects. For example, the ­formula 5PERMUT(6, 4) calculates the answer for Example 4.16 on page 170. Counting Rule 5 In-depth Excel Use the COMBIN(n, x) worksheet function in a cell formula to calculate the number of ways of selecting x objects from n objects, irrespective of order. For example, the formula 5COMBIN(6, 4) calculates the answer for Example 4.17 on page 171. Microsoft® product screen shots are reprinted with permission from Microsoft Corporation. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e CHA PTER 5 Some important discrete probability distributions GAIA ADVENTURE TOURS T ours and activities for Gaia Adventure Tours (see Chapter 4) are booked online. Potential customers can submit an online enquiry, which Gaia Adventure Tours advertises will be answered within 45 minutes between 7 am and 11 pm by a knowledgeable local adventure tour consultant. Yang, who is in charge of Gaia Adventure Tours’ online enquiry and booking procedures, is investigating several key performance indicators (KPIs); in particular: ■ the proportion of online enquiries converted to bookings ■ the number of online enquiries received in 1 hour ■ the proportion of online enquiries submitted between 7 am and 11 pm answered within 45 minutes. Recent data collected by Yang show that: ■ 10% of online enquires are converted to bookings ■ on average, Gaia Adventure Tours receives 30 online enquiries an hour between 7 am and 11 pm ■ with the current levels of staffing for enquiries: – when 24 or more online enquiries are received in 30 minutes, queries start to queue and may not be answered within the stated 45 minutes – when fewer than five enquiries are received in 20 minutes, enquiry staff have significant idle time. Yang would like to determine the probability of a given number of online enquiries being converted to confirmed bookings in a sample of a specific size. In addition, to help determine optimal enquiry staffing levels, Yang would like to calculate the probability of receiving 24 or more online enquiries in any 30 minutes or fewer than five online enquiries in any 20 minutes. Answers to these questions and others can help Gaia Adventure Tours to develop future sales, marketing and staffing strategies. © Georgejmclittle/Shutterstock/Pearson Education Ltd Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5.1 Probability Distribution for a Discrete Random Variable 181 LEARNING OBJECTIVES After studying this chapter you should be able to: 1 recognise and apply the properties of a probability distribution 2 calculate the expected value and variance of a probability distribution 3 calculate average return and measure risk associated with various investment proposals 4 identify situations that can be modelled by a binomial distribution and calculate binomial probabilities 5 identify situations that can be modelled by a Poisson distribution and calculate Poisson probabilities 6 identify situations that can be modelled by a hypergeometric distribution and calculate hypergeometric probabilities To help answer the given probability questions Yang can use a model, or small-scale representation, that approximates the online enquiry process, allowing inferences to be made about the processes. Although model building is a difficult task for some endeavours, in this case Yang can use probability distributions, which are mathematical models suitable for solving these types of probability questions. This chapter introduces probability distributions and explains how to apply the binomial, Poisson and hypergeometric distributions to business and other problems. 5.1 PROBABILITY DISTRIBUTION FOR A DISCRETE RANDOM VARIABLE A numerical variable (see Chapter 1) is a variable that yields numerical responses such as the number of magazines you subscribe to or your height in centimetres. Numerical variables are classified as either continuous or discrete. Continuous numerical variables have outcomes that arise from a measuring process, for example your height or weight. Discrete numerical variables have outcomes that arise from a counting process, such as the number of magazines you subscribe to or the number of phone calls received in an hour. This chapter introduces probability distributions that represent discrete numerical variables; continuous probability distributions are discussed in Chapter 6. A probability distribution for a discrete random variable is a mutually exclusive list of all possible numerical outcomes of the random variable with the probability of occurrence associated with each outcome. 1. 2. LEARNING OBJECTIVE 1 Recognise and apply the properties of a probability distribution probability distribution for a discrete random variable Values of a discrete random variable with the corresponding probability of occurrence. For a probability distribution for a discrete random variable: all probabilities must be between 0 and 1 inclusive; that is, 0 # P(X) # 1 the sum of the probabilities must equal 1; that is, ∑ P(X) 5 1. As an example, Table 5.1 gives the distribution of the number of home mortgages approved per week by the loans manager at a local branch of Check$mart Bank. From this we can see Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 182 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS that the loans manager approves no more than six home mortgages per week as the list in Table 5.1 is collectively exhaustive. Furthermore, as one of the outcomes must happen – that is, between none and six mortgages approved – the probabilities must sum to 1. Figure 5.1 is a graphical representation of Table 5.1. Table 5.1 Probability distribution of the number of home mortgages approved per week Home mortgages approved per week 0 1 2 3 4 5 6 Figure 5.1 Probability distribution of the number of home mortgages approved per week Probability 0.10 0.10 0.20 0.30 0.15 0.10 0.05 P (X ) 0.3 0.2 0.1 0 LEARNING OBJECTIVE 2 Calculate the expected value and variance of a probability distribution expected value of a discrete random variable Measure of central tendency; the mean of a discrete random variable. 1 2 3 4 5 Home mortgages approved per week 6 X Expected Value of a Discrete Random Variable In Chapter 3 we used the sample mean and variance to describe the centre and variation of a sample. In the same way, we can use the mean and variance of a random variable to describe the centre and variation of a probability distribution. The mean μ of a probability distribution is the expected value of its random variable. To calculate the expected value of a discrete random variable multiply each outcome X by its corres­ ponding probability P(X) and then sum these products. E XPE CT E D VA LUE O F A D I SC R E T E R A ND O M VA R I A BLE N μ = E(X ) = ∑ XiP(Xi) (5.1) i=1 where Xi = the ith outcome of the discrete random variable X P(Xi) = probability of occurrence of the ith outcome of X Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5.1 Probability Distribution for a Discrete Random Variable 183 Using Equation 5.1 the mean, or expected value, for the probability distribution of the number of home mortgages approved per week is: μ = E(X) N = ∑ XiP(Xi) i=1 = (0 × 0.1) + (1 × 0.1) + (2 × 0.2) + (3 × 0.3) + (4 × 0.15) + (5 × 0.1) + (6 × 0.05) = 0 + 0.1 + 0.4 + 0.9 + 0.6 + 0.5 + 0.3 = 2.8 The actual number of mortgages approved in a given week must be an integer value, so 2.8 mortgages are never approved in one week. However, on average, or in the long run, 2.8 are approved per week. Variance and Standard Deviation of a Discrete Random Variable The variance of a discrete probability distribution is calculated by multiplying each squared deviation from the mean [Xi – E(X)]2 by its corresponding probability P(Xi) and then summing the resulting products. Equations 5.2a and 5.3 define, respectively, the variance of a discrete random variable and the standard deviation of a discrete random variable. VARIANC E OF A DIS CR E T E R A N DOM VA R I A BLE – D E F I NI T I O N F O R M U LA N ∑ [Xi − σ2 = E(X)]2 P(Xi) (5.2a) i=1 where Xi = the ith outcome of the discrete random variable X P(Xi) = probability of occurrence of the ith outcome of X variance of a discrete random variable Measure of variation, based on squared deviations from the mean; directly related to the standard deviation. standard deviation of a discrete random variable Measure of variation, based on squared deviations from the mean; directly related to the variance. As for the sample variance, we can use algebra to obtain an alternative calculation formula. VARIANCE OF A DISCRETE RANDOM VARIABLE – CALCULATION FORMULA N ∑ Xi2P(Xi) − E(X )2 σ2 = (5.2b) i=1 N where ∑ Xi2P(Xi) = X12P(X1) + X22P(X2) + … + XN2P(XN) i =1 STAN DARD DE VIAT ION OF A DIS CR E T E R A ND O M VA R I A BLE The standard deviation of a discrete random variable is the square root of the variance σ = σ2 (5.3) Using Equations 5.2b and 5.3, the variance and standard deviation for the probability distribution of the number of mortgages approved per week are: Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 184 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS N σ2 = ∑ Xi2P(Xi) − E(X )2 i=1 = [(02 × 0.1) + (12 × 0.1) + (22 × 0.2) + (32 × 0.3) + (42 × 0.15) + (52 × 0.1) + (62 × 0.05)] − 2.82 = [(0 × 0.1) + (1 × 0.1) + (4 × 0.2) + (9 × 0.3) + (16 × 0.15) + (25 × 0.1) + (36 × 0.05)] − 7.84 = (0 + 0.1 + 0.8 + 2.7 + 2.4 + 2.5 + 1.8) − 7.84 = 10.3 − 7.84 = 2.46 σ = σ2 = 2.46 = 1.568… Alternatively, a table format can be used to calculate the mean and variance. In Table 5.2, the mean number of home mortgages approved per week is calculated. Then, using Equation 5.2b: N σ2 = ∑ Xi2P(Xi) − E(X )2 = 10.3 − (2.8)2 = 2.46 i=1 Table 5.2 Calculating the mean and variance of the number of home mortgages approved per week Home mortgages approved per week Xi 0 1 2 3 4 5 6 P(Xi ) 0.10 0.10 0.20 0.30 0.15 0.10 0.05 1.00 XiP(Xi ) 0.0 0.1 0.4 0.9 0.6 0.5 0.3 μ = E(X ) = 2.8 Xi2P(Xi ) 0.0 0.1 0.8 2.7 2.4 2.5 1.8 10.3 The expected value is often used to measure the amount we can expect to gain or lose by undertaking a particular investment, while the standard deviation is used to measure the risk involved. Problems for Section 5.1 LEARNING THE BASICS 5.1 Given the following probability distributions: Distribution A X P(X) 0 0.50 1 0.20 2 0.15 3 0.10 4 0.05 5.2 Distribution B X P(X) 0 0.05 1 0.10 2 0.15 3 0.20 4 0.50 a. Calculate the expected value for each distribution. b. Calculate the standard deviation for each distribution. c. Compare and contrast the results of distributions A and B. Are each of the following a valid probability distribution? Justify your answers: Distribution A Distribution B Distribution C Distribution D X P(X) X P(X) X P(X) X P(X) 0.2 0 0.1 0.250 0.500 0 0.2 -1 1 0.9 1 0.2 0.500 0.250 1 0.1 2 2 0.3 1.000 0.250 2 0.4 -0.1 3 0.3 3 0.5 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5.2 Covariance and its Application in Finance 185 APPLYING THE CONCEPTS 5.3 Number of cars sold per day 0 1 2 3 4 5 6 7 8 9 10 11 Total 5.4 Interruptions (X) 0 1 2 3 4 5 6 Using the company records for the past 500 working days, the manager of Konig Motors has summarised the number of cars sold per day in the following table: Frequency of occurrence 40 100 142 66 36 30 26 20 16 14 8 2 500 5.5 a. Form the probability distribution for the number of cars sold per day. b. Calculate the mean or expected number of cars sold per day. c. Calculate the standard deviation. The manager of a large computer network has developed the following probability distribution of the number of interruptions per day: P(X) 0.32 0.35 0.18 0.08 0.04 0.02 0.01 a. Calculate the mean or expected number of interruptions per day. b. Calculate the standard deviation. In the casino version of the traditional Australian game of two-up, a spinner stands in a ring and tosses two coins into the air. The coins may land showing two heads, two tails or one tail and one head (odds). Players can bet on either heads or tails at odds of one to one. Therefore, if a player bets $1 on heads, the player will win $1 if the coins land on heads but lose $1 if the coins land on tails. Alternatively, if a player bets $1 on tails, the player will win $1 if the coins land on tails but lose $1 if the coins land on heads. If the coins land on odds, all bets are frozen and the spinner tosses again until either heads or tails comes up. If five odds are tossed in a row all players lose. a. Construct the probability distribution representing the different outcomes that are possible for a $1 bet on heads. b. Construct the probability distribution representing the different outcomes that are possible for a $1 bet on tails. c. What is the expected long-run profit (or loss) to the player? 3 5.2 COVARIANCE AND ITS APPLICATION IN FINANCE LEARNING OBJECTIVE In Section 5.1 the expected value, variance and standard deviation of a discrete random variable are discussed. In this section the covariance between two discrete random variables is introduced and then applied to portfolio management, a topic of interest to financial analysts. Calculate average return and measure risk associated with various investment proposals Covariance Covariance, σXY, is a measure of the strength of the relationship between two random variables, X and Y. A positive covariance indicates a positive relationship, while a negative covariance indicates a negative relationship. If the two variables are independent then their covariance is zero. Equation 5.4a defines the covariance between two discrete random variables. covariance Measure of the strength of the linear relationship between two numerical variables. CO VARIANC E – DE FIN IT ION FOR M UL A σXY = ∑ ∑ [Xi − E(X )][Yj − E(Y)]P(Xi and Yj) (5.4a) all Xi all Yj where Xi is the ith outcome of the discrete random variable X, and Yj is the jth outcome of the discrete random variable Y. As for the sample covariance, we can use algebra to obtain an alternative calculation formula. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 186 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS COVA R IA N CE – C A LC U LAT I O N F O R M U LA σXY = ∑∑ XiYjP(Xi and Yj) − E(X )E(Y ) (5.4b) all Xi all Yj To illustrate covariance, suppose that we are deciding between two alternative investments for the coming year. The first investment is a mutual fund that consists of shares that are expected to do well when economic conditions are strong. The second investment is a mutual fund that is expected to perform best when economic conditions are weak. Your estimate of the returns for each investment (per $1,000 investment) under three economic conditions, each with a given probability of occurrence, is summarised in Table 5.3. Table 5.3 Estimated returns for each investment under three economic conditions Economic P(XiYi ) Xi [5 P(Xi ) 5 P(Yi )] condition 0.2 Recession -$100 0.5 Stable economy +100 0.3 Expanding economy +250 1.0 Yi +$200 +50 -100 Investment XiYi Xi P(Xi ) −20,000 220 5,000 50 75 -25,000 105 Yi P(Yi ) 40 25 230 35 E(X ) E(Y ) Xi Yi P(Xi Yi ) 24,000 2,500 27,500 29,000 The expected value and standard deviation for each investment is calculated as follows. Let X = strong-economy fund, and Y = weak-economy fund: E(X ) = μX = (−100)(0.2) + (100)(0.5) + (250)(0.3) = $105 E(Y ) = μY = (+200)(0.2) + (50)(0.5) + (−100)(0.3) = $35 σ2X = [(−100)2 × 0.2) + (1002 × 0.5) + (2502 × 0.3)] − 1052 = 25,750 − 11,025 = 14,725 σX = σ2X = 14,725 = 121.346… ≈ $121.35 σ2Y = [(2002 × 0.2) + (502 × 0.5) + (−100)2 × 0.3)] − 352 = 12,250 − 1225 = 11,025 E(Xσ)Y==μX σ=2Y(−100)(0.2) + $105.00 (100)(0.5) + (250)(0.3) = $105 = 11,025 = E(Y ) =calculation μ[(−100 + (50)(0.5) (−100)(0.3) $35× (−100) ×of200 0.2) + (100 ×only 50 × 0.5) + = (250 Y = (+200)(0.2) Inσthe the× covariance, the+ non-zero probabilities are:× 0.3)] − (105 × 35) XY = 2 σ2X ==[(−100) × 3,675 0.2) +=(100 0.5) + and (250Y2 =×$200) 0.3)] − 1052 = 25,750 − 11,025 = 14,725 −9,0002 − −12,675 P(X × = -$100 = 0.2 σX = σ2X = 14,725 = 121.346… ≈and $121.35 P(X = $100 Y = $50) = 0.5 2 = 12,250 − 1225 = 11,025 = $250 and Y2=×-$100) 0.3 σ2Y = [(2002 × 0.2) + (502 P(X × 0.5) + (−100) 0.3)] −=35 σYtherefore = σ2Y = We have:11,025 = $105.00 σXY = [(−100 × 200 × 0.2) + (100 × 50 × 0.5) + (250 × (−100) × 0.3)] − (105 × 35) = −9,000 − 3,675 = −12,675 expected value of the sum of two random variables Measure of central tendency; mean of the sum of two random variables. variance of the sum of two random variables Measure of variation; directly related to the standard deviation. standard deviation of the sum of two random variables Measure of variation; directly related to the variance. Thus, the strong-economy fund has a higher expected value (i.e. larger expected return) than the weak-economy fund but has a higher standard deviation (i.e. more risk). The covariance of -12,675 between the two investments indicates a negative relationship in which the two investments are varying in the opposite direction. Therefore, when the return on one investment is high, the return on the other is typically low. Expected Value, Variance and Standard Deviation of the Sum of Two Random Variables Equation 5.4a defined the covariance between two discrete random variables, X and Y. Now, the expected value of the sum of two random variables, variance of the sum of two random variables and standard deviation of the sum of two random variables are defined. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5.2 Covariance and its Application in Finance 187 E XPE CTE D VA LUE OF T H E S UM OF T W O R A ND O M VA R I A BLE S The expected value of the sum of two random variables is equal to the sum of the expected values. E(X + Y ) = E(X ) + E(Y ) μX + Y = μX + μY Alternatively: (5.5) VARIANC E OF T H E S UM OF T W O R A ND O M VA R I A BLE S The variance of the sum of two random variables is equal to the sum of the variances plus twice the covariance. (5.6) σ2X + Y = σ2X + σ2Y + 2σXY STAN DARD DE VIAT ION OF T H E S UM O F T W O R A ND O M VA R I A BLE S The standard deviation is the square root of the variance. σX + Y = σ2X +Y (5.7) To illustrate the expected value, variance and standard deviation of the sum of two random variables, consider the two investments previously discussed. Using Equations 5.5, 5.6 and 5.7: μX + Y = E(X + Y ) = E(X ) + E(Y ) = 105 + 35 = $140 σ2X + Y = σ2X + σ2Y + 2σXY = (14,725 + 11,025) + 2 × (−12,675) = 400 σX + Y = 400 = $20 The expected return of the sum of the strong-economy fund and the weak-economy fund is $140 with a standard deviation of $20. The standard deviation of the sum of the two investments is much less than the standard deviation of either single investment because there is a large negative covariance between the investments. Portfolio Expected Return and Portfolio Risk The concepts of covariance, expected return and standard deviation of the sum of two random variables can be applied to the study of investment portfolios where investors combine assets into portfolios to reduce their risk. The objective is to maximise the return while minimising the risk. For such portfolios, rather than studying the sum of two random variables, each investment is weighted by the proportion of assets assigned to that investment. Equations 5.8 and 5.9 define portfolio expected return and portfolio risk. PO RTFO LIO E XPE CT E D R E T UR N The portfolio expected return for a two-asset investment is equal to the weight assigned to asset X multiplied by the expected return of asset X plus the weight assigned to asset Y multiplied by the expected return of asset Y: E(P ) = wE(X ) + (1 − w)E(Y ) portfolio A combined investment in two or more assets. portfolio expected return Measure of central tendency; mean return on investment. portfolio risk Measure of the variation of investment returns. (5.8) where E(P) = portfolio expected return w = portion of the portfolio assigned to asset X, 0 ⩽ w ⩽ 1 1 – w = portion of the portfolio assigned to asset Y Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 188 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS PORT FOL IO R IS K σp = w2σ2X + (1 − w)2σ2Y + 2w(1 − w)σXY (5.9) In the previous example, the expected return and risk of two different investments were calculated, a strong-economy fund and a weak-economy fund. The covariance of the two investments was also calculated. Now, suppose that we wish to form a portfolio of these two investments that consists of an equal investment in each of these two funds. To calculate the portfolio expected return and the portfolio risk, use Equations 5.8 and 5.9, with w = 0.5, to obtain: E(P ) = wE(X ) + (1 − w)E(Y ) = (0.5 × 105) + (0.5 × 35) = $70 σp = (0.5)2(14,725) + (1 − 0.5)2(11,025) + 2(0.5)(1 − 0.5)(−12,675) = 100 = $10 Thus, the portfolio has an expected return of $70 for each $1,000 invested (a return of 7%) and has a portfolio risk of $10. The portfolio risk here is small because there is a large negative covariance between the two investments. The fact that each investment performs best under different circumstances has reduced the overall risk of the portfolio. It is possible to use calculus to determine the minimum portfolio risk – which may occasionally be zero – but that is outside the scope of this textbook. Problems for Section 5.2 Two investments, X and Y, have the following characteristics: LEARNING THE BASICS 5.8 5.6 E(X ) = $50, E(Y ) = $100, σX2 = 9,000, σY2 = 15,000 and σXY = 7,500 Given the following probability distributions for variables X and Y: P(XiYi ) 0.4 0.6 5.7 X 100 200 Calculate: a. E(X ) and E(Y ) b. σX and σY c. σXY d. E(X + Y ) Given the following probability distributions for variables X and Y: P(XiYi) 0.2 0.4 0.3 0.1 Calculate: a. E(X ) and E(Y ) b. σX and σY c. σXY d. E(X + Y ) X -100 50 200 300 If the weight assigned to investment X of portfolio assets is 0.4, calculate: a. the portfolio expected return b. the portfolio risk Y 200 100 Y 50 30 20 20 APPLYING THE CONCEPTS 5.9 The process of being served at a bank consists of two independent parts – the time waiting in line and the time it takes to be served by the teller. Suppose, at a branch of Check$mart, that the time waiting in line has an expected value of 4 minutes with a standard deviation of 1.2 minutes and the time it takes to be served by the teller has an expected value of 5.5 minutes with a standard deviation of 1.5 minutes. Calculate: a. the expected value of the total time it takes to be served b. the standard deviation of the total time it takes to be served 5.10 For the investment example given in Table 5.3: a. Calculate the portfolio expected return and the portfolio risk if: i. 30% is invested in the strong-economy fund and 70% in the weak-economy fund ii. 70% is invested in the strong-economy fund and 30% in the weak-economy fund b. Which of the three investment strategies (30%, 50% or 70% in the strong-economy fund) would you recommend? Why? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5.3 Binomial Distribution 189 5.11 You are developing a strategy for investing in two different shares. The anticipated annual return for a $1,000 investment in each share has the following probability distribution: Returns Probability 0.1 0.3 0.3 0.3 Share X -$100 0 80 150 Share Y $50 150 -20 -100 a. Calculate: i. the expected return for share X and for share Y ii. the standard deviation for share X and for share Y iii. the covariance of share X and share Y b. Would you invest in share X or share Y? Explain. 5.12 Suppose that in problem 5.11 you wanted to create a portfolio that consists of share X and share Y. a. Calculate the portfolio expected return and portfolio risk for each of the following percentages invested in share X: i. 30% ii. 50% iii. 70% b. On the basis of the results of your calculations in part (a), which portfolio would you recommend? Explain. 5.13 You are trying to set up a portfolio that consists of a corporate bond fund and a common share fund. The following information about the annual return (per $1,000) of each of these investments under different economic conditions is available, together with the probability that each of these economic conditions will occur. Probability 0.10 0.15 0.35 0.30 0.10 State of the economy Recession Stagnation Slow growth Moderate growth High growth Corporate bond fund -$30 50 90 100 110 Common share fund -$150 -20 120 160 250 a. Calculate: i. the expected return for the corporate bond fund and for the common share fund ii. the standard deviation for the corporate bond fund and for the common share fund iii. the covariance of the corporate bond fund and the common share fund b. Would you invest in the corporate bond fund or the common share fund? Explain. 5.14 Suppose that in problem 5.13 you wanted to create a portfolio that consists of a corporate bond fund and a common share fund. a. Calculate the portfolio expected return and portfolio risk for each of the following percentages invested in a corporate bond fund: i. 30% ii. 50% iii. 70% b. On the basis of the results of your calculations in (a), which portfolio would you recommend? Explain. 5.3 BINOMIAL DISTRIBUTION The next three sections use mathematical models to solve business and other problems. A mathematical model is a mathematical expression representing a variable of interest. When a mathematical model of a discrete probability distribution is available, you can easily calculate the exact probability of occurrence of any particular outcome of the random variable. The binomial distribution is one of the most important and widely used discrete probability distributions. The binomial distribution arises when the discrete random variable is the number of successes in a sample of n observations. The binomial distribution has four essential properties: 1. The sample consists of a fixed number of observations, n. 2. Each observation is classified into one of two mutually exclusive and collectively exhaustive categories, usually called success and failure. 3. The probability of an observation being classified as a success, p, is constant from observation to observation. Thus, the probability of an observation being classified as a failure, 1 – p, is also constant for all observations. 4. The outcome (i.e. success or failure) of any observation is independent of the outcome of any other observation. To ensure independence, the observations can be randomly selected either from an infinite population without replacement or from a finite population with replacement. LEARNING OBJECTIVE 4 Identify situations that can be modelled by a binomial distribution and calculate binomial probabilities mathematical model The mathematical representation of a random variable. binomial distribution Discrete probability distribution, where the random variable is the number of successes in a sample of n observations from either an infinite population or sampling with replacement. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 190 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS The proportion of online enquiries that are converted to bookings is of interest in the Gaia Adventure Tours scenario, so Yang could define an online enquiry converted to a booking as a success and an online enquiry that is not converted to a booking as a failure. Yang would then be interested in the number of successes; that is, the number of online enquiries converted to bookings in a random ­sample of n online enquiries. Note: In a binomial distribution, ‘success’ is usually defined as the outcome we are interested in – in this case an online query converted to a booking. This is a binomial situation because: • a fixed number of online enquiries, n, is chosen • each online enquiry is either converted to a booking – a success – or not converted – a failure • 10% of online enquiries are converted to bookings, so the probability of a randomly chosen online enquiry being converted to a booking is p = 0.1 and that of a randomly chosen online enquiry not being converted to a booking is 1 – p = 0.9 • online enquiries are randomly selected; so the outcome, converted or not converted, of any enquiry is independent of the outcome of any other enquiry. If Yang takes a random sample of four online enquiries, the binomial random variable defined as: X = number of online enquiries converted to bookings has a range from 0 to four as none, one, two, three or all four enquiries may be converted to bookings. In general, a binomial random variable has a range from 0 to n. Suppose that Yang observes the following result in a sample of four enquiries: First order Converted Second order Converted Third order Not converted Fourth order Converted What is the probability of having three successes (converted enquiries) in a sample of four enquiries in this particular sequence? Because the historical probability of enquiries converted to bookings is 0.10, the probability that each enquiry occurs in the sequence is: First enquiry p = 0.1 Second enquiry p = 0.1 Third enquiry 1 – p = 0.9 Fourth enquiry p = 0.1 Each outcome is independent of the others because the enquiries are randomly selected. Therefore, the probability of having this particular sequence is: pp(1 - p)p = p3(1 - p) = (0.1)3(0.9)1 = 0.0009 This result indicates only the probability of three online enquiries converted to bookings (successes) out of a sample of four online enquiries in a specific sequence. The number of ways of selecting X objects from n objects irrespective of sequence is given by the counting rule for combinations introduced in Chapter 4 as Equation 4.14 and as Equation 5.10 below, introducing a different notation. COM B IN AT ION S The number of combinations of selecting X objects from n objects is given by n! n = nCX = X!(n − X)! X (5.10) where n factorial is defined by n! = n × (n – 1) × … × 2 × 1 and by definition, 0! = 1. Using Equation 5.10, we see that there are: 4C3 = 4! =4 3!(4 − 3 )! Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5.3 Binomial Distribution 191 sequences of three converted enquiries and one enquiry not converted. The four possible sequences are: Sequence 1 Converted Sequence 2 Converted Sequence 3 Converted Sequence 4 Not converted and the probability of each is: Converted Converted Not converted Converted Converted Not converted Converted Converted Not converted Converted Converted Converted p3(1 − p) = (0.1)3(0.9)1 = 0.0009 Therefore, the probability of three converted enquiries out of four is equal to: number of sequences × probability of sequence = 4 × 0.0009 = 0.0036 We can make similar, intuitive derivations for the other possible outcomes of the random variable – zero, one, two and four converted enquiries. However, as n, the sample size, gets larger, the calculations involved in using this approach become time-consuming. Instead, a mathematical model provides a formula to calculate any binomial probability. Equation 5.11 is the mathematical model that represents the binomial probability distribution and is used to calculate the probability of X successes for any given values of n and p. BIN O MIAL P R OB A B IL IT Y DIST R IB UT I O N P(X ) = n! p X(1 − p)n − X X!(n − X )! (5.11) where P(X) = probability of X successes given n and p n = number of observations p = probability of success 1 – p = probability of failure X = number of successes (X = 0, 1, 2, …, n) Equation 5.11 restates what we had intuitively derived. The binomial random variable X can have any integer value X from 0 to n. In Equation 5.11 the product: p X(1 − p)n − X indicates the probability of exactly X successes out of n observations in a particular sequence. The term: n! X!(n − X)! indicates how many combinations of the X successes out of n observations are possible. Hence, given the number of observations n and the probability of success p, the probability of X successes is: P(X) = number of sequences × probability of sequence n! p X(1 − p)n − X = X!(n − X)! Example 5.1 illustrates the use of Equation 5.11. DETER M INING P ( X 5 3 ), G IV E N n 5 4 AN D p 5 0. 1 If 10% of online enquiries are converted to bookings, what is the probability that there are three converted enquiries in a sample of four? EXAMPLE 5.1 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 192 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS SOLUTION Using Equation 5.11, the probability of three converted enquiries from a sample of four is: P(X = 3) = 4! (0.1)3(1 − 0.1)4 − 3 = 4 × 0.001 × 0.9 = 0.0036 3 !(4 − 3 )! Examples 5.2 and 5.3 give the calculations for other values of X. EXAMPLE 5.2 DE T E R MIN ING P (X . 3) , GI V E N n 5 4 AN D p 5 0. 1 If 10% of online enquiries are converted to bookings, what is the probability that there are at least three converted enquiries in a sample of four? SOLUTION In Example 5.1 we found that the probability of exactly three converted enquiries from a sample of four is 0.0036. To calculate the probability of at least three converted enquiries, we need to add the probability of three converted enquiries to the probability of four converted enquiries. The probability of four converted enquiries is: P(X = 4) = 4! (0.1)4(1 − 0.1)4 − 4 = 1 × 0.0001 × 1 = 0.0001 4 !(4 − 4 )! Thus, the probability of at least three converted enquiries is: P(X ⩾ 3) = P(X = 3) + P(X = 4) = 0.0036 + 0.0001 = 0.0037 There is a 0.37% chance that there will be at least three converted enquiries in a sample of four. EXAMPLE 5.3 DE T E R MIN ING P ( X 6 3) , GI V E N n = 4 AN D p = 0. 1 If 10% of online enquiries are converted to bookings, what is the probability that there are fewer than three converted enquiries in a sample of four? SOLUTION The probability that there are fewer than three converted enquiries is: P(X < 3) = P(X = 0) + P(X = 1) + P(X = 2) Use Equation 5.11 to calculate each of these probabilities: P(X = 0) = 4! (0.1)0(1 − 0.1)4 − 0 = 0.6561 0 !(4 − 0 )! P(X = 1) = 4! (0.1)1(1 − 0.1)4 − 1 = 0.2916 1 !(4 − 1 )! P(X = 2) = 4! (0.1)2(1 − 0.1)4 − 2 = 0.0486 2 !(4 − 2 )! Therefore, P(X < 3) = 0.6561 + 0.2916 + 0.0486 = 0.9963 Alternatively, P(X 6 3) can also be calculated from its complement, P(X 9 3), since: P(X < 3) = 1 − P(X ⩾ 3) = 1 − 0.0037 = 0.9963 Calculations such as those in Example 5.3 can become tedious, especially as n gets large. To avoid computational drudgery, many binomial probabilities can be found directly from Table E.6 (Appendix E), a portion of which is reproduced in Table 5.4. Table E.6 provides Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5.3 Binomial Distribution 193 ­binomial probabilities for X = 0, 1, 2, … , n for selected combinations of n and p. For example, to find the probability of exactly two successes in a sample of four when the probability of success is 0.1, first find n = 4 and then read off the required probability at the intersection of the row X = 2 and the column p = 0.10. Thus: P(X = 2) = 0.0486 n 4 X 0 1 2 3 4 0.01 0.9606 0.0388 0.0006 0.0000 0.0000 p .... .... .... .... .... .... 0.02 0.9224 0.0753 0.0023 0.0000 0.0000 0.10 0.6561 0.2916 0.0486 0.0036 0.0001 Table 5.4 Finding a binomial probability for n = 4, X = 2 and p = 0.1 (extracted from Table E.6) The binomial probabilities given in Table E.6 can also be calculated using Microsoft Excel. Figure 5.2 presents a Microsoft Excel worksheet for calculating binomial probabilities, using the Excel 2010 and later inbuilt binomial function BINOM.DIST(number_s,trials, probability_s,cumulative). For earlier versions of Excel the corresponding binomial function is BINOMDIST(number_s,trials,probability_s,cumulative). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 A Binomial probabilities B Data Sample size Probability of an event of interest C D E F 4 0.1 Figure 5.2 Microsoft Excel worksheet for number of online enquiries converted to bookings example Statistics Mean Variance Standard deviation 0.4 =B4 * B5 0.36 =B8 * (1-B5) 0.6 =SQRT(B9) Binomial probabilities table X 0 1 2 3 4 P(X) 0.6561 0.2916 0.0486 0.0036 0.0001 =BINOM.DIST(A14, $B$4, $B$5, FALSE) =BINOM.DIST(A15, $B$4, $B$5, FALSE) =BINOM.DIST(A16, $B$4, $B$5, FALSE) =BINOM.DIST(A17, $B$4, $B$5, FALSE) =BINOM.DIST(A18, $B$4, $B$5, FALSE) The shape of a binomial probability distribution depends on the values of n and p. When p = 0.5, the binomial distribution is symmetrical, regardless of how large or small the value of n. When p ∙ 0.5, the distribution is skewed, to the right if p < 0.5 and to the left if p > 0.5. The closer p is to 0.5 and/or the larger the number of observations n, the less skewed the distribution. For example, the distribution of the number of converted online enquiries is highly skewed to the right because p = 0.1 and n = 4 (see Figure 5.3). Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 194 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS Figure 5.3 0.7 Microsoft Excel graph of the binomial probability distribution with n = 4 and p = 0.1 0.6 0.5 P (X ) 0.4 0.3 0.2 0.1 0 0 1 2 3 4 Number of successes Substituting the binomial probability equation (5.11) in the expected value equation (5.1) and using algebra to simplify, it can be shown that the mean of the binomial distribution is equal to the product of n and p, as shown in Equation 5.12. Therefore, use Equation 5.12 to calculate the mean of a binomial distribution, instead of Equation 5.1. T H E M E A N OF T HE BI NO M I A L D I ST R I BU T I O N The mean μ of the binomial distribution is equal to the sample size n multiplied by the probability of success p. (5.12) μ = E(X ) = np Therefore, on average, Yang can theoretically expect E(X) = 4 * 0.1 = 0.4 converted enquiries in a sample of four. Similarly, by substituting the binomial probability equation (5.11) in the variance equation (5.2a or 5.2b) and using algebra to simplify, it can be shown that the standard deviation of the binomial distribution is given by Equation 5.13. T H E STA N DA R D D E V I AT I O N O F T HE BI NO M I A L D I ST R I BU T I O N σ = σ2 = np(1 − p) (5.13) Therefore, using Equation 5.13, the standard deviation of the number of converted enquiries is: σ = 4(0.1)(0.9) = 0.60 EXAMPLE 5.4 C A LC U LAT IN G B INO M I AL P ROBABI L I TI E S Accuracy (measured as the percentage of orders consisting of a main item, side item and drink that are filled correctly) in taking orders at the drive-through window is an important feature for fast-food chains. Suppose in a recent month that records show that the percentage Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5.3 Binomial Distribution 195 of correct orders of this type filled at a Hungry Jack’s franchise was 88%. Suppose three friends go to the drive-through window at this Hungry Jack’s franchise and each places an order of the type just mentioned. • What is the probability that: – all three orders will be filled correctly? – none of the three will be filled correctly? – at least two of the three will be filled correctly? • What is the average and standard deviation of the number of orders filled correctly? SOLUTION There are three orders and the probability of any order being accurate is 0.88. Therefore: X = number of orders filled correctly = 0, 1, 2, 3 is a binomial random variable with n = 3, p = 0.88. Using Equations 5.11, 5.12 and 5.13: P(X = 3) = 3! (0.88)3(1 − 0.88)3 − 3 = 1 × 0.68147… × 1 = 0.68147… 3!(3 − 3 )! P(X = 0) = 3! (0.88)0(1 − 0.88)3 − 0 = 1 × 1 × 0.00172… = 0.00172 0!(3 − 0 )! P(X = 2) = 3! (0.88)2(1 − 0.88)3 − 2 = 3 × 0.7744 × 0.12… = 0.27878… 2!(3 − 2 )! P(X ⩾ 2) = P(X = 2) + P(X = 3) = 0.27878… + 0.68147… = 0.96025… μ = E(X ) = 3 × 0.88 = 2.64 σ= np(1 − p) = 3 × 0.88 × 0.12 = 0.5628… The probability that all three orders are filled correctly is 0.6815. The probability that none of the orders is filled correctly is 0.0017. The probability that at least two orders are filled correctly is 0.9603. The mean number of accurate orders filled in a sample of three orders is 2.64 and the standard deviation is 0.563. This section introduced the binomial distribution and applied it to business and other problems. The binomial distribution plays an important role when it is used in statistical inference problems involving the estimation or testing of hypotheses about proportions (discussed in Chapters 8 and 9). Problems for Section 5.3 Problems 5.15 to 5.24 can be solved manually or by using Microsoft Excel. Some, but not all, can also be solved using Table E.6. LEARNING THE BASICS 5.15 If X is a binomial random variable, determine the following: a. For n = 4 and p = 0.12, what is P(X = 0)? b. For n = 10 and p = 0.40, what is P(X = 9)? c. For n = 10 and p = 0.50, what is P(X = 8)? d. For n = 6 and p = 0.83, what is P(X = 5)? 5.16 If X is a binomial random variable with n = 5 and p = 0.40, what is the probability that: a. X = 4? b. X ⩽ 3? c. X < 2? d. X > 1? 5.17 Determine the mean and standard deviation of the random variable X in each of the following binomial distributions: a. n = 4 and p = 0.10 b. n = 4 and p = 0.40 c. n = 5 and p = 0.80 d. n = 3 and p = 0.50 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 196 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS APPLYING THE CONCEPTS 5.18 The increase or decrease in the price of a share between the beginning and the end of a trading day is assumed to be an equally likely random event. What is the probability that a share will show an increase in its closing price on five consecutive days? 5.19 Research has shown that only 60% of consumers read every word, including the fine print, of a service contract. Assume that the number of consumers who read every word of a contract can be modelled using the binomial distribution. A group of five consumers has just signed a 12-month contract with an ISP (Internet service provider). a. What is the probability that: i. all five will have read every word of their contract? ii. at least three will have read every word of their contract? iii. less than two will have read every word of their contract? b. What would your answers be in (a) if the probability is 0.80 that a consumer reads every word of a service contract? 5.20 A student taking a multiple-choice test consisting of five questions, each with four options, selects the answers randomly. What is the probability that the student will get: a. five questions correct? b. at least four questions correct? c. no questions correct? d. no more than two questions correct? 5.21 In Example 5.4 three friends went to a Hungry Jack’s franchise. Instead, suppose that they go to a McDonald’s franchise, which last month filled 90% of orders correctly. a. What is the probability that: i. all three orders will be filled correctly? ii. none of the three will be filled correctly? iii. at least two of the three will be filled correctly? b. What is the mean and standard deviation of the number of orders filled correctly? 5.22 In a certain weekday television show, the winning contestant has to choose randomly from 20 boxes, one of which contains a major prize of $100,000. LEARNING OBJECTIVE 5 Identify situations that can be modelled by a Poisson distribution and calculate Poisson probabilities Poisson distribution Discrete probability distribution, where the random variable is the number of events in a given interval. a. What is the probability that, during a week: i. no contestant wins the major prize? ii. exactly one contestant wins the major prize? iii. no more than two contestants win the major prize? iv. at least three contestants win the major prize? b. Calculate the expected number and standard deviation of winners in a week. c. How much should the producers budget for major prizes per week? 5.23 When a customer places an order with Rudy’s On-Line Office Supplies, a computerised accounting information system (AIS) automatically checks to see whether the customer has exceeded their credit limit. Past records indicate that the probability of customers exceeding their credit limit is 0.05. Suppose that, in a given half hour, 20 customers place orders. Assume that the number of customers that the AIS detects as having exceeded their credit limit is distributed as a binomial random variable. a. What are the mean and standard deviation of the number of customers exceeding their credit limits? b. What is the probability that no customer will exceed their limit? c. What is the probability that one customer will exceed their limit? d. What is the probability that two or more customers will exceed their limits? 5.24 A new drug is found to be effective on 90% of the patients tested. a. Is the 90% effective rate best classified as a priori classical probability, empirical classical probability or subjective probability? b. If the drug is administered to 20 randomly chosen patients at a large hospital, find the probability that it is effective for: i. fewer than five patients ii. 10 or more patients iii. all 20 patients 5.4 POISSON DISTRIBUTION Many studies are based on the number of times a random event occurs in an interval of time or space. Examples are the number of surface defects on a new refrigerator, the number of network failures in a month or the number of fleas on the body of a dog. The Poisson distribution can be used to calculate probabilities when counting the number of times a particular event occurs in an interval of time or space if: 1. the probability an event occurs in any interval is the same for all intervals of the same size 2. the number of occurrences of the event in one non-overlapping interval is independent of the number in any other interval 3. the probability that two or more occurrences of the event in an interval approaches zero as the interval becomes smaller. If these properties hold, then the average or expected number of occurrences over any interval is proportional to the size of the interval. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5.4 Poisson Distribution 197 Consider the number of online enquiries received by Gaia Adventure Tours. Suppose that Yang is interested in the number of online enquiries received during a 20-minute interval. Does this situation match the properties of the Poisson distribution given above? First, define the random variable as: X = number of online enquiries received during a 20-minute interval Suppose enquiries are received randomly, then it is reasonable to assume that the probability an enquiry is received during a 20-minute interval is the same as the probability for all other 20-minute intervals. Yang can also assume that the receipt of an enquiry during a 20-minute interval has no effect on (i.e. is statistically independent of) the receipt of any other enquiry during any 20-minute interval. Finally, the probability that two or more enquiries will be received in a given time period approaches zero as the time interval becomes smaller. For example, the probability is virtually zero that two enquiries will be received in a time interval of 0.001 of a second. Thus, Yang can use the Poisson distribution to determine probabilities involving the number of online enquiries received in a 20-minute interval. The Poisson distribution has one parameter, λ (Greek lower-case letter lambda), which is the mean or expected number of events per interval. The variance of a Poisson distribution is also equal to λ, hence the standard deviation is equal to ∙∙λ. The number of events, X, of the Poisson random variable ranges from 0 to infinity. Equation 5.14, the mathematical formula for the Poisson distribution, gives the probability of X events in an interval, given that λ events are expected. PO IS S O N P R OB A B IL IT Y DIST R IB UT IO N P(X ) = e−λλX X! (5.14) where P(X) = the probability of X events in a given interval λ = expected number of events in the given interval e = 2.71828 … is the base of natural logarithms To illustrate the use of the Poisson distribution, calculate the probability that in a given 20 minutes exactly five online enquiries will be received, and the probability that less than five online enquiries will be received. On average, Gaia Adventure Tours receives 30 online enquiries 30 an hour, so the average or expected number of enquiries in 20 minutes is λ = × 20 = 10 60 Using Equation 5.14 with λ = 10, the probability that in a given 20 minutes exactly five online enquiries will be received is: P(X = 5) = e−10105 4.53999… = = 0.03783… 5! 120 and the probability that in any given 20 minutes less than five online enquiries will be received is: P(X < 5) = e−10(10)0 e−10(10)1 e−10(10)2 e−10(10)3 e−10(10)4 + + + + 0! 1! 2! 3! 4! = 0.00004… + 0.00045… + 0.00226… + 0.00756… + 0.01891… = 0.029252… Thus, there is a 3% likelihood that less than five online enquiries will be received in 20 minutes, leading to enquiry staff having significant idle time. To avoid the computational drudgery involved in these calculations, many Poisson probabilities can be found directly from Table E.7 (Appendix E), a portion of which is Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 198 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS reproduced in Table 5.5. Table E.7 provides probabilities for the Poisson random variable for X = 0, 1, 2, . . . for selected values of the parameter λ. The probability that exactly five online enquiries will be received in a given 20-minute interval when the mean number of enquiries received in 20 minutes is 10 is given by the intersection of the row X = 5 and column λ = 10. Therefore, from Table 5.5, P(X = 5) = 0.0378. Table 5.5 Calculating a Poisson probability for λ = 10 (extracted from Table E.7 in Appendix E of this book) 𝛌 X 0 1 2 3 4 5 6 7 9.1 0.0001 0.0010 0.0046 0.0140 0.0319 0.0581 0.0881 0.1145 9.2 0.0001 0.0009 0.0043 0.0131 0.0302 0.0555 0.0851 0.1118 10 0.0000 0.0005 0.0023 0.0076 0.0189 0.0378 0.0631 0.0901 .... .... .... .... .... .... .... .... You can also calculate the Poisson probabilities given in Table E.7 using Microsoft Excel. Figure 5.4 presents a Microsoft Excel worksheet for the Poisson distribution, with λ = 10, using the Excel 2010 and later inbuilt Poisson function POISSON.DIST(x,mean,cumulative). For earlier versions of Excel the corresponding Poisson function is POISSON(x,mean,cumulative). Figure 5.4 Microsoft Excel worksheet for ‘number of online enquries in 20 minutes’ example A B C D E 1 Poisson probabilities 2 3 Data 4 Mean/Expected number of events of interest: 5 6 Poisson probabilities table 7 X P (X) 8 0 0.0000 =POISSON.DIST(A8, $E$4, FALSE) 9 1 0.0005 =POISSON.DIST(A9, $E$4, FALSE) 10 2 0.0023 =POISSON.DIST(A10, $E$4, FALSE) 11 3 0.0076 =POISSON.DIST(A11, $E$4, FALSE) 12 4 0.0189 =POISSON.DIST(A12, $E$4, FALSE) 13 5 0.0378 =POISSON.DIST(A13, $E$4, FALSE) 14 6 0.0631 =POISSON.DIST(A14, $E$4, FALSE) 15 7 0.0901 =POISSON.DIST(A15, $E$4, FALSE) 16 8 0.1126 =POISSON.DIST(A16, $E$4, FALSE) 17 9 0.1251 =POISSON.DIST(A17, $E$4, FALSE) 18 10 0.1251 =POISSON.DIST(A18, $E$4, FALSE) 19 11 0.1137 =POISSON.DIST(A19, $E$4, FALSE) 20 12 0.0948 =POISSON.DIST(A20, $E$4, FALSE) 21 13 0.0729 =POISSON.DIST(A21, $E$4, FALSE) 22 14 0.0521 =POISSON.DIST(A22, $E$4, FALSE) 23 15 0.0347 =POISSON.DIST(A23, $E$4, FALSE) 24 16 0.0217 =POISSON.DIST(A24, $E$4, FALSE) 25 17 0.0128 =POISSON.DIST(A25, $E$4, FALSE) 26 18 0.0071 =POISSON.DIST(A26, $E$4, FALSE) 27 19 0.0037 =POISSON.DIST(A27, $E$4, FALSE) 28 20 0.0019 =POISSON.DIST(A28, $E$4, FALSE) F 10 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5.4 Poisson Distribution 199 CA LC ULATING P O IS S O N P RO B A B ILIT IES The number of faults per month that arise in the gearboxes of a bus fleet is known to follow a Poisson distribution with a mean of 2.5 faults per month. What is the probability that in a given month no faults are found? At least one fault is found? EXAMPLE 5.5 SOLUTION Using Equation 5.14 with λ = 2.5 (or using Table E.7 or Microsoft Excel), the probabilities that in a given month no faults are found and at least one fault is found are: P(X = 0) = e−2.5(2.5)0 0.08208… × 1 = = 0.08208… 0! 1 P(X ⩾ 1) = 1 − P(X = 0) = 1 − 0.08208… = 0.91791… The probability that there will be no faults in a given month is 0.0821. The probability that there will be at least one fault is 0.9179, which is the complement of there being no faults in a given month. CA LC ULATING P O IS S O N P RO B A B ILIT IES For the Gaia Adventure Tours scenario, what is the probability that 24 or more online enquiries are received in 30 minutes? EXAMPLE 5.6 SOLUTION Let X = number of online enquiries received in 30 minutes, then X is Poisson with 30 λ= × 30 = 15 60 Using Microsoft Excel we can obtain Table 5.6, which gives Poisson probabilities for λ = 15. Enquiries received in 30 minutes Expected number of enquiries: 15 X P(X ) 0 0.0000 1 0.0000 2 0.0000 3 0.0002 4 0.0006 5 0.0019 6 0.0048 7 0.0104 X 8 9 10 11 12 13 14 15 P(X ) 0.0194 0.0324 0.0486 0.0663 0.0829 0.0956 0.1024 0.1024 X 16 17 18 19 20 21 22 23 Total P(X ) 0.0960 0.0847 0.0706 0.0557 0.0418 0.0299 0.0204 0.0133 0.9805 Table 5.6 Poisson probabilities for λ = 15 From Table 5.6: P(X ⪖ 24) = 1 − P(X < 24) = 1 − P(X ⪕ 23) = 1 − 0.9805 = 0.0195 Therefore, in approximately 2% of 30-minute intervals 24 or more online enquiries are expected to be received, hence increasing the likelihood that enquiries start to queue and may not be answered within the stated 45 minutes. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 200 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS Problems for Section 5.4 LEARNING THE BASICS 5.25 Assume a Poisson distribution. a. If λ = 2.5, find P(X = 2). b. If λ = 8, find P(X = 8). c. If λ = 0.5, find P(X = 1). d. If λ = 3.7, find P(X = 0). 5.26 Assume a Poisson distribution. a. If λ = 2, find P(X 9 2). b. If λ = 8, find P(X 9 3). c. If λ = 0.5, find P(X ⩽ 1). d. If λ = 4, find P(X 9 1). e. If λ = 5, find P(X ⩽ 3). 5.27 Assume a Poisson distribution with λ = 5. Find the probability that: a. X = 1 b. X 6 1 c. X > 1 d. X ⩽ 1 APPLYING THE CONCEPTS Problems 5.28 to 5.32 can be solved manually or by using Microsoft Excel. Some, but not all, can also be solved using Table E.7. 5.28 The quality control manager of Marilyn’s Bakery is inspecting a batch of chocolate-chip biscuits that has just been baked. If the production process is in control, the mean number of chip parts per biscuit is 6.0. What is the probability that, in any particular biscuit being inspected: a. fewer than five chip parts will be found? b. exactly five chip parts will be found? c. five or more chip parts will be found? d. either four or five chip parts will be found? 5.29 Refer to problem 5.28. How many biscuits in a batch of 100 should the manager expect to discard if company policy requires that all chocolate-chip biscuits sold must have at least four chocolate-chip parts? LEARNING OBJECTIVE 6 Identify situations that can be modelled by a hypergeometric distribution and calculate hypergeometric probabilities hypergeometric distribution Discrete probability distribution where the random variable is the number of successes in a sample of n observations from a finite population without replacement. 5.30 The number of floods in a certain region is approximately Poisson distributed with an average of three floods every 10 years. a. Find the probability that a family living in the area for one year will experience: i. exactly one flood ii. at least one flood b. Find the probability that a student who moves to the area for three years will experience i. exactly one flood ii. at least one flood 5.31 Based on past experience, it is assumed that the number of flaws per metre in rolls of grade 2 paper follow a Poisson distribution with a mean of one flaw per 5 metres of paper. What is the probability that in a: a. 1-metre roll there will be at least two flaws? b. 10-metre roll there will be at least one flaw? c. 50-metre roll there will be between five and 15 (inclusive) flaws? 5.32 A toll-free phone number is available from 9 am to 9 pm for customers to register a complaint about a product purchased from a large company. Past history indicates that an average of 0.4 calls are received per minute. a. What properties must be true about the situation described above in order to use the Poisson distribution to calculate probabilities concerning the number of phone calls received in a 1-minute period? b. Assuming that this situation matches the properties you discuss in (a), what is the probability that, during a 1-minute period: i. zero phone calls will be received? ii. three or more phone calls will be received? c. What is the maximum number of phone calls that will be received in a 1-minute period 99.99% of the time? 5.5 HYPERGEOMETRIC DISTRIBUTION The binomial distribution and the hypergeometric distribution are both concerned with the number of successes in a sample of n observations. However, they differ in the way in which the sample is selected. For the binomial distribution, as the probability of success p must be constant for all observations and the outcome of any particular observation must be independent of any other, the random sample is either selected with replacement from a finite population or without replacement from an infinite population. For the hypergeometric distribution, the random sample is selected without replacement from a finite population. Thus, the outcome of one observation is dependent on the outcomes of previous observations. Consider a population of size N. Let A represent the total number of successes in the population. The hypergeometric distribution is then used to find the probability of X successes in a Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5.5 Hypergeometric Distribution 201 sample of size n selected without replacement. Equation 5.15, the mathematical formula for the hypergeometric distribution, gives the probability of X successes, given n, N and A. H YPE RGE O M E T R IC DIST R IB UT ION P(X ) = N−A n−X A X N n (5.15) where P(X) = the probability of X successes, given n, N and A n = sample size N = population size A = number of successes in the population N – A = number of failures in the population X = number of successes in the sample n 2 X = number of successes in the sample A = ACX (see Equation 5.10) X The number of successes in the sample, represented by X, cannot be greater than the number of successes in the population, A, or the sample size, n. Thus, the range of the hypergeometric random variable is limited to the minimum of the sample size or the number of successes in the population. Equation 5.16 defines the mean of the hypergeometric distribution. TH E M E AN OF T H E H YP E R GE OM E T R I C D I ST R I BU T I O N μ = E (X ) = nA N (5.16) Equation 5.17 defines the standard deviation of the hypergeometric distribution. TH E STAN DA R D DE VIAT ION OF T H E HY P E R G E O M E T R I C D I ST R I BU T I O N σ= nA(N − A) N 2 ⋅ N−n N−1 (5.17) N−n is a finite population correction factor that results N−1 from sampling without replacement from a finite population. To illustrate the hypergeometric distribution, suppose that we wish to form a team of eight executives from different departments within a company. Suppose the company has a total of 30 executives, and 10 of these are from the finance department. If members of the team are to be selected at random, what is the probability that the team will contain two executives from the finance department? Here, the population of N = 30 executives within the company is finite. In addition, A = 10 are from the finance department and a team of n = 8 executives is to be selected. In Equation 5.17, the expression finite population correction factor Factor required when sampling from a finite population without replacement. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 202 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS Using Equation 5.15: P(X = 2) = 10 20 2 6 30 8 10! 20! × 2 !8! 6 !14! = = 0.2980… 30! 8 !22! Using Equations 5.16 and 5.17: μ = E(X ) = 8 × 10 = 2.666... 30 and σ= 30 − 8 = 1.1613... 30 − 1 8 × 10 × (30 − 10) 30 2 Thus, the probability that the team will contain two members from the finance department is 0.298, or 29.8%. Such calculations can become tedious, especially as N gets larger. However, Microsoft Excel can be used to calculate hypergeometric probabilities. Figure 5.5, using the Excel 2010 and later inbuilt hypergeometric function HYPGEOM.DIST(sample_s,number_sample, population_s,number_population,cumulative), presents a Microsoft Excel worksheet for the team-formation example. Note that the number of executives from the finance department (i.e. the number of successes in the sample) can be equal to 0, 1, 2, … 8. Figure 5.5 Microsoft Excel worksheet for the team-formation example A 1 Hypergeometric probabilities 2 3 Data 4 Sample size 5 No. of events of interest in population 6 Population size 7 8 Hypergeometric probabilities table 9 10 11 12 13 14 15 16 17 18 B C D E F G 8 10 30 X 0 1 2 3 4 5 6 7 8 P(X) 0.0215 0.1324 0.2980 0.3179 0.1738 0.0491 0.0068 0.0004 0.0000 =HYPGEOM.DIST (A10, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A11, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A12, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A13, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A14, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A15, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A16, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A17, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A18, $B$4, $B$5, $B$6 FALSE) For earlier versions of Excel the corresponding hypergeometric function is HYPGEOMDIST (sample_s,number_sample,population_s,number_pop). Problems for Section 5.5 LEARNING THE BASICS APPLYING THE CONCEPTS 5.33 Determine the following: a. If n = 4, N = 10 and A = 5, find P(X = 3). b. If n = 4, N = 6 and A = 3, find P(X = 1). c. If n = 5, N = 12 and A = 3, find P(X = 0). d. If n = 3, N = 10 and A = 3, find P(X = 3). 5.34 Referring to problem 5.33, calculate the mean and the standard deviation for the hypergeometric distributions described in (a) to (d). 5.35 An auditor for the Australian Taxation Office is selecting a sample of six tax returns from a batch of 100 for an audit. If two or more of these returns contain errors, the entire batch of 100 tax returns will be audited. Problems 5.35 to 5.39 can be solved manually or by using Microsoft Excel. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 5.5 Hypergeometric Distribution 203 a. What is the probability that the entire batch will be audited if the true number of returns with errors in the batch is: i. 25? ii. 30? iii. 5? iv. 10? b. Discuss the differences in your results depending on the true number of returns in the batch with error. 5.36 The dean of a business faculty wishes to form an executive committee of five from among the 40 tenured faculty members. The selection is to be random, and there are eight tenured faculty members in accounting. a. What is the probability that the committee will contain: i. none of them? ii. at least one of them? iii. not more than one of them? b. What is your answer to part (i) above if the committee consists of seven members? 5.37 In a shipment of 15 hard disks, five are defective. If four of the disks are inspected, a. What is the probability that: i. exactly one is defective? ii. at least one is defective? iii. no more than two are defective? b. What is the mean number of defective hard disks that you would expect to find in the sample of four hard disks? 5.38 In each game of OZ Lotto seven numbers are selected, from 1 to 45. Seven winning numbers are chosen at random plus two supplementary numbers. An extension of the hypergeometric distribution to calculate probabilities of selecting combinations of winning and supplementary numbers is: P(X, Y ) = A X S Y N−A−S n−X−Y N n where P(X,Y) is the probability of selecting X winning numbers and Y supplementary numbers, and S is the number of supplementary numbers. a. To win Division 1, the seven winning numbers must be selected. In any game, what is the probability of winning Division 1? b. To win Division 2, six winning numbers plus either of the two supplementary numbers must be selected. In any game, what is the probability of winning Division 2? c. To win Division 3, six winning numbers must be selected. In any game, what is the probability of winning Division 3? d. To win Division 4, five winning numbers plus either of the two supplementary numbers must be selected. In any game, what is the probability of winning Division 4? e. To win Division 5, five winning numbers must be selected. In any game, what is the probability of winning Division 5? f. To win Division 6, four winning numbers must be selected. In any game, what is the probability of winning Division 6? g. To win Division 7, three winning numbers plus either of the two supplementary numbers must be selected. In any game, what is the probability of winning Division 7? h. What is the probability of selecting none of the winning or supplementary numbers? 5.39 In a certain game of Lotto six numbers are selected from 1 to 45. Six winning numbers are chosen at random plus two supplementary numbers. Use the formula in problem 5.38 or Equation 5.15 to calculate the following probabilities. a. To win Division 1, the six winning numbers must be selected. In any game, what is the probability of winning Division 1? b. To win Division 2, five winning numbers plus either of the two supplementary numbers must be selected. In any game, what is the probability of winning Division 2? c. To win Division 3, five winning numbers must be selected. In any game, what is the probability of winning Division 3? d. To win Division 4, four winning numbers must be selected. In any game, what is the probability of winning Division 4? e. To win Division 5, three winning numbers plus either of the two supplementary numbers must be selected. In any game, what is the probability of winning Division 5? f. What is the probability of selecting none of the winning or supplementary numbers? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 204 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS 5 Assess your progress Summary This chapter introduced mathematical expectation, covariance and the development and application of the binomial, Poisson and hypergeometric distributions. In the Gaia Adventure Tours scenario, we saw how to calculate probabilities from the binomial and Poisson distributions concerning the number of online enquiries converted to bookings in a sample of n enquiries and the number of online enquiries received in a given time interval. In the next chapter, important continuous distributions are introduced, in particular the normal distribution. To help decide which discrete probability distribution to use for a particular situation, we need to ask the following questions: • Is there a fixed number of observations n, each of which is classified as success or failure, or are we counting the number • of times an event happens in an interval of time or space? If there is a fixed number of observations n, each of which is classified as success or failure, we can use the binomial or hypergeometric distribution, if the properties of the distribution are satisfied. If we are counting the number of events in an interval, we can use the Poisson distribution only if all its properties are satisfied. In deciding whether to use the binomial or hypergeometric distribution, is the probability of success constant for all ­observations? If yes, we may be able to use the binomial ­distribution. If no, we may be able to use the hypergeometric distribution. Key formulas Expected value of the sum of two random variables Expected value 𝛍 of a discrete random variable E(X + Y ) = E(X ) + E(Y ) (5.5) N μ = E(X ) = ∑ XiP(Xi) (5.1) Variance of the sum of two random variables i=1 σ2X + Y = σ2X + σ2Y + 2σXY (5.6) Variance of a discrete random variable N σ2 = ∑ [Xi − E(X)]2 P(Xi) Standard deviation of the sum of two random variables (5.2a) (definition) σX + Y = σ2X + Y (5.7) i=1 N σ2 = ∑ Xi2P(Xi) − E(X)2 Portfolio expected return (5.2b) (calculation) E(P ) = wE(X ) + (1 − w)E(Y ) (5.8) i=1 Portfolio risk Standard deviation of a discrete random variable σ = σ2 σp = w2σ2X + (1 − w)2σ2Y + 2w(1 − w)σXY (5.3) (5.9) Combinations Covariance σXY = ∑∑ all Xi all Yj σXY = ∑∑ all Xi all Yj [Xi − E(X )][Yj − E(Y)]P(Xi and Yj) (5.4a) (definition) n! n = nCX = (5.10) X !(n − X)! X Binomial distribution XiYjP(Xi and Yj) − E(X )E(Y ) (5.4b) (calculation) P(X ) = n! p X(1 − p)n − X (5.11) X !(n − X )! Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter review problems 205 The mean of the binomial distribution The mean of the hypergeometric distribution μ = E(X ) = np (5.12) nA (5.16) N μ =E( X ) = The standard deviation of the binomial distribution σ = σ2 = np(1 − p) (5.13) The standard deviation of the hypergeometric distribution Poisson distribution σ= e−λλX (5.14) P(X ) = X! nA(N − A) N 2 ⋅ N−n (5.17) N−1 Hypergeometric distribution P(X ) = A X N−A n−X N n (5.15) Key terms binomial distribution covariance expected value of a discrete random variable expected value of the sum of two random variables finite population correction factor 189 185 182 186 201 hypergeometric distribution mathematical model Poisson distribution portfolio portfolio expected return portfolio risk probability distribution for a discrete random variable 200 189 196 187 187 187 181 standard deviation of a discrete random variable standard deviation of the sum of two random variables variance of a discrete random variable variance of the sum of two random variables 183 186 183 186 Chapter review problems CHECKING YOUR UNDERSTANDING 5.40 5.41 5.42 5.43 What is the meaning of the expected value of a probability distribution? What are the four properties of a binomial distribution? What are the three properties of a Poisson distribution? When is the hypergeometric distribution used instead of the binomial distribution? 5.45 APPLYING THE CONCEPTS Problems 5.44 to 5.53 can be solved manually or by using Microsoft Excel. Some, but not all, can also be solved using Tables E.6 and E.7. 5.44 From September 1984 to July 2017 the ASX All Ordinaries Index has opened higher than the previous month for 233 of the 395 months – that is, approximately 59.0% of months (Data from YAHOO!7FINANCE <http://au.finance.yahoo.com> accessed July 2017). a. Assuming a binomial distribution, estimate the probability that the ASX All Ordinaries Index will open higher than the previous month: i. for one month ii. for two months in a row 5.46 iii. in four of the next five months iv. in none of the next five years b. For the situation in (a) above, what assumption of the binomial distribution might not be valid? At a recent election, 12% of the voters in a certain electorate gave their first preference to the Greens candidate. If 10 people on the electoral roll for that electorate were randomly selected, find the probability that: a. exactly four gave their first preference to the Greens candidate b. at most four gave their first preference to the Greens candidate c. a majority gave their first preference to the Greens candidate When calculating premiums on life insurance products, insurance companies often use life tables that enable the probability of a person dying in any age interval to be calculated. The following data obtained from the ‘New Zealand Abridged Period Life Table: 2014-16’ gives the number out of 100,000 New Zealand-born males and females who are still alive during each five-year period of life between age 20 and 60 (inclusive). Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 206 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS Exact age (years) 20 25 30 35 40 45 50 55 60 Number alive at exact age Out of 100,000 Out of 100,000 females born males born 99,288 99,031 99,128 98,685 98,949 98,312 98,726 97,899 98,427 97,381 97,934 96,649 97,157 95,548 95,933 93,853 94,162 91,352 Data obtained from <www.stats.govt.nz> accessed June 2017. © Statistics New Zealand and licensed by Statistics New Zealand for re-use under the Creative Commons Attribution 3.0 New Zealand licence 5.47 5.48 5.49 Suppose a New Zealand-born female on her 35th birthday purchases a one million dollar, five-year term life policy from an insurance company. That is, the insurance company must pay her estate $1 million if she dies within the next five years. a. Determine the insurance company’s expected payout on this policy. b. What would be the minimum you would expect the insurance company to charge her for this policy? c. What would the expected payout be if the same policy were taken out by a New Zealand-born female on her 40th birthday? d. Repeat parts (a) to (c) for a New Zealand-born male. The emergency facility at a small country hospital has been in operation for 60 weeks and has been used 120 times. The weekly pattern of demand for this facility has a Poisson distribution. Find the: a. mean demand per week b. probability the emergency facility is not used in a given week c. probability the emergency facility is used at least twice in a week d. probability the room is used at least once in a given twoweek period Check$mart’s records show that 58% of its customers pay only the minimum repayment on their credit card each month. a. If a random sample of 20 credit-card holders is selected, what is the probability that: i. none pays the minimum amount? ii. no more than five pay the minimum amount? iii. more than 10 pay the minimum amount? b. What assumptions did you have to make to answer each part of (a) above? In 2016, the New Zealand general marriage rate was 10.95 marriages and civil unions per 1,000 population 16 years and over who are not married or in a civil union. The corresponding divorce rate was 8.7 per 1,000 existing marriages and civil unions (data obtained from Marriage, Civil Unions and Divorces: Year ended December 2016, Statistics New Zealand <www. stats.govt.nz> accessed July 2017). 5.50 5.51 5.52 5.53 a. Suppose 60 unmarried women were randomly selected on 1 January 2016. i. Find the probability that at least three married, including civil unions, during 2016. ii. Find the probability that at most two married, including civil unions, during 2016. iii. What is the mean and standard deviation of the number who married during 2016? b. Suppose 60 married couples were randomly selected on 1 January 2016. i. Find the probability that none divorced during 2016. ii. Find the probability that at most two divorced during 2016. iii. What is the mean and standard deviation of the number of divorces during 2016? A customer service manager of Check$mart bank is monitoring one of its phone banking call centres servicing a rural region. Suppose that on average the call centre receives 180 calls an hour during its operating hours of 8 am to 6 pm. a. Can the Poisson distribution be used to model the number of calls received in one minute? Explain. b. Assuming the number of calls received in a given interval is Poisson, calculate the probability that: i. in a given minute exactly two calls will be received ii. more than two calls will be received in a minute iii. the number of calls received in 5 minutes is at least 20 iv. the number of calls received in 5 minutes is less than 10 At current staffing levels calls start to queue, increasing the time it takes to answer a call, when the number of calls received in 5 minutes is 20 or more. However, when there are less than 10 calls in 5 minutes, more than one Customer Service Officer is usually available, increasing unproductive staff time. c. What conclusions can you draw from problem (b) parts (iii) and (iv) above? Suppose the average number of students who log on to a university’s computer system is 4.45 in each 5-minute interval. a. What is the probability that six students will log on in the next minute? b. What is the probability that fewer than six students will log on during the next two minutes? A study of various news home pages reports that the mean number of bad links per home page is 0.4 and the mean number of spelling errors per home page is 0.16. Use the Poisson distribution to find the probability that a randomly selected home page will contain: a. no bad links b. five or more bad links c. no spelling errors d. 10 or more spelling errors In an online test, 10 multiple-choice questions are randomly selected from a test bank of 100 questions. Supposing that each student has two attempts at the online test, what is the probability that in the second test a student attempts there are: Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter review problems 207 5.54 a. no questions from the first test? b. at least one question from the first test? c. exactly five questions from the first test? d. 10 questions from the first test? The following table gives the grade distribution at a certain university. Fail 15% 5.55 Pass 40% Credit 25% Distinction 15% High Distinction 5% Supposing that a result is selected randomly, what is the probability that: a. the result is a passing grade (Pass or above)? b. the result is a Credit or above? c. If a random sample of 15 results is selected, what is the probability of: i. exactly three Fails? ii. more than five Fails? iii. all being Pass or above? iv. none being Credit or above? v. exactly five being Credits or above? vi. at least one Distinction or High Distinction? d. Based on the random sample of 15 results, what is the expected number, variance and standard deviation of the number of: i. Fail grades? ii. grades Pass or above? e. Comment on the relationship between (i) and (ii) in part (d) above. A grade point of 7 is assigned to each High Distinction, 6 to each Distinction, 5 to each Credit, 4 to each Pass and 0 to each fail. f. What is the average, variance and standard deviation of grade points for the university? You are trying to develop a strategy for investing in two different shares. The anticipated annual return for a $1,000 investment in each share has the following probability distribution: 5.56 Number of drink-driving offences Local – in council area Seaside town 151 Not seaside town 462 Not local – not in council area Intrastate (within state) 130 Interstate (another state) 228 International (outside Australia) 22 Home address 5.57 5.58 Returns Probability 0.25 0.50 0.25 Share A $240 $150 –$100 Share B –$100 $150 $240 a. Calculate: i. the expected returns for share A and for share B ii. the variances and standard deviations for share A and for share B iii. the covariance of share A and share B b. Suppose you want to create a portfolio that consists of share A and share B. Calculate the portfolio expected return and risk if the proportion invested in share A is: i. 0.40 ii. 0.50 iii. 0.60 c. On the basis of the results in (b), which portfolio would you recommend? Explain. The breakdown by home address of the previous year’s 993 drink-driving offences in Problem 2.67 is: 5.59 5.60 Suppose that Kai randomly selects 20 of the offenders to interview in depth. What is the probability that: a. all 20 will be local? b. 15 will be local? c. five will be from interstate? d. at least 10 will not be local? Past data indicate that 6% of all students enrolled in a firstyear statistics unit at Tasman University obtain a High Distinction (HD). Assume that students are allocated randomly to a tutorial group. a. What is the probability that in a tutorial group of 30 students: i. none receive an HD? ii. at most, two students obtain an HD? iii. more than four students obtain an HD? b. What is the mean and standard deviation of the number of HDs obtained in a tutorial group? In a regional city, on average 2.6 traffic accidents are reported an hour from 7 am to 7 pm. On a given day, what is the probability that: a. four accidents are reported from 9 am to 9.30 am? b. five accidents are reported from 2 pm to 4 pm? c. three or four accidents are reported from 2 pm to 3 pm? d. at least one accident is reported from 4 pm to 5.30 pm? A hand of five cards is dealt from a shuffled standard pack of 52 cards. Find the probability that: a. all the cards are red b. exactly two of the cards are red c. at least one card is red d. the hand contains four kings e. the hand has at least one king f. all the cards are hearts g. the cards are all the same suit Pat’s Used Cars sells on average 3.6 used cars in a normal trading day. Assume the number of used cars sold follows a Poisson distribution. Determine: a. the probability that five used cars are sold in a day b. the probability that no more than two used cars are sold in a day Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 208 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS 5.61 c. the expected number and standard deviation of used cars sold in a 10-day period The Ashland MultiComm Services (AMS) marketing department wants to increase subscriptions for a combined telephone, pay TV and Internet bundle called 3-For-All. AMS marketing has been conducting an aggressive directmarketing campaign that includes postal and electronic mailings and telephone solicitations. Feedback from these efforts indicates that including premium channels of the customer’s choice in this bundle is a very important factor for both current and prospective subscribers. After several brainstorming sessions, the marketing department has decided to add premium channels as a no-cost benefit of subscribing to 3-For-All. The research director, Mona Fields, is planning to conduct a survey among prospective customers to determine how many premium channels need to be added to 3-For-All in order to generate increased subscriptions. Based on past campaigns and on industry-wide data, she estimates the following: Number of free premium channels 0 1 2 3 4 5 Probability of subscriptions 0.020 0.040 0.060 0.070 0.080 0.085 a. If a sample of 50 prospective customers is selected and no free premium channels are included in the 3-For-All bundle, given the above probability estimates, what is the probability that: i. fewer than three customers will subscribe to 3-For-All? ii. at most one customer will subscribe to 3-For- All ? iii. more than four customers will subscribe to 3-For-All ? iv. Suppose that in the survey of 50 prospective customers, five customers subscribe to 3-For-All. What does this tell you about the estimate of the proportion of customers who would subscribe to 3-For-All if no free premium channels are included? b. Instead of offering no premium free channels, as in part (a), suppose that two free premium channels of the customer’s choice are included in the 3-For-All bundle. Given the above probability estimates, what is the probability that: i. fewer than three customers will subscribe to 3-For-All? ii. at most one customer will subscribe to 3-For-All? iii. more than four customers will subscribe to 3-For-All? c. Compare the results of (b) to those of (a). d. Suppose that in a survey of 50 prospective customers where two free premium channels of the customer’s choice are included in the 3-For-All offer, five customers subscribe. What does this tell you about the estimate of the proportion of customers who would subscribe to 3-For-All if two free premium channels are included? e. What do the above results tell you about the effect of offering free premium channels of the customer’s choice on the likelihood of obtaining subscriptions to 3-For-All? Chapter 5 Excel Guide EG5.1 THE PROBABILITY DISTRIBUTION FOR A DISCRETE VARIABLE Key technique Use the SUMPRODUCT(cell range 1, cell range 2) function to calculate the expected value and variance. Example Calculate the expected value, variance, and standard deviation for the number of home mortgages approved per week data given in Table 5.1 on page 182. In-depth Excel Use the Discrete_Variable workbook as a model. For the example, open to the DATA worksheet of the Discrete_Variable workbook. The worksheet already contains the ­entries needed to calculate the expected value, variance, and standard deviation (shown in the COMPUTE worksheet) for the example. For other problems, enter the probability distribution data into columns A and B of the DATA worksheet, overwriting the existing entries. If required, extend columns C and D, first selecting cell range C7:D7 and then copying the cell range down as many rows as necessary. If the probability distribution has fewer than six outcomes, select the rows that contain the extra, unwanted outcomes, right-click, and then click Delete in the shortcut menu. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 5 Excel Guide 209 EG5.2 COVARIANCE OF A PROBABILITY DISTRIBUTION AND ITS APPLICATION IN FINANCE Key technique Use the SQRT and SUMPRODUCT functions to calculate the portfolio analysis statistics. Example Perform the portfolio analysis for the Section 5.2 investment example. PHStat Use Covariance and Portfolio Analysis. For the example, select PHStat ➔ Decision-Making ➔ Covariance and Portfolio Analysis. In the Covariance and Portfolio Management dialog box (shown in Figure EG5.1): 1. Enter 3 as the Number of Outcomes. 2. Enter a Title, check Portfolio Management Analysis and click OK. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 A Covariance analysis Probabilities & outcomes: Weight assigned to X Statistics E(X ) E(Y ) Variance (X ) Standard deviation(X ) Variance (X ) Standard deviation(Y ) Covariance (X Y ) Variance(X+Y ) Standard deviation(X+Y ) B C D P 0.2 0.5 0.3 X –100 100 250 Y 200 50 –100 E F 0.5 105 35 14725 121.346611 11025 105 –12675 400 20 Portfolio management Weight assigned to X Weight assigned to Y Portfolio expected return Portfolio risk 0.5 0.5 70 10 =SUMPRODUCT(B4:B6,C4:C6) =SUMPRODUCT(B4:B6,D4:D6) =SUMPRODUCT(B4:B6,G13:$G$15) =SQRT(B13) =SUMPRODUCT(B4:B6,H13:$H$15) =SQRT(B15) =SUMPRODUCT(B4:B6,I13:$I$15) =B13+B15+2*B17 =SQRT(B18) =B8 =1-B22 =B22*B11+B23*B12 =SQRT(B22^2*B13+B23^2*B15+2*B22*B23*B17) Figure EG5.2 COMPUTE worksheet of Portfolio workbook EG5.3 BINOMIAL DISTRIBUTION Key technique Use the BINOM.DIST(number of events of interest, sample size, probability of an event of interest, FALSE) function. Example Calculate the binomial probabilities for n 5 4 and p 5 0.1, given in Figure 5.2 for the ‘number of online enquiries converted to bookings’ problem. Figure EG5.1 Covariance and Portfolio Management dialog box In the new worksheet (shown in Figure EG5.2): 1. Enter the probabilities and outcomes in the table that begins in cell B3. 2. Enter 0.5 as the Weight assigned to X. In-depth Excel Use the COMPUTE worksheet of the Portfolio workbook as a template. The worksheet (shown in Figure EG5.2) already contains the data for the example. Overwrite the P, X and Y values and the weight assigned to X when you enter data for other problems. If a problem has more or fewer than three outcomes, first select row 5, right-click, and click Insert (or Delete) in the shortcut menu to insert (or delete) rows one at a time. If you insert rows, select the cell range F4:J4 and copy the contents of this range down through the new table rows. The worksheet also contains a Calculations Area that contains various intermediate calculations. Open the COMPUTE_FORMULAS worksheet to examine all the formulas used in this area. PHStat Use Binomial. For the example, select PHStat ➔ Probability & Prob. Distributions ➔ Binomial. In the Binomial Probability Distribution dialog box (shown in Figure EG5.3): 1. Enter 4 as the Sample Size. 2. Enter 0.1 as the Prob. of an Event of Interest. 3. Enter 0 as the Outcomes From value and enter 4 as the (Outcomes) To value. 4. Enter a Title, check Histogram, and click OK. Figure EG5.3 Binomial Probability Distribution dialog box Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 210 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS Check Cumulative Probabilities before clicking OK in step 4 to have the procedure include columns for P(#X), P(,X), P(.X), and P($X) in the binomial probabilities table. In-depth Excel Use the Binomial workbook as a template and model. For the example, open to the COMPUTE worksheet of the Binomial workbook, shown in Figure 5.2 on page 193. The worksheet already contains the entries needed for the example. For other problems, change the sample size in cell B4 and the probability of an event of interest in cell B5. If necessary, extend the binomial probabilities table by selecting cell range A18:B18 and then copying the cell range down as many rows as necessary. Use the CUMULATIVE worksheet if you require cumulative probabilities. Use CUMULATIVE_OLDER worksheet if using a version of Excel before Excel 2010. P(,X), P(.X), and P($X) in the Poisson probabilities table. Check Histogram to construct a histogram of the Poisson probability distribution. In-depth Excel Use the Poisson workbook as a template. For the example, open to the COMPUTE worksheet of the Poisson workbook, shown in Figure 5.4 on page 198. The w ­ orksheet already contains the entries for the example. For other problems, change the mean or expected number of events of ­interest in cell E4. If necessary, extend the Poisson probabilities table by selecting cell range A28:B28 and then copying the cell range down as many rows as necessary. Use the CUMULATIVE worksheet if you require cumulative probabilities. Use the CUMULATIVE_OLDER worksheet if using a version of Excel before Excel 2010. EG5.5 HYPGEOMETRIC DISTRIBUTION EG5.4 POISSON DISTRIBUTION Key technique Use the POISSON.DIST(number of events of interest, the average or expected number of events of interest, FALSE) function. Example Calculate the Poisson probabilities for the ‘number of online enquiries received in 20 minutes’ problem with l 5 10, as in Figure 5.4 on page 198. PHStat Use Poisson. For the example, select PHStat ➔ Probability & Prob. ­Distributions ➔ Poisson. In the Poisson Probability Distribution dialog box (shown in Figure EG5.4): 1. Enter 10 as the Mean/Expected No. of Events of Interest. 2. Enter a Title and click OK. Key technique Use the HYPGEOM.DIST(X, sample size, number of events of interest in the population, population size, FALSE) function. Example Calculate the hypergeometric probabilities for the team formation problem in Figure 5.5 on page 202. PHStat Use Hypergeometric. For the example, select PHStat ➔ Probability & Prob. Distributions ➔ Hypergeometric. In this procedure’s dialog box (shown in Figure EG5.5): 1. Enter 8 as the Sample Size. 2. Enter 10 as the No. of Events of Interest in Pop. 3. Enter 30 as the Population Size. 4. Enter a Title and click OK. Figure EG5.5 Figure EG5.4 Poisson Probability Distribution dialog box Check Cumulative Probabilities before clicking OK in step 2 to have the procedure include columns for P(#X), Hypergeometric Probability Distribution dialog box Check Histogram to produce a histogram of the probability ­distribution. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 5 Excel Guide 211 In-depth Excel Use the Hypergeometric workbook as a template. For the example, open to the COMPUTE worksheet of the Hypergeometric workbook, shown in Figure 5.5 on page 202. The worksheet already contains the entries for the example. For other problems, change the sample size in cell B4, the number of events of interest in the population in cell B5, and the population size in cell B6. If necessary, extend the hypergeometric probabilities table by selecting cell range A18:B18 and then copying the cell range down as many rows as necessary. Use the CUMULATIVE worksheet if you require cumulative probabilities. Use the CUMULATIVE_OLDER worksheet if using a version of Excel before Excel 2010. Microsoft® product screen shots are reprinted with permission from Microsoft Corporation. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e CHA PTER 6 The normal distribution and other continuous distributions TASMAN UNIVERSITY ORIENTATION A s part of orientation activities, new students at Tasman University (TU) are encouraged to complete a ‘Welcome to Tasman University (TU)’ online program. To assess the success – or otherwise – of this program, data have been collected on the time a new student spends working through it. The data suggest that the time students spend on the first module in the program ‘Introduction to TU’ is normally distributed with a mean of 7 minutes and a standard deviation of 2. From the data, the time students spend on another module in the program, ‘Support at TU’, is also normal, but with a mean of 4 minutes and a standard deviation of 1 minute. How can the orientation organisers use this data to answer questions about the time students spend on the ‘Introduction to TU’ and ‘Support at TU’ modules? © Solis Images/Shutterstock Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.1 Continuous Probability Distributions 213 LEARNING OBJECTIVES After studying this chapter you should be able to: 1 calculate probabilities from the normal distribution 2 determine whether a set of data is approximately normally distributed 3 calculate probabilities from the uniform distribution 4 calculate probabilities from the exponential distribution 5 use the normal distribution to approximate probabilities from the binomial distribution In the Gaia Adventure Tours scenario in Chapter 5, Yang wanted to solve problems about the number of occurrences of an outcome in a given sample size or the number of events in a specified interval. A different task is faced in the Tasman University Orientation scenario, one that involves a continuous measurement since the time students spend on the ‘Introduction to TU’ module can be any positive value, not just an integer value. How, then, can the orientation organisers answer questions about continuous numerical variables such as: • What proportion of students spend more than 9 minutes on the ‘Introduction to TU’ module? • 10% of students spend less than how long on the module? • What is the probability that a randomly chosen student accessing the module spends less than 3.5 minutes on it? As in Chapter 5, we use probability distributions as models. This chapter introduces the characteristics of a continuous probability distribution and then uses the normal, uniform and exponential distributions to solve business and other problems. 6.1 CONTINUOUS PROBABILITY DISTRIBUTIONS Chapter 5 discussed discrete random variables and probability distributions. In this chapter we look at continuous random variables and probability distributions. Continuous random variables arise from a measuring process where the response can take on any value within a continuum or interval; for example time, temperature, weight, height, revenue or cost. A continuous probability density function, represented by f(x), is the mathematical expression that defines the distribution of the values for a continuous random variable. Figure 6.1 graphically displays the three continuous probability density functions discussed in this chapter. Panel A depicts a normal distribution. The normal distribution is symmetrical and bell shaped, implying that most values tend to cluster around the mean, which, due to its symmetry, is equal to the median. Although the values in a normal distribution can range from negative infinity to positive infinity, the shape of the distribution makes it very unlikely that extremely large or extremely small values will occur. Panel B depicts a uniform distribution where the probability of occurrence of a value is equally likely to occur anywhere in the range between the smallest value a and the largest value b. Sometimes referred to as the rectangular distribution, the uniform distribution is symmetrical and therefore the mean equals the median. An exponential distribution is illustrated in panel C. This distribution is skewed to the right, with the mean larger than the median. The range for an exponential distribution is zero to positive infinity but its shape makes the occurrence of extremely large values unlikely. continuous probability density function Mathematical expression that defines the distribution of the values for a continuous random variable. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 214 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS Figure 6.1 Three continuous distributions Values of X Panel A Normal distribution a b Values of X Panel B Uniform distribution Values of X Panel C Exponential distribution Note that a continuous probability density function gives the graph of the probability distribution, not the probability, as is the case with a discrete probability function. Probabilities involving continuous random variables are calculated as areas under the curve given by the probability density function and between specified values of the random variable. LEARNING OBJECTIVE 1 Calculate probabilities from the normal distribution normal distribution Continuous probability distribution represented by a bell-shaped curve. 6.2 THE NORMAL DISTRIBUTION The normal distribution (sometimes referred to as the Gaussian distribution) is the most common continuous probability distribution used in statistics. The normal distribution is vitally important in statistics for three main reasons: 1. Numerous continuous variables common in business, and elsewhere, have distributions that are normal or approximately normal. 2. The normal distribution can be used to approximate various discrete probability distributions. 3. The normal distribution provides the basis for classical statistical inference because of its relationship to the Central Limit Theorem (discussed in Section 7.2). The normal distribution is represented by the classic bell shape depicted in panel A of ­ igure 6.1. In the normal distribution, we can calculate the probability that values of the ranF dom variable occur within a range or interval. However, the probability of a particular or individual value of a continuous random variable, such as a normal random variable, is zero. This property distinguishes continuous variables, which are measured, from discrete variables, which are counted. As an example, time (in seconds) is measured and not counted. Therefore, we can determine the probability that the load time for a website is between 1 and 5 seconds or between 2 and 4 seconds or between 2.99 and 3.01 seconds. However, the probability that the load time is exactly 3 seconds is effectively zero. The normal distribution has several important theoretical properties: • It is bell-shaped (and thus symmetrical) in its appearance. • Its mean and median are equal. 4 • Its middle 50% of data is within approximately standard deviations. This means that the 3 interquartile range is contained within an interval of two-thirds of a standard deviation below the mean to two-thirds of a standard deviation above the mean – that is, the middle 2 2 50% of data have Z scores (introduced in Section 3.1) between 2 and . 3 3 • Its associated random variable has an infinite range (2 ∞ , X , ∞). In practice, many variables have distributions that closely resemble the theoretical properties of the normal distribution. The data in Table 6.1 represent the thickness (in millimetres) of 10,000 brass washers manufactured by a large company. The continuous variable of interest, thickness, can be approximated by the normal distribution. The measurements of the thickness of the 10,000 brass washers cluster in the interval 0.485 to 0.495 mm and are distributed Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.2 The Normal Distribution 215 s­ ymmetrically around that interval, forming a bell-shaped pattern. As illustrated in Table 6.1, the non-overlapping (mutually exclusive) classes contain all possible values (are collectively exhaustive) and so the relative frequencies sum to 1. Thickness (mm) Under 0.425 0.425 , 0.435 0.435 , 0.445 0.445 , 0.455 0.455 , 0.465 0.465 , 0.475 0.475 , 0.485 0.485 , 0.495 0.495 , 0.505 0.505 , 0.515 0.515 , 0.525 0.525 , 0.535 0.535 , 0.545 0.545 , 0.555 0.555 or above Total Frequency 0 48 122 325 695 1,198 1,664 1,896 1,664 1,198 695 325 122 48 0 10,000 Relative frequency 0 0.0048 0.0122 0.0325 0.0695 0.1198 0.1664 0.1896 0.1664 0.1198 0.0695 0.0325 0.0122 0.0048 0 1.0000 Table 6.1 Thickness of 10,000 brass washers Figure 6.2 depicts the relative frequency histogram and polygon for the distribution of the thickness of 10,000 brass washers. For these data, the first three theoretical properties of the normal distribution are approximately satisfied; however, the fourth does not hold. The random variable of interest, thickness, cannot possibly be zero or below, and a washer cannot be so thick that it becomes unusable. From Table 6.1, only 48 out of every 10,000 brass washers are expected to have a thickness of between 0.545 and 0.555 mm and none above 0.555 mm, whereas an equal number is expected to have a thickness between 0.425 and 0.435 mm and none below 0.425 mm. Thus, the chance of randomly getting a washer thinner than 0.435 mm or thicker than 0.545 mm is 0.0048 1 0.0048 5 0.0096 2 or less than 1 in 100. Figure 6.2 Relative frequency histogram and polygon of the thickness of 10,000 brass washers 0.20 Relative Freqency 0.15 0.10 0.05 0.00 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 Thickness (mm) For the normal distribution, the normal probability density function is given by Equation 6.1. normal probability density function Mathematical expression that defines the normal distribution. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 216 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS T H E N OR M A L PR O BA BI LI T Y D E NSI T Y F U NC T I O N f (X ) = 1 2 σ 2π e −(1/2)[(X−μ)/σ] (6.1) where e 5 2.71828… is the base of natural logarithms p 5 3.14159… m is the mean s is the standard deviation X is any value of the normal random variable, where (2 ∞ , X , ∞) Because e and p are mathematical constants, the probability density function depends on the two parameters of the normal distribution: the mean m and the standard deviation s. Each combination of m and s generates a different normal distribution. Figure 6.3 illustrates three different normal distributions. Distributions A and B have the same mean (m) but have different standard deviations. Distributions A and C have the same standard deviation (s) but have different means. Distributions B and C depict two normal probability density functions that differ with respect to both m and s. Figure 6.3 Three normal distributions B C A transformation formula Z score formula used to convert any normal random variable to the standardised normal random variable. standardised normal random variable Normal random variable with a mean of 0 and a standard deviation of 1. Normal probabilities are calculated as areas under the curve given by Equation 6.1; this requires integral calculus and there is no exact rule. Fortunately, all normal probabilities can be calculated from normal probability tables. However, as there is a different normal probability distribution for each combination of m and s, the first step in finding a normal probability is to use the transformation formula, given in Equation 6.2, to convert any normal random variable X to a standardised normal random variable Z. T R A N S FOR M AT IO N F O R M U LA The Z value is equal to the difference between X and the mean m, divided by the standard deviation s. Z= X−μ σ (6.2) Equation 6.2 is a restatement of the Z score equation (3.12), introduced in Chapter 3. Thus, Equation 6.2 represents the distance between a given value of the random variable X and the mean expressed in standard deviations. Although the original normal random variable X had mean m and standard deviation s, the standard normal random variable Z has mean m 5 0 and standard deviation s 5 1. By substituting m 5 0 and s 5 1 in Equation 6.1, the probability density function of the standardised normal variable Z is given in Equation 6.3. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.2 The Normal Distribution 217 THE STANDA R DIS E D N OR M A L P R OBA BI LI T Y D E NSI T Y F U NC T I O N f (Z ) = 1 2π e −(1/2)Z 2 (6.3) Any normal probability distribution can be converted to the standardised probability d­ istribution. Then normal probabilities can be determined from Table E.2, the cumulative ­standardised normal distribution. To see how the transformation formula is applied and the results used to find probabilities from Table E.2, recall from the Tasman University Orientation scenario at the beginning of the chapter that data indicate that the time students spend on the ‘Introduction to TU’ module is normal, with mean m 5 7 minutes and standard deviation s 5 2 minutes. From Figure 6.4, it can be seen that every value of the random variable X, time, has a corresponding standardised Z value calculated by the transformation formula (Equation 6.2). Therefore, a time of 9 minutes is equivalent to Z 5 1; that is, 9 minutes is one standard deviation above the mean since: Z= 9−7 = +1 2 Time on ‘Introduction to TU’ module μ – 3σ μ – 2σ μ – 1σ μ cumulative standardised normal distribution Represents the cumulative area under the standard normal curve less than a given value. μ + 1σ μ + 2σ μ + 3σ 1 3 5 7 9 11 13 –3 –2 –1 0 +1 +2 +3 Figure 6.4 Transformation of scales X scale, minutes (μ = 7, σ = 2) Z scale (μ = 0, σ = 1) A time of 1 minute is equivalent to Z 5 23; that is, 1 minute is three standard deviations below the mean since: 1−7 = −3 Z= 2 Thus, the standard deviation is the unit of measurement. In other words, a time of 9 minutes is 2 minutes (i.e. one standard deviation) higher, or longer, than the mean time of 7 minutes. Similarly, if a student spends 1 minute on the module it is 6 minutes (i.e. three standard deviations) lower, or shorter, than the mean time. To illustrate further the transformation formula, the time students spend on the ‘Support at TU’ module is also normal with a mean of 4 minutes and a standard deviation of 1 minute. This distribution is illustrated in Figure 6.5. For ‘Support at TU’, a time of 5 minutes is one standard deviation above the mean time since: Z= 5−4 = +1 1 A time of 1 minute is three standard deviations below the mean time since: Z= 1−4 = −3 1 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 218 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS The two bell-shaped curves in Figures 6.4 and 6.5 represent the probability density functions of the time (in minutes) students spend on the two modules. Since the times represent the entire population, the area under the entire curve, representing probability, must be 1. Figure 6.5 A different transformation of scales Time on ‘Support at TU’ module 1 2 3 4 –3 –2 –1 0 5 6 7 +1 +2 +3 X scale, minutes (μ = 4, σ = 1) Z scale (μ = 0, σ = 1) The steps to find the probability that the time a student spends in the ‘Introduction to TU’ module in the Tasman University Orientation scenario is less than 9 minutes are as follows: 1. Use Equation 6.2 to transform X 5 9 to the corresponding Z value: Z= 2. Table 6.2 Finding a cumulative area under the normal curve (extracted from Table E.2 in Appendix E of this book) 9−7 =1 2 Use Table E.2 to find the cumulative area under the standard normal curve less than (i.e. to the left of) Z 5 1.00. To read the probability or area under the curve less than Z 5 1.00, scan down the Z column in Table E.2 to the Z value of interest to one decimal place, the Z row for 1.0. Read across this row until it intersects the column that contains the second decimal place of the Z value, the column representing .00. Therefore, from the body of the table, the probability for P(Z , 1.00) is given by the intersection of the row Z 5 1.0 and the column Z 5 .00, as shown in Table 6.2, which is extracted from Table E.2. This probability is 0.8413 2 that is, Z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 .00 .5000 .5398 .5793 .6179 .6554 .6915 .7257 .7580 .7881 .8159 .8413 .01 .5040 .5438 .5832 .6217 .6591 .6950 .7291 .7612 .7910 .8186 .8438 .02 .5080 .5478 .5871 .6255 .6628 .6985 .7324 .7642 .7939 .8212 .8461 .03 .5120 .5517 .5910 .6293 .6664 .7019 .7357 .7673 .7967 .8238 .8485 .04 .5160 .5557 .5948 .6331 .6700 .7054 .7389 .7704 .7995 .8264 .8508 .05 .5199 .5596 .5987 .6368 .6736 .7088 .7422 .7734 .8023 .8289 .8531 .06 .5239 .5636 .6026 .6406 .6772 .7123 .7454 .7764 .8051 .8315 .8554 .07 .5279 .5675 .6064 .6443 .6808 .7157 .7486 .7794 .8078 .8340 .8577 .08 .5319 .5714 .6103 .6480 .6844 .7190 .7518 .7823 .8106 .8365 .8599 .09 .5359 .5753 .6141 .6517 .6879 .7224 .7549 .7852 .8133 .8389 .8621 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.2 The Normal Distribution 219 P(Z , 1.00) 5 0.8413. As illustrated in Figure 6.6, there is an 84.13% likelihood that a student will spend less than 9 minutes on the ‘Introduction to TU’ online module. Figure 6.6 Determining the area less than Z from a cumulative standardised normal distribution Time on ‘Introduction to TU’ module Area 0.8413 1 3 5 7 –3.00 –2.00 –1.00 9 11 13 X scale, minutes +1.00 +2.00 +3.00 Z scale 0 For ‘Support at TU’, as Z 5 (5 2 4)/1 5 1 (see Figure 6.7), the probability of a time less than 5 minutes is also 0.8413. Figure 6.7 shows that, regardless of the value of the mean m and standard deviation s of a normal random variable X, Equation 6.2 can be used to transform the distribution to the standard normal distribution Z. Figure 6.7 A transformation of scales for corresponding cumulative portions under two normal curves ‘Support at TU’ cale ‘Introduction to TU’ 7 9 11 Xs 13 cale Zs 5 34 +2 +3 +1 0 –2 –1 –3 In the following examples, which answer questions relating to the time students spend on the ‘Introduction to TU’ module, when necessary the normal curve is sketched and the required probability/area shaded before using Table E.2 with Equation 6.2 to calculate the required probability. FINDING P (X . 9 ) What is the probability that a student will spend at least 9 minutes on ‘Introduction to TU’? EXAMPLE 6.1 SOLUTION The probability that a student spends less than 9 minutes is 0.8413 (see Figure 6.6). Thus, the probability that a student will spend at least 9 minutes is the complement of less than 9 minutes, so: P(X ⩾ 9) = 1 − P(X < 9) = 1 − 0.8413 = 0.1587 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 220 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS Therefore, approximately 15.9% of students spend at least 9 minutes on ‘Introduction to TU’. Figure 6.8 illustrates this result. Figure 6.8 Finding P(X > 9) Time on ‘Introduction to TU’ module Area 0.1587 0.8413 1 3 5 –3.00 –2.00 –1.00 EXAMPLE 6.2 7 0 9 11 13 X scale, minutes +1.00 +2.00 +3.00 Z scale FIN D ING P(7 * X * 9) What is the probability that a student spends between 7 and 9 minutes on ‘Introduction to TU’? SOLUTION From Figure 6.6, P(X , 9) 5 0.8413. Now determine the probability that the time will be at most 7 minutes and subtract this from the probability that the time is less than 9 minutes. That is: P(7 < X < 9) = P(X < 9) − P(X ⩽ 7) This is shown in Figure 6.9. Figure 6.9 Finding P(7 , X , 9) Time on ‘Introduction to TU’ module Area 0.3413 Area 0.5000 1 3 Area 0.1587 5 –3.00 –2.00 –1.00 7 0 9 11 13 +1.00 +2.00 +3.00 X scale, minutes Z scale From Equation 6.2 and Table E.2: P(X ⩽ 7) = P Z ⩽ 7−7 = P(Z ⩽ 0.00) = 0.5000 2 Therefore: P(7 < X < 9) = P(X < 9) − P(X ⩽ 7) = 0.8413 − 0.5000 = 0.3413 EXAMPLE 6.3 FIN D ING P( X - 7 O R X . 9) What is the probability that the time a student spends on ‘Introduction to TU’ is at most 7 minutes or at least 9 minutes? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.2 The Normal Distribution 221 SOLUTION From Figure 6.9, the probability that a student spends between 7 and 9 minutes is 0.3413. Therefore, the probability that the time spent on ‘Introduction to TU’ is at most 7 minutes or at least 9 minutes is its complement, so: P(X ⩽ 7 or X ⩾ 9) = 1 − P(7 < X < 9) = 1 − 0.3413 = 0.6587 Alternatively, calculate separately the probability of a time of 7 minutes or less and the probability of a time of 9 minutes or more and then add these two probabilities together to obtain the desired result (see Figure 6.10). Because the mean and median are the same for a normal distribution, 50% of students spend 7 minutes or less. From Example 6.1, the probability of a student spending at least 9 minutes is 0.1587. Hence, the probability that the time on ‘Introduction to TU’ is at most 7 minutes or at least 9 minutes is: P(X ⩽ 7 or X ⩾ 9) = P(X ⩽ 7) + P(X ⩾ 9) = 0.5000 + 0.1587 = 0.6587 Time on ‘Introduction to TU’ module Area 0.3413 Area 0.1587 Area 0.5000 1 Figure 6.10 Finding P(X ø 7 or X > 9) 3 5 7 –3.00 –2.00 –1.00 0 9 11 X scale, minutes 13 Z scale +1.00 +2.00 +3.00 FINDING P (5 * X * 9 ) What is the probability that a student spends between 5 and 9 minutes on ‘Introduction to TU’? EXAMPLE 6.4 SOLUTION The required area/probability is the area under the curve between X 5 5 and X 5 9 (see Figure 6.11). As Table E.2 gives probabilities less than a particular value of interest, we ­calculate the probabilities P(X , 9) and P(X … 5) and then obtain the desired probability/area by subtraction: P(5 < X < 9) = P 5−7 9−7 <Z< 2 2 = P(− 1 < Z < 1) = P(Z < 1) − P(Z ⩽ −1) = 0.8413 − 0.1587 = 0.6826 Area 0.1587 Figure 6.11 Finding P(5 , X ø 9) Area 0.6826 Area 0.1587 1 3 5 –3.00 –2.00 –1.00 7 0 9 11 13 +1.00 +2.00 +3.00 X scale Z scale Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 222 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS The result of Example 6.4 is important and allows us to generalise the findings. For any normal distribution there is a 0.6826 probability that a randomly selected item will fall within ±1 standard deviation of the mean. From Figure 6.12, slightly more than 95% of the items will fall within ±2 standard deviations of the mean. Thus, 95.44% of students will spend between 3 and 11 minutes on ‘Introduction to TU’. From Figure 6.13, 99.73% of the items will fall within ±3 standard deviations of the mean. Thus, 99.73% of students will spend between 1 and 13 minutes on ‘Introduction to TU’. Therefore, it is unlikely (0.00135, or 135 in 100,000) that a student will spend less than a minute on ‘Introduction to TU’. Similarly, it is unlikely (0.135%) that a student will spend more than 13 minutes on ‘Introduction to TU’. For this reason, 6s (i.e. three standard deviations below the mean to three standard deviations above the mean) is often used as a practical approximation of the range for normally distributed data. Figure 6.12 Finding P(3 , X , 11) Area 0.9544 Area 0.0228 Area 0.0228 1 3 5 –3.00 –2.00 –1.00 7 0 Figure 6.13 Finding P(1 , X , 13) 9 11 13 X scale +1.00 +2.00 +3.00 Z scale Area 0.9973 Area 0.00135 1 Area 0.00135 3 5 –3.00 –2.00 –1.00 • • • 7 0 9 11 13 X scale +1.00 +2.00 +3.00 Z scale Therefore, for any normal distribution: approximately 68.26% of the items will fall within ±1 standard deviation of the mean approximately 95.44% of the items will fall within ±2 standard deviations of the mean approximately 99.73% of the items will fall within ±3 standard deviations of the mean. The above result is the justification for the empirical rule introduced in Chapter 3. The closer a data set is to a normal distribution, the more accurate the empirical rule is. EXAMPLE 6.5 FIN D ING P( X * 3 .5 ) What is the probability that a student will spend less than 3.5 minutes on ‘Introduction to TU’? SOLUTION The required probability/area is the shaded lower left-tail region of Figure 6.14. Figure 6.14 Finding P(X , 3.5) Area 0.0401 1 –3.00 3.5 –1.75 5 7 9 11 13 X scale –1.00 0 +1.00 +2.00 +3.00 Z scale Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.2 The Normal Distribution 223 To determine the area under the curve below 3.5 years, first calculate the required Z value: P(X < 3.5) = P Z < 3.5 − 7 = P(Z < −1.75) 2 Then look up the Z value of 21.75 in Table E.2 by matching the appropriate Z row (21.7) with the appropriate Z column (.05) as shown in Table 6.3 (which is extracted from Table E.2). The resulting probability or area under the curve less than 21.75 standard ­deviations below the mean is 0.0401. Therefore: P(X < 3.5) = 0.0401 That is, orientation organisers can expect approximately 4% of students to spend less than 3.5 minutes on ‘Introduction to TU’. Z . . . -1.7 -1.6 .00 . . . .0446 .0548 .01 . . . .0436 .0537 .02 . . . .0427 .0526 .03 . . . .0418 .0516 .04 . . . .0409 .0505 .05 . . . .0401 .0495 .06 . . . .0392 .0485 .07 . . . .0384 .0475 .08 . . . .0375 .0465 .09 . . . .0367 .0455 Table 6.3 Finding a cumulative area under the normal curve (extracted from Table E.2 in Appendix E of this book) Examples 6.1 to 6.5 used the cumulative standard normal table to find an area under the normal curve that corresponded to a specific X value. There are circumstances when we want to do the opposite. Examples 6.6 and 6.7, still referring to the time students spend on ‘Introduction to TU’ in the Tasman University Orientation scenario, illustrate how to find the X value that corresponds to a specific area. FINDING TH E X VALU E FO R A C U MU LATI V E P ROBABI L I TY OF 0. 10 What is the most amount of time spent on the ‘Introduction to TU’ module by the10% of students who use it the least? EXAMPLE 6.6 SOLUTION Because 10% of students spend less than X minutes on ‘Introduction to TU’, the area under the normal curve less than the corresponding Z value is 0.1000. Use the body of Table E.2 to search for the area/probability of 0.1000. The closest result is 0.1003, as shown in Table 6.4 (which is extracted from Table E.2). Z . . . -1.5 -1.4 -1.3 -1.2 .00 . . . .0668 .0808 .0968 .1151 .01 . . . .0655 .0793 .0951 .1131 .02 . . . .0643 .0778 .0934 .1112 .03 . . . .0630 .0764 .0918 .1093 .04 . . . .0618 .0749 .0901 .1075 .05 . . . .0606 .0735 .0885 .0156 .06 . . . .0594 .0721 .0869 .0138 .07 . . . .0582 .0708 .0853 .1020 .08 . . . .0571 .0694 .0838 .1003 .09 . . . .0559 .0681 .0823 .0985 Table 6.4 Finding a Z value corres­ ponding to a particular cumulative area (0.10) under the normal curve (extracted from Table E.2 in Appendix E of this book) Working from this area/probability to the margins of the table, the Z value corresponding to the particular Z row (21.2) and Z column (.08) is 21.28 (see Figure 6.15). That is, the 10th-percentile time, being the amount of time spent on the ‘Introduction to TU’ module by the10% of students who use it the least, is 1.28 standard deviations below the mean. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 224 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS Figure 6.15 Finding Z to determine X Area 0.9000 Area 0.1000 X 7 X scale –1.28 0 Z scale Then rearrange the transformation formula equation (6.2) to determine the corresponding X value as follows. Since: X−μ Z= σ then: X = μ + Zσ Substituting m 5 7, s 5 2 and Z 5 –1.28 in the rearranged transformation formula, we obtain: X 5 7 1 (21.28) 3 2 5 7 2 2.56 5 4.44 minutes Thus, 10% of students spend less than 4.44 minutes on ‘Introduction to TU’. Equation 6.4 is used to find the X value that corresponds to a Z value. FIN DIN G A N X VA LU E A SSO C I AT E D W I T H KNO W N P R O BA BI LI T Y The X value is equal to the mean m plus the product of the Z value and the standard deviation s. X = μ + Zσ (6.4) 1. 2. 3. 4. 5. To find a particular value associated with a known probability, follow these steps: Sketch the normal curve, and then place the values for the means on the respective X and Z scales. Find the cumulative area less than X. Shade the area of interest. Using Table E.2, determine the Z value corresponding to the area under the normal curve less than X. Use Equation 6.4 to solve for X: X 5 m 1 Zs EXAMPLE 6.7 FIN D ING T H E X VA LU E S THAT I N CL U D E THE TI M E S THAT 95% OF STU D E N T S S P E ND O N ‘ INT RO DU CTI ON TO TU ’ What are the lower and upper values of X, located symmetrically around the mean, which include the middle 95% of times that students spend on ‘Introduction to TU’? SOLUTION First find the lower value of X (called XL). Then find the upper value of X (called XU). Since 95% of the values are between XL and XU, and XL and XU are an equal distance from the mean, 2.5% of the values are below XL (see Figure 6.16). Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.2 The Normal Distribution 225 Figure 6.16 Finding Z to determine XL Area 0.9750 Area 0.0250 XL 7 X scale –1.96 0 Z scale Although XL is not known, the corresponding Z value can be found because the area under the normal curve less than this Z is 0.0250. Using the body of Table E.2 (see Table 6.5), search for the probability 0.0250. Z . . . -2.0 -1.9 -1.8 .00 . . . .0228 .0287 .0359 .01 . . . .0222 .0281 .0351 .02 . . . .0217 .0274 .0344 .03 . . . .0212 .0268 .0336 .04 . . . .0207 .0262 .0329 .05 . . . .0202 .0256 .0232 .06 . . . .0197 .0250 .0314 .07 . . . .0192 .0244 .0307 .08 . . . .0188 .0239 .0301 .09 . . . .0183 .0233 .0294 Table 6.5 Finding a Z value corresponding to a cumulative area of 0.025 under the normal curve (extracted from Table E.2 in Appendix E of this book) Working from the body of the table to the margins, the Z value that corresponds to the particular Z row (21.9) and Z column (.06) is 21.96. Then use Equation 6.4 to find the corresponding X value: X = μ + Zσ = 7 + (−1.96) × 2 = 7 − 3.92 = 3.08 minutes Use a similar process to find XU. Since only 2.5% of times are longer than XU minutes, 97.5% of times are less than XU minutes. From the symmetry of the normal distribution, the desired Z value, as shown in Figure 6.17, is 11.96. Alternatively, extract this Z value from Table E.2 (see Table 6.6). Note that 0.975 is the area under the normal curve less than the Z value of 11.96. Figure 6.17 Finding Z to determine XU Area 0.9750 Area 0.0250 7 XU 0 +1.96 X scale Z scale Then use Equation 6.4 to find the corresponding X value: X = μ + Zσ = 7 + 1.96 × 2 = 7 + 3.92 = 10.92 minutes Therefore, 95% of students spend between 3.08 and 10.92 minutes on ‘Introduction to TU’. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 226 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS Table 6.6 Finding a Z value corresponding to a cumulative area of 0.975 under the normal curve (extracted from Table E.2 in Appendix E of this book) Z . . . +1.8 +1.9 +2.0 .00 . . . .9641 .9713 .9772 .01 . . . .9649 .9719 .9778 .02 . . . .9656 .9726 .9783 .03 . . . .9664 .9732 .9788 .04 . . . .9671 .9738 .9793 .05 . . . .9678 .9744 .9798 .06 . . . .9686 .9750 .9803 .07 . . . .9693 .9756 .9808 .08 . . . .9699 .9761 .9812 .09 . . . .9706 .9767 .9817 You can also use Microsoft Excel to calculate normal probabilities. Figure 6.18 illustrates a Microsoft Excel worksheet for Examples 6.5 and 6.6, using the Excel inbuilt functions STANDARDIZE(x,mean,standard_dev), NORM.DIST(x,mean,standard_dev,cummulative), NORM.S.INV(probability) and NORM.INV(probability,mean,standard_dev). For Excel 2007 and earlier, the corresponding functions are STANDARDIZE(x,mean,standard_dev), NORMDIST(x,mean,standard_dev,cummulative), NORMSINV(probability) and NORMINV (probability,mean,standard_dev). Exploring Descriptive Statistics visual explorations Open the VE_Normal_Distribution add-in workbook to explore the normal distribution. To explore the effects of changing the mean and standard deviation on the area under a normal distribution curve, select Add-ins ➔ Normal Distribution. The add-in displays a normal curve for the Tasman University Orientation scenario and a floating control panel (at top right). Use the control panel spinner buttons to change the values for the mean, standard deviation and X value, and then note the effects of these changes on the probability of X < value and the corresponding shaded area under the curve. To see the normal curve labelled with Z values, click Z Values. Click the Reset button to reset the control panel values. Click Finish to finish exploring. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.2 The Normal Distribution 227 1 2 3 4 5 6 7 8 9 10 28 29 30 31 32 A Normal probabilities Common data Mean Standard deviation X value Z value P(X<=3.5) B C D E Figure 6.18 Microsoft Excel worksheet for calculating normal probabilities 7 2 Probability for X <= Find X and Z given a cum. pctage Cumulative percentage Z value X value 3.5 –1.75 =STANDARDIZE(B8, B4, B5) 0.0401 =NORM.DIST(B8, B4, B5, TRUE) 10.00% –1.2816 =NORM.S.INV(B30) 4.4369 =NORM.INV(B30, B4, B5) What is normal? Ironically, the statistician who popularised the use of ‘normal’ to describe the distribution discussed in Section 6.2 was someone who saw the distribution as anything but the everyday, anticipated occurrence that the adjective normal usually suggests. think about this Starting with an 1894 paper, Karl Pearson argued that measurements of phenomena do not naturally, or ‘normally’, conform to the classic bell shape. While this principle underlies statistics today, Pearson’s point of view was radical to contemporaries who saw the world as standardised and normal. Pearson changed minds by showing that some populations are naturally skewed (coining that term in passing), and he helped put to rest the notion that the normal distribution underlies all phenomena. Misunderstandings about the normal distribution have occurred both in business and in the public sector throughout the years. These misunderstandings have caused a number of business blunders and have sparked several public policy debates, including on the causes of the collapse of large financial institutions in 2008. According to one theory, the investment banking industry’s application of the normal distribution to assess risk may have contributed to the global collapse. (See ‘A finer formula for assessing risks’, New York Times, 11 May 2010, p. B2.) Using the normal distribution led these banks to overestimate the probability of having stable market conditions and underestimate the chance of unusually large market losses. According to this theory, other distributions that have less area in the middle of their curves and, therefore, more in the ‘tails’ that represent unusual market outcomes, may have led to less serious losses. As you study this chapter, make sure you understand the assumptions that must hold for the proper use of the normal distribution, assumptions that were not explicitly verified by the investment bankers. And, most importantly, always remember that the name normal distribution does not mean to suggest normal in the everyday (dare we say ‘normal’?!) sense of the word. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 228 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS Problems for Section 6.2 LEARNING THE BASICS Make sure that you sketch the normal curve and shade the required area/ probability. 6.1 6.2 6.3 6.4 6.5 6.6 Given the standard normal distribution (with a mean of 0 and a standard deviation of 1, as in Table E.2), what is the probability that: a. Z is less than 1.57? b. Z is greater than 1.84? c. Z is between 1.57 and 1.84? d. Z is less than 1.57 or greater than 1.84? a.Given the standard normal distribution, what is the probability that: i. Z is between 21.57 and 1.84? ii. Z is less than 21.57 or greater than 1.84? b. What is the value of Z if only 2.5% of all possible Z values are larger? c. Between which two values of Z (symmetrically distributed around the mean) will 68.26% of all possible Z values be contained? Given the standard normal distribution, what is the probability that: a. Z is less than 1.08? b. Z is greater than 2 0.21? c. Z is less than 2 0.21 or greater than the mean? d. Z is less than 2 0.21 or greater than 1.08? a.Given the standard normal distribution, determine the following probabilities: i. P (Z 7 1.08) ii. P (Z , 2 0.21) iii. P (2 1.96 , Z , 2 0.21) b. What is the value of Z if only 15.87% of all possible Z values are larger? Given a normal distribution with m 5 100 and s 5 10, what is the probability that: a. X 7 75? b. X , 70? c. X , 80 or X 7 110? d. 80% of the values are between which two X values (symmetrically distributed around the mean)? Given a normal distribution with m 5 50 and s 5 4, what is the probability that: a. X 7 43? b. X , 42? c. 5% of the values are less than which X value? d. 60% of the values are between which two X values (symmetrically distributed around the mean)? APPLYING THE CONCEPTS 6.7 The records of Check$mart Bank show that the average credit card balance of its customers is $3,325 with a standard deviation of $1,500. Assume that the distribution of these credit card balances is approximately normal. a. Find the probability that an account balance is less than $2,500. b. Find the probability that an account balance is more than $5,000. c. What proportion of account balances are between $3,000 and $4,000? d. 99% of account balances are less than which amount? 6.8 Toby’s Trucking Company determined that, on an annual basis, the distance travelled per truck is normally distributed with a mean of 100,000 kilometres and a standard deviation of 20,000 kilometres. a. What proportion of trucks can be expected to travel between 80,000 and 120,000 kilometres in the year? b. What percentage of trucks can be expected to travel either below 60,000 or above 140,000 kilometres in the year? c. How many kilometres will be travelled by at least 80% of the trucks? d. What are your answers to (a) to (c) if the standard deviation is 10,000 km? 6.9 The breaking strength of plastic bags used for packaging produce is normally distributed with a mean of 35 kPa (kilopascals) and a standard deviation of 10 kPa. a. What proportion of the bags have a breaking strength of: i. less than 20 kPa? ii. at least 30 kPa? iii. between 25 and 45 kPa? b. Between which two values symmetrically distributed around the mean will 95% of the breaking strengths fall? 6.10 A set of final examination marks in an introductory statistics unit is normally distributed with a mean of 73 and a standard deviation of 8. a. What is the probability of getting a mark of 91 or less? b. What is the probability that a student obtains a mark between 65 and 89? c. If the lecturer gives Distinction and High Distinction grades to the top 15% of students, what mark does a student need to get a distinction? d. If the lecturer gives a High Distinction to the top 5% of students, are you better off with a mark of 80 on this exam or a mark of 68 on a different exam where the mean is 62 and the standard deviation is 3? Show your answer statistically and explain. 6.11 A statistical analysis of 1,000 long-distance telephone calls made from the headquarters of the Bricks and Clicks Computer Corporation indicates that the length of these calls is normal with m 5 240 seconds and s 5 40 seconds. a. What is the probability that a call lasted less than 180 seconds? b. What is the probability that a particular call lasted between 180 and 300 seconds? c. What is the probability that a call lasted between 110 and 180 seconds? d. What is the length of a particular call if only 1% of all calls are shorter? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.3 Evaluating Normality 229 6.12 The number of shares traded daily is referred to as the volume of trade. During 2008 the average volume traded daily for the ASX All Ordinaries was 992 million with a standard deviation of 252 million (Sydney Morning Herald, <http://business.smh.com. au>, October 2008 and January 2009). Assume that the number of All Ordinaries shares traded daily on the ASX is a normal random variable with a mean of 992 million and a standard deviation of 252 million. a. For a randomly selected day, what is the probability that the volume of trading is: i. below 500 million? ii. between 750 and 1,000 million? iii. below 1,500 million? iv. above 1,200 million? b. On 18 September 2008 the All Ordinaries volume of trade was 2,125 million. What is the probability that the volume of trading for the All Ordinaries on a randomly selected day is at least 2,125 million? What conclusions can you draw from this probability? 6.13 Many manufacturing problems involve the accurate matching of machine parts, such as shafts, that fit into a valve hole. A particular design requires a shaft with a diameter of 22.000 mm, but shafts with diameters between 21.900 mm and 22.010 mm are acceptable. Suppose that the manufacturing process yields shafts with diameters normally distributed with a mean of 22.002 mm and a standard deviation of 0.005 mm. a. For this process, what is: i. the proportion of shafts with a diameter between 21.900 mm and 22.000 mm? ii. the probability a shaft is acceptable? iii. the diameter that will be exceeded by only 2% of shafts? b. What would be your answers in (a) if the standard deviation of the shaft diameters was 0.004 mm? 6.3 EVALUATING NORMALITY As discussed in Section 6.2, many continuous variables used in business and elsewhere closely resemble a normal distribution. However, other variables cannot be approximated by the normal distribution. This section presents two approaches for evaluating whether a set of data can be approximated by the normal distribution: 1. Compare the data set’s characteristics with the properties of the normal distribution. 2. Construct a normal probability plot. LEARNING OBJECTIVE 2 Determine whether a set of data is approximately normally distributed Evaluating the Properties As mentioned in Section 6.2, the normal distribution has several important theoretical properties: • It is symmetrical, thus the mean and median are equal. • It is bell shaped, thus the empirical rule applies. 4 • The interquartile range equals approximately standard deviations. 3 • The range is infinite (but in practice is approximately six times the standard deviation). In practice, some continuous variables may have characteristics that approximate these theoretical properties. However, many continuous variables are neither normal nor approximately normal. For such variables, the descriptive characteristics of the data do not match well with the properties of a normal distribution. One approach to checking for normality is to compare the actual data characteristics with the corresponding properties from an underlying normal distribution, as follows: • Construct charts and observe their appearance. For small or moderate-sized data sets, construct a stem-and-leaf display or a box-and-whisker plot. For large data sets, construct the frequency distribution and plot the histogram or polygon. • Calculate descriptive numerical measures and compare the characteristics of the data with the theoretical properties of the normal distribution. Compare the mean and median. Is the interquartile range approximately 1.33 times the standard deviation? Is the range approximately six times the standard deviation? • Evaluate how the values in the data are distributed. Determine whether approximately two-thirds of the values lie within ±1 standard deviation of the mean. Determine whether approximately four-fifths of the values lie within ±1.28 standard deviations of the mean. Determine whether approximately 19 out of every 20 values lie within ±2 standard deviations of the mean. Example 6.8 illustrates these steps. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 230 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS EXAMPLE 6.8 E VA LU AT ING N O R MAL I TY Innovative Kitchens design, build and install custom made kitchens. Toni, a customer manager, is interested in the length of the initial call from potential customers, in particular what percentage of calls last more than 4 minutes. As Toni wishes to use the normal probability distribution to calculate the required probability, the assumption that the random variable X 5 length of call is normal needs to be checked. The table below gives the length in seconds of 20 randomly chosen calls. < CALL_LENGTH > 165 153 253 263 187 137 209 179 97 170 43 295 181 121 117 210 Do these data show the properties of the normal distribution? 200 191 181 248 SOLUTION Figure 6.19 displays descriptive statistics for these data and Figure 6.20 presents a box-andwhisker plot. Figure 6.19 Microsoft Excel and PHStat descriptive statistics for length of call A B 1 Descriptive summary 2 Length of call 3 4 Mean 180 5 Median 181 6 Mode 181 7 Minimum 43 8 Maximum 295 9 Range 252 10 Variance 3599.5789 11 Standard deviation 59.9965 12 Coefficient of variation 33.33% 13 Skewness –0.2288 14 Kurtosis 0.3939 15 Count 20 16 Standard error 13.4156 A 1 2 3 4 5 6 7 8 B Boxplot Five-number summary Minimum First quartile Median Third quartile Maximum 43 137 181 210 295 Figure 6.20 Box-and-whisker plot for length of calls 40 1. 2. 60 80 100 120 140 160 180 200 Length of call (seconds) 220 240 260 280 300 From these figures, we can make the following statements: The mean of 180 is slightly less than the median and mode of 181. The box-and-whisker plot appears slightly left skewed. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.3 Evaluating Normality 231 3. The interquartile range of Q3 2 Q1 5 210 2 137 5 73 is ­deviations. 73 5 1.216… standard 60 252 5 4.2… standard deviations. 60 4. The range of 252 is equal to 5. 13 5 65% of the calls are within 6 1 standard deviation of the mean (180 6 60 → 20 120 to 240). 6. 19 5 95% of the calls are within 6 2 standard deviations of the mean (180 6 120 → 20 60 to 300). Based on these statements and the criteria given above, it can be concluded that the length of calls is approximately normal. However, statements 1 and 2 indicate that the calls are slightly left skewed. Thus, Toni can use the normal probability distribution, with m 5 180 seconds and s 5 60 seconds, to calculate the percentage of calls lasting more than 4 minutes. Constructing a Normal Probability Plot A normal probability plot is a graphical approach for evaluating whether data are normally distributed. One common approach is called the quantile–quantile plot. In this method, each ordered value is transformed to a Z score and plotted along with the ordered data values of the variable. For example, if you have a sample of n 5 19, the Z value for the smallest value corresponds to 1 1 1 a cumulative area of = 0.05. The Z value for a cumulative area of 0.05 = = n + 1 19 + 1 20 (from Table E.2) is 21.65. Table 6.7 illustrates the entire set of Z values for a sample of n 5 19. Ordered value 1 2 3 4 5 6 7 8 9 10 Z value -1.65 -1.28 -1.04 -0.84 -0.67 -0.52 -0.39 -0.25 -0.13 0.00 Ordered value 11 12 13 14 15 16 17 18 19 Z value 0.13 0.25 0.39 0.52 0.67 0.84 1.04 1.28 1.65 normal probability plot Graphical approach used to evaluate if data are normal. quantile–quantile plot A normal probability plot. Table 6.7 Ordered values and corresponding Z values for a sample of n = 19 The Z values are plotted on the horizontal axis and the corresponding values of the variable are plotted on the vertical axis. If the data are normally distributed, the points will be approximately in a straight line. Figure 6.21 illustrates the typical shape of normal probability plots for a left-skewed distribution (panel A), a normal distribution (panel B) and a right-skewed distribution (panel C). If the data are left skewed, the curve will rise more rapidly at first, and then level off. If the data are right skewed, the data will rise more slowly at first, and then rise at a faster rate for higher values of the variable being plotted. Figure 6.22 illustrates a PHStat normal probability plot for the length-of-call data in Example 6.8, and shows that it is approximately a straight line. Thus, it can be concluded that the distribution of the data on length of calls is approximately normal. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 232 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS Figure 6.21 Normal probability plots for a left-skewed distribution, a normal distribution and a right-skewed distribution % % Panel A Left skewed Figure 6.22 PHStat normal probability plot for length of call % Panel B Normal Panel C Right skewed Normal probability plot 350 300 Length of call 250 200 150 100 50 0 –2 –1.5 –1 –0.5 0 Z value 0.5 1 1.5 2 Problems for Section 6.3 LEARNING THE BASICS 6.14 When evaluating normality, show that, for a sample of n 5 39, the smallest and largest Z values are 2 1.96 and 1 1.96, and the middle (i.e. 20th) Z value is 0.00. 6.15 For a sample of n 5 6, list the six Z values. APPLYING THE CONCEPTS You can solve problems 6.16 to 6.20 manually or by using Microsoft Excel. We recommend that you use Microsoft Excel. 6.16 The full daily rates in Australian dollars for a random sample of 19 Australian and New Zealand hotels from a certain chain are as follows: < HOTEL_RATE > Location Auckland Barossa Valley Brisbane Full rate A$ 280 200 441 Canberra Darwin Hamilton Melbourne Melbourne Melbourne Palmerston North Perth Queenstown Rotorua Snowy Mountains Sunshine Coast Sydney Sydney Sydney Wellington 290 662 255 358 308 279 259 232 312 309 615 534 360 573 320 335 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.4 The Uniform Distribution 233 Decide whether or not the data appear to be approximately normally distributed by: a. evaluating the actual versus theoretical properties b. constructing a normal probability plot 6.17 A problem with a telephone line that prevents a customer from receiving or making calls is disconcerting to both the customer and the telephone company. The following data represent two samples of 20 problems reported to two different offices of a telephone company. The time to clear these problems from the customers’ lines is recorded in minutes. < PHONE > Central Office I Time to clear problems (minutes) 1.48 1.75 0.78 2.85 0.52 1.60 4.15 1.02 0.53 0.93 1.60 0.80 1.05 6.32 3.97 3.93 1.48 5.45 3.10 0.97 Central Office II Time to clear problems (minutes) 7.55 3.75 0.10 1.10 0.60 0.52 3.30 3.75 0.65 1.92 0.60 1.53 4.23 0.08 2.10 1.48 0.58 1.65 4.02 0.72 For each of the two central office locations, decide whether the data appear to be approximately normally distributed by: a. evaluating the actual versus theoretical properties b. constructing a normal probability plot 6.18 Many manufacturing processes use the term work-in-progress (often abbreviated to WIP). In a book-manufacturing plant, the WIP represents the time it takes for sheets from a press to be folded, gathered, sewn, tipped on end sheets and bound. The following data represent samples of 20 books at each of two production plants and the processing time (operationally defined as the time in days from when the books came off the press to when they were packed in cartons) for these jobs: < WIP > Plant A 5.62 11.62 5.29 21.62 10.50 8.45 7.29 7.58 16.25 7.50 8.58 9.29 Plant B 9.54 5.75 11.46 12.46 16.62 15.41 2.33 14.29 14.25 13.13 11.46 4.42 11.42 8.92 9.17 12.62 13.21 25.75 6.00 5.37 13.71 6.25 10.04 9.71 For each of the two plants, decide whether or not the data appear to be approximately normally distributed by: a. evaluating the actual versus theoretical properties b. constructing a normal probability plot 6.19 The data file < GRADES > contains a sample of student marks and grades from a population of students enrolled in a statistics unit. Decide whether or not the ‘Total Mark’ data appear to be approximately normal by: a. evaluating the actual versus theoretical properties b. constructing a normal probability plot 6.20 For the data from problem 6.19, < GRADES >, decide whether or not the ‘Exam Mark’ data appear to be approximately normally distributed by: a. evaluating the actual versus theoretical properties b. constructing a normal probability plot 6.4 THE UNIFORM DISTRIBUTION LEARNING OBJECTIVE In the uniform distribution, a value has the same probability of occurrence anywhere in the range between the smallest value a and the largest value b. Because of its shape, the uniform distribution is sometimes called the rectangular distribution (see panel B of Figure 6.1). Equation 6.5 defines the uniform probability density function. THE UN IFO R M PR OB A B IL IT Y DE N S IT Y F UN CT I O N f (X ) = 10.92 7.96 5.41 7.54 1 if a ⩽ X ⩽ b and 0 elsewhere b−a (6.5) Calculate probabilities from the uniform distribution uniform (rectangular) distribution Continuous probability distribution; the values of the random variable have the same probability; also called the ‘rectangular distribution’. where a 5 the minimum value of X b 5 the maximum value of X Equations 6.6 and 6.7 define the mean and variance of the uniform distribution. TH E M E AN A N D VA R IA N CE OF T H E U NI F O R M DI ST R I B UT I O N a+b 2 (6.6) (b − a)2 12 (6.7) μ= σ2 = 3 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 234 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS Figure 6.23 illustrates the uniform distribution with a 5 0 and b 5 1. The total area of the rectangle is equal to base 3 height 5 1 3 1 5 1, thus satisfying the requirement that the area under any probability density function equals 1. In such a distribution, what is the probability of getting a value between 0.1 and 0.3? The area between 0.1 and 0.3, depicted in Figure 6.24, is equal to the base (0.3 2 0.1 5 0.2) multiplied by the height (1.0). Therefore: P(0.1 < X < 0.3) = base × height = 0.2 × 1 = 0.2 Figure 6.23 Probability density function for a uniform distribution with a 5 0 and b 5 1 f (x ) 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Figure 6.24 Finding P (0.1 , X , 0.3) for a uniform distribution with a 5 0 and b 5 1 x f (x ) 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x From Equations 6.6 and 6.7, the mean and standard deviation of the uniform distribution for a 5 0 and b 5 1: μ= σ2 = a+b 0+1 = = 0.5 2 2 (b − a)2 (1 − 0)2 1 = 0.0833… = = 12 12 12 σ = 0.0833… = 0.2886… Thus, the mean is 0.5 and the standard deviation is 0.2887. Problems for Section 6.4 LEARNING THE BASICS APPLYING THE CONCEPTS 6.21 Suppose you sample one value from a uniform distribution with a 5 0 and b 5 10. a. What is the probability of getting a value: i. between 5 and 7? ii. between 2 and 3? b. What is the mean? c. What is the standard deviation? 6.22 The time between arrivals of customers at a bank between noon and 1 pm has a uniform distribution over an interval from 0 to 120 seconds. a. What is the probability that the time between the arrival of two customers will be: i. less than 20 seconds? ii. between 10 and 30 seconds? iii. more than 35 seconds? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.5 The Exponential Distribution 235 b. What is the mean and standard deviation of the time between arrivals? 6.23 The time of failure for a continuous operation monitoring device of air quality has a uniform distribution over a 24-hour day. a. If a failure occurs on a day when it is daylight between 5.55 am and 7.38 pm, what is the probability that the failure will occur during daylight hours? b. If the device is in secondary mode from 10 pm to 5 am, what is the probability that a failure occurs during secondary mode? c. If the device has a self-checking computer chip that determines whether the device is operational every hour on the hour, what is the probability that a failure will be detected within 10 minutes of its occurrence? d. If the device has a self-checking computer chip that determines whether the device is operational every hour on the hour, what is the probability that it will take at least 40 minutes to detect that a failure has occurred? 6.24 In an apartment building the waiting time for a lift is found to be uniformly distributed between 0 and 3 minutes. a. What is the probability of waiting: i. no more than a minute? ii. between 1 and 2 minutes? iii. more than 2 minutes? b. What is the mean and standard deviation of waiting time? 6.5 THE EXPONENTIAL DISTRIBUTION LEARNING OBJECTIVE The exponential distribution is a continuous distribution that is right skewed and ranges from zero to positive infinity (see panel C of Figure 6.1). The exponential distribution is widely used in waiting line (or queuing) theory to model the length of time between random and independent events or the time to the first occurrence of an event. For example, the exponential random variable can be used to model the: • time between arrivals of customers at a bank’s ATM or a fast-food restaurant • time between patients entering a hospital emergency room • time between hits on a website • time between outages to an Internet banking system • time to failure of a certain item or component. Calculate probabilities from the exponential distribution exponential distribution Continuous probability distribution, used to model the interval between Poisson events. The exponential and Poisson distributions are closely related. The Poisson distribution is used to count the number of times an event occurs in some interval, while the exponential distribution is used to measure the interval between Poisson events or until the first event. The exponential distribution is defined by a single parameter, l(lambda), the expected number of events per interval; note that this is the mean of the corresponding Poisson distribution. Equation 6.8 can be used to calculate exponential probabilities. PRO BABILIT Y T H AT A N E XP ON E N T IA L R AN DO M VAR I A BL E I S LE SS THAN A If X is an exponential random variable, 0 8 X 8 ∞, then P(X < A) = 1 − e−λA 4 (6.8) where l 5 expected number of events in interval e 5 2.71828… is the base of natural logarithms A is a given value of the exponential random variable X From Equation 6.8, using the complement rule, we obtain: P(X ⩾ A) = 1 − P(X < A) = 1 − (1 − e−λA) = e−λA Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 236 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS T H E M E A N , VA R IA NC E A ND STA NDA R D D E V I AT I O N O F T HE E XP ON E N T IA L DIST R I BU T I O N μ=σ= σ2 = 1 λ (6.9) 1 λ2 (6.10) where l 5 expected number of events in interval. For example, if the expected number of events in a minute is l 5 4, then the mean time between 1 events is m 5 5 0.25 minutes or 15 seconds. 4 To illustrate the exponential distribution, suppose that customers arrive at an ATM randomly and independently at the rate of 20 per hour. If a customer has just arrived, what is the probability that the next customer will arrive within 6 minutes (i.e. 0.1 hour)? For this example, X 5 time in hours until next customer is exponential with l 5 20 per hour. Using Equation 6.8 and A 5 6 minutes 5 0.1 hour: P(X < 0.1) = 1 − e−20×0.1 = 1 − e−2 = 1 − 0.13533… = 0.86466… Thus, the probability that a customer will arrive within 6 minutes is 0.8647. You can also use Microsoft Excel to calculate this probability. Figure 6.25 shows a Microsoft Excel worksheet, using the Excel inbuilt exponential function EXPON. DIST(x,lambda,cumulative). For Excel 2007 and earlier the corresponding exponential function is EXPONDIST(x,lambda,cumulative). Figure 6.25 Microsoft Excel worksheet for finding exponential probabilities EXAMPLE 6.9 1 2 3 4 5 6 7 8 9 A B Exponential probability C D E Data λ X value P(<=X ) P(>X ) 20 0.1 Results 0.8647 =EXPON.DIST(B5, B4, TRUE) 0.1353 =1-B8 C A LC U LAT IN G E X P O N E N TI AL P ROBABI L I TI E S In the ATM example, what is the probability that the next customer will arrive within 3 minutes (i.e. 0.05 hour)? SOLUTION Using Equation 6.8: P(X < 0.05) = 1 − e−20×0.05 = 1 − e−1 = 1 − 0.3678… = 0.63212… Thus, the probability that a customer will arrive within 3 minutes is 0.6321. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.5 The Exponential Distribution 237 CA LC ULATING E X P O N E N T IA L P RO B A B I L I TI E S Past data show that two serious workplace accidents resulting in employees taking time off work occur annually at Innovative Kitchens. A serious workplace accident has just occurred. What is the probability that there will not be another serious workplace accident in the next year and the probability that there will be at least one serious workplace accident in the next six months? EXAMPLE 6.10 SOLUTION X 5 time in years until next serious workplace accident is exponential with l 5 2 per year. Using Equation 6.8 and the complement rule: P(X > 1) = e −2×1 = e −2 = 0.13533… Thus, the probability that there will not be another serious workplace accident in the next year is 0.1353. Using Equation 6.8: P(X 8 0.5) 5 1 2 e22 3 0.5 5 1 2 e21 5 1 2 0.36787… 5 0.6321… Thus, the probability that there will be at least one serious workplace accident in the next six months is 0.632. Memoryless distribution Suppose customers arrive at an average rate of one per minute. If no customer has arrived in the last minute, what is the probability that no customer will arrive in the next 2 minutes? think about this To answer this, let X 5 time until next customer arrives in minutes. Then X is exponential with λ 5 1. We want the probability that we will wait at least another 2 minutes for the next customer, given that we have waited 1 minute already; that is: P(X > 2 + 1 X > 1) = P(X > 3) e –331 = –131 = 0.1353… e P(X > 1) Now suppose that a customer has just arrived. The probability that no customer will arrive in the next two minutes is: P(X > 2) = e –231 = e –2 = 0.13533… What do you notice? The probability has not changed. This illustrates the memoryless property of the exponential distribution. It means that it does not matter how long you have waited for a customer. If a customer has not arrived at time T, the distribution of the waiting time from time T until the next customer arrives is the same as when a customer has just arrived. In general, it can be shown that if X is exponential, then X is a memoryless random variable and: P(X > A + T X > T ) = P(X > A) for A, T ⩾ 0 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 238 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS Problems for Section 6.5 LEARNING THE BASICS 6.25 Given an exponential distribution with l 5 10, what is the probability that X is: a. less than 0.1? b. greater than 0.1? c. between 0.1 and 0.2? d. less than 0.1 or greater than 0.2? 6.26 Given an exponential distribution with l 5 30, what is the probability that X is: a. less than 0.1? b. greater than 0.1? c. between 0.1 and 0.2? d. less than 0.1 or greater than 0.2? 6.27 Given an exponential distribution with l 5 20, what is the probability that X is: a. less than 4? b. greater than 0.4? c. between 0.4 and 0.5? d. less than 0.4 or greater than 0.5? APPLYING THE CONCEPTS 6.28 Vehicles arrive, randomly and independently, at a toll booth located at the entrance to a bridge at the rate of 240 per hour between 1 am and 2 am. Suppose a vehicle has just arrived. a. What is the probability that the next vehicle arrives within the next minute? b. What is the probability that no vehicle arrives in the next 30 seconds? c. What is the mean time between arrivals at the toll booth? d. What are your answers to (a) to (c) if the rate of arrival of vehicles is 300 per hour? e. What are your answers to (a) to (c) if the rate of arrival of vehicles is 210 per hour? LEARNING OBJECTIVE 5 Use the normal distribution to approximate probabilities from the binomial distribution 6.29 Customers arrive at the drive-through window of a fast-food restaurant at an average of two per minute during the lunch hour. a. What is the probability that the next customer will arrive within 1 minute? b. What is the probability that the next customer will arrive within 5 minutes? c. During the dinner time period, the average arrival rate is one per minute. What are your answers to (a) and (b) for this period? 6.30 The time between unplanned shutdowns of a power plant has an exponential distribution with a mean of 20 days. Find the probability that the time between two unplanned shutdowns is: a. less than 14 days b. more than 21 days c. less than 7 days 6.31 Golfers arrive at the starter’s booth of a public golf course at an average of eight per hour during the Monday-to-Thursday midweek period. a. If a golfer has just arrived: i. what is the probability that the next golfer arrives within 15 minutes (0.25 hour)? ii. what is the probability that the next golfer arrives within 3 minutes (0.05 hour)? b. The average arrival rate on Fridays is 15 per hour. What are your answers to (a) on Fridays? 6.32 The number of floods in a certain region is approximately Poisson distributed with an average of three floods every 10 years. A flood has just occurred. a. What is the probability that: i. a flood occurs in the next year? ii. there isn’t a flood in the next two years? iii. a flood occurs in the next month? iv. at least one flood occurs in the next six months? b. What is the average time between floods? 6.6 THE NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION In earlier sections of this chapter, the normal probability distribution was introduced. In this section we use the normal distribution to approximate the binomial distribution. When, as in this case, a continuous distribution is used to approximate a discrete probability distribution, a continuity correction factor is required. Need for a Continuity Correction There are two major reasons why a continuity correction is needed when using a continuous random variable to approximate a discrete random variable. First, discrete random variables such as binomial random variables can take on only specified (integer) values, while continuous random variables such as normal random variables can take on any values within a continuum or interval. When using the normal distribution to approximate the binomial Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.6 The Normal Approximation to the Binomial Distribution 239 distribution, more accurate approximations of the probabilities are obtained when a continuity correction is used. Second, with a continuous distribution such as the normal distribution, the probability of getting a specific value of a random variable is zero. However, when a continuous distribution is used to approximate a discrete distribution, a continuity correction is used to obtain the approximate probability of a specific value of the discrete distribution. Consider an experiment in which we toss a fair coin 10 times. Suppose we want to calculate the probability of getting exactly four heads. Whereas a discrete random variable can have only a specified value (such as 4), a continuous random variable used to approximate it could take on any values within an interval around that specified value, as demonstrated on the scale below: ... ...X 2.5 3 3.5 4 4.5 5 5.5 The continuity correction requires adding or subtracting 0.5 from the value or values of the discrete random variable X as required. To use the normal distribution to approximate the probability of getting exactly four heads, X 5 4, we need to find the area under the normal curve from X 5 3.5 to X 5 4.5, the lower and upper boundaries of 4. To determine the approximate probability of getting at least four heads, we find the area under the normal curve greater than or equal to 3.5, X 9 3.5, since 3.5 is the lower boundary of 4. Similarly, to determine the approximate probability of getting at most four heads, we find the area under the normal curve equal to or less than 4.5, X 8 4.5, since 4.5 is the upper boundary of 4. When using the normal distribution to approximate discrete probability distributions, semantics are important. To determine the approximate probability of getting fewer than four heads, we find the area under the normal curve less than or equal to 3.5, X 8 3.5. To determine the approximate probability of getting more than four heads, we find the area under the normal curve greater than or equal to 4.5, X 9 4.5. To determine the approximate probability of getting four to seven heads (inclusive), we find the area under the normal curve from 3.5 to 7.5, 3.5 8 X 8 7.5. Approximating the Binomial Distribution In Section 5.3 we saw that the binomial distribution is symmetrical (as is the normal distribution) whenever p 5 0.5. When p Z 0.5, the binomial distribution is not symmetrical. However, the closer p is to 0.5 and/or the larger the sample size n, the more symmetrical the distribution is. On the other hand, the larger the sample size the more tedious it is to calculate the exact probabilities of success using Equation 5.11. Fortunately, whenever the sample size is large, we can use the normal distribution to approximate the exact binomial probabilities. As a general rule, the normal distribution can be used to approximate the binomial distribution whenever np and n(1 2 p) are both at least 5. From Section 5.3, the mean and standard deviation of the binomial distribution are: μ = np σ = np(1 − p) By substituting these into the transformation formula (Equation 6.2), we obtain: Z= X−μ X − np = σ np(1 − p) where for large enough n the random variable Z is approximately normal. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 240 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS Hence, Equation 6.11 is used to find approximate probabilities corresponding to the values of the discrete binomial random variable, X. N OR M A L A P PR OX I M AT I O N TO T HE BI NO M I A L D I ST R I BU T I O N Z= Xa − np np(1 − p) (6.11) where m 5 np, mean of the binomial distribution σ = np(1 − p), standard deviation of the binomial distribution Xa 5 adjusted number of successes for the discrete random variable X, such that Xa 5 X ± 0.5 as appropriate EXAMPLE 6.11 U S ING T H E NO R MA L D I ST RI B UT I ON TO A P P ROXI M AT E T HE BI N O MI A L D IST R IB U T IO N A random sample of n 5 1,600 tyres is selected from an ongoing production process in which 8% of all tyres produced are defective. What is the probability that 150 or fewer tyres will be defective? SOLUTION Since both np 5 1,600 3 0.08 5 128 and n(1 2 p) 5 1,600 3 0.92 5 1,472 are greater than 5, the normal distribution can be used to approximate the binomial. Here, Xa, the adjusted number of successes, is 150.5 and: Z≈ Xa − np np(1 − p) = 150.5 − 128 (1,600)(0.08)(0.92) = 22.5 ≈ 2.07 10.8517… Then, using Table E.2, the area under the curve to the left of Z 5 2.07 is 0.9808 (see ­Figure 6.26). Therefore, the probability of 150 or fewer tyres being defective is approximately 0.98. This agrees to two decimal places with the exact binomial probability of 0.9790. Figure 6.26 Approximating the binomial distribution Area 0.9808 μ = 128 150.5 X 0 +2.07 Z Calculating a Probability Approximation for an Individual Value Suppose that we want to approximate the probability of getting exactly 150 defective tyres. The correction for continuity defines the integer value of interest to range from one-half unit below it to one-half unit above it. Therefore, we define the probability of getting exactly 150 defective Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 6.6 The Normal Approximation to the Binomial Distribution 241 tyres as the area under the normal curve between 149.5 and 150.5. Using Equation 6.11, the corresponding Z values are: Z= 149.5 − 128 (1,600)(0.08)(0.92) = 21.5 = 1.98 10.85 = 22.5 = 2.07 10.85 and: Z= 150.5 − 128 (1,600)(0.08)(0.92) Therefore, using Table E.2, we obtain: P(exactly 150 tyres defective) ≈ P(149.5 ⩽ X ⩽ 150.5) ≈ P(1.98 ⩽ Z ⩽ 2.07) = 0.9808 − 0.9761 = 0.0047 Thus, the approximate probability of getting 150 defective tyres is 0.0047. Compare this with the exact binomial probability which, to four decimal places, is 0.0048. Problems for Section 6.6 LEARNING THE BASICS 6.33 For n 5 100 and p 5 0.2, use the normal distribution to approximate the probability that: a. X 5 25 b. X 7 25 c. X 8 25 d. X , 25 6.34 For n 5 100 and p 5 0.4, use the normal distribution to approximate the probability that: a. X 5 40 b. X 7 40 c. X 8 40 d. X , 40 i. four heads ii. at least four heads iii. four to seven heads b. Use the normal approximation to the binomial distribution to approximate the probabilities in (a). 6.36 For overseas flights, an airline has three different choices on its dessert menu: ice cream, apple pie and chocolate cake. Based on past experience, the airline feels that each dessert is equally likely to be chosen. If a random sample of 90 passengers is selected, what is the approximate probability that: a. at least 20 will choose ice cream for dessert? b. exactly 20 will choose ice cream for dessert? c. less than 20 will choose ice cream for dessert? APPLYING THE CONCEPTS 6.35 Consider an experiment in which a fair coin is tossed 10 times. a. Use Equation 5.11, Table E.6 or Microsoft Excel to determine the probability of getting: Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 242 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS 6 Assess your progress Summary In this chapter we used the normal distribution for the Tasman University Orientation scenario to study the time students spend on the ‘Introduction to TU’ module. We also used the exponential distribution to model the time between serious workplace accidents. In addition, we studied the uniform distribution, the normal probability plot and the normal approximation to the binomial distribution. In the next chapter, the normal distribution is used in developing the subject of statistical inference. Key formulas Variance of the uniform distribution The normal probability density function f (X ) = 1 2 σ 2π e −(1/2)[(X − μ)/σ] (6.1) σ2 = Calculating exponential probabilities Finding a Z value Z= X−μ (6.2) σ P(X < A) = 1 − e−λA (6.8) The standardised normal probability density function f (Z ) = 1 2π e −(1/2)Z2 (6.3) Mean and standard deviation of exponential distribution μ=σ= 1 (6.9) λ Variance of exponential distribution Finding an X value σ2 = X = μ + Zσ (6.4) The uniform distribution probability density function f (X ) = (b − a)2 (6.7) 12 1 if a ⩽ X ⩽ b and 0 elsewhere (6.5) b−a 1 (6.10) λ2 Normal approximation to the binomial distribution Z= Xa − np np(1 − p) (6.11) Mean of the uniform distribution μ= a+b (6.6) 2 Key terms continuous probability density function213 cumulative standardised normal distribution217 exponential distribution 235 normal distribution 214 normal probability density function216 normal probability plot 231 quantile–quantile plot 231 standardised normal random variable216 transformation formula 216 uniform (rectangular) distribution 233 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter review problems 243 Chapter review problems CHECKING YOUR UNDERSTANDING 6.37 6.38 6.39 6.40 6.41 6.42 How do you find the area between two values under the normal curve? How do you find the X value that corresponds to a given percentile of the normal distribution? What are some of the properties of a normal distribution? How can you use the normal probability plot to evaluate whether a set of data is normally distributed? Why is a continuity correction needed when approximating a binomial probability with normal distribution? When can you use the normal distribution to approximate the binomial distribution? APPLYING THE CONCEPTS 6.43 6.44 6.45 Based on past experience, it is assumed that the number of flaws per metre in rolls of grade 2 paper follow a Poisson distribution with a mean of one flaw per 5 metres of paper. A flaw has just been found. a. What is the probability that: i. there is not another flaw in the remaining 10 metres of the roll? ii. a flaw will be found in the next metre of the roll? iii. at least one flaw will be found in the next 5 metres? b. What is the mean distance between flaws? Aircraft arrive at a regional airport at a rate of 30 per hour. a. If the interarrival time follows an exponential distribution: i. What is the probability that air traffic control will have a break of at least 2 minutes between arrivals? ii. What is the probability that there is less than 30 seconds between arrivals? iii. What is the expected time between arrivals? b. If the interarrival time follows a uniform distribution between 0 and 4 minutes: i. What is the probability that air traffic control will have a break of at least 2 minutes between arrivals? ii. What is the probability that there is less than 30 seconds between arrivals? iii. What is the expected time between arrivals? c. If the interarrival time follows a normal distribution with mean 2 minutes and standard deviation 0.6 minutes: i. What is the probability that air traffic control will have a break of at least 2 minutes between arrivals? ii. What is the probability that there is less than 30 seconds between arrivals? An orange juice producer buys all his oranges from a large orange grove. The amount of juice squeezed from each orange is approximately normally distributed with a mean of 135 mL and a standard deviation of 12 mL. a. What is the probability that a randomly selected orange will contain between 135 mL and 140 mL of juice? b. What is the probability that a randomly selected orange will contain between 140 mL and 155 mL of juice? 6.46 c. 77% of the oranges will contain at least how many millilitres of juice? d. 80% of the oranges are between which two values of juice (in millilitres) symmetrically distributed around the population mean? The hotels from the chain in problem 6.16 frequently offer discounted ‘hot deal’ rates online. The table below gives the ‘hot deal’ rates available recently on a selected Sunday, in Australian dollars. < HOTEL_RATE > Location Auckland Barossa Valley Brisbane Canberra Darwin Hamilton Melbourne Melbourne Melbourne Palmerston North Perth Queenstown Rotorua Snowy Mountains Sunshine Coast Sydney Sydney Sydney Wellington 6.47 6.48 6.49 Hot deals rate A$ 140 174 129 230 114 154 152 189 149 80 150 95 122 288 170 239 189 160 105 Decide whether or not the data appear to be approximately normally distributed by: a. evaluating the actual versus theoretical properties b. constructing a normal probability plot Geoscientists estimate that, on average, a given region has a major earthquake every 250 years. Assuming that the time between major earthquakes in this region is exponentially distributed, what is the probability that a major earthquake: a. will not occur between 2020 and 2030? b. will occur between 2020 and 2070? c. will not occur between 2020 and 2200? An examination consists of 40 multiple-choice questions, with each question having four options. Suppose you randomly select the answer to each question – that is, you guess. What is the probability of obtaining at least 50% in the examination? According to Burton G. Malkiel, the daily changes in the closing price of shares follow a random walk – that is, these daily events are independent of each other and move upwards or downwards in a random manner – and can be approximated by a normal distribution. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 244 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS 6.50 6.51 6.52 6.53 a. To test this theory, use the daily changes in the All Ordinaries for 2016–17 financial year in < ALL_ORDS_2016_17 > to: i. construct a stem-and-leaf display, histogram, polygon and/or box-and-whisker plot ii. evaluate the actual versus theoretical properties iii. construct a normal probability plot b. Discuss the results of (a). Are the daily changes in closing prices approximately normal? From past data, Safe-As-Houses Real Estate concludes that the age of houses in the suburb of NewAcres is uniformly distributed between 20 and 40 years. What is the probability that the age of a randomly chosen house in NewAcres is: a. more than 30 years? b. between 25 and 35 years? c. less than 35 years? The time customers are on hold when ringing the IT help line for a certain ISP provider is normally distributed with a mean of 20 minutes and a standard deviation of 10 minutes. a. What proportion of customers are on hold for more than 40 minutes? b. What is the probability that a customer is on hold for less than 30 minutes? c. What percentage of calls are answered within 10 minutes? A study by the ISP provider in problem 6.51 has shown that the length of time on hold before a customer hangs up follows an approximate exponential distribution, with an average time of 15 minutes on hold before a customer hangs up. a. What percentage of customers will hang up during the first 20 minutes on hold? b. What is the probability that a customer will hang up during the first 10 minutes on hold? c. What proportion of customers do not hang up when on hold for 40 minutes? From the Household Expenditure Statistics: Year Ended 30 June 2016 (Statistics New Zealand, <www.stats.govt.nz>), the average weekly household expenditure in New Zealand was $1,300. Assuming that weekly household expenditure is approximately normal with a standard deviation of $350: 6.54 6.55 6.56 a. Find the probability that a household’s weekly expenditure is i. less than $500 ii. more than $1,750 b. What proportion of household expenditures are between $1,250 and $1,500? c. 99% of households have weekly expenditures of less than which amount? d. 95% of households have weekly expenditures of more than which amount? Water_Wise (see problem 3.53) is analysing water usage for a block of one-bedroom flats. It collects data on total daily water consumption in kilolitres (kL) for 133 consecutive days. < WATER >. a. Decide whether total daily water usage in this block of flats is approximately normal by: i. evaluating the actual versus theoretical properties ii. constructing a normal probability plot b. From part (a), assume that total daily water usage of the flats is normally distributed with a mean of 1.27 kL and standard deviation of 0.33 kL. i. On what percentage of days is total water usage less than 1.0 kL? ii. On what proportion of days is total water usage between 0.8kL and 1.4 kL? iii. What is the probability that tomorrow total water usage will exceed 2.0 kL? Suppose there is a free bus, with no timetable, which circles the city centre every 20 minutes. You arrive at a bus stop unaware of when the bus last arrived at this stop. What is the probability that you will wait for the bus: a. less than 5 minutes? b. between 10 and 15 minutes? c. more than 12 minutes? The lifespan of a certain car battery is normally distributed with a mean of 5 years and a standard deviation of 9 months. a. What is the probability that a battery lasts more than 7 years? b. What proportion of batteries fail within the warranty period of 3 years? c. What warranty period, in months, should be set if only 1% of batteries fail within the warranty period? Continuing cases Tasman University Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues. In particular, students within the school are asked to complete a student survey when they receive their grades each semester. The results of Bachelor of Business (BBus) and Master of Business Administration (MBA) students who responded to the latest undergraduate (UG) and postgraduate (PG) student surveys are stored in < TASMAN_ UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Continuing cases 245 Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman University Postgraduate MBA Student Survey. a For a selection of numerical variables in the BBus student survey, decide whether the variable is approximately normally distributed by: i comparing data characteristics to theoretical properties ii constructing a normal probability plot b For a selection of numerical variables in the MBA student survey, decide whether the variable is approximately normally distributed by: i comparing data characteristics to theoretical properties ii constructing a normal probability plot c Write a report summarising your conclusions. d Assume that the weighted average mark (WAM) of BBus students is normal with a mean of 63.9 and a standard deviation of 12.8. i What percentage of BBus students have a WAM of at least 65, a Credit average? ii What percentage of BBus students have a WAM of at least 75, a Distinction average? iii What proportion of BBus students have a WAM of at least 85, a High Distinction average? iv What proportion of BBus students have a WAM of less than 50? v What is the probability that a BBus student chosen at random has a WAM between 50 and 70? vi Below what WAM do the lowest 10% of BBus students achieve? vii What WAM is achieved by the top 5% of BBus students? e Assume that the MBA weighted average mark (WAM) of MBA students is normal with a mean of 73.8 and a standard deviation of 8.6. i What percentage of MBA students have a WAM of at least 65, a Credit average? ii What percentage of MBA students have a WAM of at least 75, a Distinction average? iii What proportion of MBA students have a WAM of at least 85, a High Distinction average? iv What proportion of MBA students have a WAM of less than 50? v What is the probability that an MBA student chosen at random has a WAM between 50 and 70? vi Below what WAM do the lowest 10% of MBA students achieve? vii What WAM is achieved by the top 5% of MBA students? As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >. a For a selection of numerical variables for regional city 1 state A, decide whether the variable is approximately normally distributed by: i comparing data characteristics to theoretical properties ii constructing a normal probability plot b For a selection of numerical variables for coastal city 1 state A, decide whether the variable is approximately normally distributed by: i comparing data characteristics to theoretical properties ii constructing a normal probability plot c Write a report summarising your conclusions. d Repeat (a) to (c) for another pair of non-capital cities or towns in state A and/or state B. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 246 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS Chapter 6 Excel Guide EG6.1 CONTINUOUS PROBABILITY DISTRIBUTIONS There are no Excel Guide instructions for this section. EG6.2 THE NORMAL DISTRIBUTION Key technique Use the NORM.DIST(X value, mean, standard deviation, True) function to calculate normal probabilities and use the NORM.S.INV(percentage) function and the STANDARDIZE function (see Section EG3.1) to calculate the Z value. Example Calculate the normal probabilities for Examples 6.1, 6.4 and 6.5 and the X and Z values for ­Examples 6.6 and 6.7. PHStat Use Normal. For the example, select PHStat ➔ Probability & Prob. Distributions ➔ Normal. In this procedure’s dialog box (shown in Figure EG6.1): 1. Enter 7 as the Mean and 2 as the Standard Deviation. 2. Check Probability for: X , 5 and enter 3.5 in its box. 3. Check Probability for: X . and enter 9 in its box. 4. Check Probability for range and enter 5 in the first box and 9 in the second box. 5. Check X for Cumulative Percentage and enter 10 in its box. 6. Check X Values for Percentage and enter 95 in its box. 7. Enter a Title and click OK. Figure EG6.1 Normal Probability Distribution dialog box In-depth Excel Use the COMPUTE worksheet of the Normal workbook as a template. The worksheet already contains the data for solving the problems in Examples 6.1 and 6.4 to 6.7. For other problems, change the values for the Mean, Standard Deviation, X Value, From X Value, To X Value, Cumulative Percentage and/or Percentage. If you use an ­Excel ­version older than Excel 2010, use the COMPUTE_OLDER worksheet. EG6.3 EVALUATING NORMALITY Comparing Data Characteristics to Theoretical Properties Use the Sections EG2.3, EG3.1 and EG3.4 instructions to compare data characteristics to theoretical properties. Constructing the Normal Probability Plot Key technique Use an Excel Scatter (X, Y) chart with Z values calculated using the NORM.S.INV function. Example Construct the normal probability plot for the call length data, as in Figure 6.22. PHStat Use Normal Probability Plot. For the example, open the Call_Length file. Select PHStat ➔ Probability & Prob. Distributions ➔ Normal Probability Plot. In the Normal Probability Plot dialog box (shown in Figure EG6.2): 1. Enter or highlight A1:A21 as the Variable Cell Range. 2. Check First cell contains label. 3. Enter a Title and click OK. Figure EG6.2 Normal Probability Plot dialog box In addition to the chart sheet containing the normal probability plot, the procedure creates a plot data worksheet Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 6 Excel Guide 247 identical to the PlotData worksheet discussed in the Indepth Excel instructions. In-depth Excel Use the worksheets of the NPP workbook as templates. The NormalPlot chart sheet displays a normal probability plot using the rank, the proportion, the Z value and the variable found in the PLOT_DATA worksheet. The PLOT_DATA worksheet already contains the call length data for the example. To construct a plot for a different variable, paste the sorted values for that variable in column D of the PLOT_DATA worksheet. If you have fewer than 20 values, delete rows from the bottom up. If you have more than 20 values, select row 21, right-click, click Insert ➔ Rows in the shortcut menu, copy down the formulas in A20:C20 to the new rows and then paste the sorted values for the variable in column D. To create your own normal probability plot for the call length, open to the PLOT_DATA worksheet and select the cell range C1:D21. Then select Insert ➔ Scatter and select the first Scatter gallery item (that shows only points and is labeled with Scatter or Scatter with only Markers). Relocate the chart to a chart sheet, turn off the chart legend and gridlines, add axis titles and modify the chart title. If you use an Excel version older than Excel 2010, use the PLOT_OLDER worksheet and the NormalPlot_ OLDER chart sheet. EG6.4 THE UNIFORM DISTRIBUTION There are no Excel Guide instructions for this section. EG6.5 THE EXPONENTIAL DISTRIBUTION Key technique Use the EXPON.DIST(X value, mean, True) function. Example Calculate the exponential probability for the bank ATM customer arrival example in Section 6.5. PHStat Use Exponential. For the example, select PHStat ➔ Probability & Prob. Distributions ➔ Exponential. In the procedure’s dialog box (shown in Figure EG6.3): 1. Enter 20 as the Mean per unit (Lambda) and 0.1 as the X Value. 2. Enter a Title and click OK. Figure EG6.3 Exponential Probability Distribution dialog box In-depth Excel Use the COMPUTE worksheet of the Exponential workbook as a template. The worksheet already contains the data for the example. For other problems, change Lambda and X Value in cells B4 and B5. If you use an Excel version older than Excel 2010, use the COMPUTE_OLDER worksheet. Microsoft® product screen shots are reprinted with permission from Microsoft Corporation. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e CHA PTER 7 Sampling distributions PACKAGING TEA TREE SHAMPOO F or centuries, Indigenous Australian peoples used the leaves of the tea tree, Melaleuca alternifolia, for healing purposes. Now tea tree oil is being used in a variety of products for its beneficial antiseptic and antifungal properties. Zoffira Pty Ltd is a small company that manufactures a number of tea tree oil products, including Zoffira T Shampoo. The shampoo is packaged in 500 mL clear pump-pack bottles via a conveyor belt process. You are in charge of monitoring that bottles are being filled correctly. Bottles are supposed to contain a mean of 500 mL of shampoo, as indicated on the package label. Because of the speed of the process, the volume of the contents varies from bottle to bottle, causing some bottles to be underfilled and some overfilled. If the process is not working properly, the mean volume in the bottles could vary too much from the label volume of 500 mL to be acceptable. As weighing every single bottle is too time-consuming, costly and inefficient, you must take a sample of bottles and make a decision regarding the probability that the packaging process is working properly. Each time _ you select a sample of bottles and check the individual _ contents, you calculate a sample mean X . You need to determine the probability that such an X could have been randomly drawn from a population whose population mean is 500 mL. Based on this assessment, you will have to decide whether to maintain, alter or shut down the process. © Nolan777|Dreamstime.com Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 7.2 Sampling Distribution of the Mean 249 LEARNING OBJECTIVES After studying this chapter you should be able to: 1 interpret the concept of the sampling distribution 2 calculate probabilities related to the sample mean 3 recognise the importance of the Central Limit Theorem 4 calculate probabilities related to the sample proportion In this chapter you need to make a decision about the shampoo-packaging process based on a sample of shampoo bottles. You will learn about sampling distributions and how to use them to solve business problems. As in the previous chapter, the normal distribution is used to calculate probabilities. 7.1 SAMPLING DISTRIBUTIONS In many applications, you want to make statistical inferences – that is, to use statistics calculated from samples to estimate the values of population parameters. In this chapter you will learn more about the sample mean, a statistic used to estimate a population mean (a parameter). You will also learn about the sample proportion, a statistic used to estimate the population proportion (a parameter). Your main concern when making a statistical inference is drawing conclusions about a population, not about a sample. For example, a political pollster is interested in the sample results only as a way of estimating the actual proportion of the votes that each candidate will receive from the population of voters. Likewise, as an operations manager for ­Zoffira Pty Ltd, you are interested only in using the sample mean calculated from a sample of shampoo bottles for estimating the mean volume contained in a population of bottles. In practice, you select a single random sample of a predetermined size from the population. The items included in the sample are determined through the use of a random number generator, such as a table of random numbers (see Section 1.4 and Table E.1), or by using Microsoft Excel (see page 36). Hypothetically, to use the sample statistic to estimate the population parameter, you should examine every possible sample that could occur. A sampling distribution is the distribution of the results if you actually selected all possible samples. LEARNING OBJECTIVE 1 Interpret the concept of the sampling distribution sampling distribution The probability distribution of a given sample statistic with repeated sampling of the population. 7.2 SAMPLING DISTRIBUTION OF THE MEAN In Chapter 3, several measures of central tendency are discussed. Undoubtedly, the mean is the most widely used measure of central tendency. The sample mean is often used to estimate the population mean. The sampling distribution of the mean is the distribution of all possible sample means if you select all possible samples of a certain size. The Unbiased Property of the Sample Mean sampling distribution of the mean The distribution of all possible sample means from samples of a given size for a given population. The sample mean is unbiased because the mean of all possible sample means (of a given sample size n), μX_, is equal to the population mean μ. A simple example concerns a population of four candidates attempting a driver knowledge test of 45 questions in order to get a driver’s licence. Table 7.1 presents the number of errors. unbiased If the average of all possible sample means equals the population mean then the sample mean is unbiased. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 250 CHAPTER 7 SAMPLING DISTRIBUTIONS Table 7.1 Number of errors made by each of four driver’s knowledge test candidates Candidate Vicky Yvana Xing Zac Number of errors X1 = 3 X2 = 2 X3 = 1 X4 = 4 This population distribution is shown in Figure 7.1. Figure 7.1 Number of errors made by a population of four driver’s knowledge test candidates Frequency 3 2 1 0 0 3 2 Number of errors 1 4 When you have the data from a population, you calculate the mean using Equation 7.1. POPUL AT ION M E A N The population mean is the sum of the values in the population divided by the population size N. N μ= ∑ Xi (7.1) i=1 N You calculate the population standard deviation σ using Equation 7.2. POPUL AT ION STA NDA R D D E V I AT I O N N σ= ∑ (X i – μ) 2 i=1 (7.2) N Thus, for the data of Table 7.1: μ= 3+2+1+4 = 2.5 4 and: σ= (3 – 2.5)2 + (2 – 2.5)2 + (1 – 2.5)2 + (4 – 2.5)2 = 1.12 errors 4 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 7.2 Sampling Distribution of the Mean 251 If you select samples of two candidates with replacement from this population, there are 16 possible samples (N n = 42 = 16). Table 7.2 lists the 16 possible sample outcomes. If you average all 16 of these sample means, the mean of these values, μX_, is equal to 2.5, which is also the mean of the population μ. Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Candidates Vicky, Vicky Vicky, Yvana Vicky, Xing Vicky, Zac Yvana, Vicky Yvana, Yvana Yvana, Xing Yvana, Zac Xing, Vicky Xing, Yvana Xing, Xing Xing, Zac Zac, Vicky Zac, Yvana Zac, Xing Zac, Zac Sample outcomes 3, 3 3, 2 3, 1 3, 4 2, 3 2, 2 2, 1 2, 4 1, 3 1, 2 1, 1 1, 4 4, 3 4, 2 4, 1 4, 4 Sample mean – X1 = 3 – X 2 = 2.5 – X3 = 2 – X 4 = 3.5 – X 5 = 2.5 – X6 = 2 – X 7 = 1.5 – X8 = 3 – X9 = 2 – X 10 = 1.5 – X 11 = 1 – X 12 = 2.5 – X 13 = 3.5 – X 14 = 3 – X 15 = 2.5 – X 16 = 4 ΣX– = 40 40 µX– = = 2.5 16 Table 7.2 All 16 samples of n = 2 test candidates from a population of n = 4 candidates when sampling with replacement Since the mean of the 16 sample means is equal to the population mean, the sample mean is an unbiased estimator of the population mean. Therefore, although you do not know how close the sample mean of any particular sample selected comes to the population mean, you are at least assured that the mean of all the possible sample means that could have been selected is equal to the population mean. Standard Error of the Mean Figure 7.2 illustrates the variation in the sample mean when selecting all 16 possible samples. In this small example, although the sample mean varies from sample to sample depending on which candidates are selected, the sample mean does not vary as much as the individual values in the population. That the sample means are less variable than the individual values in the population follows directly from the fact that each sample mean averages together all the values Figure 7.2 Sampling distribution of the mean based on all possible samples containing two candidates 5 Frequency 4 3 2 1 0 0 1 2 3 Number of errors 4 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 252 CHAPTER 7 SAMPLING DISTRIBUTIONS standard error of the mean Reflects how much the sample mean varies from its average value in repeated experiments. in the sample. A population consists of individual outcomes that can take on a wide range of values from extremely small to extremely large. However, if a sample contains an extreme value, although this value will have an effect on the sample mean, the effect is reduced because the value is averaged with all the other values in the sample. As the sample size increases, the effect of a single extreme value becomes smaller because it is averaged with more values. The value of the standard deviation of all possible sample means, called the standard error of the mean, expresses how the sample mean varies from sample to sample. Equation 7.3 defines the standard error of the mean when sampling with replacement or without replacement (see page 18) from large or infinite populations. STA N DA R D E R R O R O F T HE M E A N The standard error of the mean σX_ is equal to the standard deviation in the population σ divided by the square root of the sample size n. σX = σ n (7.3) Therefore, as the sample size increases, the standard error of the mean decreases by a factor equal to the square root of the sample size. You can also use Equation 7.3 as an approximation to the standard error of the mean when the sample is selected without replacement, if the sample contains less than 5% of the entire population. Example 7.1 calculates the standard error of the mean for such a situation. EXAMPLE 7.1 C A LC U LAT ING T H E STAN D ARD E RROR OF THE M E AN Return to the shampoo-packaging process described in the scenario on page 248. If you randomly select a sample of 25 bottles without replacement from the thousands of bottles filled during a shift, the sample contains far less than 5% of the population. Given that the standard deviation of the shampoo-packaging process is 15 mL, calculate the standard error of the mean. SOLUTION Using Equation 7.3 with n = 25 and σ = 15, the standard error of the mean is: sX = s n = 15 25 = 15 = 3 mL 5 The variation in the sample means for samples of n = 25 is much less than the variation in individual bottles of shampoo (i.e. σX_ = 3 while σ = 15). Sampling from Normally Distributed Populations Now that the concept of a sampling distribution has been introduced and the standard error of the mean has been defined, what distribution will the sample mean follow? If you are sampling from a population that is normally distributed with mean μ and standard deviation σ, regardless of the sample size n, the sampling distribution of the mean is normally distributed with mean μX_ = μ and standard error of the mean σX_. In the simplest case, if you take samples of size n = 1, each possible sample mean is a single value from the population because: n X = ∑X i =1 n i = X1 = X1 1 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 7.2 Sampling Distribution of the Mean 253 Therefore, if the population is normally distributed with mean μ and standard deviation σ, the sampling distribution of X for samples of n = 1 must also follow the normal distribution with mean μX_ = μ and standard error of the mean σX_ = σ/1 = σ. In addition, as the sample size increases, the sampling distribution of the mean still follows a normal distribution with mean μX_ = μ, but the standard error of the mean decreases, so that a larger proportion of sample means are closer to the population mean. Figure 7.3 illustrates this reduction in variability, in which 500 samples of sizes 1, 2, 4, 8, 16 and 32 were randomly selected from a normally ­distributed population. From the polygons in Figure 7.3, you can see that, although the sampling distribution of the mean is approximately1 normal for each sample size, the sample means are distributed more tightly around the population mean as the sample size is increased. To examine the concept of the sampling distribution of the mean further, consider the ­shampoo-packaging scenario again. The packaging equipment that is filling 500-mL bottles of shampoo is set so that the amount of shampoo in a bottle is normally distributed with a mean of 500 mL. From past experience, the population standard deviation for this filling process is 15 mL. If you randomly select a sample of 25 bottles from the many thousands that are filled in a day and the mean volume is calculated for this sample, what type of result could you expect? For example, do you think that the sample mean could be 500 mL? 300 mL? 510 mL? Figure 7.3 Sampling distribution of the mean from 500 samples of sizes n = 1, 2, 4, 8, 16 and 32 selected from a normal population n = 32 n = 16 n=8 n=4 n=2 n=1 0 Z 1 Remember that ‘only’ 500 samples out of an infinite number of samples have been selected, so that the sampling distributions shown are only approximations of the true distributions. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 254 CHAPTER 7 SAMPLING DISTRIBUTIONS The sample acts as a miniature representation of the population, so if the values in the population are normally distributed, the values in the sample should be approximately normally distributed. Thus, if the population mean is 500 mL, the sample mean has a good chance of being close to 500 mL. How can you determine the probability that the sample of 25 bottles will have a mean below 497 mL? From the normal distribution (Section 6.2) you know that you can find the area below any value X by converting to standardised Z units: Z= X–m s In the examples in Section 6.2 we saw how any single value X differs from the mean. Now, in the shampoo-packaging example, the value involved is a sample mean X and we wish to determine the likelihood that a sample mean is below 497. Thus, by substituting X for X, μX_ for μ and σX_ for σ, the appropriate Z value is defined in Equation 7.4. FIN DIN G Z FOR T HE SA M P LI NG D I ST R I BU T I O N O F T HE M E A N The Z value is equal to the difference between the sample mean and the population mean μ, divided by the standard error of the mean σX_. Z= LEARNING OBJECTIVE 2 Calculate probabilities related to the sample mean X – mX X–m = s sX n (7.4) To find the area below 497 mL, from Equation 7.4: Z= X – mX 497 – 500 –3 = = = –1.00 15 sX 3 25 The area corresponding to Z = -1.00 in Table E.2 is 0.1587. Therefore, 15.87% of all the possible samples of size 25 have a sample mean below 497 mL. This is not the same as saying that a certain percentage of individual bottles will have less than 497 mL of shampoo. We calculate that percentage as follows: Z= –3 497 – 500 X–m = –0.20 = = 15 s 15 The area corresponding to Z = - 0.20 in Table E.2 is 0.4207. Therefore, 42.07% of the individual bottles are expected to contain less than 497 mL. Comparing these results, we see that many more individual bottles than sample means are below 497 mL. This result is explained by the fact that each sample consists of 25 different values, some small and some large. The averaging process dilutes the importance of any individual value, particularly when the sample size is large. Thus, the chance that the sample mean of 25 bottles is far away from the population mean is less than the chance that a single bottle is far away. Examples 7.2 and 7.3 show how these results are affected by using a different sample size. EXAMPLE 7.2 T H E E FFE CT O F S A M P L E S I Z E n ON THE CAL CU L ATI ON OF 𝛔X_ How is the standard error of the mean affected by increasing the sample size from 25 to 100 bottles? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 7.2 Sampling Distribution of the Mean 255 SOLUTION If n = 100 bottles, then using Equation 7.3: sX = s n = 15 100 = 15 = 1.5 10 The fourfold increase in the sample size from 25 to 100 reduces the standard error of the mean by half – from 3 mL to 1.5 mL. This demonstrates that taking a larger sample results in less variability in the sample means from sample to sample. TH E EFFECT O F S A MP LE S IZ E n O N T H E CL U STE RI N G OF M E AN S I N THE SA MPLING D IST R IB U T IO N In the shampoo-packaging example, if you select a sample of 100 bottles, what is the probability that the sample mean is below 497 mL? EXAMPLE 7.3 SOLUTION Using Equation 7.4: Z= X – mX 497 – 500 –3 = –2.00 = = 15 sX 1.5 100 From Table E.2, the area less than Z = -2.00 is 0.0228. Therefore, 2.28% of the samples of 100 have means below 497 mL, as compared with 15.87% for samples of 25. Sometimes, you need to find the interval that contains a fixed proportion of the sample means. You need to determine a distance below and above the population mean containing a specific area of the normal curve. From Equation 7.4: Z= X–m s n Solving for X results in Equation 7.5. _ FIND ING X FOR T H E S A M P L IN G DIST R I BU T I O N O F T HE M E A N X=m+Z s n (7.5) Example 7.4 illustrates the use of Equation 7.5. DETER M INING T H E IN T E R VA L T H AT INC L U D E S A F I XE D P ROP ORTI ON OF THE SA MPLE M EANS In the shampoo-packaging example, find an interval around the population mean that will include 95% of the sample means based on samples of 25 bottles. EXAMPLE 7.4 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 256 CHAPTER 7 SAMPLING DISTRIBUTIONS SOLUTION If 95% of the sample means are in the interval, then 5% are outside the interval. Divide the 5% into two equal parts of 2.5%. The value of Z in Table E.2 corresponding to an area of 0.0250 in the lower tail of the normal curve is -1.96, and the value of Z corresponding to a cumulative area of 0.975 (i.e. 0.025 in the upper tail of the normal curve) is +1.96. The lower value of X (called X L) and the upper value of X (called X U) are found by using Equation 7.5: XL = 500 + (–1.96) XU = 500 + (1.96) 15 25 15 25 = 500 – 5.88 = 494.12 = 500 + 5.88 = 505.88 Therefore, 95% of all sample means are between 494.12 and 505.88 mL for samples of 25 bottles. Sampling from Non-normally Distributed Populations – The Central Limit Theorem So far in this section we have discussed the sampling distribution of the mean for a normally distributed population. However, in many instances, either you know that the population is not normally distributed or it is unrealistic to assume a normal distribution. An important theorem in statistics, the Central Limit Theorem, deals with this situation. LEARNING OBJECTIVE 3 Recognise the importance of the Central Limit Theorem Central Limit Theorem If the sample size is large enough, the distribution of sample means will be approximately normal even if the samples came from a population that was not normal. T H E CE N T R A L L IM I T T HE O R E M The Central Limit Theorem states that, as the sample size (i.e. the number of values in each sample) gets large enough, the sampling distribution of the mean is approximately normally distributed. This is true regardless of the shape of the distribution of the individual values in the population. What sample size is large enough? A great deal of statistical research has gone into this issue. As a general rule, statisticians have found that for many population distributions, when the sample size is at least 30, the sampling distribution of the mean is approximately normal. However, you can apply the Central Limit Theorem for even smaller sample sizes if the population distribution is approximately bell-shaped. In the uncommon case where the distribution is extremely skewed or has more than one mode, you may need sample sizes larger than 30 to ensure normality. Figure 7.4 illustrates the application of the Central Limit Theorem to different populations. The sampling distributions from three different continuous distributions (normal, uniform and exponential) for varying sample sizes (n = 2, 5, 30) are displayed. Panel A of Figure 7.4 shows the sampling distribution of the mean selected from a normal population. As mentioned earlier, when the population is normally distributed the sampling distribution of the mean is normally distributed for any sample size. (You can measure the variability using the standard error of the mean, Equation 7.3.) Because of the unbiasedness property, the mean of any sampling distribution is always equal to the mean of the population. Panel B of Figure 7.4 depicts the sampling distribution from a population with a uniform (or rectangular) distribution (see Section 6.4). When samples of size n = 2 are selected, there is a peaking or central limiting effect already working. For n = 5, the sampling distribution is bell shaped and approximately normal. When n = 30, the sampling distribution looks very Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 7.2 Sampling Distribution of the Mean 257 Panel A Normal population Panel B Uniform population Panel C Exponential population Values of X Values of X Values of X Sampling distribution of X Sampling distribution of X Sampling distribution of X n=2 n=2 n=2 Values of X Values of X Values of X Sampling distribution of X Sampling distribution of X Sampling distribution of X n=5 n=5 Values of X Values of X Sampling distribution of X Sampling distribution of X n = 30 Values of X n = 30 Values of X Figure 7.4 Sampling distribution of the mean for different populations for samples of n = 2, 5 and 30 n=5 Values of X Sampling distribution of X n = 30 Values of X similar to a normal distribution. In general, the larger the sample size the more closely the sampling distribution will follow a normal distribution. As with all cases, the mean of each sampling distribution is equal to the mean of the population, and the variability decreases as the sample size increases. Panel C of Figure 7.4 presents an exponential distribution (see Section 6.5). This population is heavily skewed to the right. When n = 2, the sampling distribution is still highly skewed to the right but less so than the distribution of the population. For n = 5, the sampling distribution is Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 258 CHAPTER 7 SAMPLING DISTRIBUTIONS more symmetrical with only a slight skew to the right. When n = 30, the sampling distribution looks approximately normal. Again, the mean of each sampling distribution is equal to the mean of the population, and the variability decreases as the sample size increases. Using the results from these well-known statistical distributions (normal, uniform and exponential), you can make the following conclusions regarding the Central Limit Theorem: • For most population distributions, regardless of shape, the sampling distribution of the mean is approximately normally distributed if samples of at least 30 are selected. • If the population distribution is fairly symmetrical, the sampling distribution of the mean is approximately normal for samples as small as 5. • If the population is normally distributed, the sampling distribution of the mean is normally distributed regardless of the sample size. The Central Limit Theorem is of crucial importance in using statistical inference to draw conclusions about a population. It allows you to make inferences about the population mean without having to know the specific shape of the population distribution. You can explore how the Central Limit Theorem works yourself using Excel to generate samples through a Random Number Generator (see the Chapter 7 Excel Guide at the end of this chapter). PHStat also has an easy-to-use Sampling Distributions Simulation. Problems for Section 7.2 LEARNING THE BASICS 7.1 7.2 Given a normal distribution with μ = 100 and σ = 10, if you select a sample of n = 25: _ a. What is the probability that X is: i. less than 95? ii. between 95 and 97.5? iii. above 102.2? _ b. There is a 65% chance that X is above what value? Given a normal distribution with μ = 50 and σ = 5, if you select a sample of n = 100: _ a. What is the probability that X is: i. less than 47? ii. between 47 and 49.5? iii. above 51.1? _ b. There is a 35% chance that X is above what value? 7.5 APPLYING THE CONCEPTS 7.3 7.4 For each of the following three populations, indicate what the sampling distribution of the mean for samples of 25 would consist of. a. Travel expense vouchers for a university in an academic year b. Absentee records (days absent per year) in 2010 for employees of a large construction company c. Yearly sales (in litres) of E10 fuel at service stations located in a particular state The following data represent the number of days absent per year in a population of six employees of a small company: 1 3 6 7 9 10 7.6 a. Assuming that you sample without replacement, select all possible samples of n = 2 and construct the sampling distribution of the mean. Calculate the mean of all the sample means and also calculate the population mean. Are they equal? What is this property called? b. Repeat (a) for all possible samples of n = 3. c. Compare the shape of the sampling distribution of the mean in (a) and (b). Which sampling distribution has less variability? Why? d. Assuming that you sample with replacement, repeat (a), (b) and (c) and compare the results. Which sampling distributions have the least variability, those in (a) or (b)? Why? The number of passengers passing through a large South East Asian airport is normally distributed with a mean of 110,000 persons per day and a standard deviation of 20,200 persons. If you select a random sample of 16 days: a. What is the sampling distribution of the mean? b. What is the probability that the sample mean is less than 98,000 passengers per day? c. What is the probability that the sample mean is between 102,000 and 104,500 passengers per day? d. The probability is 60% that the sample mean will be between which two values symmetrically distributed around the population mean? Realestate.com.au reports that the median price of houses in the Newcastle suburb of Merewether that were sold in the 13 months to March 2017 was $1,150,000 (<www.realestate. com.au/neighbourhoods/merewether-2291-nsw?cid=srp> accessed 3 April 2017). Suppose that the mean price of Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 7.3 Sampling Distribution of the Proportion 259 7.7 7.8 houses in Merewether sold during that period was $1,236,450 and the standard deviation was $150,000. a. If you take samples of n = _ 2, describe the shape of the sampling distribution of X . b. If you take samples of n = _ 100, describe the shape of the sampling distribution of X . c. If you take a random sample of n = 100, what is the probability that the sample mean will be less than $1,235,000? Travel time on a bus between two suburban stops is normally distributed with μ = 8 minutes and σ = 2 minutes. a. If you select a random sample of 25 trips, what is the probability that the sample mean is between 6.9 and 8.2 minutes? b. If you select a random sample of 25 trips, what is the probability that the sample mean is between 7.5 and 8 minutes? c. If you select a random sample of 100 trips, what is the probability that the sample mean is between 6.9 and 8.2 minutes? d. Explain the difference between the results of (a) and (c). It is often important to monitor traffic on a website as organisations need to make online interactions with their clients faster and easier. For example, businesses applying for an Australian Business Number (ABN) online at <www.abr.gov.au> are asked to have a variety of information about their entity ready before they begin the online process. Assume that ABN online-application times are normally distributed with a mean 7.9 time of 40 minutes and a standard deviation of 5 minutes. If a random sample of 50 applications is taken: a. What is the probability that the sample mean application time is less than 38 minutes? b. What is the probability that the sample mean is between 39 and 41 minutes? c. The probability is 80% that the sample mean is between what two values symmetrically distributed around the population mean? d. The probability is 90% that the sample mean is less than what value? A company is having a new corporate website developed. In the final testing phase the download time to open the new home page is recorded for a large number of computers in office and home settings. The mean download time for the site is 3.61 seconds. Suppose that the download times for the site are normally distributed with a standard deviation of 0.5 seconds. If you select a random sample of 30 download times: a. What is the probability that the sample mean download time is less than 3.75 seconds? b. What is the probability that the sample mean is between 3.70 and 3.90 seconds? c. The probability is 80% that the sample mean is between which two values symmetrically distributed around the population mean? d. The probability is 90% that the sample mean is less than what value? 7.3 SAMPLING DISTRIBUTION OF THE PROPORTION Consider a categorical variable that has only two categories, such as the customer prefers your brand or the customer prefers the competitor’s brand. Of interest is the proportion of items belonging to one of the categories; for example the proportion of customers who prefer your brand. The population proportion, represented by π, is the proportion of items in the entire population with the characteristic of interest. The sample proportion, represented by p, is the proportion of items in the sample with the characteristic of interest. The sample proportion, a statistic, is used to estimate the population proportion, a parameter. To calculate the sample proportion, you assign the two possible outcomes scores of 1 or 0 to represent the presence or absence of the characteristic. You then sum all the 1 and 0 scores and divide by n, the sample size. For example, if, in a sample of five customers, three preferred your brand and two did not, you have three ones and two zeroes. Summing the three ones and two zeroes and dividing by the sample size of 5 gives you a sample proportion of 0.60 who preferred your brand. THE SAM PLE PR OPORT ION p= X number of items with the characteristic of interest = n sample size (7.6) The sample proportion p takes on values between 0 and 1. If all individuals possess the characteristic, you assign each a score of 1 and p is equal to 1. If half the individuals possess the characteristic, you assign half a score of 1, and assign the other half a score of 0, and p is Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 260 CHAPTER 7 SAMPLING DISTRIBUTIONS standard error of the proportion The standard deviation of the sample proportion for repeated samples. equal to 0.5. If none of the individuals possess the characteristic, you assign each a score of 0 and p is equal to 0. While the sample mean X is an unbiased estimator of the population mean μ, the statistic p is an unbiased estimator of the population proportion π. By analogy to the sampling distribution of the mean, the standard error of the proportion, σp, is given in Equation 7.7. STA N DA R D E R R O R O F T HE P R O P O RT I O N sp = sampling distribution of the proportion The distribution of all possible sample proportions from samples of a certain size. p(1 – p) n (7.7) If you select all possible samples of a certain size, the distribution of all possible sample proportions is referred to as the sampling distribution of the proportion. When sampling with replacement from a finite population, the sampling distribution of the proportion follows the binomial distribution, as discussed in Section 5.3. However, you can use the normal distribution to approximate the binomial distribution when nπ and n(1 - π) are each greater than 5 (see Section 6.6). In most cases in which inferences are made about the proportion, the sample size is substantial enough to meet the conditions for using the normal approximation (see reference 1). Therefore, in many instances, you can use the normal distribution to estimate the sampling distribution of the proportion. Substituting p for X, π for μ and p(1 – p) for σ in Equan n tion 7.4 results in Equation 7.8. DIFFE R E N CE B E T W E E N T HE SA M P LE P R O P O RT I O N A ND T HE P OPUL AT ION P R O P O RT I O N I N STA NDA R D I SE D NO R M A L U NI T S Z= LEARNING OBJECTIVE Calculate probabilities related to the sample proportion 4 p–p (7.8) p(1 – p) n To illustrate the sampling distribution of the proportion, suppose that the manager of a railway’s WiFi services determines that 40% of all passengers have multiple WiFi-enabled devices available on board their train. You select a random sample of 200 passengers and count those with multiple WiFi-enabled devices. The probability that the sample proportion of passengers with multiple devices is less than 0.30 is calculated as follows. Because nπ = 200(0.40) = 80 7 5 and n(1 - π) = 200(0.60) = 120 7 5, the sample size is large enough to assume that the sampling distribution of the proportion is approximately normally distributed. Using Equation 7.8: Z= = p−π π(1 − π) n 0.30 − 0.40 (0.40)(0.60) 200 = −0.10 0.24 200 = −0.10 0.0346 = −2.89 Using Table E.2, the area under the normal curve less than Z = -2.89 is 0.0019. Therefore, the probability that the sample proportion is less than 0.30 is 0.0019 – a highly unlikely event. This means that if the true proportion of successes in the population is 0.40, less than one-fifth of 1% of the samples of n = 200 are expected to have sample proportions of less than 0.30. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 7.3 Sampling Distribution of the Proportion 261 Problems for Section 7.3 LEARNING THE BASICS 7.10 In a random sample of 64 people, 48 are classified as ‘successful’. If the population proportion is 0.70: a. Determine the sample proportion p of ‘successful’ people. b. Determine the standard error of the proportion. 7.11 A random sample of 50 households was selected for a telephone survey. The key question asked was, ‘Has any member of your household travelled by plane in the past month?’ Of the 50 respondents, 16 said yes and 34 said no. If the population proportion is 0.40: a. Determine the sample proportion p of households with members who have travelled by plane in the past month. b. Determine the standard error of the proportion. 7.12 The following data represent the responses (Y for yes and N for no) from a sample of 40 university students to the question, ‘Do you currently own any shares in listed companies?’: N N Y N N Y N Y N Y N N Y N Y Y N N N Y N Y N N N N Y N N Y Y N N N Y N N Y N N If the population proportion is 0.30: a. Determine the sample proportion p of university students who own shares in listed companies. b. Determine the standard error of the proportion. APPLYING THE CONCEPTS 7.13 A political polling organisation is conducting an analysis of sample results in order to make predictions on election night. Assuming a two-candidate election, if a specific candidate receives at least 55% of the vote in the sample, then that candidate will be forecast as the winner of the election. If you select a random sample of 100 voters: a. What is the probability that a candidate will be forecast as the winner when: i. the true percentage of her vote is 50.1%? ii. the true percentage of her vote is 60%? iii. the true percentage of her vote is 49% (and she will actually lose the election)? b. If the sample size is increased to 400, what are your answers to (a)? Discuss. 7.14 You plan to conduct a marketing experiment in which students are to taste one of two different brands of soft drink. Their task is to identify correctly the brand they tasted. You select a random sample of 200 students and assume they have no ability to distinguish between the two brands. (Hint: If an individual has no ability to distinguish between the two soft drinks, then each brand is equally likely to be selected.) a. What is the probability that the sample will have between 50% and 60% of the identifications correct? b. The probability is 90% that the sample percentage is contained within which symmetrical limits of the population percentage? c. What is the probability that the sample percentage of correct identifications is greater than 65%? d. Which is more likely to occur – more than 60% correct identifications in the sample of 200 or more than 55% correct identifications in a sample of 1,000? Explain. 7.15 Over the past few years there has been increased monitoring of the representation of women on corporate boards. The Australian Institute of Company Directors reports in its March–May 2016 Report that 23.6% of ASX 200 board members were female (<www.companydirectors. com.au/~/media/resources/director-resource-centre/ governance-and-director-issues/board-diversity/boarddiversity-pdf/05385-2-coms-gender-diversity-quarterlyreport-june16-a4_web.ashx> accessed 25 April 2017). Suppose that the true percentage of women on ASX 200 boards is now 24.6% and that a random sample of 220 board members is chosen. a. What is the probability that in the sample less than 24% of board members will be women? b. What is the probability that in the sample between 24.2% and 25.0% of board members will be women? c. What is the probability that in the sample between 24.5% and 24.7% of board members will be women? d. If a sample of 100 is taken, how does this change your answers to (a), (b) and (c)? 7.16 People with permanent visas accounted for 19.5% of the net overseas migration to Australia during 2015. The relative shares of the different visa categories were: Family visas, 6.9%; Skilled, 9.0%; and Special Eligibility and Humanitarian, 2.5% (Australian Bureau of Statistics, Migration, Australia, 2015–16 , Cat. No. 3412.0, March 2017). Suppose a government department is conducting a follow-up study and randomly selects 260 people who migrated in 2015. a. What is the probability that more than 9.1% of the people in the sample are skilled migrants? b. What is the probability that less than 2.8% are holders of Special Eligibility or Humanitarian permanent visas? c. If a random sample of size 500 is taken, how does this change your answers to (a) and (b)? 7.17 As technology continues to change rapidly there has been a worldwide trend towards the use of smaller and more mobile devices and away from PCs. Analysts at Gartner predicted that in 2019 only 8% of devices shipped worldwide would be traditional PCs (desktops or notebooks) (<www.consumerit.eu/index.php?option=com_content& view=article&id=3363:gartner-spending-on-the-devicesup-shipments-flat&catid= 20&Itemid=100017> accessed 27 April 2017). Assume this prediction holds and you randomly select a sample of 100 people who purchase a device shipped in 2019. a. What is the probability that between 7.5% and 8.2% purchase a traditional PC? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 262 CHAPTER 7 SAMPLING DISTRIBUTIONS b. The probability is 90% that the sample percentage will be contained within which symmetrical limits of the population percentage? c. The probability is 95% that the sample percentage will be contained within which symmetrical limits of the population percentage? 7.18 According to an Australian Government report, retail trade is the second largest employing industry in Australia with more than 1.267 million workers, or 11% of working Australians (Department of Employment, Australian Jobs 2016 <https:// docs.employment.gov.au/system/files/doc/other/ australianjobs2016_0.pdf> accessed 28 April 2017). This report shows that the percentage of those employed in retail trade in November 2015 who were working part-time was 49%. Assuming this percentage is still current: a. If you select a random sample of 400 Australian retail trade workers, what is the probability that the sample has between 45% and 50% who are employed part-time? b. If a current sample of 400 Australian retail trade workers has 50.2% who are employed part-time, what can you infer about the population estimate of 49%? Explain. c. If a current sample of 100 Australian retail trade workers has 50.2% who are employed part-time, what can you infer about the population estimate of 49%? Explain. d. Explain the difference between the results in (b) and (c). 7 7.19 The Australian Tax Office carries out a range of verification checks and audits for the goods and services tax (GST) including Business Activity Statement integrity audits. Assume that currently no additional tax is collected for 25% of such audits. Suppose that you select a random sample of 100 audits. What is the probability that the sample will have: a. between 24% and 26% of audits that collect no additional tax? b. between 20% and 30% of audits that collect no additional tax? c. more than 30% of audits that collect no additional tax? 7.20 The 11th Annual Statistical Report of the HILDA Survey relates to the 2016 phase of a large longitudinal study of Australian residents. It found that 19.9% of households surveyed had HECS/HELP debts and 35.7% had debts on their home (R. Wilkins (ed), The Household, Income and Labour Dynamics in Australia Survey: Selected Findings from Waves 1 to 14, Melbourne Institute of Applied Economic and Social Research, University of Melbourne, 2016 <http://melbourneinstitute. unimelb.edu.au/__data/assets/pdf_file/0007/2155507/hildastatreport-2016.pdf> accessed 28 April 2017). Assume the same percentages found in the survey apply right now for all Australian households. In a sample of 600 of these households, what is the probability that: a. more than 18% of households have HECS/HELP debts? b. fewer than 33.5% of households have debts on their home? Assess your progress Summary In this chapter we looked at the sampling distribution of the sample mean, the Central Limit Theorem and the sampling distribution of the sample proportion. You learned that the sample mean is an unbiased estimator of the population mean and the sample proportion is an unbiased estimator of the population proportion. By observing the mean volume in a sample of shampoo bottles filled by Zoffira Pty Ltd, you were able to draw conclusions about the mean volume in the population of shampoo bottles. In the next three chapters, techniques commonly used for statistical inference, confidence intervals and tests of hypotheses are discussed. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter review problems 263 Key formulas _ Finding X for the sampling distribution of the mean Population mean N μ= ∑ X=μ+Z Xi i =1 (7.1) N p= N ∑ (X – μ) i 2 i =1 N σX = n X number of items with the characteristic of interest = n sample size Standard error of the sample proportion σp = (7.3) X – μX X–μ = σ σX n π(1 – π) n (7.7) Finding Z for the sampling distribution of the proportion Finding Z for the sampling distribution of the mean Z= (7.5) (7.6) (7.2) Standard error of the mean σ n Sample proportion Population standard deviation σ= σ Z= (7.4) p–π π(1 – π) n (7.8) Key terms Central Limit Theorem sampling distribution sampling distribution of the mean 256 249 249 sampling distribution of the proportion260 standard error of the mean 252 standard error of the proportion 260 unbiased249 References 1. Cochran, W. G., Sampling Techniques, 3rd edn (New York: Wiley, 1977). Chapter review problems CHECKING YOUR UNDERSTANDING 7.21 7.22 7.23 7.24 7.25 Why is the sample mean an unbiased estimator of the population mean? Why does the standard error of the mean decrease as the sample size n increases? Why does the sampling distribution of the mean follow a normal distribution for a large enough sample size even though the population may not be normally distributed? What is the difference between a probability distribution and a sampling distribution? Under what circumstances does the sampling distribution of the proportion approximately follow the normal distribution? APPLYING THE CONCEPTS 7.26 A particular type of ballpoint pen uses minute ball bearings that are targeted to have a diameter of 0.5 mm. The lower and upper specification limits under which the ball bearing can operate are 0.49 mm (lower) and 0.51 mm (upper). Past 7.27 experience has indicated that the actual diameter of the ball bearings is approximately normally distributed with a mean of 0.503 mm and a standard deviation of 0.004 mm. If you select a random sample of 25 ball bearings: a. What is the probability that the sample mean is: i. between the target and the population mean of 0.503? ii. between the lower specification limit and the target? iii. above the upper specification limit? iv. below the lower specification limit? b. The probability is 93.32% that the sample mean diameter will be above what value? The fill amount of milk in plastic containers is normally distributed with a mean of 2.0 litres and a standard deviation of 0.05 litres. If you select a random sample of 25 containers: a. What is the probability that the sample mean will be: i. between 1.99 and 2.0 litres? ii. below 1.98 litres? iii. above 2.01 litres? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 264 CHAPTER 7 SAMPLING DISTRIBUTIONS 7.28 7.29 7.30 7.31 b. The probability is 99% that the sample mean will contain at least how much milk? c. The probability is 99% that the sample mean will contain an amount that is between which two values (symmetrically distributed around the mean)? The ABS has reported that in 2015, 26.78% of the 16.8 million employees in Australia worked part-time in their main job (Australian Bureau of Statistics, Characteristics of Employment, Australia, August 2015, Cat. No. 6333.0, 2016). Suppose that you select a random sample of 250 employees from around Australia. a. What is the probability that more than 26.2% of those sampled work part-time in their main job? b. What is the probability that the proportion of part-time employees is between 0.27 and 0.29? c. The probability is 77% that the sample proportion of parttime employees will be above what value? A new online advertisement for an Extra Dry beer has been designed for a target audience of Australian males aged 18 to 30. The advertisers hope that 24% of the target audience will find the ad ‘very entertaining’. Suppose that a sample of 400 male television viewers in the target age group is shown the advertise­ ment. What is the probability that the sample will have between: a. 18% and 22% who find it ‘very entertaining’? b. 16% and 24% who find it ‘very entertaining’? c. 14% and 26% who find it ‘very entertaining’? d. 12% and 28% who find it ‘very entertaining’? Assume that, for the first quarter of 2017, the weekly rental costs of three-bedroom dwellings in a coastal town in Western Australia are normally distributed with a mean of $260 and a standard deviation of $30. If you select a random sample of 10 dwellings from this population, what is the probability that the sample will have a mean rental cost: a. less than $270? b. between $265 and $275? c. greater than $282? APRA, the Australian Prudential Regulation Authority, monitors the return rates of large superannuation funds in Australia. Its publication Statistics: Quarterly superannuation performance, December 2016 showed an annual rate of return of 6.8% for the year <www.apra.gov.au/Super/Publications/ Documents/2016QSP201612.pdf>. Imagine that a researcher with access to the APRA data finds that the average rate of return for the largest superannuation funds in the last year has been 7.5% with a standard deviation of 0.7%, and that rates of return were normally distributed. a. If the researcher selects an individual fund at random from this population, what is the probability that the fund had a return of: i. less than 8.2%? ii. between 6.9% and 7.8%? iii. greater than 7.9%? b. If a random sample of 10 funds is selected from this population, what is the probability that the sample mean lies in the ranges given in (a)? 7.32 7.33 7.34 7.35 7.36 Assume that the returns for shares on the Chinese share market were distributed as a normal random variable, with a mean of 1.54 and a standard deviation of 10. If you select an individual share from this population, what is the probability that it would have a return: a. less than 0 (i.e. a loss)? b. between –10 and –20? c. greater than –5? If you selected a random sample of four shares from this population, what is the probability that the sample would have a mean return: d. less than 0 (a loss)? e. between –10 and –20? f. greater than –5? g. Compare your results in parts (d) to (f) to those in (a) to (c). (Class project ) The table of random numbers is an example of a uniform distribution because each digit is equally likely to occur. Starting in the row corresponding to the day of the month on which you were born, use the table of random numbers (Table E.1) to take one digit at a time. Select five different samples of n = 2, n = 5 and n = 10. Calculate the sample mean of each sample. Develop a frequency distribution of the sample means for the results of the entire class based on samples of sizes n = 2, n = 5 and n = 10. What can be said about the shape of the sampling distribution for each of these sample sizes? (Class project ) Toss a coin 10 times and record the number of heads. If each student performs this experiment five times, a frequency distribution of the number of heads can be developed from the results of the entire class. Does this distribution seem to approximate the normal distribution? (Class project ) The table of random numbers can simulate the selection of different-coloured balls from a bowl as follows: 1. Start in the row corresponding to the day of the month on which you were born. 2. Select one-digit numbers. 3. If a random digit between 0 and 6, inclusive, is selected, consider the ball white; if a random digit is a 7, 8 or 9, consider the ball red. Select samples of n = 10, n = 25 and n = 50 digits. In each sample, count the number of white balls and calculate the proportion of white balls in the sample. If each student in the class selects five different samples for each sample size, a frequency distribution of the proportion of white balls (for each sample size) can be developed from the results of the entire class. What conclusions can you reach about the sampling distribution of the proportion as the sample size is increased? (Class project ) Suppose that step 3 of problem 7.35 uses the following rule: ‘If a random digit between 0 and 8, inclusive, is selected, consider the ball to be white; if a random digit of 9 is selected, consider the ball to be red’. Compare and contrast the results in this problem and in problem 7.35. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 7 Excel Guide 265 Continuing cases As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >. a Find the mean price for the sample of 125 properties sold in regional city 1 of state A. What is the probability of finding a sample mean at least this large if the population mean and standard deviation of prices for this city are $300,000 and $100,000 respectively? b Now find the mean price for the sample of 125 properties sold in the coastal city of state B. What is the probability that the sample mean is less than or equal to this value if the population mean and standard deviation for this city are $595,000 and $287,000 respectively? c Discuss why your answers to (a) and (b) are not the same as finding comparable probabilities for individual properties sold in each city. Chapter 7 Excel Guide EG7.1 SAMPLING DISTRIBUTION OF THE MEAN Key technique Use an add-in procedure to create a simulated sampling distribution. Example Create a simulated sampling distribution that consists of 100 samples of n 5 30 from a uniformly distributed population. Analysis ToolPak Use Random Number Generation. For the example, select Data ➔ Data Analysis. In the Data Analysis dialog box, select Random Number Generation from the Analysis Tools list and then click OK. In the procedure’s dialog box (shown in Figure EG7.1): 1. Enter 100 as the Number of Variables. 2. Enter 30 as the Number of Random Numbers. 3. Select Uniform from the Distribution dropdown list. 4. Keep the Parameters values as they are. 5. Click New Worksheet Ply and then click OK. Figure EG7.1 shows the entries for generating 100 samples of n 5 30 from a uniformly distributed population. Figure EG7.1 Data Analysis Random Number Generation dialog box If you are using PHStat with either Excel for Mac 2016 or Excel 2016, see Appendix D.1 (Sampling Distribution of the Mean) to produce an enhanced version of this worksheet. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 266 CHAPTER 7 SAMPLING DISTRIBUTIONS EG7.2 CENTRAL LIMIT THEOREM By using the above method to generate 50 samples of size 3, then size 10 and size 40 from a uniform distribution, you should be able to observe how the Central Limit Theorem works. On PHStat simply click on Histogram to see the shape of the sampling distribution. If you are using Excel’s Random Number Generator a bit more work is required. For each set of samples use the =AVERAGE function to calculate the mean of the first sample, then drag or copy this to find the means of the remaining 49 samples. Next, create frequency distributions of the sample means using the methods described in the Chapter 2 Excel Guide. Last, compare the three frequency tables. You should see that they resemble a normal distribution more closely as the sample size increases. Microsoft® product screen shots are reprinted with permission from Microsoft Corporation. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e End of Part 2 problems 267 End of Part 2 problems B.1 B.2 B.3 B.4 B.5 B.6 A soft-drink bottling company maintains records of the number of unacceptable bottles of soft drink coming from the filling and capping machines. Based on past data, the probability that a bottle came from machine I and was unacceptable is 0.01, and the probability that a bottle came from machine II and was unacceptable is 0.025. Half the bottles are filled on machine I and the other half are filled on machine II. If a filled bottle of soft drink is selected at random: a. What is the probability that it is unacceptable? b. What is the probability that it was filled on machine I or is acceptable? c. Suppose you know that the bottle was filled on machine I. What is the probability that it is unacceptable? d. Suppose you know that the bottle is unacceptable. What is the probability that it was filled on machine I? e. Explain the difference in the answers to (c) and (d). (Hint: Construct a 2 * 2 contingency table or a Venn diagram to evaluate the probabilities.) The fill amount of soft-drink bottles is normally distributed with a mean of 2.0 litres (the listed content) and a standard deviation of 0.05 litre. Bottles that contain less than 95% of the listed net content (1.90 litres, in this case) make the manufacturer subject to penalties. Bottles that have a net content above 2.10 litres may cause excess spillage upon opening. a. What proportion of the bottles will contain: i. between 1.90 and 2.0 litres? ii. between 1.90 and 2.10 litres? iii. less than 1.90 litres or more than 2.10 litres? b. 99% of the bottles contain at least how much soft drink? c. 99% of the bottles contain an amount that is between which two values (symmetrically distributed) around the mean? In an effort to reduce the number of bottles that contain less than 1.90 litres, the bottler in problem B.2 sets the filling machine so that the mean is 2.02 litres. Under these circumstances, what are your answers to (a) to (c)? a.If a coin is tossed seven times, how many different outcomes are possible? b. If a die is rolled seven times, how many different outcomes are possible? c. Discuss the differences in your answers to (a) and (b). The time between arrivals of cars at Sheng’s carwash is exponential with an average of 6 minutes between arrivals. What is the probability that the time between successive arrivals will be a. less than 2 minutes? b. more than 10 minutes? c. between 4 and 6 minutes? The following data represent the electricity cost in dollars during the month of July for a random sample of 50 twobedroom apartments in a New Zealand city: < ELECTRICITY > 96 171 202 178 147 102 153 197 127 82 157 185 90 116 172 111 148 213 130 165 141 149 206 175 123 128 144 168 109 167 95 163 150 154 130 143 187 166 139 149 108 119 183 151 114 135 191 137 129 158 a. Decide whether the electricity cost for July is approximately normal by: i. evaluating the actual versus theoretical properties ii. constructing a normal probability plot From part (a), assume that electricity cost for July is normally distributed with a mean of $147 and standard deviation of $31.70. b. A two-bedroom apartment is selected at random. What is the probability that electricity cost for July is: i. less than $120? ii. between $100 and $160? iii. more than $225? c. For 10% of two-bedroom apartments, the electricity cost for July is above what amount? d. The cost of electricity for the middle 95% of two-bedroom apartments is between which two amounts? B.7 An electrical retail store has found that 55% of its customers use a credit card to pay for their purchases. a. If 15 customers who make a purchase are randomly selected, what is the probability that: i. none use a credit card? ii. exactly five use a credit card? iii. more than two use a credit card? b. What are the mean and the standard deviation of the probability distribution? B.8 It has been observed that 92% of train commuters travelling during the 8.00 am to 9.00 am period use a mobile phone during their trip for various activities. a. In a train carriage with 42 passengers during this period, what is the probability that fewer than 38 passengers use their mobile phone during their commute? b. If the carriage has 50 passengers, what is the probability that between 43 and 47 passengers use their mobile phone? B.9 From a consignment of 64 large garden pots in individual crates being shipped from Vietnam to a local importer, 16 have imperfections such as cracks or are broken. a. If eight crates are shipped to a particular garden nursery, what is the probability that: i. all eight will have defective pots? ii. none will have a defective pot? iii. at least one will have a defective pot? b. What would be your answers to (a) if eight crates have defective pots? B.10 East Park Realty, a small real estate company located in country areas of South Australia, specialises primarily in Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 268 End of Part 2 problems residential listings. It is interested in determining the probability of one of its listings being sold within a certain number of days. An analysis of company sales of 800 houses in the previous year produces the following data. Days listed until sold Initial asking price 30 and under 31–90 Over 90 Total Under $200,000 50 40 10 100 $200,000–$299,999 40 140 70 250 $300,000–$399,999 30 270 100 400 $400,000 or more 10 30 10 50 480 190 800 Total 130 a. b. c. d. Give an example of a simple event. Give an example of a joint event. What is the complement of ‘asking price under $200,000’? Why is ‘asking price under $200,000 and being listed more than 90 days until sold’ a joint event? e. Given that a house had an asking price of less than $200,000, what is the probability that it took more than 90 days to sell? f. Given that a house took more than 90 days to sell, what is the probability that its asking price was less than $200,000? g. Explain the difference in the results in (e) and (f). h. Are the two events – asking price less than $200,000, and taking more than 90 days to sell – statistically independent? i. If a house is selected at random, what is the probability that i. it is listed more than 90 days before being sold? ii. its initial asking price is at least $400,000? iii. its initial asking price is at least $400,000 and it is listed more than 90 days before being sold? iv. its initial asking price is more than $400,000 or it is listed more than 90 days before being sold? j. Explain the difference in the results in parts (i) to (iv) above. B.11 You are trying to develop a strategy for investing in two different shares. The anticipated annual return for a $1,000 investment in each share has the following probability distribution: Probability 0.1 0.3 0.4 0.2 Returns Share X Share Y -$50 20 100 150 -$100 50 130 200 a. Calculate the: i. expected return for share X and for share Y ii. standard deviation for share X and for share Y iii. covariance of share X and share Y b. Would you invest in share X or share Y? Explain. B.12 Suppose that in problem B.11 you wanted to create a portfolio that consists of share X and share Y. a. Calculate the portfolio expected return and portfolio risk for each of the following percentages invested in share X: i. 30% ii. 50% iii. 70% b. On the basis of the results in (a), which portfolio would you recommend? Explain. B.13 At an ocean-side nuclear power plant, seawater is used as part of the cooling system. This system raises the temperature of the water that is discharged back into the ocean. The amount that the water temperature is raised has a uniform distribution over the interval from 10°C to 25°C. a. What is the probability that the temperature will increase less than 20°C? b. What is the probability that the temperature will increase between 20°C and 22°C? c. A temperature increase of more than 18°C is considered potentially dangerous to the environment. What is the probability that, at any point of time, the temperature increase is potentially dangerous? d. What is the mean and standard deviation of the temperature increase? B.14 A survey of 1,500 students at a large university gave the following data on their study mode (full- or part-time) as well as their employment status. Employment status Studying full-time Studying part-time All students Employed full-time 94 558 652 Employed part-time 292 190 482 Not employed 278 88 366 836 1500 All students 664 a. b. c. d. e. Give an example of a simple event. Give an example of a joint event. What is the complement of ‘employed full-time’? Why is ‘employed full-time and studying full-time’ a joint event? If a student is selected at random, what is the probability that: i. they are employed? ii. they are studying part-time and are employed? iii. they are studying part-time or are employed? f. Explain the difference between the results in part (e) above. B.15 Telephone calls arrive at the information desk of a large computer software company at the rate of 15 per hour. a. What is the probability that the next call will arrive within 3 minutes (0.05 hour)? b. What is the probability that the next call will arrive within 15 minutes (0.25 hour)? c. Suppose the company has just introduced an updated version of one of its software programs, and telephone calls are now arriving at the rate of 25 per hour. Given this information, redo (a) and (b). B.16 On a tourism Twitter site, where photos of scenic views and native animals are regularly shared, the long-term average number of ‘Likes’ obtained per photo posted is 600.5, with a standard deviation of 76. A sample of 52 photos is selected at random. a. What is the probability that the average number of ‘Likes’ for the sample is at least 630? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e End of Part 2 problems 269 b. What is the probability that the average number of ‘Likes’ is less than 575? B.17 In each game of OZ Lotto seven numbers are selected from 1 to 45. To win the first-division prize, the seven winning numbers must have been selected. On any game, what is the probability of winning the first division? B.18 To test the effectiveness of mail X-ray screening in identifying potential illegal or threatening items, a mail centre X-rays a random sample of 500 packages and then independently searches each package. The results of this test are given below. Search Items found Yes No Total X-ray items identified Yes No B.21 Total 36 12 48 14 438 452 450 500 50 B.22 a. What percentage of items does the X-ray identify as potentially illegal or threatening? b. What proportion of items identified by X-ray as potentially illegal or threatening are found to be such when searched? c. An item is found during the search to be illegal or threatening. What is the probability that the X-ray identified it as potentially illegal or threatening? d. What percentage of items are not found to be illegal or threatening during the search and not identified as illegal or threatening by X-ray? B.19 Of the packages searched at the mail centre in problem B.18, 9.6% are found to contain illegal or threatening items. Suppose 10 packages are independently and randomly selected to be searched. a. What is the probability that: i. exactly two contain illegal or threatening items? ii. none contain illegal or threatening items? iii. at least one contains illegal or threatening items? iv. more than half contain illegal or threatening items? b. What is the expected number and standard deviation of the number of packages with illegal or threatening items? B.20 The table below classifies the academic staff of a small regional university by gender and level of appointment. Gender Level Female Male Total Professor 13 21 34 Associate professor 16 24 40 Senior lecturer 37 52 89 Lecturer 74 58 132 Associate lecturer 23 13 36 Total 163 168 331 a. Calculate the following probabilities: i. A randomly selected academic staff member is female. ii. A randomly selected male academic staff member is a senior lecturer or above. B.23 B.24 iii. A randomly selected academic staff member is a female associate lecturer. iv. A randomly selected professor is female. v. A randomly selected academic staff member is an associate professor. b. Are level of appointment and gender statistically independent? Explain. Suppose the executive of the university in problem B.20 randomly select five senior (senior lecturer and above) academic staff members for a committee. Calculate the following probabilities: a. The selected members of the committee are all male senior lecturers. b. There are no professors on the committee. c. At least half the committee is female. d. There is exactly one professor on the committee. e. There are three associate professors on the committee. An on-the-job injury occurs once every 10 days on average at a car manufacturer. What is the probability that the next on-the-job injury will occur within: a. 10 days? b. 5 days? c. 1 day? In a recent opinion poll a sample of 1,200 adults (at least 20 years old) was surveyed. Of these adults, 768 were married, 684 were female and there were 459 married females. Construct a contingency table or a Venn diagram and evaluate the probability that a surveyed adult selected at random: a. is male b. is single c. is a married male d. is a single female The following table contains the probability distribution for the number of traffic accidents per day in a small city. Number of accidents daily (X) P(X) 0 1 2 3 4 5 0.10 0.20 0.45 0.15 0.05 0.05 a. Calculate the mean or expected number of accidents per day. b. Calculate the standard deviation. B.25 On average 108 customers per hour join a queue at any one of the checkout counters of a grocery store. Suppose that the number of customers joining a queue at the checkout counters follows an approximate Poisson distribution. a. What is the probability that in the next minute: i. exactly four customers join a queue? ii. at least one customer joins a queue? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 270 End of Part 2 problems B.26 B.27 B.28 B.29 b. What is the probability that in the next 5 minutes: i. exactly 10 customers join a queue? ii. at least 10 customers join a queue? A computer Help desk has two technicians, A with advanced training, who is able to solve 95% of problems, and B with less training, who is only able to solve 85% of problems. Each technician randomly receives 50% of problems. a. What percentage of solved problems are solved by technician A? b. What percentage of problems are solved? A particular weekly Bingo session consists of 20 games. In each game, there are two points where a player can win (a line and a house). Assume on a typical week that there are 100 players, each player is equally likely to win and winning is independent. a. Calculate the probability that a player has a win (line and/or house) on a game. Ignore the possibility of multiple winners at any stage of a game. b. Calculate the probability that a player wins at least once during the evening. On a typical week Biff went to Bingo with four friends, each of whom won at least once but she did not. c. Calculate the probability that in a group of five players exactly four will win at least once. d. Calculate the probability that Biff does not have a win but her four friends do. The Bingo session costs a player $8, with each line won paying $10 and each house $20. e. Construct the probability distribution for the amount a player wins in a game. f. What is the expected amount a player wins in a game? g. What is the variance and standard deviation of the amount a player wins a game? h. What is a player’s expected profit (or loss) from the Bingo session? Based on past experience, 40% of all customers at Miller’s Service Station pay for their purchases with a credit card. If a random sample of 200 customers is selected, what is the approximate probability that: a. at least 75 pay with a credit card? b. not more than 70 pay with a credit card? c. between 70 and 75 customers, inclusive, pay with a credit card? At the local golf course golfers lose golf balls at a rate of 3.8 per 18-hole round. Assume that the number of golf balls lost in an 18-hole round is distributed as a Poisson random variable. a. What assumptions need to be made so that the number of golf balls lost in an 18-hole round is distributed as a Poisson random variable? b. Given the assumptions made in (a), what is the probability that in an 18-hole round: i. at least one ball will be lost? ii. less than three balls will be lost? iii. more than five balls will be lost? B.30 The Tasmanian Visitor Survey presents data in an analyser database on a number of aspects of tourism, including attractions visited by tourists aged 14 or over. The most visited attractions by 1,283,618 tourists in the October 2016 to September 2017 period were the Saturday Salamanca Market (443,600/34.6%), MONA – the Museum of Old and New Art (352,222 /27.4%) and Mt Wellington (328,752/25.6%) (data obtained from <www.tvsanalyser.com.au>). a. If a survey of 300 people aged 14 or over who toured Tasmania during the period in question is taken, what is the probability that at least 30% visited MONA? b. What is the probability in this survey that between 31% and 36% of tourists visited the Saturday Salamanca Market? c. What is the probability in this survey that fewer than 23% of tourists visited Mt Wellington? B.31 A box of nine golf gloves contains two left-handed gloves and seven right-handed gloves. a. If two gloves are randomly selected from the box without replacement, what is the probability that both gloves will be right-handed? b. If two gloves are randomly selected from the box without replacement, what is the probability that one right-handed glove and one left-handed glove will be selected? c. If three gloves are selected with replacement, what is the probability that all three will be left-handed? d. If you were sampling with replacement, what would be the answers to (a) and (b)? B.32 Based on past experience, the owner of a stall at the local annual show states that 60% of visitors to the stall will purchase a showbag. On a certain day, the stall has 100 visitors. a. Is the 60% figure best classified as a priori classical probability, empirical classical probability or subjective probability? b. Find the expected number and standard deviation of sales, assuming that number of sales is binomial. c. If the showbags cost $12 each, find the expected revenue from the sales. d. What assumptions are necessary in (b)? B.33 The cost of a phone call passed on to a ‘live’ operator is approximately 10 times that of a call answered by an automated customer-service system. However, as more and more companies have implemented automated systems, customer annoyance with these systems has grown. Many customers are quick to leave the automated system when given an option such as ‘Press zero to talk to a customer-service representative’. Research has shown that approximately 40% of all callers to automated customer-service systems will automatically opt to go to a live operator when given the chance. a. If 10 independent callers contact an automated customerservice system, what is the probability that: i. none of the callers will automatically opt to talk to a live operator? ii. exactly one will automatically opt to talk to a live operator? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e End of Part 2 problems 271 B.34 B.35 B.36 B.37 B.38 iii. two or fewer will automatically opt to talk to a live operator? iv. all 10 will automatically opt to talk to a live operator? b. If all 10 automatically opt to talk to a live operator, do you think that the 40% figure applies to this particular system? Explain. One theory concerning the Standard & Poor’s (S&P) 500 Index of US stocks is that if it increases during the first five trading days of the year, it is likely to increase during the entire year. From 1929 to 2016, early gains during the first five days predicted full-year gains approximately 69.5% (41 out of 59) of the time. Assuming that this indicator is a random event with no predictive value, you would expect that the indicator would be correct 50% of the time. a. What is the probability of the S&P 500 Index increasing in 41 or more of 59 years with an early gain if the true probability of an increase in the S&P 500 Index is: i. 0.50? ii. 0.70? iii. 0.90? b. Based on the results in (a), what do you think is the probability that the S&P 500 Index will increase if there is an early gain in the first five trading days of the year? Explain. A research institute has interviewed a total of 1,764 employers. Fifty-three per cent of the 264 employers from the telecommunications industry expected to have a net increase in employment in their company during the next quarter. Only 43% of employers interviewed from other industries expected a net increase during the same period. a. If an employer from this survey pool is selected at random and expects that there will be a net increase in employment in his company during the next quarter, what is the probability that his company is in the telecommunications industry? b. What is the chance that an employer, selected at random, is neither from the telecommunications industry nor expects an increase? A quinella consists of picking the horses that will place first and second in a race irrespective of order. Suppose eight horses are entered in a race. a. How many quinella combinations are there for this race? b. If you choose two horses randomly, what is the probability that you win the quinella? Suppose that a quality control department has established that 0.1% of items produced are defective. a. If 25 items are randomly selected, find the probability that: i. exactly two items are defective ii. at most one item is defective iii. at least two items are defective b. What is the expected number and standard deviation of defective items? Assume that the number of network errors experienced in a day on a local area network (LAN) is distributed as a Poisson random variable. The mean number of network errors B.39 B.40 B.41 B.42 B.43 experienced in a day is 2.4. What is the probability that, in any given day: a. zero network errors will occur? b. exactly one network error will occur? c. two or more network errors will occur? d. fewer than three network errors will occur? Greenway Gardens currently has six plots available to plant tomatoes, eggplant, capsicum, cucumbers, beans and lettuce. Each vegetable will be planted in one and only one plot. How many ways are there to position these vegetables in the gardens? Olive Construction Company is determining whether it should submit a bid for a new shopping centre. In the past, Olive’s main competitor, Base Construction Company, has submitted bids 70% of the time. If Base Construction does not bid on a job, the probability that Olive Construction will get the job is 0.50. If Base Construction bids on a job, the probability that Olive Construction will get the job is 0.25. a. If Olive Construction gets the job, what is the probability that Base Construction did not bid? b. What is the probability that Olive Construction will get the job? An airline maintains statistics for mishandled bags per 1,000 passengers. Suppose that last year this airline had 7.03 mishandled bags per 1,000 passengers. What is the probability that the next 1,000 passengers on this airline will have: a. no mishandled bags? b. at least one mishandled bag? c. at least two mishandled bags? A small factory processes and bottles fruit juice. Two types of defect can occur – an incorrect fill amount (over or under the stated amount on the label) and an incorrect seal. From production data it is known that 0.5% of two-litre bottles filled have an incorrect fill amount and 0.1% are incorrectly sealed, with 0.002% having both defects – an incorrect fill amount and incorrectly sealing. a. What proportion of two-litre bottles produced have at least one type of defect? b. What proportion of two-litre bottles produced have no defects? c. A two-litre bottle has an incorrect fill amount. What is the probability that it also is incorrectly sealed? d. Twenty filled two-litre bottles are randomly chosen. Determine the probability that: i. only one bottle has an incorrect fill amount ii. at least one bottle has an incorrect fill amount iii. at most, two bottles have an incorrect fill amount e. In a random sample of 100 filled two-litre bottles, find the expected number of bottles which are incorrectly sealed. The amount of time a bank teller spends with each customer has a population mean μ = 3.10 minutes and standard deviation σ = 0.40 minute. a. If you select a random sample of 16 customers: i. what is the probability that the mean time spent per customer is at least 3 minutes? ii. there is an 85% chance that the sample mean is below how many minutes? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 272 End of Part 2 problems b. What assumption must you make in order to solve both parts of (a)? c. If you select a random sample of 64 customers, there is an 85% chance that the sample mean is below how many minutes? B.44 A manager of a seafood restaurant is interested in both the time it takes a customer to be seated (the waiting time) and the length of time between a customer being seated and leaving the restaurant (the service time). Over a month, a random sample of 100 customers (only one per party/table) was selected and waiting and serving times, in minutes, are recorded in the file < RESTAURANT_TIMES >. a. Construct a histogram for waiting times. Are waiting times approximately normal, exponential or uniform? Is this what you expected? b. Construct a histogram of serving times. Are serving times approximately normal, exponential or uniform? Is this what you expected? c. Calculate the mean and standard deviation of waiting and serving times. d. Use the results of (a) and (c) to calculate the approximate probability that a customer will wait less than 5 minutes to be seated. e. Use the results of (a) and (c) to calculate the approximate probability that a customer will wait more than 10 minutes to be seated. f. Use the results of (b) and (c) to calculate the approximate probability that the serving time for a customer will be less than 1 hour. g. Use the results of (b) and (c) to calculate the approximate probability that the serving time for a customer will be more than 90 minutes. B.45 Data from the Bureau of Infrastructure, Transport and Regional Economics (BITRE) (<https://bitre.gov.au>) shows that in Australia during 2015, the number of motorcyclist deaths was 6.47 per 100 million vehicle kilometres travelled (VKT), while for car occupants it was 0.35 per 100 million VKT. A local council estimates that within the council boundaries there are annually 300 million VKT for cars and 5 million VKT for motorcycles. Assume that the fatality rates have not changed and that the Poisson distribution can be used to model the number of deaths. a. For motorcyclists in the local council area, calculate the following probabilities that in the next 12 months: i. there are no deaths ii. there is at least one death iii. there is exactly one death iv. there are no more than two deaths b. For car occupants in the local council area, calculate the following probabilities that in the next 12 months: i. there are no deaths ii. there is at least one death iii. there is exactly one death iv. there are no more than two deaths B.46 In 2015, 16.4% of Australians aged 45 to 54 years reported a disability compared to 8.2% aged 15 to 24 years (data obtained from Australian Bureau of Statistics, Disability, Ageing and Carers, Australia: Summary of Findings, 2015, Cat. No. 4430.0 <www.abs.gov.au>). Suppose 15 Australians in each age group are randomly selected. a. For each age group of those selected, calculate the probability that: i. none reports a disability ii. at least one reports a disability iii. exactly five report a disability iv. a majority report a disability b. Repeat (i) to (iv) for the 90 years and over age group, of whom 85.4% report a disability. B.47 A telemarketing firm phones households at random. Data show that 80% of such calls are answered. a. If 100 households are called each evening, approximate the probability that: i. more than 50% of the calls are answered ii. between 70 and 90 (inclusive) calls are answered iii. fewer than 75 calls are answered b. Use Excel to calculate the exact probabilities for part (a). B.48 Check$mart encourages its customers to use Internet banking. Therefore the bank is concerned with the download time (the number of seconds that passes from first linking to the website until the home page is fully displayed) of its home page. Both the design of a home page and the load on the bank’s web server affect the download time. Past data indicate that download times are approximately normal with a mean of 0.9 seconds and a standard deviation of 0.3 seconds. What is the probability that a download time is: a. less than 1 second? b. more than 0.5 seconds? c. between 0.5 and 1.5 seconds? d. more than 2 seconds? e. less than 0.6 seconds? f. between 1.0 seconds and 1.5 seconds? B.49 Past records show that on average there are four unplanned outages a year to Check$mart’s Internet banking system and that these unplanned outages occur randomly and are independent of each other. An unplanned outage has just occurred. a. What is the probability that there will: i. not be an unplanned outage in the next month? ii. not be an unplanned outage in the next three months? iii. be at least one unplanned outage in the next six months? b. What is the mean time between unplanned outages? c. What is the probability that there will: i. be exactly three unplanned outages in the next year? ii. be more than six unplanned outages in the next six months? iii. be fewer than two unplanned outages in the next month? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e End of Part 2 problems 273 B.50 In the Household Expenditure Statistics: Year Ended 30 June 2016 (Statistics New Zealand, <www.stats.govt.nz>, licensed by Statistics New Zealand for re-use under the Creative Commons Attribution 3.0 New Zealand licence), 64% of the New Zealand households reported that their income was enough or more than enough to meet their everyday needs. However, of the 20% of households with an annual income of less than $35,700, 48% reported that their income was enough or more than enough for their everyday needs, while of the 20% of households with an annual income of at least $136,600, 87% reported that their income was enough or more than enough to meet their everyday needs. a. What proportion of households reporting that their income is not enough for their everyday needs have an annual income of at least $136,600? b. What proportion of households who report their income is enough for their everyday needs have incomes of less than $35,700? c. What is the probability that a household has an annual income of less than $35,700 and reports that this is enough for their everyday needs? d. What proportion of households with an annual income of at least $35,700 report that their income is enough? e. What proportion of households with an annual income of less than $136,600 report that their income is not enough? f. What is the probability that a household has an annual income of at least $136,600 and reports that this is not enough for their everyday needs? B.51 In problem 6.12 on page 229, it was assumed that the number of All Ordinaries shares traded daily on the Australian Securities Exchange (ASX) is a normal random variable. a. To test this assumption use the All Ordinaries daily volume of trade for the 2016–17 financial year < ALL_ORDS_2016_17 > to: i. construct a stem-and-leaf display, histogram, polygon and/or box-and-whisker plot ii. evaluate the actual versus theoretical properties iii. construct a normal probability plot b. Discuss the results in (a). Are the number of All Ordinaries shares traded daily approximately normal? B.52 According to Burton G. Malkiel, the daily changes in the closing price of shares follow a random walk – that is, these daily events are independent of each other and move upwards or downwards in a random manner – and can be approximated by a normal distribution. To test this theory, use either a newspaper or the Internet to select three companies traded on the ASX or other stock exchange, and then do the following: 1. Obtain the daily closing share price of each company for six consecutive weeks (so that you have 30 values per company). 2. Obtain the daily changes in the closing share price of each company for six consecutive weeks (so that you have 30 values per company). a. For each of your six data sets, decide whether the data are approximately normally distributed by: i. examining the stem-and-leaf display, histogram or polygon and the box-and-whisker plot ii. evaluating the actual versus theoretical properties iii. constructing a normal probability plot b. Discuss the results in (a). What can you now say about your three shares with respect to daily closing prices and daily changes in closing prices? c. Which, if any, of the data sets are approximately normally distributed? Note: The random-walk theory pertains to the daily changes in the closing share price, not the daily closing share price. B.53 A motoring organisation has conducted a survey of owners of new cars manufactured in 2017. It has listed the average number of problems per car as 1.27 for brand H. Let the random variable X be equal to the number of problems with a newly purchased brand H. a. What assumptions must be made in order for X to be distributed as a Poisson random variable? Are these assumptions reasonable? b. Making the assumptions as in (a), if you purchased a 2017 brand H, what is the probability that the new car will have: i. zero problems? ii. two or fewer problems? c. Give an operational definition for ‘problem’. Why is the operational definition important in interpreting the results of the survey? B.54 Assume that in 2018 the manufacturers of brand H improve their performance, with owners of 2018 brand H reporting 1.04 problems per car. a. If you purchased a 2018 brand H, what is the probability that the new car will have: i. zero problems? ii. two or fewer problems? b. Compare your answers in part (a) with those for 2017 brand H in problem B.53 part (b). B.55 Jay has had three incidents in the past 10 years where an insurance excess needed to be paid. These were a collision with a kangaroo, hail damage and a collision from behind while stationary. In this last instance the excess was refunded as the other driver was at fault. Furthermore, Jay estimates that he drives 300 days a year. Jay recently booked a rental car online for a 27-day holiday in New Zealand. During the booking process he was offered a policy at a price of $18.40 per day, to reduce the insurance excess of $2,000 to $0. However, Jay chose not to accept this offer. a. Estimate the probability, per day of driving, that Jay will have to pay an insurance excess, even if it is refunded later because the other driver is at fault. b. Assume that the number of days during the holiday that require insurance excess to be paid can be modelled by the binomial distribution. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 274 End of Part 2 problems i. Calculate the probability that for the 27-day holiday there are no days requiring insurance excess to be paid. ii. Calculate the probability that for the 27-day holiday there is exactly one day requiring insurance excess to be paid. iii. Calculate the probability that for the 27-day holiday there are exactly two days requiring insurance excess to be paid. iv. Calculate the probability that for the 27-day holiday there are at least three days requiring insurance excess to be paid. v. Calculate the expected payout on this policy. c. Assume that the number of instances during the holiday in which insurance excess is required to be paid can be modelled by the Poisson distribution. i. Calculate the probability that for the 27-day holiday there are no instances in which insurance excess is required to be to be paid. ii. Calculate the probability that for the 27-day holiday there is exactly one instance in which insurance excess is required to be to be paid. iii. Calculate the probability that for the 27-day holiday there are exactly two instances in which insurance excess is required to be paid. iv. Calculate the probability that for the 27-day holiday there are at least three instances in which insurance excess is required to be paid. v. Calculate the expected payout on this policy. d. Did Jay make the correct decision? e. Calculate the probability per day of driving that Jay will have to pay an insurance excess for the policy to break even. B.56 Sam and Jo recently lost their house to fire. Although they were insured, the insurance company has offered them 30% less than the rebuild amount for which they were insured. The amount for which they were insured was the amount specified by the insurance company and is consistent with the rebuild amount given by the insurance company’s online calculator. Therefore, Sam and Jo are not accepting the insurance company’s statement that they were over-insured. Do Sam and Jo have a case to ask for a higher amount to rebuild their house? a. The online calculator states that ‘in approximately 80% of cases the building estimate delivers an accuracy of +/– 10%’. Assuming that the difference between the estimated rebuild cost given by the calculator and the actual rebuild cost is normal with a mean of zero, estimate the standard deviation. b. Using the results of part (a), calculate the probability of an actual rebuild cost of at most 30% less than the estimated rebuild cost given by the online calculator. c. Comment on the insurance company’s claim that Sam and Jo were over-insured. Do you consider that they are justified in asking for a higher rebuild amount? B.57 Australia is known as a nation of sports lovers but cultural events and venues are not all well supported. A survey by the Australian Bureau of Statistics found that in 2013–14 the attendance rates for Australians aged 15 years and over at the following selected cultural events and venues were as follows: cinemas 66.3%, zoological parks and aquariums 33.9%, botanic gardens 37.2% and libraries 34.0%. It also found that only 14.8% of Australians had attended an opera or musical in the previous 12 months (Australian Bureau of Statistics, Attendance at Selected Cultural Events and Venues, Australia, 2013–14, Cat. No. 4114.0). a. If the percentages reported by the ABS are used in decimal form as probabilities, are they best classified as a priori classical probabilities, empirical classical probabilities or subjective probabilities? b. Suppose that 10 Australians aged 15 years and over are randomly sampled. Consider the random variable defined by the number of people that have attended a musical or opera in the past year. What assumptions must be made so that this random variable is distributed as a binomial random variable? c. Assuming that the number of people who have attended a musical or opera in the past year is a binomial random variable, what are the mean and standard deviation of the distribution in (b)? B.58 Refer to problem B.57. Calculate the probability that, of the 10 people sampled, the number who have attended a musical or opera in the past year is: a. exactly none b. all 10 c. more than half d. eight or more B.59 Refer to problem B.57. a. For cinemas, using the given probability of attendance of 0.663, calculate the probability that, of the 10 people sampled, the number who have attended a cinema in the past year is: i. exactly none ii. all 10 iii. more than half iv. eight or more b. Compare the results in (a) with those of problem B.58 (a) to (d). B.60 The manager of a seafood restaurant was interested in studying ordering patterns of patrons for the Friday-to-Sunday weekend time period. Records were maintained that indicated the demand for dessert during the same period. The manager decided to study two other variables together with whether a dessert was ordered: the gender of the individual and whether a shellfish entrée was ordered. The results are as follows: Gender Dessert ordered Male Female Total Yes 82 32 No 278 208 240 Total 360 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 114 486 600 End of Part 2 problems 275 Shellfish entrée Dessert ordered Yes No Total Yes 52 62 No 106 380 442 Total 158 114 486 600 a. A waiter approaches a table to take an order. What is the probability that the first customer to order at the table: i. orders a dessert? ii. orders a dessert or a shellfish entrée? iii. is a female and does not order a dessert? iv. is a female or does not order a dessert? b. Suppose the first person that the waiter asks for a dessert order is a female. What is the probability that she does not order dessert? c. Are gender and ordering dessert statistically independent? d. Is ordering a shellfish entrée statistically independent of whether the person orders dessert? B.61 The council for a regional city constructed a levee to protect the central business district and surrounding suburbs from flooding in up to a 1-in-10-year flood. This levee was finished 12 years ago, and has just been breached for the first time, having held during three previous floods. Assume that the number of floods that breach the levee can be modelled by a Poisson distribution. a. What is the probability that the levee is: i. not breached in 12 years? ii. breached in the next 5 years? iii. not breached in the next 20 years? iv. breached again within 2 years? v. not breached within 10 years? b. Suppose the council decides to increase the height of the levee, so that the new levee will protect the central business district and surrounding suburbs from flooding in up to a 1-in-20-year flood. What is the probability that the new levee is: i. not breached in 12 years? ii. breached within 5 years of completion? iii. not breached in 20 years? iv. breached within 2 years of completion? v. not breached within 10 years of completion? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e PA R T 3 Drawing conclusions about populations based only on sample information Real People, Real Stats Rod Battye TOURISM RESEARCH AUSTRALIA Which company are you currently working for and what are some of your responsibilities? Tourism Research Australia (TRA), which is currently a business unit within the Australian Trade Commission (AUSTRADE). My main responsibilities are to manage: • the International (IVS) and National (NVS) Visitor Surveys • the service-level agreement with funding partners • TRA’s interactive websites • TRA software and databases • data requests and statistics in general • staff and individual and team development. List five words that best describe your personality. Patient, precise, conscientious, creative, relaxed. What are some things that motivate you? Working in a team, building relationships, creating new ways to communicate messages, doing new things, getting it right and being relevant. Promoting a happy work environment. When did you first become interested in statistics? Started when I was writing programs to extract data in an area that handled statistics. I was always good at maths and it came naturally from there. There are many disciplines in statistics that apply to programming as well. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e a quick q&a Complete the following sentence. A world without statistics … … is uninformed and lacking the information required for good planning and decision making. LET’S TALK STATS What do you enjoy most about working in statistics? Not all statistics are enjoyable to work with; the subject matter is very important. I work with tourism-related information, which covers both domestic and international topics. I am especially keen on the international topics. I enjoy working with people across the many facets of survey work I do, from collection in the field to publication and reporting. Tourism has a lot of positive and forward-looking values; it cuts across nations, genders, age, technology and much more (variety), which makes it interesting. Describe your first statistics-related job or work experience. Was this a positive or a negative experience? As mentioned earlier, I was always good at maths and statistics and got involved when I was writing software programs to extract data on migration and other topics. This experience increased my interest in information, wanting to know more, put the pieces together and tell a story. What do you feel is the most common misconception about your work held by students who are studying statistics? Please explain. The biggest misconceptions are that it’s easy and a lot of people don’t realise there is a need to do a bit of an apprenticeship. There are many different streams of statistics and a vast array of survey methodologies. It takes some time to gain the knowledge and experience to be competent at your job. Do you need to be good at maths to understand and use statistics successfully? Overall the answer is yes! I’ve seen some horror stories when people have ended in the wrong role and they are not numbers savvy. Having said that, the direction we are moving in with the way we report statistics in a simpler way, using more visual and interactive formats/technologies, is removing some of the mystery. Is there a high demand for statisticians in your industry (or in other industries)? Please explain. There is a demand particularly for younger people. There seems to have been a drop off in younger people coming through. There is a tendency for people to focus on policy or marketing and other avenues, as the trip to the top is considered to happen more quickly. What we really need more of is statistics and research in the one package. What I mean by that is someone who can reveal the numbers, interpret and write the story/convey the message. DRAWING CONCLUSIONS ABOUT POPULATIONS BASED ONLY ON SAMPLE INFORMATION What are some variables for which data have been collected in your field? There are so many to list in terms of international visitors to Australia: where they went; what activities they undertook; their satisfaction levels with cost, food, language services, accommodation etc.; their likelihood to recommend Australia as a holiday destination; expenditure; places and attractions; tours; demographics; where they come from; and why they are here, just to name a few. Why is sampling an important part of your work? What are some common sampling techniques that you employed in the past? Sampling techniques are vital to what I do as they are a costeffective means of obtaining good results for a fraction of the cost of conducting a census. The surveys I work on are ongoing measurement surveys vital to government and the broader tourism industry. We mostly use stratified random sampling techniques. We have excellent data that we use for our sampling and benchmarking processes. Our domestic survey uses computer-assisted telephone interviewing (CATI) via random digit dialing of household phone numbers, using the last birthday method of selection. Samples are stratified using telephone prefixes and the estimated resident population of Australia by capital city/rest of state. The international survey uses computer-assisted personal interviewing (CAPI) at international departure lounges in airports throughout Australia. Immigration data using flight details, airport, country of residence and gender are used to stratify samples. Interviews are chosen at random. All TRA surveys use screening questions to check for in-scope/ targeted respondents. What are some statistical methods you have used that have assisted in solving a problem? In our surveys we have a small number of records that end up with high weights. These weights can have a detrimental effect on results at lower levels. We use a ‘trimming’ technique to reduce these influences by trimming to the weights to five standard errors from the mean. The weights are then redistributed using a raking method. Have you ever conducted a hypothesis test where the outcome was not what you expected? If so, what did you do? Yes, we conduct these types of tests on a regular basis when we consider adding new topics to the surveys. We conduct testing Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e before implementation and have some expectation as to what the result would be. Occasionally we get an outcome well removed from what was expected. In this case we consult other data/ information and source industry players. We then review and update certain details, re-test, then implement. We need to be sure what we are doing does not influence results (skew them). What are some challenges that you have faced in using statistics to provide information about a population of interest? How have you overcome these challenges? We have had difficulty reporting travel by domestic visitors due to the growth in mobile-phone-only households (under coverage of the population); our CATI collection has been conducted via random selection of landline telephone numbers. This has long been the accepted way of surveying the community in a costeffective manner. However, because the growth of mobile-only households was taken up at disproportionate rates across the age groups, there had been a shift in the characteristic of travellers and an under-representation of the younger age groups. We conducted an extensive review of methodologies and the result was that CATI was still the best method of collection for our survey (a large tracking survey with complex definitions). In recent years we had looked into phoning mobiles, but this was too expensive due to the large number of invalid numbers (SIM cards that were no longer in use, SIM cards in shops, etc.). Advancements in technology that reduced the cost issue (being able to ping and identify invalid mobile numbers) had recently appeared. With this we decided to push ahead with the introduction of a dual-frame overlap survey. This type of collection is cutting-edge, and nowhere in the world is there a survey of this size (120,000 sample) that measures visitation via a dual-frame survey. Whereas before the survey sampled and weighted to the estimated resident population of Australia, we now have three distinct populations: mobile only, landline only, and mobile and landline. The new approach is only in its early stages but all is looking very good; the weights are now distributed as they should be and we have successfully implemented all other facets of the sampling, collection, processing and weighting. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Confidence interval estimation C HAP T E R 8 AUDITING SALES INVOICES AT CALLISTEMON CAMPING SUPPLIES C allistemon Camping Supplies Pty Ltd has several outlets that sell outdoor clothing, backpacks, tents and other camping equipment. As the company’s accountant, you are responsible for the accuracy of the integrated inventory management and sales information system. You could review the contents of every record to check the accuracy of this system, but such a detailed review would be time-consuming and costly. A better approach is to use statistical inference techniques to draw conclusions about the population of all records from a relatively small sample collected during an audit. At the end of each month, you can select a sample of the sales invoices to determine the following: ■ the mean dollar amount listed on the sales invoices for the month ■ the total dollar amount listed on the sales invoices for the month ■ ■ any differences between the dollar amounts on the sales invoices and the amounts entered into the sales information system the frequency of occurrence of various types of errors that violate the internal control policy of the distribution sites. These errors include making a shipment when there is no authorised delivery docket, failure to include the correct account number and shipment of the incorrect part. How accurate are the results from the samples and how do you use this information? Are the sample sizes large enough to give you the information you need? © Chris Howes/Wild Places Photography/Alamy Stock Photo Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 280 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION LEARNING OBJECTIVES After studying this chapter you should be able to: 1 construct and interpret confidence interval estimates for the mean 2 construct and interpret confidence interval estimates for the proportion 3 determine the sample size necessary to develop a confidence interval for the mean or proportion 4 recognise how to use confidence interval estimates in auditing point estimate A single value calculated from a sample which is used to estimate an unknown population parameter. confidence interval estimate A range of numbers constructed about the point estimate. Statistical inference is the process of using sample results to draw conclusions about the characteristics of a population. Inferential statistics enables you to estimate unknown population characteristics such as a population mean or a population proportion. Two types of estimates are used to estimate population parameters: point estimates and interval estimates. A point estimate is the value of a single sample statistic. A confidence interval estimate is a range of numbers, called an interval, constructed around the point estimate. The process used to construct confidence intervals tells us that the population parameter is located somewhere within the interval in a known percentage of the intervals that could be constructed from different samples. Suppose that you would like to estimate the mean number of hours of paid work undertaken per week during term time by students in your university. The mean hours of paid work for all the students is an unknown population mean, denoted by μ. You select a sample of – students and find that the sample mean is 14.8. The sample mean, X, is a point estimate of the population mean μ. How accurate is 14.8? To answer this question you must construct a confidence interval estimate. In this chapter you will learn how to construct and interpret confidence interval estimates. – Recall that the sample mean, X, is a point estimate of the population mean μ. However, the sample mean will vary from sample to sample because it depends on the items selected in the sample. By taking into account the known variability from sample to sample (see Section 7.2 on the sampling distribution of the mean), you will learn how to develop the interval estimate for the population mean. The interval constructed will have a specified confidence of correctly estimating the value of the population parameter μ. In other words, there is a specified confidence that μ is somewhere in the range of numbers defined by the interval. Suppose that after studying this chapter you find that a 95% confidence interval for the mean number of hours students at your university are employed in paid work per week is (14.75 8 μ 8 14.85). You can interpret this interval estimate by stating that you are 95% confident that the mean number of hours per week of paid work undertaken by students at your university is between 14.75 and 14.85. However, there is still a possibility that the mean number of hours is below 14.75 or above 14.85. After learning about the confidence interval for the mean, we look at how to develop an interval estimate for the population proportion. Then we consider how large a sample to select when constructing confidence intervals, and how to perform several important estimation procedures that accountants use when performing audits. 8.1 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ KNOWN) In Section 7.2 we used the Central Limit Theorem and knowledge of the population distribution to determine the percentage of sample means that fall within certain distances of the population mean. For instance, in the shampoo-bottling example used throughout Chapter 7, 95% of all Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.1 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ KNOWN) 281 sample means are between 494.12 and 505.88 mL. This statement is based on deductive reasoning. However, inductive reasoning is what we need here. We need inductive reasoning because, in statistical inference, you use the results of a single sample to draw conclusions about the population, not vice versa. Suppose that in the shampoo-bottling example you wish to estimate the unknown population mean using the information from only a sample. Thus, rather than take μ ± (1.96) (σ∕∙∙n) to find the upper – and lower limits around μ, as in Section 7.2, you substitute the sample mean, X , for the – unknown μ and use X ± (1.96) (σ∕∙∙n) as an interval to estimate the unknown μ. Although in – practice you select a single sample of size n and calculate the mean X, in order to understand the full meaning of the interval estimate you need to examine a hypothetical set of all possible samples of n values. Figure 8.1 shows the actual population distribution of shampoo bottle contents at the top with a mean value of 500 and five confidence intervals for the population mean based on five different sample means. Suppose that a sample of n = 25 bottles has a mean of 496.2 mL. The interval developed to estimate μ is 496.2 ± (1.96)(15)∕(∙∙ 25) or 496.2 ± 5.88. The interval estimate of μ is: Deductive reasoning Reasoning that starts with a hypothesis and examines possibilities to move to a specific conclusion. Inductive reasoning Reasoning that uses specific observations to make a general conclusion. 490.32 8 μ 8 502.08 Because the population mean μ (equal to 500) is included within the interval, this sample has led to a correct statement about μ (see Figure 8.1). Figure 8.1 Confidence interval estimates for five different samples of n = 25 taken from a population where μ = 500 and σ = 15 500 494.12 X1 = 496.2 490.32 X2 = 501.6 X3 = 493.0 X4 = 494.12 X5 = 505.88 496.2 495.72 487.12 493.0 488.24 505.88 502.08 501.6 507.48 498.88 494.12 500 500 505.88 511.76 To continue this hypothetical example, suppose that for a different sample of n = 25 bottles the mean is 501.6. The interval developed from this sample is: 501.6 ± (1.96)(15)/( 25 ) or 501.6 ± 5.88. The estimate is: 495.72 8 μ 8 507.48 Because the population mean μ (equal to 500) is also included within this interval, this statement about μ is correct. Now, before you begin to think that correct statements about μ are always made by developing a confidence interval estimate, suppose a third hypothetical sample of n = 25 bottles is selected and the sample mean is equal to 493 mL. The interval developed here is 493 ± (1.96) (15)∕(∙∙ 25) or 493 ± 5.88. In this case, the interval estimate of μ is: 487.12 8 μ 8 498.88 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 282 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION This estimate is not a correct statement, because the population mean μ is not included in the interval developed from this sample (see Figure 8.1). Thus, for some samples the interval estimate of μ is correct but for others it is incorrect. In practice, only one sample is selected and, because the population mean is unknown, you cannot determine whether the interval estimate is correct. To resolve this dilemma of sometimes having an interval that provides a correct estimate and sometimes having an interval that provides an incorrect estimate, you need to determine the proportion of samples producing intervals that result in correct statements about the population – mean μ. To do this, consider two other hypothetical samples: the case in which X = 494.12 mL – – and the case in which X = 505.88 mL. If X = 494.12, the interval is 494.12 ± (1.96)(15)∕(∙∙ 25) or 494.12 ± 5.88. This leads to the following interval: 488.24 8 μ 8 500.00 Because the population mean of 500 is at the upper limit of the interval, the statement is correct (see Figure 8.1). – When X = 505.88, the interval is 505.88 ± (1.96)(15)∕(∙∙ 25) or 505.88 ± 5.88. The interval for the sample mean is: 500.00 8 μ 8 511.76 In this case, because the population mean of 500 is included at the lower limit of the interval, the statement is correct. Figure 8.1 shows that when the sample mean falls anywhere between 494.12 and 505.88 mL, the population mean is included somewhere within the interval. In Section 7.2 we found that 95% of the sample means fall between 494.12 and 505.88 mL. Therefore, 95% of all samples of n = 25 bottles have sample means that include the population mean within the interval developed. The interval from 494.12 to 505.88 is referred to as a 95% confidence interval. Because, in practice, you select only one sample and μ is unknown, you never know for sure whether the specific interval includes the population mean or not. However, if you take all possible samples of n and calculate their sample means, 95% of the intervals will include the population mean and only 5% of them will not. In other words, there is 95% confidence that the population mean is somewhere in the interval. Thus, we can interpret the confidence interval above as follows: LEARNING OBJECTIVE 1 Construct and interpret confidence intervals for the mean level of confidence Represents the percentage of intervals, based on all samples of a certain size, which would contain the population parameter. I am 95% confident that the mean amount of shampoo in the population of bottles is somewhere between 494.12 and 505.88 mL. In some situations, you might want a higher degree of confidence (such as 99%) of including the population mean within the interval. In other cases, you might accept less confidence (such as 90%) of correctly estimating the population mean. In general, the level of confidence is symbolised by (1 - α) * 100%, where α is the area in the tails of the distribution that is outside the confidence interval. The area in the upper tail of the distribution is α/2, and the area in the lower tail of the distribution is α/2. We can use Equation 8.1 to construct a (1 - α) * 100% confidence interval estimate of the mean with σ known. CON FIDE N CE IN TE R VA L F O R A M E A N ( σ KNO W N) X±Z σ n or X−Z σ n ⩽μ⩽X+Z σ n (8.1) where Z = the value corresponding to a cumulative area of 1 - α/2 from the standardised normal distribution – that is, an upper-tail probability of α/2. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.1 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ KNOWN) 283 The value of Z needed for constructing a confidence interval is called the critical value for the distribution. For a 95% confidence interval the value of α is 0.05. The critical Z value corresponding to a cumulative area of 0.9750 is 1.96 because there is 0.025 in the upper tail of the distribution and the cumulative area less than Z = 1.96 is 0.975. There is a different critical value for each level of confidence 1 - α. A level of confidence of 95% leads to a Z value of 1.96 (see Figure 8.2). For a 99% level of confidence, α is 0.01. The Z value is approximately 2.58 because the upper-tail area is 0.005 and the cumulative area less than Z = 2.58 is 0.995 (see Figure 8.3). 0.025 0.475 μ 0 –1.96 0.005 –2.58 0.475 0.495 Figure 8.2 Normal curve for determining the Z value needed for 95% confidence 0.025 X Z +1.96 0.495 μ 0 critical value The value in a distribution that cuts off the required probability in the tail for a given confidence level. 0.005 Figure 8.3 Normal curve for determining the Z value needed for 99% confidence X +2.58 Z Now that various levels of confidence have been considered, why not make the confidence level as close to 100% as possible? Before doing so, you need to realise that any increase in the level of confidence is achieved only by widening (and making less precise) the confidence interval. You would have more confidence that the population mean is within a broader range of values. However, this might make the interpretation of the confidence interval less useful. The trade-off between the width of the confidence interval and the level of confidence is discussed in greater depth in the context of determining the sample size in Section 8.4. Example 8.1 illustrates the application of the confidence interval estimate. ESTIM ATING T H E ME A N S A LMO N W E IGHT WI TH 95% CON F I D E N CE Atlantic Salmon farming is an important industry in Tasmania. Fish are grown to market size in a series of large, circular, netted enclosures in areas such as the Huon River, Port Esperance, the D’Entrecasteaux Channel and around the Tasman Peninsula. When salmon are harvested to send to market they need to weigh 3.5–4 kg, so the farmer is aiming to have an average weight of 3.75 kg. We will assume that all salmon are placed in their final growing enclosure at the same time and spend 12 months there, and that the standard deviation of their weights after that time is 380 g. A farmer wishes to check whether the average weight of salmon in the enclosure falls in the required range. He weighs a sample of 50 salmon being sent to market and finds their average weight is 3,607 g. Construct a 95% confidence interval estimate for the population mean salmon weight. EXAMPLE 8.1 SOLUTION Using Equation 8.1 with Z = 1.96 for 95% confidence: σ 380 X±Z = 3607 ± (1.96) n 50 = 3607 ± 105.33 3501.67 ⩽ μ ⩽ 3712.33 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 284 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION Thus, with 95% confidence you can conclude that the mean weight of salmon in the enclosure is between 3,501.67 g and 3,712.33 g. This would indicate that the average weight of fish in the enclosure is below the average of 3,750 g desirable for market-ready fish. We would expect that many fish in the enclosure still need to grow larger before being harvested. To see the effect of using a 99% confidence interval, examine Example 8.2. EXAMPLE 8.2 E ST IMAT ING T H E ME AN S AL M ON WE I GHT WI TH 99% CON F I D E N CE Construct a 99% confidence interval for the population mean salmon weight. SOLUTION Using Equation 8.1 with Z = 2.58 for 99% confidence: σ 380 X±Z = 3607 ± (2.58) n 50 = 3607 ± 138.65 3468.35 ⩽ μ ⩽ 3745.65 The interval still does not contain the desired mean weight of 3.75 kg, so the fish will need to grow larger. Problems for Section 8.1 LEARNING THE BASICS 8.1 8.2 8.3 8.4 8.5 8.6 – If X = 85, σ = 8 and n = 64, construct a 95% confidence interval estimate of the population mean μ. – If X = 125, σ = 24 and n = 36, construct a 99% confidence interval estimate of the population mean μ. A market researcher states that she has 95% confidence that the mean monthly sales of a product are between $170,000 and $200,000. Explain the meaning of this statement. Why is it not possible in Example 8.1 to have 100% confidence? Explain. From the results of Example 8.1 regarding salmon farming, is it true that 95% of the sample means will fall between 3,501.67 g and 3,712.33 g? Explain. Is it true in Example 8.1 that you do not know for sure whether the population mean is between 3,501.67 g and 3,712.33 g? Explain. APPLYING THE CONCEPTS 8.7 The manager of a paint supply store wants to estimate the actual amount of paint contained in 4-litre cans purchased from a nationally known manufacturer. It is known from the manufacturer’s specifications that the standard deviation of the amount of paint is equal to 0.08 litres. A random sample of 8.8 50 cans is selected, and the sample mean amount of paint per 4-litre can is 3.98 litres. a. Construct a 99% confidence interval estimate of the population mean amount of paint included in a 4-litre can. b. On the basis of your results, do you think that the manager has a right to complain to the manufacturer? Why? c. Must you assume that the population amount of paint per can is normally distributed here? Explain. d. Construct a 95% confidence interval estimate. How does this change your answer to (b)? The quality control manager at a light globe factory needs to estimate the mean life of a large shipment of energy-saving light-emitting diode (LED) light globes. The standard deviation is 3,000 hours. A random sample of 64 light globes indicates a sample mean life of 34,000 hours. a. Construct a 95% confidence interval estimate of the population mean life of light globes in this shipment. b. Do you think that the manufacturer has the right to state that the light globes last an average of 35,000 hours? Explain. c. Must you assume that the population of light globe life is normally distributed? Explain. d. Suppose that the standard deviation changes to 6,000 hours. What are your answers in (a) and (b)? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.2 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ UNKNOWN) 285 8.9 The inspection division of a state department that regulates trade measurement wants to estimate the actual amount of soft drink in 2-litre bottles at the local bottling plant of a large, nationally known soft-drink company. The bottling plant has informed the inspection division that the population standard deviation for 2-litre bottles is 0.05 litres. A random sample of 100 2-litre bottles at this bottling plant indicates a sample mean of 1.99 litres. a. Construct a 95% confidence interval estimate of the population mean amount of soft drink in each bottle. b. Must you assume that the population of soft-drink fill is normally distributed? Explain. c. Explain why a value of 2.02 litres for a single bottle is not unusual, even though it is outside the confidence interval you calculated. d. Suppose that the sample mean had been 1.97 litres. What is your answer to (a)? 8.2 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (𝛔 UNKNOWN) Just as the mean of the population μ is usually unknown, you rarely know the actual standard deviation of the population, σ. Therefore, you need to develop a confidence interval estimate of – μ using only the sample statistics X and S. Student’s t Distribution At the beginning of the twentieth century a statistician for Guinness Breweries in Ireland (see reference 1), William S. Gosset, wanted to make inferences about the mean when σ was unknown. Because Guinness employees were not permitted to publish research work under their own names, Gosset adopted the pseudonym ‘Student’. The distribution that he developed is known as Student’s t distribution. If the random variable X is normally distributed, then the following statistic has a t distribution with n - 1 degrees of freedom: t= X−μ S Student’s t distribution A continuous probability distribution whose shape depends on the number of degrees of freedom. degrees of freedom Relate to the number of values in the calculation of a statistic that are free to vary. n This expression has the same form as the Z statistic in Equation 7.4 on page 254, except that S is used to estimate the unknown σ. The concept of degrees of freedom is discussed further on page 286. Properties of the t Distribution In appearance, the t distribution is very similar to the standardised normal distribution. Both distributions are bell shaped. However, the t distribution has more area in the tails and less in the centre than the standardised normal distribution (see Figure 8.4). Because the value of σ is unknown and S is used to estimate it, the values of t are more variable than those for Z. The degrees of freedom n - 1 are directly related to the sample size n. As the sample size and degrees of freedom increase, S becomes a better estimate of σ and the t distribution gradually approaches the standardised normal distribution until the two are virtually identical. With a sample size of about 120 or more, S estimates σ precisely enough that there is little difference between the t and Z distributions. For this reason, most statisticians use Z instead of t when the sample size is greater than 120. Standardised normal t distribution for 5 degrees of freedom Figure 8.4 Standardised normal distribution and t distribution for 5 degrees of freedom Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 286 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION As stated earlier, the t distribution assumes that the random variable X is normally distributed. In practice, however, as long as the sample size is large enough and the population is not very skewed, you can use the t distribution to estimate the population mean when σ is unknown. When dealing with a small sample size and a skewed population distribution, the validity of the confidence interval is a concern. To assess the assumption of normality, you can evaluate the shape of the sample data by using a histogram, stem-and-leaf display, box-and-whisker plot or normal probability plot. You find the critical values of t for the appropriate degrees of freedom from the table of the t distribution (Table E.3). The columns of the table represent the area in the upper tail of the t distribution. Each row represents the particular t value for each specific degree of freedom. For example, with 99 degrees of freedom, if you want 95% confidence you find the appropriate value of t as shown in Table 8.1. The 95% confidence level means that 2.5% of the values (an area of 0.025) are in each tail of the distribution. Looking in the column for an upper-tail area of 0.025 and in the row corresponding to 99 degrees of freedom gives you a critical value for t of 1.9842. Because t is a symmetrical distribution with a mean of 0, if the upper-tail value is +1.9842, the value for the lower-tail area (lower 0.025) is -1.9842. A t value of -1.9842 means that the probability that t is less than -1.9842 is 0.025, or 2.5% (see Figure 8.5). The Concept of Degrees of Freedom In Chapter 3 we saw that the numerator of the sample variance S2 (see Equation 3.9a) requires the calculation of: n ∑ (Xi − X )2 i=1 Table 8.1 Determining the critical value from the t table for an area of 0.025 in each tail with 99 degrees of freedom (extracted from Table E.3 in Appendix E of this book) Upper-tail areas Degrees of freedom .25 .10 .05 .025 .01 .005 1 1.0000 3.0777 6.3138 12.7062 31.8207 63.6574 2 0.8165 1.8856 2.9200 4.3027 6.9646 9.9248 3 0.7649 1.6377 2.3534 3.1824 4.5407 5.8409 4 0.7407 1.5332 2.1318 2.7764 3.7469 4.6041 5 0.7267 1.4759 2.0150 2.5706 3.3649 4.0322 . . . . . . . . . . . . . . . . . . . . . 96 0.6771 1.2904 1.6609 1.9850 2.3658 2.6280 97 0.6770 1.2903 1.6607 1.9847 2.3654 2.6275 98 0.6770 1.2902 1.6606 1.9845 2.3650 2.6269 99 0.6770 1.2902 1.6604 1.9842 2.3646 2.6264 100 0.6770 1.2901 1.6602 1.9840 2.3642 2.6259 Figure 8.5 t distribution with 99 degrees of freedom 0.025 –1.9842 1 – α = 0.95 0.025 +1.9842 t99 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.2 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ UNKNOWN) 287 – In order to calculate S2, you first need to know X. Therefore, only n - 1 of the sample values are free to vary. This means that you have n - 1 degrees of freedom. For example, suppose a sample of five values has a mean of 20. How many values do you need to know before you – can determine the remainder of the values? The fact that n = 5 and X = 20 also tells you that: n ∑ Xi = 100 i=1 because: n ∑ Xi i=1 n =X Thus, when you know four of the values, the fifth one will not be free to vary because the sum must add to 100. For example, if four of the values are 18, 24, 19 and 16, the fifth value must be 23 so that the sum equals 100. The Confidence Interval Statement Equation 8.2 defines the (1 - α) * 100% confidence interval estimate for the mean with σ unknown. LEARNING OBJECTIVE Construct and interpret confidence intervals for the mean CO N FID E N CE IN T E R VA L FOR T H E M E A N (σ U NKNO W N) X ± tn−1 S n or X − tn−1 S n ⩽ μ ⩽ X + tn−1 S n (8.2) where tn-1 is the critical value of the t distribution with n - 1 degrees of freedom for an area of α/2 in the upper tail. To illustrate the application of the confidence interval estimate for the mean when the standard deviation σ is unknown, return to the Callistemon Camping Supplies scenario presented on page 279. You select a sample of 100 sales invoices from the population of sales invoices during the month and the sample mean of the 100 sales invoices is $230.27, with a sample standard deviation of $52.62. For 95% confidence, the critical value from the t distribution (as shown in Table 8.1) is 1.9842. Using Equation 8.2: X ± tn−1 S n = 230.27 ± (1.9842) 1 52.62 100 = 230.27 ± 10.44 $219.83 ⩽ μ ⩽ $240.71 A Microsoft Excel worksheet for these data is presented in Figure 8.6 (overleaf). Thus, with 95% confidence, you conclude that the mean amount of all the sales invoices is between $219.83 and $240.71. The 95% confidence level indicates that if you selected all possible samples of 100 (something that is never done in practice), 95% of the intervals developed would include the population mean somewhere within the interval. The validity of this Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 288 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION Figure 8.6 Microsoft Excel 2016 worksheet to calculate a confidence interval estimate for the mean sales invoice amount for Callistemon Camping Supplies 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 A B Estimate for the mean sales invoice amount Data Sample standard deviation Sample mean Sample size Confidence level 52.62 230.27 100 95% Intermediate calculations Standard error of the mean 5.262 Degrees of freedom 99 t value 1.984217 Interval half width 10.44095 =B4/SQRT(B6) =B6 – 1 =T.INV.2T(1 – B7,B11) =B12 * B10 Confidence interval Interval lower limit Interval upper limit =B5 – B13 =B5 + B13 219.8291 240.7109 confidence interval estimate depends on the assumption of normality for the distribution of the amount of the sales invoices. With a sample of 100, the normality assumption is not overly restrictive and the use of the t distribution is probably appropriate. Example 8.3 further illustrates how to construct the confidence interval for a mean when the population standard deviation is unknown. EXAMPLE 8.3 Table 8.2 Heights (in millimetres) of female athletes aged 18–25 E ST IMAT ING T H E ME AN HE I G HT O F FE M AL E ATH LE T E S A GE D 18 –2 5 A manufacturer of women’s tracksuits needs to estimate the average height of female ­athletes in the 18–25 age group. The measurements of a sample of 30 women are taken and their heights recorded in millimetres. Table 8.2 lists these values. < HEIGHTS > Construct a 95% confidence interval estimate for the population mean height of female athletes in this age group. 1,870 1,728 1,656 1,610 1,634 1,784 1,522 1,696 1,592 1,662 1,866 1,764 1,734 1,662 1,734 1,774 1,550 1,756 1,762 1,866 1,820 1,744 1,788 1,688 1,810 1,752 1,680 1,810 1,652 1,736 SOLUTION – Figure 8.7 shows that the sample mean is X = 1,723.4 mm and the sample standard deviation is S = 89.55 mm. Using Equation 8.2 to construct the confidence interval, you need to determine the critical value from the t table for an area of 0.025 in each tail with 29 degrees – of freedom. Table E.3 shows that t29 = 2.0452. Thus, using X = 1,723.4, S = 89.55, n = 30 and t29 = 2.0452: S X ± tn−1 n = 1,723.4 ± (2.0452) 89.55 30 = 1,723.4 ± 33.44 1,689.96 ⩽ μ ⩽ 1,756.84 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.2 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ UNKNOWN) 289 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 A One sample t : height Figure 8.7 PHStat confidence interval estimate for the mean height (in millimetres) of female athletes aged 18–25 B Data Sample standard deviation Sample mean Sample size Confidence level 89.55083319 1723.4 30 95% Intermediate calculations Standard error of the mean 16.34967046 Degrees of freedom 29 t value 2.045229642 Interval half width 33.43883066 Confidence interval Interval lower limit Interval upper limit 1689.96 1756.84 You conclude with 95% confidence that the mean height of 18–25-year-old female a­ thletes is between 1,689.96 and 1,756.84 mm. The validity of this confidence interval estimate depends on the assumption that the heights in the population are normally distributed. Remember, however, that you can slightly relax this assumption for large sample sizes. Thus, with a sample of 30, you can use the t distribution even if the distribution of heights is slightly skewed. From the normal probability plot displayed in Figure 8.8, or the boxplot displayed in Figure 8.9, the heights appear only slightly skewed. Thus the t distribution is appropriate for these data. Height Normal probability plot of height Figure 8.8 PHStat normal probability plot for the height (in millimetres) of female athletes aged 18–25 2,000 1,800 1,600 1,400 1,200 1,000 800 600 400 200 0 –2.5 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2 2.5 Z value Boxplot of height Figure 8.9 PHStat boxplot for the height (in millimetres) of female athletes aged 18–25 Height 1,520 1,620 1,720 1,820 1,920 2,020 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 290 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION The validity of this confidence interval estimate depends on the assumption that the p­ rocessing time is normally distributed. What would happen if there was a small sample and the boxplot and the normal probability plot indicted that the distribution was right-skewed? In this case you would have some concern about the validity of the confidence interval in estimating the population mean. The concern is that a 95% confidence interval based on a small sample from a skewed distribution will contain the population mean less than 95% of the time in repeated sampling. In the case of small sample sizes and skewed distributions, you might consider the sample median as an estimate of central tendency and construct a confidence interval for the population median (see reference 2). Problems for Section 8.2 LEARNING THE BASICS 8.10 Determine the critical value of t in each of the following circumstances: a. 1 - α = 0.95, n = 10 b. 1 - α = 0.99, n = 10 c. 1 - α = 0.95, n = 32 d. 1 - α = 0.95, n = 65 e. 1 - α = 0.90, n = 16 – 8.11 If X = 75, S = 24, n = 36, and assuming that the population is normally distributed, construct a 95% confidence interval estimate of the population mean μ. – 8.12 If X = 50, S = 15, n = 16, and assuming that the population is normally distributed, construct a 99% confidence interval estimate of the population mean μ. 8.13 Construct a 95% confidence interval estimate for the population mean, based on each of the following sets of data, assuming that the population is normally distributed: Set 1: 1, 1, 1, 1, 8, 8, 8, 8 Set 2: 1, 2, 3, 4, 5, 6, 7, 8 Explain why these data sets have different confidence intervals even though they have the same mean and range. 8.14 Construct a 95% confidence interval for the population mean, based on the numbers 1, 2, 3, 4, 5, 6 and 20. Change the number 20 to 7 and recalculate the confidence interval. Using these results, describe the effect of an outlier (i.e. extreme value) on the confidence interval. APPLYING THE CONCEPTS You can solve problems 8.15 to 8.21 with or without Microsoft Excel. 8.15 A stationery store wants to estimate the mean retail value of greeting cards that it has in its inventory. A random sample of 20 greeting cards indicates a mean value of $4.95 and a standard deviation of $0.82. a. Assuming a normal distribution, construct a 95% confidence interval estimate of the mean value of all greeting cards in the store’s inventory. b. How are the results in (a) useful in assisting the store owner to estimate the total value of his inventory? 8.16 Water resources in many parts of Australia are being closely watched and restrictions or water-wise rules have been imposed on activities such as garden watering. Suppose that Sydney Water monitors water usage in a suburb and finds that for one summer the average household usage is 408 litres per day. A year later it examines records of a sample of 50 households and finds that there is a daily mean usage of 380 litres with a standard deviation of 25 litres. a. Construct a 95% confidence interval for the population mean daily water usage in the second summer. Assume the population usage is normally distributed. b. Interpret the interval constructed in (a). c. Do you think water usage has changed in the second summer? Explain. 8.17 The energy consumption of refrigerators sold in Australia and New Zealand is checked and appliances are given a star rating to guide consumers who are about to make purchases. The consumption in kilowatts per annum is also displayed for each model on the website <www.energyrating.gov.au>. Suppose a consumer organisation wants to estimate the actual electricity usage of a model of refrigerator that has an advertised energy usage of 355 kW per annum. It tests a random sample of n = 18 fridges and finds a sample mean usage of 367 and a sample standard deviation of 30. a. Assuming that the energy usage in the population is normally distributed, construct a 95% confidence interval estimate of the population mean energy usage for this model of refrigerator. b. Do you think that the consumer organisation should accuse the manufacturer of producing fridges that do not meet the advertised energy consumption? Explain. c. Explain why an observed energy usage of 350 kW for a particular refrigerator is not unusual, even though it is outside the confidence interval developed in (a). 8.18 The data below represent the annual account fees for cheques made by a bank for a sample of 23 clients with Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.3 Confidence Interval Estimation for the Proportion 291 cheque accounts who do not undertake Internet banking. the profitability of this service to the insurance company. Over a period of one month, a random sample of 27 approved policies was selected and the total processing time in days recorded. < INSURANCE > < BANK_COST1 > 26 29 20 20 21 22 25 25 18 25 15 18 20 25 25 22 30 30 30 15 20 29 20 a. Construct a 95% confidence interval for the population mean annual cheque fee. b. What assumption must you make about the population distribution in (a)? c. Interpret the interval constructed in (a). 8.19 One of the major measures of the quality of service provided by any organisation is the speed with which it responds to customer complaints. A large family-held department store selling furniture and flooring, including carpeting, has undergone a major expansion in the past several years. In particular, the flooring department has expanded from two installation crews to an installation supervisor, a measurer and 15 installation crews. Last year there were 50 complaints about carpet installation. The data below represent the number of days between the receipt of the complaint and the resolution of the complaint. < FURNITURE > 54 5 35 137 31 27 152 2 123 81 74 11 19 126 110 110 29 61 35 94 31 26 12 29 26 25 4 165 13 10 5 32 27 29 28 4 52 30 22 1 14 36 26 20 73 19 16 64 28 28 31 90 60 56 31 56 22 18 45 48 17 17 17 91 92 63 50 51 69 16 17 a. Construct a 95% confidence interval estimate of the mean processing time. b. What assumption must you make about the population distribution in (a)? c. Do you think that the assumption made in (b) is seriously violated? Use a plot and explain. 8.21 The data below represent the daily rate in Australian dollars for a double room or studio booking on the following Monday night at a sample of hotels, motels and motor lodges in 20 New Zealand cities and towns in July 2017. < MOTEL_2017 > City/Town Room cost Lake Taupo 138 Hamilton 147 27 Whitianga 152 Waitomo 118 5 Auckland 179 Whangarei 137 13 Paihia 113 Russell 129 23 Wellington 141 Kerikeri 136 Tauranga 128 Havelock North 156 New Plymouth 149 Thames 121 Hastings 103 Palmerston North 137 Napier 135 Wanganui 114 Gisborne 122 Rotorua 132 33 68 a. Construct a 95% confidence interval estimate of the mean number of days between receipt of the complaint and resolution of the complaint. b. What assumption must you make about the population distribution in (a)? c. Do you think that the assumption made in (b) is seriously violated? Explain. d. What effect might your conclusion in (c) have on the validity of the results in (a)? 8.20 The approval process for a life insurance policy requires a review of the application and the applicant’s medical history, possible requests for additional medical information and medical examinations, and a policy compilation stage where the policy pages are generated then delivered. The ability to deliver approved policies to customers in a timely manner is critical to City/Town Room cost Data obtained from <http://compare.jasons.co.nz> accessed 4 July 2017 a. Construct a 95% confidence interval for the population mean lowest room cost. b. Construct a 99% confidence interval for the population mean lowest room cost. c. What assumption do you need to make about the population of interest to construct the intervals in (a) and (b)? d. Given the data presented, do you think the assumption needed in (a) and (b) is valid? Use a plot and explain. 8.3 CONFIDENCE INTERVAL ESTIMATION FOR THE PROPORTION This section extends the concept of the confidence interval to categorical data. Here you are concerned with estimating the proportion of items in a population with a certain characteristic of interest. The unknown population proportion is represented by the Greek letter π (pronounced pi). The point estimate for π is the sample proportion, p = X/n, where n is the sample size and X is the number of items in the sample with the characteristic of interest. Equation 8.3 defines the confidence interval estimate for the population proportion. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 292 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION CON FIDE N CE IN TE R VA L E ST I M AT E F O R T HE P R O P O RT I O N p±Z or p(1− p) n p(1 − p) <π<p1Z n p−Z p(1 − p) n (8.3) X number of items with the characteristic = n sample size π = population proportion Z = critical value from the standardised normal distribution n = sample size assuming both np and n (1 - p) are greater than 5 where p = sample proportion = LEARNING OBJECTIVE 2 Construct and interpret confidence intervals for the proportion You can use the confidence interval estimate of the proportion defined in Equation 8.3 to estimate the proportion of sales invoices that contain errors (see the opening scenario on page 279). Suppose that in a sample of 100 sales invoices, 10 contain errors. Thus, for these data, p = X/n = 10/100 = 0.10, so np = 10 > 5 and n(1 - p) = 90 > 5. Using Equation 8.3 and Z = 1.96 for 95% confidence: p±Z p(1 − p) n = 0.10 ± (1.96) (0.10 )(0.90 ) 100 = 0.10 ± (1.96)(0.03) = 0.10 ± 0.0588 0.0412 ⩽ π ⩽ 0.1588 Therefore, you have 95% confidence that between 4.12% and 15.88% of all the sales invoices contain errors. Figure 8.10 shows a Microsoft Excel worksheet for these data. Note that in early versions of Excel, the formula used in cell B10 would be = NORMSINV((1+B6)/2). Example 8.4 illustrates another application of a confidence interval estimate for the proportion. Figure 8.10 Microsoft Excel 2016 worksheet to form a confidence interval estimate for the proportion of sales invoices that contain errors 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A Proportion of in-error sales invoices B Data Sample size Number of success Confidence level 100 10 95% Intermediate calculations Sample proportion Z value Standard error of the proportion Interval half width 0.1 1.96 0.03 0.0588 =B5/B4 =NORM.S.INV((1 + B6)/2) =SQRT(B9 * (1 – B9)/B4) =(B10 * B11) Confidence interval Interval lower limit Interval upper limit 0.0412 0.1588 =B9 – B12 =B9 + B12 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.3 Confidence Interval Estimation for the Proportion 293 ESTIM ATING T H E P RO P O RT IO N O F T YP OGRAP HI CAL E RRORS IN O NLINE N E W S PA P E R S With the latest technology available to check written text, mistakes in newspapers are becoming less common. However, humans still make mistakes. A large media corporation wants to estimate the proportion of online newspaper articles written by a variety of journalists that have typographical errors. A random sample of 200 articles is selected from all the newspapers posted online during a single month. For this sample of 200, 7 contain some type of typographical error. Construct and interpret a 90% confidence interval for the proportion of articles posted online during the month that have a typographical error. EXAMPLE 8.4 SOLUTION Using Equation 8.3: 7 = 0.035 200 so np = 200 3 0.035 = 7 . 5 n(1 – p) = 200 3 0.965 = 193 . 5 and with a 90% level of confidence Z = 1.645 p= p±Z p(1 − p) n = 0.035 ± (1.645) (0.035)(0.965) 200 = 0.035 ± (1.645)(0.0130) = 0.035 ± 0.0214 0.0136 < π < 0.0564 You can conclude with 90% confidence that between 1.36% and 5.64% of the newspaper articles posted online in that month have a typographical error. Equation 8.3 contains a Z statistic since you can use the normal distribution to approximate the binomial distribution when the sample size is sufficiently large. In Example 8.4, the confidence interval using Z provides an excellent approximation for the population proportion since both X and n - X are greater than 5. However, if you do not have a sufficiently large sample size, you should use the binomial distribution rather than Equation 8.3 (see references 3, 4 and 5). The exact confidence intervals for various sample sizes and proportions of successes have been tabulated by Fisher and Yates (reference 4). Problems for Section 8.3 LEARNING THE BASICS 8.22 If n = 200 and X = 50, construct a 95% confidence interval estimate of the population proportion. 8.23 If n = 400 and X = 25, construct a 99% confidence interval estimate of the population proportion. APPLYING THE CONCEPTS 8.24 A telco wants to estimate the proportion of mobile phone customers who would purchase a phone plan with unlimited standard calls and SMS and 2GB of data if it were made available at a substantially reduced cost. A random sample of 500 customers is selected. The results indicate that 190 of the customers would purchase the plan at a reduced cost. a. Construct a 99% confidence interval estimate of the population proportion of customers who would purchase the unlimited 2GB plan. b. How would the manager in charge of promotional programs for mobile customers use the results in (a)? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 294 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION 8.25 A survey of 500 highly educated women who left careers for family reasons found that 66% postponed their return to work due to difficulty in making suitable childcare arrangements. a. Construct a 95% confidence interval for the population proportion of highly educated women who have postponed their return to work due to difficulty in making suitable childcare arrangements. b. Interpret the interval in (a). 8.26 A survey of 293 inhabitants of Tropical North Queensland in 2013 found that 45% considered increased property values were a negative impact of tourism in the region (Tropical North Queensland Social Indicators 2013 <https://cdn-teq.queensland. com/~/media/d0af5b7686754e2591d7e3fad2cdb673. ashx?vs=1&d=20140515T080145> accessed 5 July 2017). a. Construct a 95% confidence interval for the proportion of all residents in the region who believe increased property values are a negative impact of tourism. b. Construct a 90% confidence interval for the proportion of all residents in the region who believe increased property values are a negative impact of tourism. c. Which interval is wider? Explain why this is true. 8.27 The number of older consumers in Australia is growing and they are becoming an important economic force. According to the Australian Bureau of Statistics, the proportion of the population aged 65 years and over increased from 14% in 2011 to 16% in 2016. (Australian Bureau of Statistics, Reflecting Australia- Stories from the Census, 2016, Cat. No. 2071.0, 2017). The proportion is projected to grow higher in coming years. Many older consumers feel overwhelmed when confronted with the task of selecting investments, banking services, health insurance or phone service providers. Suppose a telephone survey of 1,900 older consumers found that 27% said they felt confused when making financial decisions. a. Construct a 95% confidence interval for the population proportion of older consumers who feel confused when making financial decisions. b. Interpret the interval in (a). 8.28 The Australian Telecommunications Industry Ombudsman 2016 Annual Report states that 34.1% of new complaints in 2015–16 related to faults (<http://annualreport2016.tio.com. au/#Service_type_in_complaints> accessed 5 July 2017). Imagine that you take a survey of 1,000 Australian users and find that 36% of this sample report that they have had telecommunication service faults in the past three months. a. Construct a 95% confidence interval for the population proportion of users who have experienced service faults in the past three months. b. Does your interval indicate that there is a difference from the percentage reported by the Ombudsman? Give reasons why a difference may occur. 8.29 The Australian Psychological Society conducted an online survey in 2016 of 1,000 adults and 518 adolescents. It found that 69% of adolescents reported consuming food from fast food restaurants at least once a week. (Psychology Week 2016, APS Compass for Life Wellbeing Survey <www. psychology.org.au/Assets/Files/16APS-PW-Survey-Web.pdf> accessed 5 July 2017). a. Construct a 95% confidence interval for the proportion of all Australian adolescents who consume food from fast food restaurants at least once per week. b. How would your result change if it was a 99% interval? 8.30 Suppose that, in a survey of 600 employers, 126 indicate that they have used a recruitment service within the past two months to find new staff. a. Construct a 95% confidence interval for the population proportion of employers who have used a recruitment service within the past two months to find new staff. b. Construct a 99% confidence interval for the population proportion of employers who have used a recruitment service within the past two months to find new staff. c. Interpret the intervals in (a) and (b). d. Discuss the effect on the confidence interval estimate when you change the level of confidence. 8.4 DETERMINING SAMPLE SIZE LEARNING OBJECTIVE 3 Determine the sample size necessary to develop a confidence interval for the mean In each example of confidence interval estimation, you selected the sample size without regard to the width of the resulting confidence interval. In the business world, determining the proper sample size is a complicated procedure, subject to the constraints of budget, time and the amount of acceptable sampling error. If, in the Callistemon Camping Supplies scenario, you want to estimate the mean dollar amount of the sales invoices or the proportion of sales invoices that contain errors, you must determine in advance how large a sampling error to allow in estimating each of the parameters. You must also determine in advance the level of confidence to use in estimating the population parameter. Sample Size Determination for the Mean To develop a formula for determining the appropriate sample size needed when constructing a confidence interval estimate of the mean, recall Equation 8.1 on page 282: X±Z σ n Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.4 Determining Sample Size 295 – The amount added to or subtracted from X is equal to half the width of the interval. This quantity represents the amount of imprecision in the estimate that results from sampling error. The sampling error e (in this context, some statisticians refer to e as the ‘margin of error’) is defined as: σ e=Z n sampling error The difference in results for different samples of the same size. Solving for n gives the sample size needed to construct the appropriate confidence interval estimate for the mean. ‘Appropriate’ means that the resulting interval will have an acceptable amount of sampling error. S AMPLE S IZ E DE T E R M IN AT ION FOR T HE M E A N The sample size n is equal to the product of the Z value squared and the variance σ2, divided by the sampling error e squared. n= 1. 2. 3. Z 2σ 2 e2 (8.4) To determine the sample size, you must know three factors: the desired confidence level, which determines the value of Z, the critical value from the standardised normal distribution1 the acceptable sampling error e the standard deviation σ. In some business-to-business relationships requiring estimation of important parameters, legal contracts specify acceptable levels of sampling error and the confidence level required. For companies in the food or drug sectors, government regulations often specify sampling errors and confidence levels. In general, however, it is usually not easy to specify the two factors needed to determine the sample size. How can you determine the level of confidence and sampling error? Typically, these questions are answered only by the subject matter expert (i.e. the individual most familiar with the variables under study). Although 95% is the most common confidence level used, if more confidence is desired then 99% might be more appropriate; if less confidence is deemed acceptable, then 90% might be used. For the sampling error, you should think not of how much sampling error you would like to have (you really do not want any error), but of how much you can tolerate when drawing conclusions from the data. In addition to specifying the confidence level and the sampling error, you need an estimate of the standard deviation. Unfortunately, you rarely know the population standard deviation, σ. In some instances, you can estimate the standard deviation from past data. In other situations, you can make an educated guess by taking into account the range and distribution of the variable. For example, if you assume a normal distribution, the range is approximately equal to 6σ (i.e. ±3σ around the mean) so that you estimate σ as the range divided by 6. If you cannot estimate σ in this way, you can conduct a small-scale study and estimate the standard deviation from the resulting data. To explore how to determine the sample size needed for estimating the population mean, consider again the audit at Callistemon Camping Supplies. In Section 8.2, we selected a sample of 100 sales invoices and developed a 95% confidence interval estimate of the population mean sales invoice amount. How was this sample size determined? Should we have selected a different sample size? Suppose that, after consultation with company officials, we determine that a sampling error of no more than ±$10 is desired, together with 95% confidence. Past data indicate that the 1 You use Z instead of t because to determine the critical value of t you need to know the sample size, but you do not know it yet. For most studies, the sample size needed is large enough that the standardised normal distribution is a good approximation of the t distribution. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 296 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION standard deviation of the sales amount is approximately $50. Thus, e = $5, σ = $50 and Z = 1.96 (for 95% confidence). Using Equation 8.4: n= Z 2σ 2 e2 = (1.96 ) 2 ( 50) 2 (10)2 = 96.04 Because the general rule is to oversatisfy slightly the criteria by rounding the sample size up to the next whole integer, you should select a sample of size 97. Thus, the sample of size 100 used on page 287 is close to what is necessary to satisfy the needs of the company based on the ­estimated standard deviation, desired confidence level and sampling error. Because the calculated sample standard deviation is slightly higher than expected, $52.62 compared with $50.00, the confidence interval is slightly wider than desired. Figure 8.11 illustrates the Microsoft Excel worksheet to determine the sample size. For early versions of Excel use the formula =NORMSINV((1+B6)/2) in cell B9. Figure 8.11 Microsoft Excel 2016 worksheet for determining sample size for estimating the mean sales invoice amount for Callistemon Camping Supplies Pty Ltd 1 2 3 4 5 6 7 8 9 10 11 12 13 A For the mean sales invoice amount Data Population standard deviation Sampling error Confidence level 50 10 95% Intermediate calculations Z value Calculated sample size Result Sample size needed B 1.9600 96.0365 97 =NORM.S.INV((1 + B6)/2) =((B9 * B4)/B5)^2 =ROUNDUP(B10,0) Example 8.5 illustrates another application of determining the sample size needed to develop a confidence interval estimate for the mean. EXAMPLE 8.5 D E T E R MININ G T H E S AM P LE S I Z E F OR T HE ME A N Returning to Example 8.3, suppose you want to estimate the population mean height for females who wear size 12 to within ±15 mm with 95% confidence. On the basis of a study taken the previous year, you believe that the standard deviation is 100 mm. Find the sample size needed. SOLUTION Using Equation 8.4 on page 295 and e = 15, σ = 100 and Z = 1.96 for 95% confidence: n= Z 2σ 2 e2 = (1.96)2 (100)2 (15)2 = 170.74 Therefore, you should select a sample size of 171 women, because the general rule for determining sample size is always to round up to the next integer value in order to oversatisfy slightly the criteria desired. An actual sampling error slightly larger than 15 will result if the sample standard deviation calculated in this sample of 171 is greater than 100, and it will be slightly smaller if the sample standard deviation is less than 100. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.4 Determining Sample Size 297 Sample Size Determination for the Proportion So far, we have seen how to determine the sample size needed for estimating the population mean. Now suppose that you want to determine the sample size necessary for estimating the proportion of sales invoices at Callistemon Camping Supplies that contain errors. To determine the sample size needed to estimate a population proportion (π), you use a method similar to that for a population mean. Recall that in developing the sample size for a confidence interval for the mean, the sampling error is defined by: e=Z LEARNING OBJECTIVE Determine the sample size necessary to develop a confidence interval for the proportion σ n When estimating a proportion, you replace σ with π(1 - π). Thus, the sampling error is: e=Z π(1− π) n Solving for n, you have the sample size necessary to develop a confidence interval estimate for a proportion. SAM PLE S IZ E DE T E R M IN AT ION FOR T HE P R O P O RT I O N The sample size n is equal to the product of Z value squared, the population proportion π and 1 minus the population proportion π, divided by the sampling error e squared. n= 1. 2. 3. Z 2 π(1− π) e2 3 (8.5) To determine the sample size, you must know three factors: the desired confidence level, which determines the value of Z, the critical value from the standardised normal distribution the acceptable sampling error e the population proportion π. In practice, selecting these quantities requires some planning. Once you determine the desired level of confidence, you can find the appropriate Z value from the standardised normal distribution. The sampling error e indicates the amount of error that you are willing to tolerate in estimating the population proportion. The third quantity, π, is actually the population parameter that you want to estimate! How do you state a value for the very thing that you are taking a sample in order to determine? There are two alternatives. In many situations, you may have past information or relevant experiences that provide an educated estimate of π. If you do not, you can try to provide a value for π that would never underestimate the sample size needed. Referring to Equation 8.5, you can see that the quantity π(1 - π) appears in the numerator. Thus, you need to determine the value of π that will make the quantity π(1 - π) as large as possible. When π = 0.5, the product π(1 - π) achieves its maximum result. To show this, here are several values of π together with the accompanying products of π(1 - π): When π = 0.9, π(1 - π) = (0.9)(0.1) = 0.09 When π = 0.7, π(1 - π) = (0.7)(0.3) = 0.21 When π = 0.5, π(1 - π) = (0.5)(0.5) = 0.25 When π = 0.3, π(1 - π) = (0.3)(0.7) = 0.21 When π = 0.1, π(1 - π) = (0.1)(0.9) = 0.09 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 298 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION Therefore, when you have no prior knowledge or estimate of the population proportion π, use π = 0.5 for determining the sample size. This produces the largest possible sample size and results in the highest possible cost of sampling. Using π = 0.5 may overestimate the sample size needed because you use the actual sample proportion in developing the confidence interval. You will get a confidence interval narrower than originally intended if the actual sample proportion is different from 0.5. The increased precision comes at the cost of spending more time and money for an increased sample size. Returning to the Callistemon Camping Supplies scenario, suppose that the auditing procedures require you to have 95% confidence in estimating the population proportion of sales invoices with errors to within ±0.07. The results from past months indicate that the largest proportion has been no more than 0.15. Thus, using Equation 8.5 and e = 0.07, π = 0.15 and Z = 1.96 for 95% confidence: Z 2 π(1− π) n= = e2 (1.96 ) 2 ( 0.15)(0.85) (0.07 ) 2 = 99.96 Because the general rule is to round up the sample size to the next whole integer to slightly oversatisfy the criteria, a sample size of 100 is needed. Thus, the sample size needed to satisfy the requirements of the company based on the estimated proportion, desired confidence level and sampling error is equal to the sample size taken on page 292. The actual confidence interval is narrower than required since the sample proportion is 0.10, while 0.15 was used for π in Equation 8.5. Figure 8.12 shows a Microsoft Excel 2016 worksheet. Change the formula in cell B9 to =NORMSINV((1+B6)/2) for early versions of Excel. Example 8.6 provides a second application of determining the sample size for estimating the population proportion. Figure 8.12 Microsoft Excel 2016 worksheet for determining sample size for estimating the proportion of sales invoices with errors for Callistemon Camping Supplies Pty Ltd EXAMPLE 8.6 1 2 3 4 5 6 7 8 9 10 11 12 13 A B For the proportion of in-error sales invoices Data Estimate of true proportion Sampling error Confidence level Intermediate calculations Z value Calculated sample size Result Sample size needed 0.15 0.07 95% 1.9600 99.9563 100 =NORM.S.INV((1 + B6)/2) =(B9^2 * B4 * (1 – B4))/B5^2 =ROUNDUP(B10,0) DE T E R MIN ING T H E SA MP L E S I Z E FO R TH E P O P UL AT I ON P RO P ORT I ON You want to have 90% confidence of estimating the proportion of office workers who respond to email within an hour to within ±0.05. Because you have not previously undertaken such a study, there is no information available from past data. Determine the sample size needed. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.4 Determining Sample Size 299 SOLUTION Because no information is available from past data, assume π = 0.50. Using Equation 8.5 and e = 0.05, π = 0.50 and Z = 1.645 for 90% confidence: n= (1.645) 2 ( 0.50 )(0.50 ) (0.05) 2 = 270.6 Therefore, you need a sample of 271 office workers to estimate the population proportion to within ±0.05 with 90% confidence. Problems for Section 8.4 LEARNING THE BASICS 8.31 If you want to be 95% confident of estimating the population mean to within a sampling error of ±5 and the standard deviation is assumed to be 15, what sample size is required? 8.32 If you want to be 99% confident of estimating the population mean to within a sampling error of ±20 and the standard deviation is assumed to be 100, what sample size is required? 8.33 If you want to be 99% confident of estimating the population proportion to within a sampling error of ±0.04, what sample size is needed? 8.34 If you want to be 95% confident of estimating the population proportion to within a sampling error of ±0.02 and there is historical evidence that the population proportion is approximately 0.40, what sample size is needed? APPLYING THE CONCEPTS 8.35 A survey is planned to determine the mean annual family medical expenses of employees of a large company which subsidises the health insurance of its staff. The management of the company wishes to be 95% confident that the sample mean is correct to within ±$50 of the population mean annual family medical expenses. A previous study indicates that the standard deviation is approximately $400. a. How large a sample size is necessary? b. If management wants to be correct to within ±$25, what sample size is necessary? 8.36 If the manager of a paint supply store wants to estimate the mean amount of paint in a 4-litre can to within ±0.015 litres with 95% confidence and also assumes that the standard deviation is 0.075 litres, what sample size is needed? 8.37 If a quality control manager wants to estimate the mean life of a new type of LED light globe to within 1,000 hours with 95% confidence and also assumes that the population standard deviation is 5,000 hours, what sample size is needed? 8.38 The inspection division of a state department which regulates trade measurement wants to estimate the mean amount of soft-drink fill in 2-litre bottles to within ±0.01 litres with 95% confidence. If it assumes that the standard deviation is 0.05 litres, what sample size is needed? 8.39 A consumer group wants to estimate the mean electric bill for the month of July for single family homes in a large city. Based on studies conducted in other cities, the standard deviation is assumed to be $60. The group wants to estimate the mean bill for July to within ±$15 with 99% confidence. a. What sample size is needed? b. If 95% confidence is desired, what sample size is necessary? 8.40 An advertising agency that serves a major radio station wants to estimate the mean amount of time that the station’s audience spends listening to the radio daily. From past studies, the standard deviation is estimated as 45 minutes. a. What sample size is needed if the agency wants to be 90% confident of being correct to within ±5 minutes? b. If 99% confidence is desired, what sample size is necessary? 8.41 Suppose that an energy company wants to estimate its mean waiting time for natural gas installation to within ±5 days with 95% confidence. The company does not have access to previous data, but suspects that the standard deviation is approximately 20 days. What sample size is needed? 8.42 At a large South East Asian airport flights are classified as being ‘on time’ if they land less than 15 minutes after the scheduled time. A study of airlines using the airport finds that one of the airlines that services Australia has a record of 17% of flights arriving late. Suppose you were asked to perform a follow-up study for this airline in order to update the estimated proportion of late arrivals. What sample size would you use to estimate the population proportion to within a sampling error of: a. ±0.06 with 95% confidence? b. ±0.04 with 95% confidence? c. ±0.02 with 95% confidence? 8.43 The Nielsen company regularly conducts research into consumer purchases. Neilsen Homescan data for the 52 weeks ended 28 January 2017 showed that 34.5% of Australian homes had purchased Asian vegetables in that period. Households of 1–2 persons accounted for 47% of the volume in Asian vegetable sales. (Neilsen Insights <http:// www.nielsen.com/au/en/insights/news/2017/green-eatersasian-vegetables-on-therise-in-australia.html> accessed 5 July 2017). Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 300 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION Consider a follow-up study focusing on the latest calendar year. a. What sample size is needed to estimate the population proportion of Australian households that have purchased Asian vegetables to within ±0.02 with 95% confidence? b. What sample size is needed to estimate the population proportion of the volume of Asian vegetables that are purchased by 1–2 person households to within ±0.02 with 95% confidence? c. Compare the results of (a) and (b). Explain why these results differ. d. If you were to design a data collection method for a follow-up study, would you use one sample and collect data to answer both questions, or would you select two separate samples? Explain the rationale behind your decision. 8.44 Suppose that a survey of the audience at a Sydney Symphony Orchestra (SSO) concert has found that 48 out of 350 members of the audience who participated in the survey are visitors to Sydney. a. Construct a 95% confidence interval for the population proportion of audience members at SSO concerts who are visitors to Sydney. b. Interpret the interval constructed in (a). c. To conduct a follow-up study that would provide 95% confidence that the point estimate is correct to within ±0.03 of the population proportion, how large a sample size is required? d. To conduct a follow-up study that would provide 99% confidence that the point estimate is correct to within ±0.03 of the population proportion, how large a sample size is required? 8.45 A study conducted by the Australian Securities Exchange found that 36% of 4,009 Australian adults surveyed in late 2014 held shares, either directly or indirectly through unlisted managed funds (Australian Securities Exchange, 2014 Australian Share Ownership Study, <www.asx.com.au/documents/resources/ australian-share-ownership-study-2014.pdf> accessed 5 July 2017). a. Construct a 95% confidence interval for the proportion of Australian adults who held shares in late 2014. b. Interpret the interval constructed in (a). c. To conduct a follow-up study to estimate the population proportion of adults who currently hold shares to within ±0.01 with 95% confidence, how many adults would you interview? 8.5 APPLICATIONS OF CONFIDENCE INTERVAL ESTIMATION IN AUDITING This chapter has focused on estimating either the population mean or the population proportion. Auditing is one area in business that makes widespread use of statistical sampling for the purposes of estimation. A UDIT IN G Auditing is the collection and evaluation of evidence about information relating to an economic entity such as a sole business proprietor, a partnership, a corporation or a government agency in order to determine and report on how well the information corresponds to established criteria. auditing A process of checking the accuracy of financial records. 1. 2. 3. 4. 5. 6. Six advantages of statistical sampling in auditing are: Results are objective and defensible. Because the sample size is based on demonstrable statistical principles, the audit is defensible before one’s superiors and in a court of law. Statistical sampling provides an objective way of estimating the sample size in advance. Statistical sampling provides an estimate of the sampling error. Statistical sampling is often more accurate for drawing conclusions about large populations. Examining large populations is time-consuming and therefore often subject to more non-sampling error than a statistical sample. Statistical sampling allows auditors to combine, and then evaluate collectively, samples collected by different individuals. Statistical sampling allows auditors to generalise their findings to the population with a known sampling error. Estimating the Population Total Amount total amount The sum of all values. In auditing applications we are often more interested in developing estimates of the population total amount than the population mean. Equation 8.6 shows how to estimate a population total amount. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.5 Applications of Confidence Interval Estimation in Auditing 301 E STIM ATING T H E P OPUL AT ION TOTA L The point estimate for the population total is equal to the population size N times the sample mean. Total = NX (8.6) Equation 8.7 defines the confidence interval estimate for the population total. The term is included where sampling is from a finite population. LEARNING OBJECTIVE CO N FID E N CE IN T E R VA L E ST IM AT E F O R T HE TOTA L NX ± N(tn−1) S n N−n N−1 (8.7) 4 Recognise how to use confidence interval estimates in auditing To demonstrate the application of the confidence interval estimate for the population total amount, we return to the Callistemon Camping Supplies scenario. One of the auditing tasks is to estimate the total dollar amount of all sales invoices for the month. If there are 5,000 invoices – for that month and X = $110.27, then, using Equation 8.6: NX = (5,000)($110.27) = $551,350 If n = 100 and S = $28.95, then, using Equation 8.7 with t99 = 1.9842 for 95% confidence: NX ± N (tn−1) S n N−n 28.95 5,000 − 100 = 551,350 ± (5,000)(1.9842) 5,000 − 1 N−1 100 = 551,350 ± 28, 721.295(0.99005) = 551,350 ± 28,436 $522,914 < population total < $579,786 Therefore, with 95% confidence, you estimate that the total amount of sales invoices is between $522,914 and $579,786. Example 8.7 further illustrates the population total. DEVELOPING A CONFIDENCE INTERVAL ESTIMATE FOR THE POPULATION TOTAL An auditor is faced with a population of 1,000 vouchers and wants to estimate the total value of the population of vouchers. A sample of 50 vouchers is selected with the following results: – Mean voucher amount (X) = $1,076.39 Standard deviation (S) = $273.62 EXAMPLE 8.7 Construct a 95% confidence interval estimate of the total amount for the population of vouchers. SOLUTION Using Equation 8.6, the point estimate of the population total is: NX = (1,000)(1,076.39) = $1,076,390 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 302 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION From Equation 8.7, a 95% confidence interval estimate of the population total amount is: (1,000)(1,076.39) ± (1,000)(2.0096) 273.62 50 1,000 − 50 1,000 − 1 = 1,076,390 ± 77,762.902( 0.97517 ) = 1,076,390 ± 75,832 $1,000,558 < population total < $1,152,222 Therefore, with 95% confidence, you estimate that the total amount of the vouchers is between $1,000,558 and $1,152,222. Difference Estimation difference estimation A method of estimating the level of discrepancy between book and audit values for a population. Auditors use difference estimation when they believe that errors exist in a set of items and they want to estimate the magnitude of the errors based only on a sample. The following steps are used in difference estimation: 1. Determine the sample size required. 2. Calculate the differences between the values reached during the audit and the original values recorded. The difference in value i, denoted Di, is equal to 0 if the auditor finds that the original value is correct, is a positive value when the audited value is larger than the original value, and is negative when the audited value is smaller than the original value. – 3. Calculate the mean difference in the sample (D) by dividing the total difference by the sample size, as shown in Equation 8.8. M E A N DIFFE R E N C E n D= ∑ Di i=1 (8.8) n where Di = audited value – original value 4. Calculate the standard deviation of the differences (SD), as shown in Equation 8.9. Remember that any item that is not in error has a difference value of 0. STA N DA R D DE VIAT I O N O F T HE D I F F E R E NC E n SD = 5. ∑ ( Di − D )2 i=1 (8.9) n−1 Use Equation 8.10 to construct a confidence interval estimate of the total difference in the population. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.5 Applications of Confidence Interval Estimation in Auditing 303 CO N FID E N CE IN T E R VA L E ST IM AT E F O R T HE TOTA L D I F F E R E NC E ND ± N (tn−1) SD N − n n N−1 (8.10) The auditing procedures for Callistemon Camping Supplies require a 95% confidence interval estimate of the difference between the actual dollar amounts on the sales invoice and the amounts entered into the integrated inventory and sales information system. Suppose that, in a sample of 100 sales invoices, you have 12 invoices in which the actual amount on the sales invoice and the amount entered into the integrated inventory and sales information system are different. These 12 differences < PARTS_INV > are: $9.03 $7.47 $17.32 $8.30 $5.21 $10.80 $6.22 $5.63 $4.97 $7.43 $2.99 $4.63 The other 88 invoices are not in error. Their differences are each 0. Thus: n D= ∑ Di i=1 n = 90 = 0.90 100 and: n SD = = ∑ ( Di − D )2 i=1 n −1 (9.03 − 0.9 ) 2 + (7.47 − 0.9 ) 2 + … + (0 − 0.9 ) 2 100 − 1 (In the numerator, there are 100 differences. The last 88 are all (0 − 0.9)2 .) SD = 2.7518 Using Equation 8.10, construct the confidence interval estimate for the total difference in the population of 5,000 sales invoices as follows: (5,000)(0.90) ± (5,000)(1.9842) 2.7518 5,000 − 100 5,000 − 1 100 = 4,500 ± 2,702.89 $1,797.11 < total difference < $7,202.89 Thus, the auditor estimates with 95% confidence that the total difference between the sales invoices, as determined during the audit, and the amount originally entered into the accounting system is between $1,797.11 and $7,202.89. In the previous example, all 12 differences are positive because the actual amount on the sales invoice is more than the amount entered into the accounting system. In some circumstances you could have negative errors. Example 8.8 illustrates such an occurrence. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 304 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION EXAMPLE 8.8 DIFFE R E NC E E ST IMATI ON Returning to Example 8.7, suppose that 14 vouchers contain errors in the sample of 50 vouchers. The values < DIFF_TEST > of the 14 errors are as follows, in which two differences are negative: $75.41 $127.74 $38.97 $55.42 $108.54 $39.03 –$37.18 $29.41 $62.75 $47.99 $118.32 $28.73 –$88.84 $84.05 Construct a 95% confidence interval estimate of the total difference in the population of 1,000 vouchers. SOLUTION For these data: n D= ∑ Di i=1 n = 690.34 = 13.8068 50 and: n SD = = ∑ ( Di − D )2 i=1 n −1 ( 75.41 − 13.8068 ) 2 + ( 38.97 − 13.8068 ) 2 + … + (0 − 13.8068) 2 50 − 1 = 37.427 Using Equation 8.10, construct the confidence interval estimate for the total difference in the population: (1,000)(13.8068) ± (1,000)(2.0096) 37.427 50 1,000 − 50 1,000 − 1 = 13,806.8 ± 10,372.63 $3,434.17 < total difference < $24,179.43 Therefore, with 95% confidence you estimate that the total difference in the population of vouchers is between $3,434.17 and $24,179.43. LEARNING OBJECTIVE 4 Recognise how to use confidence interval estimates in auditing one-sided confidence interval Gives only an upper or lower bound to the value of the population parameter. One-Sided Confidence Interval Estimation of the Rate of Non-Compliance with Internal Controls Organisations use internal control mechanisms to ensure that individuals act in accordance with company guidelines. For example, Callistemon Camping Supplies requires that an authorised delivery docket is completed before goods are removed from the warehouse. During the monthly audit of the company, the auditing team is charged with the task of estimating the proportion of times goods were removed without proper authorisation. This is referred to as the rate of noncompliance with the internal control. To estimate the rate of non-compliance, auditors take a random sample of sales invoices and determine how often merchandise was shipped without an authorised delivery docket. The auditors then compare their results with a previously established tolerable exception rate, which is the maximum allowable proportion of items in the population not in compliance. When estimating the rate of non-compliance, it is commonplace to use a one-sided confidence interval. That is, the auditors estimate an upper bound on the rate of noncompliance. Equation 8.11 defines a one-sided confidence interval for a proportion. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.5 Applications of Confidence Interval Estimation in Auditing 305 O NE -S ID E D CON FIDE N CE IN T E R VA L F O R A P R O P O RT I O N Upper bound = p + Z p(1 − p) n N−n N−1 (8.11) where Z = the value corresponding to a cumulative area of (1 - α) from the standardised normal distribution – that is, a right-hand tail probability of α. If the tolerable exception rate is higher than the upper bound, then the auditor concludes that the company is in compliance with the internal control. If the upper bound is higher than the tolerable exception rate, the auditor concludes that the control non-compliance rate is too high. The auditor may then request a larger sample. Suppose that, in the monthly audit, you select 400 of the sales invoices from a population of 10,000 invoices. In the sample of 400 sales invoices, 20 are in violation of the internal control. If the tolerable exception rate for this internal control is 6%, what should you conclude? Use a 95% level of confidence. The one-sided confidence interval is calculated using p = 20/400 = 0.05 and Z = 1.645. Using Equation 8.11: Upper bound = p + Z p(1 − p) n N−n 0.05(1 − 0.05) 10,000 − 400 = 0.05 + 1.645 N−1 400 10,000 − 1 = 0.05 + 1.645(0.0109)(0.98) = 0.05 + 0.0176 = 0.0676 Thus, you have 95% confidence that the rate of non-compliance is less than 6.76%. Because the tolerable exception rate is 6%, the rate of non-compliance may be too high for this internal control. In other words, it is possible that the non-compliance rate for the population is higher than the rate deemed tolerable. Therefore, you should request a larger sample. In many cases, the auditor is able to conclude that the rate of non-compliance with the company’s internal controls is acceptable. Example 8.9 illustrates such an occurrence. ESTIM ATING T H E R AT E O F N O N- C O MP L I AN CE A large electronics firm makes one million direct debit payments a year. An internal control policy requires that each payment is made only after an invoice has been authorised by an accounts payable supervisor. The company’s tolerable exception rate for this control is 4%. If control deviations are found in 8 of the 400 invoices sampled, what should the auditor do? Use a 95% level of confidence. EXAMPLE 8.9 SOLUTION The auditor constructs a 95% one-sided confidence interval for the proportion of invoices in non-compliance and compares this with the tolerable exception rate. Using Equation 8.11, p = 8/400 = 0.02 and Z = 1.645 for 95% confidence: Upper bound = p + Z p(1 − p) n N−n 0.02(1 − 0.02) 1,000,000 − 400 = 0.02 + 1.645 N−1 400 1,000,000 − 1 = 0.02 + 1.645(0.007)(0.9998) = 0.02 + 0.0115 = 0.0315 The auditor concludes with 95% confidence that the rate of non-compliance is less than 3.15%. Since this is less than the tolerable exception rate, the auditor concludes that the internal control compliance is adequate. In other words, the auditor is more than 95% confident that the rate of non-compliance is less than 4%. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 306 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION Problems for Section 8.5 LEARNING THE BASICS 8.46 A sample of 25 is selected from a population of 500 items. The sample mean is 25.7 and the sample standard deviation is 7.8. Construct a 99% confidence interval estimate of the population total. 8.47 Suppose that a sample of 200 is selected from a population of 10,000 items. Ten items are found to have errors of the following amounts: 13.76 42.87 34.65 11.09 14.54 22.87 25.52 9.81 10.03 15.49 Construct a 95% confidence interval estimate of the total difference in the population. < ITEM_ERR > 8.48 If p = 0.04, n = 300 and N = 5,000, calculate the upper bound for a one-sided confidence interval estimate of the population proportion, π, using a level of confidence of: a. 90% b. 95% c. 99% APPLYING THE CONCEPTS 8.49 A stationery store wants to estimate the total retail value of the 300 greeting cards it has in its inventory. Construct a 95% confidence interval estimate of the population total value of all greeting cards that are in the inventory if a random sample of 20 greeting cards indicates an average value of $5.45 and a standard deviation of $0.82. 8.50 The personnel department of a large corporation employing 3,000 workers wants to estimate the family dental expenses of its employees to determine the feasibility of providing a dental insurance plan. A random sample of 10 employees reveals the following family dental expenses (in dollars) for the preceding year: < DENTAL > Tax (GST) payable to the Australian Tax Office needs to be adjusted. A sample of 150 items selected from a population of 4,000 invoices at the end of a period of time revealed that in 13 cases staff failed to adjust the GST amount correctly. The amounts (in dollars) of the 13 amounts by which GST was overcharged are: < DISCOUNT > 6.45 15.32 97.36 230.63 104.18 84.92 132.76 66.12 26.55 129.43 88.32 47.81 89.01 Construct a 99% confidence interval estimate of the population total amount of GST overcharged. 8.53 Econe Pty Ltd is a small company that manufactures women’s dresses for sale to specialty stores. There are 1,200 inventory items, and the historical cost is recorded on a first in, first out (FIFO) basis. In the past, approximately 15% of the inventory items were incorrectly priced. However, any misstatements were usually not significant. A sample of 120 items was selected and the historical cost of each item compared with the audited value. The results indicated that 15 items differed in their historical cost and audited value. These differences were as follows: < FIFO > Sample Historical Audited number cost ($) value ($) 5 261 240 Sample Historical Audited number cost ($) value ($) 60 21 210 9 87 105 73 140 152 17 201 276 86 129 112 18 121 110 95 340 216 28 315 298 96 341 402 35 411 356 107 135 97 43 249 211 119 228 220 51 216 305 1,110 362 2,320 1,930 3,210 208 1,730 825 616 1,179 Construct a 90% confidence interval estimate of the total family dental expenses for all employees in the preceding year. 8.51 A branch of a chain of large electronics stores is conducting an end-of-month inventory of the merchandise in stock. There are 1,546 items in inventory at the time. A sample of 50 items is randomly selected and an audit conducted, with the following results: Value of merchandise X = $252.28 S = $93.67 Construct a 95% confidence interval estimate of the total value of the merchandise in the inventory at the end of the month. 8.52 When a trade discount is allowed by wholesalers for particular types of early payments by customers, the Goods and Services Construct a 95% confidence interval estimate of the total population difference in the historical cost and audited value. 8.54 The Snowy Ski Centre Pty Ltd conducts an annual audit of its financial records. An internal control policy for the company is that a cheque can be issued only after the accounts payable manager initials the invoice. The tolerable exception rate for this internal control is 0.04. During an audit, a sample of 300 invoices is examined from a population of 10,000 invoices and 11 invoices are found to violate the internal control. a. Calculate the upper bound for a 95% one-sided confidence interval estimate for the rate of non-compliance. b. Based on (a), what should the auditor conclude? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 8.6 More On Confidence Interval Estimation And Ethical Issues 307 8.6 MORE ON CONFIDENCE INTERVAL ESTIMATION AND ETHICAL ISSUES You should be aware that when sampling is done without replacement from a finite population, an adjustment to the standard error of the mean or standard error of the proportion is required. This has been included in equations 8.7, 8.10 and 8.11, where standard errors have been multiplied by the correction factor square root of (N - n)/(N - 1). The correction factor is used in confidence intervals for the population mean and proportion when the sample size, n, is large in relation to the population size, N (i.e. more than 5%). Ethical issues relating to the selection of samples and the inferences that accompany them can arise in several ways. The major ethical issue relates to whether or not confidence interval estimates are provided together with the sample statistics. To provide a sample statistic without also including the confidence interval limits (typically set at 95%), the sample size used and an interpretation of the meaning of the confidence interval in terms that a layperson can understand raises ethical issues because of their omission. Failure to include a confidence interval estimate might mislead the user of the results into thinking that the point estimate is all that is needed to predict the population characteristic with certainty. Thus, it is important that you indicate the interval estimate in a prominent place in any written communication, together with a simple explanation of the meaning of the confidence interval. In addition, you should highlight the size of the sample. Ethical issues concerning estimation most commonly occur in the publication of the results of political polls. Often the results of the polls are highlighted in a prominent part of the newspaper, while the sampling error involved and the methodology used is printed on the page where the article is continued, frequently in the middle of the newspaper in print editions or with a separate link in online ones. To ensure an ethical presentation of statistical results, the confidence levels, sample size and confidence limits should be made available for all surveys and other statistical studies. Reporting poll results Let’s imagine that a newspaper reports the following table in both its print and online editions. State premier’s performance July–Sept 2016 (%) Oct–Dec 2016 (%) Jan–Mar 2017 (%) Mar–Jun 2017 (%) July–Sept 2017 (%) Satisfied 52 50 48 42 33 Dissatisfied 33 33 41 46 57 Uncommitted 15 17 11 12 10 think about this In the print edition it shows this extra information immediately below the table. In the online edition readers need to click on a link to see it. Question: Are you satisfied or dissatisfied with the way the current state premier is performing? This poll was carried out by a phone interview of the state’s voters, with the number in each poll being a constant percentage of the estimated number of voters. The latest survey interviewed 1,560 voters. Do you think the variation in display methods between the print and online editions will alter the way readers interpret the poll results? What other information is necessary for you to be able to evaluate the poll results effectively? Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 308 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION 8 Assess your progress Summary This chapter has discussed confidence intervals for estimating the characteristics of a population, and explained how to determine the necessary sample size. We showed how an accountant of Callistemon Camping Supplies can use the sample data from an audit to estimate important population parameters such as the mean dollar amount on invoices and the proportion of shipments that are made without proper authorisation. To determine which equation to use for a particular situation, you need to ask several questions: • Are you developing a confidence interval or are you determining sample size? • Do you have a numerical variable or do you have a categorical variable? • If you have a numerical variable, do you know the population standard deviation? If you do, use the normal distribution. If you do not, use the t distribution. The next three chapters develop a hypothesis-testing approach that makes decisions about population parameters. Key formulas Confidence interval estimate for the proportion Confidence interval for the mean (𝛔 known) X±Z σ n p±Z (8.1) or or X−Z p(1 − p) (8.3) n σ n <μ<X+Z σ Confidence interval for the mean (𝛔 unknown) X ± tn−1 S n (8.2) p(1 − p) n Sample size determination for the mean n= Z 2σ 2 e2 (8.4) Sample size determination for the proportion or X − tn−1 p(1 − p) <π<p1Z n p−Z n S n ⩽ μ ⩽ X + tn−1 S n= n Z 2 π(1− π) e2 (8.5) Key terms auditing confidence interval estimate critical value deductive reasoning degrees of freedom 300 280 283 281 285 difference estimation inductive reasoning level of confidence one-sided confidence interval point estimate 302 281 282 304 280 sampling error Student’s t distribution total amount Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 295 285 300 Chapter review problems 309 References 1. Statprob: The Encyclopedia Sponsored by Statistics and Probability 4. Fisher, R. A. & F. Yates, Statistical Tables for Biological, Societies, at <http://statprob.com/encyclopedia/williamsealygosset. html> accessed April 2014. 2. Daniel, W. W. Applied Nonparametric Statistics, 2nd edn (Boston, MA: PWS Kent, 1990). 3. Cochran, W. G., Sampling Techniques, 3rd edn (New York: Wiley, 1977). Agricultural and Medical Research, 5th edn (Edinburgh: Oliver & Boyd, 1957). 5. Snedecor, G. W. & W. G. Cochran, Statistical Methods, 8th edn (Ames, IA: Iowa State University Press, 1989). Chapter review problems CHECKING YOUR UNDERSTANDING 8.55 8.56 8.57 8.58 8.59 8.60 Why is it that you can never really have 100% confidence of correctly estimating the population characteristic of interest? When do you use the t distribution to develop the confidence interval estimate for the mean? Why is it true that, for a given sample size n, an increase in confidence is achieved by widening (and making less precise) the confidence interval? Under what circumstances do you use a one-sided confidence interval instead of a two-sided confidence interval? When would you want to estimate the population total instead of the population mean? How does difference estimation differ from estimating the mean? APPLYING THE CONCEPTS 8.63 You can solve problems 8.61 to 8.75 with or without a computer. You should use Microsoft Excel or another program to solve problems 8.76 to 8.80. 8.61 8.62 A trade union of medical workers conducted a survey through its website about preferred working hours. Hospital workers visiting the website were given the opportunity to fill out an on-screen survey form. A total of 665 workers responded to a question that asked whether they would prefer a five-day working week with eight-hour shifts, or seven 12-hour shifts per fortnight. Twelve-hour shifts were the preference for 412 of the respondents. a. Define the population from which this sample was drawn. b. Is this a random sample from this population? c. Is this a statistically valid study? d. Describe how you would design a statistically valid study to investigate the proportion of hospital workers who would prefer 12-hour shifts rather than a five-day working week. Use the information above to determine the sample size needed to estimate this population proportion to within ±0.02 with 95% confidence. In 2014–15 the Australian Bureau of Statistics conducted a multipurpose household survey and had responses from 13,686 individuals in private dwellings on their use of information technology. Assume there were 477 15–17 year old respondents and 2,256 45–54 year old respondents who used the Internet to purchase goods or services online. For 45–54 year olds, 49.2% of online purchases were on travel, accommodation or related services. By comparison, for 15–17 year olds, 60% of purchases were of music, movies, electronic games or books (Australian 8.64 Bureau of Statistics, Household Use of Information Technology, Australia, 2014–15, Cat. No. 8146.0, 2016). a. Construct a 95% confidence interval for the population proportion of all Australian Internet purchasers aged 45–54 who bought travel, accommodation or related services online in 2014–15. b. Construct a 95% confidence interval for the population proportion of all Australian Internet purchasers aged 15–17 who bought music, movies, electronic games or books online in 2014–15. c. Construct a 99% confidence interval for the population proportion of all Australian Internet purchasers aged 15–17 who bought music, movies, electronic games or books online in 2014–15. The KPMG 2016 report, Global Profiles of the Fraudster gives details of a survey of investigations between March 2013 and August 2015 relating to frauds committed by 750 people worldwide. Where fraudsters were working in collaboration with others, the most common means of detection were tipoffs and complaints (31%), but fraudsters acting alone were most often detected by management review (25%). (Global Profiles of the Fraudster: Technology and Weak Controls <https://assets.kpmg.com/content/dam/kpmg/pdf/2016/06/ profiles-of-the-fraudster-au.pdf> accessed 6 July 2017). Suppose the percentages above are based on 210 singleperson frauds and 240 frauds where there was collusion. a. Find a 95% confidence interval for the proportion of all single person fraud incidents that are detected by management review. b. Find a 95% confidence interval for the proportion of all fraud collusion incidents that are detected due to tip-offs and complaints. The Legal Services Council conducted a consumer survey in 2017 which asked respondents about different attitudes and experiences relating to legal costs. One question asked: ‘How well did you understand what the costs were likely to be?’ (Legal Services Council Consumer Survey 2017 (<www. legalservicescouncil.org.au/Documents/consultation/LSC_ Consumer_Survey_Report.pdf> accessed 6 July 2017). There were 1402 replies to this question. The percentages of those who replied to each category were: • Understood well: 21% • Understood adequately: 33% Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 310 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION 8.65 8.66 8.67 • Understood a little: 34% • I did not understand: 12% Construct 95% confidence interval estimates for each of these categories. What conclusions can you reach about consumers’ understanding of legal costs from these results? A study by the Australian Bureau of Statistics looked at health and activity habits of various groups of Australians. It found that only 43.3 % of males aged 35–44 had participated in sufficient physical activity in the last week for health purposes. It also found that males of all ages spent an average of 12.9 hours in the last week sitting to watch television or videos (Australian Bureau of Statistics, Australian Health Survey: Physical Activity, 2011–12, Cat. No. 4364.0.55.004, 2013). Assume you have two samples with 500 males aged 35–44 and 2000 males of all ages. Assume the values given above apply and S = 2.8 hours sitting time. a. Construct a 95% confidence interval estimate for the mean time males sit per week to watch television or videos. b. Construct a 95% confidence interval estimate for the population proportion of 35–44-year-old males who participate in sufficient activity per week for health. If you want to take another survey in future, answer the following questions: c. What sample size is required to be 95% confident of estimating the population mean to within ±2 hours assuming that the population standard deviation is equal to 3 hours? d. What sample size is needed to be 95% confident of being within ±0.035 of the population proportion of 35–44 year old males who participate in sufficient activity if no previous estimate is available? A researcher for a state government agriculture department wants to study various characteristics of medium-sized farms in the state. A random sample of 70 farms of between 100 and 600 hectares reveals the following: – • average area X = 350 hectares, standard deviation S = 70 hectares • 21 farms are engaged primarily in beef cattle production a. Construct a 99% confidence interval estimate of the population mean area of medium-sized farms. b. Construct a 95% confidence interval estimate of the population proportion of medium-sized farms which are primarily beef cattle producers. The personnel manager of a large corporation wishes to study absenteeism among clerical workers at the corporation’s central office during the year. A random sample of 25 clerical workers reveals the following: – • absenteeism: X = 9.7 days, S = 4.0 days • 12 clerical workers were absent for more than 10 days a. Construct a 95% confidence interval estimate of the mean number of absences for clerical workers last year. b. Construct a 95% confidence interval estimate of the population proportion of clerical workers absent for more than 10 days last year. 8.68 8.69 8.70 If the personnel manager also wishes to take a survey in a branch office, answer these questions: c. What sample size is needed to have 95% confidence in estimating the population mean to within ±1.5 days if the population standard deviation is 4.5 days? d. What sample size is needed to have 90% confidence in estimating the population proportion to within ±0.075 if no previous estimate is available? e. Based on (c) and (d), what sample size is needed if a single survey is being conducted? The market research manager for Dalton’s department store wants to study women’s spending on cosmetics. A survey is designed to estimate the proportion of women who purchase their cosmetics primarily from Dalton’s department store, and the mean yearly amount that women spend on cosmetics. A previous survey found that the standard deviation of the amount women spend on cosmetics in a year is approximately $64.70. a. What sample size is needed to have 99% confidence of estimating the population mean to within ±$5? b. What sample size is needed to have 90% confidence of estimating the population proportion to within ±0.045? c. Based on the results in (a) and (b), how many of the store’s female customers should be sampled? Explain. A survey of Internet shopping for goods looked at how much shoppers spent on online purchases of clothing, footwear and accessories in the past year. The results from a sample of 270 customers are as follows: – • amount spent: X $528.90 S = $113.90 • 108 customers stated that they made the majority of purchases at overseas sites a. Construct a 95% confidence interval estimate of the population mean amount spent on Internet purchases of clothing, footwear and accessories in the past year. b. Construct a 90% confidence interval estimate of the population proportion of customers who have made the majority of purchases on overseas sites. Assume that you wish to run a similar survey for the coming year. c. What sample size is needed to have 95% confidence of estimating the population mean amount spent on online purchases of clothing, footwear and accessories to within ±$1.20 if the standard deviation is assumed to be $10? d. What sample size is needed to have 90% confidence of estimating the population proportion that will make the majority of purchases on overseas sites to within ±0.04? e. Based on your answers to (c) and (d), how large a sample should be taken? The branch manager of an outlet (store 1) of a nationwide chain of pet supply stores wants to study the characteristics of her customers. In particular, she decides to focus on two variables: the amount of money spent by customers and whether the customers own only one dog, only one cat, or more than one dog and/or cat. The results from a sample of 70 customers are shown below: – • amount of money spent: X = $21.34, S = $9.22 • 37 customers own only a dog Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter review problems 311 8.71 8.72 • 26 customers own only a cat • 7 customers own more than one dog and/or cat a. Construct a 95% confidence interval estimate of the population mean amount spent in the pet supply store. b. Construct a 90% confidence interval estimate of the population proportion of customers who own only a cat. The branch manager of another outlet (store 2) wishes to conduct a similar survey in his store. The manager does not have any access to the information generated by the manager of store 1. c. What sample size is needed to have 95% confidence of estimating the population mean amount spent in his store to within ±$1.50 if the standard deviation is $10? d. What sample size is needed to have 90% confidence of estimating the population proportion of customers who own only a cat to within ±0.045? e. Based on your answers to (c) and (d), how large a sample should the manager take? The owner of a restaurant that serves continental food wants to study the characteristics of his customers. He decides to focus on two variables: the amount of money spent per diner on food and whether diners order dessert. The results from a sample of 60 diners are as follows: – • amount spent: X = $47.20, S = $8.60 • number of diners who purchased dessert: 18 a. Construct a 95% confidence interval estimate of the population mean amount spent per diner on food. b. Construct a 90% confidence interval estimate of the population proportion of diners who purchase dessert. The owner of a competing restaurant wants to conduct a similar survey in her restaurant. This owner does not have access to the information generated by the owner of the first restaurant. c. What sample size is needed to have 95% confidence of estimating the population mean amount spent by each diner on food in her restaurant to within ±$1.50, assuming the standard deviation is $9? d. What sample size is needed to have 90% confidence of estimating the population proportion of diners who purchase dessert to within ±0.04? e. Based on your answers to (c) and (d), how large a sample should the owner take? The manufacturer of Tuffstuff concrete pavers claims its products have a breaking strength of 5 kN. A representative of a building advisory organisation is interested in assessing this claim and sends a number of pavers to be tested in a laboratory. The representative wants to know with 95% confidence, within ±0.05, what proportion of pavers perform the job as claimed by the manufacturer. a. How many pavers does the laboratory need to test? What assumption should be made about the population proportion? The laboratory tests 50 pavers, and 42 have the breaking strength claimed. b. Construct a 95% confidence interval estimate for the population proportion that have the breaking strength claimed. 8.73 8.74 c. How can the representative use the results of (b) to advise the public about the product? An auditor needs to estimate the percentage of times a company fails to follow an internal control procedure. A sample of 50 from a population of 1,000 items is selected, and in 7 instances the internal control procedure was not followed. a. Construct a 90% one-sided confidence interval estimate of the population proportion of items in which the internal control procedure was not followed. b. If the tolerable exception rate is 0.15, what should the auditor conclude? An auditor for a government agency needs to evaluate payments that were made by Medicare for consultations in doctors’ surgeries in a particular postcode area during June. A total of 25,056 visits occurred during June in this area. The auditor wants to estimate the total amount paid by Medicare to within ± $10 with 95% confidence. On the basis of past experience, she believes that the standard deviation is approximately $60. a. What sample size should she select? Using the sample size selected in (a), an audit is conducted. It is discovered that for 12 of the surgery consultations an incorrect amount of reimbursement was provided. Amount of reimbursement X = $98.70 S = $44.55 For the 12 surgery consultations for which incorrect reimbursement was provided, the differences between the amount reimbursed and the amount that the auditor determined should have been reimbursed were: < MEDICARE > $17 $25 $14 -$10 $20 $40 $35 $30 $28 $22 $15 $5 8.75 b. Construct a 90% confidence interval estimate of the population proportion of reimbursements that contain errors. c. Construct a 95% confidence interval estimate of the population mean reimbursement per surgery consultation. d. Construct a 95% confidence interval estimate of the population total amount of reimbursements for this postcode area for consultations in June. e. Construct a 95% confidence interval estimate of the total difference between the amount reimbursed and the amount that should have been reimbursed. A large computer store is conducting an end-of-month inventory of the tablet computers in stock. An auditor for the store wants to estimate the mean value of the tablets in stock at that time. He wants to have 99% confidence that his estimate of the mean value is correct to within ±$23. On the basis of past experience, he estimates that the standard deviation of the value of a tablet is $45. a. What sample size should he select? b. Using the sample size selected in (a), an audit is conducted with the following results: X = $575 S = $72.20 Construct a 99% confidence interval estimate of the total value of the tablets in stock at the end of the month if there were 258 tablets listed in the inventory. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 312 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION 8.76 A quality characteristic of interest for a tea-bag-filling process is the weight of the tea in the individual bags. In this example, the label weight on the package indicates that the mean amount of tea in a bag is 5.5 g. If the bags are underfilled, two problems arise. First, customers may not be able to brew the tea to be as strong as they wish. Second, the company may be in violation of the law because of misleading labelling. On the other hand, if the mean amount of tea in a bag exceeds the label weight, the company is giving away product. Getting an exact amount of tea in a bag is problematic because of variation in the temperature and humidity inside the factory, differences in the density of the tea, and the extremely fast filling operation of the machine (approximately 170 bags a minute). The following table provides the weight in grams of a sample of 50 tea-bags produced in one hour by a single machine. < TEABAGS > 5.65 5.57 5.47 5.77 5.61 8.77 5.44 5.40 5.40 5.57 5.45 Weight of tea-bags in grams 5.42 5.40 5.53 5.34 5.54 5.45 5.53 5.54 5.55 5.62 5.56 5.46 5.47 5.61 5.53 5.32 5.67 5.29 5.42 5.58 5.58 5.50 5.32 5.50 5.44 5.25 5.56 5.63 5.50 5.57 5.52 5.44 5.49 5.53 5.67 8.79 5.41 5.51 5.55 5.58 5.36 a. Construct a 99% confidence interval estimate of the population mean weight of the tea-bags. b. Is the company meeting the requirement set forth on the label that the mean amount of tea in a bag is 5.5 g? A manufacturing company produces steel housings for electrical equipment. The main component of the housing is a steel trough made out of a 2-mm steel coil. It is produced using a 250-tonne progressive punch press with a wipe-down operation that puts two 90-degree forms in the flat steel to make the trough. The distance from one side of the form to the other is critical because of weatherproofing in outdoor applications. The data from a sample of 49 troughs follow: < TROUGH > Width of trough (in mm) 203.12 204.22 204.98 204.29 204.10 204.27 8.78 203.43 204.76 204.47 204.58 204.05 204.20 203.17 203.82 204.36 204.62 203.23 204.98 203.83 204.84 204.13 204.60 204.20 204.09 203.48 204.03 204.89 204.44 203.96 204.10 204.14 204.14 204.29 204.47 203.51 204.19 204.81 204.60 204.05 203.73 203.85 204.15 204.12 204.39 204.81 204.65 204.79 204.20 204.11 a. Construct a 95% confidence interval estimate of the mean width of the troughs. b. Interpret the interval developed in (a). A busy landscaping supplies company sells wood chips for garden mulch. The mulch is sold by the cubic metre and delivered to households in a small truck. Each truckload is expected to be 4 cubic metres. The company decides to conduct an audit of actual load volumes by smoothing and measuring samples of loads for a two-week period. The data file < MULCH > contains the volume (in cubic metres) from a 8.80 sample of 368 truckloads of cypress pine mulch and from a sample of 330 truckloads of cedar wood chips. a. For the cypress pine wood chips, construct a 95% confidence interval estimate of the mean volume. b. For the cedar wood chips, construct a 95% confidence interval estimate of the mean volume. c. Evaluate whether the assumption needed for (a) and (b) has been seriously violated. d. Based on the results of (a) and (b), what conclusions can you reach concerning the mean volume of the cypress pine and cedar wood chips? The manufacturer of ‘Bondi’ and ‘Vincentia’ terracotta roof shingles provides its customers with a 50-year warranty on the product. To determine whether a shingle will last as long as the warranty period, accelerated life testing is conducted at the manufacturing plant. Accelerated life testing exposes the shingle to the stresses it would be subject to in a lifetime of normal use via a laboratory experiment that takes only a few hours to conduct. In this test, a shingle is repeatedly scraped with an abrasive and the particles that are removed are weighed (in grams). Shingles that experience small amounts of particle loss are expected to last longer in normal use than shingles that experience large amounts of particle loss. In this situation, a shingle should experience no more than 0.8 g of particle loss if it is expected to last the length of the warranty period. The data file < PARTICLE > contains a sample of 170 measurements made on the company’s ‘Bondi’ shingles, and 140 measurements made on ‘Vincentia’ shingles. a. For the ‘Bondi’ shingles, construct a 95% confidence interval estimate of the mean particle loss. b. For the ‘Vincentia’ shingles, construct a 95% confidence interval estimate of the mean particle loss. c. Evaluate whether the assumption needed for (a) and (b) has been seriously violated. d. Based on the results of (a) and (b), what conclusions can you reach concerning the mean particle loss of the ‘Bondi’ and ‘Vincentia’ shingles? Diners have rated 14 North Island and 14 South Island New Zealand restaurants on the basis of food, presentation, service and toilets using an online review system with ratings from 1 to 10. The data file < REST_NZ > contains the ratings for each of these categories. For each island separately: a. Construct 95% confidence interval estimates for the mean food rating, mean presentation rating, mean service rating and mean toilet rating. b. What conclusions can you reach about the North and South Island restaurants from the results in (a)? REPORT WRITING EXERCISE 8.81 Referring to the results in problem 8.77 concerning the width of a steel trough, write a report that summarises your conclusions. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Chapter 8 Excel Guide 313 Continuing cases Tasman University The Business School at Tasman University (TBU) has decided to gather data about its undergraduate students. It has created and distributed a survey of 14 questions and receives responses from 62 undergraduates (stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY >). a For each variable included in the survey, construct a 95% confidence interval estimate for the population characteristic and write a report summarising your conclusions. Shortly afterwards, TBU decides to undertake a similar survey for graduate students. It creates and distributes a survey of 14 questions and receives responses from 44 graduate students (stored in < TASMAN_UNIVERSITY_MBA_ STUDENT_SURVEY >). b For each variable included in the survey, construct a 95% confidence interval estimate for the population characteristic and write a report summarising your conclusions. As Safe as Houses While working at Safe-As-Houses Real Estate, you are told the company wishes to explore variations in the average prices of properties in towns and cities. Using data in the file < REAL_ESTATE >, find a 95% confidence interval for the mean property price in each town or city in both states. Write a report that details your findings. Have you found any evidence of differences between average prices in these towns and cities? Chapter 8 Excel Guide EG8.1 CONFIDENCE INTERVAL ESTIMATE FOR THE MEAN (σ KNOWN) EG8.2 CONFIDENCE INTERVAL ESTIMATE FOR THE MEAN (σ UNKNOWN) Open the CIE_Sigma_Known workbook. This workbook already contains the entries for Example 8.1 on page 283 and uses the NORM.S.INV and CONFIDENCE.NORM functions (see Appendix D.2 for more information). To adapt this worksheet to other problems, change the population standard deviation, sample mean, sample size and confidence level values in the tinted cells in rows 4 to 7. Open the CIE_Sigma_Unknown workbook, shown in Figure 8.6 on page 288. The workbook uses the T.INV.2T function to determine the critical value from the t distribution (see Appendix D.3 for more information). To adapt this workbook to other problems, change the sample statistics and confidence level values in the tinted cells in rows 4 to 7. OR See Appendix D.2 (Confidence Interval Estimate for the Mean, sigma known) if you want PHStat to produce a worksheet for you. OR See Appendix D.3 (Confidence Interval Estimate for the Mean, sigma unknown) if you want PHStat to produce a worksheet for you. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 314 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION EG8.3 CONFIDENCE INTERVAL ESTIMATE FOR THE PROPORTION Open the CIE_Proportion workbook, shown in Figure 8.10 on page 292. The workbook uses the NORM.S.INV function to determine the Z value (see Appendix D.4 for more information). To adapt this workbook to other problems, change the sample size, number of successes and confidence level values in the tinted cells in rows 4 to 6. OR See Appendix D.6 (Sample Size Determination for the Proportion) if you want PHStat to produce a worksheet for you. EG8.6 CONFIDENCE INTERVAL ESTIMATE FOR THE POPULATION TOTAL OR See Appendix D.4 (Confidence Interval Estimate for the Proportion) if you want PHStat to produce a worksheet for you. Open the CIE_Total workbook. The workbook uses the T.INV.2T function to determine the critical value from the t distribution (see Appendix D.7 for more information). To adapt this workbook to other problems, change the population size, sample mean, sample size, sample standard deviation and confidence level values in the tinted cells in rows 4 to 8. EG8.4 SAMPLE SIZE DETERMINATION FOR THE MEAN OR See Appendix D.7 (Estimate for the Population Total) if you want PHStat to produce a worksheet for you. Open the Sample_Size_Mean workbook, shown in Figure 8.11 on page 296. The workbook uses the NORM.S.INV and ROUNDUP functions (see Appendix D.5 for more information). To adapt this workbook to other problems, change the population standard deviation, sampling error and confidence level values in the tinted cells in rows 4 to 6. EG8.7 CONFIDENCE INTERVAL ESTIMATE FOR THE TOTAL DIFFERENCE OR See Appendix D.5 (Sample Size Determination for the Mean) if you want PHStat to produce a worksheet for you. EG8.5 SAMPLE SIZE DETERMINATION FOR THE PROPORTION Open the Sample_Size_Proportion workbook, shown in Figure 8.12 on page 298. The workbook uses the NORM.S.INV and ROUNDUP functions (see Appendix D.6 for more information). To adapt this workbook to other problems, change the estimate of true proportion, sampling error and confidence level values in the tinted cells in rows 4 to 6. Open the CIE_Total_Difference workbook. This two-­ worksheet file already contains the entries for the Callistemon Camping Supplies example used in Section 8.5. To adapt this workbook to other problems, first change the population size, sample size and confidence level values in the tinted cells in rows 4 to 6. Then select the Data worksheet and enter differences data in column A, replacing the data already there for the Section 8.5 problem. Finally, adjust the column B formulas, copying the formulas down to additional cells if you have more than 12 differences, or deleting the unneeded formulas if you have fewer than 12 differences. OR See Appendix D.8 (Estimate for the Total Difference) if you want PHStat to produce a worksheet for you. Microsoft® product screen shots are reprinted with permission from Microsoft Corporation. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e Fundamentals of hypothesis testing: One-sample tests C HAP T E R 9 PATRICIO’S PASTA CO. Y ou have recently been appointed to oversee quality control at Patricio’s Pasta Co., which produces and packages a range of dried pasta in traditional Italian shapes. It is made from Australian durum wheat semolina, sourced from grain grown in the Narrabri region of New South Wales. The pasta is sold in 500-gram packets, and part of your job is to ensure that packets are being filled correctly and that the weight of the contents is as shown on the packet. You select and weigh a random sample of 25 filled spiral pasta packets in order to calculate a sample mean and investigate how close the weights are to the company’s specifications of a mean of 500 grams. You must make a decision and conclude whether (or not) the mean fill weight in the entire process is equal to 500 grams, in order to know whether the fill process needs adjustment. How could you rationally make this decision? © Tim Hill/Alamy Stock Photo Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 316 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS LEARNING OBJECTIVES After studying this chapter you should be able to: 1 identify the basic principles of hypothesis testing 2 explain the assumptions of each hypothesis-testing procedure, how to evaluate them and the consequences if they are seriously violated 3 use hypothesis testing to test a mean or proportion 4 recognise the pitfalls involved in hypothesis testing 5 identify the ethical issues involved in hypothesis testing Unlike Chapter 7, in which the problem facing the operations manager was to determine whether the sample mean was consistent with a known population mean, this chapter’s opening scenario asks how the sample mean can validate the claim that the population mean is 500 grams. To validate the claim, you must first state the claim unambiguously. For example, the population mean is 500 grams. In the inferential method known as hypothesis testing you consider the evidence – the sample statistic – to see whether the evidence better supports the statement, called the null hypothesis, or the mutually exclusive alternative which, in this case, states that the population mean is not 500 grams. In this chapter the focus is on hypothesis testing, another aspect of statistical inference that, like confidence interval estimation, is based on sample information. A step-by-step methodology is developed that enables you to make inferences about a population parameter by analysing differences between the results observed (the sample statistic) and the results you expect to get if some underlying hypothesis is actually true. For example, is the mean weight of the retail spiral pasta packets in the sample taken at Patricio’s Pasta consistent with what you would expect if the mean of the entire population of retail packets is 500 grams? Or can you infer that the population mean is not equal to 500 grams because the sample mean is significantly different from 500 grams? 9.1 HYPOTHESIS-TESTING METHODOLOGY The Null and Alternative Hypotheses hypothesis testing A method of statistical inference used to make tests about the value of population parameters. null hypothesis (H0) A statement about the value of one or more population parameters which we test and aim to disprove. Hypothesis testing typically begins with some theory, claim or assertion about a particular parameter of a population. For example, your initial hypothesis about the pasta company example is that the process is working properly, meaning that the mean weight is 500 grams, and no corrective action is needed. The hypothesis that the population parameter is equal to the company specification is referred to as the null hypothesis. A null hypothesis is always one of status quo, and is identified by the symbol H0. Here, the null hypothesis is that the filling process is working properly and therefore the mean weight is the 500-gram specification. This is stated as: H0: μ 5 500 Even though information is available only from the sample, the null hypothesis is written in terms of the population. Remember, your focus is on the population of all retail spiral pasta packets. The sample statistic is used to make inferences about the entire filling process. One inference may be that the results observed from the sample data indicate that the null hypothesis is false. If the null hypothesis is considered false, something else must be true. Whenever a null hypothesis is specified, an alternative hypothesis is also specified, one that must be true if Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 9.1 Hypothesis-testing Methodology 317 the null hypothesis is false. The alternative hypothesis, H1, is the opposite of the null hypothesis, H0. This is stated in the pasta example as: H1: m Z 500 The alternative hypothesis represents the conclusion reached by rejecting the null hypothesis. The null hypothesis is rejected when there is sufficient evidence from the sample information that the null hypothesis is false. In the pasta example, if the weights of the sampled packets are sufficiently above or below the expected 500-gram mean specified by the company, you reject the null hypothesis in favour of the alternative hypothesis that the mean fill is different from 500 grams. You stop production and take whatever action is necessary to correct the problem. If the null hypothesis is not rejected, then you should continue to believe in the status quo, that the process is working correctly and that no corrective action is necessary. Note that this does not mean you have proved that the process is working correctly. Rather, you have failed to prove that it is working incorrectly and, therefore, you continue your (unproven) belief in the null hypothesis. In the hypothesis-testing methodology, the null hypothesis is rejected when the sample evidence suggests that it is far more likely that the alternative hypothesis is true. However, failure to reject the null hypothesis is not proof that it is true. You can never prove that the null hypothesis is correct because the decision is based only on the sample information, not on the entire population. Therefore, if you fail to reject the null hypothesis, you can only conclude that there is insufficient evidence to warrant its rejection. The following key points summarise the null and alternative hypotheses: • The null hypothesis, H0, represents the status quo or the current belief in a situation. • The alternative hypothesis, H1, is the opposite of the null hypothesis and represents a research claim or specific inference you would like to prove. • If you reject the null hypothesis, you have statistical proof that the alternative hypothesis is correct. • If you do not reject the null hypothesis, you have failed to prove the alternative hypothesis. Failure to prove the alternative hypothesis, however, does not mean that you have proved the null hypothesis. • The null hypothesis, H0, always refers to a specified/hypothesised value of the population parameter (such as m), not a sample statistic (such as X ). • The statement of the null hypothesis always contains an equals sign regarding the specified value of the population parameter (e.g. H0: m 5 500 or H0: m > 400). • The statement of the alternative hypothesis never contains an equals sign regarding the specified value of the population parameter (e.g. H1: m . 500 or H1: m , 400). TH E NULL A N D A LT E R N AT IV E H YP OT H E S E S You are the manager of an Internet provider’s call centre for customer support. You want to determine whether the time taken to call back customers who elected to leave the phone queue has changed in the past month from its previous population mean value of 4.5 minutes. State the null and alternative hypotheses. alternative hypothesis (H1) A statement that we aim to prove about one or more population parameters; the opposite of the null hypothesis. LEARNING OBJECTIVE 1 Identify the basic principles of hypothesis testing EXAMPLE 9.1 SOLUTION The null hypothesis is that the population mean has not changed from its previous value of 4.5 minutes. This is stated as: H0: μ = 4.5 The alternative hypothesis is the opposite of the null hypothesis. Since the null hypothesis is that the population mean is 4.5 minutes, the alternative hypothesis is that the population mean is not 4.5 minutes. This is stated as: H1: μ ≠ 4.5 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 318 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS Determining the Test Statistic test statistic A value derived from sample data that is used to determine whether the null hypothesis should be rejected or not. region of rejection The range of values of the test statistic where the null hypothesis is rejected; it is also called the ‘critical region’. region of non-rejection The range of values of the test statistic where the null hypothesis cannot be rejected. The logic behind the hypothesis-testing methodology is to determine how likely it is that the null hypothesis is true by considering the information gathered in a sample. In the Patricio’s Pasta scenario, the null hypothesis is that the mean weight of spiral pasta packets in the entire filling process is 500 grams (i.e. the population parameter specified by the company). You select a sample of packets from the filling process, weigh each packet and calculate the sample mean. This statistic is an estimate of the corresponding parameter (the population mean m). Even if the null hypothesis is in fact true, the statistic (the sample mean X ) is likely to differ from the value of the parameter (the population mean m) because of variation due to sampling. However, you expect the sample statistic to be close to the population parameter if the null hypothesis is true. If the sample statistic is close to the population parameter, you have insufficient evidence to reject the null hypothesis. For example, if the sample mean is 499.9, you would conclude that the population mean has not changed (i.e. m 5 500), because a sample mean of 499.9 is very close to the hypothesised value of 500. Intuitively, you think that it is likely that you could get a sample mean of 499.9 from a population whose mean is 500. On the other hand, if there is a large difference between the value of the statistic and the hypothesised value of the population parameter, you will conclude that the null hypothesis is false. For example, if the sample mean is 420, you would conclude that the population mean is not 500 (i.e. m Z 500), because the sample mean is very far from the hypothesised value of 500. In such a case you conclude that it is very unlikely to get a sample mean of 420 if the population mean is really 500. Therefore, it is more logical to conclude that the population mean is not equal to 500 and reject the null hypothesis. Unfortunately, the decision-making process is not always so clear-cut. Determining what is ‘very close’ and what is ‘very different’ is arbitrary and without clear definitions. Hypothesistesting methodology provides clear definitions for evaluating differences. It also enables you to quantify the decision-making process by calculating the probability of getting a given sample result if the null hypothesis is true. You calculate this probability by determining the sampling distribution for the sample statistic of interest (e.g. the sample mean) and then calculating the particular test statistic based on the given sample result. Because the sampling distribution for the test statistic often follows a well-known statistical distribution, such as the standardised normal distribution or t distribution, you can use these distributions to help determine whether the null hypothesis is true. Regions of Rejection and Non-Rejection The sampling distribution of the test statistic is divided into two regions, a region of rejection (sometimes called the critical region) and a region of non-rejection (see Figure 9.1). If the test statistic falls into the region of non-rejection, you do not reject the null hypothesis. In the Patricio’s Pasta scenario, you see that there is insufficient evidence that the population mean fill is different from 500 grams. If the test statistic falls into the rejection region, you reject the null hypothesis. In this case, you will see that the population mean is not 500 grams. Figure 9.1 Regions of rejection and non-rejection in hypothesis testing X μ Critical Region of value rejection Region of non-rejection Critical value Region of rejection Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e 9.1 Hypothesis-testing Methodology 319 The region of rejection consists of the values of the test statistic that are unlikely to occur if the null hypothesis is true. These values are more likely to occur if the null hypothesis is false. Therefore, if a value of the test statistic falls into this rejection region, you reject the null hypothesis because that value is unlikely if the null hypothesis is true. To make a decision concerning the null hypothesis, you first determine the critical value of the test statistic. The critical value divides the non-rejection region from the rejection region. Determining this critical value depends on the size of the rejection region. The size of the rejection region is directly related to the risks involved in using only sample evidence to make decisions about a population parameter. critical value The value in a distribution that cuts off the required probability in the tail for a given confidence level. Risks in Decision Making Using Hypothesis Testing When using a sample statistic to make decisions about a population parameter, there is a risk that you will reach an incorrect conclusion. You can make two different types of errors when applying hypothesis-testing methodology: a Type I error and a Type II error. A Type I error occurs if you reject the null hypothesis, H0, when in fact it is true and should not be rejected. The probability of a Type I error occurring is a. A Type II error occurs if you do not reject the null hypothesis, H0, when in fact it is false and should be rejected. The probability of a Type II error occurring is β. In the Patricio’s Pasta scenario, you make a Type I error if you conclude that the population mean weight is not 500 when in fact it is 500. You make a Type II error if you conclude that the population mean weight is 500 when in fact it is not 500. The Level of Significance (a) The probability of committing a Type I error, denoted by a (the lower-case Greek letter alpha), is referred to as the level of significance of the statistical test. Traditionally, you control the Type I error by deciding on the risk level, a, that you are willing to have in rejecting the null hypothesis when it is true. Because you specify the level of significance before the hypothesis test is performed, the risk of committing a Type I error, a, is directly under your control. Traditionally, you select levels of 0.01, 0.05 or 0.10. The choice of a particular risk level for making a Type I error depends on the cost of making such an error. After you specify the value for a, you know the size of the rejection region because a is the probability of rejection under the null hypothesis. From this fact, you can then determine the critical value or values that divide the rejection and non-rejection regions. Type I error The rejection of a null hypothesis that is true and should not be rejected. Type II error The non-rejection of a null hypothesis that is false and should be rejected. level of significance (𝛂) The probability of rejecting a null hypothesis which is in fact true. The Confidence Coefficient The complement of the probability of a Type I error (1 2 a) is called the confidence coefficient. When multiplied by 100%, the confidence coefficient yields the confidence level that was studied when constructing confidence intervals (see Section 8.1). The confidence coefficient, 1 2 a, is the probability that you will not reject the null hypothesis, H0, when it is true and should not be rejected. The confidence level of a hypothesis test is (1 2 a) * 100%. In terms of hypothesis-testing methodology, the confidence coefficient represents the probability of concluding that the value of the parameter as specified i