Uploaded by Xinglin Han

Business Statistics A First Course, 8e David Levine, Kathryn Szabat, David Stephan

advertisement
Get Complete eBook Download Link below
for instant download
https://browsegrades.net/documents/2
86751/ebook-payment-link-for-instantdownload-after-payment
A Roadmap for Selecting
a Statistical Method
Data Analysis Task
Describing a group
or several groups
For Numerical Variables
Ordered array, stem-and-leaf display, frequency
distribution, relative frequency distribution, percentage
distribution, cumulative percentage distribution,
histogram, polygon, cumulative percentage polygon
(Sections 2.2, 2.4)
For Categorical Variables
Summary table, bar chart, pie chart,
doughnut chart, Pareto chart
(Sections 2.1 and 2.3)
Mean, median, mode, geometric mean, quartiles,
range, interquartile range, standard deviation,
variance, coefficient of variation, skewness, kurtosis,
boxplot, normal probability plot
(Sections 3.1, 3.2, 3.3, 6.3)
Dashboards (Section 14.2)
Inference about one
group
Comparing two groups
Confidence interval estimate of the mean
(Sections 8.1 and 8.2)
Confidence interval estimate of the
proportion (Section 8.3)
t test for the mean (Section 9.2)
Z test for the proportion (Section 9.4)
Tests for the difference in the means of two
independent populations (Section 10.1)
Z test for the difference between two
proportions (Section 10.3)
Paired t test (Section 10.2)
Chi-square test for the difference
between two proportions (Section 12.1)
F test for the difference between two variances
(Section 10.4)
Comparing more than
two groups
One-way analysis of variance for comparing several
means (Section 11.1)
Chi-square test for differences
among more than two proportions
(Section 12.2)
Analyzing the
relationship between
two variables
Scatter plot, time series plot (Section 2.5)
Contingency table, side-by-side bar
chart, PivotTables
(Sections 2.1, 2.3, 2.6)
Covariance, coefficient of correlation (Section 3.5)
Simple linear regression (Chapter 13)
t test of correlation (Section 13.7)
Sparklines (Section 2.7)
Analyzing the
relationship between
two or more variables
Chi-square test of independence
(Section 12.3)
Colored scatter plots, bubble chart, treemap
(Section 2.7)
Multidimensional contingency tables
(Section 2.6)
Multiple regression (Chapters 14)
Drilldown and slicers (Section 2.7)
Dynamic bubble charts (Section 14.2)
Classification trees (Section 14.4)
Regression trees (Section 14.3)
Multiple correspondence analysis
(Section 14.6)
Cluster analysis (Section 14.5)
Multidimensional scaling (Section 14.6)
Business Statistics
A First Course
Business Statistics
A First Course
EIGHTH EDITION
David M. Levine
Department of Information Systems and Statistics
Zicklin School of Business, Baruch College, City University of New York
Kathryn A. Szabat
Department of Business Systems and Analytics
School of Business, La Salle University
David F. Stephan
Two Bridges Instructional Technology
Senior VP, Courseware Portfolio Management: Marcia Horton
Director, Portfolio Management: Deirdre Lynch
Courseware Portfolio Manager: Suzanna Bainbridge
Courseware Portfolio Management Assistant: Morgan Danna
Managing Producer: Karen Wernholm
Content Producer: Kathleen A. Manley
Senior Producer: Aimee Thorne
Associate Content Producer: Sneh Singh
Manager, Courseware QA: Mary Durnwald
Manager, Content Development: Robert Carroll
Product Marketing Manager: Kaylee Carlson
Product Marketing Assistant: Shannon McCormack
Field Marketing Manager: Thomas Hayward
Field Marketing Assistant: Derrica Moser
Senior Author Support/Technology Specialist: Joe Vetere
Manager, Rights and Permissions: Gina Cheselka
Manufacturing Buyer: Carol Melville, LSC Communications
Composition and Production Coordination: Pearson Content Services Centre, Julie Kidd
Senior Designer and Cover Design: Barbara T. Atkinson
Cover Image: ArtisticPhoto/Shutterstock
Copyright © 2020, 2016, 2013 by Pearson Education, Inc. 221 River Street, Hoboken, NJ 07030 All Rights Reserved. Printed in the United States of America.
This publication is protected by copyright, and permission should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system,
or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise. For information regarding permissions, request forms
and the appropriate contacts within the Pearson Education Global Rights & Permissions department, please visit www.pearsoned.com/permissions/.
Attributions of third party content appear on page 651, which constitutes an extension of this copyright page.
PEARSON, ALWAYS LEARNING, MYLAB, MYLAB PLUS, MATHXL, LEARNING CATALYTICS, and TESTGEN are exclusive trademarks owned by
Pearson Education, Inc. or its affiliates in the U.S. and/or other countries.
Unless otherwise indicated herein, any third-party trademarks that may appear in this work are the property of their respective owners and any references to
third-party trademarks, logos or other trade dress are for demonstrative or descriptive purposes only. Such references are not intended to imply any sponsorship,
endorsement, authorization, or promotion of Pearson’s products by the owners of such marks, or any relationship between the owner and Pearson Education, Inc.
or its affiliates, authors, licensees or distributors.
Microsoft® Windows®, and Microsoft office® are registered trademarks of the Microsoft Corporation in the U.S.A. and other countries. This book is not sponsored
or endorsed by or affiliated with the Microsoft Corporation.
Microsoft and/or its respective suppliers make no representations about the suitability of the information contained in the documents and related graphics
published as part of the services for any purpose. All such documents and related graphics are provided “as is” without warranty of any kind. Microsoft and/or
its respective suppliers hereby disclaim all warranties and conditions with regard to this information, including all warranties and conditions of merchantability,
whether express, implied or statutory, fitness for a particular purpose, title and non-infringement. In no event shall microsoft and/or its respective suppliers be
liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract,
negligence or other tortious action, arising out of or in connection with the use or performance of information available from the services. The documents
and related graphics contained herein could include technical inaccuracies or typographical errors. Changes are periodically added to the information herein.
Microsoftand/or its respective suppliers may make improvements and/or changes in the product(s) and/or the program(s) described herein at any time. Partial
screen shots may be viewed in full within the software version specified.
Cataloging-in-Publication Data on file with the Library of Congress
Instructor’s Review Copy:
ISBN 13: 978-0-13-524459-3
ISBN 10:
0-13-524459-5
Student Edition:
ISBN 13: 978-0-13-517778-5
ISBN 10:
0-13-517778-2
To our spouses and children,
Marilyn, Mary, Sharyn, and Mark
and to our parents, in loving memory,
Lee, Reuben, Mary, William, Ruth and Francis J.
About the Authors
David M. Levine, Kathryn A. Szabat, and David F. Stephan
are all experienced business school educators committed to innovation and improving instruction in business statistics and related
subjects.
David Levine, Professor Emeritus of Statistics and CIS at Baruch
College, CUNY, is a nationally recognized innovator in statistics
education for more than three decades. Levine has coauthored
14 books, including several business statistics textbooks; textbooks
and professional titles that explain and explore quality management
and the Six Sigma approach; and, with David Stephan, a trade paperback that explains statistical concepts to a general audience. Levine
has presented or chaired numerous sessions about business eduKathryn Szabat, David Levine, and David Stephan
cation at leading conferences conducted by the Decision Sciences
Institute (DSI) and the American Statistical Association, and he
and his coauthors have been active participants in the annual DSI Data, Analytics, and Statistics
Instruction (DASI) mini-conference. During his many years teaching at Baruch College, Levine
was recognized for his contributions to teaching and curriculum development with the College’s
highest distinguished teaching honor. He earned B.B.A. and M.B.A. degrees from CCNY. and a
Ph.D. in industrial engineering and operations research from New York University.
As Associate Professor of Business Systems and Analytics at La Salle University, Kathryn Szabat
has transformed several business school majors into one interdisciplinary major that better supports careers in new and emerging disciplines of data analysis including analytics. Szabat strives
to inspire, stimulate, challenge, and motivate students through innovation and curricular enhancements, and shares her coauthors’ commitment to teaching excellence and the continual improvement
of statistics presentations. Beyond the classroom she has provided statistical advice to numerous
business, nonbusiness, and academic communities, with particular interest in the areas of education,
medicine, and nonprofit capacity building. Her research activities have led to journal publications,
chapters in scholarly books, and conference presentations. Szabat is a member of the American
Statistical Association (ASA), DSI, Institute for Operation Research and Management Sciences
(INFORMS), and DSI DASI. She received a B.S. from SUNY-Albany, an M.S. in statistics from the
Wharton School of the University of Pennsylvania, and a Ph.D. degree in statistics, with a cognate
in operations research, from the Wharton School of the University of Pennsylvania.
Advances in computing have always shaped David Stephan’s professional life. As an undergraduate, he helped professors use statistics software that was considered advanced even though it could
compute only several things discussed in Chapter 3, thereby gaining an early appreciation for the
benefits of using software to solve problems (and perhaps positively influencing his grades). An
early advocate of using computers to support instruction, he developed a prototype of a mainframe-based system that anticipated features found today in Pearson’s MathXL and served as special assistant for computing to the Dean and Provost at Baruch College. In his many years teaching
at Baruch, Stephan implemented the first computer-based classroom, helped redevelop the CIS
curriculum, and, as part of a FIPSE project team, designed and implemented a multimedia learning
environment. He was also nominated for teaching honors. Stephan has presented at SEDSI and DSI
DASI (formerly MSMESB) mini-conferences, sometimes with his coauthors. Stephan earned a
B.A. from Franklin & Marshall College and an M.S. from Baruch College, CUNY, and completed
the instructional technology graduate program at Teachers College, Columbia University.
For all three coauthors, continuous improvement is a natural outcome of their curiosity about the
world. Their varied backgrounds and many years of teaching experience have come together to
shape this book in ways discussed in the Preface.
ix
Brief Contents
Preface xxiii
First Things First 1
1
Defining and Collecting Data 18
2
Organizing and Visualizing Variables 44
3
Numerical Descriptive Measures 130
4
Basic Probability 176
5
Discrete Probability Distributions 207
6
The Normal Distribution 231
7
Sampling Distributions 257
8
Confidence Interval Estimation 279
9
Fundamentals of Hypothesis Testing: One-Sample Tests 314
10 Two-Sample Tests and One-Way ANOVA 354
11 Chi-Square Tests 421
12 Simple Linear Regression 450
13 Multiple Regression 502
14 Business Analytics 538
15 Applications in Quality Management (online) 15-1
Appendices A–H 565
Self-Test Solutions and Answers to Selected Even-Numbered
Problems 615
Index 641
Credits 651
xi
Contents
Preface xxiii
First Things First
1
1 Defining and Collecting
Data 18
USING STATISTICS: Defining Moments 18
USING STATISTICS: “The Price of Admission” 1
1.1
FTF.1 Think Differently About Statistics 2
Statistics: A Way of Thinking 2
Statistics: An Important Part of Your Business Education 3
Classifying Variables by Type 19
Measurement Scales 20
1.2
FTF.2 Business Analytics: The Changing Face of Statistics 4
“Big Data” 4
FTF.3 Starting Point for Learning Statistics 5
Statistic 5
Can Statistics (pl., statistic) Lie? 6
Data Sources 22
1.3
Types of Sampling Methods 23
Simple Random Sample 23
Systematic Sample 24
Stratified Sample 24
Cluster Sample 24
1.4
Data Cleaning 26
Invalid Variable Values 26
KEY TERMS 9
Coding Errors 26
EXCEL GUIDE 10
Data Integration Errors 26
EG.1 Getting Started with Excel 10
Missing Values 27
EG.2 Entering Data 10
EG.3 Open or Save a Workbook 10
EG.4 Working with a Workbook 11
Algorithmic Cleaning of Extreme Numerical Values 27
1.5
Other Data Preprocessing Tasks 27
Data Formatting 27
Stacking and Unstacking Data 28
Recoding Variables 28
1.6
Types of Survey Errors 29
Coverage Error 29
Nonresponse Error 29
Sampling Error 30
Measurement Error 30
EG.5 Print a Worksheet 11
EG.6 Reviewing Worksheets 11
EG.7 If You use the Workbook Instructions 11
JMP GUIDE 12
JG.1 Getting Started With JMP 12
JG.2 Entering Data 13
JG.3 Create New Project or Data Table 13
JG.4 Open or Save Files 13
JG.5 Print Data Tables or Report Windows 13
JG.6 Jmp Script Files 13
MINITAB GUIDE 14
MG.1 Getting Started with Minitab 14
MG.2 Entering Data 14
MG.3 Open or Save Files 14
MG.4 Insert or Copy Worksheets 14
MG.5 Print Worksheets 15
TABLEAU GUIDE 15
TG.1 Getting Started with Tableau 15
TG.2 Entering Data 16
TG.3 Open or Save a Workbook 16
TG.4 Working with Data 17
TG.5 Print a Workbook 17
Collecting Data 21
Populations and Samples 21
FTF.4 Starting Point for Using Software 6
Using Software Properly 8
REFERENCES 9
Defining Variables 19
Ethical Issues About Surveys 30
CONSIDER THIS: New Media Surveys/Old Survey Errors 31
USING STATISTICS: Defining Moments, Revisited 32
SUMMARY 32
REFERENCES 32
KEY TERMS 33
CHECKING YOUR UNDERSTANDING 33
CHAPTER REVIEW PROBLEMS 33
CASES FOR CHAPTER 1 34
Managing Ashland MultiComm Services 34
CardioGood Fitness 35
Clear Mountain State Student Survey 35
Learning with the Digital Cases 35
xiii
xiv
CONTENTS
Bubble Charts 77
CHAPTER 1 EXCEL GUIDE 37
EG1.1 Defining Variables 37
EG1.2 Collecting Data 37
PivotChart (Excel) 77
Treemap (Excel, JMP, Tableau) 77
EG1.3 Types of Sampling Methods 37
EG1.4 Data Cleaning 38
Sparklines (Excel, Tableau) 78
2.8
Filtering and Querying Data 79
Excel Slicers 79
2.9
Pitfalls in Organizing and Visualizing Variables 81
Obscuring Data 81
Creating False Impressions 82
Chartjunk 83
EG1.5 Other Data Preprocessing 38
CHAPTER 1 JMP GUIDE 39
JG1.1 Defining Variables 39
JG1.2 Collecting Data 39
JG1.3 Types of Sampling Methods 39
JG1.4 Data Cleaning 40
JG1.5 Other Preprocessing Tasks 41
CHAPTER 1 MINITAB GUIDE 41
MG1.1 Defining Variables 41
MG1.2 Collecting Data 41
MG1.3 Types of Sampling Methods 41
MG1.4 Data Cleaning 42
MG1.5 Other Preprocessing Tasks 42
CHAPTER 1 TABLEAU GUIDE 43
TG1.1 Defining Variables 43
TG1.2 Collecting Data 43
TG1.3 Types of Sampling Methods 43
TG1.4 Data Cleaning 43
TG1.5 Other Preprocessing Tasks 43
2 Organizing and Visualizing
Variables 44
USING STATISTICS: “The Choice Is Yours” 44
2.1
2.2
Organizing Categorical Variables 45
The Summary Table 45
The Contingency Table 46
Organizing Numerical Variables 49
The Frequency Distribution 50
The Relative Frequency Distribution and the Percentage
Distribution 52
The Cumulative Distribution 54
2.3
2.4
Visualizing Categorical Variables 57
The Bar Chart 57
2.6
2.7
SUMMARY 85
REFERENCES 86
KEY EQUATIONS 86
KEY TERMS 87
CHECKING YOUR UNDERSTANDING 87
CHAPTER REVIEW PROBLEMS 87
CASES FOR CHAPTER 2 92
Managing Ashland MultiComm Services 92
Digital Case 92
CardioGood Fitness 93
The Choice Is Yours Follow-Up 93
Clear Mountain State Student Survey 93
CHAPTER 2 EXCEL GUIDE 94
EG2.1 Organizing Categorical Variables 94
EG2.2 Organizing Numerical Variables 96
EG2
Charts Group Reference 98
EG2.3 Visualizing Categorical Variables 98
EG2.4 Visualizing Numerical Variables 100
EG2.5 Visualizing Two Numerical Variables 103
EG2.6 Organizing a Mix of Variables 104
EG2.7 Visualizing a Mix of Variables 105
EG2.8 Filtering and Querying Data 106
CHAPTER 2 JMP GUIDE 106
JG2
JMP Choices for Creating Summaries 106
JG2.1 Organizing Categorical Variables 107
JG2.2 Organizing Numerical Variables 108
JG2.3 Visualizing Categorical Variables 110
The Pie Chart and the Doughnut Chart 58
JG2.4 Visualizing Numerical Variables 111
The Pareto Chart 59
JG2.5 Visualizing Two Numerical Variables 113
Visualizing Two Categorical Variables 61
JG2.6 Organizing a Mix of Variables 114
Visualizing Numerical Variables 64
The Stem-and-Leaf Display 64
The Histogram 65
The Percentage Polygon 66
The Cumulative Percentage Polygon (Ogive) 67
2.5
USING STATISTICS: “The Choice Is Yours,” Revisited 85
Visualizing Two Numerical Variables 71
The Scatter Plot 71
The Time-Series Plot 72
Organizing a Mix of Variables 74
Drill-down 75
Visualizing a Mix of Variables 76
Colored Scatter Plot 76
JG2.7 Visualizing a Mix of Variables 114
JG2.8 Filtering and Querying Data 115
JMP Guide Gallery 116
CHAPTER 2 MINITAB GUIDE 117
MG2.1 Organizing Categorical Variables 117
MG2.2 Organizing Numerical Variables 117
MG2.3 Visualizing Categorical Variables 117
MG2.4 Visualizing Numerical Variables 119
MG2.5 Visualizing Two Numerical Variables 121
MG2.6 Organizing a Mix of Variables 122
MG2.7 Visualizing a Mix of Variables 122
MG2.8 Filtering and Querying Data 123
CONTENTS
CHAPTER 2 TABLEAU GUIDE 123
CHAPTER 3 EXCEL GUIDE 168
TG2.1 Organizing Categorical Variables 123
EG3.1 Measures of Central Tendency 168
TG2.2 Organizing Numerical Variables 124
EG3.2 Measures of Variation and Shape 168
TG2.3 Visualizing Categorical Variables 124
EG3.3 Exploring Numerical Variables 169
TG2.4 Visualizing Numerical Variables 126
EG3.4 Numerical Descriptive Measures for a Population 170
TG2.5 Visualizing Two Numerical Variables 127
TG2.6 Organizing a Mix of Variables 127
TG2.7 Visualizing a Mix of Variables 128
3 Numerical Descriptive
Measures 130
USING STATISTICS: More Descriptive Choices 130
3.1
3.2
Measures of Central Tendency 131
The Mean 131
The Median 133
The Mode 134
Measures of Variation and Shape 135
The Range 135
The Variance and the Standard Deviation 135
The Coefficient of Variation 138
Z Scores 139
Shape: Skewness 140
Shape: Kurtosis 141
3.3
Exploring Numerical Variables 145
Quartiles 145
The Interquartile Range 147
The Five-Number Summary 148
The Boxplot 149
3.4
Numerical Descriptive Measures for a Population 152
The Population Mean 152
The Population Variance and Standard Deviation 153
The Empirical Rule 154
Chebyshev’s Theorem 154
3.5
3.6
EG3.5 The Covariance and the Coefficient of Correlation 170
CHAPTER 3 JMP GUIDE 171
JG3.1 Measures of Central Tendency 171
JG3.2 Measures of Variation and Shape 171
JG3.3 Exploring Numerical Variables 171
JG3.4 Numerical Descriptive Measures for a Population 172
JG3.5 The Covariance and the Coefficient of Correlation 172
CHAPTER 3 MINITAB GUIDE 173
MG3.1 Measures of Central Tendency 173
MG3.2 Measures of Variation and Shape 173
MG3.3 Exploring Numerical Variables 174
MG3.4 Numerical Descriptive Measures for a Population 174
MG3.5 The Covariance and the Coefficient of Correlation 174
CHAPTER 3 TABLEAU GUIDE 175
TG3.3 Exploring Numerical Variables 175
4 Basic Probability
4.1
Basic Probability Concepts 177
Events and Sample Spaces 177
Types of Probability 178
Summarizing Sample Spaces 179
Simple Probability 180
Joint Probability 181
Marginal Probability 182
General Addition Rule 182
4.2
Conditional Probability 186
Calculating Conditional Probabilities 186
Decision Trees 187
Independence 189
Multiplication Rules 190
Marginal Probability Using the General Multiplication
Rule 191
Descriptive Statistics: Pitfalls and Ethical Issues 161
SUMMARY 162
REFERENCES 162
KEY EQUATIONS 162
KEY TERMS 163
176
USING STATISTICS: Possibilities at M&R Electronics World 176
The Covariance and the Coefficient of Correlation 156
The Covariance 156
The Coefficient of Correlation 157
USING STATISTICS: More Descriptive Choices,
Revisited 161
xv
4.3
Ethical Issues and Probability 193
4.4
Bayes’ Theorem 194
CONSIDER THIS: Divine Providence and Spam 196
4.5
Counting Rules 197
CHECKING YOUR UNDERSTANDING 163
USING STATISTICS: Possibilities at M&R Electronics World,
Revisited 200
CHAPTER REVIEW PROBLEMS 164
SUMMARY 201
CASES FOR CHAPTER 3 167
REFERENCES 201
Managing Ashland MultiComm Services 167
Digital Case 167
CardioGood Fitness 167
KEY EQUATIONS 201
KEY TERMS 202
More Descriptive Choices Follow-up 167
CHECKING YOUR UNDERSTANDING 202
Clear Mountain State Student Survey 167
CHAPTER REVIEW PROBLEMS 202
xvi
CONTENTS
CASES FOR CHAPTER 4 204
Digital Case 204
CardioGood Fitness 204
The Choice Is Yours Follow-Up 204
Clear Mountain State Student Survey 204
CHAPTER 4 EXCEL GUIDE 205
6 The Normal Distribution
231
USING STATISTICS: Normal Load Times at MyTVLab 231
6.1
Continuous Probability Distributions 232
6.2
The Normal Distribution 232
EG4.1 Basic Probability Concepts 205
Role of the Mean and the Standard Deviation 234
EG4.4 Bayes’ Theorem 205
Calculating Normal Probabilities 235
EG4.5 Counting Rules 205
CHAPTER 4 JMP GUIDE 206
JG4.4 Bayes’ Theorem 206
Finding X Values 240
CONSIDER THIS: What Is Normal? 243
6.3
CHAPTER 4 MINITAB GUIDE 206
MG4.5 Counting Rules 206
5 Discrete Probability
Distributions 207
USING STATISTICS: Events of Interest at Ricknel Home
Centers 207
5.1
5.2
The Probability Distribution for a Discrete Variable 208
Expected Value of a Discrete Variable 208
Variance and Standard Deviation of a Discrete Variable 209
Binomial Distribution 212
Histograms for Discrete Variables 215
Summary Measures for the Binomial Distribution 216
5.3
Poisson Distribution 219
USING STATISTICS: Events of Interest …,
Revisited 222
SUMMARY 222
REFERENCES 223
KEY EQUATIONS 223
KEY TERMS 223
CHECKING YOUR UNDERSTANDING 223
CHAPTER REVIEW PROBLEMS 223
CASES FOR CHAPTER 5 226
Evaluating Normality 245
Comparing Data Characteristics to Theoretical
Properties 245
Constructing the Normal Probability Plot 246
USING STATISTICS: Normal Load Times …, Revisited 249
SUMMARY 249
REFERENCES 249
KEY EQUATIONS 250
KEY TERMS 250
CHECKING YOUR UNDERSTANDING 250
CHAPTER REVIEW PROBLEMS 250
CASES FOR CHAPTER 6 252
Managing Ashland MultiComm Services 252
CardioGood Fitness 252
More Descriptive Choices Follow-up 252
Clear Mountain State Student Survey 252
Digital Case 252
CHAPTER 6 EXCEL GUIDE 253
EG6.2 The Normal Distribution 253
EG6.3 Evaluating Normality 253
CHAPTER 6 JMP GUIDE 254
JG6.2 The Normal Distribution 254
JG6.3 Evaluating Normality 254
CHAPTER 6 MINITAB GUIDE 255
MG6.2 The Normal Distribution 255
MG6.3 Evaluating Normality 256
Managing Ashland MultiComm Services 226
Digital Case 226
CHAPTER 5 EXCEL GUIDE 227
EG5.1 The Probability Distribution for a Discrete Variable 227
7 Sampling Distributions
257
EG5.2 Binomial Distribution 227
USING STATISTICS: Sampling Oxford Cereals 257
EG5.3 Poisson Distribution 227
7.1
Sampling Distributions 258
CHAPTER 5 JMP GUIDE 228
7.2
Sampling Distribution of the Mean 258
The Unbiased Property of the Sample Mean 258
JG5.1 The Probability Distribution for a Discrete Variable 228
Standard Error of the Mean 260
JG5.2 Binomial Distribution 228
Sampling from Normally Distributed Populations 261
Sampling from Non-normally Distributed Populations—
The Central Limit Theorem 264
VISUAL EXPLORATIONS: Exploring Sampling
Distributions 268
JG5.3 Poisson Distribution 229
CHAPTER 5 MINITAB GUIDE 229
MG5.1 The Probability Distribution for a
Discrete Variable 229
MG5.2 Binomial Distribution 230
MG5.3 Poisson Distribution 230
7.3
Sampling Distribution of the Proportion 269
CONTENTS
USING STATISTICS: Sampling Oxford Cereals,
Revisited 272
CardioGood Fitness 307
SUMMARY 273
Clear Mountain State Student Survey 307
REFERENCES 273
KEY EQUATIONS 273
KEY TERMS 273
CHECKING YOUR UNDERSTANDING 273
CHAPTER REVIEW PROBLEMS 274
CASES FOR CHAPTER 7 275
xvii
More Descriptive Choices Follow-Up 307
CHAPTER 8 EXCEL GUIDE 308
EG8.1 Confidence Interval Estimate for the Mean (s Known) 308
EG8.2 Confidence Interval Estimate for the Mean (s Unknown) 308
EG8.3 Confidence Interval Estimate for the Proportion 309
EG8.4 Determining Sample Size 309
CHAPTER 8 JMP GUIDE 310
JG8.1 Confidence Interval Estimate for the Mean (s Known) 310
Managing Ashland MultiComm Services 275
JG8.2 Confidence Interval Estimate for the Mean (s Unknown) 310
Digital Case 275
JG8.3 Confidence Interval Estimate for the Proportion 311
CHAPTER 7 EXCEL GUIDE 276
EG7.2 Sampling Distribution of the Mean 276
CHAPTER 7 JMP GUIDE 277
JG7.2 Sampling Distribution of the Mean 277
CHAPTER 7 MINITAB GUIDE 278
JG8.4 Determining Sample Size 311
CHAPTER 8 MINITAB GUIDE 312
MG8.1 Confidence Interval Estimate for the Mean (s Known) 312
MG8.2 Confidence Interval Estimate for the Mean (s Unknown) 312
MG8.3 Confidence Interval Estimate for the Proportion 313
MG7.2 Sampling Distribution of the Mean 278
MG8.4 Determining Sample Size 313
8 Confidence Interval
Estimation 279
9 Fundamentals of Hypothesis
Testing: One-Sample Tests 314
USING STATISTICS: Getting Estimates at Ricknel Home
Centers 279
8.1
Confidence Interval Estimate for the Mean
(σ Known) 280
Sampling Error 281
USING STATISTICS: Significant Testing at Oxford
Cereals 314
9.1
Risks in Decision Making Using Hypothesis Testing 317
Can You Ever Know the Population Standard Deviation? 284
8.2
Confidence Interval Estimate for the Mean
(σ Unknown) 285
Student’s t Distribution 285
The Concept of Degrees of Freedom 286
Properties of the t Distribution 286
The Confidence Interval Statement 288
8.3
Confidence Interval Estimate for the Proportion 293
8.4
Determining Sample Size 296
Z Test for the Mean (s Known) 319
Hypothesis Testing Using the Critical Value Approach 320
Hypothesis Testing Using the p-Value Approach 323
A Connection Between Confidence Interval Estimation and
Hypothesis Testing 325
Can You Ever Know the Population Standard Deviation? 326
9.2
t Test of Hypothesis for the Mean (σ Unknown) 327
Using the Critical Value Approach 328
Using the p-Value Approach 329
Checking the Normality Assumption 330
9.3
One-Tail Tests 333
Sample Size Determination for the Mean 296
Sample Size Determination for the Proportion 298
8.5
Confidence Interval Estimation and Ethical Issues 301
USING STATISTICS: Getting Estimates at Ricknel Home
Centers, Revisited 301
SUMMARY 301
Using the Critical Value Approach 333
Using the p-Value Approach 335
9.4
Z Test of Hypothesis for the Proportion 337
Using the Critical Value Approach 339
Using the p-Value Approach 339
9.5
Potential Hypothesis-Testing Pitfalls and Ethical
Issues 341
Important Planning Stage Questions 341
REFERENCES 302
KEY EQUATIONS 302
KEY TERMS 302
CHECKING YOUR UNDERSTANDING 303
CHAPTER REVIEW PROBLEMS 303
CASES FOR CHAPTER 8 306
Managing Ashland MultiComm Services 306
Digital Case 307
Sure Value Convenience Stores 307
Fundamentals of Hypothesis Testing 315
The Critical Value of the Test Statistic 316
Regions of Rejection and Nonrejection 317
Statistical Significance Versus Practical Significance 342
Statistical Insignificance Versus Importance 342
Reporting of Findings 342
Ethical Issues 342
USING STATISTICS: Significant Testing …, Revisited 343
SUMMARY 343
xviii
CONTENTS
REFERENCES 343
F Test for Differences Among More Than Two Means 386
KEY EQUATIONS 344
One-Way ANOVA F Test Assumptions 391
Levene Test for Homogeneity of Variance 392
KEY TERMS 344
Multiple Comparisons: The Tukey-Kramer Procedure 393
CHECKING YOUR UNDERSTANDING 344
USING STATISTICS I: Differing Means for Selling …,
Revisited 398
CHAPTER REVIEW PROBLEMS 344
CASES FOR CHAPTER 9 346
USING STATISTICS II: The Means to Find Differences …,
Revisited 399
SUMMARY 399
Managing Ashland MultiComm Services 346
Digital Case 346
Sure Value Convenience Stores 347
REFERENCES 400
CHAPTER 9 EXCEL GUIDE 348
EG9.1 Fundamentals of Hypothesis Testing 348
KEY EQUATIONS 401
EG9.2 t Test of Hypothesis for the Mean (s Unknown) 348
KEY TERMS 401
EG9.3 One-Tail Tests 349
CHECKING YOUR UNDERSTANDING 402
EG9.4 Z Test of Hypothesis for the Proportion 349
CHAPTER REVIEW PROBLEMS 402
CHAPTER 9 JMP GUIDE 350
CASES FOR CHAPTER 10 404
JG9.1 Fundamentals of Hypothesis Testing 350
JG9.2 t Test of Hypothesis for the Mean (s Unknown) 350
JG9.3 One-Tail Tests 351
CardioGood Fitness 406
CHAPTER 9 MINITAB GUIDE 351
More Descriptive Choices Follow-Up 406
MG9.1 Fundamentals of Hypothesis Testing 351
Clear Mountain State Student Survey 406
MG9.2 t Test of Hypothesis for the Mean (s Unknown) 352
MG9.3 One-Tail Tests 352
EG10.2 Comparing the Means of Two Related Populations 409
EG10.3 Comparing the Proportions of Two Independent
Populations 410
354
USING STATISTICS I: Differing Means for Selling Streaming
Media Players at Arlingtons? 354
10.1 Comparing the Means of Two Independent
Populations 355
Pooled-Variance t Test for the Difference Between Two
Means Assuming Equal Variances 355
Evaluating the Normality Assumption 358
Confidence Interval Estimate for the Difference Between Two
Means 360
Separate-Variance t Test for the Difference Between Two
Means, Assuming Unequal Variances 361
CONSIDER THIS: Do People Really Do This? 362
10.2 Comparing the Means of Two Related Populations 364
Paired t Test 365
Confidence Interval Estimate for the Mean Difference 370
10.3 Comparing the Proportions of Two Independent
Populations 372
Z Test for the Difference Between Two Proportions 372
Confidence Interval Estimate for the Difference Between Two
Proportions 376
10.4 F Test for the Ratio of Two Variances 379
USING STATISTICS II: The Means to Find Differences
at Arlingtons 383
Analyzing Variation in One-Way ANOVA 384
CHAPTER 10 EXCEL GUIDE 407
EG10.1 Comparing the Means of Two Independent
Populations 407
MG9.4 Z Test of Hypothesis for the Proportion 352
10.5 One-Way ANOVA 384
Digital Case 405
Sure Value Convenience Stores 405
JG9.4 Z Test of Hypothesis for the Proportion 351
10 Two-Sample Tests
and One-Way ANOVA
Managing Ashland MultiComm Services 404
EG10.4 F Test For The Ratio of Two Variances 411
EG10.5 One-Way ANOVA 411
CHAPTER 10 JMP GUIDE 414
JG10.1 Comparing the Means of Two Independent Populations 414
JG10.2 Comparing the Means of Two Related Populations 415
JG10.3 Comparing the Proportions of Two Independent
Populations 415
JG10.4 F Test for the Ratio of Two Variances 416
JG10.5 One-Way ANOVA 416
CHAPTER 10 MINITAB GUIDE 417
MG10.1 Comparing the Means of Two Independent Populations 417
MG10.2 Comparing the Means of Two Related Populations 417
MG10.3 Comparing the Proportions of Two Independent
Populations 418
MG10.4 F Test for the Ratio Of Two Variances 418
MG10.5 One-Way ANOVA 419
11 Chi-Square Tests
421
USING STATISTICS: Avoiding Guesswork About Resort
Guests 421
11.1 Chi-Square Test for the Difference Between Two
Proportions 422
11.2 Chi-Square Test for Differences Among
More Than Two Proportions 429
11.3 Chi-Square Test of Independence 435
CONTENTS
USING STATISTICS: Avoiding Guesswork …, Revisited 441
xix
Residual Plots to Detect Autocorrelation 470
The Durbin-Watson Statistic 471
SUMMARY 441
12.7 Inferences About the Slope and Correlation
Coefficient 474
t Test for the Slope 474
F Test for the Slope 475
Confidence Interval Estimate for the Slope 477
t Test for the Correlation Coefficient 477
REFERENCES 442
KEY EQUATIONS 442
KEY TERMS 442
CHECKING YOUR UNDERSTANDING 442
CHAPTER REVIEW PROBLEMS 442
PHASE 2 444
12.8 Estimation of Mean Values and Prediction of Individual
Values 480
The Confidence Interval Estimate for the Mean Response 481
The Prediction Interval for an Individual Response 482
Digital Case 445
12.9 Potential Pitfalls in Regression 484
CASES FOR CHAPTER 11 444
Managing Ashland MultiComm Services 444
PHASE 1 444
CardioGood Fitness 445
USING STATISTICS: Knowing Customers …, Revisited 486
Clear Mountain State Student Survey 445
SUMMARY 487
CHAPTER 11 EXCEL GUIDE 446
EG11.1 Chi-Square Test for the Difference Between Two
Proportions 446
EG11.2 Chi-Square Test for Differences Among More Than Two
Proportions 446
EG11.3 Chi-Square Test of Independence 447
KEY EQUATIONS 488
KEY TERMS 489
CHECKING YOUR UNDERSTANDING 489
CHAPTER REVIEW PROBLEMS 490
CHAPTER 11 JMP GUIDE 448
JG11.1 Chi-Square Test for the Difference Between Two
Proportions 448
JG11.2 Chi-Square Test for Difference Among More Than Two
Proportions 448
JG11.3 Chi-Square Test of Independence 448
CASES FOR CHAPTER 12 493
Managing Ashland MultiComm Services 493
Digital Case 493
Brynne Packaging 493
CHAPTER 12 EXCEL GUIDE 494
CHAPTER 11 MINITAB GUIDE 449
MG11.1 Chi-Square Test for the Difference Between Two
Proportions 449
MG11.2 Chi-Square Test for Differences Among More Than Two
Proportions 449
MG11.3 Chi-Square Test of Independence 449
12 Simple Linear Regression
REFERENCES 488
450
EG12.2 Determining the Simple Linear Regression Equation 494
EG12.3 Measures of Variation 495
EG12.5 Residual Analysis 495
EG12.6 Measuring Autocorrelation: the Durbin-Watson
Statistic 496
EG12.7 Inferences About the Slope and Correlation Coefficient 496
EG12.8 Estimation of Mean Values and Prediction of Individual
Values 496
CHAPTER 12 JMP GUIDE 497
JG12.2 Determining the Simple Linear Regression Equation 497
USING STATISTICS: Knowing Customers at Sunflowers
Apparel 450
Preliminary Analysis 451
JG12.3 Measures of Variation 497
12.1 Simple Linear Regression Models 452
JG12.7 Inferences About the Slope and Correlation Coefficient 497
JG12.8 Estimation of Mean Values and Prediction of Individual
Values 498
12.2 Determining the Simple Linear Regression Equation 453
The Least-Squares Method 453
Predictions in Regression Analysis: Interpolation Versus
Extrapolation 456
Calculating the Slope, b1, and the Y Intercept, b0 457
12.3 Measures of Variation 461
Computing the Sum of Squares 461
The Coefficient of Determination 463
Standard Error of the Estimate 464
12.4 Assumptions of Regression 465
12.5 Residual Analysis 466
Evaluating the Assumptions 466
12.6 Measuring Autocorrelation: The Durbin-Watson
Statistic 470
JG12.5 Residual Analysis 497
JG12.6 Measuring Autocorrelation: the Durbin-Watson Statistic 497
CHAPTER 12 MINITAB GUIDE 499
MG12.2 Determining the Simple Linear Regression Equation 499
MG12.3 Measures of Variation 500
MG12.5 Residual Analysis 500
MG12.6 Measuring Autocorrelation: The Durbin-Watson Statistic 500
MG12.7 Inferences About the Slope and Correlation
Coefficient 500
MG12.8 Estimation of Mean Values and Prediction of Individual
Values 500
CHAPTER 12 TABLEAU GUIDE 501
TG12.2 Determining the Simple Linear Regression Equation 501
TG12.3 Measures of Variation 501
xx
CONTENTS
13 Multiple Regression
502
USING STATISTICS: The Multiple Effects of OmniPower
Bars 502
13.1 Developing a Multiple Regression Model 503
Interpreting the Regression Coefficients 504
Predicting the Dependent Variable Y 506
13.2 Evaluating Multiple Regression Models 508
Coefficient of Multiple Determination, r 2 508
Adjusted r 2 509
F Test for the Significance of the Overall Multiple Regression
Model 509
13.3 Multiple Regression Residual Analysis 512
13.4 Inferences About the Population Regression
Coefficients 513
Tests of Hypothesis 514
Confidence Interval Estimation 515
14 Business Analytics
538
USING STATISTICS: Back to Arlingtons for the Future 538
14.1 Business Analytics Categories 539
Inferential Statistics and Predictive Analytics 540
Supervised and Unsupervised Methods 540
CONSIDER THIS: What’s My Major If I Want to Be a Data
Miner? 541
14.2 Descriptive Analytics 542
Dashboards 542
Data Dimensionality and Descriptive Analytics 543
14.3 Predictive Analytics for Prediction 544
14.4 Predictive Analytics for Classification 547
14.5 Predictive Analytics for Clustering 548
14.6 Predictive Analytics for Association 551
Multidimensional Scaling (MDS) 552
14.7 Text Analytics 553
13.5 Using Dummy Variables and Interaction Terms 517
Interactions 519
14.8 Prescriptive Analytics 554
USING STATISTICS: The Multiple Effects …, Revisited 524
USING STATISTICS: Back to Arlingtons …, Revisited 555
SUMMARY 524
REFERENCES 555
REFERENCES 526
KEY EQUATIONS 556
KEY EQUATIONS 526
KEY TERMS 556
KEY TERMS 526
CHECKING YOUR UNDERSTANDING 556
CHECKING YOUR UNDERSTANDING 526
CHAPTER REVIEW PROBLEMS 556
CHAPTER REVIEW PROBLEMS 526
CHAPTER 14 SOFTWARE GUIDE 558
Introduction 558
SG14.2 Descriptive Analytics 558
CASES FOR CHAPTER 13 529
Managing Ashland MultiComm Services 529
Digital Case 529
CHAPTER 13 EXCEL GUIDE 530
EG13.1 Developing a Multiple Regression Model 530
EG13.2 Evaluating Multiple Regression Models 531
EG13.3 Multiple Regression Residual Analysis 531
EG13.4 Inferences About the Population Regression
Coefficients 532
EG13.5 Using Dummy Variables and Interaction Terms 532
CHAPTER 13 JMP GUIDE 532
SG14.3 Predictive Analytics for Prediction 561
SG14.4 Predictive Analytics for Classification 561
SG14.5 Predictive Analytics for Clustering 562
SG14.6 Predictive Analytics for Association 564
15 Statistical Applications
in Quality Management
(online) 15-1
JG13.1 Developing a Multiple Regression Model 532
JG13.2 Evaluating Multiple Regression Models 533
JG13.3 Multiple Regression Residual Analysis 533
JG13.4 Inferences About the Population 534
JG13.5 Using Dummy Variables And Interaction Terms 534
CHAPTER 13 MINITAB GUIDE 535
MG13.1 Developing a Multiple Regression Model 535
MG13.2 Evaluating Multiple Regression Models 536
MG13.3 Multiple Regression Residual Analysis 536
MG13.4 Inferences About the Population Regression
Coefficients 536
MG13.5 Using Dummy Variables and Interaction Terms
In Regression Models 536
USING STATISTICS: Finding Quality at the
Beachcomber 15-1
15.1 The Theory of Control Charts 15-2
The Causes of Variation 15-2
15.2 Control Chart for the Proportion: The p Chart 15-4
15.3 The Red Bead Experiment: Understanding Process
Variability 15-10
15.4 Control Chart for an Area of Opportunity: The c
Chart 15-12
15.5 Control Charts for the Range and the Mean 15-15
The R Chart 15-15
The X Chart 15-18
CONTENTS
15.6 Process Capability 15-21
Customer Satisfaction and Specification Limits 15-21
Capability Indices 15-22
CPL, CPU, and Cpk
15-23
15.7 Total Quality Management 15-26
15.8 Six Sigma 15-27
The DMAIC Model 15-28
Roles in a Six Sigma Organization 15-29
Lean Six Sigma 15-29
USING STATISTICS: Finding Quality at the Beachcomber,
Revisited 15-30
REFERENCES 15-31
KEY EQUATIONS 15-31
KEY TERMS 15-32
CHAPTER REVIEW PROBLEMS 15-32
CASES FOR CHAPTER 15 15-35
The Harnswell Sewing Machine Company Case 15-35
Managing Ashland Multicomm Services 15-37
CHAPTER 15 EXCEL GUIDE 15-38
EG15.2 Control Chart for the Proportion: The p Chart 15-38
EG15.4 Control Chart for an Area Of Opportunity: The c Chart 15-39
EG15.5 Control Charts for the Range And The Mean 15-40
EG15.6 Process Capability 15-41
CHAPTER 15 JMP GUIDE 15-41
JG15.2 Control Chart for the Proportion: The p Chart 15-41
JG15.4 Control Chart for an Area of Opportunity: The c
Chart 15-41
JG15.5 Control Charts for the Range and the Mean 15-42
JG15.6 Process Capability 15-42
CHAPTER 15 MINITAB GUIDE 15-42
MG15.2 Control Chart for the Proportion: The p Chart 15-42
MG15.4 Control Chart for an Area of Opportunity: The c
Chart 15-43
MG15.5 Control Charts for the Range and the Mean 15-43
MG15.6 Process Capability 15-43
Appendices 565
A. BASIC MATH CONCEPTS AND SYMBOLS 566
A.1
A.2
A.3
A.4
A.5
A.6
Operators 566
Rules for Arithmetic Operations 566
Rules for Algebra: Exponents and Square Roots 566
Rules for Logarithms 567
Summation Notation 568
Greek Alphabet 571
B. IMPORTANT SOFTWARE SKILLS AND CONCEPTS 572
B.1
B.2
Identifying the Software Version 572
Formulas 572
B.3
B.4
B.5E
B.5J
B.5M
B.5T
B.6
B.7
xxi
Excel Cell References 574
Excel Worksheet Formatting 575
Excel Chart Formatting 576
JMP Chart Formatting 577
Minitab Chart Formatting 578
Tableau Chart Formatting 578
Creating Histograms for Discrete Probability Distributions
(Excel) 579
Deleting the “Extra” Histogram Bar (Excel) 580
C. ONLINE RESOURCES 581
C.1
C.2
C.3
C.4
About the Online Resources for This Book 581
Data Files 581
Files Integrated With Microsoft Excel 586
Supplemental Files 586
D. CONFIGURING SOFTWARE 587
D.1
D.2
D.3
D.4
Microsoft Excel Configuration 587
JMP Configuration 589
Minitab Configuration 589
Tableau Configuration 589
E. TABLE 590
E.1
E.2
E.3
E.4
E.5
E.6
E.7
E.8
E.9
Table of Random Numbers 590
The Cumulative Standardized Normal Distribution 592
Critical Values of t 594
Critical Values of x2 596
Critical Values of F 597
The Standardized Normal Distribution 601
Critical Values of the Studentized Range, Q 602
Critical Values, dL and dU, of the Durbin-Watson Statistic, D
(Critical Values Are One-Sided) 604
Control Chart Factors 605
F. USEFUL KNOWLEDGE 606
F.1
F.2
Keyboard Shortcuts 606
Understanding the Nonstatistical Functions 606
G. SOFTWARE FAQS 608
G.1
G.2
G.3
G.4
G.5
Microsoft Excel FAQs 608
PHStat FAQs 608
JMP FAQs 609
Minitab FAQs 609
Tableau FAQs 609
H. ALL ABOUT PHStat 611
H.1
H.2
H.3
H.4
What is PHStat? 611
Obtaining and Setting Up PHStat 612
Using PHStat 612
PHStat Procedures, by Category 613
Self-Test Solutions and Answers to
Selected Even-Numbered Problems 615
Index 641
Credits 651
Preface
A
s business statistics evolves and becomes an increasingly important part of one’s business
education, how business statistics gets taught and what gets taught becomes all the more
important.
We, the authors, think about these issues as we seek ways to continuously improve the
teaching of business statistics. We actively participate in Decision Sciences Institute (DSI),
American Statistical Association (ASA), and Data, Analytics, and Statistics Instruction and
Business (DASI) conferences. We use the ASA’s Guidelines for Assessment and Instruction
(GAISE) reports and combine them with our experiences teaching business statistics to a
diverse student body at several universities.
When writing for introductory business statistics students, five principles guide us.
Help students see the relevance of statistics to their own careers by using examples from
the functional areas that may become their areas of specialization. Students need to learn
statistics in the context of the functional areas of business. We present each statistics topic in
the context of areas such as accounting, finance, management, and marketing and explain the
application of specific methods to business activities.
Emphasize interpretation and analysis of statistical results over calculation. We emphasize the interpretation of results, the evaluation of the assumptions, and the discussion of what
should be done if the assumptions are violated. We believe that these activities are more important to students’ futures and will serve them better than focusing on tedious manual calculations.
Give students ample practice in understanding how to apply statistics to business. We
believe that both classroom examples and homework exercises should involve actual or realistic data, using small and large sets of data, to the extent possible.
Familiarize students with the use of data analysis software. We integrate using Microsoft
Excel, JMP, and Minitab into all statistics topics to illustrate how software can assist the business decision making process. In this edition, we also integrate using Tableau into selected
topics, where such integration makes best sense. (Using software in this way also supports our
second point about emphasizing interpretation over calculation).
Provide clear instructions to students that facilitate their use of data analysis software.
We believe that providing such instructions assists learning and minimizes the chance that the
software will distract from the learning of statistical concepts.
What’s New in This Edition?
This eighth edition of Business Statistics: A First Course features many passages rewritten in
a more concise style that emphasize definitions as the foundation for understanding statistical
concepts. In addition to changes that readers of past editions have come to expect, such as new
examples and Using Statistics case scenarios and an extensive number of new end-of-section or
end-of-chapter problems, the edition debuts:
• A First Things First Chapter that builds on the previous edition’s novel Important
Things to Learn First Chapter by using real-world examples to illustrate how developments such as the increasing use of business analytics and “big data” have made knowing
xxiii
xxiv
PREFACE
•
•
•
•
•
and understanding statistics that much more critical. This chapter is available as complimentary online download, allowing students to get a head start on learning.
Tabular Summaries that state hypothesis test and regression example results along with
the conclusions that those results support now appear in Chapters 10 through 13.
Updated Excel and Minitab Guides that reflect the most recent editions of these programs.
New JMP Guides that provide detailed, hands-on instructions for using JMP to illustrate
the concepts that this book teaches. JMP provides a starting point for continuing studies
in business statistics and business analytics and features visualizations that are easy to
construct and that summarize data in innovative ways.
For selected chapters, Tableau Guides that make best use of this software for basic and
advanced visualizations and regression analysis.
An All-New Business Analytics Chapter (Chapter 14) that makes extensive use of JMP,
Minitab, and Tableau to illustrate predictive analytics for prediction, classification, clustering, and association as well as explaining what text analytics does and how descriptive and
prescriptive analytics relate to predictive analytics. This chapter benefits from the insights
the coauthors have gained from teaching and lecturing on business analytics as well as
research the coauthors have done for a forthcoming companion title on business analytics.
Continuing Features that Readers Have Come to Expect
This edition of Business Statistics: A First Course continues to incorporate a number of distinctive features that has led to its wide adoption over the previous editions. Table 1 summaries
these carry-over features:
TABLE 1
Distinctive Features Continued in the Eighth Edition
Feature
Details
Using Statistics Business Scenarios
A Using Statistics scenario that highlights how statistics is used in a business functional area
begins each chapter. Each scenario provides an applied context for learning in its chapter.
End-of-chapter “Revisited” sections reinforces the statistical methods that a chapter discusses
and apply those methods to the questions raised in the scenario. In this edition, four chapters
have new or revised Using Statistics scenarios.
Basic Business Statistics was among the first business statistics textbooks to focus on interpretation of the results of a statistical method and not on the mathematics of a method. This
tradition continues, now supplemented by JMP results complimenting the Excel and Minitab
results of recent prior editions.
Software instructions in this book feature chapter examples and were personally written by
the authors, who collectively have over one hundred years experience teaching the application
of software to business. Software usage also features templates and applications developed
by the authors that minimize the frustration of using software while maximizing statistical
learning
Student Tips, LearnMore bubbles, and Consider This features extend student-paced learning by reinforcing important points or examining side issues or answering questions that
arise while studying business statistics such as “What is so ‘normal’ about the normal
distribution?”
With an extensive library of separate online topics, sections, and even two full chapters,
instructors can combine these materials and the opportunities for additional learning to meet
their curricular needs.
With modularized software instructions, instructors and students can switch among Excel,
Excel with PHStat, JMP, Minitab, and Tableau as they use this book, taking advantage of the
strengths of each program to enhance learning.
Emphasis on Data
Analysis and Interpretation of Results
Software Integration
Opportunities for
Additional Learning
Highly Tailorable
Context
Software Flexibility
PREFACE
xxv
TABLE 1 Distinctive Features Continued in the Eighth Edition (continued)
Feature
Details
End-of-Section and
End-of-Chapter
Reinforcements
“Exhibits” summarize key processes throughout the book. “Key Terms” provides an index to
the definitions of the important vocabulary of a chapter. “Learning the Basics” questions test
the basic concepts of a chapter. “Applying the Concepts” problems test the learner’s ability to
apply those problems to business problems.
For the more quantitatively-minded, “Key Equations” list the boxed number equations that
appear in a chapter.
End-of-chapter cases include a case that continues through many chapters as well as “Digital
Cases” that require students to examine business documents and other information sources to
sift through various claims and discover the data most relevant to a business case problem as
well as common misuses of statistical information. (Instructional tips for these cases and solutions to the Digital Cases are included in the Instructor’s Solutions Manual.)
An appendix provides additional self-study opportunities by provides answers to the
“Self-Test” problems and most of the even-numbered problems in this book.
Innovative Cases
Answers to
Even-Numbered
Problems
Unique Excel
Integration
Many textbooks feature Microsoft Excel, but Business Statistics: A First Course comes from
the authors who originated both the Excel Guide workbooks that illustrate model solutions,
developed Visual Explorations that demonstrate selected basic concepts, and designed and
implemented PHStat, the Pearson statistical add-in for Excel that places the focus on statistical learning. (See Appendix H for a complete summary of PHStat.)
Chapter-by-Chapter Changes Made for This Edition
Because the authors believe in continuous quality improvement, every chapter of Business
Statistics: A First Course contains changes to enhance, update, or just freshen this book. Table 2
provides a chapter-by-chapter summary of these changes.
TABLE 2
Chapter-by-Chapter
Change Matrix
Chapter
Using
Statistics
Changed
JMP/
Tableau
Guide
Problems
Changed
FTF
•
J, T
n.a.
1
•
J, T
40%
2
J, T
60%
3
J, T
50%
Selected Chapter Changes
Think Differently About Statistics
Starting Point for Learning Statistics
Data Cleaning
Other Data Preprocessing Tasks
Organizing a Mix of Variables
Visualizing A Mix of Variables
Filtering and Querying Data
Reorganized categorical variables
discussion.
Expanded data visualization discussion.
New samples of 379 retirement funds
and 100 restaurant meal costs for
examples.
New samples of 379 retirement funds
and 100 restaurant meal costs for
examples.
Updated NBA team values data set.
xxvi
PREFACE
TABLE 2
Chapter-by-Chapter
Change Matrix (continued)
JMP/
Tableau
Guide
Problems
Changed
4
J
43%
5
J
60%
J
J
33%
47%
8
J
40%
9
J
20%
J
43%
11
J
43%
12
J, T
46%
13
J
30%
J, T
n.a.
Chapter
6
7
10
14
Using
Statistics
Changed
•
•
n.a.
Selected Chapter Changes
Basic Probability Concepts rewritten.
Bayes’ theorem example moved online.
Section 5.1 and Binomial Distribution
revised.
Normal Distribution rewritten.
Sampling Distribution of the Proportion
rewritten.
Confidence Interval Estimate for the
Mean revised.
Revised “Managing Ashland MultiComm Services” continuing case.
Chapter introduction revised.
Section 9.1 rewritten.
New Section 9.4 example.
New paired t test and the difference
between two proportions examples.
Extensive use of new tabular summaries.
Revised “Managing Ashland MultiComm Services” continuing case.
Chapter introduction revised.
Section 12.2 revised.
Section 13.1 revised.
Section 13.3 reorganized and revised.
New dummy variable example.
All-new chapter that introduces business
analytics.
Software Guide explains using Excel
with Power BI Desktop, JMP,
Minitab, and Tableau, for various
descriptive and predictive analytics
methods.
Serious About Writing Improvements
Ever review a textbook that reads the same as an edition from years ago? Or read a preface that
claims writing improvements but offers no evidence? Among the writing improvements in this
edition of Business Statistics: A First Course, the authors have turned to tabular summaries to
guide readers to reaching conclusions and making decisions based on statistical information.
The authors believe that this writing improvement, which appears in Chapters 9 through 13, not
only adds clarity to the purpose of the statistical method being discussed but better illustrates
the role of statistics in business decision-making processes. Judge for yourself using the sample
from Chapter 10 Example 10.1.
Previously, part of the solution to Example 10.1 was presented as:
You do not reject the null hypothesis because tSTAT = - 1.6341 7 - 1.7341. The p-value
(as computed in Figure 10.5) is 0.0598. This p-value indicates that the probability that
tSTAT 6 - 1.6341 is equal to 0.0598. In other words, if the population means are equal,
the probability that the sample mean delivery time for the local pizza restaurant is at least
PREFACE
xxvii
2.18 minutes faster than the national chain is 0.0598. Because the p-value is greater than
a = 0.05, there is insufficient evidence to reject the null hypothesis. Based on these results,
there is insufficient evidence for the local pizza restaurant to make the advertising claim
that it has a faster delivery time.
In this edition, we present the equivalent solution (on page 360):
Table 10.4 summarizes the results of the pooled-variance t test for the pizza delivery data
using the calculation above (not shown in this sample) and Figure 10.5 results. Based on
the conclusions, local branch of the national chain and a local pizza restaurant have similar
delivery times. Therefore, as part of the last step of the DCOVA framework, you and your
friends exclude delivery time as a decision criteria when choosing from which store to
order pizza.
TABLE 10.4
Pooled-variance t test summary for the delivery times for the two pizza restaurants
Result
Conclusions
The tSTAT = - 1.6341 is greater than - 1.7341.
The t test p@value = 0.0598 is greater than the
level of significance, a = 0.05.
1. Do not reject the null hypothesis H0.
2. Conclude that insufficient evidence exists that the mean
delivery time is lower for the local restaurant than for the
branch of the national chain.
3. There is a probability of 0.0598 that tSTAT 6 - 1.6341.
A Note of Thanks
Creating a new edition of a textbook is a team effort, and we thank our Pearson Education
editorial, marketing, and production teammates: Suzanna Bainbridge, Kathy Manley, Kaylee
Carlson, Thomas Hayward, Deirdre Lynch, Aimee Thorne, and Morgan Danna. And we would
be remiss not to note the continuing work of Joe Vetere to prepare our screen shot illustrations
and the efforts of Julie Kidd of Pearson CSC to ensure that this edition meets the highest standard of book production quality that is possible.
We also thank Gail Illich of McLennan Community College for preparing instructor resources for this edition and thank the following people whose comments helped us
improve this edition: Mohammad Ahmadi, University of Tennessee-Chattanooga; Sung Ahn,
Washington State University; Kelly Alvey, Old Dominion University; Al Batten, University
of Colorado-Colorado Springs; Alan Chesen, Wright State University; Gail Hafer, St. Louis
Community College-Meramec; Chun Jin, Central Connecticut State University; Benjamin
Lev, Drexel University; Lilian Prince, Kent State University; Bharatendra Rai, University of
Massachusetts Dartmouth; Ahmad Vakil, St. John’s University (NYC); and Shiro Withanachchi,
Queens College (CUNY).
We thank the RAND Corporation and the American Society for Testing and Materials for
their kind permission to publish various tables in Appendix E, and to the American Statistical
Association for its permission to publish diagrams from the American Statistician. Finally, we
would like to thank our families for their patience, understanding, love, and assistance in making this book a reality.
xxviii
PREFACE
Contact Us!
Please email us at authors@davidlevinestatistics.com or tweet us @BusStatBooks with your
questions about the contents of this book. Please include the hashtag #BSAFC8 in your tweet
or in the subject line of your email. We also welcome suggestions you may have for a future
edition of this book. And while we have strived to make this book as error-free as possible, we
also appreciate those who share with us any perceived problems or errors that they encounter.
If you need assistance using software, please contact your academic support person or
Pearson Support at support.pearson.com/getsupport/. They have the resources to resolve and
walk you through a solution to many technical issues in a way we do not.
As you use this book, be sure to make use of the “Resources for Success” that Pearson
Education supplies for this book (described on the following pages). We also invite you to visit
bsafc8.davidlevinestatistics.com (bit.ly/2Apx1xH), where we may post additional information or new content as necessary.
David M. Levine
Kathryn A. Szabat
David F. Stephan
Resources for Success
MyLab Statistics Online Course for Business Statistics:
A First Course, 8th Edition by Levine, Szabat, Stephan
(access code required)
MyLab Statistics is the teaching and learning platform
that empowers instructors to reach every student. By
combining trusted author content with digital tools and
a flexible platform, MyLab Statistics personalizes the
learning experience and improves results for
each student.
MyLab makes learning and using a variety of statistical
programs as seamless and intuitive as possible.
Download the data files that this book uses (see
Appendix C) in Excel, JMP, and Minitab formats.
Download supplemental files that
support in-book cases or extend
learning.
Instructional Videos
Access instructional support videos including
Pearson’s Business Insight and StatTalk videos,
available with assessment questions. Reference
technology study cards and instructional videos for
Excel, JMP, Minitab, StatCrunch, and R software.
Diverse Question Libraries
Build homework assignments, quizzes,
and tests to support your course learning
outcomes. From Getting Ready (GR) questions
to the Conceptual Question Library (CQL), we
have your assessment needs covered from
the mechanics to the critical understanding
of Statistics. The exercise libraries include
technology-led instruction, including new
Excel-based exercises, and learning aids to
reinforce your students’ success.
pearson.com/mylab/statistics
Resources for Success
Instructor Resources
Instructor’s Solutions Manual, presents solutions
for end-of-section and end-of-chapter problems
and answers to case questions, and provides
teaching tips for each chapter. The Instructor's
Solutions Manual is available for download at
www.Pearson.com or in MyLab Statistics.
Lecture PowerPoint Presentations, by Patrick
Schur, Miami University (Ohio), are available
for each chapter. These presentations provide
instructors with individual lecture notes to
accompany the text. The slides include many of the
figures and tables from the textbook. Instructors
can use these lecture notes as is or customize
them in Microsoft PowerPoint. The PowerPoint
presentations are available for download at
www.Pearson.com or in MyLab Statistics.
Test Bank , contains true/false, multiple-choice,
fill-in, and problem-solving questions based on
the definitions, concepts, and ideas developed in
each chapter of the text. The Test Bank is available
for download at www.Pearson.com or in MyLab
Statistics.
TestGen® (www.pearsoned.com/testgen)
enables instructors to build, edit, print, and
administer tests using a computerized bank of
questions developed to cover all the objectives of
the text. TestGen is algorithmically based, allowing
instructors to create multiple but equivalent
versions of the same question or test with the click
of a button. Instructors can also modify test bank
questions or add new questions. The software and
test bank are available for download from Pearson
Education’s online catalog.
Student Resources
Student’s Solutions Manual, provides detailed
solutions to virtually all the even-numbered
exercises and worked-out solutions to the
self-test problems.
(ISBN-10: 0-13-518243-3;
ISBN-13: 978-0-13-518243-7)
Online resources complement and extend the
study of business statistics and support the content
of this book. These resources include data files for
in-chapter examples and problems, templates
and model solutions, and optional topics
and chapters. (See Appendix C for a complete
description of the online resources.)
PHStat helps create Excel worksheet solutions to
statistical problems. PHStat uses Excel building
blocks to create worksheet solutions. These
worksheet solutions illustrate Excel techniques
and students can examine them to gain new Excel
skills. Additionally, many solutions are what-if
templates in which the effects of changing data on
the results can be explored. Such templates are
fully reusable on any computer on which Excel has
been installed. PHStat requires an access code and
separate download for use. PHStat access codes
can be bundled with this textbook using
ISBN-10: 0-13-399058-3;
ISBN-13: 978-0-13-399058-4.
Minitab® More than 4,000 colleges and universities
worldwide use Minitab software to help students
learn quickly and to provide them with a skill-set
that’s in demand in today’s data-driven workforce.
Minitab® includes a comprehensive collection
of statistical tools to teach beginning through
advanced courses. Bundling Minitab software
ensures students have the software they need for
the duration of their course work.
(ISBN-10: 0-13-445640-8;
ISBN-13: 978-0-13-445640-9)
JMP® Student Edition software is statistical
discovery software from SAS Institute Inc., the
leader in business analytics software and services.
JMP® Student Edition is a streamlined version of
JMP that provides all the statistics and graphics
covered in introductory and intermediate statistics
courses. Available for bundling with this textbook.
(ISBN-10: 0-13-467979-2;
ISBN-13: 978-0-13-467979-2)
First Things First
CONTENTS
“The Price of Admission”
FTF.1 Think Differently
About Statistics
FTF.2 Business Analytics:
The Changing Face
of Statistics
FTF.3 Starting Point for
Learning Statistics
FTF.4 Starting Point for
Using Software
EXCEL GUIDE
JMP GUIDE
▼
USING STATISTICS
MINITAB GUIDE
TABLEAU GUIDE
“The Price of Admission”
I
t’s the year 1900 and you are a promoter of theatrical productions, in the business of selling
seats for individual performances. Using your knowledge and experience, you establish a
selling price for the performances, a price you hope represents a good trade-off between
maximizing revenues and avoiding driving away demand for your seats. You print up tickets
and flyers, place advertisements in local media, and see what happens. After the event, you
review your results and consider if you made a wise trade-off.
Tickets sold very quickly? Next time perhaps you can charge more. The event failed to sell
out? Perhaps next time you could charge less or take out more advertisements to drive demand.
If you lived over 100 years ago, that’s about all you could do.
Jump ahead about 70 years. You’re still a promoter but now using a computer system that
allows your customers to buy tickets over the phone. You can get summary reports of advance
sales for future events and adjust your advertising on radio and on TV and, perhaps, add or
subtract performance dates using the information in those reports.
Jump ahead to today. You’re still a promoter but you now have a fully computerized sales
system that allows you to constantly adjust the price of tickets. You also can manage many more
categories of tickets than just the near-stage and far-stage categories you might have used many
years ago. You no longer have to wait until after an event to make decisions about changing
your sales program. Through your sales system you have gained insights about your customers
such as where they live, what other tickets they buy, and their appropriate demographic traits.
Because you know more about your customers, you can make your advertising and publicity
more efficient by aiming your messages at the types of people more likely to buy your tickets.
By using social media networks and other online media, you can also learn almost immediately
who is noticing and responding to your advertising messages. You might even run experiments
online presenting your advertising in two different ways and seeing which way sells better.
Your current self has capabilities that allow you to be a more effective promoter than any
older version of yourself. Just how much better? Turn the page.
OBJECTIVES
■
■
■
■
■
Statistics is a way of
thinking that can lead to
better decision making
Statistics requires analytical skills and is an
important part of your
business education
Recent developments
such as the use of business analytics and “big
data” have made knowing statistics even more
critical
The DCOVA framework
guides your application
of statistics
The opportunity business
analytics represents for
business students
1
2
First Things First
Now Appearing on Broadway … and Everywhere Else
student TIP
From other business
courses, you may
recognize that Disney’s
system uses dynamic
pricing.
In early 2014, Disney Theatrical Productions woke up the rest of Broadway when reports
revealed that its 17-year-old production of The Lion King had been the top-grossing Broadway
show in 2013. How could such a long-running show, whose most expensive ticket was less than
half the most expensive ticket on Broadway, earn so much while being so old? Over time, grosses
for a show decline, and weekly grosses for The Lion King had dropped about 25% by the year
2009. But, in 2013, grosses were up 67% from 2009, and weekly grosses typically exceeded the
grosses of opening weeks in 1997, adjusted for inflation!
Heavier advertising and some changes in ticket pricing helped, but the major reason for
this change was something else: combining business acumen with the systematic application of
business statistics and analytics to the problem of selling tickets. As a producer of the newest
musical at the time said, “We make educated predictions on price. Disney, on the other hand,
has turned this into a science” (see reference 3).
Disney had followed the plan of action that this book presents. It had collected its daily and
weekly results and summarized them, using techniques this book introduces in the next three
chapters. Disney then analyzed those results by performing experiments and tests on the data
collected (using techniques that later chapters introduce). In turn, those analyses were applied
to a new interactive seating map that allowed customers to buy tickets for specific seats and
permitted Disney to adjust the pricing of each seat for each performance. The whole system
was constantly reviewed and refined, using the semiautomated methods to which Chapter 14
will introduce you. The end result was a system that outperformed the ticket-selling methods
others used.
FTF.1 Think Differently About Statistics
The “Using Statistics” scenario suggests, and the Disney example illustrates, that modern-day
information technology has allowed businesses to apply statistics in ways that could not be done
years ago. This scenario and example reflect how this book teaches you about statistics. In these
first two pages, you may notice
• the lack of calculation details and “math.”
• the emphasis on enhancing business methods and management decision making.
• that none of this seems like the content of a middle school or high school statistics class
you may have taken.
You may have had some prior knowledge or instruction in mathematical statistics. This book
discusses business statistics. While the boundary between the two can be blurry, business statistics emphasizes business problem solving and shows a preference for using software to perform
calculations.
One similarity that you might notice between these first two pages and any prior instruction
is data. Data are the facts about the world that one seeks to study and explore. Some data are
unsummarized, such as the facts about a single ticket-selling transaction, whereas other facts,
such as weekly ticket grosses, are summarized, derived from a set of unsummarized data. While
you may think of data as being numbers, such as the cost of a ticket or the percentage that weekly
grosses have increased in a year, do not overlook that data can be non-numerical as well, such
as ticket-buyer’s name, seat location, or method of payment.
Statistics: A Way of Thinking
Statistics are the methods that allow you to work with data effectively. Business statistics
focuses on interpreting the results of applying those methods. You interpret those results to help
you enhance business processes and make better decisions. Specifically, business statistics provides you with a formal basis to summarize and visualize business data, reach conclusions about
that data, make reliable predictions about business activities, and improve business processes.
FTF.1 Think Differently About Statistics
3
You must apply this way of thinking correctly. Any “bad” things you may have heard about
statistics, including the famous quote “there are lies, damned lies, and statistics” made famous
by Mark Twain, speak to the errors that people make when either misusing statistical methods or
mistaking statistics as a substitution for, and not an enhancement of, a decision-making process.
(Disney Theatrical Productions’ success was based on combining statistics with business acumen,
not replacing that acumen.)
DCOVA Framework To minimize errors, you use a framework that organizes the set of
tasks that you follow to apply statistics properly. The five tasks that comprise the DCOVA
framework are
•
•
•
•
•
Define the data that you want to study to solve a problem or meet an objective.
Collect the data from appropriate sources.
Organize the data collected, by developing tables.
Visualize the data collected, by developing charts.
Analyze the data collected, reach conclusions, and present the results.
You must always do the Define and Collect tasks before doing the other three. The order of the
other three varies, and sometimes all three are done concurrently. In this book, you will learn
more about the Define and Collect tasks in Chapter 1 and then be introduced to the Organize
and Visualize tasks in Chapter 2. Beginning with Chapter 3, you will learn methods that help
complete the Analyze task. Throughout this book, you will see specific examples that apply the
DCOVA framework to specific business problems and examples.
Analytical Skills More Important Than Arithmetic Skills The business preference for using software to automate statistical calculations maximizes the importance of having
analytical skills while it minimizes the need for arithmetic skills. With software, you perform
calculations faster and more accurately than if you did those calculations by hand, minimizing
the need for advanced arithmetic skills. However, with software you can also generate inappropriate or meaningless results if you have not fully understood a business problem or goal under
study or if you use that software without a proper understanding of statistics.
Therefore, using software to create results that help solve business problems or meet business goals is always intertwined with using a framework. And using software does not mean
memorizing long lists of software commands or how-to operations, but knowing how to review,
modify, and possibly create software solutions. If you can analyze what you need to do and
have a general sense of what you need, you can always find instructions or illustrative sample
solutions to guide you. (This book provides detailed instructions as well as sample solutions
for every statistical activity discussed in end-of-chapter software guides and through the use of
various downloadable files and sample solutions.)
If you were introduced to using software in an application development setting or an introductory information systems class, do not mistake building applications from scratch as being a
necessary skill. A “smart” smartphone user knows how to use apps such as Facebook, Instagram,
YouTube, Google Maps, and Gmail effectively to communicate or discover and use information
and has no idea how to construct a social media network, create a mapping system, or write an
email program. Your approach to using the software in this book should be the same as that smart
user. Use your analytical skills to focus on being an effective user and to understand conceptually
what a statistical method or the software that implements that method does.
Statistics: An Important Part of Your Business Education
Until you read these pages, you may have seen a course in business statistics solely as a
required course with little relevance to your overall business education. In just two pages, you
have learned that statistics is a way of thinking that can help enhance your effectiveness in
business—that is, applying statistics correctly is a fundamental, global skill in your business
education.
4
First Things First
In the current data-driven environment of business, you need the general analytical skills
that allow you to work with data and interpret analytical results regardless of the discipline
in which you work. No longer is statistics only for accounting, economics, finance, or other
disciplines that directly work with numerical data. As the Disney example illustrates, the
decisions you make will be increasingly based on data and not on your gut or intuition supported by past experience. Having a well-balanced mix of statistics, modeling, and basic
technical skills as well as managerial skills, such as business acumen and problem-solving
and communication skills, will best prepare you for the workplace today … and tomorrow
(see reference 1).
FTF.2 Business Analytics: The Changing Face of Statistics
Of the recent changes that have made statistics an important part of your business education,
the emergence of the set of methods collectively known as business analytics may be the most
significant change of all. Business analytics combine traditional statistical methods with methods from management science and information systems to form an interdisciplinary tool that
supports fact-based decision making. Business analytics include
• statistical methods to analyze and explore data that can uncover previously unknown or
unforeseen relationships.
• information systems methods to collect and process data sets of all sizes, including very
large data sets that would otherwise be hard to use efficiently.
• management science methods to develop optimization models that support all levels of
management, from strategic planning to daily operations.
In the Disney Theatrical Productions example, statistical methods helped determine pricing
factors, information systems methods made the interactive seating map and pricing analysis
possible, and management science methods helped adjust pricing rules to match Disney’s goal
of sustaining ticket sales into the future. Other businesses use analytics to send custom mailings
to their customers, and businesses such as the travel review site tripadvisor.com use analytics
to help optimally price advertising as well as generate information that makes a persuasive case
for using that advertising.
Generally, studies have shown that businesses that actively use business analytics and
combine that use with data-guided management see increases in productivity, innovation, and
competition (see reference 1). Chapter 14 introduces you to the statistical methods typically
used in business analytics and shows how these methods are related to statistical methods that
the book discusses in earlier chapters.
“Big Data”
Big data is a collection of data that cannot be easily browsed or analyzed using traditional
methods. Big data implies data that are being collected in huge volumes, at very fast rates or
velocities (typically in near real time), and in a variety of forms that can differ from the structured forms such as records stored in files or rows of data stored in worksheets that businesses
use every day. These attributes of volume, velocity, and variety (see reference 5) distinguish
big data from a “big” (large) set of data that contains numerous records or rows of similar data.
When combined with business analytics and the basic statistical methods discussed in this book,
big data presents opportunities to gain new management insights and extract value from the data
resources of a business (see reference 8).
Unstructured Data Big data may also include unstructured data, data that has an
irregular pattern and contain values that are not comprehensible without additional automated or manual interpretation. Unstructured data takes many forms such as unstructured text,
pictures, videos, and audio tracks, with unstructured text, such as social media comments,
Get Complete eBook Download Link below
for instant download
https://browsegrades.net/documents/2
86751/ebook-payment-link-for-instantdownload-after-payment
FTF.3 Starting Point for Learning Statistics
5
getting the most immediate attention today for its possible use in customer, branding, or
marketing analyses. Unstructured data can be adapted for use with a number of methods, such
as regression, which this book illustrates with conventional, structured files and worksheets.
Unstructured data may require performing data collection and preparation tasks beyond those
tasks that Chapter 1 discusses. While describing all such tasks is beyond the scope of this book,
Section 14.1 includes an example of the additional interpretation that is necessary when working
with unstructured text.
FTF.3 Starting Point for Learning Statistics
Statistics has its own vocabulary and learning the precise meanings, or operational definitions,
of basic terms provides the basis for understanding the statistical methods that this book discusses. For example, in statistics, a variable defines a characteristic, or property, of an item or
individual that can vary among the occurrences of those items or individuals. For example, for
the item “book,” variables would include the title and number of chapters, as these facts can
vary from book to book. For a given book, these variables have a specific value. For this book,
the value of the title variable would be “Business Statistics: A First Course,” and “15” would be
the value for the number of chapters variable. Note that a statistical variable is not an algebraic
variable, which serves as a stand-in to represent one value in an algebraic statement and could
never take a non-numerical value such as “Business Statistics: A First Course.”
Using the definition of variable, data, in its statistical sense, can be defined as the set of
values associated with one or more variables. In statistics, each value for a specific variable is
a single fact, not a list of facts. For example, what would be the value of the variable author for
this book? Without this rule, you might say that the single list “Levine, Szabat, Stephan” is the
value. However, applying this rule, one would say that the variable has three separate values:
“Levine”, “Stephan”, and “Szabat”. This distinction of using only single-value data has the
practical benefit of simplifying the task of entering data for software analysis.
Using the definitions of data and variable, the definition of statistics can be restated as
the methods that analyze the data of the variables of interest. The methods that primarily help
summarize and present data comprise descriptive statistics. Methods that use data collected
from a small group to reach conclusions about a larger group comprise inferential statistics.
Chapters 2 and 3 introduce descriptive methods, many of which are applied to support the inferential methods that the rest of the book presents.
Statistic
The previous section uses statistics in the sense of a collective noun, a noun that is the name
for a collection of things (methods in this case). The word statistics also serves as the plural
form of the noun statistic, as in “one uses methods of descriptive statistics (collective noun) to
generate descriptive statistics (plural of the singular noun).” In this sense, a statistic refers to a
value that summarizes the data of a particular variable. (More about this in coming chapters.)
In the Disney Theatrical Productions example, the statement “for 2013, weekly grosses were
up 67% from 2009” cites a statistic that summarizes the variable weekly grosses using the 2013
data—all 52 values.
When someone warns you of a possible unfortunate outcome by saying, “Don’t be a
statistic!” you can always reply, “I can’t be.” You always represent one value and a statistic
always summarizes multiple values. For the statistic “87% of our employees suffer a workplace
accident,” you, as an employee, will either have suffered or have not suffered a workplace
accident. The “have” or “have not” value contributes to the statistic but cannot be the statistic.
A statistic can facilitate preliminary decision making. For example, would you immediately
accept a position at a company if you learned that 87% of their employees suffered a workplace
accident? (Sounds like this might be a dangerous place to work and that further investigation
is necessary.)
6
First Things First
Can Statistics (pl., statistic) Lie?
The famous quote “lies, damned lies, and statistics” actually refers to the plural form of statistic
and does not refer to statistics, the field of study. Can any statistic “lie”? No, faulty or invalid statistics can only be produced through willful misuse of statistics or when DCOVA framework tasks
are done incorrectly. For example, many statistical methods are valid only if the data being analyzed have certain properties. To the extent possible, you test the assertion that the data have those
properties, which in statistics are called assumptions. When an assumption is violated, shown to be
invalid for the data being analyzed, the methods that require that assumption should not be used.
For the inferential methods that this book discusses in later chapters, you must always look
for logical causality. Logical causality means that you can plausibly claim something directly
causes something else. For example, you wear black shoes today and note that the weather is
sunny. The next day, you again wear black shoes and notice that the weather continues to be
sunny. The third day, you change to brown shoes and note that the weather is rainy. The fourth
day, you wear black shoes again and the weather is again sunny. These four days seem to suggest a strong pattern between your shoe color choice and the type of weather you experience.
You begin to think if you wear brown shoes on the fifth day, the weather will be rainy. Then
you realize that your shoes cannot plausibly influence weather patterns, that your shoe color
choice cannot logically cause the weather. What you are seeing is mere coincidence. (On the
fifth day, you do wear brown shoes and it happens to rain, but that is just another coincidence.)
You can easily spot the lack of logical causality when trying to correlate shoe color choice
with the weather, but in other situations the lack of logical causality may not be so easily seen.
Therefore, relying on such correlations by themselves is a fundamental misuse of statistics.
When you look for patterns in the data being analyzed, you must always be thinking of logical
causes. Otherwise, you are misrepresenting your results. Such misrepresentations sometimes
cause people to wrongly conclude that all statistics are “lies.” Statistics (pl., statistic) are not lies
or “damned lies.” They play a significant role in statistics, the way of thinking that can enhance
your decision making and increase your effectiveness in business.
FTF.4 Starting Point for Using Software
This book uses Microsoft Excel, JMP, Minitab, and Tableau to help explain and illustrate statistical concepts and methods as well as demonstrate how such applications can help facilitate
business decision making. To begin using the software that this book uses requires only the
knowledge of basic user interface skills, operations, and vocabulary that Table FTF.1 summarizes and which the supplemental, online Basic Computing Skills document reviews. (Learn
more about online supplemental files in Appendix C.)
TABLE FTF.1
Basic Computing
Knowledge
Skill or Operation
Specifics
Identify and use standard
window objects
Title bar, minimize/resize/close buttons, scroll bars,
mouse pointer, menu bars or ribbons, dialog box, window
subdivisions such as areas, panes, or child windows
Command button, list box, drop-down list, edit box,
option button, check box, tabs (tabbed panels)
Click, called select in some list or menu contexts and
check or clear in some check box contexts; double-click;
right-click; drag and drag-and-drop
Identify and use common
dialog box items
Mouse operations
Excel, JMP, and Minitab all use worksheets to display the contents of a data set and as the
means to enter or edit data. (JMP calls its worksheets data tables.) Worksheets are containers
that present tabular arrangements of data, in which the intersections of rows and columns form
cells, boxes into which individual entries are made. One places the data for a variable into the
FTF.4 Starting Point for Using Software
7
cells of a column such that each column contains the data for a different variable, if more than
one variable is under study. By convention, one uses the cell in the initial row to enter names of
the variables (variable columns).
Shown below, from back to front, are a Minitab worksheet, a JMP data table, and an Excel
worksheet. The JMP and Minitab containers contain a special unnumbered row into which
column variable names can be entered. In Excel, variable names are entered in row 1 of the
worksheet, which can sometimes lead to inadvertent errors.
student TIP
Selected Excel, JMP, and
Minitab solutions that
this book presents exist
as templates that simplify
the production of results
and serve as models
for learning more about
using formulas.
Appendix Section G.5
explains the limitations on
using Tableau workbooks
that Tableau Public users
face.
Generally, entries in each cell are single data values that can be text or numbers. All three
programs also permit formulas, instructions to process data, to compute cell values. Formulas
can include functions that simplify certain arithmetic tasks or provide access to advanced processing or statistical features. Formulas play an important role in designing templates, reusable
solutions that have been previously audited and verified. However, JMP and Minitab allow only
column formulas that define calculations for all the cells in a column, whereas Excel allows only
cell formulas that define calculations for individual cells.
These three programs save worksheet data and results as one file, called a workbook in Excel
and a project in JMP and Minitab. JMP and Minitab also allow the saving of individual worksheets or results as separate files, whereas Excel always saves a workbook even if the workbook
contains (only) one worksheet. Both JMP and Minitab can open the data worksheets of an Excel
workbook, making the Excel workbook a universal format for sharing files that contain only data,
such as the set of data files for use with this book that Appendix C documents.
Tableau Differences Like Excel, Tableau uses workbooks to store one or more worksheets, but Tableau defines the concept of worksheet differently. A Tableau worksheet stores
tabular and visual summaries that are associated with a separately defined data source that can
be a complex collection of data or be equivalent to an Excel data worksheet (as are the data
sources that this book uses for examples). Data sources can be viewed and column formulas
can be used to define new columns, but individual values cannot be edited. Data sources can
be unique to a specific Tableau worksheet or shared by several Tableau worksheets. Tableau
workbooks can also store dashboards, a concept that Chapter 14 discusses. Table FTF.2 summarizes some of the various file formats that the four programs use. (Appendix D discusses
macro and add-in files.)
TABLE FTF.2 Common File Formats for Excel, JMP, Minitab, and Tableau
File Type
Excel
JMP
Minitab
Tableau
All-in-one-file
Single worksheet
Results only
Macro or add-in
.xlsx (workbook)
.xlsx
n.a.
.xlsm, .xlam
.jmpprj (project)
.jmp
.jrp (report)
.jsl, .jmpaddin
.mpj (project)
.mtw
.mgf (graph)
.mtb, .mac
.twbx (workbook)
8
First Things First
student TIP
Using Software Properly
Check the student
download web page
for this book for more
information about PHStat
and JMP and Minitab
macros and add-ins that
may be available for
download.
Learning to use software properly can be hard as software has limited ways to provide feedback
for user actions that are invalid operations. In addition, no software will ever know if you are
following proper procedures for using that software. The principles that Exhibit FTF.1 lists will
assist you and should govern your use of software with this book. These principles will minimize
your chance of making errors and lessen the frustration that often occurs when these principles
are unknown or overlooked by a user.
EXHIBIT FTF.1
Principles of Using Software Properly
Ensure that software is properly updated. Users who manage their own
computers often overlook the importance of ensuring that all installed software
is up to date.
Understand the basic operational tasks. Take the time to master the tasks of
starting the application, loading and entering data, and how to select or choose
commands in a general way.
Understand the statistical concepts that an application uses. Not understanding
those concepts can lead to making wrong choices in the application and can make
interpreting results difficult.
Know how to review software use for errors. Review and verify that the proper
data preparation procedures (see Chapter 1) have been applied to the data before
analysis. Verify that the correct procedures, commands, and software options
have been selected. For information entered for labeling purposes, verify that no
typographical errors exist.
Seek reuse of preexisting solutions to solve new problems. Build solutions
from scratch only as necessary, particularly if using Excel in which errors can be
most easily made. Some solutions, and almost all Excel solutions that this book
presents, exist as models or templates that can and should be reused because such
reuse models best practice.
Understand how to organize and present information from the results that
the software produces. Think about the best ways to arrange and label the data.
Consider ways to enhance or reorganize results that will facilitate communication
with others.
Use self-identifying names, especially for the files that you create and save.
Naming files Document 1, Document 2, and so on will impede the later retrieval
and use of those files.
In addition, also look for ways in which you can simplify the user interface of the software you use. If using Excel with this book, consider using PHStat, supplied separately or
as part of a bundle by Pearson. PHStat simplifies the user interface by providing a consistent
dialog box driven interface that minimizes keystrokes and mouse selections. If using JMP
and Minitab, look for macros and add-ins that simplify command sequences or automate
repetitive activities.
Software-related Conventions Table FTF.3 on page 9 summarizes the softwarerelated conventions that this book uses. These conventions are used extensively in the end-ofchapter software guides and certain appendices to provide a concise and clear way of expressing
specific user activities.
Key Terms
9
TABLE FTF.3 Conventions That This Book Uses
Convention
▼
Example
Special key names appear capitalized and in boldface
Press Enter.
Press Command or Ctrl.
Key combinations appear in boldface, with key names linked
using this symbol: +
Enter the formula and press Ctrl+Enter.
Press Ctrl+C.
Menu or Ribbon selections appear in boldface and sequences
of consecutive selections are shown using this symbol: ➔
Select File ➔ New.
Select PHStat ➔ Descriptive Statistics ➔ Boxplot.
Target of mouse operations appear in boldface
Click OK.
Select Attendance and then click the Y button.
Entries and the location of where entries are made appear
in boldface
Enter 450 in cell B5.
Add Temperature to the Model Effects list.
Variable or column names sometimes appear capitalized for
emphasis
This file contains the Fund Type, Assets, and
Expense Ratio for the growth funds.
Placeholders that express a general case appear in italics and
may also appear in boldface as part of a function definition
AVERAGE (cell range of variable)
Replace cell range of variable with the cell range
that contains the Asset variable.
Names of data files mentioned in sections or problems
appear in a special font but appear in boldface in end-ofchapter Guide instructions
Open the Retirement Funds workbook.
When current versions of Excel and Minitab differ in their
user interface, alternate instructions for older versions
appear in a second color immediately following the primary
instructions
In the Select Data Source display, click the icon
inside the Horizontal (Category) axis labels box.
Click Edit under the Horizontal (Categories) Axis
Labels heading.
REFERENCES
1. Advani, D. “Preparing Students for the Jobs of the Future.”
University Business (2011), bit.ly/1gNLTJm.
2. Davenport, T., J. Harris, and R. Morison. Analytics at Work.
Boston: Harvard Business School Press, 2010.
3. Healy, P. “Ticker Pricing Puts ‘Lion King’ atop Broadway’s
Circle of Life.” New York Times, New York edition, March 17,
2014, p. A1, and nyti.ms.1zDkzki.
4. JP Morgan Chase. “Report of JPMorgan Chase & Co.
Management Task Force Regarding 2012 CIO Losses,”
bit.ly/1BnQZzY, as quoted in J. Ewok, “The Importance of
Excel,” The Baseline Scenario, bit.ly/1LPeQUy.
▼
Retirement Funds
5. Laney, D. 3D Data Management: Controlling Data Volume,
Velocity, and Variety. Stamford, CT: META Group. February 6,
2001.
6. Levine, D., and D. Stephan. “Teaching Introductory Business
Statistics Using the DCOVA Framework.” Decision Sciences
Journal of Innovative Education 9 (Sept. 2011): 393–398.
7. Marr, B. “20 Claims About Big Data and Why They All Are
Wrong.” Data-informed.com, September 28, 2015.
8. “What Is Big Data?” IBM Corporation, www.ibm.com/
big-data/us/en/.
KEY TERMS
big data 4
cells 6
data 2
data table 6
business analytics 4
DCOVA framework 3
descriptive statistics 5
formula 7
function 7
inferential statistics 5
logical causality 6
operational definition 5
project (JMP, Minitab) 7
statistic 5
statistics 2
summarized data 2
template 7
unstructured data 4
variable 5
workbook 7
worksheet 6
▼
First Things
First
EXCEL GUIDE
EG.1 GETTING STARTED with EXCEL
student TIP
Excel sometimes
displays a task pane in
the worksheet area that
presents formatting and
similar choices.
Opening Excel displays a window that contains the Office Ribbon tabs above a worksheet
area that displays the current worksheet of the current workbook, the name of which appears
centered in the title bar. The top of the worksheet area contains a formula bar that allows you
to see and edit the contents of the currently selected cell (cell A1 in the illustration). Immediately below the worksheet grid is a sheet tab that identifies the name of current worksheet
(DATA). In workbooks with more than one sheet, clicking the sheet tabs navigates through
the workbook. In the illustration, the current cell is cell A1, and its content is displayed in the
formula bar.
The Basic Computing
Skills online document (see
Appendix C) discusses the
other standard features
Microsoft Office seen in the
illustration.
The illustration shows the Retirement Funds workbook, one of many data workbooks that
support this textbook. Use the Excel data workbooks with either the Excel Guide workbooks or
PHStat to create solutions to problems or to recreate results used in examples. See Appendix C
for description of these learning resources. (Using PHStat requires a separate download and
an access code, which may have been bundled with the purchase of this book, as Appendix H
explains.)
EG.2
ENTERING DATA
In Excel, enter data into worksheet columns, starting with
the leftmost, first column, using the cells in row 1 to enter
variable names. Avoid skipping rows or columns as such
skipping can disrupt or alter the way certain Excel procedures work. Complete a cell entry by pressing Tab or Enter,
or, if using the formula bar to make a cell entry, by clicking
the check mark icon in the formula bar. To enter or edit data
in a specific cell, either use the cursor keys to move the cell
pointer to the cell or select the cell directly.
Try to avoid using numbers as row 1 variable headings; if
you cannot avoid their use, precede such headings with apostrophes. Pay attention to special instructions in this book that
note specific orderings of variable columns that are necessary
for some Excel operations. When in doubt, use the DATA
worksheets of the Excel Guide Workbooks as the guide for
entering and arranging variable data.
10
EG.3
OPEN or SAVE a WORKBOOK
Use File ➔ Open or File ➔ Save As.
Open and Save As use similar means to open or save the
workbook by name while specifying the physical device or
network location and folder for that workbook. Save As dialog boxes enable one to save a file in alternate formats for programs that cannot open Excel workbooks (.xlsx files) directly.
Alternate formats include a simple text file with values delimited with tab characters, Text (Tab delimited) (*.txt) that
saves the contents of the current worksheet as a simple text
file, and CSV (Comma delimited) (*.csv) that saves worksheet cell values as text values that are delimited with commas. Excels for Mac list these choices as Tab Delimited Text
(.txt) and Windows Comma Separated (.csv).
The illustration on the next page shows the part of the Save
As dialog box that contains the Save as type drop-down list.
EXCEL Guide
(Open dialog boxes have a similar drop-down list.) In all Windows Excel versions, you can also select a file format in the Open
dialog box. Selecting All Files (*.*) from the drop-down list can
list files that had been previously saved in unexpected formats.
To open a new workbook, select File ➔ New (New
Workbook in Excel for Mac). Excel displays a new workbook
with one or more blank worksheets.
EG.4
WORKING with a WORKBOOK
Use Insert (or Insert Sheet), Delete, or Move or Copy.
Alter the contents of a workbook by adding a worksheet
or by deleting, copying, or rearranging the worksheets and
chart sheets that the workbook contains. To perform one of
these operations, right-click a sheet tab and select the appropriate choice from the shortcut menu that appears.
To add a worksheet, select Insert. In Microsoft Windows
Excel, you also click Worksheet and then click OK in the
Insert dialog box. To delete a worksheet or chart sheet, rightclick the sheet tab of the worksheet to be deleted and select
Delete. To copy or rearrange the position of a worksheet or
chart sheet, right-click the sheet tab of the sheet and select
Move or Copy. In the Move or Copy dialog box, first select
the workbook and the position in the workbook for the sheet.
If copying a sheet, also check Create a copy. Then click OK.
EG.5
PRINT a WORKSHEET
Use File ➔ Print.
Excel prints worksheets and chart sheets, not workbooks. When
you select Print, Excel displays a preview of the currently
opened sheet in a dialog box or pane that allows you to select that
sheet or other sheets from the workbook. You can adjust the print
formatting of the worksheet(s) to be printed by clicking Page
Setup. Typically, in the Page Setup dialog box, you might click
the Sheet tab and then check or clear the Gridlines and Row
and column headings checkboxes to add or remove worksheet
cell gridlines and the numbered row and lettered column headings that are similar to how a worksheet is displayed onscreen.
EG.6
REVIEWING WORKSHEETS
Follow the best practice of reviewing worksheets before you
use them to help solve problems. When you use a worksheet,
what you see displayed in cells may be the result of either the
recalculation of formulas or cell formatting. A cell that displays
4 might contain the value 4, might contain a formula calculation
that results in the value 4, or might contain a value such as 3.987
that has been formatted to display as the nearest whole number.
11
To display and review all formulas, press Ctrl+` (grave
accent). Excel displays the formula view of the worksheet,
revealing all formulas. (Pressing Ctrl+` a second time
restores the worksheet to its normal display.) If you use the
Excel Guide workbooks, you will discover that each workbook contains one or more FORMULAS worksheets that
provide a second way of viewing all formulas.
In the Excel solutions for this book, you will notice cell formatting operations that have changed the background color of
cells, changed text attributes such as boldface of cell entries, and
rounded values to a certain number of decimal places (typically
four). However, if you want to learn more about cell formatting,
Appendix B includes a summary of common formatting operations, including those used in the Excel solutions for this book.
EG.7 IF YOU USE the WORKBOOK
INSTRUCTIONS
Excel Guide Workbook instructions enable you to directly
modify the template and model worksheet solutions for problems other than the one they help solve. (In contrast, PHStat
provides a dialog box interface in which you make entries
that PHStat uses to automate such modifications.) Workbook
instructions express Excel operations in the most universal
way possible. For example, many instructions ask you to select
(click on) an item from a gallery of items and identify that item
selection by name. In some Excel versions, these names may
be visible captions for the item; in other versions, you will need
to move the mouse over the image to pop up the image name.
Guides also use the word display as in the “Format Axis
display” to refer to a user interaction that may be presented
by Excel in a task pane or a two-panel dialog box (“Format
Axis task pane” or “Format Axis dialog box”). Task panes
open to the side of the worksheet and can remain onscreen
indefinitely. Initially, some parts of a pane may be hidden and
you may need to click on an icon or label to reveal that hidden
part to complete a Workbook instruction. Two-panel dialog
boxes open over the worksheet and must be closed before you
can do other Excel activities. The left panel of such dialog
boxes are always visible and clicking entries in the left panel
makes visible one of the right panels. (Click the system close
button at the top right of a task pane or dialog box to close
the display and remove it from the screen.)
Current Excel versions can vary in their menu sequences.
Excel Guide instructions show these variations as parenthetical
phrases. For example, the menu sequence, “select Design (or
Chart Design) ➔ Add Chart Element” tells you to first select
Design or Chart Design to begin the sequence and then to continue by selecting Add Chart Element. (Microsoft Windows
Excel uses Design and Excel for Mac uses Chart Design.)
For the current Excel versions that this book supports (see
the FAQs in Appendix G), the Workbook Instructions are generally identical. Occasionally, individual instructions may
differ significantly for one (or more) versions. In such cases,
the instructions that apply for multiple versions appear first,
in normal text, and the instructions for the unique version
immediately follows in this text color.
▼
First Things
First
JMP GUIDE
JG.1
GETTING STARTED with JMP
Opening JMP displays the JMP Home Window (shown below) that contains the main menu bar
and toolbar through which you make JMP command selections, as well as lists of recent files
and any other JMP windows that JMP has been set previously to open. In the illustration below,
JMP has opened the Retirement Fund data table window and the Retirement Fund - Chart by
Market Cap window and displays those two items in the Window List.
Windows that JMP opens or creates display independently of other windows and can be
arranged to overlap, as the illustration shows. Note that JMP displays thumbnails of results
windows associated with a data table in an evaluations done panel that appears below the data
table. In the Windows List, associated results appear as indented list items under the name of
the data table window.
In many windows that JMP creates, JMP hides a copy of the home window’s menu bar
and tool bar under a “thin blue bar” as shown above. Clicking the thin blue bar, seen in the
Retirement Fund - Chart by Market Cap window, displays a copy of the home window’s user
interface. Most results windows also contain a right downward-pointing triangle to the left of a
result heading (Chart in the illustration). Clicking this red triangle displays a red triangle menu
of commands and options appropriate for the results that appear under the heading. Red triangle
menus also appear in other contexts, such as in the upper left corner of data tables where they
hide various row and column selection, data entry, and formatting commands.
A result heading also includes a gray right triangle disclosure button that hides or reveals
results (to the left of the red triangle in the Chart heading). By using the disclosure button and a
combination of red triangle menu selections you can tailor the results, what JMP calls a report,
to your specific needs.
Selecting Help ➔ Books from the main window’s menu bar displays a list of books in PDF
format that you can display in JMP or save and read when not using JMP. Consult the books
Discovering JMP and Using JMP as additional sources for getting started with JMP or to discover the JMP features and commands that the instructions in this book do not use.
12
JMP Guide
JG.2
ENTERING DATA
In JMP, enter data into data table (worksheet) columns,
starting with the first numbered row and the leftmost, first
column. Never skip a cell when entering data because JMP
will interpret that skipped cell as a “missing value” (see
Section 1.4) that can affect analysis. Complete a cell entry
by pressing Enter. To enter or edit data in a specific cell,
either use the cursor keys to move the cell pointer to the cell
or select the cell directly.
As one enters data into columns, JMP assigns default
column names in the form Column 1, Column 2, and so
forth. Change these default names to variable names by
double-clicking the name or right-clicking and selecting
Column Info from the shortcut menu. Either action displays
the Column dialog box in which you can enter the variable
name and set data type and scale, attributes of the data that
Chapter 1 explains.
JG.3
CREATE NEW PROJECT
or DATA TABLE
Use File ➔ New ➔ Project.
Use File ➔ New ➔ Data Table.
While report windows that contain results can be opened and
saved separately from the data table that provides the data
for those results (see Section JG.4), best practice groups a
set of report window and data table files into one project file.
To create a project file, select File ➔ New ➔ Project. JMP
opens a new Projects window with the name Untitled. Rightclick “Untitled” and rename the project. In the illustration
below, the project has been renamed Retirement Funds Market Cap Analysis.
To add a window to a project, right-click the project name and select Add Window. In the dialog box that
appears, select the window to be included and click OK. In
lieu of selecting Add Window, select Add All Windows to
add all onscreen JMP windows excluding the home window.
In the illustration below, the Retirement Funds data table
and Chart by Market Cap windows has been added to the
renamed project. Project files can be opened and saved as
Section JG.4 explains.
The data table New command opens a blank data table in
its own window. Any new data table is not automatically added
to the currently open project, and you must use the Add Window
command if you want a new data table to be part of a project.
JG.4
13
OPEN or SAVE FILES
Use File ➔ Open.
Use File ➔ Save As.
In JMP, all displayed windows can be opened or saved
as separate files, as well as open and save special grouping files such as projects. By default, JMP lists all JMP
file types in open operations and properly assigns the file
type in all save operations. To import an Excel workbook,
select Excel Files (*.xls, *.xlsx, *.xlsm) from the pulldown list in the Open Data File dialog box. To export a
JMP data table as an Excel file, change the Save as type
in the Save JMP File As dialog box to Excel Workbook
(*.xlsx, *.xls).
Report windows can be saved as “interactive HTML”
files that allow others to use systems on which JMP has not
been installed to explore results in an interactive way, using a
subset of JMP functionality. To save this type of file, change
the Save as type in the Save JMP File As dialog box to
Interactive HTML with Data (*.htm;*.html).
JG.5 PRINT DATA TABLES or REPORT
WINDOWS
Use File ➔ Print or File ➔ Print Preview.
To print a JMP object, select these File commands from
the window that contains the object you want to print. For
results (report) windows, first click the thin blue bar to reveal
the menu bar that contains File. If using Print Preview, JMP
opens a new window in which preview output and printing
options can be adjusted before printing. [To print, click the
leftmost (Print) icon in the window.]
JG.6 JMP SCRIPT FILES
JMP script files record many user interface actions and construct or modify JMP objects such as data tables. Using its
own JSL scripting language, JMP records your actions as
you analyze data in a script file that you can optionally save
and play back later to recreate the analysis. Saved script files
are text files that can be viewed, edited, and run in their
own JMP window or edited by word or texting processing
applications.
JSL also includes user interface commands and directives allowing one to construct scripts that simplify and
customize the use of the JMP Home window menu bar and
toolbar. For selected chapters, JMP scripts created especially
for this book can facilitate your use of JMP for those chapters (see Appendix C). JMP scripts are sometimes packaged
as a JMP add-in that can be “installed” in JMP and directly
selected from the JMP Home window menu bar, eliminating
the need to open a script and then run the script from inside
the script window.
▼
First Things
First
MINITAB GUIDE
MG.1 GETTING STARTED with MINITAB
student TIP
To view a window that
may be obscured or
hidden, select Window
from the Minitab menu
bar, and then select the
name of the window you
want to view.
MG.2
When you open Minitab, you see a main window and a number of child windows that cannot
be moved outside the boundaries of the main window. You will normally see a blank worksheet
and the Session window that records commands and displays results as the child windows.
Pictured below is a project with one opened worksheet. Besides the slightly obscured DATA
worksheet window and Session window, this figure also shows a Project Manager that lists
the contents of the current project. (Use the keyboard shortcut Ctrl+I to display the Project
Manager if it is not otherwise visible in the main window.)
ENTERING DATA
In Minitab, enter data into worksheet columns, starting with
the first numbered row and leftmost, first column. Minitab
names columns using the form Cn, such that the first column
is named C1, the second column is C2, and the tenth column
is C10. Use the first, unnumbered and shaded row to enter variable names that can be used as a second way to refer to a column by name. If a variable name contains spaces or other special characters, such as Market Cap, Minitab will display that
name in dialog boxes using a pair of single quotation marks
('Market Cap'). You must include those quotation marks any
time you enter such a variable name in a dialog box.
If a column contains non-numerical data, Minitab
displays the column name with an appended -T such as
C1-T, C2-T, and C3-T in the worksheet shown above. If
a column contains data that Minitab interprets as either
dates or times, Minitab displays the column name with an
appended -D. If a column contains data that a column formula (see Chapter 1) computes, Minitab displays a small
green check mark above and to the right of the Cn name.
(Neither the appended -D nor the check mark are shown in
the worksheet above.)
To enter or edit data in a specific cell, either use the cursor keys to move the cell pointer to the cell or use your mouse
14
to select the cell directly. Never skip a cell in a numbered row
when entering data because Minitab will interpret a skipped
cells as a “missing value” (see Section 1.4).
MG.3
OPEN or SAVE FILES
Use File ➔ Open Worksheet or File ➔ Open Project and
File ➔ Save Current Worksheet or File ➔ Save Project As.
In Minitab, open and save individual worksheets or entire
projects, collections of worksheets, Session results, and
graphs. To save data in a form readable by Excel, select
Excel from the Save as type drop-down list before clicking Save. Data can also be saved as a simple text file, Text,
or as simple text with values delimited with commas, CSV.
In Minitab, individual graphs and a project’s session
window can be opened or saved, although these operations
are never used in the Minitab Guides in this book.
MG.4
INSERT or COPY WORKSHEETS
Use File ➔ New or File ➔ Open Worksheet.
To insert a new worksheet, select File ➔ New and in the New
dialog box click Minitab Worksheet and then click OK. To
insert a copy of a worksheet, select File ➔ Open Worksheet
and select worksheet to be copied.
TABLEAU Guide
MG.5
PRINT WORKSHEETS
Use File ➔ Print Worksheet (or Print Graph or Print
Session Window).
Selecting Print Worksheet displays the Data Window Print
Options dialog box. In this dialog box, specify the title and
formatting options for printing. Selecting Print Graph or
▼
Print Session Window displays a dialog box that allows you
to change the default printer settings.
If you need to change printing attributes, first select
File ➔ Print Setup and make the appropriate selections in
the Print dialog box before you select the Print command.
TABLEAU GUIDE
TG.1
15
First Things
First
GETTING STARTED with TABLEAU
The Tableau Guides for this book feature Tableau Public, version 2018, also known as the
Tableau Desktop Public Edition. Tableau Public allows users to share data and create, in
worksheets, tabular and visual summaries that can be building blocks for a dashboard or a
story, a slide-based presentation. Tableau Public uses a drag-and-drop interface that will be
most familiar to users of the JMP Graph Builder or the Excel PivotTable feature. Tableau
Public consists of three different editions, as Appendix Section G.5 explains. Except where
noted in this book, instructions apply to all three versions.
Tableau Public uses workbooks to store tabular and visual worksheet summaries, dashboards, and stories. Opening Tableau Public displays the main Tableau window in which the
contents of a workbook can be viewed and edited. The main window shown below displays
the Mobile Electronics Store worksheet of the Arlingtons National Sales Dashboard Tableau
workbook that is used in a Chapter 14 example.
The main Tableau window contains a menu bar and toolbar, a tabbed left area that presents
data and formatting details, several special areas that Section TG.4 explains, the worksheet display area, and the Show Me gallery that displays tabular and visual summaries appropriate for
the data in the worksheet area. Worksheet tabs appear under these areas, along with a tab for the
current data source to the left of the tabs and icon shortcuts for a new worksheet, new dashboard,
16
First Things First
and new story, respectively, to the right of the tabs. The bottom of the window displays status
information at the left, the current signed-in user (“MyUserName”), a drop-down button to sign
out of Tableau Public, and four media controls that permit browsing through the worksheet tabs
of the current workbook.
The Data tab shown on the previous page displays the workbook data sources (five) and lists
the “Dimensions” and “Measures,” the variable columns of the data source for the worksheet,
summary measures, and calculated values for the MobileElectronicsStore data source associated
with the displayed worksheet. (Section TG1.1 explains more about the significance of dimensions
and measures.)
For the worksheet shown on the previous page, the variable Store was dragged-and-dropped
in the Rows shelf and the variable Mobile Electronics Sales was dragged-and-dropped in
the Column shelf. The worksheet also contains a filter for the Region column that selects only
those rows in the source worksheet in which the value in the Region column is Alpha. (To create
the bar chart, horizontal bars icon was selected from the Show Me gallery.)
TG.2
ENTERING DATA
Tableau does not support the direct entry of data values. When using Tableau, one imports data
from other sources, such as Microsoft Excel workbooks, text files, JSON files, or data retrieved
from server-based data systems. Many users create and save data files in another program such
as Excel and then open the saved data file in Tableau to import the data.
TG.3
OPEN or SAVE a WORKBOOK
Use File ➔ Open or File ➔ Open from Tableau Public.
Use File ➔ Save to Tableau Public As.
Use Open to import simple data sources such as Microsoft Excel workbooks or text files or to
open a Tableau workbook that previously downloaded to a local computing device. Most Tableau
Guide instructions begin by opening a Microsoft Excel workbook to avoid the limitations of
Tableau Public that Appendix Section G.5 discusses. (This procedure mimics the typical usage
of creating data in another program that Section TG.2 discusses.)
To open a new workbook, select File ➔ New. For a new workbook, the Data tab will display
the hyperlink Connect to Data. Clicking the hyperlink displays the Connect panel from which
data sources can be retrieved. When opening an Excel workbook that contains more than one
data worksheet, each Excel worksheet being used must be defined as a separate data source.
This Connect panel also appears when opening an Excel workbook directly using the file open
command. The illustration below shows the data source linked to the Data worksheet of the
Excel Retirement Funds workbook.
TABLEAU Guide
Using the Open from Tableau Public or Save to Tableau
Public As command requires a valid Tableau online account.
(Accounts are complimentary but require registration.)
Using these commands means signing into a Tableau Public
account and retrieving (or storing) a Tableau workbook from
that account or from an account that has been shared. Save
to Tableau Public As stores the Tableau workbook in the
account and opens a web browser to display the workbook
and to permit its downloading to a local computing device.
In paid-subscription editions of Tableau Desktop, these open
and save commands appear in the Server menu and not in
the File menu.
Although not used in the Tableau Guides, Tableau Public permits join and union operations to combine columns
from two data sources. Join operations combine two tables,
typically by matching values in a variable column that both
original tables share. Union operations add rows. Union
operations require that tables share columns that hold values
for the same variables. Joins and unions can solve problems
that arise from seeking to perform an analysis of data on a
set of variables stored in two different places, such as two
different Excel data worksheets in the same Excel workbook.
TG.4
WORKING with DATA
Tableau Desktop uses the term data field to refer to what
this book calls a variable or, in some contexts, a column.
What an Excel, JMP, or Minitab user would call a formula
function, Tableau calls an aggregation. What an Excel, JMP,
or MInitab user would call a formula, Tableau calls an aggregate calculation. Tableau also invents its own vocabulary for
several user interface elements in the worksheet window that
users of Excel, JMP, or Minitab may know under more common names. Knowing this vocabulary can be helpful when
consulting the Tableau help system or other references.
Tableau calls the Pages, Filters, Columns, and Rows
areas, all shown in the Section TG.1 illustration, shelves. The
Marks area seems to be a shelf but for reasons that would be
only self-evident to a regular user of Tableau, the Marks area
is called a card. Shelves are places into which things can be
placed or dropped, such as the pills that have been placed in
the Filters, Row, and Columns shelves. A pill represents some
data and is so named because it reminds some of a medicinal
capsule. In its simplest form, a pill represents a data field in the
17
data source. However, pills can represent a filtering operation,
such as the Filters shelf pill in the illustration below, or a calculated result, similar to a worksheet or data table formula.
Pills can be either blue or green, reflecting the type of numerical data, discrete or continuous (see Section TG1.1), or red,
reflecting an error condition.
In the Section TG.1 illustration, the Store Dimension has
been dragged to the Rows shelf, creating the Store pill and the
Mobile Electronics Sales Measure has been dropped on the
Columns shelf, creating the SUM(Mobile Electronics Sales)
pill (slightly truncated, see figure detail below). Dropping
a measure name on a shelf creates an aggregation (formula
function). The SUM aggregation pill sums mobile electronic
sales values by rows (each store). In the special case where
each store is represented by only one row, there is no actual
summing of values.
TG.5
PRINT a WORKBOOK
Tableau Public does not contain a print function that would
print worksheets. (Commercial versions of Tableau Desktop
do.) To print a tabular or visual summary, use a screen capture utility to capture the display for later printing. For online
worksheets displayed in a web browser, use the print function
of the browser.
1
Defining and
Collecting Data
CONTENTS
USING STATISTICS:
Defining Moments
1.1 Defining Variables
1.2 Collecting Data
1.3 Types of Sampling
Methods
1.4 Data Cleaning
1.5 Other Data Preprocessing Tasks
1.6 Types of Survey Errors
CONSIDER THIS: New
Media Surveys/Old
Survey Errors
Defining Moments,
Revisited
EXCEL GUIDE
JMP GUIDE
MINITAB GUIDE
TABLEAU DESKTOP
GUIDE
OBJECTIVES
■
■
■
■
■
■
■
Understand issues that
arise when defining
variables
How to define variables
Understand the different
measurement scales
How to collect data
Identify the different
ways to collect a sample
Understand the issues
involved in data
preparation
Understand the types of
survey errors
18
▼
USING STATISTICS
Defining Moments
#1
You’re the sales manager in charge of the best-selling beverage in its category.
For years, your chief competitor has made sales gains, claiming a better tasting
product. Worse, a new sibling product from your company, known for its good
taste, has quickly gained significant market share at the expense of your product. Worried that
your product may soon lose its number one status, you seek to improve sales by improving the
product’s taste. You experiment and develop a new beverage formulation. Using methods taught
in this book, you conduct surveys and discover that people overwhelmingly like the newer formulation. You decide to use that new formulation going forward, having statistically shown that
people prefer it. What could go wrong?
#2
You’re a senior airline manager who has noticed that your frequent fliers always
choose another airline when flying from the United States to Europe. You suspect
fliers make that choice because of the other airline’s perceived higher quality. You
survey those fliers, using techniques taught in this book, and confirm your suspicions. You then
design a new survey to collect detailed information about the quality of all components of a
flight, from the seats to the meals served to the flight attendants’ service. Based on the results
of that survey, you approve a costly plan that will enable your airline to match the perceived
quality of your competitor. What could go wrong?
In both cases, much did go wrong. Both cases serve as cautionary tales that if you choose the
wrong variables to study, you may not end up with results that support making better decisions.
Defining and collecting data, which at first glance can seem to be the simplest tasks in the
DCOVA framework, can often be more challenging than people anticipate.
1.1 Defining Variables
19
A
Coke managers also
overlooked other issues,
such as people’s emotional
connection and brand
loyalty to Coca-Cola,
issues better discussed
in a marketing book than
this book.
s the initial chapter notes, statistics is a way of thinking that can help fact-based decision making. But statistics, even properly applied using the DCOVA framework, can
never be a substitute for sound management judgment. If you misidentify the business
problem or lack proper insight into a problem, statistics cannot help you make a good decision.
Case #1 retells the story of one of the most famous marketing blunders ever, the change in the
formulation of Coca-Cola in the 1980s. In that case, Coke brand managers were so focused on
the taste of Pepsi and the newly successful sibling Diet Coke that they decided only to define
a variable and collect data about which drink tasters preferred in a blind taste test. When New
Coke was preferred, even over Pepsi, managers rushed the new formulation into production.
In doing so, those managers failed to reflect on whether the statistical results about a test that
asked people to compare one-ounce samples of several beverages would demonstrate anything
about beverage sales. After all, people were asked which beverage tasted better, not whether
they would buy that better-tasting beverage in the future. New Coke was an immediate failure, and Coke managers reversed their decision a mere 77 days after introducing their new
formulation (see reference 7).
Case #2 represents a composite story of managerial actions at several airlines. In some
cases, managers overlooked the need to state operational definitions for quality factors about
which fliers were surveyed. In at least one case, statistics was applied correctly, and an airline
spent great sums on upgrades and was able to significantly improve quality. Unfortunately, their
frequent fliers still chose the competitor’s flights. In this case, no statistical survey about quality
could reveal the managerial oversight that given the same level of quality between two airlines,
frequent fliers will almost always choose the cheaper airline. While quality was a significant
variable of interest, it was not the most significant.
The lessons of these cases apply throughout this book. Due to the necessities of instruction,
the examples and problems in all but the last chapter include preidentified business problems
and defined variables. Identifying the business problem or objective to be considered is always
a prelude to applying the DCOVA framework.
1.1 Defining Variables
Identifying a proper business problem or objective enables one to begin to identify and define
the variables for analysis. For each variable identified, assign an operational definition that
specifies the type of variable and the scale, the type of measurement, that the variable uses.
EXAMPLE 1.1
Defining Data
at GT&M
You have been hired by Good Tunes & More (GT&M), a local electronics retailer, to assist in
establishing a fair and reasonable price for Whitney Wireless, a privately held chain that GT&M
seeks to acquire. You need data that would help to analyze and verify the contents of the wireless
company’s basic financial statements. A GT&M manager suggests that one variable you should
use is monthly sales. What do you do?
SOLUTION Having first confirmed with the GT&M financial team that monthly sales is a
relevant variable of interest, you develop an operational definition for this variable. Does this
variable refer to sales per month for the entire chain or for individual stores? Does the variable
refer to net or gross sales? Do the monthly sales data represent number of units sold or currency
amounts? If the data are currency amounts, are they expressed in U.S. dollars? After getting
answers to these and similar questions, you draft an operational definition for ratification by
others working on this project.
Classifying Variables by Type
The type of data that a variable contains determines the statistical methods that are appropriate
for a variable. Broadly, all variables are either numerical, variables whose data represent a
counted or measured quantity, or categorical, variables whose data represent categories. Gender
20
CHAPTER 1 | Defining and Collecting Data
student TIP
Some prefer the
terms quantitative
and qualitative over
the terms numerical
and categorical when
describing variables.
These two pairs of terms
are interchangeable.
with its categories male and female is a categorical variable, as is the variable preferred-NewCoke with its categories yes and no. In Example 1.1, the monthly sales variable is numerical
because the data for this variable represent a quantity.
For some statistical methods, numerical variables must be further specified as either being
discrete or continuous. Discrete numerical variables have data that arise from a counting process.
Discrete numerical variables include variables that represent a “number of something,” such as
the monthly number of smartphones sold in an electronics store. Continuous numerical variables
have data that arise from a measuring process. The variable “the time spent waiting on a checkout
line” is a continuous numerical variable because its data represent timing measurements. The data
for a continuous variable can take on any value within a continuum or an interval, subject to the
precision of the measuring instrument. For example, a waiting time could be 1 minute, 1.1 minutes,
1.11 minutes, or 1.113 minutes, depending on the precision of the electronic timing device used.
For a particular variable, one might use a numerical definition for one problem, but use a
categorical definition for another problem. For example, a person’s age might seem to always
be a numerical age variable, but what if one was interested in comparing the buying habits of
children, young adults, middle-aged persons, and retirement-age people? In that case, defining
age as a categorical variable would make better sense.
Measurement Scales
Determining the measurement scale that the data for a variable represent is part of defining a
variable. The measurement scale defines the ordering of values and determines if differences
among pairs of values for a variable are equivalent and whether one value can be expressed in
terms of another. Table 1.1 presents examples of measurement scales, some of which are used
in the rest of this section.
TABLE 1.1
Examples of Different
Scales and Types
student TIP
JMP and Tableau users
will benefit the most
from understanding
measurement scales.
Data
Scale, Type
Cellular provider
Excel skills
Temperature (°F)
SAT Math score
Item cost (in $)
nominal, categorical
ordinal, categorical
interval, numerical
interval, numerical
ratio, numerical
Values
AT&T, T-Mobile, Verizon, Other, None
novice, intermediate, expert
- 459.67°F or higher
a value between 200 and 800, inclusive
$0.00 or higher
Define numerical variables as using either an interval scale, which expresses a difference
between measurements that do not include a true zero point, or a ratio scale, an ordered scale that
includes a true zero point. Categorical variables use measurement scales that provide less insight
into the values for the variable. For data measured on a nominal scale, category values express no
order or ranking. For data measured on an ordinal scale, an ordering or ranking of category values
is implied. Ordinal scales contain some information to compare values but not as much as interval
or ratio scales. For example, the ordinal scale poor, fair, good, and excellent allows one to know that
“good” is better than poor or fair and not better than excellent. But unlike interval and ratio scales, one
would not know that the difference from poor to fair is the same as fair to good (or good to excellent).
PROBLEMS FOR SECTION 1.1
LEARNING THE BASICS
1.2 U.S. businesses are listed by size: small, medium, and large.
1.1 Four different beverages are sold at a fast-food restaurant: soft
drinks, tea, coffee, and bottled water.
Explain why business size is an example of categorical variable.
Explain why the type of beverage sold is an example of a categorical
variable.
1.3 The time it takes to download a video from the Internet is
measured.
Explain why the download time is a continuous numerical variable.
1.2 Collecting Data
APPLYING THE CONCEPTS
SELF 1.4 For each of the following variables, determine
TEST
whether the variable is categorical or numerical. If the
variable is numerical, determine whether the variable is discrete or
continuous.
a. Number of cell phones in the household
b. Monthly data usage (in MB)
c. Number of text messages exchanged per month
d. Voice usage per month (in minutes)
e. Whether the cell phone is used for streaming video
1.5 The following information is collected from students upon
exiting the campus bookstore during the first week of classes.
a. Amount of time spent shopping in the bookstore
b. Number of textbooks purchased
c. Academic major
d. Gender
Classify each variable as categorical or numerical.
1.6 For each of the following variables, determine whether the
variable is categorical or numerical. If the variable is numerical,
determine whether the variable is discrete or continuous.
a. Name of Internet service provider
b. Time, in hours, spent surfing the Internet per week
c. Whether the individual uses a mobile phone to stream video
d. Number of online purchases made in a month
e. Where the individual accesses social networks to find soughtafter information
1.7 For each of the following variables, determine whether the
variable is categorical or numerical. If the variable is numerical,
determine whether the variable is discrete or continuous.
a. Amount of money spent on clothing in the past month
b. Favorite department store
21
c. Most likely time period during which shopping for clothing
takes place (weekday, weeknight, or weekend)
d. Number of pairs of shoes owned
1.8 Suppose the following information is collected from Robert
Keeler on his application for a home mortgage loan at the Metro
County Savings and Loan Association.
a. Monthly payments: $2,227
b. Number of jobs in past 10 years: 1
c. Annual family income: $96,000
d. Marital status: Married
Classify each of the responses by type of data.
1.9 One of the variables most often included in surveys is income.
Sometimes the question is phrased, “What is your income (in thousands of dollars)?” In other surveys, the respondent is asked to
“Select the circle corresponding to your income level” and is given
a number of income ranges to choose from.
a. In the first format, explain why income might be considered
either discrete or continuous.
b. Which of these two formats would you prefer to use if you were
conducting a survey? Why?
1.10 If two students both score 90 on the same examination, what
arguments could be used to show that the underlying variable—test
score—is continuous?
1.11 The director of market research at a large department store
chain wanted to conduct a survey throughout a metropolitan area to
determine the amount of time working women spend shopping for
clothing in a typical month.
a. Indicate the type of data the director might want to collect.
b. Develop a first draft of the questionnaire needed by writing three
categorical questions and three numerical questions that you feel
would be appropriate for this survey.
1.2 Collecting Data
Collecting data using improper methods can spoil any statistical analysis. For example,
Coca-Cola managers in the 1980s (see page 19) faced advertisements from their competitor
publicizing the results of a “Pepsi Challenge” in which taste testers consistently favored Pepsi
over Coke. No wonder—test recruiters deliberately selected tasters they thought would likely
be more favorable to Pepsi and served samples of Pepsi chilled, while serving samples of Coke
lukewarm (not a very fair comparison!). These introduced biases made the challenge anything
but a proper scientific or statistical test. Proper data collection avoids introducing biases and
minimizes errors.
Populations and Samples
Data are collected from either a population or a sample. A population contains all the items
or individuals of interest that one seeks to study. All of the GT&M sales transactions for a
specific year, all of the full-time students enrolled in a college, and all of the registered voters
in Ohio are examples of populations. A sample contains only a portion of a population of
interest. One analyzes a sample to estimate characteristics of an entire population. For example, one might select a sample of 200 sales transactions for a retailer or select a sample of
500 registered voters in Ohio in lieu of analyzing the populations of all the sales transactions
or all the registered voters.
One uses a sample when selecting a sample will be less time consuming or less cumbersome
than selecting every item in the population or when analyzing a sample is less cumbersome or
22
CHAPTER 1 | Defining and Collecting Data
learnMORE
Read the Short Takes for
Chapter 1 for a further
discussion about data
sources.
more practical than analyzing the entire population. Section FTF.3 defines statistic as a “value
that summarizes the data of a specific variable.” More precisely, a statistic summarizes the value
of a specific variable for sample data. Correspondingly, a parameter summarizes the value of
a population for a specific variable.
Data Sources
Data sources arise from the following activities:
•
•
•
•
•
Tableau uses the term data
source in a different sense,
to refer to the data being
presented as a specific
tabular or visual summary.
Choosing to conduct an
observation study or a
designed experiment on a
variable of interest affects
the statistical methods
and the decision-making
processes that can be
used, as Chapters 10 and
14 further explain.
Capturing data generated by ongoing business activities
Distributing data compiled by an organization or individual
Compiling the responses from a survey
Conducting an observational study and recording the results of the study
Conducting a designed experiment and recording the outcomes of the experiment
When the person conducting an analysis performs one of these activities, the data source is a
primary data source. When one of these activities is done by someone other than the person
conducting an analysis, the data source is a secondary data source.
Capturing data can be done as a byproduct of an organization’s transactional information
processing, such as the storing of sales transactions at a retailer, or as result of a service provided
by a second party, such as customer information that a social media website business collects on
behalf of another business. Therefore, such data capture may be either a primary or a secondary
source.
Typically, organizations such as market research firms and trade associations distribute
complied data, as do businesses that offer syndicated services, such as The Neilsen Company,
known for its TV ratings. Therefore, this source of data is usually a secondary source. (If one
supervised the distribution of a survey, compiled its results, and then analyzed those results, the
survey would be a primary data source.)
In both observational studies and designed experiments, researchers that collect data are
looking for the effect of some change, called a treatment, on a variable of interest. In an observational study, the researcher collects data in a natural or neutral setting and has no direct control
of the treatment. For example, in an observational study of the possible effects on theme park
usage patterns that a new electronic payment method might cause, one would take a sample
of guests, identify those who use the new method and those who do not, and then “observe” if
those who use the new method have different park usage patterns. As a designed experiment,
one would select guests to use the new electronic payment method and then discover if those
guests have theme park usage patterns that are different from the guests not selected to use the
new payment method.
PROBLEMS FOR SECTION 1.2
APPLYING THE CONCEPTS
1.12 The American Community Survey (www.census.gov/acs)
provides data every year about communities in the United States.
Addresses are randomly selected, and respondents are required to
supply answers to a series of questions.
a. Which of the sources of data best describe the American Community Survey?
b. Is the American Community Survey based on a sample or a
population?
1.13 Visit the website of the Gallup organization at www.gallup
.com. Read today’s top story. What type of data source is the top
story based on?
1.14 Visit the website of the Pew Research organization at www
.pewresearch.org. Read today’s top story. What type of data source
is the top story based on?
1.15 Transportation engineers and planners want to address the
dynamic properties of travel behavior by describing in detail the
driving characteristics of drivers over the course of a month. What
type of data collection source do you think the transportation engineers and planners should use?
1.16 Visit the home page of the Statistics Portal “Statista” at www
.statista.com. Examine one of the “Popular infographic topics” in
the Infographics section on that page. What type of data source is
the information presented here based on?
1.3 Types of Sampling Methods
23
1.3 Types of Sampling Methods
When selecting a sample to collect data, begin by defining the frame. The frame is a complete or
partial listing of the items that make up the population from which the sample will be selected.
Inaccurate or biased results can occur if a frame excludes certain groups, or portions of the population. Using different frames to collect data can lead to different, even opposite, conclusions.
Using the frame, select either a nonprobability sample or a probability sample. In a nonprobability sample, select the items or individuals without knowing their probabilities of selection. In a probability sample, select items based on known probabilities. Whenever possible,
use a probability sample as such a sample will allow one to make inferences about the population
being analyzed.
Nonprobability samples can have certain advantages, such as convenience, speed, and low
cost. Such samples are typically used to obtain informal approximations or as small-scale initial
or pilot analyses. However, because the theory of statistical inference depends on probability
sampling, nonprobability samples cannot be used for statistical inference and this more than
offsets those advantages in more formal analyses.
Figure 1.1 shows the subcategories of the two types of sampling. A nonprobability sample
can be either a convenience sample or a judgment sample. To collect a convenience sample,
select items that are easy, inexpensive, or convenient to sample. For example, in a warehouse of
stacked items, selecting only the items located on the top of each stack and within easy reach
would create a convenience sample. So, too, would be the responses to surveys that the websites of many companies offer visitors. While such surveys can provide large amounts of data
quickly and inexpensively, the convenience samples selected from these responses will consist
of self-selected website visitors. (Read the Consider This essay on page 31 for a related story.)
To collect a judgment sample, collect the opinions of preselected experts in the subject matter.
Although the experts may be well informed, one cannot generalize their results to the population.
The types of probability samples most commonly used include simple random, systematic,
stratified, and cluster samples. These four types of probability samples vary in terms of cost,
accuracy, and complexity, and they are the subject of the rest of this section.
FIGURE 1.1
Nonprobability Samples
Probability Samples
Types of samples
Judgment
Sample
Convenience
Sample
Simple
Random
Sample
Systematic Stratified
Sample
Sample
Cluster
Sample
Simple Random Sample
In a simple random sample, every item from a frame has the same chance of selection as every other
item, and every sample of a fixed size has the same chance of selection as every other sample of that
size. Simple random sampling is the most elementary random sampling technique. It forms the basis
for the other random sampling techniques. However, simple random sampling has its disadvantages.
Its results are often subject to more variation than other sampling methods. In addition, when the
frame used is very large, carrying out a simple random sample may be time consuming and expensive.
With simple random sampling, use n to represent the sample size and N to represent the
frame size. Number every item in the frame from 1 to N. The chance that any particular member
of the frame will be selected during the first selection is 1>N.
Select samples with replacement or without replacement. Sampling with replacement means
that selected items are returned to the frame, where it has the same probability of being selected
again. For example, imagine a fishbowl containing N business cards, one card for each person. The
first selection selects the card for Grace Kim. After the pertinent information has been recorded,
the business card is placed back in the fishbowl. All cards are thoroughly mixed and a second card
selected. On the second selection, Grace Kim has the same probability of being selected again, 1>N.
Most sampling is sampling without replacement. Sampling without replacement means
that once an item has been selected, the item cannot ever again be selected for the sample.
24
CHAPTER 1 | Defining and Collecting Data
learnMORE
Learn to use a table
of random numbers to
select a simple random
sample in the Section 1.3
LearnMore online topic.
The chance that any particular item in the frame will be selected—for example, the business
card for Grace Kim—on the first selection is 1>N. The chance that any card not previously
chosen will be chosen on the second selection becomes 1 out of N - 1.
When creating a simple random sample, avoid the “fishbowl” method of selecting a sample
because this method lacks the ability to thoroughly mix items and, therefore, randomly select a
sample. Instead, use a more rigorous selection method.
One such method is to use a table of random numbers, such as Table E.1 in Appendix E, for
selecting the sample. A table of random numbers consists of a series of digits listed in a randomly
generated sequence. To use a random number table for selecting a sample, assign code numbers to
the individual items of the frame. Then generate the random sample by reading the table of random
numbers and selecting those individuals from the frame whose assigned code numbers match the
digits found in the table. Because every digit or sequence of digits in the table is random, the table can
be read either horizontally or vertically. The margins of the table designate row numbers and column
numbers, and the digits are grouped into sequences of five in order to make reading the table easier.
Because the number system uses 10 digits (0, 1, 2, … , 9), the chance that any particular
digit will be randomly generated is equal 1 out of 10 and is equal to the probability of generating
any other digit. For a generated sequence of 800 digits, one would expect about 80 to be the
digit 0, 80 to be the digit 1, and so on.
Systematic Sample
In a systematic sample, partition the N items in the frame into n groups of k items, where
N
k =
n
Round k to the nearest integer. To select a systematic sample, choose the first item to be selected
at random from the first k items in the frame. Then, select the remaining n - 1 items by taking
every kth item thereafter from the entire frame.
If the frame consists of a list of prenumbered checks, sales receipts, or invoices, taking a
systematic sample is faster and easier than taking a simple random sample. A systematic sample is also a convenient mechanism for collecting data from membership directories, electoral
registers, class rosters, and consecutive items coming off an assembly line.
To take a systematic sample of n = 40 from the population of N = 800 full-time employees, partition the frame of 800 into 40 groups, each of which contains 20 employees. Then select
a random number from the first 20 individuals and include every twentieth individual after the
first selection in the sample. For example, if the first random number selected is 008, subsequent
selections will be 028, 048, 068, 088, 108, … , 768, and 788.
Simple random sampling and systematic sampling are simpler than other, more sophisticated,
probability sampling methods, but they generally require a larger sample size. In addition, systematic sampling is prone to selection bias that can occur when there is a pattern in the frame. To overcome the inefficiency of simple random sampling and the potential selection bias involved with
systematic sampling, one can use either stratified sampling methods or cluster sampling methods.
Stratified Sample
learnMORE
Learn how to select a
stratified sample in the
online in the Section 1.3
LearnMore online topic.
In a stratified sample, first subdivide the N items in the frame into separate subpopulations, or
strata. A stratum is defined by some common characteristic, such as gender or year in school.
Then select a simple random sample within each of the strata and combine the results from the
separate simple random samples. Stratified sampling is more efficient than either simple random
sampling or systematic sampling because the representation of items across the entire population is
assured. The homogeneity of items within each stratum provides greater precision in the estimates
of underlying population parameters. In addition, stratified sampling enables one to reach conclusions about each strata in the frame. However, using a stratified sample requires that one can determine the variable(s) on which to base the stratification and can also be expensive to implement.
Cluster Sample
In a cluster sample, divide the N items in the frame into clusters that contain several items.
Clusters are often naturally occurring groups, such as counties, election districts, city blocks,
1.3 Types of Sampling Methods
25
households, or sales territories. Then take a random sample of one or more clusters and study
all items in each selected cluster.
Cluster sampling is often more cost-effective than simple random sampling, particularly if
the population is spread over a wide geographic region. However, cluster sampling often requires
a larger sample size to produce results as precise as those from simple random sampling or
stratified sampling. A detailed discussion of systematic sampling, stratified sampling, and cluster
sampling procedures can be found in references 2, 4, and 6.
PROBLEMS FOR SECTION 1.3
LEARNING THE BASICS
1.17 For a population containing N = 902 individuals, what code
number would you assign for
a. the first person on the list?
b. the fortieth person on the list?
c. the last person on the list?
1.18 For a population of N = 902, verify that by starting in row
05, column 01 of the table of random numbers (Table E.1), you need
only six rows to select a sample of N = 60 without replacement.
1.19 Given a population of N = 93, starting in row 29, column 01
of the table of random numbers (Table E.1), and reading across the
row, select a sample of N = 15
a. without replacement.
b. with replacement.
APPLYING THE CONCEPTS
1.20 For a study that consists of personal interviews with participants (rather than mail or phone surveys), explain why simple random
sampling might be less practical than some other sampling methods.
1.21 You want to select a random sample of n = 1 from a population of three items (which are called A, B, and C). The rule for
selecting the sample is as follows: Flip a coin; if it is heads, pick
item A; if it is tails, flip the coin again; this time, if it is heads,
choose B; if it is tails, choose C. Explain why this is a probability
sample but not a simple random sample.
1.22 A population has four members (called A, B, C, and D). You
would like to select a random sample of n = 2, which you decide
to do in the following way: Flip a coin; if it is heads, the sample will
be items A and B; if it is tails, the sample will be items C and D.
Although this is a random sample, it is not a simple random sample.
Explain why. (Compare the procedure described in Problem 1.21
with the procedure described in this problem.)
1.23 The registrar of a university with a population of N = 4,000
full-time students is asked by the president to conduct a survey to
measure satisfaction with the quality of life on campus. The following table contains a breakdown of the 4,000 registered full-time
students, by gender and class designation:
CLASS DESIGNATION
GENDER
Female
Male
Total
Fr.
So.
Jr.
Sr.
Total
700
560
1,260
520
460
980
500
400
900
480
380
860
2,200
1,800
4,000
The registrar intends to take a probability sample of n = 200
students and project the results from the sample to the entire
population of full-time students.
a. If the frame available from the registrar’s files is an alphabetical
listing of the names of all N = 4,000 registered full-time students, what type of sample could you take? Discuss.
b. What is the advantage of selecting a simple random sample
in (a)?
c. What is the advantage of selecting a systematic sample in (a)?
d. If the frame available from the registrar’s files is a list of the
names of all N = 4,000 registered full-time students compiled
from eight separate alphabetical lists, based on the gender and
class designation breakdowns shown in the class designation
table, what type of sample should you take? Discuss.
e. Suppose that each of the N = 4,000 registered full-time students lived in one of the 10 campus dormitories. Each dormitory
accommodates 400 students. It is college policy to fully integrate
students by gender and class designation in each dormitory. If
the registrar is able to compile a listing of all students by dormitory, explain how you could take a cluster sample.
SELF 1.24 Prenumbered sales invoices are kept in a sales
TEST
journal. The invoices are numbered from 0001 to 5000.
a. Beginning in row 16, column 01, and proceeding horizontally in
a table of random numbers (Table E.1), select a simple random
sample of 50 invoice numbers.
b. Select a systematic sample of 50 invoice numbers. Use the random numbers in row 20, columns 05–07, as the starting point
for your selection.
c. Are the invoices selected in (a) the same as those selected in (b)?
Why or why not?
1.25 Suppose that 10,000 customers in a retailer’s customer database are categorized by three customer types: 3,500 prospective
buyers, 4,500 first time buyers, and 2,000 repeat (loyal) buyers.
A sample of 1,000 customers is needed.
a. What type of sampling should you do? Why?
b. Explain how you would carry out the sampling according to the
method stated in (a).
c. Why is the sampling in (a) not simple random sampling?
26
CHAPTER 1 | Defining and Collecting Data
1.4 Data Cleaning
Even if proper data collection procedures are followed, the collected data may contain incorrect
or inconsistent data that could affect statistical results. Data cleaning corrects such defects
and ensures the data contain suitable quality for analysis. Cleaning is the most important data
preprocessing task and must be done before performing any analysis. Cleaning can take a significant amount of time to do. One survey of big data analysts reported that they spend 60% of
their time cleaning data, while only 20% of their time collecting data and a similar percentage
for analyzing data (see reference 8).
Data cleaning seeks to correct the following types of irregularities:
With the exception of
several examples designed
for use with this section,
data for the problems and
examples in this book have
already been properly
cleaned to allow focus on
the statistical concepts
and methods that the book
discusses.
• Invalid variable values, including non-numerical data for a numerical variable, invalid
categorical values of a categorical variable, and numeric values outside a defined range
• Coding errors, including inconsistent categorical values, inconsistent case for categorical
values, and extraneous characters
• Data integration errors, including redundant columns, duplicated rows, differing column
lengths, and different units of measure or scale for numerical variables
By its nature, data cleaning cannot be a fully automated process, even in large business systems that contain data cleaning software components. As this chapter’s software guides explain,
Excel, JMP, Minitab, and Tableau contain functionality that lessens the burden of data cleaning.
When performing data cleaning, first preserve a copy of the original data for later reference.
Invalid Variable Values
Invalid variable values can be identified as being incorrect by simple scanning techniques so
long as operational definitions for the variables the data represent exist. For any numerical
variable, any value that is not a number is clearly an incorrect value. For a categorical variable,
a value that does not match any of the predefined categories of the variable is, likewise, clearly
an incorrect value. And for numerical variables defined with an explicit range of values, a value
outside that range is clearly an error.
Coding Errors
Coding errors can result from poor recording or entry of data values or as the result of computerized
operations such as copy-and-paste or data import. While coding errors are literally invalid values,
coding errors may be correctable without consulting additional information whereas the invalid variable values never are. For example, for a Gender variable with the defined values F and M, the value
“Female” is a coding error that can be reasonably changed to F. However, the value “New York” for
the same variable is an invalid variable value that you cannot reasonably change to either F or M.
Unlike invalid variable values, coding errors may be tolerated by analysis software. For
example, for the same Gender variable, the values M and m might be treated as the “same” value
for purposes of an analysis by software that was tolerant of case inconsistencies, an attribute
known as being insensitive to case.
Perhaps the most frustrating coding errors are extraneous characters in a value. Visual
examination may not be able to spot extraneous characters such as nonprinting characters or
extra, trailing space characters as one scans data. For example, the value David and the value
that is David followed by three space characters may look the same to one casually scanning
them but may not be treated the same by software. Likewise, values with nonprinting characters
may look correct but may cause software errors or be reported as invalid by analysis software.
Data Integration Errors
Data integration errors arise when data from two different computerized sources, such as two
different data repositories are combined into one data set for analysis. Identifying data integration errors may be the most time-consuming data cleaning task. Because spotting these errors
requires a type of data interpretation that automated processes of a typical business computer
1.5 Other Data Preprocessing Tasks
Perhaps not surprising,
supplying business
systems with automated
data interpretation skills
that would semi-automate
this task is a goal of many
companies that provide
data analysis software and
services.
27
systems today cannot supply, spotting these errors using manual methods will be typical for the
foreseeable future.
Some data integration errors occur because variable names or definitions for the same item
of interest have minor differences across systems. In one system, a customer ID number may
be known as Customer ID, whereas in a different system, the same variable is known as Cust
Number. A result of combining data from the two systems may result in having both Customer
ID and Cust Number variable columns, a redundancy that should be eliminated.
Duplicated rows also occur because of similar inconsistencies across systems. Consider a
Customer Name variable with the value that represents the first coauthor of this book, David
M. Levine. In one system, this name may have been recorded as David Levine, whereas in
another system, the name was recorded as D M Levine. Combining records from both systems
may result in two records, where only one should exist. Whether “David Levine” is actually the
same person as “D M Levine” requires an interpretation skill that today’s software may lack.
Likewise, different units of measurement (or scale) may not be obvious without additional,
human interpretation. Consider the variable Air Temperature, recorded in degrees Celsius in one
system and degrees Fahrenheit in another. The value 30 would be a plausible value under either
measurement system and without further knowledge or context impossible to spot as a Celsius
measurement in a column of otherwise Fahrenheit measurements.
Missing Values
Missing values are values that were not collected for a variable. For example, survey data may
include answers for which no response was given by the survey taker. Such “no responses” are
examples of missing values. Missing values can also result from integrating two data sources that
do not have a row-to-row correspondence for each row in both sources. The lack of correspondence
creates particular variable columns to be longer, to contain additional rows than the other columns.
For these additional rows, missing would be the value for the cells in the shorter columns.
Do not confuse missing values with miscoded values. Unresolved miscoded values—values
that cannot be cleaned by any method—might be changed to missing by some researchers or
excluded for analysis by others.
Algorithmic Cleaning of Extreme Numerical Values
For numerical variables without a defined range of possible values, you might find outliers,
values that seem excessively different from most of the other values. Such values may or may
not be errors, but all outliers require review. While there is no one standard for defining outliers,
most define outliers in terms of descriptive measures such as the standard deviation or the interquartile range that Chapter 3 discusses. Because software can compute such measures, spotting
outliers can be automated if a definition of the term that uses a such a measure is used. As later
chapters note as appropriate, identifying outliers is important as some methods are sensitive to
outliers and produce very different results when outliers are included in analysis.
1.5 Other Data Preprocessing Tasks
In addition to data cleaning, there are several other data preprocessing tasks that you might
undertake before visualizing and analyzing your data.
Data Formatting
Data formatting includes the rearranging the structure of the data or changing the electronic
encoding of the data or both. For example, consider financial data that has been collected for a
sample of companies. The collected data may be structured as tables of data, as the contents of
standard forms, in a continuous stock ticker stream, or as messages or blog entries that appear
on various websites. These data sources have various levels of structure which affect the ease
of reformatting them for use.
28
CHAPTER 1 | Defining and Collecting Data
Because tables of data are highly structured and are similar to the structure of a worksheet,
tables would require the least reformatting. In the best case, the rows and columns of a table
would become the rows and columns of a worksheet. Unstructured data sources, such as messages and blog entries, often represent the worst case. The data may need to be paraphrased,
characterized, or summarized in a way that does not involve a direct transfer. As the use of business analytics grows (see Chapter 14), the use of automated ways to paraphrase or characterize
these and other types of unstructured data grows, too.
Independent of the structure, collected data may exist in an electronic form that needs to
be changed in order to be analyzed. For example, data presented as a digital picture of Excel
worksheets would need to be changed into an actual Excel worksheet before that data could be
analyzed. In this example, the electronic encoding of the data changes from a picture format
such as jpeg to an Excel workbook format. Sometimes, individual numerical values that have
been collected may need to be changed, especially collected values that result from a computational process. Demonstrate this issue in Excel by entering a formula that is equivalent to the
expression 1 * (0.5 - 0.4 - 0.1). This should evaluate as 0 but Excel evaluates to a very small
negative number. Altering that value to 0 would be part of the data cleaning process.
Stacking and Unstacking Data
When collecting data for a numerical variable, subdividing that data into two or more groups
for analysis may be necessary. For example, data about the cost of a restaurant meal in an urban
area might be subdivided to consider the cost of meals at restaurants in the center city district
separately from the meal costs at metro area restaurants. When using data that represent two or
more groups, data can be arranged as either unstacked or stacked.
To use an unstacked arrangement, create separate numerical variables for each group. For
this example, create a center city meal cost variable and a second variable to hold the meal costs
at metro area restaurants. To use a stacked arrangement format, pair the single numerical variable meal cost with a second, categorical variable that contains two categories, such as center
city and metro area. If collecting data for several numerical variables, each of which will be
subdivided in the same way, stacking the data will be the more efficient choice.
When using software to analyze data, a specific procedure may require data to be stacked
(or unstacked). When such cases arise using Microsoft Excel, JMP, or Minitab for problems or
examples that this book discusses, a workbook or project will contain that data in both arrangements. For example, Restaurants , that Chapter 2 uses for several examples, contains both the
original (stacked) data about restaurants as well as an unstacked worksheet (or data table) that
contains the meal cost by location, center city or metro area.
Recoding Variables
After data have been collected, categories defined for a categorical variable may need to be
reconsidered or a numerical variable may need to be transformed into a categorical variable by
assigning individual numeric values to one of several groups. For either case, define a recoded
variable that supplements or replaces the original variable in your analysis.
For example, having already defined the variable class standing with the categories freshman, sophomore, junior, and senior, a researcher decides to investigate the differences between
lowerclassmen (freshmen or sophomores) and upperclassmen (juniors or seniors). The researcher
can define a recoded variable UpperLower and assign the value Upper if a student is a junior or
senior and assign the value Lower if the student is a freshman or sophomore.
When recoding variables, make sure that one and only one of the new categories can be assigned
to any particular value being recoded and that each value can be recoded successfully by one of your
new categories, the properties known as being mutually exclusive and collectively exhaustive.
When recoding numerical variables, pay particular attention to the operational definitions of the
categories created for the recoded variable, especially if the categories are not self-defining ranges.
For example, while the recoded categories Under 12, 12–20, 21–34, 35–54, and 55-and-over are
self-defining for age, the categories child, youth, young adult, middle aged, and senior each need
to be further defined in terms of mutually exclusive and collectively exhaustive numerical ranges.
1.6 Types of Survey Errors
29
PROBLEMS FOR SECTIONS 1.4 AND 1.5
APPLYING THE CONCEPTS
1.26 The cell phone brands owned by a sample of 20 respondents
were:
Apple, Samsung, Appel, Nokia, Blackberry, HTC, Apple, Samsung,
HTC, LG, Blueberry, Samsung, Samsung, APPLE, Motorola,
Apple, Samsun, Apple, Samsung
a. Clean these data and identify any irregularities in the data.
b. Are there any missing values in this set of 20 respondents? Identify the missing values.
1.27 The amount of monthly data usage by a sample of 10 cell
phone users (in MB) was:
0.4, 2.7MB, 5.6, 4.3, 11.4, 26.8, 1.6, 1,079, 8.3, 4.2
Are there any potential irregularities in the data?
1.28 An amusement park company owns three hotels on an adjoining site. A guest relations manager wants to study the time it takes
for shuttle buses to travel from each of the hotels to the amusement park entrance. Data were collected on a particular day for the
recorded travel times in minutes.
a. Explain how the data could be organized in an unstacked format.
b. Explain how the data could be organized in a stacked format.
1.29 A hotel management company runs 10 hotels in a resort
area. The hotels have a mix of pricing—some hotels have budgetpriced rooms, some have moderate-priced rooms, and some have
deluxe-priced rooms. Data are collected that indicate the number
of rooms that are occupied at each hotel on each day of a month.
Explain how the 10 hotels can be recoded into these three price
categories.
1.6 Types of Survey Errors
Collected data in the form of compiled responses from a survey must be verifed to ensure that the
results can be used in a decision-making process. Verification begins by evaluating the validity
of the survey to make sure the survey does not lack objectivity or credibility. To do this, evaluate
the purpose of the survey, the reason the survey was conducted, and for whom the survey was
conducted.
Having validated the objectivity and credibility of the survey, determine whether the survey
was based on a probability sample (see Section 1.3). Surveys that use nonprobability samples
are subject to serious biases that render their results useless for decision-making purposes. In the
case of the Coca-Cola managers concerned about the “Pepsi Challenge” results (see page 19),
the managers failed to reflect on the subjective nature of the challenge as well as the nonprobability sample that this survey used. Had the managers done so, they might not have been so
quick to make the reformulation blunder that was reversed just weeks later.
Even after verification, surveys can suffer from any combination of the following types of
survey errors: coverage error, nonresponse error, sampling error, or measurement error. Developers of well-designed surveys seek to reduce or minimize these types of errors, often at considerable cost.
Coverage Error
The key to proper sample selection is having an adequate frame. Coverage error occurs if
certain groups of items are excluded from the frame so that they have no chance of being
selected in the sample or if items are included from outside the frame. Coverage error results
in a selection bias. If the frame is inadequate because certain groups of items in the population
were not properly included, any probability sample selected will provide only an estimate of the
characteristics of the frame, not the actual population.
Nonresponse Error
Not everyone is willing to respond to a survey. Nonresponse error arises from failure
to collect data on all items in the sample and results in a nonresponse bias. Because a
researcher cannot always assume that persons who do not respond to surveys are similar to
those who do, researchers need to follow up on the nonresponses after a specified period of
time. Researchers should make several attempts to convince such individuals to complete
30
CHAPTER 1 | Defining and Collecting Data
the survey and possibly offer an incentive to participate. The follow-up responses are then
compared to the initial responses in order to make valid inferences from the survey (see
references 2, 4, and 6). The mode of response the researcher uses, such as face-to-face
interview, telephone interview, paper questionnaire, or computerized questionnaire, affects
the rate of response. Personal interviews and telephone interviews usually produce a higher
response rate than do mail surveys—but at a higher cost.
Sampling Error
When conducting a probability sample, chance dictates which individuals or items will or will
not be included in the sample. Sampling error reflects the variation, or “chance differences,”
from sample to sample, based on the probability of particular individuals or items being selected
in the particular samples.
When there is a news report about the results of surveys or polls in newspapers or on the
Internet, there is often a statement regarding a margin of error, such as “the results of this poll
are expected to be within { 4 percentage points of the actual value.” This margin of error is
the sampling error. Using larger sample sizes reduces the sampling error. Of course, doing so
increases the cost of conducting the survey.
Measurement Error
In the practice of good survey research, design surveys with the intention of gathering meaningful and accurate information. Unfortunately, the survey results are often only a proxy for the
ones sought. Unlike height or weight, certain information about behaviors and psychological
states is impossible or impractical to obtain directly.
When surveys rely on self-reported information, the mode of data collection, the respondent
to the survey, and or the survey itself can be possible sources of measurement error. Satisficing,
social desirability, reading ability, and/or interviewer effects can be dependent on the mode of
data collection. The social desirability bias or cognitive/memory limitations of a respondent can
affect the results. Vague questions, double-barreled questions that ask about multiple issues but
require a single response, or questions that ask the respondent to report something that occurs
over time but fail to clearly define the extent of time about which the question asks (the reference
period) are some of the survey flaws that can cause errors.
To minimize measurement error, standardize survey administration and respondent understanding of questions, but there are many barriers to this (see references 1, 3, and 12).
Ethical Issues About Surveys
Ethical considerations arise with respect to the four types of survey error. Coverage error
can result in selection bias and becomes an ethical issue if particular groups or individuals
are purposely excluded from the frame so that the survey results are more favorable to the
survey’s sponsor. Nonresponse error can lead to nonresponse bias and becomes an ethical
issue if the sponsor knowingly designs the survey so that particular groups or individuals
are less likely than others to respond. Sampling error becomes an ethical issue if the findings are purposely presented without reference to sample size and margin of error so that
the sponsor can promote a viewpoint that might otherwise be inappropriate. Measurement
error can become an ethical issue in one of three ways: (1) a survey sponsor chooses leading
questions that guide the respondent in a particular direction; (2) an interviewer, through
mannerisms and tone, purposely makes a respondent obligated to please the interviewer
or otherwise guides the respondent in a particular direction; or (3) a respondent willfully
provides false information.
Ethical issues also arise when the results of nonprobability samples are used to form
conclusions about the entire population. When using a nonprobability sampling method,
explain the sampling procedures and state that the results cannot be generalized beyond the
sample.
1.6 Types of Survey Errors
31
CONSIDER THIS
New Media Surveys/Old Survey Errors
Software company executives decide to create a “customer
experience improvement program” to record how customers use the company’s products, with the goal of using the
collected data to make product enhancements. Product
marketers decide to use social media websites to collect
consumer feedback. These people risk making the same
type of survey error that led to the quick demise of a very
successful magazine nearly 80 years ago.
By 1935, “straw polls” conducted by the magazine Literary Digest had successfully predicted five consecutive U.S.
presidential elections. For the 1936 election, the magazine
promised its largest poll ever and sent about 10 million
ballots to people all across the country. After tabulating more
than 2.3 million ballots, the Digest confidently proclaimed
that Alf Landon would be an easy winner over Franklin
D. Roosevelt. The actual results: FDR won in a landslide and
Landon received the fewest electoral votes in U.S. history.
Being so wrong ruined the reputation of Literary Digest,
and it would cease publication less than two years after it
made its erroneous claim. A review much later found that
the low response rate (less than 25% of the ballots distributed were returned) and nonresponse error (Roosevelt
voters were less likely to mail in a ballot than Landon voters)
were significant reasons for the failure of the Literary Digest
poll (see reference 11).
The Literary Digest error proved to be a watershed event
in the history of sample surveys. First, the error disproved
the assertion that the larger the sample is, the better the
results will be—an assertion some people still mistakenly
make today. The error paved the way for the modern
methods of sampling discussed in this chapter and gave
prominence to the more “scientific” methods that George
Gallup and Elmo Roper both used to correctly predict the
1936 elections. (Today’s Gallup Polls and Roper Reports
remember those researchers.)
In more recent times, Microsoft software executives
overlooked that experienced users could easily opt out of
participating in their improvement program. This created
another case of nonresponse error which may have led to
the improved product (Microsoft Office) being so poorly
received initially by experienced Office users who, by
being more likely to opt out of the improvement program,
biased the data that Microsoft used to determine Office
“improvements.”
And while those product marketers may be able to
collect a lot of customer feedback data, those data also
suffer from nonresponse error. In collecting data from social
media websites, the marketers cannot know who chose not
to leave comments. The marketers also cannot verify if the
data collected suffer from a selection bias due to a coverage error.
That you might use media newer than the mailed,
dead-tree form that Literary Digest used does not mean
that you automatically avoid the old survey errors. Just the
opposite—the accessibility and reach of new media makes
it much easier for unknowing people to commit such errors.
PROBLEMS FOR SECTION 1.6
APPLYING THE CONCEPTS
1.30 A survey indicates that the vast majority of college students
own their own smartphones. What information would you want to
know before you accepted the results of this survey?
1.31 A simple random sample of n = 300 full-time employees is
selected from a company list containing the names of all N = 5,000
full-time employees in order to evaluate job satisfaction.
a. Give an example of possible coverage error.
b. Give an example of possible nonresponse error.
c. Give an example of possible sampling error.
d. Give an example of possible measurement error.
SELF 1.32 Results of a 2017 Computer Services, Inc. (CSI)
TEST
survey of a sample of 163 bank executives reveal insights
on banking priorities among financial institutions (goo.gl/
mniYMM). As financial institutions begin planning for a new year,
of utmost importance is boosting profitability and identifying
growth areas. The results show that 55% of bank institutions note
customer experience initiatives as an area in which spending is
expected to increase. Implementing a customer relationship management (CRM) solution was ranked as the top most important
omnichannel strategy to pursue with 41% of institutions citing digital banking enhancements as the greatest anticipated strategy to
enhance the customer experience.
Identify potential concerns with coverage, nonresponse, sampling,
and measurement errors.
1.33 A recent PwC survey of 1,379 CEOs from a wide range of
industries representing a mix of company sizes from Asia, Europe,
and the Americas indicated that CEOs are firmly convinced that it is
harder to gain and retain people’s trust in an increasingly digitalized
world (pwc.to/2jFLzjF). Fifty-eight percent of CEOs are worried
that lack of trust in business would harm their company’s growth.
32
CHAPTER 1 | Defining and Collecting Data
Which risks arising from connectivity concern CEOs most? Eightyseven percent believe that social media could have a negative impact
on the level of trust in their industry over the next few years. But
they also say new dangers are emerging and old ones are getting
worse as new technologies and new uses of existing technologies
increase rapidly. CEOs are particularly anxious about breaches
in data security and ethics and IT outages and disruptions. A vast
majority of CEOs are already taking steps to address these concerns, with larger-sized companies doing more than smaller-sized
companies.
What additional information would you want to know about the
survey before you accepted the results for the study?
▼
1.34 A recent survey points to tremendous revenue potential and
consumer value in leveraging driver and vehicle data in the automobile industry. The 2017 KPMG Global Automotive Executive Study
found that automobile executives believe data will be the fuel for
future business models and that they will make money from that
data (prn.to/2q9rubN). Eighty-two percent of automobile executives agree that in order to create value and consequently monetize
data, a car needs its own ecosystem/operating system; otherwise the
valuable consumer and/or vehicle data will likely be routed through
third parties and valuable revenue streams will be lost. What additional information would you want to know about the survey before
you accepted the results of the study?
USING STATISTICS
Defining Moments, Revisited
T
he New Coke and airline quality cases illustrate missteps
that can occur during the define and collect tasks of the
DCOVA framework. To use statistics effectively, you must
properly define a business problem or goal and then collect
data that will allow you to make observations and reach
conclusions that are relevant to that problem or goal.
In the New Coke case, managers failed to consider that
data collected about a taste test would not necessarily provide
useful information about the sales issues they faced. The
managers also did not realize that the test used improper
sampling techniques, deliberately introduced biases, and were
subject to coverage and nonresponse errors. Those mistakes
invalidated the test, making the conclusion that New Coke
tasted better than Pepsi an invalid claim.
▼
SUMMARY
In this chapter, you learned the details about the Define and
Collect tasks of the DCOVA framework, which are important first steps to applying statistics properly to decision
making. You learned that defining variables means developing an operational definition that includes establishing the
type of variable and the measurement scale that the variable
uses. You learned important details about data collection as
▼
In the airline
quality case, no
mistakes in defining
and collecting data
were made. The results that fliers like quality was a valid
one, but decision makers overlooked that quality was not the
most significant factor for people buying seats on transatlantic
flights (price was). This case illustrates that no matter how
well you apply statistics, if you do not properly analyze the
business problem or goal being considered, you may end
up with valid results that lead you to invalid management
decisions.
well as some new basic vocabulary terms (sample, population, and parameter) and a more precise definition of statistic. You specifically learned about sampling and the types
of sampling methods available to you. Finally, you surveyed
data preparation considerations and learned about the type
of survey errors you can encounter.
REFERENCES
1. Biemer, P. B., R. M. Graves, L. E. Lyberg, A. Mathiowetz, and
S. Sudman. Measurement Errors in Surveys. New York: Wiley
Interscience, 2004.
2. Cochran, W. G. Sampling Techniques, 3rd ed. New York: Wiley,
1977.
3. Fowler, F. J. Improving Survey Questions: Design and
Evaluation, Applied Special Research Methods Series, Vol. 38,
Thousand Oaks, CA: Sage Publications, 1995.
4. Groves R. M., F. J. Fowler, M. P. Couper, J. M. Lepkowski, E.
Singer, and R. Tourangeau. Survey Methodology, 2nd ed. New
York: John Wiley, 2009.
5. Hellerstein, J. “Quantitative Data Cleaning for Large
Databases.” bit.ly/2q7PGIn.
6. Lohr, S. L. Sampling Design and Analysis, 2nd ed. Boston, MA:
Brooks/Cole Cengage Learning, 2010.
7. Polaris Marketing Research. “Brilliant Marketing Research
or What? The New Coke Story,” posted September 20, 2011.
bit.ly/1DofHSM (removed).
8. Press, G. “Cleaning Big Data: Most Time-Consuming, Least
Enjoyable Data Science Task, Survey Says,” posted March 23,
2016. bit.ly/2oNCwzh.
Chapter Review Problems
9. Rosenbaum, D. “The New Big Data Magic,” posted August 20,
2011. bit.ly/1DUMWzv.
10. Osbourne, J. Best Practices in Data Cleaning. Thousand Oaks,
CA: Sage Publications, 2012.
11. Squire, P. “Why the 1936 Literary Digest Poll Failed.” Public
Opinion Quarterly 52 (1988): 125–133.
▼
mutually exclusive 28
nominal scale 20
nonprobability sample 23
nonresponse bias 29
nonresponse error 29
numerical variable 19
operational definition 19
ordinal scale 20
outlier 27
parameter 22
population 21
primary data source 22
probability sample 23
qualitative variable 20
quantitative variable 20
ratio scale 20
recoded variable 28
sample 21
sampling error 30
sampling with replacement 23
sampling without replacement 23
secondary data source 22
selection bias 29
simple random sample 23
stacked 28
statistic 22
strata 24
stratified sample 24
systematic sample 24
table of random numbers 24
treatment 22
unstacked 28
CHECKING YOUR UNDERSTANDING
1.35 What is the difference between a sample and a population?
1.36 What is the difference between a statistic and a parameter?
1.37 What is the difference between a categorical variable and a
numerical variable?
1.38 What is the difference between a discrete numerical variable
and a continuous numerical variable?
1.39 What is the difference between a nominal scaled variable and
an ordinal scaled variable?
1.40 What is the difference between an interval scaled variable and
a ratio scaled variable?
▼
12. Sudman, S., N. M. Bradburn, and N. Schwarz. Thinking About
Answers: The Application of Cognitive Processes to Survey
Methodology. San Francisco, CA: Jossey-Bass, 1993.
KEY TERMS
categorical variable 19
cluster 24
cluster sample 24
collectively exhaustive 28
continuous variable 20
convenience sample 23
coverage error 29
data cleaning 26
discrete variable 20
frame 23
interval scale 20
judgment sample 23
margin of error 30
measurement error 30
measurement scale 20
missing value 27
▼
33
1.41 What is the difference between probability sampling and nonprobability sampling?
1.42 What is the difference between a missing value and an outlier?
1.43 What is the difference between unstacked and stacked
variables?
1.44 What is the difference between coverage error and nonresponse error?
1.45 What is the difference between sampling error and measurement error?
CHAPTER REVIEW PROBLEMS
1.46 Visit the official website for Microsoft Excel (products
.office.com/excel), Minitab (www.minitab.com), or JMP (www
.jmp.com). Review the features of the program you chose and then
state the ways the program could be useful in statistical analysis.
1.47 Results of a 2017 Computer Services, Inc. (CSI) survey of
a sample of 163 bank executives reveal insights on banking priorities among financial institutions (goo.gl/mniYMM). As financial
institutions begin planning for a new year, of utmost importance is
boosting profitability and identifying growth areas.
The results show that 55% of bank institutions note customer experience initiatives as an area in which spending is expected to increase.
Implementing a customer relationship management (CRM) solution
was ranked as the top most important omnichannel strategy to pursue with 41% of institutions citing digital banking enhancements as
the greatest anticipated strategy to enhance the customer experience.
a. Describe the population of interest.
b. Describe the sample that was collected.
c. Describe a parameter of interest.
d. Describe the statistic used to estimate the parameter in (c).
34
CHAPTER 1 | Defining and Collecting Data
1.48 The Gallup organization releases the results of recent polls
on its website, www.gallup.com. Visit this site and read an article
of interest.
a. Describe the population of interest.
b. Describe the sample that was collected.
c. Describe a parameter of interest.
d. Describe the statistic used to estimate the parameter in (c).
1.49 A recent PwC survey of 1,379 CEOs from a wide range of
industries representing a mix of company sizes from Asia, Europe,
and the Americas indicated that CEOs are firmly convinced that it
is harder to gain and retain people’s trust in an increasingly digitalized world (pwc.to/2jFLzjF). Fifty-eight percent of CEOs are
worried that lack of trust in business would harm their company’s
growth. Which risks arising from connectivity concern CEOs most?
Eighty-seven percent believe that social media could have a negative impact on the level of trust in their industry over the next few
years. But they also say new dangers are emerging and old ones
are getting worse as new technologies and new uses of existing
technologies increase rapidly. CEOs are particularly anxious about
breaches in data security and ethics and IT outages and disruptions. A vast majority of CEOs are already taking steps to address
these concerns, with larger-sized companies doing more than
smaller-sized companies.
a. Describe the population of interest.
b. Describe the sample that was collected.
c. Describe a parameter of interest.
d. Describe the statistic used to estimate the parameter in (c).
1.50 The American Community Survey (www.census.gov/acs)
provides data every year about communities in the United States.
Addresses are randomly selected and respondents are required to
supply answers to a series of questions.
a. Describe a variable for which data are collected.
b. Is the variable categorical or numerical?
c. If the variable is numerical, is it discrete or continuous?
1.51 Download and examine Zarca Interactive’s “Sample Survey
for Associations/Sample Questions for Surveys for Associations,”
available at bit.ly/2p5HlGO.
a. Give an example of a categorical variable included in the survey.
b. Give an example of a numerical variable included in the survey.
1.52 Three professors examined awareness of four widely disseminated retirement rules among employees at the University of
Utah. These rules provide simple answers to questions about retirement planning (R. N. Mayer, C. D. Zick, and M. Glaittle, “Public
Awareness of Retirement Planning Rules of Thumb,” Journal of
Personal Finance, 2011 10(62), 12–35). At the time of the investigation, there were approximately 10,000 benefited employees,
and 3,095 participated in the study. Demographic data collected
on these 3,095 employees included gender, age (years), education
level (years completed), marital status, household income ($), and
employment category.
a. Describe the population of interest.
b. Describe the sample that was collected.
c. Indicate whether each of the demographic variables mentioned
is categorical or numerical.
1.53 Social media provides an enormous amount of data about
the activities and habits of people using social platforms like Facebook and Twitter. The belief is that mining that data provides a
treasure trove for those who seek to quantify and predict future
human behavior. A marketer is planning a survey of Internet users in
the United States to determine social media usage. The objective of
the survey is to gain insight on these three items: key social media
platforms used, frequency of social media usage, and demographics
of key social media platform users.
a. For each of the three items listed, indicate whether the variables are categorical or numerical. If a variable is numerical, is
it discrete or continuous?
b. Develop five categorical questions for the survey.
c. Develop five numerical questions for the survey.
CHAPTER
CASES
Managing Ashland MultiComm Services
Ashland MultiComm Services (AMS) provides high-quality
telecommunications services in the Greater Ashland area. AMS
traces its roots to a small company that redistributed the broadcast television signals from nearby major metropolitan areas
but has evolved into a provider of a wide range of broadband
services for residential customers.
AMS offers subscription-based services for digital cable
television, local and long-distance telephone services, and
high-speed Internet access. Recently, AMS has faced competition from other service providers as well as Internet-based,
on-demand streaming services that have caused many customers
to “cut the cable” and drop their subscription to cable video
services.
1
AMS management believes that a combination of increased
promotional expenditures, adjustment in subscription fees, and
improved customer service will allow AMS to successfully face
these challenges. To help determine the proper mix of strategies to be taken, AMS management has decided to organize a
research team to undertake a study.
The managers suggest that the research team examine the
company’s own historical data for number of subscribers, revenues, and subscription renewal rates for the past few years.
They direct the team to examine year-to-date data as well, as the
managers suspect that some of the changes they have seen have
been a relatively recent phenomena.
1. What type of data source would the company’s own historical
data be? Identify other possible data sources that the research
Cases
team might use to examine the current marketplace for residential broadband services in a city such as Ashland.
2. What type of data collection techniques might the team
employ?
3. In their suggestions and directions, the AMS managers have
named a number of possible variables to study, but offered no
operational definitions for those variables. What types of possible misunderstandings could arise if the team and managers
do not first properly define each variable cited?
CardioGood Fitness
CardioGood Fitness is a developer of high-quality cardiovascular
exercise equipment. Its products include treadmills, fitness bikes,
elliptical machines, and e-glides. CardioGood Fitness looks to
increase the sales of its treadmill products and has hired The
AdRight Agency, a small advertising firm, to create and implement an advertising program. The AdRight Agency plans to identify particular market segments that are most likely to buy their
clients’ goods and services and then locate advertising outlets that
will reach that market group. This activity includes collecting
data on clients’ actual sales and on the customers who make the
purchases, with the goal of determining whether there is a distinct
profile of the typical customer for a particular product or service.
If a distinct profile emerges, efforts are made to match that profile
to advertising outlets known to reflect the particular profile, thus
targeting advertising directly to high-potential customers.
CardioGood Fitness sells three different lines of treadmills.
The TM195 is an entry-level treadmill. It is as dependable as
other models offered by CardioGood Fitness, but with fewer
programs and features. It is suitable for individuals who thrive
on minimal programming and the desire for simplicity to initiate
their walk or hike. The TM195 sells for $1,500.
The middle-line TM498 adds to the features of the entrylevel model two user programs and up to 15% elevation upgrade.
The TM498 is suitable for individuals who are walkers at a transitional stage from walking to running or midlevel runners. The
TM498 sells for $1,750.
The top-of-the-line TM798 is structurally larger and heavier
and has more features than the other models. Its unique features include a bright blue backlit LCD console, quick speed
and incline keys, a wireless heart rate monitor with a telemetric
chest strap, remote speed and incline controls, and an anatomical
figure that specifies which muscles are minimally and maximally activated. This model features a nonfolding platform base
that is designed to handle rigorous, frequent running; the TM798
is therefore appealing to someone who is a power walker or a
runner. The selling price is $2,500.
As a first step, the market research team at AdRight is
assigned the task of identifying the profile of the typical customer for each treadmill product offered by CardioGood Fitness.
The market research team decides to investigate whether there
are differences across the product lines with respect to customer
characteristics. The team decides to collect data on individuals
who purchased a treadmill at a CardioGood Fitness retail store
during the prior three months.
The team decides to use both business transactional data
and the results of a personal profile survey that every purchaser
35
completes as their sources of data. The team identifies the following customer variables to study: product purchased—TM195,
TM498, or TM798; gender; age, in years; education, in years;
relationship status, single or partnered; annual household income
($); mean number of times the customer plans to use the treadmill
each week; mean number of miles the customer expects to walk/
run each week; and self-rated fitness on a 1-to-5 scale, where 1
is poor shape and 5 is excellent shape. For this set of variables:
1. Which variables in the survey are categorical?
2. Which variables in the survey are numerical?
3. Which variables are discrete numerical variables?
Clear Mountain State Student Survey
The Student News Service at Clear Mountain State University
(CMSU) has decided to gather data about the undergraduate
students who attend CMSU. They create and distribute a survey
of 14 questions and receive responses from 111 undergraduates
(stored in StudentSurvey ).
Download (see Appendix C) and review the survey document CMUndergradSurvey.pdf. For each question asked in the
survey, determine whether the variable is categorical or numerical. If you determine that the variable is numerical, identify
whether it is discrete or continuous.
Learning with the Digital Cases
Identifying and preventing misuses of statistics is an important
responsibility for all managers. The Digital Cases allow you to
practice the skills necessary for this important task.
Each chapter’s Digital Case tests your understanding of how
to apply an important statistical concept taught in the chapter.
As in many business situations, not all of the information you
encounter will be relevant to your task, and you may occasionally discover conflicting information that you have to resolve in
order to complete the case.
To assist your learning, each Digital Case begins with a
learning objective and a summary of the problem or issue at
hand. Each case directs you to the information necessary to
reach your own conclusions and to answer the case questions.
Many cases, such as the sample case worked out next, extend
a chapter’s Using Statistics scenario. You can download digital
case files which are PDF format documents that may contain
extended features as interactivity or data file attachments. Open
these files with a current version of Adobe Reader, as other PDF
programs may not support the extended features. (For more
information, see Appendix C.)
To illustrate learning with a Digital Case, open the Digital
Case file WhitneyWireless.pdf that contains summary information about the Whitney Wireless business. Apparently, from
the claim on the title page, this business is celebrating its “best
sales year ever.”
Review the Who We Are, What We Do, and What We
Plan to Do sections on the second page. Do these sections contain any useful information? What questions does this passage
raise? Did you notice that while many facts are presented, no
data that would support the claim of “best sales year ever” are
presented? And were those mobile “mobilemobiles” used solely
36
CHAPTER 1 | Defining and Collecting Data
for promotion? Or did they generate any sales? Do you think
that a talk-with-your-mouth-full event, however novel, would
be a success?
Continue to the third page and the Our Best Sales Year
Ever! section. How would you support such a claim? With
a table of numbers? Remarks attributed to a knowledgeable
source? Whitney Wireless has used a chart to present “two years
ago” and “latest twelve months” sales data by category. Are there
any problems with what the company has done? Absolutely!
Take a moment to identify and reflect on those problems.
Then turn to pages 4 though 6 that present an annotated version
of the first three pages and discusses some of the problems with
this document.
In subsequent Digital Cases, you will be asked to provide
this type of analysis, using the open-ended case questions as
your guide. Not all the cases are as straightforward as this example, and some cases include perfectly appropriate applications of
statistical methods. And none have annotated answers!
CHAPTER
1
EXCEL GUIDE
EG1.1
DEFINING VARIABLES
Classifying Variables by Type
Microsoft Excel infers the variable type from the data one
enters into a column. If Excel discovers a column that contains
numbers, it treats the column as a numerical variable. If Excel
discovers a column that contains words or alphanumeric entries,
it treats the column as a non-numerical (categorical) variable.
This imperfect method works most of the time, especially
if one makes sure that the categories for a categorical variables are words or phrases such as “yes” and “no.” However,
because one cannot explicitly define the variable type, Excel
enables one to do nonsensical things such as using a categorical variable with a statistical method designed for numerical variables. If one must use categorical values such as 1, 2,
or 3, enter them preceded with an apostrophe, as Excel treats
all values that begin with an apostrophe as non-numerical
data. (To check whether a cell entry includes a leading apostrophe, select the cell and view its contents in the formula bar.)
EG1.2
COLLECTING DATA
There are no Excel Guide instructions for Section 1.2.
EG1.3
TYPES of SAMPLING METHODS
Simple Random Sample
Key Technique Use the RANDBETWEEN(smallest
integer, largest integer) function to generate a random integer that can then be used to select an item from a frame.
Example 1 Create a simple random sample with replacement of size 40 from a population of 800 items.
Workbook Enter a formula that uses this function and then
copy the formula down a column for as many rows as is necessary. For example, to create a simple random sample with
replacement of size 40 from a population of 800 items, open
to a new worksheet. Enter Sample in cell A1 and enter the
formula =RANDBETWEEN(1, 800) in cell A2. Then copy
the formula down the column to cell A41.
Analysis dialog box, select Sampling from the Analysis
Tools list and then click OK. In the procedure’s dialog box
(shown below):
1. Enter A1:A801 as the Input Range and check Labels.
2. Click Random and enter 40 as the Number of Samples.
3. Click New Worksheet Ply and then click OK.
Example 2 Create a simple random sample without
replacement of size 40 from a population of 800 items.
PHStat Use Random Sample Generation.
For the example, select PHStat ➔ Sampling ➔ Random
Sample Generation. In the procedure’s dialog box (shown
below):
1. Enter 40 as the Sample Size.
2. Click Generate list of random numbers and enter 800
as the Population Size.
3. Enter a Title and click OK.
Excel contains no functions to select a random sample
without replacement. Such samples are most easily created
using an add-in such as PHStat or the Analysis ToolPak, as
described in the following paragraphs.
Analysis ToolPak Use Sampling to create a random sam-
ple with replacement.
For the example, open to the worksheet that contains the population of 800 items in column A and that contains a column
heading in cell A1. Select Data ➔ Data Analysis. In the Data
Unlike most other PHStat results worksheets, the worksheet created contains no formulas.
37
38
CHAPTER 1 | Defining and Collecting Data
Workbook Use the COMPUTE worksheet of the Random
workbook as a template.
The worksheet already contains 40 copies of the formula
=RANDBETWEEN(1, 800) in column B. Because the
RANDBETWEEN function samples with replacement as
discussed at the start of this section, one may need to add
additional copies of the formula in new column B rows until
one has 40 unique values.
When the intended sample size is large, spotting duplicate values can be hard. Read the Short Takes for Chapter 1
to learn more about an advanced technique that uses formulas
to detect duplicate values.
EG1.4
DATA CLEANING
Key Technique Use a column of formulas to detect invalid
variable values in another column.
Example Scan the DirtyDATA worksheet in the Dirty
Data workbook for invalid variable values.
PHStat Use Data Cleaning.
For the example, open to the DirtyData worksheet. Select
Data Preparation ➔ Numerical Data Scan. In the procedure’s dialog box:
1. Enter a column range as the Numerical Variable Cell
Range.
2. Click OK.
The procedure creates a worksheet that contains a column
that identifies every data value as either being numerical or
non-numerical and states the minimum and maximum values
found in the column. To scan for irregularities in categorical
data, use the Workbook instructions.
Workbook Use the ScanData worksheet of the Data
Cleaning workbook as a model solution to scan for the following types of irregularities: non-numerical data values for
a numerical variable, invalid categorical values of a categorical variable, numerical values outside a defined range, and
missing values in individual cells.
The worksheet uses several different Excel functions to
detect an irregularity in one column and display a message in
another column. For each categorical variable scanned, the
worksheet contains a table of valid values that are looked up
and compared to cell values to spot inconsistencies. Read the
Short Takes for Chapter 1 to learn the specifics of the formulas the worksheet uses to scan data.
EG1.5
OTHER DATA PREPROCESSING
Stacking and Unstacking Variables
PHStat Use Data Preparation ➔ Stack Data (or Unstack
Data).
For Stack Data, in the Stack Data dialog box, enter an
Unstacked Data Cell Range and then click OK to create
stacked data in a new worksheet. For Unstack Data, in the
Unstack Data dialog box, enter a Grouping Variable Cell
Range and a Stacked Data Cell Range and then click OK
to create unstacked data in a new worksheet.
Recoding Variables
Key Technique To recode a categorical variable, first copy
the original variable’s column of data and then use the findand-replace function on the copied data. To recode a numerical variable, enter a formula that returns a recoded value in
a new column.
Example Using the DATA worksheet of the Recoded
workbook, create the recoded variable UpperLower from
the categorical variable Class and create the recoded Variable
Dean’s List from the numerical variable GPA.
Workbook Use the RECODED worksheet of the Recoded
workbook as a model.
The worksheet already contains UpperLower, a recoded
version of Class that uses the operational definitions on page 28,
and Dean’s List, a recoded version of GPA, in which the value
No recodes all GPA values less than 3.3 and Yes recodes all
values 3.3 or greater. The RECODED_ FORMULAS worksheet in the same workbook shows how formulas in column I
use the IF function to recode GPA as the Dean’s List variable.
These recoded variables were created by first opening
to the DATA worksheet in the same workbook and then following these steps:
1. Right-click column D (right-click over the shaded “D” at
the top of column D) and click Copy in the shortcut menu.
2. Right-click column H and click the first choice in the
Paste Options gallery.
3. Enter UpperLower in cell H1.
4. Select column H. With column H selected, click Home ➔
Find & Select ➔ Replace.
In the Replace tab of the Find and Replace dialog box:
5. Enter Senior as Find what, Upper as Replace with,
and then click Replace All.
6. Click OK to close the dialog box that reports the results
of the replacement command.
7. Still in the Find and Replace dialog box, enter Junior
as Find what and then click Replace All.
8. Click OK to close the dialog box that reports the results
of the replacement command.
9. Still in the Find and Replace dialog box, enter
Sophomore as Find what, Lower as Replace with,
and then click Replace All.
10. Click OK to close the dialog box that reports the results
of the replacement command.
11. Still in the Find and Replace dialog box, enter Freshman
as Find what and then click Replace All.
12. Click OK to close the dialog box that reports the results
of the replacement command.
(This creates the recoded variable UpperLower in column H.)
JMP Guide
13. Enter Dean’s List in cell I1.
14. Enter the formula =IF(G2 < 3.3, “No”, “Yes”) in cell
I2.
15. Copy this formula down the column to the last row that
contains student data (row 63).
39
The RECODED worksheet uses the IF function, that
Appendix F discusses to recode the numerical variable into
two categories. Numerical variables can also be recoded
into multiple categories by using the VLOOKUP function
(see Appendix F).
(This creates the recoded variable Dean’s List in column I.)
CHAPTER
1
JMP GUIDE
JG1.1
DEFINING VARIABLES
Classifying Variables by Type
JMP infers the variable type and scale from the data one
enters in a column. To override any inference JMP makes,
first right-click a column name and select Column Info from
the shortcut menu. In the column info dialog box, change
the Data Type or Modeling Type (scale) to the type desired.
Shown below is the column information dialog box for the
Assets column in the Retirement Funds data table. Contents
of this dialog box will vary depending on JMP inferences
and the entries you make in the dialog box, but the dialog
box will always contain Column Name as the JMP Guide
in the previous chapter explains. For continuous numerical
variables such as Assets, use Format to control the display of
values. For the format Fixed Dec, the Dec box entry controls
the number of decimal places to which values will be rounded
in the column.
In boxes that lists names of variable columns, JMP uses
icons to represent the modeling type of the variable. Below are the
icons for nominal (CustomerID), continuous (YTD Spending),
ordinal (Texts Sent), and unstructured text (Last Text
Message) modeling types, the subset of types that examples
use in this book. (Open the Modeling Types data table to
explore these choices.)
JG1.2 COLLECTING DATA
There are no JMP Guide instructions for Section 1.2.
JG1.3
TYPES of SAMPLING METHODS
Simple Random Sample and Stratified Sample
Use Subset.
To take a simple random sample of data table data, open the
data table and select Tables ➔ Subset. In the Subset dialog
box, click Random - sample size, enter the sample size of
the sample, enter an Output table name, and click OK. In
the illustration below, 20 has been entered as the sample size
and Funds Simple Random Sample as the output table name.
40
CHAPTER 1 | Defining and Collecting Data
To specify a stratified sample, check Stratify. JMP displays a column list box under this check box from which you
choose the variable that will be used to define the strata.
Systematic Sample
Use a formula column to help identify every kth item and then
use Subset to create the sample.
For example, to take a systematic sample of n = 20 of the
800 Employees data table that contains 800 rows of data,
first determine k = 40 (800 divided by 20). With the data
table open:
1. Right-click Fund Number (name of first column) and
select Insert Columns. The new column, Column 1,
appears to the left of Fund Number column.
2. Right-click Column 1 heading and select Formula.
3. In the large formula composition pane of the Formula
dialog box (shown below), enter Sequence(1, 40) and
then click OK.
JG1.4 DATA CLEANING
Use a variety of techniques including Recode and Row
Selection.
For categorical variables, Recode can be used to spot several
kinds of invalid variable names and coding errors. For example, open the DirtyDATA data table, select the Gender
column and:
1. Select Cols ➔ Recode.
2. In the Recode dialog box, click the red triangle and
select Group Similar Values.
3. In the Grouping Options dialog box, check all check
boxes and click OK.
Back in the Recode dialog box (shown below), JMP attempts
to group together similar values. The success of the regrouping by JMP can vary, but regrouping facilitates your review,
especially of data the contain many rows. Note that the first
entry in the old and new values table for a cell that is blank:
4. Make entries in the New Values column as necessary.
5. Click the Done pull-down list and select New Column.
JMP fills Column 1 with the recurring series of 1 to 40 in
Column 1. Next choose a random number between 1 and 40
inclusive by any means and:
1. Click the data table Rows red triangle and select Clear
Row States.
2. Click the data table Rows red triangle a second time
and select Row Selection ➔ Select Where.
In the Select Rows dialog box:
3. Select Column 1 from the column list.
4. Select equals from the first pull-down list.
5. Enter the random number that was selected in the edit
box to the right of the pull-down and click OK.
JMP selects a sample that contains the rows in which the
Column 1 value matches the randomly chosen number. Continue
to copy the rows to a new data table. With the rows still selected:
6. Select Tables ➔ Subset.
7. In the Subset dialog box, click Selected Rows and click
OK.
The sample appears in a new data table in its own window.
JMP places the corrected data in a new column, preserving
the original dirty data in its original column.
To identify non-numerical data in a “numerical” column,
changing the column data type to Numeric in the Column
Info dialog box (see Section JG1.1). This will cause JMP to
change all non-numerical data to missing values. The process
works because JMP assigns the character data type to any
column that contains non-numerical data.
To identify numeric values that are outside a defined
range, Select the data table Rows red triangle and then select
Row Selection ➔ Select Where.
In the Select Rows dialog box:
1. Select the variable column to be range-checked.
2. Select a relationship from the pull-down list and enter
the appropriated comparison value in the edit box.
3. Click Add Condition to add the condition to the
Selected Conditions list.
Repeat steps 1 through 3 for as many times as necessary. For
the DirtyDATA data table, the variable GPA has a defined
range of 0 through 4 inclusive. The conditions listed in the
Select Row dialog box (shown at top of next page) check for
GPA values outside that range.
MINITAB Guide
41
JG1.5 OTHER PREPROCESSING TASKS
Stacking and Unstacking Variables
Use Stack or Split.
To stack data, select Tables ➔ Stack. In the Stack dialog
box, select the (unstacked) variable columns, click Stack
Columns, and click OK. The stacked data appears in a new
column, with the names of columns that were stacked as the
values in the Label column.
To unstack data, select Tables ➔ Split. In the Split dialog
box, select the categorical variable that holds grouping information and click Split By, select the numerical column (or
columns) to unstack and click Split Columns and then click
OK. The unstacked data appears in a new data table.
JMP scripts and add-ins exist that semi-automate data cleaning and spot other errors that this section does not address,
sometimes using JMP techniques beyond the scope of this
book to explain.
Recoding Variables
Use Recode.
To recode the values of either a categorical or numerical
variable, first select the variable column and then select
Cols ➔ Recode (in older JMP versions, Cols ➔ Utilities ➔
Recode). In the Recode dialog box, JMP lists all unique values
found in the column and display form in which you can change
one or more of those values. When you finish making changes,
click the Done pull-down list and select New Column.
Recoded values appear in a new column.
CHAPTER
1
MINITAB GUIDE
MG1.1
DEFINING VARIABLES
Classifying Variables by Type
Minitab infers the variable type from the data one enters into
a column as Section MG.2 “Entering Data” explains. Sometimes, Minitab will misclassify a variable, for example, mistaking a numerical variable for a categorical (text) variable.
In such cases, select the column, then select Data ➔ Change
Data Type, and then select one of the choices, for example,
Text to Numeric for the case of when Minitab has mistaken
a numerical variable as a categorical variable.
MG1.2
COLLECTING DATA
There are no Minitab Guide instructions for Section 1.2.
MG1.3
TYPES of SAMPLING METHODS
Simple Random Samples
Use Sample From Columns.
For example, to create a simple random sample with replacement of size 40 from a population of 800 items, first create
the list of 800 employee numbers in column C1.
Select Calc ➔ Make Patterned Data ➔ Simple Set of
Numbers. In the procedure’s dialog box (shown below):
1.
2.
3.
4.
Enter C1 in the Store patterned data in box.
Enter 1 in the From first value box.
Enter 800 in the To last value box.
Verify that the three other boxes contain 1 and then click
OK.
42
CHAPTER 1 | Defining and Collecting Data
With the worksheet containing the column C1 list still open:
5. Select Calc ➔ Random Data ➔ Sample from Columns.
In the Sample From Columns dialog box (shown below):
6. Enter 40 in the Number of rows to sample box.
7. Enter C1 in the From columns box.
8. Enter C2 in the Store samples in box.
9. Click OK.
MG1.4
DATA CLEANING
Minitab cleans the data when one imports data by opening a
file created by another application, such as a workbook file
created by Excel. For an existing worksheet, use a combination of commands and column formulas to count the number
of missing values for a variable, change invalid categorical
values of a categorical variable to a missing value, and identify numerical values that are outside a defined range.
In the import method, select data cleaning options in the
file open dialog box. The cleaning options vary according
to the type of file being imported. For an Excel workbook,
one can specify which values represent missing values and
instruct Minitab to skip a blank row, add missing values to
uneven columns, remove nonprintable characters and extra
spaces, and correct case mismatches.
Read the Short Takes for Chapter 1 to learn the specifics of the Minitab commands and formulas that one can use
to scan data.
MG1.5
In the Replace in Data Window dialog box:
3. Enter Senior as Find what, Upper as Replace with,
and then click Replace All.
4. Click OK to close the dialog box that reports the results
of the replacement command.
5. Still in the Find and Replace dialog box, enter Junior
as Find what (replacing Senior), and then click
Replace All.
6. Click OK to close the dialog box that reports the results
of the replacement command.
7. Still in the Find and Replace dialog box, enter Sophomore
as Find what, Lower as Replace with, and then click
Replace All.
8. Click OK to close the dialog box that reports the results
of the replacement command.
9. Still in the Find and Replace dialog box, enter Freshman
as Find what, and then click Replace All.
10. Click OK to close the dialog box that reports the results
of the replacement command.
To create the recoded variable Dean’s List from the numerical
variable GPA (C7), with the DATA worksheet of the Recode
project still open:
1. Enter Dean’s List as the name of the empty column C8.
2. Select Calc ➔ Calculator.
In the Calculator dialog box (shown below):
3. Enter C8 in the Store result in variable box.
4. Enter IF(GPA < 3.3, "No", "Yes") in the Expression
box.
5. Click OK.
OTHER PREPROCESSING TASKS
Recoding Variables
Use the Replace command to recode a categorical variable
and Calculator to recode a numerical variable.
For example, to create the recoded variable UpperLower
from the categorical variable Class (C4-T), open to the DATA
worksheet of the Recode project and:
1. Select the Class column (C4-T).
2. Select Editor ➔ Replace.
Variables can also be recoded into multiple categories by
using the Data ➔ Code command. Read the Short Takes
for Chapter 1 to learn more about this advanced recoding
technique.
CHAPTER
1
TABLEAU GUIDE
TG1.1
DEFINING VARIABLES
Classifying Variables by Type
Tableau infers three attributes of a column from the data that
a column contains. Tableau assigns a data role to each column, initially considering any column that contains categorical data to be a Dimension and any column that contains
numerical data to be a Measure. The data role of the column,
in turn, affects how Tableau processes the column data as
well as which choices Tableau presents in the Show Me visualization gallery (see Chapter 2).
Tableau also classifies a column as either having discrete or
continuous data. Tableau uses the terms discrete and continuous
in the sense that this chapter defines, except that Tableau considers all categorical data as discrete data, too. Tableau colors a
column name green if the column contains continuous data and
colors a column name blue if the column contains discrete data.
Tableau also determines whether the data type in the column
is one of several data types. In the Data tab, Tableau displays
a hashtag icon (#) before the names of columns that contain
whole or decimal numbers, displays an Abc icon before the
names of columns that contain alphanumeric data and displays
a calendar icon before the names of columns that contain date
or date and time data. Tableau precedes a data type icon with an
equals sign if the column contains the results of a calculation or
represents data copied from another column. Tableau presents a
Dimension or Measure name in italics if the name represents a
filtered set. In the illustration below, Number of Records in the
Measures list is a calculation that is also a filtered set.
To change the data role assigned to a column, right-click the
column name and select Convert to Dimension or Convert to
Measure from the shortcut menu. To change the assignment of
discrete or continuous to a column, right-click the column name
and select Convert to Continuous or Convert to Discrete
from the shortcut menu. To change the assigned data type, rightclick the column name and select Change Data Type and then
select the appropriate choice from the shortcut menu.
For example, in the illustration above Tableau has classified Region Code as a continuous (green) measure that has
whole or decimal number data. Region Code is a recoding
of Region that uses the numerals 1, 2, 3, and 4 to replace
categorical values. Therefore, Region Code should be a
Dimension that contains discrete, alphanumeric data. For
the column’s shortcut menu, the three right-click selections
Convert to Dimension, Convert to Discrete, and Change
Data Type ➔ String would be necessary to proper reclassify
Region Code. (“String” is the Tableau term for alphanumeric
data.)
TG1.2 COLLECTING DATA
There are no Tableau Guide instructions for this section.
TG1.3
TYPES of SAMPLING METHODS
There are no Tableau Guide instructions for this section.
TG1.4 DATA CLEANING
Use the Data Interpreter.
For certain types of structured data, such as Excel worksheets
and Google sheets, Tableau can apply its Data Interpreter
procedure to examine data for common types of error. For
an Excel workbook, the interpreter will not only import corrected values but insert a series of worksheets that documents
its work into the workbook that is being used. To use the interpreter, check Use Data Interpreter in the left panel of a Data
Source display (shown below left). The panel display changes
to Cleaned with Data Interpreter (shown below right), when
the data cleaning finishes and Tableau has imported the data
to be analyzed (in the illustration, the data from the Data
worksheet of the Retirement Funds Excel workbook.)
TG1.5 OTHER PREPROCESSING TASKS
There are no Tableau Guide instructions for this section.
43
Get Complete eBook Download Link below
for instant download
https://browsegrades.net/documents/2
86751/ebook-payment-link-for-instantdownload-after-payment
Download