Get Complete eBook Download Link below for instant download https://browsegrades.net/documents/2 86751/ebook-payment-link-for-instantdownload-after-payment A Roadmap for Selecting a Statistical Method Data Analysis Task Describing a group or several groups For Numerical Variables Ordered array, stem-and-leaf display, frequency distribution, relative frequency distribution, percentage distribution, cumulative percentage distribution, histogram, polygon, cumulative percentage polygon (Sections 2.2, 2.4) For Categorical Variables Summary table, bar chart, pie chart, doughnut chart, Pareto chart (Sections 2.1 and 2.3) Mean, median, mode, geometric mean, quartiles, range, interquartile range, standard deviation, variance, coefficient of variation, skewness, kurtosis, boxplot, normal probability plot (Sections 3.1, 3.2, 3.3, 6.3) Dashboards (Section 14.2) Inference about one group Comparing two groups Confidence interval estimate of the mean (Sections 8.1 and 8.2) Confidence interval estimate of the proportion (Section 8.3) t test for the mean (Section 9.2) Z test for the proportion (Section 9.4) Tests for the difference in the means of two independent populations (Section 10.1) Z test for the difference between two proportions (Section 10.3) Paired t test (Section 10.2) Chi-square test for the difference between two proportions (Section 12.1) F test for the difference between two variances (Section 10.4) Comparing more than two groups One-way analysis of variance for comparing several means (Section 11.1) Chi-square test for differences among more than two proportions (Section 12.2) Analyzing the relationship between two variables Scatter plot, time series plot (Section 2.5) Contingency table, side-by-side bar chart, PivotTables (Sections 2.1, 2.3, 2.6) Covariance, coefficient of correlation (Section 3.5) Simple linear regression (Chapter 13) t test of correlation (Section 13.7) Sparklines (Section 2.7) Analyzing the relationship between two or more variables Chi-square test of independence (Section 12.3) Colored scatter plots, bubble chart, treemap (Section 2.7) Multidimensional contingency tables (Section 2.6) Multiple regression (Chapters 14) Drilldown and slicers (Section 2.7) Dynamic bubble charts (Section 14.2) Classification trees (Section 14.4) Regression trees (Section 14.3) Multiple correspondence analysis (Section 14.6) Cluster analysis (Section 14.5) Multidimensional scaling (Section 14.6) Business Statistics A First Course Business Statistics A First Course EIGHTH EDITION David M. Levine Department of Information Systems and Statistics Zicklin School of Business, Baruch College, City University of New York Kathryn A. Szabat Department of Business Systems and Analytics School of Business, La Salle University David F. Stephan Two Bridges Instructional Technology Senior VP, Courseware Portfolio Management: Marcia Horton Director, Portfolio Management: Deirdre Lynch Courseware Portfolio Manager: Suzanna Bainbridge Courseware Portfolio Management Assistant: Morgan Danna Managing Producer: Karen Wernholm Content Producer: Kathleen A. Manley Senior Producer: Aimee Thorne Associate Content Producer: Sneh Singh Manager, Courseware QA: Mary Durnwald Manager, Content Development: Robert Carroll Product Marketing Manager: Kaylee Carlson Product Marketing Assistant: Shannon McCormack Field Marketing Manager: Thomas Hayward Field Marketing Assistant: Derrica Moser Senior Author Support/Technology Specialist: Joe Vetere Manager, Rights and Permissions: Gina Cheselka Manufacturing Buyer: Carol Melville, LSC Communications Composition and Production Coordination: Pearson Content Services Centre, Julie Kidd Senior Designer and Cover Design: Barbara T. Atkinson Cover Image: ArtisticPhoto/Shutterstock Copyright © 2020, 2016, 2013 by Pearson Education, Inc. 221 River Street, Hoboken, NJ 07030 All Rights Reserved. Printed in the United States of America. This publication is protected by copyright, and permission should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise. For information regarding permissions, request forms and the appropriate contacts within the Pearson Education Global Rights & Permissions department, please visit www.pearsoned.com/permissions/. Attributions of third party content appear on page 651, which constitutes an extension of this copyright page. PEARSON, ALWAYS LEARNING, MYLAB, MYLAB PLUS, MATHXL, LEARNING CATALYTICS, and TESTGEN are exclusive trademarks owned by Pearson Education, Inc. or its affiliates in the U.S. and/or other countries. Unless otherwise indicated herein, any third-party trademarks that may appear in this work are the property of their respective owners and any references to third-party trademarks, logos or other trade dress are for demonstrative or descriptive purposes only. Such references are not intended to imply any sponsorship, endorsement, authorization, or promotion of Pearson’s products by the owners of such marks, or any relationship between the owner and Pearson Education, Inc. or its affiliates, authors, licensees or distributors. Microsoft® Windows®, and Microsoft office® are registered trademarks of the Microsoft Corporation in the U.S.A. and other countries. This book is not sponsored or endorsed by or affiliated with the Microsoft Corporation. Microsoft and/or its respective suppliers make no representations about the suitability of the information contained in the documents and related graphics published as part of the services for any purpose. All such documents and related graphics are provided “as is” without warranty of any kind. Microsoft and/or its respective suppliers hereby disclaim all warranties and conditions with regard to this information, including all warranties and conditions of merchantability, whether express, implied or statutory, fitness for a particular purpose, title and non-infringement. In no event shall microsoft and/or its respective suppliers be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of information available from the services. The documents and related graphics contained herein could include technical inaccuracies or typographical errors. Changes are periodically added to the information herein. Microsoftand/or its respective suppliers may make improvements and/or changes in the product(s) and/or the program(s) described herein at any time. Partial screen shots may be viewed in full within the software version specified. Cataloging-in-Publication Data on file with the Library of Congress Instructor’s Review Copy: ISBN 13: 978-0-13-524459-3 ISBN 10: 0-13-524459-5 Student Edition: ISBN 13: 978-0-13-517778-5 ISBN 10: 0-13-517778-2 To our spouses and children, Marilyn, Mary, Sharyn, and Mark and to our parents, in loving memory, Lee, Reuben, Mary, William, Ruth and Francis J. About the Authors David M. Levine, Kathryn A. Szabat, and David F. Stephan are all experienced business school educators committed to innovation and improving instruction in business statistics and related subjects. David Levine, Professor Emeritus of Statistics and CIS at Baruch College, CUNY, is a nationally recognized innovator in statistics education for more than three decades. Levine has coauthored 14 books, including several business statistics textbooks; textbooks and professional titles that explain and explore quality management and the Six Sigma approach; and, with David Stephan, a trade paperback that explains statistical concepts to a general audience. Levine has presented or chaired numerous sessions about business eduKathryn Szabat, David Levine, and David Stephan cation at leading conferences conducted by the Decision Sciences Institute (DSI) and the American Statistical Association, and he and his coauthors have been active participants in the annual DSI Data, Analytics, and Statistics Instruction (DASI) mini-conference. During his many years teaching at Baruch College, Levine was recognized for his contributions to teaching and curriculum development with the College’s highest distinguished teaching honor. He earned B.B.A. and M.B.A. degrees from CCNY. and a Ph.D. in industrial engineering and operations research from New York University. As Associate Professor of Business Systems and Analytics at La Salle University, Kathryn Szabat has transformed several business school majors into one interdisciplinary major that better supports careers in new and emerging disciplines of data analysis including analytics. Szabat strives to inspire, stimulate, challenge, and motivate students through innovation and curricular enhancements, and shares her coauthors’ commitment to teaching excellence and the continual improvement of statistics presentations. Beyond the classroom she has provided statistical advice to numerous business, nonbusiness, and academic communities, with particular interest in the areas of education, medicine, and nonprofit capacity building. Her research activities have led to journal publications, chapters in scholarly books, and conference presentations. Szabat is a member of the American Statistical Association (ASA), DSI, Institute for Operation Research and Management Sciences (INFORMS), and DSI DASI. She received a B.S. from SUNY-Albany, an M.S. in statistics from the Wharton School of the University of Pennsylvania, and a Ph.D. degree in statistics, with a cognate in operations research, from the Wharton School of the University of Pennsylvania. Advances in computing have always shaped David Stephan’s professional life. As an undergraduate, he helped professors use statistics software that was considered advanced even though it could compute only several things discussed in Chapter 3, thereby gaining an early appreciation for the benefits of using software to solve problems (and perhaps positively influencing his grades). An early advocate of using computers to support instruction, he developed a prototype of a mainframe-based system that anticipated features found today in Pearson’s MathXL and served as special assistant for computing to the Dean and Provost at Baruch College. In his many years teaching at Baruch, Stephan implemented the first computer-based classroom, helped redevelop the CIS curriculum, and, as part of a FIPSE project team, designed and implemented a multimedia learning environment. He was also nominated for teaching honors. Stephan has presented at SEDSI and DSI DASI (formerly MSMESB) mini-conferences, sometimes with his coauthors. Stephan earned a B.A. from Franklin & Marshall College and an M.S. from Baruch College, CUNY, and completed the instructional technology graduate program at Teachers College, Columbia University. For all three coauthors, continuous improvement is a natural outcome of their curiosity about the world. Their varied backgrounds and many years of teaching experience have come together to shape this book in ways discussed in the Preface. ix Brief Contents Preface xxiii First Things First 1 1 Defining and Collecting Data 18 2 Organizing and Visualizing Variables 44 3 Numerical Descriptive Measures 130 4 Basic Probability 176 5 Discrete Probability Distributions 207 6 The Normal Distribution 231 7 Sampling Distributions 257 8 Confidence Interval Estimation 279 9 Fundamentals of Hypothesis Testing: One-Sample Tests 314 10 Two-Sample Tests and One-Way ANOVA 354 11 Chi-Square Tests 421 12 Simple Linear Regression 450 13 Multiple Regression 502 14 Business Analytics 538 15 Applications in Quality Management (online) 15-1 Appendices A–H 565 Self-Test Solutions and Answers to Selected Even-Numbered Problems 615 Index 641 Credits 651 xi Contents Preface xxiii First Things First 1 1 Defining and Collecting Data 18 USING STATISTICS: Defining Moments 18 USING STATISTICS: “The Price of Admission” 1 1.1 FTF.1 Think Differently About Statistics 2 Statistics: A Way of Thinking 2 Statistics: An Important Part of Your Business Education 3 Classifying Variables by Type 19 Measurement Scales 20 1.2 FTF.2 Business Analytics: The Changing Face of Statistics 4 “Big Data” 4 FTF.3 Starting Point for Learning Statistics 5 Statistic 5 Can Statistics (pl., statistic) Lie? 6 Data Sources 22 1.3 Types of Sampling Methods 23 Simple Random Sample 23 Systematic Sample 24 Stratified Sample 24 Cluster Sample 24 1.4 Data Cleaning 26 Invalid Variable Values 26 KEY TERMS 9 Coding Errors 26 EXCEL GUIDE 10 Data Integration Errors 26 EG.1 Getting Started with Excel 10 Missing Values 27 EG.2 Entering Data 10 EG.3 Open or Save a Workbook 10 EG.4 Working with a Workbook 11 Algorithmic Cleaning of Extreme Numerical Values 27 1.5 Other Data Preprocessing Tasks 27 Data Formatting 27 Stacking and Unstacking Data 28 Recoding Variables 28 1.6 Types of Survey Errors 29 Coverage Error 29 Nonresponse Error 29 Sampling Error 30 Measurement Error 30 EG.5 Print a Worksheet 11 EG.6 Reviewing Worksheets 11 EG.7 If You use the Workbook Instructions 11 JMP GUIDE 12 JG.1 Getting Started With JMP 12 JG.2 Entering Data 13 JG.3 Create New Project or Data Table 13 JG.4 Open or Save Files 13 JG.5 Print Data Tables or Report Windows 13 JG.6 Jmp Script Files 13 MINITAB GUIDE 14 MG.1 Getting Started with Minitab 14 MG.2 Entering Data 14 MG.3 Open or Save Files 14 MG.4 Insert or Copy Worksheets 14 MG.5 Print Worksheets 15 TABLEAU GUIDE 15 TG.1 Getting Started with Tableau 15 TG.2 Entering Data 16 TG.3 Open or Save a Workbook 16 TG.4 Working with Data 17 TG.5 Print a Workbook 17 Collecting Data 21 Populations and Samples 21 FTF.4 Starting Point for Using Software 6 Using Software Properly 8 REFERENCES 9 Defining Variables 19 Ethical Issues About Surveys 30 CONSIDER THIS: New Media Surveys/Old Survey Errors 31 USING STATISTICS: Defining Moments, Revisited 32 SUMMARY 32 REFERENCES 32 KEY TERMS 33 CHECKING YOUR UNDERSTANDING 33 CHAPTER REVIEW PROBLEMS 33 CASES FOR CHAPTER 1 34 Managing Ashland MultiComm Services 34 CardioGood Fitness 35 Clear Mountain State Student Survey 35 Learning with the Digital Cases 35 xiii xiv CONTENTS Bubble Charts 77 CHAPTER 1 EXCEL GUIDE 37 EG1.1 Defining Variables 37 EG1.2 Collecting Data 37 PivotChart (Excel) 77 Treemap (Excel, JMP, Tableau) 77 EG1.3 Types of Sampling Methods 37 EG1.4 Data Cleaning 38 Sparklines (Excel, Tableau) 78 2.8 Filtering and Querying Data 79 Excel Slicers 79 2.9 Pitfalls in Organizing and Visualizing Variables 81 Obscuring Data 81 Creating False Impressions 82 Chartjunk 83 EG1.5 Other Data Preprocessing 38 CHAPTER 1 JMP GUIDE 39 JG1.1 Defining Variables 39 JG1.2 Collecting Data 39 JG1.3 Types of Sampling Methods 39 JG1.4 Data Cleaning 40 JG1.5 Other Preprocessing Tasks 41 CHAPTER 1 MINITAB GUIDE 41 MG1.1 Defining Variables 41 MG1.2 Collecting Data 41 MG1.3 Types of Sampling Methods 41 MG1.4 Data Cleaning 42 MG1.5 Other Preprocessing Tasks 42 CHAPTER 1 TABLEAU GUIDE 43 TG1.1 Defining Variables 43 TG1.2 Collecting Data 43 TG1.3 Types of Sampling Methods 43 TG1.4 Data Cleaning 43 TG1.5 Other Preprocessing Tasks 43 2 Organizing and Visualizing Variables 44 USING STATISTICS: “The Choice Is Yours” 44 2.1 2.2 Organizing Categorical Variables 45 The Summary Table 45 The Contingency Table 46 Organizing Numerical Variables 49 The Frequency Distribution 50 The Relative Frequency Distribution and the Percentage Distribution 52 The Cumulative Distribution 54 2.3 2.4 Visualizing Categorical Variables 57 The Bar Chart 57 2.6 2.7 SUMMARY 85 REFERENCES 86 KEY EQUATIONS 86 KEY TERMS 87 CHECKING YOUR UNDERSTANDING 87 CHAPTER REVIEW PROBLEMS 87 CASES FOR CHAPTER 2 92 Managing Ashland MultiComm Services 92 Digital Case 92 CardioGood Fitness 93 The Choice Is Yours Follow-Up 93 Clear Mountain State Student Survey 93 CHAPTER 2 EXCEL GUIDE 94 EG2.1 Organizing Categorical Variables 94 EG2.2 Organizing Numerical Variables 96 EG2 Charts Group Reference 98 EG2.3 Visualizing Categorical Variables 98 EG2.4 Visualizing Numerical Variables 100 EG2.5 Visualizing Two Numerical Variables 103 EG2.6 Organizing a Mix of Variables 104 EG2.7 Visualizing a Mix of Variables 105 EG2.8 Filtering and Querying Data 106 CHAPTER 2 JMP GUIDE 106 JG2 JMP Choices for Creating Summaries 106 JG2.1 Organizing Categorical Variables 107 JG2.2 Organizing Numerical Variables 108 JG2.3 Visualizing Categorical Variables 110 The Pie Chart and the Doughnut Chart 58 JG2.4 Visualizing Numerical Variables 111 The Pareto Chart 59 JG2.5 Visualizing Two Numerical Variables 113 Visualizing Two Categorical Variables 61 JG2.6 Organizing a Mix of Variables 114 Visualizing Numerical Variables 64 The Stem-and-Leaf Display 64 The Histogram 65 The Percentage Polygon 66 The Cumulative Percentage Polygon (Ogive) 67 2.5 USING STATISTICS: “The Choice Is Yours,” Revisited 85 Visualizing Two Numerical Variables 71 The Scatter Plot 71 The Time-Series Plot 72 Organizing a Mix of Variables 74 Drill-down 75 Visualizing a Mix of Variables 76 Colored Scatter Plot 76 JG2.7 Visualizing a Mix of Variables 114 JG2.8 Filtering and Querying Data 115 JMP Guide Gallery 116 CHAPTER 2 MINITAB GUIDE 117 MG2.1 Organizing Categorical Variables 117 MG2.2 Organizing Numerical Variables 117 MG2.3 Visualizing Categorical Variables 117 MG2.4 Visualizing Numerical Variables 119 MG2.5 Visualizing Two Numerical Variables 121 MG2.6 Organizing a Mix of Variables 122 MG2.7 Visualizing a Mix of Variables 122 MG2.8 Filtering and Querying Data 123 CONTENTS CHAPTER 2 TABLEAU GUIDE 123 CHAPTER 3 EXCEL GUIDE 168 TG2.1 Organizing Categorical Variables 123 EG3.1 Measures of Central Tendency 168 TG2.2 Organizing Numerical Variables 124 EG3.2 Measures of Variation and Shape 168 TG2.3 Visualizing Categorical Variables 124 EG3.3 Exploring Numerical Variables 169 TG2.4 Visualizing Numerical Variables 126 EG3.4 Numerical Descriptive Measures for a Population 170 TG2.5 Visualizing Two Numerical Variables 127 TG2.6 Organizing a Mix of Variables 127 TG2.7 Visualizing a Mix of Variables 128 3 Numerical Descriptive Measures 130 USING STATISTICS: More Descriptive Choices 130 3.1 3.2 Measures of Central Tendency 131 The Mean 131 The Median 133 The Mode 134 Measures of Variation and Shape 135 The Range 135 The Variance and the Standard Deviation 135 The Coefficient of Variation 138 Z Scores 139 Shape: Skewness 140 Shape: Kurtosis 141 3.3 Exploring Numerical Variables 145 Quartiles 145 The Interquartile Range 147 The Five-Number Summary 148 The Boxplot 149 3.4 Numerical Descriptive Measures for a Population 152 The Population Mean 152 The Population Variance and Standard Deviation 153 The Empirical Rule 154 Chebyshev’s Theorem 154 3.5 3.6 EG3.5 The Covariance and the Coefficient of Correlation 170 CHAPTER 3 JMP GUIDE 171 JG3.1 Measures of Central Tendency 171 JG3.2 Measures of Variation and Shape 171 JG3.3 Exploring Numerical Variables 171 JG3.4 Numerical Descriptive Measures for a Population 172 JG3.5 The Covariance and the Coefficient of Correlation 172 CHAPTER 3 MINITAB GUIDE 173 MG3.1 Measures of Central Tendency 173 MG3.2 Measures of Variation and Shape 173 MG3.3 Exploring Numerical Variables 174 MG3.4 Numerical Descriptive Measures for a Population 174 MG3.5 The Covariance and the Coefficient of Correlation 174 CHAPTER 3 TABLEAU GUIDE 175 TG3.3 Exploring Numerical Variables 175 4 Basic Probability 4.1 Basic Probability Concepts 177 Events and Sample Spaces 177 Types of Probability 178 Summarizing Sample Spaces 179 Simple Probability 180 Joint Probability 181 Marginal Probability 182 General Addition Rule 182 4.2 Conditional Probability 186 Calculating Conditional Probabilities 186 Decision Trees 187 Independence 189 Multiplication Rules 190 Marginal Probability Using the General Multiplication Rule 191 Descriptive Statistics: Pitfalls and Ethical Issues 161 SUMMARY 162 REFERENCES 162 KEY EQUATIONS 162 KEY TERMS 163 176 USING STATISTICS: Possibilities at M&R Electronics World 176 The Covariance and the Coefficient of Correlation 156 The Covariance 156 The Coefficient of Correlation 157 USING STATISTICS: More Descriptive Choices, Revisited 161 xv 4.3 Ethical Issues and Probability 193 4.4 Bayes’ Theorem 194 CONSIDER THIS: Divine Providence and Spam 196 4.5 Counting Rules 197 CHECKING YOUR UNDERSTANDING 163 USING STATISTICS: Possibilities at M&R Electronics World, Revisited 200 CHAPTER REVIEW PROBLEMS 164 SUMMARY 201 CASES FOR CHAPTER 3 167 REFERENCES 201 Managing Ashland MultiComm Services 167 Digital Case 167 CardioGood Fitness 167 KEY EQUATIONS 201 KEY TERMS 202 More Descriptive Choices Follow-up 167 CHECKING YOUR UNDERSTANDING 202 Clear Mountain State Student Survey 167 CHAPTER REVIEW PROBLEMS 202 xvi CONTENTS CASES FOR CHAPTER 4 204 Digital Case 204 CardioGood Fitness 204 The Choice Is Yours Follow-Up 204 Clear Mountain State Student Survey 204 CHAPTER 4 EXCEL GUIDE 205 6 The Normal Distribution 231 USING STATISTICS: Normal Load Times at MyTVLab 231 6.1 Continuous Probability Distributions 232 6.2 The Normal Distribution 232 EG4.1 Basic Probability Concepts 205 Role of the Mean and the Standard Deviation 234 EG4.4 Bayes’ Theorem 205 Calculating Normal Probabilities 235 EG4.5 Counting Rules 205 CHAPTER 4 JMP GUIDE 206 JG4.4 Bayes’ Theorem 206 Finding X Values 240 CONSIDER THIS: What Is Normal? 243 6.3 CHAPTER 4 MINITAB GUIDE 206 MG4.5 Counting Rules 206 5 Discrete Probability Distributions 207 USING STATISTICS: Events of Interest at Ricknel Home Centers 207 5.1 5.2 The Probability Distribution for a Discrete Variable 208 Expected Value of a Discrete Variable 208 Variance and Standard Deviation of a Discrete Variable 209 Binomial Distribution 212 Histograms for Discrete Variables 215 Summary Measures for the Binomial Distribution 216 5.3 Poisson Distribution 219 USING STATISTICS: Events of Interest …, Revisited 222 SUMMARY 222 REFERENCES 223 KEY EQUATIONS 223 KEY TERMS 223 CHECKING YOUR UNDERSTANDING 223 CHAPTER REVIEW PROBLEMS 223 CASES FOR CHAPTER 5 226 Evaluating Normality 245 Comparing Data Characteristics to Theoretical Properties 245 Constructing the Normal Probability Plot 246 USING STATISTICS: Normal Load Times …, Revisited 249 SUMMARY 249 REFERENCES 249 KEY EQUATIONS 250 KEY TERMS 250 CHECKING YOUR UNDERSTANDING 250 CHAPTER REVIEW PROBLEMS 250 CASES FOR CHAPTER 6 252 Managing Ashland MultiComm Services 252 CardioGood Fitness 252 More Descriptive Choices Follow-up 252 Clear Mountain State Student Survey 252 Digital Case 252 CHAPTER 6 EXCEL GUIDE 253 EG6.2 The Normal Distribution 253 EG6.3 Evaluating Normality 253 CHAPTER 6 JMP GUIDE 254 JG6.2 The Normal Distribution 254 JG6.3 Evaluating Normality 254 CHAPTER 6 MINITAB GUIDE 255 MG6.2 The Normal Distribution 255 MG6.3 Evaluating Normality 256 Managing Ashland MultiComm Services 226 Digital Case 226 CHAPTER 5 EXCEL GUIDE 227 EG5.1 The Probability Distribution for a Discrete Variable 227 7 Sampling Distributions 257 EG5.2 Binomial Distribution 227 USING STATISTICS: Sampling Oxford Cereals 257 EG5.3 Poisson Distribution 227 7.1 Sampling Distributions 258 CHAPTER 5 JMP GUIDE 228 7.2 Sampling Distribution of the Mean 258 The Unbiased Property of the Sample Mean 258 JG5.1 The Probability Distribution for a Discrete Variable 228 Standard Error of the Mean 260 JG5.2 Binomial Distribution 228 Sampling from Normally Distributed Populations 261 Sampling from Non-normally Distributed Populations— The Central Limit Theorem 264 VISUAL EXPLORATIONS: Exploring Sampling Distributions 268 JG5.3 Poisson Distribution 229 CHAPTER 5 MINITAB GUIDE 229 MG5.1 The Probability Distribution for a Discrete Variable 229 MG5.2 Binomial Distribution 230 MG5.3 Poisson Distribution 230 7.3 Sampling Distribution of the Proportion 269 CONTENTS USING STATISTICS: Sampling Oxford Cereals, Revisited 272 CardioGood Fitness 307 SUMMARY 273 Clear Mountain State Student Survey 307 REFERENCES 273 KEY EQUATIONS 273 KEY TERMS 273 CHECKING YOUR UNDERSTANDING 273 CHAPTER REVIEW PROBLEMS 274 CASES FOR CHAPTER 7 275 xvii More Descriptive Choices Follow-Up 307 CHAPTER 8 EXCEL GUIDE 308 EG8.1 Confidence Interval Estimate for the Mean (s Known) 308 EG8.2 Confidence Interval Estimate for the Mean (s Unknown) 308 EG8.3 Confidence Interval Estimate for the Proportion 309 EG8.4 Determining Sample Size 309 CHAPTER 8 JMP GUIDE 310 JG8.1 Confidence Interval Estimate for the Mean (s Known) 310 Managing Ashland MultiComm Services 275 JG8.2 Confidence Interval Estimate for the Mean (s Unknown) 310 Digital Case 275 JG8.3 Confidence Interval Estimate for the Proportion 311 CHAPTER 7 EXCEL GUIDE 276 EG7.2 Sampling Distribution of the Mean 276 CHAPTER 7 JMP GUIDE 277 JG7.2 Sampling Distribution of the Mean 277 CHAPTER 7 MINITAB GUIDE 278 JG8.4 Determining Sample Size 311 CHAPTER 8 MINITAB GUIDE 312 MG8.1 Confidence Interval Estimate for the Mean (s Known) 312 MG8.2 Confidence Interval Estimate for the Mean (s Unknown) 312 MG8.3 Confidence Interval Estimate for the Proportion 313 MG7.2 Sampling Distribution of the Mean 278 MG8.4 Determining Sample Size 313 8 Confidence Interval Estimation 279 9 Fundamentals of Hypothesis Testing: One-Sample Tests 314 USING STATISTICS: Getting Estimates at Ricknel Home Centers 279 8.1 Confidence Interval Estimate for the Mean (σ Known) 280 Sampling Error 281 USING STATISTICS: Significant Testing at Oxford Cereals 314 9.1 Risks in Decision Making Using Hypothesis Testing 317 Can You Ever Know the Population Standard Deviation? 284 8.2 Confidence Interval Estimate for the Mean (σ Unknown) 285 Student’s t Distribution 285 The Concept of Degrees of Freedom 286 Properties of the t Distribution 286 The Confidence Interval Statement 288 8.3 Confidence Interval Estimate for the Proportion 293 8.4 Determining Sample Size 296 Z Test for the Mean (s Known) 319 Hypothesis Testing Using the Critical Value Approach 320 Hypothesis Testing Using the p-Value Approach 323 A Connection Between Confidence Interval Estimation and Hypothesis Testing 325 Can You Ever Know the Population Standard Deviation? 326 9.2 t Test of Hypothesis for the Mean (σ Unknown) 327 Using the Critical Value Approach 328 Using the p-Value Approach 329 Checking the Normality Assumption 330 9.3 One-Tail Tests 333 Sample Size Determination for the Mean 296 Sample Size Determination for the Proportion 298 8.5 Confidence Interval Estimation and Ethical Issues 301 USING STATISTICS: Getting Estimates at Ricknel Home Centers, Revisited 301 SUMMARY 301 Using the Critical Value Approach 333 Using the p-Value Approach 335 9.4 Z Test of Hypothesis for the Proportion 337 Using the Critical Value Approach 339 Using the p-Value Approach 339 9.5 Potential Hypothesis-Testing Pitfalls and Ethical Issues 341 Important Planning Stage Questions 341 REFERENCES 302 KEY EQUATIONS 302 KEY TERMS 302 CHECKING YOUR UNDERSTANDING 303 CHAPTER REVIEW PROBLEMS 303 CASES FOR CHAPTER 8 306 Managing Ashland MultiComm Services 306 Digital Case 307 Sure Value Convenience Stores 307 Fundamentals of Hypothesis Testing 315 The Critical Value of the Test Statistic 316 Regions of Rejection and Nonrejection 317 Statistical Significance Versus Practical Significance 342 Statistical Insignificance Versus Importance 342 Reporting of Findings 342 Ethical Issues 342 USING STATISTICS: Significant Testing …, Revisited 343 SUMMARY 343 xviii CONTENTS REFERENCES 343 F Test for Differences Among More Than Two Means 386 KEY EQUATIONS 344 One-Way ANOVA F Test Assumptions 391 Levene Test for Homogeneity of Variance 392 KEY TERMS 344 Multiple Comparisons: The Tukey-Kramer Procedure 393 CHECKING YOUR UNDERSTANDING 344 USING STATISTICS I: Differing Means for Selling …, Revisited 398 CHAPTER REVIEW PROBLEMS 344 CASES FOR CHAPTER 9 346 USING STATISTICS II: The Means to Find Differences …, Revisited 399 SUMMARY 399 Managing Ashland MultiComm Services 346 Digital Case 346 Sure Value Convenience Stores 347 REFERENCES 400 CHAPTER 9 EXCEL GUIDE 348 EG9.1 Fundamentals of Hypothesis Testing 348 KEY EQUATIONS 401 EG9.2 t Test of Hypothesis for the Mean (s Unknown) 348 KEY TERMS 401 EG9.3 One-Tail Tests 349 CHECKING YOUR UNDERSTANDING 402 EG9.4 Z Test of Hypothesis for the Proportion 349 CHAPTER REVIEW PROBLEMS 402 CHAPTER 9 JMP GUIDE 350 CASES FOR CHAPTER 10 404 JG9.1 Fundamentals of Hypothesis Testing 350 JG9.2 t Test of Hypothesis for the Mean (s Unknown) 350 JG9.3 One-Tail Tests 351 CardioGood Fitness 406 CHAPTER 9 MINITAB GUIDE 351 More Descriptive Choices Follow-Up 406 MG9.1 Fundamentals of Hypothesis Testing 351 Clear Mountain State Student Survey 406 MG9.2 t Test of Hypothesis for the Mean (s Unknown) 352 MG9.3 One-Tail Tests 352 EG10.2 Comparing the Means of Two Related Populations 409 EG10.3 Comparing the Proportions of Two Independent Populations 410 354 USING STATISTICS I: Differing Means for Selling Streaming Media Players at Arlingtons? 354 10.1 Comparing the Means of Two Independent Populations 355 Pooled-Variance t Test for the Difference Between Two Means Assuming Equal Variances 355 Evaluating the Normality Assumption 358 Confidence Interval Estimate for the Difference Between Two Means 360 Separate-Variance t Test for the Difference Between Two Means, Assuming Unequal Variances 361 CONSIDER THIS: Do People Really Do This? 362 10.2 Comparing the Means of Two Related Populations 364 Paired t Test 365 Confidence Interval Estimate for the Mean Difference 370 10.3 Comparing the Proportions of Two Independent Populations 372 Z Test for the Difference Between Two Proportions 372 Confidence Interval Estimate for the Difference Between Two Proportions 376 10.4 F Test for the Ratio of Two Variances 379 USING STATISTICS II: The Means to Find Differences at Arlingtons 383 Analyzing Variation in One-Way ANOVA 384 CHAPTER 10 EXCEL GUIDE 407 EG10.1 Comparing the Means of Two Independent Populations 407 MG9.4 Z Test of Hypothesis for the Proportion 352 10.5 One-Way ANOVA 384 Digital Case 405 Sure Value Convenience Stores 405 JG9.4 Z Test of Hypothesis for the Proportion 351 10 Two-Sample Tests and One-Way ANOVA Managing Ashland MultiComm Services 404 EG10.4 F Test For The Ratio of Two Variances 411 EG10.5 One-Way ANOVA 411 CHAPTER 10 JMP GUIDE 414 JG10.1 Comparing the Means of Two Independent Populations 414 JG10.2 Comparing the Means of Two Related Populations 415 JG10.3 Comparing the Proportions of Two Independent Populations 415 JG10.4 F Test for the Ratio of Two Variances 416 JG10.5 One-Way ANOVA 416 CHAPTER 10 MINITAB GUIDE 417 MG10.1 Comparing the Means of Two Independent Populations 417 MG10.2 Comparing the Means of Two Related Populations 417 MG10.3 Comparing the Proportions of Two Independent Populations 418 MG10.4 F Test for the Ratio Of Two Variances 418 MG10.5 One-Way ANOVA 419 11 Chi-Square Tests 421 USING STATISTICS: Avoiding Guesswork About Resort Guests 421 11.1 Chi-Square Test for the Difference Between Two Proportions 422 11.2 Chi-Square Test for Differences Among More Than Two Proportions 429 11.3 Chi-Square Test of Independence 435 CONTENTS USING STATISTICS: Avoiding Guesswork …, Revisited 441 xix Residual Plots to Detect Autocorrelation 470 The Durbin-Watson Statistic 471 SUMMARY 441 12.7 Inferences About the Slope and Correlation Coefficient 474 t Test for the Slope 474 F Test for the Slope 475 Confidence Interval Estimate for the Slope 477 t Test for the Correlation Coefficient 477 REFERENCES 442 KEY EQUATIONS 442 KEY TERMS 442 CHECKING YOUR UNDERSTANDING 442 CHAPTER REVIEW PROBLEMS 442 PHASE 2 444 12.8 Estimation of Mean Values and Prediction of Individual Values 480 The Confidence Interval Estimate for the Mean Response 481 The Prediction Interval for an Individual Response 482 Digital Case 445 12.9 Potential Pitfalls in Regression 484 CASES FOR CHAPTER 11 444 Managing Ashland MultiComm Services 444 PHASE 1 444 CardioGood Fitness 445 USING STATISTICS: Knowing Customers …, Revisited 486 Clear Mountain State Student Survey 445 SUMMARY 487 CHAPTER 11 EXCEL GUIDE 446 EG11.1 Chi-Square Test for the Difference Between Two Proportions 446 EG11.2 Chi-Square Test for Differences Among More Than Two Proportions 446 EG11.3 Chi-Square Test of Independence 447 KEY EQUATIONS 488 KEY TERMS 489 CHECKING YOUR UNDERSTANDING 489 CHAPTER REVIEW PROBLEMS 490 CHAPTER 11 JMP GUIDE 448 JG11.1 Chi-Square Test for the Difference Between Two Proportions 448 JG11.2 Chi-Square Test for Difference Among More Than Two Proportions 448 JG11.3 Chi-Square Test of Independence 448 CASES FOR CHAPTER 12 493 Managing Ashland MultiComm Services 493 Digital Case 493 Brynne Packaging 493 CHAPTER 12 EXCEL GUIDE 494 CHAPTER 11 MINITAB GUIDE 449 MG11.1 Chi-Square Test for the Difference Between Two Proportions 449 MG11.2 Chi-Square Test for Differences Among More Than Two Proportions 449 MG11.3 Chi-Square Test of Independence 449 12 Simple Linear Regression REFERENCES 488 450 EG12.2 Determining the Simple Linear Regression Equation 494 EG12.3 Measures of Variation 495 EG12.5 Residual Analysis 495 EG12.6 Measuring Autocorrelation: the Durbin-Watson Statistic 496 EG12.7 Inferences About the Slope and Correlation Coefficient 496 EG12.8 Estimation of Mean Values and Prediction of Individual Values 496 CHAPTER 12 JMP GUIDE 497 JG12.2 Determining the Simple Linear Regression Equation 497 USING STATISTICS: Knowing Customers at Sunflowers Apparel 450 Preliminary Analysis 451 JG12.3 Measures of Variation 497 12.1 Simple Linear Regression Models 452 JG12.7 Inferences About the Slope and Correlation Coefficient 497 JG12.8 Estimation of Mean Values and Prediction of Individual Values 498 12.2 Determining the Simple Linear Regression Equation 453 The Least-Squares Method 453 Predictions in Regression Analysis: Interpolation Versus Extrapolation 456 Calculating the Slope, b1, and the Y Intercept, b0 457 12.3 Measures of Variation 461 Computing the Sum of Squares 461 The Coefficient of Determination 463 Standard Error of the Estimate 464 12.4 Assumptions of Regression 465 12.5 Residual Analysis 466 Evaluating the Assumptions 466 12.6 Measuring Autocorrelation: The Durbin-Watson Statistic 470 JG12.5 Residual Analysis 497 JG12.6 Measuring Autocorrelation: the Durbin-Watson Statistic 497 CHAPTER 12 MINITAB GUIDE 499 MG12.2 Determining the Simple Linear Regression Equation 499 MG12.3 Measures of Variation 500 MG12.5 Residual Analysis 500 MG12.6 Measuring Autocorrelation: The Durbin-Watson Statistic 500 MG12.7 Inferences About the Slope and Correlation Coefficient 500 MG12.8 Estimation of Mean Values and Prediction of Individual Values 500 CHAPTER 12 TABLEAU GUIDE 501 TG12.2 Determining the Simple Linear Regression Equation 501 TG12.3 Measures of Variation 501 xx CONTENTS 13 Multiple Regression 502 USING STATISTICS: The Multiple Effects of OmniPower Bars 502 13.1 Developing a Multiple Regression Model 503 Interpreting the Regression Coefficients 504 Predicting the Dependent Variable Y 506 13.2 Evaluating Multiple Regression Models 508 Coefficient of Multiple Determination, r 2 508 Adjusted r 2 509 F Test for the Significance of the Overall Multiple Regression Model 509 13.3 Multiple Regression Residual Analysis 512 13.4 Inferences About the Population Regression Coefficients 513 Tests of Hypothesis 514 Confidence Interval Estimation 515 14 Business Analytics 538 USING STATISTICS: Back to Arlingtons for the Future 538 14.1 Business Analytics Categories 539 Inferential Statistics and Predictive Analytics 540 Supervised and Unsupervised Methods 540 CONSIDER THIS: What’s My Major If I Want to Be a Data Miner? 541 14.2 Descriptive Analytics 542 Dashboards 542 Data Dimensionality and Descriptive Analytics 543 14.3 Predictive Analytics for Prediction 544 14.4 Predictive Analytics for Classification 547 14.5 Predictive Analytics for Clustering 548 14.6 Predictive Analytics for Association 551 Multidimensional Scaling (MDS) 552 14.7 Text Analytics 553 13.5 Using Dummy Variables and Interaction Terms 517 Interactions 519 14.8 Prescriptive Analytics 554 USING STATISTICS: The Multiple Effects …, Revisited 524 USING STATISTICS: Back to Arlingtons …, Revisited 555 SUMMARY 524 REFERENCES 555 REFERENCES 526 KEY EQUATIONS 556 KEY EQUATIONS 526 KEY TERMS 556 KEY TERMS 526 CHECKING YOUR UNDERSTANDING 556 CHECKING YOUR UNDERSTANDING 526 CHAPTER REVIEW PROBLEMS 556 CHAPTER REVIEW PROBLEMS 526 CHAPTER 14 SOFTWARE GUIDE 558 Introduction 558 SG14.2 Descriptive Analytics 558 CASES FOR CHAPTER 13 529 Managing Ashland MultiComm Services 529 Digital Case 529 CHAPTER 13 EXCEL GUIDE 530 EG13.1 Developing a Multiple Regression Model 530 EG13.2 Evaluating Multiple Regression Models 531 EG13.3 Multiple Regression Residual Analysis 531 EG13.4 Inferences About the Population Regression Coefficients 532 EG13.5 Using Dummy Variables and Interaction Terms 532 CHAPTER 13 JMP GUIDE 532 SG14.3 Predictive Analytics for Prediction 561 SG14.4 Predictive Analytics for Classification 561 SG14.5 Predictive Analytics for Clustering 562 SG14.6 Predictive Analytics for Association 564 15 Statistical Applications in Quality Management (online) 15-1 JG13.1 Developing a Multiple Regression Model 532 JG13.2 Evaluating Multiple Regression Models 533 JG13.3 Multiple Regression Residual Analysis 533 JG13.4 Inferences About the Population 534 JG13.5 Using Dummy Variables And Interaction Terms 534 CHAPTER 13 MINITAB GUIDE 535 MG13.1 Developing a Multiple Regression Model 535 MG13.2 Evaluating Multiple Regression Models 536 MG13.3 Multiple Regression Residual Analysis 536 MG13.4 Inferences About the Population Regression Coefficients 536 MG13.5 Using Dummy Variables and Interaction Terms In Regression Models 536 USING STATISTICS: Finding Quality at the Beachcomber 15-1 15.1 The Theory of Control Charts 15-2 The Causes of Variation 15-2 15.2 Control Chart for the Proportion: The p Chart 15-4 15.3 The Red Bead Experiment: Understanding Process Variability 15-10 15.4 Control Chart for an Area of Opportunity: The c Chart 15-12 15.5 Control Charts for the Range and the Mean 15-15 The R Chart 15-15 The X Chart 15-18 CONTENTS 15.6 Process Capability 15-21 Customer Satisfaction and Specification Limits 15-21 Capability Indices 15-22 CPL, CPU, and Cpk 15-23 15.7 Total Quality Management 15-26 15.8 Six Sigma 15-27 The DMAIC Model 15-28 Roles in a Six Sigma Organization 15-29 Lean Six Sigma 15-29 USING STATISTICS: Finding Quality at the Beachcomber, Revisited 15-30 REFERENCES 15-31 KEY EQUATIONS 15-31 KEY TERMS 15-32 CHAPTER REVIEW PROBLEMS 15-32 CASES FOR CHAPTER 15 15-35 The Harnswell Sewing Machine Company Case 15-35 Managing Ashland Multicomm Services 15-37 CHAPTER 15 EXCEL GUIDE 15-38 EG15.2 Control Chart for the Proportion: The p Chart 15-38 EG15.4 Control Chart for an Area Of Opportunity: The c Chart 15-39 EG15.5 Control Charts for the Range And The Mean 15-40 EG15.6 Process Capability 15-41 CHAPTER 15 JMP GUIDE 15-41 JG15.2 Control Chart for the Proportion: The p Chart 15-41 JG15.4 Control Chart for an Area of Opportunity: The c Chart 15-41 JG15.5 Control Charts for the Range and the Mean 15-42 JG15.6 Process Capability 15-42 CHAPTER 15 MINITAB GUIDE 15-42 MG15.2 Control Chart for the Proportion: The p Chart 15-42 MG15.4 Control Chart for an Area of Opportunity: The c Chart 15-43 MG15.5 Control Charts for the Range and the Mean 15-43 MG15.6 Process Capability 15-43 Appendices 565 A. BASIC MATH CONCEPTS AND SYMBOLS 566 A.1 A.2 A.3 A.4 A.5 A.6 Operators 566 Rules for Arithmetic Operations 566 Rules for Algebra: Exponents and Square Roots 566 Rules for Logarithms 567 Summation Notation 568 Greek Alphabet 571 B. IMPORTANT SOFTWARE SKILLS AND CONCEPTS 572 B.1 B.2 Identifying the Software Version 572 Formulas 572 B.3 B.4 B.5E B.5J B.5M B.5T B.6 B.7 xxi Excel Cell References 574 Excel Worksheet Formatting 575 Excel Chart Formatting 576 JMP Chart Formatting 577 Minitab Chart Formatting 578 Tableau Chart Formatting 578 Creating Histograms for Discrete Probability Distributions (Excel) 579 Deleting the “Extra” Histogram Bar (Excel) 580 C. ONLINE RESOURCES 581 C.1 C.2 C.3 C.4 About the Online Resources for This Book 581 Data Files 581 Files Integrated With Microsoft Excel 586 Supplemental Files 586 D. CONFIGURING SOFTWARE 587 D.1 D.2 D.3 D.4 Microsoft Excel Configuration 587 JMP Configuration 589 Minitab Configuration 589 Tableau Configuration 589 E. TABLE 590 E.1 E.2 E.3 E.4 E.5 E.6 E.7 E.8 E.9 Table of Random Numbers 590 The Cumulative Standardized Normal Distribution 592 Critical Values of t 594 Critical Values of x2 596 Critical Values of F 597 The Standardized Normal Distribution 601 Critical Values of the Studentized Range, Q 602 Critical Values, dL and dU, of the Durbin-Watson Statistic, D (Critical Values Are One-Sided) 604 Control Chart Factors 605 F. USEFUL KNOWLEDGE 606 F.1 F.2 Keyboard Shortcuts 606 Understanding the Nonstatistical Functions 606 G. SOFTWARE FAQS 608 G.1 G.2 G.3 G.4 G.5 Microsoft Excel FAQs 608 PHStat FAQs 608 JMP FAQs 609 Minitab FAQs 609 Tableau FAQs 609 H. ALL ABOUT PHStat 611 H.1 H.2 H.3 H.4 What is PHStat? 611 Obtaining and Setting Up PHStat 612 Using PHStat 612 PHStat Procedures, by Category 613 Self-Test Solutions and Answers to Selected Even-Numbered Problems 615 Index 641 Credits 651 Preface A s business statistics evolves and becomes an increasingly important part of one’s business education, how business statistics gets taught and what gets taught becomes all the more important. We, the authors, think about these issues as we seek ways to continuously improve the teaching of business statistics. We actively participate in Decision Sciences Institute (DSI), American Statistical Association (ASA), and Data, Analytics, and Statistics Instruction and Business (DASI) conferences. We use the ASA’s Guidelines for Assessment and Instruction (GAISE) reports and combine them with our experiences teaching business statistics to a diverse student body at several universities. When writing for introductory business statistics students, five principles guide us. Help students see the relevance of statistics to their own careers by using examples from the functional areas that may become their areas of specialization. Students need to learn statistics in the context of the functional areas of business. We present each statistics topic in the context of areas such as accounting, finance, management, and marketing and explain the application of specific methods to business activities. Emphasize interpretation and analysis of statistical results over calculation. We emphasize the interpretation of results, the evaluation of the assumptions, and the discussion of what should be done if the assumptions are violated. We believe that these activities are more important to students’ futures and will serve them better than focusing on tedious manual calculations. Give students ample practice in understanding how to apply statistics to business. We believe that both classroom examples and homework exercises should involve actual or realistic data, using small and large sets of data, to the extent possible. Familiarize students with the use of data analysis software. We integrate using Microsoft Excel, JMP, and Minitab into all statistics topics to illustrate how software can assist the business decision making process. In this edition, we also integrate using Tableau into selected topics, where such integration makes best sense. (Using software in this way also supports our second point about emphasizing interpretation over calculation). Provide clear instructions to students that facilitate their use of data analysis software. We believe that providing such instructions assists learning and minimizes the chance that the software will distract from the learning of statistical concepts. What’s New in This Edition? This eighth edition of Business Statistics: A First Course features many passages rewritten in a more concise style that emphasize definitions as the foundation for understanding statistical concepts. In addition to changes that readers of past editions have come to expect, such as new examples and Using Statistics case scenarios and an extensive number of new end-of-section or end-of-chapter problems, the edition debuts: • A First Things First Chapter that builds on the previous edition’s novel Important Things to Learn First Chapter by using real-world examples to illustrate how developments such as the increasing use of business analytics and “big data” have made knowing xxiii xxiv PREFACE • • • • • and understanding statistics that much more critical. This chapter is available as complimentary online download, allowing students to get a head start on learning. Tabular Summaries that state hypothesis test and regression example results along with the conclusions that those results support now appear in Chapters 10 through 13. Updated Excel and Minitab Guides that reflect the most recent editions of these programs. New JMP Guides that provide detailed, hands-on instructions for using JMP to illustrate the concepts that this book teaches. JMP provides a starting point for continuing studies in business statistics and business analytics and features visualizations that are easy to construct and that summarize data in innovative ways. For selected chapters, Tableau Guides that make best use of this software for basic and advanced visualizations and regression analysis. An All-New Business Analytics Chapter (Chapter 14) that makes extensive use of JMP, Minitab, and Tableau to illustrate predictive analytics for prediction, classification, clustering, and association as well as explaining what text analytics does and how descriptive and prescriptive analytics relate to predictive analytics. This chapter benefits from the insights the coauthors have gained from teaching and lecturing on business analytics as well as research the coauthors have done for a forthcoming companion title on business analytics. Continuing Features that Readers Have Come to Expect This edition of Business Statistics: A First Course continues to incorporate a number of distinctive features that has led to its wide adoption over the previous editions. Table 1 summaries these carry-over features: TABLE 1 Distinctive Features Continued in the Eighth Edition Feature Details Using Statistics Business Scenarios A Using Statistics scenario that highlights how statistics is used in a business functional area begins each chapter. Each scenario provides an applied context for learning in its chapter. End-of-chapter “Revisited” sections reinforces the statistical methods that a chapter discusses and apply those methods to the questions raised in the scenario. In this edition, four chapters have new or revised Using Statistics scenarios. Basic Business Statistics was among the first business statistics textbooks to focus on interpretation of the results of a statistical method and not on the mathematics of a method. This tradition continues, now supplemented by JMP results complimenting the Excel and Minitab results of recent prior editions. Software instructions in this book feature chapter examples and were personally written by the authors, who collectively have over one hundred years experience teaching the application of software to business. Software usage also features templates and applications developed by the authors that minimize the frustration of using software while maximizing statistical learning Student Tips, LearnMore bubbles, and Consider This features extend student-paced learning by reinforcing important points or examining side issues or answering questions that arise while studying business statistics such as “What is so ‘normal’ about the normal distribution?” With an extensive library of separate online topics, sections, and even two full chapters, instructors can combine these materials and the opportunities for additional learning to meet their curricular needs. With modularized software instructions, instructors and students can switch among Excel, Excel with PHStat, JMP, Minitab, and Tableau as they use this book, taking advantage of the strengths of each program to enhance learning. Emphasis on Data Analysis and Interpretation of Results Software Integration Opportunities for Additional Learning Highly Tailorable Context Software Flexibility PREFACE xxv TABLE 1 Distinctive Features Continued in the Eighth Edition (continued) Feature Details End-of-Section and End-of-Chapter Reinforcements “Exhibits” summarize key processes throughout the book. “Key Terms” provides an index to the definitions of the important vocabulary of a chapter. “Learning the Basics” questions test the basic concepts of a chapter. “Applying the Concepts” problems test the learner’s ability to apply those problems to business problems. For the more quantitatively-minded, “Key Equations” list the boxed number equations that appear in a chapter. End-of-chapter cases include a case that continues through many chapters as well as “Digital Cases” that require students to examine business documents and other information sources to sift through various claims and discover the data most relevant to a business case problem as well as common misuses of statistical information. (Instructional tips for these cases and solutions to the Digital Cases are included in the Instructor’s Solutions Manual.) An appendix provides additional self-study opportunities by provides answers to the “Self-Test” problems and most of the even-numbered problems in this book. Innovative Cases Answers to Even-Numbered Problems Unique Excel Integration Many textbooks feature Microsoft Excel, but Business Statistics: A First Course comes from the authors who originated both the Excel Guide workbooks that illustrate model solutions, developed Visual Explorations that demonstrate selected basic concepts, and designed and implemented PHStat, the Pearson statistical add-in for Excel that places the focus on statistical learning. (See Appendix H for a complete summary of PHStat.) Chapter-by-Chapter Changes Made for This Edition Because the authors believe in continuous quality improvement, every chapter of Business Statistics: A First Course contains changes to enhance, update, or just freshen this book. Table 2 provides a chapter-by-chapter summary of these changes. TABLE 2 Chapter-by-Chapter Change Matrix Chapter Using Statistics Changed JMP/ Tableau Guide Problems Changed FTF • J, T n.a. 1 • J, T 40% 2 J, T 60% 3 J, T 50% Selected Chapter Changes Think Differently About Statistics Starting Point for Learning Statistics Data Cleaning Other Data Preprocessing Tasks Organizing a Mix of Variables Visualizing A Mix of Variables Filtering and Querying Data Reorganized categorical variables discussion. Expanded data visualization discussion. New samples of 379 retirement funds and 100 restaurant meal costs for examples. New samples of 379 retirement funds and 100 restaurant meal costs for examples. Updated NBA team values data set. xxvi PREFACE TABLE 2 Chapter-by-Chapter Change Matrix (continued) JMP/ Tableau Guide Problems Changed 4 J 43% 5 J 60% J J 33% 47% 8 J 40% 9 J 20% J 43% 11 J 43% 12 J, T 46% 13 J 30% J, T n.a. Chapter 6 7 10 14 Using Statistics Changed • • n.a. Selected Chapter Changes Basic Probability Concepts rewritten. Bayes’ theorem example moved online. Section 5.1 and Binomial Distribution revised. Normal Distribution rewritten. Sampling Distribution of the Proportion rewritten. Confidence Interval Estimate for the Mean revised. Revised “Managing Ashland MultiComm Services” continuing case. Chapter introduction revised. Section 9.1 rewritten. New Section 9.4 example. New paired t test and the difference between two proportions examples. Extensive use of new tabular summaries. Revised “Managing Ashland MultiComm Services” continuing case. Chapter introduction revised. Section 12.2 revised. Section 13.1 revised. Section 13.3 reorganized and revised. New dummy variable example. All-new chapter that introduces business analytics. Software Guide explains using Excel with Power BI Desktop, JMP, Minitab, and Tableau, for various descriptive and predictive analytics methods. Serious About Writing Improvements Ever review a textbook that reads the same as an edition from years ago? Or read a preface that claims writing improvements but offers no evidence? Among the writing improvements in this edition of Business Statistics: A First Course, the authors have turned to tabular summaries to guide readers to reaching conclusions and making decisions based on statistical information. The authors believe that this writing improvement, which appears in Chapters 9 through 13, not only adds clarity to the purpose of the statistical method being discussed but better illustrates the role of statistics in business decision-making processes. Judge for yourself using the sample from Chapter 10 Example 10.1. Previously, part of the solution to Example 10.1 was presented as: You do not reject the null hypothesis because tSTAT = - 1.6341 7 - 1.7341. The p-value (as computed in Figure 10.5) is 0.0598. This p-value indicates that the probability that tSTAT 6 - 1.6341 is equal to 0.0598. In other words, if the population means are equal, the probability that the sample mean delivery time for the local pizza restaurant is at least PREFACE xxvii 2.18 minutes faster than the national chain is 0.0598. Because the p-value is greater than a = 0.05, there is insufficient evidence to reject the null hypothesis. Based on these results, there is insufficient evidence for the local pizza restaurant to make the advertising claim that it has a faster delivery time. In this edition, we present the equivalent solution (on page 360): Table 10.4 summarizes the results of the pooled-variance t test for the pizza delivery data using the calculation above (not shown in this sample) and Figure 10.5 results. Based on the conclusions, local branch of the national chain and a local pizza restaurant have similar delivery times. Therefore, as part of the last step of the DCOVA framework, you and your friends exclude delivery time as a decision criteria when choosing from which store to order pizza. TABLE 10.4 Pooled-variance t test summary for the delivery times for the two pizza restaurants Result Conclusions The tSTAT = - 1.6341 is greater than - 1.7341. The t test p@value = 0.0598 is greater than the level of significance, a = 0.05. 1. Do not reject the null hypothesis H0. 2. Conclude that insufficient evidence exists that the mean delivery time is lower for the local restaurant than for the branch of the national chain. 3. There is a probability of 0.0598 that tSTAT 6 - 1.6341. A Note of Thanks Creating a new edition of a textbook is a team effort, and we thank our Pearson Education editorial, marketing, and production teammates: Suzanna Bainbridge, Kathy Manley, Kaylee Carlson, Thomas Hayward, Deirdre Lynch, Aimee Thorne, and Morgan Danna. And we would be remiss not to note the continuing work of Joe Vetere to prepare our screen shot illustrations and the efforts of Julie Kidd of Pearson CSC to ensure that this edition meets the highest standard of book production quality that is possible. We also thank Gail Illich of McLennan Community College for preparing instructor resources for this edition and thank the following people whose comments helped us improve this edition: Mohammad Ahmadi, University of Tennessee-Chattanooga; Sung Ahn, Washington State University; Kelly Alvey, Old Dominion University; Al Batten, University of Colorado-Colorado Springs; Alan Chesen, Wright State University; Gail Hafer, St. Louis Community College-Meramec; Chun Jin, Central Connecticut State University; Benjamin Lev, Drexel University; Lilian Prince, Kent State University; Bharatendra Rai, University of Massachusetts Dartmouth; Ahmad Vakil, St. John’s University (NYC); and Shiro Withanachchi, Queens College (CUNY). We thank the RAND Corporation and the American Society for Testing and Materials for their kind permission to publish various tables in Appendix E, and to the American Statistical Association for its permission to publish diagrams from the American Statistician. Finally, we would like to thank our families for their patience, understanding, love, and assistance in making this book a reality. xxviii PREFACE Contact Us! Please email us at authors@davidlevinestatistics.com or tweet us @BusStatBooks with your questions about the contents of this book. Please include the hashtag #BSAFC8 in your tweet or in the subject line of your email. We also welcome suggestions you may have for a future edition of this book. And while we have strived to make this book as error-free as possible, we also appreciate those who share with us any perceived problems or errors that they encounter. If you need assistance using software, please contact your academic support person or Pearson Support at support.pearson.com/getsupport/. They have the resources to resolve and walk you through a solution to many technical issues in a way we do not. As you use this book, be sure to make use of the “Resources for Success” that Pearson Education supplies for this book (described on the following pages). We also invite you to visit bsafc8.davidlevinestatistics.com (bit.ly/2Apx1xH), where we may post additional information or new content as necessary. David M. Levine Kathryn A. Szabat David F. Stephan Resources for Success MyLab Statistics Online Course for Business Statistics: A First Course, 8th Edition by Levine, Szabat, Stephan (access code required) MyLab Statistics is the teaching and learning platform that empowers instructors to reach every student. By combining trusted author content with digital tools and a flexible platform, MyLab Statistics personalizes the learning experience and improves results for each student. MyLab makes learning and using a variety of statistical programs as seamless and intuitive as possible. Download the data files that this book uses (see Appendix C) in Excel, JMP, and Minitab formats. Download supplemental files that support in-book cases or extend learning. Instructional Videos Access instructional support videos including Pearson’s Business Insight and StatTalk videos, available with assessment questions. Reference technology study cards and instructional videos for Excel, JMP, Minitab, StatCrunch, and R software. Diverse Question Libraries Build homework assignments, quizzes, and tests to support your course learning outcomes. From Getting Ready (GR) questions to the Conceptual Question Library (CQL), we have your assessment needs covered from the mechanics to the critical understanding of Statistics. The exercise libraries include technology-led instruction, including new Excel-based exercises, and learning aids to reinforce your students’ success. pearson.com/mylab/statistics Resources for Success Instructor Resources Instructor’s Solutions Manual, presents solutions for end-of-section and end-of-chapter problems and answers to case questions, and provides teaching tips for each chapter. The Instructor's Solutions Manual is available for download at www.Pearson.com or in MyLab Statistics. Lecture PowerPoint Presentations, by Patrick Schur, Miami University (Ohio), are available for each chapter. These presentations provide instructors with individual lecture notes to accompany the text. The slides include many of the figures and tables from the textbook. Instructors can use these lecture notes as is or customize them in Microsoft PowerPoint. The PowerPoint presentations are available for download at www.Pearson.com or in MyLab Statistics. Test Bank , contains true/false, multiple-choice, fill-in, and problem-solving questions based on the definitions, concepts, and ideas developed in each chapter of the text. The Test Bank is available for download at www.Pearson.com or in MyLab Statistics. TestGen® (www.pearsoned.com/testgen) enables instructors to build, edit, print, and administer tests using a computerized bank of questions developed to cover all the objectives of the text. TestGen is algorithmically based, allowing instructors to create multiple but equivalent versions of the same question or test with the click of a button. Instructors can also modify test bank questions or add new questions. The software and test bank are available for download from Pearson Education’s online catalog. Student Resources Student’s Solutions Manual, provides detailed solutions to virtually all the even-numbered exercises and worked-out solutions to the self-test problems. (ISBN-10: 0-13-518243-3; ISBN-13: 978-0-13-518243-7) Online resources complement and extend the study of business statistics and support the content of this book. These resources include data files for in-chapter examples and problems, templates and model solutions, and optional topics and chapters. (See Appendix C for a complete description of the online resources.) PHStat helps create Excel worksheet solutions to statistical problems. PHStat uses Excel building blocks to create worksheet solutions. These worksheet solutions illustrate Excel techniques and students can examine them to gain new Excel skills. Additionally, many solutions are what-if templates in which the effects of changing data on the results can be explored. Such templates are fully reusable on any computer on which Excel has been installed. PHStat requires an access code and separate download for use. PHStat access codes can be bundled with this textbook using ISBN-10: 0-13-399058-3; ISBN-13: 978-0-13-399058-4. Minitab® More than 4,000 colleges and universities worldwide use Minitab software to help students learn quickly and to provide them with a skill-set that’s in demand in today’s data-driven workforce. Minitab® includes a comprehensive collection of statistical tools to teach beginning through advanced courses. Bundling Minitab software ensures students have the software they need for the duration of their course work. (ISBN-10: 0-13-445640-8; ISBN-13: 978-0-13-445640-9) JMP® Student Edition software is statistical discovery software from SAS Institute Inc., the leader in business analytics software and services. JMP® Student Edition is a streamlined version of JMP that provides all the statistics and graphics covered in introductory and intermediate statistics courses. Available for bundling with this textbook. (ISBN-10: 0-13-467979-2; ISBN-13: 978-0-13-467979-2) First Things First CONTENTS “The Price of Admission” FTF.1 Think Differently About Statistics FTF.2 Business Analytics: The Changing Face of Statistics FTF.3 Starting Point for Learning Statistics FTF.4 Starting Point for Using Software EXCEL GUIDE JMP GUIDE ▼ USING STATISTICS MINITAB GUIDE TABLEAU GUIDE “The Price of Admission” I t’s the year 1900 and you are a promoter of theatrical productions, in the business of selling seats for individual performances. Using your knowledge and experience, you establish a selling price for the performances, a price you hope represents a good trade-off between maximizing revenues and avoiding driving away demand for your seats. You print up tickets and flyers, place advertisements in local media, and see what happens. After the event, you review your results and consider if you made a wise trade-off. Tickets sold very quickly? Next time perhaps you can charge more. The event failed to sell out? Perhaps next time you could charge less or take out more advertisements to drive demand. If you lived over 100 years ago, that’s about all you could do. Jump ahead about 70 years. You’re still a promoter but now using a computer system that allows your customers to buy tickets over the phone. You can get summary reports of advance sales for future events and adjust your advertising on radio and on TV and, perhaps, add or subtract performance dates using the information in those reports. Jump ahead to today. You’re still a promoter but you now have a fully computerized sales system that allows you to constantly adjust the price of tickets. You also can manage many more categories of tickets than just the near-stage and far-stage categories you might have used many years ago. You no longer have to wait until after an event to make decisions about changing your sales program. Through your sales system you have gained insights about your customers such as where they live, what other tickets they buy, and their appropriate demographic traits. Because you know more about your customers, you can make your advertising and publicity more efficient by aiming your messages at the types of people more likely to buy your tickets. By using social media networks and other online media, you can also learn almost immediately who is noticing and responding to your advertising messages. You might even run experiments online presenting your advertising in two different ways and seeing which way sells better. Your current self has capabilities that allow you to be a more effective promoter than any older version of yourself. Just how much better? Turn the page. OBJECTIVES ■ ■ ■ ■ ■ Statistics is a way of thinking that can lead to better decision making Statistics requires analytical skills and is an important part of your business education Recent developments such as the use of business analytics and “big data” have made knowing statistics even more critical The DCOVA framework guides your application of statistics The opportunity business analytics represents for business students 1 2 First Things First Now Appearing on Broadway … and Everywhere Else student TIP From other business courses, you may recognize that Disney’s system uses dynamic pricing. In early 2014, Disney Theatrical Productions woke up the rest of Broadway when reports revealed that its 17-year-old production of The Lion King had been the top-grossing Broadway show in 2013. How could such a long-running show, whose most expensive ticket was less than half the most expensive ticket on Broadway, earn so much while being so old? Over time, grosses for a show decline, and weekly grosses for The Lion King had dropped about 25% by the year 2009. But, in 2013, grosses were up 67% from 2009, and weekly grosses typically exceeded the grosses of opening weeks in 1997, adjusted for inflation! Heavier advertising and some changes in ticket pricing helped, but the major reason for this change was something else: combining business acumen with the systematic application of business statistics and analytics to the problem of selling tickets. As a producer of the newest musical at the time said, “We make educated predictions on price. Disney, on the other hand, has turned this into a science” (see reference 3). Disney had followed the plan of action that this book presents. It had collected its daily and weekly results and summarized them, using techniques this book introduces in the next three chapters. Disney then analyzed those results by performing experiments and tests on the data collected (using techniques that later chapters introduce). In turn, those analyses were applied to a new interactive seating map that allowed customers to buy tickets for specific seats and permitted Disney to adjust the pricing of each seat for each performance. The whole system was constantly reviewed and refined, using the semiautomated methods to which Chapter 14 will introduce you. The end result was a system that outperformed the ticket-selling methods others used. FTF.1 Think Differently About Statistics The “Using Statistics” scenario suggests, and the Disney example illustrates, that modern-day information technology has allowed businesses to apply statistics in ways that could not be done years ago. This scenario and example reflect how this book teaches you about statistics. In these first two pages, you may notice • the lack of calculation details and “math.” • the emphasis on enhancing business methods and management decision making. • that none of this seems like the content of a middle school or high school statistics class you may have taken. You may have had some prior knowledge or instruction in mathematical statistics. This book discusses business statistics. While the boundary between the two can be blurry, business statistics emphasizes business problem solving and shows a preference for using software to perform calculations. One similarity that you might notice between these first two pages and any prior instruction is data. Data are the facts about the world that one seeks to study and explore. Some data are unsummarized, such as the facts about a single ticket-selling transaction, whereas other facts, such as weekly ticket grosses, are summarized, derived from a set of unsummarized data. While you may think of data as being numbers, such as the cost of a ticket or the percentage that weekly grosses have increased in a year, do not overlook that data can be non-numerical as well, such as ticket-buyer’s name, seat location, or method of payment. Statistics: A Way of Thinking Statistics are the methods that allow you to work with data effectively. Business statistics focuses on interpreting the results of applying those methods. You interpret those results to help you enhance business processes and make better decisions. Specifically, business statistics provides you with a formal basis to summarize and visualize business data, reach conclusions about that data, make reliable predictions about business activities, and improve business processes. FTF.1 Think Differently About Statistics 3 You must apply this way of thinking correctly. Any “bad” things you may have heard about statistics, including the famous quote “there are lies, damned lies, and statistics” made famous by Mark Twain, speak to the errors that people make when either misusing statistical methods or mistaking statistics as a substitution for, and not an enhancement of, a decision-making process. (Disney Theatrical Productions’ success was based on combining statistics with business acumen, not replacing that acumen.) DCOVA Framework To minimize errors, you use a framework that organizes the set of tasks that you follow to apply statistics properly. The five tasks that comprise the DCOVA framework are • • • • • Define the data that you want to study to solve a problem or meet an objective. Collect the data from appropriate sources. Organize the data collected, by developing tables. Visualize the data collected, by developing charts. Analyze the data collected, reach conclusions, and present the results. You must always do the Define and Collect tasks before doing the other three. The order of the other three varies, and sometimes all three are done concurrently. In this book, you will learn more about the Define and Collect tasks in Chapter 1 and then be introduced to the Organize and Visualize tasks in Chapter 2. Beginning with Chapter 3, you will learn methods that help complete the Analyze task. Throughout this book, you will see specific examples that apply the DCOVA framework to specific business problems and examples. Analytical Skills More Important Than Arithmetic Skills The business preference for using software to automate statistical calculations maximizes the importance of having analytical skills while it minimizes the need for arithmetic skills. With software, you perform calculations faster and more accurately than if you did those calculations by hand, minimizing the need for advanced arithmetic skills. However, with software you can also generate inappropriate or meaningless results if you have not fully understood a business problem or goal under study or if you use that software without a proper understanding of statistics. Therefore, using software to create results that help solve business problems or meet business goals is always intertwined with using a framework. And using software does not mean memorizing long lists of software commands or how-to operations, but knowing how to review, modify, and possibly create software solutions. If you can analyze what you need to do and have a general sense of what you need, you can always find instructions or illustrative sample solutions to guide you. (This book provides detailed instructions as well as sample solutions for every statistical activity discussed in end-of-chapter software guides and through the use of various downloadable files and sample solutions.) If you were introduced to using software in an application development setting or an introductory information systems class, do not mistake building applications from scratch as being a necessary skill. A “smart” smartphone user knows how to use apps such as Facebook, Instagram, YouTube, Google Maps, and Gmail effectively to communicate or discover and use information and has no idea how to construct a social media network, create a mapping system, or write an email program. Your approach to using the software in this book should be the same as that smart user. Use your analytical skills to focus on being an effective user and to understand conceptually what a statistical method or the software that implements that method does. Statistics: An Important Part of Your Business Education Until you read these pages, you may have seen a course in business statistics solely as a required course with little relevance to your overall business education. In just two pages, you have learned that statistics is a way of thinking that can help enhance your effectiveness in business—that is, applying statistics correctly is a fundamental, global skill in your business education. 4 First Things First In the current data-driven environment of business, you need the general analytical skills that allow you to work with data and interpret analytical results regardless of the discipline in which you work. No longer is statistics only for accounting, economics, finance, or other disciplines that directly work with numerical data. As the Disney example illustrates, the decisions you make will be increasingly based on data and not on your gut or intuition supported by past experience. Having a well-balanced mix of statistics, modeling, and basic technical skills as well as managerial skills, such as business acumen and problem-solving and communication skills, will best prepare you for the workplace today … and tomorrow (see reference 1). FTF.2 Business Analytics: The Changing Face of Statistics Of the recent changes that have made statistics an important part of your business education, the emergence of the set of methods collectively known as business analytics may be the most significant change of all. Business analytics combine traditional statistical methods with methods from management science and information systems to form an interdisciplinary tool that supports fact-based decision making. Business analytics include • statistical methods to analyze and explore data that can uncover previously unknown or unforeseen relationships. • information systems methods to collect and process data sets of all sizes, including very large data sets that would otherwise be hard to use efficiently. • management science methods to develop optimization models that support all levels of management, from strategic planning to daily operations. In the Disney Theatrical Productions example, statistical methods helped determine pricing factors, information systems methods made the interactive seating map and pricing analysis possible, and management science methods helped adjust pricing rules to match Disney’s goal of sustaining ticket sales into the future. Other businesses use analytics to send custom mailings to their customers, and businesses such as the travel review site tripadvisor.com use analytics to help optimally price advertising as well as generate information that makes a persuasive case for using that advertising. Generally, studies have shown that businesses that actively use business analytics and combine that use with data-guided management see increases in productivity, innovation, and competition (see reference 1). Chapter 14 introduces you to the statistical methods typically used in business analytics and shows how these methods are related to statistical methods that the book discusses in earlier chapters. “Big Data” Big data is a collection of data that cannot be easily browsed or analyzed using traditional methods. Big data implies data that are being collected in huge volumes, at very fast rates or velocities (typically in near real time), and in a variety of forms that can differ from the structured forms such as records stored in files or rows of data stored in worksheets that businesses use every day. These attributes of volume, velocity, and variety (see reference 5) distinguish big data from a “big” (large) set of data that contains numerous records or rows of similar data. When combined with business analytics and the basic statistical methods discussed in this book, big data presents opportunities to gain new management insights and extract value from the data resources of a business (see reference 8). Unstructured Data Big data may also include unstructured data, data that has an irregular pattern and contain values that are not comprehensible without additional automated or manual interpretation. Unstructured data takes many forms such as unstructured text, pictures, videos, and audio tracks, with unstructured text, such as social media comments, Get Complete eBook Download Link below for instant download https://browsegrades.net/documents/2 86751/ebook-payment-link-for-instantdownload-after-payment FTF.3 Starting Point for Learning Statistics 5 getting the most immediate attention today for its possible use in customer, branding, or marketing analyses. Unstructured data can be adapted for use with a number of methods, such as regression, which this book illustrates with conventional, structured files and worksheets. Unstructured data may require performing data collection and preparation tasks beyond those tasks that Chapter 1 discusses. While describing all such tasks is beyond the scope of this book, Section 14.1 includes an example of the additional interpretation that is necessary when working with unstructured text. FTF.3 Starting Point for Learning Statistics Statistics has its own vocabulary and learning the precise meanings, or operational definitions, of basic terms provides the basis for understanding the statistical methods that this book discusses. For example, in statistics, a variable defines a characteristic, or property, of an item or individual that can vary among the occurrences of those items or individuals. For example, for the item “book,” variables would include the title and number of chapters, as these facts can vary from book to book. For a given book, these variables have a specific value. For this book, the value of the title variable would be “Business Statistics: A First Course,” and “15” would be the value for the number of chapters variable. Note that a statistical variable is not an algebraic variable, which serves as a stand-in to represent one value in an algebraic statement and could never take a non-numerical value such as “Business Statistics: A First Course.” Using the definition of variable, data, in its statistical sense, can be defined as the set of values associated with one or more variables. In statistics, each value for a specific variable is a single fact, not a list of facts. For example, what would be the value of the variable author for this book? Without this rule, you might say that the single list “Levine, Szabat, Stephan” is the value. However, applying this rule, one would say that the variable has three separate values: “Levine”, “Stephan”, and “Szabat”. This distinction of using only single-value data has the practical benefit of simplifying the task of entering data for software analysis. Using the definitions of data and variable, the definition of statistics can be restated as the methods that analyze the data of the variables of interest. The methods that primarily help summarize and present data comprise descriptive statistics. Methods that use data collected from a small group to reach conclusions about a larger group comprise inferential statistics. Chapters 2 and 3 introduce descriptive methods, many of which are applied to support the inferential methods that the rest of the book presents. Statistic The previous section uses statistics in the sense of a collective noun, a noun that is the name for a collection of things (methods in this case). The word statistics also serves as the plural form of the noun statistic, as in “one uses methods of descriptive statistics (collective noun) to generate descriptive statistics (plural of the singular noun).” In this sense, a statistic refers to a value that summarizes the data of a particular variable. (More about this in coming chapters.) In the Disney Theatrical Productions example, the statement “for 2013, weekly grosses were up 67% from 2009” cites a statistic that summarizes the variable weekly grosses using the 2013 data—all 52 values. When someone warns you of a possible unfortunate outcome by saying, “Don’t be a statistic!” you can always reply, “I can’t be.” You always represent one value and a statistic always summarizes multiple values. For the statistic “87% of our employees suffer a workplace accident,” you, as an employee, will either have suffered or have not suffered a workplace accident. The “have” or “have not” value contributes to the statistic but cannot be the statistic. A statistic can facilitate preliminary decision making. For example, would you immediately accept a position at a company if you learned that 87% of their employees suffered a workplace accident? (Sounds like this might be a dangerous place to work and that further investigation is necessary.) 6 First Things First Can Statistics (pl., statistic) Lie? The famous quote “lies, damned lies, and statistics” actually refers to the plural form of statistic and does not refer to statistics, the field of study. Can any statistic “lie”? No, faulty or invalid statistics can only be produced through willful misuse of statistics or when DCOVA framework tasks are done incorrectly. For example, many statistical methods are valid only if the data being analyzed have certain properties. To the extent possible, you test the assertion that the data have those properties, which in statistics are called assumptions. When an assumption is violated, shown to be invalid for the data being analyzed, the methods that require that assumption should not be used. For the inferential methods that this book discusses in later chapters, you must always look for logical causality. Logical causality means that you can plausibly claim something directly causes something else. For example, you wear black shoes today and note that the weather is sunny. The next day, you again wear black shoes and notice that the weather continues to be sunny. The third day, you change to brown shoes and note that the weather is rainy. The fourth day, you wear black shoes again and the weather is again sunny. These four days seem to suggest a strong pattern between your shoe color choice and the type of weather you experience. You begin to think if you wear brown shoes on the fifth day, the weather will be rainy. Then you realize that your shoes cannot plausibly influence weather patterns, that your shoe color choice cannot logically cause the weather. What you are seeing is mere coincidence. (On the fifth day, you do wear brown shoes and it happens to rain, but that is just another coincidence.) You can easily spot the lack of logical causality when trying to correlate shoe color choice with the weather, but in other situations the lack of logical causality may not be so easily seen. Therefore, relying on such correlations by themselves is a fundamental misuse of statistics. When you look for patterns in the data being analyzed, you must always be thinking of logical causes. Otherwise, you are misrepresenting your results. Such misrepresentations sometimes cause people to wrongly conclude that all statistics are “lies.” Statistics (pl., statistic) are not lies or “damned lies.” They play a significant role in statistics, the way of thinking that can enhance your decision making and increase your effectiveness in business. FTF.4 Starting Point for Using Software This book uses Microsoft Excel, JMP, Minitab, and Tableau to help explain and illustrate statistical concepts and methods as well as demonstrate how such applications can help facilitate business decision making. To begin using the software that this book uses requires only the knowledge of basic user interface skills, operations, and vocabulary that Table FTF.1 summarizes and which the supplemental, online Basic Computing Skills document reviews. (Learn more about online supplemental files in Appendix C.) TABLE FTF.1 Basic Computing Knowledge Skill or Operation Specifics Identify and use standard window objects Title bar, minimize/resize/close buttons, scroll bars, mouse pointer, menu bars or ribbons, dialog box, window subdivisions such as areas, panes, or child windows Command button, list box, drop-down list, edit box, option button, check box, tabs (tabbed panels) Click, called select in some list or menu contexts and check or clear in some check box contexts; double-click; right-click; drag and drag-and-drop Identify and use common dialog box items Mouse operations Excel, JMP, and Minitab all use worksheets to display the contents of a data set and as the means to enter or edit data. (JMP calls its worksheets data tables.) Worksheets are containers that present tabular arrangements of data, in which the intersections of rows and columns form cells, boxes into which individual entries are made. One places the data for a variable into the FTF.4 Starting Point for Using Software 7 cells of a column such that each column contains the data for a different variable, if more than one variable is under study. By convention, one uses the cell in the initial row to enter names of the variables (variable columns). Shown below, from back to front, are a Minitab worksheet, a JMP data table, and an Excel worksheet. The JMP and Minitab containers contain a special unnumbered row into which column variable names can be entered. In Excel, variable names are entered in row 1 of the worksheet, which can sometimes lead to inadvertent errors. student TIP Selected Excel, JMP, and Minitab solutions that this book presents exist as templates that simplify the production of results and serve as models for learning more about using formulas. Appendix Section G.5 explains the limitations on using Tableau workbooks that Tableau Public users face. Generally, entries in each cell are single data values that can be text or numbers. All three programs also permit formulas, instructions to process data, to compute cell values. Formulas can include functions that simplify certain arithmetic tasks or provide access to advanced processing or statistical features. Formulas play an important role in designing templates, reusable solutions that have been previously audited and verified. However, JMP and Minitab allow only column formulas that define calculations for all the cells in a column, whereas Excel allows only cell formulas that define calculations for individual cells. These three programs save worksheet data and results as one file, called a workbook in Excel and a project in JMP and Minitab. JMP and Minitab also allow the saving of individual worksheets or results as separate files, whereas Excel always saves a workbook even if the workbook contains (only) one worksheet. Both JMP and Minitab can open the data worksheets of an Excel workbook, making the Excel workbook a universal format for sharing files that contain only data, such as the set of data files for use with this book that Appendix C documents. Tableau Differences Like Excel, Tableau uses workbooks to store one or more worksheets, but Tableau defines the concept of worksheet differently. A Tableau worksheet stores tabular and visual summaries that are associated with a separately defined data source that can be a complex collection of data or be equivalent to an Excel data worksheet (as are the data sources that this book uses for examples). Data sources can be viewed and column formulas can be used to define new columns, but individual values cannot be edited. Data sources can be unique to a specific Tableau worksheet or shared by several Tableau worksheets. Tableau workbooks can also store dashboards, a concept that Chapter 14 discusses. Table FTF.2 summarizes some of the various file formats that the four programs use. (Appendix D discusses macro and add-in files.) TABLE FTF.2 Common File Formats for Excel, JMP, Minitab, and Tableau File Type Excel JMP Minitab Tableau All-in-one-file Single worksheet Results only Macro or add-in .xlsx (workbook) .xlsx n.a. .xlsm, .xlam .jmpprj (project) .jmp .jrp (report) .jsl, .jmpaddin .mpj (project) .mtw .mgf (graph) .mtb, .mac .twbx (workbook) 8 First Things First student TIP Using Software Properly Check the student download web page for this book for more information about PHStat and JMP and Minitab macros and add-ins that may be available for download. Learning to use software properly can be hard as software has limited ways to provide feedback for user actions that are invalid operations. In addition, no software will ever know if you are following proper procedures for using that software. The principles that Exhibit FTF.1 lists will assist you and should govern your use of software with this book. These principles will minimize your chance of making errors and lessen the frustration that often occurs when these principles are unknown or overlooked by a user. EXHIBIT FTF.1 Principles of Using Software Properly Ensure that software is properly updated. Users who manage their own computers often overlook the importance of ensuring that all installed software is up to date. Understand the basic operational tasks. Take the time to master the tasks of starting the application, loading and entering data, and how to select or choose commands in a general way. Understand the statistical concepts that an application uses. Not understanding those concepts can lead to making wrong choices in the application and can make interpreting results difficult. Know how to review software use for errors. Review and verify that the proper data preparation procedures (see Chapter 1) have been applied to the data before analysis. Verify that the correct procedures, commands, and software options have been selected. For information entered for labeling purposes, verify that no typographical errors exist. Seek reuse of preexisting solutions to solve new problems. Build solutions from scratch only as necessary, particularly if using Excel in which errors can be most easily made. Some solutions, and almost all Excel solutions that this book presents, exist as models or templates that can and should be reused because such reuse models best practice. Understand how to organize and present information from the results that the software produces. Think about the best ways to arrange and label the data. Consider ways to enhance or reorganize results that will facilitate communication with others. Use self-identifying names, especially for the files that you create and save. Naming files Document 1, Document 2, and so on will impede the later retrieval and use of those files. In addition, also look for ways in which you can simplify the user interface of the software you use. If using Excel with this book, consider using PHStat, supplied separately or as part of a bundle by Pearson. PHStat simplifies the user interface by providing a consistent dialog box driven interface that minimizes keystrokes and mouse selections. If using JMP and Minitab, look for macros and add-ins that simplify command sequences or automate repetitive activities. Software-related Conventions Table FTF.3 on page 9 summarizes the softwarerelated conventions that this book uses. These conventions are used extensively in the end-ofchapter software guides and certain appendices to provide a concise and clear way of expressing specific user activities. Key Terms 9 TABLE FTF.3 Conventions That This Book Uses Convention ▼ Example Special key names appear capitalized and in boldface Press Enter. Press Command or Ctrl. Key combinations appear in boldface, with key names linked using this symbol: + Enter the formula and press Ctrl+Enter. Press Ctrl+C. Menu or Ribbon selections appear in boldface and sequences of consecutive selections are shown using this symbol: ➔ Select File ➔ New. Select PHStat ➔ Descriptive Statistics ➔ Boxplot. Target of mouse operations appear in boldface Click OK. Select Attendance and then click the Y button. Entries and the location of where entries are made appear in boldface Enter 450 in cell B5. Add Temperature to the Model Effects list. Variable or column names sometimes appear capitalized for emphasis This file contains the Fund Type, Assets, and Expense Ratio for the growth funds. Placeholders that express a general case appear in italics and may also appear in boldface as part of a function definition AVERAGE (cell range of variable) Replace cell range of variable with the cell range that contains the Asset variable. Names of data files mentioned in sections or problems appear in a special font but appear in boldface in end-ofchapter Guide instructions Open the Retirement Funds workbook. When current versions of Excel and Minitab differ in their user interface, alternate instructions for older versions appear in a second color immediately following the primary instructions In the Select Data Source display, click the icon inside the Horizontal (Category) axis labels box. Click Edit under the Horizontal (Categories) Axis Labels heading. REFERENCES 1. Advani, D. “Preparing Students for the Jobs of the Future.” University Business (2011), bit.ly/1gNLTJm. 2. Davenport, T., J. Harris, and R. Morison. Analytics at Work. Boston: Harvard Business School Press, 2010. 3. Healy, P. “Ticker Pricing Puts ‘Lion King’ atop Broadway’s Circle of Life.” New York Times, New York edition, March 17, 2014, p. A1, and nyti.ms.1zDkzki. 4. JP Morgan Chase. “Report of JPMorgan Chase & Co. Management Task Force Regarding 2012 CIO Losses,” bit.ly/1BnQZzY, as quoted in J. Ewok, “The Importance of Excel,” The Baseline Scenario, bit.ly/1LPeQUy. ▼ Retirement Funds 5. Laney, D. 3D Data Management: Controlling Data Volume, Velocity, and Variety. Stamford, CT: META Group. February 6, 2001. 6. Levine, D., and D. Stephan. “Teaching Introductory Business Statistics Using the DCOVA Framework.” Decision Sciences Journal of Innovative Education 9 (Sept. 2011): 393–398. 7. Marr, B. “20 Claims About Big Data and Why They All Are Wrong.” Data-informed.com, September 28, 2015. 8. “What Is Big Data?” IBM Corporation, www.ibm.com/ big-data/us/en/. KEY TERMS big data 4 cells 6 data 2 data table 6 business analytics 4 DCOVA framework 3 descriptive statistics 5 formula 7 function 7 inferential statistics 5 logical causality 6 operational definition 5 project (JMP, Minitab) 7 statistic 5 statistics 2 summarized data 2 template 7 unstructured data 4 variable 5 workbook 7 worksheet 6 ▼ First Things First EXCEL GUIDE EG.1 GETTING STARTED with EXCEL student TIP Excel sometimes displays a task pane in the worksheet area that presents formatting and similar choices. Opening Excel displays a window that contains the Office Ribbon tabs above a worksheet area that displays the current worksheet of the current workbook, the name of which appears centered in the title bar. The top of the worksheet area contains a formula bar that allows you to see and edit the contents of the currently selected cell (cell A1 in the illustration). Immediately below the worksheet grid is a sheet tab that identifies the name of current worksheet (DATA). In workbooks with more than one sheet, clicking the sheet tabs navigates through the workbook. In the illustration, the current cell is cell A1, and its content is displayed in the formula bar. The Basic Computing Skills online document (see Appendix C) discusses the other standard features Microsoft Office seen in the illustration. The illustration shows the Retirement Funds workbook, one of many data workbooks that support this textbook. Use the Excel data workbooks with either the Excel Guide workbooks or PHStat to create solutions to problems or to recreate results used in examples. See Appendix C for description of these learning resources. (Using PHStat requires a separate download and an access code, which may have been bundled with the purchase of this book, as Appendix H explains.) EG.2 ENTERING DATA In Excel, enter data into worksheet columns, starting with the leftmost, first column, using the cells in row 1 to enter variable names. Avoid skipping rows or columns as such skipping can disrupt or alter the way certain Excel procedures work. Complete a cell entry by pressing Tab or Enter, or, if using the formula bar to make a cell entry, by clicking the check mark icon in the formula bar. To enter or edit data in a specific cell, either use the cursor keys to move the cell pointer to the cell or select the cell directly. Try to avoid using numbers as row 1 variable headings; if you cannot avoid their use, precede such headings with apostrophes. Pay attention to special instructions in this book that note specific orderings of variable columns that are necessary for some Excel operations. When in doubt, use the DATA worksheets of the Excel Guide Workbooks as the guide for entering and arranging variable data. 10 EG.3 OPEN or SAVE a WORKBOOK Use File ➔ Open or File ➔ Save As. Open and Save As use similar means to open or save the workbook by name while specifying the physical device or network location and folder for that workbook. Save As dialog boxes enable one to save a file in alternate formats for programs that cannot open Excel workbooks (.xlsx files) directly. Alternate formats include a simple text file with values delimited with tab characters, Text (Tab delimited) (*.txt) that saves the contents of the current worksheet as a simple text file, and CSV (Comma delimited) (*.csv) that saves worksheet cell values as text values that are delimited with commas. Excels for Mac list these choices as Tab Delimited Text (.txt) and Windows Comma Separated (.csv). The illustration on the next page shows the part of the Save As dialog box that contains the Save as type drop-down list. EXCEL Guide (Open dialog boxes have a similar drop-down list.) In all Windows Excel versions, you can also select a file format in the Open dialog box. Selecting All Files (*.*) from the drop-down list can list files that had been previously saved in unexpected formats. To open a new workbook, select File ➔ New (New Workbook in Excel for Mac). Excel displays a new workbook with one or more blank worksheets. EG.4 WORKING with a WORKBOOK Use Insert (or Insert Sheet), Delete, or Move or Copy. Alter the contents of a workbook by adding a worksheet or by deleting, copying, or rearranging the worksheets and chart sheets that the workbook contains. To perform one of these operations, right-click a sheet tab and select the appropriate choice from the shortcut menu that appears. To add a worksheet, select Insert. In Microsoft Windows Excel, you also click Worksheet and then click OK in the Insert dialog box. To delete a worksheet or chart sheet, rightclick the sheet tab of the worksheet to be deleted and select Delete. To copy or rearrange the position of a worksheet or chart sheet, right-click the sheet tab of the sheet and select Move or Copy. In the Move or Copy dialog box, first select the workbook and the position in the workbook for the sheet. If copying a sheet, also check Create a copy. Then click OK. EG.5 PRINT a WORKSHEET Use File ➔ Print. Excel prints worksheets and chart sheets, not workbooks. When you select Print, Excel displays a preview of the currently opened sheet in a dialog box or pane that allows you to select that sheet or other sheets from the workbook. You can adjust the print formatting of the worksheet(s) to be printed by clicking Page Setup. Typically, in the Page Setup dialog box, you might click the Sheet tab and then check or clear the Gridlines and Row and column headings checkboxes to add or remove worksheet cell gridlines and the numbered row and lettered column headings that are similar to how a worksheet is displayed onscreen. EG.6 REVIEWING WORKSHEETS Follow the best practice of reviewing worksheets before you use them to help solve problems. When you use a worksheet, what you see displayed in cells may be the result of either the recalculation of formulas or cell formatting. A cell that displays 4 might contain the value 4, might contain a formula calculation that results in the value 4, or might contain a value such as 3.987 that has been formatted to display as the nearest whole number. 11 To display and review all formulas, press Ctrl+` (grave accent). Excel displays the formula view of the worksheet, revealing all formulas. (Pressing Ctrl+` a second time restores the worksheet to its normal display.) If you use the Excel Guide workbooks, you will discover that each workbook contains one or more FORMULAS worksheets that provide a second way of viewing all formulas. In the Excel solutions for this book, you will notice cell formatting operations that have changed the background color of cells, changed text attributes such as boldface of cell entries, and rounded values to a certain number of decimal places (typically four). However, if you want to learn more about cell formatting, Appendix B includes a summary of common formatting operations, including those used in the Excel solutions for this book. EG.7 IF YOU USE the WORKBOOK INSTRUCTIONS Excel Guide Workbook instructions enable you to directly modify the template and model worksheet solutions for problems other than the one they help solve. (In contrast, PHStat provides a dialog box interface in which you make entries that PHStat uses to automate such modifications.) Workbook instructions express Excel operations in the most universal way possible. For example, many instructions ask you to select (click on) an item from a gallery of items and identify that item selection by name. In some Excel versions, these names may be visible captions for the item; in other versions, you will need to move the mouse over the image to pop up the image name. Guides also use the word display as in the “Format Axis display” to refer to a user interaction that may be presented by Excel in a task pane or a two-panel dialog box (“Format Axis task pane” or “Format Axis dialog box”). Task panes open to the side of the worksheet and can remain onscreen indefinitely. Initially, some parts of a pane may be hidden and you may need to click on an icon or label to reveal that hidden part to complete a Workbook instruction. Two-panel dialog boxes open over the worksheet and must be closed before you can do other Excel activities. The left panel of such dialog boxes are always visible and clicking entries in the left panel makes visible one of the right panels. (Click the system close button at the top right of a task pane or dialog box to close the display and remove it from the screen.) Current Excel versions can vary in their menu sequences. Excel Guide instructions show these variations as parenthetical phrases. For example, the menu sequence, “select Design (or Chart Design) ➔ Add Chart Element” tells you to first select Design or Chart Design to begin the sequence and then to continue by selecting Add Chart Element. (Microsoft Windows Excel uses Design and Excel for Mac uses Chart Design.) For the current Excel versions that this book supports (see the FAQs in Appendix G), the Workbook Instructions are generally identical. Occasionally, individual instructions may differ significantly for one (or more) versions. In such cases, the instructions that apply for multiple versions appear first, in normal text, and the instructions for the unique version immediately follows in this text color. ▼ First Things First JMP GUIDE JG.1 GETTING STARTED with JMP Opening JMP displays the JMP Home Window (shown below) that contains the main menu bar and toolbar through which you make JMP command selections, as well as lists of recent files and any other JMP windows that JMP has been set previously to open. In the illustration below, JMP has opened the Retirement Fund data table window and the Retirement Fund - Chart by Market Cap window and displays those two items in the Window List. Windows that JMP opens or creates display independently of other windows and can be arranged to overlap, as the illustration shows. Note that JMP displays thumbnails of results windows associated with a data table in an evaluations done panel that appears below the data table. In the Windows List, associated results appear as indented list items under the name of the data table window. In many windows that JMP creates, JMP hides a copy of the home window’s menu bar and tool bar under a “thin blue bar” as shown above. Clicking the thin blue bar, seen in the Retirement Fund - Chart by Market Cap window, displays a copy of the home window’s user interface. Most results windows also contain a right downward-pointing triangle to the left of a result heading (Chart in the illustration). Clicking this red triangle displays a red triangle menu of commands and options appropriate for the results that appear under the heading. Red triangle menus also appear in other contexts, such as in the upper left corner of data tables where they hide various row and column selection, data entry, and formatting commands. A result heading also includes a gray right triangle disclosure button that hides or reveals results (to the left of the red triangle in the Chart heading). By using the disclosure button and a combination of red triangle menu selections you can tailor the results, what JMP calls a report, to your specific needs. Selecting Help ➔ Books from the main window’s menu bar displays a list of books in PDF format that you can display in JMP or save and read when not using JMP. Consult the books Discovering JMP and Using JMP as additional sources for getting started with JMP or to discover the JMP features and commands that the instructions in this book do not use. 12 JMP Guide JG.2 ENTERING DATA In JMP, enter data into data table (worksheet) columns, starting with the first numbered row and the leftmost, first column. Never skip a cell when entering data because JMP will interpret that skipped cell as a “missing value” (see Section 1.4) that can affect analysis. Complete a cell entry by pressing Enter. To enter or edit data in a specific cell, either use the cursor keys to move the cell pointer to the cell or select the cell directly. As one enters data into columns, JMP assigns default column names in the form Column 1, Column 2, and so forth. Change these default names to variable names by double-clicking the name or right-clicking and selecting Column Info from the shortcut menu. Either action displays the Column dialog box in which you can enter the variable name and set data type and scale, attributes of the data that Chapter 1 explains. JG.3 CREATE NEW PROJECT or DATA TABLE Use File ➔ New ➔ Project. Use File ➔ New ➔ Data Table. While report windows that contain results can be opened and saved separately from the data table that provides the data for those results (see Section JG.4), best practice groups a set of report window and data table files into one project file. To create a project file, select File ➔ New ➔ Project. JMP opens a new Projects window with the name Untitled. Rightclick “Untitled” and rename the project. In the illustration below, the project has been renamed Retirement Funds Market Cap Analysis. To add a window to a project, right-click the project name and select Add Window. In the dialog box that appears, select the window to be included and click OK. In lieu of selecting Add Window, select Add All Windows to add all onscreen JMP windows excluding the home window. In the illustration below, the Retirement Funds data table and Chart by Market Cap windows has been added to the renamed project. Project files can be opened and saved as Section JG.4 explains. The data table New command opens a blank data table in its own window. Any new data table is not automatically added to the currently open project, and you must use the Add Window command if you want a new data table to be part of a project. JG.4 13 OPEN or SAVE FILES Use File ➔ Open. Use File ➔ Save As. In JMP, all displayed windows can be opened or saved as separate files, as well as open and save special grouping files such as projects. By default, JMP lists all JMP file types in open operations and properly assigns the file type in all save operations. To import an Excel workbook, select Excel Files (*.xls, *.xlsx, *.xlsm) from the pulldown list in the Open Data File dialog box. To export a JMP data table as an Excel file, change the Save as type in the Save JMP File As dialog box to Excel Workbook (*.xlsx, *.xls). Report windows can be saved as “interactive HTML” files that allow others to use systems on which JMP has not been installed to explore results in an interactive way, using a subset of JMP functionality. To save this type of file, change the Save as type in the Save JMP File As dialog box to Interactive HTML with Data (*.htm;*.html). JG.5 PRINT DATA TABLES or REPORT WINDOWS Use File ➔ Print or File ➔ Print Preview. To print a JMP object, select these File commands from the window that contains the object you want to print. For results (report) windows, first click the thin blue bar to reveal the menu bar that contains File. If using Print Preview, JMP opens a new window in which preview output and printing options can be adjusted before printing. [To print, click the leftmost (Print) icon in the window.] JG.6 JMP SCRIPT FILES JMP script files record many user interface actions and construct or modify JMP objects such as data tables. Using its own JSL scripting language, JMP records your actions as you analyze data in a script file that you can optionally save and play back later to recreate the analysis. Saved script files are text files that can be viewed, edited, and run in their own JMP window or edited by word or texting processing applications. JSL also includes user interface commands and directives allowing one to construct scripts that simplify and customize the use of the JMP Home window menu bar and toolbar. For selected chapters, JMP scripts created especially for this book can facilitate your use of JMP for those chapters (see Appendix C). JMP scripts are sometimes packaged as a JMP add-in that can be “installed” in JMP and directly selected from the JMP Home window menu bar, eliminating the need to open a script and then run the script from inside the script window. ▼ First Things First MINITAB GUIDE MG.1 GETTING STARTED with MINITAB student TIP To view a window that may be obscured or hidden, select Window from the Minitab menu bar, and then select the name of the window you want to view. MG.2 When you open Minitab, you see a main window and a number of child windows that cannot be moved outside the boundaries of the main window. You will normally see a blank worksheet and the Session window that records commands and displays results as the child windows. Pictured below is a project with one opened worksheet. Besides the slightly obscured DATA worksheet window and Session window, this figure also shows a Project Manager that lists the contents of the current project. (Use the keyboard shortcut Ctrl+I to display the Project Manager if it is not otherwise visible in the main window.) ENTERING DATA In Minitab, enter data into worksheet columns, starting with the first numbered row and leftmost, first column. Minitab names columns using the form Cn, such that the first column is named C1, the second column is C2, and the tenth column is C10. Use the first, unnumbered and shaded row to enter variable names that can be used as a second way to refer to a column by name. If a variable name contains spaces or other special characters, such as Market Cap, Minitab will display that name in dialog boxes using a pair of single quotation marks ('Market Cap'). You must include those quotation marks any time you enter such a variable name in a dialog box. If a column contains non-numerical data, Minitab displays the column name with an appended -T such as C1-T, C2-T, and C3-T in the worksheet shown above. If a column contains data that Minitab interprets as either dates or times, Minitab displays the column name with an appended -D. If a column contains data that a column formula (see Chapter 1) computes, Minitab displays a small green check mark above and to the right of the Cn name. (Neither the appended -D nor the check mark are shown in the worksheet above.) To enter or edit data in a specific cell, either use the cursor keys to move the cell pointer to the cell or use your mouse 14 to select the cell directly. Never skip a cell in a numbered row when entering data because Minitab will interpret a skipped cells as a “missing value” (see Section 1.4). MG.3 OPEN or SAVE FILES Use File ➔ Open Worksheet or File ➔ Open Project and File ➔ Save Current Worksheet or File ➔ Save Project As. In Minitab, open and save individual worksheets or entire projects, collections of worksheets, Session results, and graphs. To save data in a form readable by Excel, select Excel from the Save as type drop-down list before clicking Save. Data can also be saved as a simple text file, Text, or as simple text with values delimited with commas, CSV. In Minitab, individual graphs and a project’s session window can be opened or saved, although these operations are never used in the Minitab Guides in this book. MG.4 INSERT or COPY WORKSHEETS Use File ➔ New or File ➔ Open Worksheet. To insert a new worksheet, select File ➔ New and in the New dialog box click Minitab Worksheet and then click OK. To insert a copy of a worksheet, select File ➔ Open Worksheet and select worksheet to be copied. TABLEAU Guide MG.5 PRINT WORKSHEETS Use File ➔ Print Worksheet (or Print Graph or Print Session Window). Selecting Print Worksheet displays the Data Window Print Options dialog box. In this dialog box, specify the title and formatting options for printing. Selecting Print Graph or ▼ Print Session Window displays a dialog box that allows you to change the default printer settings. If you need to change printing attributes, first select File ➔ Print Setup and make the appropriate selections in the Print dialog box before you select the Print command. TABLEAU GUIDE TG.1 15 First Things First GETTING STARTED with TABLEAU The Tableau Guides for this book feature Tableau Public, version 2018, also known as the Tableau Desktop Public Edition. Tableau Public allows users to share data and create, in worksheets, tabular and visual summaries that can be building blocks for a dashboard or a story, a slide-based presentation. Tableau Public uses a drag-and-drop interface that will be most familiar to users of the JMP Graph Builder or the Excel PivotTable feature. Tableau Public consists of three different editions, as Appendix Section G.5 explains. Except where noted in this book, instructions apply to all three versions. Tableau Public uses workbooks to store tabular and visual worksheet summaries, dashboards, and stories. Opening Tableau Public displays the main Tableau window in which the contents of a workbook can be viewed and edited. The main window shown below displays the Mobile Electronics Store worksheet of the Arlingtons National Sales Dashboard Tableau workbook that is used in a Chapter 14 example. The main Tableau window contains a menu bar and toolbar, a tabbed left area that presents data and formatting details, several special areas that Section TG.4 explains, the worksheet display area, and the Show Me gallery that displays tabular and visual summaries appropriate for the data in the worksheet area. Worksheet tabs appear under these areas, along with a tab for the current data source to the left of the tabs and icon shortcuts for a new worksheet, new dashboard, 16 First Things First and new story, respectively, to the right of the tabs. The bottom of the window displays status information at the left, the current signed-in user (“MyUserName”), a drop-down button to sign out of Tableau Public, and four media controls that permit browsing through the worksheet tabs of the current workbook. The Data tab shown on the previous page displays the workbook data sources (five) and lists the “Dimensions” and “Measures,” the variable columns of the data source for the worksheet, summary measures, and calculated values for the MobileElectronicsStore data source associated with the displayed worksheet. (Section TG1.1 explains more about the significance of dimensions and measures.) For the worksheet shown on the previous page, the variable Store was dragged-and-dropped in the Rows shelf and the variable Mobile Electronics Sales was dragged-and-dropped in the Column shelf. The worksheet also contains a filter for the Region column that selects only those rows in the source worksheet in which the value in the Region column is Alpha. (To create the bar chart, horizontal bars icon was selected from the Show Me gallery.) TG.2 ENTERING DATA Tableau does not support the direct entry of data values. When using Tableau, one imports data from other sources, such as Microsoft Excel workbooks, text files, JSON files, or data retrieved from server-based data systems. Many users create and save data files in another program such as Excel and then open the saved data file in Tableau to import the data. TG.3 OPEN or SAVE a WORKBOOK Use File ➔ Open or File ➔ Open from Tableau Public. Use File ➔ Save to Tableau Public As. Use Open to import simple data sources such as Microsoft Excel workbooks or text files or to open a Tableau workbook that previously downloaded to a local computing device. Most Tableau Guide instructions begin by opening a Microsoft Excel workbook to avoid the limitations of Tableau Public that Appendix Section G.5 discusses. (This procedure mimics the typical usage of creating data in another program that Section TG.2 discusses.) To open a new workbook, select File ➔ New. For a new workbook, the Data tab will display the hyperlink Connect to Data. Clicking the hyperlink displays the Connect panel from which data sources can be retrieved. When opening an Excel workbook that contains more than one data worksheet, each Excel worksheet being used must be defined as a separate data source. This Connect panel also appears when opening an Excel workbook directly using the file open command. The illustration below shows the data source linked to the Data worksheet of the Excel Retirement Funds workbook. TABLEAU Guide Using the Open from Tableau Public or Save to Tableau Public As command requires a valid Tableau online account. (Accounts are complimentary but require registration.) Using these commands means signing into a Tableau Public account and retrieving (or storing) a Tableau workbook from that account or from an account that has been shared. Save to Tableau Public As stores the Tableau workbook in the account and opens a web browser to display the workbook and to permit its downloading to a local computing device. In paid-subscription editions of Tableau Desktop, these open and save commands appear in the Server menu and not in the File menu. Although not used in the Tableau Guides, Tableau Public permits join and union operations to combine columns from two data sources. Join operations combine two tables, typically by matching values in a variable column that both original tables share. Union operations add rows. Union operations require that tables share columns that hold values for the same variables. Joins and unions can solve problems that arise from seeking to perform an analysis of data on a set of variables stored in two different places, such as two different Excel data worksheets in the same Excel workbook. TG.4 WORKING with DATA Tableau Desktop uses the term data field to refer to what this book calls a variable or, in some contexts, a column. What an Excel, JMP, or Minitab user would call a formula function, Tableau calls an aggregation. What an Excel, JMP, or MInitab user would call a formula, Tableau calls an aggregate calculation. Tableau also invents its own vocabulary for several user interface elements in the worksheet window that users of Excel, JMP, or Minitab may know under more common names. Knowing this vocabulary can be helpful when consulting the Tableau help system or other references. Tableau calls the Pages, Filters, Columns, and Rows areas, all shown in the Section TG.1 illustration, shelves. The Marks area seems to be a shelf but for reasons that would be only self-evident to a regular user of Tableau, the Marks area is called a card. Shelves are places into which things can be placed or dropped, such as the pills that have been placed in the Filters, Row, and Columns shelves. A pill represents some data and is so named because it reminds some of a medicinal capsule. In its simplest form, a pill represents a data field in the 17 data source. However, pills can represent a filtering operation, such as the Filters shelf pill in the illustration below, or a calculated result, similar to a worksheet or data table formula. Pills can be either blue or green, reflecting the type of numerical data, discrete or continuous (see Section TG1.1), or red, reflecting an error condition. In the Section TG.1 illustration, the Store Dimension has been dragged to the Rows shelf, creating the Store pill and the Mobile Electronics Sales Measure has been dropped on the Columns shelf, creating the SUM(Mobile Electronics Sales) pill (slightly truncated, see figure detail below). Dropping a measure name on a shelf creates an aggregation (formula function). The SUM aggregation pill sums mobile electronic sales values by rows (each store). In the special case where each store is represented by only one row, there is no actual summing of values. TG.5 PRINT a WORKBOOK Tableau Public does not contain a print function that would print worksheets. (Commercial versions of Tableau Desktop do.) To print a tabular or visual summary, use a screen capture utility to capture the display for later printing. For online worksheets displayed in a web browser, use the print function of the browser. 1 Defining and Collecting Data CONTENTS USING STATISTICS: Defining Moments 1.1 Defining Variables 1.2 Collecting Data 1.3 Types of Sampling Methods 1.4 Data Cleaning 1.5 Other Data Preprocessing Tasks 1.6 Types of Survey Errors CONSIDER THIS: New Media Surveys/Old Survey Errors Defining Moments, Revisited EXCEL GUIDE JMP GUIDE MINITAB GUIDE TABLEAU DESKTOP GUIDE OBJECTIVES ■ ■ ■ ■ ■ ■ ■ Understand issues that arise when defining variables How to define variables Understand the different measurement scales How to collect data Identify the different ways to collect a sample Understand the issues involved in data preparation Understand the types of survey errors 18 ▼ USING STATISTICS Defining Moments #1 You’re the sales manager in charge of the best-selling beverage in its category. For years, your chief competitor has made sales gains, claiming a better tasting product. Worse, a new sibling product from your company, known for its good taste, has quickly gained significant market share at the expense of your product. Worried that your product may soon lose its number one status, you seek to improve sales by improving the product’s taste. You experiment and develop a new beverage formulation. Using methods taught in this book, you conduct surveys and discover that people overwhelmingly like the newer formulation. You decide to use that new formulation going forward, having statistically shown that people prefer it. What could go wrong? #2 You’re a senior airline manager who has noticed that your frequent fliers always choose another airline when flying from the United States to Europe. You suspect fliers make that choice because of the other airline’s perceived higher quality. You survey those fliers, using techniques taught in this book, and confirm your suspicions. You then design a new survey to collect detailed information about the quality of all components of a flight, from the seats to the meals served to the flight attendants’ service. Based on the results of that survey, you approve a costly plan that will enable your airline to match the perceived quality of your competitor. What could go wrong? In both cases, much did go wrong. Both cases serve as cautionary tales that if you choose the wrong variables to study, you may not end up with results that support making better decisions. Defining and collecting data, which at first glance can seem to be the simplest tasks in the DCOVA framework, can often be more challenging than people anticipate. 1.1 Defining Variables 19 A Coke managers also overlooked other issues, such as people’s emotional connection and brand loyalty to Coca-Cola, issues better discussed in a marketing book than this book. s the initial chapter notes, statistics is a way of thinking that can help fact-based decision making. But statistics, even properly applied using the DCOVA framework, can never be a substitute for sound management judgment. If you misidentify the business problem or lack proper insight into a problem, statistics cannot help you make a good decision. Case #1 retells the story of one of the most famous marketing blunders ever, the change in the formulation of Coca-Cola in the 1980s. In that case, Coke brand managers were so focused on the taste of Pepsi and the newly successful sibling Diet Coke that they decided only to define a variable and collect data about which drink tasters preferred in a blind taste test. When New Coke was preferred, even over Pepsi, managers rushed the new formulation into production. In doing so, those managers failed to reflect on whether the statistical results about a test that asked people to compare one-ounce samples of several beverages would demonstrate anything about beverage sales. After all, people were asked which beverage tasted better, not whether they would buy that better-tasting beverage in the future. New Coke was an immediate failure, and Coke managers reversed their decision a mere 77 days after introducing their new formulation (see reference 7). Case #2 represents a composite story of managerial actions at several airlines. In some cases, managers overlooked the need to state operational definitions for quality factors about which fliers were surveyed. In at least one case, statistics was applied correctly, and an airline spent great sums on upgrades and was able to significantly improve quality. Unfortunately, their frequent fliers still chose the competitor’s flights. In this case, no statistical survey about quality could reveal the managerial oversight that given the same level of quality between two airlines, frequent fliers will almost always choose the cheaper airline. While quality was a significant variable of interest, it was not the most significant. The lessons of these cases apply throughout this book. Due to the necessities of instruction, the examples and problems in all but the last chapter include preidentified business problems and defined variables. Identifying the business problem or objective to be considered is always a prelude to applying the DCOVA framework. 1.1 Defining Variables Identifying a proper business problem or objective enables one to begin to identify and define the variables for analysis. For each variable identified, assign an operational definition that specifies the type of variable and the scale, the type of measurement, that the variable uses. EXAMPLE 1.1 Defining Data at GT&M You have been hired by Good Tunes & More (GT&M), a local electronics retailer, to assist in establishing a fair and reasonable price for Whitney Wireless, a privately held chain that GT&M seeks to acquire. You need data that would help to analyze and verify the contents of the wireless company’s basic financial statements. A GT&M manager suggests that one variable you should use is monthly sales. What do you do? SOLUTION Having first confirmed with the GT&M financial team that monthly sales is a relevant variable of interest, you develop an operational definition for this variable. Does this variable refer to sales per month for the entire chain or for individual stores? Does the variable refer to net or gross sales? Do the monthly sales data represent number of units sold or currency amounts? If the data are currency amounts, are they expressed in U.S. dollars? After getting answers to these and similar questions, you draft an operational definition for ratification by others working on this project. Classifying Variables by Type The type of data that a variable contains determines the statistical methods that are appropriate for a variable. Broadly, all variables are either numerical, variables whose data represent a counted or measured quantity, or categorical, variables whose data represent categories. Gender 20 CHAPTER 1 | Defining and Collecting Data student TIP Some prefer the terms quantitative and qualitative over the terms numerical and categorical when describing variables. These two pairs of terms are interchangeable. with its categories male and female is a categorical variable, as is the variable preferred-NewCoke with its categories yes and no. In Example 1.1, the monthly sales variable is numerical because the data for this variable represent a quantity. For some statistical methods, numerical variables must be further specified as either being discrete or continuous. Discrete numerical variables have data that arise from a counting process. Discrete numerical variables include variables that represent a “number of something,” such as the monthly number of smartphones sold in an electronics store. Continuous numerical variables have data that arise from a measuring process. The variable “the time spent waiting on a checkout line” is a continuous numerical variable because its data represent timing measurements. The data for a continuous variable can take on any value within a continuum or an interval, subject to the precision of the measuring instrument. For example, a waiting time could be 1 minute, 1.1 minutes, 1.11 minutes, or 1.113 minutes, depending on the precision of the electronic timing device used. For a particular variable, one might use a numerical definition for one problem, but use a categorical definition for another problem. For example, a person’s age might seem to always be a numerical age variable, but what if one was interested in comparing the buying habits of children, young adults, middle-aged persons, and retirement-age people? In that case, defining age as a categorical variable would make better sense. Measurement Scales Determining the measurement scale that the data for a variable represent is part of defining a variable. The measurement scale defines the ordering of values and determines if differences among pairs of values for a variable are equivalent and whether one value can be expressed in terms of another. Table 1.1 presents examples of measurement scales, some of which are used in the rest of this section. TABLE 1.1 Examples of Different Scales and Types student TIP JMP and Tableau users will benefit the most from understanding measurement scales. Data Scale, Type Cellular provider Excel skills Temperature (°F) SAT Math score Item cost (in $) nominal, categorical ordinal, categorical interval, numerical interval, numerical ratio, numerical Values AT&T, T-Mobile, Verizon, Other, None novice, intermediate, expert - 459.67°F or higher a value between 200 and 800, inclusive $0.00 or higher Define numerical variables as using either an interval scale, which expresses a difference between measurements that do not include a true zero point, or a ratio scale, an ordered scale that includes a true zero point. Categorical variables use measurement scales that provide less insight into the values for the variable. For data measured on a nominal scale, category values express no order or ranking. For data measured on an ordinal scale, an ordering or ranking of category values is implied. Ordinal scales contain some information to compare values but not as much as interval or ratio scales. For example, the ordinal scale poor, fair, good, and excellent allows one to know that “good” is better than poor or fair and not better than excellent. But unlike interval and ratio scales, one would not know that the difference from poor to fair is the same as fair to good (or good to excellent). PROBLEMS FOR SECTION 1.1 LEARNING THE BASICS 1.2 U.S. businesses are listed by size: small, medium, and large. 1.1 Four different beverages are sold at a fast-food restaurant: soft drinks, tea, coffee, and bottled water. Explain why business size is an example of categorical variable. Explain why the type of beverage sold is an example of a categorical variable. 1.3 The time it takes to download a video from the Internet is measured. Explain why the download time is a continuous numerical variable. 1.2 Collecting Data APPLYING THE CONCEPTS SELF 1.4 For each of the following variables, determine TEST whether the variable is categorical or numerical. If the variable is numerical, determine whether the variable is discrete or continuous. a. Number of cell phones in the household b. Monthly data usage (in MB) c. Number of text messages exchanged per month d. Voice usage per month (in minutes) e. Whether the cell phone is used for streaming video 1.5 The following information is collected from students upon exiting the campus bookstore during the first week of classes. a. Amount of time spent shopping in the bookstore b. Number of textbooks purchased c. Academic major d. Gender Classify each variable as categorical or numerical. 1.6 For each of the following variables, determine whether the variable is categorical or numerical. If the variable is numerical, determine whether the variable is discrete or continuous. a. Name of Internet service provider b. Time, in hours, spent surfing the Internet per week c. Whether the individual uses a mobile phone to stream video d. Number of online purchases made in a month e. Where the individual accesses social networks to find soughtafter information 1.7 For each of the following variables, determine whether the variable is categorical or numerical. If the variable is numerical, determine whether the variable is discrete or continuous. a. Amount of money spent on clothing in the past month b. Favorite department store 21 c. Most likely time period during which shopping for clothing takes place (weekday, weeknight, or weekend) d. Number of pairs of shoes owned 1.8 Suppose the following information is collected from Robert Keeler on his application for a home mortgage loan at the Metro County Savings and Loan Association. a. Monthly payments: $2,227 b. Number of jobs in past 10 years: 1 c. Annual family income: $96,000 d. Marital status: Married Classify each of the responses by type of data. 1.9 One of the variables most often included in surveys is income. Sometimes the question is phrased, “What is your income (in thousands of dollars)?” In other surveys, the respondent is asked to “Select the circle corresponding to your income level” and is given a number of income ranges to choose from. a. In the first format, explain why income might be considered either discrete or continuous. b. Which of these two formats would you prefer to use if you were conducting a survey? Why? 1.10 If two students both score 90 on the same examination, what arguments could be used to show that the underlying variable—test score—is continuous? 1.11 The director of market research at a large department store chain wanted to conduct a survey throughout a metropolitan area to determine the amount of time working women spend shopping for clothing in a typical month. a. Indicate the type of data the director might want to collect. b. Develop a first draft of the questionnaire needed by writing three categorical questions and three numerical questions that you feel would be appropriate for this survey. 1.2 Collecting Data Collecting data using improper methods can spoil any statistical analysis. For example, Coca-Cola managers in the 1980s (see page 19) faced advertisements from their competitor publicizing the results of a “Pepsi Challenge” in which taste testers consistently favored Pepsi over Coke. No wonder—test recruiters deliberately selected tasters they thought would likely be more favorable to Pepsi and served samples of Pepsi chilled, while serving samples of Coke lukewarm (not a very fair comparison!). These introduced biases made the challenge anything but a proper scientific or statistical test. Proper data collection avoids introducing biases and minimizes errors. Populations and Samples Data are collected from either a population or a sample. A population contains all the items or individuals of interest that one seeks to study. All of the GT&M sales transactions for a specific year, all of the full-time students enrolled in a college, and all of the registered voters in Ohio are examples of populations. A sample contains only a portion of a population of interest. One analyzes a sample to estimate characteristics of an entire population. For example, one might select a sample of 200 sales transactions for a retailer or select a sample of 500 registered voters in Ohio in lieu of analyzing the populations of all the sales transactions or all the registered voters. One uses a sample when selecting a sample will be less time consuming or less cumbersome than selecting every item in the population or when analyzing a sample is less cumbersome or 22 CHAPTER 1 | Defining and Collecting Data learnMORE Read the Short Takes for Chapter 1 for a further discussion about data sources. more practical than analyzing the entire population. Section FTF.3 defines statistic as a “value that summarizes the data of a specific variable.” More precisely, a statistic summarizes the value of a specific variable for sample data. Correspondingly, a parameter summarizes the value of a population for a specific variable. Data Sources Data sources arise from the following activities: • • • • • Tableau uses the term data source in a different sense, to refer to the data being presented as a specific tabular or visual summary. Choosing to conduct an observation study or a designed experiment on a variable of interest affects the statistical methods and the decision-making processes that can be used, as Chapters 10 and 14 further explain. Capturing data generated by ongoing business activities Distributing data compiled by an organization or individual Compiling the responses from a survey Conducting an observational study and recording the results of the study Conducting a designed experiment and recording the outcomes of the experiment When the person conducting an analysis performs one of these activities, the data source is a primary data source. When one of these activities is done by someone other than the person conducting an analysis, the data source is a secondary data source. Capturing data can be done as a byproduct of an organization’s transactional information processing, such as the storing of sales transactions at a retailer, or as result of a service provided by a second party, such as customer information that a social media website business collects on behalf of another business. Therefore, such data capture may be either a primary or a secondary source. Typically, organizations such as market research firms and trade associations distribute complied data, as do businesses that offer syndicated services, such as The Neilsen Company, known for its TV ratings. Therefore, this source of data is usually a secondary source. (If one supervised the distribution of a survey, compiled its results, and then analyzed those results, the survey would be a primary data source.) In both observational studies and designed experiments, researchers that collect data are looking for the effect of some change, called a treatment, on a variable of interest. In an observational study, the researcher collects data in a natural or neutral setting and has no direct control of the treatment. For example, in an observational study of the possible effects on theme park usage patterns that a new electronic payment method might cause, one would take a sample of guests, identify those who use the new method and those who do not, and then “observe” if those who use the new method have different park usage patterns. As a designed experiment, one would select guests to use the new electronic payment method and then discover if those guests have theme park usage patterns that are different from the guests not selected to use the new payment method. PROBLEMS FOR SECTION 1.2 APPLYING THE CONCEPTS 1.12 The American Community Survey (www.census.gov/acs) provides data every year about communities in the United States. Addresses are randomly selected, and respondents are required to supply answers to a series of questions. a. Which of the sources of data best describe the American Community Survey? b. Is the American Community Survey based on a sample or a population? 1.13 Visit the website of the Gallup organization at www.gallup .com. Read today’s top story. What type of data source is the top story based on? 1.14 Visit the website of the Pew Research organization at www .pewresearch.org. Read today’s top story. What type of data source is the top story based on? 1.15 Transportation engineers and planners want to address the dynamic properties of travel behavior by describing in detail the driving characteristics of drivers over the course of a month. What type of data collection source do you think the transportation engineers and planners should use? 1.16 Visit the home page of the Statistics Portal “Statista” at www .statista.com. Examine one of the “Popular infographic topics” in the Infographics section on that page. What type of data source is the information presented here based on? 1.3 Types of Sampling Methods 23 1.3 Types of Sampling Methods When selecting a sample to collect data, begin by defining the frame. The frame is a complete or partial listing of the items that make up the population from which the sample will be selected. Inaccurate or biased results can occur if a frame excludes certain groups, or portions of the population. Using different frames to collect data can lead to different, even opposite, conclusions. Using the frame, select either a nonprobability sample or a probability sample. In a nonprobability sample, select the items or individuals without knowing their probabilities of selection. In a probability sample, select items based on known probabilities. Whenever possible, use a probability sample as such a sample will allow one to make inferences about the population being analyzed. Nonprobability samples can have certain advantages, such as convenience, speed, and low cost. Such samples are typically used to obtain informal approximations or as small-scale initial or pilot analyses. However, because the theory of statistical inference depends on probability sampling, nonprobability samples cannot be used for statistical inference and this more than offsets those advantages in more formal analyses. Figure 1.1 shows the subcategories of the two types of sampling. A nonprobability sample can be either a convenience sample or a judgment sample. To collect a convenience sample, select items that are easy, inexpensive, or convenient to sample. For example, in a warehouse of stacked items, selecting only the items located on the top of each stack and within easy reach would create a convenience sample. So, too, would be the responses to surveys that the websites of many companies offer visitors. While such surveys can provide large amounts of data quickly and inexpensively, the convenience samples selected from these responses will consist of self-selected website visitors. (Read the Consider This essay on page 31 for a related story.) To collect a judgment sample, collect the opinions of preselected experts in the subject matter. Although the experts may be well informed, one cannot generalize their results to the population. The types of probability samples most commonly used include simple random, systematic, stratified, and cluster samples. These four types of probability samples vary in terms of cost, accuracy, and complexity, and they are the subject of the rest of this section. FIGURE 1.1 Nonprobability Samples Probability Samples Types of samples Judgment Sample Convenience Sample Simple Random Sample Systematic Stratified Sample Sample Cluster Sample Simple Random Sample In a simple random sample, every item from a frame has the same chance of selection as every other item, and every sample of a fixed size has the same chance of selection as every other sample of that size. Simple random sampling is the most elementary random sampling technique. It forms the basis for the other random sampling techniques. However, simple random sampling has its disadvantages. Its results are often subject to more variation than other sampling methods. In addition, when the frame used is very large, carrying out a simple random sample may be time consuming and expensive. With simple random sampling, use n to represent the sample size and N to represent the frame size. Number every item in the frame from 1 to N. The chance that any particular member of the frame will be selected during the first selection is 1>N. Select samples with replacement or without replacement. Sampling with replacement means that selected items are returned to the frame, where it has the same probability of being selected again. For example, imagine a fishbowl containing N business cards, one card for each person. The first selection selects the card for Grace Kim. After the pertinent information has been recorded, the business card is placed back in the fishbowl. All cards are thoroughly mixed and a second card selected. On the second selection, Grace Kim has the same probability of being selected again, 1>N. Most sampling is sampling without replacement. Sampling without replacement means that once an item has been selected, the item cannot ever again be selected for the sample. 24 CHAPTER 1 | Defining and Collecting Data learnMORE Learn to use a table of random numbers to select a simple random sample in the Section 1.3 LearnMore online topic. The chance that any particular item in the frame will be selected—for example, the business card for Grace Kim—on the first selection is 1>N. The chance that any card not previously chosen will be chosen on the second selection becomes 1 out of N - 1. When creating a simple random sample, avoid the “fishbowl” method of selecting a sample because this method lacks the ability to thoroughly mix items and, therefore, randomly select a sample. Instead, use a more rigorous selection method. One such method is to use a table of random numbers, such as Table E.1 in Appendix E, for selecting the sample. A table of random numbers consists of a series of digits listed in a randomly generated sequence. To use a random number table for selecting a sample, assign code numbers to the individual items of the frame. Then generate the random sample by reading the table of random numbers and selecting those individuals from the frame whose assigned code numbers match the digits found in the table. Because every digit or sequence of digits in the table is random, the table can be read either horizontally or vertically. The margins of the table designate row numbers and column numbers, and the digits are grouped into sequences of five in order to make reading the table easier. Because the number system uses 10 digits (0, 1, 2, … , 9), the chance that any particular digit will be randomly generated is equal 1 out of 10 and is equal to the probability of generating any other digit. For a generated sequence of 800 digits, one would expect about 80 to be the digit 0, 80 to be the digit 1, and so on. Systematic Sample In a systematic sample, partition the N items in the frame into n groups of k items, where N k = n Round k to the nearest integer. To select a systematic sample, choose the first item to be selected at random from the first k items in the frame. Then, select the remaining n - 1 items by taking every kth item thereafter from the entire frame. If the frame consists of a list of prenumbered checks, sales receipts, or invoices, taking a systematic sample is faster and easier than taking a simple random sample. A systematic sample is also a convenient mechanism for collecting data from membership directories, electoral registers, class rosters, and consecutive items coming off an assembly line. To take a systematic sample of n = 40 from the population of N = 800 full-time employees, partition the frame of 800 into 40 groups, each of which contains 20 employees. Then select a random number from the first 20 individuals and include every twentieth individual after the first selection in the sample. For example, if the first random number selected is 008, subsequent selections will be 028, 048, 068, 088, 108, … , 768, and 788. Simple random sampling and systematic sampling are simpler than other, more sophisticated, probability sampling methods, but they generally require a larger sample size. In addition, systematic sampling is prone to selection bias that can occur when there is a pattern in the frame. To overcome the inefficiency of simple random sampling and the potential selection bias involved with systematic sampling, one can use either stratified sampling methods or cluster sampling methods. Stratified Sample learnMORE Learn how to select a stratified sample in the online in the Section 1.3 LearnMore online topic. In a stratified sample, first subdivide the N items in the frame into separate subpopulations, or strata. A stratum is defined by some common characteristic, such as gender or year in school. Then select a simple random sample within each of the strata and combine the results from the separate simple random samples. Stratified sampling is more efficient than either simple random sampling or systematic sampling because the representation of items across the entire population is assured. The homogeneity of items within each stratum provides greater precision in the estimates of underlying population parameters. In addition, stratified sampling enables one to reach conclusions about each strata in the frame. However, using a stratified sample requires that one can determine the variable(s) on which to base the stratification and can also be expensive to implement. Cluster Sample In a cluster sample, divide the N items in the frame into clusters that contain several items. Clusters are often naturally occurring groups, such as counties, election districts, city blocks, 1.3 Types of Sampling Methods 25 households, or sales territories. Then take a random sample of one or more clusters and study all items in each selected cluster. Cluster sampling is often more cost-effective than simple random sampling, particularly if the population is spread over a wide geographic region. However, cluster sampling often requires a larger sample size to produce results as precise as those from simple random sampling or stratified sampling. A detailed discussion of systematic sampling, stratified sampling, and cluster sampling procedures can be found in references 2, 4, and 6. PROBLEMS FOR SECTION 1.3 LEARNING THE BASICS 1.17 For a population containing N = 902 individuals, what code number would you assign for a. the first person on the list? b. the fortieth person on the list? c. the last person on the list? 1.18 For a population of N = 902, verify that by starting in row 05, column 01 of the table of random numbers (Table E.1), you need only six rows to select a sample of N = 60 without replacement. 1.19 Given a population of N = 93, starting in row 29, column 01 of the table of random numbers (Table E.1), and reading across the row, select a sample of N = 15 a. without replacement. b. with replacement. APPLYING THE CONCEPTS 1.20 For a study that consists of personal interviews with participants (rather than mail or phone surveys), explain why simple random sampling might be less practical than some other sampling methods. 1.21 You want to select a random sample of n = 1 from a population of three items (which are called A, B, and C). The rule for selecting the sample is as follows: Flip a coin; if it is heads, pick item A; if it is tails, flip the coin again; this time, if it is heads, choose B; if it is tails, choose C. Explain why this is a probability sample but not a simple random sample. 1.22 A population has four members (called A, B, C, and D). You would like to select a random sample of n = 2, which you decide to do in the following way: Flip a coin; if it is heads, the sample will be items A and B; if it is tails, the sample will be items C and D. Although this is a random sample, it is not a simple random sample. Explain why. (Compare the procedure described in Problem 1.21 with the procedure described in this problem.) 1.23 The registrar of a university with a population of N = 4,000 full-time students is asked by the president to conduct a survey to measure satisfaction with the quality of life on campus. The following table contains a breakdown of the 4,000 registered full-time students, by gender and class designation: CLASS DESIGNATION GENDER Female Male Total Fr. So. Jr. Sr. Total 700 560 1,260 520 460 980 500 400 900 480 380 860 2,200 1,800 4,000 The registrar intends to take a probability sample of n = 200 students and project the results from the sample to the entire population of full-time students. a. If the frame available from the registrar’s files is an alphabetical listing of the names of all N = 4,000 registered full-time students, what type of sample could you take? Discuss. b. What is the advantage of selecting a simple random sample in (a)? c. What is the advantage of selecting a systematic sample in (a)? d. If the frame available from the registrar’s files is a list of the names of all N = 4,000 registered full-time students compiled from eight separate alphabetical lists, based on the gender and class designation breakdowns shown in the class designation table, what type of sample should you take? Discuss. e. Suppose that each of the N = 4,000 registered full-time students lived in one of the 10 campus dormitories. Each dormitory accommodates 400 students. It is college policy to fully integrate students by gender and class designation in each dormitory. If the registrar is able to compile a listing of all students by dormitory, explain how you could take a cluster sample. SELF 1.24 Prenumbered sales invoices are kept in a sales TEST journal. The invoices are numbered from 0001 to 5000. a. Beginning in row 16, column 01, and proceeding horizontally in a table of random numbers (Table E.1), select a simple random sample of 50 invoice numbers. b. Select a systematic sample of 50 invoice numbers. Use the random numbers in row 20, columns 05–07, as the starting point for your selection. c. Are the invoices selected in (a) the same as those selected in (b)? Why or why not? 1.25 Suppose that 10,000 customers in a retailer’s customer database are categorized by three customer types: 3,500 prospective buyers, 4,500 first time buyers, and 2,000 repeat (loyal) buyers. A sample of 1,000 customers is needed. a. What type of sampling should you do? Why? b. Explain how you would carry out the sampling according to the method stated in (a). c. Why is the sampling in (a) not simple random sampling? 26 CHAPTER 1 | Defining and Collecting Data 1.4 Data Cleaning Even if proper data collection procedures are followed, the collected data may contain incorrect or inconsistent data that could affect statistical results. Data cleaning corrects such defects and ensures the data contain suitable quality for analysis. Cleaning is the most important data preprocessing task and must be done before performing any analysis. Cleaning can take a significant amount of time to do. One survey of big data analysts reported that they spend 60% of their time cleaning data, while only 20% of their time collecting data and a similar percentage for analyzing data (see reference 8). Data cleaning seeks to correct the following types of irregularities: With the exception of several examples designed for use with this section, data for the problems and examples in this book have already been properly cleaned to allow focus on the statistical concepts and methods that the book discusses. • Invalid variable values, including non-numerical data for a numerical variable, invalid categorical values of a categorical variable, and numeric values outside a defined range • Coding errors, including inconsistent categorical values, inconsistent case for categorical values, and extraneous characters • Data integration errors, including redundant columns, duplicated rows, differing column lengths, and different units of measure or scale for numerical variables By its nature, data cleaning cannot be a fully automated process, even in large business systems that contain data cleaning software components. As this chapter’s software guides explain, Excel, JMP, Minitab, and Tableau contain functionality that lessens the burden of data cleaning. When performing data cleaning, first preserve a copy of the original data for later reference. Invalid Variable Values Invalid variable values can be identified as being incorrect by simple scanning techniques so long as operational definitions for the variables the data represent exist. For any numerical variable, any value that is not a number is clearly an incorrect value. For a categorical variable, a value that does not match any of the predefined categories of the variable is, likewise, clearly an incorrect value. And for numerical variables defined with an explicit range of values, a value outside that range is clearly an error. Coding Errors Coding errors can result from poor recording or entry of data values or as the result of computerized operations such as copy-and-paste or data import. While coding errors are literally invalid values, coding errors may be correctable without consulting additional information whereas the invalid variable values never are. For example, for a Gender variable with the defined values F and M, the value “Female” is a coding error that can be reasonably changed to F. However, the value “New York” for the same variable is an invalid variable value that you cannot reasonably change to either F or M. Unlike invalid variable values, coding errors may be tolerated by analysis software. For example, for the same Gender variable, the values M and m might be treated as the “same” value for purposes of an analysis by software that was tolerant of case inconsistencies, an attribute known as being insensitive to case. Perhaps the most frustrating coding errors are extraneous characters in a value. Visual examination may not be able to spot extraneous characters such as nonprinting characters or extra, trailing space characters as one scans data. For example, the value David and the value that is David followed by three space characters may look the same to one casually scanning them but may not be treated the same by software. Likewise, values with nonprinting characters may look correct but may cause software errors or be reported as invalid by analysis software. Data Integration Errors Data integration errors arise when data from two different computerized sources, such as two different data repositories are combined into one data set for analysis. Identifying data integration errors may be the most time-consuming data cleaning task. Because spotting these errors requires a type of data interpretation that automated processes of a typical business computer 1.5 Other Data Preprocessing Tasks Perhaps not surprising, supplying business systems with automated data interpretation skills that would semi-automate this task is a goal of many companies that provide data analysis software and services. 27 systems today cannot supply, spotting these errors using manual methods will be typical for the foreseeable future. Some data integration errors occur because variable names or definitions for the same item of interest have minor differences across systems. In one system, a customer ID number may be known as Customer ID, whereas in a different system, the same variable is known as Cust Number. A result of combining data from the two systems may result in having both Customer ID and Cust Number variable columns, a redundancy that should be eliminated. Duplicated rows also occur because of similar inconsistencies across systems. Consider a Customer Name variable with the value that represents the first coauthor of this book, David M. Levine. In one system, this name may have been recorded as David Levine, whereas in another system, the name was recorded as D M Levine. Combining records from both systems may result in two records, where only one should exist. Whether “David Levine” is actually the same person as “D M Levine” requires an interpretation skill that today’s software may lack. Likewise, different units of measurement (or scale) may not be obvious without additional, human interpretation. Consider the variable Air Temperature, recorded in degrees Celsius in one system and degrees Fahrenheit in another. The value 30 would be a plausible value under either measurement system and without further knowledge or context impossible to spot as a Celsius measurement in a column of otherwise Fahrenheit measurements. Missing Values Missing values are values that were not collected for a variable. For example, survey data may include answers for which no response was given by the survey taker. Such “no responses” are examples of missing values. Missing values can also result from integrating two data sources that do not have a row-to-row correspondence for each row in both sources. The lack of correspondence creates particular variable columns to be longer, to contain additional rows than the other columns. For these additional rows, missing would be the value for the cells in the shorter columns. Do not confuse missing values with miscoded values. Unresolved miscoded values—values that cannot be cleaned by any method—might be changed to missing by some researchers or excluded for analysis by others. Algorithmic Cleaning of Extreme Numerical Values For numerical variables without a defined range of possible values, you might find outliers, values that seem excessively different from most of the other values. Such values may or may not be errors, but all outliers require review. While there is no one standard for defining outliers, most define outliers in terms of descriptive measures such as the standard deviation or the interquartile range that Chapter 3 discusses. Because software can compute such measures, spotting outliers can be automated if a definition of the term that uses a such a measure is used. As later chapters note as appropriate, identifying outliers is important as some methods are sensitive to outliers and produce very different results when outliers are included in analysis. 1.5 Other Data Preprocessing Tasks In addition to data cleaning, there are several other data preprocessing tasks that you might undertake before visualizing and analyzing your data. Data Formatting Data formatting includes the rearranging the structure of the data or changing the electronic encoding of the data or both. For example, consider financial data that has been collected for a sample of companies. The collected data may be structured as tables of data, as the contents of standard forms, in a continuous stock ticker stream, or as messages or blog entries that appear on various websites. These data sources have various levels of structure which affect the ease of reformatting them for use. 28 CHAPTER 1 | Defining and Collecting Data Because tables of data are highly structured and are similar to the structure of a worksheet, tables would require the least reformatting. In the best case, the rows and columns of a table would become the rows and columns of a worksheet. Unstructured data sources, such as messages and blog entries, often represent the worst case. The data may need to be paraphrased, characterized, or summarized in a way that does not involve a direct transfer. As the use of business analytics grows (see Chapter 14), the use of automated ways to paraphrase or characterize these and other types of unstructured data grows, too. Independent of the structure, collected data may exist in an electronic form that needs to be changed in order to be analyzed. For example, data presented as a digital picture of Excel worksheets would need to be changed into an actual Excel worksheet before that data could be analyzed. In this example, the electronic encoding of the data changes from a picture format such as jpeg to an Excel workbook format. Sometimes, individual numerical values that have been collected may need to be changed, especially collected values that result from a computational process. Demonstrate this issue in Excel by entering a formula that is equivalent to the expression 1 * (0.5 - 0.4 - 0.1). This should evaluate as 0 but Excel evaluates to a very small negative number. Altering that value to 0 would be part of the data cleaning process. Stacking and Unstacking Data When collecting data for a numerical variable, subdividing that data into two or more groups for analysis may be necessary. For example, data about the cost of a restaurant meal in an urban area might be subdivided to consider the cost of meals at restaurants in the center city district separately from the meal costs at metro area restaurants. When using data that represent two or more groups, data can be arranged as either unstacked or stacked. To use an unstacked arrangement, create separate numerical variables for each group. For this example, create a center city meal cost variable and a second variable to hold the meal costs at metro area restaurants. To use a stacked arrangement format, pair the single numerical variable meal cost with a second, categorical variable that contains two categories, such as center city and metro area. If collecting data for several numerical variables, each of which will be subdivided in the same way, stacking the data will be the more efficient choice. When using software to analyze data, a specific procedure may require data to be stacked (or unstacked). When such cases arise using Microsoft Excel, JMP, or Minitab for problems or examples that this book discusses, a workbook or project will contain that data in both arrangements. For example, Restaurants , that Chapter 2 uses for several examples, contains both the original (stacked) data about restaurants as well as an unstacked worksheet (or data table) that contains the meal cost by location, center city or metro area. Recoding Variables After data have been collected, categories defined for a categorical variable may need to be reconsidered or a numerical variable may need to be transformed into a categorical variable by assigning individual numeric values to one of several groups. For either case, define a recoded variable that supplements or replaces the original variable in your analysis. For example, having already defined the variable class standing with the categories freshman, sophomore, junior, and senior, a researcher decides to investigate the differences between lowerclassmen (freshmen or sophomores) and upperclassmen (juniors or seniors). The researcher can define a recoded variable UpperLower and assign the value Upper if a student is a junior or senior and assign the value Lower if the student is a freshman or sophomore. When recoding variables, make sure that one and only one of the new categories can be assigned to any particular value being recoded and that each value can be recoded successfully by one of your new categories, the properties known as being mutually exclusive and collectively exhaustive. When recoding numerical variables, pay particular attention to the operational definitions of the categories created for the recoded variable, especially if the categories are not self-defining ranges. For example, while the recoded categories Under 12, 12–20, 21–34, 35–54, and 55-and-over are self-defining for age, the categories child, youth, young adult, middle aged, and senior each need to be further defined in terms of mutually exclusive and collectively exhaustive numerical ranges. 1.6 Types of Survey Errors 29 PROBLEMS FOR SECTIONS 1.4 AND 1.5 APPLYING THE CONCEPTS 1.26 The cell phone brands owned by a sample of 20 respondents were: Apple, Samsung, Appel, Nokia, Blackberry, HTC, Apple, Samsung, HTC, LG, Blueberry, Samsung, Samsung, APPLE, Motorola, Apple, Samsun, Apple, Samsung a. Clean these data and identify any irregularities in the data. b. Are there any missing values in this set of 20 respondents? Identify the missing values. 1.27 The amount of monthly data usage by a sample of 10 cell phone users (in MB) was: 0.4, 2.7MB, 5.6, 4.3, 11.4, 26.8, 1.6, 1,079, 8.3, 4.2 Are there any potential irregularities in the data? 1.28 An amusement park company owns three hotels on an adjoining site. A guest relations manager wants to study the time it takes for shuttle buses to travel from each of the hotels to the amusement park entrance. Data were collected on a particular day for the recorded travel times in minutes. a. Explain how the data could be organized in an unstacked format. b. Explain how the data could be organized in a stacked format. 1.29 A hotel management company runs 10 hotels in a resort area. The hotels have a mix of pricing—some hotels have budgetpriced rooms, some have moderate-priced rooms, and some have deluxe-priced rooms. Data are collected that indicate the number of rooms that are occupied at each hotel on each day of a month. Explain how the 10 hotels can be recoded into these three price categories. 1.6 Types of Survey Errors Collected data in the form of compiled responses from a survey must be verifed to ensure that the results can be used in a decision-making process. Verification begins by evaluating the validity of the survey to make sure the survey does not lack objectivity or credibility. To do this, evaluate the purpose of the survey, the reason the survey was conducted, and for whom the survey was conducted. Having validated the objectivity and credibility of the survey, determine whether the survey was based on a probability sample (see Section 1.3). Surveys that use nonprobability samples are subject to serious biases that render their results useless for decision-making purposes. In the case of the Coca-Cola managers concerned about the “Pepsi Challenge” results (see page 19), the managers failed to reflect on the subjective nature of the challenge as well as the nonprobability sample that this survey used. Had the managers done so, they might not have been so quick to make the reformulation blunder that was reversed just weeks later. Even after verification, surveys can suffer from any combination of the following types of survey errors: coverage error, nonresponse error, sampling error, or measurement error. Developers of well-designed surveys seek to reduce or minimize these types of errors, often at considerable cost. Coverage Error The key to proper sample selection is having an adequate frame. Coverage error occurs if certain groups of items are excluded from the frame so that they have no chance of being selected in the sample or if items are included from outside the frame. Coverage error results in a selection bias. If the frame is inadequate because certain groups of items in the population were not properly included, any probability sample selected will provide only an estimate of the characteristics of the frame, not the actual population. Nonresponse Error Not everyone is willing to respond to a survey. Nonresponse error arises from failure to collect data on all items in the sample and results in a nonresponse bias. Because a researcher cannot always assume that persons who do not respond to surveys are similar to those who do, researchers need to follow up on the nonresponses after a specified period of time. Researchers should make several attempts to convince such individuals to complete 30 CHAPTER 1 | Defining and Collecting Data the survey and possibly offer an incentive to participate. The follow-up responses are then compared to the initial responses in order to make valid inferences from the survey (see references 2, 4, and 6). The mode of response the researcher uses, such as face-to-face interview, telephone interview, paper questionnaire, or computerized questionnaire, affects the rate of response. Personal interviews and telephone interviews usually produce a higher response rate than do mail surveys—but at a higher cost. Sampling Error When conducting a probability sample, chance dictates which individuals or items will or will not be included in the sample. Sampling error reflects the variation, or “chance differences,” from sample to sample, based on the probability of particular individuals or items being selected in the particular samples. When there is a news report about the results of surveys or polls in newspapers or on the Internet, there is often a statement regarding a margin of error, such as “the results of this poll are expected to be within { 4 percentage points of the actual value.” This margin of error is the sampling error. Using larger sample sizes reduces the sampling error. Of course, doing so increases the cost of conducting the survey. Measurement Error In the practice of good survey research, design surveys with the intention of gathering meaningful and accurate information. Unfortunately, the survey results are often only a proxy for the ones sought. Unlike height or weight, certain information about behaviors and psychological states is impossible or impractical to obtain directly. When surveys rely on self-reported information, the mode of data collection, the respondent to the survey, and or the survey itself can be possible sources of measurement error. Satisficing, social desirability, reading ability, and/or interviewer effects can be dependent on the mode of data collection. The social desirability bias or cognitive/memory limitations of a respondent can affect the results. Vague questions, double-barreled questions that ask about multiple issues but require a single response, or questions that ask the respondent to report something that occurs over time but fail to clearly define the extent of time about which the question asks (the reference period) are some of the survey flaws that can cause errors. To minimize measurement error, standardize survey administration and respondent understanding of questions, but there are many barriers to this (see references 1, 3, and 12). Ethical Issues About Surveys Ethical considerations arise with respect to the four types of survey error. Coverage error can result in selection bias and becomes an ethical issue if particular groups or individuals are purposely excluded from the frame so that the survey results are more favorable to the survey’s sponsor. Nonresponse error can lead to nonresponse bias and becomes an ethical issue if the sponsor knowingly designs the survey so that particular groups or individuals are less likely than others to respond. Sampling error becomes an ethical issue if the findings are purposely presented without reference to sample size and margin of error so that the sponsor can promote a viewpoint that might otherwise be inappropriate. Measurement error can become an ethical issue in one of three ways: (1) a survey sponsor chooses leading questions that guide the respondent in a particular direction; (2) an interviewer, through mannerisms and tone, purposely makes a respondent obligated to please the interviewer or otherwise guides the respondent in a particular direction; or (3) a respondent willfully provides false information. Ethical issues also arise when the results of nonprobability samples are used to form conclusions about the entire population. When using a nonprobability sampling method, explain the sampling procedures and state that the results cannot be generalized beyond the sample. 1.6 Types of Survey Errors 31 CONSIDER THIS New Media Surveys/Old Survey Errors Software company executives decide to create a “customer experience improvement program” to record how customers use the company’s products, with the goal of using the collected data to make product enhancements. Product marketers decide to use social media websites to collect consumer feedback. These people risk making the same type of survey error that led to the quick demise of a very successful magazine nearly 80 years ago. By 1935, “straw polls” conducted by the magazine Literary Digest had successfully predicted five consecutive U.S. presidential elections. For the 1936 election, the magazine promised its largest poll ever and sent about 10 million ballots to people all across the country. After tabulating more than 2.3 million ballots, the Digest confidently proclaimed that Alf Landon would be an easy winner over Franklin D. Roosevelt. The actual results: FDR won in a landslide and Landon received the fewest electoral votes in U.S. history. Being so wrong ruined the reputation of Literary Digest, and it would cease publication less than two years after it made its erroneous claim. A review much later found that the low response rate (less than 25% of the ballots distributed were returned) and nonresponse error (Roosevelt voters were less likely to mail in a ballot than Landon voters) were significant reasons for the failure of the Literary Digest poll (see reference 11). The Literary Digest error proved to be a watershed event in the history of sample surveys. First, the error disproved the assertion that the larger the sample is, the better the results will be—an assertion some people still mistakenly make today. The error paved the way for the modern methods of sampling discussed in this chapter and gave prominence to the more “scientific” methods that George Gallup and Elmo Roper both used to correctly predict the 1936 elections. (Today’s Gallup Polls and Roper Reports remember those researchers.) In more recent times, Microsoft software executives overlooked that experienced users could easily opt out of participating in their improvement program. This created another case of nonresponse error which may have led to the improved product (Microsoft Office) being so poorly received initially by experienced Office users who, by being more likely to opt out of the improvement program, biased the data that Microsoft used to determine Office “improvements.” And while those product marketers may be able to collect a lot of customer feedback data, those data also suffer from nonresponse error. In collecting data from social media websites, the marketers cannot know who chose not to leave comments. The marketers also cannot verify if the data collected suffer from a selection bias due to a coverage error. That you might use media newer than the mailed, dead-tree form that Literary Digest used does not mean that you automatically avoid the old survey errors. Just the opposite—the accessibility and reach of new media makes it much easier for unknowing people to commit such errors. PROBLEMS FOR SECTION 1.6 APPLYING THE CONCEPTS 1.30 A survey indicates that the vast majority of college students own their own smartphones. What information would you want to know before you accepted the results of this survey? 1.31 A simple random sample of n = 300 full-time employees is selected from a company list containing the names of all N = 5,000 full-time employees in order to evaluate job satisfaction. a. Give an example of possible coverage error. b. Give an example of possible nonresponse error. c. Give an example of possible sampling error. d. Give an example of possible measurement error. SELF 1.32 Results of a 2017 Computer Services, Inc. (CSI) TEST survey of a sample of 163 bank executives reveal insights on banking priorities among financial institutions (goo.gl/ mniYMM). As financial institutions begin planning for a new year, of utmost importance is boosting profitability and identifying growth areas. The results show that 55% of bank institutions note customer experience initiatives as an area in which spending is expected to increase. Implementing a customer relationship management (CRM) solution was ranked as the top most important omnichannel strategy to pursue with 41% of institutions citing digital banking enhancements as the greatest anticipated strategy to enhance the customer experience. Identify potential concerns with coverage, nonresponse, sampling, and measurement errors. 1.33 A recent PwC survey of 1,379 CEOs from a wide range of industries representing a mix of company sizes from Asia, Europe, and the Americas indicated that CEOs are firmly convinced that it is harder to gain and retain people’s trust in an increasingly digitalized world (pwc.to/2jFLzjF). Fifty-eight percent of CEOs are worried that lack of trust in business would harm their company’s growth. 32 CHAPTER 1 | Defining and Collecting Data Which risks arising from connectivity concern CEOs most? Eightyseven percent believe that social media could have a negative impact on the level of trust in their industry over the next few years. But they also say new dangers are emerging and old ones are getting worse as new technologies and new uses of existing technologies increase rapidly. CEOs are particularly anxious about breaches in data security and ethics and IT outages and disruptions. A vast majority of CEOs are already taking steps to address these concerns, with larger-sized companies doing more than smaller-sized companies. What additional information would you want to know about the survey before you accepted the results for the study? ▼ 1.34 A recent survey points to tremendous revenue potential and consumer value in leveraging driver and vehicle data in the automobile industry. The 2017 KPMG Global Automotive Executive Study found that automobile executives believe data will be the fuel for future business models and that they will make money from that data (prn.to/2q9rubN). Eighty-two percent of automobile executives agree that in order to create value and consequently monetize data, a car needs its own ecosystem/operating system; otherwise the valuable consumer and/or vehicle data will likely be routed through third parties and valuable revenue streams will be lost. What additional information would you want to know about the survey before you accepted the results of the study? USING STATISTICS Defining Moments, Revisited T he New Coke and airline quality cases illustrate missteps that can occur during the define and collect tasks of the DCOVA framework. To use statistics effectively, you must properly define a business problem or goal and then collect data that will allow you to make observations and reach conclusions that are relevant to that problem or goal. In the New Coke case, managers failed to consider that data collected about a taste test would not necessarily provide useful information about the sales issues they faced. The managers also did not realize that the test used improper sampling techniques, deliberately introduced biases, and were subject to coverage and nonresponse errors. Those mistakes invalidated the test, making the conclusion that New Coke tasted better than Pepsi an invalid claim. ▼ SUMMARY In this chapter, you learned the details about the Define and Collect tasks of the DCOVA framework, which are important first steps to applying statistics properly to decision making. You learned that defining variables means developing an operational definition that includes establishing the type of variable and the measurement scale that the variable uses. You learned important details about data collection as ▼ In the airline quality case, no mistakes in defining and collecting data were made. The results that fliers like quality was a valid one, but decision makers overlooked that quality was not the most significant factor for people buying seats on transatlantic flights (price was). This case illustrates that no matter how well you apply statistics, if you do not properly analyze the business problem or goal being considered, you may end up with valid results that lead you to invalid management decisions. well as some new basic vocabulary terms (sample, population, and parameter) and a more precise definition of statistic. You specifically learned about sampling and the types of sampling methods available to you. Finally, you surveyed data preparation considerations and learned about the type of survey errors you can encounter. REFERENCES 1. Biemer, P. B., R. M. Graves, L. E. Lyberg, A. Mathiowetz, and S. Sudman. Measurement Errors in Surveys. New York: Wiley Interscience, 2004. 2. Cochran, W. G. Sampling Techniques, 3rd ed. New York: Wiley, 1977. 3. Fowler, F. J. Improving Survey Questions: Design and Evaluation, Applied Special Research Methods Series, Vol. 38, Thousand Oaks, CA: Sage Publications, 1995. 4. Groves R. M., F. J. Fowler, M. P. Couper, J. M. Lepkowski, E. Singer, and R. Tourangeau. Survey Methodology, 2nd ed. New York: John Wiley, 2009. 5. Hellerstein, J. “Quantitative Data Cleaning for Large Databases.” bit.ly/2q7PGIn. 6. Lohr, S. L. Sampling Design and Analysis, 2nd ed. Boston, MA: Brooks/Cole Cengage Learning, 2010. 7. Polaris Marketing Research. “Brilliant Marketing Research or What? The New Coke Story,” posted September 20, 2011. bit.ly/1DofHSM (removed). 8. Press, G. “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says,” posted March 23, 2016. bit.ly/2oNCwzh. Chapter Review Problems 9. Rosenbaum, D. “The New Big Data Magic,” posted August 20, 2011. bit.ly/1DUMWzv. 10. Osbourne, J. Best Practices in Data Cleaning. Thousand Oaks, CA: Sage Publications, 2012. 11. Squire, P. “Why the 1936 Literary Digest Poll Failed.” Public Opinion Quarterly 52 (1988): 125–133. ▼ mutually exclusive 28 nominal scale 20 nonprobability sample 23 nonresponse bias 29 nonresponse error 29 numerical variable 19 operational definition 19 ordinal scale 20 outlier 27 parameter 22 population 21 primary data source 22 probability sample 23 qualitative variable 20 quantitative variable 20 ratio scale 20 recoded variable 28 sample 21 sampling error 30 sampling with replacement 23 sampling without replacement 23 secondary data source 22 selection bias 29 simple random sample 23 stacked 28 statistic 22 strata 24 stratified sample 24 systematic sample 24 table of random numbers 24 treatment 22 unstacked 28 CHECKING YOUR UNDERSTANDING 1.35 What is the difference between a sample and a population? 1.36 What is the difference between a statistic and a parameter? 1.37 What is the difference between a categorical variable and a numerical variable? 1.38 What is the difference between a discrete numerical variable and a continuous numerical variable? 1.39 What is the difference between a nominal scaled variable and an ordinal scaled variable? 1.40 What is the difference between an interval scaled variable and a ratio scaled variable? ▼ 12. Sudman, S., N. M. Bradburn, and N. Schwarz. Thinking About Answers: The Application of Cognitive Processes to Survey Methodology. San Francisco, CA: Jossey-Bass, 1993. KEY TERMS categorical variable 19 cluster 24 cluster sample 24 collectively exhaustive 28 continuous variable 20 convenience sample 23 coverage error 29 data cleaning 26 discrete variable 20 frame 23 interval scale 20 judgment sample 23 margin of error 30 measurement error 30 measurement scale 20 missing value 27 ▼ 33 1.41 What is the difference between probability sampling and nonprobability sampling? 1.42 What is the difference between a missing value and an outlier? 1.43 What is the difference between unstacked and stacked variables? 1.44 What is the difference between coverage error and nonresponse error? 1.45 What is the difference between sampling error and measurement error? CHAPTER REVIEW PROBLEMS 1.46 Visit the official website for Microsoft Excel (products .office.com/excel), Minitab (www.minitab.com), or JMP (www .jmp.com). Review the features of the program you chose and then state the ways the program could be useful in statistical analysis. 1.47 Results of a 2017 Computer Services, Inc. (CSI) survey of a sample of 163 bank executives reveal insights on banking priorities among financial institutions (goo.gl/mniYMM). As financial institutions begin planning for a new year, of utmost importance is boosting profitability and identifying growth areas. The results show that 55% of bank institutions note customer experience initiatives as an area in which spending is expected to increase. Implementing a customer relationship management (CRM) solution was ranked as the top most important omnichannel strategy to pursue with 41% of institutions citing digital banking enhancements as the greatest anticipated strategy to enhance the customer experience. a. Describe the population of interest. b. Describe the sample that was collected. c. Describe a parameter of interest. d. Describe the statistic used to estimate the parameter in (c). 34 CHAPTER 1 | Defining and Collecting Data 1.48 The Gallup organization releases the results of recent polls on its website, www.gallup.com. Visit this site and read an article of interest. a. Describe the population of interest. b. Describe the sample that was collected. c. Describe a parameter of interest. d. Describe the statistic used to estimate the parameter in (c). 1.49 A recent PwC survey of 1,379 CEOs from a wide range of industries representing a mix of company sizes from Asia, Europe, and the Americas indicated that CEOs are firmly convinced that it is harder to gain and retain people’s trust in an increasingly digitalized world (pwc.to/2jFLzjF). Fifty-eight percent of CEOs are worried that lack of trust in business would harm their company’s growth. Which risks arising from connectivity concern CEOs most? Eighty-seven percent believe that social media could have a negative impact on the level of trust in their industry over the next few years. But they also say new dangers are emerging and old ones are getting worse as new technologies and new uses of existing technologies increase rapidly. CEOs are particularly anxious about breaches in data security and ethics and IT outages and disruptions. A vast majority of CEOs are already taking steps to address these concerns, with larger-sized companies doing more than smaller-sized companies. a. Describe the population of interest. b. Describe the sample that was collected. c. Describe a parameter of interest. d. Describe the statistic used to estimate the parameter in (c). 1.50 The American Community Survey (www.census.gov/acs) provides data every year about communities in the United States. Addresses are randomly selected and respondents are required to supply answers to a series of questions. a. Describe a variable for which data are collected. b. Is the variable categorical or numerical? c. If the variable is numerical, is it discrete or continuous? 1.51 Download and examine Zarca Interactive’s “Sample Survey for Associations/Sample Questions for Surveys for Associations,” available at bit.ly/2p5HlGO. a. Give an example of a categorical variable included in the survey. b. Give an example of a numerical variable included in the survey. 1.52 Three professors examined awareness of four widely disseminated retirement rules among employees at the University of Utah. These rules provide simple answers to questions about retirement planning (R. N. Mayer, C. D. Zick, and M. Glaittle, “Public Awareness of Retirement Planning Rules of Thumb,” Journal of Personal Finance, 2011 10(62), 12–35). At the time of the investigation, there were approximately 10,000 benefited employees, and 3,095 participated in the study. Demographic data collected on these 3,095 employees included gender, age (years), education level (years completed), marital status, household income ($), and employment category. a. Describe the population of interest. b. Describe the sample that was collected. c. Indicate whether each of the demographic variables mentioned is categorical or numerical. 1.53 Social media provides an enormous amount of data about the activities and habits of people using social platforms like Facebook and Twitter. The belief is that mining that data provides a treasure trove for those who seek to quantify and predict future human behavior. A marketer is planning a survey of Internet users in the United States to determine social media usage. The objective of the survey is to gain insight on these three items: key social media platforms used, frequency of social media usage, and demographics of key social media platform users. a. For each of the three items listed, indicate whether the variables are categorical or numerical. If a variable is numerical, is it discrete or continuous? b. Develop five categorical questions for the survey. c. Develop five numerical questions for the survey. CHAPTER CASES Managing Ashland MultiComm Services Ashland MultiComm Services (AMS) provides high-quality telecommunications services in the Greater Ashland area. AMS traces its roots to a small company that redistributed the broadcast television signals from nearby major metropolitan areas but has evolved into a provider of a wide range of broadband services for residential customers. AMS offers subscription-based services for digital cable television, local and long-distance telephone services, and high-speed Internet access. Recently, AMS has faced competition from other service providers as well as Internet-based, on-demand streaming services that have caused many customers to “cut the cable” and drop their subscription to cable video services. 1 AMS management believes that a combination of increased promotional expenditures, adjustment in subscription fees, and improved customer service will allow AMS to successfully face these challenges. To help determine the proper mix of strategies to be taken, AMS management has decided to organize a research team to undertake a study. The managers suggest that the research team examine the company’s own historical data for number of subscribers, revenues, and subscription renewal rates for the past few years. They direct the team to examine year-to-date data as well, as the managers suspect that some of the changes they have seen have been a relatively recent phenomena. 1. What type of data source would the company’s own historical data be? Identify other possible data sources that the research Cases team might use to examine the current marketplace for residential broadband services in a city such as Ashland. 2. What type of data collection techniques might the team employ? 3. In their suggestions and directions, the AMS managers have named a number of possible variables to study, but offered no operational definitions for those variables. What types of possible misunderstandings could arise if the team and managers do not first properly define each variable cited? CardioGood Fitness CardioGood Fitness is a developer of high-quality cardiovascular exercise equipment. Its products include treadmills, fitness bikes, elliptical machines, and e-glides. CardioGood Fitness looks to increase the sales of its treadmill products and has hired The AdRight Agency, a small advertising firm, to create and implement an advertising program. The AdRight Agency plans to identify particular market segments that are most likely to buy their clients’ goods and services and then locate advertising outlets that will reach that market group. This activity includes collecting data on clients’ actual sales and on the customers who make the purchases, with the goal of determining whether there is a distinct profile of the typical customer for a particular product or service. If a distinct profile emerges, efforts are made to match that profile to advertising outlets known to reflect the particular profile, thus targeting advertising directly to high-potential customers. CardioGood Fitness sells three different lines of treadmills. The TM195 is an entry-level treadmill. It is as dependable as other models offered by CardioGood Fitness, but with fewer programs and features. It is suitable for individuals who thrive on minimal programming and the desire for simplicity to initiate their walk or hike. The TM195 sells for $1,500. The middle-line TM498 adds to the features of the entrylevel model two user programs and up to 15% elevation upgrade. The TM498 is suitable for individuals who are walkers at a transitional stage from walking to running or midlevel runners. The TM498 sells for $1,750. The top-of-the-line TM798 is structurally larger and heavier and has more features than the other models. Its unique features include a bright blue backlit LCD console, quick speed and incline keys, a wireless heart rate monitor with a telemetric chest strap, remote speed and incline controls, and an anatomical figure that specifies which muscles are minimally and maximally activated. This model features a nonfolding platform base that is designed to handle rigorous, frequent running; the TM798 is therefore appealing to someone who is a power walker or a runner. The selling price is $2,500. As a first step, the market research team at AdRight is assigned the task of identifying the profile of the typical customer for each treadmill product offered by CardioGood Fitness. The market research team decides to investigate whether there are differences across the product lines with respect to customer characteristics. The team decides to collect data on individuals who purchased a treadmill at a CardioGood Fitness retail store during the prior three months. The team decides to use both business transactional data and the results of a personal profile survey that every purchaser 35 completes as their sources of data. The team identifies the following customer variables to study: product purchased—TM195, TM498, or TM798; gender; age, in years; education, in years; relationship status, single or partnered; annual household income ($); mean number of times the customer plans to use the treadmill each week; mean number of miles the customer expects to walk/ run each week; and self-rated fitness on a 1-to-5 scale, where 1 is poor shape and 5 is excellent shape. For this set of variables: 1. Which variables in the survey are categorical? 2. Which variables in the survey are numerical? 3. Which variables are discrete numerical variables? Clear Mountain State Student Survey The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about the undergraduate students who attend CMSU. They create and distribute a survey of 14 questions and receive responses from 111 undergraduates (stored in StudentSurvey ). Download (see Appendix C) and review the survey document CMUndergradSurvey.pdf. For each question asked in the survey, determine whether the variable is categorical or numerical. If you determine that the variable is numerical, identify whether it is discrete or continuous. Learning with the Digital Cases Identifying and preventing misuses of statistics is an important responsibility for all managers. The Digital Cases allow you to practice the skills necessary for this important task. Each chapter’s Digital Case tests your understanding of how to apply an important statistical concept taught in the chapter. As in many business situations, not all of the information you encounter will be relevant to your task, and you may occasionally discover conflicting information that you have to resolve in order to complete the case. To assist your learning, each Digital Case begins with a learning objective and a summary of the problem or issue at hand. Each case directs you to the information necessary to reach your own conclusions and to answer the case questions. Many cases, such as the sample case worked out next, extend a chapter’s Using Statistics scenario. You can download digital case files which are PDF format documents that may contain extended features as interactivity or data file attachments. Open these files with a current version of Adobe Reader, as other PDF programs may not support the extended features. (For more information, see Appendix C.) To illustrate learning with a Digital Case, open the Digital Case file WhitneyWireless.pdf that contains summary information about the Whitney Wireless business. Apparently, from the claim on the title page, this business is celebrating its “best sales year ever.” Review the Who We Are, What We Do, and What We Plan to Do sections on the second page. Do these sections contain any useful information? What questions does this passage raise? Did you notice that while many facts are presented, no data that would support the claim of “best sales year ever” are presented? And were those mobile “mobilemobiles” used solely 36 CHAPTER 1 | Defining and Collecting Data for promotion? Or did they generate any sales? Do you think that a talk-with-your-mouth-full event, however novel, would be a success? Continue to the third page and the Our Best Sales Year Ever! section. How would you support such a claim? With a table of numbers? Remarks attributed to a knowledgeable source? Whitney Wireless has used a chart to present “two years ago” and “latest twelve months” sales data by category. Are there any problems with what the company has done? Absolutely! Take a moment to identify and reflect on those problems. Then turn to pages 4 though 6 that present an annotated version of the first three pages and discusses some of the problems with this document. In subsequent Digital Cases, you will be asked to provide this type of analysis, using the open-ended case questions as your guide. Not all the cases are as straightforward as this example, and some cases include perfectly appropriate applications of statistical methods. And none have annotated answers! CHAPTER 1 EXCEL GUIDE EG1.1 DEFINING VARIABLES Classifying Variables by Type Microsoft Excel infers the variable type from the data one enters into a column. If Excel discovers a column that contains numbers, it treats the column as a numerical variable. If Excel discovers a column that contains words or alphanumeric entries, it treats the column as a non-numerical (categorical) variable. This imperfect method works most of the time, especially if one makes sure that the categories for a categorical variables are words or phrases such as “yes” and “no.” However, because one cannot explicitly define the variable type, Excel enables one to do nonsensical things such as using a categorical variable with a statistical method designed for numerical variables. If one must use categorical values such as 1, 2, or 3, enter them preceded with an apostrophe, as Excel treats all values that begin with an apostrophe as non-numerical data. (To check whether a cell entry includes a leading apostrophe, select the cell and view its contents in the formula bar.) EG1.2 COLLECTING DATA There are no Excel Guide instructions for Section 1.2. EG1.3 TYPES of SAMPLING METHODS Simple Random Sample Key Technique Use the RANDBETWEEN(smallest integer, largest integer) function to generate a random integer that can then be used to select an item from a frame. Example 1 Create a simple random sample with replacement of size 40 from a population of 800 items. Workbook Enter a formula that uses this function and then copy the formula down a column for as many rows as is necessary. For example, to create a simple random sample with replacement of size 40 from a population of 800 items, open to a new worksheet. Enter Sample in cell A1 and enter the formula =RANDBETWEEN(1, 800) in cell A2. Then copy the formula down the column to cell A41. Analysis dialog box, select Sampling from the Analysis Tools list and then click OK. In the procedure’s dialog box (shown below): 1. Enter A1:A801 as the Input Range and check Labels. 2. Click Random and enter 40 as the Number of Samples. 3. Click New Worksheet Ply and then click OK. Example 2 Create a simple random sample without replacement of size 40 from a population of 800 items. PHStat Use Random Sample Generation. For the example, select PHStat ➔ Sampling ➔ Random Sample Generation. In the procedure’s dialog box (shown below): 1. Enter 40 as the Sample Size. 2. Click Generate list of random numbers and enter 800 as the Population Size. 3. Enter a Title and click OK. Excel contains no functions to select a random sample without replacement. Such samples are most easily created using an add-in such as PHStat or the Analysis ToolPak, as described in the following paragraphs. Analysis ToolPak Use Sampling to create a random sam- ple with replacement. For the example, open to the worksheet that contains the population of 800 items in column A and that contains a column heading in cell A1. Select Data ➔ Data Analysis. In the Data Unlike most other PHStat results worksheets, the worksheet created contains no formulas. 37 38 CHAPTER 1 | Defining and Collecting Data Workbook Use the COMPUTE worksheet of the Random workbook as a template. The worksheet already contains 40 copies of the formula =RANDBETWEEN(1, 800) in column B. Because the RANDBETWEEN function samples with replacement as discussed at the start of this section, one may need to add additional copies of the formula in new column B rows until one has 40 unique values. When the intended sample size is large, spotting duplicate values can be hard. Read the Short Takes for Chapter 1 to learn more about an advanced technique that uses formulas to detect duplicate values. EG1.4 DATA CLEANING Key Technique Use a column of formulas to detect invalid variable values in another column. Example Scan the DirtyDATA worksheet in the Dirty Data workbook for invalid variable values. PHStat Use Data Cleaning. For the example, open to the DirtyData worksheet. Select Data Preparation ➔ Numerical Data Scan. In the procedure’s dialog box: 1. Enter a column range as the Numerical Variable Cell Range. 2. Click OK. The procedure creates a worksheet that contains a column that identifies every data value as either being numerical or non-numerical and states the minimum and maximum values found in the column. To scan for irregularities in categorical data, use the Workbook instructions. Workbook Use the ScanData worksheet of the Data Cleaning workbook as a model solution to scan for the following types of irregularities: non-numerical data values for a numerical variable, invalid categorical values of a categorical variable, numerical values outside a defined range, and missing values in individual cells. The worksheet uses several different Excel functions to detect an irregularity in one column and display a message in another column. For each categorical variable scanned, the worksheet contains a table of valid values that are looked up and compared to cell values to spot inconsistencies. Read the Short Takes for Chapter 1 to learn the specifics of the formulas the worksheet uses to scan data. EG1.5 OTHER DATA PREPROCESSING Stacking and Unstacking Variables PHStat Use Data Preparation ➔ Stack Data (or Unstack Data). For Stack Data, in the Stack Data dialog box, enter an Unstacked Data Cell Range and then click OK to create stacked data in a new worksheet. For Unstack Data, in the Unstack Data dialog box, enter a Grouping Variable Cell Range and a Stacked Data Cell Range and then click OK to create unstacked data in a new worksheet. Recoding Variables Key Technique To recode a categorical variable, first copy the original variable’s column of data and then use the findand-replace function on the copied data. To recode a numerical variable, enter a formula that returns a recoded value in a new column. Example Using the DATA worksheet of the Recoded workbook, create the recoded variable UpperLower from the categorical variable Class and create the recoded Variable Dean’s List from the numerical variable GPA. Workbook Use the RECODED worksheet of the Recoded workbook as a model. The worksheet already contains UpperLower, a recoded version of Class that uses the operational definitions on page 28, and Dean’s List, a recoded version of GPA, in which the value No recodes all GPA values less than 3.3 and Yes recodes all values 3.3 or greater. The RECODED_ FORMULAS worksheet in the same workbook shows how formulas in column I use the IF function to recode GPA as the Dean’s List variable. These recoded variables were created by first opening to the DATA worksheet in the same workbook and then following these steps: 1. Right-click column D (right-click over the shaded “D” at the top of column D) and click Copy in the shortcut menu. 2. Right-click column H and click the first choice in the Paste Options gallery. 3. Enter UpperLower in cell H1. 4. Select column H. With column H selected, click Home ➔ Find & Select ➔ Replace. In the Replace tab of the Find and Replace dialog box: 5. Enter Senior as Find what, Upper as Replace with, and then click Replace All. 6. Click OK to close the dialog box that reports the results of the replacement command. 7. Still in the Find and Replace dialog box, enter Junior as Find what and then click Replace All. 8. Click OK to close the dialog box that reports the results of the replacement command. 9. Still in the Find and Replace dialog box, enter Sophomore as Find what, Lower as Replace with, and then click Replace All. 10. Click OK to close the dialog box that reports the results of the replacement command. 11. Still in the Find and Replace dialog box, enter Freshman as Find what and then click Replace All. 12. Click OK to close the dialog box that reports the results of the replacement command. (This creates the recoded variable UpperLower in column H.) JMP Guide 13. Enter Dean’s List in cell I1. 14. Enter the formula =IF(G2 < 3.3, “No”, “Yes”) in cell I2. 15. Copy this formula down the column to the last row that contains student data (row 63). 39 The RECODED worksheet uses the IF function, that Appendix F discusses to recode the numerical variable into two categories. Numerical variables can also be recoded into multiple categories by using the VLOOKUP function (see Appendix F). (This creates the recoded variable Dean’s List in column I.) CHAPTER 1 JMP GUIDE JG1.1 DEFINING VARIABLES Classifying Variables by Type JMP infers the variable type and scale from the data one enters in a column. To override any inference JMP makes, first right-click a column name and select Column Info from the shortcut menu. In the column info dialog box, change the Data Type or Modeling Type (scale) to the type desired. Shown below is the column information dialog box for the Assets column in the Retirement Funds data table. Contents of this dialog box will vary depending on JMP inferences and the entries you make in the dialog box, but the dialog box will always contain Column Name as the JMP Guide in the previous chapter explains. For continuous numerical variables such as Assets, use Format to control the display of values. For the format Fixed Dec, the Dec box entry controls the number of decimal places to which values will be rounded in the column. In boxes that lists names of variable columns, JMP uses icons to represent the modeling type of the variable. Below are the icons for nominal (CustomerID), continuous (YTD Spending), ordinal (Texts Sent), and unstructured text (Last Text Message) modeling types, the subset of types that examples use in this book. (Open the Modeling Types data table to explore these choices.) JG1.2 COLLECTING DATA There are no JMP Guide instructions for Section 1.2. JG1.3 TYPES of SAMPLING METHODS Simple Random Sample and Stratified Sample Use Subset. To take a simple random sample of data table data, open the data table and select Tables ➔ Subset. In the Subset dialog box, click Random - sample size, enter the sample size of the sample, enter an Output table name, and click OK. In the illustration below, 20 has been entered as the sample size and Funds Simple Random Sample as the output table name. 40 CHAPTER 1 | Defining and Collecting Data To specify a stratified sample, check Stratify. JMP displays a column list box under this check box from which you choose the variable that will be used to define the strata. Systematic Sample Use a formula column to help identify every kth item and then use Subset to create the sample. For example, to take a systematic sample of n = 20 of the 800 Employees data table that contains 800 rows of data, first determine k = 40 (800 divided by 20). With the data table open: 1. Right-click Fund Number (name of first column) and select Insert Columns. The new column, Column 1, appears to the left of Fund Number column. 2. Right-click Column 1 heading and select Formula. 3. In the large formula composition pane of the Formula dialog box (shown below), enter Sequence(1, 40) and then click OK. JG1.4 DATA CLEANING Use a variety of techniques including Recode and Row Selection. For categorical variables, Recode can be used to spot several kinds of invalid variable names and coding errors. For example, open the DirtyDATA data table, select the Gender column and: 1. Select Cols ➔ Recode. 2. In the Recode dialog box, click the red triangle and select Group Similar Values. 3. In the Grouping Options dialog box, check all check boxes and click OK. Back in the Recode dialog box (shown below), JMP attempts to group together similar values. The success of the regrouping by JMP can vary, but regrouping facilitates your review, especially of data the contain many rows. Note that the first entry in the old and new values table for a cell that is blank: 4. Make entries in the New Values column as necessary. 5. Click the Done pull-down list and select New Column. JMP fills Column 1 with the recurring series of 1 to 40 in Column 1. Next choose a random number between 1 and 40 inclusive by any means and: 1. Click the data table Rows red triangle and select Clear Row States. 2. Click the data table Rows red triangle a second time and select Row Selection ➔ Select Where. In the Select Rows dialog box: 3. Select Column 1 from the column list. 4. Select equals from the first pull-down list. 5. Enter the random number that was selected in the edit box to the right of the pull-down and click OK. JMP selects a sample that contains the rows in which the Column 1 value matches the randomly chosen number. Continue to copy the rows to a new data table. With the rows still selected: 6. Select Tables ➔ Subset. 7. In the Subset dialog box, click Selected Rows and click OK. The sample appears in a new data table in its own window. JMP places the corrected data in a new column, preserving the original dirty data in its original column. To identify non-numerical data in a “numerical” column, changing the column data type to Numeric in the Column Info dialog box (see Section JG1.1). This will cause JMP to change all non-numerical data to missing values. The process works because JMP assigns the character data type to any column that contains non-numerical data. To identify numeric values that are outside a defined range, Select the data table Rows red triangle and then select Row Selection ➔ Select Where. In the Select Rows dialog box: 1. Select the variable column to be range-checked. 2. Select a relationship from the pull-down list and enter the appropriated comparison value in the edit box. 3. Click Add Condition to add the condition to the Selected Conditions list. Repeat steps 1 through 3 for as many times as necessary. For the DirtyDATA data table, the variable GPA has a defined range of 0 through 4 inclusive. The conditions listed in the Select Row dialog box (shown at top of next page) check for GPA values outside that range. MINITAB Guide 41 JG1.5 OTHER PREPROCESSING TASKS Stacking and Unstacking Variables Use Stack or Split. To stack data, select Tables ➔ Stack. In the Stack dialog box, select the (unstacked) variable columns, click Stack Columns, and click OK. The stacked data appears in a new column, with the names of columns that were stacked as the values in the Label column. To unstack data, select Tables ➔ Split. In the Split dialog box, select the categorical variable that holds grouping information and click Split By, select the numerical column (or columns) to unstack and click Split Columns and then click OK. The unstacked data appears in a new data table. JMP scripts and add-ins exist that semi-automate data cleaning and spot other errors that this section does not address, sometimes using JMP techniques beyond the scope of this book to explain. Recoding Variables Use Recode. To recode the values of either a categorical or numerical variable, first select the variable column and then select Cols ➔ Recode (in older JMP versions, Cols ➔ Utilities ➔ Recode). In the Recode dialog box, JMP lists all unique values found in the column and display form in which you can change one or more of those values. When you finish making changes, click the Done pull-down list and select New Column. Recoded values appear in a new column. CHAPTER 1 MINITAB GUIDE MG1.1 DEFINING VARIABLES Classifying Variables by Type Minitab infers the variable type from the data one enters into a column as Section MG.2 “Entering Data” explains. Sometimes, Minitab will misclassify a variable, for example, mistaking a numerical variable for a categorical (text) variable. In such cases, select the column, then select Data ➔ Change Data Type, and then select one of the choices, for example, Text to Numeric for the case of when Minitab has mistaken a numerical variable as a categorical variable. MG1.2 COLLECTING DATA There are no Minitab Guide instructions for Section 1.2. MG1.3 TYPES of SAMPLING METHODS Simple Random Samples Use Sample From Columns. For example, to create a simple random sample with replacement of size 40 from a population of 800 items, first create the list of 800 employee numbers in column C1. Select Calc ➔ Make Patterned Data ➔ Simple Set of Numbers. In the procedure’s dialog box (shown below): 1. 2. 3. 4. Enter C1 in the Store patterned data in box. Enter 1 in the From first value box. Enter 800 in the To last value box. Verify that the three other boxes contain 1 and then click OK. 42 CHAPTER 1 | Defining and Collecting Data With the worksheet containing the column C1 list still open: 5. Select Calc ➔ Random Data ➔ Sample from Columns. In the Sample From Columns dialog box (shown below): 6. Enter 40 in the Number of rows to sample box. 7. Enter C1 in the From columns box. 8. Enter C2 in the Store samples in box. 9. Click OK. MG1.4 DATA CLEANING Minitab cleans the data when one imports data by opening a file created by another application, such as a workbook file created by Excel. For an existing worksheet, use a combination of commands and column formulas to count the number of missing values for a variable, change invalid categorical values of a categorical variable to a missing value, and identify numerical values that are outside a defined range. In the import method, select data cleaning options in the file open dialog box. The cleaning options vary according to the type of file being imported. For an Excel workbook, one can specify which values represent missing values and instruct Minitab to skip a blank row, add missing values to uneven columns, remove nonprintable characters and extra spaces, and correct case mismatches. Read the Short Takes for Chapter 1 to learn the specifics of the Minitab commands and formulas that one can use to scan data. MG1.5 In the Replace in Data Window dialog box: 3. Enter Senior as Find what, Upper as Replace with, and then click Replace All. 4. Click OK to close the dialog box that reports the results of the replacement command. 5. Still in the Find and Replace dialog box, enter Junior as Find what (replacing Senior), and then click Replace All. 6. Click OK to close the dialog box that reports the results of the replacement command. 7. Still in the Find and Replace dialog box, enter Sophomore as Find what, Lower as Replace with, and then click Replace All. 8. Click OK to close the dialog box that reports the results of the replacement command. 9. Still in the Find and Replace dialog box, enter Freshman as Find what, and then click Replace All. 10. Click OK to close the dialog box that reports the results of the replacement command. To create the recoded variable Dean’s List from the numerical variable GPA (C7), with the DATA worksheet of the Recode project still open: 1. Enter Dean’s List as the name of the empty column C8. 2. Select Calc ➔ Calculator. In the Calculator dialog box (shown below): 3. Enter C8 in the Store result in variable box. 4. Enter IF(GPA < 3.3, "No", "Yes") in the Expression box. 5. Click OK. OTHER PREPROCESSING TASKS Recoding Variables Use the Replace command to recode a categorical variable and Calculator to recode a numerical variable. For example, to create the recoded variable UpperLower from the categorical variable Class (C4-T), open to the DATA worksheet of the Recode project and: 1. Select the Class column (C4-T). 2. Select Editor ➔ Replace. Variables can also be recoded into multiple categories by using the Data ➔ Code command. Read the Short Takes for Chapter 1 to learn more about this advanced recoding technique. CHAPTER 1 TABLEAU GUIDE TG1.1 DEFINING VARIABLES Classifying Variables by Type Tableau infers three attributes of a column from the data that a column contains. Tableau assigns a data role to each column, initially considering any column that contains categorical data to be a Dimension and any column that contains numerical data to be a Measure. The data role of the column, in turn, affects how Tableau processes the column data as well as which choices Tableau presents in the Show Me visualization gallery (see Chapter 2). Tableau also classifies a column as either having discrete or continuous data. Tableau uses the terms discrete and continuous in the sense that this chapter defines, except that Tableau considers all categorical data as discrete data, too. Tableau colors a column name green if the column contains continuous data and colors a column name blue if the column contains discrete data. Tableau also determines whether the data type in the column is one of several data types. In the Data tab, Tableau displays a hashtag icon (#) before the names of columns that contain whole or decimal numbers, displays an Abc icon before the names of columns that contain alphanumeric data and displays a calendar icon before the names of columns that contain date or date and time data. Tableau precedes a data type icon with an equals sign if the column contains the results of a calculation or represents data copied from another column. Tableau presents a Dimension or Measure name in italics if the name represents a filtered set. In the illustration below, Number of Records in the Measures list is a calculation that is also a filtered set. To change the data role assigned to a column, right-click the column name and select Convert to Dimension or Convert to Measure from the shortcut menu. To change the assignment of discrete or continuous to a column, right-click the column name and select Convert to Continuous or Convert to Discrete from the shortcut menu. To change the assigned data type, rightclick the column name and select Change Data Type and then select the appropriate choice from the shortcut menu. For example, in the illustration above Tableau has classified Region Code as a continuous (green) measure that has whole or decimal number data. Region Code is a recoding of Region that uses the numerals 1, 2, 3, and 4 to replace categorical values. Therefore, Region Code should be a Dimension that contains discrete, alphanumeric data. For the column’s shortcut menu, the three right-click selections Convert to Dimension, Convert to Discrete, and Change Data Type ➔ String would be necessary to proper reclassify Region Code. (“String” is the Tableau term for alphanumeric data.) TG1.2 COLLECTING DATA There are no Tableau Guide instructions for this section. TG1.3 TYPES of SAMPLING METHODS There are no Tableau Guide instructions for this section. TG1.4 DATA CLEANING Use the Data Interpreter. For certain types of structured data, such as Excel worksheets and Google sheets, Tableau can apply its Data Interpreter procedure to examine data for common types of error. For an Excel workbook, the interpreter will not only import corrected values but insert a series of worksheets that documents its work into the workbook that is being used. To use the interpreter, check Use Data Interpreter in the left panel of a Data Source display (shown below left). The panel display changes to Cleaned with Data Interpreter (shown below right), when the data cleaning finishes and Tableau has imported the data to be analyzed (in the illustration, the data from the Data worksheet of the Retirement Funds Excel workbook.) TG1.5 OTHER PREPROCESSING TASKS There are no Tableau Guide instructions for this section. 43 Get Complete eBook Download Link below for instant download https://browsegrades.net/documents/2 86751/ebook-payment-link-for-instantdownload-after-payment