What Issues do Big Data Present for Business Education? Bob Andrews (Virginia Commonwealth University) Relevant Statements for Big Data Statistics Analytics “describes any use of data and statistical analysis to drive business decisions from data whether the purpose is predictive or simply descriptive.” Sharpe, DeVeaux & Velleman 3rd edition “Data mining refers to extracting useful knowledge from what may otherwise appear to be an overwhelming amount of noisy data.“ Stine & Foster Definition of Statistics “Statistics is the science and art of extracting answers from data. Some of the answers do require numbers and formulas, but you can also do every statistical analysis with pictures-graphs and tables. … in this course you’ll learn how to use statistics to interpret data and answer interesting questions.” Stine & Foster Definition of Statistics "Statistics is a way of reasoning, along with a collection of tools and methods, designed to help us understand the world." Sharpe, DeVeaux & Velleman 2nd edition Statistics is the science of uncertainty. (Andrews) The Signal and the Noise by Nate Silver “Finding patterns is easy in any kind of data-rich environment; … The key is in determining whether the patterns represent noise or signal.“ (pg. 240) “… sampling error does not always tell the whole story …” (pg. 252) “If you’re using a biased instrument, it doesn’t matter how many measurements you take-you’re aiming at the wrong target.” (pg. 253) “… the era of Big Data only seems to be worsening the problems of false positive findings in the research literature.” (pg. 253) The Signal and the Noise by Nate Silver “Essentially, the frequentist approach toward statistics seeks to wash its hands of the reason that predictions most often go wrong: human error. It views uncertainty as something intrinsic to the experiment rather than something intrinsic to our ability to understand the real world. The frequentist method also implies that, as you collect more data, your error will eventually approach zero; this will be necessary and sufficient to solve any problem.” (pg. 253) The Signal and the Noise by Nate Silver Fisher’s notion of statistical significance, which uses arbitrary cutoffs devoid of context to determine what is a “significant” finding and what isn’t, is much too clumsy …” (pg. 256) “… some professions have considered banning Fisher’s hypothesis test from their journals.” (pg. 260) “The Null Hypothesis Testing Controversy in Psychology,” JASA, December 1999 by David H. Krantz The New Statistics Geoff Cumming http://pss.sagepub.com/content/25/1/7 “in response to renewed recognition of the severe flaws of null-hypothesis significance testing (NHST), we need to shift from reliance on NHST to estimation and other preferred techniques.” “The new statistics refers to recommended practices, including estimation based on effect sizes, confidence intervals, and meta-analysis.” The Signal and the Noise by Nate Silver The goal of any predictive model is to capture as much signal as possible and as little noise as possible. Striking the right balance is not always so easy, and our ability to do so will be dictated by the strength of the theory and the equality and quantity of the data. In economic forecasting, the data is very poor and the theory is weak, hence Armstrong’s argument that the more complex you make the model the worse the forecast gets.” (pg. 388) (He is referring to Scott Armstrong of the Wharton School of U. Penn.) “What matters most, as always, is how well the predictions do in the real world.” Wikipedia definition of Big Data 2-19-2014 “Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.” These are NOT issues to be addressed in introductory statistics. A Statistician’s Definition of Big Data Michael Horrigan from the Bureau of Labor Statistics sees “Big Data as nonsampled data, characterized by the creation of databases from electronic sources whose primary purpose is something other than statistical inference.” Horrigan, Michael W., “Big Data: A Perspective from the BLS,” Amstatnews, Issue#427 (January 2013), pp. 25-27. The V’s of Big Data 3 V’s: Volume, Velocity & Variety th 4 V: (Veracity, Validity or Verification) th 5 V: Value Sources of Uncertainty/Variation 1. Standard Error of the Statistic 2. Uncertainty surrounding the veracity of the data used to calculate the Standard Error For Big Data the second source of uncertainty becomes much more important. What is Driving Big Data Use? What is the value from Big Data? It’s Better Business Decision Making Consider these titles of Tom Davenport’s books Competing on Analytics: The New Science of Winning Analytics at Work: Smarter Decisions, Better Results Big Data is about Decision Making Big Data is not about Hypothesis Testing. Statistics Instruction for Big Data should have more emphasis on Statistical Thinking rather than statistical mechanics. Graphing and Visualization to effectively communicate the data’s story. Decision Making rather than hypothesis testing Determining the veracity/validity of the data for making decisions about the phenomenon of interest. (Understanding implications of data being obtained over time)