Introduction Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia mvee@virginia.edu 1 Outline • Increasing interest in data • Course: From Data to Knowledge • Summary 2 “The data deluge” “Data, data everywhere” • Economist Special Issue Feb 27-Mar. 5, 2010 • Walmart databases alone are estimated at more than 2.5 petabytes (a petabyte is 1 million gigabytes): 2010 numbers • From businesses to governments, data collection and analysis is rapidly becoming the next big thing. • 2012: http://www.nytimes.com/2012/02/12/sundayreview/big-datas-impact-in-the3 world.html?pagewanted=all “The data deluge” • “A new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.” • Hal Varian, Google’s chief economist notes that “Data are widely available; what is scarce is the ability to extract wisdom from them.” 4 Business intelligence • Nestle sells > 100,000 products in 200 countries using 550,000 suppliers • Problem: not using its huge buying power effectively • Used SAP software and analyzed its data • Just one ingredient – vanilla – its American operation reduced the number of specifications and used fewer suppliers, saving $30M per year • Annual savings from such operational improvements: $1 billion 5 Economist special issue Medical use • Dr. Carolyn McGregor from University of Ontario • Goal: spot fatal infections in premature babies • Monitors subtle changes in 7 streams of real-time data, such as heart rate, blood pressure, etc. • ECG alone takes 1000 readings/second • Infections are detected before obvious symptoms emerge • Naked eye cannot see it, but the computer can! • Who programs these? Stats experts. • Another term: Evidence Based Medicine 6 Economist special issue Government usage • An add-on to a 1986 law required firms to disclose the harmful chemicals they release. • When the public started tracking these numbers, by 2000, American businesses had reduced their emissions of the chemicals covered under the law by 40% 7 Economist special issue Best-sellers • “Super-crunchers: Why Thinking-byNumbers Is the New Way to Be Smart” by Ian Ayres • “Money Ball: The Art of Winning an Unfair Game” by Michael Lewis • “The Long Tail” by Chris Anderson • Malcolm Gladwell books - Outliers • Microtrends – Mark Penn (elections) • Freakonomics – S. Dubner and S. Levitt 8 Moneyball example • 2002 season: Richest team, NY Yankees, had a payroll of $126 million, while the Oakland A’s had a payroll of less than a third of that, about $40 million, and yet they had reached the playoffs three years in a row, and took the Yankees close to elimination. How did they do it? • Billy Beane, general manager of Oakland A’s – Respected statistics – Hired Paul DePodesta, Harvard MBA, who applied Bill James’ formulas and selected players based on their statistics. – Runs created = (Hits + Walks) Total Bases/(At Bats + Walks) – Jeremy Brown – only player in the history of the SEC with 300 hits and 200 walks, but he was overweight – Scouts vs. statisticians! • The tendency of everyone to generalize wildly from his own experience. Most people think their own experience is typical! 9 Malcolm's Gladwell's "Outliers” hockey players story • Why Canadian hockey players born early in the year have a big advantage; cutoff date was Jan. 1 • ESPN conducted a little study: All the 2008 season NHL players who were born from 1980 to 1990. [Later disputed for 2011 players] • Sure enough: Many more were born early in the year than late. Jan. 51 Jul. 36 Feb. 46 Aug. 41 Mar. 61 Sep. 36 Apr. 49 Oct. 34 May 46 Nov. 33 June 49 Dec. 30 http://sports.espn.go.com/espn/page2/story?page=merron/081208 10 Examples from “The Long Tail” • Rhapsody, an online music store, which in Dec. 2005 had 1.5M tracks, reported that the number of downloads/month for even the 100,000th track was in the 1000s, when a Walmart store, the largest brick-and-mortar music retailer, stocks only 55,000 tracks. • Rhapsody reports that 40% of its total sales came from the Long Tail products, i.e., those not available in retail stores. • Anderson gives several such examples, calling these businesses Long-Tail aggregators – – – – – Google as the long-tail aggregator of advertising eBay of goods Amazon of books Apple of music Netflix of movies 11 Experts vs. intuition • Ian Ayres’ book – “The future belongs to people like Wolfers who are comfortable with both intuition and numbers” – Wolfers analyzed 44,000 college basketball games (> 16 years) • Also see Jason Lehrer’s “How we Decide” – another bestseller 12 Ian Ayres’ book, page 220 What Wolfers did • Plot density function of number of games that beat the Las Vegas spread – Perfect normal bell curve! • Just look at games with point spreads less than or equal to 12 – Perfect normal bell curve • Look at games with point spread > 12 – 47% chance that the favored team beat the spread (53% failed to cover the spread) – more than 20% of games fell in this category of games with >12 spreads – Is it point shaving? • Look at the score five minutes before the end of the game – right on track to beat the spread 50% of the time! – Indeed a stronger case for point shaving Ian Ayres’ book, page 216 13 2SD Rule: To understand variability • There is a 95% chance that a normally distributed variable will fall within two standard deviations (plus or minus) of its mean • Statistical significance – simple intuitive concept – there is less than 5% chance that a random variable will be more than two standard deviations away from the mean. • Stanford Law school students knew that professors were required to give a 3.2 mean. They wanted to know if the professor was a “spreader” or a “clumper”! 14 Ian Ayres’ book, page 221 Technology trends enabling all this data analysis • Cloud computing – Amazon , Google, Yahoo, Microsoft • Open source software – R programming language • NY Times article, Jan. 7, 2009 – Hadoop allows ordinary PCs to analyze huge quantities of data that previously required supercomputers 15 Economist special issue Technology or techniques? • Moore’s Law – Processing power doubles every two years – Supercrunching does need CPUs, but computing power has been available • More important: Kryder’s Law – Storage capacity of hard drives has been doubling every two years – Chief technology office (Mark Kryder) for hard drive manufacturer, Seagate 16 Ian Ayres’ book, page 151 Three techniques • Regressions yi 0 1xi i – error term ~ N(0,2) • Randomization – Run experiments by treating different samples in different ways • Neural networks – Functional form is not assumed to be linear or anything specific 17 Ian Ayres’ book Course material • • • • From Data to Knowledge Focus on data sets Less on details of statistical techniques Learn R programming through classprovided R programs and assignments • http://www.ece.virginia.edu/mv/edu/D2 K/index.htm 18 Summary • Importance of data analysis – in every walk of life! • How to extract the “story” hidden in the data set? 19