MIS5401: Data Analytics for Management What is Analytics? Principles of Data Visualization Sunil Wattal Focus of this Course Analyze Communicate …using data Manage Discuss (5 minutes) • From reading Matchmaking with Math, what is a good definition of analytics? • From reading Unwin, what are some guidelines for good data visualizations? • From reading Davenport, how do analytics and visualization facilitate storytelling? Analytics is Extracting information from data Discovering meaningful patterns What is Big Data? What is the role of Hadoop and MapReduce? By the way, you don’t need Big Data to do analytics! Velocity Volume Variety The Three “V”s of Big Data • Manufacturers are monitoring minute vibration data from their equipment, which changes slightly as it wears down, to predict the optimal time to replace or maintain. Replacing it too soon wastes money; replacing it too late triggers an expensive work stoppage • Manufacturers are also monitoring social networks, but with a different goal than marketers: They are using it to detect aftermarket support issues before a warranty failure becomes publicly detrimental. • Financial Services organizations are using data mined from customer interactions to slice and dice their users into finely tuned segments. This enables these financial institutions to create increasingly relevant and sophisticated offers. • Advertising and marketing agencies are tracking social media to understand responsiveness to campaigns, promotions, and other advertising mediums. • Insurance companies are using Big Data analysis to see which home insurance applications can be immediately processed, and which ones need a validating in-person visit from an agent. • Hospitals are analyzing medical data and patient records to predict those patients that are likely to seek readmission within a few months of discharge. The hospital can then intervene in hopes of preventing another costly hospital stay. • Web-based businesses are developing information products that combine data gathered from customers to offer more appealing recommendations and more successful coupon programs. • Sports teams are using data for tracking ticket sales and even for tracking team strategies. • Pause: what other examples can you think of? Case Study: Matchmaking with Math What did they do? How did they do it? What do you think of the result? What do you think of the process? Communicating using data Minard’s map of Napoleon’s campaign into Russia, 1869 Reprinted in Tufte (2009), p. 41 What can you learn from this map? http://www.popvssoda.com/countystats/total-county.html Some basic principles (adapted from Tufte 2009) 1 • The chart should tell a story 2 • The chart should have graphical integrity 3 • The chart should minimize graphical complexity Tufte’s fundamental principle: Above all else show the data Principle 1: Graphics should tell a story Clear Enable comparison Yield insight Roumeliotis’ steps for communicating analysis My understanding of the business problem How will I measure the business impact? What’s the available data? The initial solution hypothesis The solution The business impact of the solution But avoid communicating: • Technical terminology • Step-by-step methodology • Complex statistics All of these inform the data visualization From: Davenport, T. (2013). Telling a Story with Data. Deloitte University Press. What story does this tell? Causes of death in the Crimean war: Blue – preventable diseases Red – wounds Black – other http://en.wikipedia.org/wiki/Florence_Nightingale What story does this tell? http://fivethirtyeight.com/datalab/most-of-the-u-s-saw-little-recovery-in-2013/ What story do these tell? http://fivethirtyeight.com/features/the-potential-bubble-the-federal-reservecares-most-about/ Finding the next “bubble”… Principle 2: The chart should have graphical integrity • Basically, it shouldn’t “lie” (mislead the reader) • Tufte’s “Lie Factor”: – πΏππ πΉπππ‘ππ = π ππ§π ππ ππππππ‘ π βππ€π ππ ππππβππ π ππ§π ππ ππππππ‘ ππ πππ‘π Should be ~ 1 < 1 = understated effect > 1 = exaggerated effect “Lie factor”: Visualizing Area A $500 $250 WIDTH of A is twice width of B, but AREA of A is 4x area of B! Lie factor of 2 (4 ÷ 2) = 2! A B $500 vs. B $250 WIDTH of A is 1.4x width of B, so area of A is now 2x area of B! Lie factor of 1 (no lie!) (2 ÷ 2) = 1 Adapted from Krum, R. (2014). Cool infographics: Effective Communication with Data Visualization. “Lie factor”: Axes should include “0” 2012 Presidential Run GOP CANDIDATES 70% 63% Huckabee 60% Palin Romney vs. vs. 2012 Presidential Run GOP CANDIDATES 63% Huckabee 70% Palin 60% Romney Other tips to avoid “lying” Hypothetical Industries, Inc. 140 130 Adjust for inflation 120 110 Revenue Adjusted Revenue 100 90 80 2003 2004 2005 2006 2007 2008 2009 2010 Year Hypothetical City Crime Hypothetical City Crime 400 425 390 Thefts per 100000 citizens Thefts per 100000 citizens Make sure the context is presented 410 380 370 360 350 2009 2010 vs. 375 325 275 225 175 125 75 25 2003 2004 2005 2006 2007 2008 2009 2010 Principle 3: The chart should minimize graphical complexity Generally, the simpler the better… Key concepts Sometimes a table is better Data-in When a table is better than a chart For a few data points, a table can do just as well… Total Sales by Salesperson $250,000.00 $200,000.00 $150,000.00 $100,000.00 $50,000.00 $0.00 Salesperson Total Sales Peacock $225,763.68 Leverling $201,196.27 Davolio $182,500.09 Fuller $162,503.78 Callahan $123,032.67 King $116,962.99 Dodsworth $75,048.04 Suyama $72,527.63 Buchanan $68,792.25 The table carries more information in less space and is more precise. Data Ink • The amount of “ink” devoted to data in a chart • Tufte’s Data-Ink ratio: – π·ππ‘π − πππ πππ‘ππ = πππ‘π−πππ π‘ππ‘ππ πππ π’π ππ ππ ππππβππ Should be ~ 1 < 1 = more non-data related ink in graphic = 1 implies all ink devoted to data Tufte’s principle: Erase ink whenever possible Being conscious of data ink Lower data-ink ratio (worse) Hypothetical City Crime 425 425 375 325 275 225 175 125 75 375 Thefts per 100000 citizens Hypothetical City Crime Thefts per 100000 citizens Note the use of labels and legends 25 2003 325 2004 2005 2006 2007 2008 2009 2010 275 225 Hypothetical City Crime 175 400 125 370 75 25 2003 2004 2005 2006 2007 2008 2009 320 330 2005 2006 370 350 270 2010 200 Higher data-ink ratio (better) 2003 2004 2007 2008 2009 2010 Evaluate these graphics. http://www.nytimes.com/2014/09/11/world/middleeast/obama-speech-isis.html http://fivethirtyeight.com/features/skeptical-footballmanning-vs-messi-and-andrew-luck-experiment/