DATA CONFUSION How to confuse yourself and others with Data Analysis AGENDA FOR TODAY’S TALK • • • • • • Good Graphs – Bad Graphs The Law of Averages PTBD Analysis Enumerative & Analytical Problems PARC Analysis Wrong Methods of Analysis “There are three kinds of lies: Lies, damned lies and statistics” Attributed to Benjamin Disraeli by Mark Twain GOOD GRAPHS AND BAD GRAPHS DATA RELEVANCE • Graphs are only as good as the data they display • No amount of creativity can produce good graphs from dubious data DATA CONTENT • Don’t produce graphs from very small amounts of data • The human brain can grasp 1, 2 or 3 numbers without a graph RULES FOR PRODUCING GOOD GRAPHS • KEEP IT SIMPLE AND STUPID – Jesse Ventura • Tell the truth – don’t distort the data GOOD GRAPHS • Portray information without distortion • Contain no distracting elements – No false third dimensions, irrelevant decoration, or colour (chartjunk) • Use an appropriate scale • Label axes and tick marks properly, including measurement units • Have a descriptive title and/ or caption and legend • Have a low ink – to – information ratio GOOD GRAPH BAD GRAPH Temperature (degC) of Air and Subject during one day 40 40 Air 35 Subject 30 25 Air 20 Subject 15 Temperature (degC) 35 10 5 30 25 20 15 6 am 0 6 am Noon 6 pm Midnight 6 pm Midnight 6 am Time of Day 6 am EVEN BETTER GRAPH BAD GRAPH Temperature (degC) of Air and Subject during one day Temperature (degC) of Air and Subject during one day 100 40 Air 90 subject Subject 35 Temperature (degC) 80 Temperature (degC) Noon 70 60 50 40 30 20 30 25 air 20 10 0 15 6 am Noon 6 pm Time of Day Midnight 6 am 6 am Noon 6 pm Time of Day Midnight 6 am 18 16 14 12 BAD GRAPH 10 8 6 4 2 0 A B C D GOOD GRAPH E GOOD GRAPH Boxplot of A, B, C, D, E Dotplot of A, B, C, D, E 17 16 A B Data 15 C 14 D E 13 11.5 12.0 12.5 13.0 13.5 12 A B Project: more chewchat data.MPJ; Worksheet: Worksheet 1; 12/04/2006; Graham Errington C D E Project: more chewchat data.MPJ; Worksheet: Worksheet 1; 12/04/2006; Graham Errington 14.0 14.5 Data 15.0 15.5 16.0 16.5 17.0 GRAPHS THAT CONFUSE MONTHLY REJECTS MONTHLY REJECTS 9000 100% 8000 80% No. REJECTS 2001 6000 2002 5000 2003 4000 2004 3000 2005 2000 2005 60% 2004 2003 2002 40% 2001 20% 1000 0 0% Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun MONTH Jul Aug Sep Oct Nov Dec MONTH MONTHLY REJECTS 35000 30000 25000 No. REJECTS No. REJECTS 7000 2005 2004 20000 2003 15000 2002 2001 10000 5000 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec MONTH CHART JUNK 5811 2001 2002 7890 15000 6666 10000 7243 7174 No. REJECTS 4870 6009 4526 5000 6413 0 4989 Dec 4526 Nov 5432 Oct 4989 Jan 2001 Sep 6009 6261 Aug 4503 Jul 5693 Jun 6852 May MONTH 4247 5452 Feb 2003 Apr 3505 Mar 3633 4745 Feb 7732 Jan 5321 0 5214 Mar 2005 4205 1000 5872 Apr 5163 2000 5301 4725 May 4601 3000 5458 5452 Jun 4000 3976 Jul 5000 6261 5801 Aug 6009 6000 Sep 2004 2005 6913 7174 7000 Oct 4205 8000 4432 Nov 2001 2002 2003 4476 5693 5872 Dec 5157 4605 4734 5940 5940 5811 5214 5186 5186 5157 5104 4725 4476 4567 5000 MONTHLY REJECTS 20000 2003 2004 25000 30000 35000 2005 100% 2001 2002 2003 2004 2005 80% 60% 40% 20% 2005 Jan 0% Jan Feb Mar Apr May 2001 Jun 2002 Jul 2003 Aug 2004 2005 Sep Oct Nov Dec Feb Mar Apr 2003 May Jun Jul Aug Sep 2001 Oct Nov Dec GRAPHS THAT TELL A STORY I-MR Chart of No. REJECTS by YEAR Time Series Plot of No. REJECTS 2001 8000 2003 Individual Value 6000 2004 2005 UCL=8406 1 8000 7000 No. REJECTS 2002 _ X=5926 1 1 1 6000 2 2 1 1 4000 LCL=3445 2000 n Ja 5000 ar M M ay l p Ju S e ov N 2001 r n Ja Ma M ay l p Ju S e ov N 2002 n Ja Ma r ay M l v p n Ju S e N o Ja MONTH 2003 ar M ay M l p Ju S e ov N n Ja 2004 M ar M ay l p Ju S e ov N 2005 UCL=3048 3000 Moving Range 1 4000 Project: Untitled; Worksheet: Worksheet 3; 04/04/2006; Graham Errington 2005 Nov Jul 2005 Sep 2005 2005 May 2005 Mar 2005 Jan 2004 Nov Jul 2004 Sep 2004 2004 May 2004 Mar 2004 Jan 2003 Nov Jul 2003 Sep 2003 2003 May 2003 Mar 2003 Jan 2002 Nov Jul 2002 Sep 2002 2002 May 2002 Mar 2002 Jan 2001 Nov Jul 2001 Sep 2001 2001 May YEAR 2001 Mar MONTH 2001 Jan 3000 2000 1000 __ MR=933 1 0 LCL=0 2 2 n Ja ar M M ay l Ju S e p ov N n Ja M ar M ay l Ju S ep ov N n Ja M ar Project: Data for ChewChat 13 Apr 2006.MPJ; Worksheet: Worksheet 3; 04/04/2006; Graham Errington ay M l v Ju S e p N o Ja n MONTH ar M ay M l Ju S e p ov N n Ja M ar M ay l Ju S ep ov N HISTOGRAMS 7263. 5714 6010. 7142 4757. 8571 20 10 0 3505 Frequency Histogram Frequency Bin Histogram of C20 14 12 10 Frequency • No meaningless gaps • Reasonable Choice of bins • Easy to choose or adjust bins • Good aspect ratio • Meaningful labels on axes • Appropriate labels on bin tick marks 8 6 4 2 0 4000 5000 Project: DATA FOR CHEWCHAT 13 APR 2006.MPJ; Worksheet: Worksheet 1; 07/04/2006; Graham Errington 6000 Data 7000 8000 TRENDING RANDOM VARIATION “Upward trend” “Setback” “Downturn” “Turnaround” “Rebound” “Downward trend” THE LAW OF AVERAGES “If I sit in a freezer and plunge my head into a pan of boiling chip fat. . . . . on average, I’m quite comfortable.” SHEWHART’S RULES FOR PRESENTATION OF DATA • Rule One – Data should always be presented in a way that preserves the evidence in the data • Rule Two – When an average, standard deviation or histogram is used to summarize data, the user should not be misled into to taking action they would not take if the data were presented in a time series USING THE WRONG METHODS Process: A B C D 1 11.85 11.85 11.75 12.14 2 11.83 11.86 11.95 12.01 3 11.87 11.87 11.8 11.88 Descriptive Statistics: A, B, C, D 4 11.84 11.87 11.94 12.07 Variable N 5 11.85 11.88 11.95 11.95 A 20 11.950 0.102 0.85 11.83 12.08 6 11.86 11.89 12 11.87 B 20 11.950 0.100 0.84 11.85 12.25 7 11.85 11.89 12.05 12.06 C 20 11.950 0.102 0.86 11.75 12.15 D 20 11.950 0.100 0.84 11.81 12.14 8 11.85 11.9 11.85 11.94 9 11.84 11.92 11.94 11.84 10 11.86 11.91 11.85 12.05 11 12.05 11.93 12.05 11.93 12 12.06 11.93 11.85 11.83 13 12.03 11.95 12.05 12.04 14 12.02 11.97 11.95 11.92 15 12.03 11.96 11.95 11.82 16 12.04 11.99 11.95 12.03 17 12.06 12 11.85 11.91 18 12.06 12 12.1 11.81 19 12.04 12.16 12 12.01 20 12.08 12.25 12.15 11.81 Mean StDev CoefVar Minimum Maximum NO SIGNIFICANT DIFFERENCE HERE! One-way ANOVA: A, B, C, D Source Factor Error Total DF 3 76 79 S = 0.1010 Level A B C D N 20 20 20 20 SS 0.0000 0.7746 0.7746 MS 0.0000 0.0102 R-Sq = 0.00% Mean 11.950 11.950 11.950 11.950 StDev 0.102 0.100 0.102 0.100 Pooled StDev = 0.101 F 0.00 P 1.000 R-Sq(adj) = 0.00% Individual 95% CIs For Mean Based on Pooled StDev --------+---------+---------+---------+(-----------------*-----------------) (-----------------*-----------------) (-----------------*-----------------) (-----------------*-----------------) --------+---------+---------+---------+11.925 11.950 11.975 12.000 NO DIFFERENCE?!? FOUR PROCESSES WITH SAME MEAN AND SD Mean = 11.95, SD = .10 A B C D 12.2 12.1 12.0 11.9 11.8 12.2 12.1 12.0 11.9 11.8 2 4 6 8 10 12 14 16 18 20 2 Sample Project: ENUMERATIVE VS ANALYTICAL STUDIES.MPJ; Worksheet: Worksheet 1; 05/04/2006; Graham Errington 4 6 8 10 12 14 16 18 20 ALWAYS CARRY OUT PTBD ANALYSIS PLOT THE B….. DOTS! TYPES OF STATISTICAL STUDIES • Descriptive • Enumerative • Analytic DESCRIPTIVE STUDY • Count all fish in barrel • Count number of goldfish • Proportion of goldfish applies to the fish population in this barrel and no other barrels of fish ENUMERATIVE STUDY • Take a sample of fish from the barrel, and count the number of goldfish in the sample • Point estimate of the proportion of goldfish in the barrel population • Many statistical procedures do this • Cannot make any inference about any other barrels of fish ANALYTICAL STUDY Fish Packing Process over Time • Will we get the same proportion of goldfish in the future as we got in the past? • An analytical study allows prediction within limits ANALYTICAL STUDY C Chart of No goldfish per Barrel 10 UCL=10 8 Sample Count • Proportion of goldfish is stable over time • Fish packing process is predictable within limits • We can expect, on average, 4 goldfish per barrel, but as many as 10 and as few as 0 in any single barrel 6 _ C=4 4 2 0 LCL=0 1 3 5 7 9 11 Week No. Project: ENUMERATIVE VS ANALYTICAL STUDIES.MPJ; Worksheet: Fish in Barrel; 05/04/2006; Graham Errington 13 15 17 19 ENUMERATIVE vs ANALYTICAL METHODS • Enumerative methods – seek to provide numeric summaries, confidence intervals,etc – use significance tests, ANOVA, descriptive stats, etc., assume single, stable population • Analytical methods – seek to understand the system under study – use primarily graphical tools such as run charts, control charts, histograms, box plots, etc – in the real world, most problems are analytical “Analysis of variance, t-tests, confidence intervals, and other statistical techniques taught in books,….., are inappropriate because they provide no basis for prediction and because they bury the information contained in the order of production.” W.E. Deming, Out of the Crisis Traditional statistical methods have their place, but are widely abused in the real world. When this is the case, statistics do more to cloud the issue than to enlighten. PARC ANALYSIS Practical Passive Planning Profound Accumulated Analysis (by) After Analysis Records Regression Research Relying (on) Compilation Correlations Completed Computers note inverse relationship with Continuous Constant Recording (of) Repetition (of) Administrative Anecdotal Procedures Perceptions PLANNING A PROCESS IMPROVEMENT STUDY • • • • • • • • • Why collect the data? What statistical methods for analysis? What data will be collected? How much data do we need? How will the data be measured? How good is the measurement system? When and where will data be collected? Who will collect the data? Remember: GARBAGE IN – GARBAGE OUT WHAT’S SIGNIFICANT? Two-sample T for C1 vs C2 N Mean StDev Mean A = 13.7, Mean B = 14.4 SE Mean A 5 13.652 0.487 0.22 B 5 14.369 0.646 0.29 Not significant? Difference = mu (C1) - mu (C2) Estimate for difference: 95% CI for difference: -0.716615 (-1.551531, 0.118301) T-Test of difference = 0 (vs not =): T-Value = -1.98 P-Value = 0.083 DF = 8 Both use Pooled StDev = 0.5725 Two-sample T for C3 vs C4 N Mean StDev SE Mean A 200 13.510 0.501 0.035 A 200 13.667 0.492 0.035 Mean A = 13.5, Mean B = 13.7 Difference = mu (C3) - mu (C4) Estimate for difference: 95% CI for difference: Significant? -0.157292 (-0.254935, -0.059649) T-Test of difference = 0 (vs not =): T-Value = -3.17 Both use Pooled StDev = 0.4967 P-Value = 0.002 DF = 398 WHAT SHOULD I DO WITH OUTLIERS? • • • • • • Data point far away from the rest of the data Don’t remove outliers to make data “look good” Do you know why it is different? If you do, remove it. If you don’t, leave it in Could have a big impact on the analysis Re – run analysis without outlier, and compare results “REGRESSION” WITH EXCEL • Usually means drawing an X-Y plot, fitting a straight line and coming up with an R2 value. • As long as R2 is high, everything’s hunky-dory. WRONG! “REGRESSION” WITH EXCEL Defects vs Cure Time 6 y = 0.1913x - 5.5192 R2 = 0.5079 5 No. of Defects 4 3 2 1 0 -1 20 25 30 35 40 45 50 -2 Cure Time s Relationship is clearly not linear, and should not be presented as such “REGRESSION” WITH EXCEL • Regression model checking – in Excel? • Residual plots: – Normally distributed – Random pattern when plotted vs fitted values OK Variance not homogeneous Model incorrect PITFALLS OF REGRESSION ANALYSIS • • • • • • Non-Linear Relationships Influential Points Extrapolating Lurking Variables Summary Data Assuming Causation • THAT’S (WITH REASONABLE PROBABILITY) THE END FOLKS! And remember, • With statistics, you never have to say you’re certain! • THANK YOU FOR YOUR ATTENTION • ARE THERE ANY QUESTIONS? • GOOD LUCK!!