Lies and Statistics.doc

Liars Figure, and Figures Lie How many times have you heard that old joke used to deflate statistics --, “liars figure and figures lie”. Or even the Mark Twain quote “Lies, Damned Lies, and Statistics”. This month I will demonstrate where these old truisms originate from, and how to avoid them. I will use four different analyses on exactly the same data. I will demonstrate four completely different interpretations of this same data, three lies and one truth. Let the saga begin . . . I have retrieved 25 months of some operational data which management wants analyzed. The first method we will use is a common one, a bar chart. One thing that most people probably would succumb to is “I only want to see the current year of data!”, but let us assume I am allowed show all 25 months. The resulting bar chart is shown as figure 1. For the sake of discussion, increasing numbers are “bad”. 18 16 14 12 10 8 6 4 2 0 Jan-00 Nov-99 My assessment given to the manager of this process as follows (in a breathless manner befitting the adverse trend that has developed): Sep-99 Jul-99 May-99 Mar-99 Jan-99 Nov-98 Sep-98 Jul-98 May-98 Mar-98 Jan-98 Figure 1 “The past three months in a row have been increasing! In fact, the current month is at the highest value since more than one year ago! We must do something! We must find out why this month was so high!” Note the interpretation would be the same if I was only showing one year of data, but of course the current month would now be the “highest on the whole graph!” The manager who owns the process that generated the data (and thus must be accountable, or find someone to hold accountable for the increase) says “Wait a minute. In Excel spreadsheet you can add a ‘trendline’ to these charts. This trendline will tell us if we are overall increasing, or overall decreasing.” We dutifully go to Excel, and generate figure 2. This figure shows the “trendline”, which is generated using a “least-squares” fitted straight line. Just like many of us learned in high school science class. Figure 2 y = -0.132x + 11.987 Ja n -9 Ap 8 r-9 Ju 8 l -9 O 8 ct -9 Ja 8 n9 Ap 9 r-9 Ju 9 l -9 O 9 ct -9 Ja 9 n00 18 16 14 12 10 8 6 4 2 0 “Aha! The trend line is negative! We have an improvement occurring, the rate is decreasing! It is obvious, the Excel trend line shows us the slope is negative. In addition, a projection ahead shows that we should achieve a value of less than 8 by July 2000.” We get ready for a celebration pizza party . . . But wait. A consultant arrives saying that he always uses moving averages to smooth out the fluctuations in the raw data. Let us see what a moving average (which averages the last six months together) gives for an interpretation. Figure 3 18 16 14 12 10 8 6 4 2 6 Month Moving Average Ja nM 98 ar M 98 ay -9 Ju 8 lSe 98 pN 98 ov -9 Ja 8 nM 99 ar M 99 ay -9 Ju 9 lSe 99 pN 99 ov -9 Ja 9 n00 0 “See!” says the consultant. The moving average shows that we were improving until June 1999, then we got worse. Also, look how high above the average the current month is! We must determine what happened back in June that made us worse!” The Three Big Lies. This first interpretation simply reacted to the raw data. Of course at least one point on the graph will be the “highest on the graph”. Likewise, there will also be a lowest. Many people succumb to explaining, in gory detail exactly why the current result was the way it is. We must also find those to hold accountable (that is, blame). The second interpretation simply placed on a linear regression line (least squares fit). It is highly unlikely that the slope of such a line will be exactly zero. So there will always be a “positive trend” or a “negative trend” declared. What most people fail to do is examine the “R-squared” value and determine the statistical significance of the slope. The question is -- “is the slope of the line significantly different than zero?” In this case, the R-squared = 0.11, usually considered to be a pretty poor fit. An R-squared of 1.0 is a perfect fit. The moving average is next to useless. It also fails to tell you what is significant and what is not. All you know is that whether or not the current month was above or below the previous average. In reality, as you update the moving average, the current month replaces the earliest month in the previous average. If the current month was higher than the earliest month, then the moving average increases. Of course, half the data will be above average and half below average. An even worse structure is a cumulative average, where each average value has a differing number of data points in it. Thus a given shift in the data will either make a huge apparent change (early on in the accumulation of data), and hardly any change once a large amount of data is accumulated. What is truth in this case? Let us try to find the answer using control chart. The control chart is shown below as Figure 4. This chart shows that the data are actually stable, that no change has occurred. For more details on control charting, please see the Hanford Trending Primer at http://www.hanford.gov/safety/vpp/trend.htm. There are no significant trends on this graph. 25 20 15 Upper Control Limit Average = 10.3 (Jan98 - Jan00) 10 5 Lower Control Limit Figure 4 Ja n -9 8 Ap r-9 8 Ju l-9 8 O ct -9 8 Ja n99 Ap r-9 9 Ju l-9 9 O ct -9 9 Ja n00 0 And where did this data come from? It was generated from a normal distribution random number generator, following an average of 10 and a standard deviation of 3. Only the control chart gave us the correct interpretation of the data. Steven S Prevette ASQ Certified Quality Engineer This article is to appear in the October 1999 ASQ newsletter

Lies and Statistics.doc

Related documents

Products

Support

Lies and Statistics.doc

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib