Sage Note: Statistics in Excel Page 1 Excel as a Tool for Statistical Analysis This note, regarding the statistics features in Microsoft Excel, available through Tools Data Analysis , shows that these features are flawed. While Excel generally produces correct results, erroneous results arise with disturbing frequency. On the other hand, Excel continues to be arguably the world’s most widely distributed statistical software package – it behooves us to anticipate such problems! Note the following problems: (1) (2) (3) (4) Excel may produce numerically inaccurate results (usually through the use of obsolete computing algorithms. Excel has not responded to earlier commentary about the quality of its statistical computing. Excel’s data orientation creates difficulties in basic calculations: cells vs variables Excel lacks the flexibility to deal with serious statistical investigations. Are these fair criticisms? After all, any piece of software will have bugs. As we will see, the problems with Excel are not merely rounding errors, nor are these problems nuances of usage in esoteric procedures. Excel has fundamental problems of correctness and accuracy in very commonly occurring problems. For many of these errors, awareness of the existence of these risks is sufficient for an experienced data analyst to detect the problem(s) and overcome them, through additional analysis, checks and transformations of the data. Experience in use, and caution in application and interpretation are essential if Excel is to be regarded as a useful adjunct tool in statistical analysis. Note that these objections to Excel are with regard primarily to its statistical “features” and use as a statistical analysis package. Nevertheless, some of these problems seriously challenge the idea that Excel has been, and continues to be, a good spreadsheet program. Any person considering the use of Excel for statistical purposes should consult B. D. McCullough and Berry Wilson, “On the accuracy of statistical procedures in Microsoft Excel 97,” Journal of Computational Statistics & Data Analysis, Volume 31, 1999. This simple abstract gives their position: The reliability of statistical procedures in Excel is assessed in three areas: estimation (both linear and nonlinear), random number generation, and statistical distributions (e.g. for calculating p-values). Excel’s performance in all three areas is found to be inadequate. Persons desiring to conduct statistical analyses of data are advised not to use Excel. Several other sources from the scholarly literature are also cited. In addition, note points provided by Minitab, Inc., in its “white paper.” Note that Minitab is not a disinterested party in this discussion! For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 2 Excel may produce numerically inaccurate results. McCullough and Wilson examined software by analyzing data in the Standard Reference Datasets released by the National Institute of Standards and Technology. (McCullough and Wilson, p 27). They are careful to state that their concern is not merely the number of correct digits that an algorithm produces, as the need for precision will vary among users and among applications. However, if there exists a reliable algorithm that solves a particular problem to a certain precision, then a program that fails to get close to this precision should be judged inadequate. McCullough and Wilson show that for “easy” data sets, Excel will get acceptable precision for the mean and standard deviation. However, for not-so-easy data sets, Excel will get one significant figure (or worse) correct in computing a standard deviation (Table 2, p 30). Among these are data sets Numacc3 and Numacc4. Set Numacc3 has length n = 1,001. Value 1,000,000.1 occurs 500 times Value 1,000,000.2 occurs once Value 1,000,000.3 occurs 500 times The exact standard deviation is 0.10. Minitab finds the standard deviation as 0.10000 through Calc Column statistics. SAS likewise produces an accurate result. Excel (MS Excel 2000), however, reports the standard deviation as 0.234787138. (Using either the =STDEV() function, or the Descriptive Statistics tool from the Analysis Toolpak.) From Excel 2000: Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level(95.0%) 1000000.2 0.007420912 1000000.2 1000000.1 0.234787138 0.055125 -2.003003003 5.23168E-07 0.2 1000000.1 1000000.3 1001000200 1001 0.014562347 Set Numacc4 has length n = 1,001. Value 10,000,000.1 occurs 500 times For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 3 Value 10,000,000.2 occurs once Value 10,000,000.3 occurs 500 times The exact standard deviation is 0.10. Minitab finds the standard deviation as 0.10000 through Calc Column statistics. Excel, however, reports the standard deviation as 0.912140340. Now, one could argue that these are terribly unrealistic data sets, in that the only variability in the numbers occurs in the low-order digits. Two comments seem appropriate. First of all, financial data taken over a short time period could well have precisely this character, and calculating the variance of a short series of financial figures is a key element of determining the value of, for example, an option. A sixty-fold error in the magnitude of the volatility of a series of cashflows will have an enormous effect on the value of the option!. Secondly, this type of computing difficulty comes in the use of xi2 n x 2 to compute a sum of squares around an average, and it has been known in the statistical literature for decades that this is an inferior algorithm. (The HELP feature in Excel 2000 acknowledges the use of this formula.) Note that Excel performs perfectly well with this dataset if a two-pass algorithm is implemented “by hand”. As the relative size of the disturbances and the signal gets larger (i.e. the coefficient of variation), the ratio of the standard deviation as calculated by Excel to the true standard deviation gets closer to 1. The chart below, which shows the CV on the horizontal axis, and the standard deviation ratio on the vertical axis, demonstrates this. 20.0 18.0 16.0 14.0 12.0 10.0 8.0 6.0 4.0 2.0 0.0 0 10 20 30 40 For business strategy updates … visit www.sqm.co.nz 50 Sage Note: Statistics in Excel Page 4 A curiosity Interestingly, if we change NumAcc3 a little, so that it is centred on the value 1000000 (i.e. we have 500 values 999999.9; one value 1000000, and 500 values 1000000.1), then the scale is unchanged, of course. The results are not quite what one might have predicted! The two-pass algorithm is fine, of course, returning the exact standard deviation of 0.1. The stdev() function, however, gives us the result 0. Keeping a similar scale (i.e. roughly +/- 0.1 around a signal of approximately 1 million), we can get a wide variety of values for the standard deviation from the =STDEV() function. Regression Similarly, for linear regression problems Excel also may have problems. For example, the file Fillip in the Standard Reference Datasets is a polynomial regression problem in which Excel is not able to achieve even one significant figure correct! This is an example of an ill-conditioned problem, meaning that the matrix of independent variables is very nearly rank-deficient. It may seem that this problem is too exotic to be worthy of concern, but in fact problems that are rank-deficient (or nearly so) are common. (And the polynomial regression is one of the standard alternatives offered in the Excel “Add Trendline” option for scatter charts!) Note also that L. Knüsel examined the computations related to the statistical distributions used in Excel, including normal, chi-squared, F, and t. The article is “On the accuracy of statistical distributions in Microsoft Excel 97,” Computational Statistics and Data Analysis, Volume 26, 1998, pp 375-377. He found many defects in the algorithms, and judged Excel to be inadequate. Generally, these inadequacies lie in the tails of the distributions; for most hypothesis-testing related tasks, the approximations are perfectly adequate. However, these inaccuracies in the tails may make Excel a poor choice for simulation studies, except under carefully selected conditions. (Excel is perfectly adequate, for example, for class demonstrations of basic probability ideas, and the Central Limit Theorem.) The paired t-test Let’s be very specific about the kinds of things that Excel might do to an unsuspecting user. Consider the following data set, which appears in Minitab’s “white paper” regarding Excel. Excel gives the wrong t-statistic and p-value for a paired t-test if there are missing values. Here are the data: Sample 1 3 4 3 2 4 4 3 Sample 2 2 2 3 3 3 3 4 For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 5 2 4 4 3 4 3 2 3 3 4 4 2 3 2 2 2 3 2 2 2 4 2 2 3 The paired t-test involves differences, Sample 1 minus Sample 2. There are 20 lines in the data set, but there are only 18 usable differences, and consequently 17 degrees of freedom for estimating the standard error of the difference in means. Here is the Minitab output: Paired T-Test and CI: Sample 1, Sample 2 Paired T for Sample 1 - Sample 2 Sample 1 Sample 2 Difference N 18 18 18 Mean 3.167 2.556 0.611 StDev 0.786 0.705 1.145 SE Mean 0.185 0.166 0.270 95% CI for mean difference: (0.042, 1.180) T-Test of mean difference = 0 (vs not = 0): T-Value = 2.26 P-Value = 0.037 Minitab correctly works around the missing values. Minitab reports the means for each column for the 18 relevant entries. Minitab gives the correct values, t = 2.26 and p = 0.037. (Minitab does not print the number of degrees of freedom, but it is easy to determine that this is 17; moreover the p-value corresponds to 17 degrees of freedom.) (SAS enforces explicit calculation of the differences for the paired t-test, and of course, produces the correct results.) Here is the Excel output: t-Test: Paired Two Sample for Means Sample 1 Sample 2 Mean 3.2105 2.5789 Variance 0.6199 0.4795 19 19 Observations Pearson Correlation Hypothesized Mean Difference df -0.1770 0 18 For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 6 t Stat 1.7143 P(T<=t) one-tail 0.0518 t Critical one-tail 1.7341 P(T<=t) two-tail 0.1036 t Critical two-tail 2.1009 Excel gives t = 1.7143 and p = 0.1036, which are incorrect. Excel gives 18 degrees of freedom, which is also incorrect. (The combination of p-value, t, and degrees of freedom is at least internally consistent.) Excel’s handling of this data set is unfortunate: it cannot even determine which data points are to be used in the analysis. If the data are “cleaned up” prior to use of the tool (i.e. all pairs that have a missing component are deleted), then the Excel routine works fine (modulo its problems with precision). Unfortunately, most Excel functions, such as =AVERAGE(), =STDEV() etc, are designed to take account of missing values, so the less sophisticated user may also trust the Statistical tools to take account of missing values in a sensible way. This of course, is not the case! Incidentally, if there are entries in the missing places, such as ‘*’ or ‘.’, then the Excel tool throws up its hands in horror and declares that the “Input range contains non-numeric data”. Note that “P(T<=t) one tail” seems to be a poor description. What is intended is something like P(T>|t|) one tail. Note also that Excel does not in general carry through arithmetic properly if there are missing values – empty cells in arithmetic expressions are (usually) interpreted as zeroes. In this example, the value in cell B3 should be missing, but Excel gives it as -1.5: 1 2 3 4 A X 1 2 B Resids -0.5 (=A2 - average(A2:A4) -1.5 (=A3 - average(A2:A4) 0.5 (=A4 - average(A2:A4) Again, this type of behaviour is common in Excel, which does not have strong typing, and readily converts blank cells to numeric cells (zero), and will convert the text string “1” to the number 1 within an arithmetic expression, but not within a function evaluation such as =SUM(). This can lead to enormous problems with data received from a third party, particularly if several files have been merged to create a final dataset. (The operational unit in Excel is the cell – in statistical languages such as S-Plus, Minitab, R, SAS etc, the basic unit is the variable. Consequently, all values of a particular variable are necessarily of the same type (numeric, character etc), while this is not the case with Excel. In particular, variables with mixed types can create havoc with For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 7 the output from Pivot tables and the statistical tools in the toolpak. Pivot tables are particularly at risk in this regard, as all the functions applied within the pivot table happily omit character variables without comment. See the section further on regarding Basic Calculations.) Linear Regression Now let’s consider a linear regression situation. The independent variables are Frittata, Salmon_crepes, …. In this example, there is an exact linear relationship among the independent variables, as Totalpieces is just the sum of the other variables. Incidentally, Minitab recognizes this sort of situation and automatically removes one of the variables in such a constraint. SUMMARY OUTPUT Regression Statistics Multiple R 0.959 R Square 0.920 Adjusted R Square 0.910 Standard Error 21.269 Observations 96 ANOVA df Regression Residual Total Intercept Frittata Salmon_Crepes Mini_Bagels Onion_Olive_Tarts Crostini Mini_Pizza Sandwiches Mini_Muffins Fishcakes Thai_Crepes TotalPieces SS 439623.940 37998.560 477622.500 MS 39965.813 452.364 Coefficients Standard Error 13.398 5.863 9.608 1109566.045 0.623 1109566.045 0.717 1109566.045 0.440 1109566.045 0.007 1109566.045 0.151 1109566.045 -0.026 1109566.045 0.107 1109566.045 0.360 1109566.045 0.519 1109566.045 0.310 1109566.045 t Stat 2.285 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 11 84 95 F Significance F 88.349 2.39613E-41 P-value 0.025 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Lower 95% 1.739 -2206484.709 -2206493.693 -2206493.599 -2206493.877 -2206494.310 -2206494.165 -2206494.342 -2206494.210 -2206493.957 -2206493.797 -2206494.006 Upper 95% 25.056 2206503.924 2206494.939 2206495.033 2206494.756 2206494.323 2206494.467 2206494.291 2206494.423 2206494.676 2206494.836 2206494.626 Excel quite happily has a crack at calculating all the usual regression statistics, and, of course, comes up with some fairly weird results. This just reflects the fact that the calculations of regression are simply that: calculations – feed in the numbers, and something will come out. Minitab is a program designed for statistical analysis, and consequently attempts to give statistical guidance, and make sensible decisions about the statistical analysis. Excel is not designed for statistical analysis, and will not give any For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 8 guidance beyond the strangeness of the results. With regression analyses in Excel, the “neck-top” computer is a vital tool, and in this case, the signals are certainly there to see: standard errors that are enormous (or possibly vanishingly small in other cases), incalculable P-values and so on. This is not so much a criticism of Excel as it is a cautionary tale, which indicates that regression output should be looked at with some care. (A comment that can equally be made about the regression output from any statistical program.) Below is the output from SAS, which copes quite well with the constraint: The REG Procedure Model: MODEL1 Dependent Variable: Tot_Cost Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 10 85 95 439624 37999 477623 43962 447.04188 Root MSE Dependent Mean Coeff Var 21.14337 167.12500 12.65123 R-Square Adj R-Sq F Value Pr > F 98.34 <.0001 0.9204 0.9111 NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown. totpiece = Fritata + SalmCrep + MiniBag + OnOlTt + Crstini + MiniPizz + Sandwich + MiniMuffs + Fishcks + ThaiCrps Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Intercept Fritata SalmCrep MiniBag OnOlTt Crstini MiniPizz Sandwich MiniMuffs Fishcks ThaiCrps totpiece 1 B B B B B B B B B B 0 13.39793 9.91777 0.93333 1.02691 0.74971 0.31679 0.46124 0.28449 0.41665 0.66981 0.82959 0 5.82805 1.78338 0.09349 0.11159 0.06739 0.02522 0.06545 0.02319 0.05598 0.03934 0.08966 . 2.30 5.56 9.98 9.20 11.13 12.56 7.05 12.27 7.44 17.03 9.25 . 0.0240 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 . For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 9 Simple Linear Regression, Numerically unstable The next example is obtained from a posting on edstat by Mark Eakin, 9/18/1996. This is a simple Y-on-X regression. Y 1.1 1.9 3.1 3.9 4.9 6.1 X 10000000.1 10000000.2 10000000.3 10000000.4 10000000.5 10000000.6 Here is the Excel output (2000 version). SUMMARY OUTPUT Regression Statistics Multiple R 65535 R Square -0.816 Adjusted R Square -1.271 Standard Error 2.808 Observations 6 ANOVA df Regression Residual Total Intercept X 1 4 5 SS -14.174 31.534 17.360 Coefficients Standard Error -233538842.107 102864523.600 23.354 10.286 MS -14.174 7.884 F Significance F -1.798 #NUM! t Stat P-value Lower 95% -2.270 0.086 -519137136.696 2.270 0.086 -5.206 This has negative SS for regression, and negative F. Again, one might claim that the data are highly contrived. Nonetheless, the problem can be solved correctly by other programs, and it indicates again that Excel is using obsolete algorithms. Again, the output is indicative of something seriously amiss. For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 10 Incidentally, the short-cut way of fitting a simple linear regression in Excel is to create the scatter plot and “Add Trendline”. This results in a quite remarkable plot of the data plus fitted line: 8 y = 2.09E+01x - 2.09E+08 6 R2 = 2.09E+00 4 2 0 10000000.00 10000000.10 10000000.20 10000000.30 10000000.40 10000000.50 10000000.60 10000000.70 -2 -4 -6 -8 So the algorithm used in the “Add Trendline” function is different to that used in the regression tool! (And quite markedly wrong!) Different coefficients, different R2 value. (greater than 1!) Minitab gets this one correct, by the way, as does SAS. Regression through the Origin Now let’s note that Excel also does not correctly perform a regression through the origin. Consider this data set: X 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Y 24.4 32.1 37.1 40.4 43.3 51.4 61.9 66.1 77.2 79.2 For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 11 We want to fit the regression model Yi = Xi + i . Here is the output from Excel: SUMMARY OUTPUT Regression Statistics Multiple R 0.952081 R Square 0.906459 Adjusted 0.795348 R Square Standard 5.818497 Error Observati 10 ons ANOVA df Regressio n Residual Total SS MS 1 2952.635 2952.635 9 304.6941 10 3257.329 F 87.2144 Significance F 1.41E-05 33.8549 Coefficient Standard t Stat P-value Lower Upper Lower Upper s Error 95% 95% 95.0% 95.0% Intercept 0 #N/A #N/A #N/A #N/A #N/A #N/A #N/A X Variable 9.130107 0.310458 29.40852 2.97E-10 8.427802 9.832412 8.427802 9.832412 1 (This F-value is very strange! Certainly not equal to 2952.635/33.8549, nor is it equal to the square of the t-statistic for the slope {which is, incidentally, the correct value, 864.86, as in the Minitab output below.}) Here is the same run in Minitab: The regression equation is Y = 9.13 X Predictor Noconstant X Coef SE Coef T P 9.1301 0.3105 29.41 0.000 SS 29280 305 29584 MS 29280 34 F 864.86 S = 5.818 Analysis of Variance Source Regression Residual Error Total DF 1 9 10 P 0.000 It should be noted that both programs produce the same fitted equation Y = 9.13 X. For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 12 In the analysis of variance, both programs get the sum residual sum of squares, 305. 10 However, Minitab correctly works from the correct total sum of squares Y i 1 i 2 = 29,584. Excel makes the mistake of computing around the average; that is, Excel computes n Y Y i i 1 2 = 3,257.21. The average Y has no bearing in a regression through the origin. As a result, Excel reports an incorrect R2. The correct computation of R2 for this problem is indicated by Sen and Srivastava, Regression Analysis: Theory, Methods, and Applications, Springer-Verlag, New York, 1990, or just about any book on statistics at the elementary level. Thus, Excel gets most of the analysis of variance table wrong, including the F test. Provided the data are well-behaved, and it is intended that inferences are to be made on the coefficients, then the “regression through the origin” option in the regression tool can be used with a little confidence, and a bucket of caution! Excel uses obsolete computing algorithms. This point was noted earlier, with regard to the use of computing x 2 i x x i 2 through n x 2 . The latter formula has been known to be inaccurate for decades. Excel is not able to deal with singularity or near-singularity in the matrix of independent variables in the regression problem; Excel is again using an inappropriate computing algorithm. See also the earlier note on the findings of Knüsel regarding statistical distributions. He found many defects in Excel’s algorithms. For the most part, these problems may be detected by checking such things as relative scale (CV etc), and using alternative algorithms for calculating tail probabilities, particularly for very small values. Awareness of the shortcomings, however, is vital. Microsoft has not responded to earlier comments about the quality of its statistical computing. The essence of the McCullough and Wilson findings is that Excel, for many examples, does not produce answers with the same number of significant figures that other programs can obtain. In their exploration of computational quality of statistical programs, they applied their criteria to many statistical and econometric packages. This comment is quite relevant: For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 13 It is worth noting that vendors of these statistical and econometric packages participated fully in the application of this methodology to their products. These vendors verified all calculations, provided information on algorithms when such information was not included in the documentation, and otherwise assisted the process. … Since it is conceivable that more statistical calculations are performed using Excel than any other package, it is important that the statistical capabilities of Excel be assessed. Microsoft was invited to participate in this evaluation but chose not to do so. It is hard to know the precise reasons for Excel’s failings, as Microsoft did not cooperate in the investigation. The failure to calculate precisely the sample standard deviation 1 2 xi x in the example on page 2 leads to the suggestion that Excel used the n 1 hand calculator form x 2 i n x2 . Unfortunately, as noted by McCullough and n 1 Wilson, p 30, it has been known for most of the 20th century that this formula is highly unreliable. McCullough and Wilson cite R. F. Ling, “Comparison of several algorithms for computing sample means and variances,” Journal of the American Statistical Association, vol 69, 1974, pp 859-866. Microsoft has not paid attention to earlier reports about computational problems and surprisingly, Microsoft refused to cooperate in the McCullough and Wilson study. Excel lacks the flexibility to deal with serious statistical investigations. This is in some senses a non-criticism, since Excel was never intended for use as a serious package for statistical analysis. Nevertheless, it is worth noting some of its “absences”. The Data Analysis main menu in Excel lists these features: Anova: Single Factor Anova: Two-Factor with Replication Anova: Two-Factor without Replication Correlation Covariance Descriptive Statistics Exponential Smoothing F-Test Two-Sample for Variances Fourier Analysis Histogram Moving Average Random Number Generation Rank and Percentile For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 14 Regression Sampling t-Test: Paired Two Sample for Means t-Test: Two-Sample Assuming Equal Variances t-Test: Two-Sample Assuming Unequal Variances z-Test: Two Sample for Means This is a very modest subset of the statistical tools that an analyst would use. Omitted topics include discriminant analysis, logistic regression, factor analysis, principal components, leverage values, and on and on. Many of these can be calculated using some of the matrix primitives available in Excel, but the numerical properties of some of these algorithms must be at best suspect given the poor implementation of even very simple and well-known algorithms such as the standard deviation algorithm. Maximum likelihood estimates can be derived using the SOLVER add-in, (but recall the problems with the implementation of some of the distribution functions in Excel) and it appears that the technology underlying the SOLVER is a great deal more sophisticated than that underlying the Data Analysis Toolpak. Again, Excel can be a reasonable stop-gap alternative, in the hands of an experienced and cautious analyst. Often, those statistical tasks performed by Excel require an inordinate amount of setup. For example, the chi-squared test for the two-by-two table requires that the user actually set up the table of expected values, and as we have seen, if there are missing values within the dataset, these need to be removed (in the multivariate sense) prior to applying any of the tools in the toolkit. Minitab’s “white paper” lists additional complaints. They note that Excel does not do stem-and-leaf plots, boxplots, or dotplots. Few of Excel’s analyses come with automatic graphs, and even those that do are in fact incorrectly constructed. The histogram, for example, is not a histogram, nor is the normal probability plot a normal probability plot of any kind. Again, these charts can be produced, but it may involve a fair amount of “hard work” to persuade Excel’s only real graph, the XY plot, to do so. The “white paper” notes that the documentation is minimal and sometimes inaccurate. They cite “Spreadsheets and Statistics: The Formulas and the Words,” by Herman Callaert, Chance, Volume 12, No. 2, 1999. Basic Calculations Of more interest and concern, perhaps, is the loose typing employed by Excel throughout. As mentioned before, the operational unit in Excel is the cell, whilst in statistical thinking, it is the variable, usually represented in Excel as a column of values. In statistical thinking, we tend to regard all the values of a variable as having the same type, be it real numbers, integers, or character values. Excel makes no such constraint, and we may get trapped, for example, when a column of what appear to be numbers contains a number of character entries. For example, consider the spreadsheet fragment below: For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 15 They look like numbers! All are right-adjusted (the default display for numbers, whereas the default for strings is left-adjusted), but of course Excel allows us to choose the justification within an area of the sheet. What of the entries in rows 12 and 13? Each is labeled “Sum”: the sum in cell C12 is the result of the usual formula for the sum of a range, and the entry in cell C13 is the result of adding together the individual entries in cells C2 to C10, as shown in the fragment below (displaying the formulae): Sum Sum2 =SUM(C2:C10) =C2+C3+C4+C5+C6+C7+C8+C9+C10 So, what’s the problem here? Several of the entries in the range C2:C10 are actually text entries, not numerical entries. When the function ‘+’ is applied to such entries, is converts the entry from type text to type numerical; when the SUM() function is applied to such entries, it applies no conversion! This sort of thing can arise when an old spreadsheet is re-used (i.e. we open an old spreadsheet, delete all the entries, and then enter our new values): some areas of the spreadsheet may have been formatted as text – when we type numbers into these areas, they are stored by Excel as text values. Another interesting phenomenon is revealed here: the entry in C13 is left-adjusted – a text value! Thus Excel has performed the type conversion within the repeated application of ‘+’, but has converted the result to a text value! A very amusing thing happens if we try to edit this value: it becomes the text string =C2+C3+C4+C5+C6+C7+C8+C9+C10 perhaps not quite what was intended! These problems in Excel make it a dangerous tool for financial calculations and the construction of financial models, perhaps its primary use! Dates, too, can cause enormous problems in Excel, particularly in the confusion between dates that are formatted as dd/mm/yy vs mm/dd/yy. Is 12/11/00 the 12th of November, or For business strategy updates … visit www.sqm.co.nz Sage Note: Statistics in Excel Page 16 the 11th of December? Choosing a non-ambiguous format is always safer. Safer still is to enter dates as three separate fields, one for the day, another for the month, and a third for the year (in full! 1989, not 89!). Date values can then be constructed (and manipulated) in a consistent and interpretable fashion. Final Note The analysis tool is an add-on bought from some probably extinct software company. Current Excel staff have neither the interest nor the expertise to dive into it and fix it up, nor would it be cost effective. The argument is basically that the analysis tool is for quick and dirty analysis and that anyone needing a really serious tool would already have SAS, SPSS or Minitab or something of that nature. Thus we need not worry about the problems on which the analysis tool "fails". The same argument probably goes for the statistical functions. This is, of course, a facile argument, and many very careful analysts have been trapped by the shortcomings of Excel as a data analytic tool. The old saw “If it’s worth doing, it’s worth doing well” would seem to hold as well today, and in this sphere as when I was browbeaten with it by my father when I was a child ! It is unlikely that this note will have any effect on Microsoft – my hope is that it may make some users a little more careful, a little more critical, when applying the dangerous tools in this toolkit! For business strategy updates … visit www.sqm.co.nz