Stats in Excel

advertisement
Sage Note: Statistics in Excel
Page 1
Excel as a Tool for Statistical Analysis
This note, regarding the statistics features in Microsoft Excel, available through Tools 
Data Analysis  , shows that these features are flawed. While Excel generally produces
correct results, erroneous results arise with disturbing frequency. On the other hand,
Excel continues to be arguably the world’s most widely distributed statistical software
package – it behooves us to anticipate such problems!
Note the following problems:
(1)
(2)
(3)
(4)
Excel may produce numerically inaccurate results (usually through the use
of obsolete computing algorithms.
Excel has not responded to earlier commentary about the quality of its
statistical computing.
Excel’s data orientation creates difficulties in basic calculations: cells vs
variables
Excel lacks the flexibility to deal with serious statistical investigations.
Are these fair criticisms? After all, any piece of software will have bugs. As we will
see, the problems with Excel are not merely rounding errors, nor are these problems
nuances of usage in esoteric procedures. Excel has fundamental problems of correctness
and accuracy in very commonly occurring problems. For many of these errors,
awareness of the existence of these risks is sufficient for an experienced data analyst to
detect the problem(s) and overcome them, through additional analysis, checks and
transformations of the data. Experience in use, and caution in application and
interpretation are essential if Excel is to be regarded as a useful adjunct tool in statistical
analysis.
Note that these objections to Excel are with regard primarily to its statistical “features”
and use as a statistical analysis package. Nevertheless, some of these problems seriously
challenge the idea that Excel has been, and continues to be, a good spreadsheet program.
Any person considering the use of Excel for statistical purposes should consult B. D.
McCullough and Berry Wilson, “On the accuracy of statistical procedures in Microsoft
Excel 97,” Journal of Computational Statistics & Data Analysis, Volume 31, 1999. This
simple abstract gives their position:
The reliability of statistical procedures in Excel is assessed in three areas:
estimation (both linear and nonlinear), random number generation, and statistical
distributions (e.g. for calculating p-values). Excel’s performance in all three areas
is found to be inadequate. Persons desiring to conduct statistical analyses of data
are advised not to use Excel.
Several other sources from the scholarly literature are also cited. In addition, note points
provided by Minitab, Inc., in its “white paper.” Note that Minitab is not a disinterested
party in this discussion!
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 2
Excel may produce numerically inaccurate results.
McCullough and Wilson examined software by analyzing data in the Standard Reference
Datasets released by the National Institute of Standards and Technology. (McCullough
and Wilson, p 27). They are careful to state that their concern is not merely the number
of correct digits that an algorithm produces, as the need for precision will vary among
users and among applications. However, if there exists a reliable algorithm that solves a
particular problem to a certain precision, then a program that fails to get close to this
precision should be judged inadequate.
McCullough and Wilson show that for “easy” data sets, Excel will get acceptable
precision for the mean and standard deviation. However, for not-so-easy data sets, Excel
will get one significant figure (or worse) correct in computing a standard deviation
(Table 2, p 30). Among these are data sets Numacc3 and Numacc4.
Set Numacc3 has length n = 1,001.
Value 1,000,000.1 occurs 500 times
Value 1,000,000.2 occurs once
Value 1,000,000.3 occurs 500 times
The exact standard deviation is 0.10.
Minitab finds the standard deviation as 0.10000 through Calc  Column
statistics. SAS likewise produces an accurate result. Excel (MS Excel
2000), however, reports the standard deviation as 0.234787138. (Using
either the =STDEV() function, or the Descriptive Statistics tool from the
Analysis Toolpak.)
From Excel 2000:
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence Level(95.0%)
1000000.2
0.007420912
1000000.2
1000000.1
0.234787138
0.055125
-2.003003003
5.23168E-07
0.2
1000000.1
1000000.3
1001000200
1001
0.014562347
Set Numacc4 has length n = 1,001.
Value 10,000,000.1 occurs 500 times
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 3
Value 10,000,000.2 occurs once
Value 10,000,000.3 occurs 500 times
The exact standard deviation is 0.10.
Minitab finds the standard deviation as 0.10000 through Calc  Column
statistics. Excel, however, reports the standard deviation as
0.912140340.
Now, one could argue that these are terribly unrealistic data sets, in that the only
variability in the numbers occurs in the low-order digits. Two comments seem
appropriate. First of all, financial data taken over a short time period could well have
precisely this character, and calculating the variance of a short series of financial figures
is a key element of determining the value of, for example, an option. A sixty-fold error in
the magnitude of the volatility of a series of cashflows will have an enormous effect on
the value of the option!. Secondly, this type of computing difficulty comes in the use of
 xi2  n x 2 to compute a sum of squares around an average, and it has been known in
the statistical literature for decades that this is an inferior algorithm. (The HELP feature
in Excel 2000 acknowledges the use of this formula.)
Note that Excel performs perfectly well with this dataset if a two-pass algorithm is
implemented “by hand”. As the relative size of the disturbances and the signal gets larger
(i.e. the coefficient of variation), the ratio of the standard deviation as calculated by Excel
to the true standard deviation gets closer to 1. The chart below, which shows the CV on
the horizontal axis, and the standard deviation ratio on the vertical axis, demonstrates
this.
20.0
18.0
16.0
14.0
12.0
10.0
8.0
6.0
4.0
2.0
0.0
0
10
20
30
40
For business strategy updates … visit www.sqm.co.nz
50
Sage Note: Statistics in Excel
Page 4
A curiosity
Interestingly, if we change NumAcc3 a little, so that it is centred on the value 1000000
(i.e. we have 500 values 999999.9; one value 1000000, and 500 values 1000000.1), then
the scale is unchanged, of course. The results are not quite what one might have
predicted! The two-pass algorithm is fine, of course, returning the exact standard
deviation of 0.1. The stdev() function, however, gives us the result 0. Keeping a similar
scale (i.e. roughly +/- 0.1 around a signal of approximately 1 million), we can get a wide
variety of values for the standard deviation from the =STDEV() function.
Regression
Similarly, for linear regression problems Excel also may have problems. For example,
the file Fillip in the Standard Reference Datasets is a polynomial regression problem in
which Excel is not able to achieve even one significant figure correct! This is an
example of an ill-conditioned problem, meaning that the matrix of independent variables
is very nearly rank-deficient. It may seem that this problem is too exotic to be worthy of
concern, but in fact problems that are rank-deficient (or nearly so) are common. (And the
polynomial regression is one of the standard alternatives offered in the Excel “Add
Trendline” option for scatter charts!)
Note also that L. Knüsel examined the computations related to the statistical distributions
used in Excel, including normal, chi-squared, F, and t. The article is “On the accuracy of
statistical distributions in Microsoft Excel 97,” Computational Statistics and Data
Analysis, Volume 26, 1998, pp 375-377. He found many defects in the algorithms, and
judged Excel to be inadequate. Generally, these inadequacies lie in the tails of the
distributions; for most hypothesis-testing related tasks, the approximations are perfectly
adequate. However, these inaccuracies in the tails may make Excel a poor choice for
simulation studies, except under carefully selected conditions. (Excel is perfectly
adequate, for example, for class demonstrations of basic probability ideas, and the
Central Limit Theorem.)
The paired t-test
Let’s be very specific about the kinds of things that Excel might do to an unsuspecting
user. Consider the following data set, which appears in Minitab’s “white paper”
regarding Excel. Excel gives the wrong t-statistic and p-value for a paired t-test if there
are missing values. Here are the data:
Sample 1
3
4
3
2
4
4
3
Sample 2
2
2
3
3
3
3
4
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 5
2
4
4
3
4
3
2
3
3
4
4
2
3
2
2
2
3
2
2
2
4
2
2
3
The paired t-test involves differences, Sample 1 minus Sample 2. There are 20 lines in
the data set, but there are only 18 usable differences, and consequently 17 degrees of
freedom for estimating the standard error of the difference in means.
Here is the Minitab output:
Paired T-Test and CI: Sample 1, Sample 2
Paired T for Sample 1 - Sample 2
Sample 1
Sample 2
Difference
N
18
18
18
Mean
3.167
2.556
0.611
StDev
0.786
0.705
1.145
SE Mean
0.185
0.166
0.270
95% CI for mean difference: (0.042, 1.180)
T-Test of mean difference = 0 (vs not = 0): T-Value = 2.26
P-Value = 0.037
Minitab correctly works around the missing values. Minitab reports the means for each
column for the 18 relevant entries. Minitab gives the correct values, t = 2.26 and
p = 0.037. (Minitab does not print the number of degrees of freedom, but it is easy to
determine that this is 17; moreover the p-value corresponds to 17 degrees of freedom.)
(SAS enforces explicit calculation of the differences for the paired t-test, and of course,
produces the correct results.)
Here is the Excel output:
t-Test: Paired Two Sample for Means
Sample 1
Sample 2
Mean
3.2105
2.5789
Variance
0.6199
0.4795
19
19
Observations
Pearson Correlation
Hypothesized Mean Difference
df
-0.1770
0
18
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 6
t Stat
1.7143
P(T<=t) one-tail
0.0518
t Critical one-tail
1.7341
P(T<=t) two-tail
0.1036
t Critical two-tail
2.1009
Excel gives t = 1.7143 and p = 0.1036, which are incorrect. Excel gives 18 degrees of
freedom, which is also incorrect. (The combination of p-value, t, and degrees of
freedom is at least internally consistent.) Excel’s handling of this data set is unfortunate:
it cannot even determine which data points are to be used in the analysis. If the data are
“cleaned up” prior to use of the tool (i.e. all pairs that have a missing component are
deleted), then the Excel routine works fine (modulo its problems with precision).
Unfortunately, most Excel functions, such as =AVERAGE(), =STDEV() etc, are
designed to take account of missing values, so the less sophisticated user may also trust
the Statistical tools to take account of missing values in a sensible way. This of course, is
not the case!
Incidentally, if there are entries in the missing places, such as ‘*’ or ‘.’, then the Excel
tool throws up its hands in horror and declares that the “Input range contains non-numeric
data”.
Note that “P(T<=t) one tail” seems to be a poor description. What is intended is
something like P(T>|t|) one tail.
Note also that Excel does not in general carry through arithmetic properly if there are
missing values – empty cells in arithmetic expressions are (usually) interpreted as zeroes.
In this example, the value in cell B3 should be missing, but Excel gives it as -1.5:
1
2
3
4
A
X
1
2
B
Resids
-0.5 (=A2 - average(A2:A4)
-1.5 (=A3 - average(A2:A4)
0.5 (=A4 - average(A2:A4)
Again, this type of behaviour is common in Excel, which does not have strong typing,
and readily converts blank cells to numeric cells (zero), and will convert the text string
“1” to the number 1 within an arithmetic expression, but not within a function evaluation
such as =SUM(). This can lead to enormous problems with data received from a third
party, particularly if several files have been merged to create a final dataset.
(The operational unit in Excel is the cell – in statistical languages such as S-Plus,
Minitab, R, SAS etc, the basic unit is the variable. Consequently, all values of a
particular variable are necessarily of the same type (numeric, character etc), while this is
not the case with Excel. In particular, variables with mixed types can create havoc with
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 7
the output from Pivot tables and the statistical tools in the toolpak. Pivot tables are
particularly at risk in this regard, as all the functions applied within the pivot table
happily omit character variables without comment. See the section further on regarding
Basic Calculations.)
Linear Regression
Now let’s consider a linear regression situation. The independent variables are Frittata,
Salmon_crepes, …. In this example, there is an exact linear relationship among the
independent variables, as Totalpieces is just the sum of the other variables. Incidentally,
Minitab recognizes this sort of situation and automatically removes one of the variables
in such a constraint.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.959
R Square
0.920
Adjusted R Square
0.910
Standard Error
21.269
Observations
96
ANOVA
df
Regression
Residual
Total
Intercept
Frittata
Salmon_Crepes
Mini_Bagels
Onion_Olive_Tarts
Crostini
Mini_Pizza
Sandwiches
Mini_Muffins
Fishcakes
Thai_Crepes
TotalPieces
SS
439623.940
37998.560
477622.500
MS
39965.813
452.364
Coefficients Standard Error
13.398
5.863
9.608 1109566.045
0.623 1109566.045
0.717 1109566.045
0.440 1109566.045
0.007 1109566.045
0.151 1109566.045
-0.026 1109566.045
0.107 1109566.045
0.360 1109566.045
0.519 1109566.045
0.310 1109566.045
t Stat
2.285
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
11
84
95
F
Significance F
88.349
2.39613E-41
P-value
0.025
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
Lower 95%
1.739
-2206484.709
-2206493.693
-2206493.599
-2206493.877
-2206494.310
-2206494.165
-2206494.342
-2206494.210
-2206493.957
-2206493.797
-2206494.006
Upper 95%
25.056
2206503.924
2206494.939
2206495.033
2206494.756
2206494.323
2206494.467
2206494.291
2206494.423
2206494.676
2206494.836
2206494.626
Excel quite happily has a crack at calculating all the usual regression statistics, and, of
course, comes up with some fairly weird results. This just reflects the fact that the
calculations of regression are simply that: calculations – feed in the numbers, and
something will come out. Minitab is a program designed for statistical analysis, and
consequently attempts to give statistical guidance, and make sensible decisions about the
statistical analysis. Excel is not designed for statistical analysis, and will not give any
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 8
guidance beyond the strangeness of the results. With regression analyses in Excel, the
“neck-top” computer is a vital tool, and in this case, the signals are certainly there to see:
standard errors that are enormous (or possibly vanishingly small in other cases),
incalculable P-values and so on. This is not so much a criticism of Excel as it is a
cautionary tale, which indicates that regression output should be looked at with some
care. (A comment that can equally be made about the regression output from any
statistical program.)
Below is the output from SAS, which copes quite well with the constraint:
The REG Procedure
Model: MODEL1
Dependent Variable: Tot_Cost
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
10
85
95
439624
37999
477623
43962
447.04188
Root MSE
Dependent Mean
Coeff Var
21.14337
167.12500
12.65123
R-Square
Adj R-Sq
F Value
Pr > F
98.34
<.0001
0.9204
0.9111
NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some
statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.
NOTE: The following parameters have been set to 0, since the variables are a linear combination of
other variables as shown.
totpiece =
Fritata + SalmCrep + MiniBag + OnOlTt + Crstini + MiniPizz
+ Sandwich + MiniMuffs + Fishcks + ThaiCrps
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
Fritata
SalmCrep
MiniBag
OnOlTt
Crstini
MiniPizz
Sandwich
MiniMuffs
Fishcks
ThaiCrps
totpiece
1
B
B
B
B
B
B
B
B
B
B
0
13.39793
9.91777
0.93333
1.02691
0.74971
0.31679
0.46124
0.28449
0.41665
0.66981
0.82959
0
5.82805
1.78338
0.09349
0.11159
0.06739
0.02522
0.06545
0.02319
0.05598
0.03934
0.08966
.
2.30
5.56
9.98
9.20
11.13
12.56
7.05
12.27
7.44
17.03
9.25
.
0.0240
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
.
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 9
Simple Linear Regression, Numerically unstable
The next example is obtained from a posting on edstat by Mark Eakin, 9/18/1996. This is
a simple Y-on-X regression.
Y
1.1
1.9
3.1
3.9
4.9
6.1
X
10000000.1
10000000.2
10000000.3
10000000.4
10000000.5
10000000.6
Here is the Excel output (2000 version).
SUMMARY OUTPUT
Regression Statistics
Multiple R
65535
R Square
-0.816
Adjusted R Square
-1.271
Standard Error
2.808
Observations
6
ANOVA
df
Regression
Residual
Total
Intercept
X
1
4
5
SS
-14.174
31.534
17.360
Coefficients
Standard Error
-233538842.107 102864523.600
23.354
10.286
MS
-14.174
7.884
F
Significance F
-1.798
#NUM!
t Stat
P-value
Lower 95%
-2.270
0.086 -519137136.696
2.270
0.086
-5.206
This has negative SS for regression, and negative F.
Again, one might claim that the data are highly contrived. Nonetheless, the problem can
be solved correctly by other programs, and it indicates again that Excel is using obsolete
algorithms. Again, the output is indicative of something seriously amiss.
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 10
Incidentally, the short-cut way of fitting a simple linear regression in Excel is to create
the scatter plot and “Add Trendline”. This results in a quite remarkable plot of the data
plus fitted line:
8
y = 2.09E+01x - 2.09E+08
6
R2 = 2.09E+00
4
2
0
10000000.00
10000000.10
10000000.20
10000000.30
10000000.40
10000000.50
10000000.60
10000000.70
-2
-4
-6
-8
So the algorithm used in the “Add Trendline” function is different to that used in the
regression tool! (And quite markedly wrong!) Different coefficients, different R2 value.
(greater than 1!)
Minitab gets this one correct, by the way, as does SAS.
Regression through the Origin
Now let’s note that Excel also does not correctly perform a regression through the origin.
Consider this data set:
X
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Y
24.4
32.1
37.1
40.4
43.3
51.4
61.9
66.1
77.2
79.2
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 11
We want to fit the regression model Yi =  Xi + i .
Here is the output from Excel:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.952081
R Square 0.906459
Adjusted
0.795348
R Square
Standard
5.818497
Error
Observati
10
ons
ANOVA
df
Regressio
n
Residual
Total
SS
MS
1 2952.635 2952.635
9 304.6941
10 3257.329
F
87.2144
Significance F
1.41E-05
33.8549
Coefficient Standard
t Stat
P-value
Lower
Upper
Lower
Upper
s
Error
95%
95%
95.0%
95.0%
Intercept
0
#N/A
#N/A
#N/A
#N/A
#N/A
#N/A
#N/A
X Variable 9.130107 0.310458 29.40852 2.97E-10 8.427802 9.832412 8.427802 9.832412
1
(This F-value is very strange! Certainly not equal to 2952.635/33.8549, nor is it equal to
the square of the t-statistic for the slope {which is, incidentally, the correct value, 864.86,
as in the Minitab output below.})
Here is the same run in Minitab:
The regression equation is
Y = 9.13 X
Predictor
Noconstant
X
Coef
SE Coef
T
P
9.1301
0.3105
29.41
0.000
SS
29280
305
29584
MS
29280
34
F
864.86
S = 5.818
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
9
10
P
0.000
It should be noted that both programs produce the same fitted equation Y = 9.13 X.
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 12
In the analysis of variance, both programs get the sum residual sum of squares, 305.
10
However, Minitab correctly works from the correct total sum of squares
Y
i 1
i
2
= 29,584.
Excel makes the mistake of computing around the average; that is, Excel computes
n
 Y  Y 
i
i 1
2
= 3,257.21. The average Y has no bearing in a regression through the
origin. As a result, Excel reports an incorrect R2. The correct computation of R2 for this
problem is indicated by Sen and Srivastava, Regression Analysis: Theory, Methods, and
Applications, Springer-Verlag, New York, 1990, or just about any book on statistics at
the elementary level.
Thus, Excel gets most of the analysis of variance table wrong, including the F test.
Provided the data are well-behaved, and it is intended that inferences are to be made on
the coefficients, then the “regression through the origin” option in the regression tool can
be used with a little confidence, and a bucket of caution!
Excel uses obsolete computing algorithms.
This point was noted earlier, with regard to the use of computing
x
2
i
x  x 
i
2
through
 n x 2 . The latter formula has been known to be inaccurate for decades.
Excel is not able to deal with singularity or near-singularity in the matrix of independent
variables in the regression problem; Excel is again using an inappropriate computing
algorithm.
See also the earlier note on the findings of Knüsel regarding statistical distributions. He
found many defects in Excel’s algorithms.
For the most part, these problems may be detected by checking such things as relative
scale (CV etc), and using alternative algorithms for calculating tail probabilities,
particularly for very small values. Awareness of the shortcomings, however, is vital.
Microsoft has not responded to earlier comments about the
quality of its statistical computing.
The essence of the McCullough and Wilson findings is that Excel, for many examples,
does not produce answers with the same number of significant figures that other
programs can obtain. In their exploration of computational quality of statistical
programs, they applied their criteria to many statistical and econometric packages. This
comment is quite relevant:
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 13
It is worth noting that vendors of these statistical and econometric packages
participated fully in the application of this methodology to their products. These
vendors verified all calculations, provided information on algorithms when such
information was not included in the documentation, and otherwise assisted the
process. … Since it is conceivable that more statistical calculations are performed
using Excel than any other package, it is important that the statistical capabilities
of Excel be assessed. Microsoft was invited to participate in this evaluation but
chose not to do so.
It is hard to know the precise reasons for Excel’s failings, as Microsoft did not cooperate
in the investigation. The failure to calculate precisely the sample standard deviation
1
2
 xi  x  in the example on page 2 leads to the suggestion that Excel used the

n 1
hand calculator form
x
2
i
 n x2
. Unfortunately, as noted by McCullough and
n 1
Wilson, p 30, it has been known for most of the 20th century that this formula is highly
unreliable. McCullough and Wilson cite R. F. Ling, “Comparison of several algorithms
for computing sample means and variances,” Journal of the American Statistical
Association, vol 69, 1974, pp 859-866.
Microsoft has not paid attention to earlier reports about computational problems and
surprisingly, Microsoft refused to cooperate in the McCullough and Wilson study.
Excel lacks the flexibility to deal with serious statistical
investigations.
This is in some senses a non-criticism, since Excel was never intended for use as a
serious package for statistical analysis. Nevertheless, it is worth noting some of its
“absences”.
The Data Analysis main menu in Excel lists these features:
Anova: Single Factor
Anova: Two-Factor with Replication
Anova: Two-Factor without Replication
Correlation
Covariance
Descriptive Statistics
Exponential Smoothing
F-Test Two-Sample for Variances
Fourier Analysis
Histogram
Moving Average
Random Number Generation
Rank and Percentile
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 14
Regression
Sampling
t-Test: Paired Two Sample for Means
t-Test: Two-Sample Assuming Equal Variances
t-Test: Two-Sample Assuming Unequal Variances
z-Test: Two Sample for Means
This is a very modest subset of the statistical tools that an analyst would use. Omitted
topics include discriminant analysis, logistic regression, factor analysis, principal
components, leverage values, and on and on. Many of these can be calculated using
some of the matrix primitives available in Excel, but the numerical properties of some of
these algorithms must be at best suspect given the poor implementation of even very
simple and well-known algorithms such as the standard deviation algorithm. Maximum
likelihood estimates can be derived using the SOLVER add-in, (but recall the problems
with the implementation of some of the distribution functions in Excel) and it appears
that the technology underlying the SOLVER is a great deal more sophisticated than that
underlying the Data Analysis Toolpak. Again, Excel can be a reasonable stop-gap
alternative, in the hands of an experienced and cautious analyst.
Often, those statistical tasks performed by Excel require an inordinate amount of setup.
For example, the chi-squared test for the two-by-two table requires that the user actually
set up the table of expected values, and as we have seen, if there are missing values
within the dataset, these need to be removed (in the multivariate sense) prior to applying
any of the tools in the toolkit.
Minitab’s “white paper” lists additional complaints. They note that Excel does not do
stem-and-leaf plots, boxplots, or dotplots. Few of Excel’s analyses come with automatic
graphs, and even those that do are in fact incorrectly constructed. The histogram, for
example, is not a histogram, nor is the normal probability plot a normal probability plot
of any kind. Again, these charts can be produced, but it may involve a fair amount of
“hard work” to persuade Excel’s only real graph, the XY plot, to do so.
The “white paper” notes that the documentation is minimal and sometimes inaccurate.
They cite “Spreadsheets and Statistics: The Formulas and the Words,” by Herman
Callaert, Chance, Volume 12, No. 2, 1999.
Basic Calculations
Of more interest and concern, perhaps, is the loose typing employed by Excel throughout.
As mentioned before, the operational unit in Excel is the cell, whilst in statistical
thinking, it is the variable, usually represented in Excel as a column of values. In
statistical thinking, we tend to regard all the values of a variable as having the same type,
be it real numbers, integers, or character values. Excel makes no such constraint, and we
may get trapped, for example, when a column of what appear to be numbers contains a
number of character entries. For example, consider the spreadsheet fragment below:
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 15
They look like numbers! All are right-adjusted (the default display for numbers, whereas
the default for strings is left-adjusted), but of course Excel allows us to choose the
justification within an area of the sheet. What of the entries in rows 12 and 13? Each is
labeled “Sum”: the sum in cell C12 is the result of the usual formula for the sum of a
range, and the entry in cell C13 is the result of adding together the individual entries in
cells C2 to C10, as shown in the fragment below (displaying the formulae):
Sum
Sum2
=SUM(C2:C10)
=C2+C3+C4+C5+C6+C7+C8+C9+C10
So, what’s the problem here? Several of the entries in the range C2:C10 are actually text
entries, not numerical entries. When the function ‘+’ is applied to such entries, is
converts the entry from type text to type numerical; when the SUM() function is applied
to such entries, it applies no conversion! This sort of thing can arise when an old
spreadsheet is re-used (i.e. we open an old spreadsheet, delete all the entries, and then
enter our new values): some areas of the spreadsheet may have been formatted as text –
when we type numbers into these areas, they are stored by Excel as text values.
Another interesting phenomenon is revealed here: the entry in C13 is left-adjusted – a
text value! Thus Excel has performed the type conversion within the repeated application
of ‘+’, but has converted the result to a text value! A very amusing thing happens if we
try to edit this value: it becomes the text string
=C2+C3+C4+C5+C6+C7+C8+C9+C10
perhaps not quite what was intended!
These problems in Excel make it a dangerous tool for financial calculations and the
construction of financial models, perhaps its primary use!
Dates, too, can cause enormous problems in Excel, particularly in the confusion between
dates that are formatted as dd/mm/yy vs mm/dd/yy. Is 12/11/00 the 12th of November, or
For business strategy updates … visit www.sqm.co.nz
Sage Note: Statistics in Excel
Page 16
the 11th of December? Choosing a non-ambiguous format is always safer. Safer still is to
enter dates as three separate fields, one for the day, another for the month, and a third for
the year (in full! 1989, not 89!). Date values can then be constructed (and manipulated)
in a consistent and interpretable fashion.
Final Note
The analysis tool is an add-on bought from some probably extinct software company.
Current Excel staff have neither the interest nor the expertise to dive into it and fix it up,
nor would it be cost effective. The argument is basically that the analysis tool is for quick
and dirty analysis and that anyone needing a really serious tool would already have SAS,
SPSS or Minitab or something of that nature. Thus we need not worry about the problems
on which the analysis tool "fails". The same argument probably goes for the statistical
functions.
This is, of course, a facile argument, and many very careful analysts have been trapped
by the shortcomings of Excel as a data analytic tool. The old saw “If it’s worth doing,
it’s worth doing well” would seem to hold as well today, and in this sphere as when I was
browbeaten with it by my father when I was a child ! It is unlikely that this note will
have any effect on Microsoft – my hope is that it may make some users a little more
careful, a little more critical, when applying the dangerous tools in this toolkit!
For business strategy updates … visit www.sqm.co.nz
Download