Spreadsheets in Statistical Practice—Another Look

advertisement

Spreadsheets in Statistical Practice—Another Look

J. C. N

ASH

Many authors have criticized the use of spreadsheets for statistical data processing and computing because of incorrect statistical functions, no log file or audit trail, inconsistent behavior of computational dialogs, and poor handling of missing values. Some improvements in some spreadsheet processors and the possibility of audit trail facilities suggest that the use of a spreadsheet for some statistical data entry and simple analysis tasks may now be acceptable. A brief outline of some issues and some guidelines for good practice are included.

KEY WORDS: Audit trail; Data entry; Statistical computing.

1.

CONCERNS ABOUT SPREADSHEETS

The ubiquity of spreadsheets has encouraged their use in statistics as well as most other areas of quantitative endeavour. Panko and Ordway (2005, also panko.cba.hawaii.edu/ ssr/ ) showed that a vast majority of financial and management planning and decision-making uses spreadsheets, sometimes with disastrous consequences (Brethour 2003). The European

Spreadsheet Risks Interest Group, which in fact has worldwide participation, considers these issues. See www.eusprig.org for many useful examples and links to their conference proceedings.

Many statisticians dislike spreadsheets in statistical practice, first because of bugs or inaccuracies in the mathematical or statistical functions of the spreadsheet programs. A sample of references includes Cryer (2002), Nash and Quon (1996), Nash,

Quon, and Gianini (1995), and contributions by McCullough

(1998, 1998) and McCullough and Wilson (2002, 2005).

A second concern is data entry and edit, where the lack of an audit trail of changes to the spreadsheet data is an invitation to poor and unverifiable work (Nash and Quon 1996). Yet spreadsheet use is almost casual, for example, by Mount et al. (2004):

“Data were entered into Microsoft Access and Microsoft Excel and exported to Stata (version 7) for analysis.”

Practitioners are well-aware how easily errors and falsifications arise in data collection. An excellent and entertaining overview was given by Gentleman (2000). Popular statistical packages offer an audit or log file as an aid for checking work performed.

A third issue is that the use of “one tool for all tasks” may leave students unaware of the diversity of tools and unable to select the most appropriate software for their needs (Hunt 1995;

College Entrance Exam Board 2002). Despite the pedagogical convenience of familiar software, statisticians have a role in promoting the use of tools appropriate to the task.

Most of us are likely, however, to use spreadsheets or spreadsheet-like interfaces, possibly in statistical packages such as Minitab, Statistica, UNISTAT, and NCSS. There are good reasons for this. Spreadsheets allow the user to access the data more or less randomly. That is, we can go to any cell and make a change. If cells contain formulas or functions, the spreadsheet computational paradigm is supposed to ensure that all dependent cells of the dataset are updated.

Updating is useful, but it is also dangerous, since we can do a lot of damage with clumsy fingers on the keyboard. Furthermore, as noted by Nash and Quon (1996), some of the statistical dialogs of spreadsheets, for example, regression, result in static outputs—a violation of the spreadsheet paradigm that results in errors when users do not re-run the calculations after updating their data. The confusion is worsened by different behavior depending on the calculation chosen and the spreadsheet processor.

In Excel 2003, ANOVA updates while regression does not. A

“recalculate” instruction does not suffice.

Nevertheless, developments in spreadsheets may render them suitable for some statistical work. I will try to suggest some appropriate applications.

2.

MOTIVATIONS AND GOALS

My main objective is to encourage statisticians to learn where and how spreadsheets (indeed any software) may be appropriate in their work. Software developments, some outside statistics, offer potentially “safer” ways to use spreadsheets in statistical work. Where good statistical packages or well-constructed databases for data entry and edit are unavailable spreadsheets may prove useful. My message is harm reduction as opposed to abstinence. The developments, some incomplete, that inform my view on spreadsheet use in statistics involve

• improved statistical functions;

• audit trails of spreadsheet work; and

• improved data and program transfer (e.g., http:// www.

oasis-open.org).

These ideas offer potential benefits to statistical practitioners, especially because many of the ideas are being developed collaboratively with involvement of users.

J. C. Nash is Professor, School of Management, University of Ottawa, ON

K1N 9B5, Canada (E-mail: nashjc@uottawa.ca). This article would not have been written without the stimulation and interaction with Neil Smith, Andy

Adler, Sylvie Noël, and Jody Goldberg. The author is involved with preparing test spreadsheets for the Gnumeric project.

3.

IMPROVING SPREADSHEET FUNCTIONS

Computational “add-ins” to spreadsheets, especially Microsoft Excel, claiming to allow “correct” statistical computations to be performed include Analyse-It (www.analyse-it.com),

UNISTAT (www.unistat.com), and Palisade Stattools (www.

palisade.com/ html/ stattools.asp). Alternatively, RSvr is a freely available tool to allow Excel to use functions in the open-source

R statistical package (cran.r-project.org/ contrib/ extra/ dcom/ ).

© 2006 American Statistical Association DOI: 10.1198/000313006X126585 The American Statistician, August 2006, Vol. 60, No. 3 287

The latter tool is but one of many contributions associated with

Erich Neuwirth (see sunsite.univie.ac.at/ Spreadsite/ ).

Add-ins allow for a quick remediation of defective functionality, but may fragment the user community. For example, if user A uses add-in X but user B uses add-in Y while user C uses the base spreadsheet processor, we may expect some— hopefully minor—differences in their statistical results. Unfortunately, even small differences give rise to worries that results may be “wrong” and the causes of differences may be difficult to elucidate.

Alternatively, like the Gnumeric community, one can attempt to provide a “best possible” spreadsheet processor. The market dominance of Excel means in some cases including a way to “work like Excel” even including its errors, but extra functionality is also possible, such as Gnumeric’s hypergeometric function. Gnumeric.org also offers a set of test spreadsheets to allow a spreadsheet processor, and in particular new “builds” of Gnumeric, to be verified. See www.gnumeric.org for either the open-source spreadsheet processor or the test sheets, which are in .xls format. Unfortunately, these test spreadsheets are not nearly as extensive as one might like. The author invites interested readers to join him in helping to improve these.

Gnumeric has already influenced statistical computing. Jody

Goldberg, the lead maintainer of Gnumeric, found some improvements to the statistical distribution function codes from R which were used as the basis for Gnumeric’s functions. These improvements have, I am informed, now been incorporated back into R.

288

4.

AUDIT TRAILS

Audit trails help us find and correct errors. Most major statistical packages include this facility. Velleman (1993) presented some ideas. For spreadsheets, we want to know who changed a particular cell, when they changed it, and the content of the cell before and after the change. Spreadsheets, traditionally, have not provided this capability.

There are several ways to include an audit trail with a spreadsheet. One is to note, as Neil Smith and I did in late 2002, that the change-recording facility of modern spreadsheet processors could provide a log if we can ensure the change record is not tampered with or accidentally altered. The resulting changes list is large, requiring tools to filter it and ease the task of analyzing the audit trail, for example, to show only those cells where a formula has been replaced with a number.

After overcoming many annoying issues of technical detail, we found success by running the OpenOffice.org spreadsheet processor calc on a secure Web server. The details and other developments have been described in the references due to Adler,

Nash, and Noël (2006). The server software is available under the Lesser GNU Public License at http:// telltable-s.sf.net. Descriptions of the filtering program are available at http:// www.

telltable.com.

Another approach, currently being tried in Gnumeric, is to include the audit capability in the spreadsheet software itself.

Clearly it helps to have the source code (Nash and Goldberg

2005).

Finally, there are some commercial tools that claim to offer audit trail capability. Wimmer Systems (www.wimmersystems.

Statistical Computing Software Reviews

com) provides an Excel add-in to do this, while Cluster Seven

(www.clustersevern.com) uses (apparently) a large-scale enterprise groupware system to monitor changes.

5.

WHEN SHOULD WE USE SPREADSHEETS IN

STATISTICS?

Data entry, edit, and transformation is first on my list of statsitical applications of spreadsheets. In the absence of a wellconfigured database system, a spreadsheet with audit trail is easy to use and relatively “safe.” If functions are computed properly, we can perform transformations, recodings, and simple preliminary analyses. I find an audit trail serves best when it allows me to catch my own errors. Test calculations, as in the Gnumeric test spreadsheets, serve a similar role, but improvement is needed in the usability, the capability, and the output of such tools.

For statistical analysis, I use a spreadsheet for modest computations that can be programmed within the spreadsheet’s own functions, avoiding special dialogs such as regression that (usually!) require user intervention to re-run them if data change or that may vary in how they behave across spreadsheet processors

(e.g., ANOVA). Graphics usually update if the inputs change, so these are useful if they are simple enough to prepare, though my own preference is to use a statistical package. It is conceivable that regression could be included within the regular spreadsheet functions by using vector-valued or selectable outputs. This is a direction I am investigating, as it would make these important statistical computations “updateable” with the data. A central theme, however, is that any analysis by spreadsheet should be simple to set up. A single formula applied to a large block of data is preferred to serveral formulas applied to only a few cells.

Typical analyses where spreadsheets can be used:

• evaluation of probability distributions, as in solutions and marking guides for classroom exams, or computation of simple confidence intervals or hypothesis test results;

• data conversions, check totals, and modest tables;

• simple trend lines or smoothings of data; and

• simple descriptive statistics of columns or blocks of data.

Macros should be avoided. These are programs that can be launched (sometimes automatically and against the user’s wishes) from within the spreadsheet. For Excel, macros are written in a form of Visual Basic. Other spreadsheet programs generally allow similar constructs but using different coding methods.

From the point of view of security and quality, macros raise a large red flag since they often use random access to the spreadsheet data in a way that is difficult to track and debug. Clearly add-ins can be criticized similarly if there is not a well-defined mechanism for interaction with the spreadsheet data.

6.

SIDE BENEFITS AND SYNERGIES

We built TellTable using a Web interface in order to protect our audited spreadsheet from interference. After the fact, we discovered that it could run many software packages in a way that allowed for controlled collaboration. Users normally “lock” a file to prevent conflicting edits, but a user can elect to share the screen with one or more others who are at different locations.

As a proof of concept we had two users share a single Matlab

session where both could modify inputs to an animated graphical output that was the solution of a set of equations. Collaboration on statistical modeling at a distance would be a similar task.

Although many statstical packages provide a form of spreadsheet for data entry and manipulation, these are often poor imitations of general spreadsheet capabilities. Using standardized file formats, members of a team of users should be able each to choose the tools they find most suited to their needs and tastes without fearing platform or file-format conversion difficulties. Dissociating information content from the tools used to process it allows customization for individual productivity while maintaining group progress. For example, using ssconvert from Gnumeric allows me to move datasets and outputs back and forth easily between spreadsheets and R.

The growth of Web-based applications and interfaces permits multiple, small, easily linked statistical applications to work on standardized files. The building blocks exist now, are relatively straightforward to program, and are usually platform independent as a bonus. Even on a local machine a Web interface is a convenient way to build a graphical front-end to a set of simple, possibly non-windowed applications (Nash and Wang 2003).

The three themes here—collaboration over distance and time, standardized files, and Web interfaces—are complementary to each other and to spreadsheet use for selected statistical purposes.

7.

CONCLUSION

My goal has been to highlight several technological developments in spreadsheets and computational practice that promise improvement in statistical data processing. A decade ago, I warned against spreadsheet use for any statistical application.

Now I see the possibility of some useful, low-risk statistical applications of spreadsheets. Furthermore, statisticians, rather than complaining about the faults of spreadsheets, can become insiders to open-source software projects that let them improve their own tools and the ways they use them.

REFERENCES

Adler, A., and Nash, J. C. (2004), “Knowing What was Done: Uses of a

Spreadsheet Log File,” Spreadsheets in Education. Available online at http:

// www.sie.bond.edu.au/ articles/ 1.2/ AdlerNash.pdf .

Adler, A., Nash, J. C., Noël, S. (2006), “Evaluating and Implementing a Collaborative Office Document System,” Interacting with Computers, 18, 665–682.

Brethour, P. (2003), “Human Error Costs TransAlta $24-million on Contract

Bids,” Globe and Mail (Toronto), online edition, Wednesday, Jun. 4, 2003,

http:// www.bpm.ca/ TransAlta.htm.

College Entrance Exam Board (2002), “Advanced Placement Program: Statistics

Teachers Guide.” Available online at apcentral.collegeboard.com/ repository/ ap02 stat techneed fi 20406.pdf

.

Cryer, J. (2002), “Problems with using Microsoft Excel for Statistics,” in Pro-

ceedings of the 2001 Joint Statistical Meetings [CD-ROM], Alexandria, VA:

American Statistical Association.

Gentleman, J. F. (2000), “Data’s Perilous Journey: Data Quality Problems and

14 Other Impediments to Health Information Analysis,” Statistics and Health,

Edmonton Statistics Conference 2000, Edmonton, Alberta, 2000.

Hunt, N. (1995), “Teaching Statistical Concepts Using Spreadsheets,” in Pro- ceedings of the 1995 Conference of the Association of Statistics Lecturers in

Universities, Teaching Statistics Trust. Available online at http:// www.mis. coventry.ac.uk/

nhunt/ aslu.htm.

McCullough, B. D. (1998), “Assessing the Reliability of Statistical Software:

Part I,” The American Statistician, 52, 358–366.

(1999), “Assessing the Reliability of Statistical Software: Part II,” The

American Statistician, 53, 149–159.

McCullough, B. D., and Wilson, B. (2002), “On the Accuracy of Statistical

Procedures in Microsoft Excel 2000 and Excel XP,” Computational Statistics

and Data Analysis, 40, 713–721.

(2005), “On the Accuracy of Statistical Procedures in Microsoft Excel

2003,” Computational Statistics and Data Analysis, 49, 1244–1252.

Mount, A. M., Mwapasa, V., Elliott, S. R., Beeson, J. G., Tadesse, E., Lema, V.

M., Molyneux, M. E., Meshnick, S. R., and Rogerson, S. J. (2004), “Impairment of Humoral Immunity to Plasmodium falciparum malaria in Pregnancy by HIV Infection,” The Lancet, 363, June 5, pp. 1860–1867.

Nash, J. C. (1991), “Software Reviews: Optimizing Add-Ins: The Educated

Guess,” PC Magazine, 10, 7, April 16, 1991, pp. 127–132.

Nash, J. C., and Quon, T. (1996), “Issues in Teaching Statistical Thinking with

Spreadsheets,” Journal of Statistics Education, 4, March. Available online at

http:// www.amstat.org/ publications/ jse/ v4n1/ nash.html.

Nash, J. C., Quon, T., and Gianini, J. (1995), “Statistical Issues in Spreadsheet

Software,” 1994 Proceedings of the Section on Statistical Education, Alecandria, VA: American Statistical Association, pp. 238–241.

Nash, J.C., Smith, N., and Adler, A. (2003), “Audit and Change Analysis of Spreadsheets,” in Proceedings of the 2003 Conference of the European

Spreadsheet Risks Interest Group, eds. David Chadwick and David Ward,

Dublin, London: EuSpRIG, pp. 81–90.

Nash, J. C., and Wang, S. (2003), “Approaches to Extending or Customizing

Statistical Software Using Web Technology,” Working Paper 03-24, School of Management, University of Ottawa.

Nash, J. C., and Goldberg, J. (2005), “Why, How and When Spreadsheet Tests

Should be Used,” Proceedings of the EuSpRIG 2005 Conference on Manag-

ing Spreadsheets in the Light of Sarbanes-Oxley, ed. David Ward, London:

European Spreadsheet Risks Interest Group, pp. 155–160.

Panko, R. R., and Ordway, N. (2005), “Sarbanes-Oxley: What About All the

Spreadsheets?” Proceedings of the EuSpRIG 2005 Conference on Manag-

ing Spreadsheets in the light of Sarbanes-Oxley, ed. David Ward, London:

European Spreadsheet Risks Interest Group, pp. 15–60.

Velleman, P. (1993), Statistical Computing: Editor’s Notes, The American Statis-

tician, 47, 46–47.

The American Statistician, August 2006, Vol. 60, No. 3 289

Download