A Look at Alternatives to the Usual Statistical Software Packages Dominic T. Moore Forrest W. Young University of North Carolina at Chapel Hill Contact: dmoore@bios.unc.edu Web: http://www.mindspring.com/~dtmoore 1 SAS SPSS SPlus Systat Stata Minitab Statview Statistica Others? 2 “Free software is software that comes with permission for anyone to use, copy and distribute, either verbatim or with modifications, either gratis or for a fee. In particular, this means that source code is available…” Free Software Foundation 3 Introduction to Open Source The basic idea behind open source is very simple. When programmers on the Internet can read, redistribute, and modify the source for a piece of software, it evolves. People improve it, people adapt it, people fix bugs. And this can happen at a speed that, if one is used to the slow pace of conventional software development, seems astonishing. We in the open-source community have learned that this rapid evolutionary process produces better software than the traditional closed model, in which only a very few programmers can see source and everybody else must blindly use an opaque block of bits. 4 How is `open source' related to `free software'? Open Source is a marketing program for free software. It's a pitch for `free software' on solid pragmatic grounds rather than ideological tub-thumping. The winning substance has not changed, the losing attitude and symbolism have. OPENSOURCE.ORG 5 Numerous ways to philosophize or think about software Politically Economically Religiously??? (Microsoft as the Great Satan?) 6 Two Free Statistical Software Environments R-code - based on and like the S-language XLisp-Stat - based on Lisp language 7 LISP-STAT AN OBJECT-ORIENTED ENVIRONMENT FOR STATISTICAL COMPUTING AND DYNAMIC GRAPHICS. By Luke Tierney Wiley Series in Probability and Mathematical Statistics 8 The implementation of Lisp-Stat is known as XLisp-Stat, (XLS) Since David Betz developed XLisp and made source code available to the public. 9 About XLisp-Stat Object-oriented programming Prototyping, Statistical model representation Portable Windows interface Macintosh, X windows, Microsoft windows Graphics Dynamic and Customizable The LISP Language Extends Lisp arithmetic, element-wise operations Adds Statistical and Linear Algebra functions 10 More about XLisp-Stat Comes with complete source code in ANSI C. It’s free and can be given away for people to use and extend onto any number of computers you like. Porting from Common Lisp is now easy. The whole environment can be controlled and written in XLS (windows, dialogue boxes, menus, etc.) Dynamic graphics - interactivity, with visualizations. 11 Things I like about Xlisp-Stat: Learning to program “inside out”. Interactive, Iterative programming Very different from “data stepping” and “PROC-ing” 12 Pearson Chi-square: s r i 1 j 1 Qp = ( Observed ( ij ) Expected ( ij ))2 Expected ( ij ) Where ( i th RowTotal )( j th ColumnTotal ) Expected(ij) = N total 13 The Quadratic form for Qp Qp = (n-m) A (AVA ) / / / -1 A(n- m) Where n=(n11, n12...n1r...ns1...nsr)/ vector m is the corresponding expected vector V is the covariance matrix and A = [I(r-1) O(r-1)] [I(s-1)1 O(s-1)] 14 ViSta is: Professor Forrest W. Young’s 10 year software development project that utilizes cognitive science and visualization techniques. Designed particularly for students and teachers of statistics, (of all levels). Used as a research and development tool in computational and graphical statistics Free, extendible and downloadable from the web, (and runs under a variety of platforms). 15 ViSta can: Reveal structure in your data Guide you through an analysis Show you results of your analyses Structure your data analysis process 16 Neat Things about ViSta Automatically linked graphics Data as objects Data entry is intuitive, not a remote data step, (ie 2x2 table). Point and click guidemap. A visual interface for novices, A command line for experts. Everything can be programmed, (windows, dialog boxes, menus). Add on in the form of plug-ins 17 The Structured Desktop ViSta's Desktop has WorkMaps, GuideMaps, SpreadPlots, Datasheets and other features designed to structure and assist the statistical analyst. 18 Here is an example of a ViSta desktop: 19 Datasheets Datasheets display data and provide a simple datasheet editor to create and edit your data. Here is an example of a datasheet: 20 WorkMaps are ViSta's visualization technique for structuring data analysis sessions. WorkMaps are created by ViSta as the data analysis session progresses. 21 Here is an example of a ViSta WorkMap: 22 SpreadPlots SpreadPlots help you explore your data (and models of the data) to see what they seem to say. SpreadPlots are state of the art: They are structured, multiwindow, linked, dynamic and interactive. 23 Multi-Window: SpreadPlots are groups of several plot-windows. Structured: Each plot-window show a particular aspect of your data or model. Linked: The plot-windows can be linked by the data's observations or variables. Dynamic: Each plot window shows a dynamic graphic. For example, spinplots spin to communicate 3D structure. Boxplots can show a moving parallel coordinate plot to communicate higher dimensional structure. Interactive: You interact with the spinplot to make it spin, with the boxplot to move the parallel coordinate lines. 24 Spreadplot for Three-Way Table 25 Analyses Exploratory and Descriptive Data Analysis Dynamic Exploratory Graphics include Spinplots, Scatterplots, Scatterplot Matrices, Histograms, Boxplots, Parallel Coordinate Plots, Mosaic Plots, Quantile Plots, Normal Probability Plots, Quantile-Quantile Plots, Diamond Plots, Dotplots, Biplots, and Guided Tour Plots. Plots support brushing and labeling, and are dynamically linked. Smoothers and Contours can be added to several plots. Descriptive Statistics including Means, Standard Deviations, Variances, Ranges, Quartiles, Medians, Correlations, Covariances, Distances 26 Analyses Univariate Analysis Univariate Tests including T- and Z-tests (confidence intervals) for single sample, paired samples and two independent samples data, with Wilcoxon Signed-Rank and Mann-Whitney tests in appropriate situations. ANOVA - Univariate Analysis of Variance for balanced and unbalanced, one or multi-way data (data must be complete). Model may or may not include two-way (but not higher-way) interactions. The model visualization is a spreadplot composed of a boxplot, diamond plot, quantile plot, quantile-quantile plot and effects plot. 27 More Univariate Analysis Multiple Regression - Univariate regression includes simple, multiple, robust, and monotonic regression. The model visualization is a spreadplot comprised of a regression, addedvariable, influence, leverage, and residuals plots. Weight plots are also included for robust and monotonic regression. 28 Analyses Multivariate Analysis Multiple Regression - Multivariate Multiple Regression Analysis. The spreadplot consists of a biplot, spinplot, histogram and scatterplotmatrix. Principal Component Analysis of correlations or covariances. The model visualization is a spreadplot composed of a biplot, spin-plot, scree-plot and scatterplot-matrix. Multidimensional Scaling of one or more symmetric or asymmetric matrices. The model visualization is a spreadplot composed of a scatterplot, spin-plot, scree-plot and scatterplotmatrix. The spreadplot supports graphical reestimation of model parameters. Correspondence Analysis of two-way contingency tables. The model visualization is a spreadplot composed of a biplot, spinplot, residuals plot and scree-plot. The spreadplot supports graphical re-estimation of model parameters. 29 30