R: An Open Source Statistical Environment R: An Open Source Statistical Environment Valentin Todorov UNIDO v.todorov@unido.org MSIS 2008 (Luxembourg, 7-9 April 2008) 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 1 R: An Open Source Statistical Environment Outline • • • • • • • • • • 8.4.2008 Introduction: the R Platform and Availability R Learning Curve (is R hard to learn) R Extensibility (R Packages) R and the others (Interfaces) R Graphics R for Time series R for Survey Analysis R and the Outliers (Robust Statistics in R) More R features (WEB, Missing data, OOP, GUI) Summary and Conclusions MSIS 2008, Luxembourg: Valentin Todorov 2 R: An Open Source Statistical Environment What is R • R is “ a system for statistical computation and graphics. It provides, among other things, a programming language, high-level graphics, interfaces to other languages and debugging facilities” • Developed after the S language and environment – S was developed at Bell Labs (John Chambers et al.) – S-Plus: a value added implementation of the S language- Insightful Corporation – much code written for S runs unaltered under R • Significantly influenced by Scheme, a Lisp dialect 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 3 R: An Open Source Statistical Environment What is R • Ihaka and Gentleman, University of Auckland (New Zealand) – 1993 a preliminary version of R – 1995 released under the GNU Public License – Now: R-core team consisting of 17 members including John Chambers • R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, robust methods and many more) and graphical techniques • R is available as Free Software under the terms of the GNU General Public License (GPL). 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 4 R: An Open Source Statistical Environment R Extensibility (R Packages) • One of the most important features of R is its extensibility by creating packages of functions and data. • The R package system provides a framework for developing, documenting, and testing extension code. • Packages can include R code, documentation, data and foreign code written in C or Fortran. • Packages are distributed through the CRAN repository – http://cran.r-project.org - currently more than 1300 packages covering a wide variety of statistical methods and algorithms. ‘base’ and ‘recommended’ packages are included in all binary distributions. 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 5 R: An Open Source Statistical Environment R and the Others (R Interfaces) • Reading and writing data (text files, XML, spreadsheet like data, e.g. Excel • Read and write data formats of SAS, S-Plus, SPSS, STATA, Systat, Octave – package foreign. • Emulation of Matlab – package matlab. • Communication with RDBMS – ROracle, RMySql, RSQLite, RmSQL, RPgSQL, RODBC – large data sets, concurrency • Package filehash – a simple key-value style database, the data are stored on disk but are handled like data sets • Can use compiled native code in C, C++, Fortran, Java 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 6 R: An Open Source Statistical Environment R Graphics • One of the most important strengths of R – simple exploratory graphics as well as well-designed publication quality plots. • The graphics can include mathematical symbols and formulae where needed. • Can produce graphics in many formats: – – – – 8.4.2008 On screen PS and PDF for including in LaTex and pdfLaTeX or for distribution PNG or JPEG for the Web On Windows, metafiles for Word, PowerPoint, etc. MSIS 2008, Luxembourg: Valentin Todorov 7 R: An Open Source Statistical Environment R Graphics: basic and multipanel plots (trellis) Boxplot virginica 7.5 Petal Length Three 6.5 Varieties Sepal Width of 4.5 5.5 Sepal.Width 0.8 0.4 0.0 Density 1.2 Histogram 2.0 2.5 3.0 3.5 4.0 Sepal Length setosa versicolor Iris Sepal.Width setosa Normal Q-Q Plot Petal Length 4.0 Petal Length Sepal Width 3.0 Sepal.Width Sepal Length 2.0 3.0 2.0 Sepal.Width 4.0 Bagplot versicolor 4.5 5.5 6.5 Sepal.Length 8.4.2008 7.5 -2 -1 0 1 Sepal Width Sepal Length 2 norm quantiles MSIS 2008, Luxembourg: Valentin Todorov Scatter Plot Matrix 8 R: An Open Source Statistical Environment R Graphics: parallel plot and coplot Given : depth Three virginica Petal Length 100 200 300 400 500 600 Varieties of Sepal Width 165 170 175 180 185 165 setosa 180 185 -25 Sepal Length versicolor -15 lat -35 Petal Length 175 -15 Iris 170 -35 -25 Sepal Width Sepal Length Min 165 Max 170 175 180 185 long 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 9 R: An Open Source Statistical Environment R for Time Series • Package stats – classical time series modeling tools – arima() for Box-Jenkins type analysis – structural time series – StructTS() – filtering and decomposition – decompose() and HoltWinters() • Package forecast – additional forecast methods and graphical tools • Analyzing monthly or lower frequency time series: – TRAMO/SEATS – X-12-ARIMA accessible through the Gretl library • Task View Econometrics: http://cran.r-project.org/web/views/Econometrics.html 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 10 R: An Open Source Statistical Environment R for Time Series: Example • Fitting an ARIMA model to a univariate time series with arima() and using tsdiag() for plotting time series analysis diagnostic 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 11 R: An Open Source Statistical Environment R for Survey Analysis • Complex survey samples are usually analysed by specialized software packages: SUDAAN, Bascula 4 (Statistics Netherlands), etc. • STATA provides much more comprehensive support for analysing survey data than SAS and SPSS and could successfully compete with the specialized packages 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 12 R: An Open Source Statistical Environment R for Survey Analysis • R – package survey - http://faculty.washington.edu/tlumley/survey/ – stratification, clustering, possibly multistage sampling, unequal sampling probabilities or weights; multistage stratified random sampling with or without replacements – Summary statistics: means, totals, ratios, quantiles, contingency tables, regression models, for the whole sample and for domains – Variances by Taylor linearization or by replicate weights (BRR, jack-knife, bootstrap, or user-supplied) – Graphics: histograms, hexbin scatterplots, smoothers • Other packages: pps, sampling, sampfling 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 13 R: An Open Source Statistical Environment R and the Outliers (Robust Statistics in R) • What are Outliers – atypical observations which are inconsistent with the rest of the data or deviate from the postulated model – may arise through contamination, errors in data gathering, or misspecification of the model. – classical statistical methods are very sensitive to such data • What are Robust methods – Produce reasonable results even when one or more outliers may appear in the data – Robust regression - robustbase – Robust multivariate methods – rrcov, robustbase – Robust time series analysis - robust-ts 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 14 R: An Open Source Statistical Environment R and the Outliers: Example • Example: Wages and Hours - http://lib.stat.cmu.edu/DASL/ – a national sample of 6000 households with a male head earning less than $15,000 annually in 1966 - 9 independent variables; classified into 39 demographic groups – estimate y = the labour supply (average hours) from the available data (for the example we will consider only one variable: x = average age of the respondents: y 0 1 x – We will fit an Ordinary Least Squares (OLS) and a robust Least Trimmed Squares model 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 15 R: An Open Source Statistical Environment 1 0 -2 -1 Standardized LS residual 2150 2050 2100 -3 2000 19 25 30 35 40 45 50 55 0 10 20 30 40 Index AGE (a) 8.4.2008 -2.5 HRS 2200 2 2.5 2250 3 R and the Outliers: Example OLS (b) MSIS 2008, Luxembourg: Valentin Todorov 16 R: An Open Source Statistical Environment R and the Outliers: Example LTS 10 -2.5 0 32 34 -20 4 25 30 35 40 45 50 55 0 10 20 30 40 Index AGE (c) 8.4.2008 2.5 29 -10 Standardized LTS residual 2150 2000 2050 2100 HRS 2200 20 2250 5 (d) MSIS 2008, Luxembourg: Valentin Todorov 17 R: An Open Source Statistical Environment R and the Outliers: Example Covariance •Marona & Yohai (1998) TOLERANCE ELLIPSE (97.5%) •rrcov: data set maryo 2 •A bivariate data set with: 19 1 n 20, 0 0 -1 0 1 0.8 S 0 . 8 1 -2 9 clean contaminated -2 -1 0 1 •sample correlation: 0.81 •interchange the largest and smallest value in the first coordinate •the sample correlation becomes 0.05 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 18 R: An Open Source Statistical Environment More R… • R and the WEB - several projects that provide possibilities to use R over the WEB • R and the Missing – advanced missing value handling – – – – – mvnmle: ML estimation for multivariate data with missing values mitools: Tools for multiple imputation of missing data mice - Multivariate Imputation by Chained Equations EMV: Estimation of Missing Values for a Data Matrix VIM: provides methods for the visualisation as well as imputation of missing data • R Objects – R is an Object Oriented language (however in a quite different sense from C++, Java, C#) 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 19 R: An Open Source Statistical Environment More R… • R GUI – R Commander: a basic statistics GUI, consisting of a window containing several menus, buttons, and information fields – Sciviews: a suite of companion applications for Windows • R and SDMX • R Reports – package xtable: coerce data to LaTeX and HTML tables – package Sweave: a framework for mixing text and R code for automatic report gene 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 20 R: An Open Source Statistical Environment Summary • Output Management System – SAS/SPSS: it is rarely used for routine work – R: output is easily passed from one function to another to do further processing and to obtain more results • Macro Language – SAS/SPSS: a special language with own syntax. The new functions are not run in the same way as the built-in procedures – R itself is a programming language • Matrix Language – SAS/SPSS: A special language with own syntax – R is a vector and matrix based language complemented by additional packages: Matitrx, SparseM 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 21 R: An Open Source Statistical Environment Summary (cont.) • Publishing results – SAS/SPSS: Cut and paste to a Word processor or exporting to a file – R: produce LaTex output (including graphics) using for example the Sweave package • Data size – SAS/SPSS: Limited by the size of the disk – R: Limited by the size of the RAM, (not trivial) usage of databases for large data sets is possible • Data structure – SAS/SPSS: Rectangular data set – R: Rectangular data frame, vector, list 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 22 R: An Open Source Statistical Environment Summary (cont.) • Interface to other programming languages – SAS/SPSS: Not available – R: R can be easily mixed with Fortran, C, C++ and Java • Source code – SAS/SPSS: Not available – R: the source code of R itself as well as of its packages is a part of the distribution 8.4.2008 MSIS 2008, Luxembourg: Valentin Todorov 23 R: An Open Source Statistical Environment References • • • • • • • • 8.4.2008 Hornik, K and Leisch, F, (2005) R Version 2.1.0, Computational Statistics, 20 2 pp 197-202 Kabacoff, R. (2008) Quick-R for SAS and SPSS users, available from http://www.statmethods.net/index.html López-de-Lacalle, J, (2006) The R-computing language: Potential for Asian economists, Journal of Asian Economics, 17 6, pp 1066-1081 Muenchen, R. (2007), R for SAS and SPSS users, URL: http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf Murrel, P. (2005) R Graphics, Chapman & Hall R Development Core Team (2007) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. URL: http://www.r-project.org/ Templ, M and Filzmoser, F (2008), Visualisation of Missing Values and Robust Imputation in Environmental Surveys, submitted for publication Wheeler, D.A., (2007) Why Open Source Software / Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers! MSIS 2008, Luxembourg: Valentin Todorov 24