Performance: Test cases

advertisement
R: An Open Source Statistical Environment
R: An Open Source Statistical
Environment
Valentin Todorov
UNIDO
v.todorov@unido.org
MSIS 2008 (Luxembourg, 7-9 April 2008)
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
1
R: An Open Source Statistical Environment
Outline
•
•
•
•
•
•
•
•
•
•
8.4.2008
Introduction: the R Platform and Availability
R Learning Curve (is R hard to learn)
R Extensibility (R Packages)
R and the others (Interfaces)
R Graphics
R for Time series
R for Survey Analysis
R and the Outliers (Robust Statistics in R)
More R features (WEB, Missing data, OOP, GUI)
Summary and Conclusions
MSIS 2008, Luxembourg: Valentin Todorov
2
R: An Open Source Statistical Environment
What is R
• R is “ a system for statistical computation and graphics. It
provides, among other things, a programming language,
high-level graphics, interfaces to other languages and
debugging facilities”
• Developed after the S language and environment
– S was developed at Bell Labs (John Chambers et al.)
– S-Plus: a value added implementation of the S language- Insightful
Corporation
– much code written for S runs unaltered under R
• Significantly influenced by Scheme, a Lisp dialect
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
3
R: An Open Source Statistical Environment
What is R
• Ihaka and Gentleman, University of Auckland (New
Zealand)
– 1993 a preliminary version of R
– 1995 released under the GNU Public License
– Now: R-core team consisting of 17 members including John
Chambers
• R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series
analysis, classification, clustering, robust methods and
many more) and graphical techniques
• R is available as Free Software under the terms of the
GNU General Public License (GPL).
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
4
R: An Open Source Statistical Environment
R Extensibility (R Packages)
• One of the most important features of R is its extensibility
by creating packages of functions and data.
• The R package system provides a framework for
developing, documenting, and testing extension code.
• Packages can include R code, documentation, data and
foreign code written in C or Fortran.
• Packages are distributed through the CRAN repository –
http://cran.r-project.org - currently more than 1300
packages covering a wide variety of statistical methods
and algorithms. ‘base’ and ‘recommended’ packages are
included in all binary distributions.
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
5
R: An Open Source Statistical Environment
R and the Others (R Interfaces)
• Reading and writing data (text files, XML, spreadsheet like
data, e.g. Excel
• Read and write data formats of SAS, S-Plus, SPSS,
STATA, Systat, Octave – package foreign.
• Emulation of Matlab – package matlab.
• Communication with RDBMS – ROracle, RMySql,
RSQLite, RmSQL, RPgSQL, RODBC – large data sets,
concurrency
• Package filehash – a simple key-value style database, the
data are stored on disk but are handled like data sets
• Can use compiled native code in C, C++, Fortran, Java
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
6
R: An Open Source Statistical Environment
R Graphics
• One of the most important strengths of R – simple
exploratory graphics as well as well-designed publication
quality plots.
• The graphics can include mathematical symbols and
formulae where needed.
• Can produce graphics in many formats:
–
–
–
–
8.4.2008
On screen
PS and PDF for including in LaTex and pdfLaTeX or for distribution
PNG or JPEG for the Web
On Windows, metafiles for Word, PowerPoint, etc.
MSIS 2008, Luxembourg: Valentin Todorov
7
R: An Open Source Statistical Environment
R Graphics: basic and multipanel plots (trellis)
Boxplot
virginica
7.5
Petal
Length
Three
6.5
Varieties
Sepal
Width
of
4.5
5.5
Sepal.Width
0.8
0.4
0.0
Density
1.2
Histogram
2.0
2.5
3.0
3.5
4.0
Sepal
Length
setosa versicolor
Iris
Sepal.Width
setosa
Normal Q-Q Plot
Petal
Length
4.0
Petal
Length
Sepal
Width
3.0
Sepal.Width
Sepal
Length
2.0
3.0
2.0
Sepal.Width
4.0
Bagplot
versicolor
4.5
5.5
6.5
Sepal.Length
8.4.2008
7.5
-2
-1
0
1
Sepal
Width
Sepal
Length
2
norm quantiles
MSIS 2008, Luxembourg: Valentin Todorov
Scatter Plot Matrix
8
R: An Open Source Statistical Environment
R Graphics: parallel plot and coplot
Given : depth
Three
virginica
Petal
Length
100
200
300
400
500
600
Varieties
of
Sepal
Width
165
170
175
180
185
165
setosa
180
185
-25
Sepal
Length
versicolor
-15
lat
-35
Petal
Length
175
-15
Iris
170
-35
-25
Sepal
Width
Sepal
Length
Min
165
Max
170
175
180
185
long
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
9
R: An Open Source Statistical Environment
R for Time Series
• Package stats
– classical time series modeling tools – arima() for Box-Jenkins type
analysis
– structural time series – StructTS()
– filtering and decomposition – decompose() and HoltWinters()
• Package forecast – additional forecast methods and
graphical tools
• Analyzing monthly or lower frequency time series:
– TRAMO/SEATS
– X-12-ARIMA
 accessible through the Gretl library
• Task View Econometrics: http://cran.r-project.org/web/views/Econometrics.html
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
10
R: An Open Source Statistical Environment
R for Time Series: Example
• Fitting an
ARIMA model
to a univariate
time series with
arima() and
using tsdiag()
for plotting time
series analysis
diagnostic
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
11
R: An Open Source Statistical Environment
R for Survey Analysis
• Complex survey samples are usually analysed by
specialized software packages: SUDAAN, Bascula 4
(Statistics Netherlands), etc.
• STATA provides much more comprehensive support for
analysing survey data than SAS and SPSS and could
successfully compete with the specialized packages
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
12
R: An Open Source Statistical Environment
R for Survey Analysis
• R – package survey - http://faculty.washington.edu/tlumley/survey/
– stratification, clustering, possibly multistage sampling, unequal
sampling probabilities or weights; multistage stratified random
sampling with or without replacements
– Summary statistics: means, totals, ratios, quantiles, contingency
tables, regression models, for the whole sample and for domains
– Variances by Taylor linearization or by replicate weights (BRR,
jack-knife, bootstrap, or user-supplied)
– Graphics: histograms, hexbin scatterplots, smoothers
• Other packages: pps, sampling, sampfling
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
13
R: An Open Source Statistical Environment
R and the Outliers (Robust Statistics in R)
• What are Outliers
– atypical observations which are inconsistent with the rest of the
data or deviate from the postulated model
– may arise through contamination, errors in data gathering, or
misspecification of the model.
– classical statistical methods are very sensitive to such data
• What are Robust methods
– Produce reasonable results even when one or more outliers may
appear in the data
– Robust regression - robustbase
– Robust multivariate methods – rrcov, robustbase
– Robust time series analysis - robust-ts
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
14
R: An Open Source Statistical Environment
R and the Outliers: Example
• Example: Wages and Hours - http://lib.stat.cmu.edu/DASL/
– a national sample of 6000 households with a male head earning
less than $15,000 annually in 1966 - 9 independent variables;
classified into 39 demographic groups
– estimate y = the labour supply (average hours) from the available
data (for the example we will consider only one variable: x =
average age of the respondents:
y  0  1 x
– We will fit an Ordinary Least Squares (OLS) and a robust Least
Trimmed Squares model
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
15
R: An Open Source Statistical Environment
1
0
-2
-1
Standardized LS residual
2150
2050
2100
-3
2000
19
25
30
35
40
45
50
55
0
10
20
30
40
Index
AGE
(a)
8.4.2008
-2.5
HRS
2200
2
2.5
2250
3
R and the Outliers: Example OLS
(b)
MSIS 2008, Luxembourg: Valentin Todorov
16
R: An Open Source Statistical Environment
R and the Outliers: Example LTS
10
-2.5
0
32 34
-20
4
25
30
35
40
45
50
55
0
10
20
30
40
Index
AGE
(c)
8.4.2008
2.5
29
-10
Standardized LTS residual
2150
2000
2050
2100
HRS
2200
20
2250
5
(d)
MSIS 2008, Luxembourg: Valentin Todorov
17
R: An Open Source Statistical Environment
R and the Outliers: Example Covariance
•Marona & Yohai (1998)
TOLERANCE ELLIPSE (97.5%)
•rrcov: data set maryo
2
•A bivariate data set with:
19
1
n  20,   0 0
-1
0
 1 0.8
S

0
.
8
1


-2
9
clean
contaminated
-2
-1
0
1
•sample correlation: 0.81
•interchange the largest and smallest
value in the first coordinate
•the sample correlation becomes 0.05
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
18
R: An Open Source Statistical Environment
More R…
• R and the WEB - several projects that provide possibilities
to use R over the WEB
• R and the Missing – advanced missing value handling
–
–
–
–
–
mvnmle: ML estimation for multivariate data with missing values
mitools: Tools for multiple imputation of missing data
mice - Multivariate Imputation by Chained Equations
EMV: Estimation of Missing Values for a Data Matrix
VIM: provides methods for the visualisation as well as imputation of
missing data
• R Objects – R is an Object Oriented language (however in
a quite different sense from C++, Java, C#)
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
19
R: An Open Source Statistical Environment
More R…
• R GUI
– R Commander: a basic statistics GUI, consisting of a window
containing several menus, buttons, and information fields
– Sciviews: a suite of companion applications for Windows
• R and SDMX
• R Reports
– package xtable: coerce data to LaTeX and HTML tables
– package Sweave: a framework for mixing text and R code for
automatic report gene
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
20
R: An Open Source Statistical Environment
Summary
• Output Management System
– SAS/SPSS: it is rarely used for routine work
– R: output is easily passed from one function to another to do
further processing and to obtain more results
• Macro Language
– SAS/SPSS: a special language with own syntax. The new
functions are not run in the same way as the built-in procedures
– R itself is a programming language
• Matrix Language
– SAS/SPSS: A special language with own syntax
– R is a vector and matrix based language complemented by
additional packages: Matitrx, SparseM
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
21
R: An Open Source Statistical Environment
Summary (cont.)
• Publishing results
– SAS/SPSS: Cut and paste to a Word processor or exporting to a
file
– R: produce LaTex output (including graphics) using for example
the Sweave package
• Data size
– SAS/SPSS: Limited by the size of the disk
– R: Limited by the size of the RAM, (not trivial) usage of databases
for large data sets is possible
• Data structure
– SAS/SPSS: Rectangular data set
– R: Rectangular data frame, vector, list
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
22
R: An Open Source Statistical Environment
Summary (cont.)
• Interface to other programming languages
– SAS/SPSS: Not available
– R: R can be easily mixed with Fortran, C, C++ and Java
• Source code
– SAS/SPSS: Not available
– R: the source code of R itself as well as of its packages is a part of
the distribution
8.4.2008
MSIS 2008, Luxembourg: Valentin Todorov
23
R: An Open Source Statistical Environment
References
•
•
•
•
•
•
•
•
8.4.2008
Hornik, K and Leisch, F, (2005) R Version 2.1.0, Computational Statistics, 20
2 pp 197-202
Kabacoff, R. (2008) Quick-R for SAS and SPSS users, available from
http://www.statmethods.net/index.html
López-de-Lacalle, J, (2006) The R-computing language: Potential for Asian
economists, Journal of Asian Economics, 17 6, pp 1066-1081
Muenchen, R. (2007), R for SAS and SPSS users, URL:
http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf
Murrel, P. (2005) R Graphics, Chapman & Hall
R Development Core Team (2007) R: A Language and Environment for
Statistical Computing, R Foundation for Statistical Computing, Vienna,
Austria, ISBN 3-900051-07-0. URL: http://www.r-project.org/
Templ, M and Filzmoser, F (2008), Visualisation of Missing Values and
Robust Imputation in Environmental Surveys, submitted for publication
Wheeler, D.A., (2007) Why Open Source Software / Free Software (OSS/FS,
FLOSS, or FOSS)? Look at the Numbers!
MSIS 2008, Luxembourg: Valentin Todorov
24
Download