R in the Statistical Office R in the Statistical Office: The UNIDO Experience Valentin Todorov UNIDO v.todorov@unido.org MSIS 2010 (Daejeon, 26-29 April 2010) 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 1 R in the Statistical Office Outline • Introduction: the R Platform and Availability • R Extensibility (R Packages) • R in UNIDO statistical process: three examples – R as a Mediator (R Interfaces) – R as a Graphics Engine (R, LaTeX and Sweave) – Nowcasting tool for the Manufacturing Value Added (MVA) • Summary and Conclusions 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 2 R in the Statistical Office What is R: Platform • R is “a system for statistical computation and graphics. It provides, among other things, a programming language, high-level graphics, interfaces to other languages and debugging facilities” • Developed after the S language and environment – S was developed at Bell Labs (John Chambers et al.) – S-Plus: a value added implementation of the S language- Insightful Corporation – much code written for S runs unaltered under R • Significantly influenced by Scheme, a Lisp dialect 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 3 R in the Statistical Office What is R: History • Ihaka and Gentleman, University of Auckland (New Zealand) – 1993 a preliminary version of R – 1995 released under the GNU Public License – Now: R-core team consisting of 17 members including John Chambers • R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, robust methods and many more) and graphical techniques 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 4 R in the Statistical Office What is R: Availability • R is available as Free Software under the terms of the GNU General Public License (GPL) • R is available for: – wide variety of UNIX platforms (including FreeBSD and Linux) – Windows – MacOS • Add-on functionality is available in the form of packages from CRAN: http://cran.r-project.org/ 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 5 R in the Statistical Office R Extensibility (R Packages) • One of the most important features of R is its extensibility by creating packages of functions and data. • The R package system provides a framework for developing, documenting, and testing extension code. • Packages can include R code, documentation, data and foreign code written in C or Fortran. • Packages are distributed through the CRAN repository – http://cran.r-project.org - currently more than 1300 packages covering a wide variety of statistical methods and algorithms. ‘base’ and ‘recommended’ packages are included in all binary distributions. 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 6 R in the Statistical Office 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 7 R in the Statistical Office I. R as a mediator (R Interfaces) • Using a statistical system is not done in isolation • Import data for analysis • Export data for further processing use the right tool for the right work • Export results for report writing • Even in a small research department (UNIDO): SAS, Stata, Eviews, Octave, SPSS and R user 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 8 R in the Statistical Office R as a mediator (R Interfaces) • Reading and writing data (text files, XML, spreadsheet like data, e.g. Excel • Read and write data formats of SAS, S-Plus, SPSS, EpiInfo, STATA, SYSTAT, Octave – package foreign • Emulation of Matlab – package matlab 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 9 R in the Statistical Office R as a mediator: the foreign package library(foreign) df <- read.dbf(“myfile.dbf") # DBase df <- read.epiinfo("myfile.epiinfo") # Epi Info df <- read.mtp("myfile.mtp") # Minitab portable worksheet df <- read.octave("myfile.octave") # Octave df <- read.ssd("myfile.ssd") # SAS version 6 df <- read.xport("myfile.xport") # SAS XPORT file df <- read.spss("myfile.sav") # SPSS df <- read.dta("myfile.dta") # Stata df <- read.systat("myfile.sys") # Systat 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 10 R in the Statistical Office R as a mediator (Accessing data on the Internet) • Reading data from an URL: – readLines() to read arbitrary text – read.table() to read a file with observations and variables (first line can be used for variable names) – read.csv() to read comma separated values. • Example (from Kleinman and Horton, 2009) ch <- url("http://www.math.smith.edu/sasr/testdata") df <- readLines(ch) ## df <- read.table("http://www.math.smith.edu/sasr/testdata") ## df <- read.csv("http://www.math.smith.edu/sasr/file.csv") 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 11 R in the Statistical Office R as a mediator (XML processing) • Use package XML – xmlRoot() to open the connection to the file – xmlSApply() and xmlValue() are called recursively to process the file. – A character matrix is returned: columns correspond to observations and rows correspond to variables. • Example (from Kleinman and Horton, 2009) library(XML) surl <- http://www.math.smith.edu/sasr/datasets/help.xml doc <- xmlRoot(xmlTreeParse(surl )) tmp <- xmlSApply(doc, function(x) xmlSApply(x, xmlValue)) df <- t(tmp)[,-1] 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 12 R in the Statistical Office SDMX example: Retrieve IMF/IFS data library(XML) surl <- "c:/download/Exrate4Unido.xml" doc <- as.list(xmlRoot(xmlTreeParse(surl))) ## Get the data for Korea kr <- doc[[which(xmlSApply(doc, function(x) xmlAttrs(x)[3]) == "Korea")]] xmlAttrs(kr) Frequency "A" CountryName "Korea" TS_Key "542..RF.ZF..." Units Database "IFS" Country "542" Descriptor "MARKET RATE, PERIOD AVERAGE" Scale "National Currency per US Dollar“ 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov "None" 13 R in the Statistical Office SDMX example: Retrieve IMF/IFS data (2) getExdata <- function(x) { out <<- rbind(out,as.numeric(xmlAttrs(x))) } out <- data.frame() xmlSApply(kr, getExdata) out … 27 1974 404.4725 28 1975 484.0000 29 1976 484.0000 30 1977 484.0000 31 1978 484.0000 32 1979 484.0000 33 1980 607.4325 34 1981 681.0283 … 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 14 R in the Statistical Office R as a mediator (Databases) • Communication with RDBMS – ROracle, RMySql, RSQLite, RmSQL, RPgSQL, RODBC – large data sets, concurrency • Package filehash – a simple key-value style database, the data are stored on disk but are handled like data sets • Can use compiled native code in C, C++, Fortran, Java 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 15 R in the Statistical Office R as a mediator: IDSB Example • Industrial Demand and Supply Balance (IDSB) Database - data sets based on ISIC Rev.3 at 4-digit level • Contains annual time series data (in current US dollars) for eight interrelated items • Data are derived from: – INDSTAT: Output data reported by National Statistical Offices – COMTRADE: UNIDO estimates for ISIC-based international trade data • A new, related to IDSB product will contain also Index of Industrial Production data (UNSD) 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 16 R in the Statistical Office R as a mediator: IDSB Example (2) • The generation of the final data set involves – combination of two independent data sets (INDSTAT and COMTRADE), – conversion from one classification (SITC) to another (ISIC), – conversion of the monetary values from current national currency to current USD and other minor adjustments of the data. • Each single data set is verified thoroughly and its quality is guaranteed • But the verification of the synthesized data set is a serious challenge for the statistical staff of the Unit: A comprehensive screening data set is created 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 17 R in the Statistical Office R as a mediator: IDSB Example (3) 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 18 R in the Statistical Office R as a mediator: IDSB Example: R code ## First load the RODBC library. If not yet installed, install it using ## install.packages("RODBC") ## library(RODBC) ## Open the ODBC connection to the MDB file ’fname’ ## ch <- odbcConnectAccess("C:/work/idsb34screen.mdb") ## Create an SQL query of the type: ## "SELECT * FROM table_name WHERE where_condition" ## Execute the query and obtain the selected data in a dataframe ## sql <- "Select * from idsb34 where MXMARK <>’’" xdata <- sqlQuery(ch, sql) 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 19 R in the Statistical Office II. R as a Graphics Engine • A natural way to visualize data are graphs and plots • Publication quality displays should be both informative and aesthetically pleasing (Tufte, 2001): – present many numbers in a small space; – encourage the eye to compare different pieces of data • The graphics have to be mingled with text explaining and commenting them • The standard approach: POINT & CLICK – WYSIWYG – COPY & PASTE • The proposed solution: R + LaTeX + BibTeX => Sweave PDF 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 20 R in the Statistical Office The Example: International Yearbook of Industrial Statistics • A unique and comprehensive source of information, the only international publication providing worldwide statistics on performance and trends in the manufacturing sector. • Designed to facilitate international comparisons relating to manufacturing activity, industrial development and performance. • Data which can be used to analyze patterns of growth and related long term trends, structural change and industrial performance in individual industries. • A new graphical section presenting the major trends of growth and distribution of manufacturing in the world. 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 21 R in the Statistical Office 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 22 R in the Statistical Office 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 23 R in the Statistical Office Yearbook Graphics: Requirements • The software tool we are looking for should fulfil as a minimum the following requirements – Create publication quality graphics – Interface easily with the other components of the production line (SAS , Sybase, .Net) – Comply with the submission guidelines of the publisher – e.g. the final document must contain only embedded fonts. – Provide means for easy text and image placement. Whenever the data are changed the document should be (preferably automatically) regenerated. – Use the same fonts in figure labels as in the main document – Easy to maintain and extend 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 24 R in the Statistical Office The Components: R Graphics • One of the most important strengths of R – simple exploratory graphics as well as well-designed publication quality plots • The graphics can include mathematical symbols and formulae where needed • Can produce graphics in many formats: – – – – 26.4.2010 On screen PS and PDF for including in LaTex and pdfLaTeX or for distribution PNG or JPEG for the Web On Windows, metafiles for Word, PowerPoint, etc MSIS 2010, Daejeon: Valentin Todorov 25 R in the Statistical Office R Graphics: basic and multipanel plots (trellis) Boxplot virginica 7.5 Petal Length Three 6.5 Varieties Sepal Width of 4.5 5.5 Sepal.Width 0.8 0.4 0.0 Density 1.2 Histogram 2.0 2.5 3.0 3.5 4.0 Sepal Length setosa versicolor Iris Sepal.Width setosa Normal Q-Q Plot Petal Length 4.0 Petal Length Sepal Width 3.0 Sepal.Width Sepal Length 2.0 3.0 2.0 Sepal.Width 4.0 Bagplot versicolor 4.5 5.5 6.5 Sepal.Length 26.4.2010 7.5 -2 -1 0 1 Sepal Width Sepal Length 2 norm quantiles MSIS 2010, Daejeon: Valentin Todorov Scatter Plot Matrix 26 R in the Statistical Office R Graphics: parallel plot and coplot Given : depth Three virginica Petal Length 100 200 300 400 500 600 Varieties of Sepal Width 165 170 175 180 185 165 setosa 180 185 -25 Sepal Length versicolor -15 lat -35 Petal Length 175 -15 Iris 170 -35 -25 Sepal Width Sepal Length Min 165 Max 170 175 180 185 long 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 27 R in the Statistical Office The Components: TeX and LaTeX • TeX: a typesetting system (computer program) for producing nicely printed, publication quality output, freely available: Donald E. Knuth, 1974 • LaTeX: a component designed to shield the author from the details of TeX; Lamport (1994) – Available for free from http://www.latex-project.org/ftp.html for Linux, MacOs and Windows. • BibTeX: A simple tool to create a bibliography in a LaTeX document – a uniform style is achieved, which easily can be replaced by another – a unified library of references shared among publications and authors 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 28 R in the Statistical Office The Components: Sweave • A suitable tool that allows to embed the code for complete data analysis in documents (see Leisch, 2002) • Create dynamic reports, which can be updated automatically if data or analysis change • The master document (.Rnw) contains: – the necessary programming code for obtaining of the graphs, tables, etc. written in R – the text written in LaTeX • The document is run through R – all the data analysis is performed on the fly – the generated output - tables, graphs, etc. is inserted into the final LaTeX document. 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 29 R in the Statistical Office III. Nowcasting MVA for Cross-country Comparison • UNIDO maintains a unique industrial statistics database INDSTAT) – updated regularly with data collected from NSOs • A separate database – compilation of statistics related to MVA – growth rate and share in GDP • Published in the International Yearbook of Industrial Statistics and on the statistical pages of the UNIDO web site • For current economic analysis it is crucial that the Yearbook presents data for the most recent years 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 30 R in the Statistical Office Nowcasting MVA: The Model • The database consists of annual values of MVA and GDP at constant 2000 prices for around 200 countries • GDP data are available up to the current year: – For earlier years the actual GDP values are used – For the most recent one or two years the GDP values are derived from the nowcasts of GDP growth rates reported in the World Economic Outlook of IMF (see Artis, 1996) • MVA – a time-gap of at least one year: nowcasting • MVA is strongly connected to the GDP • this suggests to nowcast MVA on the basis of the estimated relationship between contemporaneous values of MVA and GDP 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 31 R in the Statistical Office Nowcasting MVA: The Model (2) • We consider models based on the following general representation of MVA MVAi ,t MVAi ,t 1 (1 gMVAi ,t ) ) where the MVA growth rate is modelled as gMVAi ,t ai bi gGDPi ,t ci gMVAi ,t 1 ei ,t and ei,t is white noise. • This general model can be specialized down to four different models (see Boudt, Todorov and Upadhyaya, 2009) 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 32 R in the Statistical Office Nowcasting MVA: Estimation 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 33 R in the Statistical Office Nowcasting MVA: Estimation • The standard OLS estimator may be biased because of – violation of the assumption of exogeneity of the regressors with respect to the error term – presence of outliers in the data • What are outliers: – atypical observations which are inconsistent with the rest of the data or deviate from the postulated model – may arise through contamination, errors in data gathering, or misspecification of the model – classical statistical methods are very sensitive to such data • For this reason we also consider a robust alternative to the OLS estimator, namely the MM estimator 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 34 R in the Statistical Office 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 35 R in the Statistical Office Nowcasting MVA: MM-Estimator • Robust methods: produce reasonable results even when one or more outliers may appear in the data • The MM regression estimator is a two step estimator: – First step – LTS (Least Trimmed Squares) – estimates the parameter vector that minimizes the sum of the 50% smallest square residuals – This estimate is used as a starting value for M-estimation where a loss function is minimized that downweights outliers • Has a high efficiency under the linear regression model with normally distributed errors • Because of the LTS initialization it is highly robust • For details see Maronna et al (2006) 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 36 R in the Statistical Office Nowcasting MVA: MM-Estimator in R • Package robustbase: • Provides “essential robust statistics” within R available in a single package • Provides tools that allow analyzing data with robust methods: – Regression including model selection – Multivariate statistics • Aims to cover the book of Maronna et al (2006) 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 37 R in the Statistical Office Summary and Outlook • An increasing demand for statistical tools which combine ease of use and availability of newest analytical methods. • Provided by the flexibility of the statistical programming language and environment R • Illustrated by examples from the statistical production process of UNIDO • Future development: – R for survey data analysis – Detection of outliers in survey data with R – Imputation of missing values in multivariate data with R 26.4.2010 MSIS 2010, Daejeon: Valentin Todorov 38