UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Working paper 2013/28 27 November 2013 High-level group for the modernisation of statistical production and services (HLG) Meeting, Geneva, 27 November 2013 Istat generalised software development and sharing: experiences and strategy1 1. Background By “generalised (or generic) software” for statistical production we mean the set of systems and IT tools specifically designed to ensure production capabilities in the different phases of a statistical production process (as defined by the Generic Statistic Business Production Model). Aside from the productivity benefits that result from the use of products applicable with little or no need for ad hoc code development, it should be noted that generalized systems usually implement methodological solutions that ensure the maximum quality of the produced statistical information. The limited number of users of these systems (essentially National Institutes of Statistics and, to a lesser extent, other bodies involved in statistical research), has meant that private companies are not interested in this development area (with some exceptions), which has instead being covered by leading official statistical institutions such as Statistics Canada, Statistics Netherlands, the U.S. Bureau of the Census and a few others. Also Istat, the Italian National Institute of Statistics, has been developing its own generalised IT tools since the early 90’s (mainly edit & imputation and sampling design & estimation generalised systems: Concord, Diesis, Mauss, Genesees). Where necessary, generalised systems were acquired outside (namely Blaise, ACTR and GEIS/Banff), so not to leave any suitable phase of the production process uncovered. As a matter of fact, so far no strategic decision has been taken inside the community of National Statistical Institutes in order to coordinate their efforts and produce a complete and optimised suite of products to cover the whole statistical production chain. Only recently interest for this coordination has arisen and initiatives have been undertaken. A first initiative was carried on by UNECE with its Sharing Advisory Board, aiming at collecting information on generalised software used by official statistical institutions2. A second set of initiatives were the launch of ESSnet projects dedicated to the definition of a common architecture and environment (respectively, CORA and CORE) facilitating the sharing of IT tools, and, more recently, the constitution of the HLG group for the definition of a Common Statistical Production Architecture (CSPA), based on a Service Oriented Architecture approach. All these initiatives are not directly aiming at the cooperative development of software systems, but rather to prepare the ground for facilitating the adoption of common optimal IT solutions. In the VIP.Programme the project “Shared Services“ will contain an ESSnet (to be launched in 2014) dedicated to the investigation of the opportunities of Free and Open Source Software (FOSS) projects useful to NSI’s and the ESS. The main objectives include stocktaking, standard setting, and benchmarking. The project will also investigate opportunities for a Center of Expertise on FOSS projects for official statistics. It is thus unappropriate to proceed in an uncoordinated and isolated manner in the field of generalised software production: NSI’s have to define a clear strategy in order to achieve the goal of developing and adopting a common set of standards or recommended IT tools. 1 2 Prepared by Giulio Barcaroli (“Methods, Tools and Methodological Support” Division, ISTAT, barcarol@istat.it) http://www1.unece.org/stat/platform/display/msis/Software+Sharing Istat generalised software development and sharing Page 1 of 10 2. Generalised software in Istat Since the early 90’s ISTAT goal was to make available for each step of the production process one or more recommended software systems enabling to perform these steps in an optimal way, from the point of view of both effectiveness (quality of results) and efficiency (reduction of related costs). In table 1, the current situation is reported. It can be seen that whenever an IT tool has been developed by ISTAT, a clear choice in favor of an open source approach has been made, at least since 2007. This is due to ISTAT policy regarding the development and / or acquisition of IT tools, that must meet the following requirements: (1) implement the most advanced methods and techniques from the point of view of their gains in terms of quality and efficiency, and (2) be fully interoperable, and therefore usable in the first place by the whole community of official statistics. As for the first requirement, it is well known that the open source system R is on the edge of research, because of the huge amount of libraries (5000) produced by a vast community of developers: the availability of these libraries is a great advantage for the development of IT tools that can harness already available solutions. A pre-requirement for the second condition is the adoption of development technologies that do not jeopardise interoperability. Therefore, whatever tool we develop should not be tied to particular commercial DBMS or systems. For instance, ISTAT standard DBMS is Oracle, but the aim is to develop IT tools that, when a relational database is required, use ODBC to connect instead to declare directly a particular DBMS. The use of SAS to develop generalized software is a blocking factor, because it has to be installed to run SAS-based IT tools. To overcome this limit, some ten years ago we asked SAS Institute to grant the possibility to create executables not requiring SAS installations or to provide the tools functionalities as a kind of web services, but as answers were negative in both cases we decided to abandon SAS as a development technology. As a consequence, since 2007 the R system environment and programming language have been elected as preferred instruments to develop generalized software..SAS was substituted by R not only because of a general policy aimed at decreasing ISTAT’s dependency from proprietary software vendors, but also because of interoperability requirements. In fact, ISTAT has many relationships with other entities in the National Statistical System, and also with other Statistical Institutes in developing countries: in transferring ISTAT software to partners, one capital requirement is that this cannot imply any obligation for them to purchase commercial software to run the software. Currently, the software acquired on a commercial basis is from other Statistical Institutes or International Organisations: Canada (ACTR, Banff), Netherlands (Blaise), OECD (OECD.Stat). Some of them are planned to be substituted with FOSS solutions. In particular, ACTR is being migrated to an R version, and Banff will be substituted by a set of alternative tools: o “editrules” for error localisation (developed by Statistics Netherlands as an R package), o an R package for nearest neighbor and Predictive Mean Matching imputation, being developed in ISTAT, o “RSPA” for minimal record adjustment after imputation (another R package developed by Statistics Netherlands). As for licensing, from the very beginning we decided to disseminate the software produced on our own on a complete free basis. Our first generalized products (Concord and Genesees, SAS versions) have been offered to any Institution or private body asking for them, with no restrictions. Now, ISTAT policy is to license the software on the basis of a FOSS licensing. Almost all of them are EUPL (“European Union Public Licence”). The EUPL ensures the following rights to the licensee: 1. Obtain the source code of the software from a free access repository 2. Use the software in any circumstance and for all usage 3. Reproduce (copy, duplicate) the software 4. Modify the original software, and/or make derivative works out of it Istat generalised software development and sharing Page 2 of 10 5. Communicate the software to the public (i.e. using it through a public network or distributing services based on the software via Internet) 6. Distribute the software or copies thereof to other users (inside or outside the licensee's organisation) 7. Lend and rent the software or copies thereof 8. Sub-licence rights in the software or copies thereof. Istat generalised software development and sharing Page 3 of 10 GSBPM Software Functions Developer Charact eristics State Licensing condition s MAUSS-R (“Multivariate Allocation of Units in Sampling Surveys”) Design singlestage stratified samples ISTAT Design twostage stratified samples ISTAT Currently used in ISTAT and available for anyone Completing development and testing FOSS (EUPL) BEAT (“Bethel Allocation R-based system (R core + Java interface) R package ISTAT R packages (core + Tcl-TK interface) Beta version FOSS Frame optimisation for stratified sampling and selection of units from optimised strata Develop applications for CAPI, CATI and CADI Develop applications for interactive coding ISTAT R package Currently used in ISTAT and available for anyone FOSS (GPL2) (on the CRAN) Currently used in ISTAT Commercial StatMatch Statistical matching ISTAT R package FOSS (EUPL) (on the CRAN) RELAIS (REcord Linkage At IStat) Develop record linkage applications ISTAT R-based system (R core + Java interface) ACTR Develop applications for batch coding Statistics Canada Concord-Java Develop applications for Fellegi-Holt based edit and imputation procedures (categorical data) ISTAT Fortran (core) + Java (interface) Currently used in ISTAT and available for anyone Currently used in ISTAT and available for anyone Currently used in ISTAT, but migrating to an R version Currently used in ISTAT and available for anyone Diesis Develop applications for Fellegi-Holt and/or NearestNeighbour based edit and imputation procedures ISTAT C Currently used in ISTAT but not yet available outside Will be FOSS phases and sub-processes 2. Design 2.1 Design frame and samples 2. Design 2.1 Design frame and samples 2. Design 2.1 Design frame and samples 4. Collect 4.1 Select samples 2. Design FS4 (First Stage Sample Stratification and Selection) SamplingStrata 2.1 Design frame and samples 4. Collect 4.1 Select samples 4. Collect Blaise 4.3 Run collection 5. Process 5.2. Classify and code 6. Process Statistics Netherlands 5.1 Integrate 5. Process 5.1 Integrate 5. Process 5.2. Classify and code 5.Process 5.3 Review, validate and edit 5.4 Impute 5.Process 5.3 Review, validate and edit 5.4 Impute Istat generalised software development and sharing FOSS FOSS Commercial FOSS (EUPL) Page 4 of 10 GSBPM Software Functions Developer Charact eristics State Licensing condition s Statistics Netherlands R package Testing in ISTAT FOSS (GPL2) (on the CRAN) Statistics Canada SAS procedures Commercial ISTAT R package Currently used in ISTAT, planned substitution with open source alternative Currently used in ISTAT and available for anyone phases and sub-processes 5. Process editrules 5.3 Review, validate and edit 5. Process Banff 5.3 Review, validate and edit 5.4 Impute 5. Process 5.3 Review, SelEMix (“SELective Editing via MIXture models”) validate and edit 5. Process 5.6 Calculate weights (categorical and continuous data) Develop applications for Fellegi-Holt based edit and imputation procedures (categorical and continuous data) Develop applications for Fellegi-Holt based edit and imputation procedures (continuous data) Develop applications for optimised selective editing FOSS (EUPL) (on the CRAN) ReGenesees (“R evolved GENeralised Software for Sampling Estimates and ErrorS” Calibration and sampling errors estimation (using analytic methods) ISTAT R packages (core + Tcl-TK interface) Currently used in ISTAT and available for anyone FOSS (EUPL) EVER (“Estimation of Variance by Efficient Replication”) Calibration and sampling errors estimation (using replication methods) ISTAT R package Used in ISTAT and available for anyone FOSS (EUPL) (on the CRAN) Client version: VisualBasic Web version: Java Used in ISTAT, not yet available outside FOSS Used in ISTAT Commercial 5.7 Calculate aggregates 5. Process 5.6 Calculate weights 5.7 Calculate aggregates 6. Analyse Ranker Calculation and analysis of composite indicators ISTAT 7. Disseminate OECD.stat Statistical data warehousing OECD Table 1 – Generalised software and IT tools for statistical production in ISTAT (in Annex A detailed information is reported with respect to the ISTAT developed software) Istat generalised software development and sharing Page 5 of 10 3. Future developments and collaboration strategy ISTAT will continue its policy of developing and / or acquiring IT tools fully interoperable and shareable with other entities in the official statistical community. ISTAT is a promoter of the “ESSnet on Free and Open Source Software for Statistical Production”, already approved by the ESSC, to be launched in 2014. The aim of this ESSnet is “is to document, explore, and educate the ESS on the use of FOSS projects for statistical production and to evaluate the case for an ongoing Centre of Expertise to support the Official Statistics FOSS community. In doing so it will provide some investment to properly address the potential role of FOSS within a Generic Statistical Information Model and its applicability in the ESS. It will also spread knowledge of FOSS solutions and how they may benefit the business architecture, and provide NSIs and the ESS with informed research on which to base decisions on the future of statistical production. The importance of “plug and play” solutions to statistical production has been noted by the High Level Group for the Modernisation of Statistical Production and Services. To this end while a one off project in the form of an ESSnet to assess some of these plug and play solutions will be valuable, evaluating the case for a Centre of Expertise to support the Official Statistics FOSS community that has a strong methodological viewpoint is also critical to the success of moving away from the stovepipe model of statistical production and embracing new and improved software as they are developed.” In this framework, ISTAT has a positive attitude towards collaborative development of new IT tools. With reference to Statistics Canada proposal3, ISTAT agrees on the particular collaborative development model suggested, and is willing to participate to a pilot experience. A crucial point regards the choice of the tool to be developed in a collaborative way: o first, the area of interest (or, with a more precise term, the GSBPM sub-process) in which the tool will be used, has to be identified. In choosing this area, a requirement for related methods and techniques that can be employed in it, is that they have to be well established, with no need for further research. Sub-processes that fulfill this requirement are, for instance, “Review, validate and edit”, “Imputation” or “Calculate weights”, thus being natural candidates; o on the basis of the best methods conceptually available for the chosen sub-process, a review of existing tools (if any) implementing them should be carried out (updating the one contained in the Sharing Advisory Board website); on the basis of this review, different possible situations can occur: 1) one existing tool is deemed to be completely satisfactory and is adopted as standard for official statistics community; 2) one existing tool is considered to be as potentially adequate, but it needs more developments to ensure additional functionalities or to improve the available ones; 3) no tool is adequate. In cases (2) and (3) a new development project can be launched, accordingly to collaborative development model proposed by Statistics Canada. 3 “Software Collaboration and Sharing at Statistics Canada” – HLG meeting Geneva June 2013, Working paper 2013/19 Istat generalised software development and sharing Page 6 of 10 Annex A — ISTAT IT tools (R packages / R based systems) Package StatMatch Author: Marcello D'Orazio (madorazi@istat.it) Paper: (package Vignette) Statistical Matching and Imputation of Survey Data with the Package StatMatch for the R Environment http://rm.mirror.garr.it/mirrors/CRAN/web/packages/StatMatch/vignettes/Statistical_Matching_with_Stat Match.pdf Phase: 5.4 Impute / 5.1 Integrate Link: http://cran.r-project.org/web/packages/StatMatch/index.html StatMatch provides some R functions to perform statistical matching, i.e. the integration of two data sources referred to the same target population which share a number of common variables. Some functions can also be used to impute missing values in data sets through hot deck imputation methods. Methods to perform statistical matching when dealing with data from complex sample surveys (via weights calibration) are available too. Package SeleMix Author: Ugo Guarnera, M. Teresa Buglielli (buglielli@istat.it) Paper: PAPER_Q2010.pdf Phase: 5.3 Review, validate and edit Link: http://cran.r-project.org/web/packages/SeleMix/index.html SeleMix (Selective Editing via Mixture models) is an R package for selective editing. It includes functions for identification of outliers and influential errors in numerical data. For each unit, it provides also anticipated values (predictions) for both observed and non observed variables. The method is based on explicitly modelling both true (error-free) data and error mechanism. Specifically, true data are supposed to follow normal or log-normal distribution. We assume that only a subset of data is affected by error and that the error mechanism is specified through a Gaussian random variable with zero mean vector and covariance matrix proportional to the covariance matrix characterising the true data distribution. Package SamplingStrata Author: Giulio Barcaroli (barcarol@istat.it) Paper: (package Vignette) Optimization of sampling strata with the SamplingStrata package http://cran.r-project.org/web/packages/SamplingStrata/vignettes/SamplingStrataVignette.pdf Phase - 2.4 Design frame and sample methodology Link: http://cran.r-project.org/web/packages/SamplingStrata/index.html In the field of sampling design (in particular for stratified sampling), this package offers an approach for the determination of the best stratification of a sampling frame, the one that ensures the minimum sample size under the condition to satisfy precision constraints in a multivariate and multidomain case. This approach is based on the use of the genetic algorithm: each solution (i.e. a particular partition in strata of the sampling frame) is considered as an individual in a population; the fitness of all individuals is Istat generalised software development and sharing Page 7 of 10 evaluated by calculating (using the Bethel-Chromy algorithm) the sampling size satisfying accuracy constraints on the target estimates. Functions in the package allows to: (a) analyse the obtained results of the optimisation step; (b) assign the new strata labels to the sampling frame; (c) select a sample from the new frame accordingly to the best allocation. There is also a function that allows to build the most important input to the optimisation step, i.e. the ‘‘strata’’ dataframe, containing information (means and standard errors) regarding the distributions of the target variables in the different strata,using the sampling frame or using data from previous rounds of the same survey. Software MAUSS-R Multivariate Allocation of Units in Sampling Surveys Authors: - Teresa Buglielli, Daniela Pagliuca Paper: 1- User and methodological manual http://www.istat.it/it/files/2011/02/user_and_methodological_manual.pdf 2- mauss reference manual mauss.pdf Phase - 2.4 Design frame and sample methodology Link: http://www.istat.it/it/strumenti/metodi-e-software/software/mauss-r Mauss is a tool for defining the sampling design for sample surveys on finite populations. It guarantees optimality criteria, flexibility and easy management for those who have the responsibility to design and conduct such surveys. It enables the user, once defined the objectives and the operational constraints of the survey, to choose the best sampling design between those obtained by adopting different definitions of the key features of the survey, such as the type of stratification, the desired accuracy of the estimates, the sample size, the type of domains of study, the variables of interest. The use of this software also ensures transparency, standardization and accuracy of the methods used. RELAIS Authors: Nicoletta Cibella, Marco Fortini, Monica Scannapieco, Laura Tosco, Tiziana Tuoto, Luca Valentino link download software and user guide: http://www3.istat.it/strumenti/metodi/software/registrazione/datiuser.html?software=relais22 https://joinup.ec.europa.eu/software/relais/release/all Selected papers: - Cibella N., Fernandez G.L., Fortini M., Guigò M., Hernandez F., Scannapieco M., Tosco L., Tuoto T.(2009) "Sharing Solutions for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences" - NTTS 2009 - Cibella N., Fortini M., Scannapieco M., Tosco L., Tuoto T. "Theory and practice of developing a record linkage software" - "Combination of surveys and administrative data" Wien 29-30 May - 2008 - Tuoto T., Cibella N., Fortini M., Scannapieco M., Tosco L. "RELAIS: Don't Get Lost in a Record Linkage Project" - FCSM 2008 - Fortini M., Scannapieco M., Tosco L., Tuoto T. "Towards an Open Source Toolkit for Building Record Linkage Workflows" - IQIS 2006 - User’s Guide Version 2.2 http://www.istat.it/it/files/2011/03/Relais2.2UserGuide.pdf Istat generalised software development and sharing Page 8 of 10 Phase: 5.1 Integrate data RELAIS (Record Linkage At Istat) is a toolkit for record linkage. RELAIS allows combining techniques for each of the record linkage phases, so that the resulting workflow is actually built on the basis of the requirements of the application at hand. More specifically, the RELAIS toolkit is composed by a collection of techniques for each record linkage phase that can be dynamically combined in order to build the best record linkage workflow. RELAIS has been implemented in Java and R and has a database architecture (MySQL). Specifically, the estimation phase (EM) for probabilistic decision model has been implemented in R as the 1:1 reduction phase that implements the LP-solve algorithm. The other tecniques and GUIs are implemented in Java. Software ReGenesees (R evolved Generalised software for sampling estimates and errors in surveys) Author: - Raffaella Cianchetta, Diego Zardetto Paper: ReGenesees reference manual ReGenesees.pdf ReGenesees.GUI.pdf Phase - 5.6 Calculate weights / 5.7 Calculate aggregates / 6.2 Validate outputs Link: https://joinup.ec.europa.eu/software/regenesees/home ReGenesees is a full-fledged software system entirely developed in R. It has a clear-cut two-layer architecture. The application layer of the system is embedded into an R package named itself ReGenesees. A second R package, called ReGenesees.GUI, implements the presentation layer of the system. Both packages can be run under Windows as well as under most of the Unix-like operating systems. While the ReGenesees.GUI package requires the ReGenesees package, the latter can be used also without the GUI on its top. This means that the statistical functions of the system will always be accessible by users interacting with R trough the traditional command-line interface. On the contrary, less experienced R users will take advantage from the user-friendly mouse-click graphical interface. Software EVER (Estimation of Variance by Efficient Replication) Author: - Diego Zardetto Paper: EVER reference manual http://cran.r-project.org/web/packages/EVER/EVER.pdf Phase - 5.6 Calculate weights / 5.7 Calculate aggregates / 6.2 Validate outputs Link: http://cran.r-project.org/web/packages/EVER/index.html EVER is mainly intended for calculating estimates and standard errors in complex surveys. Variance estimation is based on the extended DAGJK (Delete-A-group Jackknife) technique proposed by Dr. Phillip S. Kott. The advantage of the DAGJK method over the traditional jackknife is that, unlike the latter, it remains computationally manageable even when dealing with “complex and big” surveys (tens of Istat generalised software development and sharing Page 9 of 10 thousands of PSUs arranged in a large number of strata with widely varying sizes). In fact, the DAGJK method is known to provide, for a broad range of sampling designs and estimators, (near) unbiased standard error estimates even with a “small” number (e.g. a few tens) of replicate weights. Besides his peculiar computational efficiency, the DAGJK method takes advantage of the strong points it shares with the most common replication methods. As a remarkable example, EVER is designed to fully exploit DAGJK's versatility: the package provides the user with a user-friendly tool for calculating estimates, standard errors and confidence intervals for estimators defined by the user themselves (even non-analytic). This functionality makes EVER especially appealing whenever variance estimation by Taylor linearisation can be applied only at the price of crude approximations (e.g. poverty estimates). Istat generalised software development and sharing Page 10 of 10