ISTAT IT tools

advertisement
UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE
CONFERENCE OF EUROPEAN STATISTICIANS
Working paper 2013/28
27 November 2013
High-level group for the modernisation of statistical production and services (HLG)
Meeting, Geneva, 27 November 2013
Istat generalised software development and sharing:
experiences and strategy1
1. Background
By “generalised (or generic) software” for statistical production we mean the set of systems and IT tools specifically
designed to ensure production capabilities in the different phases of a statistical production process (as defined by
the Generic Statistic Business Production Model). Aside from the productivity benefits that result from the use of
products applicable with little or no need for ad hoc code development, it should be noted that generalized systems
usually implement methodological solutions that ensure the maximum quality of the produced statistical
information.
The limited number of users of these systems (essentially National Institutes of Statistics and, to a lesser extent,
other bodies involved in statistical research), has meant that private companies are not interested in this
development area (with some exceptions), which has instead being covered by leading official statistical institutions
such as Statistics Canada, Statistics Netherlands, the U.S. Bureau of the Census and a few others.
Also Istat, the Italian National Institute of Statistics, has been developing its own generalised IT tools since the
early 90’s (mainly edit & imputation and sampling design & estimation generalised systems: Concord, Diesis,
Mauss, Genesees). Where necessary, generalised systems were acquired outside (namely Blaise, ACTR and
GEIS/Banff), so not to leave any suitable phase of the production process uncovered.
As a matter of fact, so far no strategic decision has been taken inside the community of National Statistical Institutes
in order to coordinate their efforts and produce a complete and optimised suite of products to cover the whole
statistical production chain. Only recently interest for this coordination has arisen and initiatives have been
undertaken.
A first initiative was carried on by UNECE with its Sharing Advisory Board, aiming at collecting information on
generalised software used by official statistical institutions2.
A second set of initiatives were the launch of ESSnet projects dedicated to the definition of a common architecture
and environment (respectively, CORA and CORE) facilitating the sharing of IT tools, and, more recently, the
constitution of the HLG group for the definition of a Common Statistical Production Architecture (CSPA), based on
a Service Oriented Architecture approach.
All these initiatives are not directly aiming at the cooperative development of software systems, but rather to
prepare the ground for facilitating the adoption of common optimal IT solutions.
In the VIP.Programme the project “Shared Services“ will contain an ESSnet (to be launched in 2014) dedicated to
the investigation of the opportunities of Free and Open Source Software (FOSS) projects useful to NSI’s and the
ESS. The main objectives include stocktaking, standard setting, and benchmarking. The project will also investigate
opportunities for a Center of Expertise on FOSS projects for official statistics.
It is thus unappropriate to proceed in an uncoordinated and isolated manner in the field of generalised software
production: NSI’s have to define a clear strategy in order to achieve the goal of developing and adopting a common
set of standards or recommended IT tools.
1
2
Prepared by Giulio Barcaroli (“Methods, Tools and Methodological Support” Division, ISTAT, barcarol@istat.it)
http://www1.unece.org/stat/platform/display/msis/Software+Sharing
Istat generalised software development and sharing
Page 1 of 10
2. Generalised software in Istat
Since the early 90’s ISTAT goal was to make available for each step of the production process one or more
recommended software systems enabling to perform these steps in an optimal way, from the point of view of both
effectiveness (quality of results) and efficiency (reduction of related costs).
In table 1, the current situation is reported.
It can be seen that whenever an IT tool has been developed by ISTAT, a clear choice in favor of an open source
approach has been made, at least since 2007.
This is due to ISTAT policy regarding the development and / or acquisition of IT tools, that must meet the following
requirements: (1) implement the most advanced methods and techniques from the point of view of their gains in
terms of quality and efficiency, and (2) be fully interoperable, and therefore usable in the first place by the whole
community of official statistics.
As for the first requirement, it is well known that the open source system R is on the edge of research, because of
the huge amount of libraries (5000) produced by a vast community of developers: the availability of these libraries
is a great advantage for the development of IT tools that can harness already available solutions.
A pre-requirement for the second condition is the adoption of development technologies that do not jeopardise
interoperability. Therefore, whatever tool we develop should not be tied to particular commercial DBMS or
systems.
For instance, ISTAT standard DBMS is Oracle, but the aim is to develop IT tools that, when a relational database is
required, use ODBC to connect instead to declare directly a particular DBMS.
The use of SAS to develop generalized software is a blocking factor, because it has to be installed to run SAS-based
IT tools. To overcome this limit, some ten years ago we asked SAS Institute to grant the possibility to create
executables not requiring SAS installations or to provide the tools functionalities as a kind of web services, but as
answers were negative in both cases we decided to abandon SAS as a development technology.
As a consequence, since 2007 the R system environment and programming language have been elected as preferred
instruments to develop generalized software..SAS was substituted by R not only because of a general policy aimed
at decreasing ISTAT’s dependency from proprietary software vendors, but also because of interoperability
requirements. In fact, ISTAT has many relationships with other entities in the National Statistical System, and also
with other Statistical Institutes in developing countries: in transferring ISTAT software to partners, one capital
requirement is that this cannot imply any obligation for them to purchase commercial software to run the software.
Currently, the software acquired on a commercial basis is from other Statistical Institutes or International
Organisations: Canada (ACTR, Banff), Netherlands (Blaise), OECD (OECD.Stat). Some of them are planned to be
substituted with FOSS solutions.
In particular, ACTR is being migrated to an R version, and Banff will be substituted by a set of alternative tools:
o “editrules” for error localisation (developed by Statistics Netherlands as an R package),
o an R package for nearest neighbor and Predictive Mean Matching imputation, being developed in ISTAT,
o “RSPA” for minimal record adjustment after imputation (another R package developed by Statistics
Netherlands).
As for licensing, from the very beginning we decided to disseminate the software produced on our own on a
complete free basis. Our first generalized products (Concord and Genesees, SAS versions) have been offered to any
Institution or private body asking for them, with no restrictions.
Now, ISTAT policy is to license the software on the basis of a FOSS licensing. Almost all of them are EUPL
(“European Union Public Licence”). The EUPL ensures the following rights to the licensee:
1. Obtain the source code of the software from a free access repository
2. Use the software in any circumstance and for all usage
3. Reproduce (copy, duplicate) the software
4. Modify the original software, and/or make derivative works out of it
Istat generalised software development and sharing
Page 2 of 10
5. Communicate the software to the public (i.e. using it through a public network or distributing services based
on the software via Internet)
6. Distribute the software or copies thereof to other users (inside or outside the licensee's organisation)
7. Lend and rent the software or copies thereof
8. Sub-licence rights in the software or copies thereof.
Istat generalised software development and sharing
Page 3 of 10
GSBPM
Software
Functions
Developer
Charact
eristics
State
Licensing
condition
s
MAUSS-R (“Multivariate
Allocation of Units in
Sampling Surveys”)
Design singlestage stratified
samples
ISTAT
Design twostage stratified
samples
ISTAT
Currently
used in
ISTAT and
available for
anyone
Completing
development
and testing
FOSS
(EUPL)
BEAT (“Bethel Allocation
R-based
system
(R core +
Java
interface)
R package
ISTAT
R packages
(core +
Tcl-TK
interface)
Beta version
FOSS
Frame
optimisation for
stratified
sampling and
selection of
units from
optimised
strata
Develop
applications for
CAPI, CATI and
CADI
Develop
applications for
interactive
coding
ISTAT
R package
Currently
used in
ISTAT and
available for
anyone
FOSS
(GPL2)
(on the
CRAN)
Currently
used in
ISTAT
Commercial
StatMatch
Statistical
matching
ISTAT
R package
FOSS
(EUPL)
(on the
CRAN)
RELAIS (REcord Linkage
At IStat)
Develop record
linkage
applications
ISTAT
R-based
system
(R core +
Java
interface)
ACTR
Develop
applications for
batch coding
Statistics
Canada
Concord-Java
Develop
applications for
Fellegi-Holt
based edit and
imputation
procedures
(categorical
data)
ISTAT
Fortran
(core) +
Java
(interface)
Currently
used in
ISTAT and
available for
anyone
Currently
used in
ISTAT and
available for
anyone
Currently
used in
ISTAT, but
migrating to
an R version
Currently
used in
ISTAT and
available for
anyone
Diesis
Develop
applications for
Fellegi-Holt
and/or
NearestNeighbour
based edit and
imputation
procedures
ISTAT
C
Currently
used in
ISTAT but
not yet
available
outside
Will be FOSS
phases and
sub-processes
2.
Design
2.1 Design frame
and samples
2.
Design
2.1 Design frame
and samples
2.
Design
2.1 Design
frame and
samples
4. Collect
4.1 Select
samples
2.
Design
FS4 (First Stage Sample
Stratification and
Selection)
SamplingStrata
2.1 Design
frame and
samples
4. Collect
4.1 Select
samples
4.
Collect
Blaise
4.3 Run
collection
5.
Process
5.2. Classify and
code
6.
Process
Statistics
Netherlands
5.1 Integrate
5. Process
5.1 Integrate
5.
Process
5.2. Classify and
code
5.Process
5.3 Review,
validate and edit
5.4 Impute
5.Process
5.3 Review,
validate and edit
5.4 Impute
Istat generalised software development and sharing
FOSS
FOSS
Commercial
FOSS
(EUPL)
Page 4 of 10
GSBPM
Software
Functions
Developer
Charact
eristics
State
Licensing
condition
s
Statistics
Netherlands
R package
Testing in
ISTAT
FOSS
(GPL2)
(on the
CRAN)
Statistics
Canada
SAS
procedures
Commercial
ISTAT
R package
Currently
used in
ISTAT,
planned
substitution
with open
source
alternative
Currently
used in
ISTAT and
available for
anyone
phases and
sub-processes
5.
Process
editrules
5.3 Review,
validate and edit
5.
Process
Banff
5.3 Review,
validate and edit
5.4 Impute
5.
Process
5.3 Review,
SelEMix (“SELective
Editing via MIXture
models”)
validate and edit
5. Process
5.6 Calculate
weights
(categorical
and continuous
data)
Develop
applications for
Fellegi-Holt
based edit and
imputation
procedures
(categorical
and continuous
data)
Develop
applications for
Fellegi-Holt
based edit and
imputation
procedures
(continuous
data)
Develop
applications for
optimised
selective
editing
FOSS
(EUPL)
(on the
CRAN)
ReGenesees (“R evolved
GENeralised Software for
Sampling Estimates and
ErrorS”
Calibration and
sampling errors
estimation
(using analytic
methods)
ISTAT
R packages
(core +
Tcl-TK
interface)
Currently
used in
ISTAT and
available for
anyone
FOSS
(EUPL)
EVER (“Estimation of
Variance by Efficient
Replication”)
Calibration and
sampling errors
estimation
(using
replication
methods)
ISTAT
R package
Used in
ISTAT and
available for
anyone
FOSS
(EUPL)
(on the
CRAN)
Client
version:
VisualBasic
Web
version:
Java
Used in
ISTAT, not
yet available
outside
FOSS
Used in
ISTAT
Commercial
5.7 Calculate
aggregates
5. Process
5.6 Calculate
weights
5.7 Calculate
aggregates
6.
Analyse
Ranker
Calculation and
analysis of
composite
indicators
ISTAT
7.
Disseminate
OECD.stat
Statistical data
warehousing
OECD
Table 1 – Generalised software and IT tools for statistical production in ISTAT
(in Annex A detailed information is reported with respect to the ISTAT developed software)
Istat generalised software development and sharing
Page 5 of 10
3. Future developments and collaboration strategy
ISTAT will continue its policy of developing and / or acquiring IT tools fully interoperable and shareable with other
entities in the official statistical community.
ISTAT is a promoter of the “ESSnet on Free and Open Source Software for Statistical Production”, already
approved by the ESSC, to be launched in 2014. The aim of this ESSnet is “is to document, explore, and educate the
ESS on the use of FOSS projects for statistical production and to evaluate the case for an ongoing Centre of
Expertise to support the Official Statistics FOSS community. In doing so it will provide some investment to properly
address the potential role of FOSS within a Generic Statistical Information Model and its applicability in the ESS. It
will also spread knowledge of FOSS solutions and how they may benefit the business architecture, and provide NSIs
and the ESS with informed research on which to base decisions on the future of statistical production. The
importance of “plug and play” solutions to statistical production has been noted by the High Level Group for the
Modernisation of Statistical Production and Services. To this end while a one off project in the form of an ESSnet to
assess some of these plug and play solutions will be valuable, evaluating the case for a Centre of Expertise to
support the Official Statistics FOSS community that has a strong methodological viewpoint is also critical to the
success of moving away from the stovepipe model of statistical production and embracing new and improved
software as they are developed.”
In this framework, ISTAT has a positive attitude towards collaborative development of new IT tools.
With reference to Statistics Canada proposal3, ISTAT agrees on the particular collaborative development model
suggested, and is willing to participate to a pilot experience.
A crucial point regards the choice of the tool to be developed in a collaborative way:
o first, the area of interest (or, with a more precise term, the GSBPM sub-process) in which the tool will be used,
has to be identified. In choosing this area, a requirement for related methods and techniques that can be
employed in it, is that they have to be well established, with no need for further research. Sub-processes that
fulfill this requirement are, for instance, “Review, validate and edit”, “Imputation” or “Calculate weights”, thus
being natural candidates;
o on the basis of the best methods conceptually available for the chosen sub-process, a review of existing tools (if
any) implementing them should be carried out (updating the one contained in the Sharing Advisory Board
website); on the basis of this review, different possible situations can occur:
1) one existing tool is deemed to be completely satisfactory and is adopted as standard for official statistics
community;
2) one existing tool is considered to be as potentially adequate, but it needs more developments to ensure
additional functionalities or to improve the available ones;
3) no tool is adequate.
In cases (2) and (3) a new development project can be launched, accordingly to collaborative development model
proposed by Statistics Canada.
3
“Software Collaboration and Sharing at Statistics Canada” – HLG meeting Geneva June 2013, Working paper 2013/19
Istat generalised software development and sharing
Page 6 of 10
Annex A — ISTAT IT tools (R packages / R based systems)
Package StatMatch
Author: Marcello D'Orazio (madorazi@istat.it)
Paper: (package Vignette) Statistical Matching and Imputation of Survey Data with the Package
StatMatch for the R Environment
http://rm.mirror.garr.it/mirrors/CRAN/web/packages/StatMatch/vignettes/Statistical_Matching_with_Stat
Match.pdf
Phase: 5.4 Impute / 5.1 Integrate
Link: http://cran.r-project.org/web/packages/StatMatch/index.html
StatMatch provides some R functions to perform statistical matching, i.e. the integration of two data
sources referred to the same target population which share a number of common variables. Some
functions can also be used to impute missing values in data sets through hot deck
imputation methods. Methods to perform statistical matching when dealing with data from complex
sample surveys (via weights calibration) are available too.
Package SeleMix
Author: Ugo Guarnera, M. Teresa Buglielli (buglielli@istat.it)
Paper:
PAPER_Q2010.pdf
Phase: 5.3 Review, validate and edit
Link: http://cran.r-project.org/web/packages/SeleMix/index.html
SeleMix (Selective Editing via Mixture models) is an R package for selective editing. It includes functions
for identification of outliers and influential errors in numerical data. For each unit, it provides also
anticipated values (predictions) for both observed and non observed variables. The method is based on
explicitly modelling both true (error-free) data and error mechanism. Specifically, true data are supposed
to follow normal or log-normal distribution. We assume that only a subset of data is affected by error and
that the error mechanism is specified through a Gaussian random variable with zero mean vector and
covariance matrix proportional to the covariance matrix characterising the true data distribution.
Package SamplingStrata
Author: Giulio Barcaroli (barcarol@istat.it)
Paper: (package Vignette) Optimization of sampling strata with the SamplingStrata package
http://cran.r-project.org/web/packages/SamplingStrata/vignettes/SamplingStrataVignette.pdf
Phase - 2.4 Design frame and sample methodology
Link: http://cran.r-project.org/web/packages/SamplingStrata/index.html
In the field of sampling design (in particular for stratified sampling), this package offers an approach for
the determination of the best stratification of a sampling frame, the one that ensures the minimum sample
size under the condition to satisfy precision constraints in a multivariate and multidomain case. This
approach is based on the use of the genetic algorithm: each solution (i.e. a particular partition in strata of
the sampling frame) is considered as an individual in a population; the fitness of all individuals is
Istat generalised software development and sharing
Page 7 of 10
evaluated by calculating (using the Bethel-Chromy algorithm) the sampling size satisfying accuracy
constraints on the target estimates. Functions in the package allows to: (a) analyse the obtained results of
the optimisation step; (b) assign the new strata labels to the sampling frame; (c) select a sample from the
new frame accordingly to the best allocation. There is also a function that allows to build the most
important input to the optimisation step, i.e. the ‘‘strata’’ dataframe, containing information (means and
standard errors) regarding the distributions of the target variables in the different strata,using the sampling
frame or using data from previous rounds of the same survey.
Software MAUSS-R Multivariate Allocation of Units in Sampling Surveys
Authors: - Teresa Buglielli, Daniela Pagliuca
Paper:
1- User and methodological manual
http://www.istat.it/it/files/2011/02/user_and_methodological_manual.pdf
2- mauss reference manual
mauss.pdf
Phase - 2.4 Design frame and sample methodology
Link: http://www.istat.it/it/strumenti/metodi-e-software/software/mauss-r
Mauss is a tool for defining the sampling design for sample surveys on finite populations. It guarantees
optimality criteria, flexibility and easy management for those who have the responsibility to design and
conduct such surveys. It enables the user, once defined the objectives and the operational constraints of
the survey, to choose the best sampling design between those obtained by adopting different definitions of
the key features of the survey, such as the type of stratification, the desired accuracy of the estimates, the
sample size, the type of domains of study, the variables of interest.
The use of this software also ensures transparency, standardization and accuracy of the methods used.
RELAIS
Authors: Nicoletta Cibella, Marco Fortini, Monica Scannapieco, Laura Tosco, Tiziana Tuoto, Luca
Valentino
link download software and user guide:
http://www3.istat.it/strumenti/metodi/software/registrazione/datiuser.html?software=relais22
https://joinup.ec.europa.eu/software/relais/release/all
Selected papers:
- Cibella N., Fernandez G.L., Fortini M., Guigò M., Hernandez F., Scannapieco M., Tosco L., Tuoto T.(2009) "Sharing Solutions for Record Linkage: the RELAIS Software and the Italian and Spanish
Experiences" - NTTS 2009
- Cibella N., Fortini M., Scannapieco M., Tosco L., Tuoto T. "Theory and practice of developing a record
linkage software" - "Combination of surveys and administrative data" Wien 29-30 May - 2008
- Tuoto T., Cibella N., Fortini M., Scannapieco M., Tosco L. "RELAIS: Don't Get Lost in a Record
Linkage Project" - FCSM 2008
- Fortini M., Scannapieco M., Tosco L., Tuoto T. "Towards an Open Source Toolkit for Building Record
Linkage Workflows" - IQIS 2006
- User’s Guide Version 2.2
http://www.istat.it/it/files/2011/03/Relais2.2UserGuide.pdf
Istat generalised software development and sharing
Page 8 of 10
Phase: 5.1 Integrate data
RELAIS (Record Linkage At Istat) is a toolkit for record linkage. RELAIS allows combining techniques
for each of the record linkage phases, so that the resulting workflow is actually built on the basis of the
requirements of the application at hand. More specifically, the RELAIS toolkit is composed by a
collection of techniques for each record linkage phase that can be dynamically combined in order to build
the best record linkage workflow. RELAIS has been implemented in Java and R and has a database
architecture (MySQL). Specifically, the estimation phase (EM) for probabilistic decision model has been
implemented in R as the 1:1 reduction phase that implements the LP-solve algorithm. The other tecniques
and GUIs are implemented in Java.
Software ReGenesees (R evolved Generalised software for sampling estimates and errors in surveys)
Author: - Raffaella Cianchetta, Diego Zardetto
Paper:
ReGenesees reference manual
ReGenesees.pdf
ReGenesees.GUI.pdf
Phase - 5.6 Calculate weights / 5.7 Calculate aggregates / 6.2 Validate outputs
Link:
https://joinup.ec.europa.eu/software/regenesees/home
ReGenesees is a full-fledged software system entirely developed in R. It has a clear-cut two-layer
architecture. The application layer of the system is embedded into an R package named itself
ReGenesees. A second R package, called ReGenesees.GUI, implements the presentation layer of
the system. Both packages can be run under Windows as well as under most of the Unix-like
operating systems. While the ReGenesees.GUI package requires the ReGenesees package, the
latter can be used also without the GUI on its top. This means that the statistical functions of the
system will always be accessible by users interacting with R trough the traditional command-line
interface. On the contrary, less experienced R users will take advantage from the user-friendly
mouse-click graphical interface.
Software EVER (Estimation of Variance by Efficient Replication)
Author: - Diego Zardetto
Paper:
EVER reference manual
http://cran.r-project.org/web/packages/EVER/EVER.pdf
Phase - 5.6 Calculate weights / 5.7 Calculate aggregates / 6.2 Validate outputs
Link:
http://cran.r-project.org/web/packages/EVER/index.html
EVER is mainly intended for calculating estimates and standard errors in complex surveys.
Variance estimation is based on the extended DAGJK (Delete-A-group Jackknife) technique
proposed by Dr. Phillip S. Kott.
The advantage of the DAGJK method over the traditional jackknife is that, unlike the latter, it
remains computationally manageable even when dealing with “complex and big” surveys (tens of
Istat generalised software development and sharing
Page 9 of 10
thousands of PSUs arranged in a large number of strata with widely varying sizes). In fact, the
DAGJK method is known to provide, for a broad range of sampling designs and estimators, (near)
unbiased standard error estimates even with a “small” number (e.g. a few tens) of replicate weights.
Besides his peculiar computational efficiency, the DAGJK method takes advantage of the strong
points it shares with the most common replication methods. As a remarkable example, EVER is
designed to fully exploit DAGJK's versatility: the package provides the user with a user-friendly
tool for calculating estimates, standard errors and confidence intervals for estimators defined by the
user themselves (even non-analytic). This functionality makes EVER especially appealing
whenever variance estimation by Taylor linearisation can be applied only at the price of crude
approximations (e.g. poverty estimates).
Istat generalised software development and sharing
Page 10 of 10
Download