Data Loader

advertisement
ENABLING DATA COMPILATION FROM COUNTRIES USING SDMX
Prepared by
Edgardo GREISING
Database Manager, Department of Statistics
International Labour Office
Route de Morillons 4, CH-1211 Geneva 22, Switzerland
E-mail: greising@ilo.org
I.
Rationale
1.
Every international organization collecting statistical indicators from countries aim to have the best response rate
from the countries, to reduce the delay between the production and publication of the information received and to
improve the overall quality of the data published, while trying to avoid duplication of efforts and reduce the
burden to countries.
2.
Among the different data channels provided, electronic data interchange thru the use of SDMX standard seems to
be a good solution to achieve these objectives.
3.
“Doing” SDMX requires good conceptual knowledge of the standard and the implementation (or adaptation to
own environment) of appropriated software tools to generate the SDMX data flows defined and pull the data
from their databases.
4.
Nevertheless, and beyond the costs and efforts demanded by this implementation, many developing countries do
not have a repository of indicators. In those countries the indicators are calculated as part of a publication plan
and then tabulated and the results are published on the website of the NSI, often in pdf or Excel. However, the
data is not preserved in a central repository, which makes the generation of SDMX files (or even csv) very
difficult, because there is no database from which to take this information.
5.
Moreover, other institutions that has an indicators database, but the resources required to disseminate the data in
a particular format (like SDMX) are not available. And SDMX is not easy to understand, and harder to
implement, even for simple data reporting.
6.
Many people trends to think “SDMX is too complex, and a simpler format like csv should be used”. This
argument is fallacious since SDMX is a standard protocol for exchanging statistical data and metadata that
comprises its own information model and can be implemented on top of XML or UN/EDIFACT syntax, while
“csv” is another file format syntax comparable to XML or UN/EDIFACT. If somebody wants to interchange data
using csv files, a protocol will have to be defined; and if it includes an information model to support both data
and metadata, and generic enough as to become an ISO standard, it will probably be as complex as SDMX is.
7.
Even though it is possible to define a protocol for data interchange using csv files, it is not easy to create these
files without such a repository, especially if this file has to include descriptive metadata.
8.
This paper elaborates on the idea of implementing an “out-of-the-box” software easy to deploy and maintain by
IT specialists and statisticians (not SDMX experts) of any NSI or any reporting agency that could act as a central
repository for statistical indicators administered and provided by the agency and able to generate SDMX data
flows to disseminate the data, including transmission to international organizations.
II.
9.
Requirements
The tool should be very easy to install and maintain, be supported on free software platforms (DBMS, application
server, etc) and count with good documentation, tutorials, and training.
10. To standardize the structure of the tables (multi-dimensional cubes) to be collected, the compiler (international
organization) should be able to define it. For this purpose, the tool should be able to take a definition file (i.e. in
SDMX a DSD and a Data Flow definition), define the table structure to hold the data and import code lists and
descriptive metadata from the attributes. The DSD could be taken from an SDMX registry, or without connecting
to a registry by importing the xml file.
1
11. Must have an integrated data discovery and data visualization tool, and a full screen editor to enter/modify data
and descriptive metadata (including footnotes), based on the structural metadata that defined the tables. An
interactive tool for defining the tables (structural metadata editor) would be desirable as well.
12. Should provide different format for downloading/uploading tables: Excel, csv format, pdf (reports) and, of
course, SDMX. Dissemination of data should be supported in the form of web services and/or downloaded files.
III.
Possible solutions
13. At the moment of writing this paper, we go on looking for already existent solutions that could cover most of the
requirements. Several alternatives have been identified so far, each one with its strengths and weakness.
14. FAO’s CountryStat project “is a web-based information technology system for food and agriculture statistics at
the national and subnational levels. In practice, it acts as a one stop center which centralizes and integrates the
data coming from various sources and allows to harmonize it according to international standards while ensuring
data quality and reliability.”1
15. The application can be deployed as a stand-alone solution in a server located at the country or can be used in a
Software-as-a-service mode hosted centrally by FAO. In either case, the database administration is left under
country’s responsibility.
16. Among the features of CountryStat it’s worth highlighting a great flexibility for uploading the data into the local
repository, although there is no editor for entering or modifying the data. The data can be disseminated locally
and is transmitted to the central database using proprietary protocol. (SDMX in the pipeline). The time series are
one-dimensional (not cubes) and the descriptive metadata management is limited.
17. .Stat2 datawarehouse from OECD and the Statistical Information System Collaboration Community is a very
good statistical datawarehouse with a well-organized support and collaboration model behind.
18. The product is basically oriented towards dissemination, and includes the possibility of displaying tables as well
as publishing SDMX data. Data entry is based on the Data Provider Interface (DPI) that allows to define the
structural metadata and mappings to the production database, from where data is then processed by the EntryGate
thru XML files. It should be improved in its online data collection and editing capabilities.
19. Fusion toolsuite3 from Metadata Technology Ltd., composed by several tools to work around the SDMX
paradigm (some of them free, others with license cost) can achieve most of the functionalities. It should be
complemented with some additional interfaces for data collection and editing.
20. Since 2009 Eurostat has been developing their SDMX Reference Infrastructure (SDMX-RI)4 to help any
system not “SDMX enabled” to make use of the standard for data collection and dissemination. It allows for
defining the mapping between the SDMX concepts and the production database and provides the interfaces thru
web services. Nevertheless, SDMX-RI does not include a data repository. It is assumed as to be present in the
agency that wants to collect/disseminate the data.
21. Dataloader5 is a standalone tool created by Istat (Italy’s NSI) to enable de-centralized data collection at the
national level, which probably could be extrapolated to the international level.
22. The idea was to extend SDMX-RI by creating and populating a dissemination database interacting with the
Mapping Store database from which it retrieves the structural metadata information necessary to upload the data.
Dissemination is then assured by the SDMX-RI components.
23. For data entry it is able to gather csv or fixed length record (flr) files by defining the mapping among the
different concepts in a DSD and the columns of the file. It should be improved in its online data collection and
editing capabilities.
1
2
http://www.fao.org/economic/ess/countrystat/en/
.
See Appendix I - What is Stat
See Appendix II - Fusion Toolsuite
4
https://webgate.ec.europa.eu/fpfis/mwikis/sdmx/index.php/SDMX_Reference_Infrastructure_SDMX-RI
5
See Appendix III - Data Loader
3
2
24. Another alternative to analyse is the possibility of adding to the common reporting tools used to compute and
disseminate the indicators (Stata, SPSS, R, SAS, etc.) the ability of writing SDMX datasets. In that case, the user
would be able of generate a SDMX-compliant file in the same way the reports in Excel or pdf used for
dissemination are.
25. It is important to take into account that for this alternative the software should be able not only of writing
SDMX-compliant output but also of importing a DSD containing the definition of concepts and code lists to be
used.
26. On the other hand, the handling of descriptive metadata (footnotes, flags) to be attached to the observation values
and table’s components may be complex when we are thinking of producing the SDMX data file directly from
the statistical processor.
3
IV.
At a glance
Requirement
CountryStat
.Stat
Fusion Toolsuite
SDMX-RI
SDMX-RI +
Dataloader
Developer
FAO
OECD – SIS-CC
Metadata Technology
Eurostat
Istat
Governance/Deployment experience
Yes
Yes
No
Yes
No
Documentation, tutorials, training
Yes
?
?
Yes
?
Multidimensional cubes
No
Yes
Yes
Yes
Yes
?
Yes
Yes
Yes
Yes
Define cube structure by DSD
No
No
Yes
Yes
Yes
Data discovery & visualization
Yes
Yes
Yes
No
No
Online data & metadata editor
No
No
No
No
No
Limited
Yes
?
Yes
?
SDMX exports/web services
No
Yes
Yes
Yes
Yes
Other formats exports
Yes
Yes
Yes
Converter
Converter
Commercial status
Free
Collab. Community
$
Free
?
Interactive cube structure definition
Descriptive metadata management
4
APPENDIX I
What is . Stat
.Stat is the central repository ("warehouse") where validated statistics and related metadata are stored. It provides the
sole and coherent source of statistical data and related metadata for an organisation’s statistical data sharing,
publication and electronic dissemination processes.
.Stat enables users to easily locate required data from a single online source, rather than having to navigate multiple
databases and data formats, each with its own query/manipulation systems. And the access to systematic metadata in
.Stat helps ensure appropriate selection and use of statistical information.
Main features
 a rich data visualisation and exploration interface, allowing dimension selection, query saving and data
extraction in multiple formats
 data providers are autonomous for most data loads and management features with overall reduction in data
administration and a centralised system for access rights management by data providers
 separation of the production and dissemination environments for improved performance and data integrity
 a single repository that can integrate with analytical tools and publication processes through various
interfaces
 webservices in support of Open data accesses, built on internationally recognised standards through the use
of Statistical Data and Metadata Exchange Web Service (SDMX)
.Stat current evolutions
Search API
Connect .Stat to any third party search engine using a dedicated web service
Open data APIs (BETA)
SDMX-JSON WS
OData WS
5
.Stat Web Services and interaction with Web browser (WBOS)
Authentication
Search
Query Manager
Bulk Extract
SDMX
Data Extractor
(Hide empty rows)
6
Table Generator
DatasetBrowser
+
. Stat
Architecture
–
7
Data
Flow
Data Entry
Data Provider Interface (DPI) a web based application and the interface between data production and data
publishing that enables data providers to manage their data in the Data Warehouse. The DPI creates work
packages for the entry gate to process and load data into the Data Warehouse.
•
•
Data provider interface (DPI)
– web based application and the interface between data production and data publishing,
– enables data providers to manage their data in the data warehouse, including:
• dataset and dimension structure,
• data content,
• data presentation,
• metadata,
• core data view,
•
and related files.
– The DPI creates work packages for the entry gate to process and load data into the data
warehouse.
EntryGate
– a data transfer “protocol” that uses a well-defined XML-stream, and hot folders to manage
loading of data to .Stat.
– Permanently running Entry-Gate-Service watches and processes all “incoming” XML files one at
the time, all other requests are queued.
– The data provider is informed (as specified through the DPI) by email about the outcome
(success, failure, errors, warnings). If any error occurred and the processing failed, .Stat
performs a complete rollback to the situation before the submission of the XML stream.
– The data provider can automate the update of data by creating batch jobs that regularly extract the
relevant data from the production database, create the necessary XML file and put a copy of
this file at the .Stat Entry-Gate.
– XML format was chosen for its:
• flexibility,
• platform- and software independence,
• readability,
• and well-structured format.
Data Storage
Based on a standard star schema data warehouse technology, the Data Warehouse is able to process and handle
multiple updates per hour in a synchronised way.
•
Data warehouse
– Microsoft SQL server 2005 or above,
– standard star schema data warehouse technology,
– OECD has over 800 datasets (tables)
– database is approximately 330GB in size but built to be scalable,
– able to process and handle multiple updates per hour and all updates are synchronised,
– maximum of 16 dimensions with a current maximum of 7000 items per dimension for optimal
performance,
– snapshots and CSV exports provide archive functionality.
8
Data Exit
A single exit point serves all outputs from the Data Warehouse. This consists of a set of Web Services, which
are exploited by the various dissemination tools such as the .Stat Web Browser.
•
Services
–
•
A number of services complete the .Stat solution. These include:
•
international recognised standards through the Statistical Data and Metadata
Exchange Web Service (SDMX),
•
Entry Gate enables the processing of xml command files to load data from
production data systems through the DPI,
•
Dataset Browser serves up data to the OECD.Stat browser interface,
•
and Authentication allows for single sign on (SSO) management of groups and
dataset access permissions.
Dissemination
–
–
.Stat browser allows for:
•
tabular views of datasets,
•
pivoting functionality,
•
export to CSV, excel, SDMX,
•
save queries,
•
and merge queries to compare datasets.
eXplorer component viewer is graphical component that provides mapping and charting of
data.
9
APPENDIX II
Fusion Toolsuite
All the following solutions are products developed by Metadata Technology which have been built with
SdmxSource at the core. Each product plays a specific role, to support data, structural metadata, and reference
metadata. Each can operate on its own, or they can work together to provide an integrated solution.
Fusion Matrix
Functional Role
Figure 1 – Showing the static mock up for Fusion Matrix User Interface
The Fusion Matrix is a dissemination database. The database tables are generated automatically based on the
Data Structure Definition. The tables are indexed and optimised for data retrieval. On data import, the data are
validated against the DSD and any Constraints defined (i.e. sub set of valid codes or keys). The resulting table
structure is very simple, with only two tables for each Data Structure.
The Fusion Matrix offers SDMX compliant web services for external clients to directly query the database. The
Fusion Matrix provides a user interface to view the data, and related metadata. The Fusion Browser Excel
plugin tool integrates with the Fusion Matrix, allowing users to directly query the database from within Excel.
The Fusion Matrix currently does not have a back office user Interface. All data is imported/deleted though the
command line tool. Data can be imported in any SDMX format, and CSV format.
The Fusion Matrix can make use of the Fusion Registry to obtain the Structural Metadata (codelists, concepts,
data structures).
The Fusion Matrix can make use of Fusion Scribe to obtain the Reference Metadata.
The development plan for Fusion Matrix includes the addition a back-office GUI which allows users to
insert/amend/delete data directly in the database. The plan also includes allowing the user to identify
10
observation(s) or keys that require metadata using the Fusion Matrix UI. In addition the generation of pdf output
is in the plan schedule.
Technical Requirements
There is no installation required. Fusion Matrix consists of a command line tool to perform imports, and a web
application for dissemination. Fusion Matrix requires the following software to be available (all of which are
freely available):


MySql Database
Apache Tomcat web server
Fusion Registry
Functional Role
The Fusion Registry is a structural metadata maintenance and dissemination application. It provides web
services for external applications to obtain structures. Fusion Matrix does not require a Fusion Registry to be
present, it just requires structural metadata. Structural metadata can be provided via any SDMX conformant
web service, including the Fusion Registry, or a SDMX file.
Technical Requirements
There is no installation required. The Fusion Registry is a single web application, which is deployed to an
Apache Tomcat. Fusion Registry requires the following software to be available (all of which are freely
available):


MySql/Oracle/SqlServer database
Apache Tomcat web server
Fusion Scribe
Functional Role
The Fusion Scribe is a reference metadata repository, which allows the authoring and dissemination of
reference metadata. Fusion Matrix does not require Fusion Scribe, however if it has access to the Fusion Scribe,
it will integrate it in. This means the Fusion Matrix can enrich the data responses with additional metadata from
Fusion Scribe.
Technical Requirements
There is no installation required. The Fusion Scribe consists of a web application for metadata maintenance,
and one for dissemination, both are deployed to an Apache Tomcat. Fusion Scribe requires the following
software to be available (all of which are freely available):
•
Mongo Db database
•
Apache Tomcat web server
Fusion Browser
Functional Role
Figure 2 – Fusion Browser – navigate the Dataflows (called topics)
11
Figure 3 On selecting a Dataflow (topic), selecting values per dimension. Available choices are updated on click to
prevent the user from making an invalid combination of selections
Figure 4 – Showing the results of a data query. The user is presented with additional tools to chart, pivot, and
perform arithmetic on the data
Fusion Browser is an Excel plugin for Excel 2007 and upwards. It adds a new option to the Excel Ribbon,
which allows users to browse data from configurable datasources. Fusion Matrix has been enabled to support
the Fusion Browser. The Fusion Browser is an open source product from Metadata Technology, and the Beta
release is imminent.
Technical Requirements
The Fusion Browser does not require installation and, other than Excel, has no other software requirements.
Benefits of the Solution to the Reporting Agency
The use of the Fusion Matrix integrated solution provides the reporting agency with:




An easy to use and performant database for permanent storage that can be used for data reporting and
data visualisation supporting multiple statistical domains
The ability to report data in a standardised format
The ability query and visualise data using an Excel add-in
An easy to use metadata authoring and reporting tool that integrates metadata automatically to the data
that is queried
The benefits to the reporting agency of using the Fusion Matrix integrated solution are:
1.
2.
3.
The need for technical resources is minimal by providing an integrated toolset with minimal
installation requirements.
The solution can be used for statistical data storage, reporting, visualisation of any type of data as the
toolset is not specific to any particular statistical domain.
Interfacing to the toolset is via software already installed on the user desktop – web browser, Excel –
with which the user is well acquainted.
12
4.
Enables the organisation to better manage its statistical resources by combining the major aspects of
statistical data and metadata storage, visualisation and dissemination.
Additional Tools
Introduction
There will be reporting organisations that wish to take advantage of plugging into the SDMX world (and
therefore able to take advantage of other SDMX tools) using an existing database but do not have the resources
to learn the SDMX standards. The way this can be achieved is covered in this section.
A possible way of for a data collection organisation to collect data is to implement a web service which allows
users to upload datasets. It is probable that such a service will need to have sufficient security to ensure the data
it collects is from validated data reporters and compliant in the terms of the data itself. This topic is also covered
in brief in this section.
Organisations with an Existing Database
And which wishes to use an easy way to retrieve and format data for reporting or dissemination
Solution
The SdmxSource has a variety of components that can be packaged into a “data access” utility that comprises
two major components:
Query reader – this makes it easy for an organisation to process an SDMX query without knowing the syntax
of SDMX: it just needs to understand that there are dimensions each of which has one or more query values and
the time range queried. The organisation then needs to write the SQL code that will make the query on its
database.
Data writer – this makes it easy to write the data returned as a result of the SQL query without knowing the
syntax of SDMX: it just needs to understand that there are series keys, observations, time periods, and attributes.
There are many implementations of the data writer in SdmxSource and so a variety of output formats are
13
available and it is very easy to write an output for a different format: it does not have to be an SDMX format
(indeed a CSV format is already available and formats for packages such as R and RDF (using the Data Cube
Vocabulary) are under development).
Benefits to the Reporting Agency
 Ability to report data in a standardised format with minimal effort and no need to learn SDMX
 When coupled with the Fusion Registry the Excel plug can be used to query and visualise the data
Data Upload to a Central Location
Data collection can be streamlined by providing a data upload facility to reporting organisations. Such facilities
are easy to build using the SdmxSource and it is simple to use. Note that such files can be zipped automatically
to minimise size in the transfer. Validation and visualisation components can be integrated easily into such a
collection system.
14
APPENDIX III
Data Loader
Scenario
The Data Loader is an application for loading the data into a repository that interacts with the SDMX-RI for
dissemination thru SDMX web services.
15
The data flow in the Data Loader application
As it is shown in the picture above, the Loader application creates and populates a dissemination database
interacting with the Mapping Store database from which it retrieves the structure metadata information
necessary to upload the data.
Through this interaction the user loads the DSDs and dataflows that needs into the Loader database.
This choice, although create a duplication of structural metadata (Structural Tables in Fig), is due to two
different reasons:


Better performance during the loading of data
A better completeness of the information stored into the Loader database
After the loading, the user can create the data table (Data Tables in Fig). These data tables are depending on the
DSD to which they refer.
The next step for the user is to load a data file. A recommendation is that data in the data file must be described
using the structural metadata previously loaded. Data file to be loaded can have the following formats: SDMX
(compact, generic and cross sectional), fixed length record (flr), gesmes TS and character separated value (csv).
The loading of data file in fixed length record, gesmes TS and character separated value requires internal
operation that provides to the application the exact way in which data can be read (mapping). This mapping is
not necessary for the SDMX data file. After the loading of data it is still necessary (in this version) to perform a
mapping of data using the “Mapping assistant” to made data available by the SDMX-RI web service.
By the same application is possible also to create static data file in SDMX format and related RSS file for the
organizations that does not want to use a web service for the dissemination of data.
The application provides also the possibility to define different users with different levels of security.
16
Download