ENABLING DATA COMPILATION FROM COUNTRIES USING SDMX Prepared by Edgardo GREISING Database Manager, Department of Statistics International Labour Office Route de Morillons 4, CH-1211 Geneva 22, Switzerland E-mail: greising@ilo.org I. Rationale 1. Every international organization collecting statistical indicators from countries aim to have the best response rate from the countries, to reduce the delay between the production and publication of the information received and to improve the overall quality of the data published, while trying to avoid duplication of efforts and reduce the burden to countries. 2. Among the different data channels provided, electronic data interchange thru the use of SDMX standard seems to be a good solution to achieve these objectives. 3. “Doing” SDMX requires good conceptual knowledge of the standard and the implementation (or adaptation to own environment) of appropriated software tools to generate the SDMX data flows defined and pull the data from their databases. 4. Nevertheless, and beyond the costs and efforts demanded by this implementation, many developing countries do not have a repository of indicators. In those countries the indicators are calculated as part of a publication plan and then tabulated and the results are published on the website of the NSI, often in pdf or Excel. However, the data is not preserved in a central repository, which makes the generation of SDMX files (or even csv) very difficult, because there is no database from which to take this information. 5. Moreover, other institutions that has an indicators database, but the resources required to disseminate the data in a particular format (like SDMX) are not available. And SDMX is not easy to understand, and harder to implement, even for simple data reporting. 6. Many people trends to think “SDMX is too complex, and a simpler format like csv should be used”. This argument is fallacious since SDMX is a standard protocol for exchanging statistical data and metadata that comprises its own information model and can be implemented on top of XML or UN/EDIFACT syntax, while “csv” is another file format syntax comparable to XML or UN/EDIFACT. If somebody wants to interchange data using csv files, a protocol will have to be defined; and if it includes an information model to support both data and metadata, and generic enough as to become an ISO standard, it will probably be as complex as SDMX is. 7. Even though it is possible to define a protocol for data interchange using csv files, it is not easy to create these files without such a repository, especially if this file has to include descriptive metadata. 8. This paper elaborates on the idea of implementing an “out-of-the-box” software easy to deploy and maintain by IT specialists and statisticians (not SDMX experts) of any NSI or any reporting agency that could act as a central repository for statistical indicators administered and provided by the agency and able to generate SDMX data flows to disseminate the data, including transmission to international organizations. II. 9. Requirements The tool should be very easy to install and maintain, be supported on free software platforms (DBMS, application server, etc) and count with good documentation, tutorials, and training. 10. To standardize the structure of the tables (multi-dimensional cubes) to be collected, the compiler (international organization) should be able to define it. For this purpose, the tool should be able to take a definition file (i.e. in SDMX a DSD and a Data Flow definition), define the table structure to hold the data and import code lists and descriptive metadata from the attributes. The DSD could be taken from an SDMX registry, or without connecting to a registry by importing the xml file. 1 11. Must have an integrated data discovery and data visualization tool, and a full screen editor to enter/modify data and descriptive metadata (including footnotes), based on the structural metadata that defined the tables. An interactive tool for defining the tables (structural metadata editor) would be desirable as well. 12. Should provide different format for downloading/uploading tables: Excel, csv format, pdf (reports) and, of course, SDMX. Dissemination of data should be supported in the form of web services and/or downloaded files. III. Possible solutions 13. At the moment of writing this paper, we go on looking for already existent solutions that could cover most of the requirements. Several alternatives have been identified so far, each one with its strengths and weakness. 14. FAO’s CountryStat project “is a web-based information technology system for food and agriculture statistics at the national and subnational levels. In practice, it acts as a one stop center which centralizes and integrates the data coming from various sources and allows to harmonize it according to international standards while ensuring data quality and reliability.”1 15. The application can be deployed as a stand-alone solution in a server located at the country or can be used in a Software-as-a-service mode hosted centrally by FAO. In either case, the database administration is left under country’s responsibility. 16. Among the features of CountryStat it’s worth highlighting a great flexibility for uploading the data into the local repository, although there is no editor for entering or modifying the data. The data can be disseminated locally and is transmitted to the central database using proprietary protocol. (SDMX in the pipeline). The time series are one-dimensional (not cubes) and the descriptive metadata management is limited. 17. .Stat2 datawarehouse from OECD and the Statistical Information System Collaboration Community is a very good statistical datawarehouse with a well-organized support and collaboration model behind. 18. The product is basically oriented towards dissemination, and includes the possibility of displaying tables as well as publishing SDMX data. Data entry is based on the Data Provider Interface (DPI) that allows to define the structural metadata and mappings to the production database, from where data is then processed by the EntryGate thru XML files. It should be improved in its online data collection and editing capabilities. 19. Fusion toolsuite3 from Metadata Technology Ltd., composed by several tools to work around the SDMX paradigm (some of them free, others with license cost) can achieve most of the functionalities. It should be complemented with some additional interfaces for data collection and editing. 20. Since 2009 Eurostat has been developing their SDMX Reference Infrastructure (SDMX-RI)4 to help any system not “SDMX enabled” to make use of the standard for data collection and dissemination. It allows for defining the mapping between the SDMX concepts and the production database and provides the interfaces thru web services. Nevertheless, SDMX-RI does not include a data repository. It is assumed as to be present in the agency that wants to collect/disseminate the data. 21. Dataloader5 is a standalone tool created by Istat (Italy’s NSI) to enable de-centralized data collection at the national level, which probably could be extrapolated to the international level. 22. The idea was to extend SDMX-RI by creating and populating a dissemination database interacting with the Mapping Store database from which it retrieves the structural metadata information necessary to upload the data. Dissemination is then assured by the SDMX-RI components. 23. For data entry it is able to gather csv or fixed length record (flr) files by defining the mapping among the different concepts in a DSD and the columns of the file. It should be improved in its online data collection and editing capabilities. 1 2 http://www.fao.org/economic/ess/countrystat/en/ . See Appendix I - What is Stat See Appendix II - Fusion Toolsuite 4 https://webgate.ec.europa.eu/fpfis/mwikis/sdmx/index.php/SDMX_Reference_Infrastructure_SDMX-RI 5 See Appendix III - Data Loader 3 2 24. Another alternative to analyse is the possibility of adding to the common reporting tools used to compute and disseminate the indicators (Stata, SPSS, R, SAS, etc.) the ability of writing SDMX datasets. In that case, the user would be able of generate a SDMX-compliant file in the same way the reports in Excel or pdf used for dissemination are. 25. It is important to take into account that for this alternative the software should be able not only of writing SDMX-compliant output but also of importing a DSD containing the definition of concepts and code lists to be used. 26. On the other hand, the handling of descriptive metadata (footnotes, flags) to be attached to the observation values and table’s components may be complex when we are thinking of producing the SDMX data file directly from the statistical processor. 3 IV. At a glance Requirement CountryStat .Stat Fusion Toolsuite SDMX-RI SDMX-RI + Dataloader Developer FAO OECD – SIS-CC Metadata Technology Eurostat Istat Governance/Deployment experience Yes Yes No Yes No Documentation, tutorials, training Yes ? ? Yes ? Multidimensional cubes No Yes Yes Yes Yes ? Yes Yes Yes Yes Define cube structure by DSD No No Yes Yes Yes Data discovery & visualization Yes Yes Yes No No Online data & metadata editor No No No No No Limited Yes ? Yes ? SDMX exports/web services No Yes Yes Yes Yes Other formats exports Yes Yes Yes Converter Converter Commercial status Free Collab. Community $ Free ? Interactive cube structure definition Descriptive metadata management 4 APPENDIX I What is . Stat .Stat is the central repository ("warehouse") where validated statistics and related metadata are stored. It provides the sole and coherent source of statistical data and related metadata for an organisation’s statistical data sharing, publication and electronic dissemination processes. .Stat enables users to easily locate required data from a single online source, rather than having to navigate multiple databases and data formats, each with its own query/manipulation systems. And the access to systematic metadata in .Stat helps ensure appropriate selection and use of statistical information. Main features a rich data visualisation and exploration interface, allowing dimension selection, query saving and data extraction in multiple formats data providers are autonomous for most data loads and management features with overall reduction in data administration and a centralised system for access rights management by data providers separation of the production and dissemination environments for improved performance and data integrity a single repository that can integrate with analytical tools and publication processes through various interfaces webservices in support of Open data accesses, built on internationally recognised standards through the use of Statistical Data and Metadata Exchange Web Service (SDMX) .Stat current evolutions Search API Connect .Stat to any third party search engine using a dedicated web service Open data APIs (BETA) SDMX-JSON WS OData WS 5 .Stat Web Services and interaction with Web browser (WBOS) Authentication Search Query Manager Bulk Extract SDMX Data Extractor (Hide empty rows) 6 Table Generator DatasetBrowser + . Stat Architecture – 7 Data Flow Data Entry Data Provider Interface (DPI) a web based application and the interface between data production and data publishing that enables data providers to manage their data in the Data Warehouse. The DPI creates work packages for the entry gate to process and load data into the Data Warehouse. • • Data provider interface (DPI) – web based application and the interface between data production and data publishing, – enables data providers to manage their data in the data warehouse, including: • dataset and dimension structure, • data content, • data presentation, • metadata, • core data view, • and related files. – The DPI creates work packages for the entry gate to process and load data into the data warehouse. EntryGate – a data transfer “protocol” that uses a well-defined XML-stream, and hot folders to manage loading of data to .Stat. – Permanently running Entry-Gate-Service watches and processes all “incoming” XML files one at the time, all other requests are queued. – The data provider is informed (as specified through the DPI) by email about the outcome (success, failure, errors, warnings). If any error occurred and the processing failed, .Stat performs a complete rollback to the situation before the submission of the XML stream. – The data provider can automate the update of data by creating batch jobs that regularly extract the relevant data from the production database, create the necessary XML file and put a copy of this file at the .Stat Entry-Gate. – XML format was chosen for its: • flexibility, • platform- and software independence, • readability, • and well-structured format. Data Storage Based on a standard star schema data warehouse technology, the Data Warehouse is able to process and handle multiple updates per hour in a synchronised way. • Data warehouse – Microsoft SQL server 2005 or above, – standard star schema data warehouse technology, – OECD has over 800 datasets (tables) – database is approximately 330GB in size but built to be scalable, – able to process and handle multiple updates per hour and all updates are synchronised, – maximum of 16 dimensions with a current maximum of 7000 items per dimension for optimal performance, – snapshots and CSV exports provide archive functionality. 8 Data Exit A single exit point serves all outputs from the Data Warehouse. This consists of a set of Web Services, which are exploited by the various dissemination tools such as the .Stat Web Browser. • Services – • A number of services complete the .Stat solution. These include: • international recognised standards through the Statistical Data and Metadata Exchange Web Service (SDMX), • Entry Gate enables the processing of xml command files to load data from production data systems through the DPI, • Dataset Browser serves up data to the OECD.Stat browser interface, • and Authentication allows for single sign on (SSO) management of groups and dataset access permissions. Dissemination – – .Stat browser allows for: • tabular views of datasets, • pivoting functionality, • export to CSV, excel, SDMX, • save queries, • and merge queries to compare datasets. eXplorer component viewer is graphical component that provides mapping and charting of data. 9 APPENDIX II Fusion Toolsuite All the following solutions are products developed by Metadata Technology which have been built with SdmxSource at the core. Each product plays a specific role, to support data, structural metadata, and reference metadata. Each can operate on its own, or they can work together to provide an integrated solution. Fusion Matrix Functional Role Figure 1 – Showing the static mock up for Fusion Matrix User Interface The Fusion Matrix is a dissemination database. The database tables are generated automatically based on the Data Structure Definition. The tables are indexed and optimised for data retrieval. On data import, the data are validated against the DSD and any Constraints defined (i.e. sub set of valid codes or keys). The resulting table structure is very simple, with only two tables for each Data Structure. The Fusion Matrix offers SDMX compliant web services for external clients to directly query the database. The Fusion Matrix provides a user interface to view the data, and related metadata. The Fusion Browser Excel plugin tool integrates with the Fusion Matrix, allowing users to directly query the database from within Excel. The Fusion Matrix currently does not have a back office user Interface. All data is imported/deleted though the command line tool. Data can be imported in any SDMX format, and CSV format. The Fusion Matrix can make use of the Fusion Registry to obtain the Structural Metadata (codelists, concepts, data structures). The Fusion Matrix can make use of Fusion Scribe to obtain the Reference Metadata. The development plan for Fusion Matrix includes the addition a back-office GUI which allows users to insert/amend/delete data directly in the database. The plan also includes allowing the user to identify 10 observation(s) or keys that require metadata using the Fusion Matrix UI. In addition the generation of pdf output is in the plan schedule. Technical Requirements There is no installation required. Fusion Matrix consists of a command line tool to perform imports, and a web application for dissemination. Fusion Matrix requires the following software to be available (all of which are freely available): MySql Database Apache Tomcat web server Fusion Registry Functional Role The Fusion Registry is a structural metadata maintenance and dissemination application. It provides web services for external applications to obtain structures. Fusion Matrix does not require a Fusion Registry to be present, it just requires structural metadata. Structural metadata can be provided via any SDMX conformant web service, including the Fusion Registry, or a SDMX file. Technical Requirements There is no installation required. The Fusion Registry is a single web application, which is deployed to an Apache Tomcat. Fusion Registry requires the following software to be available (all of which are freely available): MySql/Oracle/SqlServer database Apache Tomcat web server Fusion Scribe Functional Role The Fusion Scribe is a reference metadata repository, which allows the authoring and dissemination of reference metadata. Fusion Matrix does not require Fusion Scribe, however if it has access to the Fusion Scribe, it will integrate it in. This means the Fusion Matrix can enrich the data responses with additional metadata from Fusion Scribe. Technical Requirements There is no installation required. The Fusion Scribe consists of a web application for metadata maintenance, and one for dissemination, both are deployed to an Apache Tomcat. Fusion Scribe requires the following software to be available (all of which are freely available): • Mongo Db database • Apache Tomcat web server Fusion Browser Functional Role Figure 2 – Fusion Browser – navigate the Dataflows (called topics) 11 Figure 3 On selecting a Dataflow (topic), selecting values per dimension. Available choices are updated on click to prevent the user from making an invalid combination of selections Figure 4 – Showing the results of a data query. The user is presented with additional tools to chart, pivot, and perform arithmetic on the data Fusion Browser is an Excel plugin for Excel 2007 and upwards. It adds a new option to the Excel Ribbon, which allows users to browse data from configurable datasources. Fusion Matrix has been enabled to support the Fusion Browser. The Fusion Browser is an open source product from Metadata Technology, and the Beta release is imminent. Technical Requirements The Fusion Browser does not require installation and, other than Excel, has no other software requirements. Benefits of the Solution to the Reporting Agency The use of the Fusion Matrix integrated solution provides the reporting agency with: An easy to use and performant database for permanent storage that can be used for data reporting and data visualisation supporting multiple statistical domains The ability to report data in a standardised format The ability query and visualise data using an Excel add-in An easy to use metadata authoring and reporting tool that integrates metadata automatically to the data that is queried The benefits to the reporting agency of using the Fusion Matrix integrated solution are: 1. 2. 3. The need for technical resources is minimal by providing an integrated toolset with minimal installation requirements. The solution can be used for statistical data storage, reporting, visualisation of any type of data as the toolset is not specific to any particular statistical domain. Interfacing to the toolset is via software already installed on the user desktop – web browser, Excel – with which the user is well acquainted. 12 4. Enables the organisation to better manage its statistical resources by combining the major aspects of statistical data and metadata storage, visualisation and dissemination. Additional Tools Introduction There will be reporting organisations that wish to take advantage of plugging into the SDMX world (and therefore able to take advantage of other SDMX tools) using an existing database but do not have the resources to learn the SDMX standards. The way this can be achieved is covered in this section. A possible way of for a data collection organisation to collect data is to implement a web service which allows users to upload datasets. It is probable that such a service will need to have sufficient security to ensure the data it collects is from validated data reporters and compliant in the terms of the data itself. This topic is also covered in brief in this section. Organisations with an Existing Database And which wishes to use an easy way to retrieve and format data for reporting or dissemination Solution The SdmxSource has a variety of components that can be packaged into a “data access” utility that comprises two major components: Query reader – this makes it easy for an organisation to process an SDMX query without knowing the syntax of SDMX: it just needs to understand that there are dimensions each of which has one or more query values and the time range queried. The organisation then needs to write the SQL code that will make the query on its database. Data writer – this makes it easy to write the data returned as a result of the SQL query without knowing the syntax of SDMX: it just needs to understand that there are series keys, observations, time periods, and attributes. There are many implementations of the data writer in SdmxSource and so a variety of output formats are 13 available and it is very easy to write an output for a different format: it does not have to be an SDMX format (indeed a CSV format is already available and formats for packages such as R and RDF (using the Data Cube Vocabulary) are under development). Benefits to the Reporting Agency Ability to report data in a standardised format with minimal effort and no need to learn SDMX When coupled with the Fusion Registry the Excel plug can be used to query and visualise the data Data Upload to a Central Location Data collection can be streamlined by providing a data upload facility to reporting organisations. Such facilities are easy to build using the SdmxSource and it is simple to use. Note that such files can be zipped automatically to minimise size in the transfer. Validation and visualisation components can be integrated easily into such a collection system. 14 APPENDIX III Data Loader Scenario The Data Loader is an application for loading the data into a repository that interacts with the SDMX-RI for dissemination thru SDMX web services. 15 The data flow in the Data Loader application As it is shown in the picture above, the Loader application creates and populates a dissemination database interacting with the Mapping Store database from which it retrieves the structure metadata information necessary to upload the data. Through this interaction the user loads the DSDs and dataflows that needs into the Loader database. This choice, although create a duplication of structural metadata (Structural Tables in Fig), is due to two different reasons: Better performance during the loading of data A better completeness of the information stored into the Loader database After the loading, the user can create the data table (Data Tables in Fig). These data tables are depending on the DSD to which they refer. The next step for the user is to load a data file. A recommendation is that data in the data file must be described using the structural metadata previously loaded. Data file to be loaded can have the following formats: SDMX (compact, generic and cross sectional), fixed length record (flr), gesmes TS and character separated value (csv). The loading of data file in fixed length record, gesmes TS and character separated value requires internal operation that provides to the application the exact way in which data can be read (mapping). This mapping is not necessary for the SDMX data file. After the loading of data it is still necessary (in this version) to perform a mapping of data using the “Mapping assistant” to made data available by the SDMX-RI web service. By the same application is possible also to create static data file in SDMX format and related RSS file for the organizations that does not want to use a web service for the dissemination of data. The application provides also the possibility to define different users with different levels of security. 16