in partnership with Title: S-DWH Modular Workflow WP: 3 Deliverable: 3.2 Version: 4.1 Date: February 2013 NSI: Statistics Estonia, Istat, Statistics Lithuania Allan Randlepp, Antonio Laureti Authors: Palma and Valerij Žavoronok 1 ESS - NET ON MICRO DATA LINKING AND DATA WAREHOUSING IN PRODUCTION OF BUSINESS STATISTICS @ S-DWH Modular Workflow Version: 1.0 February 25, 2013: Allan Randlepp Version: 2.0 February 27, 2013: Allan Randlepp Version: 3.0 March 1, 2013: Antonio Laureti Palma Version: 4.0 March 4, 2013: Allan Randlepp Version: 4.1 June 17, 2013: Valerij Žavoronok Summary 1 Introduction................................................................................................................................... 3 2 Statistical production models ........................................................................................................ 4 3 4 2.1 Stovepipe model .................................................................................................................... 4 2.2 Integrated model .................................................................................................................... 5 2.3 Warehouse approach ............................................................................................................. 6 Integrated Warehouse model ........................................................................................................ 7 3.1 Technical platform integration .............................................................................................. 7 3.2 Process integration ................................................................................................................ 8 3.3 Warehouse – reuse of data..................................................................................................... 9 S-DWH as layered modular system ............................................................................................ 11 4.1 Layered architecture ............................................................................................................ 11 4.2 CORE services and reuse of components............................................................................ 13 5 Conclusion .................................................................................................................................. 15 6 References................................................................................................................................... 16 2 1 Introduction Statistical system is a complex system of data collection, data processing, statistical analyses, etc. The following figure (by Sundgren (2004)) shows a statistical system as precisely defined, mandesigned system that measures external reality. Planning and control system on the figure corresponds to phases 1–3 and 9 in GSBPM notations and statistical production system on the figure corresponds to phases 4–8 in GSBPM. This is a general, standardized view of the statistical system and it could represent one survey or the whole statistical office or even an international organization. How such a system is built up and organized in real life varies greatly. Some implementations of statistical system have worked quite well so far and others not so well. Local environments of statistical systems are slightly different but big changes in environment are more and more global. It does not matter anymore how well the system has performed so far, some global changes in environment are so big that every system has to adapt and change. This paper presents the strengths and weaknesses of the main statistical production models based on W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009) “Terminology Relating To The Implementation Of The Vision On The Production Method Of EU Statistics”. This is followed with proposal how to combine integrated production model and warehouse approach. As following, an overview is provided how statistical warehouse layered architecture gives modularity to the statistical system as a whole. 3 2 Statistical production models 2.1 Stovepipe model Today’s prevalent production model in statistical systems is the stovepipe model. That is the outcome of a historic process in which statistics in individual domains have developed independently. In stovepipe model a statistical action or survey is independent form other actions in almost every phase of statistical production value chain. Advantages of the stovepipe model (from W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009)): 1. The production processes are best adapted to the corresponding products. 2. It is flexible in that it can adapt quickly to relatively minor changes in the underlying phenomena that the data describe. 3. It is under the control of the domain manager and it results in a low-risk business architecture, as a problem in one of the production processes should normally not affect the rest of the production. Disadvantages of the stovepipe model (from W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009)): 1. First, it may impose an unnecessary burden on respondents when the collection of data is conducted in an uncoordinated manner and respondents are asked for the same information more than once. 2. Second, the stovepipe model is not well adapted to collect data on phenomena that cover multiple dimensions, such as globalization, sustainability or climate change. 3. Last but not least, this way of production is inefficient and costly, as it does not make use of standardization between areas and collaboration between the Member States. Redundancies and duplication of work, be it in development, in production or in dissemination processes are unavoidable in the stovepipe model. The stovepipe model is the dominant model in ESS and is reproduced and added at Eurostat level as well, called as augmented stovepipe model. 4 2.2 Integrated model Integrated model is the new and innovative way of producing statistics. It is based on the combination of various data sources. This integration can be horizontal or vertical. 1. “Horizontal integration across statistical domains at the level of National Statistical Institutes and Eurostat. Horizontal integration means that European statistics are no longer produced domain by domain and source by source but in an integrated fashion, combining the individual characteristics of different domains/sources in the process of compiling statistics at an early stage, for example households or business surveys.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009)) 2. “Vertical integration covering both the national and EU levels. Vertical integration should be understood as the smooth and synchronized operation of information flows at national and ESS levels, free of obstacles from the sources (respondents or administration) to the final product (data or metadata). Vertical integration consists of two elements: joint structures, tools and processes and the so-called European approach to statistics.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009)) (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009)) Integrated model is created to avoid the disadvantages of stovepipe model (burden on respondents, not suitable for surveying multi-dimensional phenomena, inefficiencies and high costs). “By integrating data sets and combining data from different sources (including administrative sources) the various disadvantages of the stovepipe model could be avoided. This new approach would improve efficiency by elimination of unnecessary variation and duplication of work and create free capacities for upcoming information needs.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009)) A task to go from the stovepipe model to the integrated model is not an easy one at all. In his answer to UNSC about the draft of the paper on “Guidelines on Integrated Economic Statistics” W. Radermacher writes: “To go from a conceptually integrated system such as the SNA to a practically integrated system is a long term project and will demand integration in the production of primary 5 statistics. This is the priority objective that Eurostat has given to the European Statistical System through its 2009 Communication to the European Parliament and the European Council on the production method of the EU statistics ("a vision for the new decade").” The Sponsorship on Standardization, a strategic task force in the European Statistical System, has compared traditional and integrated approach to statistical production. They conclude that “in the current situation, it is clearly shown that there are high level risks and low level opportunities” and that “the full integration situation is more balanced than the current situation, and the most interesting point is that risks are mitigated and opportunities exploded.” (The Sponsorship on Standardisation (2013)) It seems that it is strategically wise to move away from stovepipes and partly integrated statistical systems toward fully integrated statistical production systems. 2.3 Warehouse approach In addition to the stovepipe model, augmented stovepipe model and integration model, W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009) describe also the warehouse approach: “The warehouse approach provides the means to store data once, but use it for multiple purposes. A data warehouse treats information as a reusable asset. Its underlying data model is not specific to a particular reporting or analytic requirement. Instead of focusing on a process-oriented design, the underlying repository design is modelled based on data inter-relationships that are fundamental to the organisation across processes.” Conceptual model of data warehousing in the ESS (European Statistical System) (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009)) “Based on this approach statistics for specific domains should not be produced independently from each other, but as integrated parts of comprehensive production systems, called data warehouses. A 6 data warehouse can be defined as a central repository (or "storehouse") for data collected via various channels.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009)) 3 Integrated Warehouse model Integrated Warehouse model combines the integrated model and the warehouse approach into one model. To have integrated warehouse centric statistical production system, different statistical domains should have more consistency on methodology and share common tools and distributed architecture. First we look at integration followed by warehouse and then combine both into one model. 3.1 Technical platform integration Let’s look at classical production system and try to find key integration points, where statistical activities meet each other. Classical stovepipe statistical system looks like this: Let’s begin integration of the platform from the end of the production system. Each well integrated statistical production system has the main dissemination database, where all detailed statistics are published. One for in-house use and the other for public use. To produce rich and integrated output, especially cross domain output, we need warehouse where data are stored once, but can be used for multiple purposes. Such a warehouse should be between process and analyze phases. And of course there should be a raw database. Depending on specific tools used or other circumstances, one may have more than one raw database or warehouse or dissemination database, but less is better. For example, Statistics Estonia has three integrated raw databases. The first is a web based tool for collecting data from enterprises. The second is a data collection system for social surveys. And the third one is for administrative and other data sources. But this is not all, let’s look at planning and design phases. Descriptions of all statistical actions, all classificators that are in use, input and output variables, selected data sources, descriptions of output tables, questionnaires and so on, all these meta-objects should be collected during design and build phases into one metadata repository. And needs of clients should be stored into central CRM database. 7 These are the main integration points in database level, but this is not something new or revolutionary. Although, software tools could be shared between statistical actions. How many data collection systems do we need? How many processing or dissemination tools do we need? Both in local and international level? Do we need different processing software for every statistical action or for every statistical office? This kind of technological database and software level integration is important and is not an easy task, but this is not good enough. We must go deeper into processes and find ways to standardize sub-processes and methods. One way to go deeper into process is to look at variables in each statistical activity. 3.2 Process integration “Integration should address all stages of the production process, from design of the collection system to the compilation and dissemination of data.” (W. Radermacher (2011)) Each statistical action designs sample and questionnaires according to own needs and uses variations of classificators as needed, selection of data sources is done according to the needs of the action, etc. In the statistical system there is a number of statistical actions and each action collects some input variables and produces some output variables. One way to find some common ground between different statistical actions and sources is to focus on variables. Especially input variables because data collection and processing are most costly phases of statistical production. Standardizing on these phases gives us the fastest and biggest saving. Output variables will be standardized in SDMX initiative. Statistical actions should collect unique input variables not just rows and columns of tables in a questionnaire. Each input variable should be collected and processed once in each period of time. This should be done so that the outcome, input variable in warehouse, could be used for producing various different outputs. This variable centric focus triggers changes in almost all phases of statistical production process. Samples, questionnaires, processing rules, imputation methods, data sources, etc., must be designed and built in compliance with standardized input variables, not according to the needs of one specific statistical action. The variable based on statistical production system reduces the administrative burden, lowers the cost of data collection and processing and enables to produce richer statistical output faster. Of course, this is true in boundaries of standardized design. If there is a need for special survey, one can design his/her own sample, questionnaire, etc., but then this is a separate project with its own price tag. But to produce regular statistics this way is not reasonable. 8 3.3 Warehouse – reuse of data To organize reuse of already collected and processed data in statistical production system, the boundaries of statistical actions must be removed. What will remain if statistical actions are removed? Statistical actions are collection of input and output variables, processing methods, etc. When we talk about data and reuse then we are interested in variables, samples or estimation frame and timing of surveys. The following figure represents a typical scenario with two surveys and one administrative data source. Survey 1 collects with questionnaires two input variables A and B and may use the variable B’ from the administrative source. Survey 1 analyses variables A and B*, where B* is easier B form questionnaire or imputed B’ from administrative source. Survey 2 collects variables C and D and analyses B’, C* and D. This is a statistical action based on stovepipe model. In this case it is hard to re-use data on interpretation layer, because imputation choices in integration layer for B* and C* are made “locally” and there is great choice of similar variables in interpretation layer, like B* and B’. Also samples of Survey 1 and Survey 2 may be not coherent, which means that the third survey, wanting to analyze variables A, B’ and D in interpretation layer without collecting them again, has a problem of coherence and sample. To solve the problem we should invest some time and effort to planning and preparing Surveys 1 and 2, so that they would be coherent in a unique integrated variable-sampling centric warehouse. In addition to analyzing data and generating output-cubes, interpretation layer can be used for accessing to the production data. In interpretation layer statisticians can plan and prepare Surveys 1 and 2 by coordinating surveys and archives for a common evaluation frame and defining unique variables. Information gained during this phase is basis for developing and tuning regular production processes in integration layer. This means that a coherent approach can be used if statisticians plan their actions following a logical hierarchy of the variables estimation in a common frame. What the IT must support is then an adequate environment for designing this strategy. 9 Then, according to a common strategy, Surveys 1 and 2 which collect data with questionnaires and one administrative data source serve as examples. But this time, decisions being in design phase, like design of the questionnaire, sample selection, imputation method, etc., are made “globally”, in view of the interests of all three surveys. This way, integration of processes gives us reusable data in the warehouse. Our warehouse now contains each variable only once, making it much easier to reuse and manage our valuable data. Another way of reusing data already in the warehouse is to calculate new variables. The following figure illustrates the scenario where a new variable E is calculated from variables C* and D, loaded already into the warehouse. 10 It means that data can be moved back from the warehouse to the integration layer. Warehouse data can be used in the integration layer in multiple purposes, calculating new variables is only one example. Integrated variable based warehouse opens the way to any new possible sub-sequent statistical actions that do not have to collect and process data and can produce statistics right from the warehouse. Skipping the collection and processing phases, one can produce new statistics and analyses are very fast and much cheaper than in case of the classical survey. To design and build a statistical production system according to the integrated warehouse model takes initially more time and effort than building the stovepipe model. But maintenance costs of integrated warehouse system should be lower and new products which can be produced faster and cheaper, to meet the changing needs, should compensate the initial investments soon. 4 S-DWH as layered modular system 4.1 Layered architecture In a generic S-DWH system we identified four functional layers in which we group functionalities. The ground level corresponds to the area where the external sources are incoming and interfaced, while the top of the pile is where produced data are published to external user or system. In the intermediate layers we manage the ETL functions for the DWH in which coherence analysis, data mining, design for possible new strategies or data re-use are carried out. Specifically, from the top to the bottom of the architectural pile, we define: IV° - access layer, for the final presentation, dissemination and delivery of the information sought; III° - interpretation and data analysis layer, is specifically for statisticians and enables any data analysis, data mining and support to design production processes or data re-use; 11 II° - integration layer, where all operational activities needed for any statistical production process are carried out; I° - source layer, the level in which we locate all the activities related to storing and managing internal or external data sources. S-DWH layers are in specific order and the data go through layers without skipping any layers. It is impossible to use the data directly from the other layer. If the data are needed, they have to be moved to the layer where they are needed. And they cannot be moved so that some layers are skipped. The data can be moved between neighbouring layers. For example, to publish data in access layer, raw data need to be collected into raw database in source layer, then extracted into integration layer for processing, then loaded into warehouse in interpretation layer and after that someone can calculate statistics or make an analyze and publish it in access layer. Another example: sometime it is necessary to monitor collection process and analyze the raw data during the collection. Then the raw data is extracted from the collection raw database, processed in integration layer so that the data can be easily analyzed with specific tools in use for operational activities, or loaded to interpretation layer, where it can be freely analyzed. This process is repeated as often as needed – for example, once a day, once a week or hourly. Another example: when we receive a new dataset, this is loaded by the integration layer from the source to the interpretation layer, where statisticians can make their source-evaluation or, due to any changes on the administrative regulations, define new variables or new process-up-date for existents 12 production process. It is interesting to note that this update must be included in the coherent SDWH by proper metadata. 4.2 CORE services and reuse of components There are three main groups of workflows in S-DWH. One workflow updates data in the warehouse, the second one updates in-house dissemination database and the third one updates the public dissemination database. These three automated data flows are quite independent from each other. Flow 1 is the biggest and most complex. It extracts raw data from the source layer, processes it in integration layer and loads to interpretation layer. And on the other hand, it brings cleansed data to the source layer for prefilling questionnaires, prepares sample data for collection systems, etc. Let’s name this flow the processing flow. Flow 2 and Flow 3 are very similar, both generate standard output to dissemination database. One updates data in in-house dissemination database and the second one in public database. Both are unidirectional flows. Let’s call Flow 2 as generate cube and Flow 3 as publish cube. In this context the cube is a multidimensional table, for example .Stat or PC-Axis table. Processing flows should be built up around input variables or groups of input variables to feed variable based warehouse. Generate and publish cube flows are built around cubes, i.e. each flow generates or publishes a cube. There are many software tools available to build these modular flows. S-DWH’s layered architecture itself provides the possibility to use different platforms and software in separate layers, i.e. to re-use components already available in-house or internationally. In addition, different 13 software can be used inside the same layer to build up one particular flow. The problems arise when we try to use these different modules and different data formats together. Take a deeper look into CORE services. They are used to move data between S-DWH layers and also inside the layers between different sub-tasks (e.g. edit, impute, etc.), then it is easier to use software provided by statistical community or re-use self-developed components to build flows for different purposes. Generally CORE (COmmon Reference Environment) is an environment supporting the definition of statistical processes and their automated execution. CORE processes are designed in a standard way, starting from available services; specifically, process definition is provided in terms of abstract statistical services that can be mapped to specific IT tools. CORE goes in the direction of fostering the sharing of tools among NSIs. Indeed, a tool developed by a specific NSI can be wrapped according to CORE principles, and thus is easily integrated within a statistical process of another NSI. Moreover, having a single environment for the execution of all statistical processes provides a high level of automation and a complete reproducibility of processes execution. NSI produce Official Statistics sharing very similar goals, hence several activities related to the production of statistics are common. Nevertheless, such activities are currently carried on in an independent way, without relying on shared solutions. Sharing a common architecture would result in a reduction of costs due to duplicated activities and, at the same time, in an improvement of the quality of produced statistics, due to the adoption of standardized solutions. The main principles underlying CORA design are: a) Platform Independence. NSIs use various platforms (e.g., hardware, operating systems, database management systems, statistical software, etc.), hence architecture is bound to fail if it endeavours to impose standards at a technical level. Moreover, the platform independence allows to model statistical processes at a “conceptual level”, so that they do not need to be modified when the implementation of a service changes. b) Service Orientation. The vision is that the production of statistics takes place through services calling other services. Hence services are the modular building blocks of the architecture. By having clear communication interfaces, services implement principles of modern software engineering like encapsulation and modularity. c) Layered Approach. According to this principle, some services are rich and are positioned at the top of the statistical process, so, for instance a publishing service requires the output of all sorts of services positioned earlier in the statistical process, such as collecting data and storing information. The ambition of this model is to bridge the whole range of layers from collection to publication by describing all layers in terms of services delivered to a higher layer, in such a way that each layer is dependent only on the first lower layer. CORE principal objective is the design and implementation of an environment supporting the definition of statistical processes and their automated execution. CORE processes are designed in a standard way, starting from available services; specifically, process definition is provided in terms of abstract statistical services that can be mapped to specific IT tools. CORE goes in the direction of fostering the sharing of tools among NSIs. Indeed, a tool developed by a specific NSI can be wrapped according to CORE principles, and thus easily integrated within a statistical process of another NSI. Moreover, having a single environment for the execution of all statistical processes provides a high level of automation and a complete reproducibility of processes execution. For us it is very important to make some transitions and mappings between different models and approaches. Unfortunately mapping a CORE process to a business model is not possible because 14 the CORE model is an information model and there is no way to map a business model onto an information model in a direct way. The two models are about different things. They can only be connected if this connection is in some way a part of the models. The CORE information model was designed with such a mapping in mind. Within this model, a statistical service is an object, and one of its attributes is a reference to its GSBPM process. Considering the GSBPM a business model, any mapping of the CORE model onto a business model has to go through this reference to the GSBPM. Usually different services use different services with their own tools which expect different data formats. So for service interactions we need conversions. Evidently conversions are expensive. Using for interactions CORE services number of conversions can be reduced noticeably. In a general sense, an integration API permits to wrap a tool in order to make it CORE-complaint, i.e. a CORE executable service. CORE service is indeed composed by an inner part, which is the tool to be wrapped, and by input and output integration APIs. Such APIs transform from/to CORE model into the tool specific format. As anticipated, CORE mappings are designed for classes of tools and hence integration APIs should support the admitted transformations, e.g. CSV-to-CORE and CORE-to-CSV, Relational-to-CORE and CORE-to-Relational, etc. Basically, the integration API consists of a set of transformation components. Each transformation component corresponds to a specific data format and the principal elements of their design are specific mapping files, description files and transform operations. In order to provide an input to a tool (inner part of a CORE service) the Transform-from-CORE operation is invoked. Conversely, the output of the tool is converted by Transform-to-CORE operation. For each single input or output file a transformation must be launched. In that way reusing of components can be performed in a very easy and efficient way. 5 Conclusion Today, the prevalent model for producing statistics is the stovepipe model. But there is also an integrated model and warehouse approach. Apart from an unavoidable statistical methodological strategy needed for defining a common statistical variable definition, common reference samples and a common estimation frame, in this paper the integration model and warehouse approach were put together. Integration can be looked at form three main viewpoints: 1. Technical integration – integrating IT platforms and software tools. 2. Process integration – integrating statistical processes like survey design, sample selection, data processing, etc. 3. Data integration – data are stored once, but used for multiple purposes. When we put all these three integration aspects together, we get S-DWH, which is built on integrated technology, uses integrated processes to produce statistics and reuses data efficiently. And at the same level the S-DWH environment can be used to manage changes of statistical production process, through the Interpretation Layer on the same data, which can support new statistical processing strategies able to refine the S-DWH-self. 15 6 References B. Sundgren (2010) “The Systems Approach to Official Statistics”, Official Statistics in Honour of Daniel Thorburn, pp. 225–260. Available at: https://sites.google.com/site/bosundgren/mylife/Thorburnbokkap18Sundgren.pdf?attredirects=0 W. Radermacher (2011) “Global consultation on the draft Guidelines on Integrated Economic Statistics”. UNSC (2012) “Guidelines on Integrated Economic Statistics”. Available at: http://unstats.un.org/unsd/statcom/doc12/RD-IntegratedEcoStats.pdf The Sponsorship on Standardisation (2013) “Standardisation in the European Statistical System”. W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009) “Terminology Relating To The Implementation Of The Vision On The Production Method Of EU Statistics”. Available at: http://ec.europa.eu/eurostat/ramon/coded_files/TERMS-IN-STATISTICS_version_4-0.pdf European Union, Communication from the Commission to the European Parliament and the Council on the production method of EU statistics: a vision for the next decade, COM(2009) 404 final. Available at: http://eurlex.europa.eu/LexUriServ/LexUriServ.do?uri=COM:2009:0404:FIN:EN:PDF ESSnet CORE (COmmon Reference Environment) http://www.cros-portal.eu/content/core-0 16