SDMX STATISTICAL CAPACITY BUILDING GUIDELINES FOR DESIGNING DATA STRUCTURE DEFINITIONS (DSDs) SDMX Global Conference 2013, Paris “ Global SDMX Implementation : Modernising Official Statistics “ Overview • • • • • Design principles Exchange context Design process Data structuring approaches DSD analysis: STES as example SDMX Global Conference 2013, Paris “ Global SDMX Implementation : Modernising Official Statistics “ Design Principles Structural • Parsimony • Simplicity • Purity • Unambiguousness • Exhaustiveness • Orthogonality Other • Re-use of existing artefacts • Flexibility and future needs • Fitness for use throughout statistical business process • User-friendliness Data Exchange Context • • • • • • • • Single- or multi-domain Single- or multi-purpose Type of data Human or machine as recipient Level of data exchange Role in data exchange Process pattern Phase of statistical process Design Process 1. Specifycontext context Specify Identify 2. Identifyrelevant relevant existing DSDs DSDs existing available not available Check 3. CheckDSD DSD suitability suitability partly suitable 4.1. Define modified Define DSDsDSDs modified suitable not suitable 4.2. Use suitable Use suitable DSDs DSDs Define 5. Definesupporting supporting artefacts artefacts Define new 4.3. Define new Define new DSDs DSDs DSDs Design Process Define new DSDs 4.3.1. Specify Specify concepts concepts Specify code 4.3.2. Specify code lists lists Specify data 4.3.3. Specify data formats formats 4.3.4. Assemble Assemble DSDs DSDs Design Process Specify concepts Structuring 4.3.1.1. Decide structuring approach approach Relevant 4.3.1.2. Identify relevant existing concepts concepts … revise available revise not available Concepts 4.3.1.3. Check concept suitability suitable? suitable not suitable 4.3.1.4.1. Use Use! suitable concepts 4.3.1.4.2. Define Define new! new concepts Define 4.3.1.5. Define concept concept roles roles 4.3.1.6. groups DefineDefine groups 4.3.1.7.Define Define attribute attachment levels attachment levels Design Process Define new DSDs 4.3.1. Specify Specify concepts concepts Specify code 4.3.2. Specify code lists lists Specify data 4.3.3. Specify data formats formats 4.3.4. Assemble Assemble DSDs DSDs Design Process Specify code lists 4.3.2.1. Identifycode relevant Relevant existing code lists lists available? available not available 4.3.2.2. Check Code listscode list suitability suitable? suitable 4.3.2.3.1. Use Use! suitable code lists partly suitable not suitable 4.3.2.3.2. Define Modify! modified code lists 4.3.2.3.3. Define Define new! new code lists Design Process Iterative 4.3.1. Specify Specify concepts concepts Specify code 4.3.2. Specify code lists lists Specify data 4.3.3. Specify data formats formats 4.3.4. Assemble Assemble DSDs DSDs Data structuring approaches Number and content of dimensions Number of DSDs NOT COMPLETELY INDEPENDENT: LARGER NUMBER OF DSDs FEWER CONCEPTS AND DIMENSIONS IN THE KEY Number and content of dimensions DATA CHARACTERISTICS : C1 C2 C3 C4 Sex Age Sector Employment status… Pure concepts: 1 characteristic = 1 concept Sex; Age; Sector; … wider use of composite concepts Composite concepts: More characteristics = 1 concept e.g. Sex and Age lower number of dimensions Pure vs. composite concepts Many pure Horizontal complexity V e r t i c a l Key: Dim1.Dim2………………………………….DimN c o m p l e x i t y Few mixed Codelist1 Codelist2 CodelistN 1 1 1 2 2 2 -- -- -- -- -- -- -- -- K1 K2 KN ……… -- Many pure concepts ● clean data structure ● flexible in terms of mappings to other data structure… may be mapped to any mixed representation ● flexible in terms of defining queries (for a skilled user) ● short and simple codelists ● long observation keys ● difficult to handle by end user (long codes; many dimensions) but for skilled users is more flexible ● special values (not applicable; total) widely used ● creates sparseness ● needs many constraints (due to sparseness) Some of the critical points may be overcome through a different strategy in choosing the number of DSDs. More DSDs reduce sparseness and the need for constraints, and would result in shorter keys. Strategies All pure concepts Too many? Composite concepts Trade-off Many different DSD’s Composite concepts SDMX technical notes annex 6 (343) “Avoid composite dimensions” but in particular context they may be useful Eg: to disseminate few key economic indicators (multi-domain) Number of DSDs ONE DSD or MANY DSDs? A possible approach: Master and satellite artefacts (derived via constraints) Master DSD matrix Data exchange scenario Concepts SC1 SC2 SC3 …… SCm #1 X X X X X #2 X O X X X #3 O X O O O …… …. …. … … … #n X O O X X Master and satellite DSDs Multiple satellite DSDs Master DSD constraints DSD1 DSD2 ………..DSDn Multiple satellite DSDs (unique key structures) Master and satellite DSDs One DSD, multiple data flows Master DSD constraints Dataflow 1 dataflow 2 ……….. ONE DSD dataflow n Master and satellite DSDs Multiple satellite DSDs A bit different approach: Master DSD Dropping not relevant dimensions DSD1 DSD2 ………..DSDn Multiple satellite DSDs (multiple key structures) Example: Short-term Economic Statistics DSD: Dimensions and attributes CONCEPT DESCRIPTION SUBJECT Subject matter CL_SUBJECT MEASURE Quantitative variable value CL_MEASURE FREQ Periodicity CL_FREQ REFERENCE_AREA “Reference area” and/or “Counterpart area” CODE LIST ID CL_AREA ADJUSTMENT Seasonal adjustment CL_ADJUSTMENT UNIT Generic list with code values CL_UNIT TIME_PERIOD Defines the observation period CONCEPT ATTRIBUTES UNIT_MULT Indicating the magnitude in the units of measurements CL_UNIT_MULT OBS_STATUS The observation status CL_OBS_STATUS CODE LIST ID DSD analysis Design principles • Reuse of existing code lists and future needs: Adjustment, frequency, reference area, subject matter. • Parsimony, simplicity, density: DSD is not redundant and has a small number of dimensions. The DSD provides data for most of the cells. • Purity: In this case we have the code list CL_UNIT which is not pure but adds to simplicity. DSD analysis Design principles • Unambiguousness and orthogonality: The code list MEASURE seems to be ambiguous and CL_UNIT and CL_MEASUREMENT show overlaps. • Exhaustiveness: It is possible to identify all data in the flow. DSD analysis Dimensions • The DSD includes the dimension MEASURE to differentiate the indicators expressed as an index number from the rest. • This item was added to the DSD as an independent dimension, when by its nature, could be incorporated into the CL_UNIT dimension. • In the code list of the UNIT dimension the following codes of different nature were included: Physical unit measures Monetary units Several base periods for index numbers CL_MEASURE Code Description ST Number, rate, value IXNB Index Description: A summary (means, mode, total, index, etc.) of the individual quantitative variable values for the statistical units in a specific group (study domains). Code 1995100 2000100 2003100 2005100 2008100 2010100 AUD BPA BPM BRL CAD CHF CLP CNY CZK DKK DW EUR GBP GWH HUF IDR ILS INR ISK JB CL_UNIT Description 1995=100 2000=100 2003=100 2005=100 2008=100 2010=100 Australian Dollar Barrels per day Barrels per month Brazilian Real Canadian Dollar Swiss Franc Chilean Peso Yuan Renminbi Czech Koruna Danish Krone Dwellings Euro Pound Sterling Gigawatt hour Forint Rupiah New Israeli Sheqel Indian Rupee Iceland Krona Jobs DSD analysis Code lists Description: Generic list with code values (including currency, base period, measures) DSD analysis Suggestions Eliminate the MEASURE dimension. Add to the CL_UNIT the code IXNB = Index number, so that indicators expressed as indices can be identified. Eliminate from the CL_UNIT the codes for base period. Create a new concept to specify the base period with its own code list / format.