Introduction to SDMX guidelines for the design of Data

advertisement
SDMX STATISTICAL CAPACITY BUILDING
GUIDELINES FOR DESIGNING DATA
STRUCTURE DEFINITIONS (DSDs)
SDMX Global Conference 2013, Paris
“ Global SDMX Implementation : Modernising Official Statistics “
Overview
•
•
•
•
•
Design principles
Exchange context
Design process
Data structuring approaches
DSD analysis: STES as example
SDMX Global Conference 2013, Paris
“ Global SDMX Implementation : Modernising Official Statistics “
Design Principles
Structural
• Parsimony
• Simplicity
• Purity
• Unambiguousness
• Exhaustiveness
• Orthogonality
Other
• Re-use of existing artefacts
• Flexibility and future needs
• Fitness for use throughout
statistical business process
• User-friendliness
Data Exchange Context
•
•
•
•
•
•
•
•
Single- or multi-domain
Single- or multi-purpose
Type of data
Human or machine as recipient
Level of data exchange
Role in data exchange
Process pattern
Phase of statistical process
Design Process
1. Specifycontext
context
Specify
Identify
2. Identifyrelevant
relevant
existing DSDs
DSDs
existing
available
not available
Check
3. CheckDSD
DSD
suitability
suitability
partly suitable
4.1. Define
modified
Define
DSDsDSDs
modified
suitable
not suitable
4.2. Use
suitable
Use
suitable
DSDs
DSDs
Define
5. Definesupporting
supporting
artefacts
artefacts
Define
new
4.3.
Define
new
Define
new
DSDs
DSDs
DSDs
Design Process
Define new DSDs
4.3.1.
Specify
Specify
concepts
concepts
Specify
code
4.3.2. Specify
code
lists
lists
Specify
data
4.3.3. Specify
data
formats
formats
4.3.4.
Assemble
Assemble
DSDs
DSDs
Design Process
Specify concepts
Structuring
4.3.1.1. Decide
structuring
approach
approach
Relevant
4.3.1.2.
Identify relevant
existing concepts
concepts
…
revise
available
revise
not available
Concepts
4.3.1.3.
Check
concept
suitability
suitable?
suitable
not suitable
4.3.1.4.1. Use
Use!
suitable
concepts
4.3.1.4.2. Define
Define
new!
new concepts
Define
4.3.1.5.
Define
concept
concept roles
roles
4.3.1.6.
groups
DefineDefine
groups
4.3.1.7.Define
Define attribute
attachment levels
attachment
levels
Design Process
Define new DSDs
4.3.1.
Specify
Specify
concepts
concepts
Specify
code
4.3.2. Specify
code
lists
lists
Specify
data
4.3.3. Specify
data
formats
formats
4.3.4.
Assemble
Assemble
DSDs
DSDs
Design Process
Specify code lists
4.3.2.1.
Identifycode
relevant
Relevant
existing
code lists
lists
available?
available
not available
4.3.2.2.
Check
Code
listscode
list
suitability
suitable?
suitable
4.3.2.3.1. Use
Use!
suitable code lists
partly suitable
not suitable
4.3.2.3.2. Define
Modify!
modified code lists
4.3.2.3.3. Define
Define new!
new code lists
Design Process
Iterative
4.3.1.
Specify
Specify
concepts
concepts
Specify
code
4.3.2. Specify
code
lists
lists
Specify
data
4.3.3. Specify
data
formats
formats
4.3.4.
Assemble
Assemble
DSDs
DSDs
Data structuring approaches

Number and content of dimensions

Number of DSDs
NOT COMPLETELY INDEPENDENT:
LARGER NUMBER OF DSDs
FEWER CONCEPTS AND DIMENSIONS IN THE KEY
Number and content of dimensions
DATA CHARACTERISTICS :
 C1
C2 C3
C4
 Sex Age Sector Employment status…
Pure concepts:
1 characteristic = 1 concept
Sex; Age; Sector; …
wider use of
composite concepts
Composite concepts:
More characteristics = 1 concept
e.g. Sex and Age
lower number of
dimensions
Pure vs. composite concepts
Many
pure
Horizontal complexity
V
e
r
t
i
c
a
l
Key: Dim1.Dim2………………………………….DimN
c
o
m
p
l
e
x
i
t
y
Few
mixed
Codelist1
Codelist2
CodelistN
1
1
1
2
2
2
--
--
--
--
--
--
--
--
K1
K2
KN
………
--
Many pure concepts
● clean data structure
● flexible in terms of mappings to other data structure… may be
mapped to any mixed representation
● flexible in terms of defining queries (for a skilled user)
● short and simple codelists
● long observation keys
● difficult to handle by end user (long codes; many dimensions) but for
skilled users is more flexible
● special values (not applicable; total) widely used
● creates sparseness
● needs many constraints (due to sparseness)
Some of the critical points may be overcome through a different strategy
in choosing the number of DSDs. More DSDs reduce sparseness and the
need for constraints, and would result in shorter keys.
Strategies
All pure concepts
Too many?
Composite
concepts
Trade-off
Many different
DSD’s
Composite concepts
SDMX technical notes annex 6 (343)
“Avoid composite dimensions”
but in particular context they may be
useful
Eg: to disseminate few key economic indicators
(multi-domain)
Number of DSDs
ONE DSD or MANY DSDs?
A possible approach:
Master and satellite artefacts
(derived via constraints)
Master DSD matrix
Data exchange scenario
Concepts
SC1
SC2
SC3
……
SCm
#1
X
X
X
X
X
#2
X
O
X
X
X
#3
O
X
O
O
O
……
….
….
…
…
…
#n
X
O
O
X
X
Master and satellite DSDs
Multiple satellite DSDs
Master DSD
constraints
DSD1 DSD2 ………..DSDn
Multiple satellite DSDs
(unique key structures)
Master and satellite DSDs
One DSD, multiple data flows
Master DSD
constraints
Dataflow 1 dataflow 2 ………..
ONE DSD
dataflow n
Master and satellite DSDs
Multiple satellite DSDs
A bit different approach:
Master DSD
Dropping not
relevant
dimensions
DSD1 DSD2 ………..DSDn
Multiple satellite DSDs
(multiple key structures)
Example: Short-term Economic Statistics
DSD: Dimensions and attributes
CONCEPT
DESCRIPTION
SUBJECT
Subject matter
CL_SUBJECT
MEASURE
Quantitative variable value
CL_MEASURE
FREQ
Periodicity
CL_FREQ
REFERENCE_AREA “Reference area” and/or “Counterpart
area”
CODE LIST ID
CL_AREA
ADJUSTMENT
Seasonal adjustment
CL_ADJUSTMENT
UNIT
Generic list with code values
CL_UNIT
TIME_PERIOD
Defines the observation period
CONCEPT
ATTRIBUTES
UNIT_MULT
Indicating the magnitude in the units
of measurements
CL_UNIT_MULT
OBS_STATUS
The observation status
CL_OBS_STATUS
CODE LIST ID
DSD analysis
Design principles
• Reuse of existing code lists and future needs:
 Adjustment, frequency, reference area, subject
matter.
• Parsimony, simplicity, density:
 DSD is not redundant and has a small number of
dimensions. The DSD provides data for most of
the cells.
• Purity:
 In this case we have the code list CL_UNIT
which is not pure but adds to simplicity.
DSD analysis
Design principles
• Unambiguousness and orthogonality:
 The code list MEASURE seems to be ambiguous
and CL_UNIT and CL_MEASUREMENT show
overlaps.
• Exhaustiveness:
 It is possible to identify all data in the flow.
DSD analysis
Dimensions
• The DSD includes the dimension MEASURE to
differentiate the indicators expressed as an index
number from the rest.
• This item was added to the DSD as an independent
dimension, when by its nature, could be
incorporated into the CL_UNIT dimension.
• In the code list of the UNIT dimension the following
codes of different nature were included:
 Physical unit measures
 Monetary units
 Several base periods for index numbers
CL_MEASURE
Code
Description
ST
Number, rate, value
IXNB
Index
Description:
A summary
(means, mode, total, index,
etc.)
of
the
individual
quantitative variable values
for the statistical units in a
specific
group
(study
domains).
Code
1995100
2000100
2003100
2005100
2008100
2010100
AUD
BPA
BPM
BRL
CAD
CHF
CLP
CNY
CZK
DKK
DW
EUR
GBP
GWH
HUF
IDR
ILS
INR
ISK
JB
CL_UNIT
Description
1995=100
2000=100
2003=100
2005=100
2008=100
2010=100
Australian Dollar
Barrels per day
Barrels per month
Brazilian Real
Canadian Dollar
Swiss Franc
Chilean Peso
Yuan Renminbi
Czech Koruna
Danish Krone
Dwellings
Euro
Pound Sterling
Gigawatt hour
Forint
Rupiah
New Israeli Sheqel
Indian Rupee
Iceland Krona
Jobs
DSD analysis
Code lists
Description: Generic list
with code values
(including currency,
base period, measures)
DSD analysis
Suggestions
 Eliminate the MEASURE dimension.
 Add to the CL_UNIT the code IXNB = Index
number, so that indicators expressed as
indices can be identified.
 Eliminate from the CL_UNIT the codes for
base period.
 Create a new concept to specify the base
period with its own code list / format.
Download