Statistical Composites:

advertisement
Metadata Models in
Survey Computing
Some Results of
MetaNet – WG 2
W. Grossmann
University of Vienna
METIS 2004, Geneva
Contents
Metanet
Requirements for Models
Key Features of the Model
Implications for Terminology
10.02.2004
METIS 2004 Geneva
2
METANET 1
A network of excellence funded by EUROSTAT
2000/01 – 2003, 5 work groups





10.02.2004
WG 1: Methodology and Tools
WG 2: Harmonisation of Metadata – Structure and
Definitions
WG 3: Best Practice for Migration
WG 4: Adoption Issues
WG 5: Terminology (ad hoc)
METIS 2004 Geneva
3
METANET 2
Within WG 2 two different approaches:


10.02.2004
Terminology Model (cf. WP 12)
Unified Metadata Architecture for Statistics
(UMAS model)
METIS 2004 Geneva
4
METANET 3
Intention of the UMAS model



10.02.2004
Statistics deals with different kinds of data,
e.g. surveys, registers, classifications, …
These data show a dynamic defined by
statistical processing activities
Define a model which supports besides
description of data description of the
statistical dynamic
METIS 2004 Geneva
5
Requirement Analysis 1
Method
Requirement analysis is based on


Examination of a number of activities in survey
processing
Documentation of these activities inside statistical
systems, in particular proposal of
 Banca d’Italia, DDI, OECD, SCB-DOK, SDDS, Statistics
Netherlands (Input-Throughput-Output model)

General methods for documentation, e.g.
 Dublin Core, Facet Classifications, ISO-Standards
10.02.2004
METIS 2004 Geneva
6
Requirement Analysis 2
Example A: Sampling
Terminology view

Sampling is the process of selecting a
number of cases from all the cases in a
particular group or universe
Operational view


10.02.2004
Input: Sampling Frame
Output: Sample
METIS 2004 Geneva
7
Requirement Analysis 3
Example A: Sampling
Details of operational view




10.02.2004
What is an appropriate definition for the sampling
frame given the problem (e.g. coverage)?
What kind of additional information should be
available for the sampling frame (e.g. auxiliary
variables)?
How can we obtain an appropriate representation
of the desired sampling frame (e.g. merging
existing frames, selecting from existing frames)
Who is responsible for the frame in the future?
METIS 2004 Geneva
8
Requirement Analysis 4
Example A: Sampling



10.02.2004
Which sampling technique is appropriate for our
problem (Note that there are some relations
between structure of sampling frame and possible
sampling techniques)
In which form is the output (i.e. the sample)
represented in the system?
Who is responsible for the sampling procedure?
METIS 2004 Geneva
9
Requirement Analysis 5
Example B: Editing
Terminology view

Editing is the process of detecting and adjusting
individual errors in data records resulting from
data collection and capture…
Operational view


10.02.2004
Input: A variable together with a set of admissible
values for the variable within a specific context
Output: A summary statement about quality of the
variable or a listing of errors for each case
METIS 2004 Geneva
10
Requirement Analysis 6
Example B: Editing
Details of operational view

Context may be defined in various ways
 Subject matter considerations, e.g. there is only
one person in a household, who can claim to
be head of household
 Context may be defined by some more
technical reasons, e.g. use as measurement
unit for annual income 1000€
 Context may be defined by pure technical
reasons, e.g. “f” for female and “m” for male
10.02.2004
METIS 2004 Geneva
11
Requirement Analysis 7
Example B: Editing

Context defines rules for the admissible values of
the variable
 Within one data set,
 Within one infological model (e.g. person-household)
 Within a time series

Rules may be formulated
 As strong constraints, i.e. logical conditions on
combination of values
 As soft constraints, i.e. statistical conditions on
combinations of values

10.02.2004
Rules have to be processed in algorithmic form
and maintained by an administrative procedure
METIS 2004 Geneva
12
Requirement Analysis 8
Example C: Weighting
Terminology view

Weight is the importance of an object in
relation to a set of objects to which it
belongs; ….
Operational view


10.02.2004
Input: A statistical dataset together with
appropriate information
Output: Statistical dataset augmented by
the weight information
METIS 2004 Geneva
13
Requirement Analysis 9
Example C: Weighting
Details of operational view



10.02.2004
Which subject matter problem should be solved by
weighting (e.g. representation of strata, poststratification, …)?
Which procedure should be used for weighting
(e.g. base weights, calibration weights, ….)?
In which form are the data and the additional
information about the population available (e.g.
population data as summary table or as register
with auxiliary variables)?
METIS 2004 Geneva
14
Requirement Analysis 10
Example C: Weighting




10.02.2004
How can we access and combine the different
data?
Who is responsible for the different datasets?
How is the output represented (e.g. as weight for
the dataset, as weights for the sampling
procedure, as summary table)?
Are we interested in reuse of the procedure for
new data sets (e.g. the same weighting procedure
within a series)?
METIS 2004 Geneva
15
Requirement Analysis 11
Example D: Analytical Units
Terminology view

Analytical units represent real or artificially
constructed units for which statistics are
compiled
Operational view


10.02.2004
Input: Two or more statistical units
Output: A new statistical unit
METIS 2004 Geneva
16
Requirement Analysis 12
Example D: Analytical Units
Details of operational view




10.02.2004
What is the conceptual definition of the statistical
units?
How are the conceptual definitions captured by
operational characteristics (e.g. auxiliary
variables)?
How can we access and manipulate the
operational characteristics in order to produce the
new analytical unit?
How is the new analytical unit embedded into an
existing administrative framework?
METIS 2004 Geneva
17
Requirement Analysis 13
Summary
The examples show that




10.02.2004
Models should be based on terminology but are
more than terminology
Models have to consider different types of
“statistical” objects
For these objects we have to know the concepts
represented as data, together with the relations
between the concepts
We have to know the statistical meaning of the
objects, together with their statistical relations
METIS 2004 Geneva
18
Requirement Analysis 14
Summary





10.02.2004
We have to take into account the specific format
of realisation of the objects as physical datasets
We must include statements about responsibility,
access rights and other administrative details
We need a flexible coupling mechanism for the
objects according to processing needs
We have to develop a description formalism for
statistical processing
We have to take into account information
requirements of external users
METIS 2004 Geneva
19
Key Features of the Model 1
In order to meet the different aspects
of the requirement analysis a model
with four different facets, resembling
the idea of facet classifications used by
librarians and archivist, was designed
10.02.2004
METIS 2004 Geneva
20
Key Features of the Model 2

“Structure Facet”
 The objects of interest, so called “statistical
categories”:





10.02.2004
statistical unit,
statistical population,
statistical variables,
statistical values together with a number of related
objects like classifications,
statistical datasets,
METIS 2004 Geneva
21
Key Features of the Model 3

“statistical domains” for coupling objects according
to processing needs (basically a system of
catalogues for the other objects)
 Each instance of the structure has a twofold
representation inside a system


10.02.2004
As data (“Category-Instance data”)
As description (“Category-Instance model” , i.e.
metadata)
METIS 2004 Geneva
22
Key Features of the Model 4

“View Facet” describe the instances

“Conceptual point of view”

subject matter definition
 “Statistical point of view”

The statistical properties of the instances necessary
for processing
 “Data management point of view”

All information necessary for machine supported
storage and manipulation
 “Administrative point of view”

10.02.2004
Management and bookkeeping of the structures
METIS 2004 Geneva
23
Key Features of the Model 5

“Stage Facet” describes processing at the
data as well as at the metadata level
DEFINITION
and
DESIGN
Metadata level
{ CM1, CM2, . . . }
PROCESSING
PRODUCTION
Data level
{ CID1, CID2, . . . }
10.02.2004
{ TM (CM1, CM2, ... )}
DISSEMINATION
and
EXCHANGE
{ TD (CID1, CID2, ... )}
METIS 2004 Geneva
24
Key Features of the Model 6
 “Production blueprint”

Keeps the information how the instance is set up
inside the system according to the four different
views of the view facet
 “Processing blueprint”

10.02.2004
Describes the processing activities for the instances
according to the four different views of the view
facet
METIS 2004 Geneva
25
Key Features of the Model 7

“Function facet”
 All aspects of communication and usage of
meta-information by humans inside the system
as well as in connection with dissemination and
exchange



10.02.2004
Who is involved in communication?
What information is communicated?
How is the information communicated ?
METIS 2004 Geneva
26
Implications for Terminology 1
Statistical information systems use
terminology from different sources

Statistics, Computer Science, Economy, Social
Sciences,…
We can at best collect terminology and bring
it into an order according to some model
This ordering defines a “statistical ontology”
which is above terminology and has to use
few common agreed terms
10.02.2004
METIS 2004 Geneva
27
Implications for Terminology 2
Activities of METANET WG 4 showed that
there is rather large agreement between
statisticians about the main terms for
important structures:



10.02.2004
Statistical Unit
Statistical Variable
Statistical Values
METIS 2004 Geneva
28
Implications for Terminology 3
Based on such agreement we can assign each
terminology item a specification according to
the ontology
UMAS proposes the following classification





10.02.2004
What is the source of terminology (e.g. statistics,
general standards, application area,…)?
For which structure applies the term?
For which view is the term used?
In which processing stage is the term used?
Which function aspects covers the term?
METIS 2004 Geneva
29
Implications for Terminology 4
MCV (SDMX) proposes the following
classification




10.02.2004
Administration (close relation to administrative
view)
Concepts, Definitions, Standards (close relation to
conceptual view and structure)
Data Collection, manipulating/accounting
convention (close relation to stage facet)
Quality and performance metadata (close relation
to function facet in connection with dissemination
and exchange)
METIS 2004 Geneva
30
Summary
Statistical processing activities define
metadata requirements usually not
considered in traditional data modelling
The processing activities require a model
which supports flexible coupling of entities
according to processing needs
We need besides terminology also
specification of the terminology in context
of a statistical ontology
10.02.2004
METIS 2004 Geneva
31
Thank you!
wilfried.grossmann@univie.ac.at
Download