Metadata Models in Survey Computing Some Results of MetaNet – WG 2 W. Grossmann University of Vienna METIS 2004, Geneva Contents Metanet Requirements for Models Key Features of the Model Implications for Terminology 10.02.2004 METIS 2004 Geneva 2 METANET 1 A network of excellence funded by EUROSTAT 2000/01 – 2003, 5 work groups 10.02.2004 WG 1: Methodology and Tools WG 2: Harmonisation of Metadata – Structure and Definitions WG 3: Best Practice for Migration WG 4: Adoption Issues WG 5: Terminology (ad hoc) METIS 2004 Geneva 3 METANET 2 Within WG 2 two different approaches: 10.02.2004 Terminology Model (cf. WP 12) Unified Metadata Architecture for Statistics (UMAS model) METIS 2004 Geneva 4 METANET 3 Intention of the UMAS model 10.02.2004 Statistics deals with different kinds of data, e.g. surveys, registers, classifications, … These data show a dynamic defined by statistical processing activities Define a model which supports besides description of data description of the statistical dynamic METIS 2004 Geneva 5 Requirement Analysis 1 Method Requirement analysis is based on Examination of a number of activities in survey processing Documentation of these activities inside statistical systems, in particular proposal of Banca d’Italia, DDI, OECD, SCB-DOK, SDDS, Statistics Netherlands (Input-Throughput-Output model) General methods for documentation, e.g. Dublin Core, Facet Classifications, ISO-Standards 10.02.2004 METIS 2004 Geneva 6 Requirement Analysis 2 Example A: Sampling Terminology view Sampling is the process of selecting a number of cases from all the cases in a particular group or universe Operational view 10.02.2004 Input: Sampling Frame Output: Sample METIS 2004 Geneva 7 Requirement Analysis 3 Example A: Sampling Details of operational view 10.02.2004 What is an appropriate definition for the sampling frame given the problem (e.g. coverage)? What kind of additional information should be available for the sampling frame (e.g. auxiliary variables)? How can we obtain an appropriate representation of the desired sampling frame (e.g. merging existing frames, selecting from existing frames) Who is responsible for the frame in the future? METIS 2004 Geneva 8 Requirement Analysis 4 Example A: Sampling 10.02.2004 Which sampling technique is appropriate for our problem (Note that there are some relations between structure of sampling frame and possible sampling techniques) In which form is the output (i.e. the sample) represented in the system? Who is responsible for the sampling procedure? METIS 2004 Geneva 9 Requirement Analysis 5 Example B: Editing Terminology view Editing is the process of detecting and adjusting individual errors in data records resulting from data collection and capture… Operational view 10.02.2004 Input: A variable together with a set of admissible values for the variable within a specific context Output: A summary statement about quality of the variable or a listing of errors for each case METIS 2004 Geneva 10 Requirement Analysis 6 Example B: Editing Details of operational view Context may be defined in various ways Subject matter considerations, e.g. there is only one person in a household, who can claim to be head of household Context may be defined by some more technical reasons, e.g. use as measurement unit for annual income 1000€ Context may be defined by pure technical reasons, e.g. “f” for female and “m” for male 10.02.2004 METIS 2004 Geneva 11 Requirement Analysis 7 Example B: Editing Context defines rules for the admissible values of the variable Within one data set, Within one infological model (e.g. person-household) Within a time series Rules may be formulated As strong constraints, i.e. logical conditions on combination of values As soft constraints, i.e. statistical conditions on combinations of values 10.02.2004 Rules have to be processed in algorithmic form and maintained by an administrative procedure METIS 2004 Geneva 12 Requirement Analysis 8 Example C: Weighting Terminology view Weight is the importance of an object in relation to a set of objects to which it belongs; …. Operational view 10.02.2004 Input: A statistical dataset together with appropriate information Output: Statistical dataset augmented by the weight information METIS 2004 Geneva 13 Requirement Analysis 9 Example C: Weighting Details of operational view 10.02.2004 Which subject matter problem should be solved by weighting (e.g. representation of strata, poststratification, …)? Which procedure should be used for weighting (e.g. base weights, calibration weights, ….)? In which form are the data and the additional information about the population available (e.g. population data as summary table or as register with auxiliary variables)? METIS 2004 Geneva 14 Requirement Analysis 10 Example C: Weighting 10.02.2004 How can we access and combine the different data? Who is responsible for the different datasets? How is the output represented (e.g. as weight for the dataset, as weights for the sampling procedure, as summary table)? Are we interested in reuse of the procedure for new data sets (e.g. the same weighting procedure within a series)? METIS 2004 Geneva 15 Requirement Analysis 11 Example D: Analytical Units Terminology view Analytical units represent real or artificially constructed units for which statistics are compiled Operational view 10.02.2004 Input: Two or more statistical units Output: A new statistical unit METIS 2004 Geneva 16 Requirement Analysis 12 Example D: Analytical Units Details of operational view 10.02.2004 What is the conceptual definition of the statistical units? How are the conceptual definitions captured by operational characteristics (e.g. auxiliary variables)? How can we access and manipulate the operational characteristics in order to produce the new analytical unit? How is the new analytical unit embedded into an existing administrative framework? METIS 2004 Geneva 17 Requirement Analysis 13 Summary The examples show that 10.02.2004 Models should be based on terminology but are more than terminology Models have to consider different types of “statistical” objects For these objects we have to know the concepts represented as data, together with the relations between the concepts We have to know the statistical meaning of the objects, together with their statistical relations METIS 2004 Geneva 18 Requirement Analysis 14 Summary 10.02.2004 We have to take into account the specific format of realisation of the objects as physical datasets We must include statements about responsibility, access rights and other administrative details We need a flexible coupling mechanism for the objects according to processing needs We have to develop a description formalism for statistical processing We have to take into account information requirements of external users METIS 2004 Geneva 19 Key Features of the Model 1 In order to meet the different aspects of the requirement analysis a model with four different facets, resembling the idea of facet classifications used by librarians and archivist, was designed 10.02.2004 METIS 2004 Geneva 20 Key Features of the Model 2 “Structure Facet” The objects of interest, so called “statistical categories”: 10.02.2004 statistical unit, statistical population, statistical variables, statistical values together with a number of related objects like classifications, statistical datasets, METIS 2004 Geneva 21 Key Features of the Model 3 “statistical domains” for coupling objects according to processing needs (basically a system of catalogues for the other objects) Each instance of the structure has a twofold representation inside a system 10.02.2004 As data (“Category-Instance data”) As description (“Category-Instance model” , i.e. metadata) METIS 2004 Geneva 22 Key Features of the Model 4 “View Facet” describe the instances “Conceptual point of view” subject matter definition “Statistical point of view” The statistical properties of the instances necessary for processing “Data management point of view” All information necessary for machine supported storage and manipulation “Administrative point of view” 10.02.2004 Management and bookkeeping of the structures METIS 2004 Geneva 23 Key Features of the Model 5 “Stage Facet” describes processing at the data as well as at the metadata level DEFINITION and DESIGN Metadata level { CM1, CM2, . . . } PROCESSING PRODUCTION Data level { CID1, CID2, . . . } 10.02.2004 { TM (CM1, CM2, ... )} DISSEMINATION and EXCHANGE { TD (CID1, CID2, ... )} METIS 2004 Geneva 24 Key Features of the Model 6 “Production blueprint” Keeps the information how the instance is set up inside the system according to the four different views of the view facet “Processing blueprint” 10.02.2004 Describes the processing activities for the instances according to the four different views of the view facet METIS 2004 Geneva 25 Key Features of the Model 7 “Function facet” All aspects of communication and usage of meta-information by humans inside the system as well as in connection with dissemination and exchange 10.02.2004 Who is involved in communication? What information is communicated? How is the information communicated ? METIS 2004 Geneva 26 Implications for Terminology 1 Statistical information systems use terminology from different sources Statistics, Computer Science, Economy, Social Sciences,… We can at best collect terminology and bring it into an order according to some model This ordering defines a “statistical ontology” which is above terminology and has to use few common agreed terms 10.02.2004 METIS 2004 Geneva 27 Implications for Terminology 2 Activities of METANET WG 4 showed that there is rather large agreement between statisticians about the main terms for important structures: 10.02.2004 Statistical Unit Statistical Variable Statistical Values METIS 2004 Geneva 28 Implications for Terminology 3 Based on such agreement we can assign each terminology item a specification according to the ontology UMAS proposes the following classification 10.02.2004 What is the source of terminology (e.g. statistics, general standards, application area,…)? For which structure applies the term? For which view is the term used? In which processing stage is the term used? Which function aspects covers the term? METIS 2004 Geneva 29 Implications for Terminology 4 MCV (SDMX) proposes the following classification 10.02.2004 Administration (close relation to administrative view) Concepts, Definitions, Standards (close relation to conceptual view and structure) Data Collection, manipulating/accounting convention (close relation to stage facet) Quality and performance metadata (close relation to function facet in connection with dissemination and exchange) METIS 2004 Geneva 30 Summary Statistical processing activities define metadata requirements usually not considered in traditional data modelling The processing activities require a model which supports flexible coupling of entities according to processing needs We need besides terminology also specification of the terminology in context of a statistical ontology 10.02.2004 METIS 2004 Geneva 31 Thank you! wilfried.grossmann@univie.ac.at