Who is this poster for? What is its aim? DDI introduces new terminology and formalises the definitions and relationships amongst the things it deals with – the DDI “Artefacts”. This poster provides an overview of these artefacts and shows how they relate to each other. It is not technical. Senior management and senior statistical staff in organisations using DDI as a key standard should understand DDI at this level. Coverage links to Geographic Structures and Geographic Locations for geographic coverage one to many zero to many one to one required one to one optional Describes the temporal, topical, and geographic coverage of the study in terms of dates, a controlled vocabulary, and geographic structures Coverage Dimensions A Data Collection is a container holding metadata related to the collection phase. It is shown here simply to link in some metadata that does not have other links. There are several other similar containers not shown on this diagram Methodology Describes the sampling procedures used for the study. Essentially textual and fairly limited A Universe is what statisticians often call a “Population”. It identifies the real-world entities that the data will describe. There are levels of sub-universes to describe sub-groups of the population Describes processes that will be applied to the collected data (like editing and weighting) with some facilities to record details of software and parameters Collection Event Describes events planned in the collection process. Essentially textual. DDI Containers DDI metadata is organised into “Containers” – higher-level groupings of metadata, some of which are very important and some of which are less important. Most containers are not shown on this diagram. Universe Instrument Analysis Unit Variables relate to particular units in the Universe (population) An Analysis Unit identifies the units in Variable the Universe that are the subject of the linked item. Analysis units are identified by a controlled vocabulary. A Variable is a key DDI construct. It links to a Concept which indicates what its value means. It links to Question Items and Control Constructs which are the most common source of data. It links to Processing Events that may alter data, and it links to Code Schemes for its representation (ie, the actual values it holds). Record Layouts link to Variables to describe record structures. A Logical Record collects the Variables that describe some unit from the Universe, eg a Person. Logical Records ultimately form the basis for stored physical Records A Variable references just one Concept that defines what it represents A Variable may link to a number of Questions that may provide its value A Logical Record is may up of a number of Variables Target Record Record Relationship A Data Relationship may contain multiple Record Relationships linking related Logical Records A Data Relationship links the Logical Records appearing in a Physical Structure and holds the Record Relationship information. A Physical Instance describes an actual data file or set of data in a database. Data Relationship Control Constructs may assign values to Variables or use them in Loop Control Concepts are key building blocks in DDI. Everything used in describing data must be defined as a concept. So must many other items related to processing or option choices A Data Element Concept is an explicit ISO 11179 version of a Concept. Concepts may be described either in the DDI format or the ISO 11179 format. Source Record A Variable may link to a Code Scheme for its representation Physical Structure Describes the physical layout of records in a file structure Physical Instance Describes the physical layout (giving the actual physical position of Variables in records) Questions must have a Response Domain indicating what are valid responses. There are various types of response Response Domain domain Code Domain Identifiable, Versionable, and Maintainable Most DDI metadata is Identifiable, Versionable, and Maintainable. Identifiable means it has a unique (apart from versions) Id that can be used to reference it. Versionable means it has a Version number so that there may be multiple versions with the same Id. Maintainable means it has an Owner (an organisation or organisation unit) who is responsible for it. Most DDI metadata has a status represented as an IsPublished attribute. If IsPublished is false the metadata is considered “test” and may be modified but should not be used for production purposes. If IsPublished is true the metadata is available for production use but may not be modified in any way. When modifications are actually needed for business reasons a new version must be produced, with the same ID, with both versions then being available. Thus existing production users are protected from changes (though they may of course move to use the new version). Code Scheme The conditional constructs use the values of Question Items to control the sequencing. Interviewer Instructions Multiple Question Items refer to lower-level Question Items and Multiple Question Items to make a multi-level question module Category Domain The highest level containers are very important. A Study Unit corresponds to a single cycle of a survey and contains metadata for the cycle. Study Units can be contained inside Groups or Sub-Groups (which can be inside Groups or other Sub-Groups) to bring all the cycles of a collection together (in a Group) or to bring related collections and all their cycles together (in SubGroups inside a Group). Collections and cycles then share metadata defined in their containers. A Resource Package is a container for explicitly passing metadata to someone else. A Local Holding Package records local modifications to metadata from a Resource Package. The outer DDI container is a DDI Instance. Control Constructs determine the selection and sequencing of Question Items and questionnaire text. They also link to Interviewer Instructions. There are a variety of Control Construct types – a Sequence construct that performs a sequence of other Control Constructs, If-The-Else constructs that choose which Construct is selected, RepeatUntil, RepeatWhile, and Loop that allow looping over other constructs, a Question construct that actually selects a Question Item (or Multiple Question Item), a Statement construct that simply adds text to the questionnaire, and a Computation Construct that assigns a value to a Variable. Concepts may refer to similar concepts Other Domain types (text, numeric, date-time, geographic) At the lowest level are Schemes, discussed in a box below. The next level have somewhat curious names – Conceptual Component, Data Collection (actually shown above), Logical Product, and others. As containers they group sets of related metadata but do not have much logical significance. Control Constructs refer to other Control Constructs in Sequence, If-Then_Else, Repeat, and Loop constructs Control Construct Questions are the basic elements laid out in a questionnaire. They have multi-lingual, possibly conditional, text and a “Question Intent”. A Question can relate to multiple Concepts to identify what is being collected and how it may be qualified (eg “Income from Non-Farm activities”). Multiple Question Items define what are sometimes called “Question Modules” – complex questions that include follow-up questions on a particular item Question Item and Multiple Question Item Record Layout An Instrument is the starting point for producing a questionnaire. It identifies the top-level Control Construct that lays out the questionnaire – mostly it would identify a Sequence construct. It also uses controlled vocabularies to identify instrument type and the software associated with it. Question Constructs select a question to be included in the questionnaire to be asked of particular units in the Universe (population) Concept Data Element Concept Logical Record Processing Event Data Collection Geographic Structures and Geographic Locations provide geographic references for Universes Measures Variables are the containers that will hold the values from question response and that will be organised into records in data sets or databases. A Variable is associated with just one Concept that it will hold a data value to represent. It may be associated with multiple Question Items that may provide it with values. Values can also come from Computation Control Constructs which may be referenced from other places (eg Processing Event). DDI Artefacts Poster V 1.0 – March 2012 The diagram is primarily intended to allow management and business area people to get a broad overview of DDI without having to come to grips with the schema proper, which is large and difficult. Geographic Structure NCube Link legend A Record Relationship defines the relationship between related records, say Household and Person. This diagram shows the major DDI metadata artefacts. It describes DDI Version 3.1. A new version DDI V 3.2 is expected to be standardised during 2012. There is of course much internal detail for each artefact and this can be found in the DDI Schemas using a tool such as XMLSpy. The schemas can be downloaded from the DDI Alliance web site at www.ddialliance.org. How the major artefacts relate to each other Geographic Location An NCube describes a table to be produced from the data. Tables are described in terms of “Dimensions” and “Measures” (sometimes called enumeration variables by statisticians). Both dimensions and measures are identified by variables. The DDI Artefact diagram DDI 3.1 Metadata Artefacts Interviewer Instructions a the familiar instructions to interviewers that most survey organisations use. They are linked in from the Control Constructs as the questionnaire is sequenced DDI Schemes Category Schemes may include Categories from other Category Schemes allowing the reuse of categories in different contexts or mapping the evolution of a Category Scheme over versions Most of the DDI metadata artefacts are organised into “Schemes”. A few Schemes (Code Schemes and Category Schemes) are shown on this diagram because we do not much use their items individually – we mainly deal with the collection as a whole. There are Concept Schemes, Variable Schemes, Question Schemes, Universe Schemes, Control Construct Schemes and many more. A scheme is a collection of items and is maintainable (it has an owning organisation). In all cases schemes can include items from other schemes of the same type, allowing sharing and reuse of the metadata and allowing some explicit indication of how new versions of a scheme relate to earlier versions. Category Scheme Code Schemes correspond to what statisticians call a “Classification” but they are also used for other purposes. Code Schemes may include Code Schemes link Codes to Categories from one or more Codes from other Code Schemes mapping the evolution Category Schemes. Examples of codes are “FR” linked to “France”, “10314” linked to “Butcher”, “27” linked to of a Code Scheme (a “Cancer”, “F” linked to “Female”. Classification) over versions Code Schemes are also used for “Controlled Vocabularies” – lists of standard words or names that may be used in some context. Category Schemes enumerate “Categories” – values of attributes of population unit. Examples of categories are “France”, “Butcher”, “Cancer”, “Bananas”, “Female” Controlled Vocabularies Controlled Vocabularies are mentioned several times on this diagram. A controlled vocabulary is a standardised set of terms used in some context – eg, for instrument type, to identify the process used for estimation, aggregation, weighting, or creating an index. A controlled vocabulary is defined in a Code Scheme. Controlled Vocabularies are not predefined. Organisations create their (or may share them) for their own purposes. © Australian Bureau of Statistics 2012