Major DDI Artefacts

advertisement
Who is this poster for? What is its aim?

DDI introduces new terminology and formalises the definitions and relationships amongst the
things it deals with – the DDI “Artefacts”.

This poster provides an overview of these artefacts and shows how they relate to each other.
It is not technical. Senior management and senior statistical staff in organisations using DDI
as a key standard should understand DDI at this level.
Coverage links to Geographic
Structures and Geographic
Locations for geographic
coverage
one to many
zero to many
one to one required
one to one optional
Describes the temporal, topical, and
geographic coverage of the study in
terms of dates, a controlled
vocabulary, and geographic
structures
Coverage
Dimensions
A Data Collection is a container holding metadata
related to the collection phase. It is shown here
simply to link in some metadata that does not have
other links. There are several other similar
containers not shown on this diagram
Methodology
Describes the sampling
procedures used for the study.
Essentially textual and fairly
limited
A Universe is what statisticians often call
a “Population”. It identifies the real-world
entities that the data will describe. There
are levels of sub-universes to describe
sub-groups of the population
Describes processes that will be applied to the
collected data (like editing and weighting) with
some facilities to record details of software and
parameters
Collection Event
Describes events planned in the
collection process. Essentially textual.
DDI Containers
DDI metadata is organised into “Containers” – higher-level groupings of
metadata, some of which are very important and some of which are less
important. Most containers are not shown on this diagram.
Universe
Instrument
Analysis Unit
Variables relate to particular
units in the Universe
(population)
An Analysis Unit identifies the units in
Variable
the Universe that are the subject of
the linked item. Analysis units are
identified by a controlled vocabulary.
A Variable is a key DDI construct. It links to a Concept which
indicates what its value means. It links to Question Items and
Control Constructs which are the most common source of data.
It links to Processing Events that may alter data, and it links to
Code Schemes for its representation (ie, the actual values it
holds). Record Layouts link to Variables to describe record
structures.
A Logical Record collects the Variables
that describe some unit from the
Universe, eg a Person. Logical Records
ultimately form the basis for stored
physical Records
A Variable references just
one Concept that defines
what it represents
A Variable may link to a
number of Questions that
may provide its value
A Logical Record is may up
of a number of Variables
Target Record
Record
Relationship
A Data Relationship may
contain multiple Record
Relationships linking
related Logical Records
A Data Relationship links the
Logical Records appearing in a
Physical Structure and holds the
Record Relationship information.
A Physical Instance describes
an actual data file or set of data
in a database.
Data
Relationship
Control Constructs may
assign values to Variables
or use them in Loop
Control
Concepts are key building blocks in DDI.
Everything used in describing data must
be defined as a concept. So must many
other items related to processing or
option choices
A Data Element Concept is an explicit ISO
11179 version of a Concept. Concepts may
be described either in the DDI format or the
ISO 11179 format.
Source Record
A Variable may link to a
Code Scheme for its
representation
Physical
Structure
Describes the physical
layout of records in a file
structure
Physical
Instance
Describes the physical layout
(giving the actual physical
position of Variables in records)
Questions must have a
Response Domain indicating
what are valid responses. There
are various types of response Response Domain
domain
Code Domain
Identifiable, Versionable, and Maintainable
Most DDI metadata is Identifiable, Versionable, and Maintainable. Identifiable means it has a unique (apart from
versions) Id that can be used to reference it. Versionable means it has a Version number so that there may be multiple
versions with the same Id. Maintainable means it has an Owner (an organisation or organisation unit) who is
responsible for it.
Most DDI metadata has a status represented as an IsPublished attribute. If IsPublished is false the metadata is
considered “test” and may be modified but should not be used for production purposes. If IsPublished is true the
metadata is available for production use but may not be modified in any way. When modifications are actually needed
for business reasons a new version must be produced, with the same ID, with both versions then being available. Thus
existing production users are protected from changes (though they may of course move to use the new version).
Code Scheme
The conditional constructs use the values of Question Items to control the sequencing.
Interviewer
Instructions
Multiple Question Items refer to
lower-level Question Items and
Multiple Question Items to make
a multi-level question module
Category Domain
The highest level containers are very important. A Study Unit corresponds to
a single cycle of a survey and contains metadata for the cycle. Study Units
can be contained inside Groups or Sub-Groups (which can be inside Groups
or other Sub-Groups) to bring all the cycles of a collection together (in a
Group) or to bring related collections and all their cycles together (in SubGroups inside a Group). Collections and cycles then share metadata defined
in their containers. A Resource Package is a container for explicitly passing
metadata to someone else. A Local Holding Package records local
modifications to metadata from a Resource Package. The outer DDI
container is a DDI Instance.
Control Constructs determine the selection and sequencing of Question Items and
questionnaire text. They also link to Interviewer Instructions. There are a variety of Control
Construct types – a Sequence construct that performs a sequence of other Control
Constructs, If-The-Else constructs that choose which Construct is selected, RepeatUntil,
RepeatWhile, and Loop that allow looping over other constructs, a Question construct that
actually selects a Question Item (or Multiple Question Item), a Statement construct that
simply adds text to the questionnaire, and a Computation Construct that assigns a value to a
Variable.
Concepts may refer
to similar concepts
Other Domain types (text,
numeric, date-time,
geographic)
At the lowest level are Schemes, discussed in a box below. The next level
have somewhat curious names – Conceptual Component, Data Collection
(actually shown above), Logical Product, and others. As containers they
group sets of related metadata but do not have much logical significance.
Control Constructs refer to other Control
Constructs in Sequence, If-Then_Else, Repeat,
and Loop constructs
Control
Construct
Questions are the basic elements laid out in a questionnaire. They
have multi-lingual, possibly conditional, text and a “Question Intent”.
A Question can relate to multiple Concepts to identify what is being
collected and how it may be qualified (eg “Income from Non-Farm
activities”). Multiple Question Items define what are sometimes
called “Question Modules” – complex questions that include
follow-up questions on a particular item
Question Item and
Multiple Question Item
Record
Layout
An Instrument is the starting point
for producing a questionnaire. It
identifies the top-level Control
Construct that lays out the
questionnaire – mostly it would
identify a Sequence construct. It
also uses controlled vocabularies
to identify instrument type and the
software associated with it.
Question Constructs select a
question to be included in the
questionnaire to be asked of
particular units in the
Universe (population)
Concept
Data Element Concept
Logical Record
Processing Event
Data Collection
Geographic Structures and Geographic Locations
provide geographic references for Universes
Measures
Variables are the containers that will hold the values from
question response and that will be organised into records in data
sets or databases. A Variable is associated with just one
Concept that it will hold a data value to represent. It may be
associated with multiple Question Items that may provide it with
values. Values can also come from Computation Control
Constructs which may be referenced from other places (eg
Processing Event).
DDI Artefacts Poster V 1.0 – March 2012
The diagram is primarily intended to allow management and business area people to get a broad
overview of DDI without having to come to grips with the schema proper, which is large and difficult.
Geographic Structure
NCube
Link legend
A Record Relationship defines
the relationship between related
records, say Household and
Person.
This diagram shows the major DDI metadata artefacts. It describes DDI Version 3.1. A new
version DDI V 3.2 is expected to be standardised during 2012. There is of course much internal
detail for each artefact and this can be found in the DDI Schemas using a tool such as XMLSpy.
The schemas can be downloaded from the DDI Alliance web site at www.ddialliance.org.
How the major artefacts relate to each other
Geographic Location
An NCube describes a table to be
produced from the data. Tables are
described in terms of “Dimensions” and
“Measures” (sometimes called
enumeration variables by statisticians).
Both dimensions and measures are
identified by variables.
The DDI Artefact diagram
DDI 3.1 Metadata Artefacts
Interviewer Instructions a the familiar instructions to
interviewers that most survey organisations use.
They are linked in from the Control Constructs as the
questionnaire is sequenced
DDI Schemes
Category Schemes may include
Categories from other Category
Schemes allowing the reuse of
categories in different contexts or
mapping the evolution of a
Category Scheme over versions
Most of the DDI metadata artefacts are organised into “Schemes”. A few
Schemes (Code Schemes and Category Schemes) are shown on this
diagram because we do not much use their items individually – we mainly
deal with the collection as a whole. There are Concept Schemes, Variable
Schemes, Question Schemes, Universe Schemes, Control Construct
Schemes and many more.
A scheme is a collection of items and is maintainable (it has an owning
organisation). In all cases schemes can include items from other schemes of
the same type, allowing sharing and reuse of the metadata and allowing
some explicit indication of how new versions of a scheme relate to earlier
versions.
Category Scheme
Code Schemes correspond to what statisticians call a
“Classification” but they are also used for other purposes.
Code Schemes may include
Code Schemes link Codes to Categories from one or more
Codes from other Code
Schemes mapping the evolution Category Schemes. Examples of codes are “FR” linked to
“France”, “10314” linked to “Butcher”, “27” linked to
of a Code Scheme (a
“Cancer”, “F” linked to “Female”.
Classification) over versions
Code Schemes are also used for “Controlled
Vocabularies” – lists of standard words or names that may
be used in some context.
Category Schemes enumerate
“Categories” – values of attributes of
population unit. Examples of
categories are “France”, “Butcher”,
“Cancer”, “Bananas”, “Female”
Controlled Vocabularies
Controlled Vocabularies are mentioned several times on this diagram. A controlled
vocabulary is a standardised set of terms used in some context – eg, for instrument
type, to identify the process used for estimation, aggregation, weighting, or creating an
index. A controlled vocabulary is defined in a Code Scheme.
Controlled Vocabularies are not predefined. Organisations create their (or may share
them) for their own purposes.
© Australian Bureau of Statistics 2012
Download