DDI TRAINING
WORKSHOP
Wendy Thomas
November 28-29, 2012
Overview of Workshop – Day 1
• DDI Use Cases
• Identification, Versioning, and Referencing
• Modules (structural overview)
• Questionnaire content and layout
• Concepts, Variables, Logical Record, Physical Store
• Use of DDI within a research process
• Use of DDI within a archival/management system
Overview of Workshop – Day 2
• SND Issue areas / information and discussion
• Geography and DDI
• DDI 3.1 changes and the future of DDI-L
• Tools and resources
• The status of DDI-RDF
Credits
• Unspecified slides - Wendy Thomas (MPC)
• DDI in 60 Seconds – Arofan Gregory, (ODaF)
• OAIS diagram - Herve L’Hours (UKDA)
• Remainder of Slides (source indicator in upper left):
• The slides were developed for several DDI workshops at IASSIST
conferences and at GESIS training in Dagstuhl/Germany
• Major contributors
• Wendy Thomas, Minnesota Population Center
• Arofan Gregory, Open Data Foundation
• Further contributors
• Joachim Wackerow, GESIS – Leibniz Institute for the Social Sciences
• Pascal Heus, Open Data Foundation
• Attribute: http://creativecommons.org/licenses/by-sa/3.0/legalcode
5
S01
License
Details on next slide.
6
S01
License (cont.)
On-line available at: http://creativecommons.org/licenses/by-sa/3.0/
This is a human-readable summary of the Legal Code at:
http://creativecommons.org/licenses/by-sa/3.0/legalcode
7
S03
DDI-L Lifecycle Model
Metadata Reuse
Learn DDI-L in
60 Seconds
using
Survey
Instruments
Study
made up of
measures
about
Questions
Concepts
Universes
Copyright © GESIS – Leibniz Institute for the Social Sciences, 2010
Published under Creative Commons Attribute-ShareAlike 3.0 Unported
with values of
Categories/
Codes,
Numbers
Questions
Variables
collect
made up of
Responses
Data Files
resulting in
Copyright © GESIS – Leibniz Institute for the Social Sciences, 2010
Published under Creative Commons Attribute-ShareAlike 3.0 Unported
THAT’S PRETTY MUCH IT.
Studies
Concepts
Variables
Concepts
Codes
Physical
Location
Categories
Summary
Statistics
13
S06
DDI-L USE CASES
Learning DDI: Pack S06
Copyright © GESIS – Leibniz Institute for the Social Sciences, 2010
Published under Creative Commons Attribute-ShareAlike 3.0 Unported
S06
14
Archival Ingestion and
Metadata Value-Add
• This use case concerns how DDI 3 can support the ingest
and migration functions of data archives and data
libraries.
S06
Supports automation
of processing if good
DDI metadata is
captured upstream
<DDI 3>
[Full metadata set]
(?)
+
Microdata/
Aggregates
Provides good format &
foundation for valueadded metadata by
archive
Provides a neutral
format for data
migration as analysis
packages are
versioned
Ingest
Processing
<DDI 3>
[Full or
additional
metadata]
Archival events
Dissemination
Systems
Data Archive
Data Library
Can package
Data and
metadata for
preservation
purposes
– populate
other standard
formats
Preservation
Systems
16
S06
<g:LocalHoldingPackage>
<s:StudyUnit>
<s:StudyUnit>
with full content
OR
<g:Group>
with full content
+
new value added content
<a:Archive>
<a:LifeCycleEvents>
capture ingest
processing events
S06
Data Dissemination/Data Discovery
• This use case concerns how DDI-L can support the
discovery and dissemination of data.
17
S06
<DDI-L>
Can add archival
events meta-data
Codebooks
<DDI-L>
[Full metadata set]
+
Microdata/
Aggregates
Rich metadata supports
auto-generation of websites,
packages of specific, related
materials, and other delivery
formats and applications
Websites
Databases,
repositories
Research
Data Centers
Data-Specific
Info Access
Systems
Registries
Catalogues
Question/Concept/
Variable Banks
19
S06
<c:ConceptScheme>
<c:UniverseScheme>
<c:GeographicStructureScheme>
<c:GeographicLocationScheme>
<d:QuestionScheme>
<d:ControlConstructScheme>
<l:VariableScheme>
<l:CategoryScheme>
<l:CodeScheme>
<p:PhysicalStructureScheme>
<p:RecordLayoutScheme>
<a:OrganizationScheme>
<s:StudyUnit> [descriptive content]
• Store as
separate
resources
• Use content to
feed a different
registry
structure
S06
20
Question/Concept/Variable Banks
• This use case describes how DDI 3 can support question, concept,
and variable banks. These are often termed “registries” or “metadata
repositories” because they contain only metadata – links to the data
are optional, but provide implied comparability. The focus is metadata
reuse.
S06
Because DDI has links, each type of bank
functions in a modular, complementary way.
<DDI 3>
Questions
Flow Logic
Codings
<DDI 3>
Variables
Categories
Codes
<DDI 3>
Concepts
Question
Bank
Variable
Bank
Concept
Bank
Supports but
does not require
ISO 11179
<DDI 3>
Questions
Flow Logic
Codings
Users
and
Applications
<DDI 3>
Variables
Categories
Codes
Users
and
Applications
<DDI 3>
Concepts
Users
and
Applications
S06
<g:ResourcePackage>
• Question Bank
• <d:QuestionScheme>
• <d:ControlConstructScheme>
• Variable Bank
• <l:CategoryScheme>
• <l:CodeScheme>
• <l:VariableScheme>
• Concept Bank
• <c:ConceptScheme>
22
S06
23
Questionnaire Generation, Data Collection, and
Processing
• This use case concerns how DDI 3 can support
the creation of various types of
questionnaires/CAI, and the collection and
processing of raw data into microdata.
S06
<DDI 3>
Concepts
Universes
Questions
Flow Logic
Final
Types of Metadata:
Paper
• Concepts (conceptual module)
Questionnaire • Universe (conceptual module)
• Questions (datacollection module)
• Flow Logic (datacollection module)
Online Survey • Variables (logicalproduct module)
Instrument
• Categories/Codes (logicalproduct module)
• Coding (datacollection module)
CAI
Instrument
Raw Data
DDI captures
the content – XML
allows for each
application to do
its own presentation
<DDI 3>
Concepts
Universes
Questions
Flow Logic
+
<DDI 3>
Variables
Coding
Microdata
+
<DDI 3>
Categories
Codes
Physical Data Product
Physical Instance
S06
studyunit.xsd
conceptualcomponent.xsd
datacollection.xsd
logicalproduct.xsd
physicaldataproduct.xsd
physicalinstance.xsd
Previous structure PLUS
<l:LogicalProduct>
<l:DataRelationship>
<l:VariableScheme>
<p:PhysicalDataProduct>
<p:PhysicalStructureScheme>
<p:RecordLayoutScheme>
<pi:PhysicalInstance>
25
S06
26
DDI For Use within a Research Project
• This use case concerns how DDI-L can support
various functions within a research project, from
the conception of the study through collection and
publication of the resulting data.
S06
Prinicpal
Investigator
Research Staff
Collaborators
<DDI-L>
Concepts
Universe
Methods
Purpose
People/Orgs
+
Submitted
Proposal
<DDI-L>
Funding
Revisions
$
€£
+
<DDI-L>
Variables
Physical Stores
<DDI-L>
Questions
Instrument
+
+
<DDI-L>
Data Collection
Data Processing
Presentations
+
Publication
Data
Archive/
Repository
S06
<s:StudyUnit>
<s:Abstract>
<s:Purpose>
<r:FundingInformation>
<c:ConceptualComponents>
<c:Concepts>
<c:Universe>
<d:DataCollection>
<d:Methodology>
<d:QuestionScheme>
<d:ControlConstructScheme>
<l:LogicalProduct>
<l:DataRelationship>
<l:CategoryScheme>
<l:CodeScheme>
<l:VariableScheme>
<p:PhysicalDataProduct>
<pi:PhysicalInstance>
<a:Archive>
<a:OrganizationScheme>
28
• Version 1.0.0
Preparing the
proposal for funding
S06
<s:StudyUnit>
<s:Abstract>
<s:Purpose>
<r:FundingInformation>
<c:ConceptualComponents>
<c:Concepts>
<c:Universe>
<d:DataCollection>
<d:Methodology>
<d:QuestionScheme>
<d:ControlConstructScheme>
<l:LogicalProduct>
<l:DataRelationship>
<l:CategoryScheme>
<l:CodeScheme>
<l:VariableScheme>
<p:PhysicalDataProduct>
<pi:PhysicalInstance>
<a:Archive>
<a:OrganizationScheme>
29
• Version 1.0.0
Preparing the
proposal for funding
• Version 1.1.0
Entering funding
information and
revising/versioning
earlier content
S06
<s:StudyUnit>
<s:Abstract>
<s:Purpose>
<r:FundingInformation>
<c:ConceptualComponents>
<c:Concepts>
<c:Universe>
<d:DataCollection>
<d:Methodology>
<d:QuestionScheme>
<d:ControlConstructScheme>
<l:LogicalProduct>
<l:DataRelationship>
<l:CategoryScheme>
<l:CodeScheme>
<l:VariableScheme>
<p:PhysicalDataProduct>
<pi:PhysicalInstance>
<a:Archive>
<a:OrganizationScheme>
30
• Version 1.0.0
Preparing the
proposal for funding
• Version 1.1.0
Entering funding
information and
revising/versioning
earlier content
• Version 2.0.0
Preparing for data
collection
S06
<s:StudyUnit>
<s:Abstract>
<s:Purpose>
<r:FundingInformation>
<c:ConceptualComponents>
<c:Concepts>
<c:Universe>
<d:DataCollection>
<d:Methodology>
<d:QuestionScheme>
<d:ControlConstructScheme>
<l:LogicalProduct>
<l:DataRelationship>
<l:CategoryScheme>
<l:CodeScheme>
<l:VariableScheme>
<p:PhysicalDataProduct>
<pi:PhysicalInstance>
<a:Archive>
<a:OrganizationScheme>
31
• Version 1.0.0
Preparing the
proposal for funding
• Version 1.1.0
Entering funding
information and
revising/versioning
earlier content
• Version 2.0.0
Preparing for data
collection
• Version 3.0.0
Completing the
study and preparing
the data
S06
32
Metadata Mining for Comparison, etc.
• This use case concerns how collections of DDI-L
metadata can act as a resource to be explored, providing
further insight into the comparability and other features of
a collection of data to help researchers identify data sets
for re-use.
Questions
Universe
Types of Metadata
•Universe (conceputualcomponent module)
•Concept (conceputualcomponent module)
Concepts •Question (datacollection module)
•Variable (logicalproduct module)
Variable
Metadata
Repositories/
Registries
<DDI-L>
Instances
?
Data Sets
<DDI-L>
Comparison
•Questions
•Categories
•Codes
•Variables
•Universe
•Concepts
Recodes
Harmonizations
S06
34
Register/Administrative Data
• This use case concerns how DDI-L can support
the retrieval, organization, presentation, and
dissemination of register data
S06
Generation Instruction (data collection module)
Lifecycle Events (Archive module)
Query/
Request
Register/
Administrative
Data Store
Other
Data
Collection
Processing (Data
Collection module)
Register
Admin.
Data
File
Variables, Categories, Codes,
Concepts, Etc.
Comparison/mapping
(Comparison module)
[Lifecycle continues normally]
Integrated Data Set
S06
<g:Group>
<cm:Comparison>
<s:StudyUnitReference>
<s:StudyUnit>
<d:DataCollection>
<d:Methodology>
<d:ProcessEvent>
<l:LogicalProduct>
<l:DataRelationship>
<l:VariableScheme>
<p:PhyscialDataProduct>
<pi:PhyscialInstance>
36
Emphasis is on
the process of
collection
May include
NCube Logical
Product
If data is obtained
from multiple
studies, Group and
comparison may
be used
Implementing GSBPM Content
•This use case concerns the use of DDI as an
underlying model within GSBPM and how DDI can
be used to implement the model
38
S20
The Generic Staistical Business Process
Model (GSBPM)
• The METIS group is a part of UN/ECE which addresses
metadata issues for national statistical agencies (and
other producers of official statistics)
• This community uses both SDMX and DDI
• They have produced a reference model of the statistical
production process
• The DDI 3 Lifecycle Model was a major input
• GSBPM has a much greater level of detail
S20
39
Getting into the details
• Some technical basics
• Identification, Versioning, and Reference
• Overall structures for organizing and packaging metadata
• Modules and Schemes
• Data capture
• Questionnaire structure
• Data description and storage
• Concepts, Variables, Records, Data files (physical stores)
• View from the bottom up
Identification, Versioning and
Reference
42
S08
Rationale
• Because several organizations are involved in the
creation of a set of metadata throughout the lifecycle flow:
• Rules for maintenance, versioning, and identification must be
universal
• Reference to other organization’s metadata is necessary for re-use
– and very common
43
S08
Maintenance Rules
• A maintenance agency is identified by a reserved code
based on its domain name (similar to it’s website and email)
• There is a register of DDI agency identifiers which we will look at
later in the course
• Maintenance agencies own the objects they maintain
• Only they are allowed to change or version the objects
• Other organizations may reference external items in their
own schemes, but may not change those items
• You can make a copy which you change and maintain, but once
you do that, you own it!
S08
44
Versioning Rules
• If a “published” object changes in any way, its
version changes
• This will change the version of any containing
maintainable object
• Typically, objects grow and are versioned as they
move through the lifecycle
• Versionables inherit their agency from the
maintainable object they live in at the time of
origin
45
S08
Versioning: Changes
ConceptScheme X
V 1.0.0
- Concept A v 1.0.0
- Concept B v 1.0.0 references
- Concept C v 1.0.0
ConceptScheme X
V 1.1.0
- Concept A v 1.1.0
- Concept B v 1.0.0 references
- Concept C v 1.1.0
Add:
Concept D v 1.0.0
Note: You can also reference entire
schemes and make additions
references
ConceptScheme X
V 2.0.0
- Concept A v 1.2.0
- Concept B v 1.0.0
- Concept C v 1.2.0
- Concept D v 1.1.0
Add:
Concept E v 1.0.0
ConceptScheme X
V 3.0.0
- Concept D v 1.1.0
- Concept E v 1.0.0
references
S08
46
Identifiable Rules
• Identifiers are assigned to each identifiable
object, and are unique within their maintainable
parent
• Identifiable objects inherit their version from their
containing versionable parent (if any) at their time
of origin
• Identifiable objects inherit their maintaining
agency from the maintainable object they live in
at the time of origin
S08
47
Maintainable, Versionable, and Identifiable
• DDI 3 places and emphasis on re-use
• This creates lots of inclusion by reference!
• This raises the issue of managing change over time
• The Maintainable, Versionable, and Identifiable scheme in
DDI was created to help deal with these issues
• An identifiable object is something which can be
referenced, because it has an ID
• A versionable object is something which can be
referenced, and which can change over time – it is
assigned a version number
• A maintainable object is something which is maintained by
a specified agency, and which is versionable and can be
referenced – it is given a maintenance agency
48
S08
Basic Element Types
Maintainable
Versionable
Identifiable
All ELEMENTS
Differences from DDI 1/2
--Every element is NOT identifiable
--Many individual elements or
complex elements may be
versioned
--A number of complex elements
can be separately maintained
49
S08
DDI 3.1 Identifiers
• There are two ways to provide identification for a DDI 3
object:
• Using a set of XML fields
• Using a specially-structured URN
• The structured URN approach is preferred
• URNs are a very common way of assigning a universal, public
identifier to information on the Internet
• However, they require explicit statement of agency, version, and ID
information in DDI 3
• Providing element fields in DDI 3 allows for much
information to be defaulted
• Agency can be inherited from parent element
• Version can be inherited or defaulted to “1.0.0”
50
S08
Parts of the Identification Series
• Identifiable Element
• Identifier:
• Variable
• Identifier:
• ID
• V1
• Identifying Agency
• us.mpc
• Version
• 1.1.0 [default is 1.0.0]
• Version Date
• 2007-02-10
• Version Responsibility
• Wendy Thomas
• Version Rationale
• Spelling correction
• UserID
• Object Source
51
S08
URN Detailed Example
This is a URN From DDI
In a variable scheme
The scheme agency is us.mpc
urn=“urn:ddi:us.mpc:VariableScheme.
VarSch01.1.4.0:Variable.V1.1.1.0”
With identifier
VarSch01
For a variable
Version 1.1.0
Version 1.4.0
Variable ID is V1
S08
52
Referencing
• When referencing an object, you must provide:
• The maintenance agency
• The identifier
• The version
• Often, these are inherited from a maintainable object
• This is part of their identification
53
S08
DDI References
• References in DDI may be within a single instance or
across instances
• Metadata can be re-packaged into many different groups and
instances
• “Internal” references are made to objects in the same instance
• “External” reference are made to objects in other DDI instances
• Identifiers must provide:
• The containing maintainable (a module or a scheme)
• Agency, ID, and Version
• The identifiable/versionable object
• ID (and version if versionable)
• Like identifiers, DDI references may be made using URNs
or element fields
S08
54
Reference Examples
• Internal
<VariableReference isReference=“true”
isExternal=“false” lateBound=“false”>
<Scheme isReference=“true” isExternal=“false”
lateBound=“false”>
<ID>VarSch01</ID>
<IdenftifyingAgency>us.mpc</IdentifyingAgency>
<Version>1.4.0</Version>
</Scheme>
<ID>V1</ID>
<IdenftifyingAgency>us.mpc</IdentifyingAgency>
<Version>1.1.0</Version>
</VariableReference>
S08
55
Reference Examples
• External
<VariableReference isReference=“true”
isExternal=“true” lateBound=“false”>
<urn>urn:ddi:us.mpc:VariableScheme.VarSch01.1.
4.0:Variable.V1.1.1.0</urn>
</VariableReference>
56
S09
DDI XML Schemas
and Main Structures
Learning DDI: Pack S09
Copyright © GESIS – Leibniz Institute for the Social Sciences, 2010
Published under Creative Commons Attribute-ShareAlike 3.0 Unported
S09
57
DDI-L Main Structures and Concepts
• XML Schemas
• DDI Modules
• DDI Schemes
• DDI Profiles
• A Simple Example
58
S09
XML Schemas, DDI Modules,
and DDI Schemes
XML Schemas
DDI Modules
Correspond
to a stage in
the lifecycle
<file>.xsd
<file>.xsd
<file>.xsd
<file>.xsd
May
Correspond
May
Contain
DDI Schemes
59
S09
XML Schemas
•
•
•
•
•
•
•
•
•
•
•
•
archive
comparative
conceptualcomponent
datacollection
dataset
dcelements
DDIprofile
ddi-xhtml11
ddi-xhtml11-model-1
ddi-xhtml11-modules-1
group
inline_ncube_recordlayout
• instance
• logicalproduct
• ncube_recordlayout
• physicaldataproduct
• physicalinstance
• proprietary_record_layout
• reusable
• simpledc20021212
• studyunit
• tabular_ncube_recordlayout
• xml
• set of xml schemas to support xhtml
S09
60
Reminder: DDI Modules and Schemes
• DDI has two important structures:
• “Modules”
• “Schemes”
• A module is a package of metadata corresponding to a
stage of the lifecycle or a specific structural function
• A scheme is a list of reusable metadata items of a specific
type
• Many DDI modules contain DDI schemes
61
S09
XML Schemas, DDI Modules,
and DDI Schemes
Data Collection
Instance
Study Unit
Logical Product
Reusable
Ncube
Physical Instance
Physical Data
Structure
Inline ncube
DDI Profile
Archive
Tabular ncube
Comparative
Conceptual
Component
Proprietary
Dataset
62
S09
XML Schemas, DDI Modules,
and DDI Schemes
Data Collection
Instance
Study Unit
Logical Product
Reusable
Ncube
Physical Instance
Physical Data
Structure
Inline ncube
DDI Profile
Archive
Tabular ncube
Comparative
Conceptual
Component
Proprietary
Dataset
63
S09
XML Schemas, DDI Modules,
and DDI Schemes
Data Collection
Question Scheme
Control Construct Scheme
Interviewer Instruction Scheme
Reusable
Logical Product
Instance
Category Scheme
Ncube
Code Scheme
Study Unit
Inline ncube
Variable Scheme
Physical Instance
NCube Scheme
Tabular ncube
Physical
Data
Structure
DDI Profile
Physical Structure Scheme Proprietary
Comparative
Record Layout Scheme
Dataset
Archive
Organization Scheme
Conceptual Component
Concept Scheme
Universe Scheme
Geographic Structure Scheme
Geographic Location Scheme
64
S09
Why Schemes?
• You could ask “Why do we have all these
annoying schemes in DDI?”
• There is a simple answer: reuse!
• DDI-L supports the concept of metadata registries
(e.g., question banks, variable banks)
• DDI-L also needs to show specifically where
something is reused
• Including metadata by reference helps avoid error and
confusion
• Reuse is explicit
Packaging structures
66
S04
DDI Instance
Citation
Coverage
Other Material / Notes
Translation Information
Study Unit
3.1 Local
Holding
Package
Group
Resource
Package
67
S04
Study Unit
Citation / Series Statement
Abstract / Purpose
Coverage / Universe / Analysis Unit / Kind of Data
Other Material / Notes
Funding Information / Embargo
Conceptual
Components
Physical
Instance
Data
Collection
Logical
Product
Archive
Physical
Data
Product
DDI
Profile
68
S04
Group
Citation / Series Statement
Abstract / Purpose
Coverage / Universe
Other Material / Notes
Funding Information / Embargo
Conceptual
Components
Sub Group
Data
Collection
Logical
Product
Study Unit
Comparison
Archive
Physical
Data
Product
DDI
Profile
69
S04
Resource Package
Citation / Series Statement
Abstract / Purpose
Coverage / Universe
Other Material / Notes
Funding Information / Embargo
Any module
EXCEPT
Study Unit,
Group
Or
Local Holding
Package
Any Scheme:
Organization
Concept
Universe
Geographic Structure
Geographic Location
Question
Interviewer Instruction
Control Construct
Category
Code
Variable
NCube
Physical Structure
Record Layout
70
S04
Local Holding Package (3.1 and later)
Citation / Series Statement
Abstract / Purpose
Coverage / Universe
Other Material / Notes
Funding Information / Embargo
Depository
Study Unit OR
Group
Reference:
[A reference to
the stored
version of the
deposited study
unit.]
Local Added
Content:
[This contains all
content available
in a Study Unit
whose source is
the local archive.]
71
S04
Study Unit
• Study Unit
• Identification
• Coverage
• Topical
• Identification is mapped to Dublin
Core and basic Dublin Core is
included as an option
• Geographic coverage mapped to
FGDC / ISO 19115
• Temporal
• bounding box
• Spatial
• spatial object
• Conceptual Components
• Universe
• Concept
• Representation (optional
replication)
• Purpose, Abstract, Proposal,
Funding
• polygon description of levels and
identifiers
• Universe Scheme, Concept
Scheme
• link of concept, universe,
representation through Variable
• also allows storage as a ISO/IEC
11179 compliant registry
72
S04
Archive
• An archive is whatever organization or individual has
current control over the metadata
• Contains persistent lifecycle events
• Contains archive specific information
• local identification
• local access constraints
73
S04
Data Collection
• Methodology
• Question Scheme
• Question
• Response domain
• Instrument
• using Control Construct
Scheme
• Coding Instructions
• question to raw data
• raw data to public file
• Interviewer Instructions
• Question and Response
Domain designed to
support question banks
• Question Scheme is a
maintainable object
• Organization and flow of
questions into Instrument
• Used to drive systems like
CASES and Blaise
• Coding Instructions
• Reuse by Questions,
Variables, and comparison
74
S04
Logical Product
• Category Schemes
• Categories are used as both
• Coding Schemes
question response domains and
by code schemes
• Codes are used as both
question response domains and
variable representations
• Link representations to concepts
and universes through
references
• Built from variables (dimensions
and attributes)
• Variables
• NCubes
• Variable and NCube Groups
• Data Relationships
• Map directly to SDMX structures
• More generalized to
accommodate legacy data
75
S04
Physical storage
• Physical Data Structure
• Links to Data Relationships
• Links to Variable or NCube Coordinate
• Description of physical storage structure
• in-line, fixed, delimited or proprietary
• Physical Instance
• One-to-one relationship with a data file
• Coverage constraints
• Variable and category statistics
76
S04
Group
• Resource Package
• Allows packaging of any maintainable item as a
resource item
• Group
• Up-front design of groups – allows inheritance
• Ad hoc (“after-the-fact”) groups – explicit comparison
using comparison maps for Universe, Concept,
Question, Variable, Category, and Code
• Local Holding Package
• Allows attachment of local information to a deposited
study without changing the version of the study unit
itself
77
S04
DDI Lifecycle Model and Related Modules
Groups and Resource Packages are a
means of publishing any portion or
combination of sections of the life cycle
Study
Unit
Data
Collection
Logical
Product
Local
Holding
Package
Physical
Data
Product
Physical
Instance
Archive
78
S04
Building from Component Parts
UniverseScheme
CategoryScheme
NCube
Scheme
CodeScheme
ConceptScheme
QuestionScheme
ControlConstructScheme
Variable
Scheme
RecordLayout
Scheme
[Physical Location]
Instrument
LogicalRecord
PhysicalInstance
79
S09
Study Unit Example: Schematic
Study Unit
Logical
product
Physical data
product
Concepts
Variables
Record
Layout
Universes
Codes
Conceptual
component
Physical
instance
Data collection
Questions
Categories
Category
Stats
S09
80
DDI’s “Meta-Module”
• One module is unlike all of the others in DDI – the DDI
Profile
• This is a “meta-module” – it talks about how the DDI-L is
being used by a specific application or organization
81
S09
DDI Profiles
• The DDI Profile module lets you describe which fields you
use in your institution’s flavor of DDI
• It is useful for performing machine validation of received instances
• It is useful documentation for human users
• You provide a set of information for each element allowed
in a complete DDI instance
• If it is used or not used
• If optional fields (per the XML schema) are required
• Provides the ability to describe DDI Templates
• Element AlternateName, Description and Instructions
• Required, default, fixed values
S09
82
<pr:DDIProfile xmlns="ddi:profile:3_1"
id="DDIProfileSTUDYNO">
<pr:XPathVersion>1.0</pr:XPathVersion>
<pr:DDINamespace>3.1</pr:DDINamespace>
<pr:XMLPrefixMap>
<pr:XMLPrefix>s</pr:XMLPrefix>
<pr:XMLNamespace>ddi:studyunit:3_1<pr:/XMLNamespace>
</pr:XMLPrefixMap>
<pr:Used path="/DDIInstance/VersionResponsibility"/>
<pr:Used path="/DDIInstance/Citation/Title“/>
<pr:Used path="/DDIInstance/Citation/Creator" required="true" >
<pr:AlternateName>Author</pr:AlternateName>
<pr:Used path="/DDIInstance/StudyUnit/Citation/Title"/>
.....
<pr:NotUsed path="/DDIInstance/StudyUnit/FundingInformation"/>
</pr:DDIProfile>
Content details
• Questionnaire content and design
• Breaking up content into its component parts
• Separating processes that occur at different points in the lifecycle
• Sharing common components between different points and objects
within the lifecycle
• Data Dictionary basics
• Conceptual components
• Variables
• Organization of variables into records
• Physical data stores
• A quick look from the bottom up
84
S11
Questions and Instruments
• DDI 3 separates the questions which make up a survey
instrument from the survey instrument itself
• Questions can be re-used!
• There are several different types of question text
• Many of these are the normal string types found throughout DDI 3
S11
Questionnaires
• Questions
• Question Text
• Response Domains
• Statements
• Pre- Post-question text
• Instructions
• Routing information
• Explanatory materials
• Question Flow
85
S11
Simple Questionnaire
Please answer the following:
1.
Sex
(1) Male
(2) Female
2.
Are you 18 years or older?
(0) Yes
(1) No (Go to Question 4)
3.
How old are you? ______
4.
Who do you live with?
__________________
5.
What type of school do you attend?
(1) Public school
(2) Private school
(3) Do not attend school
86
87
S11
Simple Questionnaire
Please answer the following:
1.
Sex
(1) Male
(2) Female
2.
Are you 18 years or older?
(0) Yes
(1) No (Go to Question 4)
3.
How old are you? ______
4.
Who do you live with?
__________________
5.
What type of school do you
attend?
(1) Public school
(2) Private school
(3) Do not attend school
• Questions
88
S11
Simple Questionnaire
Please answer the following:
1.
Sex
(1) Male
(2) Female
2.
Are you 18 years or older?
(0) Yes
(1) No (Go to Question 4)
3.
How old are you? ______
4.
Who do you live with?
__________________
5.
What type of school do you
attend?
(1) Public school
(2) Private school
(3) Do not attend school
• Questions
• Response
Domains
• Code
• Numeric
• Text
S11
89
Representing Response Domains
• There are many types of response domains
• Many questions have categories/codes as answers
• Textual responses are common
• Numeric responses are common
• Other response domains are also available in DDI 3 (time, mixed
responses)
90
S11
Category and Code Domains
•
Use CategoryDomain when NO codes are provided for
the category response
[ ] Yes
[ ] No
•
Use CodeDomain when codes are provided on the
questionnaire itself
1. Yes
2. No
91
S11
Category Schemes and Code Schemes
• Use the same structure as variables
• Create the category scheme or schemes first (do not
duplicate categories)
• Create the code schemes using the categories
• A category can be in more than one code scheme
• A category can have different codes in each code scheme
S11
92
Numeric and Text Domains
• Numeric Domain provides information on the
range of acceptable numbers that can be entered
as a response
• Text domains generally indicate the maximum
length of the response and can limit allowed
content using a regular expression
• Additional specialized domains such as DateTime
are also available
• Structured Mixed Response domain allows for
multiple response domains and statements within
a single question, when multiple response types
are required
93
S11
Simple Questionnaire
Please answer the following:
1.
Sex
(1) Male
(2) Female
2.
Are you 18 years or older?
(0) Yes
(1) No (Go to Question 4)
3.
How old are you? ______
4.
Who do you live with?
__________________
5.
What type of school do you
attend?
(1) Public school
(2) Private school
(3) Do not attend school
• Questions
• Response
Domains
• Code
• Numeric
• Text
• Statements
94
S11
Simple Questionnaire
Please answer the following:
1.
Sex
(1) Male
(2) Female
2.
Are you 18 years or older?
(0) Yes
(1) No (Go to Question 4)
3.
How old are you? ______
4.
Who do you live with?
__________________
5.
What type of school do you
attend?
(1) Public school
(2) Private school
(3) Do not attend school
• Questions
• Response
Domains
• Code
• Numeric
• Text
• Statements
• Instructions
95
S11
Simple Questionnaire
Please answer the following:
1.
Sex
(1) Male
(2) Female
2.
Are you 18 years or older?
(0) Yes
(1) No (Go to Question 4)
3.
How old are you? ______
4.
Who do you live with?
__________________
5.
What type of school do you
attend?
(1) Public school
(2) Private school
(3) Do not attend school
• Questions
• Response
Domains
Skip Q3
• Code
• Numeric
• Text
• Statements
• Instructions
• Flow
96
S11
Statement 1
Question 1
Question 2
Is Q2 = 0 (yes)
No
Yes
Question 3
Question 4
Question 5
S11
Approach to Survey Analysis
• Identify
• Question Text
• Statements
• Instructions or informative materials
• Response Domains (by type)
• Determine the universe structure and concepts
• Walk through the flow logic
97
S11
Completing Question Items
• Create CodeSchemes reusing common categories
• Determine range for NumericDomains
• Determine maximum length of TextDomains
• Write up control constructs (easiest is to list all
QuestionConstruct, all Statement Items)
98
99
S11
Example: Reusing Categories
Full list of all categories:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Yes
No
Don’t know
Yes
No
BECOMES
Yes
No
Yes, always
Sometimes
Some do, some don’t
Not to my knowledge
Never – I don’t let them
Never – I don’t have a television
Yes
No
Not to my knowledge
Shorter list of reusable categories:
• Yes
• No
• Don’t know
• Yes, always
• Sometimes
• Some do, some don’t
• Not to my knowledge
• Never – I don’t let them
• Never – I don’t have a television
S11
Flow Logic
• Master Sequence
• Every instrument has one top-level sequence
• Question and statement order
• Routing – IfThenElse (see next slide)
• After Statement 2 (all respondents read this)
• After Q2 Else goes to statement
• After Q5 Else goes back to a sequence
100
101
S11
Else
SI 1
Q1
IfThenElse
1
SI 2
end
Then
IfThenElse
2
Q2
Else
SI 3
Then
Else
Q3
Q4
IfThenElse
3
Q5
Q8
Then
Q6
Q7
SI 4
102
S11
Example: Master Sequence
• Statement 1
• Question 1
• Statement 2
• IFThenElse 1
• Then Sequence 1
• Question 2
• IFThenElse 2
• Then SEQuence 2
Question 3, Question 4, IFThenElse 3, Question 8, Statement 4
[Then SEQuence 3 (Question 6,Question 7)]
• Else Statement 3
103
S12
Process Items
• General Coding Instruction
• Missing Data (left as blanks)
• Suppression of confidential information such as name or
address
• Generation Instructions
• Recodes
• Review of text answers where items listed as free text result in more
than one nominal level variable
• Create variable for each with 0=no 1=yes
• Or a count of the number of different items provided by a
respondent
• Aggregation etc.
• The creation of new variables whose values are programmatically
populated (mostly from existing variables)
S10
104
Conceptual Components
• Conceptual components are defined early in the study
process. They are the who, what, where, and when of the
study.
105
S10
Difference Between Conceptual
Components and Coverage
• Conceptual Components
• For use by the study,
organization, community
• Coverage
• Spatial Coverage
• Topical Coverage
• Temporal Coverage
• High level search and
links to geographic
systems
• High level search and
links to broader world of
knowledge
• High level search
S10
106
Concepts
• A concept may be structured or unstructured and consists
of a Name, a Label, and a Description. A description is
needed if you want to support comparison. Concepts are
what questions and variables are designed to measure
and are normally assigned by the study (organization or
investigator).
S10
107
Universe
• This is the universe of the study which can combines the
who, what, when, and where of the data
• Census top level universe: “The population and
households within Kenya in 2010”
• Sub-universes: Households, Population, Males,
Population between 15 and 64 years of age, …
S10
108
Universe Structure
• Hierarchical
• Makes clear that “Owner Occupied Housing Units” are part of the
broader universe “Housing Units”
• Can be generated from the flow logic of a questionnaire
• Referenced by variables and question constructs
• Provides implicit comparability when 2 items reference the same
universe
S10
109
Population and Housing
Units in Kenya in 2010
Housing
Units
Population
Males
Variable A
Universe Reference:
Persons
15 years
and Older
Males, 15 years of
age and older in
Kenya in 2010
110
S20
Universe
Concept
Variable OR
Question
Construct
Data Element
Concept
New in 3.2
Data Element
Variable Representation
Question Response Domain
ISO/IEC 11179-1
International Standard ISO/IEC 11179-1: Information technology – Specification and
standardization of data elements – Part 1: Framework for the specification and standardization
of data elements Technologies de l’informatin – Spécifiction et normalization des elements de
données – Partie 1: Cadre pout la specification et la normalization des elements de données.
First edition 1999-12-01 (p26) http://metadata-standards.org/11179-1/ISO-IEC_111791_1999_IS_E.pdf
S12
111
Variables
• Variables are created as a result of data processing,
either from questions or other data collection/harvesting
activities.
112
S12
General Variable Components
• VariableName, Label and Description
• Links to Concept, Universe, Question, and Embargo
information
• Provides Analysis and Response Unit
• Provides basic information on its role:
• isTemporal
• isGeographic
• isWeight
• Describes Representation
S12
113
Representation
• Detailed description of the role of the variable
• References related weights (standard and variable)
• References all instructions regarding coding and
imputation
• Describes concatenated values
• Additivity and aggregation method
• Value representation
• Specific Missing Value description (proposed DDI
3.2)
• Can be used in combination with any representation type
114
S12
Value Representation
• Provides the following elements/attributes to all
representation types:
• classification level (“nominal”, “ordinal”, “interval”,
“ratio”, “continuous”)
• blankIsMissingValue (“true” “false”)
• missingValue (expressed as an array of values)
• These last 2 may be replaced in 3.2 by a missing values
representation section
• Is represented by one of four representation
types (numeric, text, code, date time)
• Additional types are under development (i.e.,
scales)
S12
115
Code Representation
• Code schemes link category labels and content to a code
used in the data file
• Codes can be numeric or text
• Hierarchies are described by level, completeness, and
relationship of items contained in a level
S12
116
Code Scheme Options
• Use in its entirety
• Use only specified levels
• Use only most discrete items (higher levels are treated as
group labels)
• Use only the specified codes or code range
117
S12
<l:CodeScheme id=”CS_1”>
<l:CategorySchemeReference>
<r:ID>CatScheme_1</r:ID>
</l:CategorySchemeReference>
<l:HierarchyType>Irregular</l:HierarchyType>
<l:Level levelNumber=”1”>
<l:Name>2 digit code</l:Name>
</l:Level>
<l:Level levelNumber=”2” >
<l:Name>4 digit code</l:Name>
</l:Level>
.....
S12
118
....
<l:Code isDiscrete=”false” levelNumber=”1” >
<l:CategoryReference><r:ID>C_1</r:ID></l:CategoryReference>
<l:Value>10</l:Value>
<l:Code isDiscrete=”true” levelNumber=”2”>
<l:CategoryReference><r:ID>C_2</r:ID></l:CategoryReference>
<l:Value>1010</l:Value>
</l:Code>
<l:Code isDiscrete=” true” levelNumber=”2” >
<l:CategoryReference><r:ID>C_3</r:ID></l:CategoryReference>
<l:Value>1020</l:Value>
</l:Code>
</l:Code>
<l:Code isDiscrete=” true” levelNumber=”1” >
<l:CategoryReference><r:ID>C_4</r:ID></l:CategoryReference>
<l:Value>20</l:Value>
</l:Code>
</l:CodeScheme>
S12
Numeric
• Use for variables where numeric response is self
•
•
•
•
explanatory (e.g., age in years)
Continuous or discrete
Specific valid levels or ranges
Missing value codes can be identified
Data is intended to be analyzed as numbers
119
S12
120
Text
• Data is intended to be analyzed as text
• Geographic codes may be numbers but are analyzed as text or
string (leading zeros used)
• Content can be any text
• Constrain length
• Constrain regular expression
• A US ZIP Code is text
• 5 characters
• numeric characters 0-9 only
S12
Date Time
• Allows specification of format
• Allows statistical software to handle appropriately
121
122
S14
Data Relationship
Learning DDI: Pack S14
Copyright © GESIS – Leibniz Institute for the Social Sciences, 2010
Published under Creative Commons Attribute-ShareAlike 3.0 Unported
S14
123
What we’re covering
• How Data Relationship provides the link between the
physical record storage and their logical intellectual
content
• How Variables and NCubes are grouped into Logical
Records
• How Logical Records define complex file relationships
124
S14
Understanding Data Relationships
• Data files can be described as following a structure
• What are the record types?
• What variables make up each record type?
• How do I know which record type I have?
• How can I find a unique record of a specific type?
• How are records related?
• DDI provides the information to automate processing of
the data files themselves
Logical vs. Physical
• Every data file has one or more “logical records”
(a record of analysis rather than a physical
record)
• The logical description separates the support
provided in the variable content from the physical
structure
• DDI provides both human readable and machine
actionable information to support programming
• Minimal information is REQUIRED even for
single record type simple files. The
LogicalRecord ID is the link between the
physical store of the data and the logical
description of its content
S14
125
S14
126
Data Relationship
• Logical Record:
• Assigns an ID to the logical record
• Provides information on the logical record type
• Identifies support for breaking the logical record into 2
or more physical segments in a storage structure
• Explains unique case identification
• Provides the content of logical record (Variables and
NCubes)
• Record Relationship
• Provides links or “keys” between logical record types
127
S14
Logical Record
• Identification
• Description
• hasLocator [boolean] and Variable Value
Reference
• To a variable that declares the record type
• Support for Multiple Segments
• Specifies variable for this information
• Case Identification
• Options for identifying a unique case within a record
type
• Variables OR NCubes in record and
variableQuantity [integer]
S14
128
Logical Record
Minimum Requirements
• Identification
• Description
• hasLocator [boolean] and Variable Value Reference
• Support for Multiple Segments
• Case Identification
• Variables OR NCubes in record and variableQuantity
[integer]
129
S14
Record Type Locator
• EXAMPLE 1:
• Household Record
• variable rectype =
“H”
• Person Record
• variable rectype =
“P”
• EXAMPLE 2:
• Record Type A
• variable chariter =
[blank]
• Record Type B
• variable chariter ≥
‘000’
S14
Case Identification
• Simple case examples:
• Case Number
• Survey form number
• Any single variable unique number within a record type
• Complex case identification:
• Concatenated keys
• Conditional concatenated keys
130
131
S14
Complex Files and Record Relationships
• Complex files consist of more than one record type stored
in one or more files
• Contains the complete Logical Record description for
each record type
• Provides information on the relationship between records
• Provides the link(s) to other records
• Provides the link(s) between waves
S14
132
RecordRelationship
• A pairwise relationship of a source and target record
• Describes the relationship:
• Source and target record
• Type of relationship (=, >, <, ≠, ≤, ≥)
• Notice that the case identification of a record type is
frequently used as a key for the relationship link
Logical Record Structure
Person
ID
Age
Gender
Household
ID
Data File:
Persons
Note: this is a logical
relationship – the fact that
the records are in two files
instead of one is
unimportant.
Data File:
Households
Household
Household Household
Type
ID
Income
Logical Record Structure
Housing Unit
Type
134
S15
Describing Data Storage
• To describe how data is stored, DDI-L separates the
storage structures from the file actually containing the
data
• The storage structures are reused
• The storage structure is called a physical data product
• The data files are called physical instances
135
S16
Study Unit
Data Collection
Logical Data File
Physical Structure 1
Physical Instance (full file)
Physical Instance (subset of records)
Physical Structure 2
Physical Instance (full file)
136
S15
Linkages Step 1
• Define and identify the LogicalRecord within Data
Relationship in the Logical Product
• PhysicalDataProduct – Physical Structure
• Format
• Default values
• Link to LogicalRecord
• Declaration of its physical segments (in terms of its storage in this
structure)
S15
137
Linkages Step 2
• PhysicalDataProduct – Record Layout
• Link from RecordLayout to PhysicalRecordSegment
• Link from DataItem to a Variable or NCube description and to the
physical location of the data in the data file
• PhysicalInstance
• Link to the RecordLayout(s) found in the file
• Link to the actual file of data
Logical Product
LogicalRecord
Variables
1..n
PhysicalDataProduct
PhysicalStructure
REF: LogicalProduct
Defines
PhysicalRecordSegments
RecordLayout
REF: PhysicalRecordSegment
DataItem
REF: Variable
Gives physical location in record
Technically a
1..n but the
additional data
files must be
the equivalent
of an identical
backup copy
n..n
Physical Instance
REF: RecordLayout
REF: Data File
Summary Statistics
REF: Variables
S15
Data File
1..1
S15
139
Complexity
• May seem like a lot of referencing and indirection
for a simple example
• Structure is designed to handle much more
complex structures in a consistent manner
• For example health interview surveys may have
records for multiple person types, incidence or
event records, biomarkers, and relationship or
situational change records stored and linked in
many different ways
• Same structure handles all levels of complexity
S15
140
Describing the Physical Store
• Link to a LogicalRecord
• Different structures to describe different storage formats
• We use XML Schema substitutions
• ASCII, internal, proprietary, etc.
• Information on relational links between record types
stored in one or many data files (physical relationship)
• Links to Variables and NCube cells
141
S15
Physical Description
• Physical Data Product
• Can describe any number of physical stores of data
• Describes the gross record layout
• Reference to a LogicalRecord
• Information on the use of multiple physical segments to store the
data in the LogicalRecord
• Provides default values for various data typing information
• Describes the record layout in detail
• Links to a GrossRecordStructure
• Provides a detailed link between a specific variable and its
physical storage location
142
S15
PhysicalDataProduct
PhysicalStructureScheme:
Reference to LogicalRecord
GrossRecordStructure
Identifies PhysicalSegments
RecordLayoutScheme
RecordLayoutScheme:
Reference to PhysicalStructure
BaseRecordLayout
proprietary
Uses XML
Schema
“substitution
groups”
Alternates for
BaseRecordLayout
DataSet
inline_ncube_recordlayout
ncube_recordlayout
tabular_ncube_recordlayout
S15
143
PhysicalDataProduct
• ncube_recordlayout
• Allows for a record per aggregation case containing
multiple ncubes listed in a fixed or comma delimited
layout (used by the example)
• inline_ncube_recordlayout
• Allows the data to be listed as a table in-line in the
PhysicalDataProduct
• tabular_ncube_recordlayout
• Describes a 2-dimensional tabular layout as used by
spreadsheets
144
S15
Proprietary Record Layout
• Used for describing data files for proprietary software
packages
• Statistical packages (SPSS, SAS, etc.)
• Relational databases (Oracle, SQL Server, etc.)
• Uses a “handle” (DataItemAddress) to define the variable
location, instead of a known location within the file
• The files are typically binary, so a positional or delimited location
does not work
• Examples: variable name, column name
• Allows for proprietary datatypes, outputs, and properties
• These describe software-specific parameters that can be defined
by the user, according to the software package they use
145
S15
Data Set
• Allows for capturing the data in a DDI-specific
XML format, as part of the DDI file
• Useful for archival storage of the data, where the data
and metadata live in the same file/package
• Useful for feeding temporary data files to visualization
packages/Web services
• Usually subsets of the full data file
• Many visualization packages expect data in XML format
• Web services demand that the communications are performed in
an XML format
• This is a very verbose way of expressing the data
– files get much larger!
146
S16
Physical Instance
Learning DDI: Pack S16
Copyright © GESIS – Leibniz Institute for the Social Sciences, 2010
Published under Creative Commons Attribute-ShareAlike 3.0 Unported
147
S16
Files of Data
• Data files are represented in the DDI metadata with a
module called a physical instance
• This is just a metadata object which represents the
existence of a physical file
• It also carries summary and category statistics because these
change from data file to data file
148
S16
Physical Instance
• Has a one-to-one relationship with a physical file
of data (plus a back-up if one exists)
• Allows for full record subsets of a large data set
using record selection
• Houses summary and category statistics that are
specific to a particular file
• Note that these can be in-line, referenced in another
physical instance, or referenced as a separate data file
(with complete logical product, physical data structure,
and physical instance)
S16
Record of a Physical Instance
• Link to a physical storage structure
• Specifics of the range of records in the file
• Record type selection
• Geographic selection
• Topical selection
• Summary statistics
• Identification and location of the actual data file
149
S16
150
Record Subsets
• PhysicalRecordSegment
• 79 record segments with each segment in its own file
(US 2000 Census SF3)
• Geography
• Using SpatialCoverage to limit to a single country
(Eurobarometer – Germany)
• Date/Time
• Use TemporalCoverage to limit to a single year
(General Social Survey – 1998)
• Topic
• Use TopicalCoverage to limit to a single topical
definition (Female cases only)
151
S16
NHGIS Processing:
NHGIS project separates physical data
files by geographic type
Alabama
Alaska
Arizona
Arkansas
State
County
Place
Tract
State
County
Place
Tract
One file per state
with all
geographic levels
Alabama
Alaska
Arizona
Arkansas
One file per
geographic level
with all states
152
S16
PhysicalInstance
Identfication
Refeference to RecordLayout(s)
ID/location of Data File
Data Fingerprint
Coverage limitations
Summary Statistics
[Variables]
GrossFileStructure
[Check sums and processing info]
Category Statistics
[allows for single level filters]
153
S17
From the Bottom Up
• This section summarizes what we have learned
starting from the data item and working up to
the full metadata description
Learning DDI: Pack S17
Copyright © GESIS – Leibniz Institute for the Social Sciences, 2010
Published under Creative Commons Attribute-ShareAlike 3.0 Unported
154
S17
DDI-L from the data item up
LogicalProduct
PhysicalDataProduct
PhysicalInstance
DDI-L breaks down a data file
into three major components:
-The LogicalProduct
describes the data dictionary
-The PhysicalDataProduct
describes the file structure
-The PhysicalInstance
describes an actual instance of
the file
155
S17
DDI-L from the data item up
LogicalProduct
PhysicalDataProduct
The PhysicalInstance refers to
a record layout in the
PhysicalDataProduct
(the same data can be stored in different
formats/locations or the same record can
contain different data)
PhysicalInstance
DataFileIdentification, GrossFileStructure,
Statistics, ProprietaryInfo
The PhysicalInstance
identifies the file (name,
path/uri), holds statistics
(#recs, #vars, freq, min, max,
etc.) and other applicable
proprietary info
156
S17
DDI-L from the data item up
LogicalProduct
The PhysicalStructure refers to
a logical record in the
LogicalProduct
(the same set of variables can be stored in
different ways)
PhysicalDataProduct
PhysicalStructureScheme/PhysicalStructure
RecordLayoutScheme/RecordLayout OR
RecordLayoutScheme/ProprietaryRecordLayout
PhysicalInstance
DataFileIdentification, GrossFileStructure,
Statistics, ProprietaryInfo
The PhysicalDataProduct
describes the
PhysicalStructure of the file
and (what are the data
components) and its
RecordLayout (variable
location, formatting, etc).
The same structure can be
used by multiple layouts
Different layouts are used to
describe text and proprietary
files.
157
S17
DDI-L from the data item up
The LogicalProduct result from
earlier life cycle stages.
(this information is not available in
traditional data files)
LogicalProduct
CategoryScheme/Category
CodeScheme/Code
VariableScheme/Variable
DataRelationship/LogicalRecord
PhysicalDataProduct
PhysicalStructureScheme/PhysicalStructure
RecordLayoutScheme/RecordLayout OR
RecordLayoutScheme/ProprietaryRecordLayout
PhysicalInstance
DataFileIdentification, GrossFileStructure,
Statistics, ProprietaryInfo
The LogicalProduct describes
the data dictionary Variables
(name,label, formats, etc.), the
Codes & Categories
(classifications) as well as the
the Logical Record for storage.
The Data Relationship can
describe complex hierarchical
structures and indexes.
158
S17
DDI-L from the data item up
DDIInstance/StudyUnit
Abstract, Coverage, Purpose, …
ConceptualComponents
ConceptScheme/Concept
UniverseScheme/Universe
DataCollection
QuestionScheme/Question
ControlConstruct, Instruction, Instrument,…
LogicalProduct
CategoryScheme/Category
CodeScheme/Code
VariableScheme/Variable
DataRelationship/LogicalRecord
PhysicalDataProduct
PhysicalStructureScheme/PhysicalStructure
RecordLayoutScheme/RecordLayout OR
RecordLayoutScheme/ProprietaryRecordLayout
PhysicalInstance
DataFileIdentification, GrossFileStructure,
Stattistics, ProprietaryInfo
The XML is contained by a
StudyUnit wrapped by a
DDIInstance.
The ConceptualComponents
describes the concepts,
universe and the
DataCollecion module
captures the questionnaire and
survey instrument.
159
S17
DDI-L from the data item up
DDIInstance/StudyUnit
Abstract, Coverage, Purpose, …
ConceptualComponents
ConceptScheme/Concept
UniverseScheme/Universe
DataCollection
QuestionScheme/Question
ControlConstruct, Instruction, Instrument,…
LogicalProduct
CategoryScheme/Category
CodeScheme/Code
VariableScheme/Variable
DataRelationship/LogicalRecord
PhysicalDataProduct
PhysicalStructureScheme/PhysicalStructure
RecordLayoutScheme/RecordLayout OR
RecordLayoutScheme/ProprietaryRecordLayout
PhysicalInstance
DataFileIdentification, GrossFileStructure,
Stattistics, ProprietaryInfo
DDI in context
• Managing Research
• Individual research
• Large research projects – longitudinal multi-researcher
• Managing digital resources
Managing research
Individual researchers
• Tools – using the software they know
• Clarifying what metadata needs to be captured for future
preservation and discovery
• Building/locating metadata resources that support
comparison
• Getting metadata from individual researchers is not a new
problem – DDI can’t solve it but can provide some
direction
The Longitudinal Version of GSBPM
• In 2011 at a Dagstuhl workshop on Longitudinal metadata
a modification of the GSBPM was developed to describe
data production for large on-going research projects
• This work is still under development but may result in a
more detailed lifecycle model for DDI moving forward
S01
164
Note the similarity to the DDI Combined Lifecycle Model
and the top level of the GSBPM
S01
165
S01
166
167
S05
Upstream Metadata Capture
• Because there is support throughout the lifecycle,
you can capture the metadata as it occurs
• It is re-useable throughout the lifecycle
• It is versionable as it is modified across the lifecycle
• It supports production at each stage of the
lifecycle
• It moves into and out of the software tools used at each
stage
168
S05
Metadata Driven Data Capture
• Questions can be organized into survey instruments
documenting flow logic and dynamic wording
• This metadata can be used to create control programs for Blaise,
CASES, CSPro and other CAI systems
• Generation Instructions can drive data capture from
registry sources and/or inform data processing post
capture
169
S05
Reuse of Metadata
• You can reuse many types of metadata,
benefitting from the work of others
•
•
•
•
•
Concepts
Variables
Categories and codes
Geography
Questions
• Promotes interoperability and standardization
across organizations
• Can capture (and re-use) common cross-walks
Managing digital resources
Management of Data and Metadata
• Managing metadata:
• Capture – goal is to capture at point of origin
• Reuse – reduce burden, reduce error, comparison
• Quality control – reuse, replication, paradata
• Preservation – metadata in a non-proprietary format
• Provenance – how the data was created
• Processing – metadata driven processing
• Discovery and access
• Analysis support and information
• Digital objects
• Data as a unique object – without metadata its just a number
Data/Metadata Mgmt Activities
• Data Capture
• determining what is to be collected from whom and how
• Data Processing
• cleaning, normalizing, aggregating, harmonizing,
•
•
•
•
creation of data products
Process evaluation and revision
• quality control, process improvement, evaluation and
analysis
Data Discovery
• Finding data, accessing data
Preservation
• short term and archival
Administrative tracking
• who has control, where in the process
Data/Metadata Management
• Downside: There’s a lot more to manage
• Greater depth than many other digital objects
• Greater detail that can be leveraged for discovery, access, and
application
• Costly to translate into a standard format
• Upside: We’ve been managing digital data for over 40
years
• No need to reinvent the wheel
• DDI as a metadata structure is not an “all or nothing” approach
• DDI uptake has moved out of the archives and is moving into the
production process
Working within a Library/Archive System
• Actionable and informational metadata
• What do you need to “do” with the metadata?
• Discovery
• How deep do you want to go?
• How integrated do you want the results to be?
• Visualization / Manipulation
• Analysis
• Preservation / Archive
Archive/Data Discovery and Delivery
• Data and Metadata are generally received from
external organizations
• Focus is on moving data and metadata to a
preservation format and supporting discovery and
delivery tools
• Management of ingest process (process
management)
• “Value Added” material
Archive/Data Discovery and Delivery
• Capturing full content
• Machine actionable
• Information for discovery
• Retaining links to other materials, collections and grouping
• Added value metadata from archive
• Variable, question, and data element groups related to subject and
keyword access
• Linking to a common geography description
• Linking to an overall organization description
• Tracking archival management activities and processes
Working with producers/researchers
• How much can you influence depositors?
• Ingest tools that result in DDI metadata
• Provision of reusable materials (schemes) or controlled
vocabularies
• metadata management tools
• Training
• What can be pushed back to long term
depositors?
• Resource package material?
• Metadata of deposited data so that only differences are reported?
• Tools to manage change over time?
General use statements
180
S05
Upstream Metadata Capture
• Because there is support throughout the lifecycle,
you can capture the metadata as it occurs
• It is re-useable throughout the lifecycle
• It is versionable as it is modified across the lifecycle
• It supports production at each stage of the
lifecycle
• It moves into and out of the software tools used at each
stage
181
S05
Reuse of Metadata
• You can reuse many types of metadata,
benefitting from the work of others
•
•
•
•
•
Concepts
Variables
Categories and codes
Geography
Questions
• Promotes interoperability and standardization
across organizations
• Can capture (and re-use) common cross-walks
182
S05
Metadata Driven Data Capture
• Questions can be organized into survey
instruments documenting flow logic and dynamic
wording
• This metadata can be used to create control programs
for Blaise, CASES, CSPro and other CAI systems
• Generation Instructions can drive data capture
from registry sources and/or inform data
processing post capture
183
S05
Management of Information, Data, and
Metadata
• An organization can manage its organizational
information, metadata, and data within
repositories using DDI 3 to transfer information
into and out of the system to support:
• Controlled development and use of concepts,
questions, variables, and other core metadata
• Development of data collection and capture processes
• Support quality control operations
• Develop data access and analysis systems
DAY 2
S01
185
DDI-C and DDI-L
• DDI has 2 development lines
• DDI Codebook (DDI-C)
• DDI Lifecycle (DDI-L)
• Both lines will continue to be improved
• DDI-C focusing just on single study codebook structures
• DDI-L focusing on a more inclusive lifecycle model and
support for machine actionability
186
S02
Background
• Concept of DDI and definition of needs grew out
of the data archival community
• Established in 1995 as a grant funded project
initiated and organized by ICPSR
• Members:
• Social Science Data Archives (US, Canada, Europe)
• Statistical data producers (including US Bureau of the
Census, the US Bureau of Labor Statistics, Statistics
Canada and Health Canada)
• February 2003 – Formation of DDI Alliance
• Membership based alliance
• Formalized development procedures
S02
187
Early DDI:
Characteristics of DDI-C
• Focuses on the static object of a codebook
• Designed for limited uses
• End user data discovery via the variable or high level
study identification (bibliographic)
• Only heavily structured content relates to information
used to drive statistical analysis
• Coverage is focused on single study, single data
file, simple survey and aggregate data files
• Variable contains majority of information
(question, categories, data typing, physical
storage information, statistics)
S02
188
Limitations of these Characteristics
• Treated as an “add on” to the data collection
process
• Focus is on the data end product and end users
(static)
• Limited tools for creation or exploitation
• The Variable must exist before metadata can be
created
• Producers hesitant to take up DDI creation
because it is a cost and does not support their
development or collection process
189
S03
Origins of the DDI Alliance
• DDI-C was developed by an informal network of
individuals from the social science community
and official statistics
• Funding was through grants
• It was decided that a more formal organization
would help to drive the development of the
standard forward
• Many new features were requested
• The DDI Alliance was born to facilitate the development
in a consistent and on-going fashion
190
S03
DDI Alliance Structure
• DDI-L specifications are created by committees drawn from
among the member organizations
• Some outside experts are invited to attend
• The Steering Committee governs the organization
• The Expert Committee votes to approve all published work
• One representative per member organization
• The Technical Implementation Committee (TIC) creates the
technical work products (XML schemas, UML models,
documentation, etc.)
• Working Groups are short term groups working on future
DDI topical content (i.e., Survey Design & Implementation)
• Tools Catalog Group describing tools and software to work
with DDI
• Web Site Maintenance Group
S03
191
Moving from DDI-C to DDI-L
• DDI Alliance members wished to support current
DDI-C users and will continue to support this
specification
• The limitations of DDI-C needed to be addressed
in order to move the standard forward to a
broader audience and user base
• Requirements for DDI-L came out of the original
committee as well as the broader data archive
community
• The development of the first wave of software for
DDI-C raised additional requirements
S03
192
Requirements for DDI-L
• Improve and expand the machine-actionable aspects of
•
•
•
•
•
the DDI to support programming and software systems
Support CAI instruments through expanded description of
the questionnaire (content and question flow)
Support the description of data series (longitudinal
surveys, panel studies, recurring waves, etc.)
Support comparison, in particular comparison by design
but also comparison-after-the fact (harmonization)
Improve support for describing complex data files (record
and file linkages)
Provide improved support for geographic content to
facilitate linking to geographic files (shape files, boundary
files, etc.)
193
S03
DDI Lifecycle Model
Metadata Reuse
S20
194
Relationship to Other Standards:
Archival
• Dublin Core
• Basic bibliographic citation information
• Basic holdings and format information
• METS
• Upper level descriptive information for managing digital objects
• Provides specified structures for domain specific metadata
• OAIS
• Reference model for the archival lifecycle
• PREMIS
• Supports and documents the digital preservation process
S20
195
Relationship to Other Standards: NonArchival
• ISO 19115 – Geography
• Metadata structure for describing geographic feature files such as
shape, boundary, or map image files and their associated attributes
• ISO/IEC 11179
• International standard for representing metadata in a Metadata
Registry
• Consists of a hierarchy of “concepts” with associated properties for
each concept
• ISO 17369 SDMX
• Exchange of statistical information (time series/indicators)
• Supports metadata capture as well as implementation of registries
196
S05
Mining the Archive
• With metadata about relationships and structural
similarities
• You can automatically identify potentially comparable data sets
• You can navigate the archive’s contents at a high level
• You have much better detail at a low level across divergent data
sets
197
S20
Metadata Coverage
• Dublin Core
• ISO/IEC 11179
• ISO 19115
• Statistical Packages
• METS
• PREMIS
• SDMX
• DDI
•
•
•
•
•
•
[Packaging]
Citation
Geographic Coverage
Temporal Coverage
Topical Coverage
Structure information
• Physical storage description
• Variable (name, label, categories, format)
•
•
•
•
•
•
•
•
Source information
Methodology
Detailed description of data
Processing
Relationships
Life-cycle events
Management information
Tabulation/aggregation
S03
198
Moving from DDI 1/2 to DDI 3
• DDI Alliance members wished to support current
DDI 1/2 users and will continue to support this
specification
• The limitations of DDI 1/2 needed to be
addressed in order to move the standard forward
to a broader audience and user base
• Requirements for DDI 3 came out of the original
committee as well as the broader data archive
community
• The development of the first wave of software for
DDI 1/2 raised additional requirements
S03
199
Requirements for 3.0
• Improve and expand the machine-actionable aspects of
•
•
•
•
•
the DDI to support programming and software systems
Support CAI instruments through expanded description of
the questionnaire (content and question flow)
Support the description of data series (longitudinal
surveys, panel studies, recurring waves, etc.)
Support comparison, in particular comparison by design
but also comparison-after-the fact (harmonization)
Improve support for describing complex data files (record
and file linkages)
Provide improved support for geographic content to
facilitate linking to geographic files (shape files, boundary
files, etc.)
DDI 1 / 2
200
S03
Document Description
Citation of the codebook document
Guide to the codebook
Document status
Source for the document
Study Description
Citation for the study
Study Information
Methodology
Data Accessibility
Other Study Material
File Description
File Text (record and relationship information)
Location Map (required for nCubes optional for microdata)
Data Description
Variable Group and nCube Group
Variable (variable specification, physical location, question, & statistics)
nCube
Other Material
201
S03
Our Initial Thinking…
The metadata payload
from DDI 1/2 was reorganized to cover these
areas.
202
S03
File Text
Location Map
Physical Location
Statistics
Study Citation
Document Source
Study Information
Data Accessibility
Study Methodology
Questions
Other Material
Variable specification
nCubes
Variable & nCube Groups
S03
Wrapper
For later parts
of the lifecycle,
metadata is
reused heavily
from earlier
Modules.
The discovery
and analysis
itself creates
data and
metadata, reused in future
cycles.
204
S03
Realizations
• Many different organizations and individuals are
involved throughout this process
• This places an emphasis on versioning and exchange
between different systems
• There is potentially a huge amount of metadata
reuse throughout an iterative cycle
• We needed to make the metadata as reusable as
possible
• Every organization acts as an “archive” (that is, a
maintainer and disseminator) at some point in the
lifecycle
• When we say “archive” in DDI 3, it refers to this function
205
DDI 3 and the Data Life Cycle
• A survey is not a static process: It dynamically evolves across time and involves many
•
•
•
•
S03
agencies/individuals
DDI 1/2 is about archiving, DDI 3 across the entire “life cycle”
DDI 3 focuses on metadata reuse (minimizes redundancies/discrepancies, support
comparison)
Also supports multilingual, grouping, geography, and others
DDI 3 is extensible
206
Approach
• Shift from the codebook centric model of early versions of
DDI to a lifecycle model, providing metadata support from
data study conception through analysis and repurposing
of data
• Shift from an XML Data Type Definition (DTD) to an XML
Schema model to support the lifecycle model, reuse of
content and increased controls to support programming
needs
• Redefine a “single DDI instance” to include a “simple
instance” similar to DDI 1/2 which covered a single study
and “complex instances” covering groups of related
studies. Allow a single study description to contain
multiple data products (for example, a microdata file and
aggregate products created from the same data
collection).
• Incorporate the requested functionality in the first
published edition
S03
207
S03
Development of DDI 3
• 2004 – Acceptance of a new
DDI paradigm
• Lifecycle model
• Shift from the codebook centric /
variable centric model to
capturing the lifecycle of data
• Agreement on expanded areas of
coverage
• 2005
• Presentation of schema structure
• Focus on points of metadata
creation and reuse
• 2006
• Presentation of first complete 3.0
model
• Internal and public review
• 2007
• Vote to move to Candidate
Version
• Establishment of a set of use
cases to test application and
implementation
• 2008
• April: DDI 3.0 published
• 2009
• DDI 3.1 approved for publication
in May 2009
• Published October 2009
• Bugs and feature corrections
identified during the first year of
use, some were backward
incompatible
208
S03
DDI 3.2
• Currently working on DDI 3.2 which will address
bug and feature corrections
• Publication for review in 2011
• Noted areas of correction:
• Broader support for controlled vocabularies
• Clarification of record relationship
• Clarification of ID and URN structures
• Missing value declarations
• Expanded Response Domain/Representation options
209
S05
Change
• DDI 3 is a major change from DDI 1/2 in terms of content
and structure. Lets step back and look at:
• Basic differences between DDI 1/2 and DDI 3
• Applications for DDI 1/2 and DDI 3
• Differences that allow DDI 3 to do more
• How these differences provide support for better management of
information, data, and metadata
S05
210
Differences Between DDI 1/2 and 3
• DDI 1/2
• Codebook based
• Format XML DTD
• After-the-fact
• Static
• Metadata replicated
• Simple study
• Limited physical storage
options
• DDI 3
• Lifecycle based
• Format XML Schema
• Point of occurrence
• Dynamic
• Metadata reused
• Simple study, series,
grouping, inter-study
comparison
• Unlimited physical
storage options
S05
211
DDI 1/2 Applications
• Simple survey capture
• High level study description with variable information for
stand alone studies
• Descriptions of basic nCubes (individual statistical tables)
• Replicating the contents of a codebook including the data
dictionary
• Collection management beyond bibliographic records
S05
212
DDI 3 Applications
• Describing a series of studies such as a longitudinal
survey or cross-cultural survey
• Capturing comparative information between studies
• Sharing and reusing metadata outside the context of
a specific study
• Capturing data in the XML
• Capturing process steps from conception of study
through data capture to data dissemination and use
• Capturing lifecycle information as it occurs, and in a
way that can inform and drive production
• Management of data and metadata within an
organization for internal use or external access
S05
213
Why can DDI 3 do more?
• It is machine-actionable – not just documentary
• It’s more complex with a tighter structure
• It manages metadata objects through a structured
identification and reference system that allows sharing
between organizations
• It has greater support for related standards
• Reuse of metadata within the lifecycle of a study and
between studies
S05
214
Reuse Across the Lifecycle
• This basic metadata is reused across the lifecycle
• Responses may use the same categories and codes which the
variables use
• Multiple waves of a study may re-use concepts, questions,
responses, variables, categories, codes, survey instruments, etc.
from earlier waves
215
S05
Reuse by Reference
• When a piece of metadata is re-used, a reference can be
made to the original
• In order to reference the original, you must be able to
identify it
• You also must be able to publish it, so it is visible (and can
be referenced)
• It is published to the user community – those users who are
allowed access
216
S05
Change over Time
• Metadata items change over time, as they move
through the data lifecycle
• This is especially true of longitudinal/repeat cross-
sectional studies
• This produces different versions of the metadata
• The metadata versions have to be maintained as
they change over time
• If you reference an item, it should not change: you
reference a specific version of the metadata item
S05
217
DDI Support for Metadata Reuse
• DDI allows for metadata items to be identifiable
• They have unique IDs
• They can be re-used by referencing those IDs
• DDI allows for metadata items to be published
• The items are published in resource packages
• Metadata items are maintainable
• They live in “schemes” (lists of items of a single type) or in
“modules” (metadata for a specific purpose or stage of the lifecycle)
• All maintainable metadata has a known owner or agency
• Maintainable metadata may be versionable
• Versions reflect changes over time
• The versionable metadata has a version number
218
S05
Study A
Study B
Ref=
“Variable X”
uses
re-uses by
reference
Variable ID=“X”
Resource Package
published in
219
S05
Variable Scheme ID=“123” Agency=“GESIS”
contained in
Variable ID=“X” Version=“1.0”
changes over time
Variable ID=“X” Version=“1.1”
changes over time
Variable ID=“X” Version=“2.0”
220
S05
Management of Information, Data, and
Metadata
• An organization can manage its organizational
information, metadata, and data within
repositories using DDI 3 to transfer information
into and out of the system to support:
• Controlled development and use of concepts,
questions, variables, and other core metadata
• Development of data collection and capture processes
• Support quality control operations
• Develop data access and analysis systems
221
S05
Upstream Metadata Capture
• Because there is support throughout the lifecycle,
you can capture the metadata as it occurs
• It is re-useable throughout the lifecycle
• It is versionable as it is modified across the lifecycle
• It supports production at each stage of the
lifecycle
• It moves into and out of the software tools used at each
stage
222
S05
Metadata Driven Data Capture
• Questions can be organized into survey instruments
documenting flow logic and dynamic wording
• This metadata can be used to create control programs for Blaise,
CASES, CSPro and other CAI systems
• Generation Instructions can drive data capture from
registry sources and/or inform data processing post
capture
223
S05
Reuse of Metadata
• You can reuse many types of metadata,
benefitting from the work of others
•
•
•
•
•
Concepts
Variables
Categories and codes
Geography
Questions
• Promotes interoperability and standardization
across organizations
• Can capture (and re-use) common cross-walks
224
S05
Virtual Data
• When researchers use data, they often combine
variables from several sources
• This can be viewed as a “virtual” data set
• The re-coding and processing can be captured as
useful metadata
• The researcher’s data set can be re-created from this
metadata
• Comparability of data from several sources can be
expressed
225
S05
Mining the Archive
• With metadata about relationships and structural
similarities
• You can automatically identify potentially comparable data sets
• You can navigate the archive’s contents at a high level
• You have much better detail at a low level across divergent data
sets
DDI - Codebook
Nesstar – HANDS ON
S11
Data Collection/Processing
• Data collection in the lifecycle
• Representing question text
• Questions and questionnaires
• Representing response domains
• Processing collected data
228
229
S11
Production
• Evaluate current processes
• What is done?
• Who does it?
• How is it done (existing software, processes)?
• Where do sections of DDI 3 fit into the process?
• Where does metadata first come into existence?
• What metadata can be reused?
• What sections of metadata be “produced” directly
from existing metadata?
• Time/cost savings
• Consistency
230
S11
Internal Consistency
• Standards within an organization or community
• Concept schemes
• Question schemes
• Coding schemes
• Interoperability between different proprietary software
systems
• allows forward flexibility for software decisions
• allows specialized software for sub-processes
S11
231
Data Collection / Production
• End use is no longer the only focus
• Major selling point of DDI 3 to production organizations is
its ability to “inform and drive the process”
• Metadata content is reused in DDI 3 so capturing it early
is an advantage to the producer
• Metadata captured early can drive the production process
Metadata-Driven Processing:
An Example
Survey Design
Tool
CAI
Tool
Survey
Documentation
What is
your socioeconomic
status?
Interviewer
I’m very,
very
wealthy!
Respondent
Generated
from DDI 3
DDI 3
Question Bank
This replaces older processes where
surveys/CAI were created by hand, and
documented after-the-fact.
S11
233
Capturing and Reusing Metadata
• Whether captured at inception or created after-
the-fact some sections must be completed before
other sections can be completed
• The capture of metadata at point of inception in a
non-proprietary structure that can be transferred
out-of and into process software provides
incentive for metadata creation during the life
cycle of the data
S11
234
Metadata Flow
• DDI is built on the life cycle of the data and some
information naturally occurs earlier than other information
• Reuse of and reference to certain types of information
such as universe, concepts, categories, and coding
schemes prescribe a creation order
235
S11
STEP 1
STEP 2
Universe
Scheme
Category
Scheme
Concept
Scheme
Coding
Scheme
Organization /
Individual
STEP 4
Variable
Scheme
Data
Relationships
STEP 5
Record
Structure
Remaining
Physical Data
Product Items
NCube
STEP 3 optional
StudyUnit
Citation
StudyUnit
Coverage
Question
Scheme
Control
Construct
Scheme
Instrument
Processing
Event
(coding)
STEP 6
Remaining
Logical Product
items
Physical
Instance
STEP 7
Archive / Group
/ etc.
Questions to Variables
REGISTRY
Question
Development
Software
Instrument
Development
Software CAI
Identifying
Universe and
Concepts
Organizing
questions and
flow logic
Building or
Importing
Question Text
and Response
Domains
S11
DDI
Capturing raw
response data
and process
data
Data Processing
Software
Data cleaning
and verification
DDI
Recoding and/or
deriving new
data elements
using existing or
new categories
or coding schemes
236
DAY 2
S10
238
Geographic Structure
• Level
• Code, Name, coverage limitation, description
• Parent
• Reference to a single parent geography
• This is used to describe single hierarchies
• OR Geographic Layer
• References multiple base levels where multiple hierarchies are
layered to create a resulting polygon
S10
239
STATE
County
S10
240
COUNTY
County
Subdivision
Census Tract
Place
241
S10
Hierarchies and Layers
• State (040)
• County (050)
• County Subdivision (060)
• Census Tract (140)
• Place (160)
• Portion of a Census
Tract within a County
Subdivision within a
Place
• Layer References:
• 140
• 060
• 160
242
S10
Geographic Location
• Level description and/or a reference to the level
description in the Geographic Structure
• Reference to the variable containing the identifier
of the geographic location
• Description of a specific geographic location:
• Code
• Name
• Geographic time
• Bounding Polygon
• Excluding Polygon
243
S10
Structure and Location
STRUCTURE:
• Level: 040
• Name: State
• U.S. State or state
equivalent including Legal
Territories and the District
of Columbia
• Parent: 010 [country]
LOCATION:
• Level Reference: 040
• Variable Reference:
STATEFP
• Name: Minnesota
• Code Value: 27
• Geographic Time:
Start:
1857 End: 9999
• Bounding Polygon or
Shape File Reference: for
each boundary over time
DDI Basics (continued)
• Study level information (continued)
• Data capture
• Questions, question flow
• Collection and processing events
• Variables
• Data dictionary contents
• Record relationships
• Physical storage
• Statistics
• From the bottom up
• Grouping and comparison
S18
245
Comparison
• There are two types of comparison in DDI 3:
• Comparison by design
• Ad-hoc (after-the-fact) comparison
• Comparison by design can be expressed using
the grouping and inheritance mechanism
• Ad-hoc comparison can be described using the
comparison module
• The comparison module is also useful for
describing harmonization when performing case
selection activities
S18
246
Data Comparison
• To compare data from different studies (or even waves of the
same study) we use the metadata
• The metadata explains which things are comparable in data sets
• When we compare two variables, they are comparable if they
have the same set of properties
• They measure the same concept for the same high-level universe, and
have the same representation (categories/codes, etc.)
• For example, two variables measuring “Age” are comparable if they
have the same concept (e.g., age at last birthday) for the same top-level
universe (i.e., people, as opposed to houses), and express their value
using the same representation (i.e., an integer from 0-99)
• They may be comparable if the only difference is their representation
(i.e., one uses 5-year age cohorts and the other uses integers) but this
requires a mapping
247
S18
DDI Support for Comparison
• For data which is completely the same, DDI provides a
way of showing comparability: Grouping
• These things are comparable “by design”
• This typically includes longitudinal/repeat cross-sectional studies
• For data which may be comparable, DDI allows for a
statement of what the comparable metadata items are:
the Comparison module
• The Comparison module provides the mappings between similar
items (“ad-hoc” comparison)
• Mappings are always context-dependent (e.g., they are sufficient
for the purposes of particular research, and are only assertions
about the equivalence of the metadata items)
248
S18
Comparability
• The comparability of a question or variable can be
complex. You must look at all components. For example,
with a question you need to look at:
• Question text
• Response domain structure
• Type of response domain
• Valid content, category, and coding schemes
• The following table looks at levels of comparability for a
question with a coded response domain
• More than one comparability “map” may be needed to
accurately describe comparability of a complex
component
249
S18
Detail of question comparability
Comparison
Map
Textual Content
of Main Body
Same
Question
Similar
Category
Same
X
X
X
X
Similar
X
X
X
X
Different
X
X
X
Same
X
X
X
Code Scheme
X
X
X
X
X
X
X
X
X
X
Tools and resources
251
S20
Tools/Projects
• DDI-L has only been an official standard since
April 2008
• Despite this, many tools are being developed
• Some useful tools already exist
• Some tools are available, others are projects
which would be willing to share code (or partner)
as the basis for further development
• The list may not be complete
• IASSIST has a DDI Tools panel every year – see online
presentations
• There is an online tools database at the DDI Alliance
site
252
S20
Tools/Projects (cont.)
• Nesstar (developed by Norwegian Social
Sciences Data Services)
• Commercial product supporting DDI 1.*/2.* (Editor is
free.)
• Provides an editing interface, visualization/tabulation,
and server-to-server data exchange
• Nesstar editor is used by the IHSN Metadata Toolkit,
which adds publishing functionality for HTML, PDF, and
CD-ROMs
• Useful for migration to DDI-L
S20
253
Tools/Projects (cont.)
• DDI Foundation Tools Program
• Joint initiative by several organizations to develop opensource tools for DDI-L
• Includes DeXT (UKDA) and GESIS-developed tools for
transformations to and from DDI 1.0 – 3.0 and statistical
packages (SAS, SPSS. Stata)
• Provides a utilities package for Java development,
including validation, XML beans, URN resolution
• Now developing a suite of tools for editing DDI-L
instances based on a common application framework
(work is lead out of the Danish Data Archive)
254
S20
Tools/Projects (cont.)
• Canadian RDC Network
• Producing DDI-L-based tools for many DDI use cases
•
•
•
•
•
Editing
Migration from DDI-Codebook
Registries
Repositories
Metadata mining
• All tools will be open-source when completed (over next 2 years)
• Some available now on request
S20
255
Tools/Projects (cont.)
• Colectica (by Algenta)
• Commercial tool supporting survey instrument creation,
and other editing functions of DDI
• Has a repository component
• Has Web and PDF publishing functionality
• Supports DDI-C, DDI-L, Blaise, Cases, and CSPro files
• DDI 3.1 is the native file format
• CSPro
• Is currently developing support for DDI-L
• Already supports DDI-C
• Free product
Tools/Projects (Cont.)
• Space-Time Research
• Has DDI-C and DDI-L support in their line of products (SuperCross,
SuperWeb, etc.), for loading micro-data into their proprietary
databases
• Commercial tool providing point-and-click functionality for
tabulation of microdata
• Support for SDMX expression of tabulations
• Uses SDMX RESTful Web services (sort of…)
• Questacy
• Based on an online documentation tool for the LISS panel study at
CentERdata
• Willing to partner to productize the code base
• Database-driven application using PHP and other easy Web
development technologies
Tools/Projects (Cont.)
• Exanda
• Online tabulation system based on DDI-L
• Intended to be released as open source, but no
committed delivery date
• Uses freely available software components (Flex,
Apache Cocoon, etc.)
• QDDS
• Documentation system for questionnaires developed by
GESIS - Leibniz Institute for the Social Sciences
• Uses DDI-C, plans for supporting DDI-L in future
• Freely available, but not open source
Tools/Projects (Cont.)
• University of Tokyo
• Producing a multi-lingual DDI editor
• English-language interface not yet available (2012/13?)
• Will be open-source
• Stat Transfer
• Has implemented support for going to/from statistical packages to
DDI 3.1
Tools/Projects (Cont.)
• Blaise
• Has support for exporting DDI-L descriptions of surveys
• Developed at University of Michigan (ISR - SRO)
• Various (GESIS, University of Kansas, etc.)
• Code for exporting DDI from statistical packages (SAS, SPSS)
• Generally available free if you know who to ask
260
S20
DDI Resources
• DDI Alliance Site
• http://www.ddialliance.org
• General link to all resources/news
• Link to Sourceforge for standards distributions
• Link to prototype page – good for examples
• There is a DDI newsletter you can subscribe to
• Tools/Resources Page
• http://tools.ddialliance.org
• Best place for tools, slides, and resources
261
S20
DDI Resources (cont.)
• Mailing Lists
• www.icpsr.umich.edu/mailman/admin/
• All of the lists starting with “DDI” are related to DDI
topics
• General list
• List for each sub-committee
• Not all groups are active
• User list is the best general place
• Open Data Foundation Site
• www.opendatafoundation.org
• White papers, other resources/tools
S20
262
DDI Resources (cont.)
• DDI Agency Registry
• http://tools.ddialliance.org/?lvl1=community&lvl2=agencyi
d
• Sign up for unique global agency identifier – helps
provide interoperability between organizations
• Currently deploying permanent registry
• International Household Survey Network
• http://surveynetwork.org
• DDI-C-based toolkit available for developing countries
(some free tools)
• Catalog of surveys, many documented in DDI (NADA) –
open source
S20
263
Best Practices
(available at DDI Alliance website)
• Implementation and Governance
• Work flows - Data Discovery and Dissemination: User Perspective
• Work flows - Archival Ingest and Metadata Enhancement
• Work flows for Metadata Creation Regarding Recoding, Aggregation
and Other Data Processing Activities
• Controlled Vocabularies
• Creating a DDI Profile
• DDI 3.0 Schemes
• Versioning and Publication
• DDI as Content for Registries
• Management of DDI 3.0 Unique Identifiers
• DDI 3.0 URNs and Entity Resolution
• High-Level Architectural Model for DDI Applications
S20
264
Use Cases
(available at DDI Alliance website)
• Questasy: Documenting and Disseminating Longitudinal
•
•
•
•
•
•
Data Online Using DDI 3
Building a Modular DDI 3 Editor
Using DDI 3 for Comparison
Extracting Metadata From the Data Analysis Workflow
Questionnaire Management and DDI: The QDDS Case
Grouping of Survey Series Using DDI 3
An Archive's Perspective on DDI 3
S20
265
DDI Events
• IASSIST
• www.iassistdata.org
• Not an official DDI event, but many DDI-related presentations
and meetings
• DDI Alliance Expert Committee meets before or after every
year
• 38th Meeting in Washington DC, was hosted by NORC, June
2012
• 39th Meeting in Köln, Germany, hosted by GESIS - Leibniz
Institute for the Social Sciences
• DDI Workshops often given day before the meeting
• Annual meetings go US-Canada-US-Outside North AmericaUS-Canada-US-Outside North America etc.
S20
266
DDI Events (cont.)
• European DDI User’s Group
• 3rd Meetings was last December at Gothenburg,
Sweden
• 4th Meeting will be in Bergen, Norway, December 2012
• Preceded by a DDI Implementers workshop
• North American User Group now being formed
• GESIS-Sponsored Autumn Events
• Schloss Dagstuhl workshops
• Open Data Foundation meetings
• Spring meeting in Europe
• Winter meeting in the US
• DDI is a major topic of discussion