Describing longitudinal data in DDI

advertisement
Marking up Longitudinal Data in DDI
Sanda Ionescu - ICPSR
We support the “Complex Files Proposal,” finding that it provides a good basis for
identifying related data files and the key variables to be used in merges.
In addition, we have identified a number of requirements for adequate markup of
longitudinal data in DDI.
Variables mapping
(See sheet 1 of attached spreadsheet)
The ability to map identical, or closely similar, variables is a key requirement in
documenting longitudinal studies, in which panels of respondents are usually asked the
same set(s) of questions over time. Data analyses often focus on comparing the answers
to these questions. Variables mappings are thus useful for both search and discovery
purposes and for enabling software to manipulate the physical structure of the data files
(split, merge, or recombine).
With an awareness of the multitude of factors that affect variables comparability – some
of which (like context) cannot be captured in a systematic and structured way – we
propose to limit variables mapping in DDI to the most common and easily identifiable
types of matches:
Type A: “Identical” variables, in which both question text and categories (values and
labels) are exactly the same
Type B: Variables with identical question text, but different categories (changes in
values, labels, or even number of categories)
Type C: Variables with changes in the question wording, but with identical categories
Type D: Variables in which both question wording and categories are (somewhat, but not
entirely) different, although they clearly and recognizably attempt to measure the
same concept
We believe that the type of match needs to be specified in the markup. In the attached
spreadsheet we have included a data construct (we see it as an attribute) that separately
identifies each type of match.
We also believe that for each variable included in the mapping we should specify the
wave it belongs to, and reference the study (ItemGroup) if we’re matching variables from
different studies.
It will also be necessary to identify, for each variable, the respondent, analysis unit,
universe, and certainly the physical data file containing the variable. We assume at this
point that it will be possible to pull out these pieces of information from the actual
variable description (Logical Product module) and its own link to the Physical Data
Instance. If that is not the case, then we will need to specify these, and IDREF the
physical data file.
Some PIs/data producers draw and distribute the variables mapping themselves. To
distinguish between such mappings and archive-generated ones, the source attribute will
indicate who is responsible for the mapping, as shown above.
We offer below a markup example of the desired variables mapping layout, as included
in an ItemGroup/ItemGroup Comparison module.
The MatchGroup element is designed to group matches of different types where all
variables are designed to measure the same concept (for instance, a variable meant to
measure “trust” might have the same question text in the first two waves of a study, then
a slightly different question text in the third wave, and then revert to the original text in
the fourth wave.)
<VariablesMapping>
<MatchGroup MatchReference=”VM1 VM2”/>
<VariablesMatch ID=”VM1” MatchType=”SameQstnCats” MatchSource=”producer”>
<VariableDescription VarRef=”V10” VarName=”HEALTH” Wave=”2002”
Geography=”US” ItemGrpRef=”IG3”/>
<VariableDescription VarRef=”V100” VarName=”RHEALTH” Wave=”2001”
Geography=”Mexico” ItemGrpRef=”IG5”/>
</VariablesMatch>
<VariablesMatch ID=”VM2” MatchType=”DiffQstnSameCats”>… etc., etc.
</VariablesMatch>
</VariablesMapping>
[Please note that given the early stage of Version 3.0 development, we cannot assume to
provide a definitive markup solution at this point. The actual code may well be different,
as long as the basic requirements are covered.]
Flagging the attrition variable
(Sheet 2 of attached spreadsheet)
For analysis purposes it is important, in longitudinal studies, to flag the attrition variable,
most likely with an attribute under <var> (in the Logical Data Structure) that would be
similar to our current “weight” flag.
Providing an accurate description of complex files
(Sheet 2 of attached spreadsheet)
The discussion around complex files has so far focused mainly on marking up
conceptually related “simple” files (one wave, one respondent, one level of a hierarchy)
with a view to merging them into complex files.
Yet both data archives and data producers hold and deliver complex physical data files
(multiple waves, respondents, questionnaires, etc.) that may need to be split, for analysis
purposes, into “simple(r)” units. We need to ensure that the new Data Model design
provides for identifying the wave and for each and every variable in such datasets.
We may follow up with a more concrete/elaborate proposal when we have a better idea of
how the actual specification will look.
Matching variables by keywords
(Sheet 2 of attached spreadsheet)
Keywords (concepts) are a “looser” but equally useful way of comparing variables,
especially across studies.
Repeatable keyword tags are now present in the variables description section of the DDI.
However, for comparison purposes, and considering that many longitudinal studies are
structured by topical sections and subsections, we are finding that for a more accurate, indepth classification we need to use a keyword hierarchy, taking us from more general
subject areas down to narrower concepts.
A group of variables, for instance, could be classified under
Health
Risk Factors
Smoking
while another group might fall under
Health
Risk Factors
Substance abuse (or, Obesity, or whatever else)
(An easy and straightforward way to enable such a hierarchical classification would be to
introduce nestable concept tags:
<concept>Health
<concept>Risk Factors
<concept>Smoking</concept>
</concept>
</concept>
If nesting is not favored in the design of the new DDI Data Model, any other pattern
should be adopted that would achieve the same result.)
In the attached spreadsheet we use IDREF-ing (pointing up to the parent) on the
assumption that this technique will be adopted for V 3.0:
<Variable>
<Concept ID= “C1V1” Vocabulary=”vocabularyname” VocabularyURI=”URI”
ParentConcept=””>Health</Concept>
<Concept ID= “C2V1” Vocabulary=”vocabularyname” VocabularyURI=”URI”
ParentConcept=”C1V1”>Risk factors</Concept>
</Variable>
Download