Marking up Longitudinal Data in DDI Sanda Ionescu - ICPSR We support the “Complex Files Proposal,” finding that it provides a good basis for identifying related data files and the key variables to be used in merges. In addition, we have identified a number of requirements for adequate markup of longitudinal data in DDI. Variables mapping (See sheet 1 of attached spreadsheet) The ability to map identical, or closely similar, variables is a key requirement in documenting longitudinal studies, in which panels of respondents are usually asked the same set(s) of questions over time. Data analyses often focus on comparing the answers to these questions. Variables mappings are thus useful for both search and discovery purposes and for enabling software to manipulate the physical structure of the data files (split, merge, or recombine). With an awareness of the multitude of factors that affect variables comparability – some of which (like context) cannot be captured in a systematic and structured way – we propose to limit variables mapping in DDI to the most common and easily identifiable types of matches: Type A: “Identical” variables, in which both question text and categories (values and labels) are exactly the same Type B: Variables with identical question text, but different categories (changes in values, labels, or even number of categories) Type C: Variables with changes in the question wording, but with identical categories Type D: Variables in which both question wording and categories are (somewhat, but not entirely) different, although they clearly and recognizably attempt to measure the same concept We believe that the type of match needs to be specified in the markup. In the attached spreadsheet we have included a data construct (we see it as an attribute) that separately identifies each type of match. We also believe that for each variable included in the mapping we should specify the wave it belongs to, and reference the study (ItemGroup) if we’re matching variables from different studies. It will also be necessary to identify, for each variable, the respondent, analysis unit, universe, and certainly the physical data file containing the variable. We assume at this point that it will be possible to pull out these pieces of information from the actual variable description (Logical Product module) and its own link to the Physical Data Instance. If that is not the case, then we will need to specify these, and IDREF the physical data file. Some PIs/data producers draw and distribute the variables mapping themselves. To distinguish between such mappings and archive-generated ones, the source attribute will indicate who is responsible for the mapping, as shown above. We offer below a markup example of the desired variables mapping layout, as included in an ItemGroup/ItemGroup Comparison module. The MatchGroup element is designed to group matches of different types where all variables are designed to measure the same concept (for instance, a variable meant to measure “trust” might have the same question text in the first two waves of a study, then a slightly different question text in the third wave, and then revert to the original text in the fourth wave.) <VariablesMapping> <MatchGroup MatchReference=”VM1 VM2”/> <VariablesMatch ID=”VM1” MatchType=”SameQstnCats” MatchSource=”producer”> <VariableDescription VarRef=”V10” VarName=”HEALTH” Wave=”2002” Geography=”US” ItemGrpRef=”IG3”/> <VariableDescription VarRef=”V100” VarName=”RHEALTH” Wave=”2001” Geography=”Mexico” ItemGrpRef=”IG5”/> </VariablesMatch> <VariablesMatch ID=”VM2” MatchType=”DiffQstnSameCats”>… etc., etc. </VariablesMatch> </VariablesMapping> [Please note that given the early stage of Version 3.0 development, we cannot assume to provide a definitive markup solution at this point. The actual code may well be different, as long as the basic requirements are covered.] Flagging the attrition variable (Sheet 2 of attached spreadsheet) For analysis purposes it is important, in longitudinal studies, to flag the attrition variable, most likely with an attribute under <var> (in the Logical Data Structure) that would be similar to our current “weight” flag. Providing an accurate description of complex files (Sheet 2 of attached spreadsheet) The discussion around complex files has so far focused mainly on marking up conceptually related “simple” files (one wave, one respondent, one level of a hierarchy) with a view to merging them into complex files. Yet both data archives and data producers hold and deliver complex physical data files (multiple waves, respondents, questionnaires, etc.) that may need to be split, for analysis purposes, into “simple(r)” units. We need to ensure that the new Data Model design provides for identifying the wave and for each and every variable in such datasets. We may follow up with a more concrete/elaborate proposal when we have a better idea of how the actual specification will look. Matching variables by keywords (Sheet 2 of attached spreadsheet) Keywords (concepts) are a “looser” but equally useful way of comparing variables, especially across studies. Repeatable keyword tags are now present in the variables description section of the DDI. However, for comparison purposes, and considering that many longitudinal studies are structured by topical sections and subsections, we are finding that for a more accurate, indepth classification we need to use a keyword hierarchy, taking us from more general subject areas down to narrower concepts. A group of variables, for instance, could be classified under Health Risk Factors Smoking while another group might fall under Health Risk Factors Substance abuse (or, Obesity, or whatever else) (An easy and straightforward way to enable such a hierarchical classification would be to introduce nestable concept tags: <concept>Health <concept>Risk Factors <concept>Smoking</concept> </concept> </concept> If nesting is not favored in the design of the new DDI Data Model, any other pattern should be adopted that would achieve the same result.) In the attached spreadsheet we use IDREF-ing (pointing up to the parent) on the assumption that this technique will be adopted for V 3.0: <Variable> <Concept ID= “C1V1” Vocabulary=”vocabularyname” VocabularyURI=”URI” ParentConcept=””>Health</Concept> <Concept ID= “C2V1” Vocabulary=”vocabularyname” VocabularyURI=”URI” ParentConcept=”C1V1”>Risk factors</Concept> </Variable>