The PowerPoint slides (ppt)

advertisement
Using XML Mapper and XMLMAP to
Read Data Documented by Data
Documentation Initiative (DDI) Files
Larry Hoyle
Policy Research Institute
University of Kansas
SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or Trademarks of their respective companies.
Overview

A SAS program reads an XML metadata file and
writes a SAS program to read the raw data file
described by the metadata file.
makeReader.sas
Read2825.sa
s
2825.xml
DDI.ma
p
Work.ICPSR2825Househol
d
Work.ICPSR2825Family
Read2825.sas
Work.ICPSR2825Person
Da2825.txt
DDI

“an international effort to establish a standard for
technical documentation describing social science
data” - http://www.icpsr.umich.edu/DDI/index.html
DDI Files

XML
• DTD - http://www.icpsr.umich.edu/DDI/users/dtd/index.html

Metadata about:
• The DDI file itself
• The study that collected the data
• The data file
• Variables within the data file
• Other Material
The Minimal DDI File
<?xml version="1.0"?>
<codeBook>
<stdyDscr>
<citation>
<titlStmt>
<titl>Howdy World: Valid but Useless Metadata</titl>
</titlStmt>
</citation>
</stdyDscr>
</codeBook>
Real Example: ICPSR 6084 Raw Data File
100001
161132146211115555299 9991199
99911219
200001
49992
30000102534
100002
000325222641942
12213112421111212112222221121
2122
12
200002
30000202574
3834101202
12221
000756221622052
4261103202
ICPSR 6084 – About The Study
<citation>
<titlStmt>
<titl>CBS
News Monthly
Poll #2, August
1992</titl>
ICPSR 6084 – About The File
<dimensns>
<caseQnty>1,546</caseQnty>
<varQnty>70</varQnty>
<logRecL>80</logRecL>
<recPrCas>3</recPrCas>
ICPSR 6084 – Reading The File With SAS
<dimensns>
<caseQnty>1,546</caseQnty>
<varQnty>70</varQnty>
<logRecL>80</logRecL>
<recPrCas>3</recPrCas>
<recNumTot>4,638</recNumTot>
</dimensns>
infile 'C:\DDRIVE\data\icpsr\data\6084\da6084.txt' LRECL=80
PAD;
More From ICPSR 6084 – first variable
More From ICPSR 6084 – Reading the
first variable
input
#1 cardno 1-1
More From ICPSR 6084 – another
variable
#3 respno 2-6
The Tasks

Pull the necessary information from a hierarchical
xml file into SAS as tables
• Use XML libname engine with an XMLMAP file

Use that information in SAS to read the raw data
file
Making the XMLMAP File – SAS XML Mapper
Defining Tables –What Defines Rows
Drag the
element that
defines rows to
the root of the
XMLMAP
structure
Defining Tables –Row Defined
Defining Tables – What Defines Columns
Drag an
element
that
defines a
column to
the root of
the table
Defining Tables – Column Defined
Viewing The XMLMap File
Viewing The XMLMap File – Row Path
Viewing The XMLMap File - Column Path
Viewing Sample SAS Code
Previewing the Table
XMLMapper Limitations

Not every XML file will have all the elements of any
possible XML file of that type.
• Use XML Schema instead of XML file

An XML Schema file may not work
• XML file type defined by DTD
• XML Schema too complex for XML Mapper
What then

You can use XMLMapper to start and then hand
edit the XML MAP file.
Lots of Tables From DDI
Mostly for Comments
DATADSCR_VAR
DATADSCR_VARGRP
DATADSCR_VAR_CATGRY
DATADSCR_VAR_INVALRNG
DATADSCR_VAR_INVALRNG_ITEM
DATADSCR_VAR_INVALRNG_RANGE
DATADSCR_VAR_VALRNG_ITEM
DATADSCR_VAR_VALRNG_RANGE
DOCDSCR_CITATION__AUTHENTY
DOCDSCR_CITATION__COPYRIGHT
DOCDSCR_CITATION__IDNO
DOCDSCR_CITATION__OTHID
DOCDSCR_CITATION__PRODDATE
DOCDSCR_CITATION__PRODUCER
DOCDSCR_CITATION__TITL
FILEDSCR_FILETXT
FILEDSCR_FILETXT_RECGRP
STDYDSCR_CITATION_BIBLCIT
STDYDSCR_CITATION_TITLSTMT
STDYDSCR_CITATION_VERSTMT
STDYDSCR_CITATION__AUTHENTY
STDYDSCR_CITATION__COPYRIGHT
STDYDSCR_CITATION__DISTRBTR
STDYDSCR_CITATION__FUNDAG
STDYDSCR_CITATION__GRANTNO
STDYDSCR_CITATION__PRODDATE
STDYDSCR_CITATION__PRODUCER
STDYDSCR_CITATION__SOFTWARE
STDYDSCR_METHOD__COLLMODE
STDYDSCR_METHOD__DATACOLLECTOR
STDYDSCR_METHOD__FREQUENC
STDYDSCR_METHOD__RESINSTRU
STDYDSCR_METHOD__SAMPPROC
STDYDSCR_METHOD__TIMEMETH
STDYDSCR_METHOD__WEIGHT
STDYDSCR_STDYINFO_ABSTRACT
STDYDSCR_STDYINFO__ANLYUNIT
STDYDSCR_STDYINFO__COLLDATE
STDYDSCR_STDYINFO__DATAKIND
STDYDSCR_STDYINFO__GEOGCOVER
STDYDSCR_STDYINFO__KEYWORD
STDYDSCR_STDYINFO__NATION
STDYDSCR_STDYINFO__TIMEPRD
STDYDSCR_STDYINFO__TOPCCLAS
STDYDSCR_STDYINFO__UNIVERSE
Write a SAS Program – Metadata Comment
data _null_;
file reader lrecl=1024 ;
length vEdited $ 2000;
set DDIfile.stdyDscr_citation_titlStmt;
if _n_=1 then put '/*' /
'
SAS program to read ' agency ' ' IDNo ;
stdyDscrTitl= compbl(tranwrd(translate(stdyDscrTitl, ' ', '09'x),
'*/', '*_/'));
put 'Study Title' _n_ ': ' stdyDscrTitl;
altTitl=compbl(tranwrd(translate(altTitl,
' ','09'x),'*/','*_/'));
put '
' altTitl;
/*
SAS program to read ICPSR 6084
Study Title1 : CBS News Monthly Poll #2, August 1992
August National Poll II, Republican National Convention
Cntlin file for Formats
data makeTheFormats;
input fmtname $ 1-7 type $ 9-9 start $ 11-26 default 28-35 /
label :&$512.;
datalines;
V00006f N 1
1
Yes
V00006f N 3
1
Converted Refusal
;
run;
proc format cntlin=makeTheFormats;
run;
Input

Logic to produce different input statements for:
• Fixed column data
• Delimited data
Output

Multiple datasets if different record types
• Separate keep dataset options
• Logic to output to appropriate dataset
if "&fileStructureType" eq "hierarchical" then do;
put 'if left(_RecordSetIdentifier) eq left("' catValu '") then DO;'
/ " output " safeAgency +(-1) safeIDNo +(-1) rectype ';' / 'END;' //;
end;
if left(_RecordSetIdentifier) eq left("2 ") then DO;
output ICPSR2825FAMILY ;
END;
Some of the Limitations

DDI can describe nCubes, geographic coverage,
variable groups
• Current makeReader.sas can’t handle these

DDI definition includes recursive elements
• E.g. recGrps within recGrps
• Current makeReader,sas would not find nested elements
Questions?
About the Speaker
Larry Hoyle
Associate Scientist
Policy Research Institute,
University of Kansas
1541 Lilac Lane
Lawrence, KS 66044-3177
LarryHoyle@ku.edu
Download