SUGI 31 Hoyle paper 019-31 Reading Microsoft Word XML files with SAS® Larry Hoyle, Policy Research Institute, University of Kansas SUGI 31 Hoyle paper 019-31 Three Scenarios • Extracting text and attributes • Extracting data from tables • Extracting drawing object parameters SUGI 31 XML - Syntax Hoyle paper 019-31 Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Tags and content called "element" Tags can be Qualified by attributes Elements can be nested, Start and end in same parent <?xml version="1.0" ?> <LarryRootTag> <EmptyTag/> <nestedTag> Some content </nestedTag > <nestedTag anAttribute="wha"> Other content </nestedTag > </LarryRootTag> SUGI 31 Hoyle paper 019-31 Word XML SUGI 31 Hoyle paper 019-31 Word XML SUGI 31 Hoyle paper 019-31 Extracting Text and Properties SUGI 31 Hoyle paper 019-31 What Does SAS Need? • SAS XML Engine • Needs XMLMAP file • Can use XML Mapper to generate XMLMAP • Only needs to be generated once for each type of extract SUGI 31 Hoyle paper 019-31 Example Document Styles and Colors Have Meaning I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. SUGI 31 Hoyle paper 019-31 Style and Color •Style is “Treated” – a statement about treatment •Color is “Red” - represents negative affect SUGI 31 Hoyle paper 019-31 Example Document as XML I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:rPr. SUGI 31 Hoyle paper 019-31 Rows • The XMLMap has to describe a path that delineates rows: • In this case it’s each text element in a run (in a paragraph…) <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/ w:r/w:t </TABLE-PATH> SUGI 31 Hoyle paper 019-31 Columns – the Text • The XMLMap has to describe a path that delineates each column: • The text itself is: <COLUMN name="t"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p /w:r/w:t </PATH> Columns – the Text Element Number SUGI 31 Hoyle paper 019-31 • A sequential number for the text element is: <COLUMN name="tNum" ordinal="YES“ retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t </INCREMENT-PATH> Columns – the Paragraph Number SUGI 31 Hoyle paper 019-31 • A sequential number for the paragraph is: <COLUMN name="pNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p </INCREMENT-PATH> SUGI 31 Hoyle paper 019-31 Columns –Paragraph Color <COLUMN name="PColorVal" retain="YES"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/ w:pPr/w:rPr/w:color/@val </PATH> SUGI 31 Hoyle paper 019-31 Columns – Run Color <COLUMN name="RColorVal" retain="YES"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w: p/w:r/w:rPr/w:color/@val </PATH> SUGI 31 Hoyle paper 019-31 Columns – Run Style <COLUMN name="RStyleval"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/ w:rPr/w:rStyle/@val </PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>11</LENGTH> </COLUMN> SUGI 31 Hoyle paper 019-31 The Data as Read into SAS SUGI 31 Hoyle paper 019-31 Tables SUGI 31 Hoyle paper 019-31 Our Sample Tables • Read all data from all tables into one dataset • Add variables to indicate table, row, column SUGI 31 Hoyle paper 019-31 The Tables Dataset SUGI 31 Hoyle paper 019-31 The Tables Dataset SUGI 31 Hoyle paper 019-31 Word XML – Tables • Absolute Path /w:wordDocument/w:body/wx: sect/w:tbl/w:tr/w:tc/w:p/w:r/ w:t • Relative Path w:tc/w:p/w:r/w:t SUGI 31 Hoyle paper 019-31 Count Table Beginnings • <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> w:tbl </INCREMENT-PATH> SUGI 31 Hoyle paper 019-31 Count Table Endings • <INCREMENT-PATH beginend=“END" syntax="XPath"> w:tbl </INCREMENT-PATH> SUGI 31 Hoyle paper 019-31 Graphics SUGI 31 Drawing Object Parameters VML – Vector Markup Language Hoyle paper 019-31 • This example will only read lines – (they’re easiest) • Other drawing objects have different XML elements SUGI 31 Hoyle paper 019-31 Our Example Drawing SUGI 31 Hoyle paper 019-31 Word XML – Drawn Lines SUGI 31 Hoyle paper 019-31 One Row for Each Line Element <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/ w:pict/v:group/v:line </TABLE-PATH> SUGI 31 Hoyle paper 019-31 Columns Parameters as Attributes <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/ w:pict/v:group/v:line/@from </PATH> SUGI 31 Hoyle paper 019-31 The Dataset SUGI 31 Hoyle paper 019-31 Example Code in Paper • Convert colors • Parse stroke weight (e.g. 2pt) • Detect the keyword “flip” and flip coordinates SUGI 31 Hoyle paper 019-31 As Drawn by SAS SUGI 31 Hoyle paper 019-31 Contact Information Larry Hoyle Policy Research Institute, University of Kansas LarryHoyle@ku.edu