Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005 3 scenarios • Extracting text along with associated properties (styles and attributes) • Extracting all data from tables • Extracting coordinates of objects in drawings XML - syntax Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Empty tags end with /> Tags and content called "element" Tags can be Qualified by attributes Elements can be nested, Start and end in same parent <?xml version="1.0" ?> <LarryRootTag> <EmptyTag/> <nestedTag> Some content </nestedTag > <nestedTag anAttribute="wha"> Other content </nestedTag > </LarryRootTag> Word XML Word XML Extracting text and properties • SAS XML Engine • Needs XMLMAP file • Can use XML Mapper to generate XMLMAP • Only needs to be generated once for each type of extract Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. XML - Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:rPr. Rows • The XMLMap has to describe a path that delineates rows: • In this case it’s each text element in a run (in a paragraph…) <TABLE-PATH syntax="XPath">/w:wordDocument/w:bo dy/wx:sect/w:p/w:r/w:t</TABLE-PATH> Columns – the text • The XMLMap has to describe a path that delineates each column: • The text itself is: <COLUMN name="t"> <PATH syntax="XPath">/w:wordDocument/w:body /wx:sect/w:p/w:r/w:t</PATH> Columns – the text element number • A sequential number for the text element is: <COLUMN name="tNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w:wordDocument/w:body /wx:sect/w:p/w:r/w:t</INCREMENT-PATH> Columns – the paragraph number • A sequential number for the paragraph is: <COLUMN name="pNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w:wordDocument/w:body /wx:sect/w:p</INCREMENT-PATH> Columns –paragraph color <COLUMN name="PColorVal" retain="YES"> <PATH syntax="XPath">/w:wordDocument/w:body/w x:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH> Columns – run color <COLUMN name="RColorVal" retain="YES"> <PATH syntax="XPath">/w:wordDocument/w:body/w x:sect/w:p/w:r/w:rPr/w:color/@val</PATH> Our dataset Tables All Tables Into One Dataset Tables – Word XML Tables - DataSet Rows <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t </TABLE-PATH> Tables – Table Number <COLUMN name="tblNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl </INCREMENT-PATH> Tables – Row Number <COLUMN name="trNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl/w:tr </INCREMENT-PATH> We Could Add Properties if Needed Nested tables Nested Tables – Absolute Path for Rows <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t </TABLE-PATH> Nested Tables – Rootless Path for Rows <TABLE-PATH syntax="XPath"> w:tbl/w:tr/w:tc/w:p/w:r/w:t </TABLE-PATH> Drawing Objects VML – Vector Markup Language • Drawings in Word get stored as XML also • We’ll just look at lines VML – Vector Markup Language Dataset – One Row for Each Line <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line </TABLE-PATH> Dataset – Column: From <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line </PATH> /@from Dataset – Column: To <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line </PATH> /@to Dataset – Column: StrokeColor <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@strokecolor </PATH> The Dataset Usage Example: Annotate dataset if prxmatch(xyPattern, from) then do; function='move'; x= input(PRXPOSN (xyPattern, 1, from),10.); if prxmatch('/flip:y/',style) then y= -1* input(PRXPOSN (xyPattern, 2, to),10.); else y= -1* input(PRXPOSN (xyPattern, 2, from),10.); output; Plotted in SAS Contact Information Larry Hoyle Policy Research Institute, University of Kansas LarryHoyle@ku.edu http://www.ku.edu/pri/ksdata/sashttp/sugi31