SUGI31 PowerPoint slides (ppt)

advertisement
SUGI 31
Hoyle paper 019-31
Reading Microsoft Word XML files
with SAS®
Larry Hoyle,
Policy Research Institute,
University of Kansas
SUGI 31
Hoyle paper 019-31
Three Scenarios
• Extracting text and attributes
• Extracting data from tables
• Extracting drawing object
parameters
SUGI 31
XML - Syntax
Hoyle paper 019-31
Must begin with this
prolog tag
Paired tags,
must have 1 root tag
case sensitive
Tags and content
called "element"
Tags can be
Qualified by
attributes
Elements can be nested,
Start and end in same parent
<?xml version="1.0" ?>
<LarryRootTag>
<EmptyTag/>
<nestedTag>
Some content
</nestedTag >
<nestedTag anAttribute="wha">
Other content
</nestedTag >
</LarryRootTag>
SUGI 31
Hoyle paper 019-31
Word XML
SUGI 31
Hoyle paper 019-31
Word XML
SUGI 31
Hoyle paper 019-31
Extracting Text and Properties
SUGI 31
Hoyle paper 019-31
What Does SAS Need?
• SAS XML Engine
• Needs XMLMAP file
• Can use XML Mapper to generate
XMLMAP
• Only needs to be generated once for
each type of extract
SUGI 31
Hoyle paper 019-31
Example Document
Styles and Colors Have Meaning
I have never been so humiliated in my life.
That was very rude treatment.
What a pleasant experience. Your staff was
both quick and pleasant.
It took about the time I expected to
reach someone.
I have nothing to say. The sky is blue and the
sea is green.
You are the worst organization in the world.
I love you guys.
SUGI 31
Hoyle paper 019-31
Style and Color
•Style is “Treated” – a statement about treatment
•Color is “Red” - represents negative affect
SUGI 31
Hoyle paper 019-31
Example Document as XML
I have never been so humiliated in my
life. That was very rude treatment.
What a pleasant experience. Your staff
was both quick and pleasant.
It took about the time I
expected to reach someone.
I have nothing to say. The sky is blue and
the sea is green.
You are the worst organization in the
world.
I love you guys.
Paragraph property:
/w:wordDocument/w:body
/wx:sect/w:p/w:pPr
Run property:
/w:wordDocument/w:body
/wx:sect/w:p/w:r/w:rPr.
SUGI 31
Hoyle paper 019-31
Rows
• The XMLMap has to describe a path that
delineates rows:
• In this case it’s each text element in a run
(in a paragraph…)
<TABLE-PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/
w:r/w:t
</TABLE-PATH>
SUGI 31
Hoyle paper 019-31
Columns – the Text
• The XMLMap has to describe a path
that delineates each column:
• The text
itself is:
<COLUMN name="t">
<PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p
/w:r/w:t
</PATH>
Columns – the Text Element Number
SUGI 31
Hoyle paper 019-31
• A sequential number for the text
element is:
<COLUMN name="tNum" ordinal="YES“
retain="YES">
<INCREMENT-PATH beginend="BEGIN"
syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t
</INCREMENT-PATH>
Columns – the Paragraph Number
SUGI 31
Hoyle paper 019-31
• A sequential number for the
paragraph is:
<COLUMN name="pNum" ordinal="YES"
retain="YES">
<INCREMENT-PATH beginend="BEGIN"
syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p
</INCREMENT-PATH>
SUGI 31
Hoyle paper 019-31
Columns –Paragraph Color
<COLUMN name="PColorVal"
retain="YES">
<PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/
w:pPr/w:rPr/w:color/@val
</PATH>
SUGI 31
Hoyle paper 019-31
Columns – Run Color
<COLUMN name="RColorVal"
retain="YES">
<PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:
p/w:r/w:rPr/w:color/@val
</PATH>
SUGI 31
Hoyle paper 019-31
Columns – Run Style
<COLUMN name="RStyleval">
<PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/
w:rPr/w:rStyle/@val
</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>11</LENGTH>
</COLUMN>
SUGI 31
Hoyle paper 019-31
The Data as Read into SAS
SUGI 31
Hoyle paper 019-31
Tables
SUGI 31
Hoyle paper 019-31
Our Sample Tables
• Read all data from all tables into one
dataset
• Add variables to indicate table, row,
column
SUGI 31
Hoyle paper 019-31
The Tables Dataset
SUGI 31
Hoyle paper 019-31
The Tables Dataset
SUGI 31
Hoyle paper 019-31
Word XML – Tables
• Absolute Path
/w:wordDocument/w:body/wx:
sect/w:tbl/w:tr/w:tc/w:p/w:r/
w:t
• Relative Path
w:tc/w:p/w:r/w:t
SUGI 31
Hoyle paper 019-31
Count Table Beginnings
• <INCREMENT-PATH beginend="BEGIN"
syntax="XPath"> w:tbl
</INCREMENT-PATH>
SUGI 31
Hoyle paper 019-31
Count Table Endings
• <INCREMENT-PATH beginend=“END"
syntax="XPath"> w:tbl
</INCREMENT-PATH>
SUGI 31
Hoyle paper 019-31
Graphics
SUGI 31
Drawing Object Parameters
VML – Vector Markup Language
Hoyle paper 019-31
• This example will only read lines
– (they’re easiest)
• Other drawing objects have different
XML elements
SUGI 31
Hoyle paper 019-31
Our Example Drawing
SUGI 31
Hoyle paper 019-31
Word XML – Drawn Lines
SUGI 31
Hoyle paper 019-31
One Row for Each Line Element
<TABLE-PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/
w:pict/v:group/v:line
</TABLE-PATH>
SUGI 31
Hoyle paper 019-31
Columns
Parameters as Attributes
<COLUMN name="from">
<PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/
w:pict/v:group/v:line/@from
</PATH>
SUGI 31
Hoyle paper 019-31
The Dataset
SUGI 31
Hoyle paper 019-31
Example Code in Paper
• Convert colors
• Parse stroke weight (e.g. 2pt)
• Detect the keyword “flip” and flip
coordinates
SUGI 31
Hoyle paper 019-31
As Drawn by SAS
SUGI 31
Hoyle paper 019-31
Contact Information
Larry Hoyle
Policy Research Institute,
University of Kansas
LarryHoyle@ku.edu
Download