Kansas City Area SAS Usre Group (KCASUG) PowerPoint slides (ppt)

advertisement
Reading Microsoft Word XML
files with SAS
August 25, 2005
Larry Hoyle -- Policy Research Institute
University of Kansas
revised 8/18/2005
3 scenarios
• Extracting text along with associated
properties (styles and attributes)
• Extracting all data from tables
• Extracting coordinates of objects in
drawings
XML - syntax
Must begin with this
prolog tag
Paired tags,
must have 1 root tag
case sensitive
Empty tags end with />
Tags and content
called "element"
Tags can be
Qualified by
attributes
Elements can be nested,
Start and end in same parent
<?xml version="1.0" ?>
<LarryRootTag>
<EmptyTag/>
<nestedTag>
Some content
</nestedTag >
<nestedTag anAttribute="wha">
Other content
</nestedTag >
</LarryRootTag>
Word XML
Word XML
Extracting text and properties
• SAS XML Engine
• Needs XMLMAP file
• Can use XML Mapper to generate
XMLMAP
• Only needs to be generated once for
each type of extract
Example Document
I have never been so humiliated in my life.
That was very rude treatment.
What a pleasant experience. Your staff was
both quick and pleasant.
It took about the time I expected to
reach someone.
I have nothing to say. The sky is blue and the
sea is green.
You are the worst organization in the world.
I love you guys.
XML - Example Document
I have never been so humiliated in my
life. That was very rude treatment.
What a pleasant experience. Your staff
was both quick and pleasant.
It took about the time I
expected to reach someone.
I have nothing to say. The sky is blue and
the sea is green.
You are the worst organization in the
world.
I love you guys.
Paragraph property:
/w:wordDocument/w:body
/wx:sect/w:p/w:pPr
Run property:
/w:wordDocument/w:body
/wx:sect/w:p/w:r/w:rPr.
Rows
• The XMLMap has to describe a path that
delineates rows:
• In this case it’s each text element in a run
(in a paragraph…)
<TABLE-PATH
syntax="XPath">/w:wordDocument/w:bo
dy/wx:sect/w:p/w:r/w:t</TABLE-PATH>
Columns – the text
• The XMLMap has to describe a path that
delineates each column:
• The text
itself is:
<COLUMN name="t">
<PATH
syntax="XPath">/w:wordDocument/w:body
/wx:sect/w:p/w:r/w:t</PATH>
Columns – the text element
number
• A sequential number for the text
element is:
<COLUMN name="tNum"
ordinal="YES" retain="YES">
<INCREMENT-PATH beginend="BEGIN"
syntax="XPath">/w:wordDocument/w:body
/wx:sect/w:p/w:r/w:t</INCREMENT-PATH>
Columns – the paragraph
number
• A sequential number for the
paragraph is:
<COLUMN name="pNum" ordinal="YES"
retain="YES">
<INCREMENT-PATH beginend="BEGIN"
syntax="XPath">/w:wordDocument/w:body
/wx:sect/w:p</INCREMENT-PATH>
Columns –paragraph color
<COLUMN name="PColorVal" retain="YES">
<PATH
syntax="XPath">/w:wordDocument/w:body/w
x:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH>
Columns – run color
<COLUMN name="RColorVal" retain="YES">
<PATH
syntax="XPath">/w:wordDocument/w:body/w
x:sect/w:p/w:r/w:rPr/w:color/@val</PATH>
Our dataset
Tables
All Tables Into One Dataset
Tables – Word XML
Tables - DataSet Rows
<TABLE-PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t
</TABLE-PATH>
Tables – Table Number
<COLUMN name="tblNum" ordinal="YES"
retain="YES">
<INCREMENT-PATH beginend="BEGIN"
syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:tbl
</INCREMENT-PATH>
Tables – Row Number
<COLUMN name="trNum" ordinal="YES"
retain="YES">
<INCREMENT-PATH beginend="BEGIN"
syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:tbl/w:tr
</INCREMENT-PATH>
We Could Add Properties if Needed
Nested tables
Nested Tables – Absolute Path for Rows
<TABLE-PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t
</TABLE-PATH>
Nested Tables – Rootless Path for Rows
<TABLE-PATH syntax="XPath">
w:tbl/w:tr/w:tc/w:p/w:r/w:t
</TABLE-PATH>
Drawing Objects
VML – Vector Markup Language
• Drawings in
Word get
stored as
XML also
• We’ll just
look at lines
VML – Vector Markup Language
Dataset – One Row for Each Line
<TABLE-PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line
</TABLE-PATH>
Dataset – Column: From
<COLUMN name="from">
<PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line
</PATH>
/@from
Dataset – Column: To
<COLUMN name="from">
<PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line
</PATH>
/@to
Dataset – Column: StrokeColor
<COLUMN name="from">
<PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@strokecolor
</PATH>
The Dataset
Usage Example: Annotate dataset
if prxmatch(xyPattern, from) then do;
function='move';
x= input(PRXPOSN (xyPattern, 1, from),10.);
if prxmatch('/flip:y/',style) then
y= -1* input(PRXPOSN (xyPattern, 2, to),10.);
else
y= -1* input(PRXPOSN (xyPattern, 2, from),10.);
output;
Plotted in SAS
Contact Information
Larry Hoyle
Policy Research Institute,
University of Kansas
LarryHoyle@ku.edu
http://www.ku.edu/pri/ksdata/sashttp/sugi31
Download