Introduction to XML Spreadsheets for SAS

advertisement
Introduction to XML Spreadsheets for SAS® Programmers, Parts I & II
James J. Van Campen, SRI International, Menlo Park, CA
ABSTRACT
This paper provides an introduction to Microsoft’s® XML Spreadsheet Schema (XMLSS) for SAS programmers.
XMLSS is used to store Microsoft Excel® workbooks as XML files. Understanding XMLSS is very useful for SAS
programmers who use the ODS ExcelXP tagset or who write their own programs to generate XML spreadsheets. Part
I of this paper provides an overview of XMLSS, and Part II describes some SAS macros written by the author to
create XML spreadsheets. All of the examples in this paper were done using SAS 9.2 and MS Office 2003.
PART I: OVERVIEW OF THE XML SPREADSHEET SCHEMA (XMLSS)
XML stands for Extensible Markup Language. In XML, nested markup tags are used to define a hierarchical structure
for data. XML can be used for just about any type of data, including Excel spreadsheets. Unlike HTML, XML has no
predefined tags. In XML, schemas are used to define the tags, the attributes, and the nested structure. The tags
define what are called elements. Typically, elements are data. An element (parent element) may contain other
elements (child elements). The parent-child relationship of the elements provides the hierarchical structure to the
data. All XML files must be fully nested; that is, both the opening and closing tags of all child elements must be
contained between the opening and closing tags of the parent element. All XML files have a root tag that contains all
the other tags. Elements may have attributes. Typically, attributes are meta-data, such as formatting information.
Attributes can be required or optional.
XML parsers are very picky and will not parse a file that is not well formed and valid. Well formed means the file
follows the XML syntax (e.g., the tags are properly nested, all required closing tags exist, etc.). Valid means the XML
file conforms to the associated schema (e.g., the element names and attributes are consistent with the schema). An
XML file can not be valid without also being well formed; however, a well-formed file is not necessarily valid.
Microsoft’s XMLSS was developed to permit Excel workbooks to be saved as XML and XML files conforming to
XMLSS to be opened as formatted workbooks. Excel is capable of opening any well-formed XML file; however, only
valid files conforming to XMLSS can modify the default method of display. Details of XMLSS, including descriptions of
the elements (tags) and attributes, can be found at the following website:
http://msdn.microsoft.com/en-us/library/aa140066(office.10).aspx
The following example illustrates the essentials of XMLSS. The contents of a simple Excel workbook, consisting of
one worksheet with three rows and two columns, are shown in Figure 1. Default Excel formatting is used throughout
the worksheet, except for the column headings, which are bold.
Name
Jack
Jill
Age
9
7
Figure 1
The following XML file conforms to XMLSS, and when opened with Excel, has the same content and formatting as
Figure 1. The first tag is a special tag that indicates this is an XML file and specifies the XML version. The second tag
is also special, and it indicates that the file should be opened as an XML spreadsheet. The root element is Workbook.
All the other elements in the file are contained between the opening and closing Workbook tags. The Workbook tag
contains five attributes. The attributes specify namespaces where the elements and attributes are defined. A
discussion of namespaces, however, is beyond the scope of this paper. The first two tags and the Workbook tag
along with its namespace attributes can be treated as boilerplate which must be included in every XML spreadsheet
file.
1
<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<Styles>
<Style ss:ID="s21">
<Font ss:Bold="1"/>
</Style>
</Styles>
<Worksheet ss:Name="Sheet1">
<Table>
<Row>
<Cell ss:StyleID="s21"><Data ss:Type="String">Name</Data></Cell>
<Cell ss:StyleID="s21"><Data ss:Type="String">Age</Data></Cell>
</Row>
<Row>
<Cell><Data ss:Type="String">Jack</Data></Cell>
<Cell><Data ss:Type="Number">9</Data></Cell>
</Row>
<Row>
<Cell><Data ss:Type="String">Jill</Data></Cell>
<Cell><Data ss:Type="Number">7</Data></Cell>
</Row>
</Table>
</Worksheet>
</Workbook>
Similar to HTML, XML tags begin with “<” and end with “>”. The element name is the first character string inside the
tag. The name is followed by the attributes, if any. The first child element of Workbook is Styles. Each style is defined
separately with Style tags. The Style element is a child of the Styles element. In this example, only one style is
defined. The ID is a required attribute of the Style element and is set to “s21”. The “ss:” part of the attribute
specification indicates the namespace where the attribute is defined. A default style can be defined, but it is not
required; no default style is defined in this example. In the style defined, the Bold attribute of the Font element is set
to “1” (True). Closing tags for elements differ from opening tags by preceding the name with a forward slash and not
having any attributes specified. All non-empty elements must have both opening and closing tags for the XML file to
be well formed. Empty elements may have only one tag. In those cases, the closing bracket is preceded by a forward
slash. The Font element is an example of an empty element.
After the Styles element comes the Worksheet element. The worksheet name is a required attribute of the Worksheet
element. In this case, it is set to “Sheet1”, the Excel default. The Table element is a child of the Worksheet element.
The Column and Row elements are children of the Table element. In this example, there are no Column elements.
The Cell elements are children of the Row elements, and two of the Cell elements have their StyleID attribute set to
“s21” for the bold font style. The Data element is a child of the Cell element. Type is a required attribute of the Data
element. Type can be set to “String”, “Number”, “DateTime”, “Boolean”, or “Error”. The data is written, without quotes,
between the opening and closing Data tags. In XML, white space is preserved. No spaces are allowed to precede or
to follow numeric data. In fact, in XMLSS, no characters other than numerals and one decimal point are permitted in
numeric data. This has important consequences when writing your SAS code, as will be seen in the next section.
After all the rows are specified, the closing Table, Worksheet, and Workbook tags are written. The XML code above
constitutes a well-formed and valid XML file conforming to XMLSS.
To create workbooks with formatting more complex than in this example requires using additional elements and
attributes from XMLSS. One way to learn more about XMLSS is to go to the website specified above. Often the
quickest and easiest way to learn about the elements and attributes in XMLSS is to create a workbook with the
formatting you desire and save it as XML. You can then open the XML file created by Excel with a text editor or your
Internet browser to see the XML code.
2
The complete hierarchy of XMLSS tags is as follows.
<ss:Workbook>
<ss:Styles>
<ss:Style>
<ss:Alignment/>
<ss:Borders>
<ss:Border/>
</ss:Borders>
<ss:Font/>
<ss:Interior/>
<ss:NumberFormat/>
<ss:Protection/>
</ss:Style>
</ss:Styles>
<ss:Names>
<ss:NamedRange/>
</ss:Names>
<ss:Worksheet>
<ss:Names>
<ss:NamedRange/>
</ss:Names>
<ss:Table>
<ss:Column/>
<ss:Row>
<ss:Cell>
<ss:NamedCell/>
<ss:Data>
<Font/>
<B/>
<I/>
<U/>
<S/>
<Sub/>
<Sup/>
<Span/>
</ss:Data>
<x:PhoneticText/>
<ss:Comment>
<ss:Data>
<Font/>
<B/>
<I/>
<U/>
<S/>
<Sub/>
<Sup/>
<Span/>
</ss:Data>
</ss:Comment>
<o:SmartTags>
<stN:SmartTag/>
</o:SmartTags>
</ss:Cell>
</ss:Row>
</ss:Table>
<c:WorksheetOptions>
<c:DisplayCustomHeaders/>
</c:WorksheetOptions>
<x:WorksheetOptions>
3
<x:PageSetup>
<x:Layout/>
<x:PageMargins/>
<x:Header/>
<x:Footer/>
</x:PageSetup>
</x:WorksheetOptions>
<x:AutoFilter>
<x:AutoFilterColumn>
<x:AutoFilterCondition/>
<x:AutoFilterAnd>
<x:AutoFilterCondition/>
</x:AutoFilterAnd>
<x:AutoFilterOr>
<x:AutoFilterCondition/>
</x:AutoFilterOr>
</x:AutoFilterColumn>
</x:AutoFilter>
</ss:Worksheet>
<c:ComponentOptions>
<c:Toolbar>
<c:HideOfficeLogo/>
</c:Toolbar>
</c:ComponentOptions>
<o:SmartTagType/>
</ss:Workbook>
As can be seen from the example spreadsheet above most of the tags can be omitted, and you will still have a viable
XML spreadsheet.
There are some important things to remember when specifying a XML spreadsheet. First, you must define a style for
each different kind of cell formatting. It can be time consuming to create all the different styles you want, but once
they are created, you can use them for all the XML spreadsheets you create. Custom column widths are specified
using the Width attribute of the Column element. Likewise, custom row heights are specified using the Height attribute
of the Row element. Cells can be merged using the MergeAcross and MergeDown attributes of the Cell element.
Most other formatting is done in the PageSetup element which is a child of the WorksheetOptions element. Headers
and footers are created by specifying the Data attribute of the Header and Footer elements. The codes for specifying
headers and footers are somewhat cryptic, and the best way to learn how to write them is to create example headers
or footers in a worksheet and save the file as a XML spreadsheet. Then open the XML file with a text editor and see
what Excel wrote.
PART II: EXAMPLE MACROS FOR CREATING XML SPREADSHEETS
XML files are plain old text files, and it is easy to create text files from SAS. To create a XML spreadsheet from SAS
simply requires using put statements in a _NULL_ DATA step. It is seriously old school SAS programming. The steps
for writing a XML spreadsheet are as follows.
1.
2.
3.
4.
5.
Start workbook
Write styles
Write worksheet
a. Start worksheet
b. Specify columns
c. Specify row and cells
i. Start row
ii. Specify cells
iii. End row
d. Repeat step c for each row
e. End worksheet
Repeat step 3 for each additional worksheet
End workbook
4
A macro was created for each of the steps. Macros are well suited for these tasks since the same operation is done
repeatedly. Also since no errors in the XML code can be tolerated, full automation of the process is optimal.
STARTING THE WORKBOOK
This step is the same for all files. The macro is passed one required parameter, the output filename (fileout). Since it
is an XML file, the filename should end in .xml. This macro writes the boilerplate code mentioned in Part I. Notice
there is no mod option in the file statement. This macro will always start a new file, or overwrite an existing file with the
same name. All of the other macros use the mod option to append to the file started by %workbook_start.
%macro workbook_start (fileout);
data _null_;
file "&fileout";
put
'<?xml version="1.0"?>' /
'<?mso-application progid="Excel.Sheet"?>' /
'<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" ' /
'
xmlns:o="urn:schemas-microsoft-com:office:office" ' /
'
xmlns:x="urn:schemas-microsoft-com:office:excel" ' /
'
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" ' /
'
xmlns:html="http://www.w3.org/TR/REC-html40" >';
run;
%mend workbook_start;
WRITING THE STYLES
The %workbook_styles macro establishes all the formats available for use in the workbook. The styles must be
defined in advance of specifying the worksheets. The sample macro below creates two styles, Default and Bold. The
default shows some of the different child elements of Style. Each of the formatting elements in Style has a set of
attributes that can be specified. For example, in the Bold style, the Bold attribute of the Font element is set to “1”
(true). To learn more about the available attributes for the style elements please see the website referenced above.
Typically, many more styles are specified within this macro since the point of using this method is to achieve custom
formatting.
%macro workbook_styles (fileout);
data _null_;
file "&fileout" mod;
put
' <Styles>' /
'
<Style ss:ID="Default">' /
'
<Alignment ss:Vertical="Bottom"/>' /
'
<Borders/>' /
'
<Font/>' /
'
<Interior/>' /
'
<NumberFormat/>' /
'
<Protection/>' /
'
</Style>' /
'
<Style ss:ID="Bold">' /
'
<Font ss:Bold="1"/>' /
'
</Style>' /
' </Styles>';
run;
%mend workbook_styles;
WRITING THE WORKSHEET
The %worksheet_write macro is where the programmer specifies the content and formatting of the worksheet. This is
done though a series of %column, %row, and %cell macro calls. This macro must be customized for each project.
The worksheet is initialized by calling the %worksheet_start macro and specifying the sheetname parameter
(“Example” in this case). The column widths are specified with the %column macro. If no column widths are specified,
a default width is used. Then the rows and cells are specified. Empty rows are specified by a single call of the %row
macro with no parameters. Rows with cell content require two %row macro calls, one for opening and one for closing
5
the row. In the %cell macro, content is specified using the content parameter, and formatting is specified with the style
parameter. Calling the %cell macro with no parameters creates a blank cell. Column and cell order start from the left,
and row order starts from the top.
%macro worksheet_write (fileout);
data _null_;
file "&fileout" mod;
%worksheet_start(Example);
%column(width=80);
%row(type=open);
%cell(content=Example, style=Bold);
%row(type=close);
%worksheet_end;
run;
%mend worksheet_write;
The %worksheet_start macro has one required parameter, sheetname. It is also the place to specify which row(s) are
repeated at the top of each page should the worksheet exceed one page in length. In this example, the first row is to
be repeated as indicated in the NamedRange element. The Names element can be omitted without consequence if
desired.
%macro worksheet_start (sheetname);
put
' <Worksheet ss:Name="' "&sheetname" '">' /
'
<Names>' /
' <NamedRange ss:Name="Print_Titles" ss:RefersTo="=' "&sheetname" '!R1:R1"/>' /
'
</Names>' /
'
<Table>';
%mend worksheet_start;
The %column macro is used when custom column widths are desired – simply specify the width parameter. If
consecutive columns have the same width, setting the span parameter to the number of additional columns makes it
so only one %column macro call is necessary.
%macro column (width=46.5, span=0);
put '
<Column ss:Width="' "&width" '" ss:Span="' "&span" '"/>' ;
%mend column;
Depending on the type parameter, the %row macro creates an opening, closing, or empty row element. Formatting
can be specified at the row level. Row level formatting can be pre-empted by cell level formatting. The cell formatting,
however, is not superposed on the row formatting. That is, unspecified elements and attributes in the cell format take
on the worksheet default not the value in the row format.
%macro row (type=empty, height=12.75, style=Default);
%if "&type"= "open" %then
put '
<Row ss:Height="' "&height" '" ss:StyleID="' "&style" '">' ;
%else %if "&type"= "close" %then
put '
</Row>' ;
%else
put '
<Row/>' ;
%mend row;
The %cell macro can be passed text strings, macro variables, and variable values of different types. For text strings
and macro variables the variable parameter should be set to “no” (default). For variable values (numeric or character)
6
it should be set to “yes”. The allowable data types in XMLSS are String (default), Number, DateTime, Boolean, and
Error. Multiple cells can be merged by setting the merge parameter to the number of additional cells.
%macro cell (content=none, variable=no, type=String, merge=0, style=Default);
%if "&content"= "none" %then
put '
<Cell ss:StyleID="' "&style" '"/>' ;
%else %if "&variable"= "no" %then
put '
<Cell ss:StyleID="' "&style" '" ss:MergeAcross="' "&merge" '">' /
'
<ss:Data ss:Type="' "&type" '" xmlns="http://www.w3.org/TR/REC
html40">' "%unquote(&content)" </ss:Data>' /
'
</Cell>' ;
%else %do;
if missing(&content) then
put '
<Cell ss:StyleID="' "&style" '"/>' ;
else
put '
<Cell ss:StyleID="' "&style" '" ss:MergeAcross="' "&merge" '">' /
'
<Data ss:Type="' "&type" '">' &content +(-1) '</Data>' /
'
</Cell>' ;
%end; %mend cell;
When specifying the content parameter for the cell macro, be wary of special characters - both in SAS and XML. For
example, a macro quoting function such as %str() or %nrstr() must be used to pass text strings containing commas.
Otherwise, SAS will interpret the comma is a parameter delimiter and the macro will fail. String data containing less
than (<), greater than (>), or ampersand (&) symbols can trip up XML parsers. Those symbols should be replaced
with “<”, >”, and “&”, respectively. If the source is a character variable in SAS, a function such as tranwrd()
can be used to do the substitution.
The %worksheet end macro is where the page setup is specified. Page orientation, headers, footers, margins, and
more can be specified here. In the example macro, the Data attribute of the Header element specifies that the header
contain “WUSS Example” centered and in bold. Similarly, the footer displays the date on the left, the page number in
the center, and the author’s name on the right. The header and footer specifications are cryptic and the best way to
learn how to specify them is to save examples as XML spreadsheets and look at the XML code.
%macro worksheet_end;
put
'
</Table>' /
'
<WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">' /
'
<PageSetup>' /
'
<Layout x:Orientation="Landscape"/>' /
'
<Header x:Data="&C&B' "WUSS Example" '"/>' /
'
<Footer x:Data="&L&D&C&P&RJames Van Campen"/>' /
'
<PageMargins x:Left="0.5" x:Right="0.5" x:Top=".75" x:Bottom=".75"/>' /
'
</PageSetup>' /
'
</WorksheetOptions>' /
' </Worksheet>';
%mend worksheet_end;
ADDITIONAL WORKSHEETS AND ENDING THE WORKBOOK
After the %worksheet_end macro has executed another worksheet can be added to the workbook simply by adding
another _NULL_ data step, like the first, in the %worksheet_write macro – be sure to specify a different worksheet
name. When all the worksheets have been specified, the workbook must be finalized by calling the %workbook_end
macro. This macro simply writes the closing workbook tag.
%macro xml_workbook_end (fileout);
data _null_;
file "&fileout" mod;
put '</Workbook> ';
run;
%mend xml_workbook_end;
7
CONVERTING THE XML TO EXCEL
Creating a XML spreadsheet is great, but the ultimate goal for most programmers is to produce a regular Excel
workbook. To do this I wrote a conversion macro. The %xml_to_xls macro has two required parameters, the input and
output filenames. The input filename will be the XML file created in the previous steps. The output filename must end
in .xls. Two options, xwait and xsynch, must be turned off before DDE is used. The Excel application is launched
using the x command. The path must specify the location of the Excel application on your machine. SAS is instructed
to wait two seconds while the application launches. The minimum required sleep time can vary from computer to
computer. The DDE topic "system" is defined in order to send commands to Excel. Then Excel is instructed to open
the input file and save it with the output filename in Excel format. Notice the “,1” after the output filename. That is a
crucial parameter which specifies the file is to be saved in Excel format instead of XML. Without that parameter, you
will simply get an XML file with a .xls extension. Finally, Excel is instructed to quit.
%macro xml_to_xls (filein, fileout);
options noxwait noxsync;
x '"C:\Program Files\Microsoft Office\Office10\Excel.exe"';
data _null_; rc = sleep(2); run;
FILENAME ddecmds DDE "excel|system";
DATA _NULL_;
FILE ddecmds;
PUT "[open("'"'"&filein"'"'")]";
PUT "[save.as("'"'"&fileout"'"'",1)]";
PUT '[quit()]'; RUN;
%mend xml_to_xls;
PUTTING IT ALL TOGETHER
The macros are called in the following order.
%workbook_start (fileout);
%workbook_styles (fileout);
%worksheet_write (fileout);
%workbook_end (fileout);
%xml_to_xls (filein,fileout);
In review, the %workbook_styles and %worksheet_write macros must be customized for each specific project. The
%worksheet_write macro will have a separate DATA _NULL_ step for each worksheet written. Also, the
%worksheet_write macro makes use of the following macros.
%worksheet_start (sheetname);
%column (width=46.5, span=0);
%row (type=empty, height=12.75, style=Default);
%cell (content=none, variable=no, type=String, merge=0, style=Default);
%worksheet_end;
The arrangement of the %column, %row and %cell macro calls in the %worksheet_write macro is specific to the
worksheet being written. The %column macro calls are for customizing column widths. The %row macro calls will
create blank rows when no type parameter is specified. Otherwise, the %row macro indicates the start (type= open) or
the end (type= close) of a row specification. The %cell macro will create a blank cell if the content parameter is
omitted. Otherwise, the %cell macro will display the content per the style specified. The %worksheet_end macro also
must be customized for each project, since that is where headers, footers and other page setup options are specified.
CONCLUSION
Writing XML spreadsheets from SAS can be an effective method for programmers to create formatted output in Excel.
The chief advantages are total control of the location and the formatting of output in the spreadsheet. Also, by
automating with macros a great deal of time can be saved compared to formatting by hand. This is especially true for
complex spreadsheets that must be generated on a repeated basis. The disadvantages are that the system is
complex and finicky. Also, there is no way to handle charts or other graphics in a XML spreadsheet.
8
REFERENCES
James Van Campen, Using SAS with XML to Create Custom Formatted Excel Workbooks, WUSS 2005
James Van Campen, SAS Macros for Creating Custom Formatted Excel Workbooks, WUSS 2006
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
James Van Campen
SRI International, BS-153
333 Ravenswood Avenue
Menlo Park CA 94025
Phone: 650-859-2906
Email: james.vancampen@sri.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
9
Download