Introduction to XML Spreadsheets for SAS® Programmers, Parts I & II James J. Van Campen, SRI International, Menlo Park, CA ABSTRACT This paper provides an introduction to Microsoft’s® XML Spreadsheet Schema (XMLSS) for SAS programmers. XMLSS is used to store Microsoft Excel® workbooks as XML files. Understanding XMLSS is very useful for SAS programmers who use the ODS ExcelXP tagset or who write their own programs to generate XML spreadsheets. Part I of this paper provides an overview of XMLSS, and Part II describes some SAS macros written by the author to create XML spreadsheets. All of the examples in this paper were done using SAS 9.2 and MS Office 2003. PART I: OVERVIEW OF THE XML SPREADSHEET SCHEMA (XMLSS) XML stands for Extensible Markup Language. In XML, nested markup tags are used to define a hierarchical structure for data. XML can be used for just about any type of data, including Excel spreadsheets. Unlike HTML, XML has no predefined tags. In XML, schemas are used to define the tags, the attributes, and the nested structure. The tags define what are called elements. Typically, elements are data. An element (parent element) may contain other elements (child elements). The parent-child relationship of the elements provides the hierarchical structure to the data. All XML files must be fully nested; that is, both the opening and closing tags of all child elements must be contained between the opening and closing tags of the parent element. All XML files have a root tag that contains all the other tags. Elements may have attributes. Typically, attributes are meta-data, such as formatting information. Attributes can be required or optional. XML parsers are very picky and will not parse a file that is not well formed and valid. Well formed means the file follows the XML syntax (e.g., the tags are properly nested, all required closing tags exist, etc.). Valid means the XML file conforms to the associated schema (e.g., the element names and attributes are consistent with the schema). An XML file can not be valid without also being well formed; however, a well-formed file is not necessarily valid. Microsoft’s XMLSS was developed to permit Excel workbooks to be saved as XML and XML files conforming to XMLSS to be opened as formatted workbooks. Excel is capable of opening any well-formed XML file; however, only valid files conforming to XMLSS can modify the default method of display. Details of XMLSS, including descriptions of the elements (tags) and attributes, can be found at the following website: http://msdn.microsoft.com/en-us/library/aa140066(office.10).aspx The following example illustrates the essentials of XMLSS. The contents of a simple Excel workbook, consisting of one worksheet with three rows and two columns, are shown in Figure 1. Default Excel formatting is used throughout the worksheet, except for the column headings, which are bold. Name Jack Jill Age 9 7 Figure 1 The following XML file conforms to XMLSS, and when opened with Excel, has the same content and formatting as Figure 1. The first tag is a special tag that indicates this is an XML file and specifies the XML version. The second tag is also special, and it indicates that the file should be opened as an XML spreadsheet. The root element is Workbook. All the other elements in the file are contained between the opening and closing Workbook tags. The Workbook tag contains five attributes. The attributes specify namespaces where the elements and attributes are defined. A discussion of namespaces, however, is beyond the scope of this paper. The first two tags and the Workbook tag along with its namespace attributes can be treated as boilerplate which must be included in every XML spreadsheet file. 1 <?xml version="1.0"?> <?mso-application progid="Excel.Sheet"?> <Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40"> <Styles> <Style ss:ID="s21"> <Font ss:Bold="1"/> </Style> </Styles> <Worksheet ss:Name="Sheet1"> <Table> <Row> <Cell ss:StyleID="s21"><Data ss:Type="String">Name</Data></Cell> <Cell ss:StyleID="s21"><Data ss:Type="String">Age</Data></Cell> </Row> <Row> <Cell><Data ss:Type="String">Jack</Data></Cell> <Cell><Data ss:Type="Number">9</Data></Cell> </Row> <Row> <Cell><Data ss:Type="String">Jill</Data></Cell> <Cell><Data ss:Type="Number">7</Data></Cell> </Row> </Table> </Worksheet> </Workbook> Similar to HTML, XML tags begin with “<” and end with “>”. The element name is the first character string inside the tag. The name is followed by the attributes, if any. The first child element of Workbook is Styles. Each style is defined separately with Style tags. The Style element is a child of the Styles element. In this example, only one style is defined. The ID is a required attribute of the Style element and is set to “s21”. The “ss:” part of the attribute specification indicates the namespace where the attribute is defined. A default style can be defined, but it is not required; no default style is defined in this example. In the style defined, the Bold attribute of the Font element is set to “1” (True). Closing tags for elements differ from opening tags by preceding the name with a forward slash and not having any attributes specified. All non-empty elements must have both opening and closing tags for the XML file to be well formed. Empty elements may have only one tag. In those cases, the closing bracket is preceded by a forward slash. The Font element is an example of an empty element. After the Styles element comes the Worksheet element. The worksheet name is a required attribute of the Worksheet element. In this case, it is set to “Sheet1”, the Excel default. The Table element is a child of the Worksheet element. The Column and Row elements are children of the Table element. In this example, there are no Column elements. The Cell elements are children of the Row elements, and two of the Cell elements have their StyleID attribute set to “s21” for the bold font style. The Data element is a child of the Cell element. Type is a required attribute of the Data element. Type can be set to “String”, “Number”, “DateTime”, “Boolean”, or “Error”. The data is written, without quotes, between the opening and closing Data tags. In XML, white space is preserved. No spaces are allowed to precede or to follow numeric data. In fact, in XMLSS, no characters other than numerals and one decimal point are permitted in numeric data. This has important consequences when writing your SAS code, as will be seen in the next section. After all the rows are specified, the closing Table, Worksheet, and Workbook tags are written. The XML code above constitutes a well-formed and valid XML file conforming to XMLSS. To create workbooks with formatting more complex than in this example requires using additional elements and attributes from XMLSS. One way to learn more about XMLSS is to go to the website specified above. Often the quickest and easiest way to learn about the elements and attributes in XMLSS is to create a workbook with the formatting you desire and save it as XML. You can then open the XML file created by Excel with a text editor or your Internet browser to see the XML code. 2 The complete hierarchy of XMLSS tags is as follows. <ss:Workbook> <ss:Styles> <ss:Style> <ss:Alignment/> <ss:Borders> <ss:Border/> </ss:Borders> <ss:Font/> <ss:Interior/> <ss:NumberFormat/> <ss:Protection/> </ss:Style> </ss:Styles> <ss:Names> <ss:NamedRange/> </ss:Names> <ss:Worksheet> <ss:Names> <ss:NamedRange/> </ss:Names> <ss:Table> <ss:Column/> <ss:Row> <ss:Cell> <ss:NamedCell/> <ss:Data> <Font/> <B/> <I/> <U/> <S/> <Sub/> <Sup/> <Span/> </ss:Data> <x:PhoneticText/> <ss:Comment> <ss:Data> <Font/> <B/> <I/> <U/> <S/> <Sub/> <Sup/> <Span/> </ss:Data> </ss:Comment> <o:SmartTags> <stN:SmartTag/> </o:SmartTags> </ss:Cell> </ss:Row> </ss:Table> <c:WorksheetOptions> <c:DisplayCustomHeaders/> </c:WorksheetOptions> <x:WorksheetOptions> 3 <x:PageSetup> <x:Layout/> <x:PageMargins/> <x:Header/> <x:Footer/> </x:PageSetup> </x:WorksheetOptions> <x:AutoFilter> <x:AutoFilterColumn> <x:AutoFilterCondition/> <x:AutoFilterAnd> <x:AutoFilterCondition/> </x:AutoFilterAnd> <x:AutoFilterOr> <x:AutoFilterCondition/> </x:AutoFilterOr> </x:AutoFilterColumn> </x:AutoFilter> </ss:Worksheet> <c:ComponentOptions> <c:Toolbar> <c:HideOfficeLogo/> </c:Toolbar> </c:ComponentOptions> <o:SmartTagType/> </ss:Workbook> As can be seen from the example spreadsheet above most of the tags can be omitted, and you will still have a viable XML spreadsheet. There are some important things to remember when specifying a XML spreadsheet. First, you must define a style for each different kind of cell formatting. It can be time consuming to create all the different styles you want, but once they are created, you can use them for all the XML spreadsheets you create. Custom column widths are specified using the Width attribute of the Column element. Likewise, custom row heights are specified using the Height attribute of the Row element. Cells can be merged using the MergeAcross and MergeDown attributes of the Cell element. Most other formatting is done in the PageSetup element which is a child of the WorksheetOptions element. Headers and footers are created by specifying the Data attribute of the Header and Footer elements. The codes for specifying headers and footers are somewhat cryptic, and the best way to learn how to write them is to create example headers or footers in a worksheet and save the file as a XML spreadsheet. Then open the XML file with a text editor and see what Excel wrote. PART II: EXAMPLE MACROS FOR CREATING XML SPREADSHEETS XML files are plain old text files, and it is easy to create text files from SAS. To create a XML spreadsheet from SAS simply requires using put statements in a _NULL_ DATA step. It is seriously old school SAS programming. The steps for writing a XML spreadsheet are as follows. 1. 2. 3. 4. 5. Start workbook Write styles Write worksheet a. Start worksheet b. Specify columns c. Specify row and cells i. Start row ii. Specify cells iii. End row d. Repeat step c for each row e. End worksheet Repeat step 3 for each additional worksheet End workbook 4 A macro was created for each of the steps. Macros are well suited for these tasks since the same operation is done repeatedly. Also since no errors in the XML code can be tolerated, full automation of the process is optimal. STARTING THE WORKBOOK This step is the same for all files. The macro is passed one required parameter, the output filename (fileout). Since it is an XML file, the filename should end in .xml. This macro writes the boilerplate code mentioned in Part I. Notice there is no mod option in the file statement. This macro will always start a new file, or overwrite an existing file with the same name. All of the other macros use the mod option to append to the file started by %workbook_start. %macro workbook_start (fileout); data _null_; file "&fileout"; put '<?xml version="1.0"?>' / '<?mso-application progid="Excel.Sheet"?>' / '<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" ' / ' xmlns:o="urn:schemas-microsoft-com:office:office" ' / ' xmlns:x="urn:schemas-microsoft-com:office:excel" ' / ' xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" ' / ' xmlns:html="http://www.w3.org/TR/REC-html40" >'; run; %mend workbook_start; WRITING THE STYLES The %workbook_styles macro establishes all the formats available for use in the workbook. The styles must be defined in advance of specifying the worksheets. The sample macro below creates two styles, Default and Bold. The default shows some of the different child elements of Style. Each of the formatting elements in Style has a set of attributes that can be specified. For example, in the Bold style, the Bold attribute of the Font element is set to “1” (true). To learn more about the available attributes for the style elements please see the website referenced above. Typically, many more styles are specified within this macro since the point of using this method is to achieve custom formatting. %macro workbook_styles (fileout); data _null_; file "&fileout" mod; put ' <Styles>' / ' <Style ss:ID="Default">' / ' <Alignment ss:Vertical="Bottom"/>' / ' <Borders/>' / ' <Font/>' / ' <Interior/>' / ' <NumberFormat/>' / ' <Protection/>' / ' </Style>' / ' <Style ss:ID="Bold">' / ' <Font ss:Bold="1"/>' / ' </Style>' / ' </Styles>'; run; %mend workbook_styles; WRITING THE WORKSHEET The %worksheet_write macro is where the programmer specifies the content and formatting of the worksheet. This is done though a series of %column, %row, and %cell macro calls. This macro must be customized for each project. The worksheet is initialized by calling the %worksheet_start macro and specifying the sheetname parameter (“Example” in this case). The column widths are specified with the %column macro. If no column widths are specified, a default width is used. Then the rows and cells are specified. Empty rows are specified by a single call of the %row macro with no parameters. Rows with cell content require two %row macro calls, one for opening and one for closing 5 the row. In the %cell macro, content is specified using the content parameter, and formatting is specified with the style parameter. Calling the %cell macro with no parameters creates a blank cell. Column and cell order start from the left, and row order starts from the top. %macro worksheet_write (fileout); data _null_; file "&fileout" mod; %worksheet_start(Example); %column(width=80); %row(type=open); %cell(content=Example, style=Bold); %row(type=close); %worksheet_end; run; %mend worksheet_write; The %worksheet_start macro has one required parameter, sheetname. It is also the place to specify which row(s) are repeated at the top of each page should the worksheet exceed one page in length. In this example, the first row is to be repeated as indicated in the NamedRange element. The Names element can be omitted without consequence if desired. %macro worksheet_start (sheetname); put ' <Worksheet ss:Name="' "&sheetname" '">' / ' <Names>' / ' <NamedRange ss:Name="Print_Titles" ss:RefersTo="=' "&sheetname" '!R1:R1"/>' / ' </Names>' / ' <Table>'; %mend worksheet_start; The %column macro is used when custom column widths are desired – simply specify the width parameter. If consecutive columns have the same width, setting the span parameter to the number of additional columns makes it so only one %column macro call is necessary. %macro column (width=46.5, span=0); put ' <Column ss:Width="' "&width" '" ss:Span="' "&span" '"/>' ; %mend column; Depending on the type parameter, the %row macro creates an opening, closing, or empty row element. Formatting can be specified at the row level. Row level formatting can be pre-empted by cell level formatting. The cell formatting, however, is not superposed on the row formatting. That is, unspecified elements and attributes in the cell format take on the worksheet default not the value in the row format. %macro row (type=empty, height=12.75, style=Default); %if "&type"= "open" %then put ' <Row ss:Height="' "&height" '" ss:StyleID="' "&style" '">' ; %else %if "&type"= "close" %then put ' </Row>' ; %else put ' <Row/>' ; %mend row; The %cell macro can be passed text strings, macro variables, and variable values of different types. For text strings and macro variables the variable parameter should be set to “no” (default). For variable values (numeric or character) 6 it should be set to “yes”. The allowable data types in XMLSS are String (default), Number, DateTime, Boolean, and Error. Multiple cells can be merged by setting the merge parameter to the number of additional cells. %macro cell (content=none, variable=no, type=String, merge=0, style=Default); %if "&content"= "none" %then put ' <Cell ss:StyleID="' "&style" '"/>' ; %else %if "&variable"= "no" %then put ' <Cell ss:StyleID="' "&style" '" ss:MergeAcross="' "&merge" '">' / ' <ss:Data ss:Type="' "&type" '" xmlns="http://www.w3.org/TR/REC html40">' "%unquote(&content)" </ss:Data>' / ' </Cell>' ; %else %do; if missing(&content) then put ' <Cell ss:StyleID="' "&style" '"/>' ; else put ' <Cell ss:StyleID="' "&style" '" ss:MergeAcross="' "&merge" '">' / ' <Data ss:Type="' "&type" '">' &content +(-1) '</Data>' / ' </Cell>' ; %end; %mend cell; When specifying the content parameter for the cell macro, be wary of special characters - both in SAS and XML. For example, a macro quoting function such as %str() or %nrstr() must be used to pass text strings containing commas. Otherwise, SAS will interpret the comma is a parameter delimiter and the macro will fail. String data containing less than (<), greater than (>), or ampersand (&) symbols can trip up XML parsers. Those symbols should be replaced with “&lt;”, &gt;”, and “&amp;”, respectively. If the source is a character variable in SAS, a function such as tranwrd() can be used to do the substitution. The %worksheet end macro is where the page setup is specified. Page orientation, headers, footers, margins, and more can be specified here. In the example macro, the Data attribute of the Header element specifies that the header contain “WUSS Example” centered and in bold. Similarly, the footer displays the date on the left, the page number in the center, and the author’s name on the right. The header and footer specifications are cryptic and the best way to learn how to specify them is to save examples as XML spreadsheets and look at the XML code. %macro worksheet_end; put ' </Table>' / ' <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">' / ' <PageSetup>' / ' <Layout x:Orientation="Landscape"/>' / ' <Header x:Data="&amp;C&amp;B' "WUSS Example" '"/>' / ' <Footer x:Data="&amp;L&amp;D&amp;C&amp;P&amp;RJames Van Campen"/>' / ' <PageMargins x:Left="0.5" x:Right="0.5" x:Top=".75" x:Bottom=".75"/>' / ' </PageSetup>' / ' </WorksheetOptions>' / ' </Worksheet>'; %mend worksheet_end; ADDITIONAL WORKSHEETS AND ENDING THE WORKBOOK After the %worksheet_end macro has executed another worksheet can be added to the workbook simply by adding another _NULL_ data step, like the first, in the %worksheet_write macro – be sure to specify a different worksheet name. When all the worksheets have been specified, the workbook must be finalized by calling the %workbook_end macro. This macro simply writes the closing workbook tag. %macro xml_workbook_end (fileout); data _null_; file "&fileout" mod; put '</Workbook> '; run; %mend xml_workbook_end; 7 CONVERTING THE XML TO EXCEL Creating a XML spreadsheet is great, but the ultimate goal for most programmers is to produce a regular Excel workbook. To do this I wrote a conversion macro. The %xml_to_xls macro has two required parameters, the input and output filenames. The input filename will be the XML file created in the previous steps. The output filename must end in .xls. Two options, xwait and xsynch, must be turned off before DDE is used. The Excel application is launched using the x command. The path must specify the location of the Excel application on your machine. SAS is instructed to wait two seconds while the application launches. The minimum required sleep time can vary from computer to computer. The DDE topic "system" is defined in order to send commands to Excel. Then Excel is instructed to open the input file and save it with the output filename in Excel format. Notice the “,1” after the output filename. That is a crucial parameter which specifies the file is to be saved in Excel format instead of XML. Without that parameter, you will simply get an XML file with a .xls extension. Finally, Excel is instructed to quit. %macro xml_to_xls (filein, fileout); options noxwait noxsync; x '"C:\Program Files\Microsoft Office\Office10\Excel.exe"'; data _null_; rc = sleep(2); run; FILENAME ddecmds DDE "excel|system"; DATA _NULL_; FILE ddecmds; PUT "[open("'"'"&filein"'"'")]"; PUT "[save.as("'"'"&fileout"'"'",1)]"; PUT '[quit()]'; RUN; %mend xml_to_xls; PUTTING IT ALL TOGETHER The macros are called in the following order. %workbook_start (fileout); %workbook_styles (fileout); %worksheet_write (fileout); %workbook_end (fileout); %xml_to_xls (filein,fileout); In review, the %workbook_styles and %worksheet_write macros must be customized for each specific project. The %worksheet_write macro will have a separate DATA _NULL_ step for each worksheet written. Also, the %worksheet_write macro makes use of the following macros. %worksheet_start (sheetname); %column (width=46.5, span=0); %row (type=empty, height=12.75, style=Default); %cell (content=none, variable=no, type=String, merge=0, style=Default); %worksheet_end; The arrangement of the %column, %row and %cell macro calls in the %worksheet_write macro is specific to the worksheet being written. The %column macro calls are for customizing column widths. The %row macro calls will create blank rows when no type parameter is specified. Otherwise, the %row macro indicates the start (type= open) or the end (type= close) of a row specification. The %cell macro will create a blank cell if the content parameter is omitted. Otherwise, the %cell macro will display the content per the style specified. The %worksheet_end macro also must be customized for each project, since that is where headers, footers and other page setup options are specified. CONCLUSION Writing XML spreadsheets from SAS can be an effective method for programmers to create formatted output in Excel. The chief advantages are total control of the location and the formatting of output in the spreadsheet. Also, by automating with macros a great deal of time can be saved compared to formatting by hand. This is especially true for complex spreadsheets that must be generated on a repeated basis. The disadvantages are that the system is complex and finicky. Also, there is no way to handle charts or other graphics in a XML spreadsheet. 8 REFERENCES James Van Campen, Using SAS with XML to Create Custom Formatted Excel Workbooks, WUSS 2005 James Van Campen, SAS Macros for Creating Custom Formatted Excel Workbooks, WUSS 2006 CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: James Van Campen SRI International, BS-153 333 Ravenswood Avenue Menlo Park CA 94025 Phone: 650-859-2906 Email: james.vancampen@sri.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 9