XML Microsoft introduced built-in XML support with SQL Server 2000. This chapter will provide a brief overview of XML itself, as well as SQL Server 2000 XML support and related technologies. For a detailed overview of the topics touched on here, see SQL Server 2000 XML Distilled by Kevin Williams et al (ISBN 1-904347-08-8), XML XML stands for Extensible Markup Language. XML is used primarily to make documents self-describing. Selfdescribing data is easier to interpret when it is moved from one business unit to another, or from one company to another. XML looks similar to HTML, in that it makes use of tags. Unlike HTML however, XML tags do not have predefined meanings. In XML, you define your own tags. Tags in HTML describe the format of the data, whereas XML tags describe the data itself. For example, in HTML, the following block means that the string Welcome! is presented in bold characters: <b> Welcome! </b> In XML, the tag <b> doesn't have a predefined meaning. XML is not a programming language. XML is a data format. XML presents data hierarchically, and not in a relational format. The following is an example of relational data from the Leaders table versus a hierarchical XML document. Relational data: Chapter 18 Hierarchical XML data: <?xml version="1.0" encoding="ISO-8859-1"?> <TheWorld> <Leaders ID="1"> <vchTitle>President</vchTitle> <vchCountry>Afghanistan</vchCountry> <vchLastName>Karzai</vchLastName> <vchFirstName>Hamid</vchFirstName> </Leaders> <Leaders ID="2"> <vchTitle>President</vchTitle> <vchCountry>Albania</vchCountry> <vchLastName>Moisiu</vchLastName> <vchFirstName>Alfred</vchFirstName> </Leaders> <Leaders ID="3"> <vchTitle>Prime Minister</vchTitle> <vchCountry>United Kingdom</vchCountry> <vchLastName>Blair</vchLastName> <vchFirstName>Tony</vchFirstName> </Leaders> </TheWorld> We can use XML to share data more easily by formatting company data in shared dialects and schemas. XML data can be transferred across the Internet, rendered in web pages, stored in XML files, and converted to various file types. This flexibility translates into a format that allows you to separate data from presentation, as well as keeping the data platform independent. The Anatomy of an XML Document In the previous section's example of an XML document, the first line was: <?xml version="1.0" encoding="ISO-8859-1"?> This is the XML declaration. For an XML document to be well-formed (meaning it follows the rules of an XML document), the declaration should always be included. If the document does not include a header, it will not be well formed and, therefore, does not adhere to all the basic XML syntax rules. Most XML parsers (involved in consuming, translating, or transforming the document) will expect a document to be well-formed. The declaration defines the XML version of the document (for now always 1.0, defined by the World Wide Web Consortium (W3C)). The encoding, which can differ, defines the XML character set. The previous example also included elements and attributes. Elements are called tags in HTML. The following XML chunk has an element called <vchTitle> and a closing element called </vchTitle> (notice the forward slash to mark the end of the element): <vchTitle>President</vchTitle> The next example shows the use of an attribute. The element is <Leaders>, and the ID="3" is the attribute describing the <Leaders> element. Attributes are normally used to describe elements, or provide additional information about elements: <Leaders ID="3"> 932 XML There are no fixed rules on when to use attributes and when to use elements; if an item occurs many times, it must be an element, since attributes cannot occur more than once for one element. A well-formed XML document must have a root element. The XML document hierarchy is made up of parent and child elements, all of which must be within the root element; the root element is the parent of all elements within a document. In the following example <TheWorld> is the root element, and all children tags exist within the <TheWorld> opening and closing (</TheWorld>) elements: <TheWorld> ... </TheWorld> Elements within other elements are called child elements; for example, the <Leaders></Leaders> elements were child elements of the <TheWorld> root element, and <vchTitle> and other elements are child elements of the <Leaders> parent. Child and parent elements must always be properly nested, meaning that their open and closing tags should not overlap, but rather should always be contained within one another, beginning with the root node. Other requirements for an XML document are that each opening element be associated with a closing element; always use closing tags in XML. XML parsers are much less forgiving than HTML parsers (usually a web browser) regarding this. Also, be aware that elements are case-sensitive, and that attribute values must always appear within double quotes (for example ID="3", not ID=3). XML Technologies There are a number of XML-related technologies. Some of these technologies are used for displaying XML, like XHTML (HTML 5.0), XSL (Extensible Stylesheet Language), and XSLT (XSL Transformations. Others are used to model and map the XML document, such as DTD (Document Type Definition) or XML Schema. DOM (Document Object Model) and SAX (Simple API for XML) are used to manipulate the contents of an XML document programmatically, and XPath (XML Path language), XLINK (XML Linking Language), and XQL (XML Query Language) are used to query the contents of an XML document. SOAP (Simple Object Access Protocol) is used to transfer arbitrary XML documents between systems. This latest version of SQLXML includes SOAP functionality. Shortly after SQL Server 2000 was released, Microsoft began offering free downloads of SQLXML (XML for SQL Server), which further extended interoperability between XML and SQL Server 2000. At the time of writing, SQLXML is currently at version 3.0 SP1 (service pack 1). SOAP with SQLXML enables SQL Server stored procedures, user-defined functions, and other SQLXML technologies (described next), to be exposed as web services. Such web services can then be accessed from multiple platforms or programming languages, by using the SOAP protocol. In addition to the aforementioned XML-related technologies, there are multiple methods available for importing, exporting, and manipulation of XML data as it relates to SQL Server 2000: OPENXML Reviewed in more detail later, OPENXML uses Transact-SQL extensions and system stored procedures to load an XML document into the SQL Server 2000 memory space. The source XML document must be stored in a CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, or NTEXT table column within a table in SQL Server. Once the document is in memory, we can use a rowset view of the XML data. We can use the results of this rowset within other Transact-SQL operations (such as importing the results into a table). OPENXML uses the Microsoft XML parser, called Microsoft XML Core Services (MSXML), for parsing the document. 933 Chapter 18 MSXML MSXML can be used with or independently of SQL Server, and is available as a free download from the Microsoft site. MSXML includes an XML parser, XSLT engine, DOM APIs, XSD (XML Schema definition language), XPATH, and SAX. At the time of writing, MSXML is currently at version 4.0, SP1. Using a programming language such as Visual Basic 6.0, and ADO (Microsoft's data access interface), you can export or load data to or from SQL Server by accessing the DOM exposed by MSXML. MSXML also includes two programming classes that enable HTTP access. Microsoft XML OLE DB Simple Provider ADO can also work with the Microsoft XML OLE DB Simple Provider, which can be used to read XML documents into a recordset, and then used for importing into SQL Server 2000. For more information on this provider, see Microsoft Knowledge Base article 271722, HOWTO: Access Hierarchical XML Data with the XML OLE DB Simple Provider. SQLXML XML Bulk Load Utility SQLXML version 1.0 introduced the XML Bulk Load utility, which allows high-speed bulk loads of data packaged inside XML tags into SQL Server. Unlike OPENXML, the entire XML document is not loaded into memory, so XML Bulk Load can be used to load very large documents. XML Bulk Load is a standalone COM object, and can be referenced and invoked by COM compliant programming languages. FOR XML The FOR XML clause is reviewed in the next section, and is used within the Transact-SQL statement to output data in XML hierarchical format. SQLXML Client-side XML Processing SQLXML 3.0 SP1 includes client-side XML processing, which converts a relational result set to a hierarchical XML document format on the client side. If you call a SELECT FOR XML query, with client-side XML formatting enabled, only the SELECT statement (without the FOR XML) is passed to SQL Server. The rowset is then converted to an XML document by SQLXML on the client workstation. XML Schema Definition (XSD) Annotated Schemas XSD is an extension of the W3C XML Schema specification. Microsoft added annotated schemas, which allow you to avoid complicated XML FOR EXPLICIT clause statements by binding the XSD definition directly to the database schema. This binding is also called XML views, allowing these views to be queried by the XML Path language (XPath). You should consider using XSD, instead of FOR XML, if you are more familiar with XML and XPath than Transact-SQL. XPath is not as expressive as Transact-SQL (for example, XPath lacks wildcards, has limited data types, and doesn't have a UNION clause). Direct URL Queries Internet Information Services 5.0 includes XML and SQL Server 2000 integration features. In conjunction with IIS, you can embed SQL queries directly into URL strings. SQL Server can then return the results as an XML document, and display the results in the browser. XML Templates Also in conjunction with IIS 5.0, XML templates contain SQL statements that incoming URL requests can invoke. Templates are more secure and flexible than direct URL queries, and are not limited in size or complexity. XML templates are stored on the server itself. SQLXML Updategrams Introduced in SQLXML version 1.0, updategrams allow you to modify data in SQL Server by using special XML tags. With updategrams, you use an XML grammar to specify before and after images for fragments of the modified data. Updategrams implement an XML-to-SQL mapping that eliminates the need to write Transact-SQL update queries. This may be beneficial for those developers unfamiliar with Transact-SQL and more comfortable with XML and related XML languages. SQLXML Diffgrams Diffgrams are similar to updategrams but they can be generated automatically from an ADO.NET Dataset object (ADO.NET is the next generation data access model for Microsoft's .NET Framework). Aside from ADO.NET, diffgrams can also be used with ADO version 2.6. 934 XML SQLXML Managed Classes Introduced in SQLXML version 2.0, the SQLXML managed classes consist of .NET objects (classes) that allow programmers unfamiliar with traditional SQL to use XML templates or server-side XPath queries against SQL Server instead. To use these classes, the .NET framework and SQLXML free download must be installed on the machine where you plan to use them. Some excellent references on the Web include: SQXML SQL Server 2000 (a Kevin Williams site) at http://www.sqlxml.org/ PerfectXML at www.PerfectXML.com/SQLXML.asp Microsoft's XML Web Services page at http://www.microsoft.com/sql/techinfo/xml/default.asp Kevin Williams' book, SQL Server 2000 XML Distilled (ISBN 1-904347-08-8), is also a great place for more information. 18.1 How to… Use FOR XML The Transact-SQL SELECT statement FOR XML clause allows you to convert relational rowset data into hierarchical XML output. FOR XML has three modes which impact how the hierarchical data is formatted: AUTO, RAW, and EXPLICIT. AUTO mode returns each table used in the FROM clause of the query as an element, and each column referenced in the SELECT clause as an attribute associated with the element. AUTO mode is ideal for queries using complex JOIN operations, easing the conversion to hierarchical format. The GROUP BY clause and aggregate functions are not allowed in conjunction with FOR XML AUTO. The following is an example of using FOR XML AUTO clause: SELECT TOP 2 Leaders.ID, Leaders.vchTitle, Leaders.vchCountry, Leaders.vchLastName, Country.iPopulation FROM Leaders, Country WHERE Leaders.vchCountry = Country.vchName FOR XML AUTO This returns: <Leaders ID="1" vchTitle="President" vchCountry="Afghanistan" vchLastName="Karzai"> <Country iPopulation="26668251"/> </Leaders> <Leaders ID="2" vchTitle="President" vchCountry="Albania" vchLastName="Moisiu"> <Country iPopulation="3119000"/> </Leaders> 935 Chapter 18 AUTO mode can also be specified with elements; when adding elements, this AUTO mode maps columns to elements instead of attributes. For example: SELECT TOP 2 Leaders.ID, Leaders.vchTitle, Leaders.vchCountry, Leaders.vchLastName, Country.iPopulation FROM Leaders, Country WHERE Leaders.vchCountry = Country.vchName FOR XML AUTO, ELEMENTS This returns: <Leaders> <ID>1</ID> <vchTitle>President</vchTitle> <vchCountry>Afghanistan</vchCountry> <vchLastName>Karzai/vchLastName> <Country> <iPopulation>26668251</iPopulation> </Country> </Leaders> <Leaders> <ID>2</ID> <vchTitle>President</vchTitle> <vchCountry>Albania</vchCountry> <vchLastName>Moisiu</vchLastName> <Country> <iPopulation>3119000</iPopulation> </Country> </Leaders> RAW mode generates one row element for each row of the query result set, and includes the column data as the row element's attributes. RAW mode does not support retrieval of binary data. The following is an example of using the FOR XML RAW clause: SELECT TOP 2 Leaders.ID, Leaders.vchTitle, Leaders.vchCountry, Leaders.vchLastName, Country.iPopulation FROM Leaders, Country WHERE Leaders.vchCountry = Country.vchName FOR XML RAW This returns: <row ID="1" vchTitle="President" vchCountry="Afghanistan" vchLastName="Karzaia" iPopulation="26668251"/> <row ID="2" vchTitle="President" vchCountry="Albania" vchLastName="Moisiu" iPopulation="3119000"/> 936 XML By adding the XMLDATA keyword to the FOR XML clause, a schema data type definition is prepended to the XML output. This is useful for returning the data types of each column. For example: SELECT TOP 2 Leaders.ID, Leaders.vchTitle, Leaders.vchCountry FROM Leaders, Country FOR XML AUTO, XMLDATA This returns two XML blocks, the first defining the schema of the XML document (defining the elements, attributes, and associated data types), and the second defining the XML document itself: <Schema name="Schema4" xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes"> <ElementType name="Leaders" content="empty" model="closed"> <AttributeType name="ID" dt:type="i4"/> <AttributeType name="vchTitle" dt:type="string"/> <AttributeType name="vchCountry" dt:type="string"/> <attribute type="ID"/> <attribute type="vchTitle"/><attribute type="vchCountry"/> </ElementType> </Schema> <Leaders xmlns="x-schema:#Schema4" ID="1" vchTitle="President" vchCountry="Afghanistan"/> <Leaders xmlns="x-schema:#Schema4" ID="1" vchTitle="President" vchCountry="Afghanistan"/> The BINARY BASE64 keyword is used with FOR XML AUTO, and FOR XML EXPLICIT (described next), when binary information needs to be embedded in the XML document as base64-encoded format. For example, the picture column from the Categories table is output using FOR XML AUTO and BINARY BASE64 (results not shown, as the BINARY BASE64 output of one row alone takes up several pages): SELECT TOP 1 CategoryID, picture FROM Categories FOR XML AUTO, BINARY BASE64 EXPLICIT mode is the most complex mode to learn, but offers the most control. Using FOR XML EXPLICIT, you can specify exactly which columns are elements or attributes in your query result set. You can specify nesting, and use subqueries and UNION queries to create sophisticated XML hierarchies. EXPLICIT format is based on the concept of a universal table, which contains information about the hierarchical structure of the XML document. The keywords TAG and PARENT are used in a FOR XML EXPLICIT SELECT statement. TAG is used to define the tag number of the current element, and the PARENT column stores the tag number of the parent element (which is always NULL for the top level of the hierarchy). After the TAG and PARENT values are defined in the SELECT statement, the elements and attributes must then be defined via the XML identifier syntax: ElementName!TagNumber!AttributeName!Directive 937 Chapter 18 XML parameters include: ElementName The ElementName defines the generic identifier of the element. TagNumber TagNumber defines the nesting value of the element within the XML tree. AttributeName AttributeName is the name of the XML attribute if Directive is not specified, otherwise it is the name of the contained element when optionalDirective is defined as xml, cdata, or element. Directive The Directive field controls the XML format returned. Permitted values include: Directive Value Description xml Directs the column to act as a contained element, without entity encoding taking place (conversion of special characters to its entity associated value). cdata Non-entity encoded data. element Generates a contained element with a specified name, with entity encoding applied to the contents of the element. ID, IDREF, IDREFS These three types all facilitate XML intra-document links. hide The attribute will not be displayed, but can still be used for ordering. xmltext Wraps the contents of the column in a single tag, to be consumed or parsed later on, or used for overflow data. Example 18.1.1: Using FOR XML EXPLICIT In this example, the top two rows are returned from the Leaders table. The ID field and vchLastName columns are defined as attributes of the Leaders parent element, and the vchCountry column is defined as an element: SELECT TOP 2 1 as TAG, NULL as PARENT, ID as [Leaders!1!Leaders!id], vchCountry as [Leaders!1!vchCountry!element], vchLastName as [Leaders!1!vchLastName] FROM dbo.Leaders FOR XML EXPLICIT This returns: <Leaders Leaders="1" vchLastName="Karzaia"> <vchCountry>Afghanistan</vchCountry> </Leaders> <Leaders Leaders="2" vchLastName="Moisiu"> <vchCountry>Albania</vchCountry> </Leaders> For multiple levels in the XML tree use the UNION ALL operator with additional SELECT statements. 938 XML 18.2 How to… Use sp_makewebtask to Output XML Documents You can use FOR XML to output your relational data as hierarchical data; but how do you get it out of SQL Server? This chapter briefly reviewed various technologies that can assist you with outputting data into XML format and files (URL queries, templates, SQLXML); however, there is one easy method that is built-in to SQL Server 2000, the sp_makewebtask procedure. You can use the sp_makewebtask procedure to produce an HTML or XML document. This procedure has a total of 33 parameters, but for our purposes, we will only need to use three of them. The user executing this procedure must have SELECT permissions for the query from which you will be returning data (using FOR XML in this case), CREATE PROCEDURE permissions in the database where the query is run, and permissions to write the generated HTML document to the selected location. The syntax of sp_makewebtask using just the 3 parameters (see Microsoft SQL Server Books Online for the full array of choices) is as follows: sp_makewebtask [@outputfile =] 'outputfile', [@query =] 'query' [, [@templatefile =] 'templatefile'] Parameter Description @outputfile ='outputfile' The location of the HTML or XML file to be generated (UNC names are allowed for remote computers). Depending on your security context, your SQL Server service account, SQL Server Agent, or proxy accounts should be configured to have permissions to write to this location. @query = 'query' The Transact-SQL query used to output the results. @templatefile = 'templatefile' Template files are used to generate the HTML or XML document, and contain placeholders and formatting instructions. The <%insert_data_here%> tag is used to indicate where query results should be added to an HTML table. <%begindetail%> and <%enddetail%> tags are used to define a complete row format (see SQL Server Books Online for a review of the formatting options). For our upcoming example, create a template file in Notepad called C:\temp\leaders.tpl, the contents of which should be: <leaders> <%begindetail%> <%insert_data_here%> <%enddetail%> </leaders> Notice that we have placed a <leaders> root tag to make this a well-formed document. 939 Chapter 18 Example 18.2.1: Using sp_makewebtask to generate an XML document Once you have created a template file, as shown in the syntax table above, you can execute the sp_makewebtask stored procedure in Query Analyzer, with the following example parameters: sp_makewebtask @outputfile = 'C:\temp\leaders.xml', @query = 'SELECT * FROM Leaders FOR XML AUTO', @templatefile = 'C:\temp\leaders.tpl' This produces a Leaders.xml file with the results of the SELECT * FROM LEADERS in XML format. If you open this file with Microsoft Internet Explorer (version 5 or higher), you will see something similar to this: 18.3 How to… Use OPENXML The OPENXML Transact-SQL command allows us to query XML documents stored in a table column like a relational table. OPENXML allows retrieval of both elements and attributes from an XML document fragment. The generated rowset can then easily be used within other SQL statements (sub-query or INSERT statement, for example). The sp_xml_preparedocument system stored procedure is used in conjunction with OPENXML to move the XML document into memory. The document is stored in the internal cache of SQL Server, using up to one eighth of the total available SQL Server memory. The procedure reads the XML document and parses it internally using the MSXML (Microsoft XML) parser. The sp_xml_preparedocument stored procedure is responsible for returning a handle that we can use to access the internal representation of the XML document. The handle can be used for the duration of the user connection to SQL Server, or until the handle is de-allocated using sp_xml_removedocument.The sp_xml_removedocument procedure is used to clear the document from memory once finished. The syntax for sp_xml_preparedocument is as follows: sp_xml_preparedocument hdoc OUTPUT [, xmltext] [, xpath_namespaces] 940 XML Parameter Description hdoc OUTPUT This is an integer OUTPUT parameter value that indicates the handle of the newly created XML document. xmltext This is the field containing the XML document. This column data type can be CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, or NTEXT. xpath_namespaces This parameter specifies the namespace declarations used in the row and column XPath expressions. This value defaults to <root xmlns:mp="urn:schemasmicrosoft-com:xml-metaprop">. A namespace is a collection of element type and attribute names that is significant to the consuming parser or XML-related technology (in this case Xpath). The syntax for sp_xml_removedocument is as follows: sp_xml_removedocument hdoc This procedure takes only one parameter, hdoc, which is the integer handle of the XML document loaded into memory with sp_xml_preparedocument. Always remember to de-allocate your XML documents using sp_xml_removedocument when you have finished using them; otherwise, the memory will not be de-allocated until the user session that loaded the document disconnects. OPENXML uses XPath, which is a general-purpose query language for addressing, sorting, and filtering elements of an XML document, including the text within them. OPENXML is used in the FROM clause, or anywhere else a rowset provider (table or view) is allowed. The syntax is: OPENXML (idoc, rowpattern, flags) [WITH (SchemaDeclaration or TableName)] Parameters Description idoc Integer document handle of the XML document. rowpattern XPath pattern used for identification of nodes to be processed as rows. flags Indicates XML data and relational rowset mapping. 0 creates attribute-centric mapping. 1 applies attribute-centric mapping first, and then element-centric mapping for columns not yet processed. 2 uses element-centric mapping. 8 specifies that consumed data should not be copied to the overflow property @mp:xmltext. WITH (SchemaDeclaration or Tablename) When WITH (SchemaDeclaration) is used, the actual syntax is made up of a column name (rowset name), column type (valid data type), column pattern (optional XPath pattern), and optional meta data properties (meta data about the XML nodes). If Tablename is used instead of a SchemaDeclaration, a table must already exist for holding the rowset data. 941 Chapter 18 Example 18.3.1: Using OPENXML This example extracts the ID and vchLastName fields from an XML document (held in a VARCHAR local variable), and presents the data in a columnar report format. The beginning of this example declares an integer variable @idoc for use by the sp_xml_preparedocument system stored procedure. The @doc VARCHAR(1000) local variable is used to populate the XML document string. A memory reference is then created to the @doc XML string, outputting the reference in memory value to the @idoc integer variable. Lastly, OPENXML is used with @idoc, along with an XPath statement pattern, to return the values of the ID and vchLastName columns as a relational result set: DECLARE @idoc int DECLARE @doc varchar(1000) SELECT @doc = '<TheWorld> <Leaders ID="1" vchLastName="Karzai "> <vchCountry>Afghanistan</vchCountry> </Leaders> <Leaders ID="2" vchLastName="Moisiu"> <vchCountry>Albania</vchCountry> </Leaders> </TheWorld>' EXEC sp_xml_preparedocument @idoc OUTPUT, @doc SELECT * FROM OPENXML (@idoc, '/TheWorld/Leaders',1) WITH (ID int, vchLastName varchar(100)) This returns: 942 XML OPENXML works best with smaller documents (1MB or less), since documents are loaded into memory. Documents can only consume up to 1/8 the total available SQL Server memory. Concurrent loading of the same document by different user sessions may cause the memory limit to be reached, as well as the loading of a single document that exceeds the maximum memory available. 943