CobWeb: A Constraint-based XML for the Web1 Timothy D. McKernan Bharat Jayaraman Sangeetha Raghavan Srivatsava Shanker Department of Computer Science and Engineering State University of New York at Buffalo Buffalo, NY 14260-2000 E-Mail: {tdm4, bharat, sr33, sshanker} @cse.buffalo.edu Phone: (716) 645-3180 x 111 Fax: (716) 645-3464 Abstract We present a constraint-based extension of XML for specifying the structure and semantic coherence of websites and their data. This extension is motivated by the fact that many websites, especially organizational websites and corporate intranets, largely contain structured information. Such websites can be regarded as databases. XML can help view, store, manipulate, and transfer semi-structured data that exists in files – often web pages - and facilitates a less ad hoc method of handling data than HTML allows. In support of the view that a website is a database, we introduce CobWeb, a constraint-based extension of XML. CobWeb allows developers to express the concept of the semantic integrity of a database. Constraints may be used in a document type definition (DTD) in order to place restrictions on the values of both elements and attributes in the DTD. These constraints can effectively govern the contents of otherwise disparate web pages in a website, thereby ensuring the both the structured and the semantic integrity of the site as a whole. We believe that constraint XML can be the lingua franca of B2B E-commerce. It is easy, less expensive to be used on the World Wide Web than Electronic Data Interchange, making it the best alternative for on-line transactions. We provide unary (or domain) constraints, binary constraints (including various comparison operations), as well as aggregation constraints (sum, average, etc.). We also define data types not included in the XML 1.0 specification, so that declaring constraints can be more easily facilitated. By taking advantage of XML’s modular design, we can create a parser that works in conjunction with and extends existing XML parsers. We have built a prototype implementation using this idea and tested out several of the examples presented in this paper. We are presently extending this implementation and planning to apply to practical problems. 1 Introduction The motivation for this work stems from the observation that many websites, especially those of businesses and organizations, contain structured information, such as listings and itemized descriptions of people, places, and events. Such websites can be regarded as databases, in that the structure of the information can be described formally. We might consider using a language such as XML in order to represent this information. XML provides the concept of document type definitions (DTDs) to describe the logical structure of a web page. (Note that HTML is not a good choice because it is more oriented towards describing the layout, or presentation, of a document.) Defining the logical structure of a website is important, since it is akin to 1 This paper may be referenced as Technical Report 2000-06, Department of Computer Science and Engineering, State University of New York at Buffalo, May 2000. Last Revision: June 2001. 1 defining the schema of a database: not only can we check the validity of the content of a web page but we can also facilitate searching the website. The specification of the logical structure of a website by itself is not enough. In many instances, we also need to ensure that the data in the web pages is semantically coherent. For example, suppose a web page contains a listing of sales figures for different geographic regions and also an additional field for the total sales. In this case, we would like to state a constraint that the total sales figure is the sum of the regional sales figures. For another example, suppose we had a listing of the monthly sales figures for a given region, we might have a separate field for the average monthly sales and a constraint that the value in this field is the average of the values in all other monthly sales figures. Such semantic conditions are akin to the notion of integrity 2 constraints in databases. XML has constructs to describe the logical structure of a web page, but one cannot specify integrity constraints using XML. In order to overcome this limitation, in this paper we present a constraint-based extension of XML for specifying both logical structure and semantic coherence of a website. Essentially, constraints may be used in a DTD in order to place restrictions on the values of the elements as well as attributes in the DTD. We provide unary (or domain) constraints, binary constraints (including various comparison operations), and aggregation constraints (sum, average, etc.). Although not considered in this paper, the built-in constraint predicates can be augmented with user-defined predicates, which may then be used in constraint specifications. A website whose content is specified using CobWeb will typically be authored using a special editor that is capable of checking constraints (and, in some cases, generating values of fields based upon the constraints). Thus, constraints are checked when the website is built (build-time) rather than when the website is viewed or browsed (browse-time). Such a website is therefore correct by construction. Moreover, a browser need not be equipped with the ability to check constraints, and will not incur any time delays checking constraints during browsing. In this approach, the DTDs of constraint-based XML pages are simply translated into standard DTDs. The remainder of this paper is organized as follows: section 2 describes our constraint-based extensions; sections 5 and 6 present two illustrative examples of constructs: an on-line brochure example and a product comparison example; section 7 gives the comparison of Cobweb with other schema languages like Schematron, DSD, XDR, SOX; section 8 gives the current status of our implementation of Cobweb and future directions. We assume knowledge of XML Document Type Definition including elements, attributes, namespaces and links. For an introduction to these features, the reader may refer to www.xml.org. 2 Constraint-based Extension of XML Despite XML’s strengths in handling data, it has several weaknesses that hinder it from describing many types of structured data. Specifically, XML does not define data types other than character data (#PCDATA), and it does not support operations on the data it defines. We use constraints to deal with these problems. A constraint declaration is expressed through a constraint expression. Constraint_expr::=[NOT] Constraint(Lop constraint)* Lop ::= AND|OR Constraint ::= Complex_id (Rop Complex_id)* Rop ::= 2 ge|gt|le|lt|= See integrity constraints in XML http://ftp.sas.com/techsup/download/technote/ts594.html 2 Complex_id ::= Simple_id(.Simple_id)* [:attribute| :href(Complex_id)] Simple_id ::= identifier | identifier ‘[‘ index ‘]’ Index ::= first | last | pos_integer Aggregate_constraint ::= Agg_term Rop Agg_term | constraint Agg_op (Complex_id) Agg_term ::= Complex_id | Agg_func ( Complex_id) Constraint ::= Complex_id (Rop Agg_op Complex_id)* Agg_op ::= Agg_func | Agg_pred Agg_func ::= SUM |AVE |COUNT Agg_pred ::= ASCENDING|DESCENDING Quantified constraint ::= [EVERY | EXIST] (x : Complex_id) Constraint_ expr A constraint is simply a predicate (or its negation) applied to its arguments. In the above syntax, lop can be AND or OR. The syntax allows constraints to be placed on elements as well as attributes. While a DTD specifies the logical structure of some piece of data, constraints augment a DTD by specifying the semantic conditions on the integrity of the data. A CONSTRAINT is placed in the DTD because a DTD defines all other aspects of the meaning of some data. Note that constraints may be placed on both elements and their attributes. Constraints are often categorized as either unary or binary constraints. A unary constraint is a constraint that acts upon a single variable, such as a domain constraint, while a binary constraint describes two or more variables. For example, a variable x is given with the constraint that it must be even. This is a unary constraint because the constraint acts only on the variable x. A binary constraint involves a predicate such as lt with two arguments, e.g. x lt y. Other common relational operators are included in constrained XML: gt (greater than), = (equal to), != (not equal to), le (less than or equal to), ge (greater than or equal to). These operators allow simple comparisons between values. The logical operators used are AND, OR, and NOT. The AND operator is included for completeness, and to give programmers a familiar way to express these concepts, even though they can already be expressed with the tools described so far. For example, any element that needs two constraints can be expressed with separate constraint elements or with the AND operator. Unlike standard XML, in which a tag’s data has only a character type, constrained XML also uses integer, real, and string types. These types are specified in the ELEMENT declaration as #INTEGER, #REAL (as in the above example), and #STRING. In ATTRIBUTE declarations these types as specified as INTEGER, REAL, and STRING. The inclusion of aggregate types, such as a DATE type, are being considered for future versions of CobWeb. Extensions to the Element definition Like Constraint based extension, we also extend the element definition by introducing Inheritance and sets. We have a preview of inheritance in this section while a much-detailed explanation can be found in Section 3. 3 1. Inheritance <! ELEMENT child :: parent element_body> Inheritance is explained in detail at a later section. 2. ENUM 3. Set Abstraction: Element body can have a set in addition to sequence e.g. {a, b, c} is a short-hand notation for (a, b, c) | (a, c, b) | (b, a, c) | (b, c, a) | (c, a, b) | (c, b, a). Complex Identifier An XML document basically has a tree-structure. Therefore Cobweb uses “Complex Identifier” for identifying nodes within the tree. A complex identifier such as a.b.c defines a path for the root of the tree to a node such that sequence of nodes on the path have element names a, b and c. a.b a.b.c a . b[2] . c[3] a[1] . b[2] . c[3] a : attrA a . b : attrB The subscript 2 refers to the second branch of b or the second sub-element of b. When we place constraints on attributes, we use the colon identifier. Readers may refer to example 1 in section 3 for this aspect. We can also place a constraint on the attribute of a sub element through the identifiers. 2.1 Simple Examples Example. A constraint may be placed on a nested element. The relation of the parent element to the nested element is shown by using a dot notation: <?xml encoding = "US-ASCII"?> <!ELEMENT adult (age)> <!ELEMENT age (#INTEGER)> <!CONSTRAINT (adult.age ge 18)> This defines the constraint on the age to only those age elements contained within an adult element. Example We could specify the same constraint using attributes; they are specified in a similar manner using a colon. We reformulate example 1 using an attribute. <?xml encoding = "US-ASCII"?> <!ELEMENT adult EMPTY> <!ATTLIST adult age REAL #REQUIRED > <!CONSTRAINT (adult:age ge 18)> 4 Example Constraints may be combined to form more powerful declarations: <?xml <!ELEMENT <!CONSTRAINT <!ELEMENT <!ELEMENT <!ELEMENT <!CONSTRAINT encoding = "US-ASCII"?> gender (#PCDATA)> (gender = "male" OR gender = "female") > age (#REAL)> patient (age,gender)> m_a_p (patient)> (m_a_p.age ge 18 AND m_a_p.gender = “male”)> Here we are using conjunctive and disjunctive operators to define both the gender and patient elements for a web page that displays information based on the patient’s age and gender. <?xml version="1.0"?> <!DOCTYPE male_patient SYSTEM "patient.dtd"> <male_patient> <patient> <age>21</age> <gender>male</gender> </patient> </male_patient> Example Now, we illustrate the use of links and constructs to specify how to access data in external resources, as well as how to specify the type of data that should exist in these resources. <?xml encoding="US-ASCII"?> <!ELEMENT patient EMPTY> <!ATTLIST patient xmlns:xlink CDATA #FIXED “http://www.w3.org/XML/XLink/0.9” xlink:type (locator) #FIXED xlink:href CDATA #REQUIRED > <!ELEMENT adult (age)> <!ELEMENT age (#INTEGER)> <!CONSTRAINT (patient:href() = adult)> <!CONSTRAINT (patient:href(adult.age) ge 18)> The first constraint declares that the document found by following the patient’s href attribute must contain an adult element as its document root. The constraint on the age remains the same as in previous examples, but now it is accessed by following the patient’s href attribute to the adult page. 2.2 Aggregate Operations <!CONSTRAINT (Complex_id Rop agg_ op ( Complex_id))> An aggregate operation is function that maps a set of values to a single value, e.g., summation of the members of the set, the minimum or maximum value in the set, the average value, etc. Since many webpages contain collections of entities, we often wish to express a constraint in terms of the aggregate value of some collection. In the example below, we show a constraint that makes use of the average value of a set of student grades. <?xml encoding="US-ASCII"?> <!ELEMENT class (course, average, student+)> <!ELEMENT course (#PCDATA)> 5 <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!CONSTRAINT student name grade average (average = (name, grade)> (#PCDATA)> (#REAL)> (#REAL)> AVERAGE(class.student.grade))> The introduction of data typing allows more complex data structures to be created. In XML, as in HTML, elements may be nested within elements. This is similar to objects containing other objects in an objectoriented language; some of the implications are the same. One element may contain a set of elements describing a student (i.e. name, age, id number, grades, etc.). Several student elements may be combined with a similar teacher element to create a class element, and so on. Adding constraints to these structures allows more precise combinations to be created. We make constraints available to the programmer through several predicate functions. The functions return a Boolean value depending upon whether or not the specified constraint has been met. The use of predicate functions over a more declarative syntax allows a programming device to be used that is familiar to most web authors. A simple example of predicate functions is the COUNT(c element) function. The function compares the value of the constrained element to the number of occurrences of c element, and returns true if they are equal, and false otherwise. In a DTD, it would look something like this: <?xml encoding="US-ASCII"?> <!ELEMENT packing_slip (address, item+, total_items)> <!ELEMENT address (#PCDATA)> <!ELEMENT item (#INTEGER)> <!ELEMENT total_items (#INTEGER)> <!CONSTRAINT (total_items = COUNT(item))> The value of total_items must equal the number of item tags in a document. A packing slip would use this: the slip lists the total number of items in a package, and itemizes each one. Each item is listed as an item tag (perhaps with the name of the item in a tag nested within the item tag). The total items tag must be an accurate count of these items so that the packers can check that they have filled the package correctly. Unlike COUNT(), some of the functions have more than one format. The SUM() function can have one or more parameters: SUM(element), and SUM(element a, element b, ...). SUM() always figures out the sum of the data values for each element that is passed in to it, and for every instance of that element in the document. A computer parts distributor might have a document containing a list of products ordered by each customer. A table at the bottom of the page lists how many of each product are needed. SUM() would check how many total speakers, modems, etc, have been ordered by adding the values of the data in the quantity tag of each product. <?xml encoding="US-ASCII"?> <!ELEMENT shipping_orders (customer*, modem_total, monitor_total, keyboard_total, drive_total)> <!ELEMENT customer (name, address, items*)> <!ELEMENT name (#PCDATA)> <!ELEMENT address (#PCDATA)> <!ELEMENT items ((modems | monitors | keyboards | drives)*) > <!ELEMENT modems (quantity)> <!ELEMENT monitors (quantity)> <!ELEMENT keyboards (quantity)> <!ELEMENT drives (quantity)> <!ELEMENT quantity (#INTEGER)> 6 <!ELEMENT <!CONSTRAINT modem_total (#INTEGER)> (SUM(customer.items.modems.quantity)) > <!ELEMENT <!CONSTRAINT monitor_total (monitor_total= <!ELEMENT <!CONSTRAINT <!ELEMENT <!CONSTRAINT keyboard_total (#INTEGER)> (keyboard_total= SUM(customer.items.keyboards.quantity)) > drive_total (#INTEGER)> (drive_total= SUM(customer.items.drives.quantity)) > (#INTEGER)> SUM(customer.items.monitors.quantity)) > 2.2.1 Ordering <!CONSTRAINT (ASCENDING(Complex_id))> <!CONSTRAINT (DESCENDING(Complex_id))> Ordering is a form of aggregation constraint. The ASCENDING() and DESCENDING() constraints are used to put tags in either alphabetical or numerical order according to their data. This immediately allows lists of names, lists of products, chronologies, etc, to be specified in a DTD: <!ELEMENT <!CONSTRAINT roster (member*)> (ASCENDING(roster.member.name)> <!ELEMENT <!ELEMENT member name (name)> (#PCDATA)> Any XML document that uses this constraint must fulfill it in the following way: the roster element has been constrained so that its member elements must be ordered alphabetically by the nametags within them. A roster of members usually has more structure to it than just an alphabetical order. Consider a roster of faculty members. Often a department’s web page will first group faculty according to title (i.e., full professor, associate professor, lecturer, etc), and then alphabetize within the ranks. Ordering by title is an interesting problem because the title names are not in alphabetical order. As far as a standard XML application is concerned, there is no way to order these titles. The enumeration data type can be very useful for aggregation and for performing comparison operations when data is in the form of string. <!ELEMENT <!ENUM week days (days+)> MON|TUE|WED|THU|FRI|SAT|SUN> Given only the ASCENDING() and DESCENDING() constraints, a separate tag must be created within the faculty tag and given a numeric rank, so that the new rank tags may be ordered: <!ELEMENT <!CONSTRAINT <!CONSTRAINT <!ELEMENT <!ATTLIST title > <!ELEMENT roster (member*)> ASCENDING(roster.member.rank)> ASCENDING(roster.member.name)> member (name, rank)> member CDATA #REQUIRED name (#PCDATA)> 7 <!ELEMENT rank (#INTEGER)> 2.2.2 Quantified Constraints <!CONSTRAINT (EVERY (x : Complex_id) constraint_expr)> <!CONSTRAINT (EXISTS (x : Complex_id) constraint_expr)> We also provide the EVERY and EXISTS constraints for stating conditions that must be satisfied by all or some elements respectively. For example, suppose we wanted to state the condition that every faculty member in a hiring committee must have a rank of 1, we can state this as follows: <!ELEMENT <!CONSTRAINT <!ELEMENT <!ATTLIST title > <!ELEMENT <!ELEMENT hiring_committee (member*)> EVERY(x:hiring_committee.member.rank ) x = 1> member (name, rank)> member CDATA #REQUIRED name rank (#PCDATA)> (#INTEGER)> On the other hand, if we wanted to state that at least one member of the graduate affairs committee should be a student, we can state this as: <!ELEMENT <!CONSTRAINT <!ELEMENT <!ATTLIST title > <!ELEMENT <!ELEMENT graduate_affairs (member*)> EXISTS(x:graduate_affairs.member.rank) x = 3> member (name, rank)> member CDATA #REQUIRED name rank (#PCDATA)> (#INTEGER)> Referential Integrity We can achieve referential integrity using quantified constraints. For example, All instructors who take class should be a faculty in some department. Hence we have a foreign key constraint as follows, (∀x: Instructor) (∃y: faculty_of_some_department) x = y <!CONSTRAINT (EVERY (x : instructor)) (EXISTS(y : department)) x = y> 3 Inheritance <!ELEMENT child :: parent element_body> The ‘double colon’ ( :: ) notation is used for inheritance. 8 Inheritance is essential in any language that follows object-oriented technology. Inheritance is achieved by extending the base type. It can be categorized as Consider the base class student and derived classes ‘teaching assistant’ and ‘research assistant’. <?Xml encoding="US-ASCII"?> <!ELEMENT student (student-id,name, transcript)> <!ELEMENT transcript (student-id,course-list)> <!ELEMENT course-list (course-id,semester,grade)+> <!ELEMENT student-id (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT grade (#PCDATA)> <!ELEMENT semester (#PCDATA)> <!ELEMENT course-id (#PCDATA)> <!CONSTRAINT (student .transcript .course-list.course-id .grade != “F”)> A student can be a Teaching Assistant or a Research Assistant. Thus the ‘student’ superclass has two subclasses ‘TA’ and ‘RA’. <!ELEMENT TA :: student (work, stipend)> <!ATTLIST TA duties #PCDATA (#REQUIRED) default=’grade students’> stipend #REAL (#REQUIRED) > <!CONSTRAINT (TA . transcript.course-list.course-id.grade ge “B+”) > <!ELEMENT RA :: student (work, stipend)> <!ATTLIST RA duties #PCDATA (#REQUIRED) default=’grade students’> stipend #REAL (#REQUIRED) > <!CONSTRAINT (RA . transcript.course-list.course-id.grade ge “B”) > Series-Parallel Circuits <!—The circuit example --> <?xml encoding="US-ASCII"?> <!ELEMENT component > <!ATTLIST component voltage #REAL (#REQUIRED) current #REAL (#REQUIRED) resistance #REAL (#REQUIRED) > <!CONSTRAINT (voltage = current*resistance)> <!ELEMENT series::component (ser_comps)> <!ELEMENT ser_comps (component+)> <!CONSTRAINT voltage = SUM(ser_comps.component:voltage)> <!CONSTRAINT current = EVERY(ser_comps.component:current)> <!CONSTRAINT resistance = SUM(ser_comps.component:resistance)> <!ELEMENT parallel::component (par_comps)> <!ELEMENT par_comps (component+)> <!CONSTRAINT voltage = EVERY(ser_comps.component:voltage)> <!CONSTRAINT current = SUM(ser_comps.component:current)> 9 <!CONSTRAINT 1/resistance = SUM(1/(ser_comps.component:resistance))> <!ELEMENT Battery > <!ATTLIST Battery voltage #REAL (#REQUIRED) > <!ELEMENT connect (battery, component)> <!CONSTRAINT battery:voltage = component:voltage)> 4 Miscellaneous features Include & Import In the ‘include’ statement, externally defined schema fragments, which have the same target namespace as the current schema, are pulled in for convenience using the ‘include’ feature. If the schema definitions are modular, it improves readability and maintenance. Include DTD location= ‘Universal Resource Indicator’ Once the above statement is stated, all definitions in the URI are automatically included in the working schema. The same applies with ‘import’ but with a subtle difference. In the ‘import’ statement, externally defined schema fragments, which have different target namespace as the current schema, are pulled in for convenience using the import feature. Default value for element While constraint DTD offers default values for attributes, we decided to extend this property by allowing default values for elements also. <!ELEMENT address (city,state)> <!ELEMENT city (# PCDATA)> <!ELEMENT state (# PCDATA)> default=’Buffalo’ default=’NY’ Link Traversal Binary constraints also offer a way to check the integrity of a whole website. The arguments to any constraint may cross over to remote files without a need for a separate syntax. A teacher may make a page for a class that includes the class average. This average may be checked by accessing each student’s individual page and reading the student’s average. A new version of the class average example from section 2.2 looks like this: <?xml encoding="US-ASCII"?> <!ELEMENT class (course, average, student+)> <!CONSTRAINT (class.student:href() = student_page) > <!ELEMENT course (#PCDATA)> <!ELEMENT average (#REAL)> <!CONSTRAINT (AVERAGE(class.student:href(student_page.average))) > <!ELEMENT student EMPTY> <!ATTLIST student xmlns:xlink CDATA 10 #FIXED "http://www.w3.org/XML/XLink/0.9" xlink:type (locator) #FIXED xlink:href CDATA #REQUIRED > <!ELEMENT <!ELEMENT <!ELEMENT student_page (name,grade)> name (#PCDATA)> grade (#REAL)> A CobWeb-aware application knows to traverse the link and looks for a student page tag (the document root), and then looks for the average tag, which contains the value we want. In order for the constraint checking application to know what tags are in a student’s page, the DTD for a student page must also be declared in the class page, or the elements may simply be included in the same DTD. Combining link traversal with ordering allows even more complicated structures to be defined. Consider a school’s online brochure, which is composed of a main page containing the table of contents, that links to each section of the brochure. The order of the sections is important. In addition, each section has a link to the previous and following sections, as well as back to the main page. Constraints may be used to ensure that the links to previous, next, and main are correct for each page. This model would allow the structure of multiple web pages to be checked. Brochures, tutorials, and well-structured literature (such as plays) could all be checked in this way. The online brochure example will be fully explained in a detailed example below. 5 Case studies 5.1 The Online Brochure Example Using the concepts described in the earlier sections, we now demonstrate how CobWeb can be used as a solution to a common problem in websites: verifying data content and links across multiple, interconnected web pages. Many websites contain online brochures or other similarly structured web pages, such as tutorials and books. In these web pages, a table of contents is defined which links together several ordered sections. Computer Engineering? Section 1: What is Computer Science 1 ----- Table of Contents 1 ----2 ----3 ----4 ----5 ----- Home Next Section 2: General Information about under graduate programs 2 ---11 The table of contents from the University at Buffalo’s Computer Science and Engineering Department undergraduate brochure is given above as an example. The sections refer to each other: Section 1 contains links to the table of contents and to Section 2. Section 2 contains links to the table of contents, Section 1, and Section 3, and so on. The links within sections create two ordered lists, one linking Section 1 through Section 18, and the other linking the sections in the reverse order. Larger and more complex linking schemes are possible. A group of interconnected tutorials, such as the one Sun Microsystems has for Java technology (http://web2.java.sun.com/docs/books/tutorial/), could have its structure verified by CobWeb. Checking the accuracy of these links is necessary, and yet tedious and impractical for large structures. While this problem cannot be solved using standard XML, CobWeb offers a concise solution. The DTD below defines both the table of contents and the sections. 12 <?xml encoding="US-ASCII"?> <!ELEMENT toc (toc_section+, toc_loc)> <!ELEMENT toc_section (toc_description)> <!ATTLIST toc_section xmlns:xlink CDATA #FIXED "http://www.w3.org/XML/XLink/0.9" xlink:type (locator) #FIXED "locator" xlink:href CDATA #REQUIRED xlink:role CDATA #IMPLIED > <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ATTLIST toc_description (toc_number, toc_title)> toc_number (#INTEGER)> toc_title (#PCDATA)> toc_loc EMPTY> toc_loc xmlns:xlink CDATA #FIXED "http://www.w3.org/XML/XLink/0.9" xlink:type (locator) #FIXED "locator" xlink:href CDATA #REQUIRED xlink:role CDATA #FIXED "toc_loc" xlink:title CDATA #FIXED "Table of Contents" > <!CONSTRAINT (toc_section:href(section.toc_loc:href)=/toc_loc:href) > <!CONSTRAINT (toc_section.toc_description.toc_number = toc_section:href:(section.toc_description.toc_number)) > <!CONSTRAINT (ASCENDING(toc_section.toc_description.toc_number)) > <!ELEMENT section next_loc?) > (toc_description, section_data, toc_loc, prev_loc?, <!ELEMENT section_data (#PCDATA)> <!ELEMENT prev_loc EMPTY> <!ATTLIST prev_loc xmlns:xlink CDATA #FIXED "http://www.w3.org/XML/XLink/0.9" xlink:type (locator) #FIXED "locator" xlink:href CDATA #REQUIRED xlink:role CDATA #FIXED "prev_loc" xlink:title CDATA #REQUIRED > 13 <!ELEMENT next_loc EMPTY> <!ATTLIST next_loc xmlns:xlink CDATA #FIXED "http://www.w3.org/XML/XLink/0.9" xlink:type (locator) #FIXED "locator" xlink:href CDATA #REQUIRED xlink:role CDATA #FIXED "next_loc" xlink:title CDATA #REQUIRED > <!CONSTRAINT ((section.prev_loc:href(section.toc_description.toc_number) + 1) = section.toc_description.toc_number) > <!CONSTRAINT ((section.next_loc:href(section.toc_description.toc_number) - 1) = section.toc_description.toc_number)> The user may refer to an XML instance of the above DTD in Appendix II. A toc document (Table Of Contents document) is composed of multiple toc section’s, as well as a pointer to itself. Each toc section contains a link to a corresponding webpage with the expected text describing a particular facet of the undergraduate program. Each toc section also contains a toc description which has a number and also a section title. A section element contains a toc description which contains data matching its corresponding toc description in the table of contents webpage. section data is the text of the section. toc loc, prev loc, and next loc are the links to the table of contents page, the previous section, and the next section, respectively. The correct ordering of each section is checked using multiple constraints. Each toc section is checked to make certain that its corresponding section correctly points to the table of contents. Each toc section is also has its toc number compared against the corresponding section’s toc number to make certain they are equal. Within each section , the prev loc and next loc sections are checked using their toc number numbers. This combination of constraints ensures that the proper structure of the brochure is maintained. Only four constraints are added, which comprise approximately 25% of the code, yet the effects of the constraints are powerful: without them, a separate and unique application is needed to check this structure. 5.2 Product Comparison Example In this example, data from other websites is collected, analyzed, and displayed for a user to browse. Unlike the online brochure, in which constraints are used to maintain the structure of the web site, we now use constraints as a way of collecting and modifying data. An application is written that searches popular ecommerce sites (in this case, Barnes & Noble and Amazon.com) for product prices and shipping costs. The results are mapped into an XML page, which adds up the total product plus shipping costs for each site, displays each site’s relevant information, and then displays the site with the lowest cost. This example makes use of two systems - one is the information-gathering application that collects the product information from each site. This application may or may not get information in the correct XML format. It will transform that data into a syntactically correct XML tree (in memory) and present it to the XML parser. The parser is the second system, where the constraint checking occurs. It is through the constraint checking that the parser will add product and shipping costs, and determine the lowest price. The comparator.dtd is the file that specifies how site and product information must be presented in order for the parser to correctly compare sites. 14 The comparator element contains a list of sites as well as a separate best buy element, which has the site with the lowest price in it. A site has a name (”Amazon.com”), a list of products, shipping costs, a total price which combines the products and shipping costs, and a link to the store’s website. The products element may be a book, a cd, or a dvd, each with a title and price, and either a catalog number (for cd’s and dvd’s) or an ISBN number (for books), as well as shipping costs. Constraints are added to each product: the title, catalog number/ISBN, and price of each product are all derived from each store’s website. The total price element then uses this information about each product to add up the sum of the prices and the shipping costs. Finally, a constraint is placed on the best buy element which says that the best buy element’s site.total price must be equal to the minimum of the total prices of all sites. The basic comparing mechanism described above can be easily extended to make more complicated analyses. Information regarding when the product will ship (”shipped by” or ”in stock”) could be collected and used to help find the best choice in those cases where the best choice depends upon shipping dates. The DTD for the comparator looks like this: <!-- add links to site and add constraints --> <?xml encoding="US-ASCII"?> <!ELEMENT comparator (site*, best_buy)> <!ELEMENT site (name, products, shipping, total_price)> <!ATTLIST site xmlns:xlink CDATA #FIXED "http://www.w3.org/XML/XLink/0.9" xlink:type (locator) #FIXED xlink:href CDATA #REQUIRED xlink:role CDATA #FIXED "data location" xlink:title CDATA #FIXED "Click to visit the site" > <!ELEMENT name (#PCDATA)> <!ELEMENT products(book*, cd*, dvd*)> <!ELEMENT book (title, ISBN, price, shipping?)> <!ELEMENT cd (title, catalog_number, price, shipping?)> <!ELEMENT dvd <!ELEMENT title (title, catalog_number, price, shipping?)> (#PCDATA)> <!CONSTRAINT (title = comparator.site:href(product.title))> <!ELEMENT ISBN (#PCDATA)> <!CONSTRAINT (ISBN = comparator.site:href(product.ISBN))> <!ELEMENT price (#REAL)> <!CONSTRAINT (price = comparator.site:href(product.price))> <!ELEMENT catalog_number (#PCDATA)> <!ELEMENT shipping (#REAL)> <!CONSTRAINT (shipping = comparator.site:href(shipping.ground))> <!ELEMENT total_price (#REAL)> <!CONSTRAINT (SUM(comparator.site.products.*.price) 15 + SUM(comparator.site.products.*.shipping + comparator.site.shipping) > <!ELEMENT best_buy (site)> <!CONSTRAINT (best_buy.site.total_price == MIN(comparator.site.total_price)) > The user may refer to an XML instance of the above DTD in Appendix II. 6 Comparison with other Schema Languages XML Schema and CobWeb The XML Schema specification describes several similar features. In particular, XML Schema defines data types, and promotes the creation of new types based on combining different elements. [6] [7] The most important difference between XML Schema and CobWeb is CobWeb’s support for link traversal. CobWeb allows developers to define whole websites, and to extract data from different webpages in order to check constraints - XML Schema does not include support for this. XML Schema does not support the range of constraints that CobWeb does. Also, the CobWeb syntax tends to be more concise, and uses the DTD to declare constraints. XML Schema does not use a DTD and instead requires a separate file for its specification (an XSD file). Below we give sample XML and XSD files that are on w3’s website. [8] Then we define the same XML document using CobWeb. The XML document contains data for a purchase order, in which there is a shipTo and billTo address, room for comments, and unlimited items. With the exception of a user-defined data type based on regular expressions (Sku) in the XSD file, CobWeb is capable of reproducing the same results as the XSD file. Adding the functionality of regular expressions to describe strings is planned for a future version of CobWeb. <?xml version="1.0"?> <purchaseOrder orderDate="1999-10-20"> <shipTo country="US"> <name>Alice Smith</name> <street>123 Maple Street</street> <city>Mill Valley</city> <state>CA</state> <zip>90952</zip> </shipTo> <billTo country="US"> <name>Robert Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>95819</zip> </billTo> <comment>Hurry, my lawn is going wild!</comment> <items> <item partNum="872-AA"> <productName>Lawnmower</productName> <quantity>1</quantity> <price>148.95</price> <comment>Confirm this is electric</comment> </item> <item partNum="926-AA"> <productName>Baby Monitor</productName> <quantity>1</quantity> <price>39.98</price> <shipDate>1999-05-21</shipDate> </item> </items> </purchaseOrder> 16 The XSD file below defines the structure and data types of the elements and attributes used in the XML document. <xsd:schema xmlns:xsd="http://www.w3.org/1999/XMLSchema"> <xsd:annotation> <xsd:documentation> Purchase order schema for Example.com. Copyright 2000 Example.com. All rights reserved. </xsd:documentation> </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complexType name="PurchaseOrderType"> <xsd:element name="shipTo" type="Address"/> <xsd:element name="billTo" type="Address"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType> <xsd:complexType name="Address"> <xsd:element name="name" type="xsd:string"/> <xsd:element name="street" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:decimal"/> <xsd:attribute name="country" type="xsd:NMTOKEN" use="fixed" value="US"/> </xsd:complexType> <xsd:complexType name="Items"> <xsd:element name="item" minOccurs="0" maxOccurs="unbounded"> <xsd:complexType> <xsd:element name="productName" type="xsd:string"/> <xsd:element name="quantity"> <xsd:simpleType base="xsd:positiveInteger"> <xsd:maxExclusive value="100"/> </xsd:simpleType> </xsd:element> <xsd:element name="price" type="xsd:decimal"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="shipDate" type="xsd:date" minOccurs=’0’/> <xsd:attribute name="partNum" type="Sku"/> </xsd:complexType> </xsd:element> </xsd:complexType> <xsd:simpleType name="Sku" base="xsd:string"> <xsd:pattern value="\d{3}-[A-Z]{2}"/> </xsd:simpleType> </xsd:schema> 17 Here is a DTD for the same XML document: <?xml encoding="US-ASCII"?> <!ELEMENT purchaseOrder (shipTo, billTo, comment, items*)> <!ATTLIST purchaseOrder orderDate CDATA #REQUIRED > <!ELEMENT shipTo (name, street, city, state, zip)> <!ATTLIST shipTo country CDATA #FIXED "US" > <!ELEMENT billTo (name, street, city, state, zip)> <!ATTLIST billTo country CDATA #FIXED "US" > <!ELEMENT name (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#INTEGER)> <!CONSTRAINT (zip ge 10000 AND zip le 99999)> <!ELEMENT items (item*)> <!ELEMENT item (productName, quantity, price, comment?, shipDate)> <!ATTLIST item partNum CDATA #REQUIRED > <!ELEMENT productName (#PCDATA)> <!ELEMENT quantity (#INTEGER)> <!CONSTRAINT (quantity ge 0 AND quantity lt 100)> <!ELEMENT price (#REAL)> <!CONSTRAINT (price ge 0)> <!ELEMENT <!ELEMENT comment shipDate (#PCDATA)> (#PCDATA)> Cobweb and XML Schema: Cobweb has its element definitions and constraints defined in the Document Type Definition while XML Schemas define it in the XML syntax itself. While it may be advantageous to have the element definitions or the constraints defined in the same syntax as the instance it sometimes gets tedious. Since the early realms of programming languages when functional and declarative languages were born, we have maintained the fact that declarations were always separate and had its own methodology. All schema data types have been incorporated into Cobweb. We strongly criticize open content model as it creates loopholes in the system. Cobweb and Schematron Schematron is a schema language used for validating XML using patterns. Its focus is on validating and not declaring. It provides powerful constraint specification via Xpath, query patterns (querying done by assert and report) for defining the rules and checks. With just a subset of Xpath, powerful XSLT style sheets can be created to process very complex XML instances. The basic elements of Schematron are an 18 optional title, zero or more prefixes and namespaces, several patterns containing the rule context containing assert test and report test. Although Schematron supports constraints to some extent because of Xpath, it does not support inheritance, whereas Cobweb supports inheritance. The elements and attributes do not have a default value in Schematron. Constraints have not been explored to their greatest depth in schematron for instance Cob deals with quantified constraints can only deal with minor constraints using the “assert” statement. CobWeb and XDR XDR stands for XML Data Reduced. There is broad recognition that XML's existing DTD is inadequate and/or inappropriate language for expressing what many of the current and anticipated applications of XML need to include in schemas. XML-Data provides an alternative approach using XML instance syntax language to address these needs. CobWeb and SOX (version 2.0) SOX is Schema for Object-Oriented XML. This is for defining the syntactic structure and partial semantics for XML document types. It extends DTD in an object -oriented way by allowing extensible data types and inheritance among element types. SOX was initially developed to support the development of large-scale, distributed electronic commerce applications but is applicable across the whole range of applications of markup. As compared to XML DTDs, SOX dramatically decreases the complexity of supporting interoperation among heterogeneous applications by facilitating software mapping of XML data structures, expressing domain abstractions and common relationships directly and explicitly, enabling reuse at the document design and the application programming levels, and supporting the generation of common application components. CobWeb and DSD (version 1.0) DSD is Document Structure Description. A DSD document is a specification of a class of XML documents together with a default mechanism and documentation. The goal of DSD was to establish a context-dependent description of elements and attributes, flexible fault insertion mechanisms and very good expressive power. It has a strong edge on schema constraints. It guarantees linear time processing in the size of the application document. CobWeb and DSD DSD also has some constraints built into it, is not full-fledged, because, it too like Schematron does not support quantified constraints and inheritance. Furthermore, it does not support Namespaces. Table 1: Summary of the feature comparisons. Features Schema Syntax in XML Cob Web No DTD XML XDR 1.0 1.0 Schema 1.0 No Yes Yes SOX 2.0 Schematron 1.4 DSD 1.0 Yes Yes Yes 19 Namespace Include Import Yes Yes Yes No No No Yes Yes Yes Yes No No Yes Yes Yes Yes No No No Yes No Built-in type User-defined type Domain constraint Null Attribute Default value Choice Optional vs. required Domain constraint Conditional definition Element Default value Content model Ordered sequence Unordered sequence Choice Min & max occurrence Open model Conditional definition Inheritance Simple type by extension Simple type by restriction Complex type by extension Complex type by restriction Being unique or key Uniqueness for attribute Uniqueness for nonattribute Key for attribute Key for non-attribute Foreign key for attribute Foreign key for nonattribute 38 Yes Yes No 10 No No No 37 Yes Yes Yes 33 No No No 17 Yes Partial No 0 No Yes No 0 Yes Yes No Yes Yes Yes Yes Yes Yes No Yes Partial No Yes No Yes Yes No Yes No Yes Partial No Yes No Yes Partial No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes No Yes Yes No Yes Partial No No Partial Yes Yes Yes Yes Yes No No No Yes Yes Yes Yes Yes Yes No No Partial Yes No Yes Yes No No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Partial No Yes No No No No No No No Yes No Yes No No No Yes No Yes No No No Yes No No No No Yes Yes Yes Yes Yes Yes No Yes Partial No Yes No No No Yes Yes No No No No Yes Yes No No Partial Yes Partial Partial Yes Yes No Yes No No No Yes Data type Yes Yes Yes No Yes Yes Yes Yes Yes Yes 20 Miscellaneous Dynamic constraint Version Documentation Embedded HTML Self-describability Yes Yes Yes No No No No No No No No No Yes Yes Partial No No No No No No No Yes Yes No Yes No Yes Partial Partial No Yes Yes Yes Yes 7 Conclusions and Further Work In this paper we have discussed the need for a constraint-based extension of XML. Our proposed extension offers several powerful features. Using constrained XML allows developers to expand upon the notion of viewing the web as part of a database. Not only is data checked for validity (i.e., conformance with the grammatical part of the DTD) but it is also checked for semantic integrity (i.e., satisfies the constraints). Using constrained XML to author websites will allow us to provide tools that automatically ensure their structural and semantic correctness. Currently we have a prototype implementation of the constraints described in this paper in order to demonstrate their usefulness and test the appropriateness of the syntax and design [2]. This implementation was been developed using the IBM XML4J parser. Several of the examples in the paper have been tested using this implementation, and the concept of Constraint XML has been found to be useful and easy to use. This implementation does not yet support links and namespaces, and we are in the process of extending the implementation. Using CobWeb as an authoring language is only one aspect of the language. There are several scenarios, which do not fall into the context of this paper yet could benefit from a constraint-based extension of XML. Many content-driven websites create web pages dynamically as a natural part of their operation. In these cases, using CobWeb to generate the solutions to the constraints becomes beneficial as opposed to writing a separate application to handle the data. Similarly, business-to-business applications often involve data transactions between multiple servers. CobWeb can be used to further restrict the data that is to be transferred, and help to define the data interfaces between companies. CobWeb can be extended to include solutions for these scenarios. CobWeb makes a distinction between checking and fulfilling constraints. All of the examples given so far check the validity of the data using the given constraints. However, there are circumstances in which fulfilling constraints becomes useful. In one such circumstance, we emulate the SQL SELECT query. In the following example, we want to know the average of all failing students in a given class: <?xml encoding="US-ASCII"?> <!ELEMENT class (student+, passing_student*, failing_student*)> <!ELEMENT student (name, grade)> <!ELEMENT name (#PCDATA)> <!ELEMENT grade (#REAL)> <!ELEMENT failing_student (student)> <!CONSTRAINT (class.student.grade lt 50)> <!ELEMENT passing_student (student)> <!ELEMENT failing_average (#REAL)> <!CONSTRAINT (AVERAGE(class.failing_student))> Failing students are chosen from all student elements in the class. Unlike previous examples, the author of the document does not populate the failing student tags. Instead, the parser itself takes on this duty. The SQL query that this emulates would look like this: 21 SELECT student FROM class WHERE student.grade lt 50 Such a feature would allow us to broaden CobWeb’s functionality beyond checking constraints at build-time and specify whether constraints should be checked or fulfilled at build-time, browse-time, or during querying. References [1] Bosak, Jon. XML, Java, and the Future of the Internet. http://www.xml.com/pub/w3j/s3.bosak.html, November 1987. Also published in World Wide Web Journal. [2] Saradha, K., Design and Implementation of a Parser for a Constraint-Based Extension of XML, Dept of Computer Science and Engineering, University and Buffalo, January 2001. [3] Dongwon Lee Wesley W. Chu, Comparative analysis of six schema languages [4] H. S. Thompson, D. Beech, M. Maloney, N. Mendelsohn (ed.) XML Schema Part 1: Structures, W3C, April 2000. [5]Constraints-preserving Transformation from XML Document Type Definition to Relational Schema Dongwon Lee, Wesley W. Chu [6] World Wide Web Consortium. Extensible Markup Language (XML) 1.0. Available at: http://www.w3.org/TR/REC-xml. [7] World Wide Web Consortium. Namespaces in XML. Available at: http://www.w3.org/TR/RECxmlnames/. [8] World Wide Web Consortium. XML Linking Language (XLink). Available at: http://www.w3.org/TR/2000/WD-xlink-20000221/. [9] World Wide Web Consortium. XML Schema Part 1: Structures. Available at: http://www.w3.org/TR/xmlschema-1/. [10] World Wide Web Consortium. XML Schema Part 2: Datatypes. Available at: http://www.w3.org/TR/xmlschema-2/. [11] World Wide Web Consortium. XML Schema Part 0: Primer. Available at: http://www.w3.org/TR/xmlschema-0/. [12] St. Laurent, Simon. XML Elements of Style. New York: McGraw-Hill, 2000. Appendix I: A list of constraints in XML The following is a list of constraints and their definitions. common numerical and string operators: + * / = lt gt le ge addition subraction multiplication division equals less than greater than less than or equal to greater than or equal to 22 keywords used for ordering: ASCENDING DESCENDING keywords used for quantification: EVERY SOME AVERAGE(element) the constrained element must be equal to the average of the sum of all occurences of the element passed in. CEILING(element) the constrained element must be the integer ceiling of the element passed in. COUNT(element) the constrained element must be equal to the number of occurences of the element that is passed in. FLOOR(element) the constrained element must be the integer floor of the element passed in. ISINT() the value of the constrained element must be an integer. ISREAL() the value of the constrained element must be a real number. ISTHIS(element) the value of the constrained element must exist as the value of at least one of the values of the occurences of the element passed in. MAX(element) the value of the constrained element must have the maximum value of all instances of the element in the scope of the element MIN(element) the value of the constrained element must have the minimum value of all instances of the element in the scope of the element SUM(element[, ...]) the constrained element must be equal to the sum of the elements passed in. Note that this may take several forms. Either multiple elements may be passed in, and their sum taken: SUM(apples.total, oranges.total, bananas.total) or a single element may be passed in, and the application must search for all occurences of the element in the page, and add each instances’ value: SUM(product.total) 23 Appendix II Online Brochure Example <?xml version="1.0"?> <!DOCTYPE toc SYSTEM "brochure.dtd"> <toc> <toc_loc xlink:href="http://www.cse.buffalo.edu/pub/WWW/undergrad/brochure.xml" /> <toc_section xlink:href="http://www.cse.buffalo.edu/pub/WWW/undergrad/whatiscs.xml" > <toc_description> <toc_number>1</toc_number> <toc_title> What is Computer Science? Computer Engineering? </toc_title> </toc_description> </toc_section> <toc_section xlink:href="http://www.cse.buffalo.edu/pub/WWW/undergrad/general.xml" > <toc_description> <toc_number>2</toc_number> <toc_title> General Information about Undergraduate Programs </toc_title> </toc_description> </toc_section> <toc_section xlink:href="http://www.cse.buffalo.edu/pub/WWW/undergrad/admission.xml" > <toc_description> <toc_number>3</toc_number> <toc_title>Admission to the CS Major (B.A., B.S. Degree Programs)</toc_title> </toc_description> </toc_section> </toc> Each section listed within the table of contents has its own XML file. The file for the first section is given below: <?xml version="1.0"?> <!DOCTYPE section SYSTEM "brochure.dtd"> <section> <toc_description> <toc_number>1</toc_number> <toc_title> What is Computer Science? Computer Engineering? </toc_title> </toc_description> 24 <toc_loc xlink:href="http://www.cse.buffalo.edu/pub/WWW/undergrad/brochure.xml" /> <next_loc xlink:href="http://www.cse.buffalo.edu/pub/WWW/undergrad/general.xml" /> <section_data> The Department of Computer Science (CS) at the State University of New York at Buffalo, which was established in 1967, became the Department of Computer Science and Engineering (CSE) in 1998... </section_data> </section> Product Comparison Example A sample XML file, comparing data from Amazon.com and Barnes & Noble: <comparator> <site xlink:href="http://www.amazon.com/"> <name>Amazon.com</name> <products> <book> <title>Fountainhead</title> <ISBN>0451191153</ISBN> <price>7.19</price> <shipping>0.99</shipping> </book> <book> <title>Running Linux</title> <ISBN>156592469X</ISBN> <price>6.99</price> <shipping>0.99</shipping> </book> </products> <shipping>3.00</shipping> <total_price>19.16</total_price> </site> <site xlink:href="http://www.barnesandnoble.com/products/search.jsp" > <name>Barnes & Noble</name> <products> <book> <title>Fountainhead</title> <ISBN>0451191153</ISBN> <price>7.19</price> <shipping>0.95</shipping> </book> <book> <title>Running Linux</title> 25 <ISBN>156592469X</ISBN> <price>26.96</price> <shipping>0.95</shipping> </book> </products> <shipping>3.00</shipping> <total_price>39.05</total_price> </site> <best_buy> <site xlink:href="http://www.amazon.com/"> <name>Amazon.com</name> <products> <book> <title>Fountainhead</title> <ISBN>0451191153</ISBN> <price>7.19</price> <shipping>0.99</shipping> </book> <book> <title>Running Linux</title> <ISBN>156592469X</ISBN> <price>6.99</price> <shipping>0.99</shipping> </book> </products> <shipping>3.00</shipping> <total_price>19.16</total_price> </site> </best_buy> </comparator> 26