Structured Data Web: A web-like method for sharing and interpreting structured information for bottom-up development Douglas E. Dyer, PhD Active Computing 27 May 04 Introduction Large software systems traditionally have been designed from the top down. While this conventional approach offers good control of the resulting product, it can also be timeconsuming and seems to have scalability limitations, perhaps because of increased communication, coordination, and synchronization required by software development teams. Top-down designs usually require consensus (design by committee) or specification (design by decree), and neither of these methods is known for excellence. This paper is about a simple alternative: bottom-up development. In bottom-up development, design and implementation are rarely synchronized, and work proceeds incrementally. Typically, a developer of one component aims to integrate information from third-party application, often without much help from that application’s developer. When trying to integrate a third-party application, two key difficulties frequently consume a lot of development effort. First, there is the problem of understanding what variables the third-party application has to offer and, deeper, the precise meaning of those that seem interesting. There are several methods for discovering the meaning of thirdparty variables including reading documentation, talking with the original developers, getting clues from names used in the source, or using an ontology supplied by the original developers. The second problem is more mundane: How exactly are variables accessed? There are an incredible number of possibilities, thanks to the creativity of software engineers, and the third-party application may support zero, one, or many of them. The same methods used to discover the meaning of variables apply to figuring out how to access them. Sometimes, the third-party developers are available to create new interfaces to support the integration task. If not, the integration task can be harder than re-implementing the application. In 1990, Tim Berners-Lee introduced the web technologies: a representation for formatting text (HTML), a server for providing the representation on request, a protocol for clients to interact with the server, and a common index called a uniform resource locator (URL). Because these technologies were straightforward and publishing a web page was “cool,” many people were able to contribute content and the web grew exponentially. HTTP has become a dominant protocol, meaning clients require fewer interfaces. For applications that deal with structured data, an analogous approach should work well. Let’s call this set of technologies the “structured data web.” This paper 1 describes one possible implementation for the structured data web and characterizes its utility for sharing and understanding data when integrating applications from the bottom up. Implementing the structured data web There are probably several different ways to design and implement the structured data web. In DARPA’s Active Templates program, we implemented several prototype clients that work using the same basic method. We’ll illustrate using an application for making decisions about business travel, and the specific method used in a tool called d3i. Representation. Variables (e.g., Destination) are represented in the context of their application (BusinessTravel), and in terms of some problem-solving episode. Episodes of business travel generally are identified by the person traveling and the particular trip. Sometimes applications (e.g., BusinessTravel) are designed around people and sometimes they are about objects (e.g., FuelMonitor, an application that alerts a pilot when the flight plan exceeds fuel available). In any event, the primary requirement is to uniquely identify an episode with some number of attributes. Variable values are represented as text strings1. Variables often have metadata associated with them, examples being the source of the value and the time it was set. The application, variable, user (or object), and user’s instance, together with its value and metadata---all of that is referred to as an information element. Server. In Active Templates, we found it straightforward to use a relational database as a server (we used MySQL, but any database will work). There are many reasons for choosing a relational database as the server for the structured data web including technical maturity, availability of free or low-cost servers, access control, atomicity of transactions, access efficiency, and simplicity. In our implementation, we created a table “element” used by all applications. Element has the following attributes: template (another name for application) userIdentity (we used primary email address) instance (we used a number and often find the number from the value of other variables) element (another name for variable) value source (Who set the value? Choices: ‘user,’ ‘rule,’ and ‘case,’ the latter two being AI) more (used for annotations and more information, normally unstructured) time (seconds since 1970, universal time) All of these attributes are of type text except for time, which is an integer. Text values offer efficient storage but are not length-constrained, and thus are most flexible. 1 Most data types have a text representation, and text can be cast appropriately by all major programming languages. Since we wanted maximum interoperability and simplicity, text was our choice. Austin Tate has suggested adding “type” as metadata, and I think this is good idea. 2 In our implementation, “element” holds only current values for all applications which are hosted on a particular server and database. Another table with the same attributes is called “history” and contains all values recorded over time for information elements. A third table, “elementDescription,” is used to describe applications and their variables and to serve applications. These other tables are not yet central to the current discussion, but those who peruse our example server now know the basics2. Protocol. The protocol for our implementation of the structured data web is any method of accessing a relational database including ODBC and the SQL language. “URL for structured information.” Within a particular server and database, the first four attributes of the element table constitute a URL for variables and metadata: application, user, instance, variable name. More generally, the URL consists of the server, database name3, application, user, instance, and variable. Like web URLs, our URLs are hierarchical4 in the order given, and there are requirements for name uniqueness: servers must all have different names (enforced by internet protocols), database names must be distinct for a particular server (but different servers may have databases with the same name), etc. Because of this hierarchical nature, no variable name collisions occur between applications (i.e., different applications essentially have separate namespaces), for example. So, that’s a complete description of our implementation for a structured data web. Applications should publish information elements to be shared by SQL inserting and updating attributes in the element table. To integrate your application with other third party applications, access third party information elements via SQL select statements. Readers should now understand our solution to the mundane problem of access. Now it’s time to think about which variables are available and how to figure out exactly what they mean. Achieving understanding When development teams are not necessarily working closely together, it isn’t always easy to talk directly with someone who understands a third-party’s application. In this case, developers typically rely on third-party documentation (including an ontology, if available) or by experimenting with the application or by examining the source code. I refer to documentation and ontology definitions as “semantics by declaration” because they are created based on a developer’s statements. If the developer is thorough and systematic, all questions are answered. Being thorough and systematic is difficult, timeconsuming, and costly---and it can be hard to predict whether this effort will pay off. Currently, an example of our server is up and running at 66.255.97.12 as the database “d3i”. You may check it out using the user “d3i” and the password “d3iMCCCXX”. 3 At least in MySQL, a database server can host many different databases. 4 Actually, this is only approximately true. The instance attribute’s parent is userIdentity, but the variable attribute’s parents include the application and the pair <userIdentity, instance>. 2 3 Perhaps this is why it’s often necessary to resort to other means for more complete understanding. The structured data web supports and extends these other methods and enables “semantics by example,” a method that requires less effort by developers and augments any available documentation. Semantics by example is meaning you perceive based not on documents or the names of variables but on the range of values that variables evidently have. For example, suppose you wish to integrate a third-party application called BusinessTravel. If BusinessTravel has a variable called Visiting, you might be confused about its meaning. Does Visiting refer to a person you are meeting? Or a city you are traveling to? With the structured data web, if the BusinessTravel application is in regular use, there will be many examples of the values Visiting takes on, and these example values empirically help identify the meaning of that variable. Suppose Visiting has values that include: Joint Forces Command DARPA SRI International University of Edinburgh From these values, it should be clear that Visiting refers to an organization, not a person or city (although there are likely relationships with the person visited and the location of the organization). Although the range of current values does not necessarily cover all meanings a developer might have originally intended, semantics by example seems like a powerful method. Semantics by example also answers these questions: What are all the applications? For each application, what are the variables? The data in the element table provide the current answer for each server and database. In our implementation, the elementDescription table mentioned earlier is a good place to store any available documentation including an ontology. Attributes in that table include: userIdentity template (application) element (variable) description keywords XML (This is actually the declarative part of an Active Templates application) rules (This is the procedural part of an Active Templates application) As before, all attributes are of type text. Documentation for an application should be placed in the description attribute in a in a tuple with the element attribute set to “” (an empty string). An ontology can be included, for example, or a natural language description. Documentation for individual elements (variables) should appear in the description attribute for each element. The other attributes are less important to the current discussion. The userIdentity attribute implies that every user can have different versions of an application. The keywords attributes is intended to support a search 4 engine for structured information. The other two elements are used, at least in d3i, to serve applications from the database. On ontologies built from the bottom-up An ontology for a domain or application is a set of relevant variables and a set of relationships between them. The utility of an ontology is clear understanding during design and integration, and ultimately, machine interpretation of software component interfaces. In top-down design, the goal is to create an ontology that covers the entire system and perhaps potential future extensions of the system5. The result is a single ontology for the entire system. In contrast, when integrating applications from the bottom up, each third-party application extends its own ontology---but having Developer A integrate Application B has does nothing to extend B’s ontology. In bottom-up development, there is one ontology for each application, but each can be extended as needed. Users dealing with multiple applications do not necessarily have a common ontology that covers them. On Performance Our primary concerns, for integration, are access and understanding, but developers often worry, in addition, about performance because of its impact on usability. The implementation of the structured data web described in this paper has not been fully characterized, but database performance is normally dominated by the number of disk accesses required, and we have found this to be true for large databases. For example, using current hardware and ignoring network delays, a small database returns more than 2000 variables per second, while an un-indexed database of 1 million variables takes more than a second to return a single value6. Indexing the database restores much of the performance lost, returning more than 1000 variables per second. Database performance is market-driven. We expect some combination of indexing, distributing the load to multiple servers, and query optimization to provide the performance required for any system contemplated. This is a kind of “slippery slope” leading to premature commitment to a large ontological design effort. Predicting future extensions is difficult, but integrating with third-party applications is a likely requirement. If so, then a common ontology is desired that covers all integrated applications. Since you can’t predict which applications will be integrated, you really need a global ontology, right? Argh! You’ll never finish designing such a thing. A much better approach is to define the ontology incrementally, to get the coverage needed for the immediate future. 6 These tests were conducted by writing 1 instance of 1 million different variables for one application and one user. The number of queries equaled the number of variables. In our tests with MySQL, 1 million variables consumed 100 Mb---each tuple required about 100 bytes, roughly the amount of text in the attribute values. The text type is space efficient. An index on template, userIdentity, and instance was used to test performance improvement. Creating an index in a database precludes the requirement to check each tuple for criteria in where clauses of select statement. The insert speed drops somewhat for indexed tables. 5 5 On Messaging, Services, and Alerts Often, developers desire synchronous message-passing between applications, perhaps to support a service such as remote procedure call or web services. Message passing begins with a client sending a message via a channel and waiting. If the server receives the message and does not fail, then it replies with another message, otherwise no reply is sent, in which case the client may hang or time out. The structured data web implementation supports message passing in the following way. A client writes a variable to the database and begins polling for another variable, the servers reply. The server, continually polling the database for “request” variables, notices the client’s request, and responds by inserting or updating its reply variable. Timestamps are used to determine whether a value applies to a previous service episode. Interestingly, with a structured data web, services may be set up to watch for certain state conditions which may be arbitrarily complex and involve an arbitrary number of applications. When specified state conditions are met the service may take a variety of actions including processing, writing variables, or sending messages. Such services are called alerts or sentinels. On the equivalency with other representations The structured data web represents at least <object><attribute><value> triples (OAV)7. Since the goal is integration, the objects of interest are applications, and the attributes are instances of variables. OAV is the same representation used in many other language systems including objects in object-oriented programming languages, the W3C’s resource description framework (RDF), and the web ontology language (OWL) intended to implement the semantic web. In the discussion to this point, relationships have not been explicitly treated, but the OAV framework can also used to represent relationships. Resources on the web A PowerPoint presentation that explores some related issues in the context of military command and control is available at http://www.activecomputing.org/Why-C2-hasstagnated.pps If you would like to explore our current implementation of the structured data web, download and install d3i from http://www.activecomputing.org/d3i-AcT-1.0-Setup.exe. You may execute arbitrary SQL statements using the console available from the 7 To index a value V in OAV representations, two items are used, namely O and A, the object and attribute. In our implementation of the structured data web, to index values we use four items, namely application, user, userInstance, and variable. The extra items are used to represent the problem-solving episode or instance associated with the value. We tend to think of variables in both class and instance terms. Example: phone number is a variable. But whose phone number? And when is that phone number valid for that person? Many examples of OAV representations are about static information, but because we were interested in dynamic information, we chose to use more items for indexing values. 6 Development menu. SQL statements may be called as string arguments to the sql procedure. Example: % sql "select distinct template from element" TerroristAttack BigCallForFire Travel EgressPlan TaskOrganization TravelSupport MeetingRequest CAST-lite texttest travel-scheduler WeatherDatabase CoffeeInfo Taxi4 CheckPointThreat AirfieldSeizure NewsItem EditInformationElement CCIR EmailAddressHKEY IndividualEquipment TaskForceStatus TravelPreferencesViewer TravelPreferences TravelVoucher ForceDescription PersonnelInformation MeetingInvitationResponse MetaMail MemberPayStatus CalorieCounter IDcardData WebScraperMaker CharacterCertification MyProxies TravelBudget Summary For large-scale software integration with less development effort, we have proposed the idea of a structured version of the web that can store and serve variable values for different applications and specific problem-solving episodes. This paper has described a simple implementation for the structured data web based on essentially a single table in a relational database and an index similar to URLs. A key advantage of the structured data web is the possibility of semantics by example, and we think this is important for bottomup development. Comments are solicited. 7