Structured Data Web - activecomputing.org

advertisement
Structured Data Web:
A web-like method for sharing and interpreting
structured information for bottom-up development
Douglas E. Dyer, PhD
Active Computing
27 May 04
Introduction
Large software systems traditionally have been designed from the top down. While this
conventional approach offers good control of the resulting product, it can also be timeconsuming and seems to have scalability limitations, perhaps because of increased
communication, coordination, and synchronization required by software development
teams. Top-down designs usually require consensus (design by committee) or
specification (design by decree), and neither of these methods is known for excellence.
This paper is about a simple alternative: bottom-up development. In bottom-up
development, design and implementation are rarely synchronized, and work proceeds
incrementally. Typically, a developer of one component aims to integrate information
from third-party application, often without much help from that application’s developer.
When trying to integrate a third-party application, two key difficulties frequently
consume a lot of development effort. First, there is the problem of understanding what
variables the third-party application has to offer and, deeper, the precise meaning of those
that seem interesting. There are several methods for discovering the meaning of thirdparty variables including reading documentation, talking with the original developers,
getting clues from names used in the source, or using an ontology supplied by the original
developers.
The second problem is more mundane: How exactly are variables accessed? There are
an incredible number of possibilities, thanks to the creativity of software engineers, and
the third-party application may support zero, one, or many of them. The same methods
used to discover the meaning of variables apply to figuring out how to access them.
Sometimes, the third-party developers are available to create new interfaces to support
the integration task. If not, the integration task can be harder than re-implementing the
application.
In 1990, Tim Berners-Lee introduced the web technologies: a representation for
formatting text (HTML), a server for providing the representation on request, a protocol
for clients to interact with the server, and a common index called a uniform resource
locator (URL). Because these technologies were straightforward and publishing a web
page was “cool,” many people were able to contribute content and the web grew
exponentially. HTTP has become a dominant protocol, meaning clients require fewer
interfaces. For applications that deal with structured data, an analogous approach should
work well. Let’s call this set of technologies the “structured data web.” This paper
1
describes one possible implementation for the structured data web and characterizes its
utility for sharing and understanding data when integrating applications from the bottom
up.
Implementing the structured data web
There are probably several different ways to design and implement the structured data
web. In DARPA’s Active Templates program, we implemented several prototype clients
that work using the same basic method. We’ll illustrate using an application for making
decisions about business travel, and the specific method used in a tool called d3i.
Representation. Variables (e.g., Destination) are represented in the context of their
application (BusinessTravel), and in terms of some problem-solving episode. Episodes
of business travel generally are identified by the person traveling and the particular trip.
Sometimes applications (e.g., BusinessTravel) are designed around people and sometimes
they are about objects (e.g., FuelMonitor, an application that alerts a pilot when the flight
plan exceeds fuel available). In any event, the primary requirement is to uniquely
identify an episode with some number of attributes. Variable values are represented as
text strings1. Variables often have metadata associated with them, examples being the
source of the value and the time it was set. The application, variable, user (or object),
and user’s instance, together with its value and metadata---all of that is referred to as an
information element.
Server. In Active Templates, we found it straightforward to use a relational database as
a server (we used MySQL, but any database will work). There are many reasons for
choosing a relational database as the server for the structured data web including
technical maturity, availability of free or low-cost servers, access control, atomicity of
transactions, access efficiency, and simplicity. In our implementation, we created a table
“element” used by all applications. Element has the following attributes:
template (another name for application)
userIdentity (we used primary email address)
instance (we used a number and often find the number from the value of other variables)
element (another name for variable)
value
source (Who set the value? Choices: ‘user,’ ‘rule,’ and ‘case,’ the latter two being AI)
more (used for annotations and more information, normally unstructured)
time (seconds since 1970, universal time)
All of these attributes are of type text except for time, which is an integer. Text values
offer efficient storage but are not length-constrained, and thus are most flexible.
1
Most data types have a text representation, and text can be cast appropriately by all major programming
languages. Since we wanted maximum interoperability and simplicity, text was our choice. Austin Tate
has suggested adding “type” as metadata, and I think this is good idea.
2
In our implementation, “element” holds only current values for all applications which are
hosted on a particular server and database. Another table with the same attributes is
called “history” and contains all values recorded over time for information elements. A
third table, “elementDescription,” is used to describe applications and their variables and
to serve applications. These other tables are not yet central to the current discussion, but
those who peruse our example server now know the basics2.
Protocol. The protocol for our implementation of the structured data web is any method
of accessing a relational database including ODBC and the SQL language.
“URL for structured information.” Within a particular server and database, the first
four attributes of the element table constitute a URL for variables and metadata:
application, user, instance, variable name. More generally, the URL consists of the
server, database name3, application, user, instance, and variable. Like web URLs, our
URLs are hierarchical4 in the order given, and there are requirements for name
uniqueness: servers must all have different names (enforced by internet protocols),
database names must be distinct for a particular server (but different servers may have
databases with the same name), etc. Because of this hierarchical nature, no variable
name collisions occur between applications (i.e., different applications essentially have
separate namespaces), for example.
So, that’s a complete description of our implementation for a structured data web.
Applications should publish information elements to be shared by SQL inserting and
updating attributes in the element table. To integrate your application with other third
party applications, access third party information elements via SQL select statements.
Readers should now understand our solution to the mundane problem of access. Now it’s
time to think about which variables are available and how to figure out exactly what they
mean.
Achieving understanding
When development teams are not necessarily working closely together, it isn’t always
easy to talk directly with someone who understands a third-party’s application. In this
case, developers typically rely on third-party documentation (including an ontology, if
available) or by experimenting with the application or by examining the source code. I
refer to documentation and ontology definitions as “semantics by declaration” because
they are created based on a developer’s statements. If the developer is thorough and
systematic, all questions are answered. Being thorough and systematic is difficult, timeconsuming, and costly---and it can be hard to predict whether this effort will pay off.
Currently, an example of our server is up and running at 66.255.97.12 as the database “d3i”. You may
check it out using the user “d3i” and the password “d3iMCCCXX”.
3
At least in MySQL, a database server can host many different databases.
4
Actually, this is only approximately true. The instance attribute’s parent is userIdentity, but the variable
attribute’s parents include the application and the pair <userIdentity, instance>.
2
3
Perhaps this is why it’s often necessary to resort to other means for more complete
understanding.
The structured data web supports and extends these other methods and enables
“semantics by example,” a method that requires less effort by developers and augments
any available documentation. Semantics by example is meaning you perceive based not
on documents or the names of variables but on the range of values that variables
evidently have. For example, suppose you wish to integrate a third-party application
called BusinessTravel. If BusinessTravel has a variable called Visiting, you might be
confused about its meaning. Does Visiting refer to a person you are meeting? Or a city
you are traveling to? With the structured data web, if the BusinessTravel application is
in regular use, there will be many examples of the values Visiting takes on, and these
example values empirically help identify the meaning of that variable. Suppose Visiting
has values that include:
Joint Forces Command
DARPA
SRI International
University of Edinburgh
From these values, it should be clear that Visiting refers to an organization, not a person
or city (although there are likely relationships with the person visited and the location of
the organization). Although the range of current values does not necessarily cover all
meanings a developer might have originally intended, semantics by example seems like a
powerful method. Semantics by example also answers these questions: What are all the
applications? For each application, what are the variables? The data in the element table
provide the current answer for each server and database.
In our implementation, the elementDescription table mentioned earlier is a good place to
store any available documentation including an ontology. Attributes in that table include:
userIdentity
template (application)
element (variable)
description
keywords
XML (This is actually the declarative part of an Active Templates application)
rules (This is the procedural part of an Active Templates application)
As before, all attributes are of type text. Documentation for an application should be
placed in the description attribute in a in a tuple with the element attribute set to “” (an
empty string). An ontology can be included, for example, or a natural language
description. Documentation for individual elements (variables) should appear in the
description attribute for each element. The other attributes are less important to the
current discussion. The userIdentity attribute implies that every user can have different
versions of an application. The keywords attributes is intended to support a search
4
engine for structured information. The other two elements are used, at least in d3i, to
serve applications from the database.
On ontologies built from the bottom-up
An ontology for a domain or application is a set of relevant variables and a set of
relationships between them. The utility of an ontology is clear understanding during
design and integration, and ultimately, machine interpretation of software component
interfaces. In top-down design, the goal is to create an ontology that covers the entire
system and perhaps potential future extensions of the system5. The result is a single
ontology for the entire system. In contrast, when integrating applications from the
bottom up, each third-party application extends its own ontology---but having Developer
A integrate Application B has does nothing to extend B’s ontology. In bottom-up
development, there is one ontology for each application, but each can be extended as
needed. Users dealing with multiple applications do not necessarily have a common
ontology that covers them.
On Performance
Our primary concerns, for integration, are access and understanding, but developers often
worry, in addition, about performance because of its impact on usability. The
implementation of the structured data web described in this paper has not been fully
characterized, but database performance is normally dominated by the number of disk
accesses required, and we have found this to be true for large databases. For example,
using current hardware and ignoring network delays, a small database returns more than
2000 variables per second, while an un-indexed database of 1 million variables takes
more than a second to return a single value6. Indexing the database restores much of the
performance lost, returning more than 1000 variables per second. Database performance
is market-driven. We expect some combination of indexing, distributing the load to
multiple servers, and query optimization to provide the performance required for any
system contemplated.
This is a kind of “slippery slope” leading to premature commitment to a large ontological design effort.
Predicting future extensions is difficult, but integrating with third-party applications is a likely requirement.
If so, then a common ontology is desired that covers all integrated applications. Since you can’t predict
which applications will be integrated, you really need a global ontology, right? Argh! You’ll never finish
designing such a thing. A much better approach is to define the ontology incrementally, to get the coverage
needed for the immediate future.
6
These tests were conducted by writing 1 instance of 1 million different variables for one application and
one user. The number of queries equaled the number of variables. In our tests with MySQL, 1 million
variables consumed 100 Mb---each tuple required about 100 bytes, roughly the amount of text in the
attribute values. The text type is space efficient. An index on template, userIdentity, and instance was used
to test performance improvement. Creating an index in a database precludes the requirement to check each
tuple for criteria in where clauses of select statement. The insert speed drops somewhat for indexed tables.
5
5
On Messaging, Services, and Alerts
Often, developers desire synchronous message-passing between applications, perhaps to
support a service such as remote procedure call or web services. Message passing begins
with a client sending a message via a channel and waiting. If the server receives the
message and does not fail, then it replies with another message, otherwise no reply is
sent, in which case the client may hang or time out. The structured data web
implementation supports message passing in the following way. A client writes a
variable to the database and begins polling for another variable, the servers reply. The
server, continually polling the database for “request” variables, notices the client’s
request, and responds by inserting or updating its reply variable. Timestamps are used to
determine whether a value applies to a previous service episode. Interestingly, with a
structured data web, services may be set up to watch for certain state conditions which
may be arbitrarily complex and involve an arbitrary number of applications. When
specified state conditions are met the service may take a variety of actions including
processing, writing variables, or sending messages. Such services are called alerts or
sentinels.
On the equivalency with other representations
The structured data web represents at least <object><attribute><value> triples (OAV)7.
Since the goal is integration, the objects of interest are applications, and the attributes are
instances of variables. OAV is the same representation used in many other language
systems including objects in object-oriented programming languages, the W3C’s resource
description framework (RDF), and the web ontology language (OWL) intended to
implement the semantic web. In the discussion to this point, relationships have not been
explicitly treated, but the OAV framework can also used to represent relationships.
Resources on the web
A PowerPoint presentation that explores some related issues in the context of military
command and control is available at http://www.activecomputing.org/Why-C2-hasstagnated.pps
If you would like to explore our current implementation of the structured data web,
download and install d3i from http://www.activecomputing.org/d3i-AcT-1.0-Setup.exe.
You may execute arbitrary SQL statements using the console available from the
7
To index a value V in OAV representations, two items are used, namely O and A, the object and attribute.
In our implementation of the structured data web, to index values we use four items, namely application,
user, userInstance, and variable. The extra items are used to represent the problem-solving episode or
instance associated with the value. We tend to think of variables in both class and instance terms.
Example: phone number is a variable. But whose phone number? And when is that phone number valid
for that person? Many examples of OAV representations are about static information, but because we were
interested in dynamic information, we chose to use more items for indexing values.
6
Development menu. SQL statements may be called as string arguments to the sql
procedure. Example:
% sql "select distinct template from element"
TerroristAttack BigCallForFire Travel EgressPlan TaskOrganization TravelSupport
MeetingRequest CAST-lite texttest travel-scheduler WeatherDatabase CoffeeInfo Taxi4
CheckPointThreat AirfieldSeizure NewsItem EditInformationElement CCIR
EmailAddressHKEY IndividualEquipment TaskForceStatus TravelPreferencesViewer
TravelPreferences TravelVoucher ForceDescription PersonnelInformation
MeetingInvitationResponse MetaMail MemberPayStatus CalorieCounter IDcardData
WebScraperMaker CharacterCertification MyProxies TravelBudget
Summary
For large-scale software integration with less development effort, we have proposed the
idea of a structured version of the web that can store and serve variable values for
different applications and specific problem-solving episodes. This paper has described a
simple implementation for the structured data web based on essentially a single table in a
relational database and an index similar to URLs. A key advantage of the structured data
web is the possibility of semantics by example, and we think this is important for bottomup development.
Comments are solicited.
7
Download