“Split Personalities” for Scientific Databases

advertisement
“Split Personalities” for Scientific Databases:
Targeting Database Middleware and Interfaces to Specific Audiences
Cherri M. Pancake1, Mark Newsome1, and F. Joe Hanus2
Dept. of Computer Science1, Dept. of Botany and Plant Pathology2
Oregon State University
Corvallis, OR 97331
{pancake@cs,newsome@cs,hanusj@bcc}.orst.edu
Technical Report 98-60-06
Department of Computer Science
Oregon State University
Abstract
Scientific researchers are anxious to discover new insights into the relationships and interactions among the
exceedingly diverse components of our physical, biological and ecological environment. To do this, individual
scientists must be able to synthesize conclusions from data drawn from disciplines outside their domains of expertise.
While key datasets have already been brought online, they are housed in diverse agencies, using different database
software on a variety of platforms. This paper explores the problems in providing middleware and interface support
for these autonomous research databases (ARDs).
The development of interfaces to ARDs is complicated by the fact that potential end-users range from
highly-specialized research scientists to the general public. To adequately support such diverse users, the interface
must assume different “personalities.” We describe how interfaces can be targeted to three categories of end-users:
domain specialists, non-domain specialists, and students. By automating the activities that are most frustrating, timeconsuming, or error-prone, target-specific interfaces can significantly improve database usability. There is also
heterogeneity among the people who must develop ARD interfaces. We target three different classes of implementers
— database-technology familiar, database-content familiar, and database unfamiliar. By developing different
middleware personalities that respond to the specific skills and interests of each audience, we have been able to keep
our software simple and usable without sacrificing flexibility.
One of society's most pressing needs is a comprehensive understanding of environmental
interrelationships that can move us toward a sustainable biosphere, one that is both ecologically sound and
economically feasible [18]. To meet this challenge, scientific research must yield new insights into the
relationships and interactions among the exceedingly diverse components of our physical environment (e.g.,
climate, species distribution, land utilization patterns, air quality, etc.). That, in turn, requires that scientists
become more effective at identifying, analyzing, and interpreting data derived from many different disciplines.
Cholera offers an example of how cross-disciplinary information on environmental factors permits
better understanding of biological processes that strongly influence human well-being [5]. This disease reveals
a strong association with the sea, with the great pandemics following the coastlines of the world. Vibrio
cholerae not only survives in an almost dormant state during unfavorable conditions in the ocean, but also can
attach itself to certain forms of ocean crustaceans. A single, barely visible copepod can carry 104 organisms,
while just 103 organisms are sufficient to cause disease. When El Niño caused warming and heavy rains along
the coast of South America, the warming of surface waters and an influx of nutrients precipitated a
phytoplankton bloom, followed by a zooplankton bloom (copepods); these were monitored by satellite. A
cholera epidemic followed in Peru, as far as 50 to 150 km. inland. As Colwell pointed out, the importance of
this example is that it requires data from disciplines as diverse as oceanography, microbiology, epidemiology,
and satellite imagery in order to develop predictive models for critical diseases.
The problem for the scientist, then, is not how to acquire and store personal research data, but how to
access and make sense of data gathered by different groups and for different purposes. Pertinent data resides
in large-scale databases managed through diverse database software, on diverse platforms, and with entirely
different data structures, each maintained by a distinct agency or individual. While the scientists who use each
database are expert in its particular domain and are aware of its value and limitations, outside researchers may
have great difficulty determining which data are more important, most reliable, or even most recent. The fact
that formats and nomenclature are idiosyncratic to each database complicates the process. At a DOE
workshop participants discussed the technical and sociological constraints that make it impractical to merge all
genomic data into a single, consistent repository; they projected that autonomous organizations will continue to
be the dominant mode for managing genomic information into the foreseeable future [31]. This situation
extends to many other scientific disciplines as well.
The problems are exacerbated when data spans multiple disciplines, as is the case for data that might
establish significant ecological interrelationships. Now the researcher must cope not only with difficult access
and lack of guidance about the database, but also with the distinct traditions of nomenclature, field
methodology, data organization, etc., within each disciplinary domain. Many of the most compelling research
problems in the biological and environmental sciences will not be approachable, however, until scientists
access, interrelate, and synthesize data across disciplinary boundaries.
1
Database standardization is one possible solution, but this has proven to be problematical in practice.
Consider the example of NSF’s Long-Term Ecological Research (LTER) [15] program, a systematic network
of ecological research sites in operation since 1980. The twenty LTER sites span a range of ecosystem types,
from arctic tundra to tropical rainforest. Although each site has a different investigator-oriented mission, some
measurements are made in every site in every year and the results are available in standard form — including
primary production, nitrogen mineralization rates, standing crop, abundances of soil cations, detritus
production, and censuses of dominant plant species. The majority of their data, however, is acquired and
maintained in site-specific form. Where administrative units like LTER don’t exist, standards are even more
difficult to enforce. At the same time, many valuable datasets are already in place; these are unlikely to be
restructured or converted in the near future. Yet these independent, relatively small-scale (typically less than
100,000 records) databases, maintained by geographically disperse research teams, contain some of the most
critical and most up-to-date scientific information anywhere.
Table 1. Characteristics of Autonomous Research Databases (ARDs).
Independent: Created and managed by diverse, relatively isolated research groups
Autonomous: Ongoing need to maintain control over data and access
Heterogeneous: Hardware/software varies dramatically from site to site
Non-uniform: Each site has distinct ways of organizing data
Non-standardized: Each site may employ distinct data formats and nomenclature
Limited expertise: Little or no database technology expertise available in-house
Inward focus: Primary motivation is to support in-house research
Financial constraints: Little or no funding for making data available to others
This paper explores the problems in providing middleware and interface support for such databases,
which we call autonomous research databases (ARDs). The key properties of ARDs are summarized in Table
1.
In addition to the issues that were already outlined, ARDs typically are constrained in terms of the
amount of funding or level of effort available to support database maintenance or the construction of access
mechanisms. The original accumulation and computerization of scientific research data is typically covered by
discipline-specific research grants. Such grants are for conducting science, however, and do not generally
include the hiring of database administrators nor the development of database interfaces. As a result, virtually
all ARDs are created and maintained entirely by domain specialists. They have only scant familiarity with
database or interface technologies. Further, their chief concern is that the data be useful for their particular
research projects. It is their sense of responsibility to the broader scientific community, not explicit support by
the granting agencies, that motivates them to make their data available for other uses.
The scientific community has become increasingly vocal in its concern about accessing the many
research databases that have already been populated at a considerable investment of time and dollars [16, 34,
6]. All too often, scientists find themselves effectively shut out of even major databases, due to the user-hostile
nature of query languages and interfaces. Smaller databases — even those with critical information for many
other disciplines — may not even be available on the Internet. Yet if scientists cannot interact freely with
remote databases, they must curtail the scope of their scientific inquiries or duplicate previous efforts by
collecting similar data. With resources, facilities, and funding for research becoming more scarce, access to
ARDs is now an economic as well as a scientific issue.
2
The remainder of this paper discusses how such access can be provided, without violating any of the
essential constraints imposed by the nature of ARDs. In the next section, we explore the real costs involved in
access to ARDs, and how these constraints enforce a new approach to software support. The following two
sections describe how the constraints can be accommodated through “split personalities”: interfaces and
middleware layers that target different segments of the ARD community. Examples demonstrate how
interfaces and middleware can respond specifically to the kinds of user needs we have identified in our
collaborations with three research alliances: the Oregon Coalition of Interdisciplinary Databases (OCID [25]),
the Long Term Ecological Research (LTER [15]) program, and the National Partnership for Advanced
Computing Infrastructure (NPACI [19]). Another section compares our work with related research efforts,
underscoring how usability considerations caused us to adopt an unusually flexible and platform-independent
approach. The paper concludes with a discussion of how ARDs might be expected to change over the next
few years, and the implications of these changes in terms of improving access to scientific databases.
Why “Split Personalities” Are Needed
While modern computing and communication systems provide the infrastructure to send bits
anywhere, anytime in mass quantities, connectivity of itself does not ensure that communications are useful
[11]. It certainly doesn't guarantee that professionals outside the scientific community — such as business
leaders who are attempting sustainable use of natural resources or educators who wish to use “real-life” data
in the classroom — will be able to access appropriate information. As Hammond explained, “The ideal
[medical data] system permits us to have a system in which all who need data can have exactly what they need.
It's complicated by the fact that there are [many] types of people that probably have a legitimate need for
access to health care information” [24]. This is a problem of usability. It’s important that data be presented
according to the context in which they will be used.
We propose that in order to meet this challenge, a data resource must have multiple “personalities.”
This can be accomplished by presenting alternate interfaces to the database, each designed to accommodate a
different segment of the user community or to facilitate a different type of use. No single interface is sufficient
to reach dramatically different audiences. A recent workshop arrived at the same conclusion about access to
digital libraries:
You cannot have the same interface for a specialist in a scientific discipline and a schoolchild,
so we should stop fooling ourselves. What we can have, though, are multiple interfaces that
all access the same [digital] library but target different levels of understanding, and then
provide ways to move back and forth between the levels. You may come in at a novice level,
use the library, then decide you’re more of an expert, that you’d like to see a more
professional approach. [35]
For the purposes of this discussion, we consider three general end-user expertise profiles:
•
Domain specialist: scientist with expertise in the particular disciplinary area(s) represented by the
ARD
•
Non-domain specialist: scientist whose expertise is in some other disciplinary area; i.e., a crossdisciplinary user
•
Domain novice: student who is specializing in the disciplinary area(s) represented by the ARD.
3
While ultimately the audience for scientific databases could extend to the general populace (i.e., non-domain
novices) as well, our work has focused on the needs of scientists and students of science.
If the database is to be truly usable, its interfaces must respond to the requirements of different user
profiles. This evidenced by the fact that software users are no longer willing to put up with products that are
difficult to learn or use [9]. Simply creating an elegant, powerful scientific database doesn’t mean that it will
be used. Even scientists and engineers — traditionally the group that was most tolerant of software
idiosyncracies — now expect their software to show evidence of usability [27, 26]. Yet the needs of students
clearly differ from those of professional scientists, just as the needs of a microbiologist clearly differ from
those of a forester. Table 2 gives examples of some of the distinctions among the three profiles.
Table 2. Examples of how needs differ according to end-user expertise profile.
Example
Domain specialist
Non-domain specialist
Domain novice
Type of terminology
Precision is desirable
May need mapping to terms
from other disciplines
Specialized terms need
explanation
Need for explanation
Only if ambiguous or
controversial
Only if ambiguous or unique
to this discipline
Explanation is essential
Usefulness of “extra”
explanations
Gets in the way and slows
access to needed data
If tangential, could obfuscate
key issues or data use
Highly desirable as tool for
clarification
Key query needs
Speed/ease of access (How
quickly can I get the info I
seek?)
Obvious relevance of results
(Is any of the data relevant to
my work?)
Didactic value (How does
this help me learn?)
Above and beyond the need for supporting different end-user profiles is the fact that the software for
constructing multiple personalities must be tailored to the needs of ARDs. In an ideal world, there would be
sufficient human resources to guarantee that staff with expertise in databases and user interface design were
available to construct sets of split-personality interfaces for each ARD. The current situation is far from ideal.
The typical ARD staff — as discovered through our OCID, LTER, and NPACI collaborations — is composed
exclusively of professional scientists. Only major collaborative efforts, such as the LTER program, have the
resources to fund programmers or consultants with database expertise, and as far as we have been able to
determine, no ARD has found it possible to hire specialists in user interface design.
Thus, only some of the people who develop database interfaces can be considered “programmers.”
This has implications for the types of middleware that will be support ARD interface development. Moreover,
end-users have indicated that they want to be able to create their own interfaces. At the least, they want the
capability of adapting or tailoring pre-built interfaces to their own research needs [28]. As they point out, if
an interface is intended to provide access to a database in its entirety, it cannot also provide the quickest,
simplest access to a very specific subset of that data.
Here, we consider three general implementer expertise profiles:
•
DB-technology familiar: database administrator or other person trained in computer science aspects
of database management
•
DB-content familiar: scientist who was responsible for acquiring the data and populating the ARD,
or who followed that person in the role of ARD maintainer
4
•
DB unfamiliar: scientist who is not a member of the group that created the ARD, but who has a
particular need or interest in developing interfaces to it
Each profile represents a different skill set with respect to the technology that may be required to implement
interfaces. It also represents a different degree of comfort with the intent and content of the ARD, and the
knowledge of how to make that content more accessible to target users. Table 3 gives examples of how the
profiles differ.
Table 3. Examples of how needs differ according to implementer expertise profile.
Example
DB-technology familiar
DB-content familiar
DB unfamiliar
Database skills
Strong skills, based on both
training and experience
Weak to strong skills, based on
experience
Unlikely to have skills
Programming skills
Very comfortable with
programming and script
languages, as well as SQL
Weakly to moderately
comfortable with SQL
Unlikely to have skills
Knowledge of database
organization
Extremely familiar with
database organization in
general, as well as this
database’s structure
Familiar with this database’s
structure
Unfamiliar with this database
Understanding of
purpose/limitations of data
Moderately cognizant of
database contents
Extremely cognizant of
database contents
Unfamiliar with database
contents
We propose that these distinctions, too, imply the need for split personalities, each targeted at a
different expertise profile. This time, however, it is the database middleware that should respond to the
needs of its user communities. The rationale is again based on usability criteria: if it is too difficult or timeconsuming to construct interfaces to an ARD, they simply will not be provided. For example, a strategy that
requires that the database be in Boyce-Codd Normal Form will effectively rule out its implementation by any
ARD that does not have DB-technology familiar staff.
The next sections describe how our middleware products are designed to support different implementer
expertise profiles, allowing ARDs to constuct sets of interfaces that respond to different user expertise
profiles. A key characteristic of the products is that they automate many of the activities that are most
frustrating, time-consuming, or error-prone. This is a classic approach to improving usability, although it has
not been applied very frequently to software for scientific users [14, 17, 2, 26, 27].
Split Personalities for Database Interfaces
The Web has been cited frequently for its role in making the Internet easier to use and more broadly
meaningful [24]. The ubiquity of Web clients, the ease with which Web pages can be produced, and the
familiarity of users with the format make the Web page an excellent choice for interfaces that can be tailored
to the needs of widely differing groups. From the perspective of research scientists, however, Web-based
access to databases has been only poorly supported and existing interfaces are extremely non-intuitive. In
particular, two key developments are essential in order to achieve multiple personalities for research data.
First, interface features must make it possible to minimize the amount of text that must be entered by
particular user expertise profiles. Since the likelihood of inappropriate or misspelled values is much higher
among a non-domain specialist or domain novice audience, it should be possible to develop query interfaces
5
that are based on recognition (e.g., point-and-click from scrollable lists) rather than recall (e.g., typed input
fields). For similar reasons, it must be possible to provide additional information (e.g., context-sensitive
annotation, instruction, or help) that is no more than a click away in order to accommodate varying levels of
user expertise.
Second, the interface must be able to adapt as the database grows. ARDs are active research tools.
Periodic updates are to be expected, and users should see completely up-to-date information at all times. The
values included in scrollable lists or drop-down menus, for example, should be obtained dynamically at time
the screen is displayed, so that they reflect any recent additions or updates.
How can these mechanisms be exploited to develop split personalities? We describe a case study from
OCID [25]. Sherry Pittam, a lichenologist from the Dept. of Botany and Plant Pathology at Oregon State, had
developed a database of Lichens of the Pacific Northwest, using data from several herbarium collections. (A
lichen consists of two mutually dependent organisms, fungi and algae, living in a symbiotic relationship.) The
relational database, managed using Sybase [33], houses an extensive collection of literature, image,
taxonomic, chemical, and ecological characteristics, as well as the registration data for the herbarium
specimens. Pittam also downloaded and installed the HyperSQL middleware described in the next section,
which she used to developed Web interfaces targeting different user communities. We use her work as an
example of how split personalities can improve database usability. At the same time, we point out the
middleware features that are needed to support interface personalities.
Interface A, for domain specialists with fully-specified queries. For professional lichenologists, the
primary need is quick access to the information on particular lichens. The simplest interface to any database is
a form where the user types in the desired characteristics — for example, naming the desired genus, species,
etc. — and those values are passed to the database as query criteria. One was implemented for the lichens, but
it had the obvious disadvantage of being prone to typing errors. Further, in lichenology as in many other
biological disciplines, taxonomical nomenclature has evolved over time so that a particular species may have
had several names over the last century. This means that it was possible for a user to request information on a
species and be told that no information was available, when in fact the lichen was present but under a different
name.
To accommodate this problem, the interface was modified to eliminate all user typing. Instead of
presenting blank input areas, it pre-fetches all possible values for an attribute and displays them in a list
(pulldown for a small number of values; scrollable when there can be many). The user selects values from the
lists that are of interest, such as a particular genus, letting the other lists default to “any match.” This provides
a robust and simple-to-use interface for anyone interested in retrieving information about a specific lichen or
about lichens that share a specific characteristic, such as occupying a particular type of habitat. [Authors’
note: because this interface is so simple, we do not include a figure; one can be added later if desired.]
The interface has an added advantage for first-time or infrequent users of the database. The user can
choose query criteria by recognition, rather than having to remember or guess a valid keyword or value. This
allows user to effectively “see” what is included in the database before they actually initiate any queries. For
example, a lichenologist interested only in Peltigera spp., or only in lichens from estuarine habitats, would be
able to determine immediately if the database included such information.
Interface B, for domain specialists with incompletely-specified queries. Note that interface A
requires that the user be able to specify the query criteria completely. To support more open-ended queries, a
second interface allows the user to submit queries that are only partially specified. This interface supports the
identification of lichens on the basis of chemical reactions. Lichenologists may supply the reactions from a
full suite of chemical tests, thereby guaranteeing that a unique species will be identified. Alternatively, tests
may be applied in some arbitrary order, with one or more results used as the basis for an incompletely-
6
specified query (i.e., some attributes are left as “don’t-care” values). In some cases, as few as one or two tests
may be sufficient for successful identification; see Figure 1.
Figure 1. Lichen database, Interface B, showing how the user can specify just partial results of chemical
tests.
This is possible because of how the interface is structured. Normally, the identification of biological
specimens requires making a series of comparisons using structured questions (e.g., Is it bigger than a tennis
ball? If so, is it slick or is it fuzzy?). This classic structure is called a dichotomous key because of its reliance
on yes/no questions. The drawback is that if one of the decisions cannot be made (e.g., you cannot determine
if the color is mauve or taupe), the process fails. A more human-friendly approach is to support identification
through a synoptic key, which uses a checklist of compiled characteristics, each with a set of potential values.
The user identifies an organism by checking off known characteristics. Organisms not fitting the pattern are
eliminated, resulting in a list of one or more organisms that match the user's observations. Thus, synoptic keys
allow the user to make an identification on the basis of incomplete or imprecise information.
Synoptic keys can be implemented using any Web-to-database middleware that is capable of prefetching and displaying lists of potential attribute values. The user chooses as many or as few attributes as
desired, and the query is formulated as a series of conjunctions (for performance reasons, it is usually desirable
to build special tables or indices reflecting the dependencies between attributes). A successful identification
has been made when sufficient attributes have been selected to reduce the “scope” of the query to just one
species. HyperSQL adds a key feature that significantly improves usability for this type of incompletelyspecified queries: iterative query scoping.
Consider the case of a scientist who wishes to minimize the number of tests needed to identify a lichen.
Without special mechanisms, it will be necessary to query the database repeatedly as each test is performed,
specifying all results known so far and then seeing if the result has been narrowed to a single species. Note
that although the database includes embedded information that could tell the lichenologist which of the
remaining tests will further reduce the scope of the results, there is no way for the user to access this
7
information directly. HyperSQL provides that service. After selecting the known test results, the user can
choose to re-scope the query instead of submitting it. This directs the middleware to determine which
remaining test values can coexist with the partially specified ones. The lists of remaining attribute values are
updated to reflect the narrower scope. When the user’s selections reduce the scope of some other attribute to
just one value, it is shown as the “selection.” This effectively tells the scientist that a test need not be made, as
it will not further narrow the scope of the query. Figure 2, for example, shows how query re-scoping changed
all remaining fields from “*” to some value. This indicates that the result of the single Medulla KC test was
sufficient to uniquely identify the lichen.
Figure 2. Lichen database, Interface B, after the query has been re-scoped; it turns out that no other tests are
needed.
Interface C, for non-domain specialists. To support the needs of scientists who are not lichen
specialists, a third interface implements a synoptic key for identifying specimens based on their physical
characteristics. This allows the user to select the most familiar or obvious characteristics (e.g., lobe shape)
first. The availability of query re-scoping means that selections based on more obscure features (e.g., podetia)
can be deferred until it is certain that they will be needed to successfully identify the specimen.
Two other HyperSQL mechanisms are particularly useful in the context of users from other scientific
disciplines. First, cascaded queries can be used to link together related information that may not be essential
for domain specialists, but might be important for outsiders. The availability of information on lichen habitat
is an example. From the perspective of a lichenologist, this is tangential information that is not used in keying
out specimens. For foresters or ecological scientists, on the other hand, this may well be the most important
data. Interface C is structured to include habitat information with every query result. Moreover, the habitat
appears as a link, similar to a normal Web hyperlink; when selected, it transparently generates an additional
query and presents the user with information on the habitat and a list of all species associated with that habitat
(see Figure 3).
8
Figure 3. Lichen database, Interface C, showing the effects of a cascaded query to reveal other lichens
occupying the same habitat as Tuckermannopsis platyphylla (Tuck.) Hale.
Second, the user can choose to restrict the number of results that will be returned by the query. This
feature is particularly important for first-time or infrequent users of the interface, who have no idea
whether their query is sufficiently specific to return one, ten, or ten thousand records. Explicit interface
controls allow the user to decide the relative number of results that are appropriate for his or her particular
use. Without them, the browser may be stalled while large numbers of records are downloaded, or the user
may be forced to interrupt the process manually.
Interface D, for biology students. Recognizing that the database content, but not the technical
presentation, would be useful for educational settings, Pittam also constructed an interface called LichenLand
for use by high school and undergraduate students in biology courses. The appearance is dramatically
different, since the synoptic key is icon-driven. Each taxonomic characteristic is illustrated
9
Figure 4. Lichen database, Interface D, showing how cartoons are used to illustrate the taxonomic
characteristics.
with a cartoon to help the student create associations for the technical terminology (Figure 4). As for Interface
C, iterative query scoping and cascaded queries facilitate free exploration through related information..
The interface also exploits HyperSQL’s mechanism for context-sensitive explanation to improve its
usefulness for the target user profile. Each of the taxonomic characteristics is linked to an illustrated
discussion of what the characteristic means, including pitures reflecting each possible value (Figures 5). One
consequence is that in many cases, lichen identifications can be accomplished entirely by comparing the
specimen with pictures. This approach makes it possible for even a completely untrained user to experience
success in taxonomic identification. A number of educators have adopted the interface for use in high school
biology and general science courses [22].
10
Figure 5. Lichen database, Interface D, showing context-sensitive explanatory material that helps students
learn about lichen characteristics.
Split Personalities for Database Middleware
Just as database end-users impose multiple sets of requirements on interfaces, the users who build
those interfaces reflect multiple communities, with distinct characteristics and distinct types of needs. In this
section, we discuss how database middleware can respond to them by supporting multiple personalities. In
particular, middleware design has to take into account the fact that ARDs have limited expertise, time, and
funding available. Adding database personalities cannot place an undue burden on the research group that
maintains the data and is making it available to others.
The OCID partners have established a set of general requirements for middleware. Since they are
intended to make it feasible for research databases from many disciplines to participate in cross-database
integration efforts, they encapsulate some of the unique needs of ARDs:
Expertise required. It is critical that the expertise necessary to develop a Web
interface be within the skill range of a scientific researcher. Many of the key repositories of
scientific information are in the hands of small research groups that do not include computer
scientists or database professionals. Also, the effort involved in bringing a new database
online should be minimal, so as not to disenfranchise researchers who work alone or who
receive no external funding. In particular, database owners must not have to restructure,
normalize, or otherwise transform their data simply to permit access to outsiders.
Security. It is also important that research databases remain autonomous, and that
policies for granting access remain under the control of the individual institutions or agencies
that established them. Some research data, such as distributions of endangered species or
11
personal health data, must be safeguarded as sensitive information. It should be possible to
maintain access privileges at the database and operating system level, and not be dependent on
Web security mechanisms (which are known to be unreliable). In addition, it must not be
necessary to maintain the Web interfaces on the same machine as the database; this is
particularly important when the data includes sensitive information. It should be possible to
support a full range of queries while granting only read-access to the database.
Software required. Research groups must not be required to have particular database
software in order to participate. Both commercial and public-domain software are in
widespread use; it is not realistic to expect researchers to acquire new software and convert
their databases. In terms of the Web interface software, it should be capable of interacting
with a range of database software and should not require any special browser capabilities.
Other features. To the extent possible, the interfaces to research databases should
always reflect up-to-date research data. Special mechanisms are needed to support this; they
must be simple, however, in order to overcome researchers’ inclination to hardcode
information as static lists in Web interfaces. Finally, it should be possible for a skilled user to
build his/her own specially tailored interface to a database. Ideally, the user could start with
an interface provided by the research group, and then modify it to suit individual needs.
No commercial software products satisfy these requirements. Indeed, several fail to satisfy any of them at all
[20]. In response to what is clearly an important emerging need, we worked with other researchers at the
Northwest Alliance for Computational Science and Engineering (NACSE [23]) to develop Web-to-database
middleware that would meet all the basic requirements of ARDs. Further, we structured the middleware as a
suite of components, each addressing the needs of a different type of interface implementer. The components
provide three different approaches for specifying database interfaces: markup language, script-based tool, and
GUI-based tool. Each approach is described below, in terms of the type of support it offers to a particular
interface implementer profile. By way of example, we show how a simple query interface (Figure 6) would be
implemented using each middleware product.
Markup language, for implementers who are DB-technology familiar. With this approach, Web
interface files are constructed by embedding special directives and macros into HTML text files. Before the
file is sent to the browser, a translator executes the special constructs and performs whatever SQL queries are
needed to acquire specified data from the target database. Our markup language is called Query Markup
Language, or QML. An example interface specification appears in Figure 7.
QML is intended for DB-technology familiar users. Therefore, it exploits the user’s assumed
familiarity with HTML and SQL — a QML interface specification is basically an HTML file with additional
tags, macros, and fragments of SQL. The HTML is used to control interface appearance and user interaction,
while the QML constructs manage the interaction between the interface and the target database. The notation
is straightforward for users who have already mastered HTML forms. Any of the standard forms components
(e.g., scrolled lists, buttons, pull-down lists, checkboxes) can be used in the normal way. For example, the
following HTML tag is used to create a group of radio buttons, where the user can select exactly one item by
clicking on the small “radio button” to the left of the item:
<input type=radio name=choice value=”first”> first-choice-here
<input type=radio name=choice value=“next”> next-choice-here
...
12
Figure 7. Simple interface to the Microbial Germplasm Database [8].
<qml_screen name=query default>
<center><H1>General Query></H1></center><p>
<center>* (star) is the wildcard character</center>
<p><b>Type of Organism:</b><p>
<input type=radio Name=organism_type value=Algae checked>Alga
<input type=radio Name=organism_type value=Bacterium >Bacterium
<input type=radio Name=organism_type value=Fungus >Fungus
<input type=radio Name=organism_type value=Nematode>Nematode
<input type=radio Name=organism_type value=Virus>Virus
<p><b>Organism Name:</b>
<table<tr><td>
Genus: <qml_input name=genus sql="Select distinct genus from Organism"></td>
<td>Species: <input type=text name=species value="%"></td>
<td>Variety: <input type=text name=variety value="%"></td>
<td>Strain: <input type=text name=strain value="%"></td>
</tr></table>
<table><tr><td><td><qml_submit target=results><td><qml_new></table>
<p><hr><p>
Constructed using QML
</qml_screen>
Figure 6. QML specification for the example interface shown in Figure 6.
One input tag is needed for each radio button to be included in the group. Collectively, the tags direct
the browser to display a list of radio button items, each showing a button followed by the indicated string
(“first-choice-here,” “next-choice-here”), etc. When the user chooses a button, the corresponding value (e.g.,
“first”) will be stored in the variable choice.
13
QML tags make it possible to populate lists and other forms elements with data that is actually
retrieved from the database. To present the user with radio buttons for selecting what species he/she wants
information on, the equivalent QML construct is used:
<qml_input type=radio name=choice
sql=”Select distinct species from Organism where genus=’$genus’ ” multiple>
This one statement directs the QML translator to issue a query to retrieve the species names from table
Organism, and use the results to build a list of radio buttons, each labeled with the name of a different species
found in the database. The advantage of a markup language approach is that the developer retains complete
control over the layout and functionality of the interface. Since QML’s special directives and macros give the
developer the expressive power of a programming language — such as loops, assignments, and conditional
logic — it is possible to build highly customized query interfaces.
The tradeoff is that, like other markup languages, QML leaves error detection and recovery to its user.
Misspelled tags are ignored by both the translator and the browser. As with other flexible uses of SQL, it is
perfectly possible to write SQL queries that are syntactically correct, but which produce spurious results (for
examples, see [20]). If errors do occur, the end-user is directly exposed to the extremely cryptic error
messages generated by the database management software.
The learning curve is small for experienced database technologists, but is potentially very steep for
non-computer scientists. The user of QML is assumed to be extremely familiar with SQL and database
terminology from computer science. The directives and macros constitute a programming language, so he/she
is also expected to be familiar with programming concepts. QML relies on HTML structures for managing
interface appearance and cgi-bin scripts for handling user input, so familiarity with these languages is needed
as well. Finally, in order to formulate queries that cross table boundaries, the database software’s proprietary
command language must be used to examine table and foreign key structure.
Script-based tool, for implementers who are DB-content familiar. This type of tool analyzes
schema specifications written in a specialized scripting language, then generates the appropriate query files
and HTML text. A high-level specification language is used to shield the implementer from the low-level
details of SQL, HTML, and proprietary database management libraries. The tool reports problems with the
specification and automatically generates SQL and HTML for query input and output pages. Our scriptbased tool is HyperSQL, or HSQL. The specification for the example interface is shown in Figure 8.
This tool is targeted at DB-content familiar users. Unlike QML, the scripting language used by
HSQL does not try to extend on the user’s familiarity with HTML. Instead, it streamlines interface design by
establishing defaults for almost every aspect of both interface appearance and interactions with the target
database. Thus, it is possible to construct an HSQL interface with just a couple of dozen lines of text. HSQL
does this by producing the basic style of interface most commonly requested by scientists: a “query screen”
where the user indicates the characteristics to be searched for; a “results screen” providing a summary listing
of the records that were found to match, and a collection of navigatable “browse screens” providing detailed
information, graphics, sound clips, etc.
The HSQL specification is structured into logical sections reflecting that organization. Each section
is delimited by begin/end statements, allowing the HSQL tool to detect and report many potential errors.
Within a section, the user includes statements such as
OUTPRINT SUPPRESS_IF_EMPTY “Genus: DATA_FIELD=Genus”
to display the value retrieved for genus. Special options make it easy to deal with interface issues that have
traditionally been problematical. For example, SUPPRESS_IF_EMPTY indicates that the label (“Genus:”)
should not be shown if the record being displayed has no value stored for this field. The terminology is
simplified to reflect the way scientists think about databases — e.g., “field” and “list” rather than the database
terms “attribute” and “relation”
14
QUERY BEGIN
BANNER "<center><b>General Query</b></center>";
HEADER "<center>* (star) is the wildcard character</center>";
INPRINT "<b>Type of Organism:</b><p>";
INPUT organism_type TYPE=RADIOLIGHT DEFAULT="Alga";
INPUT organism_type TYPE=RADIOLIGHT DEFAULT="Bacterium";
INPUT organism_type TYPE=RADIOLIGHT DEFAULT="Fungus";
INPUT organism_type TYPE=RADIOLIGHT DEFAULT="Nematode";
INPUT organism_type TYPE=RADIOLIGHT DEFAULT="Virus";
INPRINT "<p><b>Organism name:</b>";
INPRINT "<table><tr><td>Genus: ";
INPUT Genus TYPE=QUERYLIST= "SELECT distinct genus FROM Organism ";
INPRINT "</td><td>Species: ";
INPUT Species TYPE=TEXTINPUT ;
INPRINT "</td><td>Variety: ";
INPUT Variety TYPE=TEXTINPUT ;
INPRINT "</td><td>Strain:" ;
INPUT Strain TYPE=TEXTINPUT ;
INPRINT "</td></tr></table>";
FOOTER "<hr>Constructed using HyperSQL" ;
QUERY END
Figure 8. HSQL specification for the example interface shown in Figure 6.
A tool of this type provides considerably more automation than a markup language can. Correct SQL
is composed automatically; all the user needs to supply is the fields that should be searched or returned by the
query. Similarly, the HTML needed to format the interface pages is generated automatically. Special format
features include automatic inclusion of buttons for submitting the query, limiting the number of results
returned, and performing query re-scoping (features described in the section on interface personalities).
Extensive error-checking services are provided, not only to check the validity of the HSQL specification itself,
but also to “re-package” any errors returned by the database so that they will be intelligible to end-users.
The tradeoff is that HSQL imposes more structure than a markup language does. The implementer is
no longer free to use any conceivable format, nor can SQL be used in unorthodox ways.
GUI-based tool, for implementers who are DB unfamiliar. In this case, a graphical tool is used to
construct a graphical specification, which is converted to the appropriate query files. A separate gateway
program, invoked by the Web server, translates the query files into a stream of HTML text for the browser
and streams of SQL commands for the target database. Our tool is called QueryDesigner.
GUI-based tools are intended for people who wish to construct database interfaces but are not
necessarily familiar with HTML and SQL programming. QueryDesigner takes this concept a step further,
supporting users who are not even familiar with the data or the database organization. The highly structured
environment makes it easier for inexperienced users to quickly tailor an existing query interface — or compose
a new one — without direct exposure to SQL or any textual specification mechanism.Like HSQL,
QueryDesigner makes assumptions about the types of interfaces that scientists need. In fact, it mimics the
15
styles that we observed scientists creating with other specification tools, including HSQL [27]. To personalize
an existing interface, the user clicks on the area of the screen to be changed. A series of guided screens let the
user change the appearance of the screen, eliminate fields or add new ones, associate default values with
particular fields, or link to associated data and/or Web pages, all through simple point-and-click operations.
Figure 9 shows one such screen, as it would appear if a user wanted to modify the example interface.
A unique aspect of QueryDesigner is the way it supports interface design for unfamiliar databases.
Figure 9. QueryDesigner screen for defining the example interface shown in Figure 6.
Note that to employ markup languages or script-based tools, the user must know the names of database tables
and attributes (to specify query criteria), something about the format of the data (to specify how it should be
displayed), and which attributes serve as primary and foreign keys (to specify how tables can be joined for
complex queries). QueryDesigner eliminates the need for this specialized information.
Instead, the tool contacts the target database on behalf of the user and queries it to obtain the structure
of the database. Any associated metadata is retrieved as well, although there is no requirement that it be
present. QueryDesigner uses this information to construct an entity-relationship diagram showing how data is
organized into tables and how tables relate to one another. (The choice of an E-R diagram was based on
observations that scientists find its meaning intuitively clear [20].) The diagram is actually interactive. By
clicking on boxes and connectors, the user can compose queries without even understanding what a join is
(Figure 10).
16
Another important feature of QueryDesigner is its ability to prevent errors. It has been demonstrated
that the interfaces generated using this tool reduce by 76% the number of potential error points associated with
SQL queries [20]. Because all SQL is generated automatically, and because the tool has detailed knowledge
about the organization of the target database, it can prevent the user from specifying queries that would violate
reasonable constraints. Because the tool is graphical, typographic and orthographic errors are also eliminated
outright.
The tradeoff here is that the user cedes all direct control over the interface specification. Everything is
accomplished indirectly, via the tool. Interface appearance and query functionality are of necessity more
estricted than they are with markup languages or script-based tools.
Related Work
Figure 10. QueryDesigner screen showing the organization of the target database for the example.
17
Many of the proposed solutions for constructing Web-based query interfaces have not progressed
beyond the stage of research prototypes, but there are a few products currently available. All employ a
Common Gateway Interface (CGI), where a middleware component, or gateway, is positioned between the
browser and the target database. The gateway receives user input from forms displayed by the browser,
translates it into appropriate queries and sends them to the database, then receives the query results and reformats them as HTML documents for display in the browser's window.
From the perspective of usability, current middleware differs along three dimensions: (1) how many
languages the query designer must know in order to build the forms and specify how queries should be
constructed; (2) restrictions on where the gateway can execute; and (3) what types of target database are
supported. The systems are summarized in Table 3; they have been described in detail elsewhere [20, 21].
Table 3. Comparison of Web-based Query Interface Builders
System
Language Expertise Needed
Location of Gateway
Target Databases Supported
Cold Fusion [1]
HTML, SQL, additional
markup language
same Windows machine as
database
some Windows-based databases
JFactory [32]
HTML, SQL, Java plus C or
C++
same Windows/NT machine as
database
Weblogic T3/Client server and
JDBC (but low-level calls to
drivers must be coded manually)
Genera/Web [13]
tool-specific schema
definition language
same machine as database
Sybase
WDB [30]
HTML, SQL, Perl
same machine as database
Sybase, Informix, or Mini-SQL
WebinTool [3]
HTML, SQL, tool-specific
macro language
same machine as database
Ingres, Informix or Mini-SQL
dbWeb [12]
HTML, SQL, additional
markup language
same Windows/NT local-area
network as database
Windows/NT-based ODBCcompliant databases
Sapphire/Web [4]
HTML, C
same Windows95/NT local-area
network as database
Windows/NT-based Sybase,
Oracle, or Informix
W3QS [10]
specialized version of SQL
any UNIX machine
Searches Web documents rather
than databases
HyperSQL [21]
minimal HTML, minimal
SQL, HSQL scripting
language
database, browser, or any third
machine (UNIX or
Windows/NT) accessible by
Internet
Sybase, Oracle, or any ODBCcompliant database, on any
platform
QML [7]
HTML, minimal SQL, QML
markup language
database, browser, or any third
machine (UNIX or
Windows/NT) accessible by
Internet
Sybase, Oracle, or any ODBCcompliant database on any
platform
18
QueryDesigner
[29]
minimal HTML
database, browser, or any third
machine (UNIX or
Windows/NT) accessible by
Internet
Sybase, Oracle, or any ODBCcompliant database, on any
platform
The large database management system vendors — including Oracle, Informix, Sybase, and IBM,
among others — also have released products that facilitate the construction of Web interfaces. In each case,
however, service is restricted to a particular database product, and the Web server must reside on the database
machine or its LAN. HyperSQL [21], QueryDesigner [29], and QML [7] are the only examples of Web-todatabase middleware that are capable of connecting to remote databases located anywhere. They are also the
only products that support a full range of databases on both UNIX and Windows platforms.
Conclusions
To meet the challenge of understanding complex environmental relationships, scientists must be able
to access and assimilate research data from many ARDs representing many disciplines. Acquiring such
information is problematical for a variety of reasons. Since ARDs were created for many different purposes,
their data are stored using distinct formats and nomenclature, database organization, and hardware/software
platforms. The databases are independent, reside at geographically dispersed sites, and are unlikely ever to be
co-located. Database standardization is a potential solution, but in practice ARDs are maintained for
particular research projects that do not have funding to support such efforts. It is the scientists’ sense of
responsibility to the broader community that prompts them to add interfaces for more general access to their
data.
The Web has been welcomed as a mechanism for supporting platform-independent access to remote
data. Web access alone, however, does not guarantee that ARDs will be usable. We identified several key
usability requirements from the perspective of potential end-users:
(1) Interfaces must not require that end-users have a priori knowledge about the organization of the
database being queried (i.e., should make it possible to explore and use unfamiliar ARDs).
(2) End-users should be able to perform basic searches by selecting from available choices, rather
than having to supply input values.
(3) It should be possible for users to navigate through complex hierarchies of data in incremental steps
that successively narrow the scope of search.
(4) Potential sources of user error should be minimized, as should sources of inadvertent delays such
as lengthy downloads of information that was not actually desired.
No single interface can satisfy these needs for all types of end-users. As demonstrated through a series of
examples, interface personalities provide a way to service multiple audiences. Interface appearance and
functionality can be tailored to the specific needs of each type of end-user.
The same concept can be applied to interface middleware. Working with scientist end-users, we built
middleware that addresses the key usability requirements of ARD owners:
(1) To shield DRD owners from having to learn specialized languages or techniques in order to
publish their data, middleware installation and maintenance must be within the skill set of a
reasonably adept, non-computer professional (e.g., cannot require expertise in operating systems,
object-oriented systems, SQL, or Web servers).
(2) Middleware must not require the purchase of specialized software or involve complex installation
procedures.
(3) The steps required to implement database interfaces must be fast and easy-to-learn.
(4) In order to facilitate sharing and reuse of query forms by end-users, query interfaces must be
customizable, storable, and shareable by end-users.
19
By creating middleware personalities that respond to the specific skill sets and interests of different
implementer expertise profiles, we were able to keep the software simple and usable without sacrificing
flexibility.
Clearly, there are tradeoffs at different levels of user support. The interface personalities requiring the
least user expertise are restricted in terms of how queries can be constrained or combined. For middleware
personalities, too, flexibility must sacrificed to enjoy more structure and guidance. Because the personalities
coexist, however, the end-user or implementer is always free to move to a different level of support. The
ultimate advantage of split personalities, then, is that as needs and skill sets evolve, the user can elect to
change to a new, more appropriate personality.
Acknowledgments
The lichen databases and their Web interfaces were developed by Sherry Pittam of the Oregon
Coalition of Interdisciplinary Databases; she now works for the Northwest Alliance for Computational
Science and Engineering (NACSE). QML was developed by Ken Ferschweiler of NACSE. We gratefully
acknowledge their help in providing the examples shown here.
HyperSQL and Query Designer were developed by Mark Newsome, in collaboration with Cherri
Pancake and Joe Hanus, as part of a joint project of the Microbial Germplasm Database and NACSE.
MGD is funded by NSF grants BIR95-03712 and BIR 94-02192 and the U.S. Department of
Agriculture. NACSE is a Metacenter Regional Alliance sponsored by NSF grant ASC 95-23629, the
National Partnership for Advanced Computational Infrastructure, and the U.S. Department of Defense HPC
Modernization Program.
20
References
[1] Allaire Corporation. 1995. Cold Fusion White Paper. Available online at
http://www.allaire.com/cfusion.
[2] Baecker, R.M. and W.A.S. Buxton. 1987. Cognition and Human Information Processing. In Readings
in Human-Computer Interaction: A Multidisciplinary Approach, Baecker and Buxton, eds., Morgan
Kaufmann, pp. 207-218.
[3] BBSRC Roslin Institute. 1995. WebinTool. Available online at http://anita.jax.org/webintool0.921/docs/user-guide.txt.
[4] Bluestone Corporation. 1995. Sapphire/Web Manual. Available online at http://www.bluestone.com/.
[5] Colwell, R.R. 1996. Global Climate and Infectious Disease: The Cholera Paradigm. Science, 274:20252031.
[6] Davis, F.W. 1995. Information Systems for Conservation Research, Policy, and Planning. Bioscience,
special biodiversity supplement, Summer 1995, pp. 36-41.
[7] Ferschweiler, K. 1998. Query Markup Language. Available online at http://www.nacse.org/qml.
[8] Hanus, J. F., M. Newsome, L. Moore and C. Pancake 1995. The Microbial Germplasm Database.
Available online at http://mgd.nacse.org/cgi-bin/mgd.
[9] Holtzblatt, K. and H. Beyer. 1993. Making Customer Centered Design Work for Teams.
Communications of the ACM, 36(10):93-103.
[10] Konopnicki, D. and O. Shmueli. 1995. W3QS: A Query System for the World-Wide Web, Proceedings
of the 21st VLDB Conference, Zurich, Switzerland.
[11] Kyng, M. 1991. Designing for Cooperation: Cooperating in Design. Communications of the ACM,
34(12):65-73.
[12] Laurel, J. 1995. dbWeb White Paper. Aspect Software Engineering. Available online at
http://www.microsoft.com/winterdev/.
[13] Letovsky, S.I. 1994. Web/Genera. Johns Hopkins Medical Institutes. Available online at
http://gdbdoc.gdb.org/letovsky/genera/genera.html.
[14] Lewis, C. and D.A. Norman. 1987. Designing for Error. In Readings in Human-Computer Interaction:
A Multidisciplinary Approach, Baecker and Buxton, eds., Morgan Kaufmann, pp. 627-638.
[15] LTER: National Research Sites with a Common Commitment. Available online at http://lternet.edu/.
[16] Lubchenco, J. 1995. The Role of Science in Formulating a Biodiversity Strategy. Bioscience, special
biodiversity supplement, Summer 1995, pp. 7-9.
21
[17] Marcus, A. 1993. Human Communications Issues in Advanced UIs. Communications of the ACM,
36(4):101-109.
[18] Mooney, H. and C.J. Gabriel. 1995. Toward a National Strategy on Biological Diversity. Bioscience.
Science & Biodiversity Policy Supplement, preface, Summer 1995.
[19] National Partnership for Advanced Computational Infrastructure. 1998. Materials available online at
http://www.npaci.edu.
[20] Newsome, M. 1997. A Browser-based Tool for Designing Query Interfaces to Scientific Databases.
Ph.D. Dissertation, Department of Computer Science, Oregon State University.
[21] Newsome, M., C. Pancake, and J. Hanus. 1997. HyperSQL: Web-based Query Interfaces for Biological
Databases. Proceedings of the 30th Hawaii International Conference on System Sciences, IV:329-339.
[22] Newsome, M., J. Hanus and C Pancake. 1997. Simplifying Web Access to Remote Scientific
Databases, Dept. of Computer Science, Oregon State University, Technical Report CSTR97-60-07.
[23] Northwest Alliance for Computational Science and Engineering. 1998. Materials available online at
http://www.nacse.org.
[24] NII 2000 Steering Committee, Computer Science and Telecommunications Board Commission on
Physical Sciences, Mathematics, and Applications National Research Council. National Academy of
Science. 1996. The Unpredictable Certainty: Information Infrastructure Through 2000, Chapter 6:
Public Policy and Private Action. National Academy Press Washington, D.C. Available online at
http://www.nap.edu/readingroom/books/unpredictable/.
[25] Oregon Coalition of Interdisciplinary Databases. 1998. Available online at http://www.nacse.org/ocid.
[26] Pancake, C. 1997. Can Users Play an Effective Role in Parallel Tools Research? International Journal
of Supercomputing and HPC, 11(1):84-94.
[27] Pancake, C. 1997. Improving the Usability of Numerical Software through User-Centered Design. In
The Quality of Numerical Software: Assessment and Enhancement, ed. B. Ford and J. Rice. Chapman
& Hall, London, pp. 44-60.
[28] Pancake, C., et al. 1992-1998. Interview and field notes from meetings with users of large-scale
scientific data.
[29] Pittam, S., J. Hanus, M. Newsome and C. Pancake. 1998. Multiple Database Personalities: Facilitating
Access to Research Data, In Proceedings of 4th International Conference on Information Systems,
Analysis and Synthesis, pp. 200-207.
[30] Rasmussen, B.F. 1994. WDB: A Web Interface to Sybase. In Fourth Annual Conference on
Astronomical Data Analysis Software and Systems, Baltimore, MD.
[31] Robbins, R.J. 1994. Genome Informatics I: Community Databases. Report of the Invitational DOE
Workshop on Genome Informatics, April 26-27, 1993. Journal of Computational Biology, 1(3):173-190.
22
[32] Rogue Wave Corporation. 1996. Factory User's Manual. Rogue Wave Corporation, Corvallis, OR.
[33] Sybase, Inc. 1994. Open Client-Library/C Reference Manual, Sybase, Inc. Emeryville, CA.
[34] Tiedje, J.M. 1994. Microbial Diversity: Of Value to Whom? ASM News, 60 (10):524-525.
[35] Williams, R. et al. 1998. Workshop on Interfaces to Scientific Data Archives. Center for Advanced
Computing Research, California Institute of Technology, Technical Report CACR-160. Available online
at http://www.cacr.caltech.edu/isda.
23
Download