“Split Personalities” for Scientific Databases: Targeting Database Middleware and Interfaces to Specific Audiences Cherri M. Pancake1, Mark Newsome1, and F. Joe Hanus2 Dept. of Computer Science1, Dept. of Botany and Plant Pathology2 Oregon State University Corvallis, OR 97331 {pancake@cs,newsome@cs,hanusj@bcc}.orst.edu Technical Report 98-60-06 Department of Computer Science Oregon State University Abstract Scientific researchers are anxious to discover new insights into the relationships and interactions among the exceedingly diverse components of our physical, biological and ecological environment. To do this, individual scientists must be able to synthesize conclusions from data drawn from disciplines outside their domains of expertise. While key datasets have already been brought online, they are housed in diverse agencies, using different database software on a variety of platforms. This paper explores the problems in providing middleware and interface support for these autonomous research databases (ARDs). The development of interfaces to ARDs is complicated by the fact that potential end-users range from highly-specialized research scientists to the general public. To adequately support such diverse users, the interface must assume different “personalities.” We describe how interfaces can be targeted to three categories of end-users: domain specialists, non-domain specialists, and students. By automating the activities that are most frustrating, timeconsuming, or error-prone, target-specific interfaces can significantly improve database usability. There is also heterogeneity among the people who must develop ARD interfaces. We target three different classes of implementers — database-technology familiar, database-content familiar, and database unfamiliar. By developing different middleware personalities that respond to the specific skills and interests of each audience, we have been able to keep our software simple and usable without sacrificing flexibility. One of society's most pressing needs is a comprehensive understanding of environmental interrelationships that can move us toward a sustainable biosphere, one that is both ecologically sound and economically feasible [18]. To meet this challenge, scientific research must yield new insights into the relationships and interactions among the exceedingly diverse components of our physical environment (e.g., climate, species distribution, land utilization patterns, air quality, etc.). That, in turn, requires that scientists become more effective at identifying, analyzing, and interpreting data derived from many different disciplines. Cholera offers an example of how cross-disciplinary information on environmental factors permits better understanding of biological processes that strongly influence human well-being [5]. This disease reveals a strong association with the sea, with the great pandemics following the coastlines of the world. Vibrio cholerae not only survives in an almost dormant state during unfavorable conditions in the ocean, but also can attach itself to certain forms of ocean crustaceans. A single, barely visible copepod can carry 104 organisms, while just 103 organisms are sufficient to cause disease. When El Niño caused warming and heavy rains along the coast of South America, the warming of surface waters and an influx of nutrients precipitated a phytoplankton bloom, followed by a zooplankton bloom (copepods); these were monitored by satellite. A cholera epidemic followed in Peru, as far as 50 to 150 km. inland. As Colwell pointed out, the importance of this example is that it requires data from disciplines as diverse as oceanography, microbiology, epidemiology, and satellite imagery in order to develop predictive models for critical diseases. The problem for the scientist, then, is not how to acquire and store personal research data, but how to access and make sense of data gathered by different groups and for different purposes. Pertinent data resides in large-scale databases managed through diverse database software, on diverse platforms, and with entirely different data structures, each maintained by a distinct agency or individual. While the scientists who use each database are expert in its particular domain and are aware of its value and limitations, outside researchers may have great difficulty determining which data are more important, most reliable, or even most recent. The fact that formats and nomenclature are idiosyncratic to each database complicates the process. At a DOE workshop participants discussed the technical and sociological constraints that make it impractical to merge all genomic data into a single, consistent repository; they projected that autonomous organizations will continue to be the dominant mode for managing genomic information into the foreseeable future [31]. This situation extends to many other scientific disciplines as well. The problems are exacerbated when data spans multiple disciplines, as is the case for data that might establish significant ecological interrelationships. Now the researcher must cope not only with difficult access and lack of guidance about the database, but also with the distinct traditions of nomenclature, field methodology, data organization, etc., within each disciplinary domain. Many of the most compelling research problems in the biological and environmental sciences will not be approachable, however, until scientists access, interrelate, and synthesize data across disciplinary boundaries. 1 Database standardization is one possible solution, but this has proven to be problematical in practice. Consider the example of NSF’s Long-Term Ecological Research (LTER) [15] program, a systematic network of ecological research sites in operation since 1980. The twenty LTER sites span a range of ecosystem types, from arctic tundra to tropical rainforest. Although each site has a different investigator-oriented mission, some measurements are made in every site in every year and the results are available in standard form — including primary production, nitrogen mineralization rates, standing crop, abundances of soil cations, detritus production, and censuses of dominant plant species. The majority of their data, however, is acquired and maintained in site-specific form. Where administrative units like LTER don’t exist, standards are even more difficult to enforce. At the same time, many valuable datasets are already in place; these are unlikely to be restructured or converted in the near future. Yet these independent, relatively small-scale (typically less than 100,000 records) databases, maintained by geographically disperse research teams, contain some of the most critical and most up-to-date scientific information anywhere. Table 1. Characteristics of Autonomous Research Databases (ARDs). Independent: Created and managed by diverse, relatively isolated research groups Autonomous: Ongoing need to maintain control over data and access Heterogeneous: Hardware/software varies dramatically from site to site Non-uniform: Each site has distinct ways of organizing data Non-standardized: Each site may employ distinct data formats and nomenclature Limited expertise: Little or no database technology expertise available in-house Inward focus: Primary motivation is to support in-house research Financial constraints: Little or no funding for making data available to others This paper explores the problems in providing middleware and interface support for such databases, which we call autonomous research databases (ARDs). The key properties of ARDs are summarized in Table 1. In addition to the issues that were already outlined, ARDs typically are constrained in terms of the amount of funding or level of effort available to support database maintenance or the construction of access mechanisms. The original accumulation and computerization of scientific research data is typically covered by discipline-specific research grants. Such grants are for conducting science, however, and do not generally include the hiring of database administrators nor the development of database interfaces. As a result, virtually all ARDs are created and maintained entirely by domain specialists. They have only scant familiarity with database or interface technologies. Further, their chief concern is that the data be useful for their particular research projects. It is their sense of responsibility to the broader scientific community, not explicit support by the granting agencies, that motivates them to make their data available for other uses. The scientific community has become increasingly vocal in its concern about accessing the many research databases that have already been populated at a considerable investment of time and dollars [16, 34, 6]. All too often, scientists find themselves effectively shut out of even major databases, due to the user-hostile nature of query languages and interfaces. Smaller databases — even those with critical information for many other disciplines — may not even be available on the Internet. Yet if scientists cannot interact freely with remote databases, they must curtail the scope of their scientific inquiries or duplicate previous efforts by collecting similar data. With resources, facilities, and funding for research becoming more scarce, access to ARDs is now an economic as well as a scientific issue. 2 The remainder of this paper discusses how such access can be provided, without violating any of the essential constraints imposed by the nature of ARDs. In the next section, we explore the real costs involved in access to ARDs, and how these constraints enforce a new approach to software support. The following two sections describe how the constraints can be accommodated through “split personalities”: interfaces and middleware layers that target different segments of the ARD community. Examples demonstrate how interfaces and middleware can respond specifically to the kinds of user needs we have identified in our collaborations with three research alliances: the Oregon Coalition of Interdisciplinary Databases (OCID [25]), the Long Term Ecological Research (LTER [15]) program, and the National Partnership for Advanced Computing Infrastructure (NPACI [19]). Another section compares our work with related research efforts, underscoring how usability considerations caused us to adopt an unusually flexible and platform-independent approach. The paper concludes with a discussion of how ARDs might be expected to change over the next few years, and the implications of these changes in terms of improving access to scientific databases. Why “Split Personalities” Are Needed While modern computing and communication systems provide the infrastructure to send bits anywhere, anytime in mass quantities, connectivity of itself does not ensure that communications are useful [11]. It certainly doesn't guarantee that professionals outside the scientific community — such as business leaders who are attempting sustainable use of natural resources or educators who wish to use “real-life” data in the classroom — will be able to access appropriate information. As Hammond explained, “The ideal [medical data] system permits us to have a system in which all who need data can have exactly what they need. It's complicated by the fact that there are [many] types of people that probably have a legitimate need for access to health care information” [24]. This is a problem of usability. It’s important that data be presented according to the context in which they will be used. We propose that in order to meet this challenge, a data resource must have multiple “personalities.” This can be accomplished by presenting alternate interfaces to the database, each designed to accommodate a different segment of the user community or to facilitate a different type of use. No single interface is sufficient to reach dramatically different audiences. A recent workshop arrived at the same conclusion about access to digital libraries: You cannot have the same interface for a specialist in a scientific discipline and a schoolchild, so we should stop fooling ourselves. What we can have, though, are multiple interfaces that all access the same [digital] library but target different levels of understanding, and then provide ways to move back and forth between the levels. You may come in at a novice level, use the library, then decide you’re more of an expert, that you’d like to see a more professional approach. [35] For the purposes of this discussion, we consider three general end-user expertise profiles: • Domain specialist: scientist with expertise in the particular disciplinary area(s) represented by the ARD • Non-domain specialist: scientist whose expertise is in some other disciplinary area; i.e., a crossdisciplinary user • Domain novice: student who is specializing in the disciplinary area(s) represented by the ARD. 3 While ultimately the audience for scientific databases could extend to the general populace (i.e., non-domain novices) as well, our work has focused on the needs of scientists and students of science. If the database is to be truly usable, its interfaces must respond to the requirements of different user profiles. This evidenced by the fact that software users are no longer willing to put up with products that are difficult to learn or use [9]. Simply creating an elegant, powerful scientific database doesn’t mean that it will be used. Even scientists and engineers — traditionally the group that was most tolerant of software idiosyncracies — now expect their software to show evidence of usability [27, 26]. Yet the needs of students clearly differ from those of professional scientists, just as the needs of a microbiologist clearly differ from those of a forester. Table 2 gives examples of some of the distinctions among the three profiles. Table 2. Examples of how needs differ according to end-user expertise profile. Example Domain specialist Non-domain specialist Domain novice Type of terminology Precision is desirable May need mapping to terms from other disciplines Specialized terms need explanation Need for explanation Only if ambiguous or controversial Only if ambiguous or unique to this discipline Explanation is essential Usefulness of “extra” explanations Gets in the way and slows access to needed data If tangential, could obfuscate key issues or data use Highly desirable as tool for clarification Key query needs Speed/ease of access (How quickly can I get the info I seek?) Obvious relevance of results (Is any of the data relevant to my work?) Didactic value (How does this help me learn?) Above and beyond the need for supporting different end-user profiles is the fact that the software for constructing multiple personalities must be tailored to the needs of ARDs. In an ideal world, there would be sufficient human resources to guarantee that staff with expertise in databases and user interface design were available to construct sets of split-personality interfaces for each ARD. The current situation is far from ideal. The typical ARD staff — as discovered through our OCID, LTER, and NPACI collaborations — is composed exclusively of professional scientists. Only major collaborative efforts, such as the LTER program, have the resources to fund programmers or consultants with database expertise, and as far as we have been able to determine, no ARD has found it possible to hire specialists in user interface design. Thus, only some of the people who develop database interfaces can be considered “programmers.” This has implications for the types of middleware that will be support ARD interface development. Moreover, end-users have indicated that they want to be able to create their own interfaces. At the least, they want the capability of adapting or tailoring pre-built interfaces to their own research needs [28]. As they point out, if an interface is intended to provide access to a database in its entirety, it cannot also provide the quickest, simplest access to a very specific subset of that data. Here, we consider three general implementer expertise profiles: • DB-technology familiar: database administrator or other person trained in computer science aspects of database management • DB-content familiar: scientist who was responsible for acquiring the data and populating the ARD, or who followed that person in the role of ARD maintainer 4 • DB unfamiliar: scientist who is not a member of the group that created the ARD, but who has a particular need or interest in developing interfaces to it Each profile represents a different skill set with respect to the technology that may be required to implement interfaces. It also represents a different degree of comfort with the intent and content of the ARD, and the knowledge of how to make that content more accessible to target users. Table 3 gives examples of how the profiles differ. Table 3. Examples of how needs differ according to implementer expertise profile. Example DB-technology familiar DB-content familiar DB unfamiliar Database skills Strong skills, based on both training and experience Weak to strong skills, based on experience Unlikely to have skills Programming skills Very comfortable with programming and script languages, as well as SQL Weakly to moderately comfortable with SQL Unlikely to have skills Knowledge of database organization Extremely familiar with database organization in general, as well as this database’s structure Familiar with this database’s structure Unfamiliar with this database Understanding of purpose/limitations of data Moderately cognizant of database contents Extremely cognizant of database contents Unfamiliar with database contents We propose that these distinctions, too, imply the need for split personalities, each targeted at a different expertise profile. This time, however, it is the database middleware that should respond to the needs of its user communities. The rationale is again based on usability criteria: if it is too difficult or timeconsuming to construct interfaces to an ARD, they simply will not be provided. For example, a strategy that requires that the database be in Boyce-Codd Normal Form will effectively rule out its implementation by any ARD that does not have DB-technology familiar staff. The next sections describe how our middleware products are designed to support different implementer expertise profiles, allowing ARDs to constuct sets of interfaces that respond to different user expertise profiles. A key characteristic of the products is that they automate many of the activities that are most frustrating, time-consuming, or error-prone. This is a classic approach to improving usability, although it has not been applied very frequently to software for scientific users [14, 17, 2, 26, 27]. Split Personalities for Database Interfaces The Web has been cited frequently for its role in making the Internet easier to use and more broadly meaningful [24]. The ubiquity of Web clients, the ease with which Web pages can be produced, and the familiarity of users with the format make the Web page an excellent choice for interfaces that can be tailored to the needs of widely differing groups. From the perspective of research scientists, however, Web-based access to databases has been only poorly supported and existing interfaces are extremely non-intuitive. In particular, two key developments are essential in order to achieve multiple personalities for research data. First, interface features must make it possible to minimize the amount of text that must be entered by particular user expertise profiles. Since the likelihood of inappropriate or misspelled values is much higher among a non-domain specialist or domain novice audience, it should be possible to develop query interfaces 5 that are based on recognition (e.g., point-and-click from scrollable lists) rather than recall (e.g., typed input fields). For similar reasons, it must be possible to provide additional information (e.g., context-sensitive annotation, instruction, or help) that is no more than a click away in order to accommodate varying levels of user expertise. Second, the interface must be able to adapt as the database grows. ARDs are active research tools. Periodic updates are to be expected, and users should see completely up-to-date information at all times. The values included in scrollable lists or drop-down menus, for example, should be obtained dynamically at time the screen is displayed, so that they reflect any recent additions or updates. How can these mechanisms be exploited to develop split personalities? We describe a case study from OCID [25]. Sherry Pittam, a lichenologist from the Dept. of Botany and Plant Pathology at Oregon State, had developed a database of Lichens of the Pacific Northwest, using data from several herbarium collections. (A lichen consists of two mutually dependent organisms, fungi and algae, living in a symbiotic relationship.) The relational database, managed using Sybase [33], houses an extensive collection of literature, image, taxonomic, chemical, and ecological characteristics, as well as the registration data for the herbarium specimens. Pittam also downloaded and installed the HyperSQL middleware described in the next section, which she used to developed Web interfaces targeting different user communities. We use her work as an example of how split personalities can improve database usability. At the same time, we point out the middleware features that are needed to support interface personalities. Interface A, for domain specialists with fully-specified queries. For professional lichenologists, the primary need is quick access to the information on particular lichens. The simplest interface to any database is a form where the user types in the desired characteristics — for example, naming the desired genus, species, etc. — and those values are passed to the database as query criteria. One was implemented for the lichens, but it had the obvious disadvantage of being prone to typing errors. Further, in lichenology as in many other biological disciplines, taxonomical nomenclature has evolved over time so that a particular species may have had several names over the last century. This means that it was possible for a user to request information on a species and be told that no information was available, when in fact the lichen was present but under a different name. To accommodate this problem, the interface was modified to eliminate all user typing. Instead of presenting blank input areas, it pre-fetches all possible values for an attribute and displays them in a list (pulldown for a small number of values; scrollable when there can be many). The user selects values from the lists that are of interest, such as a particular genus, letting the other lists default to “any match.” This provides a robust and simple-to-use interface for anyone interested in retrieving information about a specific lichen or about lichens that share a specific characteristic, such as occupying a particular type of habitat. [Authors’ note: because this interface is so simple, we do not include a figure; one can be added later if desired.] The interface has an added advantage for first-time or infrequent users of the database. The user can choose query criteria by recognition, rather than having to remember or guess a valid keyword or value. This allows user to effectively “see” what is included in the database before they actually initiate any queries. For example, a lichenologist interested only in Peltigera spp., or only in lichens from estuarine habitats, would be able to determine immediately if the database included such information. Interface B, for domain specialists with incompletely-specified queries. Note that interface A requires that the user be able to specify the query criteria completely. To support more open-ended queries, a second interface allows the user to submit queries that are only partially specified. This interface supports the identification of lichens on the basis of chemical reactions. Lichenologists may supply the reactions from a full suite of chemical tests, thereby guaranteeing that a unique species will be identified. Alternatively, tests may be applied in some arbitrary order, with one or more results used as the basis for an incompletely- 6 specified query (i.e., some attributes are left as “don’t-care” values). In some cases, as few as one or two tests may be sufficient for successful identification; see Figure 1. Figure 1. Lichen database, Interface B, showing how the user can specify just partial results of chemical tests. This is possible because of how the interface is structured. Normally, the identification of biological specimens requires making a series of comparisons using structured questions (e.g., Is it bigger than a tennis ball? If so, is it slick or is it fuzzy?). This classic structure is called a dichotomous key because of its reliance on yes/no questions. The drawback is that if one of the decisions cannot be made (e.g., you cannot determine if the color is mauve or taupe), the process fails. A more human-friendly approach is to support identification through a synoptic key, which uses a checklist of compiled characteristics, each with a set of potential values. The user identifies an organism by checking off known characteristics. Organisms not fitting the pattern are eliminated, resulting in a list of one or more organisms that match the user's observations. Thus, synoptic keys allow the user to make an identification on the basis of incomplete or imprecise information. Synoptic keys can be implemented using any Web-to-database middleware that is capable of prefetching and displaying lists of potential attribute values. The user chooses as many or as few attributes as desired, and the query is formulated as a series of conjunctions (for performance reasons, it is usually desirable to build special tables or indices reflecting the dependencies between attributes). A successful identification has been made when sufficient attributes have been selected to reduce the “scope” of the query to just one species. HyperSQL adds a key feature that significantly improves usability for this type of incompletelyspecified queries: iterative query scoping. Consider the case of a scientist who wishes to minimize the number of tests needed to identify a lichen. Without special mechanisms, it will be necessary to query the database repeatedly as each test is performed, specifying all results known so far and then seeing if the result has been narrowed to a single species. Note that although the database includes embedded information that could tell the lichenologist which of the remaining tests will further reduce the scope of the results, there is no way for the user to access this 7 information directly. HyperSQL provides that service. After selecting the known test results, the user can choose to re-scope the query instead of submitting it. This directs the middleware to determine which remaining test values can coexist with the partially specified ones. The lists of remaining attribute values are updated to reflect the narrower scope. When the user’s selections reduce the scope of some other attribute to just one value, it is shown as the “selection.” This effectively tells the scientist that a test need not be made, as it will not further narrow the scope of the query. Figure 2, for example, shows how query re-scoping changed all remaining fields from “*” to some value. This indicates that the result of the single Medulla KC test was sufficient to uniquely identify the lichen. Figure 2. Lichen database, Interface B, after the query has been re-scoped; it turns out that no other tests are needed. Interface C, for non-domain specialists. To support the needs of scientists who are not lichen specialists, a third interface implements a synoptic key for identifying specimens based on their physical characteristics. This allows the user to select the most familiar or obvious characteristics (e.g., lobe shape) first. The availability of query re-scoping means that selections based on more obscure features (e.g., podetia) can be deferred until it is certain that they will be needed to successfully identify the specimen. Two other HyperSQL mechanisms are particularly useful in the context of users from other scientific disciplines. First, cascaded queries can be used to link together related information that may not be essential for domain specialists, but might be important for outsiders. The availability of information on lichen habitat is an example. From the perspective of a lichenologist, this is tangential information that is not used in keying out specimens. For foresters or ecological scientists, on the other hand, this may well be the most important data. Interface C is structured to include habitat information with every query result. Moreover, the habitat appears as a link, similar to a normal Web hyperlink; when selected, it transparently generates an additional query and presents the user with information on the habitat and a list of all species associated with that habitat (see Figure 3). 8 Figure 3. Lichen database, Interface C, showing the effects of a cascaded query to reveal other lichens occupying the same habitat as Tuckermannopsis platyphylla (Tuck.) Hale. Second, the user can choose to restrict the number of results that will be returned by the query. This feature is particularly important for first-time or infrequent users of the interface, who have no idea whether their query is sufficiently specific to return one, ten, or ten thousand records. Explicit interface controls allow the user to decide the relative number of results that are appropriate for his or her particular use. Without them, the browser may be stalled while large numbers of records are downloaded, or the user may be forced to interrupt the process manually. Interface D, for biology students. Recognizing that the database content, but not the technical presentation, would be useful for educational settings, Pittam also constructed an interface called LichenLand for use by high school and undergraduate students in biology courses. The appearance is dramatically different, since the synoptic key is icon-driven. Each taxonomic characteristic is illustrated 9 Figure 4. Lichen database, Interface D, showing how cartoons are used to illustrate the taxonomic characteristics. with a cartoon to help the student create associations for the technical terminology (Figure 4). As for Interface C, iterative query scoping and cascaded queries facilitate free exploration through related information.. The interface also exploits HyperSQL’s mechanism for context-sensitive explanation to improve its usefulness for the target user profile. Each of the taxonomic characteristics is linked to an illustrated discussion of what the characteristic means, including pitures reflecting each possible value (Figures 5). One consequence is that in many cases, lichen identifications can be accomplished entirely by comparing the specimen with pictures. This approach makes it possible for even a completely untrained user to experience success in taxonomic identification. A number of educators have adopted the interface for use in high school biology and general science courses [22]. 10 Figure 5. Lichen database, Interface D, showing context-sensitive explanatory material that helps students learn about lichen characteristics. Split Personalities for Database Middleware Just as database end-users impose multiple sets of requirements on interfaces, the users who build those interfaces reflect multiple communities, with distinct characteristics and distinct types of needs. In this section, we discuss how database middleware can respond to them by supporting multiple personalities. In particular, middleware design has to take into account the fact that ARDs have limited expertise, time, and funding available. Adding database personalities cannot place an undue burden on the research group that maintains the data and is making it available to others. The OCID partners have established a set of general requirements for middleware. Since they are intended to make it feasible for research databases from many disciplines to participate in cross-database integration efforts, they encapsulate some of the unique needs of ARDs: Expertise required. It is critical that the expertise necessary to develop a Web interface be within the skill range of a scientific researcher. Many of the key repositories of scientific information are in the hands of small research groups that do not include computer scientists or database professionals. Also, the effort involved in bringing a new database online should be minimal, so as not to disenfranchise researchers who work alone or who receive no external funding. In particular, database owners must not have to restructure, normalize, or otherwise transform their data simply to permit access to outsiders. Security. It is also important that research databases remain autonomous, and that policies for granting access remain under the control of the individual institutions or agencies that established them. Some research data, such as distributions of endangered species or 11 personal health data, must be safeguarded as sensitive information. It should be possible to maintain access privileges at the database and operating system level, and not be dependent on Web security mechanisms (which are known to be unreliable). In addition, it must not be necessary to maintain the Web interfaces on the same machine as the database; this is particularly important when the data includes sensitive information. It should be possible to support a full range of queries while granting only read-access to the database. Software required. Research groups must not be required to have particular database software in order to participate. Both commercial and public-domain software are in widespread use; it is not realistic to expect researchers to acquire new software and convert their databases. In terms of the Web interface software, it should be capable of interacting with a range of database software and should not require any special browser capabilities. Other features. To the extent possible, the interfaces to research databases should always reflect up-to-date research data. Special mechanisms are needed to support this; they must be simple, however, in order to overcome researchers’ inclination to hardcode information as static lists in Web interfaces. Finally, it should be possible for a skilled user to build his/her own specially tailored interface to a database. Ideally, the user could start with an interface provided by the research group, and then modify it to suit individual needs. No commercial software products satisfy these requirements. Indeed, several fail to satisfy any of them at all [20]. In response to what is clearly an important emerging need, we worked with other researchers at the Northwest Alliance for Computational Science and Engineering (NACSE [23]) to develop Web-to-database middleware that would meet all the basic requirements of ARDs. Further, we structured the middleware as a suite of components, each addressing the needs of a different type of interface implementer. The components provide three different approaches for specifying database interfaces: markup language, script-based tool, and GUI-based tool. Each approach is described below, in terms of the type of support it offers to a particular interface implementer profile. By way of example, we show how a simple query interface (Figure 6) would be implemented using each middleware product. Markup language, for implementers who are DB-technology familiar. With this approach, Web interface files are constructed by embedding special directives and macros into HTML text files. Before the file is sent to the browser, a translator executes the special constructs and performs whatever SQL queries are needed to acquire specified data from the target database. Our markup language is called Query Markup Language, or QML. An example interface specification appears in Figure 7. QML is intended for DB-technology familiar users. Therefore, it exploits the user’s assumed familiarity with HTML and SQL — a QML interface specification is basically an HTML file with additional tags, macros, and fragments of SQL. The HTML is used to control interface appearance and user interaction, while the QML constructs manage the interaction between the interface and the target database. The notation is straightforward for users who have already mastered HTML forms. Any of the standard forms components (e.g., scrolled lists, buttons, pull-down lists, checkboxes) can be used in the normal way. For example, the following HTML tag is used to create a group of radio buttons, where the user can select exactly one item by clicking on the small “radio button” to the left of the item: <input type=radio name=choice value=”first”> first-choice-here <input type=radio name=choice value=“next”> next-choice-here ... 12 Figure 7. Simple interface to the Microbial Germplasm Database [8]. <qml_screen name=query default> <center><H1>General Query></H1></center><p> <center>* (star) is the wildcard character</center> <p><b>Type of Organism:</b><p> <input type=radio Name=organism_type value=Algae checked>Alga <input type=radio Name=organism_type value=Bacterium >Bacterium <input type=radio Name=organism_type value=Fungus >Fungus <input type=radio Name=organism_type value=Nematode>Nematode <input type=radio Name=organism_type value=Virus>Virus <p><b>Organism Name:</b> <table<tr><td> Genus: <qml_input name=genus sql="Select distinct genus from Organism"></td> <td>Species: <input type=text name=species value="%"></td> <td>Variety: <input type=text name=variety value="%"></td> <td>Strain: <input type=text name=strain value="%"></td> </tr></table> <table><tr><td><td><qml_submit target=results><td><qml_new></table> <p><hr><p> Constructed using QML </qml_screen> Figure 6. QML specification for the example interface shown in Figure 6. One input tag is needed for each radio button to be included in the group. Collectively, the tags direct the browser to display a list of radio button items, each showing a button followed by the indicated string (“first-choice-here,” “next-choice-here”), etc. When the user chooses a button, the corresponding value (e.g., “first”) will be stored in the variable choice. 13 QML tags make it possible to populate lists and other forms elements with data that is actually retrieved from the database. To present the user with radio buttons for selecting what species he/she wants information on, the equivalent QML construct is used: <qml_input type=radio name=choice sql=”Select distinct species from Organism where genus=’$genus’ ” multiple> This one statement directs the QML translator to issue a query to retrieve the species names from table Organism, and use the results to build a list of radio buttons, each labeled with the name of a different species found in the database. The advantage of a markup language approach is that the developer retains complete control over the layout and functionality of the interface. Since QML’s special directives and macros give the developer the expressive power of a programming language — such as loops, assignments, and conditional logic — it is possible to build highly customized query interfaces. The tradeoff is that, like other markup languages, QML leaves error detection and recovery to its user. Misspelled tags are ignored by both the translator and the browser. As with other flexible uses of SQL, it is perfectly possible to write SQL queries that are syntactically correct, but which produce spurious results (for examples, see [20]). If errors do occur, the end-user is directly exposed to the extremely cryptic error messages generated by the database management software. The learning curve is small for experienced database technologists, but is potentially very steep for non-computer scientists. The user of QML is assumed to be extremely familiar with SQL and database terminology from computer science. The directives and macros constitute a programming language, so he/she is also expected to be familiar with programming concepts. QML relies on HTML structures for managing interface appearance and cgi-bin scripts for handling user input, so familiarity with these languages is needed as well. Finally, in order to formulate queries that cross table boundaries, the database software’s proprietary command language must be used to examine table and foreign key structure. Script-based tool, for implementers who are DB-content familiar. This type of tool analyzes schema specifications written in a specialized scripting language, then generates the appropriate query files and HTML text. A high-level specification language is used to shield the implementer from the low-level details of SQL, HTML, and proprietary database management libraries. The tool reports problems with the specification and automatically generates SQL and HTML for query input and output pages. Our scriptbased tool is HyperSQL, or HSQL. The specification for the example interface is shown in Figure 8. This tool is targeted at DB-content familiar users. Unlike QML, the scripting language used by HSQL does not try to extend on the user’s familiarity with HTML. Instead, it streamlines interface design by establishing defaults for almost every aspect of both interface appearance and interactions with the target database. Thus, it is possible to construct an HSQL interface with just a couple of dozen lines of text. HSQL does this by producing the basic style of interface most commonly requested by scientists: a “query screen” where the user indicates the characteristics to be searched for; a “results screen” providing a summary listing of the records that were found to match, and a collection of navigatable “browse screens” providing detailed information, graphics, sound clips, etc. The HSQL specification is structured into logical sections reflecting that organization. Each section is delimited by begin/end statements, allowing the HSQL tool to detect and report many potential errors. Within a section, the user includes statements such as OUTPRINT SUPPRESS_IF_EMPTY “Genus: DATA_FIELD=Genus” to display the value retrieved for genus. Special options make it easy to deal with interface issues that have traditionally been problematical. For example, SUPPRESS_IF_EMPTY indicates that the label (“Genus:”) should not be shown if the record being displayed has no value stored for this field. The terminology is simplified to reflect the way scientists think about databases — e.g., “field” and “list” rather than the database terms “attribute” and “relation” 14 QUERY BEGIN BANNER "<center><b>General Query</b></center>"; HEADER "<center>* (star) is the wildcard character</center>"; INPRINT "<b>Type of Organism:</b><p>"; INPUT organism_type TYPE=RADIOLIGHT DEFAULT="Alga"; INPUT organism_type TYPE=RADIOLIGHT DEFAULT="Bacterium"; INPUT organism_type TYPE=RADIOLIGHT DEFAULT="Fungus"; INPUT organism_type TYPE=RADIOLIGHT DEFAULT="Nematode"; INPUT organism_type TYPE=RADIOLIGHT DEFAULT="Virus"; INPRINT "<p><b>Organism name:</b>"; INPRINT "<table><tr><td>Genus: "; INPUT Genus TYPE=QUERYLIST= "SELECT distinct genus FROM Organism "; INPRINT "</td><td>Species: "; INPUT Species TYPE=TEXTINPUT ; INPRINT "</td><td>Variety: "; INPUT Variety TYPE=TEXTINPUT ; INPRINT "</td><td>Strain:" ; INPUT Strain TYPE=TEXTINPUT ; INPRINT "</td></tr></table>"; FOOTER "<hr>Constructed using HyperSQL" ; QUERY END Figure 8. HSQL specification for the example interface shown in Figure 6. A tool of this type provides considerably more automation than a markup language can. Correct SQL is composed automatically; all the user needs to supply is the fields that should be searched or returned by the query. Similarly, the HTML needed to format the interface pages is generated automatically. Special format features include automatic inclusion of buttons for submitting the query, limiting the number of results returned, and performing query re-scoping (features described in the section on interface personalities). Extensive error-checking services are provided, not only to check the validity of the HSQL specification itself, but also to “re-package” any errors returned by the database so that they will be intelligible to end-users. The tradeoff is that HSQL imposes more structure than a markup language does. The implementer is no longer free to use any conceivable format, nor can SQL be used in unorthodox ways. GUI-based tool, for implementers who are DB unfamiliar. In this case, a graphical tool is used to construct a graphical specification, which is converted to the appropriate query files. A separate gateway program, invoked by the Web server, translates the query files into a stream of HTML text for the browser and streams of SQL commands for the target database. Our tool is called QueryDesigner. GUI-based tools are intended for people who wish to construct database interfaces but are not necessarily familiar with HTML and SQL programming. QueryDesigner takes this concept a step further, supporting users who are not even familiar with the data or the database organization. The highly structured environment makes it easier for inexperienced users to quickly tailor an existing query interface — or compose a new one — without direct exposure to SQL or any textual specification mechanism.Like HSQL, QueryDesigner makes assumptions about the types of interfaces that scientists need. In fact, it mimics the 15 styles that we observed scientists creating with other specification tools, including HSQL [27]. To personalize an existing interface, the user clicks on the area of the screen to be changed. A series of guided screens let the user change the appearance of the screen, eliminate fields or add new ones, associate default values with particular fields, or link to associated data and/or Web pages, all through simple point-and-click operations. Figure 9 shows one such screen, as it would appear if a user wanted to modify the example interface. A unique aspect of QueryDesigner is the way it supports interface design for unfamiliar databases. Figure 9. QueryDesigner screen for defining the example interface shown in Figure 6. Note that to employ markup languages or script-based tools, the user must know the names of database tables and attributes (to specify query criteria), something about the format of the data (to specify how it should be displayed), and which attributes serve as primary and foreign keys (to specify how tables can be joined for complex queries). QueryDesigner eliminates the need for this specialized information. Instead, the tool contacts the target database on behalf of the user and queries it to obtain the structure of the database. Any associated metadata is retrieved as well, although there is no requirement that it be present. QueryDesigner uses this information to construct an entity-relationship diagram showing how data is organized into tables and how tables relate to one another. (The choice of an E-R diagram was based on observations that scientists find its meaning intuitively clear [20].) The diagram is actually interactive. By clicking on boxes and connectors, the user can compose queries without even understanding what a join is (Figure 10). 16 Another important feature of QueryDesigner is its ability to prevent errors. It has been demonstrated that the interfaces generated using this tool reduce by 76% the number of potential error points associated with SQL queries [20]. Because all SQL is generated automatically, and because the tool has detailed knowledge about the organization of the target database, it can prevent the user from specifying queries that would violate reasonable constraints. Because the tool is graphical, typographic and orthographic errors are also eliminated outright. The tradeoff here is that the user cedes all direct control over the interface specification. Everything is accomplished indirectly, via the tool. Interface appearance and query functionality are of necessity more estricted than they are with markup languages or script-based tools. Related Work Figure 10. QueryDesigner screen showing the organization of the target database for the example. 17 Many of the proposed solutions for constructing Web-based query interfaces have not progressed beyond the stage of research prototypes, but there are a few products currently available. All employ a Common Gateway Interface (CGI), where a middleware component, or gateway, is positioned between the browser and the target database. The gateway receives user input from forms displayed by the browser, translates it into appropriate queries and sends them to the database, then receives the query results and reformats them as HTML documents for display in the browser's window. From the perspective of usability, current middleware differs along three dimensions: (1) how many languages the query designer must know in order to build the forms and specify how queries should be constructed; (2) restrictions on where the gateway can execute; and (3) what types of target database are supported. The systems are summarized in Table 3; they have been described in detail elsewhere [20, 21]. Table 3. Comparison of Web-based Query Interface Builders System Language Expertise Needed Location of Gateway Target Databases Supported Cold Fusion [1] HTML, SQL, additional markup language same Windows machine as database some Windows-based databases JFactory [32] HTML, SQL, Java plus C or C++ same Windows/NT machine as database Weblogic T3/Client server and JDBC (but low-level calls to drivers must be coded manually) Genera/Web [13] tool-specific schema definition language same machine as database Sybase WDB [30] HTML, SQL, Perl same machine as database Sybase, Informix, or Mini-SQL WebinTool [3] HTML, SQL, tool-specific macro language same machine as database Ingres, Informix or Mini-SQL dbWeb [12] HTML, SQL, additional markup language same Windows/NT local-area network as database Windows/NT-based ODBCcompliant databases Sapphire/Web [4] HTML, C same Windows95/NT local-area network as database Windows/NT-based Sybase, Oracle, or Informix W3QS [10] specialized version of SQL any UNIX machine Searches Web documents rather than databases HyperSQL [21] minimal HTML, minimal SQL, HSQL scripting language database, browser, or any third machine (UNIX or Windows/NT) accessible by Internet Sybase, Oracle, or any ODBCcompliant database, on any platform QML [7] HTML, minimal SQL, QML markup language database, browser, or any third machine (UNIX or Windows/NT) accessible by Internet Sybase, Oracle, or any ODBCcompliant database on any platform 18 QueryDesigner [29] minimal HTML database, browser, or any third machine (UNIX or Windows/NT) accessible by Internet Sybase, Oracle, or any ODBCcompliant database, on any platform The large database management system vendors — including Oracle, Informix, Sybase, and IBM, among others — also have released products that facilitate the construction of Web interfaces. In each case, however, service is restricted to a particular database product, and the Web server must reside on the database machine or its LAN. HyperSQL [21], QueryDesigner [29], and QML [7] are the only examples of Web-todatabase middleware that are capable of connecting to remote databases located anywhere. They are also the only products that support a full range of databases on both UNIX and Windows platforms. Conclusions To meet the challenge of understanding complex environmental relationships, scientists must be able to access and assimilate research data from many ARDs representing many disciplines. Acquiring such information is problematical for a variety of reasons. Since ARDs were created for many different purposes, their data are stored using distinct formats and nomenclature, database organization, and hardware/software platforms. The databases are independent, reside at geographically dispersed sites, and are unlikely ever to be co-located. Database standardization is a potential solution, but in practice ARDs are maintained for particular research projects that do not have funding to support such efforts. It is the scientists’ sense of responsibility to the broader community that prompts them to add interfaces for more general access to their data. The Web has been welcomed as a mechanism for supporting platform-independent access to remote data. Web access alone, however, does not guarantee that ARDs will be usable. We identified several key usability requirements from the perspective of potential end-users: (1) Interfaces must not require that end-users have a priori knowledge about the organization of the database being queried (i.e., should make it possible to explore and use unfamiliar ARDs). (2) End-users should be able to perform basic searches by selecting from available choices, rather than having to supply input values. (3) It should be possible for users to navigate through complex hierarchies of data in incremental steps that successively narrow the scope of search. (4) Potential sources of user error should be minimized, as should sources of inadvertent delays such as lengthy downloads of information that was not actually desired. No single interface can satisfy these needs for all types of end-users. As demonstrated through a series of examples, interface personalities provide a way to service multiple audiences. Interface appearance and functionality can be tailored to the specific needs of each type of end-user. The same concept can be applied to interface middleware. Working with scientist end-users, we built middleware that addresses the key usability requirements of ARD owners: (1) To shield DRD owners from having to learn specialized languages or techniques in order to publish their data, middleware installation and maintenance must be within the skill set of a reasonably adept, non-computer professional (e.g., cannot require expertise in operating systems, object-oriented systems, SQL, or Web servers). (2) Middleware must not require the purchase of specialized software or involve complex installation procedures. (3) The steps required to implement database interfaces must be fast and easy-to-learn. (4) In order to facilitate sharing and reuse of query forms by end-users, query interfaces must be customizable, storable, and shareable by end-users. 19 By creating middleware personalities that respond to the specific skill sets and interests of different implementer expertise profiles, we were able to keep the software simple and usable without sacrificing flexibility. Clearly, there are tradeoffs at different levels of user support. The interface personalities requiring the least user expertise are restricted in terms of how queries can be constrained or combined. For middleware personalities, too, flexibility must sacrificed to enjoy more structure and guidance. Because the personalities coexist, however, the end-user or implementer is always free to move to a different level of support. The ultimate advantage of split personalities, then, is that as needs and skill sets evolve, the user can elect to change to a new, more appropriate personality. Acknowledgments The lichen databases and their Web interfaces were developed by Sherry Pittam of the Oregon Coalition of Interdisciplinary Databases; she now works for the Northwest Alliance for Computational Science and Engineering (NACSE). QML was developed by Ken Ferschweiler of NACSE. We gratefully acknowledge their help in providing the examples shown here. HyperSQL and Query Designer were developed by Mark Newsome, in collaboration with Cherri Pancake and Joe Hanus, as part of a joint project of the Microbial Germplasm Database and NACSE. MGD is funded by NSF grants BIR95-03712 and BIR 94-02192 and the U.S. Department of Agriculture. NACSE is a Metacenter Regional Alliance sponsored by NSF grant ASC 95-23629, the National Partnership for Advanced Computational Infrastructure, and the U.S. Department of Defense HPC Modernization Program. 20 References [1] Allaire Corporation. 1995. Cold Fusion White Paper. Available online at http://www.allaire.com/cfusion. [2] Baecker, R.M. and W.A.S. Buxton. 1987. Cognition and Human Information Processing. In Readings in Human-Computer Interaction: A Multidisciplinary Approach, Baecker and Buxton, eds., Morgan Kaufmann, pp. 207-218. [3] BBSRC Roslin Institute. 1995. WebinTool. Available online at http://anita.jax.org/webintool0.921/docs/user-guide.txt. [4] Bluestone Corporation. 1995. Sapphire/Web Manual. Available online at http://www.bluestone.com/. [5] Colwell, R.R. 1996. Global Climate and Infectious Disease: The Cholera Paradigm. Science, 274:20252031. [6] Davis, F.W. 1995. Information Systems for Conservation Research, Policy, and Planning. Bioscience, special biodiversity supplement, Summer 1995, pp. 36-41. [7] Ferschweiler, K. 1998. Query Markup Language. Available online at http://www.nacse.org/qml. [8] Hanus, J. F., M. Newsome, L. Moore and C. Pancake 1995. The Microbial Germplasm Database. Available online at http://mgd.nacse.org/cgi-bin/mgd. [9] Holtzblatt, K. and H. Beyer. 1993. Making Customer Centered Design Work for Teams. Communications of the ACM, 36(10):93-103. [10] Konopnicki, D. and O. Shmueli. 1995. W3QS: A Query System for the World-Wide Web, Proceedings of the 21st VLDB Conference, Zurich, Switzerland. [11] Kyng, M. 1991. Designing for Cooperation: Cooperating in Design. Communications of the ACM, 34(12):65-73. [12] Laurel, J. 1995. dbWeb White Paper. Aspect Software Engineering. Available online at http://www.microsoft.com/winterdev/. [13] Letovsky, S.I. 1994. Web/Genera. Johns Hopkins Medical Institutes. Available online at http://gdbdoc.gdb.org/letovsky/genera/genera.html. [14] Lewis, C. and D.A. Norman. 1987. Designing for Error. In Readings in Human-Computer Interaction: A Multidisciplinary Approach, Baecker and Buxton, eds., Morgan Kaufmann, pp. 627-638. [15] LTER: National Research Sites with a Common Commitment. Available online at http://lternet.edu/. [16] Lubchenco, J. 1995. The Role of Science in Formulating a Biodiversity Strategy. Bioscience, special biodiversity supplement, Summer 1995, pp. 7-9. 21 [17] Marcus, A. 1993. Human Communications Issues in Advanced UIs. Communications of the ACM, 36(4):101-109. [18] Mooney, H. and C.J. Gabriel. 1995. Toward a National Strategy on Biological Diversity. Bioscience. Science & Biodiversity Policy Supplement, preface, Summer 1995. [19] National Partnership for Advanced Computational Infrastructure. 1998. Materials available online at http://www.npaci.edu. [20] Newsome, M. 1997. A Browser-based Tool for Designing Query Interfaces to Scientific Databases. Ph.D. Dissertation, Department of Computer Science, Oregon State University. [21] Newsome, M., C. Pancake, and J. Hanus. 1997. HyperSQL: Web-based Query Interfaces for Biological Databases. Proceedings of the 30th Hawaii International Conference on System Sciences, IV:329-339. [22] Newsome, M., J. Hanus and C Pancake. 1997. Simplifying Web Access to Remote Scientific Databases, Dept. of Computer Science, Oregon State University, Technical Report CSTR97-60-07. [23] Northwest Alliance for Computational Science and Engineering. 1998. Materials available online at http://www.nacse.org. [24] NII 2000 Steering Committee, Computer Science and Telecommunications Board Commission on Physical Sciences, Mathematics, and Applications National Research Council. National Academy of Science. 1996. The Unpredictable Certainty: Information Infrastructure Through 2000, Chapter 6: Public Policy and Private Action. National Academy Press Washington, D.C. Available online at http://www.nap.edu/readingroom/books/unpredictable/. [25] Oregon Coalition of Interdisciplinary Databases. 1998. Available online at http://www.nacse.org/ocid. [26] Pancake, C. 1997. Can Users Play an Effective Role in Parallel Tools Research? International Journal of Supercomputing and HPC, 11(1):84-94. [27] Pancake, C. 1997. Improving the Usability of Numerical Software through User-Centered Design. In The Quality of Numerical Software: Assessment and Enhancement, ed. B. Ford and J. Rice. Chapman & Hall, London, pp. 44-60. [28] Pancake, C., et al. 1992-1998. Interview and field notes from meetings with users of large-scale scientific data. [29] Pittam, S., J. Hanus, M. Newsome and C. Pancake. 1998. Multiple Database Personalities: Facilitating Access to Research Data, In Proceedings of 4th International Conference on Information Systems, Analysis and Synthesis, pp. 200-207. [30] Rasmussen, B.F. 1994. WDB: A Web Interface to Sybase. In Fourth Annual Conference on Astronomical Data Analysis Software and Systems, Baltimore, MD. [31] Robbins, R.J. 1994. Genome Informatics I: Community Databases. Report of the Invitational DOE Workshop on Genome Informatics, April 26-27, 1993. Journal of Computational Biology, 1(3):173-190. 22 [32] Rogue Wave Corporation. 1996. Factory User's Manual. Rogue Wave Corporation, Corvallis, OR. [33] Sybase, Inc. 1994. Open Client-Library/C Reference Manual, Sybase, Inc. Emeryville, CA. [34] Tiedje, J.M. 1994. Microbial Diversity: Of Value to Whom? ASM News, 60 (10):524-525. [35] Williams, R. et al. 1998. Workshop on Interfaces to Scientific Data Archives. Center for Advanced Computing Research, California Institute of Technology, Technical Report CACR-160. Available online at http://www.cacr.caltech.edu/isda. 23