Indexing Semistructured Data Jason McHugh Jennifer Widom Serge Abiteboul Qingshan Luo Anand Rajaraman Computer Science Department Stanford University Stanford, CA 94305 Firstname.Lastname@cs.stanford.edu http://www-db.stanford.edu/lore Abstract This paper describes techniques for building and exploiting indexes on semistructured data: data that may not have a xed schema and that may be irregular or incomplete. We rst present a general framework for indexing values in the presence of automatic type coercion. Then based on Lore, a DBMS for semistructured data, we introduce four types of indexes and illustrate how they are used during query processing. Our techniques and indexing structures are fully implemented and integrated into the Lore prototype. 1 Introduction We call data that is irregular or that exhibits type and structural heterogeneity semistructured, since it may not conform to a rigid, predened schema. Such data arises frequently on the Web, or when integrating information from heterogeneous sources. In general, semistructured data can be neither stored nor queried in relational or object-oriented database management systems easily and eciently. We are developing Lore1, a database management system designed specically for semistructured data [MAG+ 97]. Building Lore from scratch allows us to explore how semistructured data aects each component of a database management system. In this paper we focus on indexing. In any DBMS, the tradeo between ecient query performance versus space and update cost must be considered. Indexing allows fast access to data by essentially replicating portions of the database in special-purpose structures. However, these structures must be kept up to date incrementally: each change to the base data must be reected in all applicable indexes. Despite the cost of index maintenance, the added storage, and the added complexity in the query engine, indexes have shown themselves to be a useful and integral part of all database systems. The considerations involving the use of indexes are very similar in a database system for semistructured data: The administrator may choose to build some indexes which, if exploited properly by the query processor, will decrease query processing time. Consistency between the index structures and the data must be maintained. Many applications of semistructured data (such as querying Web or integrated data) tend to be This work was supported by the Air Force Rome Laboratories and DARPA under Contracts F30602-95-C-0119 and F30602-96-1-031. 1 Lore, for Lightweight Object REpository, has grown from a \lightweight" system for \lightweight" objects into a full-featured DBMS. read-intensive, in which case the balance falls towards maintaining extensive indexing structures to speed up query processing. For Lore applications we often nd this to be the approach of choice. Semistructured data does introduce some novel problems concerning the actual mechanics of indexing. First, indexes in relational and object-oriented database systems are created over a set of attributes for some collection where the types of the attributes are dened in advance. In Lore, the data is essentially an arbitrary labeled directed graph [PGMW95], so it is dicult to isolate an attribute of a collection to index, and the type of objects are not known in advance. Second, Lore automatically performs type coercion when comparing objects of dierent types|an essential feature when dealing with semistructured data. Traditional value indexing techniques must be modied considerably in order to be used in the presence of coercion. Indexing atomic values in our graph-based data model allows the query engine to quickly locate specic leaf objects. However, since almost all queries also explore the data via labeled traversals through the graph, we introduce additional indexes that eciently locate edges and paths through the data. Lore also includes a simple full-text indexing system that eciently supports Information Retrieval style predicates within Lore's query language. Finally, we describe how Lore's query optimizer pieces together the available indexes to create ecient query plans. All structures, index creation algorithms, index maintenance, and query plans described in this paper are implemented and fully functional in Lore. 1.1 Related Work A description of the Lore architecture and query execution engine can be found in [MAG+97]. Lore's cost-based query optimizer was introduced in [MW97] with a focus on how the optimizer nds ecient query plans. Lore DataGuides [GW97], which are dynamic structural summaries of a database, serve both as a tool for the end-user and as a simple path index. A preliminary version of Lore's query language, Lorel, was introduced in [QRS+ 95]. Details of the current version of Lorel appear in [AQM+ 97]. Needless to say, there has been a signicant amount of work in indexing for object-oriented databases, e.g., [KKD89, SS94, RK95, KM92, CCY94, BG92, XH94]. All of this work depends on the database having a xed schema based on a known, strongly-typed class hierarchy. In our environment we must take a dierent approach to indexing, since we do not have a xed schema, and comparable objects may take on dierent types. Other related work includes recent research into ecient evaluation of generalized path expressions. In [CCM96], an algebraic framework and transformations are presented to optimize evaluation of generalized path expressions, but the approach does not consider the use of indexes. In [FS98], a notion of \state extent" is introduced for path expression evaluation that is similar in spirit to how we use DataGuide path indexes. However, [FS98] does not consider explicit storage or maintenance of state extents, nor how they are integrated with query plan generation, so they cannot be considered indexes in the traditional sense. 2 DB &1 Movie Movie Movie References &2 Title Actor &5 &3 Year Price &7 &6 Title Writer &8 &9 "Blade 1982 "1984" Runner" Name Character Amount Currency &17 "Harrison Ford" &18 "Deccard" &10 "George Orwell" &4 Price Actor Actor Year Title &11 &12 19.95 1956 &13 Price &14 &15 &16 "25" Name Character &19 &20 &21 &22 20 "US $" "Harrison Ford" "Han" Original &23 AKA &24 "Star "Adventures of Wars" the Starkiller" Name Character &20 &25 &21 &26 "Mark Hamill" "Luke" Figure 1: An OEM database 1.2 Paper Outline Section 2 provides a brief introduction to the data model and query language used in Lore and motivates Lore's query execution strategies. A detailed foundation for indexing values in the presence of automatic type coercion, which is instantiated in Lore, is given in Section 3. The four index types implemented in the Lore system are described in Section 4. Query operators and plans that use the index structures are described in Section 5. 2 Preliminaries and Motivation To set the stage for our discussion of indexing semistructured data, we rst introduce Lore's data model, query language, and query execution strategies. For further details see [AQM+ 97, MW97]. 2.1 The Object Exchange Model The Object Exchange Model (OEM) [PGMW95] is designed for semistructured data. Data in this model can be thought of as a labeled directed graph. For example, the OEM graph shown in Figure 1 depicts a tiny portion of a database containing information about movies. (Although the example database is mostly tree-structured, OEM and Lore permit arbitrary graph-structured databases.) The vertices in the graph are objects; each object has a unique object identier (OID), such as &5. Atomic objects have no outgoing edges and contain a value from one of the basic atomic types such as integer, real, string, gif, java, audio, etc. All other objects may have outgoing edges and are called complex objects. Object &4 is complex and its subobjects are &13, &14, &15, and &16. 3 Object &5 is atomic and has value \Blade Runner". Names are special labels that serve as aliases for objects and as entry points into the database. In Figure 1, DB is a name that denotes object &1. In an OEM database there is no notion of xed schema. All the schematic information is included in the labels, which may change dynamically. Thus, an OEM database is self-describing, and there is no regularity imposed on the data. The model is designed to handle incompleteness of data, as well as structure and type heterogeneity as exhibited in the example database. Observe in Figure 1 that, for example: (i) movies have zero, one, or more actors; (ii) a price may be a string, a real, or even a complex object. For an OEM object X and a label l, the expression X:l denotes the set of all l-labeled subobjects of X . If X is an atomic object, or if l is not an outgoing label from X , then X:l is the empty set. Such \dot expressions" are used in Lore's query language, described next. 2.2 The Lorel Query Language We introduce the Lorel query language with a few examples. A full specication of Lorel, which is an extension of OQL, appears in [AQM+ 97]. Here we highlight those features of the language that are designed specically for querying semistructured data and are relevant to our indexing scheme. Many other features of Lorel (some inherited from OQL and others not) will not be covered. Our rst example query introduces the basic building block of Lorel: the simple path expression, which is a name followed by a sequence of labels. For example, DB.Movie.Actor is a simple path expression. It denotes the set of objects that can be reached starting with the DB object, following an edge labeled Movie, then following an edge labeled Actor. Range variables can be assigned to path expressions, e.g., \DB.Movie X" species that X ranges over the movie subobjects of the name DB. Our rst example gives a simple Lorel query which, when executed over the database in Figure 1, returns the titles of all movies that Harrison Ford acted in. In the query result, indentation is used to represent graph structure. Example 2.1 select DB.Movie.Title where DB.Movie.Actor.Name = \Harrison Ford" RESULT Title "Blade Runner" Title Original "Star Wars" AKA "Adventures of the Starkiller" The database in Figure 1 presents a number of irregularities, even with respect to this simple query. A guiding principle in Lorel is that to write a query one should not have to worry about such irregularities (e.g., not all movies have actors), or know the precise structure of objects (e.g., the 4 structure of Titles ), nor should one have to bother with precise types. As we will show, automatic coercion between atomic values with dierent types complicates the use of value indexes. The Lore query processor rewrites all queries into an OQL style. Our next example gives the translation of Example 2.1 into the equivalent OQL-like query. The user is free to enter either form of the query. The Lore system always builds its query execution strategy based upon the translated query. Example 2.2 select T from DB.Movie M, M.Title T where exists A in M.Actor : exists N in A.Name : N = \Harrison Ford" Lorel, like Postquel [SK91], allows the user to omit the from clause, as we have done in Example 2.1. Notice that in the rewritten version of the query a from clause has been introduced. Also notice that the comparison on the subpath Actor.Name has been transformed into two existential conditions. Thus, the user can write DB.Movie.Actor.Name = "Harrison Ford" regardless of whether Actor.Name is known to be single-valued, known to be set-valued, or unknown. Lorel oers a richer form of \declarative navigation" in OEM databases than simple path expressions, namely general path expressions. Intuitively, the user loosely species a desired pattern of labels in the database: one can specify patterns for paths (to match sequences of labels), patterns for labels (to match sequences of characters), and patterns for atomic values. The use of a path pattern is shown in our next example. Example 2.3 select DB.Movie.Price(.Amount)? where DB.Movie.Year DB.Movie.Title < RESULT Price 19.95 A \?" indicates that a subpath (in parentheses) is optional, so the path expression in the select clause will match both DB.Movie.Price and DB.Movie.Price.Amount. The query will return all Price and Amount subobjects for every movie where its year of production is less than its title: i.e., movies whose titles refer to future events, such as the movie \1984" which was produced in 1956. According to Lorel semantics, the comparison in the where clause returns false unless the values being compared can be coerced into the same atomic type [AQM+ 97]. In our example database, only &9 and &12 can be coerced into comparable types, and the comparison returns true since 1956 < 1984. Thus, object &3 satises the query and its Price subobject is returned. Further details and a formalization of atomic type coercion as implemented in Lore are given in Section 3. 5 2.3 Query Execution Strategies Now let us consider some possible query execution strategies for Example 2.2, repeated here for convenience: select T from DB.Movie M, M.Title T where exists A in M.Actor : exists N in A.Name : N = \Harrison Ford" The most straightforward approach is to rst consider the from clause, nding objects along the path DB.Movie.Title. Then, for each Movie, we look for the existence of a path Actor.Name whose value is \Harrison Ford". We call this a top-down execution strategy since we begin at the named object DB (the top) and evaluate the path expressions in a forward manner. This query execution strategy results in a depth-rst traversal of the graph and does not require any indexes. Another way to execute the same query is to rst identify all objects satisfying the where clause by using an index structure that locates all atomic objects with the value \Harrison Ford". Once we have an object satisfying the predicate, we traverse backwards through the data, matching path expressions in reverse. Since Lore's object store does not support parent (inverse) pointers, we must use an additional index structure to locate parents of a given object. We call this query execution strategy bottom-up since we rst identify atomic objects and then attempt to work back up to a named object. Obviously, whether bottom-up outperforms top-down depends on the query and the data, but we have found bottom-up to be a good strategy in practice when it is applicable. A third strategy is to evaluate some, but not necessarily all, of the from clause in a top-down fashion and create a set of \valid" objects. Then we directly identify those atomic objects that satisfy the where using an indexing structure, and using another index we traverse up to the same point as the top-down exploration. By intersecting the sets of \valid" objects and combining traversed paths we obtain the result of the query. We call this approach a hybrid plan, since it operates both top-down and bottom-up, meeting in the middle of a path expression. Yet another query execution strategy works inside-out by identifying portions of the graph that match the middle of a path expression, then traversing up through the parents and down through the children to match the remaining portions of the path expression. Here an index is needed to identify objects that match the middle of a path expression, as well as an additional index for the upward traversal. These execution strategies give the reader a avor of query processing and the use of indexes in Lore. We will return in more detail to Lore's query plans after we cover the indexing of coercible values and describe in more concrete terms the various indexes available in Lore. 3 Indexing Coercible Values Automatically coercing values of dierent types for comparisons is an important feature in semistructured data, when the user may not know the types of atomic data elements, or when the types of related atomic data may vary. In Example 2.3 it made sense to attempt to coerce the movie name 6 Integer ASCII Real Postscript MSWord String (b) (a) Figure 2: Partial orders among types to an integer so it could be compared with the production year. Any query comparing the movie prices in our database will require coercion since the prices are stored as a variety of types. Automatic type coercion makes indexing values in Lore more complicated than in traditional database systems. Before discussing the specic \coercible indexes" we have implemented in the Lore system, we present a general framework for indexing coercible values. 3.1 Types and Coercion Suppose we are given a set of types T , and a partial ordering on the types in T . When t1 t2 , we say that t1 is a more specic type than t2 . This relationship indicates that some (but not necessarily all) values of type t2 can be coerced into a value of type t1 . In Figure 2(a), for example, we see that integers are more specic than reals and reals are more specic than strings. We call t1 an ancestor of t2 if t1 t2 or t1 = t2 . Note that is transitive. The least upper bound of types t1 and t2 , denoted by lub(t1 ; t2), is the least common ancestor of t1 and t2 if one exists. Each type has associated with it a (usually innite) set of values of that type. We denote the set of values of type t by dom(t). Also associated with each pair of types ti ; tj such that ti tj is a coercion function fji , which is a partial function from dom(tj ) to dom(ti ). We assume coercion functions are transitive, i.e., for a value v , fij (fjk (v )) is dened i fik (v ) is dened, and when both are dened the coerced values are equal. For convenience, we let fii denote the identity function for all types ti . In the example of Figure 2(a) with types integer, real, and string, there are two points to note: 1. The coercion functions are true partial functions; for example, not all strings can be coerced into reals and not all reals can be coerced into integers. 2. The coercion functions are generally many-to-one and therefore not invertible; for example, the strings "5" and "05" are coerced into the same real number and same integer. Figure 2(b) shows an example of a nonlinear partial order: certain Postscript and certain MS Word documents can be coerced into plain-text ASCII documents. 7 3.2 Flexible Comparisons For developing the theoretical foundation, we say that an OEM database D is a collection of objects of dierent types. An object is represented by a triple h o; t; v i where o is the OID, t 2 T is a type, and v 2 dom(t) is a value of type t. The basic object lookup problem is to nd all atomic objects with a given value v in the database. That is, we wish to compute the set fo j h o; t; x i 2 D and x = v g. Obviously, object lookup can be a useful operation when executing a query, since it can directly locate those objects satisfying an equality condition in the query predicate. (We extend to general comparison operators below.) Often in semistructured data, the same or similar concepts are represented using dierent types. For example, the number 5 could be represented in some places as the integer 5 and in other places as the string \5". When we perform object lookup for the value 5, we would like to match not only an object with the integer value 5, but also an object with the string value \05". We formalize the idea by dening the notion of exible equality. Suppose v1 2 dom(t1 ) and v2 2 dom(t2 ). Intuitively, we must \promote" v1 and v2 as small a distance up the partial order as possible so that we can compare them. It is important to limit the distance that a value is promoted since the conversion functions are partial functions. We are guaranteed that if t3 t2 t1 and f13 (v ) is dened then f12 (v ) is dened, but not vice versa since the type t3 is more specic than t2 . Formally, we say that v1 is equal to v2 under exible equality, written v1 =f v2 , if one of the following holds: 1. t1 = t2 and v1 = v2 . 2. t3 = lub(t1; t2 ) exists, f13(v1 ) and f23(v2 ) are both dened, and f13(v1 ) = f23(v2 ). In the partial order shown in Figure 2(a), we have 5 =f \05", because the least upper bound of integer and string is integer, and the coercion of \05" into type integer results in the value 5. However, it is not the case that \5" =f \05" because the least upper bound of string and string is string and \5" 6= \05". We now generalize exible equality to exible comparisons. Suppose is a binary comparison function that is dened over all types. That is, given two values u and v of the same type, u v is either true or false. For example, could be one of the standard arithmetic comparison operators =, <, >, , or . We dene the exible comparison operator f corresponding to as follows: 1. v1 2. v1 f f v2 if v1 and v2 are of the same type and v1 v2 . v2 if v1 is of type t1 , v2 is of type t2 , t3 = lub(t1 ; t2 ) exists, f13 (v1 ) and f23 (v2 ) are both dened, and f13 (v1) f23(v2 ). Our denition of the exible comparison operator corresponding to preserves commutativity of but not transitivity. That is, if is commutative, f is also commutative, but if is transitive, then f is not in general transitive. For example, \5" =f 5, 5 =f \05", but \5" 6=f \05". Not preserving transitivity does eliminate certain optimization opportunities in query processing, however we feel that this notion of exible comparisons is appropriate for semistructured data. 8 string real int string string ! real both ! real real string ! real int ! real int both ! real int ! real Table 1: Coercion for integer, real, and string types 3.3 Flexible Indexes For developing the theoretical foundation, we say that an index I for a type t contains a set of objects h O; t; v i. It eciently supports the index lookup operation for one or more operators : ( ; v ) = fo j h o; t; x i 2 O and x v g L I; Our goal is thus to construct a set of indexes for a given database such that there is an algorithm to do a \exible" object lookup of the form x f v based on index lookups. We rst describe the set of indexes to create, and then in the next subsection the actual lookup algorithm. Let Oi denote the set of objects of type ti in the database D . For each type tj that is an ancestor of ti , we create an index Jij . Let: Oij = fh o; tj ; fij (v ) ij h o; ti; v i 2 Oi and fij (v ) is dened g Then Jij is the index for type tj containing the objects in Oij . In other words, Jij indexes all values of type ti that can be coerced into type tj . The coerced values are stored in the index. In the special case where i = j all values of type ti are indexed. Example 3.1 For the partial order given in Figure 2(a), we have the following 6 indexes, where we denote the type string by s, real by r, and integer by i: 1. Jss indexes all string values. 2. Jsr indexes all strings that can be coerced into reals (as reals). 3. Jsi indexes all strings that can be coerced into integers (as integers). 4. Jrr indexes all reals. 5. Jri indexes all reals that can be coerced into integers (as integers). 6. Jii indexes all integers. It is easy to compute the number of indexes in a partial order such as the one in Figure 2(a). There is one index associated with the top type, two indexes associated with the second type, and so on. In general, if there are n types, the number of indexes is n(n2+1) . The number of indexes is 9 less than this number for a nonlinear partial order with n types. For example, there are 5 indexes for Figure 2(b). It is possible to reduce the number of indexes if certain properties hold. In particular, suppose types t1 and t2 are such that: 1. t1 ; and t2 2. there exists a coercion function f12 from dom(t1 ) to dom(t2 ) such that for every pair of values u; v 2 dom(t1 ), f12 (u) and f12 (v ) exist and u v i f12 (u) f12 (v ). In this case we can \fold" the indexes for t1 into the indexes for t2 . For example, the types integer (t1 ) and real (t2 ) satisfy the above conditions for the operators =; <; >; ; , so we can fold all integer indexes into real indexes. This reduces the total number of indexes in Example 3.1 from six to three, keeping items 1, 2 and 4 (where 4 now contains all integers coerced to reals as well as all reals). Using this optimization, the full set of coercion rules is summarized in Table 1, and this is the approach we take to value indexing in Lore. 3.4 The Index Lookup Algorithm Suppose we wish to look up value vi of type ti under operator f . The algorithm is displayed in Figure 3. Recall that L(J; ; v ) is the lookup operation in index J on value v with operator . The set S returned contains the set of objects that \match" the value vi under the exible comparison operator f . An interesting point is that for a partial order of n types, at most n index lookups are needed regardless of the type ti . An index lookup need not occur when two types do not have a least upper bound. Procedure IndexLookup(Value , Type , Operator ) (1) (2) (3) (4) (5) (6) (7) S vi := ; foreach Type ti tj = lub( ). if 6= ; and ( ) is dened then := ( ( )). := [ . tk ti ; tj tk Sj S return fik vi L Jjk ; S ; fik vi Sj S Figure 3: Index lookup algorithm Example 3.2 Suppose we wish to match the value \05" given the partial order shown in Figure 2(a) without the optimization discussed in Section 3.3. Procedure IndexLookup would generate the following index lookups: (i) Lookup \05" in Jss ; (ii) Lookup 5.0 in Jrr ; (iii) Lookup 5 in Jii . string 10 Example 3.3 Suppose we wish to match the real number 5.0 given the partial order shown in Figure 2(a) without the optimization discussed in Section 3.3. Procedure IndexLookup would generate the following index lookups: (i) Lookup 5.0 in Jsr ; (ii) Lookup 5.0 in Jrr ; (iii) Lookup 5 in Jii . Theorem 3.1 At the end of executing Procedure IndexLookup, the set contains all and only the objects that match i under f . Moreover, each union in line (6) of IndexLookup is a disjoint union; that is, j \ k = ; for = 6 . S v S S j k Proof: The result of each lookup is: Sj = L(Jjk ; ; fik (vi )) = fo j h o; tj ; vj i 2 Ojk and vj f fik (vi)g The fact that the set S contains all and only the objects that match vi under f follows directly from the denition of Sj . To see why each union is a disjoint union, suppose Sj \ Sk 6= ; for some j 6= k . Then there must exist some o that is returned by indexes Jxy and Jwz : Recall that Jst is the index for objects of type s that have been coerced into type t. Since an object only has a single type then x = w. Now suppose y 6= z (if not then Sj = Sk and we are done). But this is impossible since line (5) only uses x a single time, and y 6= z implies that there are two least upper bounds for lub(ti ; tj ), a contradiction. 2 The fact that we are computing disjoint unions is signicant, because it shows that we are doing as little work as possible: we never obtain the same matching object from two dierent index lookups. Moreover, it also reduces the complexity of the algorithm because disjoint unions are much easier to compute than general unions. 4 Indexes In Lore We now describe the types of indexes that can be built over a Lore database. We will then describe the query plan operators that use the index structures, followed by a description of the query plans themselves. To speed up query processing in a Lore database, we can build four dierent types of index structures. The rst two identify objects that have specic values; the next two are used to eciently traverse the database graph. 1. A value index, or Vindex, locates atomic objects with certain values. Vindexes are based on the coercible indexing scheme introduced in Section 3. Vindexes can be built selectively over those objects with certain incoming labels. 2. A text index, or Tindex, locates string atomic values containing specic words or groups of words. The text index is a simplied version of information-retrieval style inverted indexes [Sal89]. Like Vindexes, Tindexes can be built selectively over those objects with certain incoming labels. 3. Since our current implementation of OEM does not support parent pointers, a link index, or Lindex, locates parents of a specic object. 11 4. A path index, or Pindex, provides fast access to all objects reachable via a given labeled path. As in all database systems, the choice of which indexes to build and maintain is made by the database administrator based on expected queries and updates. We now describe each of these types of indexes in more detail. 4.1 Value Index (Vindex) A Vindex in Lore is built over all atomic objects of base type integer, real, or string that have an incoming edge with a given label l.2 This Vindex allows the query engine to quickly locate all objects reachable by an l edge and matching a comparison predicate. While we could have chosen to support a single label-independent Vindex, a specic desired incoming label usually is known at query processing time (see Section 5.2), so it is useful to partition the Vindex by label. This approach also allows the administrator to build Vindexes selectively for frequently used labels. Our Vindex query operator does allow the label to be omitted, in which case all Vindexes are searched if they exist for all labels. Vindexes use the coercible indexing scheme described in Section 3, with the optimization outlined in Section 3.3. Thus, if the administrator requests a Vindex for label l, we actually create three indexes: one for reals and integers coerced to reals, one for strings, and one for strings coerced to reals. Each index is implemented as a B+-tree to support inequality as well as equality lookups. The lookup procedure for Vindexes is similar to the IndexLookup procedure given in Figure 3, except we extend it slightly to accept a label. Example 4.1 Suppose we create a Vindex for incoming label Price over the database in Figure 1. If we perform a lookup for values 15 00 with incoming edge Price, the result is f&11 &15g. > : ; 4.2 Text Index (Tindex) A Vindex is useful for nding values that satisfy basic comparisons such as =; <, etc. However, for string values an information-retrieval style keyword search can be very useful, especially for strings containing a signicant amount of text. In these situations, the Vindex is not powerful enough and a dierent indexing structure, the Tindex, is used. Text indexes in Lore are implemented using inverted lists, which map a given word w and label l to a list of atomic values with incoming edge l that contain word w . Like the Vindex, Tindexes are created by the administrator for a given label, for the reasons outlined earlier, but the label can always be omitted for a full search. A Tindex lookup returns a list of postings, where each posting is of the form h o; n i. A posting indicates that w appears in object o as the nth word in the value, and o has an incoming edge labeled l . The inverted lists are stored in hash tables on disk keyed on w . Example 4.2 Consider a Tindex lookup for all objects with an atomic string value containing the word \Ford" and an incoming edge Name, performed over the database in Figure 1. The result is fh &17; 2 i; h &21; 2 ig. 2 Lore currently does not support value indexing of \novel" types such as gif, audio, etc. 12 As will be seen in Section 5.1, a higher-level query plan operator above the Tindex, and a corresponding operator in the query language, support additional text search features such as AND, OR, NEAR, etc. 4.3 Link Index (Lindex) Since we do not support inverse pointers in OEM graphs, the Lindex provides a mechanism for retrieving the parents of an object via a given label. A Lindex lookup takes a \child" object c and a label l, and returns all parents p such that there is an l-labeled edge from p to c. The Lindex also supports lookups with no label, in which case all parents and their labels are returned. In Lore, the Lindex is implemented using extendible hashing [FNPS79] since for the Lindex we are always doing equality lookups. Currently we do not support selective creation of Lindexes, i.e., one Lindex is created for the entire database graph. Example 4.3 Suppose we had located all atomic objects containing the word \Ford" via the Tindex lookup in Example 4.2, and we now wanted to traverse up to parent subobjects connected via the label Name. The Lindex lookup for object &17 returns parent object &6, and the lookup for object &21 returns object &13. As will be seen in Section 5.1, a higher-level query plan operator above the Lindex supports nding ancestors objects that are reached by following a component of a general path expression that may include regular expression operations. 4.4 Path Index (Pindex) Finding all objects reachable by a given labeled path through the database is an important part of query processing. A Pindex lookup for a path p returns the set of objects O reachable via p. Currently in Lore we only index those paths that begin at named objects and contain no regular expressions. By doing so, we can easily store the set of reachable objects for each path p in Lore's DataGuide [GW97]. The DataGuide is a dynamic structural summary of all possible paths within the database at any given point in time. In addition to providing a kind of dynamic \schema" for the user to browse, the DataGuide also stores OIDs and statistics for those objects reachable via each path beginning from a name. The algorithm to compute as well as incrementally maintain the DataGuide Pindex appears in [GW97]. Example 4.4 Suppose our query is simply \select DB.Movie.Title" applied over the database in Figure 1. Instead of exploring the graph, we can use the Pindex to directly locate all objects reachable via DB.Movie.Title. The Pindex lookup operation returns f&5; &9; &14g. While the Pindex may appear very attractive, we cannot use it for all queries and path expressions: in addition to the limitations above, in some queries we also must obtain the objects along a given path in addition to those at the end of the path. See [MW97] for details. 13 DB OA[1].Movie OA[2].Title OA[2].Actor OA[4].Name Exists OA[5]? OA[1] OA[2] OA[3] OA[4] OA[5] OA[6] Figure 4: Object Assignment for Example 2.2 5 Query Plans Using Indexes A query plan in the Lore system is a tree of query operators that describes the specic sequence of steps used to answer a query. We use a recursive iterator approach in query processing, as described in, e.g., [Gra93]. With iterators, execution begins at the top of the query plan, with each node in the plan requesting a tuple at a time from its children and performing some operation on the tuple(s). After a node completes its operation, it passes a resulting tuple up to its parent. The \tuples" we operate on are Object Assignments, or OAs. An OA is a simple data structure containing slots corresponding to range variables in the query, along with some additional slots depending on the form of the query. Intuitively, each slot within an OA holds the OID of a node on a data path currently being considered by the query engine. Figure 4 contains an OA that will be described in more detail later. At a given point during query processing, not all slots of the current OA necessarily contain a valid OID; the goal of query execution is to build complete OAs. Once a valid OA reaches the top of the query plan, OIDs in appropriate slots are used to construct a component of the query result. 5.1 Query Operators We briey explain some of the key query operators that can appear in our query plans, especially those related to indexing. Each operator takes a number of arguments. If an operator produces a result then the last argument is the OA slot that contains the result. Some operators do not produce a result but aect the ow of control when executing the plan. Each of the indexing structures introduced in Section 4, the Vindex, Tindex, Lindex, and Pindex, have a corresponding query operator that supports the appropriate lookup operation during query execution. The Vindex and Pindex operators are simple \wrappers" around the corresponding index structures. We discuss the Lindex and Tindex operators in more detail since they add additional capabilities beyond what the corresponding indexes support. Like the Lindex data structure, the Lindex query plan operator can be used to nd the l-labeled parents of an object o. In addition, the Lindex operator can also nd the ancestor objects of o that are reached by matching some path described by a component of a general path expression. For example, the Lindex can nd all ancestors of o that are reached by following the path (.A|.B)*, which matches any path having zero or more A or B labeled edges. The Lindex operator keeps a run-time stack of objects visited in order to match a sequence of zero or more edges. (Since we do not yet support full regular expressions, a complete nite-state automaton is not required.) The Tindex operator uses the Tindex data structure, but it adds considerable power beyond simple single-word searches. Specically, the Tindex operator takes three arguments, a text pattern 14 to match, a label, and a destination OA slot that will contain the OIDs of all matching string atomic objects with the given incoming label. The possible patterns are dened recursively: Word or Phrase match. A string matches a word or phrase i it contains the word or phrase. The Tindex operator supports both case-sensitive and case-insensitive match. NEAR match. A string matches phrase1 NEAR phrase2 i it matches both phrase1 and phrase2, and the two matching phrases in the string are within 10 words of each other. AND operator. A string matches phrase1 AND phrase2 i it matches both phrase1 and phrase2. OR operator. A string matches phrase1 OR phrase2 i it matches either phrase1 or phrase2. ANDNOT operator. A string matches phrase1 ANDNOT phrase2 i it matches phrase1 but not phrase2. While this functionality is a simple subset of that oered by, e.g., Web search engines, we have found it very useful in practice to speed up a large class of common queries over string or text values. In addition to the four index operators, there are many other operators to perform the standard tasks necessary in query processing. (For complete details see [MW97].) For example, the Scan operator returns all OIDs that are subobjects of a given object following a specied path expression. It accepts an object o and a component of a path p and returns the set of objects that are reachable starting from o and matching p. Like the Lindex operator, the Scan operator may match a path containing regular expression operators by using a simple run-time stack. The Join, Project, and Select operators are nearly identical to their corresponding relational operators. Like a relational nested-loop join, the Join operator coordinates its left and right children. For each partially completed OA that the left child returns, the right child is called exhaustively until no more new OAs are possible. Then the left child is instructed to retrieve its next (partial) OA. The iteration continues until the left side produces no more OAs. The Project operator is used to limit which objects should be returned by specifying a set of OA slots, while the Select operator applies a predicate to the object identied by the OID in the specied OA slot. The Aggregation operator implements standard aggregation operations as well as existential quantication. At a high level, the aggregation operator calls its child exhaustively, storing the results temporarily or computing the aggregate incrementally. When the child can produce no more valid OAs, a new object is created whose value is the nal aggregation; this new object is identied within the target OA slot. For the aggregation \operation" Exists the operator adds true if the existential quantication is satised and false otherwise. Filtering of OAs whose quantication is true occurs in a Select operator which must appear immediately above the Aggregation node. Note that the exists aggregation operator \short circuits" when it nds the rst satisfying OA, while other aggregation operators may need to look at all OAs. 5.2 Query Plans We present Lore query plans through several examples. There are numerous possible plans for each query, and Lore includes a traditional cost-based query optimizer for plan enumeration and selection 15 Project (OA[3]) Join Select (OA[6] (OA6 ==TRUE) TRUE) Join Scan (OA[2],"Title",OA[3]) Join Scan (Root,"DB",OA[1]) Scan (OA[1],"Movie",OA[2]) Aggr (Exists, (Exists,OA[5], OA5, OA[6]) OA6) Select (OA[5] ="Harrison Ford") Join Scan (OA[2],"Actor",OA[4]) Scan (OA[4],"Name",OA[5]) Figure 5: Top-down plan for query in Example 2.2 as detailed in [MW97]. Here we focus on describing how query plans exploit indexes. Recall the query introduced in Example 2.2: select T from DB.Movie M, M.Title T where exists A in M.Actor : exists N in A.Name : N = \Harrison Ford" A possible OA structure for this query is shown in Figure 4. Notice that OA[6] is used for the result of the aggregation operation. A specic OA structure is created for each query plan. The OA in Figure 4 works with the \top-down" query plan explained next. 5.2.1 Top-Down Query Plan A top-down strategy (as introduced in Section 2.3) for this query does not exploit indexes. It attempts to match the path in the from clause rst by providing bindings for OA[1], then OA[2], then OA[3]. The where clause is handled in a similar fashion, and nally the result is returned. This execution strategy is reected in the query plan shown in Figure 5. The left subtree of the top Join node handles the from clause by rst scanning for the named object DB. From the resulting object of that scan the second Scan operator looks for a Movie subobject and places the result in OA[2]. The third Scan retrieves the movie's Title subobject. The right subtree rst looks for an Actor subobject of the current movie, then the bottom-right Scan and the Select operators nd all Name subobjects of the object in OA[4] with value \Harrison Ford". To satisfy the existential condition, the query plan uses the Aggregation/Select sequence, as described in Section 5.1. Note 16 Project (OA[3]) Join Join Join Join Vindex ("Name",=,"Harrsion Ford",OA[5]) Once (OA[4]) Lindex (OA[5],"Name",OA[4]) NamedObj ("DB",OA[1]) Scan (OA[2],"Title",OA[3]) Once (OA[2]) Lindex (OA[4],"Actor",OA[2]) Lindex (OA[2],"Movie",OA[1]) Figure 6: Bottom-up plan for query in Example 2.2 that in this query plan it is not necessary to test the existential condition for Actor, since if no actor subobject exists then we would never be able to satisfy the existential condition for Name. This query plan performs poorly if there are many movies in the database that \Harrison Ford" did not appear in, since in that case it would traverse many paths unnecessarily. 5.2.2 Bottom-Up Query Plan We now consider a very dierent query plan for the same query, which uses some of Lore's indexing structures through their corresponding index operators. Refer to Figure 6. In this plan the left subtree of the top-most Join operator rst satises the where clause. The Vindex operator will nd all atomic objects with an incoming edge Name whose value is \Harrison Ford". The Lindex operators then attempt to match backwards the edges Name and then Actor starting from the objects identied via the Vindex operator. The Once operator is the bottom-up equivalent of existential quantication: it ensures that each object is passed up to the parent at most once. After the where clause has been processed, we match a Movie edge via the third Lindex operator. It is then necessary to conrm that the parent of the movie object is the named object DB, accomplished via the NamedObject operator. Since the query requests all Title subobjects of a movie, we use the Scan operator to obtain the title. This query plan performs poorly if there are many objects with value \Harrison Ford" and incoming edge Name, but few that are reachable via the path DB.Movie.Actor.Name. 5.2.3 Hybrid Query Plan In certain cases, it is advantageous to combine the two techniques described so far, bottom-up and top-down, into a single hybrid plan. For instance, it may be benecial to rst locate all objects that satisfy the where clause via Vindex and Lindex operators and then, instead of continuing a bottom17 A D ... Project (OA[5]) D ... B B B B Intersect (OA[2],OA[4],OA[5]) B Join ... C C C 4 5 Vindex ("C",=,5,OA[1]) Join Lindex (OA[1],"C",OA[2]) Scan (Root,"A",OA[3]) Scan (OA[3],"B",OA[4) ... 4 Figure 7: A database where the hybrid execution strategy is preferred along with a hybrid query plan up traversal, identify all objects that satisfy the from clause in a top-down fashion. Consider the following simple query: select X from A.B X where exists Y in X.C : Y = 5 Now consider the database and query plan in Figure 7. A top-down strategy would visit all the leaf objects, but only a single one satises the predicate. A bottom-up strategy would identify the single object satisfying the predicate, but would then unnecessarily visit all of the nodes in the upper right portion of the database. In this case, a hybrid plan where we use bottom-up execution to nd the objects satisfying the where clause, then top-down execution to obtain the set of all A.B objects, then nally intersect the sets, would be a good query execution strategy. The hybrid query plan is shown in Figure 7. Notice the Intersect operation splits the execution of the where clause (appearing on the left) and the from clause (appearing on the right). 5.2.4 Full Text and Path Indexing To illustrate the use of text and path indexes, suppose we would like to nd all movies such that a Title or Description subobject contains the word \black" near the word \sheep". This query can be written as follows, where hasword is the Lorel query operator that exposes the functionality of our text indexing system: select DB.Movie where DB.Movie(.Title|.Description) hasword \black NEAR sheep" In Figure 8, we present a hybrid query plan that uses the Pindex and Tindex. The left subplan of the Intersect operator identies (using the Lindex) all parents of objects that are connected via an edge Title or Description to a child containing the word \black" near the word \sheep" (located via the Tindex). The right child of the Intersect operator identies all objects reachable via the 18 Project (OA[6]) Intersect (OA[5],OA[2],OA[6]) Pindex ("DB.Movie",OA[1]) Join Tindex ("black NEAR sheep", OA[4]) Lindex (OA[4],".Title|.Descrip(OA4,".Title|.Description",OA[3]) tion",OA3) Figure 8: A query plan that uses both Tindex and Pindex operators path DB.Movie, found using the Pindex operator. The intersection of the results generated by each subplan produces the answer to the query. 6 Conclusion This paper presented the management of indexes in Lore, a DBMS for semistructured data. We provided a general framework for indexing values in the presence of automatic type coercion. Lore's Vindex, Tindex, Lindex, and Pindex structures were introduced. We then described various query plans in Lore and how they make use of indexes. All indexing structures, index maintenance, query operators, and query plans described in this paper are fully implemented within the Lore system. Preliminary performance results indicate that at least an order of magnitude improvement is observed on a wide class of common databases and queries when indexes are used for query processing.3 References [AQM+ 97] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lorel query language for semistructured data. International Journal on Digital Libraries, 1(1):68{88, April 1997. [BG92] E. Bertino and C. Guglielmina. Optimization of object-oriented queries using path indices. In Proceedings of the Second International Workshop on Research Issues in Data Engineering, pages 140{149, February 1992. [CCM96] V. Christophides, S. Cluet, and G. Moerkotte. Evaluating queries with generalized path expressions. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 413{422, Montreal, Canada, June 1996. While time constraints prevented us from including a full performance study in this paper, we currently are conducting such a study and results could be added to the nal published report. 3 19 [CCY94] S. S. Chawathe, M. S. Chen, and P. S. Yu. On index selection schemes for nested object hierarchies. In Proceedings of the Twentieth International Conference on Very Large Data Bases, pages 331{341, Santiago, Chile, September 1994. [FNPS79] R. Fagin, J. Nievergelt, N. Pippenger, and H. Strong. Extendible hashing { A fast access method for dynamic les. ACM Transactions on Database Systems, 4(3):315{ 344, September 1979. [FS98] M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas, 1998. To appear in Proceedings of the Fourteenth International Conference on Data Engineering, Orlando, Florida, February 1998. [Gra93] G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73{170, 1993. [GW97] R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the Twenty-Third International Conference on Very Large Data Bases, pages 436{445, Athens, Greece, August 1997. [KKD89] W. Kim, K.-C. Kim, and A. Dale. Indexing techniques for object-oriented databases. In W. Kim and F.H. Lochovsky, editors, Object-Oriented Concepts, Databases and Applications, chapter 15. Addison-Wesley, 1989. [KM92] A. Kemper and G. Moerkotte. Access support relations: An indexing method for object bases. Information Systems, 17(2):117{145, 1992. [MAG+97] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: A database management system for semistructured data. SIGMOD Record, 26(3):54{66, September 1997. [MW97] J. McHugh and J. Widom. Query optimization in semistructured data. Technical report, Stanford University Database Group, 1997. Available at http://www-db.stanford.edu/pub/papers/qo.ps. [PGMW95] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In Proceedings of the Eleventh International Conference on Data Engineering, pages 251{260, Taipei, Taiwan, March 1995. [QRS+ 95] D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. Querying semistructured heterogeneous information. In Proceedings of the Fourth International Conference on Deductive and Object-Oriented Databases (DOOD), pages 319{344, Singapore, December 1995. [RK95] S. Ramaswamy and P. C. Kanellakis. OODB indexing by class-division. SIGMOD Record, 24(2):139{150, June 1995. 20 [Sal89] Gerard Salton. Automatic Text Processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, 1989. [SK91] M. Stonebraker and G. Kemnitz. The POSTGRES next-generation database management system. Communications of the ACM, 34(10):78{92, October 1991. [SS94] B. Sreenath and S. Seshadri. The hcC-tree: An ecient index structure for object oriented databases. In Proceedings of the Twentieth International Conference on Very Large Data Bases, pages 203{213, Santiago, Chile, September 1994. [XH94] Z. Xie and J. Han. Join Index Hierarchies for Supporting Ecient Navigations in ObjectOriented Databases. In Proceedings of the Twentieth International Conference on Very Large Databases, pages 522{533, Santiago, Chile, September 1994. 21