A Database Clustering Methodology and Tool Tae-Wan Ryu Department of Computer Science California State University, Fullerton Fullerton, California 92834 tryu@ecs.fullerton.edu Christoph F. Eick Department of Computer Science University of Houston Houston, Texas 77204-3010 ceick@cs.uh.edu Abstract Clustering is a popular data analysis and data mining technique. However, applying traditional clustering algorithms directly to a database is not straightforward due to the fact that a database usually consists of structured and related data; moreover, there might be several object views of the database to be clustered, depending on a data analyst’s particular interest. Finally, in many cases, there is a data model discrepancy between the format used to store the database to be analyzed and the representation format that clustering algorithms expect as their input. These discrepancies have been mostly ignored by current research. This paper focuses on identifying those discrepancies and on analyzing their impact on the application of clustering techniques to databases. We are particularly interested in the question on how clustering algorithms can be generalized to become more directly applicable to real-world databases. The paper introduces methodologies, techniques, and tools that serve this purpose. We propose a data set representation framework for database clustering that characterizes objects to be clustered through sets of tuples, and introduce preprocessing techniques and tools to generate object views based on this framework. Moreover, we introduce bag-oriented similarity measures and clustering algorithms that are suitable for our proposed data set representation framework. We also demonstrate that our approach is capable of dealing with relationship information commonly found in databases through the bag-oriented clustering. We also argue that our bag-oriented data representation framework is more suitable for database clustering than the commonly used flat file format and produce better quality of clusters. Keywords and Phrases: database clustering, preprocessing in KDD, data mining, data model discrepancy, similarity measures for bags. 1 Introduction Current technologies for collecting data such as scanners and other data collection tools have generated a huge amount of data, and the volume of data is growing rapidly every year. Database systems provide tools and an environment that manage and access the large volume of data systematically and efficiently. However, extracting useful knowledge from databases is very difficult without additional computer assistance and more powerful analytical tools. In general, to appear in Information Science in Spring 2005. 1 there is a significant gap between data generation and data understanding. Consequently, automatic powerful analytical tools for discovering useful and interesting patterns from databases are desirable. Knowledge discovery in data (KDD) is such a generic approach to analyze and extract useful knowledge from databases using fully automated techniques. Recently, many techniques and tools [HK01] have been proposed for this purpose. Popular KDD-tasks include classification, data summarization, dependency modeling, deviation detection, etc., are the popular techniques used in KDD. The focus of this paper is database clustering. The goal of database clustering is to take a database that stores information concerning a particular type of objects (e.g., customers or purchases) and identify subgroups of those objects, such that objects belonging to the same subgroup are very similar to each other, and such that objects belonging to different subgroups are quite different from each other. Restaurant database Preprocessing Object View for Clustering Clustering A Set of Similar Object Clusters Summarization Young at midnight White Collar for Dinner Retired for Lunch Figure 1. Example of Database Clustering Suppose that a restaurant owner has a database that contains customer information and he wants to obtain a better understanding of his main customer groups for marketing purposes. In order to accomplish this goal, as depicted in Figure 1, the restaurant database will be first preprocessed 2 for clustering and a clustering algorithm is applied to the preprocessed data set; for example the algorithm might reveal that there are three clusters in the customer database. Finally, characteristic knowledge that summarizes each cluster can be generated, telling the restaurant owner that his major customer groups are young people that come at midnight, white collar people that come for dinner, and retirees that come for lunch. This knowledge definitely will be useful for marketing purposes, and for designing his menu. The paper is organized as follows. Section 2 introduces the different steps that have to be taken when clustering a database, and explains how database clustering is different from traditional flat file data clustering. Based on the discussion of Section 2, Section 3 introduces a “new” data set representation framework for database clustering that characterizes objects through sets of tuples. Moreover, preprocessing techniques for generating object views based on this framework are introduced. In Section 4 similarity measures for our bag-oriented knowledge representation framework are introduced. Section 5 introduces the architecture and the components of a database clustering environment we developed. Moreover, the problems of generalizing traditional clustering algorithms for database clustering will be addressed in this section. Section 6 reviews the related literature and Section 7 summarizes the findings and contributions of the paper. 2 Database Clustering 2.1 Steps of Database Clustering Due to the fact that database clustering has not been discussed very much in the literature, we think it is useful to discuss the different steps of database clustering first. In general, we consider database clustering to be an activity that is conducted by passing through the following seven steps: (1) Define Object-View (2) Select Relevant Attributes (3) Generate Suitable Input Format for the Clustering Tool (4) Define Similarity Measure (5) Select Parameter Settings for the Chosen Clustering Algorithm (6) Run Clustering Algorithm (7) Characterize the Computed Clusters 3 The first three steps of the suggested database clustering methodology center on preprocessing the database and on generating a data set that can be processed by the employed clustering algorithm(s). In these steps, a decision has to be made what objects in the database (usually databases contain multiple types of objects) and which of their properties will be used for the purpose of clustering; moreover, the relevant information has to be converted to a format that can be processed by the selected clustering tool(s). In the fourth step similarity measures for the objects to be clustered have to be defined. Finally, in steps 5-7 the clustering algorithm has to be run, and summaries of the obtained clusters are generated. 2.2 Differences between Database Clustering and Ordinary Clustering Data collections are stored in many different formats such as flat files, relational or objectoriented databases. The flat file format is the simplest and most frequently used format in the traditional data analysis area. When using flat file format, data objects (e.g., records, cases, examples) are represented through vectors in the n-dimensional space, each of which describes an object, and the object is characterized by n attributes, each of which has a single value. Almost all existing data analysis and data mining tools, such as clustering tools, inductive learning tools, and statistical analysis tools, assume that data sets to be analyzed are represented in a flat file format. The well-known inductive learning environment C4.5 [Quin93] and similar decision tree based rule induction algorithm [Domi96], conceptual clustering algorithms such as COBWEB [Fish87], AutoClass [Chee96], ITERATE [Bisw95], statistical packages, etc. make this assumption. Due to the fact that databases are more complex than flat files, database clustering faces additional problems that do not exist when clustering flat files; these problems include: Databases contain objects that belong to different types; consequently, it has to be defined what objects in the database need to be clustered. Databases contain 1:1, 1:n and n:m relationships between objects of the same and different types. The definition of object similarity is more complex due to the presence of bags of values (or related information) that characterize an object. 4 Attributes of objects have different types which makes the selection of an appropriate similarity measure more difficult. The first two problems will be analyzed in more detail in the next two subsections; the third and fourth problem will be addressed in Section 4. 2.3 Support for Object Views for Database Clustering Because databases usually contain objects belonging to different classes, there can be several ways of viewing a database depending on what classes of objects need to be clustered. To illustrate the problems of database clustering, let us use the following simple relational database that consists of a Customer and a Purchase table; a particular state of this database is shown in Figure 2 (a). The underlined attributes in each relation represent the primary key in the relation. It is not possible to directly apply a clustering algorithm to a relational database, such as the one that is depicted in Figure 2 (a). Before a clustering algorithm can be applied to a database it has to be determined what classes of objects should be clustered: should customers or purchases be clustered? After it has been decided which objects have to be clustered, in the next step relevant attributes have to be associated with the particular objects to be clustered. The availability of preprocessing tools that facilitate the generation of such object-views is highly desirable for database clustering, because generating such object-views manually can be quite time consuming. 2.4 Problems with Relationships In general, a relational database usually consists of several related relations (or of related classes when using the object-oriented model), which frequently describe many to one, and many to many relationships between objects. For example, let us assume that we are interested in clustering the customers belonging to the relational database that was depicted in Figure 2 (a). It is obvious that the attributes found in the Customer relation alone are not sufficient to accomplish this goal, because many important characteristics of persons are found in other “related” relations, such as the Purchase relation. Prior to clustering customers, the relevant information has to be extracted from the relational database and associated with each customer object. We call a data structure that stores the results of this process an object view. An example 5 of such an object view is depicted in Figure 2 (c). The depicted data set was generated by grouping related tuples into a unique object (based on cid). The attributes p.pid, p.plocation, p.ptype, and p.amount are called related attributes with respect to the Customer relation because they had to be imported from a foreign relation, the Purchase relation in this particular case. (a) A data collection consisting of two relations, Customer and Purchase. The underlined attributes are the keys in each relation. cid (customer id) is a foreign key in the relation Purchase. oid is an order id, pgid is a product group id, ptype is a payment type (e.g., 1 for cash, 2 for credit card, and 3 for check). The cardinality ratio between two relations is 1:n. Customer cid 1 2 3 4 name Johny Andy Post Jenny Purchase age gender 43 M 21 F 67 M 35 F oid pgid cid 1 p1 1 1 p2 1 1 p3 1 2 p2 2 3 p3 2 4 p1 3 ptype 1 1 1 2 3 1 amount 400 70 200 390 100 30 date 02-10-96 02-10-96 02-10-96 02-23-96 03-03-96 03-03-96 (b) A single-valued data set created by performing an outer join on cid. The related attributes (from Purchase relation) have prefix p. For example, p.pgid is a pgid in Purchase relation. cid 1 1 1 2 2 3 4 name Johny Johny Johny Andy Andy Post Jenny age gender p.oid p.pgid p.ptype p.amount date 43 M 1 p1 1 400 02-10-96 43 M 1 p2 1 70 02-10-96 43 M 1 p3 1 200 02-10-96 21 F 2 p2 2 390 02-23-96 21 F 3 p3 3 100 03-03-96 67 M 4 p1 1 30 03-03-96 35 F null null null null null (c) A multi-valued data set created by grouping related tuples into an object. For example, the three tuples that charactize Johnny are grouped into one “Johny” object using separate bags for his payment type, product group, and amount spent on each product group. cid 1 2 3 4 name age gender p.pid p.ptype Johny 43 M {p1,p2,p3} {1,2,3} Andy 21 F {p2,p3} {2,3} Post 67 M p1 1 Jenny 35 F null null p.amount {400,70,200} {390,100} 30 null (d) A single-valued data set created by averaging of multi-valued attributes in (c). For the symbolic multi-valued attributes such as p.pgid, p.location, and p.ptye, we picked the first value in the set (arbitrarily) since we cannot calculate the averages. cid name 1 Johny 2 Andy 3 Post 4 Jenny age gender p.pgid p.ptype p.amount 43 M p1 1 223 21 F p2 2 245 67 M p1 1 30 35 F null null null Figure 2. Various representations of a data set consisting of two related relations In general, as the example shows, object views frequently contain bags of values if the relationship cardinality between the two relations is 1:n. Note that in a relational database n:m relationships are typically designed to two 1:n relationships. Unlike a set, a bag allows for 6 duplicate elements, but the elements must take values in a same domain. For example, the bag {400, 70, 200} for the amount attribute might represent three purchases, 400, 70, and 200 dollars by the customer “Johny”. Ryu and Eick [Ryu98c] call such a data set in Figure 1 (c), a multivalued data set and use term single-valued data set for the traditional flat files in (a) or (b). They use curly brackets to represent a bag of values with the cardinality of the bag greater than one (e.g., {1,2,3}), null to denote an empty bag, and give its element, if the bag has one element. Most traditional similarity measures for single-valued attributes cannot deal with multi-valued attributes such as p.pid, p.plocation, p.ptype, and p.amount. Measuring similarity between bags of values requires group similarity measures. For example, how do we compute similarity between a pair of objects, “Andy” and “Post” for a multi-valued attribute p.amount, {390,100}:30, or between “Andy” and “Johny”, {390,100}:{400,70,200}? One simple idea may be to replace the bag of values for multi-valued attributes by a single value by applying certain aggregate function (e.g., average, sum or count), as depicted in Figure 2 (d). Another alternative would be to use an outer join with respect to cid attribute to obtain a single-valued data set, as depicted in Figure 2 (b). The problem with the first approach is that by applying the aggregate function frequently valuable information may be lost. For example, if the average purchase amount is used to replace the bag of individual purchase amounts, this approach does not consider other potentially relevant information, such as total amount, and the number of purchases, in computing similarity. Another problem is that aggregate functions are only applicable to numerical attributes. Using aggregate functions for symbolic attributes, such as the attribute location or ptype in the example database, does not make sense at all. In summary, the approach of replacing bag of values by a single value faces a lot of technical difficulties. If we look at the single-valued data set in Figure 2 (b) which has been generated by using an outer join approach we observe a different problem. A clustering algorithm would treat each tuple in the obtained single-valued data set as a separate object (e.g. Johny’s 3 purchases would be considered to be different objects, and not as data that are related to the customer “Johny”), which means no longer the 4 customers would be clustered, but rather the 7 purchases; obviously, if our goal is to cluster customers, clustering purchases instead seems to be quite confusing. 7 3 A Data Set Representation Framework for Database Clustering In the following a data set representation framework for database clustering is proposed; similarity measures that are suitable in the context of the proposed framework will then be introduced in Section 4. In general, the framework consists of the following mechanisms: An object identification mechanism that defines what classes of objects will be clustered and how those objects will be uniquely identified. Mechanisms to define modular units based on object similarity have to be provided; each modular unit represents a particular perspective of the objects to be clustered; similarity of different modular units is measured independently. In the context of the relational data model modular units are defined as procedures that associate a bag of tuples with a given object. Using this framework, objects to be clustered are characterized by a set of bags of tuples, one bag for each modular unit. The similarity between two objects is measured as a weighted sum of the similarity of all its modular units. To be able to do that a weight and a (bag) similarity measure have to be provided for each modular unit. cid 1 Age Gender 43 M Pid P1 P2 P3 Amount 400 70 200 2 Age Gender 21 F Pid Amount P2 390 P3 100 Sum(amount) Date 390 2/23/96 100 3/3/96 3 Age Gender 67 M Pid Amount P1 30 Sum(amount) Date 30 3/3/96 4 Age Gender 35 F Pid Amount Sum(amount) Date Sum(amount) Date 670 2/10/96 Figure 3. An example of the bag-oriented clustering framework To illustrate this framework, let us assume that we are still interested in clustering customers. In this case the attribute cid of the relation Customer that uniquely identifies customers serves as our object identification mechanism. After the object identification mechanism has been selected, relevant attributes to define similarity between customers have to be selected. In the particular case, we assume that we consider the customer’s age/gender information, the amount of money 8 they spend on various product groups, and the customer’s daily spending pattern to be relevant for defining customer similarity. In the next step, modular units to measure customer similarity have to be defined. In this particular example, we identify three modular units each of which characterizes customers through a set of tuples. For example, the customer with cid 1 is characterized as a 43 years old male, who spent 400, 70, and 200 dollars on product groups p1, p2, and p3, and who purchased all his goods in a single day of the reporting period, spending total 670 dollars. There are different approaches to define modular units. When the relational data model is used, modular units can be defined using SQL queries that associate customers (using cid) with a set tuples that are specific for the modular unit. In the example, depicted in Figure 3, the following three SQL queries associate customers with the characteristic knowledge with respect to each modular unit: Modular Unit 1 := SELECT cid, age, gender FROM Customer; Modular Unit 2 := SELECT Customer.cid, pgid, amount FROM Customer, Purchase WHERE Customer.cid=Purchase.cid; Moduler Unit 3 := SELECT Customer.cid, sum(amount), date FROM Customer, Purchase WHERE Customer.cid=Purchase.cid GROUPED BY Customer.cid, date; As we have seen throughout the discussions of the last two sections, many different object views can be constructed from a given database. There are “simple” object views based on flat file format, such as those in Figure 2 (b) and Figure 2 (d); in this section, a more complicated scheme for defining object views has been introduced that characterizes object through sets of bags of tuples. We claim that this data set representation framework is more suitable for database clustering, and will present arguments to support our claim in Section 5. When following the proposed methodology, object views based on the definition of modular units are constructed. In the next step similarity measures have to be defined with respect to the chosen object view that will be the subject of the discussions in the next section. 9 4 Similarity Measures for Database Clustering In the previous section, we introduced a data set representation framework for database clustering. In this section, we will introduce several similarity measures that are suitable for the proposed framework. As discussed earlier, in the proposed framework each object to be clustered is described through a set of bags of tuples—one bag for each modular unit. In the case of single-valued data sets each bag degenerates to a single tuple. When defining object similarity for this framework we assume that a similarity measure is used to evaluate object similarity with respect to a particular modular unit. Object-similarity itself is measured as the weighted sum of the similarity of its modular units. More formally: Let O be the set of objects to be clustered a, b O mi: O X denotes a function that computes the bag of tuples of the ith modular unit I denotes the similarity function for ith modular unit wi denotes the weight for ith modular unit Based on these definitions, the similarity between two objects a and b can be defined as follows: (a, b) = n w (m (a), m (b))/ w n i i i i i 1 i 1 i , where n is the number of modular units. Figure 4 illustrates how the similarity measure is computed between two objects, Objecta and Objectb with modular units in our similarity framework. Similarity computation between Objecta and Objectb Objecta wn Modular unitn w2 … … Objectb w1 Modular unit 2 w1 Modular unit1 Modular unit1 1 w2 Modular unit2 wn … … Modular unitn … 2 n Figure 4. Similarity framework 10 There are many similarity metrics and concepts proposed in the literature from variety of disciplines including engineering, science [Ande73, Ever93, Jain88, Wils97] and psychology [Ashb88, Shep62]. In this paper, we broadly categorize types of attributes into quantitative type and qualitative type, and introduce existing similarity measures based on these two types, and generalize those to cope with the special characteristics of our framework. 4.1 Similarity Measures for Quantitative Types A class of distance functions, known as Minkowski metric, is the most popularly used dissimilarity function for the quantitative attributes. It is defined as follows: m dr(a,b) = ( a i bi r)1/r, r 1 (1) i 1 where a and b are two objects with m number of quantitative attributes, a = (a1, …, am) and b = m (b1, .., bm). For r = 2, it is the Euclidean metric, dr(a,b) = ( ( a i bi ) 2 ) 1 / 2 ; for r = 1, it is the cityi 1 m block (also known as taxicab or Manhattan) metric, dr(a,b) = ( a i bi ) , and for r = , it is the i 1 dominance metric, d(a,b) = max a i bi . The Euclidean metric is the most commonly used 1 i m similarity function of the Minkowski metrics. Wilson and Martinez [Wils97] discusses many other distance functions and their properties. One simple way to measure the similarity between modular units in our similarity framework is to substitute group means for the ith attribute of an object in the formulae for interobject measures such as Euclidean distance, city-block distance, or squared Mahalanobis [Jain88]. For example, suppose that group A has the mean vector A = [ x a1, x a2, …, x am] and group B has the mean vector B = [ x b1, x b2, …, x bm], then the measure by Euclidean distance between the two groups can be defined as m d(A,B) = ( ( x ai x bi ) 2 )1 / 2 i 1 (2) The other approach is to measure the distance between their closest or furthest members, one from each group, which is known as nearest neighbor or furthest neighbor distance [Ever93]. This approach is used in hierarchical clustering algorithms such as single-linkage and complete- 11 linkage. The main problems with these two approaches are that the similarity is insensitive to the quantitative variance and that it does not account for the cardinality of elements in a group. Another approach, known as group average, can be used to measure inter-group similarity. In this approach, similarity between groups is measured by taking the average of all the interobject measures for those pairs of objects for which objects in the pair are in different groups. For example, the average dissimilarity between group A and B can be defined as na nb d(A,B) = [ d (ai , b j )] n i 1 j 1 (3) where n is the total number of object-pairs, which is n = na nb, na and nb are the number of objects in the object ai and bj, respectively, and d(ai,bj) is the dissimilarity function for a pair of objects ai and bj, ai A, bj B. Note that the dissimilarity function (usually distance function) can be easily converted into a similarity function by reciprocating it. 4.2 Similarity Measures for Qualitative Types Two coefficients, the Matching coefficient and Jaccard’s coefficient, are the most commonly used similarity measures for qualitative type of attributes [Ever93, Jain88]. The Matching coefficient is the ratio of the number of features the two objects have in common, to the total number of features. Jaccard’s coefficient is the Matching coefficient that excludes negative matches. For example, let m be the total number of features; m00 and m11 be the number of common features and mismatching features; m01 and m10 be the distinctive features between two objects. Then, the Matching coefficient and Jaccard’s coefficient are defined as (m00+m11)/m and m11/(mm00), respectively (m01 and m10 are ignored). There can be other varied coefficients giving weight to either matching features or mismatching features depending on the accepted practice. The above coefficient measures can be extended to multi-valued qualitative of attributes. Restle [Rest59] has investigated the concepts of distance and ordering on sets. There are several other set-theoretical models of similarity proposed [Ashb88, Tver77]. Tversky [Tver77] proposed his contrast model and ratio model that generalize several set-theoretical similarity models proposed at that time. Tversky considers objects as sets of features instead of geometric points in a metric space. To illustrate his models, let a and b be two objects, and ma and mb 12 denote the sets of features associated with the objects a and b respectively. Tversky proposed the following similarity measure, called the contrast model: S(a,b) =f(ma mb) f(ma mb) f(mb ma) (4) for some , , 0; f is a set operator (usually the set cardinality is used). Here, ma mb represents the features that are common to both a and b; ma mb, the features that belong to a but not to b; mb ma, the features that belong to b but not to a. In the previous models, the similarity between objects was determined only by their common features, or only by their distinctive features. In the contrast model, the similarity of a pair of objects is expressed as a linear combination of the measures of the common and the distinctive features. The contrast model expresses similarity between objects as a weighted difference of the measures for their common and distinctive features. The following similarity measure represents the ratio model: S(a,b) = f(ma mb) / [f(ma mb) + f(ma mb) + f(mb ma)], , 0 (5) In the ratio model, the similarity value is normalized to a value range of 0 and 1. The ratio model generalizes a wide variety of similarity models that are based on the Matching coefficients for qualitative type of attributes as well as several other set-theoretical models of similarity [Eisl59]. For example, if = = 1, then S(a,b) becomes the Matching coefficient, f(ma mb)/f(ma mb), discussed in section 2.1.2. Note that the set in Tversky’s model is a crisp set. Santini et al. [Santi96] extend Tversky’s model to cope with fuzzy sets. Wilson and Martinez [Wils97] discuss the Value Difference Metric (VDM) introduced by Stanfill and Waltz (1986) and propose the Heterogeneous Value Difference Metric (HVDM) for handling nominal attributes. Gibson et al. [Gibs98] introduce a sophisticated approach to handle the similarity measure arising from the co-occurrence of values in a data set using an iterative method for assigning and propagating weights on the qualitative values. Their approach may handle the limited form of transitive similarity, e.g., if Oa is similar to Ob; Oc is similar to Od, then Oa is considered to be similar to Od. 4.3 Similarity Measures for Mixed Types In many real world problems, we often encounter a data set with a mixture of attribute types. Specifically, if algorithms are to be applied to databases, it may not be sensible to assume a 13 single type of attributes since data can be generated from multiple tables with different properties in a given database. A similarity measure proposed by Gower [Gowe71] is particularly useful for data with mixed types of attributes. This measure is defined as: m m i 1 i 1 S(a,b) = wi si (ai , bi ) / wi (6) where a and b are two objects with m number of attributes, a = (a1, …, am) and b = (b1, .., bm). In this formula, si (ai , bi ) is the normalized similarity index in the range of 0 and 1 between the objects a and b as measured by the function s i for ith attribute, and wi is a weight for the ith attribute. For example, the similarity index s i ( a i , bi ) can be any appropriate function of the similarity measures defined in sections 4.1 and 4.2 depending on attribute types or applications. Higher weights are assigned to more important attributes. As the reader might already observed, our approach to assess object similarity defined in the formula (0) relies on Gower’s similarity measure and associates similarity measures si with modular units that represent different facets of objects. Wilson and Martinez [Wils97] introduce a comprehensive similarity measure called, HVDM, IVDM, and the WVDM for handling mixed types of attributes. The Gower’s similarity framework to deal with mixed types of attributes can be extended to Wilson and Martinez’s framework by adding appropriate similarity measures for each type of attribute defined in HVDM. 4.4 Support for the Contextual Assessment of Similarity The similarity measures we introduced so far do not take into consideration that attributes are frequently interpreted in the context of other attributes. For example, let us consider that the data set in Figure 2 in which customer “Johny” had purchases of three product groups, p1 for $400, p2 for $70, and p3 for $200 and customer “Andy” spend $390 on product group p2 and $100 on product group p3. If product Ids, p1, p2, and p3 are “TV”, “fruit”, and “Jewelry” respectively, it might not be sensible to compute the similarity for the purchase amount attribute between “Johny” and “Andy” without considering the type of product they bought because purchases of fruit might not be considered to be similar to purchases of TV-related products, even if the 14 amount spent for each purchase is similar each other. That is, the similarity of the amount attribute needs to be evaluated in the context of the product attribute. In the following, we will introduce a new similarity measure for this purpose. Let us assume that the similarity of attribute has to be evaluated in the context of attribute , which we denote by: |. Then we can define the similarity between two objects having attributes and as the similarity of attribute with respect to attribute. The new similarity function is defined as follows: s | (a, b) ( k ) s( k ) / ( k ) , k (7) k where is a matching function for the attribute and s is a similarity function for the attribute , k is number of elements in a bag. The value of is 1 for qualitative attribute if both objects take the same value for the attribute , otherwise, is 0 (i.e., no matching values). The value of is between 0 and 1 for quantitative attribute (i.e., a normalized distance value) that represents the degree of relevancy between the two objects on the attribute, . Note that the contextual relationship between and , | is not commutative (e.g., | |). In addition, theoretically we can expand and to a conjunctive list of attributes or disjunctive list of attributes. Accordingly, the general form of | can be: 1 2 … p | 1 op 2 op … op n, where op is either or , p and n are the number of attributes for the similarity computations between two objects. However, since the similarity between two objects will be computed attribute-by-attribute from the selected list of attributes, it can be rewritten as | 1 op 2 op … op n. Some examples of contextual relationships can be | , | 1 2 … n, | 1 2 … n, | 1 2 3 … n, and so on. So, for a case of | 1 2, the value of is 1 for qualitative attribute when both objects take the same value for attributes, 1 and 2. In this definition, the information from the related multi-valued attributes is combined in an orderly way to give a similarity value. This similarity measure is embedded into our similarity framework. 15 Figure 5 illustrates how to compute the similarity considering the contextual assessment. Figure 5 also illustrates how (k) and s(k) are used when computing samount|pgid between example objects “Johny” and “Andy”. Objects Johny Andy Product Id () TV (p1) Fruit (p2) Jewelry (p3) Fruit (p2) Jewelry (p3) Amount () 400 70 200 390 100 (1):TV = 0, (2):Fruit = 1, (3):Jewelry = 1 Assuming the city-block metric, the normalized similarity index can be computed as follows: s(1) = 0.0, s(2) = 0.18, s(3) = 0.5 The similarity between Johny and Andy in the context of product ID and amount is computed: s|a (Johny, Andy) = 0.34 Figure 5. Two objects with attributes that have contextual information and its example of similarity computation Note that the proposed contextual similarity is not designed to find any sequential patterns like PrefixSpan [Pei01] or to measure transitive similarity [Gibs98] but to take the valid contextual information into account of the similarity computation. 5 Architecture of Database Clustering System Figure 6 depicts the architecture of our database clustering system that we are currently developing. The system consists of three major tools: a data preparation tool, a clustering tool, and a similarity measure tool. The data preparation tool is used to generate an object view from a relational database based on the user’s requirements. The clustering tool guides the user to choose an appropriate clustering algorithm for an application, from the library of clustering algorithms that contains various algorithms such as nearest-neighbor, hierarchical clustering, etc. Once a clustering algorithm has been selected, the similarity measure tool will assist the user in constructing an appropriate similarity measure for his/her application and the chosen clustering 16 algorithm. When the user constructs an appropriate similarity measure, the system inquires information about types, weights, and other characteristics of attributes, offering alternatives and choices to the user, if more than one similarity measure seems to be appropriate. Library of clustering algorithms Object View Data preparation tool Clustering tool User interface DBMS A set of clusters Similarity measure Similarity measure Tool Default choice and domain information Library of similarity measures Type and weight information Figure 6. Architecture for database clustering In the case that the user does not provide the necessary information, default assumptions are made based on the types of attributes (e.g., Euclidean distance is chosen for the quantitative types and Tversky’s ration model is our default choice for the qualitative types). The range value information for quantitative type of attributes can be easily retrieved from a given data set by scanning the column vector of quantitative attributes. The range value information is used to normalize the similarity index. Normalizing the similarity index is important in combining similarity values of all attributes with possibly different types. Finally, the clustering tool takes the constructed similarity measure and the object view as its input and returns a set (or a hierarchy) of object clusters as its output. 17 5.1 A Framework to Generate Object Views from Databases Our generalized clustering algorithms Bag-based Object View Processed data Structured database User Interface User's interests and objectives Conventional clustering algorithms Flat file-based object view Database name Data set of interest Object attribute(s) Selected attributes Figure 7. A framework for generating object views Figure 7 illustrates the proposed framework for generating object views from a database. One of the key ideas of the proposed research for dealing with the problems raised in Section 2.2 is to develop a semi-automatic data preparation tool that generates object views from a (relational) database based on the user’s interests and objectives. The tool basically automates the first three steps of the database clustering methodology that was introduced in Section 2.1. The tool will be interactive so that the user can define his/her object-view and the relevant attributes; based on these inputs an object-view will be automatically generated by the tool. In order to generate an object view from a database, our approach is first to enter a database name, to select a table called a data set of interest, object attribute(s), and selected attributes. A data set of interest is an anchor table for other related tables the user is interested in for clustering. The object attribute(s) (e.g., usually a key attribute in a data set of interest) define the object-view of the particular clustering task. An object in relational database is defined as a collection of tuples that have the same value for all object attribute(s). The set of tuples is viewed to describe the same object. Consequently, when generating object view, information in tuples that agree in the object attributes should be combined into a single object in the format shown in Figure 2 (c), whereas those that do not agree are represented as different objects in the generated object view. The 18 selected attributes are attributes in all the related tables the user has chosen. Although the tool can generate an object-view in conventional flat-file format for conventional clustering algorithms, the main format of the object-view in our approach is bag-based. Figure 9 shows our implemented interface for the data preparation tool to generate an object view from a relational database. We used Visual Basic to implement this tool. Using the information provided by the user through the interface the algorithm to generate an object-view works as follows: as the database name and the data set of interest are given, the attributes from the data set of interest in the database are first extracted; next the related attributes in related tables are selected through joining (usually outer join) with related tables; finally, the object attribute(s) is selected from the attributes and the object-view is created by grouping the tuples with the same values for the object attribute(s) into one object with the bags of values for the related attributes [Ryu98c and Zehu98 give a more detailed description of the algorithm]. Figure 9. Interface for data preparation tool 19 5.2 Features of the Clustering Tool Figure 10 shows the class diagram for our clustering tool in UML (Unified Modeling Language, which is a notational language for software design and architecture [Mart97]). The class diagram describes the developed classes, attributes, operations, and the relationships among classes. GetAnalysisInfo class receives basic information from the user such as the name of the selected data set, the interested attributes, data types for attributes, and the chosen similarity measure that will be applied to the selected data set. ReadDataSetObjects class reads the selected data set. Similarity Measure class defines our similarity measure. For the similarity measure in this implementation, we chose the average dissimilarity measure for quantitative attributes and the Tversky’s ratio model for qualitative attributes considering the contextual assessment of similarity. Clustering class defines a clustering algorithm that uses the similarity measure defined in Similarity Measure class. Figure 10. Class diagram for the clustering tool 20 For the clustering algorithm, we chose the Nearest-neighbor algorithm, which is a partitioning clustering method. In the nearest-neighbor algorithm, two objects are considered similar and are put in the same cluster if they are neighbors or share neighbors. In this algorithm, an object o1, from a set of objects in a data set D={o1, o2, o3,…, on} which is going to be partitioned into K clusters, is assigned to a cluster C1. The nearest neighbor of the object oi among the objects already assigned to cluster CJ is selected. And then, oi is assigned to CJ if the distance between oi and the found nearest neighbor t (t is a threshold on the nearest neighbor algorithm, selected by the user). Otherwise, the object oi is assigned to a new cluster CR. This step is repeated until all objects in the analyzed data set are assigned to clusters. When using the Nearest-neighbor algorithm the user should provide the threshold value to be used in the clustering process. Threshold value sets the condition based on which two objects can share or be grouped together in the same cluster. Consequently, the threshold value affects the number of generated clusters. As the value of the threshold increases, fewer clusters are generated. We selected the Nearest-neighbor algorithm because it is directly applicable for clustering the proposed bag-based object-view. Other algorithms that compute similarity directly between two objects can be also applicable to our framework. However, generalizing the clustering algorithm, K-means is not trivial, because of difficulties in computing centroids for clusters of objects that are characterized by sets of bags of values. 21 Go Figure 11. Clustering tool Figure 11 illustrates our implemented interface of the clustering tool. In order to cluster an object-view generated from the data preparation tool, the user needs to select a tag for flat-file based data set or bag-based data set, the attributes, types of attributes, corresponding weights, threshold value, and the output of the clustering. 5.3 Database Clustering Examples We used two relational databases, Movies database available in the UC-Irvine data set archive [UCIML04] and Online customer database received from a local company [Ryu02]. In these experiments, we generated three different data sets for each database using three different data representation formats called, single-valued data set and average-valued data set for conventional data representation and multi-valued data set for the proposed representation like the formats shown in Figure 2 to see whether a data set representation affects the quality of clusters. For the clustering algorithm, we chose the Nearest-Neighbor algorithm. For the similarity measure, we 22 used the formula (0) based on the Gower’s similarity function (6). For the multi-valued data set, the function consists of the formula (3) for quantitative attributes, (5) for qualitative attributes, and (7) for the contextual assessment. For the single-valued data set and average-valued data set, the function consists of the formula (1) (Euclidean metric) for quantitative attributes and (5) for qualitative attributes, but not the formula (7). However, other similarity measures can be also incorporated into the proposed framework depending on the user’s choice. The Movies database holds information about the movies that have been developed since 1900 such as their titles, types, directors, producers, and their years of release. The Table 1 illustrates the selected attributes, their types, and assigned weights for the Movies database. Attributes Name Properties Assigned Weight Film_Id Single-valued, Qualitative 0 Year Single-valued, Quantitative 0.2 Director Single-valued, Qualitative 0.6 Category Multi-valued, Qualitative 0.7 Awards Multi-valued, Qualitative 0.5 Table 1. Movies database The key attribute of this data set is “Film-id”. All the attributes in this data set except for the attribute Year (We are not very interested in the year information) are qualitative type. The attributes Year, Director, are single-valued; Category and Awards are multi-valued. The empirically selected threshold value for the clustering algorithm in this experiment is 0.36 [Sala00]. The number of clusters generated by each technique is shown in Table 2. The same clustering algorithm with the same similarity framework with slightly different similarity formula was applied to three different data sets except for the average-valued data set. We did not 23 generate the average-valued data set for the Movies database, because most attributes in the database are symbolic, which cannot be easily converted into meaning quantitative scale. Approach Number of Clusters Single-valued approach 136 Average-valued approach N/A Multi-valued approach 130 Table 2. Number of clusters generated by each approach Both the single-valued approach and the multi-valued approach produced the same clusters for the single-valued objects. This is not surprising since the multi-valued approach is basically a generalization of the single-valued approach. The number of clusters generated by multi-valued approach is less than the single-valued approach. Film_id category director Awards Year Asf10,T:I’ll be Home for Christmas Asf8,T:A Very Brady Sequel Comd D:Sanford Null 1998 Comd D:Sanford Null 1996 Atk10,T:Hilary and Jackie BioP, Comd D:A.Tucker Null 1998 Atk12,T:Map of Love BioP, Comd D:A.Tucker Null 1999 Table 3. Some objects in cluster-A from multi-valued approach Moreover, as we expected, in the clustering result of the single-valued approach, the same objects with multi-values were grouped into different clusters. Obviously, there is no such a problem in the multi-valued approach. Some objects of a cluster generated by the multi-valued approach are shown in Table 3. In this cluster, there are four different objects with similar properties. Note that the attribute, category, has the highest weight. As Tables 4 and 5 illustrate, some objects appeared in two different clusters generated by single-valued approach even though they are in the same cluster shown in Table 3. 24 Film_id category Director awards year Atk10,T:Hilary and Jackie BioP D:A.Tucker Null 1998 Atk12,T:Map of Love BioP D:A.Tucker Null 1999 Table 4. Some objects in cluster-B from single-valued approach awards Year Asf10,T:I’ll be Home for Christmas film_id Comd category D:Sanford Director Null 1998 Asf8,T:A Very Brady Sequel Comd D:Sanford Null 1996 Atk10,T:Hilary and Jackie Comd D:A.Tucker Null 1998 Atk12,T:Map of Love Comd D:A.Tucker Null 1999 Table 5. Some objects in cluster-C from single-valued approach For example, the two objects, “Atk10, T: Hilary and Jackie” and “Atk12, T:Map of Love” are grouped into different clusters by the single-valued approach. Note that these objects appear in both clusters in the single-valued approach. This clustering result may confuse the data analysts. We could not compare the quality of clusters for each data set since no class information for the Movies database is available. In the second experiment, we used an online customer database for a local Internet company that sells the popular climate control products such as portable heaters, window, air conditioners, etc. The size of this data set is 25,221 records after eliminating redundant, incomplete, and inconsistent data. Ryu and Chang [Ryu02] have studied this database to identify the characteristics of customers using decision tree [Quin93] and association rule mining [Agra93] approaches. They found three major groups of customers for the company as shown in Figure 12. 25 Each group Average 60% 50% 40% 30% 20% 10% 0% East Coast High Rise Renters Young Immigrant Immigrants Families Figure 12. Three customer groups with higher buying tendency. The vertical axis represents the percentage of buyers. So, the attributes and the weights for the selected attributes shown in Table 6 were selected/assigned based on the analysis result by Ryu and Chang [Ryu02]. Attributes Name Properties Assigned Weight CustID Single-valued, Qualitative 0 Age Single-valued, Quantitative 0.7 Ethnic group Single-valued, Qualitative 0.7 Amount Multi-valued, Quantitative 0.6 PayType Multi-valued, Qualitative 0.6 City Single-valued, Qualitative 0.8 State Single-valued, Qualitative 0.8 Table 6. Online customer database Again, in this experiment, we want to see whether the clusters generated by each data set representation approach are compatible to the previous analysis result. The clustering result is shown in Table 7. The multi-valued approach generates less number of clusters than other approaches do. However, the number of clusters generated by each approach is much larger than the three groups shown in Figure 12. Approach Number of Clusters Single-valued approach 251 Average-valued approach 95 Multi-valued approach 81 Table 7. Number of clusters generated by each approach So we manually examined the contents of each cluster and found that many clusters can be eventually merged into the three groups shown in Figure 12. This job was much easier for the 26 clusters generated by the multi-valued approach. However, for the clusters generated by the single-valued approach, it was very difficult since the same objects with multi-values appear in different clusters. For example, Table 8 shows some objects in a cluster-A generated by singlevalued approach. CustID age Ethnic group Amount payType City State 12001 27 A 25 Credit Brooklyn NY 13100 30 A 30 Credit Newark NJ 12200 33 A 50 Credit Los Angeles CA 13200 29 B 55 Credit Bronx NY Table 8. Some objects in cluster-A generated by single-valued approach Table 9 shows some objects assigned to other clusters generated by single-valued approach. As we can see, the customers 12001 and 13200 in Table 9 are represented as two different objects and assigned to different clusters. They should have been grouped to either cluster-A or cluster-B. CustID age Ethnic group Amount payType City State 12001 27 A 280 Paypal Brooklyn NY 12005 30 B 125 Paypal Sunnyside NY 13200 29 B 280 Paypal Bronx NY 13200 29 B 235 Paypal Bronx NY Table 9. Some objects in cluster-B generated by single-valued approach There are 157 customers (out of 5,271 customers) grouped into more than one cluster like the customers 12001 or 13200. Average-valued approach and multi-valued approach do not create this type of confusion. However, the clustering result by average-valued approach is not as accurate as the multi-valued approach, and even as the single-valued approach. One possible reason is that in the average-valued approach the mapping from qualitative data to quantitative data or the representative values (the first values picked from tuples for an object if the mapping is not possible) for the qualitative attribute might be inappropriate (see the example format in Figure 2 (d)). In summary, the overall quality of clusters generated by multi-valued approach is 27 better than other approaches. In addition, analyzing the clustering result generated by multivalued approach is much easier. Intuitively, one can think that the run-time for multi-valued approach may take longer than other approaches because of additional computation. However, the overall run-time including preprocessing for each approach was not very different. This may be because the multi-valued approach deals with much less number of records for clustering than the single-valued approach. For the average-valued approach, it requires additional preprocessing time. 6 Related Work on Structural Data Analysis In this section, we conduct a literature review on approaches to deal with structural data sets. We categorize those approaches into two general groups; the data set conversion approach that converts a database to a single flat data set without modifying data mining methods, and the method generalization approach that generalizes data mining algorithms so that they can directly be applied to structured objects. We also discuss the previously proposed database clustering algorithms. 6.1 Data Set Conversion Approach In order to concert a structured data set into a single flat file, related data sets are usually joined and various aggregate functions and/or generalization operators have to be applied to remove multi-valued attributes (for example, by averaging sets of values or by storing generalizations) before data mining techniques are applied to the given data set. Conventional data mining techniques are then applied to the “flattened” data set without any need for generalization. Many existing statistical analysis techniques and data mining techniques employ this approach [Agra93, Shek96, Bisw95, Han96, Haim97]. Nishio et al. [Nish93] proposed generalized operators that can be applied to convert a set of values into a higher-level of concept description that can encompass the set of values for data mining techniques in an object-oriented database framework. For example, a set of values, {tennis, soccer, volleyball}, can be generalized as a single value of the higher-level concept 28 “sports”. They categorize attributes into several types, such as single-valued attribute, set-valued attribute, list-valued attribute, and structure-valued attribute; and they propose the generalization mechanisms for each category of attribute. Applying a generalization operator to the related values may be a reasonable idea, since the generalized values for an attribute may keep the related information. However, it may not be always possible to generalize a set of values into a correct and consistent high-level concept description, specifically for quantitative attributes, since there can be several ways to generalize for the same set of values. Moreover, in many application domains suitable generalization hierarchies for symbolic attributes are not available. Gibson [Gib00] and similarly Ganti [Gan99] introduce novel formalizations of a cluster for categorical attributes and propose clustering algorithms for data sets with categorical attributes. DuMouchel et al. [DuMo99] proposed a methodology that squashes flat files applying statistical approaches mainly to resolve the scalability problem of data mining. The methodology consists of three steps, grouping, momentizing, and generating. These steps describe the squashing pipeline whereby the original data set is sectioned off into mutually exclusive groups; within each group a series of low-order moments are computed; and finally these moments are passed off to a routine that generates pseudo data that accurately reproduce the moments. They claim that the squashed data set keeps the structure of the original data set. 6.2 Method Generalization Approach The other way to cope with structured data sets is to generalize existing data mining methods so that they can perform data mining tasks in structured domains. A few approaches that directly represent structured data sets using more complex data structures and which generalize data mining techniques for those data structures have been proposed [Gold95, Haye78, Step86, Mich83, Wass85, Thom91, Kett95, Mana91, Mcke96, Hold94, Biss92, Kiet94, Kauf96] in the literature. We only review those approaches we consider most relevant for database clustering in this section. Goldberg and Senator [Gold95] restructure databases for data mining by consolidation and link formation. Consolidation relates identifiers present in a database to a set of real world entities (RWEs) which are not uniquely identified in the database. This process can be viewed as 29 a transformation of representation from the identifiers present in the original database to the RWE. Link formation constructs structured relationships between consolidated RWEs through identifiers and events explicitly represented in the database. Both consolidation and link formation may be interpreted as transformations of representation from the identifications originally present in a database to the RWE’s of interest. McKearney and Roberts [Mcke96] produce a single data set for data mining by generating a query, after analyzing dependencies between attributes and relationships between data sets (e.g., relations or classes). This is somewhat similar to our approach, except that our approach employs sets of queries (and not a single query) to construct modular units for similarity assessment. LABYRINTH [Thom91] is a system that extends the well-known conceptual clustering system COBWEB [Fish87] to structured domains. It forms concept hierarchy incrementally and integrates many interesting features such as incremental, probabilistic, unsupervised, relationship, and component features, as used in earlier systems. For example, it learns probabilistic concepts and also decomposes objects into sets of components to constrain matching like MERGE [Wass85]. LABYRINTH can make effective generalizations by using a more powerful structured representation language. Ketterlin et al. [Kett95] also generalize COBWEB to cope with complex objects (or composite objects, i.e., objects with 1:n relationships); that is, objects that have many other related objects (or components) to deal with 1:n or n:m relationships in structured database. The basic idea used in their system is to find a characterization of a cluster of composite objects using component clustering; that is, components are clustered first, leading to a component-clusters hierarchy, then composite objects are clustered. The systems KBG [Biss92] and KLUSTER [Kiet94] both employ high-level languages (respectively first-order logic and description logic). Both systems build a DAG (directed acyclic graph) of clusters, instead of a hierarchy. Ribeiro et al. [Ribe95, Kauf96] extend the discovery system INLEN [Kauf91] to discover knowledge in multiple data sets. In order to discover knowledge across multiple data sets, they include information on primary and foreign keys for the target data sets (e.g., relations); that is, keys serve as the links across the data sets, as it is the case in our approach. INLEN first discovers knowledge in each database or relation, then the discovered knowledge is associated 30 with related information using foreign key information; finally, all discovered knowledge for each database is integrated into a single knowledge base. Gibson et al. [Gibs98] proposed an approach for clustering categorical data based on dynamical systems. The approach tries to handle the similarity measure arising from the cooccurrence of values in a data set using an iterative method for assigning and propagating weights on the qualitative values. Ganti et al [Gant99] proposed an improved approach called, CACTUS for categorical data clustering based on the inter-attribute and the intra-attribute summaries to compute the “candidate” clusters which can then be validated to determine the actual set of clusters. 6.3 Database Clustering Algorithms Several clustering algorithms such as CLARAN [Ng94], DBSCAN [Este96], BIRCH [Zhan96], and STING [Wang97] for large databases have been proposed. However, most algorithms are targeted for spatial databases, not for structural databases like business-oriented relational databases or object-oriented databases. Moreover, like many of conventional clustering algorithms, those algorithms also make one-tuple one-object assumption. Bradley et al. [Brad98] proposed a scalable clustering framework for large databases based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The framework focuses on the scalability of database clustering algorithms. 7 Summary and Conclusion In this paper, methodologies, techniques, and tools for clustering databases were introduced. One critical problem of database clustering is the data model discrepancy between the representation format used to store the target data and the input data format that clustering algorithms expect. In most databases, data are stored in several tables or classes and related information are represented as relationships among related tables or classes, while most traditional clustering algorithms assume that input data are stored in a single flat file format. Based on this 31 observation, we showed that the traditional flat file format is not appropriate for storing related information since it restricts each attribute in a data set to have a single value while once related objects in related tables or classes are combined, objects are frequently characterized by bags of values (or tuples). We proposed a better data representation format that relies on bags of tuples and modular units for database clustering, and introduced similarity measures that are suitable for our proposed representation framework. Moreover, we reviewed the features of a database clustering environment that employs the proposed representation framework. We proposed a unified framework for similarity measures to cope with mixed-type attributes that may have set or bag of values and with object relationships commonly found in databases. Most of the similarity measures that we recommended have been introduced in the literature a long time ago; however, we also introduced a new similarity measure that allows for defining attribute similarity in the context of other attributes. We performed the experiment for clustering different types of data sets using the nearestneighbor algorithm and analyzed the results. In this experiment, we conducted a cluster analysis for a data set represented in the various formats including single-valued data set, average-valued data set, and multi-valued data set to see the effectiveness of the proposed framework. Based on our analysis, we found that the proposed multi-valued data representation approach produced clearer (better quality of) clustering result than the traditional data representation approach (single-valued or average-valued data set). Interpreting clustering result generated by the proposed approach is also easier. In the clustering result generated by traditional clustering approach, as we expected, the same objects with multi-values are grouped in different clusters, which may confuse data analysts. 32 Future Work and Issues Although we claim that our introduced knowledge representation framework is useful in representing related information, there are still several issues that are not yet analyzed or understood in sufficient details. In general, the proposed representation framework may be well applied to the clustering algorithms that compute the similarity directly between a pair of objects. Nearest-neighbor clustering and DBSCAN [Este96] are such algorithms that belong to that category. For other clustering algorithms such as K-means, COBWEB, etc., a major modification of those algorithms would be required in order to cluster object-views based on bags of tuples. Generalizing decision tree algorithm such as C4.5 to cope with structural information seems to be another challenging problem. One way to approach this problem would be to generalize the decision tree learning algorithm so that it can be applied to bags of values (or tuples). Such a generalization could reuse the preprocessing techniques that were described in this paper. This would make it possible to directly apply concept learning algorithms to databases that consist of multiple related data sets, such as relational databases, which is currently not available. Another subject that needs to be investigated further is the scalability of the nearest-neighbor clustering framework we proposed. When running our prototype system, we observed that as the number of objects and/or the information associated with a particular object grows the construction of the object similarity matrix becomes a performance bottleneck for our clustering algorithm. We believe that in order to obtain better scalability of our methods special, efficient data structures for object views need to be developed that facilitate the construction of object views and the similarity computations for pairs of objects. 33 Bibliography [Ande73] [Agra93] [Ashb88] [Biss92] [Bisw95] [Brad98] [Chee96] [Domi96] [DuMo99] [Este96] [Ever93] [Fish87] [Gib00] [Gant99] [Gibs98] [Gowe71] [Haim97] [Han96] [Hart75] M.R. Anderberg, Cluster analysis for application, Academic Press, New York, 1973. R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, In Proc. ACM SIGMOD pp. 207-216, 1993. F.G., Ashby, N.A. Perrin, Toward a unified theory of similarity and recognition, Psychological review 95(1) 124-150, 1988. G. Bisson, Conceptual clustering in a first order logic representation, In proc. of the tenth European conference on Artificial Intelligence, John Wiley & Sons, 1992. G. Biswas, J. Weinberg, C. Li, ITERATE: A Conceptual clustering method for knowledge discovery in databases, In Innovative Applications of Artificial Intelligence in the Oil and Gas Industry, B. Braunschweig, R. Day (Ed.), 1995. P.S. Bradley, U. Fayyad, C. Reina, Scaling Clustering Algorithms to Large Databases, In Proc of 4th International conference on Knowledge Discovery and Data Mining (KDD-98), New York. 1998. P. Cheeseman, J. Stutz, Bayesian Classification (AutoClass): theory and results, Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Ed.), AAAI/MIT Press, Cambridge, MA, pp. 153-180, 1996. P. Domingos, Linear-time rule induction, In Proc. of the 2nd Int'l Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, 1996. W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, D. Pregibon, Squashing Flat Files Flatter, In Proc. Of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-99), San Diego, California, 1999. M. Ester, H-P. Kriegel, J. Sander, X. Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, In proceedings of the Second Knowledge Discovery and Data Mining conference, Portland, Oregon, 1996. B.S. Everitt, Cluster Analysis, Edward Arnold co-published by Halsted Press and imprint of John Wiley & Sons Inc., 3rd edition, 1993. D. Fisher, Knowledge acquisition via incremental conceptual clustering, In Machine Learning, 2 pp. 139-172, 1987. D. Gibson, J. Kleinberg, P. Raghavan, Clustering Categorical Data: An Approach Based on Dynamical Systems, VLDB Journal 8 (3-4) pp. 222-236, 2000. V. Ganti, J. Gehrke, R. Ramakrishnan, CACTUS - Clustering Categorical Data Using Summaries, In Proc. Of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-99), San Diego, California, pp. 73-83, 1999. D. Gibson, J. Kleinberg, P. Raghavan, Clustering Categorical Data: An Approach Based on Dynamical Systems, In Proc. of the 24th International Conference on Very Large Databases, New York, 1998. J.C. Gower, A general coefficient of similarity and some of its properties, Biometrics 27, pp. 857-872, 1971. I.J. Haimowitz, O. Gur-Ali, H. Schwarz, Integrating and Mining Distributed Customer Databases, In Proc. of the 3rd Int'l Conf. on Knowledge Discovery and Data Mining, Newport Beach, California, 1997. J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, O.R. Zaiane, DBMiner: A system for Mining Knowledge in Large Relational Databases, In Proc. of the 2nd Int'l Conf. on Knowledge Discovery and Data Mining, Portalnd, Oregon, 1996. J.A. Hartigan, Clustering Algorithms. John Wiley & Sons, Inc., 1975. 34 [Hayes78] F. Hayes-Roth, J. McDermott, An interference matching technique for inducing abstractions, Communications of the ACM, 21, pp. 401-410, 1978. [Han01] J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, 2001. [Hold94] L.B. Holder, D.J. Cook, S. Djoko, Substructure Discovery in the SUBDUE system, In Proc. of the AAAI-94 Workshop on Knowledge Discovery in Databases (KDD-94), Seattle, Washington, 1994. [UCIML04] http://www.ics.uci.edu/AI/ML/Machine-Learning.html, 2004. [Jain88] A.K. Jain, R.C. Dubes, Algorithms for clustering data, Prentice Hall, Englewood Cliffs, NJ, 1988. [Jarv73] R.A. Jarvis, E.A. Patrick, Clustering using a similarity measure based on shared near neighbors, IEEE Transactions on Computers C22, pp. 1025-1034, 1973. [Kauf91] K.A. Kaufman, R.S. Michalski, L. Kerschberg, Mining for knowledge in databases: Goals and general description of the INLEN system, In Knowledge Discovery in Databases, AAAI/MIT, Cambridge, MA, 1991. [Kauf96] K.A. Kaufman, R.S. Michalski, A Method for Reasoning with Structured and Continuous Attributes in the INLEN-2 Multistrategy Knowledge Discovery System, In Proc. of the 2nd Int’l Conf. On Knowledge Discovery and Data Mining, Portland, Oregon, 1996. [Kett95] A. Ketterlin, P. Gancarski, J.J. Korczak, Conceptual Clustering in Structured Databases: a Practical Approach, In Proc. of the 1st Int’l Conf. On Knowledge Discovery and Data Mining, Quebec, Montreal, 1995. [Kiet94] J.-U. Kietz, K. Morik, A polynomial approach to the constructive induction of structural knowledge, Machine Learning 14, pp. 193-217, 1994. [Lu78] S.Y. Lue, K.S. Fu, A sentence-to-sentence clustering procedure for pattern analysis, IEEE Transactions on Systems, Man and Cybernetics SMC 8, pp. 381-389, 1978. [Mana91] M. Manago, Y. Kodratoff, Induction of Decision Trees from Complex Structured Data, In Knowledge Discovery in Databases, AAAI/The MIT press, pp. 289-306, 1991. [Mart97] F. Martin, S. Kendall, UML Distilled, Applying the Standard Object Modeling Language, Addison Wesley Longman Inc., 1997. [Mich83] R.S. Michalski, R.E. Stepp, Learning from observation: Conceptual clustering, In Machine Learning: An Artificial Intelligence Approach, Morgan Kaufmann, San Mateo, CA, pp. 331-363, 1983. [Ng94] R.T. Ng, J. Han, Efficient and Effective Clustering Methods for Spatial Data Mining, Proc. 20th Int. Conf. on Very Large Data Bases, Santiago, Chile, pp. 144-155, 1994. [Nish93] S. Nishio, H. Kawano, J. Han, Knowledge Discovery in Object-Oriented Databases: The First Step, In Proc. of the AAAI-93 Workshop on Knowledge Discovery in Databases (KDD-93), Washington, 1993. [Pei01] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, M-C. Hsu, PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth, In Proc. of the 17th International Conference on Data Engineering, Heidelberg, Germany, 2001. [Quin93] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993. [Ribe95] J.S. Ribeiro, K. Kaufmann, L. Kerschberg, Knowledge Discovery from Multiple Databases, In Proc. of the 1st Int’l Conf. On Knowledge Discovery and Data Mining, Quebec, Montreal, Canada, 1995. [Ryu98a] T.W. Ryu, C.F. Eick, Discovering Discriminant Characteristic Queries from Databases through Clustering, In the Proc. of the Fourth International Conference on Computer Science and Informatics (CS&I'98) at Research Triangle Park, NC, 1998. [Ryu98b] T.W. Ryu, Discovery of Characteristic Knowledge in Databases using Cluster Analysis and Genetic Programming, Ph.D. Dissertation, Department of Computer Science, University of Houston, Houston, 1998. 35 [Ryu98c] [Ryu02] [Sala00] [Shek96] [Shep62] [Stan86] [Step86] [Thom91] [Tver77] [Wang97] [Wass85] [Wils97] [Zhan96] [Zehu98] T.W. Ryu, C.F. Eick, Similarity Measures for Multi-valued Attributes for Database Clustering, In the Proc. of the Conference on SMART ENGINEERING SYSTEM DESIGN Neural Networks, Fuzzy Logic, Evolutionary Programming, Data Mining and Rough Sets (ANNIE'98), St. Louis, Missouri, 1998. T.W. Ryu, W-Y. Chang, Customer Analysis Using Decision Tree and Association Rule Mining, In the Proc. of the International Conference on SMART ENGINEERING SYSTEM DESIGN: Neural Networks, Fuzzy Logic, Evolutionary Programming, Artificial Life, and Data Mining (ANNIE’02), ASME press, St. Louis, Missouri, 2002. H. Salameh, Nearest-neighbor clustering algorithm for relational databases, Master of Science Thesis, Department of Computer Science, California State University, Fullerton, 2000. E.C. Shek, R.R. Muntz, E. Mesrobian, K. Ng, Scalable Exploratory Data Mining of Distributed Geoscientific Data, In Proc. of the 2nd Int’l Conf. On Knowledge Discovery and Data Mining, Portland, Oregon, 1996. R.N. Shepard, The analysis of proximities: Multidimensional scaling with unknown distance function. Part I Psychometrika 27, pp. 125-140, 1962. C. Stanfil, D. Waltz, Toward memory-based reasoning, Communications of the ACM 29, pp. 1213-1228, 1986. R.E. Stepp, R.S. Michalski, Conceptual clustering: Inventing goal-oriented classifications of structured objects. In Machine Learning: An Artificial Intelligence Approach 2, Morgan Kaufmann, San Mateo, CA, pp. 471-498, 1986. K. Thompson, P. Langley, Concept formation in structured domains, In Concept Formation: Knowledge and Experience in Unsupervised Learning, D.H. Fisher, M. Pazzani, P. Langley (Ed.), Morgan Kaufmann, 1991. A. Tversky, Feature of Similarity, Psychological review 84 (4), pp. 327-352, 1977. W. Wang, J. Yang, R.R. Muntz, STING: A Statistical Information Grid Approach to Spatial Data Mining, VLDB, pp. 186-195, 1997. K. Wasserman, Unifying representation and generalization: Understanding hierarchically structured objects, Doctoral dissertation, Department of Computer Science, Columbia University, New York, 1985. D.R. Wilson, T.R. Martinez, Improved Heterogeneous Distance Functions, Journal of Artificial Intelligence Research 6, pp. 1-34, 1997. T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient database clustering method for very large databases, In Proc. of ACM-SIGMOD Int. Conf. On Management of Data, Montreal, Canada, pp. 103-114, 1996. W. Zehua, Design and Implementation Tool to Extract Structural Information from Relational Databases, Master of Science Thesis, Department of Computer Science, University of Houston, Houston, 1998. 36