DISCOVERING TOPICAL STRUCTURES OF DATABASES Professor: Michalis Petropoulos CSE 705 Megha Ramesh Kumar Topics to be covered Introduction Problem Definition The iDisc approach Handling Complex Aggregations Finding Cluster Representatives Empirical Evaluation Related Work Conclusion Overview Discovering topical structures of databases to support semantic browsing and large scale data integration. iDisc - multi-strategy learning framework. It exploits instance values to construct multiple database representations. It employs a set of base clusterers to discover preliminary topical clusters of tables from database representations & aggregates them to final clusters via meta-clustering. Difficulties in Integrating Databases I Documentation and metadata for enterprise databases are often scattered throughout the IT department. They are incomplete, inaccurate or missing. Scale of database + lack of documentation. Cost of integrating databases increases. Designers left the company but did not leave any design documents. Implementation of databases might not be consistent with the design. Difficulties in Integrating Databases II Reverse Engineer and integrating the databases becomes difficult. Key step – Identify the semantic correspondences (mappings) among attributes from different databases. Existing matching solutions attempt to find mappings between every two attributes. Assumptions made: Databases are small. Attributes in one database are potentially relevant to all attributes in another database. Without Topical Structures With Topical Structures Gist We formally define the problem of discovering topical structures of databases and demonstrate how they can support semantic browsing & large scale data integration. We propose a novel multi- strategy discovery framework & describe the iDisc system which realizes this framework. We propose a novel clustering aggregation techniques to address limitations of existing solutions. We propose a new approach to finding representative tables using a novel measure on table importance. iDisc is evaluated over real-world databases, and results indicates that it discovers topical clusters with a high level of accuracy. Problem Definition I Topical Relationship Consider database D, set of tables where each table is associated with a topic p ∈ P, topic(T) = p. There exists a topical relationship between tables S and T if topic(T) = topic(S) [Denoted as ρ(S, T)]. Example: Suppose P = {Invoice, Shipment, Product} ρ(InvoiceItem, InvoiceTerm) since topic(InvoiceItem) = topic(InvoiceTerm) Properties: ρ is transitive, reflexive, symmetric. ρ defines an equivalence class. (mutually exclusive) Problem Definition II Topical Structure Describes how the tables in a database are grouped based on their topical relationship. Consider set of topics P, database D, topical relationship ρ between tables in D wrt P. Topical structure of InvDB with respect to P is {C1,C2 , C3}, where C1 = {InvoiceStatus, InvoiceTerm, Invoice, InvoiceItem} C2 = {Shipment, ShipmentMethod} C3 = {Product, ProductCategory,Category} Problem Definition Given a database D with a set of tables, discover: A set of topics P, which the tables in D are about; The topical structure of D with respect to P, in the form of a partition C = {C1,C2, ..., Ck} over the tables in D, where k=|P|. The iDisc Approach Model Builder The Model Builder examines database D with a set of tables from a number of perspectives and obtains a variety of representations for D. It constructs varied representations for the database from its schema information and instance data. These representations fall into three categories: Vector–Based Graph–Based Similarity–Based Vector – Based Representations It captures the topical structure of a database via the descriptive information on the tables. Table – Text document Database – Collection of documents Structures of tables are ignored. Different ways of constructing these documents results in different representations of database. Suppose number of unique tokens among documents for tables in database D is n: Each document d – n-dimensional vector <w1 , w2, … wn> i-th dimention - i-th unique token in D. wi - weight of token for document d. Many weighting functions – e.g.: TF * IDF weight. InvDB- An invoice management database Graph – Based Representations It captures the topical structure of a database via the linkage among the tables. Nodes – Table Edges – Linkage between tables. iDisc discovers primary keys and then discovers foreign keys. Due to cost of enforcing constraints information on keys and foreign keys is often missing in the catalogs. Consider key-attribute A in T1 & attribute B in T2 |B ∩ A| = |B| |B|> 0.8 |A| |B|> 2 NameSim ( A , B ) > 0.5; NameSim is a measure on similarity of attribute names. Similarity – Based Representations I It captures the topical structure of a database via the valuebased similarity between the tables. Idea: If two tables are about the same topic , they may have several attributes containing similar values. |D|× |D| matrix M |D| is the no. of tables in D M[i,j] stores the similarity between i-th and j-th tables in D Different ways of evaluating similarity results in different representations for the database. iDisc procedure: Evaluate value similarity between attributes. Discover matching attributes. Evaluate table similarity. Similarity – Based Representations II Evaluate value similarity between attributes. Discover matching attributes. (Greedy-matching) For every two attributes X and Y one from each table, compute the similarity as the jaccard similarity between the sets of values in X and Y. Let Z = ∅, U = all attributes in T, and V = all attributes in T’. Find u ∈ U and v ∈ V such that they have a maximum (positive) similarity among all pairs of attributes from U and V. Add attribute pair (u, v) to Z, remove u from U and v from V. Repeat steps 2 and 3 until no more such pairs can be found. Consider tables T = InvoiceStatus & T’ = InvoiceTerm J(T.InvoiceID, T’.InvoiceID) = 0.75 J(T.InvoiceID, T’.TermType) = 0.2 J(T.StatusCode,T’.TermType) = 0.15 Similarity – Based Representations III Evaluate table similarity Similarity between T and T ‘ where |T| - number of attributes in T. Example: Sim(InvoiceStatus, InvoiceTerm) = (0.75 + 0.15)/2 iDisc’s goal is not to build best models (which typically do not exist), but to show that it can produce a better solution by building & combining many different (possibly imperfect) models. Base Clusteres Take a database representation and discover preliminary clustering over the tables in the database. iDisc first implements several generic clustering algorithms and then instantiates them into clusters. Generic Algorithms: Similarity-based Linkage-based. Instantiation Vector based representation. Graph based representation. Similarity based representation. Generic Similarity-Based Algorithm I SimClust(T, M, ClsrSim, Q) → C Input: T , a set of table {T1, T2, ...,T|T|} M, a similarity matrix for the tables in T ClsrSim, a cluster similarity function Q, a clustering quality metric Output: C, a partition of tables in T Set up initial clusters: Let i = 1 Let C1 = {{T1}, {T2}, ..., {T|T|}} Repeat until |Ci| = 1 Evaluate the quality of Ci via Q Evaluate the similarities of clusters in Ci via ClsrSim Find Cx, Cy ∈ Ci with a maximum similarity Merge clusters Cx and Cy i←i+1 Return Ci with a maximum Q value Generic Similarity-Based Algorithm II ClsrSim is a cluster similarity function which takes the similarity matrix M and two clusters of tables, Cx and Cy, and computes a similarity value between Cx and Cy. Implementations of ClsrSim: single-link - We merge in each step the two clusters whose two closest members have the smallest distance (or the two clusters with the smallest minimum pair-wise distance). complete-link - We merge in each step the two clusters whose merger has the smallest diameter (or the two clusters with the smallest maximum pair-wise distance). average-link- Is a compromise between the sensitivity of complete-link clustering to outliers and the tendency of singlelink clustering to form long chains. Generic Similarity-Based Algorithm III Q is a metric for evaluating the quality of clusterings. Elbow criterion Gap statistics Cross-validation But there is no best solution. Q(C) = ∑Ci∈C|Ci|/N ∗ (IntraSim(Ci) − InterSim(Ci)) N is the total number of tables in the database. |Ci| is the number of tables in cluster Ci∈ C. IntraSim(Ci) is the average similarity of tables within the cluster Ci InterSim(Ci) is the maximum similarity of Ci with any other cluster in C. Generic Linkage-Based Algorithm LinkClust(T, G, EdgeDel, Q’) → C Input: set T of tables; G, a linkage graph for the tables in T EdgeDel, a function that suggests edges to be removed Q’, a clustering quality metric Output: C, a partition of tables in T Let i = 1 Repeat until G has no edges Let Ci = connected components in G Evaluate the quality of Ci via Q’ Let Ec = EdgeDel(G) Remove edges in Ec from G i←i+1 Return Ci with a maximum Q value. Shortest – path betweenness (SP) First, find the shortest paths path between vertices and then measure the betweenness β(e) of an edge by the fraction of the shortest path that contain the edge. β(e) =∑s,tV, s≠ tσst(e)/σst σst - Number of distinct shortest paths between vertices s and t. σst(e) is the number of distinct shortest paths between s and t that contain the edge e. EdgeDel(G) then returns an edge with a maximum β value. Spectral Graph Partitioning (SPC) EdgeDel returns an edge-cut of G, which comprises a set of edges which are likely lying between two clusters. Consider G’s Laplacian matrix LG = DG–AG DG is a diagonal matrix whose entry D[i, i] is the degree of ith vertex in G. AG is G’s adjacency matrix. Then it can be shown that finding a minimum edge-cut of G corresponds to finding the smallest positive eigenvalue λ2 of LG. The eigenvector for λ2 suggests a possible bi-partitioning of the vertices in G, where the vertices with positive values are placed in one cluster and the vertices with negative values in the other cluster. Metric Q’ It measures the clusterings in LINKCLUST. It captures the intuition that a good partition of the network should be such that the nodes within a community are wellconnected while there are only a few edges connecting different communities. |E|- total number of edges in graph. |Eii|-number of edges connecting two vertices both in the Ci. |Ei|-number of edges that are incident to at least one vertex in Ci. |Eii|/|E| is the observed probability that an edge falls into the cluster Ci. (|Ei|/|E|)2 is the expected probability under the assumption that the connections between vertices are random. Generating Base Clusters Graph- based representations iDisc generates base clusters by instantiating LINKCLUST. If input is directed graph, it first transforms it into undirected graph by ignoring the direction of the edges For example: LinkClust(T, G, SP, default_Q’) Generating Base Clusters Vector- based representations iDisc generates base clusters by instantiating SIM-CLUST. Consider database D, with tables T= {T1, T2….. T|T|} & token vector for table Ti as Ti^ . For every two tables, we evaluate similarity by a variety of methods. Eg. Cosine function Cos(Ti^, Tj^)= Ti^. Tj^ / (║Ti^║, ║Tj^║) SimClust(T, M, single-link, default_Q) Similarity- based representations Similar to vector-based representation, iDisc generates base clusters by instantiating SIM-CLUST. Difference is similarity matrix in the representation is directly used for instantiation. Meta- clusterer I Given a set of preliminary clusterings C from base clusters, goal of meta cluster is to find a clustering C’, such that it agrees with the clusterings in C as much as possible. Disagreements among C and C’ is d(C, C’). B1 B2 Vector-based representation Complete- link Vector-based representation Single-link B3 Linkage-based representation Meta- clusterer II iDisc implements a clustering like voting scheme, but with a difference that it automatically determines an appropriate number of clusters based on particular votes from the input clusterers. The problem of finding the best aggregated clustering can shown as NP- complete. Most of the approximation algorithms are based on majority vote scheme. The meta-clustering algorithm is also based on the voting scheme but with a key difference. It does not assume an explicit clustering, but the algorithm automatically determines an approximate number of clusters in the aggregated clustering based on particular votes from the input clusterers. Meta- clusterer III Algorithm has two phases: Vote-based similarity evaluation Consider two tables T, T’ & clustering Ci , a vote from Ci takes value1 if T and T’ are placed in the same clustering Ci & 0 otherwise. Based on the votes from the base clusterers, the similarity between two tables T, T ∈ T is computed as: 1/m· ∑i=1…mVT,T’(Ci) where m is the number of base clusterers. Vote- based similarity Meta- clusterer IV Re-clustering A similarity matrix Mv is constructed from the previous step. iDisc generates the meta-clusterer as SimClust(T, Mv, single-link, default_Q) Research however focused mostly on combining different clustering algorithms (single link vs. complete link) and not different representation models. Handling Complex Aggregations I The meta-clusterer has to identify and remove the errors in the input clusterers and combine the strength of different clusterers to produce better clusters. All input clusterers are treated as being equally good by the meta-clusterer. However, the performance of clusterers may vary a lot depending on characteristics of particular data set ie. the same clusterer might perform well on data set but perform poorly on another. Thus, we dynamically adjust the weights of clusterers so that the better performing clusterers are weighted more. Handling Complex Aggregations II We use clusterer boosting approach which first estimates the performance of a clusterer by comparing it to other clusterers which are likely to be accurate. The results from the clusterers are re-aggregated based on the new weights. Clusterer boosting involves the following steps. Determining a pseudo- solution. Ranking input clusteres. Adjusting weights. Handling Complex Aggregations III Aggregation tree H Level of aggregation – depth of deepest internal node in H. Single level clustering: Base clusters are aggregated at once by a single meta-clusterer. Multiple level clustering Multiple meta-clusterers, a some take input as aggregated clustering from previous meta-clusterers. Handling Complex Aggregation IV Similarity levels: 1. 2. 3. Clusterers which take the same representation (e.g., a vectorbased representation),but employ different clustering algorithms (e.g., single-link vs. complete-link versions of the similaritybased algorithm). Clusterers which take the same kind of representations (e.g., a vector-based representation constructed from table names vs. a vector-based representation constructed from both table & attribute names) Clusterers which take different kinds of representations (e.g., a vector-based vs. a graph-based representation). Furthermore, if one of the clusterers is a meta-clusterer, their similarity level is given by the least similarity level among all the base clusterers. Handling Complex Aggregation V Tree construction 1. Initialize a set W of current clusterers with all the base clusterers. 2. Determine the maximum similarity level l among all the clusterers in W. 3. Find a set S of all clusterers with the similarity level l. 4. Aggregate the clusterers in S using a meta-clusterer M and remove them from W. Add M into W. 5. Repeat steps 2–4 until there is only one clusterer left in W, which is the root meta-clusterer. Finding Cluster Representatives I There are large number of tables on the same topic. We need to know the important tables. These tables are cluster representatives. They serve as entry points to a cluster and gives users a general idea of what the cluster is about. iDisc – Representative Finder which discovers representative tables on the basis of their importance. Observation: A table that is important should be at the focal point in the linkage graph for the cluster. Hence , iDisc measures the importance of a table based on its centrality on the linkage graph. Finding Cluster Representatives II Given a linkage graph G(V, E), the centrality of a vertex v Є V, denoted as ζ(v), is computed as follows: where: σst : The number of distinct shortest paths between vertices s and t. σst (v) : The number of shortest paths between s and t that pass through the vertex v. Finding Cluster Representatives III Representative Discovery (REPDISC) Input : Output: Clustering C = {C1, C2,….. Ck} over database D. Linkage graph G of D. A desired number ‘ r ’ which is the number of representative tables in each cluster C. Obtain linkage graph Gci, a subgraph of G induced by the set of tables in Ci. Evaluate centrality scores. Rank the tables by descending order of their centrality scores and return top r tables in the ranked list. Finding Cluster Representatives IV Complexity of REPDISC: Consider cluster Ci ∈ C and denote the induced graph for Ci as G(Vr,Er) where Vr is a set of tables in Ci and Er is a set of linkage edges between the tables in Ci. The time to create the graph is O(|Vr| + |Er|). 1. For every two tables in Vr, determine if there is an edge between them. Suppose G is implemented with an adjacency matrix. This can be done in O(|Vr|2). Thus, the overall complexity of step 1 is O(|Vr|2) . 2. The complexity can be shown to be O(|Vr| ∗ |Er|). 3. The complexity is O(|Vr|). So the overall complexity for steps 1–3 is O(|Vr| ∗ |Er|), with the dominant factor being the time for step 2. Empirical Evaluation Experiment Setup Data Set HR1- engagement management, HR2 – skill development, HR3 – invoice tracking. Determine Set of topics in database. Which topic each table in the database is about. These were used as gold standard for experiments. Performance Metrics Precision (P)- The percentage of table pairs determined by iDisc to be on the same topic that are on the same topic according to the gold standard. Recall (R) - The percentage of table pairs determined by the domain expert to be on the same topic that are discovered by iDisc. F -measure (F1) -F-measure is used when precision P and recall R are equally weighted, i.e., F1 = 2PR/(R + P). Empirical Evaluation Experiment Setup Experiments The utility of various database representations and the accuracy of individual base clusterers. The aggregation accuracy of the baseline meta-clustering algorithm. The impact of the proposed complex aggregation techniques. For all the base clusterers and meta-clusterers, the default Q and Q’ were employed. Vector-based representations were constructed from table & attribute names and the Cosine function was employed for computing vector similarities. Since the databases contain a huge number of rows, a sample of 4k size for each attribute was created, and was used for discovering foreign keys and attribute matches . Results & Observations Base clusterers employing a complete-link algorithm(CL) tend to have higher precision and lower recall than ones based on a single-link algorithm (SL). CL-based clusterers typically produce a large number of small clusters, while SL-based ones produce a small number of large clusters. The precision of base clusterers using graph-based representations is relatively low in HR1 & HR3. The base clusterers utilizing vector-based representations perform consistently well over all three databases. This is due to the fact that similar tables in these database tend to have many common words, e.g., Emp, Emp Resume, and Emp Photo. Base clusterers utilizing similarity-based representations had poor performance on HR2 since a large number of tables in HR2 have several similar timestamplike columns, e.g., create dt and del dt for table creation and deletion datetimes. Thus many tables are falsely determined to be similar to each other. Meta - clusterers Observations: The effects of “bad” base clusterers can be cancelled out. In HR1, the precision of Meta-Vec (87.6%) is much higher than that of Vec-SL (44.8%). Meta-All is far more accurate than Vec-SL, Graph-SP, and Graph-SPC, and its F1 is higher than that of all base clusterers. Number of Topics Number of topics were compared. (a) plots the total numbers of topics versus the databases. (b) plots the total numbers of topics with at least two tables versus the databases. Observation: Number of Topics discovered by iDisc are very close to the numbers given by gold standard. Empirical Evaluation Discussion iDisc may disagree with the domain expert on the granularity of partitioning and the number of subject areas in a database. iDisc and the domain expert may also disagree on the assignment of the tables to the clusters, particularly for those “boundary” tables that connect several related entities. Some databases in our experiments contain reference tables such as country (with attributes like name, region and ISO code) and language (with attributes like name and code). These tables are often referred to from multiple subject areas. Related Work Mining Database Structures: Bellman discovers join relationships (join-able attributes ie. Attributes which are semantically similar)among tables in a database. Similar attributes are calculated using set resemblance functions similar to Jaccard function. Two tables connected via the join relation may not be on same topic. Goal of our work is to identify such inter- topic links and partition tables accordingly. Data modeling products like Erwin & RDA facilitate modular development during a top-down modeling process. It enables us to create and maintain databases. It streamlines design process and synchronizes the model with the database design. Our solution complements these functions by enabling users to reverse-engineer subject areas from a large-scale physical database during a bottom-up modeling process. Related Work Information Integration & Complexity Issues Key problem today. Fragment oriented approach to match large schemas to reduce matching complexity. It decomposes large schemas into several sub-schemas or fragments (either a XML schema segment or relational table or manually specified) & performs fragment-wise matching. Our work provides an automatic approach to partitioning a large schema into semantically meaningful fragments. Multi-Strategy Learning & Clustering Aggregation LSD (a system that matches source schemas against a mediated schema for data integration) employs a set of base learners. The predictions from these base learners are combined via the meta-learner.LSD requires training data. In contrast base clusterers and meta-clusterers in Idisc do not require training ie iDisc is unsupervised. Future Work Two directions to extend iDisc : Plan to develop soft- clustering and meta clustering techniques and incorporate them in iDisc to examine their impact on its performance. Plan to extend iDisc to produce hierarchical topical structure, where each topic may be further divided into subtopics. Advantages: Enable directory style semantic browsing. Further support the divide-and-conquer approach to schema matching. Reduce the complexity of large scale integration. Conclusions iDisc is unique in that It examines the database from varied perspectives to construct multiple representations. It employs a multi-strategy framework to effectively combine evidence through meta-clustering. It employs novel multi-level aggregation and clusterer boosting techniques to handle complex aggregations. It employs novel measure on table importance to effectively discover cluster representatives. Experiments over several large real-world databases indicate that iDisc is highly effective, with an accuracy rate of up to 87%. Thank you