D T S ISCOVERING

advertisement
DISCOVERING TOPICAL
STRUCTURES OF DATABASES
Professor: Michalis Petropoulos
CSE 705
Megha Ramesh Kumar
Topics to be covered








Introduction
Problem Definition
The iDisc approach
Handling Complex Aggregations
Finding Cluster Representatives
Empirical Evaluation
Related Work
Conclusion
Overview




Discovering topical structures of databases to support
semantic browsing and large scale data integration.
iDisc - multi-strategy learning framework.
It exploits instance values to construct multiple database
representations.
It employs a set of base clusterers to discover preliminary
topical clusters of tables from database representations &
aggregates them to final clusters via meta-clustering.
Difficulties in Integrating Databases I






Documentation and metadata for enterprise databases are
often scattered throughout the IT department.
They are incomplete, inaccurate or missing.
Scale of database + lack of documentation.
Cost of integrating databases increases.
Designers left the company but did not leave any design
documents.
Implementation of databases might not be consistent with
the design.
Difficulties in Integrating Databases II




Reverse Engineer and integrating the databases becomes
difficult.
Key step – Identify the semantic correspondences
(mappings) among attributes from different databases.
Existing matching solutions attempt to find mappings
between every two attributes.
Assumptions made:
 Databases are small.
 Attributes in one database are potentially relevant to all
attributes in another database.
Without Topical Structures
With Topical Structures
Gist





We formally define the problem of discovering topical structures
of databases and demonstrate how they can support semantic
browsing & large scale data integration.
We propose a novel multi- strategy discovery framework &
describe the iDisc system which realizes this framework.
We propose a novel clustering aggregation techniques to address
limitations of existing solutions.
We propose a new approach to finding representative tables using
a novel measure on table importance.
iDisc is evaluated over real-world databases, and results indicates
that it discovers topical clusters with a high level of accuracy.
Problem Definition I

Topical Relationship
Consider database D, set of tables
where each table is associated with a
topic p ∈ P, topic(T) = p.
There exists a topical relationship
between tables S and T if topic(T) =
topic(S) [Denoted as ρ(S, T)].

Example:
Suppose P = {Invoice, Shipment, Product}
ρ(InvoiceItem, InvoiceTerm) since
topic(InvoiceItem) = topic(InvoiceTerm)

Properties:


ρ is transitive, reflexive, symmetric.
ρ defines an equivalence class.
(mutually exclusive)
Problem Definition II

Topical Structure
Describes how the tables in a database are grouped based on their
topical relationship.
Consider set of topics P, database D, topical relationship ρ
between tables in D wrt P.
Topical structure of InvDB with respect to P is {C1,C2 , C3}, where
 C1 = {InvoiceStatus, InvoiceTerm, Invoice, InvoiceItem}
 C2 = {Shipment, ShipmentMethod}
 C3 = {Product, ProductCategory,Category}
Problem Definition
Given a database D with a set of tables, discover:
 A set of topics P, which the tables in D are about;
 The topical structure of D with respect to P, in the form of a
partition C = {C1,C2, ..., Ck} over the tables in D, where k=|P|.
The iDisc Approach
Model Builder



The Model Builder examines database D with a set of
tables from a number of perspectives and obtains a variety
of representations for D.
It constructs varied representations for the database from
its schema information and instance data.
These representations fall into three categories:
 Vector–Based
 Graph–Based
 Similarity–Based
Vector – Based Representations







It captures the topical structure of a database via the
descriptive information on the tables.
Table – Text document
Database – Collection of documents
Structures of tables are ignored.
Different ways of constructing these documents results in
different representations of database.
Suppose number of unique tokens among documents for
tables in database D is n:
Each document d – n-dimensional vector <w1 , w2, … wn>



i-th dimention - i-th unique token in D.
wi - weight of token for document d.
Many weighting functions – e.g.: TF * IDF weight.
InvDB- An invoice management database
Graph – Based Representations




It captures the topical structure of a database via the linkage
among the tables.
Nodes – Table
Edges – Linkage between tables.
iDisc discovers primary keys and then discovers foreign keys.


Due to cost of enforcing constraints information on keys and
foreign keys is often missing in the catalogs.
Consider key-attribute A in T1 & attribute B in T2
|B ∩ A| = |B|
 |B|> 0.8 |A|
 |B|> 2
 NameSim ( A , B ) > 0.5; NameSim is a measure on similarity of
attribute names.

Similarity – Based Representations I



It captures the topical structure of a database via the valuebased similarity between the tables.
Idea: If two tables are about the same topic , they may have
several attributes containing similar values.
|D|× |D| matrix M




|D| is the no. of tables in D
M[i,j] stores the similarity between i-th and j-th tables in D
Different ways of evaluating similarity results in different
representations for the database.
iDisc procedure:



Evaluate value similarity between attributes.
Discover matching attributes.
Evaluate table similarity.
Similarity – Based Representations II

Evaluate value similarity between attributes.


Discover matching attributes. (Greedy-matching)



For every two attributes X and Y one from each table, compute the
similarity as the jaccard similarity between the sets of values in X and Y.
Let Z = ∅, U = all attributes in T, and V = all attributes in T’.
Find u ∈ U and v ∈ V such that they have a maximum (positive)
similarity among all pairs of attributes from U and V.

Add attribute pair (u, v) to Z, remove u from U and v from V.

Repeat steps 2 and 3 until no more such pairs can be found.
Consider tables T = InvoiceStatus & T’ = InvoiceTerm
J(T.InvoiceID, T’.InvoiceID) = 0.75
J(T.InvoiceID, T’.TermType) = 0.2
J(T.StatusCode,T’.TermType) = 0.15
Similarity – Based Representations III

Evaluate table similarity

Similarity between T and T ‘
where |T| - number of attributes in T.

Example:


Sim(InvoiceStatus, InvoiceTerm) = (0.75 + 0.15)/2
iDisc’s goal is not to build best models (which typically do
not exist), but to show that it can produce a better solution
by building & combining many different (possibly
imperfect) models.
Base Clusteres



Take a database representation and discover preliminary
clustering over the tables in the database.
iDisc first implements several generic clustering
algorithms and then instantiates them into clusters.
Generic Algorithms:
Similarity-based
 Linkage-based.


Instantiation
Vector based representation.
 Graph based representation.
 Similarity based representation.

Generic Similarity-Based Algorithm I

SimClust(T, M, ClsrSim, Q) → C

Input: T , a set of table {T1, T2, ...,T|T|}
M, a similarity matrix for the tables in T
 ClsrSim, a cluster similarity function
 Q, a clustering quality metric



Output: C, a partition of tables in T
Set up initial clusters:
Let i = 1
 Let C1 = {{T1}, {T2}, ..., {T|T|}}


Repeat until |Ci| = 1






Evaluate the quality of Ci via Q
Evaluate the similarities of clusters in Ci via ClsrSim
Find Cx, Cy ∈ Ci with a maximum similarity
Merge clusters Cx and Cy
i←i+1
Return Ci with a maximum Q value
Generic Similarity-Based Algorithm II


ClsrSim is a cluster similarity function which takes the
similarity matrix M and two clusters of tables, Cx and Cy,
and computes a similarity value between Cx and Cy.
Implementations of ClsrSim:
single-link - We merge in each step the two clusters whose
two closest members have the smallest distance (or the two
clusters with the smallest minimum pair-wise distance).
 complete-link - We merge in each step the two clusters
whose merger has the smallest diameter (or the two clusters
with the smallest maximum pair-wise distance).
 average-link- Is a compromise between the sensitivity of
complete-link clustering to outliers and the tendency of singlelink clustering to form long chains.

Generic Similarity-Based Algorithm III

Q is a metric for evaluating the quality of clusterings.




Elbow criterion
Gap statistics
Cross-validation
But there is no best solution.
Q(C) = ∑Ci∈C|Ci|/N ∗ (IntraSim(Ci) − InterSim(Ci))
N is the total number of tables in the database.
 |Ci| is the number of tables in cluster Ci∈ C.
 IntraSim(Ci) is the average similarity of tables within the cluster Ci
 InterSim(Ci) is the maximum similarity of Ci with any other cluster in
C.

Generic Linkage-Based Algorithm

LinkClust(T, G, EdgeDel, Q’) → C

Input: set T of tables;



G, a linkage graph for the tables in T
EdgeDel, a function that suggests edges to be removed
Q’, a clustering quality metric

Output: C, a partition of tables in T

Let i = 1

Repeat until G has no edges






Let Ci = connected components in G
Evaluate the quality of Ci via Q’
Let Ec = EdgeDel(G)
Remove edges in Ec from G
i←i+1
Return Ci with a maximum Q value.

Shortest – path betweenness (SP)

First, find the shortest paths path between vertices and then
measure the betweenness β(e) of an edge by the fraction of the
shortest path that contain the edge.

β(e) =∑s,tV, s≠ tσst(e)/σst
 σst - Number of distinct shortest paths between vertices s and t.

σst(e) is the number of distinct shortest paths between s and t that
contain the edge e.


EdgeDel(G) then returns an edge with a maximum β value.
Spectral Graph Partitioning (SPC)
EdgeDel returns an edge-cut of G, which comprises a set of
edges which are likely lying between two clusters.
 Consider G’s Laplacian matrix LG = DG–AG
 DG is a diagonal matrix whose entry D[i, i] is the degree of ith vertex in G.
 AG is G’s adjacency matrix.

Then it can be shown that finding a minimum edge-cut of G
corresponds to finding the smallest positive eigenvalue λ2 of LG.
 The eigenvector for λ2 suggests a possible bi-partitioning of the
vertices in G, where the vertices with positive values are placed
in one cluster and the vertices with negative values in the other
cluster.


Metric Q’
It measures the clusterings in LINKCLUST.
 It captures the intuition that a good partition of the network
should be such that the nodes within a community are wellconnected while there are only a few edges connecting different
communities.






|E|- total number of edges in graph.
|Eii|-number of edges connecting two vertices both in the Ci.
|Ei|-number of edges that are incident to at least one vertex in Ci.
|Eii|/|E| is the observed probability that an edge falls into the
cluster Ci.
(|Ei|/|E|)2 is the expected probability under the assumption that
the connections between vertices are random.
Generating Base Clusters

Graph- based representations
iDisc generates base clusters by
instantiating LINKCLUST.
 If input is directed graph, it first
transforms it into undirected
graph by ignoring the direction of
the edges
For example:
 LinkClust(T, G, SP, default_Q’)

Generating Base Clusters

Vector- based representations





iDisc generates base clusters by instantiating SIM-CLUST.
Consider database D, with tables T= {T1, T2….. T|T|} & token
vector for table Ti as Ti^ .
For every two tables, we evaluate similarity by a variety of
methods. Eg. Cosine function
Cos(Ti^, Tj^)= Ti^. Tj^ / (║Ti^║, ║Tj^║)
SimClust(T, M, single-link, default_Q)
Similarity- based representations
Similar to vector-based representation, iDisc generates base
clusters by instantiating SIM-CLUST.
 Difference is similarity matrix in the representation is directly
used for instantiation.

Meta- clusterer I



Given a set of preliminary clusterings C from base clusters,
goal of meta cluster is to find a clustering C’, such that it
agrees with the clusterings in C as much as possible.
Disagreements among C and C’ is d(C, C’).
B1



B2



Vector-based representation
Complete- link
Vector-based representation
Single-link
B3

Linkage-based representation
Meta- clusterer II



iDisc implements a clustering like voting scheme, but with a
difference that it automatically determines an appropriate
number of clusters based on particular votes from the input
clusterers.
The problem of finding the best aggregated clustering can
shown as NP- complete. Most of the approximation algorithms
are based on majority vote scheme.
The meta-clustering algorithm is also based on the voting
scheme but with a key difference.

It does not assume an explicit clustering, but the algorithm
automatically determines an approximate number of clusters in the
aggregated clustering based on particular votes from the input
clusterers.
Meta- clusterer III


Algorithm has two phases:
Vote-based similarity evaluation

Consider two tables T, T’ & clustering Ci , a vote from Ci takes value1
if T and T’ are placed in the same clustering Ci & 0 otherwise.
Based on the votes from the base clusterers, the similarity
between two tables T, T ∈ T is computed as:
 1/m· ∑i=1…mVT,T’(Ci) where m is the number of base clusterers.


Vote- based similarity
Meta- clusterer IV

Re-clustering



A similarity matrix Mv is constructed from the previous step.
iDisc generates the meta-clusterer as
SimClust(T, Mv, single-link, default_Q)
Research however focused mostly on combining different
clustering algorithms (single link vs. complete link) and not
different representation models.
Handling Complex Aggregations I




The meta-clusterer has to identify and remove the errors in
the input clusterers and combine the strength of different
clusterers to produce better clusters.
All input clusterers are treated as being equally good by the
meta-clusterer.
However, the performance of clusterers may vary a lot
depending on characteristics of particular data set ie. the
same clusterer might perform well on data set but perform
poorly on another.
Thus, we dynamically adjust the weights of clusterers so that
the better performing clusterers are weighted more.
Handling Complex Aggregations II



We use clusterer boosting approach which first estimates the
performance of a clusterer by comparing it to other clusterers
which are likely to be accurate.
The results from the clusterers are re-aggregated based on
the new weights.
Clusterer boosting involves the following steps.



Determining a pseudo- solution.
Ranking input clusteres.
Adjusting weights.
Handling Complex Aggregations III

Aggregation tree H
Level of aggregation –
depth of deepest internal
node in H.

Single level clustering:



Base clusters are
aggregated at once by a
single meta-clusterer.
Multiple level clustering

Multiple meta-clusterers, a
some take input as
aggregated clustering from
previous meta-clusterers.
Handling Complex Aggregation IV

Similarity levels:
1.
2.
3.
Clusterers which take the same representation (e.g., a vectorbased representation),but employ different clustering algorithms
(e.g., single-link vs. complete-link versions of the similaritybased algorithm).
Clusterers which take the same kind of representations (e.g., a
vector-based representation constructed from table names vs. a
vector-based representation constructed from both table &
attribute names)
Clusterers which take different kinds of representations (e.g., a
vector-based vs. a graph-based representation). Furthermore, if
one of the clusterers is a meta-clusterer, their similarity level is
given by the least similarity level among all the base clusterers.
Handling Complex Aggregation V

Tree construction
1. Initialize a set W of current clusterers with all the base
clusterers.
2. Determine the maximum similarity level l among all the
clusterers in W.
3. Find a set S of all clusterers with the similarity level l.
4. Aggregate the clusterers in S using a meta-clusterer M and
remove them from W. Add M into W.
5. Repeat steps 2–4 until there is only one clusterer left in W,
which is the root meta-clusterer.
Finding Cluster Representatives I






There are large number of tables on the same topic. We need to know
the important tables.
These tables are cluster representatives.
They serve as entry points to a cluster and gives users a general idea
of what the cluster is about.
iDisc – Representative Finder which discovers representative tables
on the basis of their importance.
Observation: A table that is important should be at the focal point in
the linkage graph for the cluster.
Hence , iDisc measures the importance of a table based on its
centrality on the linkage graph.
Finding Cluster Representatives II

Given a linkage graph G(V, E), the centrality of a vertex v Є
V, denoted as ζ(v), is computed as follows:
where:
 σst : The number of distinct shortest paths between vertices s
and t.
 σst (v) : The number of shortest paths between s and t that pass
through the vertex v.
Finding Cluster Representatives III

Representative Discovery (REPDISC)

Input :



Output:


Clustering C = {C1, C2,….. Ck} over database D.
Linkage graph G of D.
A desired number ‘ r ’ which is the number of representative tables in each
cluster C.
Obtain linkage graph Gci, a
subgraph of G induced by the set of
tables in Ci.

Evaluate centrality scores.

Rank the tables by descending
order of their centrality scores and
return top r tables in the ranked
list.
Finding Cluster Representatives IV
Complexity of REPDISC:

Consider cluster Ci ∈ C and denote the induced graph for Ci as G(Vr,Er)


where Vr is a set of tables in Ci and Er is a set of linkage edges between the
tables in Ci.
The time to create the graph is O(|Vr| + |Er|).
1. For every two tables in Vr, determine if there is an edge between them.
Suppose G is implemented with an adjacency matrix. This can be done
in O(|Vr|2). Thus, the overall complexity of step 1 is O(|Vr|2) .
2. The complexity can be shown to be O(|Vr| ∗ |Er|).
3. The complexity is O(|Vr|).

So the overall complexity for steps 1–3 is O(|Vr| ∗ |Er|), with the
dominant factor being the time for step 2.
Empirical Evaluation
Experiment Setup

Data Set



HR1- engagement management, HR2 – skill development, HR3 –
invoice tracking.
Determine
 Set of topics in database.
 Which topic each table in the database is about.
 These were used as gold standard for experiments.
Performance Metrics
Precision (P)- The percentage of table pairs determined by iDisc to
be on the same topic that are on the same topic according to the
gold standard.
 Recall (R) - The percentage of table pairs determined by the
domain expert to be on the same topic that are discovered by iDisc.
 F -measure (F1) -F-measure is used when precision P and recall R
are equally weighted, i.e., F1 = 2PR/(R + P).

Empirical Evaluation
Experiment Setup

Experiments
The utility of various database representations and the accuracy of
individual base clusterers.
 The aggregation accuracy of the baseline meta-clustering
algorithm.
 The impact of the proposed complex aggregation techniques.
 For all the base clusterers and meta-clusterers, the default Q and
Q’ were employed.
 Vector-based representations were constructed from table &
attribute names and the Cosine function was employed for
computing vector similarities. Since the databases contain a huge
number of rows, a sample of 4k size for each attribute was created,
and was used for discovering foreign keys and attribute matches .

Results &
Observations


Base clusterers employing a
complete-link algorithm(CL)
tend to have higher precision
and lower recall than ones
based on a single-link
algorithm (SL). CL-based
clusterers typically produce a
large number of small
clusters, while SL-based ones
produce a small number of
large clusters.
The precision of base
clusterers using graph-based
representations is relatively
low in HR1 & HR3.


The base clusterers utilizing
vector-based representations
perform consistently well over
all three databases. This is due
to the fact that similar tables in
these database tend to have
many common words, e.g., Emp,
Emp Resume, and Emp Photo.
Base clusterers utilizing
similarity-based
representations had poor
performance on HR2 since a
large number of tables in HR2
have several similar timestamplike columns, e.g., create dt and
del dt for table creation and
deletion datetimes. Thus many
tables are falsely determined to
be similar to each other.


Meta - clusterers
Observations:
 The effects of “bad” base
clusterers can be cancelled
out.
 In HR1, the precision of
Meta-Vec (87.6%) is much
higher than that of Vec-SL
(44.8%).
 Meta-All is far more
accurate than Vec-SL,
Graph-SP, and Graph-SPC,
and its F1 is higher than
that of all base clusterers.

Number of Topics
Number of topics were compared.
 (a) plots the total numbers of topics versus the databases.
 (b) plots the total numbers of topics with at least two tables versus
the databases.
 Observation:
 Number of Topics discovered by iDisc are very close to the numbers
given by gold standard.

Empirical Evaluation
Discussion



iDisc may disagree with the domain expert on the
granularity of partitioning and the number of subject areas
in a database.
iDisc and the domain expert may also disagree on the
assignment of the tables to the clusters, particularly for
those “boundary” tables that connect several related
entities.
Some databases in our experiments contain reference
tables such as country (with attributes like name, region
and ISO code) and language (with attributes like name and
code). These tables are often referred to from multiple
subject areas.
Related Work

Mining Database Structures:

Bellman discovers join relationships (join-able attributes ie.
Attributes which are semantically similar)among tables in a
database. Similar attributes are calculated using set
resemblance functions similar to Jaccard function. Two tables
connected via the join relation may not be on same topic. Goal
of our work is to identify such inter- topic links and partition
tables accordingly.

Data modeling products like Erwin & RDA facilitate modular
development during a top-down modeling process. It enables
us to create and maintain databases. It streamlines design
process and synchronizes the model with the database design.
Our solution complements these functions by enabling users to
reverse-engineer subject areas from a large-scale physical
database during a bottom-up modeling process.
Related Work

Information Integration & Complexity Issues





Key problem today.
Fragment oriented approach to match large schemas to reduce
matching complexity.
It decomposes large schemas into several sub-schemas or
fragments (either a XML schema segment or relational table or
manually specified) & performs fragment-wise matching.
Our work provides an automatic approach to partitioning a large
schema into semantically meaningful fragments.
Multi-Strategy Learning & Clustering Aggregation
LSD (a system that matches source schemas against a
mediated schema for data integration) employs a set of base
learners. The predictions from these base learners are
combined via the meta-learner.LSD requires training data.
 In contrast base clusterers and meta-clusterers in Idisc do not
require training ie iDisc is unsupervised.

Future Work

Two directions to extend iDisc :



Plan to develop soft- clustering and meta clustering
techniques and incorporate them in iDisc to examine their
impact on its performance.
Plan to extend iDisc to produce hierarchical topical
structure, where each topic may be further divided into subtopics.
Advantages:
Enable directory style semantic browsing.
 Further support the divide-and-conquer approach to
schema matching.
 Reduce the complexity of large scale integration.

Conclusions

iDisc is unique in that





It examines the database from varied perspectives to construct
multiple representations.
It employs a multi-strategy framework to effectively combine
evidence through meta-clustering.
It employs novel multi-level aggregation and clusterer
boosting techniques to handle complex aggregations.
It employs novel measure on table importance to effectively
discover cluster representatives.
Experiments over several large real-world databases
indicate that iDisc is highly effective, with an accuracy rate
of up to 87%.
Thank you
Download