AdamThesis - University of South Australia

advertisement
A Framework for Matching Schemas
of Relational Databases
by
Ahmed Saimon Adam
Supervisor
Dr. Jixue Liu
A minor thesis submitted for the degree of
Master of Science (Computer and Information Science)
School of Computer and Information Science
University of South Australia
7th February 2011
Declaration
I declare that this thesis does not incorporate without acknowledgment any
material previously submitted for a degree or diploma in any university; and
that to the best of my knowledge it does not contain any materials previously
published or written by another person except where due reference is made in
the text.
………………………………………………..
Ahmed Saimon Adam
February 2011
2
Acknowledgments
I would like to thank my supervisor Dr. Jixue Liu for all the help and support
and advice he has provided to me during the course of writing this thesis.
Without his expert advice and invaluable guidance, I would not be able to
complete this thesis.
I would also like to thank our program director Dr. Jan Stanek for providing
outstanding support and encouragement in times of difficulty and at all times.
3
CONTENTS
1. Introduction ........................................................................................................................ 7
2. Schema matching architecture ......................................................................................... 11
2.1. Input component .......................................................................................................... 11
2.2. Schema matching process ............................................................................................ 13
2.2.1.
Matching at element level ........................................................................................ 14
2.2.1.1.
Pre-processing element names ............................................................................. 14
2.2.1.2.
String matching .................................................................................................... 15
2.2.1.2.1.
Prefix matching ................................................................................................ 15
2.2.1.2.2.
Suffix matching................................................................................................ 16
2.2.1.3.
String similarity matching.................................................................................... 17
2.2.1.3.1.
Edit distance (levenshtein distance) ................................................................. 17
2.2.1.3.2.
N-gram ............................................................................................................. 18
2.2.1.4.
Identifying structural similarities ......................................................................... 19
2.2.1.4.1.
Data type constraints ........................................................................................ 20
2.2.1.4.2.
Calculating final structural similarity score ..................................................... 21
2.2.2.
Matching at instance level ....................................................................................... 21
2.2.2.1.
Computing the instance similarity ....................................................................... 23
2.3. Schema matching algorithms ....................................................................................... 24
2.3.1.
Algorithm 1: find name similarity of schema element ............................................ 25
2.4. Similarity computation................................................................................................. 35
2.4.1.
Computation at matcher level .................................................................................. 35
2.4.2.
Computation at attribute level .................................................................................. 35
2.4.3.
Computing similarity at table level .......................................................................... 36
2.4.4.
Tie-breaking ............................................................................................................. 37
2.4.4.1.
Tie breaking in table matching ............................................................................ 37
2.4.4.2.
Tie breaking in attribute matching ....................................................................... 38
3. Empirical evaluation ........................................................................................................ 39
3.1. Experimental setup....................................................................................................... 39
3.1.1.
Database ................................................................................................................... 39
3.1.2.
Schema matcher prototype ....................................................................................... 40
3.2. Evaluating the accuracy of matching algorithms ......................................................... 41
3.2.1.
Prefix matching algorithm ....................................................................................... 41
3.2.2.
Suffix matching algorithm ....................................................................................... 43
3.2.3.
N-grams matching algorithm ................................................................................... 43
3.2.4.
Structural matching algorithm ................................................................................. 44
3.2.5.
Instance matching .................................................................................................... 45
3.2.6.
Overall accuracy of similarity algorithms................................................................ 46
3.2.7.
Effect of schema size on the overall precision......................................................... 46
3.2.8.
Efficiency of schema matching process ................................................................... 48
4. Conclusion ....................................................................................................................... 49
5. References ........................................................................................................................ 50
4
List of tables
Table 1: Variations in data types across database vendors ...................................................... 12
Table 2: Schema matching algorithms ..................................................................................... 13
Table 3: Examples of string tokenization ................................................................................ 14
Table 4: Prefix matching example 1 ........................................................................................ 16
Table 5: Prefix matching example 2 ........................................................................................ 16
Table 6: Suffix matching example ........................................................................................... 17
Table 7: Structural metadata for structural comparison ........................................................... 19
Table 8: Structural matching example ..................................................................................... 19
Table 9: Data type matching example ..................................................................................... 20
Table 10: Properties utilized for instance matching ................................................................ 21
Table 11: sample dataset for Schema1..................................................................................... 22
Table 12: Examples showing how the statistical calculations are performed ......................... 22
Table 13: Matcher level similarity calculation example .......................................................... 35
Table 14: Combined similarity calculation at attribute level ................................................... 35
Table 15: Table similarity calculation example 1 .................................................................... 36
Table 16: Table similarity calculation example 2 .................................................................... 37
Table 17: Tie breaking example in attribute matching ............................................................ 38
Table 18: Summary of experimental schemas ......................................................................... 40
Table 19: Software tools used for prototype ............................................................................ 41
Table 20: Prefix matching sample results ................................................................................ 42
Table 21: Suffix matching sample results ................................................................................ 43
Table 22: N-gram matching sample results ............................................................................. 44
Table 23: Structure matching sample results ........................................................................... 45
Table 24: Instance matching sample results ............................................................................ 45
Table 25: Best matching attributes for smaller schemas with 4 tables .................................... 47
Table 26: Best matching attributes for single table schemas ................................................... 47
Table 27: Decrease in efficiency with increase in schema size ............................................... 49
5
List of figures
Figure 1: Schema matcher architecture.................................................................................... 11
Figure 2: Sample snippet of a schema in DDL ........................................................................ 12
Figure 3: Combined final similarity matrix ............................................................................. 36
Figure 4: Tie breaking example 1 ............................................................................................ 38
Figure 5: Figure 6: Tie breaking example 2............................................................................. 38
Figure 7: Schema matcher prototype ....................................................................................... 40
Figure 8: Precision of matching algorithms (%) ...................................................................... 46
Figure 9: Precision vs. schema size ......................................................................................... 48
Figure 10: Schema size vs. processing time ............................................................................ 49
6
Abstract
Schema matching, the process of identifying semantic correlations between database
schemas, is one of the vital tasks in database integration.
This study is based on relational databases. In this thesis, we investigate past
researches in schema matching to study techniques and methodologies in the field. We
define an architecture for a schema matching system and propose a framework based
on it considering the factors of scalability, efficiency and accuracy. With the
framework, we also develop several algorithms that can be used in detecting
similarities in schemas. We build a prototype and conduct experimental evaluation in
order to evaluate the framework and effectiveness of its algorithms.
1. INTRODUCTION
1.1. BACKGROUND
The schema of a database describes how its concepts, their relationships and
constraints are structured. Schema matching, the process of identifying semantic
correlations between database schemas, is one of the vital and most challenging tasks
in database integration (Halevy, Rajaraman & Ordille 2006).
Many organizations, due to the adaptation to the rapid and continuous changes taking
place over the years have been developing and using various databases for their varied
applications (Parent & Spaccapietra 1998) and this continue to be the case after so
many years (Ziegler & Dittrich 2004). This tendency for creating diverse applications
and databases is due to the vibrant business environment that changes such as
structural changes to organizations, opening of new branches at geographically
dispersed locations, exploitation of new business markets are some of the reasons.
As the databases grow, organizations need to use their databases in a consolidated
manner for research and analytical purposes. Data warehousing and data mining are
results of some of the large scale integration of databases (Doan & Halevy 2005).
Currently, many data sharing applications such as data warehousing, e-commerce and
semantic web processing require matching schemas for this purpose. In such
applications, a unified virtual database that comprises of multiple databases need to be
created in order to flawlessly access the information available from those databases
(Doan & Halevy 2005).
7
Database integration research has been on going since late 70s (Batini, Lenzerini &
Navathe 1986; Doan & Halevy 2005) and a first of such mechanisms was attained in
1980 (Ziegler & Dittrich 2004) . Since early 90s, database integration methods have
been put to use commercially
as well as in the research field (Doan & Halevy 2005; Halevy, Rajaraman & Ordille
2006).
1.2. Motivation
Schema matching has been done in a highly manual fashion (Do, H 2006; Do, H &
Rahm 2007); but, it is laborious, time consuming and highly susceptible to errors
(Huimin & Sudha 2004; Doan & Halevy 2005; Po & Sorrentino 2010) and sometimes,
not even possible (Domshlak, Gal & Roitman 2007).
One of the major issues in schema matching process is due to the disparity in the
structure and semantics of databases involved. Often, how database designers perceive
a certain concept is different from one another. Therefore, same real world concept
might have different representation or it is possible that two different concepts are
represented as the same in the database (Batini, Lenzerini & Navathe 1986).
Moreover, when building a database integration project, the necessary information for
correctly mapping semantics might not be available. The major sources of
information, the original designers might not be available or the documentation might
not be sufficient. Consequently, the matching has to be performed form the limited
evidence available from schema element names and instance values (Doan & Halevy
2005).
Nonetheless, even the information available from the schema elements might not be
enough for adequate identification of matches (Doan & Halevy 2005) .Usually, in
order to find the right correspondences in schema elements, an exhaustive matching
process has to be conducted where every element needs to be evaluated to ensure the
best match among all the available elements; this is very costly and cumbersome
(Doan & Halevy 2005).
In addition to these obstacles, schema matching is largely domain specific. For
example, an attribute pair declared as highly correlated in one domain might be
completely unrelated in another domain (Doan & Halevy 2005). Hence, because of its
highly cognitive nature, the likelihood of making process fully automatic is yet
uncertain (Doan, Noy & Halevy 2004; Aumueller et al. 2005; Doan & Halevy 2005;
Do, H & Rahm 2007; Po & Sorrentino 2010).
8
However, the demand for more efficient, scalable and autonomous schema matching
systems are continuously rising (Doan, Noy & Halevy 2004; Po & Sorrentino 2010)
due to the emergence of semantic web, advancements in ecommerce and the
increasing level of collaboration in organizations.
Consequently, according to (Rahm & Bernstein 2001; Do, H, Melnik & Rahm 2002;
Doan & Halevy 2005; Shvaiko & Euzenat 2005; Do, H & Rahm 2007), researchers
have proposed several semi-automatic methods. Some adopt heuristic criteria (Doan
& Halevy 2005) while others have used machine learning and information retrieval
techniques in some approaches (Rahm & Bernstein 2001; Shvaiko & Euzenat 2005).
However, no one method can be chosen as the best schema matching solution due to
the diversity of data sources (Bernstein et al. 2004; Domshlak, Gal & Roitman 2007).
Therefore, a generic solution that can be easily customised for dealing with different
types of applications is more useful (Bernstein et al. 2004).
1.3. Previous work
According to (Rahm & Bernstein 2001; Do, H, Melnik & Rahm 2002; Doan, Noy &
Halevy 2004; Doan & Halevy 2005; Shvaiko & Euzenat 2005; Do, H & Rahm 2007),
many approaches and techniques have been proposed in the past for resolving schema
matching problems. Some methods adopt techniques from machine learning and
information retrieval, while others use heuristic criteria on composite and hybrid
techniques.
For better interpretation of the underlying semantics in the schemas, a combination of
different types of information are used in many systems and mainly two levels of
database information are exploited in schema matching techniques (Rahm & Bernstein
2001; Do, H, Melnik & Rahm 2002; Doan & Halevy 2005; Shvaiko & Euzenat 2005;
Do, H & Rahm 2007). They are the metadata available at schema level and the actual
instance data.
Schema information is used for measuring similarities at element level (e.g. attribute
and table names) and at structural level (e.g. constraints such as data type, field length,
value range); at instance level, patterns in data instances are analysed. Moreover,
many algorithms use auxiliary information such as user feedback, dictionaries,
thesauri and domain knowledge for dealing with various problems in schema
matching (Rahm & Bernstein 2001; Do, H, Melnik & Rahm 2002; Doan & Halevy
2005; Shvaiko & Euzenat 2005; Do, H & Rahm 2007).
9
1.4. Research objectives and methodology
Investigations on schema matching
In this research, our focus is on relational databases only. This research will
investigate existing schema matching techniques and their architectures and identify
what and how information can be exploited and under what circumstances (e.g. what
technique can be used when attribute names are ambiguous?).
Propose a schema matching framework
Based on the findings, we propose a schema matching framework for relational
databases. This framework is based on the findings on how to best represent input
schemas, output results and other interfaces in the architecture considering scalability,
accuracy and customizability. Taking into account the findings from this investigation,
we also implement multiple matching algorithms that exploit different clues available
from the database, and specifications of how to implement different types of
algorithms on the system. We also establish how matching accuracy can be measured
in a quantifiable manner.
Prototype and empirical evaluation
Based on this framework, we build a prototype and conduct empirical evaluation to
demonstrate the validity of the research. In our experiments, we demonstrate the
accuracy of the matching algorithms and scalability and customisability of the
architecture.
We also demonstrate the effects on matching accuracy when we use different
combinations of algorithms. We use precision and recall measurements, techniques in
information retrieval (IR), for measuring accuracy and providing results.
10
2. SCHEMA MATCHING ARCHITECTURE
The schema matching architecture of this system is defined in three major
components, similar to many other systems (Rahm & Bernstein 2001; Doan & Halevy
2005; Shvaiko & Euzenat 2005);
Input component: establishes acceptable formats and standards for the schemas. Also,
it converts schemas into a format that can be manipulated by the schema matching
software.
Matching component: Matching tasks are performed in this component. It consists of
a set of elementary algorithmic functions, each of which is called a matcher. Each
matcher utilizes some type of information (e.g. element names in name matchers) to
perform a sub-function in the matching process. Matchers are implemented by
executing them in a series.
Output: component: delivers the final mapping results.
This process is depicted in the diagram below (Figure 1: Schema matcher architecture
).
Figure 1: Schema matcher architecture
2.1.
Input component
Schemas of relational databases are often created in SQL DDL which is a series of
SQL “CERATE” statements. Therefore, information about schema elements such as
table names, attribute names, data types, keys and other available information can be
obtained from the schema, by the input module, in a standard format.
11
In this framework, the input component accepts database schemas in SQL DDL (Data
Definition Language) format and parses them, similar to (Li & Clifton 2000), to
extract schema information. The software accepts two schemas at a time and the SQL
DDL has to be in a text file with “.sql” extension. Once the required information is
extracted, the input component utilizes these data to build the appropriate Schema
objects and handovers the Schema objects to the Matching Component. These Schema
objects are manipulated by the software to perform all the necessary operations.
Schemas from different database engines will have variations in schemas in terms of
data types, constraints and other vendor specific properties (Li & Clifton 2000). For
example, in Error! Reference source not found., the attribute income from Schema
A, has a data type SMALLMONEY and needs to be matched with an attribute in Schema B.
As Schema B is from an Oracle database, it does not have a data type of
SMALLMONEY as in SQL Server.
Database engine
Data type
Attribute name
Schema A
MS SQL Server
SMALLMONEY
Income
Schema B
Oracle
NUMBER
income
Table 1: Variations in data types across database vendors
Therefore, similar to data type conversion process in (Li & Clifton 2000), we use
conversion tables to map data types if the schemas are from different database types
so that schemas from different databases also can be compared. Currently, the system
supports conversion between Oracle 11i and SQL Server 2008 schemas. The
conversion tables can be modified for supporting additional relational database
engines such as MySQL. This conversion table is given in APPENDIX 1.
The example in Figure 3 shows a snippet of a schema that has been converted into
standard SQL DDL format and ready to be used by the system for conducting schema
matching operations.
…..
CREATE TABLE ADAAS001.CUSTOMERS (
CUSTOMER_ID NUMBER NOT NULL,
FIRST_NAME VARCHAR2(10) NOT NULL,
LAST_NAME VARCHAR2(10) NOT NULL,
DOB DATE,
PHONE VARCHAR2(12),
PRIMARY KEY (CUSTOMER_ID)
);
CREATE TABLE ADAAS001.EMPLOYEES (
EMPLOYEE_ID NUMBER NOT NULL,
MANAGER_ID NUMBER,
FIRST_NAME VARCHAR2(10) NOT NULL,
LAST_NAME VARCHAR2(10) NOT NULL,
TITLE VARCHAR2(20),
SALARY NUMBER(6),
PRIMARY KEY (EMPLOYEE_ID)
12
);
…..
Figure 2: Sample snippet of a schema in DDL
2.2.
Schema matching process
It is very much likely that elements that correspond to the same concept in the real
world will prone to have similarities in databases in terms of structure and patterns in
data instances (Li & Clifton 2000; Huimin & Sudha 2004; Halevy, Rajaraman &
Ordille 2006). Therefore, according to (Rahm & Bernstein 2001; Do, H, Melnik &
Rahm 2002; Doan, Noy & Halevy 2004; Doan & Halevy 2005; Shvaiko & Euzenat
2005; Do, H & Rahm 2007), by using a combination of different types of information,
a better interpretation of the underlying semantics in the schemas can be achieved.
At schema level, we exploit metadata available from the schemas for measuring
similarities at element level (e.g. attribute and table names) and at structural level (e.g.
constraints such as data type, field length, value range);
As in other composite matching systems (Doan, Domingos & Halevy 2001; Embley,
Jackman & Xu 2001; Do, Hai & Rahm 2002), we perform the schema matching in
stages where source schema SS is compared with target schema, ST.
Once the Schema Objects are constructed from the input component and received at
the Matching Component, the matching operations begin. At the beginning of the
schema matching process, a similarity matrix, M, is initialized and at each stage, a
function, match, is executed to compute the similarity score, S for that function. M is
refined with the values of S at each stage.
The sequence of match execution, as in (Giunchiglia & Yatskevich 2004), and their
scoring weight are predefined. The algorithms we use are in Table 2.
Sequence
Matcher name
1
matchPrefix
2
matchSuffix
3
matchNgrams
4
matchEditDistance
5
matchStructure
6
matchInstance
Table 2: Schema matching algorithms
13
2.2.1. Matching at element level
In the first stage of schema matching process, emphasis is given on determining the
syntactic similarity in attribute and table names by employing several name matching
functions. In every match, each of the elements in source schema SS is compared with
that of target schema, ST to obtain the similarity score S for the match.
2.2.1.1. Pre-processing element names
In designing schemas, mostly abbreviations or multiple words are used for naming
elements instead of using single words (Huimin & Sudha 2004); as a result, element
names that have the same meaning might differ syntactically (Madhavan, Bernstein &
Rahm 2001).
Prior to performing the string matching process, element names need to be preprocessed for better understanding of the semantics in element names and achieving
higher accuracy in the subsequent processes.
Expansion of abbreviations and acronyms: First, we tokenize the element names,
similar to (Monge & Elkan 1996; Do, Hai & Rahm 2002; Wu et al. 2004), by isolating
them based on delimiters such as punctuation marks, spaces and substrings in camelcase names. Tokenization is conducted in order to uncover parts in the name that may
go undetected if this possibility is not considered. For example, consider comparing
two strings hum-res and human-resource. Performing a prefix matching operation on
them will discover that these two strings are same if done after tokenizing. However,
executing the same operation without tokenizing will not detect much similarity to the
level it deserves. Tokenization process is depicted in Table 3.
Element name
Tokenized substrings
Isolation based on
finHRdistr
fin, HR, distr
Camel-case naming
Daily-income
Daily, income
Hyphen delimitation
In_patient
In, patient
Underscore delimitation
Table 3: Examples of string tokenization
14
2.2.1.2. String matching
Once pre-processing has been done, string matching operations are performed in a
consecutive. A string matching algorithm, matcher, is used to calculate similarity
score at each matching process.
Matching suffixes and prefixes are two forms of string matching done in many
systems as in (Madhavan, Bernstein & Rahm 2001; Do, Hai & Rahm 2002; Melnik,
Garcia-Molina & Rahm 2002; Giunchiglia & Yatskevich 2004; Shvaiko & Euzenat
2005).
2.2.1.2.1. Prefix matching
First, prefix matching is done between the element names of the schemas to identify
whether the string with the longer length starts with the shorter string. This type of
matching is especially useful in detecting abbreviations (Shvaiko & Euzenat 2005). It
gives a score in the range [0, 1], where 0 if there is no similarity and 1 if an exact
match.
When comparing two name strings, every string token in one string will be compared
with every token in the other string. Match score for both prefix matching and suffix
matching are calculated using this formula,
S = k/{(x+y)/2}
Where k is number of matching sub-strings,
x is number of sub strings in element name of A from SS
y is number of sub strings in element name of B from ST
In both prefix and suffix matching, we use a first-come, first-served tie breaking
strategy similar to (Langville & Meyer 2005). The rationale for using this strategy is
further explained in section 2.4.4 Tie-breaking.
Prefix matching process is depicted in the following example.
Consider two strings being compared, inst_tech_demonstrator and institutetechnology-demo.
After tokenization, the two strings become
inst_tech_demonstrator  (inst, tech, demonstrator) and
institute-technology-demo  (institute, technology, demo)
15
By performing a prefix matching operation on every token with each other, the
number of matching substrings, k is obtained. Table 4 shows an example of prefix
matching.
Element A
inst
tech
demonstrator
Element B
Technology
0
1
0
Institute
1
0
0
demo
0
0
1
Table 4: Prefix matching example 1
In this case, k = 3.
Similarity, S between A and B
=3/{(|3|+|3|)/2}
=1
A score of 1 indicates that these two fields match exactly when prefix matching is
done.
Sometimes, Prefix matching is not very accurate in determining the similarity. It
might give a high score if the strings do match, even though they are note related.
Element A
tea
mat
Element B
maths
teacher
0
1
1
0
Table 5: Prefix matching example 2
This shows a perfect match although they are of two different meanings. Therefore, to
reduce the effect of such false positives, we use multiple string matching functions.
2.2.1.2.2. Suffix matching
Suffix matching identifies whether one string ends with another. This type of
matching is especially useful in detecting words that are related in meaning (Shvaiko
& Euzenat 2005) though they are exactly not the same. For example, ‘saw’ can be
matched with handsaw, hacksaw, jigsaw and ‘ball’ can be matched with volleyball,
baseball, football and nut with peanut, chestnut, walnut.
String matching calculations are done the same as in prefix matching. Consider the
example below:
Element B
16
Element A
human
resource
type
human
1
0
0
resource
0
1
0
fund
0
0
0
Table 6: Suffix matching example
Similarity S, between Human-Resource-Type and Human-Resource-fund
=2/{(|3|+|3|)/2}
= 0.67
Nevertheless, suffix matching does not always guarantee accurate results for every
matching operation. It is not that effective in matching some words (e.g: car
Madagascar, rentcurrent). Therefore, we use additional string similarity matching
functions as described in the next sections to reduce such effects.
2.2.1.3. String similarity matching
2.2.1.3.1. Edit distance (Levenshtein distance)
Similar to (Do, Hai & Rahm 2002; Chua, CEH, Chiang, RHL & Lim, E-P 2003;
Cohen, Ravikumar & Fienberg 2003; Giunchiglia & Yatskevich 2004), we use a form
of edit distance, Levenshtein distance, to calculate the similarity between name
strings. This is for determining how similar the characters in the two strings are.
In Levenshtein distance, the number of operations – character substitutions, insertions
and deletions – required to transform one string into another is calculated, assigning a
value of 1 to each operation performed. This value indicates the Levenshtein distance,
k, a measure of error or dissimilarity (Navarro 2001) between the two strings; shorter
the distance, higher is the similarity (Navarro 2001; Giunchiglia & Yatskevich 2004).
As we need to compute the similarity, not dissimilarity, similarity S, is calculated by
representing S in a value between 0 and 1, as in (Chua, CEH, Chiang, RHL & Lim, EP 2003; Giunchiglia & Yatskevich 2004) and excluding the error ratio from it. This is
done by the following equation (Navarro 2001; Chua, CEH, Chiang, RHL & Lim, E-P
2003; Giunchiglia & Yatskevich 2004),
Let A and B be the two strings being compared, and
S = 1 – {k / [max [length (A), length (B)]}
17
Where S is the similarity value between the two strings,
k is the Levenshtein distance
Hence, for an identical match, edit distance, k = 0 and similarity score, S, will be 1.
For example, edit distance between infoSc and informationScience is,
S= 1 – {12/18}
= 0.333
2.2.1.3.2. N-gram
With Levenshtein distance, the reflected similarity value might not be very accurate in
some types of strings. Therefore, in the next step of string similarity matching, we use
n-gram matching as in, (Do, Hai & Rahm 2002; Giunchiglia, Shvaiko & Yatskevich
2004).
In this technique, the number of common n-grams, n, between the two strings is
counted and a similarity score, S, is given by the following equation:
Let A and B be the two strings being compared, and
S = n / [max [ngrams (A), ngrams (B)]
For comparing some forms of strings, n-gram performs better than edit distance. For
example, consider two strings ‘Contest’ and ‘Context’.
ngrams (Contest)  Con, ont, nte, tes, est = 5
ngrams (Context)  Con, ont, nte, tex, ext = 5
S = 3/5 = 0.6
With n-gram, similarity score is 0.6; but, if edit distance is done on these two strings,
it gives a similarity score of 0.86.
18
2.2.1.4. Identifying structural similarities
Schema elements that resemble to entities in the real world are likely to have similar
structural properties (Li & Clifton 2000); therefore, structural properties are believed
to have some evidence for discovering conceptual meanings embedded within
schemas. Similar to (Li & Clifton 2000; Huimin & Sudha 2004), structural data
(metadata) that are utilized in this framework are given in Table 7.
#
Metadata
Details
1
Data type
Method in section 2.2.2.1
2
Field length
1 if same length
3
Range
1 if same range
4
Primary key
1 if both primary key
5
Foreign key
1 if both foreign key
6
Unique
1 if both unique key
7
Nullable
1 if both null allowed/ not allowed
8
Default
1 if both same default
9
Precision
1 if both same precision
10
Scale
1 if both same scale
Table 7: Structural metadata for structural comparison
In this stage, elements in schema SS are compared with those in schema ST for their
structural similarities depicted in Table 2. The match function, matchStructure, checks
for structural similarities in the sequence specified in Table 2 and gives the score, S in
a m x n matrix as in the example below (Table 8),
Metadata
Data type
Field length
Range
Primary key
Foreign key
Unique
Nullable
Default
Precision
Scale
Data type
Field length
Range
…..
Attributes
m1
m2
n1
0.65
0
1
0
0
1
1
0
1
0
0.85
1
1
…
n2
0.8
1
0
0
0
0
0
1
1
1
0.7
0
0
…
Table 8: Structural matching example
19
….
For each type of metadata except data type, a similarity value of 1 is given if that
property is common to both the fields and a 0 given if they do not match.
2.2.1.4.1. Data type constraints
For comparing similarities between data types, we construct data type synonyms table
and use that table for doing the data type comparisons as done in other similar
researches (Li & Clifton 2000; Do, Hai & Rahm 2002; Thang & Nam 2008; Karasneh
et al. 2009; Tekli, Chbeir & Yetongnon 2009).
Since there are variations in data types across different database systems (Li & Clifton
2000), realization of similarities in data types is not very straight forward. Therefore,
based on (Oracle 2008) and (Microsoft 2008), first we construct a Vendor Specific
Data Types Table (VSDT) that consists of data types mappings for Oracle and SQL
Server. This table is in APPENDIX 1.
Based on the VSDT table, data types that have a high level of similarity can be
detected. For example, Oracle has got a data type called ‘Number’, but SQL Server
does not; but from the VSDT table, it can be derived that Number in Oracle is
equivalent to Float in SQ Server. Therefore, the two data types can be mapped as a
match with maximum similarity. Similarly, SQL Server has got a DATETIME data
type but Oracle does not; it has a ‘Date’ type instead. In such as case these two data
types are be mapped and give a maximum data type similarity score of 1.
Often cases will come where two data types are not the same but possess some level
of similarity. For example, integer and float are not the same but they do have some
similarities as both the types are numbers (Li & Clifton 2000); likewise, char and
varchar are similar cases. In view of this situation, in order to give consideration to
data types that have some level of similarity, we further categorise all the data types
into a more generic data type classification similar to (Li & Clifton 2000). We make
this classification based on (Li & Clifton 2000; Oracle 2003; Microsoft 2007) and
give a fixed similarity value of 0.5 for such cases and a score of 0 if there is no match.
We call this table Generic Data Type Table (GDTT). GDTT is in APPENDIX 2.
As an example, consider the figure below (Table 9). In this example, attribute
‘quantity’ in SchemaA has a data type integer and attribute amount in SchemaB has a
data type float. In this case, as integer and maps to float in the Generic Data Type
Table (GDTT), it gets a similarity value of 0.5.
Attribute
Data type
Data type similarity
SchemaA
quantity
integer
SchemaB
amount
float
0.5
Table 9: Data type matching example
20
2.2.1.4.2. Calculating final structural similarity score
As described in 2.2.2 and 2.2.2.1, we obtain the property-comparison values for all the
attributes and calculate their average to get a final structural similarity score for every
attribute pair. That is, we add the property comparison values and divide by 10 as we are
utilizing 10 different metadata properties. This formula is given as follows,
S=k/N
Where S is the structural similarity score,
k is total property comparison score,
N is total number of properties considered
2.2.2. Matching at instance level
Instance values of the databases also can be used as an additional source for
discovering relationships of databases (Huimin & Sudha 2004). It is possible that the
same data might be represented differently in databases (Huimin & Sudha 2004). For
example, ‘morning’ might be represented as ‘am’ while another database could
represent it as ‘m’ or ‘1’. Although this issue exists in databases, information from
analysis of actual data values is often complementary to schema level matching (Do,
Hai & Rahm 2002; Chua, C, Chiang, R & Lim, E 2003) and can be valuable
especially in circumstances where available schema information is limited (Huimin &
Sudha 2004).
Similar to (Li & Clifton 2000; Huimin & Sudha 2004), we use several statistical
features of the instance values for assessing similarities in database fields. The
features we utilize in this framework are given in the table below (Table 10).
#
Properties
1
Mean length of the values
2
Standard deviation of length of the values
3
Mean ratio of number of numeric characters
4
Mean ratio of number of non-alphanumeric characters
5
Mean ratio of number of distinct values to total tuples in the table
6
Mean ratio of blank values to total tuples in the table
Table 10: Properties utilized for instance matching
For showing how the statistical analysis is performed on a table of instances, the table
below (Table 11) shows some sample data of a book store database and Table 12
shows details of how the calculations are performed.
21
ISBN
Author
Title
Ref_no
0062039741
Justin Bieber
First Step 2 Forever
XA345
0061997811
Brian Sibley
Harry Potter Film Wizardry
NAHKW1
1423113381
Rick Riordan
The Red Pyramid
1423101472
Mary-Jane Knight
Percy Jackson and the Olympians: The
Ultimate Guide
1617804061
Justin Bieber
My World: Easy Piano
F9876001
Table 11: sample dataset for Schema1
1
Mean(Length)
2
StdDev(Length)
3
Mean(Numeric)
4
Mean(non
alphanumeric)
5
Mean(Distinct)
6
Mean(Blanks)
ISBN
(10+10+10+10+
10)/5
=10
Normalised:0
Author
(13+12+12+16+13)/5
= 13.2
Normalised:
(13.2-12)/(16-12)
=0.3
0
(0.25-0.3)2 + (0-0.3)2 0.397
+ (0-0.3)2 + (1-0.3)2 +
(0.25-0.3)2
= 0.675/(5-1)
Var = 0.16875
StdDev
√0.16875
=
(10/10+10/10+1
0/10+10/10+10/
10)/5
=1
0
=5/5
=1
0
= 0.411
(0+0+0+0+0)/66
=0
(0 + 0 + 0 + (1/16) +
0) /5
= 0.0625/5
= 0.013
=4/5
= 0.8
=(1/13+1/12+1/12+1/
16+1/13)/5
= 0.383/5
= 0.077
Title
(20+26+15+51+
20)/5
= 26.4
Normalised:
(20.333-15)/(2615)
=0. 317
(1/20)+0+0+0+0/
5
=0.01
(0+0+0+(1/51)+(
1/20))/5
= 0.0696/5
=0.014
=5/5
=1
(3/20+3/26+2/15
+5/51+3/20)/5
= 0.647/5
= 0.129
Table 12: Examples showing how the statistical calculations are performed
22
Ref_no
(5+6+0+0+9)
/5
=4
Normalised:
(6.667-5)/( 95)
= 0.444
0.437
(3/5+1/6+0+0
+8/9)/5
= 1.6556/5
=0.3311
0
=4/5
=0.8
0
Although each of these properties would contribute to the semantics at different
levels, establishing the degree of relevance is not a straight forward task (Huimin &
Sudha 2004) as the original dimensions are varied different units of measurements.
Therefore, we normalize the values if they do not fall within the range [0, 1] (Li &
Clifton 2000; Huimin & Sudha 2004). Out of the 6 measurements we are considering
for this framework, only the first two will need to be normalized in this manner but
the other measurements need not be normalized as being ratios, they always fall in this
range.
2.2.2.1. Computing the instance similarity
Similar to the works in (Chen 1995; Jain et al. 2002; Yeung & Tsang 2002; Kaur
2010), we calculate the similarity between two fields based on average Manhattan
Distance using this formula,
S=1−
∑𝑛
𝑖=1 |𝑥𝑖−𝑦𝑖|
𝑛
Where x represents the distance measure of a field property in Schema1,
y represents the distance measure of a field property in Schema2 and
n represents the number of statistical properties or dimensions considered. In this
framework, as we are utilizing 6 properties, n = 6 by default.
After obtaining the matrices of statistical values for each of the fields in both the
schemas, we compute the instance similarities between them.
For example, Table A in Schema1 has statistical values (from Table 12: Examples
showing how the statistical calculations are performed) as,
ISBN {0, 0, 1, 0, 1, 0}
Author {0.3, 0.411, 0, 0.013, 0.8, 0.077}
Title {0. 317, 0.397, 0.01, 0.014, 1, 0.129}
Ref_no {0.444, 0.437, 0.331, 0, 0.8, 0}
Suppose that Table B in Schema2 has similar attributes with statistical values as,
ISBN {0, 0, 1, 0, 1, 0}
Author {0.4, 0.45, 0.1, 0.023, 0.7, 0.06}
Title {0. 3, 0.29, 0.04, 0.02, 1, 0.14}
Code_no {0.439, 0.433, 0.337, 0, 0.6, 0}
We calculate the similarity from the formula,
23
S(ISBN) = 1-
| (0−0)+ (0−0)+(1−1)+(0−0)+(1−1)+(0−0)|
6
=1
S(Author)
=
1
–
| (0.3−0.4)+ (0.411−0.45)+(0−0.1)+(0.013−0.023)+(0.8−0.7)+(0.077−0.06) |
6
=1 – | 0.366|/6
= 1 – 0.061
= 0.939
From the above two calculations, it can be seen that TableA.ISBN and TableB.ISBN
has a similarity value of 1, indicating the highest possible similarity.
TableA.Author and TableB.Author shows a similarity of 0.939.
2.3. Schema matching algorithms
24
2.3.1. Algorithm 1: Find name similarity of schema element
Input:
Set of attribute names in schema SS : X = {Ua. A1 , Ua.A2, …Ua.Ai...Ua.An }
Ua is set of tables in schema SS
Ai is an attribute in the table Ua
Set of attribute names in schema ST : Y = {Vb.B1, Vb.B2, ….Vb.Bn }
Vb is set of tables in schema ST
Bi is an attribute in table Vb
Output:
Table Similarity Matrix. A Set of table similarity pairs in the form
SS.Ua: ST.Vb  STB
SS is source schema; Ua is source table; ST is target schema; Vb is
target table, STB is similarity between the two tables
begin
call matchPrefix(X, Y) // X and Y are set of attributes from schema SS
and ST respectively
call matchSuffix(X, Y)
call matchNgram(X, Y, ngram) // ngram is the value passed as the ngram
call getTableMatchResult()
end
function matchPrefix(X, Y)
{
foreach Ai in X do //comparing every attribute in schema SS with
every attribute in ST
{
foreach Bj in Y do
{
if Ai = Bj then S = 1
else //if two attributes are not the same string, tokenize the
strings
{
P = call tokenize (Ai) //P and Q are set of tokens in strings
Ai and Bj respectively
Q = call tokenize (Bj)
k = 0
//initialize match counter
initialize new matchFoundList
foreach Pm in P do
{ //compare every token in Ai with that in Bj
foreach Qn in Q do
{
//if Qn does not already have a match with any token in
Pm
If(Qn not in matchFoundList)
{
//if longer token starts with shorter string, a
token match is found
max [length(Pm), length(Qn)] starts with min
[length(Pm), length(Qn)]
25
k = k + 1 //match score is incremented by one for
each token match
Add Qn to matchFoundList
}
}
}
//S is prefix similarity between attributes Ai and Bj
S (Ai, Bj) = k / { ( | P | + | Q | ) / 2 }
//S is similarity score for prefix match between two
attributes Ai and Bi
// UpdateAMSM updates Matcher Similarity Matrix in Table 1
matchPrefix is matcher algorithm name, SS.Ua.Ai and ST.Vb.Bj,
are attributes of schema SS and ST respectively, S is
similarity between the two attribute
call UpdateAMSM (matchPrefix, SS.Ua.Ai, ST.Vb.Bj, S)
//return Attribute Matcher Similarity Matrix
return AMSM
}
// function to tokenize a string X
function tokenize(X)
{
begin
Initialize C //C is set of tokens in X
d = {–, _} // d is set of accepted delimiters
foreach di in d do
add tokens to C (split(X, di) )
return C
end
}
function matchSuffix(X, Y)
{
foreach Ai in X do //comparing every attribute in schema SS with
every attribute in ST
{
foreach Bj in Y do
{
if Ai = Bj then S = 1
else //if two attributes are not the same string, tokenize the
strings
{
P = call tokenize (Ai) //P and Q are set of tokens in strings
Ai and Bj respectively
Q = call tokenize (Bj)
26
k = 0
//initialize match counter
initialize new matchFoundList
foreach Pm in P do
{ //compare every token in Ai with that in Bj
foreach Qn in Q do
{
//if Qn does not already have a match with any token in
Pm
If(Qn not in matchFoundList)
{
//if longer token ends with shorter string, a token
match is found
max [length(Pm), length(Qn)] ends with min
[length(Pm), length(Qn)]
k = k + 1 //match score is incremented by one for
each token match
Add Qn to matchFoundList
}
}
}
//S is suffix similarity between attributes Ai and Bj
S (Ai, Bj) = k / { ( | P | + | Q | ) / 2 }
//S is similarity score for suffix match between two
attributes Ai and Bi
// UpdateAMSM updates Matcher Similarity Matrix in Table 1
matchSuffix is matcher algorithm name, SS.Ua.Ai and ST.Vb.Bj,
are attributes of schema SS and ST respectively, S is
similarity between the two attribute
call UpdateAMSM (matchSuffix, SS.Ua.Ai, ST.Vb.Bj, S)
//return Attribute Matcher Similarity Matrix
return AMSM
}
function matchNgram(X, Y)
{
foreach Ai in X do//comparing every attribute in schema SS with every
attribute in ST
{
foreach Bj in Y do
{
if Ai = Bj then S = 1
else//if two attributes are not the same string, tokenize the
strings
{
P = call getNgrams (Ai, ngram) //P and Q are set of n-grams in
strings Ai and Bj respectively
Q = call getNgrams (Bj, ngram)
k = 0
//initialize match counter
27
initialize new matchFoundList//n-grams that have found a
matching n-gram
foreach Pm in P do
{//compare every n-gram in Ai with that in Bj
foreach Qn in Q do
//if Qn does not already have a match with any other ngram in Pm
if Qn (not in matchFoundList) then
{
If Pm = Qn then
//add to match found list if a matching
Add to matchFoundList(Qn)
k = k + 1//match score is incremented by
one for each n-gram match
}
//S is n-gram similarity between attributes Ai and Bj
S (Ai, Bj) = k / { Max ( | P | ,| Q | ) }
// UpdateAMSM updates Attribute Matcher Similarity Matrix in Table
1
matchNgram is matcher algorithm name, SS.Ua.Ai and ST.Vb.Bj, are
attributes of schema SS and ST respectively, S is similarity
between the two attribute
call UpdateAMSM (matchNgram, SS.Ua.Ai, ST.Vb.Bj, S)
//return Attribute Matcher Similarity Matrix
return AMSM
}
//returns a set of n-grams for a string A
Function getNgrams(A, g) // A is element name string; g is ngram value
used
{
begin
n = length(A)
r = n - g //r is last n-gram position
while i <= r do
{// get n-grams until all possible n-grams obtained
ngrami = getSubstring(A, i, g)
Add(ngrami ) to C// C is set of n-grams obtained
i = i+1
}
return C
end
}
28
//Updates Attribute Matcher Similarity Matrix in Table 1
//parameter matcher is matcher name, SS.Ua.Ai and ST.Vb.Bj, are attributes of
schema SS and ST respectively, S is similarity between the two attribute
function UpdateAMSM(matcher, SS.Ua.Ai, ST.Vb.Bj, S)
{
// AttributeMatcherSimilarity is an object to hold similarity data
initialize new AttributeMatcherSimilarity(matcher, SS.Ua.Ai, ST.Vb.Bj,
S)
//AMSM is Attribute Matcher Similarity Matrix that holds similarity
information in an Array List
Add to AMSM(matcher, SS.Ua.Ai, ST.Vb.Bj, S)
}
function getAttributeMatchResult()
{
foreach x in AMSM do
{
//source attribute in format schema.table.attribute
Ai = x.getSourceAttr()
//target attribute in format schema.table.attribute
Bj = x.targetAttribute()
Sx = x.getScore()
// AttributeMatchResult object holds aggregate similarity score
for attribute pairs
initialize new AttributeMatchResult(Ai,Bj,Sx)
//if Attribute pair similarity data not in ASM, add to ASM
if(AttributeMatchResult not in ASM) then
Add AttributeMatchResult to ASM
//if attribute pair score exists, add to existing score
else add AttributeMatchResult.score() to ASM.score(Ai,Bj)
}
//calculate average score for each pair in ASM (Attribute
Similarity Matrix in Table 2)
foreach y in ASM do
{
//calculate total score for each pair
Saggregate = ASMscore(Ai,Bj)
//calculate average for each pair
Saverage = Saggregate / ASM.countMatchers
//Update Attribute Similarity Matrix with average score
29
updateASMscore(SS.Ua.Ai, ST.Vb.Bj,Saverage)
}
//return Attribute Similarity Matrix in Table 2
return ASM
}
function getTableMatchResult()
{
foreach w in ASM do
{
//source table in format schema.table
Pm = w.getSourceTable()
//target table in format schema.table
Qn = w.targetTable()
Sr = w.getScore()
// TableMatchResult object holds aggregate similarity score for
table pairs
initialize new tableMatchResult(Pm,Qn,Sr)
//if Table pair similarity data not in TSM, add to TSM
TSM is Table Similarity Matrix in Table 3
if(TableMatchResult not in TSM) then
Add TableMatchResult to TSM
//if table pair score exists, add to existing score
else add TableMatchResult.score() to TSM.score(Pm,Qn)
}
//calculate average score for each pair in TSM (Table
Similarity Matrix in Table 3)
foreach z in TSM do
{
//calculate total score for each table pair
Saggregate = TSMscore(Pm,Qn)
//calculate average for each pair
Saverage = Saggregate /TSM.possibleMaximumScore(Pm,Qn)
//Update Attribute Similarity Matrix with average score
updateTSMscore(SS.Ua, ST.Vb,Saverage)
}
}
//return Table Similarity Matrix
return TSM
}
30
function getSchemaMatchResult()
{
foreach x in TSM do
{
//source schema
Pm = w.getSourceSchema()
//target schema
Qn = w.targetSchema()
Sr = w.getScore()
// SchemaMatchResult object holds aggregate similarity score
for schema pairs
initialize new SchemaMatchResult(Pm,Qn,Sr)
//if Schema pair similarity data not in SSV, add to SSV
SSV is Schema Similarity Value Object
if(SchemaMatchResult not in SSV) then
Add SchemaMatchResult to TSM
//if schema pair score exists, add to existing score
else add SchemaMatchResult.score() to SSV.score(Pm,Qn)
}
//calculate average score for schema pair
Saverage = SSV.getScore() /SSV.possibleMaximumScore(Pm,Qn)
//Update Attribute Similarity Matrix with average score
updateSSVscore(SS, ST,Saverage)
}
//return Schema Similarity Value SSV
return SSV
}
31
2.3.2. Algorithm 2: Find structural similarities of schemas
Input:
//Structural properties of attributes in source schema SS
//Each attribute is passed with attribute name in the form
Schema.Table.Attribute followed by structural properties
DT - data type
FL-
field length
R - range
PK
- primary Key
FK
- foreign Key
UK
- unique Key
NU
- nullable
DE
- default value
PR
- precision
SC
- scale
Set of attributes in schema SS :
X = {Ua. A1 , DT1, FL1,R1PK1,FK1,UK1,NU1,DE1,PR1,SC1 Ua.A2,DT2 FL2,
R2,PK2,FK2,UK2,NU2,DE2,PR2,SC2 …Ua.Ai, DTi,FLiRiPKi,FKi,UKi,NUi,DEi,PRi,SCi,
..Ua.An ,DTn, FLn, Rn,PKn, FKn,UKn,Nun,Den,PRn,SCn }
Set of attributes in schema ST :
Y= {Vb. B1 , DT1, FL1,R1PK1,FK1,UK1,NU1,DE1,PR1,SC1 Vb.B2,DT2 FL2,
R2,PK2,FK2,UK2,NU2,DE2,PR2,SC2 …Vb.Bj, DTj,FLjRjPKj,FKj,UKj,NUj,DEj,PRj,SCj,
..Vb.Bn ,DTn, FLn, Rn,PKn, FKn,UKn,NUn,DEn,PRn,SCn }
Output:
Set of table similarity pairs in the form
SS.Ua: ST.Vb  STB
//SS is source schema; Ua is source table; ST is target schemas; Vb is
target table
begin
call matchStructure (X, Y)
call getMatchResult()
end
32
//calculates structural similarity between two schemas
Function matchStructure(X, Y)
{
begin
if SS and ST from same type of database server then
data type ref table = same DB table
else data type ref table = conversion DB table
k=0
//compare each attribute in source schema to each in target
schema
foreach Ai in X do
{
foreach Bj in Y do
{
if DT of Ai = DT of Bj then k = k+1 //data type
if FL of Ai = FL of Bj then k = k+1 //field length
if R type of Ai = R of Bj then k = k+1// range
if
Ai is PK and Bj is a PK then k = k+1// primary Key
if
Ai is FK and Bj is a FK then k = k+1// foreign Key
if
Ai is UK and Bj is a UK then k = k+1// unique Key
if
Ai is NU and Bj is a NU then k = k+1// nullable
if
DE of Ai = DE of Bj then k = k+1// default value
if
PR of Ai = PR of Bj then k = k+1// precision
if
SC of Ai = SC of Bj then k = k+1// scale
// maxPossibleSimilarity is count of structural properties
S (Ai, Bj) = k/maxPossibleSimilarity()
// UpdateAMSM updates Attribute Matcher Similarity Matrix in Table 1
structure is matcher algorithm name, SS.Ua.Ai and ST.Vb.Bj, are
attributes of schema SS and ST respectively, S is similarity between
the two attribute
call UpdateAMSM(structure, SS.Ua.Ai, ST.Vb.Bj, S)
//return Attribute Matcher Similarity Matrix
33
return AMSM
}
}
}
function matchInstance(X, Y)
{
foreach Ai in X do //comparing values of every field in schema SS
with every field in ST
{
foreach Bj in Y do
{
Call getMeanLength(Bj)
}
}
}
//T is a set of instances for an attribute for which its mean length has to
be calculated
Function getMeanLength(T)
{
foreach ti in T do
{
Add length(ti) to totalLength
}
meanLength = totalLength / |T|
return meanLength
}
34
2.4.
Similarity computation
2.4.1. Computation at matcher level
Match operations are performed on every element of schema SS = { SS1, SS2, ..SSm }
with every element of ST = { ST1, ST2 ……STn} and a similarity value is computed for
each operation on a table by table basis as in the example below.
Match operation
Prefix
Suffix
Edit Distance
n-gram
Structural
Instance
Prefix
….
Element names
teaMat
…
…
mathsTeacher
1
0
0.25
0.2
0.5
0.04
...
…
…
diplomat
0
0.667
0.375
0.167
0.3
0.2
...
…
Table 13: Matcher level similarity calculation example
2.4.2. Computation at attribute level
To obtain a combined matching result, the average score is computed for the pair of
elements, similar to (Do, Hai & Rahm 2002; Bozovic & Vassalos 2008). For example,
after the above match operations, combined similarity values are computed as in the
table below (Table 14).
Schema SS, Table x
teaMat
…
Schema ST, Table y
mathsTeacher diplomat
0.3625
0.3022
…
…
…
…
…
Table 14: Combined similarity calculation at attribute level
Consequently, the combined final result of the matching operations is given in a
similarity matrix, M, with all the elements, m, in SS and elements, n, in ST in a m x n
matrix as in the example below.
Elements
35 in Table y, ST
n1
Elements in Table x, SS
n2
n3
m1
m2
m3
m4
Figure 3: Combined final similarity matrix
The highest score in each row of the matrix indicates the element in SS that has the
highest similarity to the corresponding element in ST.
2.4.3. Computing similarity at table level
After computing the similarities for attributes in SS and ST, we compute the combined
similarity between tables. This is done by taking the ratio of all the similarity values
between table x in schema SS and table y in schema ST with maximum possible
similarity, similar to computations in (Do, Hai & Rahm 2002).
Table similarity computed from the formula given below.
Table Similarity =
Sum of similarities between x and y
Combined maximum similarity of x and y
Example 1
Schema SS, Table x1
Schema ST, Table y1
n1
n2
m1
m2
0.25
1
0.3
0.5
m3
0.7
1
Table 15: Table similarity calculation example 1
Table Similarity =
0.25 + 0.3 + 1 +0.5 + 0.7 +1
6
36
=
0.625
Example 2
Schema SS, Table x1
m1
m2
m3
Schema ST, Table y2
n1
n2
0.4
0.5
1
0.7
0.9
1
n3
0.8
0.8
1
Table 16: Table similarity calculation example 2
Table Similarity =
=
0.4 + 0.5 + 0.8 + 1 + 0.7 + 0.8 + 0.9 + 1+ 1
9
0.789
From the above two examples, Table x1 has a higher similarity to Table y2 than Table
y1.
The end result of matching table similarity between two schemas is a matrix that gives
similarity values between all the tables in both the schemas.
Tables in schema ST
Tables in schema SS
Table x1
Table x2
Table x3
Table x4
Table y1
0.12
0.6
1
0.5
Table y2
0.9
0.74
0.27
0.91
Table y3
0.61
0.58
0.2
0.1
2.4.4. Tie-breaking
As matching attributes or tables are determined based on the highest similarity score,
there is a possibility of a tie if an attribute or table gets more than one matching
element with the same similarity score. In such cases, it is necessary to implement a
tie breaking strategy to resolve such issues.
2.4.4.1. Tie breaking in table matching
When selecting the best match for tables, we use maximum cardinality matching
strategy (Wu et al. 2004). For example, consider the set of tables in schema1 as
{m1,m2,m3} and set of tables in schema2 as {n1,n2,n3}. In maximum cardinality
matching, we choose the matching tables in a way such that every table gets the
matching table with the highest similarity score. In this case, the best matching will be
(m1,n3), (m2, n1), (m3, n2).
37
n1
n2
n3
m1
1
0.4
1
m2
m3
1
0.2
0.7
0.8
0.4
0.2
Figure 4: Tie breaking example 1
To illustrate the tie breaking strategy, consider the example below.
m1
m2
m3
n1
1
1
0.6
n2
0.5
1
1
n3
1
0.4
1
Figure 5: Figure 6: Tie breaking example 2
In this example, (m1,n3), (m2, n1), (m3, n2) produces the same result as matching
(m1,n1), (m2, n2), (m3, n3). For a case like this, the latter matching result will be
chosen as we use a first-come, first-served tie breaking strategy similar to (Langville
& Meyer 2005).
2.4.4.2. Tie breaking in attribute matching
On the other hand, when matching attributes, we use a greedy matching strategy (Wu
et al. 2004) with the intuition that there is a semantic correspondence in how fields
are ordered in a database (Wu et al. 2004). The notion of semantic resemblance of
database structure to the real world is also mentioned in (Li & Clifton 2000).
Therefore, we retain the semantics embedded within field order for improving the
matching accuracy.
For example, consider two fields, departure_date and arrival_date in one schema. It
is typically the case that the departure field occurs before arrival data in a schema. If
we consider two similar fields in another schema as date_leaving and date_returning,
if we retain the field order, attribute matching accuracy would be increased.
departure_date
arrival_date
seat_no
date_leaving
flg_no
date_returning
1
1
0.2
0.4
1
0.9
0.3
0.8
0.2
Table 17: Tie breaking example in attribute matching
38
By adopting a greedy matching strategy, we can break ties that may occur and at the
same time we preserve the semantics thereby producing results with higher accuracy.
(departure_datedate_ leaving), (arrival_date date_ returning), (seat_no, flg_no).
However, if use maximum cardinality matching as in the case with matching tables,
the matching accuracy will be reduced. (departure_datedate_returning),
(arrival_date date_leaving), (seat_no, flg_no)
3. EMPIRICAL EVALUATION
In this section, we evaluate the accuracy of the matching algorithms and scalability
and efficiency of the framework.
3.1.
Experimental setup
3.1.1. Database
Experiments are conducted based on two Oracle database schemas publicly available
for download from (Oracle 2002). The first schema, Schema1, is from a Human
Resource (HR) database with 9 tables, 32 attributes and a total 50 instances. The other
schema, Schema2, is from an Order Entry (OE) database with 9 tables, 51 attributes
and a total 2652 instances. These details are given in Table 18.
39
In order to determine the effect on the precision with change in schema size, we use
rescaled versions of the above two schemas. The structure of these schemas can be
found in section 3.2.8.
These schemas are installed in an Oracle 10.2 database server in UniSA laboratory.
We use iSQL*Plus 10.2 and RazorSQL v.5.2.5 to access the database.
Schema1 (HR)
Schema 2 (OE)
CUSTOMERS: 5 (5)
DIVISIONS: 2 (4)
EMPLOYEES: 6 (4)
JOBS: 2 (5)
ORDER_STATUS: 3 (2)
PRODUCTS: 5 (12)
PRODUCT_TYPES: 2 (5)
PURCHASES: 4 (9)
SALARY_GRADES: 3 (4)
CUSTOMERS: 8 (319)
DEPARTMENTS: 4 (27)
EMPLOYEES: 11 (107)
INVENTORIES: 3 (1112)
JOBS: 4 (19)
JOB_HISTORY: 5 (10)
ORDERS: 8 (105)
ORDER_ITEMS: 5 (665)
PRODUCT_DESCRIPTIONS: 4 (288)
Table 18: Summary of experimental schemas
3.1.2. Schema matcher prototype
For evaluating the framework and testing the effectiveness of the schema matching
algorithms, we built a prototype in Microsoft .Net/ C# as a command line program
(Figure 7).
Figure 7: Schema matcher prototype
40
We use this program for extracting schema information from text files that contain
schema in SQL DDL format and build internal class objects from those information.
Matching algorithms and all the relevant functions are done in this program. We also
use it to connect to the Oracle database in UniSA over the internet and extract instance
information from the database. A list of software tools used for this prototype is in
Table 19.
Type of software tool
Database server
Database manipulation
Programming Environment
Programming language
.Net framework
Database connector for .NET1
Details
Oracle 10.2
iSQL Plus 10.2 and RazorSQL v.5.2.5
Microsoft Visual Studio 2008
Microsoft .Net/ C#
Version 3.5
Oracle Data Provider for.NET 11.2.0.1.2
Table 19: Software tools used for prototype
3.2.
Evaluating the accuracy of matching algorithms
For assessing how accurate the matching algorithms can detect similarities, we use a
commonly used measurement in information retrieval (IR), precision similar to (Do,
Hai & Rahm 2002; Kang & Naughton 2003; Wang et al. 2004; Madhavan et al. 2005;
Blake 2007; Nottelmann & Straccia 2007). Precision defines how many detected
matches are correct in reality (Rahm & Bernstein 2001).
To assess the quality of the match results, first we execute the matching algorithms
individually. From the detected matches, we determine correct matches manually to
obtain the recall measurement. Next, we execute all the algorithms all together to get a
holistic match result and again, we manually determine which of the detected matches
are correct in reality.
3.2.1. Prefix matching algorithm
By algorithm accuracy by applying only prefix matching algorithm shows that it is
quite effective in detecting strings with similarity, S. The system shows that identical
strings are detected as perfect matches (S=1). Strings that match partially also show
high similarity if they have delimiters such as underscores. E.g. FIRST_NAME and
CUST_FIRST_NAME (S=0.8).
1
Available for download http://www.oracle.com/technetwork/topics/dotnet/utilsoft-086879.html
41
Strings with unbalanced partial strings with delimitations show some similarity but
lower than the previous cases. E.g. EMPLOYEES.SALARY and JOBS.MIN_SALARY. In this case,
SALARY has only one substring and MIN_SALARY has two substrings delimited by an
underscore. This is a positive characteristic; because, it indicates that the two strings
are semantically not the same.
A trend in the similarities can be observed that as the number of substrings is
unbalanced between the two strings, similarity reduces. This trend can be clearly seen
in the sample similarity results in Table 20: Prefix matching sample results. This table
shows the attributes in Schema.Table.Attribute format. More detailed results are in
APPENDIX 3.
By having a minimum similarity threshold of 0.8, we achieve precision 77.8 %. That
is out of the 36 detected similarities, 28 are correct when determined manually. Some
sample of the results are given in Table 20.
Schema 1 Attributes
Schema 2 Attributes
Schema1.CUSTOMERS.CUSTOMER_ID
Schema1.CUSTOMERS.FIRST_NAME
Schema1.CUSTOMERS.LAST_NAME
Schema1.PURCHASES.QUANTITY
Schema1.PURCHASES.PRODUCT_ID
Schema2.CUSTOMERS.CUSTOMER_ID
Schema2.EMPLOYEES.FIRST_NAME
Schema2.EMPLOYEES.LAST_NAME
Schema2.ORDER_ITEMS.QUANTITY
Schema2.PRODUCT_DESCRIPTIONS.PRODUCT_ID
Schema1.CUSTOMERS.FIRST_NAME
Schema1.CUSTOMERS.LAST_NAME
Schema1.ORDER_STATUS.ORDER_STATUS_ID
Schema1.CUSTOMERS.PHONE
Schema1.DIVISIONS.NAME
Schema1.DIVISIONS.NAME
Schema1.EMPLOYEES.TITLE
Schema1.EMPLOYEES.SALARY
Schema2.CUSTOMERS.CUST_FIRST_NAME
Schema2.CUSTOMERS.CUST_LAST_NAME
Schema2.ORDERS.ORDER_STATUS
Schema2.CUSTOMERS.PHONE_NUMBERS
Schema2.DEPARTMENTS.DEPARTMENT_NAME
Schema2.EMPLOYEES.LAST_NAME
Schema2.JOBS.JOB_TITLE
Schema2.JOBS.MIN_SALARY
Schema1.CUSTOMERS.FIRST_NAME
Schema1.CUSTOMERS.FIRST_NAME
Schema1.CUSTOMERS.CUSTOMER_ID
Schema1.CUSTOMERS.FIRST_NAME
Schema1.ORDER_STATUS.LAST_MODIFIED
Schema1.SALARY_GRADES.LOW_SALARY
Schema2.DEPARTMENTS.DEPARTMENT_NAME
Schema2.EMPLOYEES.LAST_NAME
Schema2.INVENTORIES.PRODUCT_ID
Schema2.PRODUCT_DESCRIPTIONS.TRANSLATED_NAME
Schema2.EMPLOYEES.LAST_NAME
Schema2.JOBS.MAX_SALARY
0.5
Schema1.CUSTOMERS.CUSTOMER_ID
Schema1.CUSTOMERS.CUSTOMER_ID
Schema2.CUSTOMERS.CUST_FIRST_NAME
Schema2.CUSTOMERS.ACCOUNT_MGR_ID
0.4
Schema1.ORDER_STATUS.ORDER_STATUS_ID
Schema2.CUSTOMERS.ACCOUNT_MGR_ID
0.333
Schema1.CUSTOMERS.CUSTOMER_ID
Schema2.CUSTOMERS.PHONE_NUMBERS
0
Table 20: Prefix matching sample results
42
Similarity
1
0.8
0.667
3.2.2. Suffix matching algorithm
In this experiment, the results are very similar to the prefix matching results. This is
because of the high number of attribute names with delimitations. Suffix matching is
much more useful with matching strings without delimitations. E.g Phone and
telephone. Table 21 shows some sample results. We achieve 77.8 % precision with a
0.8 similarity threshold.
Schema 1 Attributes
Schema 2 Attributes
Similarity
Schema1.CUSTOMERS.CUSTOMER_ID
Schema1.CUSTOMERS.FIRST_NAME
Schema1.EMPLOYEES.SALARY
Schema2.CUSTOMERS.CUSTOMER_ID
Schema2.EMPLOYEES.FIRST_NAME
Schema2.EMPLOYEES.SALARY
1
Schema1.CUSTOMERS.FIRST_NAME
Schema1.PRODUCT_TYPES.PRODUCT_TYPE_ID
Schema2.CUSTOMERS.CUST_FIRST_NAME
Schema2.INVENTORIES.PRODUCT_IDs
0.8
Schema1.CUSTOMERS.PHONE
Schema1.CUSTOMERS.FIRST_NAME
Schema1.CUSTOMERS.CUSTOMER_ID
Schema1.ORDER_STATUS.ORDER_STATUS_ID
Schema1.CUSTOMERS.LAST_NAME
Schema2.CUSTOMERS.PHONE_NUMBERS
Schema2.DEPARTMENTS.DEPARTMENT_NAME
Schema2.ORDERS.SALES_REP_ID
Schema2.CUSTOMERS.ACCOUNT_MGR_ID
Schema2.CUSTOMERS.PHONE_NUMBERS
0.667
0.5
0.4
0.333
0
Table 21: Suffix matching sample results
3.2.3. N-grams matching algorithm
From n-gram matching results, it can be seen that even if two strings have some
common words between them, similarity is not very high. This is because, the name
strings are compared with n-grams and similarity depends on the number of common
n-grams. As the string length increases in one attribute, number of ngrams also
increase and therefore, similarity decreases. For example, CUSTOMER_ID and ACCOUNT_MGR_ID
has a similarity of 0.1. This similarity is due to the n-gram RID which is common to
both as we remove delimiters before doing the comparison. By default, we take
trigrams as the n-gram value.
With a threshold of 0.6 we achieve a precision of 100%. That is, out of the detected 28
matches, all are correct. Table 22 shows some samples of the results.
Schema 1 Attributes
Schema 2 Attributes
43
Similarity
Schema1.EMPLOYEES.EMPLOYEE_ID
Schema1.EMPLOYEES.LAST_NAME
Schema1.ORDER_STATUS.ORDER_STATUS_ID
Schema1.CUSTOMERS.FIRST_NAME
Schema1.CUSTOMERS.FIRST_NAME
Schema1.EMPLOYEES.TITLE
Schema1.ORDER_STATUS.STATUS
Schema1.PRODUCTS.PRODUCT_TYPE_ID
Schema1.DIVISIONS.DIVISION_ID
Schema1.CUSTOMERS.PHONE
Schema1.CUSTOMERS.FIRST_NAME
Schema1.CUSTOMERS.CUSTOMER_ID
Schema2.JOB_HISTORY.EMPLOYEE_ID
Schema2.EMPLOYEES.LAST_NAME
Schema2.ORDERS.ORDER_STATUS
Schema2.CUSTOMERS.CUST_FIRST_NAME
Schema2.EMPLOYEES.LAST_NAME
Schema2.JOBS.JOB_TITLE
Schema2.ORDERS.ORDER_STATUS
Schema2.INVENTORIES.PRODUCT_ID
Schema2.DEPARTMENTS.LOCATION_ID
Schema2.CUSTOMERS.PHONE_NUMBERS
Schema2.ORDERS.ORDER_STATUS
Schema2.CUSTOMERS.ACCOUNT_MGR_ID
1
1
0.818
0.636
0.571
0.5
0.444
0.455
0.375
0.3
0.111
0.1
Table 22: N-gram matching sample results
3.2.4. Structural matching algorithm
It can be observed that there is very high structural similarity when data types and
field length are the same. For example, EMPLOYEES.TITLE and CUSTOMERS.CUST_FIRST_NAME
have maximum similarity as their constraints are also the same.
ORDER_STATUS.LAST_MODIFIED and EMPLOYEES.HIRE_DATE are also
similar cases. However, even if two fields looks very similar they might have slight
differences that could decrease the similarity value significantly. For example,
Schema1.CUSTOMERS.CUSTOMER_ID and Schema2.CUSTOMERS.CUSTOMER_ID looks almost identical.
They have same data types, both are primary keys of a table with same name. Their
difference is that Schema2 element has a field length defined where as Schema1
element does not.
However, at times, even when two fields are very different semantically, structurally
they could have maximum similarity. Fro example, EMPLOYEES.SALARY and
ORDERS.PROMOTION_ID show maximum structural similarity; this is because, they have the
same data type ‘Number’ and all the constraints are also the same. Hence, for avoiding
this type of situations, we need more than one type of matching algorithms. That is,
when instance matching algorithm is run on these two fields, it shows a very low
similarity. Table 23 shows some sample results of performing this evaluation.
Schema 1 Attributes
Schema 2 Attributes
Similarity
Schema1.EMPLOYEES.TITLE
Schema1.EMPLOYEES.SALARY
Schema1.ORDER_STATUS.LAST_MODIFIED
Schema1.CUSTOMERS.CUSTOMER_ID
Schema1.CUSTOMERS.PHONE
Schema1.CUSTOMERS.FIRST_NAME
Schema1.CUSTOMERS.DOB
Schema1.PRODUCTS.PRODUCT_ID
Schema2.CUSTOMERS.CUST_FIRST_NAME
Schema2.ORDERS.PROMOTION_ID
Schema2.EMPLOYEES.HIRE_DATE
Schema2.CUSTOMERS.CUSTOMER_ID
Schema2.CUSTOMERS.PHONE_NUMBERS
Schema2.EMPLOYEES.HIRE_DATE
Schema2.CUSTOMERS.CUSTOMER_ID
Schema2.DEPARTMENTS.DEPARTMENT_NAME
1
1
1
0.75
0.75
0.5
0.25
0
44
Table 23: Structure matching sample results
With a minimum threshold of 0.7, the precision is extremely low. Only 23 out of 295
matches are correct with a precision of 7.8%. Even if we increase the minimum
threshold to 1, there is not much difference in the precision. It gives a precision of
7.4%.
From this we can deduce that structural comparison doe not put much weigh in
detecting correct matches and we suspect that this algorithm can have negative effect
in the over all match quality. We will need to reduce the weight given by structural
matching or conduct more investigations for finding the reasons for this negative
effect and revise the algorithm.
3.2.5. Instance matching
Evaluation of instance matching algorithm shows that it is very effective in
identifying some of the similar fields. For example, Schema1.CUSTOMERS.CUSTOMER_ID and
Schema2.CUSTOMERS.CUSTOMER_ID
show maximum similarity. CUSTOMERS.LAST_NAME and
Schema2.INVENTORIES.PRODUCT_ID show high dissimilarity. On the other hand, as in the
structure matching case, EMPLOYEES.SALARY and CUSTOMERS.CUSTOMER_ID show maximum
similarity although they are very different in reality. This situation arises because their
data types are same and the patterns in the actual instances are also almost identical.
Another observation is that even when data instances are very different, but if their
patterns are similar, it shows a high instance similarity. e.g Schema1.CUSTOMERS.CUSTOMER_ID
and Schema2.CUSTOMERS.PHONE_NUMBERS. It can be deduced that this unfavourable effect is
due to linear normalization of the instances in the range [0, 1]. When normalized, their
patterns become very much alike and the values also become similar. A sample of the
results is in Table 24.
Schema 1 Attributes
Schema 2 Attributes
Similarity
Schema1.CUSTOMERS.CUSTOMER_ID
Schema1.EMPLOYEES.SALARY
Schema1.CUSTOMERS.FIRST_NAME
Schema1.DIVISIONS.NAME
Schema1.EMPLOYEES.FIRST_NAME
Schema1.EMPLOYEES.EMPLOYEE_ID
Schema1.EMPLOYEES.SALARY
Schema1.CUSTOMERS.LAST_NAME
Schema1.CUSTOMERS.CUSTOMER_ID
Schema2.CUSTOMERS.CUSTOMER_ID
Schema2.CUSTOMERS.CUSTOMER_ID
Schema2.JOB_HISTORY.JOB_ID
Schema2.DEPARTMENTS.DEPARTMENT_NAME
Schema2.EMPLOYEES.LAST_NAME
Schema2.JOB_HISTORY.EMPLOYEE_ID
Schema2.JOB_HISTORY.EMPLOYEE_ID
Schema2.INVENTORIES.PRODUCT_ID
Schema2.CUSTOMERS.PHONE_NUMBERS
1
1
0.928
0.962
0.907
0.95
0.95
0.49
0.807
Table 24: Instance matching sample results
With a minimum threshold value of 1, we get a precision of 2/23 with 8.7% and if we
decrease the threshold to 0.9, we get a precision of 14/137 with 10.2%.
45
3.2.6. Overall accuracy of similarity algorithms
After evaluating the individual matching algorithms, we execute the algorithms
sequentially in order to get a collective precision value. 32 matches are detected as
having the highest similarity. When we manually evaluate these matches, we find that
18 of them are actually correctly determined. Therefore, we get a precision of 56.3%.
We can consider that, n-gram matching algorithm is very good in performing its
assignment, finding similarities in strings, very effectively. The results of these
evaluations are depicted in the diagram below Figure 8.
120
100
80
60
40
20
0
Prefix
Suffix
Ngram
Structural
Instance
Overall
Figure 8: Precision of matching algorithms (%)
3.2.7. Effect of schema size on the overall precision
Next, we conducted the schema matching evaluation again, but with smaller schemas
and obtained the overall precision. Table 25 shows the results of this evaluation. This
table shows the best matching attribute in Schema 2 for each of the attributes in
Schema 1.
Schema 1 Attributes
Schema 2 Attributes
46
Similarity
Schema1.EMPLOYEES.FIRST_NAME
Schema1.EMPLOYEES.EMPLOYEE_ID
Schema1.EMPLOYEES.LAST_NAME
Schema1.EMPLOYEES.SALARY
Schema1.CUSTOMERS.CUSTOMER_ID
Schema1.EMPLOYEES.MANAGER_ID
Schema1.JOBS.JOB_ID
Schema1.CUSTOMERS.FIRST_NAME
Schema1.CUSTOMERS.LAST_NAME
Schema1.CUSTOMERS.PHONE
Schema1.DIVISIONS.NAME
Schema1.DIVISIONS.DIVISION_ID
Schema1.EMPLOYEES.TITLE
Schema1.JOBS.NAME
Schema1.CUSTOMERS.DOB
Schema2.EMPLOYEES.FIRST_NAME
Schema2.EMPLOYEES.EMPLOYEE_ID
Schema2.EMPLOYEES.LAST_NAME
Schema2.EMPLOYEES.SALARY
Schema2.CUSTOMERS.CUSTOMER_ID
Schema2.EMPLOYEES.MANAGER_ID
Schema2.EMPLOYEES.JOB_ID
Schema2.CUSTOMERS.CUST_FIRST_NAME
Schema2.CUSTOMERS.CUST_LAST_NAME
Schema2.EMPLOYEES.PHONE_NUMBER
Schema2.DEPARTMENTS.DEPARTMENT_NAME
Schema2.DEPARTMENTS.DEPARTMENT_ID
Schema2.CUSTOMERS.CUST_EMAIL
Schema2.CUSTOMERS.PHONE_NUMBERS
Schema2.EMPLOYEES.COMMISSION_PCT
0.75
0.75
0.75
0.75
0.75
0.65
0.6
0.597
0.59
0.483
0.45
0.3
0.15
0.15
0.1
Table 25: Best matching attributes for smaller schemas with 4 tables
From these results, we can deduce that 10 out of 15 matches are correct. Therefore,
the precision is 66.7%.
We conducted another evaluation with rescaled schemas which have only one table in
each schema. From the results in Table 26 Table 26: Best matching attributes for
single table schemaswe can observe that there is an improvement in the overall
precision. In this case, the precision is 4 out of 5 with 80% precision.
Schema 1 Attributes
Schema 2 Attributes
Similarity
Schema1.CUSTOMERS.CUSTOMER_ID
Schema1.CUSTOMERS.FIRST_NAME
Schema1.CUSTOMERS.LAST_NAME
Schema1.CUSTOMERS.PHONE
Schema1.CUSTOMERS.DOB
Schema2.CUSTOMERS.CUSTOMER_ID
Schema2.CUSTOMERS.CUST_FIRST_NAME
Schema2.CUSTOMERS.CUST_LAST_NAME
Schema2.CUSTOMERS.PHONE_NUMBERS
Schema2.CUSTOMERS.CUST_EMAIL
0.7
0.597
0.59
0.477
0.1
Table 26: Best matching attributes for single table schemas
Based on these results we can conclude that as the schema size increases, there is drop in the
overall accuracy of the match results. This is depicted in Figure 9.
47
90
80
70
60
50
40
30
20
10
0
5 attributes
15 attributes
32 attributes
Figure 9: Precision vs. schema size
3.2.8. Efficiency of schema matching process
First we use the largest schemas where each has got 9 tables. Executing the matching
algorithms, except for instance matching algorithm, takes a very little time of just a
few seconds. For the instance matcher, the process of reading the instance, computing
statistical measurements and setting the values to internal Schema objects take 11
minutes and 30 seconds over a 1mpbs internet connection.
In the prototype, we provide 5 options for generating the results in different forms.
Option1 generates a list of attribute similarities for each matcher algorithm. For this
operation, takes a very short time of less than second. This is because all the
information are already embedded within the class objects, just need to display them.
Option 2 displays aggregate similarities for every attribute when all algorithms are
combined. It takes 1 hour and 40 minutes to complete this process and show the
results. Option 3 shows the highest matching attribute for each of the attribute. Option
4 shows table similarities and the last option shows the overall schema similarity.
We observe that the latter three options take 1 hour and 9 minutes each and all of
these values grow drastically with increase in schema size. These long times are taken
because each time when an option is chosen, the array object is iterated and aggregate
values calculated. We conclude that instead, if we calculate the values in the same
iteration, we can achieve higher throughput. However, this can increase the
programming complexity drastically.
Therefore, we performed all the options in sequence to calculate the total time it takes
to complete the whole process and estimate the efficiency. Table 27 shows these
readings. We observe that when number of attributes increases, the processing time
increases drastically. This is depicted in Figure 10.
48
Schema size
1 table
4 tables
9 tables
Time taken
1 sec
1 min 38 sec
4hr 50 min
Table 27: Decrease in efficiency with increase in schema size
20000
15000
10000
Time (s)
5000
0
1 table
4 tables
9 tables
Figure 10: Schema size vs. processing time
Moreover, as we are by executing the algorithms sequentially in this framework, it
affects the efficiency too. If we execute the algorithms in parallel, we can achieve
higher throughput. Nevertheless, we prefer sequential execution in our design for the
reason of making it more scalable and also it can perform a much comprehensive
matching process.
4. CONCLUSION
In this thesis, we conducted a comprehensive study of schema matching. We
considered various factors that can make schema matching architectures scalable,
efficient and accurate and proposed a framework incorporating these factors. With the
framework, we also developed some algorithms for the purpose identifying
similarities in schemas. We built a prototype to evaluate the framework and these
algorithms and conducted some experiments. From these experiments it can be
deduced that various algorithms perform differently in various conditions and a
combination of these algorithms produce better results. We found that there are many
angles we can improve the architecture in order to make it more efficient, scalable and
accurate.
49
REFERENCES
Aumueller, D, Do, H, Massmann, S & Rahm, E 2005, 'Schema and ontology matching
with COMA++'.
Batini, C, Lenzerini, M & Navathe, S 1986, 'A comparative analysis of methodologies for
database schema integration', ACM computing surveys (CSUR), vol. 18, no. 4, pp. 323364.
Bernstein, P, Melnik, S, Petropoulos, M & Quix, C 2004, 'Industrial-strength schema
matching', ACM SIGMOD Record, vol. 33, no. 4, pp. 38-43.
Blake, R 2007, 'A Survey of Schema Matching Research', University of Massachusetts
Boston. The College of Management. Working Papers: September.
Bozovic, N & Vassalos, V 2008, 'Two-phase schema matching in real world relational
databases'.
Chen, S 1995, 'Measures of similarity between vague sets', Fuzzy Sets and Systems, vol.
74, no. 2, pp. 217-223.
Chua, C, Chiang, R & Lim, E 2003, 'Instance-based attribute identification in database
integration', The VLDB Journal, vol. 12, no. 3, pp. 228-243.
Chua, CEH, Chiang, RHL & Lim, E-P 2003, 'Instance-based attribute identification in
database integration', The VLDB Journal, vol. 12, no. 3, p. 228.
Cohen, W, Ravikumar, P & Fienberg, S 2003, 'A comparison of string metrics for
matching names and records'.
Do, H 2006, 'Schema matching and mapping-based data integration', Verlag Dr. Müller
(VDM), pp. 3-86550.
Do, H, Melnik, S & Rahm, E 2002, 'Comparison of schema matching evaluations', Web,
Web-Services, and Database Systems, pp. 221-237.
Do, H & Rahm, E 2002, COMA: a system for flexible combination of schema matching
approaches, VLDB Endowment, Hong Kong, China, pp. 610-621.
Do, H & Rahm, E 2007, 'Matching large schemas: Approaches and evaluation',
Information systems, vol. 32, no. 6, pp. 857-885.
Doan, A, Domingos, P & Halevy, A 2001, 'Reconciling schemas of disparate data
sources: A machine-learning approach'.
Doan, A & Halevy, A 2005, 'Semantic integration research in the database community: A
brief survey', AI magazine, vol. 26, no. 1, p. 83.
50
Doan, A, Noy, N & Halevy, A 2004, 'Introduction to the special issue on semantic
integration', ACM SIGMOD Record, vol. 33, no. 4, pp. 11-13.
Domshlak, C, Gal, A & Roitman, H 2007, 'Rank aggregation for automatic schema
matching', IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 4, p.
538.
Embley, D, Jackman, D & Xu, L 2001, 'Multifaceted exploitation of metadata for
attribute match discovery in information integration'.
Giunchiglia, F, Shvaiko, P & Yatskevich, M 2004, 'S-match: an algorithm and an
implementation of semantic matching', The semantic web: research and applications, pp.
61-75.
Giunchiglia, F & Yatskevich, M 2004, 'Element level semantic matching'.
Halevy, A, Rajaraman, A & Ordille, J 2006, 'Data integration: The teenage years'.
Huimin, Z & Sudha, R 2004, 'Clustering Schema Elements for Semantic Integration of
Heterogeneous Data Sources', Journal of Database Management, vol. 15, no. 4, p. 88.
Jain, R, Murthy, S, Chen, P & Chatterjee, S 2002, 'Similarity measures for image
databases'.
Kang, J & Naughton, J 2003, 'On schema matching with opaque column names and data
values'.
Karasneh, Y, Ibrahim, H, Othman, M & Yaakob, R 2009, 'A model for matching and
integrating heterogeneous relational biomedical databases schemas'.
Kaur, G 2010, 'SIMILARITY MEASURE OF DIFFERENT TYPES OF FUZZY SETS'.
Langville, A & Meyer, C 2005, 'A survey of eigenvector methods for web information
retrieval', SIAM review, vol. 47, no. 1, pp. 135-161.
Li, W & Clifton, C 2000, 'SEMINT: A tool for identifying attribute correspondences in
heterogeneous databases using neural networks', Data and Knowledge Engineering, vol.
33, no. 1, pp. 49-84.
Madhavan, J, Bernstein, P, Doan, A & Halevy, A 2005, 'Corpus-based schema matching'.
Madhavan, J, Bernstein, P & Rahm, E 2001, 'Generic schema matching with cupid'.
Melnik, S, Garcia-Molina, H & Rahm, E 2002, 'Similarity flooding: A versatile graph
matching algorithm and its application to schema matching'.
Microsoft 2008, Data Type Mapping for Oracle Publishers: SQL Server 2008, viewed 22
September 2010, <http://msdn.microsoft.com/enus/library/ms151817(v=SQL.100).aspx>.
51
Microsoft 2007, Equivalent ANSI SQL Data Types, viewed 22 September 2010,
<http://msdn.microsoft.com/en-us/library/bb177899.aspx>.
Monge, A & Elkan, C 1996, 'The field matching problem: Algorithms and applications'.
Navarro, G 2001, 'A guided tour to approximate string matching', ACM computing
surveys (CSUR), vol. 33, no. 1, p. 88.
Nottelmann, H & Straccia, U 2007, 'Information retrieval and machine learning for
probabilistic schema matching', Information Processing & Management, vol. 43, no. 3,
pp. 552-576.
Oracle 2003, Heterogeneous Connectivity Administrator’s Guide, viewed 28 September
2010, <http://download.oracle.com/docs/cd/B12037_01/appdev.101/b10795.pdf>.
Oracle 2002, Oracle9i Sample Schemas, Release 2 (9.2), viewed 28 September 2010,
<http://download.oracle.com/docs/cd/B10501_01/server.920/a96539.pdf>.
Oracle 2008, SQL Developer Supplementary Information for Microsoft SQL Server and Sybase
Adaptive Server Migrations Release 1.5, viewed 29 September 2010,
<http://download.oracle.com/docs/cd/E12151_01/doc.150/e12156.pdf>.
Parent, C & Spaccapietra, S 1998, 'Issues and approaches of database integration',
Communications of the ACM, vol. 41, no. 5es, pp. 166-178.
Po, L & Sorrentino, S 2010, 'Automatic generation of probabilistic relationships for
improving schema matching', Information systems.
Rahm, E & Bernstein, P 2001, 'A survey of approaches to automatic schema matching',
The VLDB Journal, vol. 10, no. 4, pp. 334-350.
Shvaiko, P & Euzenat, J 2005, 'A survey of schema-based matching approaches', Journal
on Data Semantics IV, pp. 146-171.
Tekli, J, Chbeir, R & Yetongnon, K 2009, 'Extensible User-Based XML Grammar
Matching', Conceptual Modeling-ER 2009, pp. 294-314.
Thang, H & Nam, V 2008, 'XML Schema Automatic Matching Solution', International
Journal of Computer Systems Science and Engineering, vol. 4, no. 1, pp. 68-74.
Wang, J, Wen, J, Lochovsky, F & Ma, W 2004, 'Instance-based schema matching for web
databases by domain-specific query probing'.
Wu, W, Yu, C, Doan, A & Meng, W 2004, 'An interactive clustering-based approach to
integrating source query interfaces on the deep web'.
Yeung, D & Tsang, E 2002, 'A comparative study on similarity-based fuzzy reasoning
methods', Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,
vol. 27, no. 2, pp. 216-227.
52
Ziegler, P & Dittrich, K 2004, 'Three Decades of Data Intecration—all Problems
Solved?', Building the Information Society, pp. 3-12.
APPENDIX 1: Data conversion table
This table is constructed based on the information available from (Oracle 2008) and
(Microsoft 2008).
SQL Server
Oracle
BINARY(n)
RAW(n)
BINARY(n)
BLOB
BIT
NUMBER(1)
CHAR(18)
ROWID
CHAR(18)
UROWID
CHAR(n)
CHAR(n)
DATETIME
DATE
DATETIME
INTERVAL
DATETIME
TIMESTAMP
FLOAT
FLOAT
FLOAT
NUMBER
FLOAT
REAL
53
IMAGE
BLOB
IMAGE
LONG RAW
INTEGER
NUMBER(10)
MONEY
NUMBER(19,4)
NCHAR([1-1000])
NCHAR([1-1000])
NCHAR(n)
CHAR(n*2)
NUMERIC([0-38],[1-38])
NUMBER([0-38],[1-38])
NUMERIC([1-38])
NUMBER([1-38])
NUMERIC(38)
INT
NVARCHAR([1-2000])
NVARCHAR2([1-2000])
NVARCHAR(MAX)
NCLOB
NVARCHAR(n)
VARCHAR(n*2)
REAL
FLOAT
SMALL-DATETIME
DATE
SMALLINT
NUMBER(5)
SMALLMONEY
NUMBER(10,4)
SYSNAME
VARCHAR2(30)
SYSNAME
VARCHAR2(128)
TEXT
CLOB
TIMESTAMP
NUMBER
TINYINT
NUMBER(3)
VARBINARY([1-2000])
RAW([1-2000])
VARBINARY(MAX)
BFILE
VARBINARY(MAX)
BLOB
VARBINARY(n)
RAW(n)
VARBINARY(n)
BLOB
54
VARCHAR(37)
TIMESTAMP WITH TIME
ZONE
VARCHAR(MAX)
CLOB
VARCHAR(MAX)
LONG
VARCHAR(n)
VARCHAR2(n)
APPENDIX 3: Prefix Matching
Highest similarity: 1
--------------------------------------------------------------------------Attribute 1
Attribute 2
--------------------------------------------------------------------------Schema1.CUSTOMERS.CUSTOMER_ID
Schema2.CUSTOMERS.CUSTOMER_ID
-----------------------------------------------------------------------Schema1.CUSTOMERS.FIRST_NAME
Schema2.EMPLOYEES.FIRST_NAME
Schema1.CUSTOMERS.LAST_NAME
Schema2.EMPLOYEES.LAST_NAME
-----------------------------------------------------------------------Schema1.CUSTOMERS.CUSTOMER_ID
Schema2.ORDERS.CUSTOMER_ID
-----------------------------------------------------------------------Schema1.EMPLOYEES.MANAGER_ID
Schema2.DEPARTMENTS.MANAGER_ID
-----------------------------------------------------------------------Schema1.EMPLOYEES.EMPLOYEE_ID
Schema2.EMPLOYEES.EMPLOYEE_ID
Schema1.EMPLOYEES.MANAGER_ID
Schema2.EMPLOYEES.MANAGER_ID
Schema1.EMPLOYEES.FIRST_NAME
Schema2.EMPLOYEES.FIRST_NAME
Schema1.EMPLOYEES.LAST_NAME
Schema2.EMPLOYEES.LAST_NAME
Schema1.EMPLOYEES.SALARY
Schema2.EMPLOYEES.SALARY
-----------------------------------------------------------------------Schema1.EMPLOYEES.EMPLOYEE_ID
Schema2.JOB_HISTORY.EMPLOYEE_ID
-----------------------------------------------------------------------Schema1.JOBS.JOB_ID
Schema2.EMPLOYEES.JOB_ID
-----------------------------------------------------------------------Schema1.JOBS.JOB_ID
Schema2.JOBS.JOB_ID
-----------------------------------------------------------------------Schema1.JOBS.JOB_ID
Schema2.JOB_HISTORY.JOB_ID
-----------------------------------------------------------------------Schema1.PRODUCTS.PRODUCT_ID
Schema2.INVENTORIES.PRODUCT_ID
-----------------------------------------------------------------------Schema1.PRODUCTS.PRODUCT_ID
Schema2.ORDER_ITEMS.PRODUCT_ID
-----------------------------------------------------------------------Schema1.PRODUCTS.PRODUCT_ID
Schema2.PRODUCT_DESCRIPTIONS.PRODUCT_ID
-----------------------------------------------------------------------Schema1.PURCHASES.CUSTOMER_ID
Schema2.CUSTOMERS.CUSTOMER_ID
-----------------------------------------------------------------------Schema1.PURCHASES.PRODUCT_ID
Schema2.INVENTORIES.PRODUCT_ID
-----------------------------------------------------------------------Schema1.PURCHASES.CUSTOMER_ID
Schema2.ORDERS.CUSTOMER_ID
-----------------------------------------------------------------------Schema1.PURCHASES.PRODUCT_ID
Schema2.ORDER_ITEMS.PRODUCT_ID
55
Schema1.PURCHASES.QUANTITY
Schema2.ORDER_ITEMS.QUANTITY
-----------------------------------------------------------------------Schema1.PURCHASES.PRODUCT_ID
Schema2.PRODUCT_DESCRIPTIONS.PRODUCT_ID
Similarity: 0.8
--------------------------------------------------------------------------Attribute 1
Attribute 2
--------------------------------------------------------------------------Schema1.CUSTOMERS.FIRST_NAME
Schema2.CUSTOMERS.CUST_FIRST_NAME
Schema1.CUSTOMERS.LAST_NAME
Schema2.CUSTOMERS.CUST_LAST_NAME
Schema1.EMPLOYEES.FIRST_NAME
Schema2.CUSTOMERS.CUST_FIRST_NAME
Schema1.EMPLOYEES.LAST_NAME
Schema2.CUSTOMERS.CUST_LAST_NAME
-----------------------------------------------------------------------Schema1.ORDER_STATUS.ORDER_STATUS_ID Schema2.ORDERS.ORDER_ID
Schema1.ORDER_STATUS.ORDER_STATUS_ID Schema2.ORDERS.ORDER_STATUS
-----------------------------------------------------------------------Schema1.ORDER_STATUS.ORDER_STATUS_ID Schema2.ORDER_ITEMS.ORDER_ID
-----------------------------------------------------------------------Schema1.PRODUCTS.PRODUCT_TYPE_ID
Schema2.INVENTORIES.PRODUCT_ID
-----------------------------------------------------------------------Schema1.PRODUCTS.PRODUCT_TYPE_ID
Schema2.ORDER_ITEMS.PRODUCT_ID
-----------------------------------------------------------------------Schema1.PRODUCTS.PRODUCT_TYPE_ID
Schema2.PRODUCT_DESCRIPTIONS.PRODUCT_ID
-----------------------------------------------------------------------Schema1.PRODUCT_TYPES.PRODUCT_TYPE_ID Schema2.INVENTORIES.PRODUCT_ID
-----------------------------------------------------------------------Schema1.PRODUCT_TYPES.PRODUCT_TYPE_ID Schema2.ORDER_ITEMS.PRODUCT_ID
-----------------------------------------------------------------------Schema1.PRODUCT_TYPES.PRODUCT_TYPE_ID Schema2.PRODUCT_DESCRIPTIONS.PRODUCT_ID
With 0.6
--------------------------------------------------------------------------Attribute 1
Attribute 2
--------------------------------------------------------------------------Schema1.CUSTOMERS.PHONE
Schema2.CUSTOMERS.PHONE_NUMBERS
-----------------------------------------------------------------------Schema1.CUSTOMERS.PHONE
Schema2.EMPLOYEES.PHONE_NUMBER
-----------------------------------------------------------------------Schema1.DIVISIONS.NAME
Schema2.DEPARTMENTS.DEPARTMENT_NAME
-----------------------------------------------------------------------Schema1.DIVISIONS.NAME
Schema2.EMPLOYEES.FIRST_NAME
Schema1.DIVISIONS.NAME
Schema2.EMPLOYEES.LAST_NAME
-----------------------------------------------------------------------Schema1.DIVISIONS.NAME
Schema2.PRODUCT_DESCRIPTIONS.TRANSLATED_NAME
-----------------------------------------------------------------------Schema1.EMPLOYEES.TITLE
Schema2.JOBS.JOB_TITLE
Schema1.EMPLOYEES.SALARY
Schema2.JOBS.MIN_SALARY
Schema1.EMPLOYEES.SALARY
Schema2.JOBS.MAX_SALARY
-----------------------------------------------------------------------Schema1.JOBS.NAME
Schema2.DEPARTMENTS.DEPARTMENT_NAME
-----------------------------------------------------------------------Schema1.JOBS.NAME
Schema2.EMPLOYEES.FIRST_NAME
Schema1.JOBS.NAME
Schema2.EMPLOYEES.LAST_NAME
-----------------------------------------------------------------------Schema1.JOBS.NAME
Schema2.PRODUCT_DESCRIPTIONS.TRANSLATED_NAME
-----------------------------------------------------------------------Schema1.ORDER_STATUS.STATUS
Schema2.ORDERS.ORDER_STATUS
-----------------------------------------------------------------------Schema1.PRODUCTS.NAME
Schema2.DEPARTMENTS.DEPARTMENT_NAME
-----------------------------------------------------------------------Schema1.PRODUCTS.NAME
Schema2.EMPLOYEES.FIRST_NAME
Schema1.PRODUCTS.NAME
Schema2.EMPLOYEES.LAST_NAME
-----------------------------------------------------------------------Schema1.PRODUCTS.PRICE
Schema2.ORDER_ITEMS.UNIT_PRICE
-----------------------------------------------------------------------Schema1.PRODUCTS.NAME
Schema2.PRODUCT_DESCRIPTIONS.TRANSLATED_NAME
Schema1.PRODUCTS.DESCRIPTION
Schema2.PRODUCT_DESCRIPTIONS.TRANSLATED_DESCRIPTION
------------------------------------------------------------------------
56
Schema1.PRODUCT_TYPES.NAME
Schema2.DEPARTMENTS.DEPARTMENT_NAME
-----------------------------------------------------------------------Schema1.PRODUCT_TYPES.NAME Schema2.EMPLOYEES.FIRST_NAME
Schema1.PRODUCT_TYPES.NAME Schema2.EMPLOYEES.LAST_NAME
-----------------------------------------------------------------------Schema1.PRODUCT_TYPES.NAME Schema2.PRODUCT_DESCRIPTIONS.TRANSLATED_NAME
-----------------------------------------------------------------------Schema1.SALARY_GRADES.LOW_SALARY Schema2.EMPLOYEES.SALARY
Schema1.SALARY_GRADES.HIGH_SALARY Schema2.EMPLOYEES.SALARY
57
Download