A Framework for Matching Schemas of Relational Databases by Ahmed Saimon Adam Supervisor Dr. Jixue Liu A minor thesis submitted for the degree of Master of Science (Computer and Information Science) School of Computer and Information Science University of South Australia 7th February 2011 Declaration I declare that this thesis does not incorporate without acknowledgment any material previously submitted for a degree or diploma in any university; and that to the best of my knowledge it does not contain any materials previously published or written by another person except where due reference is made in the text. ……………………………………………….. Ahmed Saimon Adam February 2011 2 Acknowledgments I would like to thank my supervisor Dr. Jixue Liu for all the help and support and advice he has provided to me during the course of writing this thesis. Without his expert advice and invaluable guidance, I would not be able to complete this thesis. I would also like to thank our program director Dr. Jan Stanek for providing outstanding support and encouragement in times of difficulty and at all times. 3 CONTENTS 1. Introduction ........................................................................................................................ 7 2. Schema matching architecture ......................................................................................... 11 2.1. Input component .......................................................................................................... 11 2.2. Schema matching process ............................................................................................ 13 2.2.1. Matching at element level ........................................................................................ 14 2.2.1.1. Pre-processing element names ............................................................................. 14 2.2.1.2. String matching .................................................................................................... 15 2.2.1.2.1. Prefix matching ................................................................................................ 15 2.2.1.2.2. Suffix matching................................................................................................ 16 2.2.1.3. String similarity matching.................................................................................... 17 2.2.1.3.1. Edit distance (levenshtein distance) ................................................................. 17 2.2.1.3.2. N-gram ............................................................................................................. 18 2.2.1.4. Identifying structural similarities ......................................................................... 19 2.2.1.4.1. Data type constraints ........................................................................................ 20 2.2.1.4.2. Calculating final structural similarity score ..................................................... 21 2.2.2. Matching at instance level ....................................................................................... 21 2.2.2.1. Computing the instance similarity ....................................................................... 23 2.3. Schema matching algorithms ....................................................................................... 24 2.3.1. Algorithm 1: find name similarity of schema element ............................................ 25 2.4. Similarity computation................................................................................................. 35 2.4.1. Computation at matcher level .................................................................................. 35 2.4.2. Computation at attribute level .................................................................................. 35 2.4.3. Computing similarity at table level .......................................................................... 36 2.4.4. Tie-breaking ............................................................................................................. 37 2.4.4.1. Tie breaking in table matching ............................................................................ 37 2.4.4.2. Tie breaking in attribute matching ....................................................................... 38 3. Empirical evaluation ........................................................................................................ 39 3.1. Experimental setup....................................................................................................... 39 3.1.1. Database ................................................................................................................... 39 3.1.2. Schema matcher prototype ....................................................................................... 40 3.2. Evaluating the accuracy of matching algorithms ......................................................... 41 3.2.1. Prefix matching algorithm ....................................................................................... 41 3.2.2. Suffix matching algorithm ....................................................................................... 43 3.2.3. N-grams matching algorithm ................................................................................... 43 3.2.4. Structural matching algorithm ................................................................................. 44 3.2.5. Instance matching .................................................................................................... 45 3.2.6. Overall accuracy of similarity algorithms................................................................ 46 3.2.7. Effect of schema size on the overall precision......................................................... 46 3.2.8. Efficiency of schema matching process ................................................................... 48 4. Conclusion ....................................................................................................................... 49 5. References ........................................................................................................................ 50 4 List of tables Table 1: Variations in data types across database vendors ...................................................... 12 Table 2: Schema matching algorithms ..................................................................................... 13 Table 3: Examples of string tokenization ................................................................................ 14 Table 4: Prefix matching example 1 ........................................................................................ 16 Table 5: Prefix matching example 2 ........................................................................................ 16 Table 6: Suffix matching example ........................................................................................... 17 Table 7: Structural metadata for structural comparison ........................................................... 19 Table 8: Structural matching example ..................................................................................... 19 Table 9: Data type matching example ..................................................................................... 20 Table 10: Properties utilized for instance matching ................................................................ 21 Table 11: sample dataset for Schema1..................................................................................... 22 Table 12: Examples showing how the statistical calculations are performed ......................... 22 Table 13: Matcher level similarity calculation example .......................................................... 35 Table 14: Combined similarity calculation at attribute level ................................................... 35 Table 15: Table similarity calculation example 1 .................................................................... 36 Table 16: Table similarity calculation example 2 .................................................................... 37 Table 17: Tie breaking example in attribute matching ............................................................ 38 Table 18: Summary of experimental schemas ......................................................................... 40 Table 19: Software tools used for prototype ............................................................................ 41 Table 20: Prefix matching sample results ................................................................................ 42 Table 21: Suffix matching sample results ................................................................................ 43 Table 22: N-gram matching sample results ............................................................................. 44 Table 23: Structure matching sample results ........................................................................... 45 Table 24: Instance matching sample results ............................................................................ 45 Table 25: Best matching attributes for smaller schemas with 4 tables .................................... 47 Table 26: Best matching attributes for single table schemas ................................................... 47 Table 27: Decrease in efficiency with increase in schema size ............................................... 49 5 List of figures Figure 1: Schema matcher architecture.................................................................................... 11 Figure 2: Sample snippet of a schema in DDL ........................................................................ 12 Figure 3: Combined final similarity matrix ............................................................................. 36 Figure 4: Tie breaking example 1 ............................................................................................ 38 Figure 5: Figure 6: Tie breaking example 2............................................................................. 38 Figure 7: Schema matcher prototype ....................................................................................... 40 Figure 8: Precision of matching algorithms (%) ...................................................................... 46 Figure 9: Precision vs. schema size ......................................................................................... 48 Figure 10: Schema size vs. processing time ............................................................................ 49 6 Abstract Schema matching, the process of identifying semantic correlations between database schemas, is one of the vital tasks in database integration. This study is based on relational databases. In this thesis, we investigate past researches in schema matching to study techniques and methodologies in the field. We define an architecture for a schema matching system and propose a framework based on it considering the factors of scalability, efficiency and accuracy. With the framework, we also develop several algorithms that can be used in detecting similarities in schemas. We build a prototype and conduct experimental evaluation in order to evaluate the framework and effectiveness of its algorithms. 1. INTRODUCTION 1.1. BACKGROUND The schema of a database describes how its concepts, their relationships and constraints are structured. Schema matching, the process of identifying semantic correlations between database schemas, is one of the vital and most challenging tasks in database integration (Halevy, Rajaraman & Ordille 2006). Many organizations, due to the adaptation to the rapid and continuous changes taking place over the years have been developing and using various databases for their varied applications (Parent & Spaccapietra 1998) and this continue to be the case after so many years (Ziegler & Dittrich 2004). This tendency for creating diverse applications and databases is due to the vibrant business environment that changes such as structural changes to organizations, opening of new branches at geographically dispersed locations, exploitation of new business markets are some of the reasons. As the databases grow, organizations need to use their databases in a consolidated manner for research and analytical purposes. Data warehousing and data mining are results of some of the large scale integration of databases (Doan & Halevy 2005). Currently, many data sharing applications such as data warehousing, e-commerce and semantic web processing require matching schemas for this purpose. In such applications, a unified virtual database that comprises of multiple databases need to be created in order to flawlessly access the information available from those databases (Doan & Halevy 2005). 7 Database integration research has been on going since late 70s (Batini, Lenzerini & Navathe 1986; Doan & Halevy 2005) and a first of such mechanisms was attained in 1980 (Ziegler & Dittrich 2004) . Since early 90s, database integration methods have been put to use commercially as well as in the research field (Doan & Halevy 2005; Halevy, Rajaraman & Ordille 2006). 1.2. Motivation Schema matching has been done in a highly manual fashion (Do, H 2006; Do, H & Rahm 2007); but, it is laborious, time consuming and highly susceptible to errors (Huimin & Sudha 2004; Doan & Halevy 2005; Po & Sorrentino 2010) and sometimes, not even possible (Domshlak, Gal & Roitman 2007). One of the major issues in schema matching process is due to the disparity in the structure and semantics of databases involved. Often, how database designers perceive a certain concept is different from one another. Therefore, same real world concept might have different representation or it is possible that two different concepts are represented as the same in the database (Batini, Lenzerini & Navathe 1986). Moreover, when building a database integration project, the necessary information for correctly mapping semantics might not be available. The major sources of information, the original designers might not be available or the documentation might not be sufficient. Consequently, the matching has to be performed form the limited evidence available from schema element names and instance values (Doan & Halevy 2005). Nonetheless, even the information available from the schema elements might not be enough for adequate identification of matches (Doan & Halevy 2005) .Usually, in order to find the right correspondences in schema elements, an exhaustive matching process has to be conducted where every element needs to be evaluated to ensure the best match among all the available elements; this is very costly and cumbersome (Doan & Halevy 2005). In addition to these obstacles, schema matching is largely domain specific. For example, an attribute pair declared as highly correlated in one domain might be completely unrelated in another domain (Doan & Halevy 2005). Hence, because of its highly cognitive nature, the likelihood of making process fully automatic is yet uncertain (Doan, Noy & Halevy 2004; Aumueller et al. 2005; Doan & Halevy 2005; Do, H & Rahm 2007; Po & Sorrentino 2010). 8 However, the demand for more efficient, scalable and autonomous schema matching systems are continuously rising (Doan, Noy & Halevy 2004; Po & Sorrentino 2010) due to the emergence of semantic web, advancements in ecommerce and the increasing level of collaboration in organizations. Consequently, according to (Rahm & Bernstein 2001; Do, H, Melnik & Rahm 2002; Doan & Halevy 2005; Shvaiko & Euzenat 2005; Do, H & Rahm 2007), researchers have proposed several semi-automatic methods. Some adopt heuristic criteria (Doan & Halevy 2005) while others have used machine learning and information retrieval techniques in some approaches (Rahm & Bernstein 2001; Shvaiko & Euzenat 2005). However, no one method can be chosen as the best schema matching solution due to the diversity of data sources (Bernstein et al. 2004; Domshlak, Gal & Roitman 2007). Therefore, a generic solution that can be easily customised for dealing with different types of applications is more useful (Bernstein et al. 2004). 1.3. Previous work According to (Rahm & Bernstein 2001; Do, H, Melnik & Rahm 2002; Doan, Noy & Halevy 2004; Doan & Halevy 2005; Shvaiko & Euzenat 2005; Do, H & Rahm 2007), many approaches and techniques have been proposed in the past for resolving schema matching problems. Some methods adopt techniques from machine learning and information retrieval, while others use heuristic criteria on composite and hybrid techniques. For better interpretation of the underlying semantics in the schemas, a combination of different types of information are used in many systems and mainly two levels of database information are exploited in schema matching techniques (Rahm & Bernstein 2001; Do, H, Melnik & Rahm 2002; Doan & Halevy 2005; Shvaiko & Euzenat 2005; Do, H & Rahm 2007). They are the metadata available at schema level and the actual instance data. Schema information is used for measuring similarities at element level (e.g. attribute and table names) and at structural level (e.g. constraints such as data type, field length, value range); at instance level, patterns in data instances are analysed. Moreover, many algorithms use auxiliary information such as user feedback, dictionaries, thesauri and domain knowledge for dealing with various problems in schema matching (Rahm & Bernstein 2001; Do, H, Melnik & Rahm 2002; Doan & Halevy 2005; Shvaiko & Euzenat 2005; Do, H & Rahm 2007). 9 1.4. Research objectives and methodology Investigations on schema matching In this research, our focus is on relational databases only. This research will investigate existing schema matching techniques and their architectures and identify what and how information can be exploited and under what circumstances (e.g. what technique can be used when attribute names are ambiguous?). Propose a schema matching framework Based on the findings, we propose a schema matching framework for relational databases. This framework is based on the findings on how to best represent input schemas, output results and other interfaces in the architecture considering scalability, accuracy and customizability. Taking into account the findings from this investigation, we also implement multiple matching algorithms that exploit different clues available from the database, and specifications of how to implement different types of algorithms on the system. We also establish how matching accuracy can be measured in a quantifiable manner. Prototype and empirical evaluation Based on this framework, we build a prototype and conduct empirical evaluation to demonstrate the validity of the research. In our experiments, we demonstrate the accuracy of the matching algorithms and scalability and customisability of the architecture. We also demonstrate the effects on matching accuracy when we use different combinations of algorithms. We use precision and recall measurements, techniques in information retrieval (IR), for measuring accuracy and providing results. 10 2. SCHEMA MATCHING ARCHITECTURE The schema matching architecture of this system is defined in three major components, similar to many other systems (Rahm & Bernstein 2001; Doan & Halevy 2005; Shvaiko & Euzenat 2005); Input component: establishes acceptable formats and standards for the schemas. Also, it converts schemas into a format that can be manipulated by the schema matching software. Matching component: Matching tasks are performed in this component. It consists of a set of elementary algorithmic functions, each of which is called a matcher. Each matcher utilizes some type of information (e.g. element names in name matchers) to perform a sub-function in the matching process. Matchers are implemented by executing them in a series. Output: component: delivers the final mapping results. This process is depicted in the diagram below (Figure 1: Schema matcher architecture ). Figure 1: Schema matcher architecture 2.1. Input component Schemas of relational databases are often created in SQL DDL which is a series of SQL “CERATE” statements. Therefore, information about schema elements such as table names, attribute names, data types, keys and other available information can be obtained from the schema, by the input module, in a standard format. 11 In this framework, the input component accepts database schemas in SQL DDL (Data Definition Language) format and parses them, similar to (Li & Clifton 2000), to extract schema information. The software accepts two schemas at a time and the SQL DDL has to be in a text file with “.sql” extension. Once the required information is extracted, the input component utilizes these data to build the appropriate Schema objects and handovers the Schema objects to the Matching Component. These Schema objects are manipulated by the software to perform all the necessary operations. Schemas from different database engines will have variations in schemas in terms of data types, constraints and other vendor specific properties (Li & Clifton 2000). For example, in Error! Reference source not found., the attribute income from Schema A, has a data type SMALLMONEY and needs to be matched with an attribute in Schema B. As Schema B is from an Oracle database, it does not have a data type of SMALLMONEY as in SQL Server. Database engine Data type Attribute name Schema A MS SQL Server SMALLMONEY Income Schema B Oracle NUMBER income Table 1: Variations in data types across database vendors Therefore, similar to data type conversion process in (Li & Clifton 2000), we use conversion tables to map data types if the schemas are from different database types so that schemas from different databases also can be compared. Currently, the system supports conversion between Oracle 11i and SQL Server 2008 schemas. The conversion tables can be modified for supporting additional relational database engines such as MySQL. This conversion table is given in APPENDIX 1. The example in Figure 3 shows a snippet of a schema that has been converted into standard SQL DDL format and ready to be used by the system for conducting schema matching operations. ….. CREATE TABLE ADAAS001.CUSTOMERS ( CUSTOMER_ID NUMBER NOT NULL, FIRST_NAME VARCHAR2(10) NOT NULL, LAST_NAME VARCHAR2(10) NOT NULL, DOB DATE, PHONE VARCHAR2(12), PRIMARY KEY (CUSTOMER_ID) ); CREATE TABLE ADAAS001.EMPLOYEES ( EMPLOYEE_ID NUMBER NOT NULL, MANAGER_ID NUMBER, FIRST_NAME VARCHAR2(10) NOT NULL, LAST_NAME VARCHAR2(10) NOT NULL, TITLE VARCHAR2(20), SALARY NUMBER(6), PRIMARY KEY (EMPLOYEE_ID) 12 ); ….. Figure 2: Sample snippet of a schema in DDL 2.2. Schema matching process It is very much likely that elements that correspond to the same concept in the real world will prone to have similarities in databases in terms of structure and patterns in data instances (Li & Clifton 2000; Huimin & Sudha 2004; Halevy, Rajaraman & Ordille 2006). Therefore, according to (Rahm & Bernstein 2001; Do, H, Melnik & Rahm 2002; Doan, Noy & Halevy 2004; Doan & Halevy 2005; Shvaiko & Euzenat 2005; Do, H & Rahm 2007), by using a combination of different types of information, a better interpretation of the underlying semantics in the schemas can be achieved. At schema level, we exploit metadata available from the schemas for measuring similarities at element level (e.g. attribute and table names) and at structural level (e.g. constraints such as data type, field length, value range); As in other composite matching systems (Doan, Domingos & Halevy 2001; Embley, Jackman & Xu 2001; Do, Hai & Rahm 2002), we perform the schema matching in stages where source schema SS is compared with target schema, ST. Once the Schema Objects are constructed from the input component and received at the Matching Component, the matching operations begin. At the beginning of the schema matching process, a similarity matrix, M, is initialized and at each stage, a function, match, is executed to compute the similarity score, S for that function. M is refined with the values of S at each stage. The sequence of match execution, as in (Giunchiglia & Yatskevich 2004), and their scoring weight are predefined. The algorithms we use are in Table 2. Sequence Matcher name 1 matchPrefix 2 matchSuffix 3 matchNgrams 4 matchEditDistance 5 matchStructure 6 matchInstance Table 2: Schema matching algorithms 13 2.2.1. Matching at element level In the first stage of schema matching process, emphasis is given on determining the syntactic similarity in attribute and table names by employing several name matching functions. In every match, each of the elements in source schema SS is compared with that of target schema, ST to obtain the similarity score S for the match. 2.2.1.1. Pre-processing element names In designing schemas, mostly abbreviations or multiple words are used for naming elements instead of using single words (Huimin & Sudha 2004); as a result, element names that have the same meaning might differ syntactically (Madhavan, Bernstein & Rahm 2001). Prior to performing the string matching process, element names need to be preprocessed for better understanding of the semantics in element names and achieving higher accuracy in the subsequent processes. Expansion of abbreviations and acronyms: First, we tokenize the element names, similar to (Monge & Elkan 1996; Do, Hai & Rahm 2002; Wu et al. 2004), by isolating them based on delimiters such as punctuation marks, spaces and substrings in camelcase names. Tokenization is conducted in order to uncover parts in the name that may go undetected if this possibility is not considered. For example, consider comparing two strings hum-res and human-resource. Performing a prefix matching operation on them will discover that these two strings are same if done after tokenizing. However, executing the same operation without tokenizing will not detect much similarity to the level it deserves. Tokenization process is depicted in Table 3. Element name Tokenized substrings Isolation based on finHRdistr fin, HR, distr Camel-case naming Daily-income Daily, income Hyphen delimitation In_patient In, patient Underscore delimitation Table 3: Examples of string tokenization 14 2.2.1.2. String matching Once pre-processing has been done, string matching operations are performed in a consecutive. A string matching algorithm, matcher, is used to calculate similarity score at each matching process. Matching suffixes and prefixes are two forms of string matching done in many systems as in (Madhavan, Bernstein & Rahm 2001; Do, Hai & Rahm 2002; Melnik, Garcia-Molina & Rahm 2002; Giunchiglia & Yatskevich 2004; Shvaiko & Euzenat 2005). 2.2.1.2.1. Prefix matching First, prefix matching is done between the element names of the schemas to identify whether the string with the longer length starts with the shorter string. This type of matching is especially useful in detecting abbreviations (Shvaiko & Euzenat 2005). It gives a score in the range [0, 1], where 0 if there is no similarity and 1 if an exact match. When comparing two name strings, every string token in one string will be compared with every token in the other string. Match score for both prefix matching and suffix matching are calculated using this formula, S = k/{(x+y)/2} Where k is number of matching sub-strings, x is number of sub strings in element name of A from SS y is number of sub strings in element name of B from ST In both prefix and suffix matching, we use a first-come, first-served tie breaking strategy similar to (Langville & Meyer 2005). The rationale for using this strategy is further explained in section 2.4.4 Tie-breaking. Prefix matching process is depicted in the following example. Consider two strings being compared, inst_tech_demonstrator and institutetechnology-demo. After tokenization, the two strings become inst_tech_demonstrator (inst, tech, demonstrator) and institute-technology-demo (institute, technology, demo) 15 By performing a prefix matching operation on every token with each other, the number of matching substrings, k is obtained. Table 4 shows an example of prefix matching. Element A inst tech demonstrator Element B Technology 0 1 0 Institute 1 0 0 demo 0 0 1 Table 4: Prefix matching example 1 In this case, k = 3. Similarity, S between A and B =3/{(|3|+|3|)/2} =1 A score of 1 indicates that these two fields match exactly when prefix matching is done. Sometimes, Prefix matching is not very accurate in determining the similarity. It might give a high score if the strings do match, even though they are note related. Element A tea mat Element B maths teacher 0 1 1 0 Table 5: Prefix matching example 2 This shows a perfect match although they are of two different meanings. Therefore, to reduce the effect of such false positives, we use multiple string matching functions. 2.2.1.2.2. Suffix matching Suffix matching identifies whether one string ends with another. This type of matching is especially useful in detecting words that are related in meaning (Shvaiko & Euzenat 2005) though they are exactly not the same. For example, ‘saw’ can be matched with handsaw, hacksaw, jigsaw and ‘ball’ can be matched with volleyball, baseball, football and nut with peanut, chestnut, walnut. String matching calculations are done the same as in prefix matching. Consider the example below: Element B 16 Element A human resource type human 1 0 0 resource 0 1 0 fund 0 0 0 Table 6: Suffix matching example Similarity S, between Human-Resource-Type and Human-Resource-fund =2/{(|3|+|3|)/2} = 0.67 Nevertheless, suffix matching does not always guarantee accurate results for every matching operation. It is not that effective in matching some words (e.g: car Madagascar, rentcurrent). Therefore, we use additional string similarity matching functions as described in the next sections to reduce such effects. 2.2.1.3. String similarity matching 2.2.1.3.1. Edit distance (Levenshtein distance) Similar to (Do, Hai & Rahm 2002; Chua, CEH, Chiang, RHL & Lim, E-P 2003; Cohen, Ravikumar & Fienberg 2003; Giunchiglia & Yatskevich 2004), we use a form of edit distance, Levenshtein distance, to calculate the similarity between name strings. This is for determining how similar the characters in the two strings are. In Levenshtein distance, the number of operations – character substitutions, insertions and deletions – required to transform one string into another is calculated, assigning a value of 1 to each operation performed. This value indicates the Levenshtein distance, k, a measure of error or dissimilarity (Navarro 2001) between the two strings; shorter the distance, higher is the similarity (Navarro 2001; Giunchiglia & Yatskevich 2004). As we need to compute the similarity, not dissimilarity, similarity S, is calculated by representing S in a value between 0 and 1, as in (Chua, CEH, Chiang, RHL & Lim, EP 2003; Giunchiglia & Yatskevich 2004) and excluding the error ratio from it. This is done by the following equation (Navarro 2001; Chua, CEH, Chiang, RHL & Lim, E-P 2003; Giunchiglia & Yatskevich 2004), Let A and B be the two strings being compared, and S = 1 – {k / [max [length (A), length (B)]} 17 Where S is the similarity value between the two strings, k is the Levenshtein distance Hence, for an identical match, edit distance, k = 0 and similarity score, S, will be 1. For example, edit distance between infoSc and informationScience is, S= 1 – {12/18} = 0.333 2.2.1.3.2. N-gram With Levenshtein distance, the reflected similarity value might not be very accurate in some types of strings. Therefore, in the next step of string similarity matching, we use n-gram matching as in, (Do, Hai & Rahm 2002; Giunchiglia, Shvaiko & Yatskevich 2004). In this technique, the number of common n-grams, n, between the two strings is counted and a similarity score, S, is given by the following equation: Let A and B be the two strings being compared, and S = n / [max [ngrams (A), ngrams (B)] For comparing some forms of strings, n-gram performs better than edit distance. For example, consider two strings ‘Contest’ and ‘Context’. ngrams (Contest) Con, ont, nte, tes, est = 5 ngrams (Context) Con, ont, nte, tex, ext = 5 S = 3/5 = 0.6 With n-gram, similarity score is 0.6; but, if edit distance is done on these two strings, it gives a similarity score of 0.86. 18 2.2.1.4. Identifying structural similarities Schema elements that resemble to entities in the real world are likely to have similar structural properties (Li & Clifton 2000); therefore, structural properties are believed to have some evidence for discovering conceptual meanings embedded within schemas. Similar to (Li & Clifton 2000; Huimin & Sudha 2004), structural data (metadata) that are utilized in this framework are given in Table 7. # Metadata Details 1 Data type Method in section 2.2.2.1 2 Field length 1 if same length 3 Range 1 if same range 4 Primary key 1 if both primary key 5 Foreign key 1 if both foreign key 6 Unique 1 if both unique key 7 Nullable 1 if both null allowed/ not allowed 8 Default 1 if both same default 9 Precision 1 if both same precision 10 Scale 1 if both same scale Table 7: Structural metadata for structural comparison In this stage, elements in schema SS are compared with those in schema ST for their structural similarities depicted in Table 2. The match function, matchStructure, checks for structural similarities in the sequence specified in Table 2 and gives the score, S in a m x n matrix as in the example below (Table 8), Metadata Data type Field length Range Primary key Foreign key Unique Nullable Default Precision Scale Data type Field length Range ….. Attributes m1 m2 n1 0.65 0 1 0 0 1 1 0 1 0 0.85 1 1 … n2 0.8 1 0 0 0 0 0 1 1 1 0.7 0 0 … Table 8: Structural matching example 19 …. For each type of metadata except data type, a similarity value of 1 is given if that property is common to both the fields and a 0 given if they do not match. 2.2.1.4.1. Data type constraints For comparing similarities between data types, we construct data type synonyms table and use that table for doing the data type comparisons as done in other similar researches (Li & Clifton 2000; Do, Hai & Rahm 2002; Thang & Nam 2008; Karasneh et al. 2009; Tekli, Chbeir & Yetongnon 2009). Since there are variations in data types across different database systems (Li & Clifton 2000), realization of similarities in data types is not very straight forward. Therefore, based on (Oracle 2008) and (Microsoft 2008), first we construct a Vendor Specific Data Types Table (VSDT) that consists of data types mappings for Oracle and SQL Server. This table is in APPENDIX 1. Based on the VSDT table, data types that have a high level of similarity can be detected. For example, Oracle has got a data type called ‘Number’, but SQL Server does not; but from the VSDT table, it can be derived that Number in Oracle is equivalent to Float in SQ Server. Therefore, the two data types can be mapped as a match with maximum similarity. Similarly, SQL Server has got a DATETIME data type but Oracle does not; it has a ‘Date’ type instead. In such as case these two data types are be mapped and give a maximum data type similarity score of 1. Often cases will come where two data types are not the same but possess some level of similarity. For example, integer and float are not the same but they do have some similarities as both the types are numbers (Li & Clifton 2000); likewise, char and varchar are similar cases. In view of this situation, in order to give consideration to data types that have some level of similarity, we further categorise all the data types into a more generic data type classification similar to (Li & Clifton 2000). We make this classification based on (Li & Clifton 2000; Oracle 2003; Microsoft 2007) and give a fixed similarity value of 0.5 for such cases and a score of 0 if there is no match. We call this table Generic Data Type Table (GDTT). GDTT is in APPENDIX 2. As an example, consider the figure below (Table 9). In this example, attribute ‘quantity’ in SchemaA has a data type integer and attribute amount in SchemaB has a data type float. In this case, as integer and maps to float in the Generic Data Type Table (GDTT), it gets a similarity value of 0.5. Attribute Data type Data type similarity SchemaA quantity integer SchemaB amount float 0.5 Table 9: Data type matching example 20 2.2.1.4.2. Calculating final structural similarity score As described in 2.2.2 and 2.2.2.1, we obtain the property-comparison values for all the attributes and calculate their average to get a final structural similarity score for every attribute pair. That is, we add the property comparison values and divide by 10 as we are utilizing 10 different metadata properties. This formula is given as follows, S=k/N Where S is the structural similarity score, k is total property comparison score, N is total number of properties considered 2.2.2. Matching at instance level Instance values of the databases also can be used as an additional source for discovering relationships of databases (Huimin & Sudha 2004). It is possible that the same data might be represented differently in databases (Huimin & Sudha 2004). For example, ‘morning’ might be represented as ‘am’ while another database could represent it as ‘m’ or ‘1’. Although this issue exists in databases, information from analysis of actual data values is often complementary to schema level matching (Do, Hai & Rahm 2002; Chua, C, Chiang, R & Lim, E 2003) and can be valuable especially in circumstances where available schema information is limited (Huimin & Sudha 2004). Similar to (Li & Clifton 2000; Huimin & Sudha 2004), we use several statistical features of the instance values for assessing similarities in database fields. The features we utilize in this framework are given in the table below (Table 10). # Properties 1 Mean length of the values 2 Standard deviation of length of the values 3 Mean ratio of number of numeric characters 4 Mean ratio of number of non-alphanumeric characters 5 Mean ratio of number of distinct values to total tuples in the table 6 Mean ratio of blank values to total tuples in the table Table 10: Properties utilized for instance matching For showing how the statistical analysis is performed on a table of instances, the table below (Table 11) shows some sample data of a book store database and Table 12 shows details of how the calculations are performed. 21 ISBN Author Title Ref_no 0062039741 Justin Bieber First Step 2 Forever XA345 0061997811 Brian Sibley Harry Potter Film Wizardry NAHKW1 1423113381 Rick Riordan The Red Pyramid 1423101472 Mary-Jane Knight Percy Jackson and the Olympians: The Ultimate Guide 1617804061 Justin Bieber My World: Easy Piano F9876001 Table 11: sample dataset for Schema1 1 Mean(Length) 2 StdDev(Length) 3 Mean(Numeric) 4 Mean(non alphanumeric) 5 Mean(Distinct) 6 Mean(Blanks) ISBN (10+10+10+10+ 10)/5 =10 Normalised:0 Author (13+12+12+16+13)/5 = 13.2 Normalised: (13.2-12)/(16-12) =0.3 0 (0.25-0.3)2 + (0-0.3)2 0.397 + (0-0.3)2 + (1-0.3)2 + (0.25-0.3)2 = 0.675/(5-1) Var = 0.16875 StdDev √0.16875 = (10/10+10/10+1 0/10+10/10+10/ 10)/5 =1 0 =5/5 =1 0 = 0.411 (0+0+0+0+0)/66 =0 (0 + 0 + 0 + (1/16) + 0) /5 = 0.0625/5 = 0.013 =4/5 = 0.8 =(1/13+1/12+1/12+1/ 16+1/13)/5 = 0.383/5 = 0.077 Title (20+26+15+51+ 20)/5 = 26.4 Normalised: (20.333-15)/(2615) =0. 317 (1/20)+0+0+0+0/ 5 =0.01 (0+0+0+(1/51)+( 1/20))/5 = 0.0696/5 =0.014 =5/5 =1 (3/20+3/26+2/15 +5/51+3/20)/5 = 0.647/5 = 0.129 Table 12: Examples showing how the statistical calculations are performed 22 Ref_no (5+6+0+0+9) /5 =4 Normalised: (6.667-5)/( 95) = 0.444 0.437 (3/5+1/6+0+0 +8/9)/5 = 1.6556/5 =0.3311 0 =4/5 =0.8 0 Although each of these properties would contribute to the semantics at different levels, establishing the degree of relevance is not a straight forward task (Huimin & Sudha 2004) as the original dimensions are varied different units of measurements. Therefore, we normalize the values if they do not fall within the range [0, 1] (Li & Clifton 2000; Huimin & Sudha 2004). Out of the 6 measurements we are considering for this framework, only the first two will need to be normalized in this manner but the other measurements need not be normalized as being ratios, they always fall in this range. 2.2.2.1. Computing the instance similarity Similar to the works in (Chen 1995; Jain et al. 2002; Yeung & Tsang 2002; Kaur 2010), we calculate the similarity between two fields based on average Manhattan Distance using this formula, S=1− ∑𝑛 𝑖=1 |𝑥𝑖−𝑦𝑖| 𝑛 Where x represents the distance measure of a field property in Schema1, y represents the distance measure of a field property in Schema2 and n represents the number of statistical properties or dimensions considered. In this framework, as we are utilizing 6 properties, n = 6 by default. After obtaining the matrices of statistical values for each of the fields in both the schemas, we compute the instance similarities between them. For example, Table A in Schema1 has statistical values (from Table 12: Examples showing how the statistical calculations are performed) as, ISBN {0, 0, 1, 0, 1, 0} Author {0.3, 0.411, 0, 0.013, 0.8, 0.077} Title {0. 317, 0.397, 0.01, 0.014, 1, 0.129} Ref_no {0.444, 0.437, 0.331, 0, 0.8, 0} Suppose that Table B in Schema2 has similar attributes with statistical values as, ISBN {0, 0, 1, 0, 1, 0} Author {0.4, 0.45, 0.1, 0.023, 0.7, 0.06} Title {0. 3, 0.29, 0.04, 0.02, 1, 0.14} Code_no {0.439, 0.433, 0.337, 0, 0.6, 0} We calculate the similarity from the formula, 23 S(ISBN) = 1- | (0−0)+ (0−0)+(1−1)+(0−0)+(1−1)+(0−0)| 6 =1 S(Author) = 1 – | (0.3−0.4)+ (0.411−0.45)+(0−0.1)+(0.013−0.023)+(0.8−0.7)+(0.077−0.06) | 6 =1 – | 0.366|/6 = 1 – 0.061 = 0.939 From the above two calculations, it can be seen that TableA.ISBN and TableB.ISBN has a similarity value of 1, indicating the highest possible similarity. TableA.Author and TableB.Author shows a similarity of 0.939. 2.3. Schema matching algorithms 24 2.3.1. Algorithm 1: Find name similarity of schema element Input: Set of attribute names in schema SS : X = {Ua. A1 , Ua.A2, …Ua.Ai...Ua.An } Ua is set of tables in schema SS Ai is an attribute in the table Ua Set of attribute names in schema ST : Y = {Vb.B1, Vb.B2, ….Vb.Bn } Vb is set of tables in schema ST Bi is an attribute in table Vb Output: Table Similarity Matrix. A Set of table similarity pairs in the form SS.Ua: ST.Vb STB SS is source schema; Ua is source table; ST is target schema; Vb is target table, STB is similarity between the two tables begin call matchPrefix(X, Y) // X and Y are set of attributes from schema SS and ST respectively call matchSuffix(X, Y) call matchNgram(X, Y, ngram) // ngram is the value passed as the ngram call getTableMatchResult() end function matchPrefix(X, Y) { foreach Ai in X do //comparing every attribute in schema SS with every attribute in ST { foreach Bj in Y do { if Ai = Bj then S = 1 else //if two attributes are not the same string, tokenize the strings { P = call tokenize (Ai) //P and Q are set of tokens in strings Ai and Bj respectively Q = call tokenize (Bj) k = 0 //initialize match counter initialize new matchFoundList foreach Pm in P do { //compare every token in Ai with that in Bj foreach Qn in Q do { //if Qn does not already have a match with any token in Pm If(Qn not in matchFoundList) { //if longer token starts with shorter string, a token match is found max [length(Pm), length(Qn)] starts with min [length(Pm), length(Qn)] 25 k = k + 1 //match score is incremented by one for each token match Add Qn to matchFoundList } } } //S is prefix similarity between attributes Ai and Bj S (Ai, Bj) = k / { ( | P | + | Q | ) / 2 } //S is similarity score for prefix match between two attributes Ai and Bi // UpdateAMSM updates Matcher Similarity Matrix in Table 1 matchPrefix is matcher algorithm name, SS.Ua.Ai and ST.Vb.Bj, are attributes of schema SS and ST respectively, S is similarity between the two attribute call UpdateAMSM (matchPrefix, SS.Ua.Ai, ST.Vb.Bj, S) //return Attribute Matcher Similarity Matrix return AMSM } // function to tokenize a string X function tokenize(X) { begin Initialize C //C is set of tokens in X d = {–, _} // d is set of accepted delimiters foreach di in d do add tokens to C (split(X, di) ) return C end } function matchSuffix(X, Y) { foreach Ai in X do //comparing every attribute in schema SS with every attribute in ST { foreach Bj in Y do { if Ai = Bj then S = 1 else //if two attributes are not the same string, tokenize the strings { P = call tokenize (Ai) //P and Q are set of tokens in strings Ai and Bj respectively Q = call tokenize (Bj) 26 k = 0 //initialize match counter initialize new matchFoundList foreach Pm in P do { //compare every token in Ai with that in Bj foreach Qn in Q do { //if Qn does not already have a match with any token in Pm If(Qn not in matchFoundList) { //if longer token ends with shorter string, a token match is found max [length(Pm), length(Qn)] ends with min [length(Pm), length(Qn)] k = k + 1 //match score is incremented by one for each token match Add Qn to matchFoundList } } } //S is suffix similarity between attributes Ai and Bj S (Ai, Bj) = k / { ( | P | + | Q | ) / 2 } //S is similarity score for suffix match between two attributes Ai and Bi // UpdateAMSM updates Matcher Similarity Matrix in Table 1 matchSuffix is matcher algorithm name, SS.Ua.Ai and ST.Vb.Bj, are attributes of schema SS and ST respectively, S is similarity between the two attribute call UpdateAMSM (matchSuffix, SS.Ua.Ai, ST.Vb.Bj, S) //return Attribute Matcher Similarity Matrix return AMSM } function matchNgram(X, Y) { foreach Ai in X do//comparing every attribute in schema SS with every attribute in ST { foreach Bj in Y do { if Ai = Bj then S = 1 else//if two attributes are not the same string, tokenize the strings { P = call getNgrams (Ai, ngram) //P and Q are set of n-grams in strings Ai and Bj respectively Q = call getNgrams (Bj, ngram) k = 0 //initialize match counter 27 initialize new matchFoundList//n-grams that have found a matching n-gram foreach Pm in P do {//compare every n-gram in Ai with that in Bj foreach Qn in Q do //if Qn does not already have a match with any other ngram in Pm if Qn (not in matchFoundList) then { If Pm = Qn then //add to match found list if a matching Add to matchFoundList(Qn) k = k + 1//match score is incremented by one for each n-gram match } //S is n-gram similarity between attributes Ai and Bj S (Ai, Bj) = k / { Max ( | P | ,| Q | ) } // UpdateAMSM updates Attribute Matcher Similarity Matrix in Table 1 matchNgram is matcher algorithm name, SS.Ua.Ai and ST.Vb.Bj, are attributes of schema SS and ST respectively, S is similarity between the two attribute call UpdateAMSM (matchNgram, SS.Ua.Ai, ST.Vb.Bj, S) //return Attribute Matcher Similarity Matrix return AMSM } //returns a set of n-grams for a string A Function getNgrams(A, g) // A is element name string; g is ngram value used { begin n = length(A) r = n - g //r is last n-gram position while i <= r do {// get n-grams until all possible n-grams obtained ngrami = getSubstring(A, i, g) Add(ngrami ) to C// C is set of n-grams obtained i = i+1 } return C end } 28 //Updates Attribute Matcher Similarity Matrix in Table 1 //parameter matcher is matcher name, SS.Ua.Ai and ST.Vb.Bj, are attributes of schema SS and ST respectively, S is similarity between the two attribute function UpdateAMSM(matcher, SS.Ua.Ai, ST.Vb.Bj, S) { // AttributeMatcherSimilarity is an object to hold similarity data initialize new AttributeMatcherSimilarity(matcher, SS.Ua.Ai, ST.Vb.Bj, S) //AMSM is Attribute Matcher Similarity Matrix that holds similarity information in an Array List Add to AMSM(matcher, SS.Ua.Ai, ST.Vb.Bj, S) } function getAttributeMatchResult() { foreach x in AMSM do { //source attribute in format schema.table.attribute Ai = x.getSourceAttr() //target attribute in format schema.table.attribute Bj = x.targetAttribute() Sx = x.getScore() // AttributeMatchResult object holds aggregate similarity score for attribute pairs initialize new AttributeMatchResult(Ai,Bj,Sx) //if Attribute pair similarity data not in ASM, add to ASM if(AttributeMatchResult not in ASM) then Add AttributeMatchResult to ASM //if attribute pair score exists, add to existing score else add AttributeMatchResult.score() to ASM.score(Ai,Bj) } //calculate average score for each pair in ASM (Attribute Similarity Matrix in Table 2) foreach y in ASM do { //calculate total score for each pair Saggregate = ASMscore(Ai,Bj) //calculate average for each pair Saverage = Saggregate / ASM.countMatchers //Update Attribute Similarity Matrix with average score 29 updateASMscore(SS.Ua.Ai, ST.Vb.Bj,Saverage) } //return Attribute Similarity Matrix in Table 2 return ASM } function getTableMatchResult() { foreach w in ASM do { //source table in format schema.table Pm = w.getSourceTable() //target table in format schema.table Qn = w.targetTable() Sr = w.getScore() // TableMatchResult object holds aggregate similarity score for table pairs initialize new tableMatchResult(Pm,Qn,Sr) //if Table pair similarity data not in TSM, add to TSM TSM is Table Similarity Matrix in Table 3 if(TableMatchResult not in TSM) then Add TableMatchResult to TSM //if table pair score exists, add to existing score else add TableMatchResult.score() to TSM.score(Pm,Qn) } //calculate average score for each pair in TSM (Table Similarity Matrix in Table 3) foreach z in TSM do { //calculate total score for each table pair Saggregate = TSMscore(Pm,Qn) //calculate average for each pair Saverage = Saggregate /TSM.possibleMaximumScore(Pm,Qn) //Update Attribute Similarity Matrix with average score updateTSMscore(SS.Ua, ST.Vb,Saverage) } } //return Table Similarity Matrix return TSM } 30 function getSchemaMatchResult() { foreach x in TSM do { //source schema Pm = w.getSourceSchema() //target schema Qn = w.targetSchema() Sr = w.getScore() // SchemaMatchResult object holds aggregate similarity score for schema pairs initialize new SchemaMatchResult(Pm,Qn,Sr) //if Schema pair similarity data not in SSV, add to SSV SSV is Schema Similarity Value Object if(SchemaMatchResult not in SSV) then Add SchemaMatchResult to TSM //if schema pair score exists, add to existing score else add SchemaMatchResult.score() to SSV.score(Pm,Qn) } //calculate average score for schema pair Saverage = SSV.getScore() /SSV.possibleMaximumScore(Pm,Qn) //Update Attribute Similarity Matrix with average score updateSSVscore(SS, ST,Saverage) } //return Schema Similarity Value SSV return SSV } 31 2.3.2. Algorithm 2: Find structural similarities of schemas Input: //Structural properties of attributes in source schema SS //Each attribute is passed with attribute name in the form Schema.Table.Attribute followed by structural properties DT - data type FL- field length R - range PK - primary Key FK - foreign Key UK - unique Key NU - nullable DE - default value PR - precision SC - scale Set of attributes in schema SS : X = {Ua. A1 , DT1, FL1,R1PK1,FK1,UK1,NU1,DE1,PR1,SC1 Ua.A2,DT2 FL2, R2,PK2,FK2,UK2,NU2,DE2,PR2,SC2 …Ua.Ai, DTi,FLiRiPKi,FKi,UKi,NUi,DEi,PRi,SCi, ..Ua.An ,DTn, FLn, Rn,PKn, FKn,UKn,Nun,Den,PRn,SCn } Set of attributes in schema ST : Y= {Vb. B1 , DT1, FL1,R1PK1,FK1,UK1,NU1,DE1,PR1,SC1 Vb.B2,DT2 FL2, R2,PK2,FK2,UK2,NU2,DE2,PR2,SC2 …Vb.Bj, DTj,FLjRjPKj,FKj,UKj,NUj,DEj,PRj,SCj, ..Vb.Bn ,DTn, FLn, Rn,PKn, FKn,UKn,NUn,DEn,PRn,SCn } Output: Set of table similarity pairs in the form SS.Ua: ST.Vb STB //SS is source schema; Ua is source table; ST is target schemas; Vb is target table begin call matchStructure (X, Y) call getMatchResult() end 32 //calculates structural similarity between two schemas Function matchStructure(X, Y) { begin if SS and ST from same type of database server then data type ref table = same DB table else data type ref table = conversion DB table k=0 //compare each attribute in source schema to each in target schema foreach Ai in X do { foreach Bj in Y do { if DT of Ai = DT of Bj then k = k+1 //data type if FL of Ai = FL of Bj then k = k+1 //field length if R type of Ai = R of Bj then k = k+1// range if Ai is PK and Bj is a PK then k = k+1// primary Key if Ai is FK and Bj is a FK then k = k+1// foreign Key if Ai is UK and Bj is a UK then k = k+1// unique Key if Ai is NU and Bj is a NU then k = k+1// nullable if DE of Ai = DE of Bj then k = k+1// default value if PR of Ai = PR of Bj then k = k+1// precision if SC of Ai = SC of Bj then k = k+1// scale // maxPossibleSimilarity is count of structural properties S (Ai, Bj) = k/maxPossibleSimilarity() // UpdateAMSM updates Attribute Matcher Similarity Matrix in Table 1 structure is matcher algorithm name, SS.Ua.Ai and ST.Vb.Bj, are attributes of schema SS and ST respectively, S is similarity between the two attribute call UpdateAMSM(structure, SS.Ua.Ai, ST.Vb.Bj, S) //return Attribute Matcher Similarity Matrix 33 return AMSM } } } function matchInstance(X, Y) { foreach Ai in X do //comparing values of every field in schema SS with every field in ST { foreach Bj in Y do { Call getMeanLength(Bj) } } } //T is a set of instances for an attribute for which its mean length has to be calculated Function getMeanLength(T) { foreach ti in T do { Add length(ti) to totalLength } meanLength = totalLength / |T| return meanLength } 34 2.4. Similarity computation 2.4.1. Computation at matcher level Match operations are performed on every element of schema SS = { SS1, SS2, ..SSm } with every element of ST = { ST1, ST2 ……STn} and a similarity value is computed for each operation on a table by table basis as in the example below. Match operation Prefix Suffix Edit Distance n-gram Structural Instance Prefix …. Element names teaMat … … mathsTeacher 1 0 0.25 0.2 0.5 0.04 ... … … diplomat 0 0.667 0.375 0.167 0.3 0.2 ... … Table 13: Matcher level similarity calculation example 2.4.2. Computation at attribute level To obtain a combined matching result, the average score is computed for the pair of elements, similar to (Do, Hai & Rahm 2002; Bozovic & Vassalos 2008). For example, after the above match operations, combined similarity values are computed as in the table below (Table 14). Schema SS, Table x teaMat … Schema ST, Table y mathsTeacher diplomat 0.3625 0.3022 … … … … … Table 14: Combined similarity calculation at attribute level Consequently, the combined final result of the matching operations is given in a similarity matrix, M, with all the elements, m, in SS and elements, n, in ST in a m x n matrix as in the example below. Elements 35 in Table y, ST n1 Elements in Table x, SS n2 n3 m1 m2 m3 m4 Figure 3: Combined final similarity matrix The highest score in each row of the matrix indicates the element in SS that has the highest similarity to the corresponding element in ST. 2.4.3. Computing similarity at table level After computing the similarities for attributes in SS and ST, we compute the combined similarity between tables. This is done by taking the ratio of all the similarity values between table x in schema SS and table y in schema ST with maximum possible similarity, similar to computations in (Do, Hai & Rahm 2002). Table similarity computed from the formula given below. Table Similarity = Sum of similarities between x and y Combined maximum similarity of x and y Example 1 Schema SS, Table x1 Schema ST, Table y1 n1 n2 m1 m2 0.25 1 0.3 0.5 m3 0.7 1 Table 15: Table similarity calculation example 1 Table Similarity = 0.25 + 0.3 + 1 +0.5 + 0.7 +1 6 36 = 0.625 Example 2 Schema SS, Table x1 m1 m2 m3 Schema ST, Table y2 n1 n2 0.4 0.5 1 0.7 0.9 1 n3 0.8 0.8 1 Table 16: Table similarity calculation example 2 Table Similarity = = 0.4 + 0.5 + 0.8 + 1 + 0.7 + 0.8 + 0.9 + 1+ 1 9 0.789 From the above two examples, Table x1 has a higher similarity to Table y2 than Table y1. The end result of matching table similarity between two schemas is a matrix that gives similarity values between all the tables in both the schemas. Tables in schema ST Tables in schema SS Table x1 Table x2 Table x3 Table x4 Table y1 0.12 0.6 1 0.5 Table y2 0.9 0.74 0.27 0.91 Table y3 0.61 0.58 0.2 0.1 2.4.4. Tie-breaking As matching attributes or tables are determined based on the highest similarity score, there is a possibility of a tie if an attribute or table gets more than one matching element with the same similarity score. In such cases, it is necessary to implement a tie breaking strategy to resolve such issues. 2.4.4.1. Tie breaking in table matching When selecting the best match for tables, we use maximum cardinality matching strategy (Wu et al. 2004). For example, consider the set of tables in schema1 as {m1,m2,m3} and set of tables in schema2 as {n1,n2,n3}. In maximum cardinality matching, we choose the matching tables in a way such that every table gets the matching table with the highest similarity score. In this case, the best matching will be (m1,n3), (m2, n1), (m3, n2). 37 n1 n2 n3 m1 1 0.4 1 m2 m3 1 0.2 0.7 0.8 0.4 0.2 Figure 4: Tie breaking example 1 To illustrate the tie breaking strategy, consider the example below. m1 m2 m3 n1 1 1 0.6 n2 0.5 1 1 n3 1 0.4 1 Figure 5: Figure 6: Tie breaking example 2 In this example, (m1,n3), (m2, n1), (m3, n2) produces the same result as matching (m1,n1), (m2, n2), (m3, n3). For a case like this, the latter matching result will be chosen as we use a first-come, first-served tie breaking strategy similar to (Langville & Meyer 2005). 2.4.4.2. Tie breaking in attribute matching On the other hand, when matching attributes, we use a greedy matching strategy (Wu et al. 2004) with the intuition that there is a semantic correspondence in how fields are ordered in a database (Wu et al. 2004). The notion of semantic resemblance of database structure to the real world is also mentioned in (Li & Clifton 2000). Therefore, we retain the semantics embedded within field order for improving the matching accuracy. For example, consider two fields, departure_date and arrival_date in one schema. It is typically the case that the departure field occurs before arrival data in a schema. If we consider two similar fields in another schema as date_leaving and date_returning, if we retain the field order, attribute matching accuracy would be increased. departure_date arrival_date seat_no date_leaving flg_no date_returning 1 1 0.2 0.4 1 0.9 0.3 0.8 0.2 Table 17: Tie breaking example in attribute matching 38 By adopting a greedy matching strategy, we can break ties that may occur and at the same time we preserve the semantics thereby producing results with higher accuracy. (departure_datedate_ leaving), (arrival_date date_ returning), (seat_no, flg_no). However, if use maximum cardinality matching as in the case with matching tables, the matching accuracy will be reduced. (departure_datedate_returning), (arrival_date date_leaving), (seat_no, flg_no) 3. EMPIRICAL EVALUATION In this section, we evaluate the accuracy of the matching algorithms and scalability and efficiency of the framework. 3.1. Experimental setup 3.1.1. Database Experiments are conducted based on two Oracle database schemas publicly available for download from (Oracle 2002). The first schema, Schema1, is from a Human Resource (HR) database with 9 tables, 32 attributes and a total 50 instances. The other schema, Schema2, is from an Order Entry (OE) database with 9 tables, 51 attributes and a total 2652 instances. These details are given in Table 18. 39 In order to determine the effect on the precision with change in schema size, we use rescaled versions of the above two schemas. The structure of these schemas can be found in section 3.2.8. These schemas are installed in an Oracle 10.2 database server in UniSA laboratory. We use iSQL*Plus 10.2 and RazorSQL v.5.2.5 to access the database. Schema1 (HR) Schema 2 (OE) CUSTOMERS: 5 (5) DIVISIONS: 2 (4) EMPLOYEES: 6 (4) JOBS: 2 (5) ORDER_STATUS: 3 (2) PRODUCTS: 5 (12) PRODUCT_TYPES: 2 (5) PURCHASES: 4 (9) SALARY_GRADES: 3 (4) CUSTOMERS: 8 (319) DEPARTMENTS: 4 (27) EMPLOYEES: 11 (107) INVENTORIES: 3 (1112) JOBS: 4 (19) JOB_HISTORY: 5 (10) ORDERS: 8 (105) ORDER_ITEMS: 5 (665) PRODUCT_DESCRIPTIONS: 4 (288) Table 18: Summary of experimental schemas 3.1.2. Schema matcher prototype For evaluating the framework and testing the effectiveness of the schema matching algorithms, we built a prototype in Microsoft .Net/ C# as a command line program (Figure 7). Figure 7: Schema matcher prototype 40 We use this program for extracting schema information from text files that contain schema in SQL DDL format and build internal class objects from those information. Matching algorithms and all the relevant functions are done in this program. We also use it to connect to the Oracle database in UniSA over the internet and extract instance information from the database. A list of software tools used for this prototype is in Table 19. Type of software tool Database server Database manipulation Programming Environment Programming language .Net framework Database connector for .NET1 Details Oracle 10.2 iSQL Plus 10.2 and RazorSQL v.5.2.5 Microsoft Visual Studio 2008 Microsoft .Net/ C# Version 3.5 Oracle Data Provider for.NET 11.2.0.1.2 Table 19: Software tools used for prototype 3.2. Evaluating the accuracy of matching algorithms For assessing how accurate the matching algorithms can detect similarities, we use a commonly used measurement in information retrieval (IR), precision similar to (Do, Hai & Rahm 2002; Kang & Naughton 2003; Wang et al. 2004; Madhavan et al. 2005; Blake 2007; Nottelmann & Straccia 2007). Precision defines how many detected matches are correct in reality (Rahm & Bernstein 2001). To assess the quality of the match results, first we execute the matching algorithms individually. From the detected matches, we determine correct matches manually to obtain the recall measurement. Next, we execute all the algorithms all together to get a holistic match result and again, we manually determine which of the detected matches are correct in reality. 3.2.1. Prefix matching algorithm By algorithm accuracy by applying only prefix matching algorithm shows that it is quite effective in detecting strings with similarity, S. The system shows that identical strings are detected as perfect matches (S=1). Strings that match partially also show high similarity if they have delimiters such as underscores. E.g. FIRST_NAME and CUST_FIRST_NAME (S=0.8). 1 Available for download http://www.oracle.com/technetwork/topics/dotnet/utilsoft-086879.html 41 Strings with unbalanced partial strings with delimitations show some similarity but lower than the previous cases. E.g. EMPLOYEES.SALARY and JOBS.MIN_SALARY. In this case, SALARY has only one substring and MIN_SALARY has two substrings delimited by an underscore. This is a positive characteristic; because, it indicates that the two strings are semantically not the same. A trend in the similarities can be observed that as the number of substrings is unbalanced between the two strings, similarity reduces. This trend can be clearly seen in the sample similarity results in Table 20: Prefix matching sample results. This table shows the attributes in Schema.Table.Attribute format. More detailed results are in APPENDIX 3. By having a minimum similarity threshold of 0.8, we achieve precision 77.8 %. That is out of the 36 detected similarities, 28 are correct when determined manually. Some sample of the results are given in Table 20. Schema 1 Attributes Schema 2 Attributes Schema1.CUSTOMERS.CUSTOMER_ID Schema1.CUSTOMERS.FIRST_NAME Schema1.CUSTOMERS.LAST_NAME Schema1.PURCHASES.QUANTITY Schema1.PURCHASES.PRODUCT_ID Schema2.CUSTOMERS.CUSTOMER_ID Schema2.EMPLOYEES.FIRST_NAME Schema2.EMPLOYEES.LAST_NAME Schema2.ORDER_ITEMS.QUANTITY Schema2.PRODUCT_DESCRIPTIONS.PRODUCT_ID Schema1.CUSTOMERS.FIRST_NAME Schema1.CUSTOMERS.LAST_NAME Schema1.ORDER_STATUS.ORDER_STATUS_ID Schema1.CUSTOMERS.PHONE Schema1.DIVISIONS.NAME Schema1.DIVISIONS.NAME Schema1.EMPLOYEES.TITLE Schema1.EMPLOYEES.SALARY Schema2.CUSTOMERS.CUST_FIRST_NAME Schema2.CUSTOMERS.CUST_LAST_NAME Schema2.ORDERS.ORDER_STATUS Schema2.CUSTOMERS.PHONE_NUMBERS Schema2.DEPARTMENTS.DEPARTMENT_NAME Schema2.EMPLOYEES.LAST_NAME Schema2.JOBS.JOB_TITLE Schema2.JOBS.MIN_SALARY Schema1.CUSTOMERS.FIRST_NAME Schema1.CUSTOMERS.FIRST_NAME Schema1.CUSTOMERS.CUSTOMER_ID Schema1.CUSTOMERS.FIRST_NAME Schema1.ORDER_STATUS.LAST_MODIFIED Schema1.SALARY_GRADES.LOW_SALARY Schema2.DEPARTMENTS.DEPARTMENT_NAME Schema2.EMPLOYEES.LAST_NAME Schema2.INVENTORIES.PRODUCT_ID Schema2.PRODUCT_DESCRIPTIONS.TRANSLATED_NAME Schema2.EMPLOYEES.LAST_NAME Schema2.JOBS.MAX_SALARY 0.5 Schema1.CUSTOMERS.CUSTOMER_ID Schema1.CUSTOMERS.CUSTOMER_ID Schema2.CUSTOMERS.CUST_FIRST_NAME Schema2.CUSTOMERS.ACCOUNT_MGR_ID 0.4 Schema1.ORDER_STATUS.ORDER_STATUS_ID Schema2.CUSTOMERS.ACCOUNT_MGR_ID 0.333 Schema1.CUSTOMERS.CUSTOMER_ID Schema2.CUSTOMERS.PHONE_NUMBERS 0 Table 20: Prefix matching sample results 42 Similarity 1 0.8 0.667 3.2.2. Suffix matching algorithm In this experiment, the results are very similar to the prefix matching results. This is because of the high number of attribute names with delimitations. Suffix matching is much more useful with matching strings without delimitations. E.g Phone and telephone. Table 21 shows some sample results. We achieve 77.8 % precision with a 0.8 similarity threshold. Schema 1 Attributes Schema 2 Attributes Similarity Schema1.CUSTOMERS.CUSTOMER_ID Schema1.CUSTOMERS.FIRST_NAME Schema1.EMPLOYEES.SALARY Schema2.CUSTOMERS.CUSTOMER_ID Schema2.EMPLOYEES.FIRST_NAME Schema2.EMPLOYEES.SALARY 1 Schema1.CUSTOMERS.FIRST_NAME Schema1.PRODUCT_TYPES.PRODUCT_TYPE_ID Schema2.CUSTOMERS.CUST_FIRST_NAME Schema2.INVENTORIES.PRODUCT_IDs 0.8 Schema1.CUSTOMERS.PHONE Schema1.CUSTOMERS.FIRST_NAME Schema1.CUSTOMERS.CUSTOMER_ID Schema1.ORDER_STATUS.ORDER_STATUS_ID Schema1.CUSTOMERS.LAST_NAME Schema2.CUSTOMERS.PHONE_NUMBERS Schema2.DEPARTMENTS.DEPARTMENT_NAME Schema2.ORDERS.SALES_REP_ID Schema2.CUSTOMERS.ACCOUNT_MGR_ID Schema2.CUSTOMERS.PHONE_NUMBERS 0.667 0.5 0.4 0.333 0 Table 21: Suffix matching sample results 3.2.3. N-grams matching algorithm From n-gram matching results, it can be seen that even if two strings have some common words between them, similarity is not very high. This is because, the name strings are compared with n-grams and similarity depends on the number of common n-grams. As the string length increases in one attribute, number of ngrams also increase and therefore, similarity decreases. For example, CUSTOMER_ID and ACCOUNT_MGR_ID has a similarity of 0.1. This similarity is due to the n-gram RID which is common to both as we remove delimiters before doing the comparison. By default, we take trigrams as the n-gram value. With a threshold of 0.6 we achieve a precision of 100%. That is, out of the detected 28 matches, all are correct. Table 22 shows some samples of the results. Schema 1 Attributes Schema 2 Attributes 43 Similarity Schema1.EMPLOYEES.EMPLOYEE_ID Schema1.EMPLOYEES.LAST_NAME Schema1.ORDER_STATUS.ORDER_STATUS_ID Schema1.CUSTOMERS.FIRST_NAME Schema1.CUSTOMERS.FIRST_NAME Schema1.EMPLOYEES.TITLE Schema1.ORDER_STATUS.STATUS Schema1.PRODUCTS.PRODUCT_TYPE_ID Schema1.DIVISIONS.DIVISION_ID Schema1.CUSTOMERS.PHONE Schema1.CUSTOMERS.FIRST_NAME Schema1.CUSTOMERS.CUSTOMER_ID Schema2.JOB_HISTORY.EMPLOYEE_ID Schema2.EMPLOYEES.LAST_NAME Schema2.ORDERS.ORDER_STATUS Schema2.CUSTOMERS.CUST_FIRST_NAME Schema2.EMPLOYEES.LAST_NAME Schema2.JOBS.JOB_TITLE Schema2.ORDERS.ORDER_STATUS Schema2.INVENTORIES.PRODUCT_ID Schema2.DEPARTMENTS.LOCATION_ID Schema2.CUSTOMERS.PHONE_NUMBERS Schema2.ORDERS.ORDER_STATUS Schema2.CUSTOMERS.ACCOUNT_MGR_ID 1 1 0.818 0.636 0.571 0.5 0.444 0.455 0.375 0.3 0.111 0.1 Table 22: N-gram matching sample results 3.2.4. Structural matching algorithm It can be observed that there is very high structural similarity when data types and field length are the same. For example, EMPLOYEES.TITLE and CUSTOMERS.CUST_FIRST_NAME have maximum similarity as their constraints are also the same. ORDER_STATUS.LAST_MODIFIED and EMPLOYEES.HIRE_DATE are also similar cases. However, even if two fields looks very similar they might have slight differences that could decrease the similarity value significantly. For example, Schema1.CUSTOMERS.CUSTOMER_ID and Schema2.CUSTOMERS.CUSTOMER_ID looks almost identical. They have same data types, both are primary keys of a table with same name. Their difference is that Schema2 element has a field length defined where as Schema1 element does not. However, at times, even when two fields are very different semantically, structurally they could have maximum similarity. Fro example, EMPLOYEES.SALARY and ORDERS.PROMOTION_ID show maximum structural similarity; this is because, they have the same data type ‘Number’ and all the constraints are also the same. Hence, for avoiding this type of situations, we need more than one type of matching algorithms. That is, when instance matching algorithm is run on these two fields, it shows a very low similarity. Table 23 shows some sample results of performing this evaluation. Schema 1 Attributes Schema 2 Attributes Similarity Schema1.EMPLOYEES.TITLE Schema1.EMPLOYEES.SALARY Schema1.ORDER_STATUS.LAST_MODIFIED Schema1.CUSTOMERS.CUSTOMER_ID Schema1.CUSTOMERS.PHONE Schema1.CUSTOMERS.FIRST_NAME Schema1.CUSTOMERS.DOB Schema1.PRODUCTS.PRODUCT_ID Schema2.CUSTOMERS.CUST_FIRST_NAME Schema2.ORDERS.PROMOTION_ID Schema2.EMPLOYEES.HIRE_DATE Schema2.CUSTOMERS.CUSTOMER_ID Schema2.CUSTOMERS.PHONE_NUMBERS Schema2.EMPLOYEES.HIRE_DATE Schema2.CUSTOMERS.CUSTOMER_ID Schema2.DEPARTMENTS.DEPARTMENT_NAME 1 1 1 0.75 0.75 0.5 0.25 0 44 Table 23: Structure matching sample results With a minimum threshold of 0.7, the precision is extremely low. Only 23 out of 295 matches are correct with a precision of 7.8%. Even if we increase the minimum threshold to 1, there is not much difference in the precision. It gives a precision of 7.4%. From this we can deduce that structural comparison doe not put much weigh in detecting correct matches and we suspect that this algorithm can have negative effect in the over all match quality. We will need to reduce the weight given by structural matching or conduct more investigations for finding the reasons for this negative effect and revise the algorithm. 3.2.5. Instance matching Evaluation of instance matching algorithm shows that it is very effective in identifying some of the similar fields. For example, Schema1.CUSTOMERS.CUSTOMER_ID and Schema2.CUSTOMERS.CUSTOMER_ID show maximum similarity. CUSTOMERS.LAST_NAME and Schema2.INVENTORIES.PRODUCT_ID show high dissimilarity. On the other hand, as in the structure matching case, EMPLOYEES.SALARY and CUSTOMERS.CUSTOMER_ID show maximum similarity although they are very different in reality. This situation arises because their data types are same and the patterns in the actual instances are also almost identical. Another observation is that even when data instances are very different, but if their patterns are similar, it shows a high instance similarity. e.g Schema1.CUSTOMERS.CUSTOMER_ID and Schema2.CUSTOMERS.PHONE_NUMBERS. It can be deduced that this unfavourable effect is due to linear normalization of the instances in the range [0, 1]. When normalized, their patterns become very much alike and the values also become similar. A sample of the results is in Table 24. Schema 1 Attributes Schema 2 Attributes Similarity Schema1.CUSTOMERS.CUSTOMER_ID Schema1.EMPLOYEES.SALARY Schema1.CUSTOMERS.FIRST_NAME Schema1.DIVISIONS.NAME Schema1.EMPLOYEES.FIRST_NAME Schema1.EMPLOYEES.EMPLOYEE_ID Schema1.EMPLOYEES.SALARY Schema1.CUSTOMERS.LAST_NAME Schema1.CUSTOMERS.CUSTOMER_ID Schema2.CUSTOMERS.CUSTOMER_ID Schema2.CUSTOMERS.CUSTOMER_ID Schema2.JOB_HISTORY.JOB_ID Schema2.DEPARTMENTS.DEPARTMENT_NAME Schema2.EMPLOYEES.LAST_NAME Schema2.JOB_HISTORY.EMPLOYEE_ID Schema2.JOB_HISTORY.EMPLOYEE_ID Schema2.INVENTORIES.PRODUCT_ID Schema2.CUSTOMERS.PHONE_NUMBERS 1 1 0.928 0.962 0.907 0.95 0.95 0.49 0.807 Table 24: Instance matching sample results With a minimum threshold value of 1, we get a precision of 2/23 with 8.7% and if we decrease the threshold to 0.9, we get a precision of 14/137 with 10.2%. 45 3.2.6. Overall accuracy of similarity algorithms After evaluating the individual matching algorithms, we execute the algorithms sequentially in order to get a collective precision value. 32 matches are detected as having the highest similarity. When we manually evaluate these matches, we find that 18 of them are actually correctly determined. Therefore, we get a precision of 56.3%. We can consider that, n-gram matching algorithm is very good in performing its assignment, finding similarities in strings, very effectively. The results of these evaluations are depicted in the diagram below Figure 8. 120 100 80 60 40 20 0 Prefix Suffix Ngram Structural Instance Overall Figure 8: Precision of matching algorithms (%) 3.2.7. Effect of schema size on the overall precision Next, we conducted the schema matching evaluation again, but with smaller schemas and obtained the overall precision. Table 25 shows the results of this evaluation. This table shows the best matching attribute in Schema 2 for each of the attributes in Schema 1. Schema 1 Attributes Schema 2 Attributes 46 Similarity Schema1.EMPLOYEES.FIRST_NAME Schema1.EMPLOYEES.EMPLOYEE_ID Schema1.EMPLOYEES.LAST_NAME Schema1.EMPLOYEES.SALARY Schema1.CUSTOMERS.CUSTOMER_ID Schema1.EMPLOYEES.MANAGER_ID Schema1.JOBS.JOB_ID Schema1.CUSTOMERS.FIRST_NAME Schema1.CUSTOMERS.LAST_NAME Schema1.CUSTOMERS.PHONE Schema1.DIVISIONS.NAME Schema1.DIVISIONS.DIVISION_ID Schema1.EMPLOYEES.TITLE Schema1.JOBS.NAME Schema1.CUSTOMERS.DOB Schema2.EMPLOYEES.FIRST_NAME Schema2.EMPLOYEES.EMPLOYEE_ID Schema2.EMPLOYEES.LAST_NAME Schema2.EMPLOYEES.SALARY Schema2.CUSTOMERS.CUSTOMER_ID Schema2.EMPLOYEES.MANAGER_ID Schema2.EMPLOYEES.JOB_ID Schema2.CUSTOMERS.CUST_FIRST_NAME Schema2.CUSTOMERS.CUST_LAST_NAME Schema2.EMPLOYEES.PHONE_NUMBER Schema2.DEPARTMENTS.DEPARTMENT_NAME Schema2.DEPARTMENTS.DEPARTMENT_ID Schema2.CUSTOMERS.CUST_EMAIL Schema2.CUSTOMERS.PHONE_NUMBERS Schema2.EMPLOYEES.COMMISSION_PCT 0.75 0.75 0.75 0.75 0.75 0.65 0.6 0.597 0.59 0.483 0.45 0.3 0.15 0.15 0.1 Table 25: Best matching attributes for smaller schemas with 4 tables From these results, we can deduce that 10 out of 15 matches are correct. Therefore, the precision is 66.7%. We conducted another evaluation with rescaled schemas which have only one table in each schema. From the results in Table 26 Table 26: Best matching attributes for single table schemaswe can observe that there is an improvement in the overall precision. In this case, the precision is 4 out of 5 with 80% precision. Schema 1 Attributes Schema 2 Attributes Similarity Schema1.CUSTOMERS.CUSTOMER_ID Schema1.CUSTOMERS.FIRST_NAME Schema1.CUSTOMERS.LAST_NAME Schema1.CUSTOMERS.PHONE Schema1.CUSTOMERS.DOB Schema2.CUSTOMERS.CUSTOMER_ID Schema2.CUSTOMERS.CUST_FIRST_NAME Schema2.CUSTOMERS.CUST_LAST_NAME Schema2.CUSTOMERS.PHONE_NUMBERS Schema2.CUSTOMERS.CUST_EMAIL 0.7 0.597 0.59 0.477 0.1 Table 26: Best matching attributes for single table schemas Based on these results we can conclude that as the schema size increases, there is drop in the overall accuracy of the match results. This is depicted in Figure 9. 47 90 80 70 60 50 40 30 20 10 0 5 attributes 15 attributes 32 attributes Figure 9: Precision vs. schema size 3.2.8. Efficiency of schema matching process First we use the largest schemas where each has got 9 tables. Executing the matching algorithms, except for instance matching algorithm, takes a very little time of just a few seconds. For the instance matcher, the process of reading the instance, computing statistical measurements and setting the values to internal Schema objects take 11 minutes and 30 seconds over a 1mpbs internet connection. In the prototype, we provide 5 options for generating the results in different forms. Option1 generates a list of attribute similarities for each matcher algorithm. For this operation, takes a very short time of less than second. This is because all the information are already embedded within the class objects, just need to display them. Option 2 displays aggregate similarities for every attribute when all algorithms are combined. It takes 1 hour and 40 minutes to complete this process and show the results. Option 3 shows the highest matching attribute for each of the attribute. Option 4 shows table similarities and the last option shows the overall schema similarity. We observe that the latter three options take 1 hour and 9 minutes each and all of these values grow drastically with increase in schema size. These long times are taken because each time when an option is chosen, the array object is iterated and aggregate values calculated. We conclude that instead, if we calculate the values in the same iteration, we can achieve higher throughput. However, this can increase the programming complexity drastically. Therefore, we performed all the options in sequence to calculate the total time it takes to complete the whole process and estimate the efficiency. Table 27 shows these readings. We observe that when number of attributes increases, the processing time increases drastically. This is depicted in Figure 10. 48 Schema size 1 table 4 tables 9 tables Time taken 1 sec 1 min 38 sec 4hr 50 min Table 27: Decrease in efficiency with increase in schema size 20000 15000 10000 Time (s) 5000 0 1 table 4 tables 9 tables Figure 10: Schema size vs. processing time Moreover, as we are by executing the algorithms sequentially in this framework, it affects the efficiency too. If we execute the algorithms in parallel, we can achieve higher throughput. Nevertheless, we prefer sequential execution in our design for the reason of making it more scalable and also it can perform a much comprehensive matching process. 4. CONCLUSION In this thesis, we conducted a comprehensive study of schema matching. We considered various factors that can make schema matching architectures scalable, efficient and accurate and proposed a framework incorporating these factors. With the framework, we also developed some algorithms for the purpose identifying similarities in schemas. We built a prototype to evaluate the framework and these algorithms and conducted some experiments. From these experiments it can be deduced that various algorithms perform differently in various conditions and a combination of these algorithms produce better results. We found that there are many angles we can improve the architecture in order to make it more efficient, scalable and accurate. 49 REFERENCES Aumueller, D, Do, H, Massmann, S & Rahm, E 2005, 'Schema and ontology matching with COMA++'. Batini, C, Lenzerini, M & Navathe, S 1986, 'A comparative analysis of methodologies for database schema integration', ACM computing surveys (CSUR), vol. 18, no. 4, pp. 323364. Bernstein, P, Melnik, S, Petropoulos, M & Quix, C 2004, 'Industrial-strength schema matching', ACM SIGMOD Record, vol. 33, no. 4, pp. 38-43. Blake, R 2007, 'A Survey of Schema Matching Research', University of Massachusetts Boston. The College of Management. Working Papers: September. Bozovic, N & Vassalos, V 2008, 'Two-phase schema matching in real world relational databases'. Chen, S 1995, 'Measures of similarity between vague sets', Fuzzy Sets and Systems, vol. 74, no. 2, pp. 217-223. Chua, C, Chiang, R & Lim, E 2003, 'Instance-based attribute identification in database integration', The VLDB Journal, vol. 12, no. 3, pp. 228-243. Chua, CEH, Chiang, RHL & Lim, E-P 2003, 'Instance-based attribute identification in database integration', The VLDB Journal, vol. 12, no. 3, p. 228. Cohen, W, Ravikumar, P & Fienberg, S 2003, 'A comparison of string metrics for matching names and records'. Do, H 2006, 'Schema matching and mapping-based data integration', Verlag Dr. Müller (VDM), pp. 3-86550. Do, H, Melnik, S & Rahm, E 2002, 'Comparison of schema matching evaluations', Web, Web-Services, and Database Systems, pp. 221-237. Do, H & Rahm, E 2002, COMA: a system for flexible combination of schema matching approaches, VLDB Endowment, Hong Kong, China, pp. 610-621. Do, H & Rahm, E 2007, 'Matching large schemas: Approaches and evaluation', Information systems, vol. 32, no. 6, pp. 857-885. Doan, A, Domingos, P & Halevy, A 2001, 'Reconciling schemas of disparate data sources: A machine-learning approach'. Doan, A & Halevy, A 2005, 'Semantic integration research in the database community: A brief survey', AI magazine, vol. 26, no. 1, p. 83. 50 Doan, A, Noy, N & Halevy, A 2004, 'Introduction to the special issue on semantic integration', ACM SIGMOD Record, vol. 33, no. 4, pp. 11-13. Domshlak, C, Gal, A & Roitman, H 2007, 'Rank aggregation for automatic schema matching', IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 4, p. 538. Embley, D, Jackman, D & Xu, L 2001, 'Multifaceted exploitation of metadata for attribute match discovery in information integration'. Giunchiglia, F, Shvaiko, P & Yatskevich, M 2004, 'S-match: an algorithm and an implementation of semantic matching', The semantic web: research and applications, pp. 61-75. Giunchiglia, F & Yatskevich, M 2004, 'Element level semantic matching'. Halevy, A, Rajaraman, A & Ordille, J 2006, 'Data integration: The teenage years'. Huimin, Z & Sudha, R 2004, 'Clustering Schema Elements for Semantic Integration of Heterogeneous Data Sources', Journal of Database Management, vol. 15, no. 4, p. 88. Jain, R, Murthy, S, Chen, P & Chatterjee, S 2002, 'Similarity measures for image databases'. Kang, J & Naughton, J 2003, 'On schema matching with opaque column names and data values'. Karasneh, Y, Ibrahim, H, Othman, M & Yaakob, R 2009, 'A model for matching and integrating heterogeneous relational biomedical databases schemas'. Kaur, G 2010, 'SIMILARITY MEASURE OF DIFFERENT TYPES OF FUZZY SETS'. Langville, A & Meyer, C 2005, 'A survey of eigenvector methods for web information retrieval', SIAM review, vol. 47, no. 1, pp. 135-161. Li, W & Clifton, C 2000, 'SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks', Data and Knowledge Engineering, vol. 33, no. 1, pp. 49-84. Madhavan, J, Bernstein, P, Doan, A & Halevy, A 2005, 'Corpus-based schema matching'. Madhavan, J, Bernstein, P & Rahm, E 2001, 'Generic schema matching with cupid'. Melnik, S, Garcia-Molina, H & Rahm, E 2002, 'Similarity flooding: A versatile graph matching algorithm and its application to schema matching'. Microsoft 2008, Data Type Mapping for Oracle Publishers: SQL Server 2008, viewed 22 September 2010, <http://msdn.microsoft.com/enus/library/ms151817(v=SQL.100).aspx>. 51 Microsoft 2007, Equivalent ANSI SQL Data Types, viewed 22 September 2010, <http://msdn.microsoft.com/en-us/library/bb177899.aspx>. Monge, A & Elkan, C 1996, 'The field matching problem: Algorithms and applications'. Navarro, G 2001, 'A guided tour to approximate string matching', ACM computing surveys (CSUR), vol. 33, no. 1, p. 88. Nottelmann, H & Straccia, U 2007, 'Information retrieval and machine learning for probabilistic schema matching', Information Processing & Management, vol. 43, no. 3, pp. 552-576. Oracle 2003, Heterogeneous Connectivity Administrator’s Guide, viewed 28 September 2010, <http://download.oracle.com/docs/cd/B12037_01/appdev.101/b10795.pdf>. Oracle 2002, Oracle9i Sample Schemas, Release 2 (9.2), viewed 28 September 2010, <http://download.oracle.com/docs/cd/B10501_01/server.920/a96539.pdf>. Oracle 2008, SQL Developer Supplementary Information for Microsoft SQL Server and Sybase Adaptive Server Migrations Release 1.5, viewed 29 September 2010, <http://download.oracle.com/docs/cd/E12151_01/doc.150/e12156.pdf>. Parent, C & Spaccapietra, S 1998, 'Issues and approaches of database integration', Communications of the ACM, vol. 41, no. 5es, pp. 166-178. Po, L & Sorrentino, S 2010, 'Automatic generation of probabilistic relationships for improving schema matching', Information systems. Rahm, E & Bernstein, P 2001, 'A survey of approaches to automatic schema matching', The VLDB Journal, vol. 10, no. 4, pp. 334-350. Shvaiko, P & Euzenat, J 2005, 'A survey of schema-based matching approaches', Journal on Data Semantics IV, pp. 146-171. Tekli, J, Chbeir, R & Yetongnon, K 2009, 'Extensible User-Based XML Grammar Matching', Conceptual Modeling-ER 2009, pp. 294-314. Thang, H & Nam, V 2008, 'XML Schema Automatic Matching Solution', International Journal of Computer Systems Science and Engineering, vol. 4, no. 1, pp. 68-74. Wang, J, Wen, J, Lochovsky, F & Ma, W 2004, 'Instance-based schema matching for web databases by domain-specific query probing'. Wu, W, Yu, C, Doan, A & Meng, W 2004, 'An interactive clustering-based approach to integrating source query interfaces on the deep web'. Yeung, D & Tsang, E 2002, 'A comparative study on similarity-based fuzzy reasoning methods', Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 27, no. 2, pp. 216-227. 52 Ziegler, P & Dittrich, K 2004, 'Three Decades of Data Intecration—all Problems Solved?', Building the Information Society, pp. 3-12. APPENDIX 1: Data conversion table This table is constructed based on the information available from (Oracle 2008) and (Microsoft 2008). SQL Server Oracle BINARY(n) RAW(n) BINARY(n) BLOB BIT NUMBER(1) CHAR(18) ROWID CHAR(18) UROWID CHAR(n) CHAR(n) DATETIME DATE DATETIME INTERVAL DATETIME TIMESTAMP FLOAT FLOAT FLOAT NUMBER FLOAT REAL 53 IMAGE BLOB IMAGE LONG RAW INTEGER NUMBER(10) MONEY NUMBER(19,4) NCHAR([1-1000]) NCHAR([1-1000]) NCHAR(n) CHAR(n*2) NUMERIC([0-38],[1-38]) NUMBER([0-38],[1-38]) NUMERIC([1-38]) NUMBER([1-38]) NUMERIC(38) INT NVARCHAR([1-2000]) NVARCHAR2([1-2000]) NVARCHAR(MAX) NCLOB NVARCHAR(n) VARCHAR(n*2) REAL FLOAT SMALL-DATETIME DATE SMALLINT NUMBER(5) SMALLMONEY NUMBER(10,4) SYSNAME VARCHAR2(30) SYSNAME VARCHAR2(128) TEXT CLOB TIMESTAMP NUMBER TINYINT NUMBER(3) VARBINARY([1-2000]) RAW([1-2000]) VARBINARY(MAX) BFILE VARBINARY(MAX) BLOB VARBINARY(n) RAW(n) VARBINARY(n) BLOB 54 VARCHAR(37) TIMESTAMP WITH TIME ZONE VARCHAR(MAX) CLOB VARCHAR(MAX) LONG VARCHAR(n) VARCHAR2(n) APPENDIX 3: Prefix Matching Highest similarity: 1 --------------------------------------------------------------------------Attribute 1 Attribute 2 --------------------------------------------------------------------------Schema1.CUSTOMERS.CUSTOMER_ID Schema2.CUSTOMERS.CUSTOMER_ID -----------------------------------------------------------------------Schema1.CUSTOMERS.FIRST_NAME Schema2.EMPLOYEES.FIRST_NAME Schema1.CUSTOMERS.LAST_NAME Schema2.EMPLOYEES.LAST_NAME -----------------------------------------------------------------------Schema1.CUSTOMERS.CUSTOMER_ID Schema2.ORDERS.CUSTOMER_ID -----------------------------------------------------------------------Schema1.EMPLOYEES.MANAGER_ID Schema2.DEPARTMENTS.MANAGER_ID -----------------------------------------------------------------------Schema1.EMPLOYEES.EMPLOYEE_ID Schema2.EMPLOYEES.EMPLOYEE_ID Schema1.EMPLOYEES.MANAGER_ID Schema2.EMPLOYEES.MANAGER_ID Schema1.EMPLOYEES.FIRST_NAME Schema2.EMPLOYEES.FIRST_NAME Schema1.EMPLOYEES.LAST_NAME Schema2.EMPLOYEES.LAST_NAME Schema1.EMPLOYEES.SALARY Schema2.EMPLOYEES.SALARY -----------------------------------------------------------------------Schema1.EMPLOYEES.EMPLOYEE_ID Schema2.JOB_HISTORY.EMPLOYEE_ID -----------------------------------------------------------------------Schema1.JOBS.JOB_ID Schema2.EMPLOYEES.JOB_ID -----------------------------------------------------------------------Schema1.JOBS.JOB_ID Schema2.JOBS.JOB_ID -----------------------------------------------------------------------Schema1.JOBS.JOB_ID Schema2.JOB_HISTORY.JOB_ID -----------------------------------------------------------------------Schema1.PRODUCTS.PRODUCT_ID Schema2.INVENTORIES.PRODUCT_ID -----------------------------------------------------------------------Schema1.PRODUCTS.PRODUCT_ID Schema2.ORDER_ITEMS.PRODUCT_ID -----------------------------------------------------------------------Schema1.PRODUCTS.PRODUCT_ID Schema2.PRODUCT_DESCRIPTIONS.PRODUCT_ID -----------------------------------------------------------------------Schema1.PURCHASES.CUSTOMER_ID Schema2.CUSTOMERS.CUSTOMER_ID -----------------------------------------------------------------------Schema1.PURCHASES.PRODUCT_ID Schema2.INVENTORIES.PRODUCT_ID -----------------------------------------------------------------------Schema1.PURCHASES.CUSTOMER_ID Schema2.ORDERS.CUSTOMER_ID -----------------------------------------------------------------------Schema1.PURCHASES.PRODUCT_ID Schema2.ORDER_ITEMS.PRODUCT_ID 55 Schema1.PURCHASES.QUANTITY Schema2.ORDER_ITEMS.QUANTITY -----------------------------------------------------------------------Schema1.PURCHASES.PRODUCT_ID Schema2.PRODUCT_DESCRIPTIONS.PRODUCT_ID Similarity: 0.8 --------------------------------------------------------------------------Attribute 1 Attribute 2 --------------------------------------------------------------------------Schema1.CUSTOMERS.FIRST_NAME Schema2.CUSTOMERS.CUST_FIRST_NAME Schema1.CUSTOMERS.LAST_NAME Schema2.CUSTOMERS.CUST_LAST_NAME Schema1.EMPLOYEES.FIRST_NAME Schema2.CUSTOMERS.CUST_FIRST_NAME Schema1.EMPLOYEES.LAST_NAME Schema2.CUSTOMERS.CUST_LAST_NAME -----------------------------------------------------------------------Schema1.ORDER_STATUS.ORDER_STATUS_ID Schema2.ORDERS.ORDER_ID Schema1.ORDER_STATUS.ORDER_STATUS_ID Schema2.ORDERS.ORDER_STATUS -----------------------------------------------------------------------Schema1.ORDER_STATUS.ORDER_STATUS_ID Schema2.ORDER_ITEMS.ORDER_ID -----------------------------------------------------------------------Schema1.PRODUCTS.PRODUCT_TYPE_ID Schema2.INVENTORIES.PRODUCT_ID -----------------------------------------------------------------------Schema1.PRODUCTS.PRODUCT_TYPE_ID Schema2.ORDER_ITEMS.PRODUCT_ID -----------------------------------------------------------------------Schema1.PRODUCTS.PRODUCT_TYPE_ID Schema2.PRODUCT_DESCRIPTIONS.PRODUCT_ID -----------------------------------------------------------------------Schema1.PRODUCT_TYPES.PRODUCT_TYPE_ID Schema2.INVENTORIES.PRODUCT_ID -----------------------------------------------------------------------Schema1.PRODUCT_TYPES.PRODUCT_TYPE_ID Schema2.ORDER_ITEMS.PRODUCT_ID -----------------------------------------------------------------------Schema1.PRODUCT_TYPES.PRODUCT_TYPE_ID Schema2.PRODUCT_DESCRIPTIONS.PRODUCT_ID With 0.6 --------------------------------------------------------------------------Attribute 1 Attribute 2 --------------------------------------------------------------------------Schema1.CUSTOMERS.PHONE Schema2.CUSTOMERS.PHONE_NUMBERS -----------------------------------------------------------------------Schema1.CUSTOMERS.PHONE Schema2.EMPLOYEES.PHONE_NUMBER -----------------------------------------------------------------------Schema1.DIVISIONS.NAME Schema2.DEPARTMENTS.DEPARTMENT_NAME -----------------------------------------------------------------------Schema1.DIVISIONS.NAME Schema2.EMPLOYEES.FIRST_NAME Schema1.DIVISIONS.NAME Schema2.EMPLOYEES.LAST_NAME -----------------------------------------------------------------------Schema1.DIVISIONS.NAME Schema2.PRODUCT_DESCRIPTIONS.TRANSLATED_NAME -----------------------------------------------------------------------Schema1.EMPLOYEES.TITLE Schema2.JOBS.JOB_TITLE Schema1.EMPLOYEES.SALARY Schema2.JOBS.MIN_SALARY Schema1.EMPLOYEES.SALARY Schema2.JOBS.MAX_SALARY -----------------------------------------------------------------------Schema1.JOBS.NAME Schema2.DEPARTMENTS.DEPARTMENT_NAME -----------------------------------------------------------------------Schema1.JOBS.NAME Schema2.EMPLOYEES.FIRST_NAME Schema1.JOBS.NAME Schema2.EMPLOYEES.LAST_NAME -----------------------------------------------------------------------Schema1.JOBS.NAME Schema2.PRODUCT_DESCRIPTIONS.TRANSLATED_NAME -----------------------------------------------------------------------Schema1.ORDER_STATUS.STATUS Schema2.ORDERS.ORDER_STATUS -----------------------------------------------------------------------Schema1.PRODUCTS.NAME Schema2.DEPARTMENTS.DEPARTMENT_NAME -----------------------------------------------------------------------Schema1.PRODUCTS.NAME Schema2.EMPLOYEES.FIRST_NAME Schema1.PRODUCTS.NAME Schema2.EMPLOYEES.LAST_NAME -----------------------------------------------------------------------Schema1.PRODUCTS.PRICE Schema2.ORDER_ITEMS.UNIT_PRICE -----------------------------------------------------------------------Schema1.PRODUCTS.NAME Schema2.PRODUCT_DESCRIPTIONS.TRANSLATED_NAME Schema1.PRODUCTS.DESCRIPTION Schema2.PRODUCT_DESCRIPTIONS.TRANSLATED_DESCRIPTION ------------------------------------------------------------------------ 56 Schema1.PRODUCT_TYPES.NAME Schema2.DEPARTMENTS.DEPARTMENT_NAME -----------------------------------------------------------------------Schema1.PRODUCT_TYPES.NAME Schema2.EMPLOYEES.FIRST_NAME Schema1.PRODUCT_TYPES.NAME Schema2.EMPLOYEES.LAST_NAME -----------------------------------------------------------------------Schema1.PRODUCT_TYPES.NAME Schema2.PRODUCT_DESCRIPTIONS.TRANSLATED_NAME -----------------------------------------------------------------------Schema1.SALARY_GRADES.LOW_SALARY Schema2.EMPLOYEES.SALARY Schema1.SALARY_GRADES.HIGH_SALARY Schema2.EMPLOYEES.SALARY 57