ITCS6010 Fall 2008 Anuradha Venkataraman 800556407 Semantic Mapping : It is a 2 step process, specifies how data in two different data sources are related and how to transform data from one data source to another find semantic matches between attributes of two schemas ◦ ◦ location=concat(city,state) find mappings to transform data in one schema to another based on the matches found(SQL queries, Clio) location=select concat(city,state) from S2 Semantic mapping is a fundamental task in DI or other data sharing systems like DW, EDI , co operations and collaborations At the time the paper was written semantic mapping was done through a manual process. This was time consuming and also prone to error. Majority of prior related work focused on creating 1 1 matches Complex matches are common in real-world schemas 1-1 matches - an attribute in the source schema is matched to another attribute in the target schema name=custname Complex matching - specifies a combination of attributes in the source schema that relate to a combination of attributes in the target schema name=concat(fname,lname) ◦ Finding complex matches is more difficult because the search space is unbounded. Schema matching in relational schemas - idea can be used for other types as well Goal to provide an interactive design environment where a human can create mappings between schemas quickly by using the system’s suggestions. Create a semi-automatic system to find complex matches between two schemas. Find complex matches. Use domain knowledge, external data and overlap data Match Generation by using search techniques over the solution space Uses a number of searchers(search modules), each searcher searches a subset of the search space. ◦ Text searcher – searches matches for textual target attributes, uses concatenation operator Beam search is deployed to control the search through the search space Machine learning, statistics and heuristics are used to evaluate candidate matches. Diminishing returns principle is used to determine when the search needs to be terminated Considers name similarity, domain knowledge, integrity constraints, external data and overlap data to re rank and select the best matches Provides an explanation module to provide the user with reasons for taking various decisions at various stages Semi-automatically discovering complex matches that combines search through a set of candidate matches and methods for evaluating each match Uses of new kinds of domain knowledge (overlap data and mining external data) A mechanism for explaining the decisions made by the matching system 1-1 and complex matches User Match selector Domain knowledge and data Similarity matrix Similarity estimator Match candidates Match generator Target schema T and source schema S Explanation module Match Generator Similarity Estimator Match Selector ◦ 2 input schemas( S - source schema, T - target schema) ◦ For each attribute t in T, generates possible match candidates by employing various searchers ◦ Computes similarity score for the match candidates indicating the level of similarity to target attribute t ◦ Outputs similarity score matrix ◦ Selects the best matches based on the similarity score and domain constraints The three modules use external data and domain knowledge and overlap data during the processing to improve accuracy and efficiency. The modules also interact with an explanation module to provide explanations to various actions that the module performs Search over the search space of possible matches Search space is extremely large Uses multi-searcher strategy ◦ Each searcher has a specific purpose ◦ Each searcher searches only a subset of the search space Advantages: ◦ Makes the system extensible – new searchers can be added and integrated ◦ Each searcher considers small , relevant portions of the search space Search Strategy: ◦ Uses beam search to control the search process ◦ Evaluates each candidate based on a scoring function and only retains k – top matches at each level. Match Evaluation ◦ Assigns score based on semantic distance between the candidate match and target attribute. ◦ Uses machine learning, statistics, heuristics. Termination ◦ Uses diminishing returns principle ◦ Stops when the difference between the best matches in consecutive levels of iterations is less than a threshold. Target attributes that are numeric price, total, age Restricted to common operations such as add, subtract, divide, multiply total=qty*unitprice Considers similarity between the value distributions of target attribute and the complex match candidate using the Kullback-Leibler divergence measure. ◦ This method calculates the divergence between two distributions based on information and probabilistic theory. Examines concatenation of text attributes in S location=concat(city,state,country) Uses Naïve Bayes Text Classifier to calculate score This classifier is trained based on the data in the target schema to learn the target attribute, returns probability value for each data instance from source mapping, the average is the score of the mapping. Starts with 1-1 matches and proceeds with concatenation using beam search strategy Target attribute t is categorical if the data instances have less than x distinct values product-category=product-type Finds conversions between categorical attributes waterfront=f(near-water) Uses KL divergence measures to determine similarity Relates data of a schema to the schema of the other Focuses on binary attributes in target schema(yes/no) mp3support = {yes/no} Identifies binary attributes in T and searches for the attribute name in the data of S ◦ description =“…plays mp3,wav…” Transforms source attribute s into a categorical attribute of t if it contains more than x instances of the attribute name of t. mp3support = yes if description contains mp3 no if description does not contain mp3 Finds matches that are a conversion between two different types of units length-mm=10 * size-cm Determines physical attributes by looking for the presence of units in the data or in the attribute name Finds complex matches for date attributes Bdate=bday/bmonth/byear Uses ontology to identify date or part of date attributes Bdate is ontology Date;bday- ontology day, etc. Uses ontology to identify relationships to determine the type of conversion. Employs scoring techniques that can not be used in the searchers due to efficiency reasons. Uses two modules: ◦ Name base evaluator – similarity between the names(concatenation of complex match) and target attribute ◦ A Naïve Bayes evaluator – based on a naïve based classifier Selects the best matches based on the scores and other domain integrity constraints . Name does not contain numbers Match name=login_name {avenkat5} is ranked lower Uses domain knowledge to clean up complex matches Candidate match - total = price*(qty – product_id), pid=product_id Domain constraint – total and pid are not related. iMap drops product_id to get total=price*qty Directs search process and prunes meaningless candidates early Knowledge – fname and location are not related; searcher does not consider complex matches that have both attributes together Uses domain knowledge as early as possible to prune the search space Types of Domain Knowledge used Domain Constraints – ◦ present in schema or provided by user. Used in various phases based on type of constraint average of total number of rooms is less that 10, all matches that evaluate to more than 10 are ignored. This decision might be postponed to a later a stage if evaluation is time consuming. Past Complex Matches – domains. ◦ Uses knowledge gained from previous matching of schemas in similar Extracts expression template for complex operations which are used by the searchers Previous match : cost=price(1 + .50); template – attribute1=variable(1+constant), matcher looks for similar expressions in complex mapping. External Data – mines properties of attributes from external sources. These properties are used as domain constraints in various stages. ◦ External data source generally provided by user( domain expert ). Information from mined data – average unit cost does not exceed 1000$. In real-world schemas source and target schemas mostly share some data . Common tuples or data that represent same entity. This overlap data provides important information on the mappings of the attributes which can be used. Special overlap searchers are used when there is overlap data in source and target schemas. Overlap Text Searcher ◦ Used instead of text searcher ◦ Uses overlap data to revaluate the mappings generated by the text searcher. ◦ New score based on the fraction of overlap data that the mapping satisfies. Overlap data – John Smith = concat(John,Smith) Mapping name=concat(fname,city) does not match overlap data – score 0; name=concat(fname,lname) matches overlap data – score 1. Overlap Numeric Searcher ◦ Used instead of numeric searcher ◦ Uses equation discovery system to find the best arithmetic expression that matches attribute t. Overlap category and schema mismatch Searcher ◦ Similar technique to Overlap text searcher Explanations help users better understand the system. The system uses complex processes and hence there is a need to explain decisions to the user. This helps the user guide the system to find the correct matches Type of Questions a user can ask: ◦ Explain a match – why numrooms=baths+beds+floor generated ◦ Explain why a match is not present – why price=listprice+agent_fee not generated ◦ Explain the ranking between matches – why price=listprice better than price=listprice+agent_fee Questions can be asked to any specific module Uses a dependency graph that is created as the process flows through each module Records matches, assumptions, data Nodes: attributes, assumptions, candidate matches, domain knowledge. Connected by directed edge labeled with the module that was responsible for the decision . Domains and Data ◦ Cricket, Inventory, Financial, Real estate ◦ Collected from various sources ◦ Use both schemas with overlap data as well as disjoint schemas Performance Measure ◦ Top 1 matching accuracy - % of matches where the best match for a attribute is the correct match ◦ Top 3 matching accuracy - % of matches where the top 3 ranked matches for a attribute contains the correct match – more relevant as the tool is used only to suggest a rank list of matches. Overlap Data Disjoint Data ◦ Default iMap – 58-74 % accuracy ◦ Exploiting domain knowledge – 68-92 % ◦ Default iMap – 55-76 % accuracy ◦ Exploiting domain knowledge – 62-79 % Top – 1 Accuracy – 62-92% Top – 3 Accuracy – 64 –95% Exploiting domain knowledge and presence of overlap data increases accuracy Default System – 33-55% Exploiting domain constraint and overlap data – 50 86% Top-1 Accuracy – Top – 3 Accuracy ◦ Disjoint data – 27 – 58 % ◦ Overlap Data – 70-85% ◦ Disjoint data – 43 – 92 % ◦ Overlap Data – 43 – 92% Exploiting domain constraints and overlap data is significant High accuracy in Top -3 Matches for complex matches Top - 3 accuracy is better than Top – 1 Removing small attributes that act as noise in a complex match is difficult Phone=concat(id,areacode,number), id is a small value that acts as noise Difficult to find small attributes that form parts of a complex match Address=concat(street,city,state) , apt# is a small value that is missed L. Xu and D. Embley. Using domain ontologies to discover direct and indirect matches for schema elements. In Proc.of the Semantic Integration Workshop at ISWC-2003. Uses domain ontology to find relationships and mappings between two schemas Can be useful in some contexts such as the date searcher. More such modules can be added Clio – a complimentary work that uses schema mappings to provide an interactive system that generates transformations from one schema to another. Finding complex matches is a tough task iMap identifies both 1-1 and complex matches Uses search modules to generate matches Exploits domain knowledge and overlap data to improve accuracy Provides a explanation feature to better assist the user The accuracy levels obtained from experimentation shows that on further fine tuning and improvisation it can be even better Handle other types of data sources other than relational data sources Provide better user interaction style Integrate more ontology based searchers How is overlap data identified? How are the domain constraints taken as input How does the user interact with the system? How can the accuracy be improved? Can the system learn from past experiences like user suggestions (not just past matches)? Is it scalable? Can it handle multiple schema matching (like DI systems where number of schemas are matched with a mediated schema)? What other search modules do you think need to be integrated? How efficient is it when handling complex schemas (the schemas experimented which are rather small)? Can it be extended to find complex matches over multiple tables more accurately?