Learning to Map between Structured Representations of Data AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Data Integration Challenge Find houses with 2 bedrooms priced under 200K New faculty member realestate.com homeseekers.com homes.com 2 Architecture of Data Integration System Find houses with 2 bedrooms priced under 200K mediated schema source schema 1 realestate.com source schema 2 homeseekers.com source schema 3 homes.com 3 Semantic Mappings between Schemas Mediated-schema price agent-name 1-1 mapping homes.com listed-price 320K 240K contact-name Jane Brown Mike Smith address complex mapping city state Seattle WA Miami FL 4 Schema Matching is Ubiquitous! Fundamental problem in numerous applications Databases – – – – – – – data integration data translation schema/view integration data warehousing semantic query processing model management peer data management AI – knowledge bases, ontology merging, information gathering agents, ... Web – e-commerce – marking up data using ontologies (Semantic Web) 5 Why Schema Matching is Difficult Schema & data never fully capture semantics! – not adequately documented Must rely on clues in schema & data – using names, structures, types, data values, etc. Such clues can be unreliable – same names => different entities: area => location or square-feet – different names => same entity: area & address => location Intended semantics can be subjective – house-style = house-description? Cannot be fully automated, needs user feedback! 6 Current State of Affairs Finding semantic mappings is now a key bottleneck! – largely done by hand – labor intensive & error prone – data integration at GTE [Li&Clifton, 2000] – 40 databases, 27000 elements, estimated time: 12 years Will only be exacerbated – data sharing becomes pervasive – translation of legacy data Need semi-automatic approaches to scale up! Many current research projects – Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U of Leipzig, ... – AI: Stanford, Karlsruhe University, NEC Japan, ... 7 Goals and Contributions Vision for schema-matching tools – – – – learn from previous matching activities exploit multiple types of information incorporate domain integrity constraints handle user feedback My contributions: solution for semi-automatic schema matching – – – – can match relational schemas, DTDs, ontologies, ... discovers both 1-1 & complex mappings highly modular & extensible achieves high matching accuracy (66 -- 97%) on real-world data 8 Road Map Introduction Schema matching [SIGMOD-01] – 1-1 mappings for data integration – LSD (Learning Source Description) system – learns from previous matching activities – employs multi-strategy learning – exploits domain constraints & user feedback Creating complex mappings Ontology matching [WWW-02] Conclusions [Tech. Report-02] 9 Schema Matching for Data Integration: the LSD Approach Suppose user wants to integrate 100 data sources 1. User – manually creates mappings for a few sources, say 3 – shows LSD these mappings 2. LSD learns from the mappings 3. LSD predicts mappings for remaining 97 sources 10 Learning from the Manual Mappings price Mediated schema agent-name agent-phone office-phone description listed-price contact-name contact-phone office comments If “office” occurs in name => office-phone Schema of realestate.com realestate.com listed-price contact-name contact-phone $250K $320K James Smith Mike Doan $350K $230K contact-agent comments (305) 729 0831 (305) 616 1822 Fantastic house (617) 253 1429 (617) 112 2315 Great location homes.com sold-at office extra-info (206) 634 9435 Beautiful yard (617) 335 4243 Close to Seattle If “fantastic” & “great” occur frequently in data instances => description 11 Must Exploit Multiple Types of Information! Mediated schema price agent-name agent-phone office-phone description listed-price contact-name contact-phone office comments If “office” occurs in name => office-phone Schema of realestate.com realestate.com listed-price contact-name contact-phone $250K $320K James Smith Mike Doan $350K $230K contact-agent comments (305) 729 0831 (305) 616 1822 Fantastic house (617) 253 1429 (617) 112 2315 Great location homes.com sold-at office extra-info (206) 634 9435 Beautiful yard (617) 335 4243 Close to Seattle If “fantastic” & “great” occur frequently in data instances => description 12 Multi-Strategy Learning Use a set of base learners – each exploits well certain types of information To match a schema element of a new source – apply base learners – combine their predictions using a meta-learner Meta-learner – uses training sources to measure base learner accuracy – weighs each learner based on its accuracy 13 Base Learners Training Object Training examples Matching Name Learner (X1,C1) (X2,C2) ... (Xm,Cm) Observed label X labels weighted by confidence score Classification model (hypothesis) – training: (“location”, address) (“contact name”, name) – matching: agent-name => (name,0.7),(phone,0.3) Naive Bayes Learner – training: (“Seattle, WA”,address) (“250K”,price) – matching: “Kent, WA” => (address,0.8),(name,0.2) 14 The LSD Architecture Training Phase Matching Phase Mediated schema Source schemas Base-Learner1 Hypothesis1 Training data for base learners Base-Learnerk Hypothesisk Base-Learner1 .... Base-Learnerk Meta-Learner Predictions for instances Prediction Combiner Domain constraints Predictions for elements Constraint Handler Meta-Learner Weights for Base Learners Mappings 15 Training the Base Learners Mediated schema address price agent-name agent-phone office-phone description realestate.com location price contact-name contact-phone office comments Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location Name Learner Naive Bayes Learner (“location”, address) (“price”, price) (“contact name”, agent-name) (“contact phone”, agent-phone) (“office”, office-phone) (“comments”, description) (“Miami, FL”, address) (“$250K”, price) (“James Smith”, agent-name) (“(305) 729 0831”, agent-phone) (“(305) 616 1822”, office-phone) (“Fantastic house”, description) (“Boston,MA”, address) 16 Meta-Learner: Stacking [Wolpert 92,Ting&Witten99] Training – – – – uses training data to learn weights one for each (base-learner,mediated-schema element) pair weight (Name-Learner,address) = 0.2 weight (Naive-Bayes,address) = 0.8 Matching: combine predictions of base learners – computes weighted average of base-learner confidence scores area Seattle, WA Kent, WA Bend, OR Name Learner Naive Bayes (address,0.4) (address,0.9) Meta-Learner (address, 0.4*0.2 + 0.9*0.8 = 0.8) 17 The LSD Architecture Training Phase Matching Phase Mediated schema Source schemas Base-Learner1 Hypothesis1 Training data for base learners Base-Learnerk Hypothesisk Base-Learner1 .... Base-Learnerk Meta-Learner Predictions for instances Prediction Combiner Domain constraints Predictions for elements Constraint Handler Meta-Learner Weights for Base Learners Mappings 18 Applying the Learners homes.com schema area sold-at contact-agent area Seattle, WA Kent, WA Bend, OR extra-info Name Learner Naive Bayes Name Learner Naive Bayes Meta-Learner Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) Prediction-Combiner (address,0.7), (description,0.3) homes.com sold-at contact-agent extra-info (price,0.9), (agent-phone,0.1) (agent-phone,0.9), (description,0.1) (address,0.6), (description,0.4) 19 Domain Constraints Encode user knowledge about domain Specified by examining mediated schema Examples – at most one source-schema element can match address – if a source-schema element matches house-id then it is a key – avg-value(price) > avg-value(num-baths) Given a mapping combination – can verify if it satisfies a given constraint area: sold-at: contact-agent: extra-info: address price agent-phone address 20 The Constraint Handler Predictions from Prediction Combiner Domain Constraints area: (address,0.7), (description,0.3) sold-at: (price,0.9), (agent-phone,0.1) contact-agent: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) At most one element matches address area: sold-at: contact-agent: extra-info: address price agent-phone address 0.7 0.9 0.9 0.6 0.3402 area: sold-at: contact-agent: extra-info: address price agent-phone description 0.3 0.1 0.1 0.4 0.0012 0.7 0.9 0.9 0.4 0.2268 Searches space of mapping combinations efficiently Can handle arbitrary constraints Also used to incorporate user feedback – sold-at does not match price 21 The Current LSD System Can also handle data in XML format – matches XML DTDs Base learners – Naive Bayes [Duda&Hart-93, Domingos&Pazzani-97] – exploits frequencies of words & symbols – WHIRL Nearest-Neighbor Classifier [Cohen&Hirsh KDD-98] – employs information-retrieval similarity metric – Name Learner [SIGMOD-01] – matches elements based on their names – County-Name Recognizer [SIGMOD-01] – stores all U.S. county names – XML Learner [SIGMOD-01] – exploits hierarchical structure of XML data 22 Empirical Evaluation Four domains – Real Estate I & II, Course Offerings, Faculty Listings For each domain – – – – created mediated schema & domain constraints chose five sources extracted & converted data into XML mediated schemas: 14 - 66 elements, source schemas: 13 - 48 Ten runs for each domain, in each run: – manually provided 1-1 mappings for 3 sources – asked LSD to propose mappings for remaining 2 sources – accuracy = % of 1-1 mappings correctly identified 23 Average Matching Acccuracy (%) High Matching Accuracy 100 90 80 70 60 50 40 30 20 10 0 Real Estate I Real Estate II Course Offerings Faculty Listings LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6% 24 Average matching accuracy (%) Contribution of Schema vs. Data 100 90 80 70 60 50 40 30 20 10 0 Real Estate I Real Estate II Course Offerings Faculty Listings LSD with only schema info. LSD with only data info. Complete LSD More experiments in [Doan et al. SIGMOD-01] 25 LSD Summary LSD – learns from previous matching activities – exploits multiple types of information – by employing multi-strategy learning – incorporates domain constraints & user feedback – achieves high matching accuracy LSD focuses on 1-1 mappings Next challenge: discover more complex mappings! – COMAP (Complex Mapping) system 26 The COMAP Approach Mediated-schema price num-baths address homes.com listed-price 320K 240K agent-id full-baths half-baths city 53211 11578 2 1 1 1 Seattle Miami zipcode 98105 23591 For each mediated-schema element – searches space of all mappings – finds a small set of likely mapping candidates – uses LSD to evaluate them To search efficiently – employs a specialized searcher for each element type – Text Searcher, Numeric Searcher, Category Searcher, ... 27 The COMAP Architecture [Doan et al., 02] Mediated schema Searcher1 Source schema + data Searcher2 Searcherk Mapping candidates Base-Learner1 .... Base-Learnerk Meta-Learner Domain constraints Prediction Combiner Constraint Handler Mappings LSD 28 An Example: Text Searcher Beam search in space of all concatenation mappings Example: find mapping candidates for address Mediated-schema homes.com listed-price 320K 240K price num-baths agent-id full-baths half-baths city zipcode 532a 115c 98105 23591 concat(agent-id,city) 532a Seattle 115c Miami address 2 1 1 1 Seattle Miami concat(agent-id,zipcode) 532a 98105 115c 23591 concat(city,zipcode) Seattle 98105 Miami 23591 Best mapping candidates for address – (agent-id,0.7), (concat(agent-id,city),0.75), (concat(city,zipcode),0.9) 29 Empirical Evaluation Current COMAP system – eight searchers Three real-world domains – in real estate & product inventory – mediated schema: 6 -- 26 elements, source schema: 16 -- 31 Accuracy: 62 -- 97% Sample discovered mappings – agent-name = concat(first-name,last-name) – area = building-area / 43560 – discount-cost = (unit-price * quantity) * (1 - discount) 30 Road Map Introduction Schema matching – LSD system Creating complex mappings – COMAP system Ontology matching – GLUE system Conclusions 31 Ontology Matching Increasingly critical for – knowledge bases, Semantic Web An ontology – concepts organized into a taxonomy tree – each concept has – a set of attributes – a set of instances Undergrad Courses – relations among concepts CS Dept. US Entity Grad Courses Faculty Matching – concepts – attributes – relations People Assistant Professor Associate Professor Staff Professor name: Mike Burns degree: Ph.D. 32 Matching Taxonomies of Concepts CS Dept. US CS Dept. Australia Entity Undergrad Courses Grad Courses Entity People Faculty Assistant Professor Associate Professor Courses Staff Professor Lecturer Staff Academic Staff Technical Staff Senior Lecturer Professor 33 Constraints in Taxonomy Matching Domain-dependent – at most one node matches department-chair – a node that matches professor can not be a child of a node that matches assistant-professor Domain-independent – two nodes match if parents & children match – if all children of X matches Y, then X also matches Y – Variations have been exploited in many restricted settings [Melnik&Garcia-Molina,ICDE-02], [Milo&Zohar,VLDB-98], [Noy et al., IJCAI-01], [Madhavan et al., VLDB-01] Challenge: find a general & efficient approach 34 Solution: Relaxation Labeling Relaxation labeling [Hummel&Zucker, 83] – – – – applied to graph labeling in vision, NLP, hypertext classification finds best label assignment, given a set of constraints starts with initial label assignment iteratively improves labels, using constraints Standard relax. labeling not applicable – extended it in many ways [Doan et al., W W W-02] Experiments – – – – three real-world domains in course catalog & company listings 30 -- 300 nodes / taxonomy accuracy 66 -- 97% vs. 52 -- 83% of best base learner relaxation labeling very fast (under few seconds) 35 Related Work Hand-crafted rules Exploit schema 1-1 mapping TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et al. 98] CUPID [Madhavan et al. 01] PROMPT [Noy et al. 00] Rules Exploit data 1-1 + complex mapping CLIO [Miller et. al., 00] [Yan et al. 01] Single learner Exploit data 1-1 mapping SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et al. 97] Learners + rules, use multi-strategy learning Exploit schema + data 1-1 + complex mapping Exploit domain constraints LSD [Doan et al., SIGMOD-01] COMAP [Doan et al. 2002, submitted] GLUE [Doan et al., WWW-02] 36 Future Work Learning source descriptions – formal semantics for mapping – query capabilities, source schema, scope, reliability of data, ... Dealing with changes in source description Matching objects across sources More sophisticated user feedback Focus on distributed information management systems – data integration, web-service integration, peer data management – goal: significantly reduce complexity of construction & maintenance 37 Conclusions Efficiently creating semantic mappings is critical Developed solution for semi-automatic schema matching – – – – – learns from previous matching activities can match relational schemas, DTDs, ontologies, ... discovers both 1-1 & complex mappings highly modular & extensible achieves high matching accuracy Made contributions to machine learning – developed novel method to classify XML data – extended relaxation labeling 38 Backup Slides 39 Training the Meta-Learner For address Extracted XML Instances <location> Miami, FL</> <listed-price> $250,000</> <area> Seattle, WA </> <house-addr>Kent, WA</> <num-baths>3</> ... Name Learner Naive Bayes 0.5 0.4 0.3 0.6 0.3 ... 0.8 0.3 0.9 0.8 0.3 ... True Predictions 1 0 1 1 0 ... Least-Squares Linear Regression Weight(Name-Learner,address) = 0.1 Weight(Naive-Bayes,address) = 0.9 40 Average matching accuracy (%) Sensitivity to Amount of Available Data 100 90 80 70 60 50 40 0 100 200 300 400 500 Number of data listings per source (Real Estate I) 41 Average Matching Acccuracy (%) Contribution of Each Component 100 80 60 40 20 0 Real Estate I Course Offerings Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system Faculty Listings Real Estate II 42 Exploiting Hierarchical Structure Existing learners flatten out all structures <contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm> </contact> <description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. </description> Developed XML learner – similar to the Naive Bayes learner – input instance = bag of tokens – differs in one crucial aspect – consider not only text tokens, but also structure tokens 43 Reasons for Incorrect Matchings Unfamiliarity – suburb – solution: add a suburb-name recognizer Insufficient information – correctly identified general type, failed to pinpoint exact type – agent-name phone Richard Smith (206) 234 5412 – solution: add a proximity learner Subjectivity – house-style = description? Victorian Beautiful neo-gothic house Mexican Great location 44 Evaluate Mapping Candidates For address, Text Searcher returns – (agent-id,0.7) – (concat(agent-id,city),0.8) – (concat(city,zipcode),0.75) Employ multi-strategy learning to evaluate mappings Example: (concat(agent-id,city),0.8) – Naive Bayes Learner: 0.8 – Name Learner: “address” vs. “agent id city” 0.3 – Meta-Learner: 0.8 * 0.7 + 0.3 * 0.3 = 0.65 Meta-Learner returns – (agent-id,0.59) – (concat(agent-id,city),0.65) – (concat(city,zipcode),0.70) 45 Relaxation Labeling Applied to similar problems in – vision, NLP, hypertext classification People Dept Australia Courses Courses Acad. Staff Faculty Dept U.S. Courses Staff Tech. Staff Staff Courses People Faculty Staff 46 Relaxation Labeling for Taxonomy Matching Must define – neighborhood of a node – k features of neighborhood – how to combine influence of features – P( N L | f1 , f 2 ,..., f k ) Algorithm – init: – loop: for each pair <N,L>, compute P( N L | ) for each pair <N,L>, re-compute P( N L | ) P( N L | M , ).P( M | ) M Staff = People Acad. Staff: Faculty Tech. Staff: Staff Neighborhood configuration 47 Relaxation Labeling for Taxonomy Matching Huge number of neighborhood configurations! – typically neighborhood = immediate nodes – here neighborhood can be entire graph 100 nodes, 10 labels => 10100 configurations Solution – label abstraction + dynamic programming – guarantee quadratic time for a broad range of domain constraints Empirical evaluation – – – – – GLUE system [Doan et. al., WWW-02] three real-world domains 30 -- 300 nodes / taxonomy high accuracy 66 -- 97% vs. 52 -- 83% of best base learner relaxation labeling very fast, finished in several seconds 48