Semantic Interoperability: Automatically Resolving Vocabularies Chuck Mosher 8500 Leesburg Pike Vienna, VA cmosher@metamatrix.com 4th Semantic Interoperability Conference February 10, 2006 Interoperable Information Backbone Enterprise Data Service Layer Applications MetaMatrix Data Sources • • Enterprise-wide data abstraction layer for applications Integrated views of data from multiple sources – • • • 2 Relational databases, applications, files Re-useable Data Services for data consistency Metadata-driven data management and integration Complements other data integration tools (ETL, EAI, quality, etc.) Data Services • A type of Web Service • Does all of the work to transform any data in any format to a W3C compliant service – Implements all of the logic to effect the transformation – Provides access to data sources, regardless of source API, technology • Does not implement application logic • Decouples the data from the application while making the data discoverable and accessible 3 Model-Based Approach Maximizes Re-use Data Abstraction Without Coding Exposed Information Services Information Consumers Reusable Integrated Business Objects Enterprise Information Sources (EIS) Web Services, Business Processes databases <WSDL> Packaged Apps SOAP (contract) services <WSDL> (contract) warehouses Custom Apps <WSDL> EAI, Data warehouses geo-spatial rich media … 4 xml spreadsheets JDBC Reporting, Analytics ODBC (contract) <sale/> <value/> </ sale > Meta Object Facility (MOF) Metamodel Model Data 5 MetaMatrix MetaBase Modeler • Model disparate information sources – Relational DBs – Content Management Systems – Files – Services – Applications • Uses and retains domain-specific modeling terminology – Relational models have “Tables”, “Foreign Keys”, “Columns”, etc. – UML models have “Packages”, “Classes”, “Attributes”, etc. 6 MetaMatrix MetaBase Modeler • Define reusable data services/ business objects • Transformations defined with: – Selects – Joins – Criteria – Unions – Functions – User defined • Perform schema and semantic matching, data type conversion 7 Semantic Mediation: The Problem Business Intelligence Applications Portal Applications Web Services ODBC/JDBC JDBC SOAP Virtual XML Document <a> <b> … </b> </a> T T Logical Data Model Location_ID T bldg_type bldg_id T Location_Type T T Depot_Number SITENUM Facility_ID Multiple Internal/External Information Sources 8 Aggregate Data Services: • Relational or XML • Application-specific • Access via ODBC, JDBC, or SOAP APIs Enterprise-wide or COI-driven Data Model • Rationalization and Semantic mediation Layer • Harmonization • Data Catalog/Dictionary Data Sources - Authoritative - Redundant - Overlapping Building Enterprise Semantic Model(s) Business Intelligence Applications Portal Applications Web Services ODBC/JDBC JDBC SOAP J-8 Force Structure J-7 Operational Plans J-6 C4CS J-5 Plans & Policy Enterprise-wide or COI-driven Data Models • Rationalization • Harmonization • Data Catalogs J-4 Logistics (GCSS) J-3 Operations J-2 Intelligence J-1 Manpower / Personnel T T T Multiple Internal/External Information Sources 9 Data Sources - Authoritative - Redundant - Overlapping Biggest Challenge in Creating Data Services? • • • • • Semantics!!! Structural differences are straightforward Differing definitions among data sources Differing vocabularies among COI’s Established, emerging, and evolving data standards – C2IEDM, JC3IEDM, GJXDM, NIEM, GFM, many more • Not addressed by ETL, EAI, SOA 10 A Previously Intractable Problem • TWPDES has 1000+ core entities • NIEM has 100,000+! • Even a limited program with a dozen data sources could yield 10’s of 1000’s of potential mappings • Humans cannot address this without help • Indeed, it has stopped many data integration/reconciliation programs in their tracks. 11 Automated Semantic Matching DISCLAIMER • Semantic matching can't really be done automatically yet! • Requires intelligence to understand the context and semantics. • So use computers to do most of the work but then have the user confirm or check the result. 13 The Matching Problem • Given two symbols, calculate a measure of the relationship between them: amount quantity Doesn’t seem so hard… 14 The Matching Problem • Given two symbols, calculate a measure of the relationship between them: ftuqky aqfkyeyr This is what a computer “sees.” 15 The Matching Problem • Even after extracting likely symbols, matching is a difficult problem. • Symbols alone are not enough to generate good matches: – “ID” -> “SocialSecurityNumber” or “NY” • The solution relies on context: – “NJ”,”MA”,”CA”,”ID” – “Ego”, “SuperEgo”, “ID” • MatchIt provides that context 16 MatchIT 1.0 • Integrated component of the MetaMatrix Semantic Data Services product • Based on ontology-driven semantic knowledge base – Word relationships, dictionaries, lexicons, thesauri • Plug-in architecture • Standards-compliant: – – – – – – 17 OWL RDF Inference engines OSGI Eclipse JDBC (Semi-)Automated Semantic Mediation *An extensible semantic knowledge base provides a dictionary and thesaurus like information on “words”, their “meanings”, and their relationships to other words. Ontology “Sex” semantically related to “Gender” Gender ID Matched (Confidence of 90%) Person Sex Code *A sophisticated set of matching algorithms provides string similarity matches and semantic matches with confidence ratings and explanations. Data Source Services FBI 18 CBP NYC NY NJ Matching Techniques • MatchIT uses two types of matching techniques: – String Matching • Attempts to determine string similarity based on the lexical distance between them. – Semantic Matching • Attempts to determine string similarity based on the ontological distance between them within a semantic ontology. • Generate Match Sets • Can be run individually or in combinations • Pluggable architecture allows for algorithmic extendibility 19 String Matching • What is the lexical distance between two symbols? – “PUZZLE”, “PUZZ” – “ID”,”IDENTIFIER” – “STRONG”,”SONG” 20 Semantic Matching • How semantically similar are two concepts? vehicle is a is a wheeled vehicle is a self-propelled vehicle is a car aircraft heavier-than-air craft is a is a truck car and truck are very similar Car and airplane are less similar 21 is a is a motor vehicle is a craft airplane Semantic Matching Objectives • Find and rank the potential matches, but let the user review and decide for sure. • I.e., eliminate 99+% of the things that don't match, and let the user review the <1%. • Many times, a user can visually scan a small list of the top 1% and very quickly agree or disagree with the results. • Favor false positives over false negatives. 22 Semantic Matching in MetaMatrix Enterprise Information Sources Conceptual/Logical/Physical Data Models Relational Domain [UML/ER] XML X MX L MX LM L Ontologies [OWL/RDF] Representations Custom Any Source XML JDBC File System RDBMS MetaMatrix Connector Framework MetaMatrix Importer Framework Data/ Content Access Import Export MatchIt MetaBase Modeler Semantic Knowledge Base Ontology Find Matches Data Harmonization Complete Schema-level Match Instance-level Match Metadata Access •Analyze •Visualize •Collaborate •Transform MetaBase Repository Ontological Semantics Access Fact Repository Onomasticons Models & Files [versioned] Lexicons Search Index 23 Web Reporting Example Overall process • Import two nontrivial vocabularies – ERwin model of large data warehouse – TWPDES XML schema • Extract symbols – Schema-specific tokenization algorithms • Assign semantics to each – Symbols are keys into dictionaries • Perform semantic matching between them • Analyze results 25 ERwin Data Warehouse Model 26 TWPDES XML Schema Mapping Classes for each XML frag in hierarchy 27 Generated Symbol Dictionary 28 (TWPDES) Generated Symbol Dictionary 29 (ERwin model) Editing the Dictionary Modify Definition 30 Editing the Semantics Control Senses 31 Target Model Match Results 32 Examine Details 33 Match Details 34 Matches Used to Build Mappings 35 From Pat Cassidy & COSMO The Integrating Function of the Common Semantic Model – via Domain-level Mapping GenericObligation SameAs Obligation 36 SameAs Duty MatchIt Semantic Matching Tool • A way to use ontologies in a world where nearly 100% of what already exists is not in an ontology. • Map connections between ontologies that are being built and artifacts currently in use: – – – – RDBMs schemas XML and XSD files Spreadsheet data More coming, including ontologies! • Map an imported model to a Vocabulary, and a Vocabulary to an Ontological structure 37 Thank you