Semantic Web & Semantic Web Processes (continued, Part II) A course at Universidade da Madeira, Funchal, Portugal June 16-18, 2005 Dr. Amit P. Sheth Professor, Computer Sc., Univ. of Georgia Director, LSDIS lab CTO/Co-founder, Semagix, Inc Special Thanks: Cartic Ramakrishnan, Karthik Gomadam Part II • Metadata, Enabling techniques and technologies • Automated metadata extraction and annotation • Computation and reasoning with focus on relationships • Example commercial Semantic Web platform for building ontology-driven applicaions What is Metadata? • Data about data – Statements, contexts – Recursive – data about “data about data” • Applications – Content management – Cataloguing – Information retrieval, search –… "A Web content repository without metadata is like a library without an index," - Jack Jia, IWOV A continuum – from data to knowledge Metadata extraction from heterogeneous content/data WWW, Enterprise Repositories Nexis UPI AP Feeds/ Documents Digital Videos ... ... Data Stores Digital Maps ... Digital Images Create/extract as much (semantics) metadata automatically as possible, from: Any format (HTML, XML, RDB, text, docs) Many media Push, pull Proprietary, Deep Web, Open Source Digital Audios EXTRACTORS METADATA Extracting a Text Document: Syntactic approach INCIDENT MANAGEMENT SITUATION REPORT Friday August 1, 1997 - 0530 MDT LAYOUT NATIONAL PREPAREDNESS LEVEL II CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires have been staffed for structure protection. SIMELS, Galena District, BLM. This fire is on the east side of the Innoko Flats, between Galena and McGr The fore is active on the southern perimeter, which is burning into a continuous stand of black spruce. The fire has increased in size, but was not mapped due to thick smoke. The slopover on the eastern perimeter is 35% contained, while protection of the historic cabit continues. Date => day month int ‘,’ int CHINIKLIK MOUNTAIN, Galena District, BLM. A Type II Incident Management Team (Wehking) is assigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. The fire is contained. Major areas of heat have been mopped-up. All crews and overhead will mop-up where the fire burned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this weekend, depending on the results of infrared scanning. Organizing Information: Automatic Classification Traditional Text Categorization Customer Training Set Statistical/AI Techniques Classify Place in a taxonomy Routing/Distribution Customer Article Feed 4715 7/1/2016 Classification of Article 4715 Standard Metadata Feed Source: iSyndicate Posted Date: 11/20/2000 Taalee’s Categorization & Automatic Metadata Creation Knowledge-base & Statistical/AI Techniques Taalee Training Set Classify Place in a taxonomy Catalog Metadata Automated Content Enrichment (ACE) FTE Article 4715 Metadata Standard metadata Customer Training Set Semantic metadata Feed Source: iSyndicate Posted Date: 11/20/2000 Company Name: France Telecom, Equant Ticker Symbol: FTE, ENT Exchange: NYSE Topic: Company News Company Analysis Conference Calls Earnings Stock Analysis ENT Company Analysis Conference Calls Earnings Stock Analysis NYSE Member Companies Market News IPOs Classification of Article 4715 Article Feed 4715 Taalee Enterprise Content Manager Customization Suite Precise syndication/filtering Routing/Distribution Map to another taxonomy Automatic Categorization & Metadata Tagging Video Segment with Associated Text ABSOLUTE CONTROL OF THE SENATE IS STILL IN QUESTION. AS OF TONIGHT, THE REPUBLICANS HAVE 50 SENATE SEATS AND THE DEMOCRATS 49. IN WASHINGTON STATE, THE SENATE RACE REMAINS TOO CLOSE TO CALL. IF THE DEMOCRATIC CHALLENGER UNSEATS THE REPUBLICAN IUMBENT THE SENATE WILL BE EVENLY DIVIDED. IN MISSOURI, REPUBLICAN SENATOR JOHN ASHCROFT SAYS HE WILL NOT CHALLENGE HIS LOSS TO GOVERNOR MEL CARNAHAN WHO DIED IN A CRASH THREE WEEKS AGO. GOVERNOR CARNAHAN'S WIFE IS EXPECTED TO TAKE HIS PLACE. IN THE HIGHEST PROFILE SENATE EVENT OF THE NIGHT, HILLARY CLINTON WON THE NEW YORK SENATE SEAT. SHE IS THE FIRST FIRST LADY TO RUN MUCH LESS WIN. Segment Description Auto Categorization Semantic Metadata Automatic Categorization & Metadata Tagging (Web page) Video with Editorialized Text on the Web Auto Categorization Semantic Metadata Automatic Categorization & Metadata Tagging (Feed) Text From Bllomberg Auto Categorization Semantic Metadata Taalee Extraction and Knowledgebase Enhancement Web Page Enhanced Metadata Asset Extraction Agent Taalee, Inc. 1999-2002 Sheth et al, 2002 Managing Semantic Content for the Web Semantic Enhancement Server Semantic Enhancement Server: Semantic Enhancement Server classifies content into the appropriate topic/category (if not already pre-classified), and subsequently performs entity extraction and content enhancement with semantic metadata from the Semagix Freedom Ontology How does it work? • Uses a hybrid of statistical, machine learning and knowledge-base techniques for classification • Not only classifies, but also enhances semantic metadata with associated domain knowledge © Semagix, Inc. Ambiguity Resolution during Metadata Extraction from content text Document ---------------- Ontology lookup Entity Candidate SES Find Entity Candidates in the document: Names and Synonyms Common variations (Jr, Sr, III, PLC, .com, etc.) ... Note: Entity Candidates can be restricted to a relevant subset of ontology Resolve ambiguities for the entity using any/all of No Multiple matches found during entity lookup? Yes these criteria: Direct/Indirect relationships with other entities found Proximity analysis of related entities Entity refinement using subset analysis (‘Doe’ vs. ‘John Doe’) ambiguity resolved List relationships between identified entities in same document (optional in output) List relationship trails e.g. CompExec position CompanyName Politician party country watchList Overcoming the key issue of resolving ambiguities in facts & evidence • Aggregation and normalization of any type of fact and evidence into the domain ontology – Resolution of issues over terminology • i.e. “Benefit number” is an alias of “SSN” – Resolution of issues over identity • i.e. is executive “Larry Levy” an existing entity or a new entity? – Enabling decisions to be made on the trustworthiness of existing facts • Which source did the data originate from? • How much supporting evidence was there? – Validating and enforcing constraints, e.g. cardinality • President of the United States (has cardinality) = Single • Terrorist (has cardinality) = Multiple Overcoming the key issue of resolving ambiguities in facts & evidence (Contd…) • Managing temporal aspects of the domain – Expiration of entity instances – E.g., “Hillary Clinton” is no longer the First Lady of the United States but was until “May 3rd 2001” • Providing auditing capabilities – Stamping evidence with date, time and source – E.g., Terrorist: “Seamus Monaghan”; date extracted: “2003-0130; time extracted: 16:45:27; source; FBI Watch list • Ontological relationships makes for more expressive model and provide better semantic description (compared to taxonomies) – Information can be presented in natural language format – E.g., “Bob Scott” is a founder member of business entity “AIX LLP” that has traded in “Iran” that is on “FATF watch-list” Example Scenario 1 Sample content text Have you ever been to Athens? How about Japan? Ontology Matches: - A: Athens[, Greece, Europe ] - B: Athens[, Georgia, United States of America, North America ] - C: Athens[, Ohio, United States of America, North America ] - D: Athens[, Tennessee, United States of America, North America ] -E: Japan[, Asia] Scores: A, B, C, D and E all scored equally – hence no ambiguity resolution possible Example Scenario 2 Sample content text Have you ever been to Athens? Or anywhere else in Georgia? How about Japan? Ontology Matches: - A: Athens[, Greece, Europe ] - B: Athens[, Georgia, United States of America, North America ] - C: Athens[, Ohio, United States of America, North America ] - D: Athens[, Tennessee, United States of America, North America ] - E: Georgia[, Asia ] - F: Georgia[, United States of America, North America ] - G: Georgia On My Mind, Inc. -H: Japan[, Asia] Scores: B and F scored highest because of exact text match and relationship Result: Entity Ambiguity Resolved Automatic Semantic Annotation of Text: Entity and Relationship Extraction KB, statistical and linguistic techniques Semantic Enhancement Engine, 2002 Metadata Extraction and Semantic Enhancement [Hammond, Sheth, Kochut 2002] Automatic Semantic Annotation COMTEX Tagging Value-added Semagix Semantic Tagging Content ‘Enhancement’ Rich Semantic Metatagging Limited tagging (mostly syntactic) Value-added relevant metatags added by Semagix to existing COMTEX tags: • Private companies • Type of company • Industry affiliation • Sector • Exchange • Company Execs • Competitors © Semagix, Inc. Ontology Design – Fundamental Principles • There is no one correct way to model a domain— there are always viable alternatives. • The best solution almost always depends on the application that you have in mind and the extensions that you anticipate. • Ontology development is necessarily an iterative process. • Concepts in the ontology should be close to objects (physical or logical) and relationships in your domain of interest. • These are most likely to be nouns (objects) or verbs (relationships) in sentences that describe your domain. Ontology Development 101: A Guide to Creating Your First Ontology Natalya F. Noy and Deborah L. McGuinness Semantic Associations: Beyond simple relationships Mechanisms for querying about and retrieving complex relationships between entities. 1. A is related to B by x.y.z x y’ y A z B z’ u ? v C Semantic Discovery project 2. A is related to C by i. x.y’.z’ ii. u.v (undirected path) 3. A is “related similarly” to B as it is to C (y’ y and z’ z x.y.z x.y’.z’) So are B and C related? Why do we need this? • Exploit ability to create ontologies and metadata for knowledge discovery and gaining insight • Very useful in information analytics – national security – business intelligence – biology - Association – Two entities e1 and en are semantically connected if there exists a sequence e1, P1, e2, P2, e3, … en-1, Pn-1, en in an RDF graph where ei, 1 i n, are entities and Pj, 1 j < n, are properties Semantically Connected “M’mmed” &r1 “Abdulaziz” &r6 “Atta” “Alomari ” &r5 - Association • Two entities are semantically similar if both have ≥ 1 similar paths starting from the initial entities, such that for each segment of the path: – Property Pi is either the same or subproperty of the corresponding property in the other path – Entity Ei belongs to the same class, classes that are siblings, or a class that is a subclass of the corresponding class in the other path - Association Passenger Ticket Cash “M’mmed” &r1 purchased &r2 paidby &r3 Semantic Similarity Semantic Similarity Semantic Similarity “Atta” “Marwan” &r7 “Al-Shehhi” lname purchased &r8 paidby &r9 The Need For Ranking • Current test bed with > 6,000 entities and > 11,000 explicit relations • The following semantic association query (“Nasir Ali”, “AlQeada”), results in 2,234 associations • The results must be presented to a user in a relevant fashion…thus the need for ranking SemRank:Ranking complex relationship search Context: Why, What, How? • Context => Relevance; Reduction in computation space • Context captures the users’ interest to provide the user with the relevant knowledge within numerous relationships between the entities • By defining regions (or sub-graphs) of the ontology we are capturing the areas of interest of the user Context Weight - Example has Account e3:Organization supports e2:Financial Organization e6:Financial Organization works For e4:Terrorist Organization e7:Terrorist Organization involved In member Of e5:Person located In e8:Terrorist Attack member Of friend Of at location e1:Person located In Region1: Financial Domain, weight=0.50 Region2: Terrorist Domain, weight=0.75 e9:Location Semantic (Web) Technology – State of the Art Semantic Technology – Key Features • Design ontology schema • Automatically Populate ontology with domain knowledge (at Enterprise Scale) • Maintain Freshness of ontology (almost) automatically • Processing of heterogeneous information (structured, semitructured and unstructured) • Automatic Semantic Metadata Extraction using lexical, statistical or NLP techniques* • Automatic Semantic Metadata Extraction using populated ontology (Knowledgebase approach) • Logic based reasoning (inferencing) • Graph/relationship traversal based reasoning Ontology-driven Information System Lifecycle Schema Creation Analytic Application Creation Ontology API MB Ontology Population KB BSBQ Application Creation Semantic Visualization Metadata Extraction Semagix Freedom Architecture: for building ontology-driven information system Sheth et al, 2002 Managing Semantic Content for the Web © Semagix, Inc. Ontology Creation and Maintenance Steps 1. Ontology Model Creation (Description) 2. Knowledge Agent Creation Ontology Semantic Query Server 4. Querying the Ontology 3. Automatic aggregation of Knowledge © Semagix, Inc.