Metadata, Enabling techniques and technologies A CSCI8350 lecture Amit P. Sheth Metadata, Enabling techniques and technologies • What is Metadata ? • Metadata Descriptions and Standards • Metadata Storage/Exchange/Infrastructure • (Automated) Metadata Creation/Extraction/Tagging • Metadata Usage/Applications What is Metadata? • Data about data – Statements, contexts – Recursive – data about “data about data” • Applications – Content management – Cataloguing – Information retrieval, search –… "A Web content repository without metadata is like a library without an index," - Jack Jia, IWOV Information Interoperability: key metadata objective and benefit • System • Syntax • Structure • Semantics Protocols Metadata Domain Modeling, Ontologies A continuum – from data to knowledge Types of Metadata for digital media • Media type-specific metadata – eg.,texture of images,font size… • Media processing-specific metadata – eg.,search, retrieval, personalized filtering • Content Specific metadata – eg.,rocket related video and documents Dublin Core Metadata Initiative • Simple element set designed for resource description • International, inter-discipline, W3C community consensus • “Semantic” interface among resource description communities (very limited form of semantics) Source:www.desire.org Dublin Core RDF <xml> <?namespace href = "http://w3.org/rdf-schema" as = "RDF"> <?namespace href = "http://metadata.net/DC" as = "DC"> <RDF:Abbreviated> <RDF:Assertion RDF:HREF = http://www.mysite.com/mydoc.html DC:Title = "I've Never Metadata I've Never Liked“ DC:Creator = "Mary Crystal“ DC:Subject = "Metadata, Dublin Core, Stuff"/> </RDF:Abbreviated> </xml> Metadata for Digital Data Metadata Data Type Metadata Type Q-Features [Jain and Hampapur] Image, Video Domain Specific R-Features [Jain and Hampapur] Image, Video Domain Independent Meta-Features [Jain and Hampapur] Image, Video Content Independent Impression Vector [Kiyoki et al.] Image Content Descriptive NDVI, Spatial Registration [Anderson and Stonebraker] Image Domain Specific Speech Feature Index [Glavitsch et al.] Audio Direct Content Based Topic Change Indices [Chen et al.] Audio Direct Content Based Document Vectors [ Deerwester et al.] Text Direct Content Based Inverted Indices [Kahle and Medlar] Content Classification Metadata [Bohm and Rakow] Text MultiMedia Direct Content Based Domain Specific Document Composition Metadata [Bohm and Rakow] MultiMedia Domain Independent Metadata Templates [Ordille and Miller] Media Independent Domain Specific Land Cover, Relief [Sheth and Kashyap] Parent Child Relationships [Shklar et al.] Media Independent Text Domain Specific Domain Independent Contexts [Sciore et al., Kashyap and Sheth] Structured Domain Specific Concepts from Cyc [Collet et al.] Structured Domain Specific User’s Data Attributes [Shoens et al.] Domain Specific Ontologies [Mena et al.] Text, Structured Media Independent Domain Specific Domain Specific Sheth, Klas: Multimedia Data Management 1998 Multiple heterogeneous metadata models with different tag names for the same data in the same GIS domain Kansas State FGDC Metadata Model UDK Metadata Model Theme keywords: digital line graph, hydrography, transportation... Search terms: digital line graph, hydrography, transportation... Title: Dakota Aquifer Topic: Dakota Aquifer Online linkage: http://gisdasc.kgs.ukans.edu/dasc/ Adress Id: http://gisdasc.kgs.ukans.edu/dasc/ Direct Spatial Reference Method: Vector Measuring Techniques: Vector Horizontal Coordinate System Definition: Universal Transverse Mercator Co-ordinate System: Universal Transverse Mercator … … … ... … … … ... Different views of Metadata Domain Independent Specifications (RDF) Frameworks/Infrastructures (XCM) Application Specific ICE Media Specific Metadata Domain Specific NewsML, FGDC/UDK MPEG7, VoiceXML Creating and Serving Metadata to Power the Life-cycle of Content Taalee Infrastructure Services Taalee Content Applications Produce Aggregate Catalog/ Index Integrate Syndicate Personalize Interactive Marketing Where is the content? Whose is it? What is this content about? What other content is it related to? What is the right content for this user? What is the best way to monetize this interaction? Taalee Semantic MetaBase Broadcast, Wireline, Wireless, Interactive TV Types of Specs and Standards (or MetaModels) • Domain Independent: (MCF), RDF, MOF, DublinCore • Media Specific: MPEG4, MPEG7, VoiceXML • Domain/Industry Specific (metamodels): MARC (Library), FGDC and UDK (Geographic), NewsML (News), PRISM (Publishing) • Application Specific: ICE (Syndication) • Exchange/Sharing: XCM, XMI • Orthogonal/(Other): RDFS, namespaces, conceptual models (UML), ontologies (OWL), what RDF can do for metadata ? • Designed to impose structural constraint on syntax to support consistent encoding, exchange and processing of metadata. • Domain Independent Metadata standard. Metadata extraction from heterogeneous content/data WWW, Enterprise Repositories Nexis UPI AP Feeds/ Documents Digital Videos ... ... Data Stores Digital Maps ... Digital Images Create/extract as much (semantics) metadata automatically as possible, from: Any format (HTML, XML, RDB, text, docs) Many media Push, pull Proprietary, Deep Web, Open Source Digital Audios EXTRACTORS METADATA Alternatives for Metadata Extraction Statistical methods/Cluster Analysis Learning/AI and Collab. Filtering Word or Phrase Reference data/Concept-terms/ Dictionary/Thesaurus By topic/industry/subject/domain Ontologies/Domain Models deeper understanding KnowledgeBase By Entities and Relationships Extracting a Text Document: Syntactic approach INCIDENT MANAGEMENT SITUATION REPORT Friday August 1, 1997 - 0530 MDT LAYOUT NATIONAL PREPAREDNESS LEVEL II CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires have been staffed for structure protection. SIMELS, Galena District, BLM. This fire is on the east side of the Innoko Flats, between Galena and McGr The fore is active on the southern perimeter, which is burning into a continuous stand of black spruce. The fire has increased in size, but was not mapped due to thick smoke. The slopover on the eastern perimeter is 35% contained, while protection of the historic cabit continues. Date => day month int ‘,’ int CHINIKLIK MOUNTAIN, Galena District, BLM. A Type II Incident Management Team (Wehking) is assigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. The fire is contained. Major areas of heat have been mopped-up. All crews and overhead will mop-up where the fire burned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this weekend, depending on the results of infrared scanning. Organizing Information: Automatic Classification Traditional Text Categorization Customer Training Set Statistical/AI Techniques Classify Place in a taxonomy Routing/Distribution Customer Article Feed 4715 7/1/2016 Classification of Article 4715 Standard Metadata Feed Source: iSyndicate Posted Date: 11/20/2000 Taalee’s Categorization & Automatic Metadata Creation Knowledge-base & Statistical/AI Techniques Taalee Training Set Classify Place in a taxonomy Catalog Metadata Automated Content Enrichment (ACE) FTE Article 4715 Metadata Standard metadata Customer Training Set Semantic metadata Feed Source: iSyndicate Posted Date: 11/20/2000 Company Name: France Telecom, Equant Ticker Symbol: FTE, ENT Exchange: NYSE Topic: Company News Company Analysis Conference Calls Earnings Stock Analysis ENT Company Analysis Conference Calls Earnings Stock Analysis NYSE Member Companies Market News IPOs Classification of Article 4715 Article Feed 4715 Taalee Enterprise Content Manager Customization Suite Precise syndication/filtering Routing/Distribution Map to another taxonomy Automatic Categorization & Metadata Tagging (Taalee, Inc.) Video Segment with Associated Text ABSOLUTE CONTROL OF THE SENATE IS STILL IN QUESTION. AS OF TONIGHT, THE REPUBLICANS HAVE 50 SENATE SEATS AND THE DEMOCRATS 49. IN WASHINGTON STATE, THE SENATE RACE REMAINS TOO CLOSE TO CALL. IF THE DEMOCRATIC CHALLENGER UNSEATS THE REPUBLICAN IUMBENT THE SENATE WILL BE EVENLY DIVIDED. IN MISSOURI, REPUBLICAN SENATOR JOHN ASHCROFT SAYS HE WILL NOT CHALLENGE HIS LOSS TO GOVERNOR MEL CARNAHAN WHO DIED IN A CRASH THREE WEEKS AGO. GOVERNOR CARNAHAN'S WIFE IS EXPECTED TO TAKE HIS PLACE. IN THE HIGHEST PROFILE SENATE EVENT OF THE NIGHT, HILLARY CLINTON WON THE NEW YORK SENATE SEAT. SHE IS THE FIRST FIRST LADY TO RUN MUCH LESS WIN. Segment Description Auto Categorization Semantic Metadata Taalee Inc, 2000 Automatic Categorization & Metadata Tagging (Web page) Video with Editorialized Text on the Web Auto Categorization Semantic Metadata Taalee Inc, 2000 Automatic Categorization & Metadata Tagging (Feed) Text From Bllomberg Auto Categorization Semantic Metadata Taalee Inc, 2000 Taalee Extraction and Knowledgebase Enhancement Web Page Enhanced Metadata Asset Extraction Agent Taalee, Inc. 1999-2002 Sheth et al, 2002 Managing Semantic Content for the Web Semantic Enhancement Server Semantic Enhancement Server: Semantic Enhancement Server classifies content into the appropriate topic/category (if not already pre-classified), and subsequently performs entity extraction and content enhancement with semantic metadata from the Semagix Freedom Ontology How does it work? • Uses a hybrid of statistical, machine learning and knowledge-base techniques for classification • Not only classifies, but also enhances semantic metadata with associated domain knowledge © Semagix, Inc. Ambiguity Resolution during Metadata Extraction from content text Document ---------------- Ontology lookup Entity Candidate SES Find Entity Candidates in the document: Names and Synonyms Common variations (Jr, Sr, III, PLC, .com, etc.) ... Note: Entity Candidates can be restricted to a relevant subset of ontology Resolve ambiguities for the entity using any/all of No Multiple matches found during entity lookup? Yes these criteria: Direct/Indirect relationships with other entities found Proximity analysis of related entities Entity refinement using subset analysis (‘Doe’ vs. ‘John Doe’) ambiguity resolved List relationships between identified entities in same document (optional in output) List relationship trails e.g. CompExec position CompanyName Politician party country watchList Overcoming the key issue of resolving ambiguities in facts & evidence • Aggregation and normalization of any type of fact and evidence into the domain ontology – Resolution of issues over terminology • i.e. “Benefit number” is an alias of “SSN” – Resolution of issues over identity • i.e. is executive “Larry Levy” an existing entity or a new entity? – Enabling decisions to be made on the trustworthiness of existing facts • Which source did the data originate from? • How much supporting evidence was there? – Validating and enforcing constraints, e.g. cardinality • President of the United States (has cardinality) = Single • Terrorist (has cardinality) = Multiple Overcoming the key issue of resolving ambiguities in facts & evidence (Contd…) • Managing temporal aspects of the domain – Expiration of entity instances – E.g., “Hillary Clinton” is no longer the First Lady of the United States but was until “May 3rd 2001” • Providing auditing capabilities – Stamping evidence with date, time and source – E.g., Terrorist: “Seamus Monaghan”; date extracted: “2003-0130; time extracted: 16:45:27; source; FBI Watch list • Ontological relationships makes for more expressive model and provide better semantic description (compared to taxonomies) – Information can be presented in natural language format – E.g., “Bob Scott” is a founder member of business entity “AIX LLP” that has traded in “Iran” that is on “FATF watch-list” Example Scenario 1 Sample content text Have you ever been to Athens? How about Japan? Ontology Matches: - A: Athens[, Greece, Europe ] - B: Athens[, Georgia, United States of America, North America ] - C: Athens[, Ohio, United States of America, North America ] - D: Athens[, Tennessee, United States of America, North America ] -E: Japan[, Asia] Scores: A, B, C, D and E all scored equally – hence no ambiguity resolution possible Example Scenario 2 Sample content text Have you ever been to Athens? Or anywhere else in Georgia? How about Japan? Ontology Matches: - A: Athens[, Greece, Europe ] - B: Athens[, Georgia, United States of America, North America ] - C: Athens[, Ohio, United States of America, North America ] - D: Athens[, Tennessee, United States of America, North America ] - E: Georgia[, Asia ] - F: Georgia[, United States of America, North America ] - G: Georgia On My Mind, Inc. -H: Japan[, Asia] Scores: B and F scored highest because of exact text match and relationship Result: Entity Ambiguity Resolved Automatic Semantic Annotation of Text: Entity and Relationship Extraction KB, statistical and linguistic techniques Semantic Enhancement Engine, 2002 Metadata Extraction and Semantic Enhancement [Hammond, Sheth, Kochut 2002] Automatic Semantic Annotation COMTEX Tagging Value-added Semagix Semantic Tagging Content ‘Enhancement’ Rich Semantic Metatagging Limited tagging (mostly syntactic) Value-added relevant metatags added by Semagix to existing COMTEX tags: • Private companies • Type of company • Industry affiliation • Sector • Exchange • Company Execs • Competitors © Semagix, Inc. Metadata Usage: Keyword, Attribute and Content Based Access Keyword Search vs Attribute Search with Semantic metadata Taalee Metadata on Football Assets Metadata from Typical Virage Search on Cataloging of Football football touchdown Assets Rich Media Reference Page Baltimore 31, Pit 24 http://www.nfl.com Brian Griese Interview Part Four Brian Griese talks about the first touchdown he ever threw. URL: http://cbs.sportsline... Jimmy Smith Interview Part Seven Jimmy Smith explains his philosophy on showboating. URL: http://cbs.sportsline... Quandry Ismail and Tony Banks hook up for their third long touchdown, this time on a 76-yarder to extend the Raven’s lead to 31-24 in the third quarter. League: Teams: Score: Players: Event: Produced by: Posted date: Professional Ravens, Steelers Bal 31, Pit 24 Quandry Ismail, Tony Banks Touchdown NFL.com 2/02/2000 Taalee’s Semantic Search Highly customizable, precise and freshest A/V search Delightful, relevant information, exceptional targeting opportunity Context and Domain Specific Attributes Uniform Metadata for Content from Multiple Sources, Can be sorted by any field Creating a Web of related information What can a context do? Taalee Directory Georgia Bulldogs System recognizes ENTITY & CATEGORY Taalee Directory Careless whisper Semantic Relationships Metadata Application Example Semantic Applications for highly relevant and fresh content: Personalization and Targeting/interactive marketing Please contact Taalee for live demonstrations Personalized Directory Change Context Obtain a whole universe of information (that you may not even have thought of) about some entities that have always been of interest to you. Please enter such semantic keywords below. Personalized Queries & Hot Topics Personalized Queries 1. My Stock Portfolio Microsoft suffers serious hack attack Cisco Systems Inc PERSONALIZATION Analyst Safa Rashtchy on Yahoo! PeopleSoft, Inc AT&T Corp. 2. My Football Fantasy Team more… Gators' Spurrier ready for 'big' game Tech's Vick looks to become complete QB Bucs excited about Hamilton Jasper Sanks rumbles into the end zone… HOT Topics!!! Edwards explains reasons for leaving BYU 1. Election 2000 more… Video: Explaining the electoral map Race for White House hots up 3. Julia Roberts Collection SeniorsHill" Give Gore Florida Edge Movie Trailer: "Notting more… Trailer - Runaway Bride 2. Middle East Peace Conflict Patrick More die as Israel steps up security Movie Trailer: "Stepmom" Israel braces for suicide bombs Conspiracy Theory more… Pentagon probes Cole's security 4. Pink Floyd Collection 3. Napster Controversy Set the Controls for the Heart of the Sun… more… The Brain Behind Napster Wish You Were Here Napster Lawsuit Round And Around Keep Talking Creative Nomad II more… The Post War Dream more… Metadata: Targeting Semantic/Interactive Targeting Buy Al Pacino Videos Buy Russell Crowe Videos Buy Christopher Plummer Videos Buy Diane Venora Videos Buy Philip Baker Hall Videos Buy The Insider Video Precisely targeted through the use of Structured Metadata and integration from multiple sources Web: Extreme Personalization Realtime Feeds Web sites and Pages Interests, Preferences Time-Shifted Content Aggregator Content Databases Personalized Content Content Personalized Content Semantic EngineTM Structured, Hi-Quality Semantic Metabase Application of Semantic Metadata and Automatic Content Enrichment MyMedia $ MyStocks News w Sports Music % % User has already completed Web Based registration and personalization at Voquette’s Enterprise Customer site. User’s “Wireless Home page” shows the categories for his interests. There is an alert (new content) for his stock and sports categories. Application of Semantic Metadata and Automatic Content Enrichment My Stocks MyMedia $ MyStocks News w Sports Music % % CSCO NT IBM Market Clicking on MyStocks brings down user’s Personal Portfolio list. The user wants to see news items about Cisco (see next slide). Search at the bottom is a semantic search that understands the financial domain, and the knowledge of user’s portfolio. Typically search can be done by typing one word or selecting from a dynamic, personalized menu. Application of Semantic Metadata and Automatic Content Enrichment CSCO My Stocks MyMedia $ MyStocks News w Sports Music Analyst Call CSCO % Conf Call NT Earnings Different types of recent audio content about Cisco are available. The user clicks to see a listing of Analyst Calls on Cisco (next slide). %IBM Market % Icons at the bottom of the screen enable contextually relevant functions: listen, set alert on story, add to playlist. Application of Semantic Metadata and Automatic Content Enrichment CSCO Analysis My Stocks MyMedia $ Analyst Call MyStocks News w Sports Music CSCO % % CSCO NT Conf Call 11/08 ON24 Payne 11/07 ON24 H&Q CC 11/06 CBS Langlesis Earnings IBM Market % Clicking on the link for Cisco Analyst Calls displays a listing sorted by date. Semantic filtering uses just the right metadata to meet screen and other constrains. E.g., Analyst Call focuses on the source and analyst name or company. The icon denote additional metadata, such as “Strong Buy” by H&Q Analyst. iTV: Taalee’s Extreme Personalization Immediate Interests, Preferences, Content Provider (DBS, DISH, Wink, AOL-TV) Content, “Programs” Meta-Data Tagged Content Semantic EngineTM Structured, Hi-Quality Semantic Metabase Personalized Content Capsules, Redirects and Programming Metadata for Automatic Content Enrichment Interactive Television This screen is customizable with interactivity feature using metadata such as whether there is a new Conference Call video on CSCO. Part of the screen can be automatically customized to show conference call specific information– including transcript, participation, etc. all of which are relevant metadata Conference Call itself can have embedded metadata to support personalization and interactivity. This segment has embedded or referenced metadata that is used by personalization application to show only the stocks that user is interested in. Metadata in Enterprise Apps Collection Sony Processing Production Support Network Content Categorize Affiliate Feeds Catalog Integrate Public Sources Rich Data Metabase Filter, Search, Consolidate, Personalize, Archive, Licensing, Syndication Description Produced by : CNN Posted Date : 12/07/2000 Reporter : David Lewis Event : Election 2000 Location : Tallahassee, Florida, USA People : Al Gore (1.33) – 12/06/00 - ABC (2.53) - 12/06/00 - CBS (5.16) - 12/06/00 - ABC (2.46) - 12/06/00 - FOX (1.33) - 12/06/00 - NBC -- Breaking News -Gore Demands That Recount Restart (5.33) - 12/06/00 (1.33) - 12/06/00 - CBS (1.33) - 12/06/00 - ABC Gore Says Fla. Can't Name Electors (3.57) - 12/06/00 - CBS (2.33) - 12/06/00 - CBS Bush Meets Colin Powell at Ranch (4.27) - 12/06/00 - ABC (3.12) - 12/06/00 - NNS Market Tumbles on Earnings Warning (3.44) - 12/06/00 - FOX (0.32) - 12/06/00 - CBS Barak Outlines His Peace Plan (1.33) - 12/06/00 - CBS (7.24) - 12/06/00 - CBS TALLAHASSEE, Florida (CNN) – Though the two presidential candidates have until noon Wednesday to file briefs in Al Gore's appeal to the Florida Supreme Court, the outcome of two trials set on the same day in Leon County, Florida, may offer Gore his best hope for the presidency. Democrats in Seminole County are seeking to have 15,000 absentee ballots thrown out in that heavily Republican jurisdiction -- a move that would give Gore a lead of up to 5,000 votes statewide. Lawyers for the plaintiff, Harry Jacobs, claim the ballots should be rejected because they say County Elections Supervisor Sandra Goard allowed Republican workers to fill out voter identification numbers on 2,126 incomplete absentee ballot applications sent in by GOP voters, while refusing to allow Democratic workers to do the same thing for Democratic voters. The GOP says that suit, and one similar to it from Martin County, demonstrates Democratic Party politics at its most desperate. Gore is not a party to either of those lawsuits. On Tuesday, the judge in the Metadata’s role in emerging iTV infrastructure Video Enhanced Digital Cable MPEG-2/4/7 MPEG Encoder Create Scene Description Tree Channel sales through Video Server Vendors, Video App Servers, and Broadcasters MPEG Decoder GREAT USER EXPERIENCE Retrieve Scene Description Track License metadata decoder and semantic applications to device makers Node = AVO Object Scene Description Tree “Cisco Systems” Node Taalee Semantic Engine Produced by: Fox Sports Creation Date: 12/05/2000 League: NFL Teams: Seattle Seahawks, Atlanta Falcons Players: John Kitna Coaches: Mike Holmgren, Dan Reeves Location: Atlanta Object Content Information (OCI) Enhanced XML Description “Cisco Systems” Metadata-rich Value-added Node Ontology Design – Fundamental Principles • There is no one correct way to model a domain— there are always viable alternatives. • The best solution almost always depends on the application that you have in mind and the extensions that you anticipate. • Ontology development is necessarily an iterative process. • Concepts in the ontology should be close to objects (physical or logical) and relationships in your domain of interest. • These are most likely to be nouns (objects) or verbs (relationships) in sentences that describe your domain. Ontology Development 101: A Guide to Creating Your First Ontology Natalya F. Noy and Deborah L. McGuinness Semantic (Web) Technology – State of the Art Semantic Technology – Key Features • Design ontology schema • Automatically Populate ontology with domain knowledge (at Enterprise Scale) • Maintain Freshness of ontology (almost) automatically • Processing of heterogeneous information (structured, semitructured and unstructured) • Automatic Semantic Metadata Extraction using lexical, statistical or NLP techniques* • Automatic Semantic Metadata Extraction using populated ontology (Knowledgebase approach) • Logic based reasoning (inferencing) • Graph/relationship traversal based reasoning Ontology-driven Information System Lifecycle Schema Creation Analytic Application Creation Ontology API MB Ontology Population KB BSBQ Application Creation Semantic Visualization Metadata Extraction Semagix Freedom Architecture: for building ontology-driven information system Sheth et al, 2002 Managing Semantic Content for the Web © Semagix, Inc. Ontology Creation and Maintenance Steps 1. Ontology Model Creation (Description) 2. Knowledge Agent Creation Ontology Semantic Query Server 4. Querying the Ontology 3. Automatic aggregation of Knowledge © Semagix, Inc.