SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY Keynote CONTENT- AND SEMANTIC-BASED INFORMATION RETRIEVAL @ SCI 2002 Amit Sheth CTO, Voquette*, Inc. Large Scale Distributed Information Systems (LSDIS) Lab University Of Georgia; http://lsdis.cs.uga.edu *Now Semagix, http://www.semagix.com July 15, 2002 © Amit Sheth New Enterprise Content Management Challenges 1. More variety and complexity More formats (MPEG, PDF, MS Office, WM, Real, AVI, etc) More types (Docs, Images -> Audio, Video, Variety of textstructured, unstructured) More sources (internal, extranet, internet, feeds) 2. Saclability, Information Overload Too much data, precious little information (Relevance) 3. Creating Value from Content How to Distribute the right content to the right people as needed? (Personalization -- book of business) Customized delivery for different consumption options (mobile/desktop, devices) Insight, Decision Making (Actionable) New Enterprise Content Management Technical Challenges 1. Aggregation Feed handlers/Agents that understand content representation and media semantics Push-pull, Web-DB-Files, Structured-Semi-structured-Unstructured data of different types 2. Homogenization and Enhancement Enterprise-wide common view Domain model, taxonomy/classification, metadata standards Semantic Metadata– created automatically if possible 3. Semantic Applications Search, personalization, directory, alerts, etc. using metadata and semantics (semantic association and correlation), for improved relevance, intelligent personalization, customization Semantics: The Next Step in the Web’s Evolution The Semantic Web -- a vision with several views: •·“The Web of data (and connections) with meaning in the sense that a computer program can learn enough about what data means to process it.” [B99] •·“The semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” [BHL01] •·“The Semantic Web is a vision: the idea of having data on the Web defined and linked in a way that it can be used by machines not just for display purposes, but for automation, integration and reuse of data across various applications. [W3C01] Semantics for the Web On the Semantic Web every resource (people, enterprises, information services, application services, and devices) are augmented with machine processable descriptions to support the finding, reasoning about (e.g., which service is best), and using (e.g., executing or manipulating) the resource. The idea is that self-descriptions of data and other techniques would allow context-understanding programs to selectively find what users want, or for programs to work on behalf of humans and organizations to make them more efficient and productive. Central Role of Metadata Back End Applications Produce Aggregate Catalog/ Index Integrate Syndicate Personalize Interactive Marketing Where is the content? Whose is it? What is this content about? What other content is it related to? What is the right content for this user? What is the best way to monetize this interaction? Semantic Metadata Broadcast, Wireline, Wireless, Interactive TV "A Web content repository without metadata is like a library without an index." - Jack Jia, IWOV “Metadata increases content value in each step of content value chain.” Amit Sheth A Metadata Classification User More Semantics for Relevance to tackle Information Overload!! Ontologies Classifications Domain Models Domain Specific Metadata area, population (Census), land-cover, relief (GIS),metadata concept descriptions from ontologies Domain Independent (structural) Metadata (C++ class-subclass relationships, HTML/SGML Document Type Definitions, C program structure...) Direct Content Based Metadata (inverted lists, document vectors, LSI) Content Dependent Metadata (size, max colors, rows, columns...) Content Independent Metadata (creation-date, location, type-of-sensor...) Data (Heterogeneous Types/Media) Semantic Content Organization and Retrieval Engine (SCORE) technology • Automatically aggregates and extracts information from disparate sources and multiple formats • Automatically tags/annotates and categorizes content • Automatically creates relevant associations - Maps content topics and their relationships • Semantic query engine relates information and knowledge both internal and external to the organization into a single view SCORE Architecture Fast main-memory based query engine with APIs and XML output Distributed Toolkit to agents design that and automatically maintain the Knowledgebase extract/mine Distributed agents that automatically extract relevant Knowledgebase represents the real-world instantiation CACS provides automatic classification knowledge(w.r.t. fromWorldModel) trusted sources semantic metadata structured and unstructured content relationships) of from the WorldModel from unstructured(entities text andand extracts contextually relevant metadata WorldModel specifies enterprise’s normalized view of information (ontology) Voquette Enterprise Semantic Platform Product Components Enhancement Engine Automatic Classification Classification Committee Entity Extraction, Enhanced Metadata, Domain Experts Content Sources Content Agents Databases CA KB Toolkit CA Email CA Reports Documents Content Agent Monitor CA Toolkit Knowledge Sources KA WM Toolkit KA World Model KS KA Metabase KS KS Knowledgebase XML/Feeds Websites Knowledge Agents Knowledge Agent Monitor KS KA Toolkit Enterprise Applications Knowledgebase and Metabase Main Memory Index EA XML APIs Web Services Search Alerts Portals Personalize Directory Semantic Engine EA EA JIVA Knowledge Sources Used Office of Foreign Assets Control (OFAC) Capital Advantage (CA) Federal Bureau of Investigation (FBI) The Interdisciplinary Center (ICT) Central Intelligence Agency (CIA) Federation of American Scientists (FAS) Data supplied from NASA (DPL) Hoover’s (H) ZDNet (ZD) Market Guide (MG) Entity Classes and Relationships populated by these knowledge sources: PERSON (OFAC, FBI, DPL) THING -politician (OFAC, FBI, CIA, CA) -event (ICT) politician associated with politicalOrganziation terroristOrganization participated in terroristSponsoredEvent (ICT) politician held politicalOffice -politicalOffice (CIA, CA) politician associated with politicalOffice politicalOffice office(s) within govtOrganization -terrorist (OFAC, FBI, DPL) terrorist memberOf organization terrorist appears on watchList -companyExecutive (MG) companyExecutive holdsOffice companyPosition person has permanent address address (OFAC, FBI) person has dob(date of birth) (OFAC, FBI) person has pob(place of birth) (OFAC, FBI) politicalOffice associated with organization -watchList (OFAC, FBI, DPL) terroristOrganization appears on watchList (OFAC, FBI, DPL) -organization (OFAC, FBI, FAS, ICT, CA, CIA) organization appears on watchList organization memberOf suborganization -company company manufactures product (ZD) company identifiedBy tickeySymbol (H) PLACE -organization located in place (H, OFAC) -religiousAffiliation practiced in place (CIA) -company headquarters in city (H) companyposition position in company (MG) company memberOf industry (H) -tickerSymbol (H) tickerSymbol memberOf exchange (H) SCORE Capabilities • Semantics (understanding of content and user needs) • Extreme relevance • Semantic associations • Near real-time • Multiple applications/usage patterns (not just search) • Automation • Scalability in all aspects Technologies Involved • Ontology driven architecture (definitional, assertional components • Automatic Classification with classifier committee (multiple technologies, rather than one size fits all) • Automatic Semantic Metadata Extraction/Annotation • Semantic associations/ knowledge inferences • Scalability throughout with distributed architecture and implementation (number of content and knowledge sources, indexing, etc.) • Main memory implementation, incremental check pointing Performance Queries per server per hour > 1,980,000 Query Response Time (light load) 1 - 10 ms Query Response Time (64 concurrent users) 65ms Incremental Index Update Frequency 1 minute (near real-time) Population/update rate in a Knowledgebase with 1 million entities/relationships > 10,000 entities/relationships per hr. Information Extraction for Metadata Creation WWW, Enterprise Repositories Nexis UPI AP Feeds/ Documents Digital Videos ... ... Data Stores Digital Maps ... Digital Images Key challenge: Create/extract as much (semantics) metadata automatically as possible Digital Audios EXTRACTORS METADATA Automatic Categorization & Metadata Tagging (Web page) Video with Editorialized Text on the Web Auto Categorization Semantic Metadata Content Extraction and Knowledgebase Enhancement Web Page Enhanced Metadata Asset Extraction Agent Content Enhancement Workflow Syntax Metadata Semantic Metadata Content Asset Index Evolution Asset Syntax Metadata Producer: BusinessWire Source: Bloomberg Date: Sept. 10 2001 Location: San Jose, CA URL: http://bloomberg.com/1.htm Media: Text Asset Syntax Metadata Producer: BusinessWire Source: Bloomberg Date: Sept. 10 2001 Location: San Jose, CA URL: http://bloomberg.com/1.htm Media: Text Semantic Metadata Company: Cisco Systems, Inc. Topic: Company News Semantic Metadata Company: Cisco Systems, Inc. Creates asset (index) out of extracted metadata Scans text for analysis Metadata extracted automatically Extractor Agent for Bloomberg Scans text for analysis XML Feed Semantic Engine Syntax Metadata Asset Producer: BusinessWire Source: Bloomberg Date: Sept. 10 2001 Location: San Jose, CA URL: http://bloomberg.com/1.htm Media: Text Semantic Metadata Company: Cisco Systems, Inc. Topic: Company News Ticker: CSCO Exchange: NASDAQ Industry: Telecomm. Sector: Computer Hardware Executive: John Chambers Competition: Nortel Networks Headquarters: San Jose, CA Categorization & Auto-Cataloging System (CACS) Classifies document into pre-defined category/topic Leverages knowledge to enhance metatagging Enhanced Content Asset Indexed Appends topic metadata to asset Knowledge Base Headquarters Sector San Jose Executives Computer Hardware Industry John Chambers Cisco Systems Company Telecomm. Exchange NASDAQ Competition Ticker CSCO Nortel Networks Intelligent Content Empowers the User End-User Intelligent Content Content which does contain the words the user asked for Extractor Agents + Content which does not contain the words the user asked for, but is about what he asked for. Value-added Metadata + Content the user did not think to ask for, but which he needs to know. Semantic Associations Example 1 – Snapshots (“Jamal Anderson”) Search for ‘Jamal Anderson’ in ‘Football’ Click on first result for Jamal Anderson View the original source HTML page. Verify that the source page contains no mention of Team name and League name. They are value-additions to the metadata to facilitate easier search. View metadata. Note that Team name and League name are also included in the metadata Semantic Application Example – Research Dashboard Automatic 3rd party content integration Focused relevant content organized by topic (semantic categorization) Related relevant content not explicitly asked for (semantic associations) Competitive research inferred automatically Automatic Content Aggregation from multiple content providers and feeds Semantic Web – Intelligent Content Intelligent Content = What You Asked for + What you need to know! Related Stock News COMPANY Competition COMPANIES in INDUSTRY with Competing PRODUCTS COMPANIES in Same or Related INDUSTRY Regulations Technology Products Important to INDUSTRY or COMPANY Industry News EPA Impacting INDUSTRY or Filed By COMPANY SEC Knowledge-based & Manual Associations Syntax Metadata Same entity led by Semantic Metadata Humanassisted inference Intelligence Analyst Browsing Scenario Innovations that affect User Experience • BSBQ: Blended Semantic Browsing and Querying – Ability to query and browse relevant desired content in a highly contextual manner • Seamless access/processing of Content, Metadata and Knowledge – Ability to retrieve relevant content, view related metadata, access relevant knowledge and switch between all the above, allowing user to follow his train of thought • dACE: dynamic Automatic Content Enhancement – Ability to provide enhanced annotation features, allowing the user to retrieve relevant knowledge about significant pieces of content during content consumption • Semantic Engine APIs with XML output – Ability to create customized APIs for the Semantic Engine involving Semantic Associations with XML output to cater to any user application Boarding Gate Interrogation Security Portal ARC AvSec Manager Data Management Data Mining Check-in IPG Airport Airspace Visionics AcSys Voquette Knowledgebase Metabase Threat Scoring Airport LEO Passenger Records Reservation Data Airline Data Airport Data Airline and Airport Data Gov’t Watchlists News Media Web Info LexisNexis RiskWise Future and Current Risks Sources Used Knowledge Sources: Content Sources : FBI - Most Wanted Terrorists Denied Persons Lists Terrorism Files ICT Office of Foreign Asset Control (OFAC) Hamas terrorists CNN Locations FAA_Airport_Codes About.com Comtex_International Hindustan Times JerusalemPost CNN Newstrove_Hamas Africa News Service AFX News – Asia/UK/Europe AP Worldstream Asia Pulse BusinessWire ComputerWire (CTW) EFE News Services FWN Select Itar-TASS Knight Ridder News (Open) Knight-Ridder Open M2 - International M2 Airline Industry Information New World Publishing PR Newswire PRLine (PRL) Resource News International RosBusiness United Press International UPI Spotlights Interrogation Kiosk – Unique Advantages of Voquette Voquette’s Semantic Technology enables flight authorities to : - take a quick look at the passenger’s history - check quickly if the passenger is on any official watchlist - interpret and understand passenger’s links to other organizations (possibly terrorist) - verify if the passenger has boarded the flight from a “high risk” region John Smith - verify if the passenger originally belongs to a “high risk” region - check if the passenger’s name has been mentioned in any news article along with the name of a known bad guy Threat Score Components Flight Coutry Check 45 Person Country Check 25 0.15 Nested Organizations Check 75 0.8 Aggregate Link Analysis Score: 17.7 appearsOn watchList: FBI KNOWLEDGEBASE SEARCH John Action: Voquette’s rich knowledgebase is METABASE LEXIS LINK ANALYSIS NEXISSEARCH ANNOTATION searched for this name and associated WATCHLIST ANALYSIS information Action: Voquette’s Information Semantic like position, analysis rich about metabase aliases, or of related the relationships various is to searched thecomponents Action: for (past passenger (watchlist, this orname present) Voquette’s Lexis returned andNexis, ofassociated rich this by Lexis knowledgebase name knowledgebase Nexis content to other is stories search, is metabase automatically mentioning organizations, enhanced search, etc.) by the to linking searched watchlists, passenger’s comeimportant upfor with country, the name an entities possible aggregate etc. are to are threat score for appearance retrieved Voquette’s the passenger rich of this knowledgebase name on any of the watchlists Ability Proven: Ability to automatically aggregate relevant Ability aggregate rich domain Proven: and relevant knowledge, retrieve Ability rich to relevant domain recognize automatically knowledge knowledge, content entities in a piece of aggregate stories, about recognize text, automatically a field passenger entities relevant reports, in co-relate and rich aetc. piece automatically domain about itofwith text the knowledge other and passenger co-relate further data and in the itthat automatically knowledgebase, with can other be used data co-relate search in bythe flight itknowledgebase for and with officials relevant rank otherto the data content determine threat toin the to present an iffactors present knowledgebase overall the passenger idea to a indicate visual of the to association has threat present any level connections a picture clear of fo the picture to passenger passenger, with the flight allowing on the known official about him to the take watchlist badpassenger people quickfront action or to organizations the flight official Smith 0.15 Query Comparison: Voquette vs. RDBMS What it will take RDBMS to support flight security application Link Analysis Component Direct Watchlist Match (person name) lookup person entity retrieve person's relationships to watchlists Organization Watchlist Match (person name, organization name) lookup person entity retrieve person's relationships to organizations retrieve the organizations' relationships to watchlists look up organization entity retrieve the organizations' relationships to watchlists Nested Organization Watchlist Match (person name, organization name) look up organization entity retrieve the organization's relationships to organizations retrieve the organizations' relationships to watchlists Flight Origin (country name) retrieve country entity see if country is on a list containing "high-risk" countries Person Origin (person name) lookup person entity retrieve person's home country retrieve the organization's relationships to lists containing "high-risk" countries Field Report Search (person name) perform SSE query for field reports that mention this person retrieve a list of people associated with these field reports determine which people are on watchlists, terrorists, etc… # Queries (Voquette) # Queries (RDBMS) Time (Voquette) Time (RDBMS) 1 CACS Request 1 SQL Query 5-10 SQL Queries 1 SQL Query .05 sec .005 sec 5-10 sec. .005 sec 1 CACS Request 1 SQL Query 1 SQL Query 1 CACS Request 1 SQL Query 5-10 SQL Queries 1 SQL Query 1 SQL Query 5-10 SQL Queries 1 SQL Query .05 sec .005 sec .005 sec .05 sec .005 sec 5-10 sec. .005 sec .005 sec 5-10 sec. .005 sec 1 CACS Request 1 SQL Query 1 SQL Query 5-10 SQL Queries 1 SQL Query 1 SQL Query .05 sec .005 sec .005 sec 5-10 sec. .005 sec .005 sec 1 SQL Query 1 SQL Query 1 SQL Query 1 SQL Query .005 sec .005 sec .005 sec .005 sec 1 CACS Request 1 SQL Query 1 SQL Query 5-10 SQL Queries 1 SQL Query 1 SQL Query .05 sec .005 sec .005 sec 5-10 sec. .005 sec .005 sec 1 SSE Request 1 SQL Query 1 SQL Query 2 SQL Queries 1 SQL Query 1 SQL Query .03 sec .005 sec .005 sec 5-30 sec .005 sec .005 sec 18 requests 39-64 SQL Queries .33 sec 30-80 sec. JIVA Functionality Interface JIVA Semantic Console Start-up Interface The mission of the JIVA project is to gather and analyze as much information of diverse kinds about suspected individuals, terrorist and other groups, organizations, events, etc. For this Terrorism domain, the JIVA Semantic Console provides an information retrieval interface (shown below) that displays some fundamental semantic attributes (based on a corresponding Terrorism domain model) to enable information retrieval in the right context. Search interface with more search features (explained later) Most fundamental semantic attributes specific to the Terrorism domain (fully customizable) Analyst can enter search values in the appropriate attribute fields (to search in the right context) Syntactic or domain-independent attributes for general and media-specific search Once all other values are set, click the “Search” button to search semantically Analyst can choose the type of media of the desired content JIVA “Complete Picture” View – Knowledgebase Results This section of the ‘Complete Picture’ shows factually known real-world information about the entity (person, organization, event, etc.) of interest along with its contextual classification(s) and relationships with other entities in the Knowledgebase, to provide a comprehensive overview of the entity. Such knowledge is kept up-to-date by means of automated knowledge extractor agents that aggregate such knowledge about millions of entities from various trusted knowledge sources. Fraud investigation of focal entity placing it in one of five levels of threats, based on score Entity’s classifications in taxonomy Entity’s aliases and other names Entity’s real-world relationships to various other entities across multiple entity classes (as defined in the Terrorism domain model) Individual related entities are clickable to navigate to a new knowledge page for that entity e.g. Al Qaeda - Knowledgebase Navigation Entity’s canonical name While browsing through relevant knowledge, analyst can search for content on the focal entity or any of the related entities. The analyst can also search for specific relationships between two or more entities by checking corresponding entity boxes for search - Blended Semantic Browsing & Querying (BSBQ) JIVA Facilitating Knowledge Discovery On clicking any bin Laden-related entity (e.g. Al Qaeda), a page is displayed to the analyst showing knowledge pertaining to that entity, which can be used in a BSBQ mode, as described on the previous screen. Continuing this integrated approach of Semantic Browsing and Querying, the analyst has the necessary ammunition to perform Knowledge Discovery. The analyst can follow his train of thought as he browses and queries to possibly discover unexpected relationships and links between entities at various levels in an indirect manner. Automatically uncovering such hidden related entities facilitates addition of new and meaningful entities and relationships to the analyst’s assessment tasks. Wireless Application of Semantic Metadata and Automatic Content Enrichment CSCO Analysis My Stocks MyMedia $ Analyst Call MyStocks News w Sports Music CSCO % % CSCO NT Conf Call 11/08 ON24 Payne 11/07 ON24 H&Q CC 11/06 CBS Langlesis Earnings IBM Market % Clicking on the link for Cisco Analyst Calls displays a listing sorted by date. Semantic filtering uses just the right metadata to meet screen and other constrains. E.g., Analyst Call focuses on the source and analyst name or company. The icon denote additional metadata, such as “Strong Buy” by H&Q Analyst. Metadata’s role in emerging iTV infrastructure Video Enhanced Digital Cable MPEG-2/4/7 MPEG Encoder Create Scene Description Tree Channel sales through Video Server Vendors, Video App Servers, and Broadcasters MPEG Decoder GREAT USER EXPERIENCE Retrieve Scene Description Track License metadata decoder and semantic applications to device makers Node = AVO Object Scene Description Tree “NSF Playoff” Node Voqutte/Taalee Semantic Engine Produced by: Fox Sports Creation Date: 12/05/2000 League: NFL Teams: Seattle Seahawks, Atlanta Falcons Players: John Kitna Coaches: Mike Holmgren, Dan Reeves Location: Atlanta Object Content Information (OCI) Enhanced XML Description “NSF Playoff” Metadata-rich Value-added Node Metadata for Automatic Content Enrichment Interactive Television This screen is customizable with interactivity feature using metadata such as whether there is a new Conference Call video on CSCO. Part of the screen can be automatically customized to show conference call specific information– including transcript, participation, etc. all of which are relevant metadata Conference Call itself can have embedded metadata to support personalization and interactivity. This segment has embedded or referenced metadata that is used by personalization application to show only the stocks that user is interested in. Future • Multimodal interfaces • Multimodal semantics • Multivalent Semantics Metadata Usage: Keyword, Attribute and Content Based Access