SCORE Voquette Company Confidential Presentation Overview • Industry Requirements • Capabilities • System Architecture and Technologies • Examples and Scenarios • Measures (Quality, Performance, Scalability, Robustness) • Deployment Information • Questions & Answers: What if • Business Development Issues • Milestones and Schedules Voquette Company Confidential Intelligence Content Management Challenges 1. The Problem: massive, disparate information • Multiple isolated sources of intelligence information (FBI, CIA, etc.) that is not shared or integrated • 2. Large variety (format, media) of open source, partner, FAA and IC information The Difficulty: inability to have timely actionable info • Amount of data too overwhelming to use constructively • Manual methods of aggregating data not scaleable => Lack of a “complete picture” to make decisions • Inability to make timely, accurate and actionable conclusions based on informationat-hand 3. The Solution: Voquette’s Semantic Technology • Technology to analyze and integrate data from disparate sources to provide a near- real time, reliable, scaleable and actionable solution for intelligence and security applications Voquette Company Confidential New Technical Challenges in Enterprise Content Management 1. Aggregation • Feed handlers/Agents that understand content representation and media semantics • Push-pull, Web-DB-Files, Structured-Semi-structured-Unstructured data of different types from proprietary, partner and open source 2. 3. Homogenization and Enhancement • Enterprise-wide common and customizable view (information organization) • Domain model, taxonomy/classification, metadata standards • Semantic Metadata– created automatically if possible • Semantic associations/inferences (link analysis) Semantic Applications (in near real-time) • Search, personalization, alerts, knowledge browsing/inference for improved relevance, intelligent personalization, customization Voquette Company Confidential Voquette’s Unique Capabilities • Semantics (understanding of content and user needs) • Extreme relevance • Knowledge inferencing (semantic associations) • Near real-time • Multiple applications/usage patterns (not just search) • Automation • Scalability in all aspects Voquette Company Confidential Voquette Semantic Technology System Architecture Fast main-memory based query engine with APIs and XML output Distributed Toolkit to agents design that and automatically maintain the Knowledgebase extract/mine Distributed agents that automatically extract relevant Knowledgebase represents the real-world instantiation CACS provides automatic classification knowledge(w.r.t. fromWorldModel) trusted sources semantic metadata structured and unstructured content relationships) of from the WorldModel from unstructured(entities text andand extracts contextually relevant metadata WorldModel specifies enterprise’s normalized view of information (ontology) Voquette Company Confidential Workflow Process • WorldModel™ (Domain Model), Taxonomy/Classification, Knowledge base schema • Classifiers • Knowledge and Content Extraction Agents • Automated or human-supervised run-time (for classification and metadata enhancement, knowledge base maintenance) • Semantic Applications All components support incremental extensions. Voquette Company Confidential Technological Innovation • Semantic approach (classification/taxonomy, domain model, entities and relationships) [All components] • Semantic associations/ knowledge inferences • Classification committee (multiple technologies, rather than one size fits all) [CACS] • Scalability throughout with distributed architecture and implementation (number of content and knowledge sources, indexing, etc.) • Main memory implementation, incremental check pointing [SSE] Voquette Company Confidential Example: Domain: Intelligence Sub-domain: People, Org, Places (Other Sub-domains: Financing, Methods & Training, Materials) Voquette Company Confidential Voquette Semantic Technology System Architecture Voquette Company Confidential Intelligence WorldModel™ What is it? WorldModel™: Template infrastructure to organize and index content contextually What does it consist of? Domains (categories) and domain-specific attributes, with geo-spatial and temporal info Setting up a Terrorist Intelligence WorldModel™ What are the information pieces of possible interest? Terrorism Intelligence (that can be modeled as WorldModel™ attributes) Group Person • Groups: Nationalist, Terrorist, Political groups Event • Person: Terrorist, Suicide Bomber, Hijacker, Personality Bank • Event: Flight hijacking, WTC Crash,Kidnapping, Terrorist training Attack Material Name Alias • Bank: Swiss bank, Belgian bank (where groups have accts) Alias Email Address • Attack Material: Knives, Plastic Explosives, RDX, AK47 Gun Location • Name Alias: Aliases of terrorists (Osama BL = Usama BL) • Alias Email Addresses: Email addresses for alias names • Location: Location related with event of interest • Time: Date/time related to event of interest Voquette Company Confidential Time Terrorism Intelligence WorldModel™ (simplified) Voquette Semantic Technology System Architecture Voquette Company Confidential Intelligence Extractor Agents What is it? Extractor Agents: Intelligent software robots that work on structured content and automatically extract metadata information that is relevant and meaningful to the domain/sub-domain at hand How do they work? • Intelligence extractor agents use the Intelligence WorldModel™ definition for meaningful metadata extraction from trusted Intelligence content • Extractor agents exploit the structure of Intelligence content and automatically “pick up” meaningful Intelligence metadata information (as defined in the WorldModel™) Terrorism Intelligence Group Pick up syntax metadata Person Event Bank Attack Material Name Alias Alias Email Address Location Pick up group name Extractor Agent For CIA Confidential Content Pick up person Pick up attack material Pick up bank name Pick up location/date/time Pick up name aliases Time Terrorism Intelligence WorldModel™ Metadata extracted Voquette Company Confidential Voquette Semantic Technology System Architecture Voquette Company Confidential Intelligence Knowledge Base What is it? Knowledge Base: Network of Intelligence objects (significant pieces of information) and a representation of the real-world relationships (associations) between them Group originated in Country (‘’Al Queda” originated in “Afghanistan”) Group accounts in Bank (‘’Al Queda” accounts in “Swiss bank”) Group Group works with (‘Irish IRA” works with “Columbian Group”) works for Group (‘Nabil Almarabh” works for “Al Queda”) Person Alias has alias (‘Bin Laden” has alias “Mohammed”) Alias has email Email add (‘Mohammed” has email “mohd@un.com”) Person Group leads (‘Bin Laden” leads “Al Queda”) Person involved in Event (‘Bin Laden” involved in “WTC Crash”) Event occurred at Location (‘WTC Crash” occurred at “New York, USA”) Event occurred at Time (‘WTC Crash” occurred at “0903, 9/11/01”) Person Terrorism WorldModel™ Intelligence Group Intelligence Knowledge Base Definition EmailAdd Alias Person Event Bank Group Person Bank Attack Material Name Alias Country Event Alias Email Address Location Time Voquette Company Confidential Location Time Voquette Semantic Technology System Architecture Voquette Company Confidential Categorization and Auto-Cataloging System (CACS) What is it? CACS: Module that categorizes content and automatically creates metadata of content How does it work? Uses a hybrid of statistical, machine learning and Intelligence knowledge-base techniques Application in Intelligence CACS could be trained to intelligently process Intelligence content to classify the content piece as a terrorism-related event (WTC Crash, Flight hijacking, etc.) Intelligence Knowledge Base Definition EmailAdd Information exchange for metadata creation Structured Intelligence content OR Bank Alias Group Person Country Event CACS Location Time Event: Pentagon Attack Unstructured Intelligence content Metadata extracted: Affiliation Country: Afghanistan Terrorist Group: Al Queda Person: Bin Laden Allied Group: Saudi Misaal Location: Washington, USA Person Alias: Mohammed Time: 0918 hrs Voquette Company Confidential Voquette Semantic Technology System Architecture Voquette Company Confidential Intelligence Semantic Engine What is it? Semantic Engine: Fast main memory-based front end query engine that enables the end-user to retrieve highly relevant and personalized content via custom APIs Features and Functionality • Minimal input from security agent – system intelligent enough to provide all possible relevant content to security agent (type in “Bin Laden” and get all relevant information on him and other items related to him) • Applications: Search, personalization, alerts, notifications, directory Search Personalization User query submitted Directory Semantic Engine Content Enhancement Technology Alerts/Notifications Intelligent Inference Highly relevant Content returned Analyst WorkBench Custom Apps. Voquette Company Confidential Confidential Agent Scenario 1: Intelligent Analysis of Confidential Email Voquette Company Confidential Scenario 1: Intelligent Analysis of Email (Contd.) • Information underlined in blue are important metadata elements automatically picked up by the Intelligence extractor agents • Information shown in red boxes are names of terrorists (stored in our Knowledge Base) that are also automatically picked up by the Intelligence extractor agents • CACS can determine by content analysis that this is a “Terrorist Meeting” information • Intelligent inferencing is possible due to semantic associations of the Knowledge Base “Mohamed Atta met with Abdulaziz Alomari” Works for Al Qaeda Originated in Afghanistan Works for Picked up off explicit mention in email Voquette Knowledge Associations Saudi Misaal Originated in Saudi Arabia Voquette Company Confidential Inference: Al Qaeda and Saudi Misaal have possibly started working together as allied groups Inference: Afghanistan and Saudi Arabia have groups that probably collaborate - look for other relationships Scenario 2: Analyst Workbench • Voquette’s Semantic Technology enables highly relevant and comprehensive terrorist research • Example: A security agent wishes to perform research on “Bin Laden” (as he is prime suspect) • News/Information directly about Bin Laden is retrieved (that mentions his name explicitly) • News/Information on Al Qaeda is retrieved (Bin Laden Al Qaeda association in KB) • News/Information on WTC Crash is retrieved (WTC Crash Bin Laden association in KB) • News/Information on Mohammed is retrieved (Mohammed Bin Laden ‘alias assoc.’ in KB) • News/Information (intelligence) on Afghanistan is retrieved (Al Qaeda Afghanistan in KB) • News/Information (intelligence) on Swiss bank is retrieved (Al Qaeda Swiss bank in KB) • Combined together, this co-related information is extremely valuable in bringing together multiple actionable perspectives and point-of-views on one screen • Result: Less time-spending, faster and much better decision making, more security! Voquette Company Confidential Knowledge Inferencing Workflow Syntax Metadata Same entity led by Semantic Metadata Voquette Company Confidential Humanassisted inference Analyst Usage Scenarios/Interfaces for Knowledge Inference Analysts can possibly use: • Search • Knowledge Base Browser / Directory • Personalization/Alerts • APIs for custom applications All options support Reference Pages, Semantic Associations, Knowledge-based browsing Voquette Company Confidential Intelligence Analyst Browsing Scenario Voquette Company Confidential Core Competencies of Voquette’s Semantic Technology Content Aggregation, Integration and Normalization • Create a Customized WorldModel™ (domain model with customized domain attributes) • Content Aggregation and integration from multiple sources, formats and media (text/audio/video) • Support push or pull delivery/ingestion of content • Patented extractor agent technology • Metadata extraction from structured, semi-structured and unstructured text (fully automated) • Automatically homogenize content feed tags (fully automated) Categorization and Auto-Cataloging • Automatically categorize structured and unstructured text • Create contextually relevant semantic metadata from unstructured text (fully automated) • Uniquely uses a hybrid of statistical, machine learning and knowledge-base techniques for classification Voquette Company Confidential Core Competencies of Voquette’s Semantic Technology Content Enhancement using Knowledge Base • Create and maintain a Customized Knowledge Base for any domain • Automatically create content tags based on text Itself (fully automated) • Automatically enhance content tags based on information outside of text (fully automated) by exploiting Knowledge Base • Provide end user relevant content not only relevant content he asked for, but also relevant content that he did not explicitly ask for, but that he needs to know Semantic Engine • Fast , main-memory based Semantic Engine • Response Time of the order of 10s of milliseconds • Performance: 1 million queries per hour per server • Real Time Indexing (stories indexed for search/personalization within a minute) • Near real-time search/personalization of new content and breaking news • Information retrieval based on quality and not quantity • Semantic Applications: Search, Directory, Personalization, Alert, Notifications, Custom enterprise applications Voquette Company Confidential SCORE Implementation Architecture Fast main-memory based query engine with APIs and XML output Distributed Toolkit to agents design that and automatically maintain the Knowledgebase extract/mine Distributed agents that automatically extract relevant Knowledgebase represents the real-world instantiation CACS provides automatic classification knowledge(w.r.t. fromWorldModel) trusted sources semantic metadata structured and unstructured content relationships) of from the WorldModel from unstructured(entities text andand extracts contextually relevant metadata WorldModel specifies enterprise’s normalized view of information (ontology) Voquette Company Confidential Example Domain: Financial Services Sub-domain: Equity Market (other potential sub-domains: Fixed Income, Mutual Funds, …) Voquette Company Confidential Content Enhancement Workflow Syntax Metadata Semantic Metadata Voquette Company Confidential Content Asset Index Evolution Asset Syntax Metadata Producer: BusinessWire Source: Bloomberg Date: Sept. 10 2001 Location: San Jose, CA URL: http://bloomberg.com/1.htm Media: Text Asset Syntax Metadata Producer: BusinessWire Source: Bloomberg Date: Sept. 10 2001 Location: San Jose, CA URL: http://bloomberg.com/1.htm Media: Text Semantic Metadata Company: Cisco Systems, Inc. Topic: Company News Semantic Metadata Company: Cisco Systems, Inc. Creates asset (index) out of extracted metadata Scans text for analysis Metadata extracted automatically Extractor Agent for Bloomberg Scans text for analysis XML Feed Semantic Engine Categorization & Auto-Cataloging System (CACS) Classifies document into pre-defined category/topic Leverages knowledge to enhance metatagging Enhanced Content Asset Indexed Appends topic metadata to asset Syntax Metadata Asset Producer: BusinessWire Source: Bloomberg Date: Sept. 10 2001 Location: San Jose, CA URL: http://bloomberg.com/1.htm Media: Text Semantic Metadata Company: Cisco Systems, Inc. Topic: Company News Ticker: CSCO Exchange: NASDAQ Industry: Telecomm. Sector: Computer Hardware Executive: John Chambers Competition: Nortel Networks Headquarters: San Jose, CA Voquette Company Confidential Knowledge Base Headquarters Sector San Jose Executives Computer Hardware Industry John Chambers Cisco Systems Company Telecomm. Exchange NASDAQ Competition Ticker CSCO Nortel Networks Voquette WorldModel™ What is it? WorldModel™: Template infrastructure to organize and index content contextually What does it consist of? Domains (categories) and domain-specific attributes Examples Sports WorldModel™ Equity WorldModel™ Sports Equity Sport Name Company Location Ticker Industry Golf Sector Executive Headquarters Football Golfer Player Tourney Team Golf Course League Definition Domain: Equity Equity-specific attributes: Company Ticker Industry Sector Executive Headquarters Coach Definition Domain: Sports Sub-Domain: Golf Sub-Domain: Football Sports-specific attributes: Sport Name Location Golf-specific attributes: Golfer Tourney Golf Course Football-specific attributes: Player Team League Coach Voquette Company Confidential Voquette Extractor Agents What is it? Extractor Agents: Intelligent software robots that work on structured content and automatically extract metadata information that is relevant and meaningful to the domain/sub-domain at hand How do they work? • Extractor agents use the WorldModel™ definition for metadata extraction • Extractor agents exploit the structure of content and automatically “pick up” meaningful metadata information • Write once, Extract permanently – schedulable according to needs • Can work on Web content, feeds, XML, corporate databases, etc. • Extractor agents specific to structure of content-at-hand Equity WorldModel™ Pick up syntax metadata Equity Pick up company Company Ticker Industry Sector Executive Headquarters Extractor Agent For CNNfN Pick up ticker Pick up industry Pick up sector Pick up executives Pick up headquarters Metadata extracted Voquette Company Confidential Voquette Knowledge Base What is it? Knowledge Base: Network of entity objects (significant pieces of information) and a representation of the real-world relationships (associations) between them What does it consist of? Entities (person, location, organization, etc.) and Entity-Relationships How does it work? • • • • Structured closely to the structure of the WorldModel™ Entity and relationship template definitions for the domain at hand Work with knowledge extractor agents to collect instances of entities from trusted sources Automatically create relationships between instances using type definitions Equity WorldModel™ Equity Company Headquarters Sector Executives Industry John Chambers Executives Computer Hardware Sector Company Industry Industry Sector Executive Headquarters San Jose Ticker Exchange Knowledge Base Equity Knowledge Base Definition Cisco Systems Company Telecomm. Exchange Exchange Ticker NASDAQ Competition Ticker CSCO Headquarters Voquette Company Confidential Nortel Networks Voquette Categorization and Auto-Cataloging System (CACS) What is it? CACS: Module that categorizes content and automatically creates metadata of content How does it work? Uses a hybrid of statistical, machine learning and knowledge-base techniques Features • Core competency – Not only categorizes, but also catalogs (extracts metadata) • Unique solution for semantic metadata extraction from unstructured content • Flexibly adaptable for diverse domains Equity Knowledge Base Definition Information exchange for metadata creation Structured content CACS Headquarters Executives Company Exchange Sector Industry Ticker Topic: Company News Metadata extracted: Company: Convera Ticker: CNVR Exchange: NASDAQ Unstructured content Voquette Company Confidential Industry: Content Management Sector: Computer Software Headquarters: Vienna, VA Executives: Ronald Whittier Voquette Semantic Engine Semantic Engine What is it? Semantic Engine: Fast main memory-based front end query engine that enables the end-user to retrieve highly relevant and personalized content via custom APIs Features and Functionality • Minimal input from user – system intelligent enough to provide only relevant content to user • Deep levels of personalization • Applications: Search, personalization, alerts, notifications, directory, routing, syndication • Custom applications: Research Dashboard (demo) Search Personalization User query submitted Directory Semantic Engine Content Enhancement Technology Alerts/Notifications Syndication Highly relevant Content returned Dashboard Custom Apps. Voquette Company Confidential End Users Semantic Application Example – Research Dashboard Automatic 3rd party content integration Focused relevant content organized by topic (semantic categorization) Related relevant content not explicitly asked for (semantic associations) Automatic Content Aggregation from multiple content providers and feeds Competitive research inferred automatically Voquette Company Confidential COMTEX Content Enhancement - Value-added metatagging COMTEX Tagging Value-added Voquette Semantic Tagging Content ‘Enhancement’ Rich Semantic Metatagging Limited tagging (mostly syntactic) Value-added relevant metatags added by Voquette to existing COMTEX tags: • Private companies • Type of company • Industry affiliation • Sector • Exchange • Company Execs • Competitors Voquette Company Confidential COMTEX Content Enhancement - Tag Normalization Source A Document with normalized tag Source A Document <company_name=Merrill Lynch, Inc.> Voquette Knowledge Base <company_name= Merrill Lynch & Co.> <company_name= Merrill Lynch & Co.> Company name: Merrill Lynch & Co. Source B Document <company_name=Merrill Lynch Corp.> Source B Document with normalized tag Voquette Company Confidential Classification & Extraction Technology Comparisons Technology Classification Metadata Features and Advantages Disadvantages and Limitations Manual Yes Yes Intelligent, adaptable to changing business needs, high levels of accuracy, rapid integration and deployment, minimal upfront investment Extremely slow, high cost of maintenance and ownership; may not be possible to scale with very high volume; difficult to have uniformity across humans Information Retrieval/Document Indexing No No Keyword-based search Typically poor relevance if used alone on a large data set Clustering May be N/A User/Enterprise does not need to give taxonomy Many clusters might be meaningless; broad commercial success not yet demonstrated Lexical/Natural language (NLP) N/A No Often better than keyword based search; natural language querying/phrases; Good for summarizing document Does not help beyond search and summarization ; generally cannot associate one document with other (no inferencing) Rules-based Yes No Works well with complex taxonomies, high consistency Intelligence bounded, high cost of maintenance, high computation cost and possible scalability limitations Voquette Company Confidential Classification & Extraction Technology Comparisons (Contd.) Technology Classific ation Metadata Features and Advantages Disadvantages and Limitations Machine Learning/AI (Bayesian, HMM, Neural Network) Yes No User/Enterprise can define taxonomy; combined with indexing can lead to better keyword based search by limited search to a node in taxonomy ; broad variety of technology choices and good experience in applying the technology User needs to provide training set; retraining needed if taxonomy is changed; Success dependent on training; usually unstructured documents/data onlynot structured or semi-structured content Thesaurus, Reference data, (Ontology) N/A Limited Metadata limited to Terms in reference data or ontology How is reference data kept up to date? Context is limited and applications are limited to narrow areas; sometimes “one size fits all” good for Web search but not necessarily for Enterprise applications ; power of relationship missing Domain Model and Information Extractors Yes Yes For structured data and semi-structured data (Feeds, Web sites); Domain model allows user/enterprise to define contextually relevant metadata; Allows more precise query formulation (attribute-value); Homogenization/integration;Semantic search Need substantial toolkit support for writing extraction, mapping heterogeneous sources to uniform domain model Knowledge Base (Entities/Classes plus Relationships) Enhances Enhances Extremely powerful, especially when combined with Domain Model; Automatic Metadata Enhancement; very highly relevant search; beyond search (personalization, semantic associations) Requires creation and maintenance of knowledge base and access to trusted sources for mining/synthesizing knowledge Voquette Company Confidential ROI Comparative Effort Chart Activity Categorization of Web pages Traditional Effort 50 pages/day/editor CET Effort Comments 1,000 pages/day (with human supervision) [at least an order of magnitude higher without supervision] Much higher quality metadata generation, in addition to higher quantity Metatagging of news feeds 10-20 feeds (syntactic + 5,000-10,000 feeds/day (fully automatic) No human supervision needed Metatagging of internal/enterprise research content 50-100 assets/day/research editor 500-1,000 assets (with human supervision) Human supervision supports higher quality metadata Metatagging of content from multiple internal or external sources Content editors using internally developed tools typically manage 1 to 5 sources Single person can supervise automatic tagging of content from 20-50 sources semantic metadata) 100 feeds (syntactic metadata) Voquette Company Confidential Deployment System Architecture Toolkits (Workstation) Enterprise S/W (Server) Knowledge Base Toolkit Categorization and Auto Cataloging System Extractor Toolkit Semantic Engine Linux/Solaris NT (any system supporting JVM) WorldModel™ Knowledge Base More Developers More Sources Higher Performance, Redundancy, More content . . . Voquette Company Confidential Measures • Quality – Categorization accuracy: Around 90 % (domain and training dependent) – Metadata extraction: limited only by WorldModel™ and KB (for which we have automated maintenance support) – Relevance: near 100% (unlike IR techniques, typical precision/recall limitation do not apply when we have metadata) • Scalability – Millions of documents per server (for Semantic Engine) – Unlimited number of documents due to distributed index seamlessly spanning multiple servers – Few to hundreds of content sources (distributed SW agents) Voquette Company Confidential Measures (Continued) • Performance – Inclusion of new content source: 2 to 8 hrs – Building WorldModel™ and Knowledge Base: 2 to 8 weeks per domain for an effort leading to useful results (approx. 1 million entities and relationships) – Extraction – several documents per second (processing time) – Near real-time search/personalization of new content and breaking news (sub-minute, due to incremental indexing) – 1 million queries per hour per server, or 1 to 10s of ms query response/inference time due to main-memory indexing/data structures • Robustness – Semantic Engine has not needed rebooted for over 400 days! – Many other engineering solutions (HW/SW redundancy) to meet any SLA Voquette Company Confidential Quantitative Measures Reading and Classification Reading , Classification, Metadata Extraction, Normalization, Enhancement Voquette vs. The Rest Voquette vs. The Rest Pages Read and Classified Voquette Average Human Per Minute 600 - 10,000 (batch mode) 1 Per Hour 36,000 – 600,000 60 Per Day 864,000 – 14.5 Million 480 Per Year 315 Million – 5.2 Trillion 120,000 Pages Read , Voquette Classified, Metadata extracted, Normalized & Enhanced Average Human Per Minute 30 1 Per Hour 1,800 60 Per Day 43,200 480 Per Year 16 Million 120,000 Voquette Company Confidential Quantitative Comparison (Continued) Voquette Specifications Semantic Engine & Knowledge Base Specs Voquette Queries per hour per server 1 Million Query Response Time (Lightly loaded server) 1 to 10 ms Query Response Time (Heavily loaded server) 100 to 200 ms Semantic associations created per hour 10,000 Semantic Associations per domain Over 1 million Voquette Company Confidential