Ontotext @ JRC 5-6 Oct 2005 Semantic Web • The Semantic Web is the abstract representation of data on the WWW, based on the RDF and other standards • SW is being developed by the W3C, in collaboration with a large number of researchers and industrial partners http://www.w3.org/2001/sw/ http://www.SemanticWeb.org 2/68 Ontotext @ JRC 5-6 Oct 2005 Semantic Web (II) • "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.“ [Berners-Lee et al. 2001] The spirit: • Automatically processable metadata regarding: – the structure (syntax) and – the meaning (semantics) – of the content. • Presented in a standard form; • Dynamic interpretation for unforeseen purposes 3/68 Ontotext @ JRC 5-6 Oct 2005 Semantic Web: Languages • • • • • • RDF(S) – the next slides SHOE, XOL, etc – the pioneers Topic Maps – a metadata language with limited impact OIL – Ontology Interchange Language, the basis of the next two http://www.ontoknowledge.org/oil/ – Description Logics-based multilayered language DAML+OIL – the predecessor of OWL, not to be developed OWL – the W3C standard for Semantic Web ontology language, http://www.w3.org/2001/sw/WebOnt/ – Extends RDF(S), but also constraints it – Has multiple layers (Lite, DL, Full) – Transitive/symmetric/etc properties, disjointness, cardinality restrictions 4/68 Ontotext @ JRC 5-6 Oct 2005 Semantic Web: Problems • Critical mass of metadata is necessary • Still lack of consensus on many issues (like query languages) • Lack of practices at the proper scale and complexity • Lack of robust Semantic (in our days RDFS) repositories: – Should be as flexible, multi-purpose and easy to use as HTTP servers and – As efficient in structured knowledge management as RDBMS 5/68 Ontotext @ JRC 5-6 Oct 2005 What are Sirma & Ontotext? • • • • • Established in 1992 as a Bulgarian AI Lab. Current structure: – Sirma Group International Corp, Montreal, Canada; – 8 subsidiary companies; the most important ones follow below. Sirma AI, Sofia – The R&D backbone of the group with two divisions: – Sirma Solutions: e-Business, banking, C3, e-Publishing, consultancy; – Ontotext Lab: Knowledge and Language Engineering. EngView Systems, Montreal – CAD/CAM systems and applications. WorkLogic.Com, Ottawa – Web-based collaboration, workflow, e-Gov. 6/68 Ontotext @ JRC 5-6 Oct 2005 Software Development and Research since 1992 • Track record of success – large companies and government organizations in US, Canada, Western Europe and Bulgaria; • Top-3 Software Company in Bulgaria; • About 70 developers; • ISO 2001 Certificate; • 1999 EIST prize winner; 7/68 Ontotext @ JRC 5-6 Oct 2005 Sirma Businesses and Domains Diverse business, ranging from COTS products to custom projects, consultancy, and outsourcing services. Major areas: • AI – expert systems (beside Ontotext); • b2b market places • CAD/CAM (for packaging, quality control) • e-Government, CSCW, Groupware, Workflow; • Banking • C3/C4 Systems (military, airport traffic); • VOIP billing systems; • e-Publishing, Proofing tools. 8/68 Ontotext @ JRC 5-6 Oct 2005 Ontotext Lab An R&D lab of Sirma for Knowledge and Language Engineering Research and core technology development for knowledge discovery, management, and engineering. Specialized for applications in Semantic Web, Knowledge Management, and Web Services. Aside from the scientific matters, most of us are just professional software developers. 9/68 Ontotext @ JRC 5-6 Oct 2005 Leading Semantic Web Technology Provider Ontotext is a leading Semantic Web technology provider, being: • the developer of the KIM Semantic Annotation Platform and • a co-developer of the GATE language engineering platform; • a co-developer of the Sesame semantic repository and OWLIM highperformance OWL reasoner; • the developer of the WSMO4J semantic web services API; • a partner in the SWAN Semantic Web Annotator project. Ontotext is part of most of the major European research projects in the field; the most successful Bulgarian participant in FP6. 10/68 Ontotext @ JRC 5-6 Oct 2005 Mission • A critical mass of research in a number of AI areas made efficient KM almost possible. • the technology on the market is mostly of two sorts: – Expensive black boxes – Academic prototypes Our mission is: • To develop and popularize open, skillfully engineered tools... • For Information Extraction and Knowledge Management, • Which considerably reduce the cost for implementation and use of KM applications. 11/68 Ontotext @ JRC 5-6 Oct 2005 Major Research Areas We focus on building cutting-edge expertise and technology in the following areas: • ontology design, management, and alignment; • knowledge representation, reasoning; • information extraction (IE), applications in IR; • semantic web services; • upper-level ontologies and lexical semantics; • NLP: POS, gazetteers, co-reference resolution, named entity recognition (NER) • machine learning (HMM, NN, etc.) 12/68 Ontotext @ JRC 5-6 Oct 2005 Academic & Technology Partners • NLP Group, Sheffield University, UK; • Digital Enterprise Research Institute (DERI), Institut für Informatik, Innsbruck, Austria, and National University of Ireland, Galway; • Aduna (Aidministrator) b.v., The Nederland's; • Linguistic Modelling Lab. CLPOI, Bulgarian Academy of Sciences; • British Telecommunications Plc, (BT), UK. • Froschungszentrum Informatik (FZI) and Institut AIFB Karlsruhe, Germany. 13/68 Ontotext @ JRC 5-6 Oct 2005 Customers • SemanticEdge GmBH, Berlin, Germany; • QinetiQ Ltd, UK; • Fairway Consultants, UK; 14/68 Ontotext @ JRC 5-6 Oct 2005 Research Projects We were/are part of a number of FP5 research projects: • On-To-Knowledge - the project which invented OIL. Ontology Middleware Module and a DAML+OIL reasoner. • VISION - Towards Next Generation Knowledge Management. • OntoWeb - Ontology-based information exchange for knowledge management …. • SWWS - Semantic Web enabled Web Services. 15/68 Ontotext @ JRC 5-6 Oct 2005 Research Projects (II) FP6 integrated projects that started Jan 2004, durations ~3 years: • SEKT: Semantic Knowledge Technologies. Targeting a synergy of Ontology and Metadata Technology, Knowledge Discovery and Human Language Technology. • DIP: Data, Information, and Process Integration with Semantic Web Services. • PrestoSpace: Preservation towards storage and access. Standardized Practices for Audiovisual Contents in Europe. • Infrawebs: Intelligent Framework for Generating Open (Adaptable) Development Platforms for Web-Service Enabled Applications Using Semantic Web Technologies, Distributed Decision Support Units and Multi-Agent-Systems 16/68 Ontotext @ JRC 5-6 Oct 2005 Introduction to Ontologies Despite the formal definitions, ontologies are: • Conceptual models or schemata – Represented in a formalism which allows – Unambiguous “semantic” interpretation – Inference • Can be considered a combination of: – DB schema – XML Schema – OO-diagram (e.g. UML) – Subject hierarchy/taxonomy (think of Yahoo) – Business logic rules 17/68 Ontotext @ JRC 5-6 Oct 2005 Introduction to Ontologies (II) • • • • Imagine a DB storing “John is a son of Mary”. It will be able to "answer" just: – Which are the sons of Mary? Which son is John? An ontology with a definition of the family relationships. It could infer: – John is a child of Mary (more general) – Mary is a woman; – Mary is the mother of John (inverse); – Mary is a relative of John (generalized inverse). The above facts, would remain "invisible" to a typical DB, which model of the world is limited to data-structures of strings and numbers. 18/68 Ontotext @ JRC 5-6 Oct 2005 Products • The Ontology Middleware Module (OMM) is an enterprise back-end for formal KR and KM applications based on Semantic Web standards • An extension of the Sesame RDF(S) repository that adds a Knowledge Control System. • OMM integration options: Built-In, RMI, SOAP, HTTP. MetaInformation St o T r a re as ck b y nd d a by ltere ved Fi e ser pr T racking C hanges Knowledge Control System ChangeInvestigation Curr en t U serInfo. Ontotext @ JRC A ccess C ontrol 19/68 5-6 Oct 2005 Products • BOR – a DAML+OIL reasoner. • Proprietary GATE components: – Hash Gazetteer. A high-performance lookup tool. – Hidden Markov Model Learner. A stohastic module for filtering annotations, disambiguation, (etc.,) based on confidence measures. • The News Collector is a web service, collecting and indexing articles from the top-10 global news wires: – About 1000 articles/day, annotated and indexed using KIM; – Used to validate the heuristics and resources of KIM; 20/68 Ontotext @ JRC 5-6 Oct 2005 Products (II) • The KIM Platform (the next slides), http://www.ontotext.kim. • SWWS Studio (http://swws.ontotext.com) – Semantic Web Service description development environment – Developed in the course of the SWWS project – Based on WSMO (http://www.wsmo.org) • WSMO4J (http://wsmo4j.sourceforge.net) – A WSMO API and a reference implementation – for building Semantic Web Services applications – Used in WSMO Studio, (http://www.wsmostudio.org/) – The basis for ORDI, used in OMWG (http://www.omwg.org) – Used in projects DIP, SEKT, Infrawebs 21/68 Ontotext @ JRC 5-6 Oct 2005 OWLIM • OWLIM is a high-performance OWL repository • Storage and Inference Layer (SAIL) for Sesame RDF database • OWLIM performs OWL DLP reasoning • It is uses the IRRE (Inductive Rule Reasoning Engine) for forward-chaining and “total materialization” • In-memory reasoning and query evaluation • OWLIM provides a reliable persistence, based on RDF N-Triples • OWLIM can manage millions of statements on desktop hardware • Extremely fast upload and query evaluation even for huge ontologies and knowledge bases 22/68 Ontotext @ JRC 5-6 Oct 2005 Scalability: Upload and Reasoning Upload speed (statements/sec) 20000 2Xeon1GB 18000 2Opt3GB 16000 2Opt5GB 14000 PM512MB 12000 1/log 10000 8000 6000 4000 2000 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 Size of repository (millions of explicit statements) 23/68 Ontotext @ JRC 5-6 Oct 2005 Scalability: Query Answering 400 Evaluation time Q2 (msec) 350 300 250 200 150 2Xeon1GB 100 2Opt3GB 50 2Opt5GB PM512MB 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 Size of repository (millions of explicit statements) • Q2: Pattern of 12 statement-joins and LIKE literal constraint 24/68 Ontotext @ JRC 5-6 Oct 2005 OWLIM under LUMB Benchmark • The Lehigh Univ. evaluation is one of the most comprehensive benchmark experiments published recently (ISWC 2004, WSJ 2005) • Synthetically generated OWL knowledge bases • The biggest set generated is LUMB(50,0) – 6M explicit statements • 14 queries, checking different inferences • OWLIM on LUMB: – On a desktop machine OWLIM loads LUMB(50,0) in 10 min – The only other systems known to load it, does this for 12 hours – All the queries are answered correctly • Based on this we can claim that: – OWLIM is the fastest OWL repository in the world! 25/68 Ontotext @ JRC 5-6 Oct 2005 JOCI • “Jobs & Contacts Intelligence”, Innovantage, Fairway Consultants • Gathering recruitment-related information from web-sites of UK organizations • Offering services on top of this data to recruitment agencies, job portals, and other. • • JOCI uses KIM for information extraction (IE, text-mining) JOCI makes use of a domain ontology to: – support the IE process, – to structure the knowledge base with the obtained results, and – facilitate semantic queries. • Sirma is shareholder in Fairway Consultants 26/68 Ontotext @ JRC 5-6 Oct 2005 JOCI Dataflow UK Web Space Web UI Focused Crawler Crawler Classifier Information Extraction KIM Server Single-Document IE Semantic Repository Object Consolidation Document Store 27/68 Ontotext @ JRC 5-6 Oct 2005 JOCI: Vacancy Consolidation/Matching Consolidated Vacancy locatedIn Vacancy 1 Vacancy 2 hasJobTitle “IT Applications Support Analyst” locatedIn U.K. sub-string Scotland subRegionOf locatedIn “Support Analyst” subRegionOf type Glasgow type Country City subClassOf Location 28/68 Ontotext @ JRC 5-6 Oct 2005 JOCI Statistics • The figures below are indicative and reflect an old state of the JOCI system: – The actual figures are to be announced after the launch of JOCI • Web-sites inspected: 0.5M • Web-sites with vacancy announcements: 30K • Extracted vacancies: 100K 29/68 Ontotext @ JRC 5-6 Oct 2005 The KIM Platform • A platform offering services and infrastructure for: – (semi-) automatic semantic annotation and – ontology population – semantic indexing and retrieval of content – query and navigation over the formal knowledge • Based on Information Extraction technology 30/68 Ontotext @ JRC 5-6 Oct 2005 KIM What’s Inside? The KIM Platform includes: • Ontologies (PROTON + KIMSO + KIMLO) and KIM World KB • KIM Server – with a set of APIs for remote access and integration • Front-ends: Web-UI and plug-in for Internet Explorer. 31/68 Ontotext @ JRC 5-6 Oct 2005 The AIM of KIM • Aim: to arm Semantic Web applications - by providing a metadata generation technology - in a standard, consistent, and scalable framework 32/68 Ontotext @ JRC 5-6 Oct 2005 What KIM does? Semantic Annotation 33/68 Ontotext @ JRC 5-6 Oct 2005 Simple Usage: Highlight, Hyperlink, and… 34/68 Ontotext @ JRC 5-6 Oct 2005 Simple Usage: … Explore and Navigate 35/68 Ontotext @ JRC 5-6 Oct 2005 Simple Usage: … Enjoy a Hyperbolic Tree View 36/68 Ontotext @ JRC 5-6 Oct 2005 KIM is Based On… KIM is based on the following open-source platforms: • GATE – the most popular NLP and IE platform in the world, developed at the University of Sheffield. Ontotext is its biggest co-developer. www.gate.ac.uk and www.ontotext.com/gate • OWLIM – OWL repository, compliant with Sesame RDF database from Aduna B.V. www.ontotext.com/owlim • Lucene – an open-source IR engine by Apache. jakarta.apache.org/lucene/ 37/68 Ontotext @ JRC 5-6 Oct 2005 How KIM Searches Better KIM can match a Query like: Documents about a telecom company in Europe, John Smith, and a date in the first half of 2002. With a document containing: “At its meeting on the 10th of May, the board of Vodafone appointed John G. Smith as CTO" The classical IR could not match: – Vodafone with a "telecom in Europe“, because: • Vodafone is a mobile operator, which is a sort of a telecom; • Vodafone is in the UK, which is a part of Europe. – 5th of May with a "date in first half of 2002“; – “John G. Smith” with “John Smith”. 38/68 Ontotext @ JRC 5-6 Oct 2005 Entity Pattern Search 39/68 Ontotext @ JRC 5-6 Oct 2005 Pattern Search: Entity Results 40/68 Ontotext @ JRC 5-6 Oct 2005 Entity Pattern Search: KIM Explorer 41/68 Ontotext @ JRC 5-6 Oct 2005 Semantic Metadata in KIM… • Provides a specific metadata schema, – focusing on named entities (particulars), – as well as number and time-expressions, addresses, etc., – everything “specific”, apart from the general concepts. • Defines specific tasks for generation and usage of the metadata which are well-understood and measurable. • Why not metadata about general things (universals)? – It is too complex… – but we leave the door open. • The particulars seem to provide a good 80/20 compromise. 42/68 Ontotext @ JRC 5-6 Oct 2005 World Knowledge in KIM Rationale: • The ontology is encoded in OWL Lite and RDF. • provide common knowledge about world entities; • KIM bets on scale and avoids heavy semantics; minimum modeling of common-sense, almost no axioms; • The ontology is encoded in OWL Lite and RDF. • In addition, a number of rules (generative axioms) are defined, e.g.: <X,locatedIn,Y> and <Y,subRegionOf,Z> => <X,locatedIn,Z> • Axioms of this sort are supported by OWLIM and they provide a consistent mechanism for “custom” extensions to the OWL or RDF(S) semantics with respect to a particular ontology 43/68 Ontotext @ JRC 5-6 Oct 2005 PROTON • • Name. PROTON is an acronym for Proto Ontology – ex-names: BULO (basic upper-level ontology), GO (generic ontology); – not a Russian space rocket – “proto” – used in the sense of “primary”, “beginning”, “giving rise to”, vs. “first in time” or “oldest”; – connotations: positive, fundamental, elemental, “in favour of”, even romantic (like a science-fiction novel from the 60-ies) Intended usage. A Basic Upper-Level Ontology like PROTON - used for: – ontology population – knowledge modelling and integration strategy of a KM environment; – generation of domain, application, and other ontologies. 44/68 Ontotext @ JRC 5-6 Oct 2005 PROTON Design • Design principles: 1. domain-independence; 2. light-weight logical definitions; 3. Compliance with popular metadata standards; 4. good coverage of concrete and/or named entities (i.e. people, organizations, numbers); 5. no specific support for general concepts (such as “apple”, “love”, “walk”), however the design allows for such extensions 45/68 Ontotext @ JRC 5-6 Oct 2005 Some Figures… • PROTON defines about 250 classes and 100 properties • Providing coverage of most of the upper-level concepts necessary for semantic annotation, indexing, and retrieval • A modular architecture, allowing for great flexibility of usage and extension: – SYSTEM module - contains a few meta-level primitives (6 classes and 7 properties); introduces the notion of 'entity', which can have aliases; – TOP module - the highest, most general, conceptual level, consisting of about 20 classes; – UPPER module - over 200 general classes of entities, which often appear in multiple domains. 46/68 Ontotext @ JRC 5-6 Oct 2005 PROTON Ontology Language • The current version of the ontology is encoded in OWL Lite. • A few custom entilement rules (axioms) are also defined for usage in tools that support them, for instance: Premise: <xxx, protont:roleHolder, yyy> <xxx, protont:roleIn, zzz> <yyy, rdf:type, protont:Agent> Consequent: <yyy, protont:involvedIn, zzz> • Axioms of this sort are interpreted by OWLIM • PROTON is portable to any OWL(Lite)-compliant tool. • PROTON can be used without such axioms either. 47/68 Ontotext @ JRC 5-6 Oct 2005 Other Standards: Relations • • • • • • ADL Feature Type Thesaurus and GNS – the backbone of the Location branch; – on its turn aligned with the geographic feature designators, of the GNS database of NIMA; – PROTON is more coarse-grained, taking about 80 out of 300 types. Dublin Core – the basic element set available as properties of protont:InformationResource and protont:Document classes; – the resource type vocabulary is mapped to sub-classes of InformationResource. OpenCyc and WordNet– consulted and referred to in glosses. ACE (Automatic Content Extraction) annotation types – covered. FOAF – assure easy mapping (e.g. the Account class was added). DOLCE, EuroWordnet Top, and others – consulted to various extent. 48/68 Ontotext @ JRC 5-6 Oct 2005 Other Standards: Compliance • Other models are not directly imported (for consistency reasons) • The mapping of the appropriate primitives is easy, on the basis of – a compliant design, and – formal notes in the PROTON glosses, which indicate the appropriate mappings. • For instance, in PROTON, a protont:inLanguage property is defined – as an equivalent of the dc:language element in Dublin Core – with a domain protont:InformationResource – and a range protont:Language 49/68 Ontotext @ JRC 5-6 Oct 2005 KIM World KB A quasi-exhaustive coverage of the most popular entities in the world … • What a person is expected to have heard about that is beyond the horizons of his country, profession, and hobbies. • Entities of general importance … like the ones that appear in the news … KIM “knows”: • Locations: mountains, cities, roads, etc. • Organizations, all important sorts of: business, international, political, government, sport, academic… • Specific people, etc. 50/68 Ontotext @ JRC 5-6 Oct 2005 KIM World KB: Entity Description • • • • • The NE-s are represented with their Semantic Descriptions via: Aliases (Florida & FL); Relations with other entities (Person hasPosition Position); Attributes (latitude & longitude of geographic entities); their proper Class 51/68 Ontotext @ JRC 5-6 Oct 2005 The Scale of KIM World KB RDF Statements Small KB - explicit Full KB 444,086 2,248,576 1,014,409 5,200,017 40,804 205,287 12,528 35,590 261 261 - Province: 4,262 4,262 - City: 4,400 4,417 - Organization: 8,339 146,969 - Company: 7,848 146,262 6,022 6,354 64,589 429,035 - after inference Instances - Entity: - Location: - Country: - Person: - Alias: 52/68 Ontotext @ JRC 5-6 Oct 2005 KIM IE Pipeline 53/68 Ontotext @ JRC 5-6 Oct 2005 JAPE Grammars • Jape grammars are based on the last MUSE version • Class/instance information included • Better class granularity in grammars • Relation recognition grammars - LocatedIn and HasPositionWithinOrganization 54/68 Ontotext @ JRC 5-6 Oct 2005 Disambiguation & Filtering • • • • Simple disambiguation (longest match), e.g. San Francisco Journal Based on the main alias, e.g. “Beijing” By priority of the class, instance or relative class priority – E.g. Brand “Microsoft” vs. Company “Microsoft Corp.” – We assign a priority (1-1000) to each class and instance – For pairs of classes we define relative priority – If the difference between the priorities is greater than a certain threshold the possible reference to the entity with the lower priority is ignored Still to be improved 55/68 Ontotext @ JRC 5-6 Oct 2005 KIM Scaling on Data • The Semantic Repository is based on OWLIM • In our practical tests we observe perfect performance on top of: – 1.2M of entity descriptions: – about 15M explicit statements; – above 30M statements after forward chaining. • Document and Annotation storage and indexing with Lucene: – One million docs, processed on a $1000-worth machine; – retrieval in milliseconds. 56/68 Ontotext @ JRC 5-6 Oct 2005 Entity Ranking: a sketch for Jan-May 2004 No Instance Label Rank 1 Country_T.5 United States 0.032 4 Country_T.IZ Republic of Iraq 0.011 6 Person_T.51 George W. Bush 0.010 9 Country_T.IS State of Israel 0.006 11 DayOfWeek_T.4 Tuesday 0.005 12 NewsAgency_T.6 The Associated Press 0.005 14 InternationalOrganization_T.13 United Nations 0.005 27 Country_T.CH People's Republic of China 0.004 32 City_T.3068 New York 0.004 36 InternationalOrganization_T.18 European Union 0.004 40 Person_T.115 Ariel Sharon 0.003 43 Country_T.JA Japan 0.003 44 Country_T.UK United Kingdom 0.003 45 CountryCapital_T.93 Baghdad 0.003 57/68 Ontotext @ JRC 5-6 Oct 2005 SWAN/KIM Cluster Architecture • At present, KIM is used for massive semantic annotation in the context of the SWAN and SEKT projects Here are some of its features: • support for a virtually unlimited number of annotators • centralized ontology storage and querying; • centralized meta-data (annotations) and document storage, indexing, and querying; • support for multiple crawlers (or other data sources); • dynamic reconfiguration of the cluster (e.g. staring new crawlers or annotators on demand). 58/68 Ontotext @ JRC 5-6 Oct 2005 SWAN/KIM Cluster Console 59/68 Ontotext @ JRC 5-6 Oct 2005 SWAN Project: Semantic Web Annotator Large Scale Annotation of human language for the Semantic Web using Human Language Technology (HLT). Hosted by DERI (NUIG, Galway) and involves also: • GATE team (from the Sheffield University's NLP Group) and • Ontotext Lab. • For more details take a look at http://deri.ie/projects/swan/ The current status: • KIM Cluster of 7 servers in DERI • Above 0.5TB shared storage • 6 AMD64 Opterons, 6 Xeons, 36GB RAM 60/68 Ontotext @ JRC 5-6 Oct 2005 CoreDB: Name and Goals • CoreDB is a component of KIM • Stands for: Co-Occurrence and Ranking of Entities DB • In a nutshell, it is designed to allow fast queries of the sort: – Q1: the number of appearances of “UK” in documents during Jan 2005 – Q2: all people co-occurring with John Smith and some bank institution in documents from the second half of 2003 – Q3: Q2 + where the documents contain “fraud” and the name of the institution contains “capital” 61/68 Ontotext @ JRC 5-6 Oct 2005 CoreDB: Functionality • It allows asking in a structured manner for: – The number of references to entities in a (sub-)set of documents – The entities, which co-occur together with other entities • Entities can be constrained by: – Class (and its sub-classes) – Keyword/token in one of its names/aliases/labels • Documents can be constrained according to DC-like features: – Date (range; could be any date in the doc) – Type (exact match; could be any string) – Authors – Title and Sub-title – Keyword/token in the content, authors or the title fields 62/68 Ontotext @ JRC 5-6 Oct 2005 The Scale of Ambition • The major point is to allow such queries in *efficient* manner over data with the following cardinality: – 10^6 entities/terms – 10^7 documents – 10^2 entities occurring in an average document • This means managing and querying efficiently 10^9 entity occurrences • We had tested the current implementations with 10^7 occurrences and it answers the basic queries in milliseconds. 63/68 Ontotext @ JRC 5-6 Oct 2005 CoreDB Applications • • • Detection of “associative” links between entities, based on cooccurrence in documents – It is an alternative of the detection of strong links based on local context parsing Ranking, measuring popularity, of an entity over a set of documents – The ranking is as good/relevant/representative as the set of documents is Computing timelines (changes over time) for entity ranking or cooccurrence – “How did our popularity in the IT press changed during June” (i.e. “What is the effect of this 1.5MEuro media campaign ?!?”) – “How does the strength of association between organization X and RDF changes over Q1 ?” 64/68 Ontotext @ JRC 5-6 Oct 2005 Implementation • • • • • It is a new component in the architecture of KIM – Having an API (part of the KIM API), allows different implementations There are now a couple of RDBMS-based implementations: – Derby (free, open-source, 100% Java, was Cloudscape from IBM) – ORACLE (v. 10g) The Derby implementation – does not allow for efficient searches involving keywords The ORACLE implementation is used also for FTS-style indexing of the document contents – Makes possible efficient combination of semantic and keyword search (which is already available through the SemanticQuery API) In both RDBMS implementations: – Part of the ontology and the KB are replicated – Same with part of the document and index related information 65/68 Ontotext @ JRC 5-6 Oct 2005 Ontotext Facts • Founded year 2000 • 14 employees (permanent, without the shared personnel and associates) • Daily statistics for http://www.ontotext.com, over: 150 visits; 2000 hits • Number of scientific publications: above 30 • Number of projects running: 9 • More than 20 partners we directly cooperate with on projects • Average age: about 28 • Number of servers per developer: 0.7 66/68 Ontotext @ JRC 5-6 Oct 2005 Ontotext Lab Robust Technology and Professional Services for Knowledge and Language Engineering http://www.ontotext.com 67/68 Ontotext @ JRC 5-6 Oct 2005