+ Advanced Intelligence Community R&D Meets the Semantic Web! Lucian Russell, PhD Expert Reasoning & Decisions LLC Semantic Technology Conference May 24th, 2007 Page 1 + Context – Data Sharing in the U.S. Government • The U.S. Government collects a vast array of data about – The Natural World The Large – Our Planet Earth – The Solar System – The Universe The Medium: Our ecosphere The Small – Our and other creatures’ DNA – Fundamental chemical reactions – Physics data from sub-atomic particles’ interactions – The Social World The government For-Profit Organizations (e.g. the Securities and Exchange Commission) Non-Profit Organizations (e.g. NIH grants recipients) – Life History Data of People Vital Statistics Census Statistics Medical Records Governmental Interactions Page 2 + The Good, the Bad, the Conclusion and the Status • The Good – – – – – The data collection processes are well thought out The data forms for collecting data have a large amount of information The data is relatively well maintained The data has been collected for many years The Computer Science technology used was the best available at the time • The Bad – – – – Schemas are not particularly well documented People who really understand some applications have retired People who really understand some applications have passed away The Computer Science technology used was the best available at the time • The Conclusion – If it were cost effective to make the data sharable we should • The Status – When the Federal Data Reference Model (prescription for data sharing) was released in 2005 the cost/benefit ratio limited the amount of data sharing – Due to new advances investments are now worthwhile ! Page 3 + The FEA & the Data Reference Model • The U.S. Federal government has a law stating that all Federal Agencies must have a Enterprise Architecture (EA) consistent with the governmentwide Federal Enterprise Architecture (FEA) – – – – – A Business Architecture A Technical Architecture A Service Architecture A Performance Architecture A Data Architecture • To show how to create them the U.S. government created a Reference Model for each category – – – – A Business Reference Model describing the Agencies' lines of business A Technical Reference Model describing the infrastructure HW/SW A Service Reference Model describing the application services A Performance Reference Model describing the improvements to be realized due to changes to be made in business processes – A Data Reference Model • Version 2.0 of the Data Reference Model (DRM) was released 12/05 Page 4 + The Data Reference Model Version 2.0 • Recognized that there were many types of data – – – – – – – – Textual Data – Documents in English Structured Data – Data having a schema (e.g. SQL, OO Data Bases) Multimedia Data – Web, image, audio, video data Geospatial Data – Map data of many viewpoints Scientific Data – Collections from many instruments Product Data – Manufactured Items/descriptions, e.g. CAD Simulation data – Simulations of phenomena, man made and natural Logical inference chains – Reasoning justifications • Defined – Data Definitions – Data Context – Data Sharing, whose services use the Data Definitions and Data Context • Designated Communities of Practice (CoPs) as the persons responsible for realizing the data sharing. These persons were supposed to overcome problems as best they could. Page 5 + While you were not watching … • In 2000 the Intelligence Community set in motion the Advanced Research and Development Activity (ARDA) • Some programs had no restrictions on the ability of their researchers to publish • A multi-discipline activity was started in Information Exploitation (Info-X) • One major program was the AQUAINT program, established by Dr. John Prange – This program, Advanced Question Answering for Intelligence, was bold, and its goals of advancing State of the Art seemed extremely ambitious” – Fortunately that was not the case – the program is now in its Third Phase – It remains unclassified, just not widely known • Another major program was NIMD. It looked at a number of issues including reasoning. It’s findings are generally labeled For Official Use Only, but fortunately one is no longer is: IKRIS – The Interoperable Knowledge Representation for Intelligence Support is a new extension of logic, incorporating OWL, ISO Common Logic and other features – It is the new features that enable a breakthrough in Semantics Page 6 + Why You Should Care • Def: Semantic Interoperability is a state of an information system artifact. When an Artifact A is semantically interoperable then a service which wishes to dynamically discover the meaning of data associated with the artifact can do so precisely • We do not have semantic interoperability today – XML is a message format – UDDI and WSDL are means for pre-agreed data descriptions to be communicated – OWL is a formalism to describe IS-A relationships which include Functions • Semantic Interoperability requires – Computers that understand human language – Schema descriptions that are precise • Prior to Info-X we had neither • On April 19th 2006 it became possible to develop Semantic Interoperability • WARNING: Computers cannot detect lies and miscommunication and cannot compensate for incorrect or intentionally ambiguous language Page 7 + Barriers to Interoperability: Texts and Schemas • Ideally an English language document describing an Artifact should suffice to describe it for Semantic Interoperability purposes. – Databases could be defined clearly as to the nature and purposes of their data elements – Text documents could be read by the computer and described by summaries as well as key concepts extracted • Barriers: – Human language is ambiguous Google gets around it by using the “MySpace” model for Web Pages, a social engineering construct, plus paid placements Lacking URLs and reference frequencies one is left with pre-culled word lists reduced to stems whose frequency is used as a surrogate for significance Well-meaning attempts by non-linguists to create OWL Ontologies do not get at the real problem of correctly specifying concepts – The Schema Mismatch problem Database schemas use names for Entities and Attributes that are too abbreviated to be, by themselves, of use for Semantic Interoperability. Data Dictionaries, though helpful, are rarely implemented and there is no standard on the content of descriptions There are syntax mis-matches (SSN) and an Entity can be an Attribute can be a Value Page 8 + We can now make more Database data sharable • Previously Databases were documented by writing text that only human beings could read. Due to constraints on time and budgets these were not well written, not understood, and not kept up to date, so they were ignored. • Database names of Entities and attributes were short, and hence easy to confuse across multiple databases. • However, documents that can describe Databases accurately can be processed by the new SW that has come out of AQUAINT and put into a knowledge base. CYC, for example, can do this, and so the data can be related to existing Ontologies. • However, there are problems in “Data Modeling” that must be addressed first in order to get this payoff: The documents must have the right descriptive information, and all of it. • We have to write data descriptions for the computer, not the human reader. • To start with, we must use the English Language correctly. To understand the English Language we need to look at the accomplishments of the WordNet project. Page 9 + The First Building Block: WordNet • WordNet is found at (http://wordnet.princeton.edu/) • WordNet disambiguates the English language by listing all the senses of the most common words in English – Synset: a set of words that can be considered synonyms; each has a number With nouns these are generally interchangable With verbs the situation is not so precise: there may be a shade of difference All entries for a word meaning have an associated phrase – a gloss – where it is used – Four parts of speech are used: nouns, verbs, adjectives and adverbs • WordNet started in 1990 – but has EVOLVED – – – – – Although the project remains the same the content of the system is very different When the book on WordNet was published 10 years ago there was WordNet 1.6 The WordNet system and database is on Release 3.0 (free download) All glosses consist of words that are marked up with their synset numbers Category words are distinguished from instances, e.g. “Atlantic” as a noun is an instance – the Atlantic Ocean – WordNet might be more aptly named Wordnets Page 10 + In what ways are words networked? • There are at least 10 in WordNet 3.0 – Nouns Hypernyms are more general terms and hyponyms the more specific ones Holonyms are higher level parts and meronyms are lower level parts Telic relationships: “A chicken is a bird” vs. “A chicken is a food” – short for “A chicken is used as a food”. The latter is a telic relationship. Synonyms Antonyms – Verbs Hypernyms exist but there are 4 types of “hyponyms”, spanning the different aspects (in time) of entailment. Holonyms are higher level parts and meronyms are lower level parts of a process • Coming Soon: others, e.g. noun to verb forms • HOWEVER: – A NOUN IS NOT A VERB – A VERB IS NOT A NOUN • Nouns have inheritance properties with hypernyms that differ from the hypernyms of verbs. Do not put a verb in an OWL-DL Ontology Page 11 + Document Reading and AQUAINT • Given the new WordNet and the results from AQUAINT-funded projects documents can now be read (i.e. the content understood). – To answer a question posed by a user a system must be able to Understand the question Determine if it entails a number of sub-questions; it must get an answer for each one Each document must be read to find if it has the answers The answer from each one of them must be evaluated The results must be combined The reasoning about the answer must be provided to the user – Obviously WordNet and sources like FrameNet, VerbNet and other reference collections must be examined and all results combined • The key to understanding the content is two-fold – WordNet to distinguish word meanings of the same text string – A markup that correctly describes relationships in TIME, because AQUAINT has found that there are dozens of logical forms of sentences and distinguishing them means understanding real world temporal relationships (AQUAINT has funded a project that has provided a new more powerful markup language for time, TimeML) Page 12 + What this means to Semantic Interoperability • We now have a means of creating a text document that is precise – All the word meanings are disambiguated – All the time relationships are correctly stated • For previously generated text documents a correct identification of concepts is much more likely • Language Computer Corporation has a suite of tools that are State of the Art in realizing this capability, and it works across languages • For previously generated Relational or Object databases it is now worth the effort to describe the data precisely, e.g. – Attributes’ full relationships to the Entities can be described – Inter-relationships among heretofore ambiguous dates can be clearly stated • In other words: the schema mismatch problem can be addressed directly – the semantics of such databases can be precisely specified and one can reason about the different forms of the data. Why now? • Because of IKRIS! Page 13 + What is IKRIS and why is it a Breakthrough? • IKRIS as a project has created the IKL, the IKRIS Knowledge Language • Using IKL one can specify – Any construct in any version of OWL (according to Dr. Chris Welty, IKRIS Co-PI, OWL is First Order Predicate Calculus without Variables) – Any construct in First Order Logic – Certain expressions in Second Order Logic – CONTEXT assumptions, that allows expressions in Non-Monotonic Logic • How has this been used? One way is to show the interoperability among different languages that specify processes. e.g. – CYC-L, the language used by CYCORP for its massive Ontology – PSL, the manufacturing process language developed at NIST – SOA-S, the proposed Ontology for IT Services • MORE IMPORTANT: Combine Context and Process specification to create the Contrafactual Conditional! Page 14 + The What? • The Contrafactual Conditional is a logical statement that is against fact. It is used to specify scientific laws, e.g. – “Glass is Brittle” “Were a pane of glass to be struck by a hammer it would shatter” – Question: “Is this pane of glass brittle? We cannot tell because it is intact!” • Using the CONTEXT clause we make a logical assertion: – “At Time T the pane of glass G is P” where P is the process “hit by a hammer H”. – Conclusion: At time T + T1 G is shattered into a set of shards {Si} • Reasoning: – Goal: Find a process Q which keeps the glass G intact – Conclusion: do not select process P as an instance of Q What’s Important: Any models of the Real-World that can be described in a database can be subjected to real – world reasoning by a computer that has the relevant collection Real-World laws! Page 15 + Yes but … • OK, not everything is possible to do perfectly. • Language – Despite all the advances due to better linguistic reference databases there are still issues to be considered. – Disambiguating texts depends on the ability to recognize the Part of Speech (POS) for any given word. – Finding the POS is possible ONLY when we have detected sentences – but this requires more than just finding “.” marks: “St. John Newfoundland” “e.g.”, “i.e.” – Although the most frequent construct is Subject Verb Object we also have: Subject Verb Object Object Verb Object Subject Verb Verb • Schema Problems – next 2 Slides • However, we can build later on what is done correctly today! Page 16 + Schema Data Model Deficiencies • Data recorded in databases comes from one of four sources – – – – The Real World The Social World The individual Mathematical Patterns • Any Data about the Real World is a SAMPLE, and most phenomena are continuous; data may record initial conditions and end conditions. • Data about the Social World is in logical States • SAMPLES ARE NOT STATES – Logical State Changes can be captured in update transaction – Sampling requires knowing the time at which data is measured – This distinction is not enforced in data models • Attribute Values in Entities may be captured at very different times, and therefore recognizing the semantics associated with the data is not possible without clarification. • The processes that create the data values must be described! Page 17 + Document the Time and Process for each Attribute Entity Entity Entity Entity Process P1 Process P2 Process P3 Process P4 Real World Attribute A1: Value at time T1 Social World Attribute A2: Value at time T2 Person “X” Attribute A3: Value at time T3 Choose model Attribute A4: Value at time T4 Data Modeling Fallacy: All data for One Entity should be combined. No – first collect only data values from the same process and then with whatever time semantics are needed. Page 18 + So we can fully describe database schemas and … • The Data Dictionary text can be implemented as texts that describe the data in a relational database as a collection of information about static states that are the result of real world processes. – The processes that create the data can be described accurately – The processes that update the data can be described accurately – Therefore a Service can be created that reasons about whether the Real-World processes that created database A are suited to the needs of the creators of database B • Semantic Interoperability can now be realized • Further, there can be a real Service Oriented Architecture, not just application programs repackaged as web-services • It also means that Ill-formed OWL Ontologies, ones that do not correspond to linguistic principles, can be replaced by Knowledge Representations where accurate OWL-DL structures are specified on the one hand and the processes that use them can be described separately using a more powerful representation Page 19 + Build knowledge bases - NOW • There are two types of Knowledge bases that are needed for Semantic Interoperability, linguistic and Real-World • CYCORP has the capabilities that are used by IKL and has them now – This means that there is no reason not to use CYC for building knowledge bases because its representation can be converted by IKL to any other suitably powerful representation • It also means that whatever other tools are used the knowledge bases created by then can be shared using IKL as a translator Page 20 + In Summary • Prior to 2006 Semantic Interoperability was stalled – The principles of Computer Science to do the job right were not present – As usual people did the best that they could with the tools at hand – Many low-level computer processes were incorrectly named as “Semantic” when they were not. They were “gilded farthings” • On 2006 there were four new developments – – – – IKRIS was specified by a Multi-University/Industry team TimeML was developed by James Pustejovsky team at Brandeis WordNet 3.0 was completed by Christiane Fellbaum’s team at Princeton The AQUAINT Phase II projects to understand language were completed • What could only be wished for was now possible Note: The Computer Science breakthroughs were paid for by your tax dollars. There IS a role for government funding! Page 21