Advanced Intelligence Community R&D Meets the

advertisement
+ Advanced Intelligence Community R&D Meets
the Semantic Web!
Lucian Russell, PhD
Expert Reasoning & Decisions LLC
Semantic Technology Conference
May 24th, 2007
Page 1
+ Context – Data Sharing in the U.S. Government
• The U.S. Government collects a vast array of data about
– The Natural World

The Large
– Our Planet Earth
– The Solar System
– The Universe
The Medium: Our ecosphere
 The Small

– Our and other creatures’ DNA
– Fundamental chemical reactions
– Physics data from sub-atomic particles’ interactions
– The Social World
The government
 For-Profit Organizations (e.g. the Securities and Exchange Commission)
 Non-Profit Organizations (e.g. NIH grants recipients)

– Life History Data of People
Vital Statistics
 Census Statistics
 Medical Records
 Governmental Interactions

Page 2
+ The Good, the Bad, the Conclusion and the Status
• The Good
–
–
–
–
–
The data collection processes are well thought out
The data forms for collecting data have a large amount of information
The data is relatively well maintained
The data has been collected for many years
The Computer Science technology used was the best available at the time
• The Bad
–
–
–
–
Schemas are not particularly well documented
People who really understand some applications have retired
People who really understand some applications have passed away
The Computer Science technology used was the best available at the time
• The Conclusion
– If it were cost effective to make the data sharable we should
• The Status
– When the Federal Data Reference Model (prescription for data sharing) was
released in 2005 the cost/benefit ratio limited the amount of data sharing
– Due to new advances investments are now worthwhile !
Page 3
+ The FEA & the Data Reference Model
• The U.S. Federal government has a law stating that all Federal Agencies
must have a Enterprise Architecture (EA) consistent with the governmentwide Federal Enterprise Architecture (FEA)
–
–
–
–
–
A Business Architecture
A Technical Architecture
A Service Architecture
A Performance Architecture
A Data Architecture
• To show how to create them the U.S. government created a Reference
Model for each category
–
–
–
–
A Business Reference Model describing the Agencies' lines of business
A Technical Reference Model describing the infrastructure HW/SW
A Service Reference Model describing the application services
A Performance Reference Model describing the improvements to be realized due
to changes to be made in business processes
– A Data Reference Model
• Version 2.0 of the Data Reference Model (DRM) was released 12/05
Page 4
+ The Data Reference Model Version 2.0
• Recognized that there were many types of data
–
–
–
–
–
–
–
–
Textual Data – Documents in English
Structured Data – Data having a schema (e.g. SQL, OO Data Bases)
Multimedia Data – Web, image, audio, video data
Geospatial Data – Map data of many viewpoints
Scientific Data – Collections from many instruments
Product Data – Manufactured Items/descriptions, e.g. CAD
Simulation data – Simulations of phenomena, man made and natural
Logical inference chains – Reasoning justifications
• Defined
– Data Definitions
– Data Context
– Data Sharing, whose services use the Data Definitions and Data Context
• Designated Communities of Practice (CoPs) as the persons responsible for
realizing the data sharing. These persons were supposed to overcome
problems as best they could.
Page 5
+ While you were not watching …
• In 2000 the Intelligence Community set in motion the Advanced Research
and Development Activity (ARDA)
• Some programs had no restrictions on the ability of their researchers to
publish
• A multi-discipline activity was started in Information Exploitation (Info-X)
• One major program was the AQUAINT program, established by Dr. John
Prange
– This program, Advanced Question Answering for Intelligence, was bold, and its
goals of advancing State of the Art seemed extremely ambitious”
– Fortunately that was not the case – the program is now in its Third Phase
– It remains unclassified, just not widely known
• Another major program was NIMD. It looked at a number of issues
including reasoning. It’s findings are generally labeled For Official Use Only,
but fortunately one is no longer is: IKRIS
– The Interoperable Knowledge Representation for Intelligence Support is a new
extension of logic, incorporating OWL, ISO Common Logic and other features
– It is the new features that enable a breakthrough in Semantics
Page 6
+ Why You Should Care
• Def: Semantic Interoperability is a state of an information system artifact.
When an Artifact A is semantically interoperable then a service which
wishes to dynamically discover the meaning of data associated with the
artifact can do so precisely
• We do not have semantic interoperability today
– XML is a message format
– UDDI and WSDL are means for pre-agreed data descriptions to be
communicated
– OWL is a formalism to describe IS-A relationships which include Functions
• Semantic Interoperability requires
– Computers that understand human language
– Schema descriptions that are precise
• Prior to Info-X we had neither
• On April 19th 2006 it became possible to develop Semantic Interoperability
• WARNING: Computers cannot detect lies and miscommunication and
cannot compensate for incorrect or intentionally ambiguous language
Page 7
+ Barriers to Interoperability: Texts and Schemas
• Ideally an English language document describing an Artifact should suffice
to describe it for Semantic Interoperability purposes.
– Databases could be defined clearly as to the nature and purposes of their data
elements
– Text documents could be read by the computer and described by summaries as
well as key concepts extracted
• Barriers:
– Human language is ambiguous
Google gets around it by using the “MySpace” model for Web Pages, a social engineering
construct, plus paid placements
 Lacking URLs and reference frequencies one is left with pre-culled word lists reduced to
stems whose frequency is used as a surrogate for significance
 Well-meaning attempts by non-linguists to create OWL Ontologies do not get at the real
problem of correctly specifying concepts

– The Schema Mismatch problem
Database schemas use names for Entities and Attributes that are too abbreviated to be,
by themselves, of use for Semantic Interoperability. Data Dictionaries, though helpful, are
rarely implemented and there is no standard on the content of descriptions
 There are syntax mis-matches (SSN) and an Entity can be an Attribute can be a Value

Page 8
+ We can now make more Database data sharable
• Previously Databases were documented by writing text that only human
beings could read. Due to constraints on time and budgets these were not
well written, not understood, and not kept up to date, so they were ignored.
• Database names of Entities and attributes were short, and hence easy to
confuse across multiple databases.
• However, documents that can describe Databases accurately can be
processed by the new SW that has come out of AQUAINT and put into a
knowledge base. CYC, for example, can do this, and so the data can be
related to existing Ontologies.
• However, there are problems in “Data Modeling” that must be addressed
first in order to get this payoff: The documents must have the right
descriptive information, and all of it.
• We have to write data descriptions for the computer, not the human reader.
• To start with, we must use the English Language correctly. To understand
the English Language we need to look at the accomplishments of the
WordNet project.
Page 9
+ The First Building Block: WordNet
• WordNet is found at (http://wordnet.princeton.edu/)
• WordNet disambiguates the English language by listing all the senses of
the most common words in English
– Synset: a set of words that can be considered synonyms; each has a number
With nouns these are generally interchangable
 With verbs the situation is not so precise: there may be a shade of difference
 All entries for a word meaning have an associated phrase – a gloss – where it is used

– Four parts of speech are used: nouns, verbs, adjectives and adverbs
• WordNet started in 1990 – but has EVOLVED
–
–
–
–
–
Although the project remains the same the content of the system is very different
When the book on WordNet was published 10 years ago there was WordNet 1.6
The WordNet system and database is on Release 3.0 (free download)
All glosses consist of words that are marked up with their synset numbers
Category words are distinguished from instances, e.g. “Atlantic” as a noun is an
instance – the Atlantic Ocean
– WordNet might be more aptly named Wordnets
Page 10
+ In what ways are words networked?
• There are at least 10 in WordNet 3.0
– Nouns
Hypernyms are more general terms and hyponyms the more specific ones
 Holonyms are higher level parts and meronyms are lower level parts
 Telic relationships: “A chicken is a bird” vs. “A chicken is a food” – short for “A chicken is
used as a food”. The latter is a telic relationship.
 Synonyms
 Antonyms

– Verbs
Hypernyms exist but there are 4 types of “hyponyms”, spanning the different aspects (in
time) of entailment.
 Holonyms are higher level parts and meronyms are lower level parts of a process

• Coming Soon: others, e.g. noun to verb forms
• HOWEVER:
– A NOUN IS NOT A VERB
– A VERB IS NOT A NOUN
• Nouns have inheritance properties with hypernyms that differ from the
hypernyms of verbs. Do not put a verb in an OWL-DL Ontology
Page 11
+ Document Reading and AQUAINT
• Given the new WordNet and the results from AQUAINT-funded projects
documents can now be read (i.e. the content understood).
– To answer a question posed by a user a system must be able to
Understand the question
 Determine if it entails a number of sub-questions; it must get an answer for each one
 Each document must be read to find if it has the answers
 The answer from each one of them must be evaluated
 The results must be combined
 The reasoning about the answer must be provided to the user

– Obviously WordNet and sources like FrameNet, VerbNet and other reference
collections must be examined and all results combined
• The key to understanding the content is two-fold
– WordNet to distinguish word meanings of the same text string
– A markup that correctly describes relationships in TIME, because AQUAINT has
found that there are dozens of logical forms of sentences and distinguishing them
means understanding real world temporal relationships (AQUAINT has funded a
project that has provided a new more powerful markup language for time,
TimeML)
Page 12
+ What this means to Semantic Interoperability
• We now have a means of creating a text document that is precise
– All the word meanings are disambiguated
– All the time relationships are correctly stated
• For previously generated text documents a correct identification of concepts
is much more likely
• Language Computer Corporation has a suite of tools that are State of the
Art in realizing this capability, and it works across languages
• For previously generated Relational or Object databases it is now worth the
effort to describe the data precisely, e.g.
– Attributes’ full relationships to the Entities can be described
– Inter-relationships among heretofore ambiguous dates can be clearly stated
• In other words: the schema mismatch problem can be addressed directly –
the semantics of such databases can be precisely specified and one can
reason about the different forms of the data. Why now?
• Because of IKRIS!
Page 13
+ What is IKRIS and why is it a Breakthrough?
• IKRIS as a project has created the IKL, the IKRIS Knowledge Language
• Using IKL one can specify
– Any construct in any version of OWL (according to Dr. Chris Welty, IKRIS Co-PI,
OWL is First Order Predicate Calculus without Variables)
– Any construct in First Order Logic
– Certain expressions in Second Order Logic
– CONTEXT assumptions, that allows expressions in Non-Monotonic Logic
• How has this been used? One way is to show the interoperability among
different languages that specify processes. e.g.
– CYC-L, the language used by CYCORP for its massive Ontology
– PSL, the manufacturing process language developed at NIST
– SOA-S, the proposed Ontology for IT Services
• MORE IMPORTANT: Combine Context and Process specification to create
the Contrafactual Conditional!
Page 14
+ The What?
• The Contrafactual Conditional is a logical statement that is against fact. It is
used to specify scientific laws, e.g.
– “Glass is Brittle” “Were a pane of glass to be struck by a hammer it would
shatter”
– Question: “Is this pane of glass brittle? We cannot tell because it is intact!”
• Using the CONTEXT clause we make a logical assertion:
– “At Time T the pane of glass G is P” where P is the process “hit by a hammer H”.
– Conclusion: At time T + T1 G is shattered into a set of shards {Si}
• Reasoning:
– Goal: Find a process Q which keeps the glass G intact
– Conclusion: do not select process P as an instance of Q
What’s Important: Any models of the Real-World
that can be described in a database can be
subjected to real – world reasoning by a computer
that has the relevant collection Real-World laws!
Page 15
+ Yes but …
• OK, not everything is possible to do perfectly.
• Language
– Despite all the advances due to better linguistic reference databases there are
still issues to be considered.
– Disambiguating texts depends on the ability to recognize the Part of Speech
(POS) for any given word.
– Finding the POS is possible ONLY when we have detected sentences – but this
requires more than just finding “.” marks:
“St. John Newfoundland”
 “e.g.”, “i.e.”

– Although the most frequent construct is Subject Verb Object we also have:
Subject Verb Object Object
 Verb Object
 Subject Verb
 Verb

• Schema Problems – next 2 Slides
• However, we can build later on what is done correctly today!
Page 16
+ Schema Data Model Deficiencies
• Data recorded in databases comes from one of four sources
–
–
–
–
The Real World
The Social World
The individual
Mathematical Patterns
• Any Data about the Real World is a SAMPLE, and most phenomena are
continuous; data may record initial conditions and end conditions.
• Data about the Social World is in logical States
• SAMPLES ARE NOT STATES
– Logical State Changes can be captured in update transaction
– Sampling requires knowing the time at which data is measured
– This distinction is not enforced in data models
• Attribute Values in Entities may be captured at very different times, and
therefore recognizing the semantics associated with the data is not possible
without clarification.
• The processes that create the data values must be described!
Page 17
+ Document the Time and Process for each Attribute
Entity
Entity
Entity
Entity
Process
P1
Process
P2
Process
P3
Process
P4
Real
World
Attribute
A1: Value
at time T1
Social
World
Attribute
A2: Value
at time T2
Person
“X”
Attribute
A3: Value
at time T3
Choose
model
Attribute
A4: Value
at time T4
Data Modeling Fallacy: All data for One Entity should be
combined. No – first collect only data values from the same
process and then with whatever time semantics are needed.
Page 18
+ So we can fully describe database schemas and …
• The Data Dictionary text can be implemented as texts that describe the
data in a relational database as a collection of information about static
states that are the result of real world processes.
– The processes that create the data can be described accurately
– The processes that update the data can be described accurately
– Therefore a Service can be created that reasons about whether the Real-World
processes that created database A are suited to the needs of the creators of
database B
• Semantic Interoperability can now be realized
• Further, there can be a real Service Oriented Architecture, not just
application programs repackaged as web-services
• It also means that Ill-formed OWL Ontologies, ones that do not correspond
to linguistic principles, can be replaced by Knowledge Representations
where accurate OWL-DL structures are specified on the one hand and the
processes that use them can be described separately using a more
powerful representation
Page 19
+ Build knowledge bases - NOW
• There are two types of Knowledge bases that are needed for Semantic
Interoperability, linguistic and Real-World
• CYCORP has the capabilities that are used by IKL and has them now
– This means that there is no reason not to use CYC for building knowledge bases
because its representation can be converted by IKL to any other suitably
powerful representation
• It also means that whatever other tools are used the knowledge bases
created by then can be shared using IKL as a translator
Page 20
+ In Summary
• Prior to 2006 Semantic Interoperability was stalled
– The principles of Computer Science to do the job right were not present
– As usual people did the best that they could with the tools at hand
– Many low-level computer processes were incorrectly named as “Semantic” when
they were not. They were “gilded farthings”
• On 2006 there were four new developments
–
–
–
–
IKRIS was specified by a Multi-University/Industry team
TimeML was developed by James Pustejovsky team at Brandeis
WordNet 3.0 was completed by Christiane Fellbaum’s team at Princeton
The AQUAINT Phase II projects to understand language were completed
• What could only be wished for was now possible
Note: The Computer Science breakthroughs
were paid for by your tax dollars. There IS a
role for government funding!
Page 21
Download