CYC: Lessons Learned in LargeScale Ontological Engineering Dr. Douglas B. Lenat , 3721 Executive Center Drive, Suite 100, Austin, TX 78731 Email: Lenat@cyc.com Phone: (512) 342-4001 Fax: (512) 342-4040 November 17, 2005 2 July 2005 1 What Led to Cyc? 1. Programs need general world knowledge, and commonsense, to break the “brittleness bottleneck” NL understanding, speech understanding, robotics, learning, expert systems, search,… 2. We know enough to do this; it is more an engineering task than a scientific research task. 3. The time was right (1984). November 17, 2005 2 ELIZA (DOCTOR) 1965 Joe Weizenbaum, MIT • Carl Rogers-like reflection Patient: “I swear a lot.” “How do you feel about the fact that you swear a lot?” • Count on patient-to-doctor respect/awe/… Patient: “My dog’s mother died recently.” “blah blah blah mother blah blah blah” “Tell me more about your mother.” November 17, 2005 3 40 years after ELIZA: ALICE • JUDGE11[9:26:0]I WILL ASK YOU SOME SIMPLE QUESTIONS AND GIVE SIMPLE COMMANDS • PROGRAM[9:26:01]Do you think your plan will succeed? • JUDGE11[9:26:16]WHAT IS THE COLOR OF A BLUE CAR • PROGRAM[9:26:16]WHAT COLOR IS a blue car • JUDGE11[9:26:35]WHERE IS SUE’S NOSE WHEN SUE IS IN HER HOUSE • PROGRAM[9:26:35]Where it belongs. Try searching the World Wide Web. November 17, 2005 4 November 17, 2005 5 Is Natural Language a Good Repr.? • + there’s a lot of it Everyone understands it • + index and search it, rapidly, using keywords • Boolean combinations of keywords • Synonyms, hyponyms, hypernyms,… of keywords • - there are a lot of different languages • - meanings vary (era, place, age group…) • - often the analyst’s query requires finding and combining n pieces of data • - can be inefficient November 17, 2005 6 Is Edward an ancestor or descendant of Sue? Carol and Sam begat Sara and Fred. Fred and Jane begat Ethan, Elaine, and Edward. John and Sara begat Steven, Mary, and Seth. Ann and Andy begat Sue and Bob. But then Sara cleaved not to John and with Bob begat Joan. Ann -- Andy Sue Joan Carol -- Sam Bob -- Sara --John Steven Mary Seth November 17, 2005 Fred --Jane Ethan Elaine Edward 7 Five friends get together to play 5 doubles matches, with a different group of 4 players each time. The sums of the ages of the players for the different matches are 124, 128, 130, 136 and 142 years. What is the age of the youngest player ? v+w+x+y = 124 v+w+x+z = 128 v+w+y+z = 130 v+x+y+z = 136 w+x+y+z = 142 November 17, 2005 8 Natural Language Understanding requires having lots of knowledge 1. The pen is in the box. The box is in the pen. 2. The police watched the demonstrators… …because they feared violence. …because they advocated violence. 3. Every American has a mother. Every American has a president. November 17, 2005 9 Natural Language Understanding requires having lots of knowledge 4. Mary and Sue are sisters. Mary and Sue are mothers. 5. The White House announced today that... 6. John saw his brother skiing on TV. The fool… ...didn’t have a coat on! …didn’t recognize him! November 17, 2005 10 Logically and Arithmetically Combining n Pieces of Info. An example: an analyst’s query posed as part of HPKB (1996) that Cyc answered. Information from multiple sources Knowledge about the domain in general Commonsense knowledge about the real world November 17, 2005 11 November 17, 2005 12 November 17, 2005 13 November 17, 2005 14 November 17, 2005 15 November 17, 2005 16 Logically and Arithmetically Combining n Pieces of Info. Information from multiple sources Knowledge about the domain in general Commonsense knowledge about the real world The original dream of Arpanet, EDI, EDR, the Semantic Web,… Ontology holds the key to doing this! BUT there are so many ways to “cut corners” and unwittingly fool oneself! November 17, 2005 17 Query: “How different in age were Uday and Qusay Hussein?” DB4 Sept. 9, 2003 SuspN YOB Qusay Hussein FBI Most Wanted Uday Hussein CATS CDE 1964 DB4 Non-ontology-based methods for DB integration are quadratic NARCL USGS OFAC DB8 DB8 Dec. 31, 1996 Prenom Surnom Qusai Hussein Odai Hussein ann 30 Ontology-Based Methods of DB Integration Can Scale Linearly DB4 Sept. 9, 2003 SuspN YOB Qusay Hussein FBI Most Wanted Uday Hussein CATS CDE DB4 CONCEPTS #$QusayHusseinAl-Takriti #$UdaiHusseinAl-Takriti CYC HAL you! NARCL 1966 1964 USGS OFAC RULES (age ?PERSON (YearsDuration ?AGE)) DB8 (birthDate ?PERSON ?BIRTH-DATE) DB8 (…and, by the way, enables DB population/enrichment) Dec. 31, 1996 Prenom Surnom ann Qusai Hussein 30 Odai Hussein 32 A Solution that Scales Linearly DB4 Sept. 9, 2003 SuspN YOB Qusay Hussein FBI Most Wanted NARCL 1966 1964 Uday Hussein CATS USGS CDE OFAC DB4 DB8 DB8 (…and, by the way, enables DB population/enrichment) Dec. 31, 1996 Prenom Surnom ann Qusai Hussein 30 Odai Hussein 32 “What major US cities are particularly vulnerable to an anthrax attack?” The answer is logically implied by data dispersed through several sources: USGS GNIS DB AMVA KB RAND R UN FAO DB November 17, 2005 DTRA CATS DB 21 “What major US cities are particularly vulnerable to an anthrax attack?” “major US city” ?C is a U.S. City with >1M population (> (NumberOfInhabitantsFn ?C) 106) “particularly vulnerable to an anthrax attack” – the current ambient temperature at ?C is above freezing, and – ?C has more than 100 people for each hospital bed, and – the number of anthrax host animals near ?C exceeds 100k Don’t add #pullets and #chickens November 17, 2005 22 state | name | type | county | state_fips | -------+-----------------------+-------+----------------+------------+ TX | Dallas | ppl | Dallas | 48 | MN | Hennepin County | civil | Hennepin | 27 | CA | Sacramento County | civil | Sacramento | 6 | AZ | Phoenix | ppl | Maricopa | 4 | primary_lat | primary_long| elevation | population | status | ------------+-------------+-----------+------------+------------------+ 32.78333 | -96.8 | 463 | 1022830 | BGN 1978 1959 45.01667 | -93.45 | 0 | 1032431 | 38.46667 | -121.31667 | 0 | 1041219 | 33.44833 | -112.07333 | 1072 | 1048949 | BGN 1931 1900 1897 USGS GNIS DB The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS). November 17, 2005 23 So how do we explain to our system that: • row 1 of that table is “about” the city of Dallas, TX • the population field of that table contains the number of inhabitants of the city that that row is “about” • here is exactly how to access tuples of that database • that access will be fast, accurate, recent, complete USGS GNIS DB The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS). November 17, 2005 24 • the population field of that table contains the number of inhabitants of the city that that row is “about” We provide the field encodings and decodings, some of which correspond to explicit fields like population, two-letter state codes, etc: (fieldDecoding Usgs-Gnis-LS ?x (TheFieldCalled “population”) (numberOfInhabitants (TheReferentOfTheRow Usgs-Gnis) ?x)) USGS GNIS DB The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS). November 17, 2005 25 • row 1 of that table is “about” the city of Dallas, TX We provide the field encodings and decodings, some of which correspond to explicit fields like population, and some correspond to entities whose existence is merely implied by the existence of that row in that table (in this case, the first row implies the existence of -- and describes some specifics of -the geographic entity that is the real-world city of Dallas, Texas, which is represented in Cyc’s KB by the term #$CityOfDallasTexas) There is a logical field name for that entity, (TheReferentOfTheRow Usgs-Gnis) , even though it is only talked about by the explicit fields. USGS GNIS DB The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS). November 17, 2005 26 • how to access tuples of that database We provide all the information needed for a JDBC connection script: We assert, in the context (MappingMtFn Usgs-KS), all of these: (passwordForSKS Usgs-KS "geografy") (portNumberForSKS Usgs-KS 4032) (serverOfSKS Usgs-KS "sksi.cyc.com") (sqlProgramForSKS Usgs-KS PostgreSQL) (structuredKnowledgeSourceName Usgs-KS "usgs") (subProtocolForSKS Usgs-KS "postgresql") (userNameForSKS "sksi") USGS GNIS DB The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS). November 17, 2005 27 • that access will be fast, accurate, recent, complete We provide meta-level assertions about the database, about each table of the database, about the completeness etc. of various kinds of data in the DB, etc. We assert, in the context (MappingMtFn Usgs-KS): (schemaCompleteExtentKnownForValueTypeInArg Usgs-Gnis-LS USCity numberOfInhabitants 1) USGS GNIS DB The Geographic Names Information System (GNIS) DB maintained by the US Geological Survey (USGS). November 17, 2005 28 Cyc automatically gathers statistics like these, and uses them to order search: (resultSetCardinality Usgs-Gnis-PS (TheSet (PhysicalFieldFn Usgs-Gnis-PS "state")) TheEmptySet 60.0) (resultSetCardinality Usgs-Gnis-PS (TheSet (PhysicalFieldFn Usgs-Gnis-PS "primary_long") (PhysicalFieldFn Usgs-Gnis-PS "primary_lat") (PhysicalFieldFn Usgs-Gnis-PS "name")) USGS (TheSet GNIS DB (PhysicalFieldFn Usgs-Gnis-PS "county") (PhysicalFieldFn Usgs-Gnis-PS "state")) 530.36) November 17, 2005 29 November 17, 2005 30 November 17, 2005 31 November 17, 2005 32 November 17, 2005 33 November 17, 2005 34 Semantic Knowledge Source Integration (SKSI) summary • Some of the knowledge needed will generally be in the Cyc KB already • Some will reside in already-mapped sources: data bases, web pages, simulators, etc. • For each needed new source, explain the meaning of its schema elements to Cyc – Write Cyc assertions to convey the meaning of each field, each polymorphism, each idiosyncratic entry code, plus metainformation: when this was created/updated, level of granularity, its sources, its degree of completeness, what it can do quickly, what it can do (slowly), how to access it, etc. November 17, 2005 35 What Led to Cyc? 1. Programs need general world knowledge, and commonsense, to break the “brittleness bottleneck” NL understanding, speech understanding, robotics, learning, expert systems, search,… 2. We know enough to do this; it is more an engineering task than a scientific research task. 3. The time was right (1984). November 17, 2005 36 How “general knowledge” helps search • Query: “Someone smiling” • Caption: “A man helping his daughter take her first step” How “general knowledge” helps search Query: “Show me pictures of strong and adventurous people” Caption: “A man climbing a rock face” November 17, 2005 38 How “general knowledge” helps search Text Document Query: “Outdoor explosions in terrorist events Lebanon between 1990 and 2001” Document: “1993 pipe bombing on the patio of the Beirut Olive Garden” November 17, 2005 39 + domain knowledge How “general knowledge”^helps search Query: “Threats to low-flying US airliners in Lebanon” Text Document Document: “Hezballah buys ten SA-7’s.” November 17, 2005 40 Find and clean (consistency-check) information by inference (+KB) XYZCo salutation first name 8041 9/1/57 8/5/91 Mr Pat Jones 8053 8053 8053 3/3/49 2/9/48 Ms Jan Smith 8053 8199 ID # birth date hire date last emerg signif name contact other If Pat and Jan are married, their date of marriage should be the same; their address is likely to be the same; their genders are likely to differ; and so on. November 17, 2005 41 What Led to Cyc? 1. Programs need general world knowledge, and commonsense, to break the “brittleness bottleneck” NL understanding, speech understanding, robotics, learning, expert systems, search,… 2. We know enough to do this; it is more an engineering task than a scientific research task. 3. The time was right (1984). November 17, 2005 42 Cyc is… Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world – – – – – – – – The typical bird has 1 beak, 1 heart, lots of feathers,… Hearts are internal organs; feathers are external protrusions Most vehicles are steered by an awake, sane, adult,… human Tangible objects can’t be in 2 (disjoint) places at once Badly injuring a child is much worse than killing a dog Causes temporally precede (i.e., start before) their effects A stabbing requires 2 cotemporal and proximate actors etc. November 17, 2005 43 Cyc is… Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world - Each of these represented in formal logic - Info. about a set of hundreds of thousands of terms - Language-independent ArabicWordForWritingPen EnglishWord-Plume EnglishWord-Pen Penitentiary WritingPen Corral BirdFeather Authoring … … FrenchWord-Plume November 17, 2005 44 Cyc is… Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world - Each of these represented in formal logic - Info. about a set of hundreds of thousands of terms • An inference engine that produces the same sorts of inferences from those that people would. • Interfaces so the system can communicate with people, data bases, spreadsheets, websites, etc. November 17, 2005 45 Knowledge Users User Interface (with Natural Language Dialog) Cyc Reasoning Modules Other Applications Cyc API Knowledge Entry Tools Knowledge Authors Cyc Ontology & Knowledge Base Interface to External Data Sources External Data Sources Data Bases Web Pages Text Sources November 17, 2005 Other KBs 46 Painful Evolution of our Representation from Frames&Slots to Contextualized HOL EVENT TEMPORAL-THING PARTIALLY-TANGIBLE-THING Upper Ontology ( a, b ) a EVENT b EVENT causes( a, b ) precedes( a, b ) Core Theories Domain-Specific Theories ( m, a ) m MAMMAL a ANTHRAX causes( exposed-to( m, a ), infected-by( m, a ) ) Very specific information (some indirect, via SKSI) (ist FtLaudHolyCrossERCase#403921 (caused CutaneousAnthrax (SkinLesions Ahmed_al-Haznawit))) First Order Predicate Calculus: unambiguous; enable mechanical reasoning Every American has a president. Every American has a mother. y.x. Amer(x) president(x,y) x.y. Amer(x) mother(x,y) Higher Order Logic (nth-order predicate calculus): contexts, predicates as variables, nested modals, reflection,… November 17, 2005 47 • Cyc is not monolithic The Knowledge Base is divided into thousands of contexts by: granularity, topic, culture, geospatial place, time,... Cyc is not committed to any one reasoning mechanism The inference engine is a community of 720 “agents” that attack every problem and, recursively, every subproblem (subgoal). One of these 720 is a general theorem prover; the others have special-purpose data structures/algorithms to handle the most important, most common cases, very fast. Cyc is not monotonic 98% of its content is marked as merely being usually true. So reasoning in Cyc is default (gather up all the pro/con arguments, and compare them). Cyc is not committed to its own reasoning mechanisms Think of reasoning modules 721, 722, 723… as being all manner of external databases, simulators, translators… Cyc Knowledge Base Cyc contains: 15,000 Predicates 300,000 Concepts 3,200,000 Assertions Intangible Individual Thing Sets Relations Space Physical Objects Living Things Ecology Natural Geography Political Geography Weather Earth & Solar System Paths Actors Actions Movement State Change Dynamics Plans Goals Physical Agents Plants Human Anatomy & Physiology Temporal Thing Partially Tangible Thing Logic Math Borders Geometry Animals Emotion Human Products Conceptual Perception Behavior & Devices Works Belief Actions Vehicles Buildings Weapons Spatial Thing Spatial Paths Materials Parts Statics Life Forms Human Beings Human Artifacts Represented in: • First Order Logic • Higher Order Logic Time • Context Logic Events Scripts • Micro-theories Agents Artifacts Thing Mechanical Software Social & Electrical Literature Language Relations, Devices Works of Art Culture Organizational Actions Organizational Plans Agent Organizations Social Behavior Organization Social Activities Human Activities Business & Commerce Purchasing Shopping Types of Organizations Politics Warfare Sports Recreation Entertainment Transportation & Logistics Human Organizations Nations Governments Geo-Politics Professions Occupations Travel Communication Law Everyday Living Business, Military Organizations General Knowledge about Various Domains Specific data, facts, and observations November 17, 2005 50 Cyc KB extended with domain knowledge about terrorism Cyc contains: 15,000 Predicates 300,000 Concepts 3,200,000 Assertions Intangible Individual Thing Sets Relations Space Physical Objects Living Things Ecology Natural Geography Political Geography Weather Earth & Solar System Paths Human Anatomy & Physiology Actors Actions Movement State Change Dynamics Plants Temporal Thing Partially Tangible Thing Logic Math Borders Geometry Plans Goals Physical Agents Animals Emotion Human Products Conceptual Perception Behavior & Devices Works Belief Actions Vehicles Buildings Weapons Spatial Thing Spatial Paths Materials Parts Statics Life Forms Human Beings Human Artifacts Represented in: • First Order Logic • Higher Order Logic Time • Context Logic Events Scripts • Micro-theories Agents Artifacts Thing Mechanical Software Social & Electrical Literature Language Relations, Devices Works of Art Culture Organizational Actions Organizational Plans Agent Organizations Social Behavior Organization Social Activities Human Activities Business & Commerce Purchasing Shopping Types of Organizations Politics Warfare Sports Recreation Entertainment Transportation & Logistics Human Organizations Nations Governments Geo-Politics Professions Occupations Travel Communication Law Everyday Living General Knowledge about Terrorism Specific data, facts, and observations about terroristNovember groups and activities 17, 2005 51 Business, Military Organizations Building Cyc qua Engineering Task amount known November 17, 2005 Building Cyc qua Engineering Task CYC amount known November 17, 2005 Building Cyc qua Engineering Task CYC amount known November 17, 2005 Guiding Principle: “We have to get it to work, not appear to work” – Don’t defer hard problems (time/space/emotions…) – No “NIH”! Harness every good idea that others have – Take an engineering approach, not a scientific research one: Instead of one TOE (elegant full solution), find a set of partial solutions that together cover the most common cases – Pursue applications that require large amounts of real-world knowledge (they need Cyc and also will drive it) November 17, 2005 55 Eschew the 5 pitfalls (ways to cut ontological corners and end up with something that only appears to work) • Ignorance-based: Have a small theory size (#terms, #instances, #rules) • Static KB (can be massively tuned, optimized, cached, etc. ahead of time) • Simple assertions (e.g., SAT constraints; propositional calculus; Horn;…) • One global context (no contradictions, limited domain, simplified world) • Don’t do all the bookkeeping and forward inference required for justification maintenance (or, equivalently, don’t ever have truth maintenance “turned on”) November 17, 2005 56 Eschew the 5 pitfalls (ways to cut ontological corners and end up with something that only appears to work) • Ignorance-based: Have a small theory size (#terms, #instances, #rules) • Static KB (can be massively tuned, optimized, cached, etc. ahead of time) • Simple assertions (e.g., SAT constraints; propositional calculus; Horn;…) • One global context (no contradictions, limited domain, simplified world) • Don’t do all the bookkeeping and forward inference required for justification maintenance (or, equivalently, don’t ever have truth maintenance “turned on”) As with pharmaceuticals, what is toxic in one dosage is beneficial in a lesser dosage. E.g., contexts lead to locally-consistent locally-small theories (faster inference/KE) E.g., often some (sub)problems can be represented/solved in a simpler repr. November 17, 2005 57 Choosing what to add to Cyc • Bottom-up: Look at a sentence, see what knowledge the writer assumed the reader already had about the world. Generalize that piece of knowledge. • Top-down: Articulate the scope of a (sub)topic, and articulate queries that should be answerable. Get missing K. by introspecting or just asking Cyc. November 17, 2005 58 The Cyc Knowledge Base Cyc contains: 15,000 Predicates 300,000 Concepts 3,200,000 Assertions Thing Intangible Individual Thing Sets Relations Space Physical Objects Living Things Ecology Natural Geography Political Geography Weather Earth & Solar System Human Beings Human Artifacts Paths Partially Tangible Thing Time Events Scripts Artifacts Plans Goals Physical Agents Plants Animals Mechanical Software Social & Electrical Literature Language Relations, Devices Works of Art Culture Organization Organizational Actions Organizational Plans Agent Organizations Social Behavior Agents Actors Actions Movement State Change Dynamics Human Anatomy & Physiology Temporal Thing Logic Math Borders Geometry Materials Parts Statics Life Forms Spatial Thing Spatial Paths Emotion Human Products Conceptual Perception Behavior & Devices Works Belief Actions Vehicles Buildings Weapons Represented in: • First Order Logic • Higher Order Logic • Context Logic • Microtheories Social Activities Human Activities Business & Commerce Purchasing Shopping Types of Organizations Politics Warfare Sports Recreation Entertainment Transportation & Logistics Human Organizations Nations Governments Geo-Politics Professions Occupations Travel Communication Law Everyday Living Real World Domain Knowledge Specific cases, facts, details,… November 17, 2005 59 Business, Military Organizations November 17, 2005 60 Cyc KB “Whitman’s Sampler” • • • • • • • • • • • Temporal Relations Senses of “x is a physical part of y” Senses of “x is physically in y” Events and their performers (role types) Organizations Propositional Attitudes Biology Materials Devices Weather Information-bearing objects November 17, 2005 61 Temporal Relations 37 Relations Between Temporal Things #$temporalBoundsIntersect #$temporallyIntersects #$temporalBoundsContain #$temporalBoundsIdentical #$startsAfterStartingOf #$startsDuring #$endsAfterEndingOf #$overlapsStart #$startingDate #$startingPoint #$temporallyContains #$simultaneousWith #$temporallyCooriginating #$after November 17, 2005 62 Temporal Relations Some of these Relations are very General, such as: #$temporallyIntersects Such relations are particularly useful when they are known not to hold between a pair of individuals: (#$not (#$temporallyIntersects ?X ?Y)) That implies all of these: (#$not (#$spouse PERSON-X PERSON-Y)) (#$not (#$consultant AGENT-X AGENT-Y)) (#$not (#$accountHolder ACCOUNT-X AGENT-Y)) (#$not (#$residesInRegion AGENT-X REGION-Y)) (#$not (#$officiator EVENT-X PERSON-Y)) November 17, 2005 63 Senses of ‘Part’ #$parts #$intangibleParts #$subInformation #$subEvents #$physicalDecompositions #$physicalPortions November 17, 2005 #$physicalParts #$externalParts #$internalParts #$anatomicalParts #$constituents #$functionalPart 64 Senses of ‘In’ • Can the inner object leave by passing between members of the outer group? – Yes -- Try #$in-Among November 17, 2005 65 Senses of ‘In’ • Does part of the inner object stick out of the container? – If the container were turned around could the contained object fall out? – None of it. -- Try #$in-ContCompletely Yes -- Try #$in-ContOpen – Yes -- Try #$in-ContPartially No -- Try #$in-ContClosed November 17, 2005 66 Senses of ‘In’ Is it attached to the inside of the outer object? – Yes -- Try #$connectedToInside Can it be removed, if enough force is used, without damaging either object? – Yes -- Try #$in-Snugly or #$screwedIn Does the inner object stick into the outer object? Yes -- Try #$sticksInto November 17, 2005 67 Event Types #$PhysicalStateChangeEvent #$TemperatureChangingProcess #$BiologicalDevelopmentEvent #$ShapeChangeEvent #$MovementEvent #$ChangingDeviceState #$GivingSomething #$DiscoveryEvent #$Cracking #$Carving #$Buying #$Thinking #$Mixing #$Singing #$CuttingNails #$PumpingFluid 11,000 more November 17, 2005 68 A few event types pertaining to Vehicular Transportation #$TransportationEvent #$ControllingATransportationDevice #$TransportWithMotorizedLandVehicle (#$SteeringFn #$RoadVehicle) #$TransporterCrashEvent #$VehicleAccident #$CarAccident #$Colliding #$IncurringDamage #$TippingOver #$Navigating #$EnteringAVehicle November 17, 2005 69 Relations Between an Event and its Participants #$performedBy #$causes-EventEvent #$objectPlaced #$objectOfStateChange #$outputsCreated #$inputsDestroyed #$assistingAgent #$beneficiary #$fromLocation #$toLocation #$deviceUsed #$driverActor #$damages #$vehicle #$providerOfMotiveForce #$transportees Over 400 more. November 17, 2005 70 These ActorSlots express each type of relation between an Event and its actors and subevents Here are some slot: value pairs for Attack874 isa: TerroristAttack. performedBy: JihadGroup. deviceUsed: Bomb8388. eventOccursAt: CityOfLondonEngland. victim: Person9399. victim: Person52666. assistingAgent: AlQaeda. objectsDestroyed: Structure2990. objectsDestroyed: Vehicle523452. November 17, 2005 71 Organization “Slots” • • • • • • #$governingBody #$parentCompany #$subOrgs-Command #$subOrgs-Permanent #$subOrgs-Temporary #$physicalQuarters • • • • • • #$hasHQinCountry #$officeInCountry #$memberTypes #$organizationHead #$PolicyFn #$mainProductType + those predicates that make sense for each generalization of Organization (e.g., #$startingTime, #$alsoKnownAs). November 17, 2005 72 Emotion • Types of Emotions: – – – – – – #$Adulation #$Abhorrence #$Relaxed-Feeling #$Gratitude #$Anticipation-Feeling Over 120 of these • Predicates For Defining and Attributing Emotions: – – – – – #$contraryFeelings #$appropriateEmotion #$actionExpressesFeeling #$feelsTowardsObject #$feelsTowardsPersonType November 17, 2005 73 Propositional Attitudes Relations Between Agents and Propositions • • • • • • #$goals #$intends #$desires #$hopes #$expects #$beliefs • • • • • • #$opinions #$knows #$rememberedProp #$perceivesThat #$seesThat #$tastesThat November 17, 2005 74 Materials • • • • Common Substances Attributes of Materials States Of Matter Solutions • • • • Electrical Conductivity Thermal Conductivity Structural Attributes Tangible Attributes November 17, 2005 75 Materials • Common Substances • Attributes of Materials • States Of Matter – SolidStateOfMatter – LiquidStateOfMatter – GaseousStateOfMatter • Solutions • • • • Electrical Conductivity Thermal Conductivity Structural Attributes Tangible Attributes – SolidTangibleThing – LiquidTangibleThing – GaseousTangibleThing November 17, 2005 76 Devices • Over 4000 Specializations of #$PhysicalDevice Device Specific Predicates • – #$ClothesWasher – #$NuclearAircraftCarrier • • Vocabulary for Describing Device Functions – #$primaryFunction-DeviceType November 17, 2005 #$gunCaliber #$speedOf Device States (40+) #$DeviceOn #$CockedState 77 Vehicular Transport Devices • Over 800 Specializations of #$RoadVehicle Five Facets of #$RoadVehicle #$RoadVehicleByChassisType #$RoadVehicleTypeByBodyStyle #$RoadVehicleTypeByModel #$RoadVehicleTypeByPowerSource #$RoadVehicleTypeByUse – #$AcuraCar – #$SportUtilityVehicle – #$Humvee • Over 100 Specializations of #$AutoPart • Specialized Predicates #$highwayFuelConsumption #$vehicleLoadClass #$trafficableForVehicle #$vehicle – #$AutomobileTire – #$ShockAbsorber – #$Windshield November 17, 2005 78 Weather Weather Events Weather Objects #$TornadoAsEvent #$SnowProcess #$CloudInSky #$SnowMob • Weather Attributes – #$ClearWeather – #$Visibility – (#$LowAmountFn #$Raininess) November 17, 2005 79 Information-Bearing Things Books, web-page copies, radio broadcasts, utterances, intell cables, TV series,… November 17, 2005 80 What is “Moby Dick” ? InformationBearingThing (IBT) AbstractInformationStructure (AIS) “‘ T i s M o b y D i c k !” PropositionalInformationThing (PIT) (#$thereExists ?SEE (#$and (#$isa ?SEE Seeing) (#$objectPerceived ?SEE #$MobyDick) (#$perceiver ?SEE #$CaptainAhab))) November 17, 2005 81 What is “Moby Dick” ? InformationBearingThing (IBT) textOfIBT instantiationOfCW ContainsInfo-Propositional-CW InfoStructureOfCW ConceptualWork (CW) PITOfIBTFn AbstractInformationStructure PropositionalInformationThing (PIT) (AIS) November 17, 2005 #$infoStructureRepresents 82 Bridging the Knowledge Gap upper ontology Water is wet Intermediate ontology Vehicles slow down in bad weather lower ontology: task-specific knowledge HUMMV’s lose 18% traction in 4-inch-deep mud November 17, 2005 83 KR Lessons Learned We started with a straightforward “Frames & Slots” representation (in 1972), FredAlbertson improving it over the years ownsA: Dog as -- but only as -- we needed to. isA: Person worksFor: UT . . . November 17, 2005 84 KR Lessons Learned We started with a straightforward “Frames & Slots” representation But Frames&Slots are inadequate to naturally express • disjunction (“Fred owns a dog or a parakeet.”) • negation (“Fred does not own a dog.”) • modals (“Fred believes Israel wants Egypt to expect…”) • meta-assertions (“That rule is 50 years old but reliable.”) • nested quantification (w)(x)(y)(z)… “Every American has a president.” versus “Every American has a mother.” November 17, 2005 85 KR Lessons Learned 2. On the one hand, we must move from Frames&Slots to Logic. But on the other hand: Theorem-proving is too slow! Solution: Do it, and to recoup efficiency, separate: The Heuristic Problem The Epistemological Problem (what should the system know?) (how can it reason efficiently with&about what it knows?) I.e., represent each assertion in (at least) 2 ways: one standard logical (predicate calculus) form (EL), and one (or more) efficient special-purpose representations (HL) November 17, 2005 86 Lessons Learned • Bridging the knowledge gap: do the “intermediate theories.” • Rather than struggling to reason in NL sentences, use a more formal representation language. Make this as simple as possible (but, year by year, we had to make it ever more expressive.) • Similarly, represent only – but all – useful distinctions. Sounds trivial but leads to huge ontologies of objects, predicates, scripts.. • Distinguish the EL and HL. Rather than striving in vain for a single fast inference engine, use a suite of 720 heuristic modules that each handle some commonly-occurring problems very fast. • Probabilities are great iff known; often relative likelihood known • Most knowledge is default; reason by argumentation • Rather than striving in vain for a monolithic consistent KB, divide the KB up into many locally-consistent contexts November 17, 2005 87 Contexts (Microtheories) Global Consistency: Can’t Live With It, Can’t Give It Up! What’s the real source of the problem? Each rule is rich: it is a simplified statement that obscures a plethora of unstated assumptions and details. As long as the rules are all in one coherent small context, they are likely to make the same simplifying assumptions, and hence are likely to work together consistently. November 17, 2005 88 “If it’s raining, carry an umbrella” the performer is a human being, the performer is sane, the performer can carry an umbrella; thus: the performer is not a baby, not unconscious, not dead, the performer is going to go outdoors now/soon, their actions permit them a free hand (e.g., not wheelbarrowing) their actions wouldn’t be unduly hampered by it (e.g., marathon-running) the wind outside is not too fierce (e.g., hurricane strength) the time period of the action is after the invention of the umbrella the culture is one that uses umbrellas as a rain- (not just sun-)protection device, the performer has easy access to an umbrella; thus: not too destitute, not someone who lives where it practically never rains, not at the office/theater/… caught without an umbrella the performer is going to be unsheltered for some period of time the more waterproof their clothing, the gentler the rain, and the warmer the air, the longer that time period the performer will not be wet anyway (e.g., swimming) the rain is annoying -- but merely annoying. Thus: not ammonia rain on Venus, radioactive post-apocalyptic rain, biblical (Noah’s-ark-sized, or frogs/blood as rained on Pharaoh) the performer is not a hydrophobic person, gingerbread man, etc., and not a hydrophilic person, someone dying of thirst, etc. November 17, 2005 89 Each assertion should be situated in a context: in a region of context-space • We identified 12 dimensions of mt-space Anthropacity Time • We developed a vocabulary of predicates GeoLocation and terms to describe points and regions TypeOfPlace along each of those 12 dimensions; and TypeOfTime Culture Sophistication/Security • We have been situating assertions more and Topic more precisely, and we have been working Granularity out calculi for inferring contexts Modality/Disposition /Epistemology – E.g., if P is true in C1, and P=>Q is true in C2, • Argument-Preference in what context C2 can Q be validly concluded? • Justification • • • • • • • • • • November 17, 2005 90 Mathematical Factoring of Context-space Dimensions UnitedStatesIn1985Context: Ronald Reagan900,000 is president. There are at least doctors. PennsylvaniaIn1985Context: Dick Thornburgh is governor. LehighCountyInFebruary1985Context: Dick Thornburgh Thornburgh is is governor governor and and there Ronald Dick Reagan president. are atisleast 900,000 doctors. November 17, 2005 91 Time Indices and Granularities Doug is talking, at 10:30 to 11:30, on 11/17/05. Therefore: Doug is talking, at 10:50 to 11:05, on 11/17/05. But not: Doug is talking, at 10:55:11 to 10:55:13, on 11/17/05. November 17, 2005 92 Time Indices and Granularities Doug is talking, at 10:30 to 11:30, on 11/17/05 with temporal granularity calendar P =minute. Doug is talking. Calendar Minutes t = that one hour interval Future t So: talking during that 15-minute interval? Yes Talking during that 2-second interval: Unknown November 17, 2005 93 Summary (1): Technology • Cyc is a power source, not a single application. Like oil, electricity, telephony, computers,… Cyc can spawn and sustain a new industry. • It can cost-effectively underlie almost all apps. (Provide a common-sense layer to reduce brittleness when faced with unexpected inputs/situations) • To apply Cyc, we extend its ontology, its KB, and possibly its suite of specialized reasoning modules November 17, 2005 94 20 Motivating Applications (1984) November 17, 2005 95 5 More Recent Application Ideas November 17, 2005 96 Recent/Current Government Apps • Dept. of Defense (mostly DARPA, ONR) – – – – – CoABS, HPKB, CPoF, DAML, ACIP RKF (OE-ing by non-logicians via clarification dialogue) BUTLER: Knowledge-based machine learning ResearchCyc: Clean, document, speed up, interface, etc. ONR: Level 2 and 3 Information Fusion (sense-making) • Other US Government Agencies (NSF, ARDA, NIST) – – – – – – – NIST ATP: Jumpstarting a Nat’l. Knowledge Infrastructure AQUAINT, NIMD, Topsail, Eagle, KSP-ATD,… Building a comprehensive terrorism KB for the US Automated generation of plausible terrorism threat scenarios Modeling intelligence analysts (script learning/recognition) Semantic knowledge source integration Efficient Inference in Large Knowledge Bases November 17, 2005 97 Recent/Current Commercial Apps • using Cyc as the basis for a medical ontology – aligning Cyc with Snomed/UMLS/Mesh/... • • • • • • multiple-thesaurus manager (align n 300k-term lists) spider the entire Web (indexing it in terms of Cyc concepts) identify inter-sentential references in NPR transcripts improved web (and website) search query/follow-ups vulnerability assessment (reason about a scanned network) semantic matching for a better customer experience November 17, 2005 98 Summary (2): Cycorp • 50 employees (almost all MTS’s) • Revenue about $7M/year (some commercial licenses and app.’s, but >50% US Government R&D contracts) • Employee-owned (VC-free and debt-free) • $75M development effort (750 PY’s over 21 years) – – – – – Mostly spent on building up its ontology and KB To a lesser extent, its reasoning modules and interfaces Focus: automatically growing Cyc via learning Focus: enabling Cyc users to directly extend it Focus: making inference orders of magnitude faster November 17, 2005 99 Summary (3): The Message: What Needs to be Shared? • • • • • • • bits/bytes/streams/network… alphabet, special characters,… words, morphological variants,… syntactic meta-level markups (HTML) semantic meta-level markups (SGML, XML) content (logical representation of doc/page/...) context (common sense, recent utterances, and n dimensions of metadata: time, space, level of granularity, the source’s purpose, etc.) November 17, 2005 100 Summary (3): The Message: What Needs to be Shared? • bits/bytes/streams/network… • alphabet, special characters,… • words, morphological variants,… • syntactic meta-level markups (HTML) • semantic meta-level markups (SGML, XML) • content (logical representation of doc/page/...) Tiny vocabulary (# distinctions) of standard relations: • context (common sense, recent utterances, and n rdf:type, subclass, label, domain, range, comment,… dimensions of metadata: time, space, level of Beyond diversity is etc.) tolerated granularity, thewhich source’s purpose, Which means divergence is inevitable “What do you mean we have no standard, we have lots of standards!” November 17, 2005 101 To do the logical/arithmetic combination Summary (3): The Message: across information sources, we need tens of Needs thousands relations, not tens What toof be Shared? • bits/bytes/streams/network… DAML+OIL adds a few more distinctions: • alphabet, special characters,… inverses, unambiguous properties, unique • words, morphological variants,… properties, lists, restrictions, cardinalities, • syntactic meta-level markups (HTML) pairwise disjoint lists, datatypes, … • semantic meta-level markups (SGML, XML) • content (logical representation of doc/page/...) Tiny vocabulary (# distinctions) of standard relations: • context (common sense, recent utterances, and n rdf:type, subclass, label, domain, range, comment,… dimensions of metadata: time, space, level of Beyond diversity is etc.) tolerated granularity, thewhich source’s purpose, Which means divergence is inevitable “What do you mean we have no standard, we have lots of standards!” November 17, 2005 102 From the User’s POV • The user has a question they want answered • The data needed to answer it is available to them, but not in one single, obvious, reliable place • The answers follow logically (and/or arithmetically) from m elements in n sources • Don’t want to have to know, ahead of time, what first-run sources to“Which go to, how to accessmovies them, how to combine theaintermediate results. star teenager born in Texas • Doand wantare to beshowing able to limit, ahead time, the today at aoftheater uncertainty, recency, granularity, < 10 minutes’ drive from thisideology… building?” (and/or see such meta-level info for each answer) November 17, 2005 103 From the User’s POV • The user has a question they want answered • The data needed to answer it is available to them, but not in one single, obvious, reliable place • The answers follow logically (and/or arithmetically) from m elements in n sources • Don’t want to have to know, ahead of time, what sources to go to, how to access them, how to combine the intermediate results. • Do want to be able to limit, ahead of time, the uncertainty, recency, granularity, ideology… (and/or see such meta-level info for each answer) November 17, 2005 104 From thefirst-run User’s POV “Which movies star a teenager born in Texas • Theand userare has showing a question today they want answered at a theater • The data needed to answer it is available to them, < 10 minutes’ drive from this building?” but not in one single, obvious, reliable place • Do want the answer to be found automatically, not a bunch of relevant pages for them to peruse. • Don’t want to have to know, ahead of time, what sources to go to, how to access them, how to combine the intermediate results. • Do want to be able to limit, ahead of time, the uncertainty, recency, granularity, ideology… (and/or see such meta-level info for each answer) November 17, 2005 105 Summary (3): The Message: What Needs to be Shared? • • • • • • • bits/bytes/streams/network… alphabet, special characters,… words, morphological variants,… syntactic meta-level markups (HTML) semantic meta-level markups (SGML, XML) content (logical representation of doc/page/...) context (common sense, recent utterances, and n dimensions of metadata: time, space, level of granularity, the source’s purpose, etc.) November 17, 2005 106 End of “The Message” End of “The Summary” Delve into a typical domain – answering intelligence analysts’ queries – where Cyc can really help, because that domain thwarts all five of “ontological corner-cutting” solutions (+ digressions for OpenCyc, ResearchCyc,…) November 17, 2005 107 Eschew the 5 pitfalls (ways to cut ontological corners and end up with something that only appears to work) • Ignorance-based: Have a small theory size (#terms, #instances, #rules) • Static KB (can be massively tuned, optimized, cached, etc. ahead of time) • Simple assertions (e.g., SAT constraints; propositional calculus; Horn;…) • One global context (no contradictions, limited domain, simplified world) • Don’t do all the bookkeeping and forward inference required for justification maintenance (or, equivalently, don’t ever have truth maintenance “turned on”) As with pharmaceuticals, what is toxic in one dosage is beneficial in a lesser dosage. E.g., contexts lead to locally-consistent locally-small theories (faster inference/KE) E.g., often some (sub)problems can be represented/solved in a simpler repr. November 17, 2005 108 The Analyst’s Knowledge Base CT Analyst “Were there any attacks on targets of symbolic value to Muslims since 1987 on a Christian holy day?" Domain Experts "What sequences of events could lead to the destruction of Hoover Dam?" Query Formulation Formulator Explanation Generation Generator Cyc Cycorp Tools For: Ontology-Building, -Browsing, -Editing, & Fact/Rule Entry Scenario Generation Generator Reasoning Reasoning Modules Modules Others’/GOTS Analysis and Collaboration Components General General Terrorism Terrorism Knowledge Knowledge Knowledge Knowledge Terrorism Knowledge Terrorism Knowledge Base) Base AKB Relational DB “projection” of the AKB Interface to Data Repositories HUMINT Messages INS SIGINT Data Message Content Border Geopolitical Data Crossings HID Global Observa Terrain tions Data Weather Travel Records Data November 17, 2005 Credit Satellite Card Intel Records Military Intel 109 output of COTS Text Extraction Systems MIPT TKS2 TKS3 2. Terrorism domain experts met MATRIX to develop a schema for the missing knowledge. Preexisting Structured Relevant TerrorismKnowledge Knowledge TTT TKS6 TKS7 TKS8 3. They and others are 1. Fusion of available structured terrorism knowledge sources: A tiny fraction of the Comprehensive AKB. working remotely, collaboratively, to flesh out the missing 95% of the AKB. 80k 1.92M 5. The Comprehensive AKB: First useful state: will contain over 4M facts and rules of thumb, about half of which is pre-existing general knowledge already in Cyc. 4. Cyc uses general and domain knowledge to convert the simple English phrases into formal logic. Templatized Terrorism Analysis Queries 1) List the [ORGANIZATIONS] at which [AGENT] was [STATUS] and when. (1a) List the schools at which [Mohammed Atta] was [enrolled] and when. (1b) List the companies at which [Mark Fulton] was a [employed] and when. (3) What percentage of [ATTACK-TYPE] are [ATTACK-TYPE]? (3a) What percentage of [terrorist attacks] are [poisonings]? (3b) What percentage of [bombings] are [suicide bombings]? (4) Between what times was the [AGENT] a/an [ROLE-PREDICATE] in what types of acts and where? (4a) Between what times was the [Aum Supreme Truth] a [performer] in what types of acts and where? (4b) Between what times was the [Ulster Volunteer Force] an [assisting agent] in what types of acts and where? November 17, 2005 111 Templatized Terrorism Analysis Queries (13) List all [AGENT-TYPE] in [LOCATION] that have used [DEVICE-TYPE] and list the specific types of (devices) that each has used. (13a) List all [revolt organizations] in [Northern Ireland] that have used [pipe bombs] and list the specific types of pipe bombs that each has used. (13b) List all [right wing terrorist groups] in [North America] that have used [package bombs] and list the specific types of package bombs that each has used. (22) List the [AGENT-TYPE] who have [RELATION] [TYPE] to [AGENT] and what those supplies were. (22a) List the [Terrorist groups] who have [given] [supplies] to [Hamas] and what those supplies were. (22b) List the [state sponsored terrorist agents] who have [provided] [support] to [Osama Bin Laden] and what those supplies were. November 17, 2005 112 CIA Intelligence Report “Seeking Information: Ahmad Said” July 26, 2004 Ahmad Said, an expert on remote-controlled bombs with a degree in chemical engineering, was seen travelling to Lebanon early this month. Said claimed to be a member of the Lebanese Hizballah from the mid 1980s until late July 1999. It is currently believed that Said assisted in the July 22nd car bombing in Beirut that damaged police barracks and destroyed several retail stores. Lebanese Hizballah's spokesman, Emad Mugniyeh, issued a statement on July 26th to the Al Aman newspaper denying the group's involvement in the attack. November 17, 2005 113 Deeper Analytical Question Answering What factors argue <for/against> the conclusion that <ETA> <performed> <the March 2004 Madrid attacks>? For: - ETA often executes attacks near national election - ETA has performed multi-target coordinated attacks - Over the past 30 years, ETA performed 75% of all terrorist attacks in Spain - Over the past 30 years, 98% of all terrorist attacks in Spain were performed by Spain-based groups, and ETA is a Spain-based group. Against: -ETA warns (a few minutes ahead of time) of attacks that would result in a high number civilian casualties, to prevent them. There was no such warning prior to this attack. -ETA generally takes responsibility for its attacks, and it did not do so this time. -ETA has never been known to falsely deny responsibility for an attack, and it did deny responsibility for this attack. November 17, 2005 114 Automatic Link Detection November 17, 2005 115 Automatic Link Detection November 17, 2005 116 Intelligent Fusion: Disparate Data • USS Lake Champlain is scheduled to return to its homeport (NavBase San Diego) 1300 4 September • Hurricane Howard predicted to make landfall at Tijuana, Mexico approx. 0100 5 September • 0600 4 September: satellite imagery reveals 126 boats berthed Silver Gate Yacht Club. • 0600 4 September: Silver Gate Yacht Club harbormaster manifest only lists 124 craft. • 1135 4 September: Coast Guard reports two cigarette boats, traveling together at 54 knots, on a trajectory consistent with a path from the Silver Gate Yacht Club to the entrance of the San Diego Naval Base. • Monitoring of cell phone activity of a suspected Red Dawn terrorist cell member in Syria has identified four calls, each of 30 seconds’ duration, placed to that suspect from Shelter Island between 2300 September 3 and 1100 September 4. Automatic Generation of Plausible (Counter)Terrorism Scenarios End at target, if given one meet in middle Start from seed, if given one Each step should be both Grow whole populations of plausible interesting such paths, notand just one. Employ heuristics to evaluate each node’s “promise”: plausibility x interestingness November 17, 2005 Generate chains of action and plausible reaction 118 Often a step is just a response, by 1 or more agents, to the prior step (or, if going right to left, it is an enabler/cause of the already-known successor step) Each step can be a… • • • • • • • Political event (e.g., an election) Diplomatic event (communique’) Military event (buildup along border) Terrorist event (suicide bombing) Economic event (loan; arms sale) Infrastructure event (power outage) Act of Nature (illness; hurricane) November 17, 2005 Generate chains of action and plausible reaction 119 Hoover dam is blown up Each step can be a… • • • • • • • Political event (e.g., an election) Diplomatic event (communique’) Military event (buildup along border) Terrorist event (suicide bombing) Economic event (loan; arms sale) Infrastructure event (power outage) Act of Nature (illness; hurricane) November 17, 2005 Generate chains of action and plausible reaction 120 Al Qaida does a sudden, atypical liquidizing of $1M of its assets buy it for $1M from Pakistan Hoover dam is blown up Destroy 3.24M tons of concrete detonate a crude 100 kton nuclear bomb, 1 km away Al Qaida has high net worth (assets) and the will to do it Pakistan has such devices and is financially hurting November 17, 2005 Generate chains of action and plausible reaction 121 November 17, 2005 122 Auto. Scen.Gen.: Lessons Learned • Forward generation is too explosive • Backward generation is too sterile • Instead, use a sort of “cardiac rhythm” – Take a large step backward (ABDUCTION) – Work forward a little from it (DEDUCTION) – Repeat. November 17, 2005 123 Targeted Fact Gathering: Web Search • Abu Sayyaf was founded in ___ • Al Harakat Islamiya, established in ___ • ASG was established in ___ Search Strings (foundingDate AbuSayyaf ?X) Abu Sayyaf was founded in the early 1990s Local storage Parse (foundingDate AbuSayyaf (EarlyPartFn (DecadeFn 199))) Suggested Fact November 17, 2005 124 Targeted Fact Gathering: Web Search • (maritalStatus YassirArafat Single) • (maritalStatus YassirArafat Married) • (maritalStatus YassirArafat Divorced) … •(maritalStatus YassirArafat Cohabitating-Unmarried) All Possible Facts • Yasser Arafat’s fiance • Yasser Arafat’s wife • Yasser Arafat’s ex-wife • Yasser Arafat divorced Search Strings (maritalStatus YassirArafat ?X) PersonTypeByMaritalStatus (maritalStatus YassirArafat Married) Suggested Fact November 17, 2005 125 Harnessing Lots of Users • Identify underpopulated common sense predicates • Use semantic constraints + shallow parsing to identify possible fact completions • Present multiple choice questions to novices to complete facts 150-400 commonsense GAFs/hour useful distinguishing facts Hat worn on: Head Neck Foot Leg November 17, 2005 126 OpenCyc Open Source release of: [most of] the Cyc Ontology + Simple Relns. + Inference Engine ResearchCyc Almost All of Cyc (for free for R&D purposes) November 17, 2005 127 The OpenCyc Release • Runs on Windows, Linux • OpenCyc Knowledge Base – LGPL license – 47,000 terms – 306,000 facts • Cyc Inference Engine – Free license for binary runtime engine • Application Programming Interface – Java, SubL, Python • Extensive documentation – Ontological Engineer’s Handbook – Online Cyc 101 course November 17, 2005 128 Why Do We Release All This? • Advance the starting line for AI • Enable a large number of users to in effect help us to grow the Cyc Knowledge Base • Help Cyc become a critical component – in the Semantic Web – in more and more applications – using OpenCyc hopefully leads to using ResearchCyc for free, eventually licensed November 17, 2005 129 OpenCyc is Upward- Compatible with ResearchCyc ResearchCyc contains • OpenCyc • Natural Language Processing subsystem • Many more facts/rules per term – The “extent” of non-structural predicates November 17, 2005 130 60,000 OpenCyc Users/Contributors, 50 Active ResearchCyc User Groups: Government-related Government Language Computer Corporation Air Force Rome Labs 21st Century Technologies Houston VA Medical Center Xerox PARC Stone’s Throw Technologies SRI ISI Daxtron Labs Austin Info Systems Lockheed Martin ATLD U of Illinois Urbana-Champaign University U of Maryland MIT Media Lab Northwestern U Commercial ANSER, Inc. NTT Communications Science Laboratories (Japan) Fraunhofer Institute Sapio Systems (Denmark) Terra Incognita Trimtab Consulting Stanford NLP Dept. TNO-DMV (Netherlands) U of Pennsylvania Rensselaer AI and Reasoning Lab Microfabrica, Inc. LBJ School of Public Affairs U of Toronto Radboud U (Netherlands) Knowledge Media Institute, Open University U of Stuttgart U of Minnesota Witan International New Mexico Highlands Univ. Harvard U Linkoping U (Sweden) U of Hawaii Institute for the Study Of Accelerating Change NPOs Tokyo Inst. of Technology November 17, 2005 131 End of “The Message” End of “The Summary” Delve into a typical domain – answering intelligence analysts’ queries – where Cyc can really help, because that domain thwarts all five of “ontological corner-cutting” solutions (+ digressions for OpenCyc, ResearchCyc,…) November 17, 2005 132 Eschew the 5 pitfalls (ways to cut ontological corners and end up with something that only appears to work) • Ignorance-based: Have a small theory size (#terms, #instances, #rules) • Static KB (can be massively tuned, optimized, cached, etc. ahead of time) • Simple assertions (e.g., SAT constraints; propositional calculus; Horn;…) • One global context (no contradictions, limited domain, simplified world) • Don’t do all the bookkeeping and forward inference required for justification maintenance (or, equivalently, don’t ever have truth maintenance “turned on”) As with pharmaceuticals, what is toxic in one dosage is beneficial in a lesser dosage. E.g., contexts lead to locally-consistent locally-small theories (faster inference/KE) E.g., often some (sub)problems can be represented/solved in a simpler repr. November 17, 2005 133 Problem 5 Factors slowing IC inference (F1) Constant stream of new assertions, new data to assimilate. – “elaboration tolerance” vs. tuned, optimized, “compiled” representations. (F2) Theory Size: Huge vocab. and #instances (people, specific reports,…) (F3) Sophisticated assertions and constraints strain even FOPC – More repr. language “features” (e.g., quantification) => slower inference (F4) Assertions are often true in one context and false in another – Contextualized data and queries => exponentially larger search space (F5) Truth maintenance must be “on”, to assimilate new data properly, and to provide the symbolic justifications behind its conclusions. – Each new datum can trigger an avalanche of TMS reactions in the KB – There can be multiple answers, each with multiple justifications November 17, 2005 134 Problem 5 Factors slowing IC inference (F1) Constant stream of new assertions, new data to assimilate. – “elaboration tolerance” vs. tuned, optimized, “compiled” representations. (F2) Theory Size: Huge vocab. and #instances (people, specific reports,…) (F3) Sophisticated assertions and constraints strain even FOPC – More repr. language “features” (e.g., quantification) => slower inference (F4) Assertions are often true in one context and false in another – Contextualized data and queries => exponentially larger search space (F5) Truth maintenance must be “on”, to assimilate new data properly, and to provide the symbolic justifications behind its conclusions. – Each new datum can trigger an avalanche of TMS reactions in the KB – There can be multiple answers, each with multiple justifications November 17, 2005 135 Slow Queries • Queries that take a long time (okay, but faster is better) – – Generate scenarios resulting in destruction of NY Stock Exchange Still running after 2 months Answer query Q modulo a small number of plausible “unknown” clauses • Queries that take a long time and shouldn’t – (capableOf ArnoldSchwarzenegger RunningForPresidentOfUS) Takes 40 minutes to return False. Why: Wasting time seeing if Arnold is an x where x can’t be President (e.g., Cow) – (hasBeliefSystems AdolfHitler AntiSemitism) In the context of World History 1944, takes 16 minutes to return True. Why: Lots of ways this might not be true November 17, 2005 136 November 17, 2005 137 November 17, 2005 138 Slow Queries • Queries that take a long time (okay, but faster is better) – – Generate scenarios resulting in destruction of NY Stock Exchange Still running after 2 months Answer query Q modulo a small number of plausible “unknown” clauses • Queries that take a long time and shouldn’t – (capableOf ArnoldSchwarzenegger RunningForPresidentOfUS) Takes 40 minutes to return False. Why: Wasting time seeing if Arnold is an x where x can’t be President (e.g., Cow) – (hasBeliefSystems AdolfHitler AntiSemitism) In the context of World History 1944, takes 16 minutes to return True. Why: Lots of ways this might not be true November 17, 2005 139 Effic. Reasoning Hypotheses • Hypothesis 1: There is no silver bullet, no one magic key waiting to be discovered which will unlock efficient pathfinding on huge knowledge-spaces. – Rather, such inference will only be improved incrementally, by bringing to bear a large number of efficient partial solutions. November 17, 2005 140 Effic. Reasoning Hypotheses • Hypothesis 2: These special-case solutions are not random, but factor into a handful of different categories. – A 2-day workshop meeting could productively be held for each such category – Important interstitial work to be done, collaboratively, before and after the meetings. November 17, 2005 141 6 categories (workshop topics) • Reasoners that exploit limitations in the expressivity of the repr. language they operate over – Description Logic, 1st order, etc. – What simplifications enable what speedups? – At what risk? • Domain-specific (incl. Context-specific) reasoners • Statistical/Bayesian Reasoners • “Unsound” (but presumably useful) reasoners 2 • Meta-reasoners (tacticians) and Meta (strategists) • Parellel Processing, HW Acceleration, “Other” November 17, 2005 142 6 categories (workshop topics) • Reasoners that exploit limitations in the expressivity of the repr. language they operate over – Description Logic, 1st order, etc. – What simplifications enable what speedups? – At what risk? • Domain-specific (incl. Context-specific) reasoners – What sorts of domain knowledge do they utilize? – How do they use that to speed up inference? – Contexts, dimensions of context-space, algorithms for exploiting that structure of the KB to do faster reasoning November 17, 2005 143 6 categories (workshop topics) • Statistical/Bayesian Reasoners – How can these cooperate with, help, and be helped by non-statistical reasoners (acting as independent agents)? – How can statistical and symbolic inference be more tightly integrated in a single reasoner (cf. Koller) ? • “Unsound” (but presumably useful) reasoners – Abduction, induction, analogy, abstraction (ignoring details which hopefully won’t matter), scen. generation – How can these cooperate with, help, and be helped…? – How can unsound and sound inference be more tightly integrated in a single reasoning engine? November 17, 2005 144 6 categories (workshop topics) 2 • Meta-reasoners (tacticians) and Meta (strategists) – Do/Improve object-level meta- level reasoning – Types of meta-… (prior & tacit; trails; reflection;…) • “Other” – Parallel processing – Hardware acceleration (special purpose chips etc.) – New types of reasoning modules and strategies, that don’t fit in any above group, that folks are working on. – What specific gaps are there (useful, doable, efficient reasoners no one has even started to research yet) ? November 17, 2005 145 Background & Lit. Review • • • • • • • • Instantiation-based reasoning systems Lifted DPLL procedures (Davis Putnam Longemann Loveland) Completion/Boolean Ring based methods ContractNet TeamWork Scatter-gather algorithms Auto. theory decomposition by static analysis Explanation-based learning/partial evaluation mechanisms that learn generalized proof schemata November 17, 2005 146 Effic. Reasoning Hypotheses 1. No silver bullet 2. 6 types of powerful partial solutions already exist – Reasoners that exploit limitations in the expressivity of the representation language they operate over – Domain-specific (incl. Context-specific) reasoners – Statistical/Bayesian Reasoners – “Unsound” (but presumably useful) reasoners – Meta-reasoners (tacticians) and Meta2 (strategists) – “Other”, HW accel., parallel processing 3. They can cooperate / synergize (neutral harness) November 17, 2005 147 Effic. Reasoning Hypotheses • Hypothesis 3: They can cooperate / synergize. – Explicitly characterize, for each “agent” (reasoner): • A trigger -- in effect specifying its area of competence • A procedure for estimating its cost, its chance to succeed, etc. – Cyc’s immense KB and ELHL architecture makes it an efficient reasoning module “magnet” or “universal recipient” November 17, 2005 148 Effic. Reasoning Hypotheses/SOW Hold 3 workshops, on the 6 topics, in 2006 •Participation Hypothesis 3: They canthe cooperate / synergize. by all leading experts More than that, we can and will harness ~10 ofthem them. Pre: readings. Post: actually harness – Explicitly characterize, for each “agent” (reasoner): • A trigger -- in effect specifying its area of competence • A procedure for estimating its cost, its chance to succeed, etc. – Cyc’s immense KB and ELHL architecture makes it an efficient reasoning module “magnet” or “universal recipient” • Use Cyc [and ARDA-related assertions/queries in it] as a testbed for – operationally “publishing” the results of each workshop – experiments on comparative and collaborative power November 17, 2005 149 Efficient Pathfinding in Very Large Data Spaces GOALS • Develop an ontology and a standard for specifying the applicability, % success, estimated resource cost, etc., of bringing various reasoning modules to bear on a problem • Build an Integration Framework, a Harness, that enables several of the world’s leading reasoning systems to cooperatively solve problems [using the above ontology and standard to act as agents, broadcast subproblems, etc.] Actually hook them up to this Harness and run them, on test problems from NIMD, AQUAINT, etc. • Overcome the 5 problems that make IC reasoning hard: (1) New assertions constantly (can’t just “compile” the KB) (2) Each is true in some contexts (in 2003; believed by x) (3) Many are complex (x believes that y believes that…) (4) Huge vocabulary size and number of instances (5) Justifications / sources matter (truth maint. Must be “on”) APPROACH Workshop Highlights • Identify the most important ways in which automated reasoners gain efficiency: limit domain, limit expressive-ness, integrate probabilistic and symbolic reasoning, meta-reasoning, and unsound reasoning (e.g., analogy) 4Q 05 Pre-start invitations and Steering Comm. planning 1Q 06 Project starts. 1st workshop: gaining efficiency by limiting representation language expressivity 2Q 06 Interstitial work on ontology and standard; building the initial Framework/harness; try out 2 “agents”; 2nd workshop: gaining efficiency by limiting the domain, the type of problem to be solved, etc. 3Q 06 3rd workshop: Integrating Bayesian probability and statistical reasoning with symbolic theorem-proving Workshop “Steering Committee”: 1Q 07 R.V. Guha, Google; Chris Welty & Andrew Tompkins, IBM; Andrei Veronkov, Manchester; + I.C./Ops. “Champions” 4th workshop: meta-reasoning (tactics & strategy) 5th workshop: unsound reasoning (e.g., analogy) 4Q 06 6th workshop; Final Report; Hand-off to I.C./Ops “Champions” for tech transfer/operationalization • Hold a workshop on each topic (16 invitees; 15 said “Yes”) • After/between the workshops, get these system builders to “publish” their reasoner to the growing Framework/harness so each can bid for, work on, and broadcast subproblems Workshop PI’s: Doug Lenat, Cycorp Michael Genesereth, Stanford CYC: Lessons Learned in LargeScale Ontological Engineering The pursuit of Artificial Intelligence -- from robotics to natural language processing to automated learning -- has been held back by the "brittleness bottleneck" caused by the need for common sense. For 21 years, we've been priming the pump, building up a formalized corpus of such knowledge, Cyc. Along the way, we've had to revise our preconceptions and theories, to expand our representation language and arsenal of inference methods, to find approximate yet adequate engineering solutions to problems that philosophers have grappled with for millennia such as ontologizing aspects of substances versus individual objects, time, space, causality, belief, social interactions, and so on. The process of ontological engineering had to grow and evolve throughout this enterprise, as well, such as how Cyc represents and reasons with contradictions and context. In this talk I will try to cover both the large scale picture of what we've built and why, and the detailed picture of how it's built, and the lessons learned along the way in how and how not to do large-scale OE. I will report on our recent efforts to make Cyc more accessible to the broader community through OpenCyc and ResearchCyc, which raises issues of how multiple individuals and groups can share and integrate their extensions (and settle their differences). Finally, I will discuss an exciting new effort we have just had funded, to gather automated reasoning researchers together for a series of workshops in 2006 on speeding up inference in large knowledge bases by orders of magnitude. November 17, 2005 2 July 2005 151