Document 17868800

advertisement
>> Eric Horvitz: Okay we’ll get started. It’s an honor today to have Ken Forbus with us. He’s the Walter
P. Murphy Professor of Computer Science and Professor of Education at Northwestern University. Ken
received his, did his Ph.D. work and undergraduate work at MIT.
He’s been, in my mind, one of the core people over the decades working on challenges of
representation and reasoning, with qualitative knowledge, as well a, doing analogical reasoning and
learning, spatial reasoning. More recently, from my watching his work over the years, sketch
understanding, natural language understanding.
He’s been a long term advocate and leader in this realm of cognitive architectures. What’s an
architecture for reasoning about the world, more generally? He’s been working in a number of areas, in
a number of application areas. He’s a fellow of the Triple AI of the ACM and of the Cognitive Science
Society. He’s received a number of awards for his work over the years. It’s always a pleasure to have
him visit us, Ken.
>> Ken Forbus: Thank you Eric for that lovely introduction. I’m going to talk to you today about some of
the work we’re doing involving learning from what we call Bespoke Data, custom data. Why Bespoke
data? Well as you all know big data is a big thing these days. My heavens it’s wonderful, right. Speech
recognition is finally getting seriously good in many ways. Simultaneous translation and Skype all sort’s
of great things it’s capable of. But it’s not what you want for everything.
For example if you’re teaching a child to understand stories and read. Then you’re actually interacting
with that child. If it takes that child a million times of being exposed to a word to learn it, you’ve got a
problem. There’s evidence that actually human children can learn new nouns in a single exposure.
There’s something we’re doing that’s very different than what today’s Machine Learning systems can
do.
If you’re teaching a child to do a jigsaw puzzle or you’re teaching an assistant a new task. You don’t
want to have to teach someone how to fill out a form a million times. You want to have data that’s
actually tuned towards your estimation of their capabilities. To be able to work with them effectively in
the same kinds of range of examples the human collaborators take.
Now, why do I care? Well, one of the things we’re trying to do is actually achieve human level AI. Now
if you think about today’s AI systems they’re like drag racers. They’re very efficient. They’re very fast.
Okay, but like a drag racer it does one thing. It’s carefully crafted to do one thing. It has to be carefully
maintained and babied, and nursed back to health every time it falls down. By carefully trained experts
who know its internals and the way that we don’t know each other’s internals. If we did cognitive
science would be trivial, okay.
Now think instead about dogs. You can task a dog to do all sorts of cool things. They don’t require a
whole bunch of maintenance. You feed them, you give them some affection. They don’t blue screen on
you a lot, right. There’s all sorts of great things. What if our AI systems were as robust and trainable,
and taskable as mammals? One way to describe this sometimes is what we’re trying to build is a 6 th
grade idiot savant from Mars. Okay, very smart but clearly not from around here. Enough where we
can communicate with it and work with it, but we’re not trying to make something that passes for
human. That’s not the point.
Now the way we’re doing this is the Companion Cognitive Architecture. What we’ve done is
reformulate the goal in a way that I think is actually pretty productive. We’re trying to build software
social organisms. Things that work with people using natural modalities. For us that means natural
language and sketching. Those are sweet spots because natural language gives you expressive power.
Sketching give you spatial aspects.
Learn and adapt over extended periods of time, things that can learn for weeks and months, or years
without having human beings maintaining them. In fact they should be maintaining themselves most of
the time. We shouldn’t have to know their internal representations just as we don’t know the internal
representations of our assistants, or associates, or our children.
Now why social organisms? Well, first of all it’s going to make them very useful. But also I think that
actually essential to making things that are as smart as we are. Mike Tomasello has made this argument
very plainly in several books now. I think there’s a lot convincing evidence for it. Also Vygotsky has
argued that much of our knowledge is learned by interactions with people. We actually call companions
sometimes the first Vygotsky in cognitive architecture. That’s a goal not an achievement. Okay, we’re
trying for that.
There’s many things we understand that we’ve never directly experienced. Yes, you need to ground
things in sensory information. But you know none of us lived through the Revolutionary War. None of
us have seen molecules directly except with the aid of a whole bunch of very carefully crafted
equipment. We see the results of plate tectonics but that’s inference as opposed to actually watching
plate tectonics in action, given the time scale that happens, and the same things with most examples of
evolution. You have to have conceptual knowledge as well as physical knowledge.
Now if you think about cognitive architecture. There’s been a lot of cognitive architectures. Some
people when they hear a new cognitive architecture, say, oh, my lord why one more? Well, if you think
about it Newell actually broke things down by time scale of tasks. Biological Band with neural modeling,
Cognitive Band thinking about skills, Rational Band more about problem solving and reasoning. Social
band and in fact there’s a Developmental Band above that.
If you look at most cognitive architectures they focused here. For instance two of the leading
Architectures Actar and Sore have as their signature phenomena skill learning. If you look at Actar
where they’re going is down here. They’ve worked a lot to model FMR data for example. Where Sore’s
been going is here. For instance they’ve had Sore agents that run for many hours in training exercises.
A real tour de force and all sorts of stuff and they’re actually headed even further up.
Companions we care about these two bands. Just as when you’re modeling a complex phenomena. You
model it at multiple layers. Ultimately those models all have to talk together. You need to explore each
layer. You can’t say I’m going to solve this one before I solve that one because we’d still be doing
quantum mechanics. We wouldn’t have chemistry and we wouldn’t have meteorology, okay. You have
to do things in parallel and talk to each other.
For us we’re starting here. We assume these folks are handling skill learning really well. They are, but
we’re more interested in how you do reasoning and learning at these broader time scales. That’s the big
idea of companions.
Now I’m going to tell you about the hypotheses in a little bit more detail. I’m going to walk you through
three kinds of examples. One is learning games, both strategy games and a very simple Tic-Tac-Toe
thing. I’ll show you a video of that. Learning spatial concepts and learning plausible reasoning. One
think you’ll see is in all these cases it doesn’t take much data to actually learn these things. You can get
a surprising amount if you know a lot and if you use analogical learning as your learning mechanism.
By analogy I mean Gentner’s Structure-Mapping Theory. The idea is that analogy and similarity involve
structured relational representations. You have entities with relations connecting them up. You’re
comparing one description in the base against another description in the target. That gives rise to a
bunch of correspondence in saying what goes with what. Then if there’s things in the base that don’t
have anything corresponding with the target you can project those, those candidate inferences to do a
kind of pattern completion of the symbolic descriptions.
Now it turns out there’s a substantial amount of psychological evidence supporting this model. In
phenomena ranging from high level, medium level vision, auditory learning, learning mental models,
textbook problem solving, and conceptual change. The evidence is both from purely psychological
things with humans and laboratories. Like reaction time studies and protocol analyses, but then also
computational modeling. The story is actually pretty interesting.
We have these three hypotheses about human cognition. We think that to build AIs we need to actually
take these seriously. The first is an analogy is a central mechanism of reasoning and learning. A lot of
things, we think of as rule based learning actually are probably applying something that’s a high level
structure that still has some concrete stuff attached to it by analogy. Dedre Gentner’s article, Why
We’re so Smart is a very good introduction to this idea.
The common sense is basically analogical and reasoning, and learning from experience. You start out by
within-domain analogies. They provide robustness and rapid predictions. When I go to start up a new
car and if it’s the same kind of car I’ve had before. Then I do the same thing and that works. Now I
don’t have to have a rule formalized saying oh, it’s a lock I have to turn the key in the lock. Or it’s one of
those fancy fobs I just keep in my pocket. I don’t have to touch it. I just do it, okay.
Now, so even one example you can do an analogy to learn some other stuff. But we do generalize. The
generalization process gets us two first-principles reasoning. But they emerge slowly as generalizations
from examples. Now by slowly I don’t mean millions. I mean dozens, okay. It’s very different in terms
of time scale. I think much more like what humans get.
Qualitative representations are central to this. Qualitative representations provide a bridge between
perception and cognition. They provide an appropriate level of understanding for communication, and
action, and generalization. Those are like the three big hypotheses we’re exploring.
Now, there’s these models we have analogical processing. I’m not going to go into gory detail about
each one of them because each one of them is an hour talk. But I’m going to give you a picture about
how they all fit together. That will show you how they get used in these subsequent things.
Here very roughly is what we think happens. You have the structure mapping engine which matches
two descriptions. These, either examples or generalizations, and the candidate imprints give you things
like predictions or explanations, or a possible principle to apply in dealing with some new situation.
Now where do you get these things? You get them from your memory of course. We have a memory
model called MAC/FAC, Many are Called Few are Chosen. It’s designed to scale up to human size
memories because the first stage is a cheap filter using flattened versions of the structured
representations. With a special kind of vector designed so that it predicts what SME will produce in
terms of overall match quality.
It’s an inaccurate prediction of course. That’s why you have the second stage that actually uses SME
over several examples in parallel. That’s where you get your stuff. Now the generalizations happen
because you have basically another model which uses SME and MAC/FAC to sit there and given
instances of a concept like models for a word. We’ve done this with spatial prepositions of contact for
example in English and Dutch. You basically build up models analogically by combining these things
using match and keeping the stuff that’s in common, and deprecating the stuff that isn’t in common by
lowering its probability of being there.
Generalizations for us are probabilistic. We have a frequency information about every statement in
them. They’re partially abstracted. There can be very concrete things in them still. They don’t have
logical variables. We don’t need to go to variables because we can just use structure mapping to do the
matching.
Now sometimes we do introduce variables. In fact there’s a couple of cases where we’ve gone through
for various reasons and turned these things into hardcore, into real life logical rules, or probabilistic
rules. But that’s not what we have to do by default to make knowledge useful. That’s the essence of
how the companion architecture turns over, sort of high level lessons. Yes?
>>: In civilization you said one of the things you do is you take a bunch of similar examples and abstract
out the commonalities. How do you know what similar examples are similarly…
>> Ken Forbus: There’s two answers to that question. First we assume the third generalization contexts
that actually are sort of the analogical models we’re building for a problem. Like how do I interpret on?
Or how do I interpret in? Another example if I’m doing language processing and doing word sensed
disambiguation. For this word in this sense what are the ways in which that’s been used, okay? That’s
categorizing things in terms of relevance to a problem.
Then to get the things that you compare against you use the same retrieval model. To retrieve from the
pool of things you’ve been building up. If two of them become, are sufficiently similar you store the
simulated version of those things back as a generalization. The same retrieval model’s used again. You
basically are, it’s just, yeah?
>>: I’m still not sure. Like, so this question like so what is similarity? Like is similarity like a general, like
why does the similarity matching? Does it have…
>> Ken Forbus: Okay, similarity metric is computed, no, no it’s computed by SME. It’s defined in
advance. Same similarity metric for where it says disambiguation, matching visual stimule, story
understanding, counter terrorism, moral reasoning, insanely robust. It’s the structural evaluation score
that SME computes for two descriptions. It really is that general.
Okay, so here’s some experiments that we’ve done early on. Learning physics from cross-domain
analogies. You start with linear kinematics a model of that and container in rotational kinematics,
electricity, and thermal problems. Now MAC/FAC retrieves the correct thing only forty percent of the
time. If you know the psychology of that that’s actually high, that’s high because cross-domain
analogies are relatively infrequent.
Well the beast didn’t know much. It’s going to be able to find stuff more easily. Like humans if you
actually give it the precedent it actually was able to transfer the information eighty-seven percent of the
time, if it had that precedent.
We’ve also done this for learning games. At sixty games which were generated by an external
contractor, experiment run by an external contractor. I think this is still the largest experiment in crossdomain analogies that anyone has ever done.
John Laires, [indiscernible] Michigan developed a sort of basic game which the contractor then went and
did, went crazy over. You’re trying to take this character and make it to the exit by building a bridge to
escape. Tom Hinrichs from our lab did a version of Rogue, a mini version of Rogue. There was a little
version of Mummy Maze here, a variant of that.
What happened was you’d learn a game, a game that you’d never seen before. The companion would
basically learn HTN strategies by experimentation. Now it knew enough about games that if it couldn’t
master the game in ten trials it quit because it never would, okay. These games are easily the
complexity of the kind of Atari games that it takes the thing in nature thirty-eight days to actually
master. I’m not so impressed by those results.
Now given one of those learned games as a base you would then try to learn some new games, sorry.
You’d measure how much faster you learn. Positive regret is good. I would have learned fifty percent
faster if I had the game almost sixty percent of the time. There is some negative transfer, so fifteen
percent of the time in fact I’d learn twenty-five percent more slowly. Than if I didn’t have the analogy.
We’re kind of excited about that.
Now modalities, we want natural interaction. That’s still a really hard problem. That’s still a very much
unopened research problem. We’ve made some progress. But, well let me show you where we are.
Our natural language approach is kind of different. We’re focusing on deep understanding. All the way
down to things you can do reasoning with.
We’re willing to simplify syntax just like human cultures do with children. We’re quite happy to simplify
syntax because we want to go all the way to reasoning and decision making. We use James Allen’s
parser, RearchCyc Knowledge Base contents. We’ve been building our own lexicon to replace COMLEX.
We use Discourse Representation Theory for doing semantic interpretation as the representations for it.
We have a query-based abductive semantic interpreter.
We use the idea of narrative function trying to figure out what a statements telling you in the context
that you’re working in. For example, in solving moral decision problems the system is looking for a set
of choices that it has to make, and decide among. It’s looking for the utilitarian costs of each decision.
It’s also looking to see if any sacred values are involved. It will basically decide in a human like way what
to do, based on those factors. Interesting thing about that, those factors are the same for any moral
decision problem.
Normally abduction grows exponentially with the number of sentences. In this system it actually grows
with the number of questions you’re asking as opposed to the number of sentences. By basically
formulating it as a top down problem you can do pretty well.
Here’s an example of some of the things you can express. “Because of a dam on a river, twenty species
of fish will be extinct.” On the left is the predicate calculus with predicates drawn from the Cyc
Knowledge Base. If you’re going to implement DRT psych micro theory is your friend. Okay, because it’s
all about context. These boxes, each of these boxes is a micro theory in psych with statements relating
the micro theories to each other. This is the DRT version of that same hunk of predicate calculus. Okay,
so it takes some work. But it can be done.
Now what do you do with this? Well one of the testbeds we use is the strategy game, Freeciv. It’s an
open source version of Civilization two. It’s cool because it’s got spatial concepts. You’ve got terrain of
different types. You have to design transportation networks and figure out where to place your cities.
You’ve got a complex economy. You can go bankrupt. You have guns and butter tradeoff. You have
investment versus immediate spending. There’s all sorts of complex stuff in here. It’s a wonderful rich
domain.
Now one of the things you do when you think about this game. Qualitative reasoning, a lot of it in
dynamics grew up around engineering. When engineering you have a blueprint. You know all the parts
in advance. That’s a really simple world. In this world it’s not that way. In this world you’re reasoning
about things that don’t exist yet. You’re reasoning about things that can get destroyed.
The object level representations are impractical when you’ve got limited attention and storage, and
processing time. If I have to build an explicit qualitative model for every tile in the game I’m hosed,
okay. That’s not going to scale. You have to plan for things that don’t exist yet. You have to build
models of the dynamics for things that don’t exist.
We’ve introduced a kind of type-level qualitative model that uses predicates and collections as
arguments. For instance qprop+ normally says the first quantity, shouldn’t stand in front of the screen.
The first quantity is causally determined in part by the second quantity. The type level version says a
quantity of this type and a quantity of this type apply to objects of these types, with this relationship
between them has a qprop between it. In other words, this cashes out two for all x, for all y. If x and y
are GaseousObject and x is the same things as y then the pressure of x is qualitatively proportional to
the temperature of y. Okay, so that’s how that’s a translationed instance level. But you do the
reasoning of the type level as much as possible.
Now if you’re dealing with generic statements like reading a simplified version of the Freeciv manual,
this is a true blessing. Here’s a little bit of a manual. Here’s the translation after the whole abductive
reasoning process has happened on the first sentence. It’s detected these; I’ve written this is frame like
syntax just for easy visual processing. Normally there’s a lot more parentheses and they’re all
independent statements.
You’ve got a type level process that’s a generic production process. That kind of event is actually tied to
the language in the Cyc KB. We didn’t introduce that. There’s some event it refers to a production from
the sentence. That’s a discourse variable. It’s done by something that’s a Freeciv-City. The outputs
created something that’s a type of amount of food. There’s a positive influence of the amount of food
of the rate of production.
Now you take the second sentence and remember this because you’ll see it again here. As the
population of the city increases the food production of the city increases. This is actually a classic
qualitative proportionality introduction. It’s saying this is positive qualitative proportionality and the
constrained quantity is the rate of food production. The thing that’s governing it is the city population.
Finally a citizen consumes food in the city you’re introducing another process. Notice it’s a destruction
event another kind of thing from the anthology. Now you’re filling out its dynamic consequences.
These qualitative models are very powerful. One of the things about strategies in games like this is they
involve tradeoffs. Analysis of the qualitative model can identify tradeoffs and let you reason about
those.
Now even a little bit of advice improves performance.
>>: Is time represented in the previous slide?
>> Ken Forbus: Because these are all things that are happening continuously while the process is active.
In other words it’s a mechanism by which behavior is generated as opposed to the description of the
behavior itself. We have other ways of describing the behavior itself.
Okay, so turns out the type-level representations are very useful for advice. “Irrigating a place increases
food production.”, absolutely true in the game. “Adding a university in a city increases its science
output.” Now you take a half dozen pieces of advice like that. You basically say okay how well do I do at
producing science? This is averaged over ten different games. By ninety-seven turns it’s still, the
experimental condition is still doing better on population. Not hugely better because that’s not what it’s
optimizing for. It’s optimizing for science output.
Here you’re getting to the point where you can build libraries and other units, right. Before that you
couldn’t build them. These two look pretty much the same on science output. But now the advice kicks
in and you start getting more science produced for the cities. Okay, so that’s one modality. It’s also a
little bit of the game learning.
Now, what about the other modality, sketching is a very natural way for people to communicate
knowledge. This is a picture of a geological formation. If you’re an instructor in geo-science what you
want to see when students mark it up is something like this. That’s the fault, the main fault. These are
the directions of movement. These are called marker beds. These indicate the displacement of the
marker beds, okay.
We’ve actually built something with our sketch understanding system. That handles things like this. It’s
domain independent which is very important for us because we want this to scale. In an experiment at
University of Wisconsin Madison, a geo-science grad student made fifteen worksheets. Showed
significant gains pre-post using those worksheets in their Intro Geo-Science class.
We’ve had similar results with a unit on the heart for fifth grade Biology. We’re about to go back into
the classrooms in an Engineering Design course for learning Engineering Graphics. We have
independent evidence from laboratory studies that you can actually do assessment by looking at what
order people draw things in, in a sketch. What they include and don’t include when they’re annotating
diagrams.
This is a massive effort involving a lot of people. He is a well known Structural Geo-Scientist. Bridget is
the grad student who actually cranked out the worksheets. She’s really good at this. Maria and Jeff are
two of the CogSketch Developers. That’s a whole talk there. But I’m not going to go there. But I’m
instead going to pivot to other roles in sketching. Sketching’s also a tool for thinking.
>>: [inaudible] completely. Can you say a few words about the ordering next?
>> Ken Forbus: Okay, so if I have for instance a photosynthesis diagram. Then a student who doesn’t
know photosynthesis will start with those visually salient parts. A student who understands
photosynthesis will start from the input and go through the causal chain to the output. Very simple to
catch and if you’ve got digital ink you can know what order people drew things in.
Yeah, this has happened in a couple of domains now. Basically if a student understands a domain well,
for instance in geo-science there’s certain things they’ll pick up versus certain surface features that
literally are irrelevant. It’s pretty easy; it’s not rocket science to tell the difference. It’s not a subtle
signal.
Okay, so this is a sketch by a painter, Shonah Trescott. She was on an Arctic expedition. You’ll see some
little stick figures here. You’ll see some notes about the background. No distinct horizon, disappearing
figures. In the Arctic you can’t really paint, okay. Even if you’re in the more genial client artists often do
this. This is the painting that resulted. She’s perfectly capable of doing all sorts of very fine subtle visual
work. But to just think it out she first did a sketch.
This is pretty common. Sketching is an aid to thinking. What you’d really like is systems that can sketch
with us and sketch for themselves in human like ways. Here’s a long-term vision. You want a software
that understands sketches as people do. Now what does that mean? Here’s a ramp and a block under
gravity. You can infer the block will slide to the right and down, perfectly sensible.
Now that requires fluent, natural interaction and human-like visual and spatial reasoning. Conceptual
reasoning about the contents of the sketch and you want it to be domain-general. If you look at the
sketch recognition literature right now every sub-problem and every domain is a separate system. You
have to train the recognizers and you. You have to build new software. That doesn’t scale.
That’s one of the reasons why we’ve done some engineering work arounds for segmentation and
conceptual labeling. We actually don’t do recognition on the whole. Okay, because it turns out you
don’t need to. When people are talking to each other during sketching that’s how a lot of the labeling
happens. It doesn’t require recognition. Recognition is the best of catalyst.
You look at the speech, recognition research it’s focused solely on that topic. We’re focused on the rest.
Even if recognition were perfect and there’s reasons to suspect that it can never be perfect at the level
you need it to be, especially for education purposes. You still have to do what we’re going.
Let me show you what that involves. Now the thing I’m about to show you is actually something you
can do yourself. If you download CogSketch and fire up the Design Coach, and ask it a question about
the behavior. What I’m going to show is basically a rational reconstruction of the reasoning and the
truth maintenance system that the system’s actually doing.
Here’s our ramp, so we have the ramp and block. CogSketch recognizes visually that they’re touching
directly which causes it to extract and edge representing that surface contact. It computes the surface
normal of that because that’s very relevant in terms of how forces transmit. It’s in quadrant three here.
That’s a qualitative way of describing angle.
Now you think about gravity and we’re now using some of Jon Wetzel’s qualitative mechanics work.
You have a force applied to the object in the downward direction. That is the only force on it. We’re
assuming the friction doesn’t matter her. Now you have a translational constraint from this edge saying
it can’t move in the downward direction. That plus a little bit of other reasoning says well the
translational motion will be in quadrant four, i.e. to the right and down. That’s how it infers that stuff.
Now it turns out there’s two parallel literatures. One in artificial intelligence about qualitative reasoning
and one in cognitive psychology that they call categorical coordinate and they’re the same thing.
They’re interestingly complimentary literatures. The other thing is the structure mapping processes are
used in visual reasoning.
Let me show you some work from Andrew Lovett’s Thesis. Geometric analogies, A is to B as C is to one
of those. If you download CogSketch you’ll see all of Evan’s problems in a sketch that you can play with
and experiment with. Andrew’s model has the lovely feature that of course it gets them all right. This is
a very easy task it turns out. It makes reaction time predictions that are born out in human behavioral
experiments.
In fact there’s a later paper where by adding working memory constraints and two strategies, which is
the model that actually you can play with in CogSketch. The correlation coefficient is point nine
something. It’s insanely good. It really is a simple task.
A much harder task is Raven’s Progressive Matricies which is commonly used as a test of fluid
intelligence in humans. Andrew’s model is better than most adult Americans. Again, it makes reaction
time predictions. What’s hard for people is hard for the model and vice versa.
Finally there’s an Oddity Task that Dehaene and Spelke used to look at cross cultural difference in
geometric processing between Americans and Munduruku. Andrew’s model, again same
representations for all these things, all automatically computed from PowerPoint, stimule copied and
pasted into CogSketch. Andrew’s model again solves most of the problems. What’s hard for it is hard
for people and vice versa. You can do ablation studies on the models obviously, not the people and get
some insights as far as what’s happening across these different cultures.
Okay, now at the risk of putting everybody to sleep. I’m going to switch to, briefly to a video. This is
showing you a companion learning Tic-Tac-Toe.
[video]
This is a demonstration of flexible multi-modal instruction in a cognitive companions architecture.
We’re going to teach our computer to play Tic-Tac-Toe through a combination of natural language and
sketch interaction. We… introducing the topic. I say I’m going to teach you to play Tic-Tac-Toe. This
provides some expectations about how to interpret future statements.
>> Ken Forbus: You’ll see the predicate calculus coming down here by the way.
[video]
We create a new sketch and I start by classifying the game. I say Tic-Tac-Toe is a marking game as
opposed to a piece moving game, or a construction game. This tells it to expect some kinds of marks as
well as a board.
Now, we’re going to draw the Tic-Tac-Toe board. It doesn’t know what this is. It’s not recognizing the
ink or anything. But the board is going to be the background. We’re going to explain what this is in
natural language.
I type Tic-Tac-Toe is played on a three by three grid of squares. That contains a lot of information. It
tells us that there’s a spatial configuration, its Cartesian coordinates, and the maximum extent of any
dimension is three. On that basis it’s able to label the glyph we just make as the board.
Now, we go on to classify the game to more. Tell it that Tic-Tac-Toe is a two player game which means
that it can expect us to introduce some player roles. X is a player. It understands that to mean that X is
a game player, not an athlete or a musician. We can now draw an X and it doesn’t recognize the X. But
rather it understands in context that this must be the X. You can see that it labels it with its own little x
below it. Now, I can proceed to make an O. I can do this before I’ve introduced the player. It doesn’t
recognize it. I entered O as a player. We can see that it understands and labels the O.
The next thing we do is describe the actions in this game. I tell it that X and O take turns marking empty
squares. That contains a lot of information about turn taking and the kinds of marks it makes, and the
precondition for making marks on empty squares.
The next thing we enter is a description of the goal. We’re only describing one goal here. We’re saying
that the first player to mark the three squares in a row wins. It only understands row to mean
horizontal row. Not a column or diagonal. But through demonstration we will teach it the other win
conditions such as vertical columns or diagonals.
Then X goes first is highly ambiguous and context sensitive. But it tells it who starts the game. At this
point there’s enough information in this representation to play a legal game. I tell it to start a game.
What it does is it takes those marks I’ve made and puts them in a catalog. A separate layer, creates a
new layer for the game state, and made its move. It made an X in the middle left square. I respond with
an O in the center square.
It’s playing a blind random game. It doesn’t understand strategy yet. I’m going to be a little cruel here.
I’m going to demonstrate winning by a diagonal. I mark three in row on the diagonal. In order to help
teach it a new rule I select the winning configuration.
>> Ken Forbus: Our way of doing deictic reference and sketches.
[video]
It doesn’t know that it’s lost yet, so I tell it I win. It takes the winning configuration and creates a new
rule for winning on the diagonal. We can inspect this rule if we look in the command transcript window.
In this case the rule is specific to the particular diagonal. It will require another trial to learn the
opposite diagonal.
But if we demonstrate winning on a vertical column it will generalize this to all columns analogously to
the rule it already has for rows. We only need a total of three trials to learn the complete rules of TicTac-Toe through a combination of instruction, and demonstration.
This concludes our demonstration of flexible multimodal…
[video ended]
>>: It’s like that middle vertical column was not a win in this special variance of no vertical win Tic-TacToe, what would happen at this point? We generalize and say oh I’ve generalized it in an appropriate
way and back off and create a special sub-case?
>> Ken Forbus: Yes, except now, this version won’t. Actually if it learns a bad rule it’s hosed.
>>: It’s really hosed it can’t go [inaudible]?
>> Ken Forbus: This version, no, right. Yeah, no, we’re focusing. I mean for that project our focus was
entirely on what you have to do in terms of multimodal interaction to boot strap these things, okay.
Okay, so what was going on under the hood there? You’ve got your language coming in, translated to a
general logical form, predicating calculus interpretation. It’s interpreted in terms of communication
events and instructional events. It understands how to interpret them. You know, in some cases we
have to fall back on general heuristics through interpretation.
In this case, no, because there’s enough knowledge about the context I’m being taught a game, okay.
That’s an incredible driver. The digital ink gets turned into symbolic representations also. Interface
gestures like they just added a glyph and things like that. Also get turned into the same communication
stream. That’s what leads to building game rules.
The interpretation process as described earlier. This turns out to be a lot of fun because the Cyc
Knowledge Based has a boat load of stuff in it. In an earlier system we talked about the Jordan River
enters the Dead Sea. I thought there was an entering the container event where the container was the
musical group the Grateful Dead.
Okay, so you get all sorts of amazing things if you have something that knows stuff about the world. It’s
sitting there basically interpreting these things. To where it gets down to a point saying well X plays the
role of something in a game. There’s a whole set of instructional events.
If you look at the intelligent tutoring literature this is a spin on some of the kinds of instructional events
you see there. But game specific. Like for instance wind conditions is not something general. But
defining configurations, that is pretty general in that the introduction and action generalization are
pretty general. The game classification is from earlier work on GDL. There’s basically nothing that we
really added ourselves. Communication events, there’s both the stuff from language and the stuff from
interpreting what’s happening with the sketch world.
Now, where we’re going next with this, we’ve done hexapawn. That’s kind of trivial. The only
difference is it’s a piece moving game. Tom actually can now describe all the regular moves of chess to
the system by language and sketching, okay. Can’t do castling, capturing en passant, and pawn
promotion. All those things are things that you really want to do by language. You really don’t want to
do those by demonstration, right. Now do you say you can’t castle if the thing hasn’t moved? That
doesn’t work so well. We’d like to then expand to discussing strategy as well as rules for play.
Now spatial concepts, one place we’ve done this is Freeciv where we looked at geospatial concepts. We
can map a Freeciv map into CogSketch and sketch on top of the map. Here, Matt has basically said,
drawn a circle and said this is a Strait. What it causes it to do is take the encoding of that region and
shove it into a journalization context for the word Strait. It’s using that to build up a model of the
concept. Then you can then draw circles elsewhere and ask it to basically classify those things.
It’s basically saying if I do analogical retrieval over my encoding of this across the whole set of
generalization context or the concepts. Then which is the strongest match from? That’s its way of doing
classification. This is okay. I mean you know sixty examples, ten per concept, tenfold cross validation
about seventy-six percent accuracy.
I’m not excited by that. But when you look at similar tasks in this world it’s not out of band. I think we
should be able to do much better. But one thing I worry about it is overfitting. It’s always dangerous to
just use your own data.
There’s this lovely corpus by some folks at Berlin & Brown. What they did was they took two hundred
fifty concepts like snowman, ice cream cone, giraffe, grapes, bunny rabbits, airplanes, etcetera, and got
people on Mechanical Turk to make eighty sketches per concept. That’s twenty thousand sketches.
Now they claim in their paper that this exhausts everyday objects. I don’t believe it. It includes things
like hand grenades and flying saucers neither of which are everyday objects for me. Especially hope not
together ever become everyday objects for me, okay. But it’s still a great corpus. It’s very tough.
Here’s the experiment. What they did was they used pixel level image encoding properties and Machine
Learning classifiers. We’re using CogSketch encoding because it’s all ink. We just suck it in, plus
analogical learning. Now we start with ten concepts. Why ten? Both runtime because we have not
engineered this in the way we need to do to scale up for large batch experiments. But also another
problem as you’re going to see in loving detail in a minute. Okay, because some of these things we just
can’t do yet. We know why. We have partial solutions.
Encode with CogSketch, eight fold cross-validation, and those numbers of concepts. Typically when we
do analogical generalization within twelve examples we’ve got as good as we’re going to get. The only
example before this that was different was counter terrorism. There we needed about thirty to figure
out who the perpetrators were in some events reliably. This was how they broke it up. We encoded it
their way.
Here’s the results for three versions of our system. These are two extremes in SAGE. This is the knob
where you say I want everything to assimilate. I set my assimilation threshold so low it always merges
everything into one model, okay. This is, turn it the other way has to be a perfect match before you’ll
assimilate it. It’s a way of doing both prototypes and exemplars in the same model. A line is like these
two that’s using SAGE. But it also has near misses automatically derived near misses.
If you get something retrieved from a neighboring concept that’s mutually exclusive, that’s obviously a
near miss. It actually constructs small hypotheses about what those differences are. That in some cases
significantly help in terms of telling them apart. Now here the differences turn out to be not so much. If
you look at the pixel level stuff it’s also about fifty-six percent but overall two hundred and fifty
categories.
Okay, so for the categories we can do at the moment. Hey, we do about the same as what they do. But
there’s two things that disturb me about this. The first thing that disturbs me is you know why so low,
and why so slow? The second thing that disturbs me is humans are up there by a Turk experiment that
they did on their dataset. That’s a lot of head room. Okay, that really bothers me because usually we
get human like performance in a small number of examples.
What’s going on? Well, if you look at it close here’s at least one thing that’s going on and there could be
others. If you take a turtle, this is from their corpus. Here’s a visualization of the CogSketch analysis of
it. It breaks things up into edges. You can see the junctions between the edges. It also constructs edge
cycles, connected sets of edges. These things often are cool because they segment stuff into objects.
These edge-connected objects are sort of higher level descriptions that group those things, okay.
Now, bad move. Instead of just picking one level, the one that’s most informative. We threw them all
in. Okay, so that’s a lot of facts. That’s four hundred and sixty-four facts for this turtle. Now in other
experiments it looks like for analogical matching if you can’t do between ten and a hundred relations per
description you’re not in the game. You’re not going to understand stories. You’re not going to do
moral decision making. You’re not going to do visual processing. You’re not going to solve textbook
problems.
But this is a lot. Now when you start doing more textures it gets worse. Okay, so a lot of texture in this
turtle, right. A lot of regions in this turtle and so what do you thinks happening? Here’s a principle
we’re trying to extract out of this. That part of the goal of encoding processes should be to construct
concise informative representations. We’ve always worried about informativeness. But we’ve never
really worried about conciseness. With perceptual processing that’s a mistake.
We think this is going to be an internal metric on the cognitive architecture. You start thinking in terms
of upper bounds and assertions, and trying to extract a way when you have that. There is some
evidence people do this. But it’s very weak evidence. I would not take it to the bank.
>>: Are you talking about concise representation or just a way of separating essentially figuring
[indiscernible], or so that it may be a layered representation?
>> Ken Forbus: Oh, yeah, no we’re assuming layered.
>>: Okay.
>> Ken Forbus: CogSketch has three, actually four different layers of representation. It can do groups,
shapes, or groups, objects, and edges. There’s a fourth thing in terms of surfaces. It’ll actually
dynamically move from representation to representation. It couldn’t do Ravens without it for instance
or the [indiscernible] task.
But we have to add more. This is a clear, you know if you’re thinking of this in vision terms this is a
texture problem. We’re looking from the vision literature planar Ising model to handle these regular
repeated structures. You basically take a whole bunch of them that are; they’re similar in some way.
Turn those into the description that’s one big chunk of stuff and say. It’s got a whole bunch of things
that are about the size and about the sysentric, and all that. You’ve basically re-represented it.
Now that works for some of the turtles. Works nicely for this one and this one, oh my god that was a
nasty little turtle before, right, because look at all these different textures in there. Works for this one,
this is all one unit now. It’s got some extra properties of it, talking about the visual properties of the
things inside.
>>: I missed what you’re doing here to abstract [inaudible]?
>> Ken Forbus: The idea of these Ising models is you say I’m going to make. I’m going to put basically
little control points at these various spots and say can I get rid of these? Are these things sufficiently
alike that I can merge them together into one unit? Then you do some energy minimization over that.
>>: [inaudible] understand that?
>> Ken Forbus: Well we didn’t invent the technique. We just got it off the shelf.
>>: Okay [inaudible].
>> Ken Forbus: But it isn’t like…
>>: It’s for texture detection, right?
>> Ken Forbus: What?
>>: Ising models are for texture detects, right?
>> Ken Forbus: Yeah, and we’re treating this like texture.
>>: It would be hard to do for hair for example whether or not. I don’t know like strands of hair might
be [indiscernible]?
>> Ken Forbus: Yeah and you know people put in what the sketch recognition community calls
adornment or decoration, right.
>>: [inaudible].
>> Ken Forbus: Yet, people do that. You really have to learn how to handle that. Now that works
sometimes, does not work all the time. Here it didn’t actually figure out that you should group these
things that way. Here it turned the entire turtle into one big blob, okay. We’re still trying to figure out
what’s going on here.
Now, I’ll close with one more example. This is learning to do inference on structures. This is not a
problem we picked out. This is a Machine Learning community problem. The idea was well you got the
semantic web which is growing by leaps and bounds. You’ve got the whole knowledge graph thing, and
Google, and I’m sure Microsoft has an equivalent thing.
>>: It looks like [inaudible].
>> Ken Forbus: I’m sorry?
>>: It’s like [indiscernible].
>> Ken Forbus: Okay.
>>: Yeah, it’s just like [indiscernible].
[laughter]
>> Ken Forbus: Can you do traditional logical inference? Sort of, but the data’s incomplete and noisy.
Statistics, well they’re structures not feature vectors. If you’re a dedicated Machine Learning person
you’ll say we can fix that. We can vectorize those suckers, okay.
You make a distributional vector space. You take the relations. You mush those into vectors and you
crank around. You make a tensor network and you do stuff, okay. Now I think that’s sort of sad. I think
feature vectors are not as expressive inherently as relational representations. If you had the structure
to begin with it’s really a mistake to not exploit it. You get good accuracy but it requires a lot of
examples. It has to train over all the relations at the same time. It lacks interpretability. You get this
number that this is the best one. You can’t say why.
Let me show you a better way. You’ve built cases for analogy by path-finding. The insight here which is
not unique to us is if you have a relation between two entities they’re likely to be indirectly related to
each other in a bunch of other ways. As usual with these sorts of algorithms you put in limits on
branching and search for tractability.
A parent of one person, another, you also generate negative examples by corrupting that triple into
something else. That’s the same technique that these folks used. Now what we do is we basically
search through the database and grab a bunch of relations tying those two things together. Those
become a case. We only grab ten positive cases and permute those to get ten negative cases. Ten
remember that number. It’ll come up again.
Then for template learning we do analogical generalization. We basically take the two cases and we
match them to find out what corresponds. Then we basically use SAGE to construct templates. If after
this match if these are mushed together then these only appear in one. They’re going to have
probability point five. These appear in both they’re going to have probably one point O. If you keep
going you can see the things that aren’t in common are going to get smaller and smaller probabilities.
Now the things that are common stay high, okay.
Now, you can do better. This, the usual way we do these things is just say hey we use SME and we let
that be that. Now what Chen figured out was okay. There’s cases where some properties are really for
a task more important than others. You’d like to learn what those are. You’d like the bias to match.
Think about the numbers computed by a structure mapping computation. Now convolve that with task
specific importance weights.
He computes those by extending logistic regression to work on structured data. Think about an analogy
between vectors in a structured case, dot product becomes structure mapping, and vector addition
becomes structure addition. That SAGE is already doing with the alignment step. Then you train with
gradient descent, etcetera.
If I’ve got the traditional way of doing it with input vectors, each vector position tells you what goes with
what. That’s trivial and then you get your dot product. With structure mapping these are expressions
now. I have to compute what things go with each other. Then I get my stuff that then I can train by
gradient descent.
How well does it do? Well, so there’s two datasets that people have used for this. They have a
WordNet dataset and a Freebase dataset. Looking at eleven and thirteen relations respectively, large
test set, whole bunch of training data. The other models use typically ten thousand training examples
and train on all relations at once. We use ten and we can train for each relation independently. You add
any relation we don’t have to retrain all the others.
How does it come out? That’s the scoreboard. That’s our system and you notice the top systems in
both cases are not the same. We’re right up there. We’re not the best on either corpus. But we’re right
up there with three orders of magnitude less training data.
I think it’s pretty cool. I’m excited about this. You know it’s, we’ve done this plenty of time before on
datasets that we found were generated from our reasoning systems. But to be able to do it on datasets
that the Machine Learning community has done, and get the same performance. We’re very happy
about that.
>>: Do you gain anything with more training examples?
>> Ken Forbus: Not at the moment. We’re trying to figure out why that is.
>>: The examples randomly chosen?
>> Ken Forbus: Yeah, they’re randomly chosen.
>>: Or [indiscernible] space in and interesting way based on the [inaudible]?
>> Ken Forbus: They’re randomly chosen. Now the other thing it gives you is explanations. You can say
basically by sort of evidential explanations what’s in the matching. This is deliberately chosen to not be
interpretable easily. If you had a natural language generation system to test your knowledge, you’d
have a hypothesis like saying well this person’s Tongan and you ask the human yes or no. They’re going
to say, uh, right. Unless they have a lot of world knowledge. In which case why are you learning this
knowledge?
But you can say, well his parents is Tongan and person with the same country as him, his ethnicity is also
Tongan. Then you can look at explanation and now have more context. You have something, some
knowledge about the reason why the system believes that. We think that’s also very valuable. By
sticking with structure you learn faster and you get explanations which is are not inconsiderable
advantages.
Just to wrap up. For some purposes Bespoke data is better than big data. If I’m trying to train software
assistants and I’m looking at you Cortana. You want to say something to it once like never show me Fox
News in my notification stream every again. I can’t say that. Or I can say that it won’t listen.
You want human like learning of tasks because you want assistants to be at least as good as we are at
learning about the tasks that we’re training them for. Rich knowledge supports learning from Bespoke
data. We didn’t use analogy in learning Tic-Tac-Toe. We actually just talked it through and because the
system knew about games more broadly it was, and knew about sketching, and knew how to connect
those things, and knew about instruction. It was able to interpret that information in a way that
mattered.
If you’re doing learning, structure mapping can support learning with a small number of examples.
Better representations like sticking with structure when you have it can drop your data needs by three
orders of magnitude in the case we just saw. That’s, we think that’s true more broadly.
A little joke here, just like orange is the new black, structure mapping is the new dot product.
[laughter]
Okay, so I end by thanking all the people who really made all this happen. Thank you.
[applause]
Questions, yeah?
>>: You mentioned earlier on in your talk. You mentioned the word understanding.
>> Ken Forbus: Yes.
>>: Do you have a definition for that. What does understanding mean?
>> Ken Forbus: That the system constructs representations that enable it to do the kinds of things that
we would expect people to do given that same material and the same context.
>>: The only way to measure if a system understands is by having humans in the loop?
>> Ken Forbus: Or by measuring its performance in some other way. I mean, so here’s an example from
another study we’ve done. You take little kids and you give them a forced choice task where you have
things that are the same in one dimension versus another. It turns out that you can pick the task so that
four year olds can’t do it, they’re at chance. Eight year olds can do it, okay. Then by cleverly reordering
the stimule you can get four year olds to learn it. Okay, without feedback, okay.
This is something Dedre Gentner and her student Laura Katowski did quite a while ago. Now here’s the
cool thing. If you’re going to simulate that you need to do two things. You need to figure out how to do
the forced choice task. In this case it’s, are these things similar enough? You know, which of these is
more similar? You need a signal that says I’m not doing it right yet.
How do you do that? Well, in the forced choice task if my encoding of the situations is not sufficient to
make it clear which one to pick I know internally without someone telling me that this is not a good
encoding. The system actually casts about a bit to figure out what a better encoding is. Then snap you
know it gets it, okay.
Internal signals like that we think are critically important. We’re kind of on the lookout for those now,
right, which is why on the encoding thing on conciseness we’re now thinking that should be a big signal.
If you’re getting too much stuff you’re clearly encoding it wrong. You have to think about this thing
differently. Rather you get advice from people in how to do it or you have to search your own space of
alternatives, open question. But I think those internal signals are also crucial.
One thing that happened in Learning Leader is you could have, there’s good news and bad news. If you
want to text surprise then you could give it some new example. If it’s been building up an analogical
model of something. It’s got a bunch of examples in it and nothings retrieved. That’s a surprise, right,
because if you’ve really seen all the space then you should get something.
Now it turns out the bad news is there’s two reasons for that to have happened. You really are seeing a
different part of the space or you have an error in your natural language understanding system. As you
get more things with that same error, telling those two situations apart is really hard. Okay, so that’s
the down side of it. AI’s hard.
[laughter]
>>: [inaudible] share some questions that you shared earlier about the approach that you’d been taking
over your career. What kind of architectures an results coming out of some of the larger scale big data
approaches. I know those things were taught but like you know, the Atari game learning by DeepMind
and other groups that were doing that kind of thing. Where thinking is a thing along these lines as a
general community challenge problem in larger spaces like [indiscernible] games. People that look at
general connectionist is the old name for it, style, models that promise to learn from the data.
Potentially having, super charging those someday with these kinds of abstractions versus starting at the
abstractions and using a little bit of learning on top.
What are your thoughts on methodology and approach? Some people say that those then will sooner or
later bury these more human authored kinds of approaches. Even when humans just draw big boxes
and put arrows between them.
>> Ken Forbus: Well, okay so if you think about what we’re doing. We’re making the bet that you can
safely package off a lot of the perceptual processing and hand jam that. For instance our CogSketch for
better or worse is our model of two D geometric processing and vision. That’s, it doesn’t do everything
in vision by any means. Think about texture, color, shape, and shading, just a raft of problems it doesn’t
try to address. But it’s a sweet spot for interacting with people.
You know, gerbils learn in slow ways, as do people. I mean if you think about how long rehabilitation
takes after an accident. Physical therapy takes forever, right, because our motor learning systems aren’t
that fast. You probably wouldn’t want them to be that fast. They’d over adapt in bad ways. But the
stuff that humans can do that’s pretty darn amazing is couple the stuff that lets you interact with the
world in reasonable ways, with all this amazingly complex conceptual knowledge.
I think if I were to make a bet, I have made this bet. You know, we’re going to have human level AIs
helping us figure out how brains work. Rather than understanding about how brains work be central in
making human level AIs.
There’s an analogy that Ken Ford and Pat Hays did with AI and artificial flight. They pointed out the first
flying machines did not flap their wings and did not have feathers. There were plenty of attempts to
make those at that time. But the ones that didn’t and succeeded basically said we’re going to
understand the principles of flight, so the principles of intelligence.
Now to understand those principles I still take a lot of guidance from humans because humans are the
best example we’ve got. But I’m also willing to look at other animals, okay. Eventually, in fact now
materials have gotten better and different. People are trying to build flying machines that are like birds.
But that’s over a hundred years later. I think that’s what will happen will AI too.
You know, eventually we are going to have one story that goes all the way down and says, you know
these dopamine receptors hook in [indiscernible] all the way up to you know your knowledge about
intention. You know Metzloff’s like Me Hypothesis where think of it as an analogy between yourself and
other people, right. I can state that hypothesis without talking about dopamine. I can have a
completely good scientific model of conceptual reasoning without knowing about neurons.
Yeah and so that’s why I’m betting the way I do. I’m glad other people are making other bets because I
could be wrong. Ultimately the whole stories got to all hook together. You really have to have all these
things going on.
You guys are going to do some game learning?
>> Eric Horvitz: We talked offline about it. But actually we’d like to get your opinions.
>> Ken Forbus: Oh, yeah. I think games are great. We did the sketch games as a Darpra seedling. No
more money for that alas. But I do think it’s the kind of platform where you can imagine really open
ended conversations.
>> Eric Horvitz: Right. We can play later [inaudible] and we’ll just talk about it offline, everybody out
there.
>> Ken Forbus: But like you know if you think about the way people are playing Go or Chess. It’s not
very human like, right. Think back to the old Chase and Simon results. One of the reasons why we’re
doing this is we’d love to make a companions executable that has the sketch game capability built in. As
well as a whole bunch of conceptual reasoning built in so that people could then experiment.
I’ll bet you that you could make a really cool like human like chess player that could really explain what
it’s doing in terms like those you would see in a chess book, for example, okay. Yeah and even if it
wasn’t the best chess player hey it would still be kind of cool.
>> Eric Horvitz: Any other questions? Okay, thanks very much.
>> Ken Forbus: Thank you.
[applause]
Download