UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH IN COGNITIVE SCIENCE

advertisement
UNIVERSITY OF PENNSYLVANIA INSTITUTE FOR RESEARCH IN
COGNITIVE SCIENCE
14th Annual Pinkel Endowed Lecture
Modeling Common-Sense Reasoning with
Probabilistic Programs
Friday, March 23, 2012
Ubiqus/Nation-Wide Reporting & Convention Coverage
22 Cortlandt Street, Suite 802 - New York, NY 10007
Phone: 212-227-7440 800-221-7242 Fax: 212-227-7524
Modeling Common-Sense Reasoning with Probabilistic
Programs
[START 90840518.mp3]
MR. JOHN TRUESWELL: I'm John Trueswell [phonetic]. Welcome to
the 14th Annual Benjamin and Ann Pinkel Endowed Lecture. The
Benjamin and Ann Pinkel Endowed Lecture Fund was established
through a generous gift from Sheila Pinkel on behalf of the
estate of her parents, Benjamin and Ann Pinkel, and serves as
a memorial tribute to the lives of her parents, Benjamin
Pinkel, who received a bachelor's degree in electrical
engineering from the University of Pennsylvania in 1930, and
was actively interested in the philosophy of the mind, and
published a monograph on the subject, which I have here.
Consciousness, Matter and Energy: The Emergence of Mind in
Nature, published in '92, the objective of which is a
reexamination of the mind/body problem in light of new
scientific information.
This lecture series is intended to advance the discussion and
rigorous study of the sorts of questions which engage Dr.
Pinkel's investigations. Indeed, the lecture fund has
brought many esteemed scientists to Penn to speak on some of
the most important topics that populate the modern study of
the human mind.
This year's speaker, Josh Tenenbaum, is
a welcome addition to that group. Here
today is the director for the Institute
Cognitive Science, David Brainard. And
book.
no exception, and is
to introduce Josh
for Research in
you get a copy of the
MR. DAVID BRAINARD: Thanks, John. I'm very pleased to welcome
and introduce Josh Tenenbaum, this year's Pinkel Endowed
Lecture speaker. Josh received his PhD in Brain and
Cognitive Sciences at MIT, and then after a short postdoctoral fellowship, took on an assistant professorship in
the psychology department in Stanford. After a few years, he
got tired of the West Coast, came back to MIT where he
remains and is now Associate Professor of Cognitive Science
and Computation.
A perusal of Josh's CV will tell you that he's won more
awards than I even knew existed, but many of them best paper
at X over the years. But in particular, I want to mention
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
1
two. One is the Early Investigator Award of the Society of
Experimental Psychologists, where he was also named a fellow
in 2007, and the other is the National Academy of Sciences
Troland Award, another early career--very prestigious early
career award.
Josh's work focuses and consistent with the theme of the
Pinkel Lecture, on an absolutely central question on human
cognition, namely how it is that we learn to generalize
effective concepts that work well for us in the world from a
small number of exemplars, the problem of inductive reasoning
in learning. As Josh notes in his web page, after seeing
only three or four examples of a horse, even a two-year-old
will confidently and mostly correctly judge whether any new
entity is a horse or not, so one generalizes from a small
number of examples, or as he puts it in a paper, how does the
mind--the question how does the mind get so much for so
little, or from so little.
Being able to learn inductively in this way isn't something
that you can just solve logically. That is to say a small
number of examples under constrain the generalizations you
might make. So something else needs to be brought into the
problem. There's no such thing as a free lunch. The thing
that Josh's work has emphasized that could be brought to the
problem and can be effectively brought to the problem is
taking advantage of statistic regularities in the natural
environment where we operate and social world where we
function. In that regard, he's been really the leader and
pioneer of using - - methods, which provide a natural
language and machinery for expressing the use of statistic
irregularities and bringing together this information into
the realm of cognition.
These ideas have been applied previously in what I would call
lower level problems in human information processing. Josh
has really led the way both in terms of actually developing
effective statistical algorithms--he has a foot in the world
of machine learning, and I think a slightly larger foot in
the world of psychology, where he designs and conducts
elegant experiments, always informative, and that speak to
the formal models in effective ways.
So I always learn when I read Josh's papers, and I'm really
looking forward to his talk today. So please join me in
welcoming him, and I look forward to his remarks.
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
2
[Applause]
MR. JOSH TENENBAUM: Can you hear me okay? Thank you very much.
It's a great honor to be here. I don't know that much about
- - except when I was skimming during those very generous
introductions. But it's actually I think a very fitting
match in the sense that he's an engineer thinking about the
brain - - . And if there's one thing to summarize to me what
essentially is the work that I'm trying to do is to think
about how the brain and the mind work from the perspective of
an engineer.
This is - - . I like to think of it as reverse engineering
of how the mind works, which doesn't just mean--the purpose
of doing these different things - - to me-[Break in audio]
MR. TENENBAUM: On now? Great, sorry about that. It's been a
great success story of many of these fields, viewing
intelligence as in some sense statistics on a grand scale. I
want to say a little bit about the success of that parallel
and mostly talk about what it leaves out because that's
really what I'm most interested in right here. I think some
of the basic challenges is--not to say this is wrong, but
what do we need to think about besides statistics.
So this is a view of mind brain intelligence. It has data
coming in through sensors like high dimensional spaces of
retinal activation or auditory, olfactory sensoria. And what
the mind essentially does is find patterns in that data,
various kinds of structure, clusters, correlations, low
dimensional manifolds, linear, nonlinear and so on. We can
describe the math of finding structure in data in very
elegant formulations. Hopefully many of you have seen these
sorts of equations, so just treat this as evocative here.
If you know for example about hebbian learning or delta rule
back propagation learning, we think about that as optimizing
some kind of error function to have rules for finding the
sort of best setting of a high dimensional parameter system,
like say the weights in an artificial neural network to best
find structure in data given the inputs. And these learning
algorithms, which are mechanisms for finding structure in
data, have remarkably been shown to have some correspondence
to the actual mechanisms of adaptive plasticity in brains,
what goes on when synapses between real neurons change their
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
3
weights. There's even established causal links between
phenomenon like LTP, long-term potentiation, that to some
extent follows the math of hebbian learning and actual
behavior in humans and many other mammals.
The same kind of theoretical computational paradigm has been
incredibly useful on the engineering side. There's many
applications, some of which we take for granted, some of
which you will soon take for granted. You just don't know it
yet. For example, pedestrian detection systems. We're used
to face detection in our digital cameras and online photo
sharing sites. Increasingly over the next few years, you're
going to see things like this system here, which is a system
developed by Mobile Eye, a start up company which is a rather
large company at this point, and being commercialized by most
of the major auto manufacturers. It can reliably detect
other cars and pedestrians in ways that are reliable enough
they sort of pass the lawyer test.
The lawyers for that company and Volvo and Audi and BMW and
increasingly other brands will let you put those in the car
and hook it up directly to the brake pedal to basically stop
the car if you're going to hit something. It's more reliable
than you at that tasks. Similarly, Google is more powerful
than any human at one particular task, of searching for
useful documents of information to match text queries. A lot
of attention to other kinds of AI technologies recently like
IBMs Watson Jeopardy playing system that beat some of the
world champions in Jeopardy.
If you have the new iPhone with Siri, that's sort of a voice
- - language interface. Most recently, did people read about
Dr. Phil, the new expert level crossword playing machine? So
we have these what I would call AI technologies, technologies
that achieve human level, even expert level performance in
particular domains, and they do it by using basically the
math of statistical machine learning and some other
technologies.
But yet none of them are really intelligent. None of them
are. They are AI technologies, and they would impress any of
the early founders of AI, but they don’t have anything like
what we would think of as intelligence or the common sense
even of a human child. This is what I want to push on. One
way to illustrate the lack of common sense that you get from
statistical patterns, here's a little demo from Google.
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
4
It's easy to pick on Google because it's so ubiquitous and so
successful. Don’t get me wrong, this is no sign of
disrespect. In particular, though, what I'm going to point
to is something about Google's spelling correction system.
This is a very cool thing. We take it for granted that we
can type in a query like this one up here. I won't read it
because I can't really read it, but you can look at that and
see what I meant to type was probably what Google figures
out. How does Google's spelling correction work, and
remarkably, it corrects those errors and gives a useful
returning document.
But consider a more disrupted string of characters, which is
this one. What does this say? It's not so easy. How about
this? Okay. Right, so you might be familiar with these kind
of things. Sometimes they go around on e-mail or Facebook.
The only difference between these two slides, this one and
this one, is I rearranged the words. Here they are in the
correct--so each word is sort of permuted letters of a real
English word, but here they follow the order of words in an
actual English sentence that makes sense. You can read this
sentence even though every work is misspelled. Here I've
mixed up the order of the words, and it's very, very hard to
read, right?
Google doesn't know what to make of either of these things,
in particular the one that you have no trouble reading, even
those of you who aren't native English speakers. Google
completely chokes on it. What's the difference? Google's
spelling correction works by finding frequently occurring
clusters of letters and saying, that's a word, and trying to
look for the most likely interpretation of the actual strings
that you typed in, in terms of the words, those patterns that
it knows. But it doesn't understand the more basic meaning,
that language sentences are actually expressed propositions
or thoughts. Words express concepts, and a proposition or a
thought links those concepts together into this larger unit.
It's about something. It's about a world. It's about a
world of people who have goals, who have beliefs, abilities
and so on. That's the kind of common sense thing we're
looking at.
Look at the IBM Watson Jeopardy playing system. Very
impressive again. It's easy to pick on because it's so good.
But look at what IBM had to do. They invested a great deal
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
5
of financial and human capital to build this system, and they
trained it on basically all the world's data, and
instantiated it in a basically massive super computer
cluster, and it works really well for that. But it can't do
anything else. It can't explain why it works. If you change
the rules a little bit, it can't play chess. That's a
different IBM system. It can't do crossword puzzles. It
can't give a lecture. It can't do anything but that.
But contrast this with your mind or any one that--imagine the
first time you playing Jeopardy, someone could explain the
rules to you, or you could basically just figure them out by
watching the game. It's a slightly funny way of asking
questions. Without any training or massive engineering,
achieve competence, and it's not ability to achieve
competence in an endless number of everyday tasks that is
part of what we mean by intelligence.
How do we do that? How do we get the kind of common sense
that supports that into a machine? There's some again, very
interesting work that's easy to pick on because it's famous
and good. In AI machine learning, various ways that people
have been trying recently to get common sense into machines
often using language as an input representation or as a data
source, I think part of the motivation for this is that with
the web, we have massive amounts of text, which expresses in
some direct or indirect form a lot of common sense, and also
using the web, you can get people to type in things.
So this is data here from a project called the Learner
Project by a researcher named Schklovsky [phonetic], and he
got people and then computers to sort of extend and
generalize millions and millions of facts about common sense
turns, these propositions like abalones can be steamed.
That's true. They can be squashed. They can be used for
food. They can be put somewhere. They can be fried. These
are all facts that we all basically recognize as true. A
ballpoint pen can be made in Singapore. A balloon can break.
This is a rather small sample of the effectively infinite
number of common sense facts.
One approach is to just try to grow and grow and grow this
list and have a system that can reason over that, and maybe
that can be common sense. Or Tom Mitchell over here and his
team at CMU has this Nell system. This is a figure from the
New York Times where it's basically doing something similar,
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
6
trying to automatically read the web and the newspaper and
grow out a propositional network of knowledge. But systems
like the IBM's Watson system or this one, while they do
incredible things, they also make basic obvious mistakes that
no human child would make.
So in this article in the New York Times, there's this
anecdote about how the Nell system learned about internet
cookies. Anybody remember this? So it had learned a
category of desserts, basically, words like cake, cookie,
pudding, whatever, ice cream, things you could eat that were
sweet, that were delicious, looking at co-occurrences in
text. Then it learned about maybe different kinds of
cookies, chocolate chip cookies, molasses cookies, sugar
cookies, and then it saw this new thing called internet
cookies.
Okay. Maybe that's a kind of cookie. That's a hypothesis.
Then you start to see the various other words that go along
with that, like you can delete your internet cookies, so
maybe that's something you can do desserts. So then maybe
files, I can delete my files, so then files and e-mail
address and address books, all of those things became
hypothetical desserts. With enough data, it will correct
that error. But my point there is while this strategy of
distributional learning is an important one for how children
might form categories of words and concepts, including people
here have worked on this in important ways, it's constrained
by some common sense ontology of the world that says there's
a basic difference between physical objects and abstract
concepts or information.
No human child would ever mistake an internet cookie for a
cookie or a dessert or an e-mail address and so on. So this
is hopefully starting to get at what I'm trying to get at
here. I want to understand from an engineering point of view
what you could call this sort of core of common sense. It's
a set of ideas that I've come to, that the field certainly
has come to, and it's relatively new to my work, to be
interested in this, but it's basically paying attention to
what colleagues have been saying from many different areas.
Infinite cognitive development, older children, people in
lexical semantics, people in visual - - understanding, all
seem to be verging on this idea over the last couple of
decades, and it's in itself a common sense idea. That our
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
7
understanding of the world comes to us not through high
dimensional vectors of data, even if at the raw sensory level
that's a way to think of it, but cognition builds on a basic
structure of objects, physical objects and intentional agents
like a goal directed actors, and their interactions. That's
what I want to understand.
How can we get these common sense abstract systems of
knowledge about physical objects and intentional agents, what
are sometimes called intuitive theories, although that word
means different things to different people. Theory like
systems of abstract knowledge about the physical and social
or psychological world. How do we describe those in
computational terms? How are they used by your brain in
often very automatic, implicit, unconscious ways, including
how they are used and developed in the brains of young
children, even infants? That's what we want to understand.
To motivate the kind of approach we're taking, I want to have
a little digression here into vision because unlike say what
a lot of the current work in machine learning has been doing,
I think the root into common sense can be motivated from
language, but is not best accessed through language. The
work on infant developmental psychology by people like Liz
Spelty [phonetic], Rene Byron-Jean [phonetic], many others
that have in particular studies infants' development of the
concepts of physical objects, shows that even before children
are moving around very much, speaking or understanding very
much of language, at four months, maybe even three months,
maybe even younger, Liz Spelty would argue very famously, of
course, innately, that even that that age, the infants
already grasp basic concepts of physical objects. I think
some compelling work in linguistics shows these might be
exactly some of the basic semantic representations that
language then builds on.
So let's think about visions, scenes and objects. The state
of the art in computer vision scene understanding is
represented by some of the figures I showed before. Here's
another very nice piece of work by some of my colleagues at
MIT for understanding scenes where basically what people are
trying to do these days, this main state of the art, is put
boxes around parts of images that correspond to some
semantically meaningful object in the world, like a face or a
person or a sofa or a piece of road, okay? And you can do
this by getting a lot of labeled data automatically labeled
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
8
or semi-automatically, or hand-labeled people, somehow
putting boxes around - - , this is a road, this is a road,
this is a road, and then learn a road detector.
The idea is to learn a whole bunch of detectors with these
basic level word category labels like road, person, sofa, and
that's scene understanding. But think about real scene
understanding from a common sense point of view. What kind
of person detector would detect all the people here? Anyone
want to be a volunteer person detector? Well, in a slightly
smaller audience, I would literally ask for a volunteer, but
since we don't have all day and the room is big, I'll do it.
If I say point to each of the people in this scene here, you
see all these bikers, but of course we know that basically
there's people--every bicycle helmet back there is a person.
We know that. Whereas no machine vision person detector like
the Mobile Eye thing that's in your new Volvo, is going to
detect that. Similarly here, every black parallelogram is a
person. Every blue one here is, even ones that barely show
up as just a few pixels. But interestingly here while
there's two of those same kind of shaped things, there's no
people in that scene, unless it was some kind of weird
Beckett [phonetic] play where the people had been buried in
the lawn.
Or take other sorts of examples of the kind of scene
understanding I'd like to achieve with common sense. Look at
this scene here, and we don't just see a bunch of wooden
planks, and we don't just see a house in its outline or frame
form, but we can analyze certain physical stability things.
Like, we can say this looks like yeah, it's basically
standing up. You can make guesses about which planks are
structurally important. If I remove this, the thing might
fall over, but if I remove this one, nothing would. You can
figure out what's attached. These things here are probably
freely floating. You could pick them up whereas many of
these other things up here, you couldn't just go up and pick
up. You'd have to pry them off.
How do we look at these kind of scenes of social interaction
and infer what people are thinking, feeling, wanting from
others, or you know, take a very practical application,
again, of street scene understanding. We don't just want to
be able to detect pedestrians and analyze where they are and
how close I am and when I have to step on the brakes, but we
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
9
want to understand their motion in terms of what is actually
driving their goals.
You'd like a car just as an automated driver system, just as
a human driver who's paying attention is able to do, is to
see people moving around and infer where they're going, why
they're going, when they might be likely to stop or start.
That's the way we look at scenes like this. And if you're
paying attention as a driver and visibility is good, you're
very good at not only not hitting the pedestrians, but
anticipating where they're going, stopping, starting, making
eye contact, using that to really be a sophisticated
interactor with that scene. The current generation of
robotic car systems don't do that.
But it's acknowledged that they would need to do that. I'm
saying in order to do that, you need to understand these
things as objects in the world, which are moving because of
their intentional states. Maybe the most for me concrete
motivating application, although it sort of again basically a
laboratory setting, are these kind of animations. Many
people are probably familiar with the classic Hider
[phonetic] and Syble [phonetic] animation on the right. I'll
show this one first on the left by developmental psychologist
- - and Chibra [phonetic].
I'll show this a couple of times while I talk over it. This
is a short excerpt from a stimulus in an experiment done on
13 month old infants where the infants see this, and just
like you, they've showed that the infants don't simply
perceive this as a blue sphere and a red sphere translating
or rolling on a green plain around some yellow, tan blocks.
But in something kind of intentional, right?
So how would we describe this scene? How would you describe
it? Something like the blue is chasing the red one. The red
one is fleeing. It's a sort of slight--the blue's chasing is
a slightly dumb one because he thinks he can fit through
those holes that he can't. We perceive that, if you pay
attention and look carefully. Or take, for example, the
classic Hider and Syble one here. Again, you could see this
as two triangles and a circle moving, but instead we see them
as objects, which are agents. We see the forces as one hits
into the other. We see the big one looks like he was kind of
scaring and bullying that one off. The little circle is
hiding in the corner until the big one goes after him. A
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
10
little nervous laughter, cue the scary music here.
If you haven't seen this before, don't worry, it ends
happily. At least for the little circle. So is it that you
see in any of these stimuli as David said, I'm very
interested in how does the mind get so much from so little.
Think about the data here as input. It's basically these low
dimensional time series. It takes you just a few numbers to
describe over time how these shapes are moving in the image
sequence, but you don't see it that way. You see it as
objects with physical interactions, and intentional agents
with goals. Basically simple kinds of mental states. Goals,
beliefs, false beliefs, even emotions, moral status.
So how does that work? If I could just get a model in good
reverse engineering terms that could understand this video, I
would be very satisfied, and I think that it would actually
be the same things I need to drive at the heart this kind of
technology on the applied side. So the questions that I want
to address in this talk are just the beginnings of this
enterprise. I think I could spend a lot of time on this, and
many others could, too. If you're interested and want to
work on this, I would be very happy to talk with you about
all these issues.
So I want to say what kinds of--or we want to start looking
at what kind of computational ideas could answer these
questions. The questions of the form of the knowledge, how
it's used and how it's learned. And one starting point,
which I think is familiar to maybe many people, certainly
coming from computer science, but increasingly in psychology
and neuroscience, is a language for building models of causal
processes in the world that's known as bazian networks
[phonetic]. There's a kind of graphical model.
Just curious here, how many people know what a bazian network
is? Okay. How many people have never seen a bazian network?
This is what you would expect. Judea Pearl [phonetic], who
is the computer scientist most associated with developing
this just won the Turring Award [phonetic], award for
lifetime best achievement in computer science, and it's a
fitting acknowledgement of that, that pretty much you can go
to any sophisticated audience that's interested in
intelligence, and most of them have heard of or seen this
idea.
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
11
It's certainly influential in all of the work that I do. The
basic idea of one of these models is to start off with a
graph, a directed graph with nodes and arrows to represent
both basically the causal processes that are operating in the
world that give rise to the data we see, and then perception,
reasoning, decision involves observing some of these nodes,
which correspond to certain aspects of the world, the more
observable ones. Making inferences about the variables out
there that you can't directly observe, and then predictions
or decisions about the ones that provide key value to you.
So a classic example is this kind of medical diagnosis
network where you'd have two levels of nodes here
representing the presence or absence of diseases and symptoms
in a hypothetical patient, and the arrows correspond to which
diseases cause which symptoms. The graph represents
basically what causes what.
Then you can put probabilities on the graph. This is an
example here where the basic knowledge about the world that's
going to be useful for making inferences from incomplete
observations basically says for each node, you look at its
parents in the graph, and you say, what's the conditional
probability of that node's variable, let's say the symptom,
it's just a binary present or absent, taking on some value
condition on the state of its parents? And it doesn't--it's
a very lightweight notion of causality, right?
All it says is what causes what, and you put some numbers to
capture roughly the statistical contingencies that go with
that causal pattern. And these models have been very, very
valuable on the engineering side and the scientific side, but
they're not enough. If there's one technical message to take
home from this talk, it's that we need a richer language.
I'm going to skip this slide because if you've seen Bazian
networks, you're probably familiar with the basics of Bazian
inference, which is the process of observing, for example,
some of the symptoms, and then working backwards, going
against the flow of causality to infer what causes best
explain the effects. That's all you really need to know
about Bazian inference for the purpose of this talk, that you
can do this and there's principled ways of formalizing this
and effective algorithms for doing it.
But what I want to emphasize is the way that we're going to
represent our causal knowledge, that we need to go beyond
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
12
graphs, basically, to programs. So there's a new idea that
is gaining currency in various quarters. Some people in our
group have worked on this. It's this idea of a probabilistic
program, and sometimes you'll see the language probabilistic
programing, but that's a little bit more ambiguous. Think of
a probabilistic program as a very fancy kind of graphical
model, which is one that unlike graphical models, doesn't
look like the basic tools of statistics. When people
actually go look at the meat of graphical models or Bazian
networks, what you see are distributions, exponential family
distributions, - - , binomials, multinomials, standard things
that in any stats class you'll learn about for describing
these basic representational features, these nodes are random
variables, and then standard kind of concepts again that
we're familiar with from statistics and engineering for
representing relations between variables, like linear models
or nonlinear things.
Sigmoid nonlinearies, additive interactions, those kind of
things. This tool kit of sort of linear galcian [phonetic]
or linear plus-plus, galcian plus-plus, that is the language
of graphical models. But what I think and many others think
that in order to describe something like common sense
knowledge for AI or cognition, we need to not only marry
probability theory to the language of graphs, but to the full
computer scientist toolkit.
The same way that you could use a graph to represent the
control flow in a program. We often make these little flow
charts, but that isn't the program. It often fails to
capture much of the richness of what's actually going on in
the program, what you actually need it to do work for you.
The idea here is to use programs to describe causal processes
in a much more sophisticated, fine-grained, powerful way.
These are going to be what we call stochastic programs.
They're programs that flip coins or roll dice to again
capture things that we're uncertain about.
And when I use the language of probabilistic programing to
distinguish a probabilistic program which is a certain kind
of knowledge representation, there's also the idea, I think,
that's important that the full computer scientist toolkit for
describing the world, which includes data structures of much
more flexible sorts than we're used to seeing in Bazian
networks or graphical models, that that's going to be
important for understanding common sense. Data structures
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
13
which deal very naturally with recursion, for example.
So I want to mostly introduce this idea in the context of a
certain kind of intuitive physics. I've shown many examples
of this already, but an understanding of the mechanics of how
objects interact from visual data. I'll say a little bit
depending on time about these ideas in more of an intuitive
psychology context. How we understand the action of
intentional agents.
The particular kind of judgments that we're going to be
looking at, and most of what I'm going to show you are
basically behavioral experiments and computational models
that are not so much AI or engineering applications, but
hopefully you can start to see where those would build on
these ideas. We've done a lot of work recently focusing on
phenomenon of stability. What makes--for example, you look
at these two tables. The one on the left looks more stable.
You wouldn't mind leaving your laptop on it or leaning
against it, whereas you wouldn't want to do that to the one
on the right.
Or consider these wooden blocks up here. You can just look
at it, and right away without much effort at all, your eye is
naturally drawn to the points of instability like with the
accidents waiting to happen. For example, up here, that's
about to fall over. You could--if I said point to the points
of instability, pretty much all of us would point to the same
things. There's games that engage this behavior like the
game of Jenga. How many people know the game of Jenga?
Cool.
We're going to basically be doing experiments on things like
that game. Your goal is maybe to take a block out without
knocking over the stack, but in such a way that will leave it
as precarious as possible for your opponent. How do you do
that reasoning? Here's a couple of slides in the middle here
that I took from my own version of Jenga at the last
conference I was at. We can see this looks like--this is not
a good place to let go of the coffee cup, and this is what
happened right after that.
Over here on the left are actually stability illusions.
There's this art of rock stacking, I guess, or I guess that's
what it's called, which practiced on beaches mostly on the
West Coast. There's probably some New England rock stackers
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
14
who will take either manmade blocks, bricks or typically the
rocks you find at the beach and stack them up into
arrangements like this, which are physically stable in the
sense that they're in static equilibrium. They may not look
like it. That's the point. They're illusions. They're
visual illusions.
They seem so precarious that they must actually be in motion,
but the whole point is to arrange these in some kind of
counterintuitive way.
One of the things we'd like to do in a lot of - - vision is
actually explain what's going on here. Why these things can
be stable when they don't look stable, as well as how our
system mostly gets it right. I would say our ability to
analyze stability is quite powerful.
So here's how you'd build a probabilistic program for this,
and it's going to look a lot like, to start off with, a
Bazian network or graphical model. Again, that's just a nice
way to sketch how it's describing what's going on in the
world that gives rise to this data and captures the aspects
of the world that we want to be reasoning about here.
You can think of this the same way people often talk about-it's common to talk about vision as kind of inverse graphics.
This is a kind of inverse graphics model, but I put it in
Pixar to emphasize the dynamic dimension. So we say we
observe an image, which is the product of some kind of
underlying world state, some scene, the world state being
some three dimensional configuration of objects in the world,
the image being some two dimensional set of pixels or photo
receptor activities.
There's some kind of causal process that gives rise to images
from world states, which you could call graphics or
projection. It's how 3D world gives rise to the image. But
the causal picture is more interesting than that because you
have time. The world state evolves in time, and you have
something like physics, which describes the processes going
on in that direction.
And our task here is to observe maybe one image or maybe a
sequence of images, and infer something about the underlying
world state. That might be the perception or the reasoning,
and maybe also if we wanted to do learning, we might want to
learn the physics or the graphics from a bunch of examples of
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
15
this process.
Again, if you're familiar with graphical models, you'd say
this is of course a very natural graphical model. It's a
hidden Markov model. That's a very standard kind of model.
But the point here is that yes, you can draw this in this
way, just again, if you know about hidden Markov models,
trying to represent the physics of the world in those scenes
as an HMM and the graphics as an HMM where you've got
discreet output symbols and discreet hidden states, and you
make a big matrix that represents that transitions and the
outputs, these--we'd be talking about infinite by infinite
matrixes, and you still wouldn't be capturing things with the
right grain.
So a probabilistic program, you can think of it as a
graphical model but with thick nodes and thick arrows, and
maybe some amount of recursion. What we really care about is
the stuff that's inside those nodes, and the arrows--what we
don’t care about--I mean, the interesting thing here isn't
that the world say to time T gives rise to the say the world
time T plus one.
But how does physics work? It's like think about when in
newtonian mechanics or any other kind of mechanics, what's
interesting, the real content of the theory aren't the
arrows, but the equations. So what are the right equations
that describe how the world evolves causally, and the right
equations that describe how images give rise to world states.
Let me get even more concrete with a set of images from one
of our experiments.
These are--it's a little bit dark, but hopefully you can get
the basic idea. It looks a lot brighter on my screen. Is it
possible--I don't want to mess with your light balance by
saying turn down the lights. You'll get the idea. If you
can see different configurations here, that's good.
So we showed people these sets of blocks, in this case ten
blocks that are basically Jenga blocks, but colored to make
it easy to pick them out individually, and some of these
configurations look very stable. This is relatively stable.
Here's one that looks pretty unstable. Everyone agree?
Pretty unstable.
This one's also unstable or stable? Yeah, pretty unstable.
Here are some kind of intermediate cases. So people make
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
16
judgments on a scale of 1 to 7 how stable is this or how
likely is the tower to fall. You can ask the question in
different ways. It doesn't make that much difference because
we all know what we're talking about. This one is very
unstable. This one is pretty stable.
And then the way we're modeling this is we're saying you
observe an image at some time, T, and then you have some kind
of probabilistic approximation to both of these processes.
The graphics process and the physics process. Then you do a
Bazian inference to infer the likely variables that would've
given rise to what you observed. So in this particular case,
what world state under a graphic rendering function is most
likely to have given rise to that image, maybe something like
that.
So the model that represents this is something like a CAD
model. This is a set of 3D blocks, and they're sitting on
top of each other in some way. Because there's uncertainty,
maybe you only look at the image briefly, maybe you don't
know exactly how graphics works, your ability to recover the
correct 3D position of the objects is limited. This might
just be one sample from the Bazian posterior, one sample from
your best set of best guesses here.
Here's another one. So I guess I slightly went out of
sequence, but given one of these samples, you can then say,
well, how is this going to evolve forward in time? So there
we have a probabilistic approximation to certain aspects of
classical mechanics, which basically allows us to run a
simulation and see for this stack, it's going to kind of
maybe fall like that. For this stack, it maybe falls like
that. And we're going to make inferences effectively by
doing inference and simulation.
Another way to sum this up is that we're positing--in some
sense, you have a video game engine in your head. How many
people are familiar with these physics video games? Angry
Birds of course is very famous, but there's ones which are a
lot more interesting like stacking blocks or cut the rope,
these kind of things. They're endlessly fun for people who
know them, and what makes them work is that basically there's
some kind of rapid approximate simulator for the physical
world. It has parameters that allow you to stretch beyond
things that are natural physics. If you ever played Jelly
Car, you have this car that can get all mushed up.
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
17
So sometimes these games have their own internal physics,
which you have to learn, and there's some kind of graphics
engine, which simulates this thing forward based on your
interaction and renders pictures pretty quickly. We're
suggesting you have something like a causal model in your
head that behaves like that, and you can put probabilities on
the parts of the model that you don't directly observe, to
make inferences about those from the parts that you do
observe.
On this particular model here, we're trying to do fairly
quantitative--maybe psychophysics for those of you who
actually do psychophysics is a little bit the wrong world,
but I should say the post-doc who's been leading this project
is Peter Betalia [phonetic], and he definitely comes from a
strong psychophysics background, and that's our intention is
to really make this as compellingly quantitatively rigorous
as the best of the psychophysics.
In this case, we've been exploring models that just have two
free parameters that we can explore how they interact with a
range of different stimuli judgments. One this sigma state
uncertainty, which basically captures the variation in these
plots here. How well can you localize the position of blocks
in three dimensions given the image? So the higher sigma,
the more uncertainty you have about the true state of the
blocks. Then this - - , which basically says how much
uncertainty do you have about the forces that are acting on
this thing? So the idea is you basically understand about
gravity and sort of inertial force interactions, but you
might allow for the possibility of there can be some unseen
forces.
This table the blocks are sitting on could be bumped, or
somebody could poof air on it or something like that. In
this task, you might wonder, well, we didn't say--I just
showed you a bunch of blocks. I didn't say anything about
perturbing forces, but we can--so hopefully this parameter
won't do whole lot.
We'll see some evidence that even here it has a little bit of
an effect, and then we'll explore it in a subsequent
experiment.
So the basic summary of the model, then, is we show people
these stimuli. We simulate inference in this by drawing a
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
18
set of samples from this scene posterior parameterized by
sigma, the state uncertainty, then propagating them forward
with a kind of approximate classical mechanics game engine,
and look at what happens. You can also think of it as a kind
of probabilistic mental imagery.
And looking at the outputs of a bunch of simulations, we can
then ask various queries. Compute the values of predicates
on the outputs. If we're interested in how much of the tower
will fall, how likely is the tower to fall, we might look at
say--just count the percentage of the blocks that fell. Or
here we might be interested in which way are they going to
fall. That's another query or predicate we could - - on the
value of the very same simulations.
An example of one of these experiments is shown here, the
results from it. In this experiment, subjects were given 60
different stimuli, these towers, of which I've showed you a
few before. They were designed to cover the spectrum from
very stable to very unstable. I think this picture shows
you, this is the stable end, this is the unstable end up
here.
On the X axis, we're plotting the model prediction for its
stability, which is basically the expected proportion of the
tower that will fall, and on the Y axis, we're plotting
human's judgments, normalized Z scores.
And what you can see is there's basically a pretty good
correlation. It's usually around 0.9 in different versions
of this model and different versions of this experiment.
We've done this experiment many different ways, probably at
least half a dozen different ways, with feedback, without
feedback, asking the question, how likely is the tower to
fall, or a 2AFC [phonetic] task.
Basically it doesn't really matter. You get pretty similar
kind of model fits in all of these cases, and it's
interesting that it doesn't actually depend on--we can give
people feedback. After every one of the 360 trials, they see
each of the 60 stimuli six times in randomly arranged blocks.
We can tell you what happens, or we don't tell you what
happens in any one of them. We just give you a little bit of
feedback on a few different towers at the beginning to get
used to the task.
It makes almost no difference in people's responses.
To
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
19
explore a little bit into the role of uncertainty here, you
might say, well, of course, so I showed you a model that fits
pretty well, but how well, and how much do these parameters
matter? Is there any evidence that there's really some kind
of a probabilistic physics involved here? This plot is meant
to give you some insight into that, where it's sort of a meta
plot. The previous scatter gram here was the value of the-the fit of the model for one value of that state uncertainty
parameter. And no latent forces.
Now, here, that's basically this intermediate setting. It's
a value about sigma 0.05 and what that means. It's basically
an amount of--a center deviation. It's about a quarter the
size of the smallest edge of one of these blocks. That's the
place where the model fits best, and what this solid line is
showing, the correlation between model and humans, which
again maxes out around 0.9 for the value of the parameter.
The dash line is the no feedback experiment, which tracks it
almost perfectly.
So it fits a little bit worse, but the correlation between
these conditions is about 0.95 just in the behavioral data,
from one group of subjects to the next.
Now, it probably won't surprise you, if I increase the model
uncertainty, so the model is less able to localize the
objects, the fit goes down because at some point if you have
no idea where the objects are, you can't predict anything.
What's maybe more interesting is what happens when you lower
the model uncertainty. So at this zero end here, that's
where the model has a basically perfect grasp of the
situation that's being simulated here. It knows exactly
where the blocks are, and it has the physics exactly right,
at least as far as the simulation captures.
Interestingly, that's where it fits human judgment's worst,
right? So this is an ideal observer model, but it's got a
key role for uncertainty. Why does it fit human judgment's
worst here? Well, a lot of it's being driven by a couple of
interesting outliers, but look at this red data point here.
Here corresponds to a stimulus that the model says is very
unstable, and people also say is very unstable, but here when
you eliminate that uncertainty, the model says, oh, no,
that's actually stable, but people still say, of course,
because the Y axis is the same, that's just behavioral data.
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
20
This corresponds to one of these towers I showed you before,
this one up here, the one in red, which looks to all of us
very unstable, but in fact it is physical stable. It is
static equilibrium. It's actually a simple version of one of
these illusions. And this is surely not the whole story of
what's going on here. This is more complex in these real
world cases, but roughly we'd say yes, what allows for
illusions here is basically what allows for illusions here.
It's possible to arrange objects in configurations, which are
stable, but kind of precarious, a small perturbation will
move them or will make them unstable, and we think that
people's sense of stability is fundamentally probabilistic in
this sense. It recognizes that kind of precariousness, which
can come from state uncertainty, or from the possibility of
latent forces. Forces like the wind blowing or somebody kind
of walking nearby or bumping it. I won't go into the
details, but you can get basically similar effects of adding
in forces of a very small, latent magnitude. Basically
allowing for the possibility that the table might be bumped
by a very small force also improves the model fit in a very
similar way.
Now, again, I could spend hours talking about this model and
the experiments that Peter and Jess Hamerick [phonetic], who
is a master student working with him have done. For example,
one of the things they've done is investigate various
heuristic alternatives that don't involve anything like a
rich physical causal model. And we can show that for
example, for this kind of task, height is a pretty reasonable
heuristic. It correlates at about 0.77 with people's
judgments. But there's all sorts of reasons we have to
believe that height is not primarily driving judgments here,
even though in these examples I showed you, the more unstable
one is also the taller one.
For example, you can do a just purely height controlled
experiment like we showed here where you made 60 new towers
that are all the same height, and people are still able to do
this.
Not quite as well, but of course that's because in forcing
the height to be the same, we're also compressing the dynamic
range, and there's a lot--it's kind of like zooming in on
this part of the scattergram here, that there would be also
less of a good correlation. But basically - - that
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
21
compression of the dynamic range doesn't make a difference.
But I'll show you an experiment in a second where it does
make a different or something like a height type heuristic
might be a more plausible account. What we've--part of the
reason to be doing this work is common sense. Even in this
very simple domain, the point of having an intuitive
mechanics theory is we're used to seeing in other areas of
cognitive science, most famously the idea of language as an
infinitely generative system where with a finite system of
rules, you can produce and understand an infinite
constrained, but within those constraints, unbounded set of
structures.
We think that intuitive theories of physics or psychology
work the same way. That by building these simulations and
running them, there's an effectively infinite number of
questions we could ask, which can be represented in language
for our purposes, but we think have to be represented in some
kind of formal predicate system as far as the simulation
models are concerned.
So in addition to just asking will the blocks fall over, we
can ask which way will they fall or how far will they fall,
or counterfactual predictions. If you had removed this
block, or if you hadn't removed this block, would it still
have fallen over? What happens if you bump the table with a
pretty hard force? What's going to happen? All sorts of
things. We can ask about other latent properties. Which of
these blocks is heavier or lighter than the others? Which
ones may be smoother or rougher? And we're doing experiments
on all these predicates. So to give a couple of examples,
here's an experiment that simultaneously asks two questions.
People see a set of one of these towers, and they have to
draw visually how far and in which direction they think the
blocks will fall. Hopefully it's pretty intuitive. This is
not a very good answer, but the next one will be a little bit
better. So you see the user of the subject is adjusting this
little polar coordinate system, and trying to align the
circle with how far the blocks are going to fall, and that
line with this sort of basically where the mass of the blocks
are, so here's a pretty good answer. Hopefully you can all
see that one is pretty good and the first one wasn't so good.
I think I could out the slide with the data there, but
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
22
basically what we've found is that--here's a little question
for you. So these two judgments, which people can make, how
far will the blocks fall, and which way will they fall. I'll
tell you again, people are very well correlated with
basically the exact, same model that I gave you before, and
another one they're not so good, at least as far as the model
is concerned. This physical simulation model isn't very good
at predicting people's judgments.
So which do you think people are, from the point of the - observer model better at? How far will they fall or which
way will they fall? How many people say how far? How many
people say which way? You're right. We didn't need to do
the experiment. Why? Anyone know why? I would've been
embarrassed if everyone raised their hand and said why.
That's a little bit less obvious.
So I mean, there's various reasons why, but I would say one
basic thing is it's a harder task for the model. In order to
know whether the blocks will fall or which way they'll fall,
you only have to do a very short simulation, right? If they
start to fall--if you imagine, okay, as soon as they start to
fall, they're falling. It's over. Which way are they
falling? It's kind of once they start to fall in this way,
they're basically going to fall in that way, right?
But if you want to say how far are they going to fall, watch
this movie, you have to run the simulation to the end. They
could bounce around. It's a very complex dynamical system,
and you can look at this quantitatively and say how accurate
is the simulation, either about ground truth or about
people's judgments as a function of how long you run it
forward? And basically for the first thing I showed you,
stability and the question of which way will they fall, these
probabilistic simulations become very accurate very quickly.
But for the question of how far will they fall, you have to
run it for a much longer period of time, about a second
instead of a couple of hundred milliseconds before you get
anything useful or anything in particular as useful as a
simpler heuristic, which is height.
So here, a height heuristic actually predicts better than our
physics model how far the blocks are going to fall, and
predicts it of course instantly. You just have to look at
the thing. So we think it's an example of a simple heuristic
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
23
that might make us smart, I guess, if you're fond of those
sorts of ideas. But I think the real challenge for us, one
of the real challenges here is to understand the interaction
between some richer causal understanding of the world and
simple heuristic rules like that. We think this is sort of
an interesting case of an adaptive mixture of the two where
your brain seems to realize in a sense that a simple rule,
which is correlated about equally well as this height rule,
it's about equally well correlated with two judgments. How
stable is the tower and how far will they fall.
But we really seem to be using it in only one case, the case
where cost benefit analysis, it provides that useful bit of
information much more quickly than the more complex
simulation you could run.
Here's a task, which is a little bit more unusual. And we
wanted to study this task for a few reasons, but one is to
take us outside of a regime of tasks, which everybody is
pretty familiar with. You might not have done exactly that
task before, but you've played with blocks and sort of judged
whether they're stable or not if you've played Jenga. This
is a task that isn't exactly like any one that you've done
before, but it starts to show a little bit of the express of
the power of this model. So let's say we have these blocks
on a table here, the red and yellow blocks, and imagine the
table is bumped hard enough to knock some of the blocks onto
the floor. Is it more likely to be red blocks or yellow ones
that hit the floor? What do you see, red or yellow?
Something unstable. Is that red or yellow? How about here?
Why is that one funny? All right. So you get the idea. You
can make these judgments pretty quickly, and I think pretty
quickly, and I think a fair amount of consistency, although
you can see with reaction time and variants, there's some
uncertainty, and those are things our model wants to capture.
So this is a similar kind of data plot for an experiment
where there were 60 different configurations of red and
yellow blocks, and they were designed in a kind of factorial
way to varied things like how high the different color stacks
were, how close they were to the edge, how precarious each
individual stack was.
Pretty complex scene factors. It's hard to say exactly how
does this factor turn into the final judgment except as far
as our model is concerned, they all factor into the
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
24
simulation. Again, it basically fits as well as the simpler
judgment of just how stable is this stack. The model is
exactly the same as the model before, but with one extra
feature. It's a more exercised feature, which is that latent
force uncertainty. Before, we allowed for the possibility
that the table could be slightly perturbed, but now we've
said the table's bumped fairly hard, enough to knock some of
these blocks off. So we varied that parameter and found,
sure enough, that you get a model fit like this only when
that parameter is in a fairly reasonable range and it's
actually strong enough to knock some of the blocks off onto
the floor.
So it's nice that it's basically the same model with one
twist, but that twist is exactly the one we describe to
subjects. If you want to think about a very different kind
of alternative account, often people look at this if they
come from machine learning and say I think I could design
some sort of classifier that I could train to detect this.
It's an interesting challenge, and we've done a lot of work
to try and do that if you want to take it on. I'd be very
happy to talk to you about that and see the results. Again,
think here what your brain is able to do. Take a sentence in
language, which is to say now the table is bumped, and it
knows how to take that information and turn it into some
relative parameter of the physics model. That's the kind of
common sense we're talking about, and we think it needs a
representation where words like force actually mean what you
think they mean. So we think we fundamentally need such a
representation.
Here's just one last physics judgment, which is to show
things are not just about prediction, but inference of latent
properties. It's sort of the beginnings of learning, if you
like. So look at these objects. I'm going to play this
movie in a second. It's frozen at the beginning, and your
task is to say which is heavier, the red blocks or the yellow
blocks. They might have different density, different mass.
I'll play it one more time. What do you think? Red or
yellow? Yeah, yellow. And for those of you who maybe didn't
quite see this, there's two places where you can see it at
least. One is at the very beginning, just the way it tips
over is a sign that the yellow is heavier, but also at the
end, look at how they bounce off of each other. So it's
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
25
interesting, right? People in perception and psychophysics
have long studied cue combination. I can say, here are two
cues that are relevant. Something about initial
configuration and something about bouncing.
It's very hard to articulate what those features are, but as
far as our model is concerned, those are just two things that
are there in the simulation, and somehow we think this is the
kind of evidence of how you're able to run this analog
physical simulation that's at a much finer grain than
language, although when we start to talk about the scene,
that's the representation that we're talking about.
I'm running fairly close to the end, so I'm going to skip,
but I'll just point you to some interesting work showing very
simple versions of this kind of object physics model can
actually be related to. Not just children's behavior, but
even infant behavior. But those of you who know about the
sort of infant looking time literature, where - - violation
of expectation measures, and how long infants look at
different scenes, that's how we know about infants'
understanding about objects comes largely from these kind of
studies by people like Spelky [phonetic], Byron-Jean, many
others.
And what we were able to do in this work with - - , Ed Vool
[phonetic] for the co-first authors and came out of Luca
Benati's [phonetic] experimental lab, was to basically take
simple versions of the models I showed you and use them to
quantitatively predict infants' looking time in simple
versions of these kind of red/yellow object motion displays.
And it's probably the first at least evidence that I know of
a quantitative psychophysics like model being tested in
infants, and it's showing two things that I think are
valuable. One is that infants judgments as measured by
looking time actually has--they're not just qualitative, but
they have some quantitative computational content to them.
The notion of surprise can be directly linked to probability,
but also it shows the quantitative value of building models
about objects and the dynamic interaction between objects
even in the earliest stages of cognition.
I just want to show a few slides about this other key part of
common sense knowledge to give a second example of what
probabilistic programs would be like and show you the scope
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
26
for common sense knowledge going back to things like these
Hider and Syble type displays. So here if we want to make a
simple graphical model, we could say the relevant causal
structure is that we observe actions, and those are a
function of some latent mental states, like an agent's
beliefs and desires. That's a classic theory of mine or
theory of a rational agent.
You could again make a - - or something based on that. But
really the hard work is done by the thick nodes and the thick
arrows. What's the propositional content that goes in those
latent variables? They are not just some low dimensional or
even high dimensional gaussian distribution. It's got to be
structured knowledge. And what's that arrow that relates
beliefs and desires to actions? It's not some linear or non
linear - - gaussian noise thing. It's something like a
planning algorithm.
So while the programs before, the probabilistic programs were
based on physics and graphics programs, now we're looking at
planning programs. What is planning? Planning is what you
do. Let's say you're building a robot or some other kind of
sequential decision system, and you write a program that
takes this input, effectively something like a belief or
desire or representation of the environment and a utility
function, and comes up with a sequence of actions, which
tries to maximize expected utility, basically. That kind of
classical rational probabilistic planning. A lot of robotics
these days is based on doing some kind of efficient
probabilistic approximations to that idea.
Coming from an economic point of view, you could see this as
taking the classical economic idea of make decisions by
maximizing expected utility and scaling it up to more complex
environments in sequential settings where you have to make a
whole sequence of actions, a plan, basically, not just a onestep decision.
And this idea of understanding intuitive psychology as a kind
of inverse economics, inverse optimal control, inverse
planning, that these are different words for more or less the
same general idea, has become extremely influential in
cognitive science these days. There's lots and lots of
people showing the value of this idea, the same way that
modeling intuitive physics has some kind of inverse Pixar
problem.
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
27
I'll just show you a couple of studies very briefly of the
kind of things that we did, and this is mostly work of Chris
Baker, who's a graduate student in our lab, but it's also
been collaborative with Rebecca Sax [phonetic]. The next
slide comes from Tulmer Olman [phonetic], who's another grad
student in the lab.
But basically what we've done here is we're trying to create
in a sort of psychophysical lab setting things like the Hider
and Syble or that Southgate and - - display I showed you with
the little shapes chasing each other. But do it in a way
where we can vary factors in a controlled way and sort of
quantitatively assess inferences about these latent mental
states, like desires and beliefs. This was our first
simplest one where if you're a subject in this experiment,
you observe an agent moving through a two dimensional room.
You're looking through the top, basically, like a little
maze. And there's three possible goal objects, A, B, and C.
There may be a wall in the middle of the room, and the wall
might have a hole in the middle, and on different trials, you
can vary all these things including where the objects are and
the path that the agent takes. You typically only see an
incomplete path, and we might stop the trial at a certain
point along the path, and you have to make a judgment on a
scale again of one to seven how likely is the agent's goal to
be the A, B or C object, or red, green and blue.
So this slide is showing somewhat densely a simple of the
results from this experiment. Here are several different
scenarios where you see the constraints of the environment,
the object locations are changed, and then you see the path
with a bunch of numbers. Those are the steps the agent
takes, and the bold faced ones are the ones that on different
trials we queried people at. So at the beginning, after
three steps, seven steps, ten and so on. Then in the second
row, you can see people's judgments, and in the third row,
you can see the model predictions.
So the judgments again are basically normalized to a
probability scale of how likely at each point along the path
do you think that the agent's goal is the blue, green or red,
and then the model is making the same thing. So you can see
in all sorts of different kinds of inferences going on, cases
like this one where you're basically unsure between blue and
green, and then there's a key step which disambiguates them.
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
28
You have these goal switches here where you think that his
goal is green, and then it seems like he changed he mind very
suddenly. You can see this here.
Most of you probably think up till now it looks like he's
very clearly headed to B, and then very suddenly it looks
like he changed his mind and is headed to A. You can get
double goal switches, all sorts of stuff going on. It's a
lot of texture to that data. What was really surprising to
me was how closely this model captured it. You can see here
the model predictions are almost perfectly capturing all
those little bumps and variations.
If we look over all the trials that correlates with the model
of people is 0.98. Again, that's usually the kind of thing
we don't expect to see in high level intuitive social
psychology but more like quantitative psychophysics. And the
model is in many ways--in some ways complex, in some ways
simple. It's complex in that it says you have just like we
think you have a game engine in your head, we're saying you
effectively have a probabilistic approximate rational planner
in your head. We implement that for people who are familiar
with the technicals of planning. With the kind of MDP or
Palm DP [phonetic] solver.
But the actual parameters that we vary to try to see what
versions of planning people have and how that works here are
very simple. Basically there's just a small cost for each
step you take. This is obviously way over simplified, but
this agent is assumed to incur some cost for each step, and
then to get a big benefit, they posit a utility when they get
to the goal. We then say if that's how you set up--and then
there's a little bit of randomness. They don't always do the
best thing.
Then you say under that generative model, they tend to
maximize--take steps, which maximize their expected utility.
Then the question is observing a partial sequence of actions,
what do you think their goal was? What goal would best make
the observed sequence of actions, probabilistic utility
maximizing sequence? Just doing that, basically, is enough
to get this model to make such predictions.
This other project that I mentioned, we extend this to a
multi agent setting where we're asking people to make
judgments about when two agents are interacting, sometimes
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
29
you can see one agent appear to be helping or hindering
another. There's some really nice infant work done by Hamlin
[phonetic], Culmeyer [phonetic] in Karen Wynn [phonetic] and
Paul Bloom's lab at Yale a few years ago, and we again took
this and did it as sort of adult psychophysical setting where
you have two agents moving around in this four room maze. I
won't go through the detail since I guess I should probably
be wrapping up, but again, we can model judgments here about
whether an agent is acting on his own or whether he's helping
or hindering another agent by having a multiagent planning
system.
Now we have these recursively defined utility functions. So
what does it mean for A to be helping B? It's for A to have
a utility function that's a positive function of B, or to be
hindering is to have a negative function. So I get rewarded
when you get rewarded. A kind of golden rule.
Or maybe the most interesting development of this so far from
our standpoint, for those of you who are familiar with the
literature on theory of mind in developmental psychology,
where things start to become interesting is when kids are
able to make inferences about not only the goals of an agent
but also their beliefs, and how beliefs and desires, which
are both typically unobservable, how they interact,
particularly when we might have trouble understanding some
kind of agent's behavior unless we realize they have a false
believe or that they have maybe a surprisingly true belief.
Maybe their goal is different than what - - are.
This is a good thing to talk about on the Penn campus because
you guys are famous for your food trucks, and this was our
food truck experiment. So this is the last study I'll talk
about. I'll try to give a high level overview of it. The
way the experiment works here is this is like a little toy
domain of a university campus, and we have a hungry grad
student, Harold, who comes out of his office, and he goes
foraging for food. We show subjects in advance that there
are three food trucks in this world, K, L and M, which stands
for Korean, Lebanese and Mexican, all right?
And there happen to be only two parking spaces, so only two
of the three trucks can be present on any one day. Sometimes
only one space is occupied, so there's always one or two
spaces occupied, and I gather some trucks are like this
around here. They're sort of portable. In this world at
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
30
least, the trucks can go in any parking space on any day,
kind of first come, first served.
So Harold doesn't know in advance which truck is where, but
he does have line of sight visual access. So he comes out of
his office, and he can see that the Korean truck is there.
What does he do? He walks all the way down here, past the
Korean track, goes around the corner where now he can see
that the Lebanese truck is on the other side, and he goes
back to the Korean truck.
So the question is which is his favorite truck of the three?
Korean, Lebanese or Mexican? Right, Mexican. You get that,
right? And it's pretty interesting because there's no
Mexican truck present. So if you imagine training some
detector, and some people have suggested this in both machine
vision and maybe as a model for what's going on in the early
stages of infant action, understanding like Amanda Woodward's
classic work on goal directed reaching. You know, some would
say whatever you're moving toward or reaching toward, that's
what you want. But of course it's not what you most want
here--it's not present. You have to do this more mentalistic
analysis. That's what people in the model do here.
And moreover, the model is also making a joint inference
about the agent's initial belief, which is what did he think
was present behind the wall before he started out? It says,
he thought it was pretty likely that the Mexican truck was
there as opposed to the Lebanese truck, or nothing. That
makes sense if you think about it because if he was sure that
there was the Lebanese truck there, he wouldn't have even
bothered to go check. He would've gone straight to the
Korean, right? Or he was sure there was nothing.
So he had to be kind of optimistic. He had to want Mexican
and think it was likely to be there. And again, we can vary
many different features of the task. This is showing a
sample of several of the conditions. And again, we can do
quantitative psychophysics. The inferences here particularly
for beliefs are not quite as good as desires. It's
interesting that your ability to quantitatively judge an
agent's hidden beliefs is still quantitatively consistent
with this sort of ideal agent model. Not perfect, and we
could talk more about that if you're interested.
Then this is just a very dense survey, just to show in the
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
31
last few years how much progress has been made in
understanding intuitive psychology as something like this
inverse planning or decision making. There's one part of the
talk, which I didn't get to talk about, but that's okay.
It's mostly wide open, and I'm happy to take questions on
this, which is if this is a view of--or if this is a
reasonable view of common sense knowledge, and how it might
be used to make inferences from sparse data, there's still
this question, maybe the hardest, most interesting question
of learning where how did you get it? Where does it come
from? Both behaviorally and empirically and computationally.
What kind of mechanisms could either in the span of
development or maybe over evolution build these kind of
probabilistic programs for physics or for planning, build
them into your head. This is a really, really hard problem.
Basically we don't know. Somehow it has to be a view of if
this is common sense knowledge as modeled as probabilistic
programs, then learning is a kind of programing your own
mind, right?
Something like a search through a space of programs to come
up with the one that best explains what you've observed.
We've been starting to work on this. I guess I'll show you
one example of how we could study this again with adults.
Although we'd like to do this kind of thing with infants.
This is again work that Tulmer Olman and others are doing.
I'll show--just look at these movies here as a representative
sample. In each one, you see a few objects of different
sorts moving around, and they vary--they sort of follow some
aspects of newtonian mechanics. They basically follow
inertial motion, F equals MA, but they vary in other things,
with you can hopefully start to see, right? What kind of
things are varying in these movies? Anyone want to venture a
guess? There's mass, there's friction, there's forces that
attract objects to each other or to different parts of the
scene, like gravity type things or winds blowing in one
direction or another. Basically what we've been doing is
showing these kind of videos to people and asking them to
figure out, to make judgments about the relative masses, the
friction of different patches on the surface, what forces are
active between objects and so on. And again, people seem to
be pretty good at this, to view a kind of very rapid physical
law induction, but they also make some systematic errors,
which we're trying to understand where those come from.
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
32
So to wrap up then, I've tried to introduce the beginnings of
this research program on what I call the common sense core.
The roots of common sense knowledge and the understanding of
physical objects, intentional agents and their interaction.
And tried to say how could we approach this computationally.
What formalisms can express this knowledge, can explain how
it's deployed quickly for inference, and how ultimately it
might be learned. And I think we've made a lot of progress,
given that we've only been working on this for a couple of
years. I'm very excited about it, and more generally the
idea of using probabilistic programs to give a new generation
of probabilistic models that captures common sense. But
there's also some huge open problems, which I just started to
hint at, actually doing inference in these models is very
difficult if you have to rely on stochastic simulation.
Those of you coming from statistics or AI and machine
learning know about the challenges there. Learning of course
is very hard, and - - toward the end, and maybe the hardest
problems is how might any of these things be implemented in
the brain. I think that that's mostly an interesting
question to ask rather than a place where we have any kind of
answers, but I'm happy to talk about any of this in the
questions. Thanks.
[Applause]
MR. TENENBAUM:
Go for it.
MALE VOICE: [off-mic]. As I understood it, part of the reason
they did this experiment is they wanted to show - - .
MR. TENENBAUM: Well, this is a really neat study because they did
the experiment after we said, why don't you do this
experiment, and they said, we already did it.
MALE VOICE: So I think that what they were trying to show is that
the inference would show a kind of intelligence where it
wasn't clear what aspect of their experience they'd be
generalizing from. In constructing such an experiment, it's
a struggle because you're always going to be fighting against
the ambiguity of the notion of analogy, right? It's like who
knows what aspect of their experience. But it seems like as
a modeler, you have the same problems as the designers of the
experiment. So when you're--well, how do you--what kind of
strategy can you follow to be sure that you're not building
in too much of the answer into the model?
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
33
MR. TENENBAUM: Part of the point of studying cognizant knowledge
is that we think a lot of the answer is built in. Not
necessarily innate, but to model the capacity, we're not
trying to say how does your ability to reason in a
sophisticated way about the physical world emerge from goo or
mush, right? We're trying to actually initially argue that
you have very structured, rich models of the causal processes
in the world.
It turns out you don't actually need a structured model. I
mean, here while I said it's a simple version of that model,
basically the only physics that this model knows is sort of
spell key object physics, which means it knows that objects
don't wink in and out of existence, and they don’t teleport,
but it doesn't know about gravity or inertia. The objects
are basically just random walk around.
They move smoothly in space and time. They don't teleport,
and I think that's important. It's another way to think
about the contribution of this model. It takes the spell key
object concept and shows what it can do if it's driving
probabilistic inference.
So the question to ask, is it an important one, gets at a few
issues that are often kind of confounded in some of the
developmental literature and the paper. I think we worked
hard to try to be clear on these issues because they're hard
ones.
What is the role of experience versus innate structures, what
a lot of infant work has seemed to be addressing. And one
way of thinking about that is--how do I even--it's so
complicated. I can think of three different ways to answer
it. We don’t have any--so the paper did make a pretty
forceful argument that your ability to do this is not based
on the kind of simple statistical learning abilities that
some other recent infant work has studied in the lab. For
example, the classic work of Saffron Newport and so on, that
says here we're going to show you a bunch of data in this
experiment and check your ability to do a kind of statistical
inference in this experiment.
We thought what's going on here is a kind of probabilistic
reasoning that's operating on some more abstract knowledge
representation, something like a spell key object physical
theory. We are agnostics where in this work where that
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
34
abstract knowledge representation comes from. - - could be
right. It could be built in innately, or Scott Johnson could
be right. It could be learned and emerge through
development. Those are both consistent with this because
this study was done at 12 months of age for infants where
everybody agrees by then something like a - - concept is very
well entrenched.
So the way I see the question that you're asking is it's
really getting at two of the questions that I was interested
in, but I want to tease them apart. One is do we have
evidence that infants are able to do this kind of
probabilistic theory like reasoning, and we think yes, they
are in the sense that just showing them this kind of funny,
novel--kind of like this red, yellow thing I showed, this
funny, novel situation, this blue/red judgment, there's not
statistical data in the actual experiment itself that's
sufficient on its own to get the right answer.
But we think yes, there's information in this which, when you
combine it with your more general idea of how objects might
move in the world, even a very weak Spelky like notion, then
there is enough statistical information in the experiment,
and that's what our model formalizes, is how that works.
But then it's a separate question, how you build those
theories, how you program your brain, how you come up with
those probabilistic programs that capture how objects move
and we want to study that both empirically and theoretically,
but that's sort of a later stage of the project. Just like
in classic work in linguistics, I think, I'm very influenced
by the sort of very general Chauncian [phonetic] program that
says if you want to study language acquisition, that goes
together with the structure of language. But you need to
have some idea of what language might be like in the actual
adult state before you can study acquisition.
Maybe before--that's the right way to put it is I see those
interacting, but at least here we have I think until the kind
of work that we've been doing, there have not been very good
models of intuitive physics at the level that could be tested
at all in these kind of paradigms. So that's at least the
high leverage point where we wanted to enter into this.
But the last movies I showed you, those are partly designed
to be the kind of things you can show to infants, and we're
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
35
very interested in how you could do Saffron Newport, - - type
very quick statistical learning experiments on those kind of
stimuli and see how much plasticity you could get. It's like
artificial language learning. It's like artificial physics
learning, basically. That's essentially what we're doing
here.
Just like following the sort of Newport-Aslan [phonetic]
generalized research program, we're starting off with this
with adults, and then we do hope to take this to infants.
Hopefully that addresses your set of concerns. Yeah?
MALE VOICE:
[off-mic]
MR. TENENBAUM:
I'll try to give a very long answer.
MALE VOICE: [off-mic] There were, as you remember, two
introductions. One was the introduction of the format of the
platform of the court. They were - - invited to speak. The
second introduction was introducing you. What have you
learned from the first introduction and since you, in three
minutes, became familiar with the book itself? What have you
learned from that experience?
MR. TENENBAUM: You mean what did I learn about Dr. Pinkel and the
Pinkel Lecturers and his thought?
MALE VOICE:
Yes.
MR. TENENBAUM: Well, I mean, I told you what I learned. I
learned that he's an engineer thinking about how the brain
and mind work, and he takes seriously some of the same kind
of questions I do, like the interaction between the brain and
the mind and the analogies between a computer and the brain.
I take it--and I got that by skimming this book. So I feel
you must be asking something incredibly deep there about how
I learned that, and though I'm not sure if this is what
you're getting at, it's certainly--I'd say it's a sort of one
theme of what I talked about here, but it's definitely a
theme of the larger research program I've worked on, is our
ability to make inferences from very sparse data. In this
case, this sparse data is just a couple of shapes moving
around a screen, and you make an inference about objects in
the world, their forced interactions and their goals,
assuming that they're intentional agents. That's a notion of
projecting beyond the observed data.
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
36
And I think we do this all the time. I guess when I gathered
a little bit of observed data, which was really just a couple
of section headings and a little bio on the back that he
worked for NASA and the Rand Corporation. So I have some
larger schemas about science and engineering and brains. I
can project a lot of things, which I don't think I need to do
here.
But we all can, right? And that's a very general thing that
our mind does is take some kind of abstract schemas and
interpret the data of our immediate experience informed by
that. You could say that is what our minds do, or if there's
any one thing that our minds do, that's what they do. And
here what I've tried to do is study that in the context of
what I've called this common sense core. This set of
particular abstract schemas that are some of the most
important ones because they emerge extremely early in
development.
I think they underlie the meanings of words in a richer way
than at least most computational models of word learning,
including stuff that I've done and many other people in
psychology and machine learning. I don't know. To sort of
invoke the spirit of - - --she's not here, right? The idea
of verbs and how verbs are different from nouns.
To be able to understand the representations of the concepts
of intentional action, which are encoded in verbs, for
example, there's been a lot of really interesting work on
statistical learning of words. But until we can do the
statistics over something like the kind of representations of
intentional action I'm talking about here, we're not going to
have good statistical word learning models, I think. So my
focus as been on not the very general problem of schemas,
abstract schemas for making inference from sparse data.
But the particular ones that I think are at the heart of
human intelligence. Hopefully--is that what you were getting
at, or do you want something more? Do you think these aren't
the right ones? Hopefully. Okay. Hopefully.
I think it's not that often in a field like--this isn't
physics. It's not that often in our where you get so many
lines of evidence converging on the same thing. You look at
what the most successful accounts of lexical semantics, at
least the most compelling ones that seem like they're saying
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
37
something about the aspects of work meaning that might
explain composition of language? Or what is the infant
research telling us? Or what are people in visual scene
understanding, where are they going? What's getting the most
traction in getting computers to look at images and movies
and get out the things that are of value in a real
application of a car that has to drive and not hit people.
And so from different areas of our science and engineering
all kind of converging on this idea, and I come from someone
with a certain mathematical computational tool kit very much
influenced by probabilistic graphical models, various kinds
of - - analyses, and I see those things are great as far as
what they cover, but they don’t have the representational
capacity to get at these kind of common sense core notions
that all these other areas of our science and engineering are
telling us we need.
So I think there's great value in trying to make that bridge,
and that's what I'm trying to do here. Yeah?
MALE VOICE: [off-mic] --is not always common to everyone, to
every society. So have you ever thought about or attempted
to address issues of how common are these intuitions?
MR. TENENBAUM: So I certainly would like to. I've only been
doing this research for--the physics stuff is only about a
year old. The intuitive psychology is only a few years old.
It's one-and-a-half graduate students worth of work. Sorry,
- - . Yeah, that's right.
And so we haven't done these experiments cross culturally,
for example, but yes, we would very much like to. And I
think there is--does anyone know, I think there's been some
sort of informal cross cultural work on things like the Hider
and Syble display. I think, for example, not everybody sees
those two triangles and the circle immediately as
intentional.
Some people will look at that and describe them as there's
two triangles, or there's a triangle like this. The way that
work is often described is to say, look, you know, you can't
resist looking at these shapes and seeing them as intentional
agents. That might be very cultural or context specific.
That's not really my point. My point is, look, if you look
at these as intentional agents, which is kind of compelling,
I think, then you see all this stuff. You see the forces,
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
38
the interactions, the goals, the beliefs, the fears and
aspirations and so on.
That tells me that to me, that that tells me something about
the basic representations of the world. They're so basic and
so quickly deployed they can even work on just triangles and
circles moving around in the two dimensional plane.
Certainly one of the things we'd like to do going forward and
I've been talking a little bit with some researchers who do
work in various kind of relatively isolated and indigenous
tribes in the Amazon, actually trying to test these things
with people who come from very different cultural
backgrounds.
[END 90840518.mp3]
UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH
14th Annual Pinkel Endowed Lecture
Friday, March 23, 2012
39
Download