UNIVERSITY OF PENNSYLVANIA INSTITUTE FOR RESEARCH IN COGNITIVE SCIENCE 14th Annual Pinkel Endowed Lecture Modeling Common-Sense Reasoning with Probabilistic Programs Friday, March 23, 2012 Ubiqus/Nation-Wide Reporting & Convention Coverage 22 Cortlandt Street, Suite 802 - New York, NY 10007 Phone: 212-227-7440 800-221-7242 Fax: 212-227-7524 Modeling Common-Sense Reasoning with Probabilistic Programs [START 90840518.mp3] MR. JOHN TRUESWELL: I'm John Trueswell [phonetic]. Welcome to the 14th Annual Benjamin and Ann Pinkel Endowed Lecture. The Benjamin and Ann Pinkel Endowed Lecture Fund was established through a generous gift from Sheila Pinkel on behalf of the estate of her parents, Benjamin and Ann Pinkel, and serves as a memorial tribute to the lives of her parents, Benjamin Pinkel, who received a bachelor's degree in electrical engineering from the University of Pennsylvania in 1930, and was actively interested in the philosophy of the mind, and published a monograph on the subject, which I have here. Consciousness, Matter and Energy: The Emergence of Mind in Nature, published in '92, the objective of which is a reexamination of the mind/body problem in light of new scientific information. This lecture series is intended to advance the discussion and rigorous study of the sorts of questions which engage Dr. Pinkel's investigations. Indeed, the lecture fund has brought many esteemed scientists to Penn to speak on some of the most important topics that populate the modern study of the human mind. This year's speaker, Josh Tenenbaum, is a welcome addition to that group. Here today is the director for the Institute Cognitive Science, David Brainard. And book. no exception, and is to introduce Josh for Research in you get a copy of the MR. DAVID BRAINARD: Thanks, John. I'm very pleased to welcome and introduce Josh Tenenbaum, this year's Pinkel Endowed Lecture speaker. Josh received his PhD in Brain and Cognitive Sciences at MIT, and then after a short postdoctoral fellowship, took on an assistant professorship in the psychology department in Stanford. After a few years, he got tired of the West Coast, came back to MIT where he remains and is now Associate Professor of Cognitive Science and Computation. A perusal of Josh's CV will tell you that he's won more awards than I even knew existed, but many of them best paper at X over the years. But in particular, I want to mention UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 1 two. One is the Early Investigator Award of the Society of Experimental Psychologists, where he was also named a fellow in 2007, and the other is the National Academy of Sciences Troland Award, another early career--very prestigious early career award. Josh's work focuses and consistent with the theme of the Pinkel Lecture, on an absolutely central question on human cognition, namely how it is that we learn to generalize effective concepts that work well for us in the world from a small number of exemplars, the problem of inductive reasoning in learning. As Josh notes in his web page, after seeing only three or four examples of a horse, even a two-year-old will confidently and mostly correctly judge whether any new entity is a horse or not, so one generalizes from a small number of examples, or as he puts it in a paper, how does the mind--the question how does the mind get so much for so little, or from so little. Being able to learn inductively in this way isn't something that you can just solve logically. That is to say a small number of examples under constrain the generalizations you might make. So something else needs to be brought into the problem. There's no such thing as a free lunch. The thing that Josh's work has emphasized that could be brought to the problem and can be effectively brought to the problem is taking advantage of statistic regularities in the natural environment where we operate and social world where we function. In that regard, he's been really the leader and pioneer of using - - methods, which provide a natural language and machinery for expressing the use of statistic irregularities and bringing together this information into the realm of cognition. These ideas have been applied previously in what I would call lower level problems in human information processing. Josh has really led the way both in terms of actually developing effective statistical algorithms--he has a foot in the world of machine learning, and I think a slightly larger foot in the world of psychology, where he designs and conducts elegant experiments, always informative, and that speak to the formal models in effective ways. So I always learn when I read Josh's papers, and I'm really looking forward to his talk today. So please join me in welcoming him, and I look forward to his remarks. UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 2 [Applause] MR. JOSH TENENBAUM: Can you hear me okay? Thank you very much. It's a great honor to be here. I don't know that much about - - except when I was skimming during those very generous introductions. But it's actually I think a very fitting match in the sense that he's an engineer thinking about the brain - - . And if there's one thing to summarize to me what essentially is the work that I'm trying to do is to think about how the brain and the mind work from the perspective of an engineer. This is - - . I like to think of it as reverse engineering of how the mind works, which doesn't just mean--the purpose of doing these different things - - to me-[Break in audio] MR. TENENBAUM: On now? Great, sorry about that. It's been a great success story of many of these fields, viewing intelligence as in some sense statistics on a grand scale. I want to say a little bit about the success of that parallel and mostly talk about what it leaves out because that's really what I'm most interested in right here. I think some of the basic challenges is--not to say this is wrong, but what do we need to think about besides statistics. So this is a view of mind brain intelligence. It has data coming in through sensors like high dimensional spaces of retinal activation or auditory, olfactory sensoria. And what the mind essentially does is find patterns in that data, various kinds of structure, clusters, correlations, low dimensional manifolds, linear, nonlinear and so on. We can describe the math of finding structure in data in very elegant formulations. Hopefully many of you have seen these sorts of equations, so just treat this as evocative here. If you know for example about hebbian learning or delta rule back propagation learning, we think about that as optimizing some kind of error function to have rules for finding the sort of best setting of a high dimensional parameter system, like say the weights in an artificial neural network to best find structure in data given the inputs. And these learning algorithms, which are mechanisms for finding structure in data, have remarkably been shown to have some correspondence to the actual mechanisms of adaptive plasticity in brains, what goes on when synapses between real neurons change their UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 3 weights. There's even established causal links between phenomenon like LTP, long-term potentiation, that to some extent follows the math of hebbian learning and actual behavior in humans and many other mammals. The same kind of theoretical computational paradigm has been incredibly useful on the engineering side. There's many applications, some of which we take for granted, some of which you will soon take for granted. You just don't know it yet. For example, pedestrian detection systems. We're used to face detection in our digital cameras and online photo sharing sites. Increasingly over the next few years, you're going to see things like this system here, which is a system developed by Mobile Eye, a start up company which is a rather large company at this point, and being commercialized by most of the major auto manufacturers. It can reliably detect other cars and pedestrians in ways that are reliable enough they sort of pass the lawyer test. The lawyers for that company and Volvo and Audi and BMW and increasingly other brands will let you put those in the car and hook it up directly to the brake pedal to basically stop the car if you're going to hit something. It's more reliable than you at that tasks. Similarly, Google is more powerful than any human at one particular task, of searching for useful documents of information to match text queries. A lot of attention to other kinds of AI technologies recently like IBMs Watson Jeopardy playing system that beat some of the world champions in Jeopardy. If you have the new iPhone with Siri, that's sort of a voice - - language interface. Most recently, did people read about Dr. Phil, the new expert level crossword playing machine? So we have these what I would call AI technologies, technologies that achieve human level, even expert level performance in particular domains, and they do it by using basically the math of statistical machine learning and some other technologies. But yet none of them are really intelligent. None of them are. They are AI technologies, and they would impress any of the early founders of AI, but they don’t have anything like what we would think of as intelligence or the common sense even of a human child. This is what I want to push on. One way to illustrate the lack of common sense that you get from statistical patterns, here's a little demo from Google. UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 4 It's easy to pick on Google because it's so ubiquitous and so successful. Don’t get me wrong, this is no sign of disrespect. In particular, though, what I'm going to point to is something about Google's spelling correction system. This is a very cool thing. We take it for granted that we can type in a query like this one up here. I won't read it because I can't really read it, but you can look at that and see what I meant to type was probably what Google figures out. How does Google's spelling correction work, and remarkably, it corrects those errors and gives a useful returning document. But consider a more disrupted string of characters, which is this one. What does this say? It's not so easy. How about this? Okay. Right, so you might be familiar with these kind of things. Sometimes they go around on e-mail or Facebook. The only difference between these two slides, this one and this one, is I rearranged the words. Here they are in the correct--so each word is sort of permuted letters of a real English word, but here they follow the order of words in an actual English sentence that makes sense. You can read this sentence even though every work is misspelled. Here I've mixed up the order of the words, and it's very, very hard to read, right? Google doesn't know what to make of either of these things, in particular the one that you have no trouble reading, even those of you who aren't native English speakers. Google completely chokes on it. What's the difference? Google's spelling correction works by finding frequently occurring clusters of letters and saying, that's a word, and trying to look for the most likely interpretation of the actual strings that you typed in, in terms of the words, those patterns that it knows. But it doesn't understand the more basic meaning, that language sentences are actually expressed propositions or thoughts. Words express concepts, and a proposition or a thought links those concepts together into this larger unit. It's about something. It's about a world. It's about a world of people who have goals, who have beliefs, abilities and so on. That's the kind of common sense thing we're looking at. Look at the IBM Watson Jeopardy playing system. Very impressive again. It's easy to pick on because it's so good. But look at what IBM had to do. They invested a great deal UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 5 of financial and human capital to build this system, and they trained it on basically all the world's data, and instantiated it in a basically massive super computer cluster, and it works really well for that. But it can't do anything else. It can't explain why it works. If you change the rules a little bit, it can't play chess. That's a different IBM system. It can't do crossword puzzles. It can't give a lecture. It can't do anything but that. But contrast this with your mind or any one that--imagine the first time you playing Jeopardy, someone could explain the rules to you, or you could basically just figure them out by watching the game. It's a slightly funny way of asking questions. Without any training or massive engineering, achieve competence, and it's not ability to achieve competence in an endless number of everyday tasks that is part of what we mean by intelligence. How do we do that? How do we get the kind of common sense that supports that into a machine? There's some again, very interesting work that's easy to pick on because it's famous and good. In AI machine learning, various ways that people have been trying recently to get common sense into machines often using language as an input representation or as a data source, I think part of the motivation for this is that with the web, we have massive amounts of text, which expresses in some direct or indirect form a lot of common sense, and also using the web, you can get people to type in things. So this is data here from a project called the Learner Project by a researcher named Schklovsky [phonetic], and he got people and then computers to sort of extend and generalize millions and millions of facts about common sense turns, these propositions like abalones can be steamed. That's true. They can be squashed. They can be used for food. They can be put somewhere. They can be fried. These are all facts that we all basically recognize as true. A ballpoint pen can be made in Singapore. A balloon can break. This is a rather small sample of the effectively infinite number of common sense facts. One approach is to just try to grow and grow and grow this list and have a system that can reason over that, and maybe that can be common sense. Or Tom Mitchell over here and his team at CMU has this Nell system. This is a figure from the New York Times where it's basically doing something similar, UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 6 trying to automatically read the web and the newspaper and grow out a propositional network of knowledge. But systems like the IBM's Watson system or this one, while they do incredible things, they also make basic obvious mistakes that no human child would make. So in this article in the New York Times, there's this anecdote about how the Nell system learned about internet cookies. Anybody remember this? So it had learned a category of desserts, basically, words like cake, cookie, pudding, whatever, ice cream, things you could eat that were sweet, that were delicious, looking at co-occurrences in text. Then it learned about maybe different kinds of cookies, chocolate chip cookies, molasses cookies, sugar cookies, and then it saw this new thing called internet cookies. Okay. Maybe that's a kind of cookie. That's a hypothesis. Then you start to see the various other words that go along with that, like you can delete your internet cookies, so maybe that's something you can do desserts. So then maybe files, I can delete my files, so then files and e-mail address and address books, all of those things became hypothetical desserts. With enough data, it will correct that error. But my point there is while this strategy of distributional learning is an important one for how children might form categories of words and concepts, including people here have worked on this in important ways, it's constrained by some common sense ontology of the world that says there's a basic difference between physical objects and abstract concepts or information. No human child would ever mistake an internet cookie for a cookie or a dessert or an e-mail address and so on. So this is hopefully starting to get at what I'm trying to get at here. I want to understand from an engineering point of view what you could call this sort of core of common sense. It's a set of ideas that I've come to, that the field certainly has come to, and it's relatively new to my work, to be interested in this, but it's basically paying attention to what colleagues have been saying from many different areas. Infinite cognitive development, older children, people in lexical semantics, people in visual - - understanding, all seem to be verging on this idea over the last couple of decades, and it's in itself a common sense idea. That our UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 7 understanding of the world comes to us not through high dimensional vectors of data, even if at the raw sensory level that's a way to think of it, but cognition builds on a basic structure of objects, physical objects and intentional agents like a goal directed actors, and their interactions. That's what I want to understand. How can we get these common sense abstract systems of knowledge about physical objects and intentional agents, what are sometimes called intuitive theories, although that word means different things to different people. Theory like systems of abstract knowledge about the physical and social or psychological world. How do we describe those in computational terms? How are they used by your brain in often very automatic, implicit, unconscious ways, including how they are used and developed in the brains of young children, even infants? That's what we want to understand. To motivate the kind of approach we're taking, I want to have a little digression here into vision because unlike say what a lot of the current work in machine learning has been doing, I think the root into common sense can be motivated from language, but is not best accessed through language. The work on infant developmental psychology by people like Liz Spelty [phonetic], Rene Byron-Jean [phonetic], many others that have in particular studies infants' development of the concepts of physical objects, shows that even before children are moving around very much, speaking or understanding very much of language, at four months, maybe even three months, maybe even younger, Liz Spelty would argue very famously, of course, innately, that even that that age, the infants already grasp basic concepts of physical objects. I think some compelling work in linguistics shows these might be exactly some of the basic semantic representations that language then builds on. So let's think about visions, scenes and objects. The state of the art in computer vision scene understanding is represented by some of the figures I showed before. Here's another very nice piece of work by some of my colleagues at MIT for understanding scenes where basically what people are trying to do these days, this main state of the art, is put boxes around parts of images that correspond to some semantically meaningful object in the world, like a face or a person or a sofa or a piece of road, okay? And you can do this by getting a lot of labeled data automatically labeled UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 8 or semi-automatically, or hand-labeled people, somehow putting boxes around - - , this is a road, this is a road, this is a road, and then learn a road detector. The idea is to learn a whole bunch of detectors with these basic level word category labels like road, person, sofa, and that's scene understanding. But think about real scene understanding from a common sense point of view. What kind of person detector would detect all the people here? Anyone want to be a volunteer person detector? Well, in a slightly smaller audience, I would literally ask for a volunteer, but since we don't have all day and the room is big, I'll do it. If I say point to each of the people in this scene here, you see all these bikers, but of course we know that basically there's people--every bicycle helmet back there is a person. We know that. Whereas no machine vision person detector like the Mobile Eye thing that's in your new Volvo, is going to detect that. Similarly here, every black parallelogram is a person. Every blue one here is, even ones that barely show up as just a few pixels. But interestingly here while there's two of those same kind of shaped things, there's no people in that scene, unless it was some kind of weird Beckett [phonetic] play where the people had been buried in the lawn. Or take other sorts of examples of the kind of scene understanding I'd like to achieve with common sense. Look at this scene here, and we don't just see a bunch of wooden planks, and we don't just see a house in its outline or frame form, but we can analyze certain physical stability things. Like, we can say this looks like yeah, it's basically standing up. You can make guesses about which planks are structurally important. If I remove this, the thing might fall over, but if I remove this one, nothing would. You can figure out what's attached. These things here are probably freely floating. You could pick them up whereas many of these other things up here, you couldn't just go up and pick up. You'd have to pry them off. How do we look at these kind of scenes of social interaction and infer what people are thinking, feeling, wanting from others, or you know, take a very practical application, again, of street scene understanding. We don't just want to be able to detect pedestrians and analyze where they are and how close I am and when I have to step on the brakes, but we UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 9 want to understand their motion in terms of what is actually driving their goals. You'd like a car just as an automated driver system, just as a human driver who's paying attention is able to do, is to see people moving around and infer where they're going, why they're going, when they might be likely to stop or start. That's the way we look at scenes like this. And if you're paying attention as a driver and visibility is good, you're very good at not only not hitting the pedestrians, but anticipating where they're going, stopping, starting, making eye contact, using that to really be a sophisticated interactor with that scene. The current generation of robotic car systems don't do that. But it's acknowledged that they would need to do that. I'm saying in order to do that, you need to understand these things as objects in the world, which are moving because of their intentional states. Maybe the most for me concrete motivating application, although it sort of again basically a laboratory setting, are these kind of animations. Many people are probably familiar with the classic Hider [phonetic] and Syble [phonetic] animation on the right. I'll show this one first on the left by developmental psychologist - - and Chibra [phonetic]. I'll show this a couple of times while I talk over it. This is a short excerpt from a stimulus in an experiment done on 13 month old infants where the infants see this, and just like you, they've showed that the infants don't simply perceive this as a blue sphere and a red sphere translating or rolling on a green plain around some yellow, tan blocks. But in something kind of intentional, right? So how would we describe this scene? How would you describe it? Something like the blue is chasing the red one. The red one is fleeing. It's a sort of slight--the blue's chasing is a slightly dumb one because he thinks he can fit through those holes that he can't. We perceive that, if you pay attention and look carefully. Or take, for example, the classic Hider and Syble one here. Again, you could see this as two triangles and a circle moving, but instead we see them as objects, which are agents. We see the forces as one hits into the other. We see the big one looks like he was kind of scaring and bullying that one off. The little circle is hiding in the corner until the big one goes after him. A UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 10 little nervous laughter, cue the scary music here. If you haven't seen this before, don't worry, it ends happily. At least for the little circle. So is it that you see in any of these stimuli as David said, I'm very interested in how does the mind get so much from so little. Think about the data here as input. It's basically these low dimensional time series. It takes you just a few numbers to describe over time how these shapes are moving in the image sequence, but you don't see it that way. You see it as objects with physical interactions, and intentional agents with goals. Basically simple kinds of mental states. Goals, beliefs, false beliefs, even emotions, moral status. So how does that work? If I could just get a model in good reverse engineering terms that could understand this video, I would be very satisfied, and I think that it would actually be the same things I need to drive at the heart this kind of technology on the applied side. So the questions that I want to address in this talk are just the beginnings of this enterprise. I think I could spend a lot of time on this, and many others could, too. If you're interested and want to work on this, I would be very happy to talk with you about all these issues. So I want to say what kinds of--or we want to start looking at what kind of computational ideas could answer these questions. The questions of the form of the knowledge, how it's used and how it's learned. And one starting point, which I think is familiar to maybe many people, certainly coming from computer science, but increasingly in psychology and neuroscience, is a language for building models of causal processes in the world that's known as bazian networks [phonetic]. There's a kind of graphical model. Just curious here, how many people know what a bazian network is? Okay. How many people have never seen a bazian network? This is what you would expect. Judea Pearl [phonetic], who is the computer scientist most associated with developing this just won the Turring Award [phonetic], award for lifetime best achievement in computer science, and it's a fitting acknowledgement of that, that pretty much you can go to any sophisticated audience that's interested in intelligence, and most of them have heard of or seen this idea. UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 11 It's certainly influential in all of the work that I do. The basic idea of one of these models is to start off with a graph, a directed graph with nodes and arrows to represent both basically the causal processes that are operating in the world that give rise to the data we see, and then perception, reasoning, decision involves observing some of these nodes, which correspond to certain aspects of the world, the more observable ones. Making inferences about the variables out there that you can't directly observe, and then predictions or decisions about the ones that provide key value to you. So a classic example is this kind of medical diagnosis network where you'd have two levels of nodes here representing the presence or absence of diseases and symptoms in a hypothetical patient, and the arrows correspond to which diseases cause which symptoms. The graph represents basically what causes what. Then you can put probabilities on the graph. This is an example here where the basic knowledge about the world that's going to be useful for making inferences from incomplete observations basically says for each node, you look at its parents in the graph, and you say, what's the conditional probability of that node's variable, let's say the symptom, it's just a binary present or absent, taking on some value condition on the state of its parents? And it doesn't--it's a very lightweight notion of causality, right? All it says is what causes what, and you put some numbers to capture roughly the statistical contingencies that go with that causal pattern. And these models have been very, very valuable on the engineering side and the scientific side, but they're not enough. If there's one technical message to take home from this talk, it's that we need a richer language. I'm going to skip this slide because if you've seen Bazian networks, you're probably familiar with the basics of Bazian inference, which is the process of observing, for example, some of the symptoms, and then working backwards, going against the flow of causality to infer what causes best explain the effects. That's all you really need to know about Bazian inference for the purpose of this talk, that you can do this and there's principled ways of formalizing this and effective algorithms for doing it. But what I want to emphasize is the way that we're going to represent our causal knowledge, that we need to go beyond UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 12 graphs, basically, to programs. So there's a new idea that is gaining currency in various quarters. Some people in our group have worked on this. It's this idea of a probabilistic program, and sometimes you'll see the language probabilistic programing, but that's a little bit more ambiguous. Think of a probabilistic program as a very fancy kind of graphical model, which is one that unlike graphical models, doesn't look like the basic tools of statistics. When people actually go look at the meat of graphical models or Bazian networks, what you see are distributions, exponential family distributions, - - , binomials, multinomials, standard things that in any stats class you'll learn about for describing these basic representational features, these nodes are random variables, and then standard kind of concepts again that we're familiar with from statistics and engineering for representing relations between variables, like linear models or nonlinear things. Sigmoid nonlinearies, additive interactions, those kind of things. This tool kit of sort of linear galcian [phonetic] or linear plus-plus, galcian plus-plus, that is the language of graphical models. But what I think and many others think that in order to describe something like common sense knowledge for AI or cognition, we need to not only marry probability theory to the language of graphs, but to the full computer scientist toolkit. The same way that you could use a graph to represent the control flow in a program. We often make these little flow charts, but that isn't the program. It often fails to capture much of the richness of what's actually going on in the program, what you actually need it to do work for you. The idea here is to use programs to describe causal processes in a much more sophisticated, fine-grained, powerful way. These are going to be what we call stochastic programs. They're programs that flip coins or roll dice to again capture things that we're uncertain about. And when I use the language of probabilistic programing to distinguish a probabilistic program which is a certain kind of knowledge representation, there's also the idea, I think, that's important that the full computer scientist toolkit for describing the world, which includes data structures of much more flexible sorts than we're used to seeing in Bazian networks or graphical models, that that's going to be important for understanding common sense. Data structures UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 13 which deal very naturally with recursion, for example. So I want to mostly introduce this idea in the context of a certain kind of intuitive physics. I've shown many examples of this already, but an understanding of the mechanics of how objects interact from visual data. I'll say a little bit depending on time about these ideas in more of an intuitive psychology context. How we understand the action of intentional agents. The particular kind of judgments that we're going to be looking at, and most of what I'm going to show you are basically behavioral experiments and computational models that are not so much AI or engineering applications, but hopefully you can start to see where those would build on these ideas. We've done a lot of work recently focusing on phenomenon of stability. What makes--for example, you look at these two tables. The one on the left looks more stable. You wouldn't mind leaving your laptop on it or leaning against it, whereas you wouldn't want to do that to the one on the right. Or consider these wooden blocks up here. You can just look at it, and right away without much effort at all, your eye is naturally drawn to the points of instability like with the accidents waiting to happen. For example, up here, that's about to fall over. You could--if I said point to the points of instability, pretty much all of us would point to the same things. There's games that engage this behavior like the game of Jenga. How many people know the game of Jenga? Cool. We're going to basically be doing experiments on things like that game. Your goal is maybe to take a block out without knocking over the stack, but in such a way that will leave it as precarious as possible for your opponent. How do you do that reasoning? Here's a couple of slides in the middle here that I took from my own version of Jenga at the last conference I was at. We can see this looks like--this is not a good place to let go of the coffee cup, and this is what happened right after that. Over here on the left are actually stability illusions. There's this art of rock stacking, I guess, or I guess that's what it's called, which practiced on beaches mostly on the West Coast. There's probably some New England rock stackers UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 14 who will take either manmade blocks, bricks or typically the rocks you find at the beach and stack them up into arrangements like this, which are physically stable in the sense that they're in static equilibrium. They may not look like it. That's the point. They're illusions. They're visual illusions. They seem so precarious that they must actually be in motion, but the whole point is to arrange these in some kind of counterintuitive way. One of the things we'd like to do in a lot of - - vision is actually explain what's going on here. Why these things can be stable when they don't look stable, as well as how our system mostly gets it right. I would say our ability to analyze stability is quite powerful. So here's how you'd build a probabilistic program for this, and it's going to look a lot like, to start off with, a Bazian network or graphical model. Again, that's just a nice way to sketch how it's describing what's going on in the world that gives rise to this data and captures the aspects of the world that we want to be reasoning about here. You can think of this the same way people often talk about-it's common to talk about vision as kind of inverse graphics. This is a kind of inverse graphics model, but I put it in Pixar to emphasize the dynamic dimension. So we say we observe an image, which is the product of some kind of underlying world state, some scene, the world state being some three dimensional configuration of objects in the world, the image being some two dimensional set of pixels or photo receptor activities. There's some kind of causal process that gives rise to images from world states, which you could call graphics or projection. It's how 3D world gives rise to the image. But the causal picture is more interesting than that because you have time. The world state evolves in time, and you have something like physics, which describes the processes going on in that direction. And our task here is to observe maybe one image or maybe a sequence of images, and infer something about the underlying world state. That might be the perception or the reasoning, and maybe also if we wanted to do learning, we might want to learn the physics or the graphics from a bunch of examples of UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 15 this process. Again, if you're familiar with graphical models, you'd say this is of course a very natural graphical model. It's a hidden Markov model. That's a very standard kind of model. But the point here is that yes, you can draw this in this way, just again, if you know about hidden Markov models, trying to represent the physics of the world in those scenes as an HMM and the graphics as an HMM where you've got discreet output symbols and discreet hidden states, and you make a big matrix that represents that transitions and the outputs, these--we'd be talking about infinite by infinite matrixes, and you still wouldn't be capturing things with the right grain. So a probabilistic program, you can think of it as a graphical model but with thick nodes and thick arrows, and maybe some amount of recursion. What we really care about is the stuff that's inside those nodes, and the arrows--what we don’t care about--I mean, the interesting thing here isn't that the world say to time T gives rise to the say the world time T plus one. But how does physics work? It's like think about when in newtonian mechanics or any other kind of mechanics, what's interesting, the real content of the theory aren't the arrows, but the equations. So what are the right equations that describe how the world evolves causally, and the right equations that describe how images give rise to world states. Let me get even more concrete with a set of images from one of our experiments. These are--it's a little bit dark, but hopefully you can get the basic idea. It looks a lot brighter on my screen. Is it possible--I don't want to mess with your light balance by saying turn down the lights. You'll get the idea. If you can see different configurations here, that's good. So we showed people these sets of blocks, in this case ten blocks that are basically Jenga blocks, but colored to make it easy to pick them out individually, and some of these configurations look very stable. This is relatively stable. Here's one that looks pretty unstable. Everyone agree? Pretty unstable. This one's also unstable or stable? Yeah, pretty unstable. Here are some kind of intermediate cases. So people make UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 16 judgments on a scale of 1 to 7 how stable is this or how likely is the tower to fall. You can ask the question in different ways. It doesn't make that much difference because we all know what we're talking about. This one is very unstable. This one is pretty stable. And then the way we're modeling this is we're saying you observe an image at some time, T, and then you have some kind of probabilistic approximation to both of these processes. The graphics process and the physics process. Then you do a Bazian inference to infer the likely variables that would've given rise to what you observed. So in this particular case, what world state under a graphic rendering function is most likely to have given rise to that image, maybe something like that. So the model that represents this is something like a CAD model. This is a set of 3D blocks, and they're sitting on top of each other in some way. Because there's uncertainty, maybe you only look at the image briefly, maybe you don't know exactly how graphics works, your ability to recover the correct 3D position of the objects is limited. This might just be one sample from the Bazian posterior, one sample from your best set of best guesses here. Here's another one. So I guess I slightly went out of sequence, but given one of these samples, you can then say, well, how is this going to evolve forward in time? So there we have a probabilistic approximation to certain aspects of classical mechanics, which basically allows us to run a simulation and see for this stack, it's going to kind of maybe fall like that. For this stack, it maybe falls like that. And we're going to make inferences effectively by doing inference and simulation. Another way to sum this up is that we're positing--in some sense, you have a video game engine in your head. How many people are familiar with these physics video games? Angry Birds of course is very famous, but there's ones which are a lot more interesting like stacking blocks or cut the rope, these kind of things. They're endlessly fun for people who know them, and what makes them work is that basically there's some kind of rapid approximate simulator for the physical world. It has parameters that allow you to stretch beyond things that are natural physics. If you ever played Jelly Car, you have this car that can get all mushed up. UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 17 So sometimes these games have their own internal physics, which you have to learn, and there's some kind of graphics engine, which simulates this thing forward based on your interaction and renders pictures pretty quickly. We're suggesting you have something like a causal model in your head that behaves like that, and you can put probabilities on the parts of the model that you don't directly observe, to make inferences about those from the parts that you do observe. On this particular model here, we're trying to do fairly quantitative--maybe psychophysics for those of you who actually do psychophysics is a little bit the wrong world, but I should say the post-doc who's been leading this project is Peter Betalia [phonetic], and he definitely comes from a strong psychophysics background, and that's our intention is to really make this as compellingly quantitatively rigorous as the best of the psychophysics. In this case, we've been exploring models that just have two free parameters that we can explore how they interact with a range of different stimuli judgments. One this sigma state uncertainty, which basically captures the variation in these plots here. How well can you localize the position of blocks in three dimensions given the image? So the higher sigma, the more uncertainty you have about the true state of the blocks. Then this - - , which basically says how much uncertainty do you have about the forces that are acting on this thing? So the idea is you basically understand about gravity and sort of inertial force interactions, but you might allow for the possibility of there can be some unseen forces. This table the blocks are sitting on could be bumped, or somebody could poof air on it or something like that. In this task, you might wonder, well, we didn't say--I just showed you a bunch of blocks. I didn't say anything about perturbing forces, but we can--so hopefully this parameter won't do whole lot. We'll see some evidence that even here it has a little bit of an effect, and then we'll explore it in a subsequent experiment. So the basic summary of the model, then, is we show people these stimuli. We simulate inference in this by drawing a UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 18 set of samples from this scene posterior parameterized by sigma, the state uncertainty, then propagating them forward with a kind of approximate classical mechanics game engine, and look at what happens. You can also think of it as a kind of probabilistic mental imagery. And looking at the outputs of a bunch of simulations, we can then ask various queries. Compute the values of predicates on the outputs. If we're interested in how much of the tower will fall, how likely is the tower to fall, we might look at say--just count the percentage of the blocks that fell. Or here we might be interested in which way are they going to fall. That's another query or predicate we could - - on the value of the very same simulations. An example of one of these experiments is shown here, the results from it. In this experiment, subjects were given 60 different stimuli, these towers, of which I've showed you a few before. They were designed to cover the spectrum from very stable to very unstable. I think this picture shows you, this is the stable end, this is the unstable end up here. On the X axis, we're plotting the model prediction for its stability, which is basically the expected proportion of the tower that will fall, and on the Y axis, we're plotting human's judgments, normalized Z scores. And what you can see is there's basically a pretty good correlation. It's usually around 0.9 in different versions of this model and different versions of this experiment. We've done this experiment many different ways, probably at least half a dozen different ways, with feedback, without feedback, asking the question, how likely is the tower to fall, or a 2AFC [phonetic] task. Basically it doesn't really matter. You get pretty similar kind of model fits in all of these cases, and it's interesting that it doesn't actually depend on--we can give people feedback. After every one of the 360 trials, they see each of the 60 stimuli six times in randomly arranged blocks. We can tell you what happens, or we don't tell you what happens in any one of them. We just give you a little bit of feedback on a few different towers at the beginning to get used to the task. It makes almost no difference in people's responses. To UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 19 explore a little bit into the role of uncertainty here, you might say, well, of course, so I showed you a model that fits pretty well, but how well, and how much do these parameters matter? Is there any evidence that there's really some kind of a probabilistic physics involved here? This plot is meant to give you some insight into that, where it's sort of a meta plot. The previous scatter gram here was the value of the-the fit of the model for one value of that state uncertainty parameter. And no latent forces. Now, here, that's basically this intermediate setting. It's a value about sigma 0.05 and what that means. It's basically an amount of--a center deviation. It's about a quarter the size of the smallest edge of one of these blocks. That's the place where the model fits best, and what this solid line is showing, the correlation between model and humans, which again maxes out around 0.9 for the value of the parameter. The dash line is the no feedback experiment, which tracks it almost perfectly. So it fits a little bit worse, but the correlation between these conditions is about 0.95 just in the behavioral data, from one group of subjects to the next. Now, it probably won't surprise you, if I increase the model uncertainty, so the model is less able to localize the objects, the fit goes down because at some point if you have no idea where the objects are, you can't predict anything. What's maybe more interesting is what happens when you lower the model uncertainty. So at this zero end here, that's where the model has a basically perfect grasp of the situation that's being simulated here. It knows exactly where the blocks are, and it has the physics exactly right, at least as far as the simulation captures. Interestingly, that's where it fits human judgment's worst, right? So this is an ideal observer model, but it's got a key role for uncertainty. Why does it fit human judgment's worst here? Well, a lot of it's being driven by a couple of interesting outliers, but look at this red data point here. Here corresponds to a stimulus that the model says is very unstable, and people also say is very unstable, but here when you eliminate that uncertainty, the model says, oh, no, that's actually stable, but people still say, of course, because the Y axis is the same, that's just behavioral data. UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 20 This corresponds to one of these towers I showed you before, this one up here, the one in red, which looks to all of us very unstable, but in fact it is physical stable. It is static equilibrium. It's actually a simple version of one of these illusions. And this is surely not the whole story of what's going on here. This is more complex in these real world cases, but roughly we'd say yes, what allows for illusions here is basically what allows for illusions here. It's possible to arrange objects in configurations, which are stable, but kind of precarious, a small perturbation will move them or will make them unstable, and we think that people's sense of stability is fundamentally probabilistic in this sense. It recognizes that kind of precariousness, which can come from state uncertainty, or from the possibility of latent forces. Forces like the wind blowing or somebody kind of walking nearby or bumping it. I won't go into the details, but you can get basically similar effects of adding in forces of a very small, latent magnitude. Basically allowing for the possibility that the table might be bumped by a very small force also improves the model fit in a very similar way. Now, again, I could spend hours talking about this model and the experiments that Peter and Jess Hamerick [phonetic], who is a master student working with him have done. For example, one of the things they've done is investigate various heuristic alternatives that don't involve anything like a rich physical causal model. And we can show that for example, for this kind of task, height is a pretty reasonable heuristic. It correlates at about 0.77 with people's judgments. But there's all sorts of reasons we have to believe that height is not primarily driving judgments here, even though in these examples I showed you, the more unstable one is also the taller one. For example, you can do a just purely height controlled experiment like we showed here where you made 60 new towers that are all the same height, and people are still able to do this. Not quite as well, but of course that's because in forcing the height to be the same, we're also compressing the dynamic range, and there's a lot--it's kind of like zooming in on this part of the scattergram here, that there would be also less of a good correlation. But basically - - that UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 21 compression of the dynamic range doesn't make a difference. But I'll show you an experiment in a second where it does make a different or something like a height type heuristic might be a more plausible account. What we've--part of the reason to be doing this work is common sense. Even in this very simple domain, the point of having an intuitive mechanics theory is we're used to seeing in other areas of cognitive science, most famously the idea of language as an infinitely generative system where with a finite system of rules, you can produce and understand an infinite constrained, but within those constraints, unbounded set of structures. We think that intuitive theories of physics or psychology work the same way. That by building these simulations and running them, there's an effectively infinite number of questions we could ask, which can be represented in language for our purposes, but we think have to be represented in some kind of formal predicate system as far as the simulation models are concerned. So in addition to just asking will the blocks fall over, we can ask which way will they fall or how far will they fall, or counterfactual predictions. If you had removed this block, or if you hadn't removed this block, would it still have fallen over? What happens if you bump the table with a pretty hard force? What's going to happen? All sorts of things. We can ask about other latent properties. Which of these blocks is heavier or lighter than the others? Which ones may be smoother or rougher? And we're doing experiments on all these predicates. So to give a couple of examples, here's an experiment that simultaneously asks two questions. People see a set of one of these towers, and they have to draw visually how far and in which direction they think the blocks will fall. Hopefully it's pretty intuitive. This is not a very good answer, but the next one will be a little bit better. So you see the user of the subject is adjusting this little polar coordinate system, and trying to align the circle with how far the blocks are going to fall, and that line with this sort of basically where the mass of the blocks are, so here's a pretty good answer. Hopefully you can all see that one is pretty good and the first one wasn't so good. I think I could out the slide with the data there, but UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 22 basically what we've found is that--here's a little question for you. So these two judgments, which people can make, how far will the blocks fall, and which way will they fall. I'll tell you again, people are very well correlated with basically the exact, same model that I gave you before, and another one they're not so good, at least as far as the model is concerned. This physical simulation model isn't very good at predicting people's judgments. So which do you think people are, from the point of the - observer model better at? How far will they fall or which way will they fall? How many people say how far? How many people say which way? You're right. We didn't need to do the experiment. Why? Anyone know why? I would've been embarrassed if everyone raised their hand and said why. That's a little bit less obvious. So I mean, there's various reasons why, but I would say one basic thing is it's a harder task for the model. In order to know whether the blocks will fall or which way they'll fall, you only have to do a very short simulation, right? If they start to fall--if you imagine, okay, as soon as they start to fall, they're falling. It's over. Which way are they falling? It's kind of once they start to fall in this way, they're basically going to fall in that way, right? But if you want to say how far are they going to fall, watch this movie, you have to run the simulation to the end. They could bounce around. It's a very complex dynamical system, and you can look at this quantitatively and say how accurate is the simulation, either about ground truth or about people's judgments as a function of how long you run it forward? And basically for the first thing I showed you, stability and the question of which way will they fall, these probabilistic simulations become very accurate very quickly. But for the question of how far will they fall, you have to run it for a much longer period of time, about a second instead of a couple of hundred milliseconds before you get anything useful or anything in particular as useful as a simpler heuristic, which is height. So here, a height heuristic actually predicts better than our physics model how far the blocks are going to fall, and predicts it of course instantly. You just have to look at the thing. So we think it's an example of a simple heuristic UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 23 that might make us smart, I guess, if you're fond of those sorts of ideas. But I think the real challenge for us, one of the real challenges here is to understand the interaction between some richer causal understanding of the world and simple heuristic rules like that. We think this is sort of an interesting case of an adaptive mixture of the two where your brain seems to realize in a sense that a simple rule, which is correlated about equally well as this height rule, it's about equally well correlated with two judgments. How stable is the tower and how far will they fall. But we really seem to be using it in only one case, the case where cost benefit analysis, it provides that useful bit of information much more quickly than the more complex simulation you could run. Here's a task, which is a little bit more unusual. And we wanted to study this task for a few reasons, but one is to take us outside of a regime of tasks, which everybody is pretty familiar with. You might not have done exactly that task before, but you've played with blocks and sort of judged whether they're stable or not if you've played Jenga. This is a task that isn't exactly like any one that you've done before, but it starts to show a little bit of the express of the power of this model. So let's say we have these blocks on a table here, the red and yellow blocks, and imagine the table is bumped hard enough to knock some of the blocks onto the floor. Is it more likely to be red blocks or yellow ones that hit the floor? What do you see, red or yellow? Something unstable. Is that red or yellow? How about here? Why is that one funny? All right. So you get the idea. You can make these judgments pretty quickly, and I think pretty quickly, and I think a fair amount of consistency, although you can see with reaction time and variants, there's some uncertainty, and those are things our model wants to capture. So this is a similar kind of data plot for an experiment where there were 60 different configurations of red and yellow blocks, and they were designed in a kind of factorial way to varied things like how high the different color stacks were, how close they were to the edge, how precarious each individual stack was. Pretty complex scene factors. It's hard to say exactly how does this factor turn into the final judgment except as far as our model is concerned, they all factor into the UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 24 simulation. Again, it basically fits as well as the simpler judgment of just how stable is this stack. The model is exactly the same as the model before, but with one extra feature. It's a more exercised feature, which is that latent force uncertainty. Before, we allowed for the possibility that the table could be slightly perturbed, but now we've said the table's bumped fairly hard, enough to knock some of these blocks off. So we varied that parameter and found, sure enough, that you get a model fit like this only when that parameter is in a fairly reasonable range and it's actually strong enough to knock some of the blocks off onto the floor. So it's nice that it's basically the same model with one twist, but that twist is exactly the one we describe to subjects. If you want to think about a very different kind of alternative account, often people look at this if they come from machine learning and say I think I could design some sort of classifier that I could train to detect this. It's an interesting challenge, and we've done a lot of work to try and do that if you want to take it on. I'd be very happy to talk to you about that and see the results. Again, think here what your brain is able to do. Take a sentence in language, which is to say now the table is bumped, and it knows how to take that information and turn it into some relative parameter of the physics model. That's the kind of common sense we're talking about, and we think it needs a representation where words like force actually mean what you think they mean. So we think we fundamentally need such a representation. Here's just one last physics judgment, which is to show things are not just about prediction, but inference of latent properties. It's sort of the beginnings of learning, if you like. So look at these objects. I'm going to play this movie in a second. It's frozen at the beginning, and your task is to say which is heavier, the red blocks or the yellow blocks. They might have different density, different mass. I'll play it one more time. What do you think? Red or yellow? Yeah, yellow. And for those of you who maybe didn't quite see this, there's two places where you can see it at least. One is at the very beginning, just the way it tips over is a sign that the yellow is heavier, but also at the end, look at how they bounce off of each other. So it's UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 25 interesting, right? People in perception and psychophysics have long studied cue combination. I can say, here are two cues that are relevant. Something about initial configuration and something about bouncing. It's very hard to articulate what those features are, but as far as our model is concerned, those are just two things that are there in the simulation, and somehow we think this is the kind of evidence of how you're able to run this analog physical simulation that's at a much finer grain than language, although when we start to talk about the scene, that's the representation that we're talking about. I'm running fairly close to the end, so I'm going to skip, but I'll just point you to some interesting work showing very simple versions of this kind of object physics model can actually be related to. Not just children's behavior, but even infant behavior. But those of you who know about the sort of infant looking time literature, where - - violation of expectation measures, and how long infants look at different scenes, that's how we know about infants' understanding about objects comes largely from these kind of studies by people like Spelky [phonetic], Byron-Jean, many others. And what we were able to do in this work with - - , Ed Vool [phonetic] for the co-first authors and came out of Luca Benati's [phonetic] experimental lab, was to basically take simple versions of the models I showed you and use them to quantitatively predict infants' looking time in simple versions of these kind of red/yellow object motion displays. And it's probably the first at least evidence that I know of a quantitative psychophysics like model being tested in infants, and it's showing two things that I think are valuable. One is that infants judgments as measured by looking time actually has--they're not just qualitative, but they have some quantitative computational content to them. The notion of surprise can be directly linked to probability, but also it shows the quantitative value of building models about objects and the dynamic interaction between objects even in the earliest stages of cognition. I just want to show a few slides about this other key part of common sense knowledge to give a second example of what probabilistic programs would be like and show you the scope UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 26 for common sense knowledge going back to things like these Hider and Syble type displays. So here if we want to make a simple graphical model, we could say the relevant causal structure is that we observe actions, and those are a function of some latent mental states, like an agent's beliefs and desires. That's a classic theory of mine or theory of a rational agent. You could again make a - - or something based on that. But really the hard work is done by the thick nodes and the thick arrows. What's the propositional content that goes in those latent variables? They are not just some low dimensional or even high dimensional gaussian distribution. It's got to be structured knowledge. And what's that arrow that relates beliefs and desires to actions? It's not some linear or non linear - - gaussian noise thing. It's something like a planning algorithm. So while the programs before, the probabilistic programs were based on physics and graphics programs, now we're looking at planning programs. What is planning? Planning is what you do. Let's say you're building a robot or some other kind of sequential decision system, and you write a program that takes this input, effectively something like a belief or desire or representation of the environment and a utility function, and comes up with a sequence of actions, which tries to maximize expected utility, basically. That kind of classical rational probabilistic planning. A lot of robotics these days is based on doing some kind of efficient probabilistic approximations to that idea. Coming from an economic point of view, you could see this as taking the classical economic idea of make decisions by maximizing expected utility and scaling it up to more complex environments in sequential settings where you have to make a whole sequence of actions, a plan, basically, not just a onestep decision. And this idea of understanding intuitive psychology as a kind of inverse economics, inverse optimal control, inverse planning, that these are different words for more or less the same general idea, has become extremely influential in cognitive science these days. There's lots and lots of people showing the value of this idea, the same way that modeling intuitive physics has some kind of inverse Pixar problem. UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 27 I'll just show you a couple of studies very briefly of the kind of things that we did, and this is mostly work of Chris Baker, who's a graduate student in our lab, but it's also been collaborative with Rebecca Sax [phonetic]. The next slide comes from Tulmer Olman [phonetic], who's another grad student in the lab. But basically what we've done here is we're trying to create in a sort of psychophysical lab setting things like the Hider and Syble or that Southgate and - - display I showed you with the little shapes chasing each other. But do it in a way where we can vary factors in a controlled way and sort of quantitatively assess inferences about these latent mental states, like desires and beliefs. This was our first simplest one where if you're a subject in this experiment, you observe an agent moving through a two dimensional room. You're looking through the top, basically, like a little maze. And there's three possible goal objects, A, B, and C. There may be a wall in the middle of the room, and the wall might have a hole in the middle, and on different trials, you can vary all these things including where the objects are and the path that the agent takes. You typically only see an incomplete path, and we might stop the trial at a certain point along the path, and you have to make a judgment on a scale again of one to seven how likely is the agent's goal to be the A, B or C object, or red, green and blue. So this slide is showing somewhat densely a simple of the results from this experiment. Here are several different scenarios where you see the constraints of the environment, the object locations are changed, and then you see the path with a bunch of numbers. Those are the steps the agent takes, and the bold faced ones are the ones that on different trials we queried people at. So at the beginning, after three steps, seven steps, ten and so on. Then in the second row, you can see people's judgments, and in the third row, you can see the model predictions. So the judgments again are basically normalized to a probability scale of how likely at each point along the path do you think that the agent's goal is the blue, green or red, and then the model is making the same thing. So you can see in all sorts of different kinds of inferences going on, cases like this one where you're basically unsure between blue and green, and then there's a key step which disambiguates them. UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 28 You have these goal switches here where you think that his goal is green, and then it seems like he changed he mind very suddenly. You can see this here. Most of you probably think up till now it looks like he's very clearly headed to B, and then very suddenly it looks like he changed his mind and is headed to A. You can get double goal switches, all sorts of stuff going on. It's a lot of texture to that data. What was really surprising to me was how closely this model captured it. You can see here the model predictions are almost perfectly capturing all those little bumps and variations. If we look over all the trials that correlates with the model of people is 0.98. Again, that's usually the kind of thing we don't expect to see in high level intuitive social psychology but more like quantitative psychophysics. And the model is in many ways--in some ways complex, in some ways simple. It's complex in that it says you have just like we think you have a game engine in your head, we're saying you effectively have a probabilistic approximate rational planner in your head. We implement that for people who are familiar with the technicals of planning. With the kind of MDP or Palm DP [phonetic] solver. But the actual parameters that we vary to try to see what versions of planning people have and how that works here are very simple. Basically there's just a small cost for each step you take. This is obviously way over simplified, but this agent is assumed to incur some cost for each step, and then to get a big benefit, they posit a utility when they get to the goal. We then say if that's how you set up--and then there's a little bit of randomness. They don't always do the best thing. Then you say under that generative model, they tend to maximize--take steps, which maximize their expected utility. Then the question is observing a partial sequence of actions, what do you think their goal was? What goal would best make the observed sequence of actions, probabilistic utility maximizing sequence? Just doing that, basically, is enough to get this model to make such predictions. This other project that I mentioned, we extend this to a multi agent setting where we're asking people to make judgments about when two agents are interacting, sometimes UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 29 you can see one agent appear to be helping or hindering another. There's some really nice infant work done by Hamlin [phonetic], Culmeyer [phonetic] in Karen Wynn [phonetic] and Paul Bloom's lab at Yale a few years ago, and we again took this and did it as sort of adult psychophysical setting where you have two agents moving around in this four room maze. I won't go through the detail since I guess I should probably be wrapping up, but again, we can model judgments here about whether an agent is acting on his own or whether he's helping or hindering another agent by having a multiagent planning system. Now we have these recursively defined utility functions. So what does it mean for A to be helping B? It's for A to have a utility function that's a positive function of B, or to be hindering is to have a negative function. So I get rewarded when you get rewarded. A kind of golden rule. Or maybe the most interesting development of this so far from our standpoint, for those of you who are familiar with the literature on theory of mind in developmental psychology, where things start to become interesting is when kids are able to make inferences about not only the goals of an agent but also their beliefs, and how beliefs and desires, which are both typically unobservable, how they interact, particularly when we might have trouble understanding some kind of agent's behavior unless we realize they have a false believe or that they have maybe a surprisingly true belief. Maybe their goal is different than what - - are. This is a good thing to talk about on the Penn campus because you guys are famous for your food trucks, and this was our food truck experiment. So this is the last study I'll talk about. I'll try to give a high level overview of it. The way the experiment works here is this is like a little toy domain of a university campus, and we have a hungry grad student, Harold, who comes out of his office, and he goes foraging for food. We show subjects in advance that there are three food trucks in this world, K, L and M, which stands for Korean, Lebanese and Mexican, all right? And there happen to be only two parking spaces, so only two of the three trucks can be present on any one day. Sometimes only one space is occupied, so there's always one or two spaces occupied, and I gather some trucks are like this around here. They're sort of portable. In this world at UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 30 least, the trucks can go in any parking space on any day, kind of first come, first served. So Harold doesn't know in advance which truck is where, but he does have line of sight visual access. So he comes out of his office, and he can see that the Korean truck is there. What does he do? He walks all the way down here, past the Korean track, goes around the corner where now he can see that the Lebanese truck is on the other side, and he goes back to the Korean truck. So the question is which is his favorite truck of the three? Korean, Lebanese or Mexican? Right, Mexican. You get that, right? And it's pretty interesting because there's no Mexican truck present. So if you imagine training some detector, and some people have suggested this in both machine vision and maybe as a model for what's going on in the early stages of infant action, understanding like Amanda Woodward's classic work on goal directed reaching. You know, some would say whatever you're moving toward or reaching toward, that's what you want. But of course it's not what you most want here--it's not present. You have to do this more mentalistic analysis. That's what people in the model do here. And moreover, the model is also making a joint inference about the agent's initial belief, which is what did he think was present behind the wall before he started out? It says, he thought it was pretty likely that the Mexican truck was there as opposed to the Lebanese truck, or nothing. That makes sense if you think about it because if he was sure that there was the Lebanese truck there, he wouldn't have even bothered to go check. He would've gone straight to the Korean, right? Or he was sure there was nothing. So he had to be kind of optimistic. He had to want Mexican and think it was likely to be there. And again, we can vary many different features of the task. This is showing a sample of several of the conditions. And again, we can do quantitative psychophysics. The inferences here particularly for beliefs are not quite as good as desires. It's interesting that your ability to quantitatively judge an agent's hidden beliefs is still quantitatively consistent with this sort of ideal agent model. Not perfect, and we could talk more about that if you're interested. Then this is just a very dense survey, just to show in the UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 31 last few years how much progress has been made in understanding intuitive psychology as something like this inverse planning or decision making. There's one part of the talk, which I didn't get to talk about, but that's okay. It's mostly wide open, and I'm happy to take questions on this, which is if this is a view of--or if this is a reasonable view of common sense knowledge, and how it might be used to make inferences from sparse data, there's still this question, maybe the hardest, most interesting question of learning where how did you get it? Where does it come from? Both behaviorally and empirically and computationally. What kind of mechanisms could either in the span of development or maybe over evolution build these kind of probabilistic programs for physics or for planning, build them into your head. This is a really, really hard problem. Basically we don't know. Somehow it has to be a view of if this is common sense knowledge as modeled as probabilistic programs, then learning is a kind of programing your own mind, right? Something like a search through a space of programs to come up with the one that best explains what you've observed. We've been starting to work on this. I guess I'll show you one example of how we could study this again with adults. Although we'd like to do this kind of thing with infants. This is again work that Tulmer Olman and others are doing. I'll show--just look at these movies here as a representative sample. In each one, you see a few objects of different sorts moving around, and they vary--they sort of follow some aspects of newtonian mechanics. They basically follow inertial motion, F equals MA, but they vary in other things, with you can hopefully start to see, right? What kind of things are varying in these movies? Anyone want to venture a guess? There's mass, there's friction, there's forces that attract objects to each other or to different parts of the scene, like gravity type things or winds blowing in one direction or another. Basically what we've been doing is showing these kind of videos to people and asking them to figure out, to make judgments about the relative masses, the friction of different patches on the surface, what forces are active between objects and so on. And again, people seem to be pretty good at this, to view a kind of very rapid physical law induction, but they also make some systematic errors, which we're trying to understand where those come from. UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 32 So to wrap up then, I've tried to introduce the beginnings of this research program on what I call the common sense core. The roots of common sense knowledge and the understanding of physical objects, intentional agents and their interaction. And tried to say how could we approach this computationally. What formalisms can express this knowledge, can explain how it's deployed quickly for inference, and how ultimately it might be learned. And I think we've made a lot of progress, given that we've only been working on this for a couple of years. I'm very excited about it, and more generally the idea of using probabilistic programs to give a new generation of probabilistic models that captures common sense. But there's also some huge open problems, which I just started to hint at, actually doing inference in these models is very difficult if you have to rely on stochastic simulation. Those of you coming from statistics or AI and machine learning know about the challenges there. Learning of course is very hard, and - - toward the end, and maybe the hardest problems is how might any of these things be implemented in the brain. I think that that's mostly an interesting question to ask rather than a place where we have any kind of answers, but I'm happy to talk about any of this in the questions. Thanks. [Applause] MR. TENENBAUM: Go for it. MALE VOICE: [off-mic]. As I understood it, part of the reason they did this experiment is they wanted to show - - . MR. TENENBAUM: Well, this is a really neat study because they did the experiment after we said, why don't you do this experiment, and they said, we already did it. MALE VOICE: So I think that what they were trying to show is that the inference would show a kind of intelligence where it wasn't clear what aspect of their experience they'd be generalizing from. In constructing such an experiment, it's a struggle because you're always going to be fighting against the ambiguity of the notion of analogy, right? It's like who knows what aspect of their experience. But it seems like as a modeler, you have the same problems as the designers of the experiment. So when you're--well, how do you--what kind of strategy can you follow to be sure that you're not building in too much of the answer into the model? UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 33 MR. TENENBAUM: Part of the point of studying cognizant knowledge is that we think a lot of the answer is built in. Not necessarily innate, but to model the capacity, we're not trying to say how does your ability to reason in a sophisticated way about the physical world emerge from goo or mush, right? We're trying to actually initially argue that you have very structured, rich models of the causal processes in the world. It turns out you don't actually need a structured model. I mean, here while I said it's a simple version of that model, basically the only physics that this model knows is sort of spell key object physics, which means it knows that objects don't wink in and out of existence, and they don’t teleport, but it doesn't know about gravity or inertia. The objects are basically just random walk around. They move smoothly in space and time. They don't teleport, and I think that's important. It's another way to think about the contribution of this model. It takes the spell key object concept and shows what it can do if it's driving probabilistic inference. So the question to ask, is it an important one, gets at a few issues that are often kind of confounded in some of the developmental literature and the paper. I think we worked hard to try to be clear on these issues because they're hard ones. What is the role of experience versus innate structures, what a lot of infant work has seemed to be addressing. And one way of thinking about that is--how do I even--it's so complicated. I can think of three different ways to answer it. We don’t have any--so the paper did make a pretty forceful argument that your ability to do this is not based on the kind of simple statistical learning abilities that some other recent infant work has studied in the lab. For example, the classic work of Saffron Newport and so on, that says here we're going to show you a bunch of data in this experiment and check your ability to do a kind of statistical inference in this experiment. We thought what's going on here is a kind of probabilistic reasoning that's operating on some more abstract knowledge representation, something like a spell key object physical theory. We are agnostics where in this work where that UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 34 abstract knowledge representation comes from. - - could be right. It could be built in innately, or Scott Johnson could be right. It could be learned and emerge through development. Those are both consistent with this because this study was done at 12 months of age for infants where everybody agrees by then something like a - - concept is very well entrenched. So the way I see the question that you're asking is it's really getting at two of the questions that I was interested in, but I want to tease them apart. One is do we have evidence that infants are able to do this kind of probabilistic theory like reasoning, and we think yes, they are in the sense that just showing them this kind of funny, novel--kind of like this red, yellow thing I showed, this funny, novel situation, this blue/red judgment, there's not statistical data in the actual experiment itself that's sufficient on its own to get the right answer. But we think yes, there's information in this which, when you combine it with your more general idea of how objects might move in the world, even a very weak Spelky like notion, then there is enough statistical information in the experiment, and that's what our model formalizes, is how that works. But then it's a separate question, how you build those theories, how you program your brain, how you come up with those probabilistic programs that capture how objects move and we want to study that both empirically and theoretically, but that's sort of a later stage of the project. Just like in classic work in linguistics, I think, I'm very influenced by the sort of very general Chauncian [phonetic] program that says if you want to study language acquisition, that goes together with the structure of language. But you need to have some idea of what language might be like in the actual adult state before you can study acquisition. Maybe before--that's the right way to put it is I see those interacting, but at least here we have I think until the kind of work that we've been doing, there have not been very good models of intuitive physics at the level that could be tested at all in these kind of paradigms. So that's at least the high leverage point where we wanted to enter into this. But the last movies I showed you, those are partly designed to be the kind of things you can show to infants, and we're UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 35 very interested in how you could do Saffron Newport, - - type very quick statistical learning experiments on those kind of stimuli and see how much plasticity you could get. It's like artificial language learning. It's like artificial physics learning, basically. That's essentially what we're doing here. Just like following the sort of Newport-Aslan [phonetic] generalized research program, we're starting off with this with adults, and then we do hope to take this to infants. Hopefully that addresses your set of concerns. Yeah? MALE VOICE: [off-mic] MR. TENENBAUM: I'll try to give a very long answer. MALE VOICE: [off-mic] There were, as you remember, two introductions. One was the introduction of the format of the platform of the court. They were - - invited to speak. The second introduction was introducing you. What have you learned from the first introduction and since you, in three minutes, became familiar with the book itself? What have you learned from that experience? MR. TENENBAUM: You mean what did I learn about Dr. Pinkel and the Pinkel Lecturers and his thought? MALE VOICE: Yes. MR. TENENBAUM: Well, I mean, I told you what I learned. I learned that he's an engineer thinking about how the brain and mind work, and he takes seriously some of the same kind of questions I do, like the interaction between the brain and the mind and the analogies between a computer and the brain. I take it--and I got that by skimming this book. So I feel you must be asking something incredibly deep there about how I learned that, and though I'm not sure if this is what you're getting at, it's certainly--I'd say it's a sort of one theme of what I talked about here, but it's definitely a theme of the larger research program I've worked on, is our ability to make inferences from very sparse data. In this case, this sparse data is just a couple of shapes moving around a screen, and you make an inference about objects in the world, their forced interactions and their goals, assuming that they're intentional agents. That's a notion of projecting beyond the observed data. UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 36 And I think we do this all the time. I guess when I gathered a little bit of observed data, which was really just a couple of section headings and a little bio on the back that he worked for NASA and the Rand Corporation. So I have some larger schemas about science and engineering and brains. I can project a lot of things, which I don't think I need to do here. But we all can, right? And that's a very general thing that our mind does is take some kind of abstract schemas and interpret the data of our immediate experience informed by that. You could say that is what our minds do, or if there's any one thing that our minds do, that's what they do. And here what I've tried to do is study that in the context of what I've called this common sense core. This set of particular abstract schemas that are some of the most important ones because they emerge extremely early in development. I think they underlie the meanings of words in a richer way than at least most computational models of word learning, including stuff that I've done and many other people in psychology and machine learning. I don't know. To sort of invoke the spirit of - - --she's not here, right? The idea of verbs and how verbs are different from nouns. To be able to understand the representations of the concepts of intentional action, which are encoded in verbs, for example, there's been a lot of really interesting work on statistical learning of words. But until we can do the statistics over something like the kind of representations of intentional action I'm talking about here, we're not going to have good statistical word learning models, I think. So my focus as been on not the very general problem of schemas, abstract schemas for making inference from sparse data. But the particular ones that I think are at the heart of human intelligence. Hopefully--is that what you were getting at, or do you want something more? Do you think these aren't the right ones? Hopefully. Okay. Hopefully. I think it's not that often in a field like--this isn't physics. It's not that often in our where you get so many lines of evidence converging on the same thing. You look at what the most successful accounts of lexical semantics, at least the most compelling ones that seem like they're saying UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 37 something about the aspects of work meaning that might explain composition of language? Or what is the infant research telling us? Or what are people in visual scene understanding, where are they going? What's getting the most traction in getting computers to look at images and movies and get out the things that are of value in a real application of a car that has to drive and not hit people. And so from different areas of our science and engineering all kind of converging on this idea, and I come from someone with a certain mathematical computational tool kit very much influenced by probabilistic graphical models, various kinds of - - analyses, and I see those things are great as far as what they cover, but they don’t have the representational capacity to get at these kind of common sense core notions that all these other areas of our science and engineering are telling us we need. So I think there's great value in trying to make that bridge, and that's what I'm trying to do here. Yeah? MALE VOICE: [off-mic] --is not always common to everyone, to every society. So have you ever thought about or attempted to address issues of how common are these intuitions? MR. TENENBAUM: So I certainly would like to. I've only been doing this research for--the physics stuff is only about a year old. The intuitive psychology is only a few years old. It's one-and-a-half graduate students worth of work. Sorry, - - . Yeah, that's right. And so we haven't done these experiments cross culturally, for example, but yes, we would very much like to. And I think there is--does anyone know, I think there's been some sort of informal cross cultural work on things like the Hider and Syble display. I think, for example, not everybody sees those two triangles and the circle immediately as intentional. Some people will look at that and describe them as there's two triangles, or there's a triangle like this. The way that work is often described is to say, look, you know, you can't resist looking at these shapes and seeing them as intentional agents. That might be very cultural or context specific. That's not really my point. My point is, look, if you look at these as intentional agents, which is kind of compelling, I think, then you see all this stuff. You see the forces, UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 38 the interactions, the goals, the beliefs, the fears and aspirations and so on. That tells me that to me, that that tells me something about the basic representations of the world. They're so basic and so quickly deployed they can even work on just triangles and circles moving around in the two dimensional plane. Certainly one of the things we'd like to do going forward and I've been talking a little bit with some researchers who do work in various kind of relatively isolated and indigenous tribes in the Amazon, actually trying to test these things with people who come from very different cultural backgrounds. [END 90840518.mp3] UNIVERSITY OF PENNSYLVANIA - INSTITUTE FOR RESEARCH 14th Annual Pinkel Endowed Lecture Friday, March 23, 2012 39