23362 >> Susan Dumais: It's my pleasure this morning to introduce Larry Birnbaum from Northwestern University where he heads the Knight News Innovation Center that he might say a little bit about during the talk. It's a recently funded effort by Knight Rider to do lots of exploration about how to present textual and maybe multimedia information in ways that are very different than we currently do with news feeds. So today Larry's going to talk a little bit about a couple of things he's been working on. One is a general notion of context and how that's important in understanding -- we worry about it a lot in understanding queries, but also in presenting results to people. I think you're going to talk about a fun project on text generation that takes raw data feeds, typically numeric, and tries to generate an interesting narrative around them, mostly for sports scores but for lots of other things. I think historically Larry's work has been around the intersection of AI and natural language and interfaces with people doing real tasks on real Web scale systems. So Larry, it's a pleasure. >> Larry Birnbaum: Susan, thank you very much. Eric claimed he would watch this later. >> Susan Dumais: Say hi to Eric. >> Larry Birnbaum: So $5, Eric, if you actually do watch this later. Thanks for having me. This talk will be a little bit personal, and I don't know either call solipsistic or narcissistic, as the case may be. But talk a little bit about our own struggles and by all means chime in with your struggles as they pertain to it. So I started out doing sort of big semantic modeling of the old school and natural language. And I love those models, the models we built like 25 years ago. I think there was a lot of beauty in those models and even a lot of truth to those models. But there were a lot of things I didn't understand at that time, in fact that none of us understood. And I think the first thing, which is idiotic that I didn't understand it, was that if it doesn't scale there's no point in doing it on a computer. And it might seem ridiculous for someone to be trained in computer science and not understand that, but I was and I didn't. But the other thing is that -- and I think this is sort of a well-known maxim now which is that scalability is not something that you ever add later; it's got to be baked in from the beginning in whatever it is you're doing. And so as we started thinking about how do we cope with these realities, I was working with my colleague, Kris Hammond. And he's the one who really got me originally in looking at problems and information retrieval and intelligent information systems and thinking about how larger semantic models could be built on top of the processes that information retrieval and statistical methods generally provide us. And kind of help to make them do kind of interesting things. So we started in this work about a dozen years ago with this as our challenge or more than a dozen years ago. It is easier for a rich man to get into heaven than it is for you to actually take the context of what you're thinking and stick it in this little box and have the machine really do the right thing. Yet, it works okay, but when it doesn't work that's extraordinarily frustrating. And so we got interested in the idea of just avoiding this entirely. Just get rid of this thing. And so the first -- our ERR system in some sense is the unfortunately named Watson system of Jay Budzik, our Ph.D. student. This again I have to say this was more Kris' work and Jay's. I contributed some things to this project, but I was really more of a kibitzer. But Watson was one. There were a number of systems at this time. There was also Remembrance Agent at the media Lab was done. It was a similar system. I liked ours, actually. I thought it was a really good one. But what Watson did was it would look at a document that you were reading and writing or writing online in a variety of applications, for instance, in PowerPoint and it would analyze it statistically and heuristically, and then it would go out and build some queries and manage those queries and come back with stuff that was relevant to what you were doing. And it was a really good system, and it worked exceptionally well. I mean, it would come back with the stuff which was very on point. Here's a PowerPoint on jazz in the American culture and you get all this great stuff about a passion for jazz, styles of jazz music, blah, blah, blah. It's pretty good stuff. And it worked really well. It was fast. It was automatic. And it was twice as good as you. What I mean by that is that when Jay ran studies where he would take a query that you would write, and by you, I mean a master's student in computer science, and shipped to a search engine, and then one that was automatically generated, ones that were automatically generated by the system and you could put together multiple ones, too, and mash up the results and give them back to you to identify whether you thought they were on point, you would say well north worth of 60 percent of Watson's results were on point whereas yours were in the 30 to 40 percent range. It seemed like a fantastic opportunity to start a company, which we did. And it didn't work. The company didn't work. And I'm not going to -- I think it's pretty clear what the system did at sort of a high level. I'm not going to go through this. It really didn't work. And what we found out as we started showing it to people for real saying you should use this in your knowledge management system you should use this in your productivity system in some shape or form people would come back and say these results are really on point but they're not interesting. It's more of the same. It's like what I started with. It's the same stuff over and over again. And we had a duh moment, which I think a lot of other people in this business has had, which was we were doing what everybody else does, using similarity as a proxy for relevance, and actually similarity as a proxy for relevance has its strong points and weak points very obviously. And our final realization of the weak points is what's the most similar document to a document you have in your hand, is another copy of that document, which is completely -- well, it might be useful to know somebody else has a copy of that document but that's a completely different kind of information than information that's contained in the document. So we had this realization that that would make something useful would be for it to be similar in certain respects and dissimilar in other respects. And I think lots of other people had this realization, too, obviously. Watson was relying on the size of the Internet and basically on noise to actually bring back stuff that might be useful to you. That might be enough on point to be relevant and enough off point to actually add information. And the question that came to our mind was: Can we do better than that? I'll back up, actually. Well, no, I'll do this one. So this is actually an older copy of a competitor's news system which I think of as an interesting failure. I don't know what they think about it. I mean, it's a little bit better than this now. We'll talk about ways in which it's better. For example, it gives you a bunch of stories, ranked by how many stories are like this or something. Whatever they're using for that. And it will show you -- it will show you this thing here. If you don't want to look at this one, you can look at 533 other stories which you're not going to look at. It's a lovely engineer's number right here. It's sort of like no one's going to look at those stories. And they don't show -- unless you actually do the expansion they don't show this to you anymore. They don't put this money in your face. But then there's the question to me like why are they showing me a story from Fox News, and why is the alternative something from Business Week Bloomberg, or why are they showing me something from the Wall Street Journal and why is the Baltimore Sun an alternative thing to show me? These again are very -- those things are particularly hermetic it seems to me. It's very unclear why they're showing me that. They're optimizing something. But it's not clear what they're optimizing, and I guess to sort of cut to the chase is that we started to come to the realization these are editorial decisions that these systems are making. Editorial -- what editors do they do intentional engineering. They decide what is going to get your attention by positioning it on a page or in a place or at a time when they think it will have your attention. And these machines are making editorial decisions, and they're applying editorial values to make editorial judgments, and we actually don't have any clear idea about what their editorial values are or how they're making the editorial judgments that they're making. And I actually dare say that the engineers who built that thing I showed you have no clue either because they don't think of themselves for the most part as editors, and they don't think of themselves as making editorial judgments. They just think of themselves as they're trying to optimize something. They're just not thinking about it that way. We'll talk about the ways in which that's improved in a minute. So we ended up starting to think what we really wanted here was a deliberate mechanism for finding alternatives or for pursuing information relationships and here's the thing where I think we can have some discussion or debate which it seemed to us that this is where kind of our preference for semantic approaches actually might play a role, because to be deliberate it seemed to us you had to be thinking about dimensions of similarity and dissimilarity that actually had some kind of semantic component to them. I'll tell you what I mean. You could look back at what Watson was doing wrong and say what I really want to do is look for the golden donut of things that are dissimilar enough that they're useful or golden -- I don't know what you call it. It's the golden -- you know, it's not a donut. It's like a nested sphere. It's the things that are far enough away from where you started to be interesting, but not too far away to be useless. You can imagine this kind of area. And that does -- I think -- and people do work on that. I think it's a really good idea but you're not getting any notion -- you're still depending on noise and the size of the Internet to bring back stuff. You're not actually in a deliberate or thoughtful fashion deciding to explore that space. You're actually just saying okay this is where the sweet spot will be. And I think that's good. So we got interested in the idea specific dimensions of similarity and dissimilarity that we could look at for particular content verticals. So here's an example. And actually this was the point of the thing we did just earlier this morning when we looked at that. So this is -- we actually did a similar system to Watson for -- you'll have to excuse me for a minute. These shoelaces are really not well designed. And I should get new ones, but I don't. We did a system like that for video that looked at the metadata associated with video and bring back relevant information as well. And it worked pretty nicely. It brings back a bunch of recipes relevant to the things he's talking about here. If you type in lasagna recipe to a search engine, actually it keeps growing. This morning's results were ten million results for a lasagna recipe. The last time I looked it was six million results for a lasagna recipe and the last before that was one million results. But the point is there's a lot of lasagna recipes online. If I'm showing you a lasagna recipe already, the question is -- this is the problem with Watson. Watson would bring you back 10,000 other lasagna recipes. That is completely pointless at some level. This is obvious, but to get something going here, what should I be showing you if I'm showing you a lasagna recipe? This is actually a question. >>: [inaudible] of the data. >> Larry Birnbaum: Right. In particular, what kinds of -- put it another way, if I'm already showing you a lasagna recipe, what other lasagna recipe should I show you that would actually be useful to you? >>: Better lasagna recipes. >> Larry Birnbaum: That's true. So highly thought of lasagna recipes. What else? >>: Different ingredients. >> Larry Birnbaum: Right. Vegetarian lasagna recipes. Low calorie lasagna recipes, low salt lasagna recipes, easy lasagna recipes, classic lasagna recipes. And in fact if you go do query completion on a search engine, if you type lasagna recipe L, you'll get low salt, low fat, low cal, low carb, which I think is fascinating, but maybe so. I'd love to hear about that. >>: You could also find other things that are related to lasagna recipe. So share ingredients but not lasagna. You don't see much of that. >> Larry Birnbaum: Absolutely. Other recipes you might consider cooking. >>: Like a nice salad or right wine. >> Larry Birnbaum: Right wine. Caesars salad and Chianti, or whatever it is you should have, and garlic bread. That's the standard, cheap -- at some point in everybody's life in their, let's say, 23, 22 to 23 years old, it's lasagna, garlic bread caesars salad and a bottle of Chianti. You've cooked somebody a nice dinner when you've done that. It's still a nice dinner. Don't mean to be patronizing. I can't even cook it even today. That's what we ended up doing, we ended up saying, look -- now, if you have a lot of data and if you want to go mine, you can find these things in what people are asking for. We actually like a relatively simple approach, because frankly a person who edits cookbooks, they are professional at what kind of information people actually need to know about recipes or what other alternatives they want -- while I'm prepared to believe there will be a lot of undiscovered gems of kinds of relationships in the data that people who build cookbooks never thought of in their imaginations, on the other hand, there will be things experts know that will be useful to you that won't typically show up in query logs because not enough people will grasp they should have looked at that, for example and it could be. We're interested in putting the tool in the hands of an editor that will say take this recipe and mutate it in the following ways to find the things that you want. And we're also interested in things like how would you find out whether it was a quick and easy lasagna recipe. It might say quick and easy in it. Or you might look at it and you might actually see how many steps it has in it. Or, et cetera. So you can imagine lots of ways to actually estimate that. This is just a way of saying that when you go through the sort of what Watson did, our Watson -- the original Watson -- you know, it's these -- it's these places where you can actually make a decision about what am I going to look at in the context, how am I going to select the sources, how am I going to form the queries, how will I do all these things, these are all the little places in the algorithm where you can stick in a little bit of nudging or control in relatively simple ways. Thank you very much. But, you know -- so let me talk about a couple other systems we built along those lines that I thought was -- this is built by my student Jiahui Liu, and the idea here was if you give Watson a story about, for example, Israel's use of cluster bombs in the 2006 Lebanon war -- is that when it was? -- there have been so many, unfortunately -- you will get 50,000 other stories about Israel's use of cluster bombs in that war. If you give this to her system, what you'll get back is stories about U.S. use of cluster bombs in Iraq, NATO use of cluster bombs in the Balkans and Afghanistan. So the idea was to actually try and find stories that had a very similar activities or actions but were located in different places or involved different actors. Here's an example. This is the case of Oracle tried to buy my SQL and we get things back -- IBM as the comparable entity. It goes out and finds stories about IBM basically trying to do pursue an open source strategy in a variety of other places. So, for example, these are the other two stories it found and it actually tells you what got kind of what corresponded to what in some sense in these things and what was in common. So you can see that the idea here was that maybe you would actually understand this story better by virtue of comparing it to something that was a little bit different from it, that that would actually be a way of actually coming to some better understanding of it. This is the ->>: This is based on entities. >> Larry Birnbaum: Absolutely. It's a relatively simple thing. So I'll come back to that in a second. We actually did a couple of things where we could do, for example, if you did a story about, for instance, GM laying off workers it might find a story about Honda hiring workers or Ford hiring workers. So we could look a little bit at antonymic activities. And do things around adjectives. If you gave a story about a bloody coup, it might come back with a relatively peaceful coup. It had some capacity to do that but in general that's right it actually -- what it did, it took the sort of top stuff in the story, it ripped out the named entities. It did a search just with that, without any named entities in it. And it was doing that to mine similarly named entities. The reason is all the stories you'll find that way are the comparable stories they'll find but they're pretty mixed in all over the place because they're not very highly ranked. It did that to try and mine, for instance, IBM or NATO or Afghanistan versus Lebanon, for example, and once it mined those things, which it did by finding similar sentences high up in the first paragraph, where they played similar syntactic roles, it would go back and reformulate the query using those entities it had found. Does that make sense? Okay. And it was -- you know I thought it was a reasonably successful system at least in finding things like that, that were a little bit different, that might actually add value to your understanding of the situation. There was something else I was going to say that I -- I'll get back to it later. I actually wanted to use something like -- this was her idea -- I wanted to use -this is a case where I thought that actually prebuilt taxonomies would be helpful. Like, for example, the Department of Commerce has lists of U.S. companies in classified by industry. And you could imagine saying I'm just going to go through that and say here's another company and I'm just going to go through and systematically substitute in names off this list. We tried to use actually brand X's set operation to do this. And it didn't work that well, interestingly enough. It didn't help us. I will say something about what this says about the kind of information retrieval compared to logic and why we were doing this. So what you really want to say here is I want a story, for example, you might say something like the following: I want a story that's like I'm looking at a story about let's say Iran and nuclear proliferation and weapons. And I want a story about this that's not about Iran because I want to think about similar cases and situations that have happened previously. I just can't take some compressed query about this thing and add not Iran. That's going to be the right thing for a variety of reasons. First of all, a lot of the specificity in queries comes from names, because that's what names are for. Names exist to be relatively individualated symbols that stand for something and a lot of the oomp in indexing comes from them. A lot of the specificity. But the second reason is that somebody might have actually written an entire case study about comparing Iran with North Korea or Libya or whatever, right? And that's actually something you do want to retrieve. If you say not Iran you're shutting yourself out. This is just a way of saying that negation doesn't work properly in IR systems, because they're not full-blown predicate logic systems. So when you want to do something like negation, what you have to do is actually do substitution of the things that you mean otherwise. So not Iran means Iraq, Libya, North Korea, whatever the United States is currently categorizing as rogue states or something like that. So that becomes the -- does that make sense? So that's sort of the model here of sort of building, and again this is sort of our notion building a little bit more complicated semantics on top of what the information retrieval systems afford us. I'm not the best evaluator in the world, my students are better than me, but the precision was pretty good. So about 70 percent or so of the stuff that came back, people would look at it and say, yeah, this is a good analogy to the story you originally gave me. The recall is a little bit funny, but I'll tell you what that means. So it turns out that reporters will very often allude to comparable cases in the stories that they give you. And so what Jiahui did she took the actual comparable cases that were alluded to in the set of stories she was starting about the original query, the original topic, whatever that was, and she took down to some depth, like 20 or 50 or whatever. She took all of those situations and said did the system find those? And the answer is it found them about 60 percent of the time. That doesn't mean that -- I mean, you get the point. It doesn't mean that things that it found there weren't good ones, they were perfectly good but they just weren't the ones -- this is something that brand X took to heart. So Jiahui went to intern at Google News while she was doing this work. And this relates to the fact that it shows you the stories, and it shows you the follow-up stories will be Bloomberg business week and why are you showing me that? And we had this realization that if you -- let's say you see a story coming out of Pakistan. And you read about it on the New York Times or CNN, and the other question where else would you like to read about it. I mean, I've already seen it on -- maybe the guardian will give me a slightly different point of view, you know what I'm saying. But it's still, you know, it's still London as opposed to New York. The minute you ask yourself this question it answers itself. What you're interested in is what are people in Pakistan saying about this? That's the obvious question. So if you happen to know that the dawn of Lahore Pakistan is one of the premier English language news sites, Pakistan you go see what they're saying about it, but we didn't know that until we started this project. So this is actually a very simple piece of technology. It just has -- it basically has lists of interesting venues associated with countries, locations, stakeholders, like organizations. And it just says if the story is about them, go look at what they say about this. Okay? So, for example, here's Putin visiting Iran when he was -- this was in 2007. So I guess he was president before he became president again. And here you can see that in Iran it finds a story -- it finds some stories about, that are from Iran about this visit in addition to the standard stories. It actually sees the United States as actually related and a variety of other organizations. It looks at them from each of their points of view in some sense. If you look at brand X's news site now you will see that they do, when it's relevant, identify the location of the news is from and try and show you so if the news is out of Philadelphia they'll try to show you something from Philadelphia as well. And I think that's a really good idea. And it's a very simple idea. It's not a very technically difficult idea. It's 11:00. >>: 11 minutes left. >> Larry Birnbaum: I'm sorry. >>: 11 minutes left. >> Larry Birnbaum: That's right. 11 minutes, right. 11. Good. I was confused. We're counting down. It's almost like New Year's. I hope this makes sense. I like the system for that reason. This is an example of the system I like to build because it's technically not that hard, and it sort of implements a relatively obvious semantic idea and does it in a straightforward way and it works. But this is one that I think was a little bit more challenging that we did that Jiahui did as well, and somebody gave me this fabulous word epistemic dimensions. I was kind of baffled, what that was. The idea is I want to look at a topic and I want to look at it from multiple point of views, by multiple views I don't mean liberal versus conservative, or for or against, I wanted something more like abstract than that. So practical versus theoretical would be an example of what I mean by a dimension like that. And what we ended up here is I want to talk about a story from a business point of view, a religious point of view, a technical point of view, a medical point of view, a legal point of view. For example, if I put in a topic like abortion, I could look at abortion from a legal, political or a medical or a religious point of view. And I'd get really different, very different kinds of -- and so again I think these are the kinds of dimensions I'm thinking about when I say, look, I'm looking for systematically different dimensions of information. This is the kind of dimension that is exciting to me. Here's an example of when the EU was going after Intel. This was either before or after they went after you guys. Well, whatever other large American company they decided to go after. But that's their job. You know what I'm saying? So I don't think we should begrudge the EU there, their mission. So here we have a technology and IT entrepreneur point of view on this issue. And then you can find blog entries about lawyers and basically on the legal point of view. And these typically came from lawyers and legal blogs. >>: [inaudible]. >> Larry Birnbaum: That's how many it found, I guess. It didn't go that deep. In general, we're not really that into completeness. I mean, we're going to be showing things to people rather than aggregating them. And so precision is more important than recall for us. I'm surprised the numbers are so low, but it is what it is. I'm not sure how deeply she dug. The main thing we were interested in doing this when we did this we didn't want to actually have to download the actual contents of the documents to analyze them. We wanted to see if we could do it from the snippets. And it turned out that this work, because bloggers are relatively consistent in the points in these epistemic points of view that they take on topics. In other words, a blogger who typically takes a legal point of view may take a legal and a political point of view but -- does that make sense? I hope that sort of makes sense. So you could actually just -- so what the system did, once a blog entry came back as relevant to this topic, you would look at the source and the first thing you'd do you'd say have I analyzed this source before because it would cache those results. If it had, it would say this is a thing from a legal point of view. And if it didn't it would just go back and do a query just for the source alone to get back 20 snippets, 20 random snippets or recent snippets, of documents from this person. And then it would analyze those. And then we had -- so at this point I was just going to say there's a bunch of details about how to do that. But it actually turned out to be the case that it was better to classify the snippets independently and then aggregate those results than to throw all the snippets in one document and try and categorize that. And that's what this says. I can't believe I'm putting up something with the word neural net in it but I am. That happens. So actually it worked pretty well. When people looked at the results, they thought they were, A, on point and, B, represented that point of view 80 percent of the time. I'm trying to remember what this means. I think it means that of the -- I think it means that if you did the query from -- when you just did the query, if you look in the query yourself in the top 20 or 50 things, what percentage of the things that you would actually say was on that point of view, did it actually find. So this is conditional on what came back from that query. It's not actually casting into the Web a need to define something. Did that make sense to everybody? Here's a system that was built by our student Francisco Iacobelli. I really like this because it's really dirt simple. There's these metrics of novelty that people work on in a variety of text packing kind of things. He actually invented a couple of new ones, but he basically said, look, I want to look at a news story and then I want to find other news stories on this topic that have new information in them. And he's defining new information extremely syntactically and simply. New information is just additional actors, additional data or additional quotes. So it's going to look through these related stories, and it's going to find either because they're already clustered together in a new source, or because he'll dynamically generate the queries ala Watson to do that, our Watson again. And then he'll actually start, digging through these stories to find these, to find stories that have these additional information. For example, this was done as you can see at the time last year when the BP well was out of control in the Gulf of Mexico. And it goes through the story in the Tribune and finds other stories like it that have additional actors. This story like Magellan Midstream Partners LP or the American Petroleum Institute. It finds this additional numbers. So, for example, this story doesn't talk about it. But it turns out that the technicians want to be able to find that the pressure readings in this containment cap has to be able to handle eight to 9,000 pounds per square inch. And also that the vessels that will be collecting oil on the surface have to be able to pull between two and a half to three and a half million gallons a day of oil. That's what they're going to try to collect. These are the numbers they found. And finally additional quotes. Here's a quote from the -- the hope is we can slowly turn off the valves, close the capping completely and test the pressure to see how well the well is performing, said the point man on the disaster. That's an additional quote. The metric he used -- again, he's finding a bunch of these things, and the question is in general we didn't have a lot of real estate here. Which two or three was he showing you? And here he used the inverted pyramid model. So the idea was that, A, a new name, number or quote that was higher up in the story that you found it in, it would actually -- it would actually prefer to show that one on the grounds that some human being thought it was important enough to put it high up in stories. >>: To look at stuff that happened between this? >> Larry Birnbaum: So this is not new in the sense of ->>: Not new in terms of novel, but if I've already read this I may know previous history. So it's sort of like the notion of relevance versus interesting-ness, the things that are after this might be more interesting to me. >> Larry Birnbaum: It didn't. But it should have. You're right. I mean -- yeah, newness here just simply meant additional information as opposed to genuinely new. I could imagine an application certainly where there's the old story that all the estimates change typically like in disasters and so on as they get updated. So you would certainly imagine that you would want to paw through and find the most recent numbers about something. And that's something I'd be interested in. I'm not averse, for example, to saying I'm going to have a relatively small model of a disaster. And by small, I don't mean a complete model of what natural disasters are like, but just that disaster stories go through the following progression. There's the original situation. There's some numbers and metrics associated with that which will be the extent of the storm or the size of the tsunami, or the range of it. The next thing will be -- stuff will come in numbers of people killed or injured, destroyed. And so those numbers will start to build over time. And then eventually you'll start to hear about rescue efforts. Right? So that kind of model of the time course of a relatively crude model of the time course of a natural disaster and its aftermath, and we haven't done that particular one -- but that's exactly the kind of thing that I'm talking about. I'm very happy to build a relatively simple model like that, just out of my head and say, okay, this is now I'm going to be looking for -- I'm not going to be looking for tweets or whatever that meet the things in each of these stages that I've identified, right? As a way of sort of organizing the thing. And that will bring some of that temporality with it but it's not automatically determined. I'm not going to talk about this. Or this. This was very precise, which was good. Actually, sometimes it would identify, I don't know how it did this, but it would identify as new information something that was not in fact new information. Or maybe it wasn't on point. By recall here, again we mean of the things that came back from these queries or that were in this cluster that you should have identified as being, for instance, a new quote, did you actually identify it. And, again, honestly I can't tell you why it wasn't perfect. It should have been perfect. So all of these systems are, have this nice property that they're contextualized. I mean, the spectrum isn't quite that way. You saw the type of query into spectrum. But I know we're all on pins and needles now. There it is. [applause] okay. 11.11.11.11. Is this five? I guess that's 5, right? 11.11.11.11. 5 11s. 5 11s beats six sigma, I don't know. That's what it reminded me of for some reason. It's an auspicious moment to be giving a talk. So we ended up getting -- we usually got rid of the search box or we at least had something that was allowing you to identify these dimensions and bring back sort of interesting stuff. But the output was the same. It was like a list of results. At some point it dawned on us, dawned on a lot of other people that you don't want 100 stories. You don't want ten stories. You want one story. And I think this is not a dream that's only our dream, but I think what we're all thinking now is that you're walking around in the world and sort of one information experience, not meaning necessarily a story or a document, but one information experience that is tailored for you right now that addresses your needs right now is dynamically constructed based on the information that's available in the world and presented to you and you can interact with. And I think that's all what we're all kind of thinking now. Synthetic documents in other words. So Chris built this, and his student Nate Nichols, built a gismo called News At seven. That was a lot of fun. It was in some sense our first foray into this. And what it did is it imitated Siskel and Ebert. You would give it the name of a movie and it would go to IMDB and to Rotten Tomatoes and it would pull out positive and negative reviews and then -- first of all, it would look to see whether this was a polarizing movie or not because it would do different things. If it was a polarizing movie, it would pull up positive and negative reviews. He, for instance, would become positive. And she would become negative, or vice versa. And then they would talk about the movie. In the meantime, in the background you'd have stills or B rolls from the movie. They would actually argue about it. And one of the reasons for doing this, by the way, is that synthetic voices are still kind of crufty. So turns out listening to two of them alternating is not quite as obnoxious as listening to one drone on and on. So that was part of the reason there was that sort of presentational thing. But it worked okay. I'll tell you my addition to this was, it comes back to the idea like simple dimensions or simple semantics, what do people talk about when they talk about a movie? What do they talk about? They talk about the acting. They talk about the directing. They talk about the music. The story. I mean, if they're really into it, they talk about the costumes or production values or cinematography. So the idea is there's dimensions of movie analysis or criticism for want of a better word. To some extent, the system did try to kind of keep itself on point. In other words, if it found a quote from some guy saying I thought X was a great actor in this movie, it would come back with either, if it could find somebody saying they thought X was a lousy actor in the movie, or at least find some other person who was a lousy actor in the movie and talk about that. So you would actually get -- does that make sense? If it couldn't do that it would just change the subject. Which turned out to be amazingly effective. In other words, when you listen to this thing, you actually -- that got us to thinking actually how often does Siskel and Ebert actually talk to each other as opposed to saying whatever it is they thought they were saying. There were syntactic clues they were having a conversation. It wasn't real conversation, was it? I think the interesting question about a thing -- there's a lot of questions. One, it's a fun experience to watch. The next question is it adding any value in the world. In other words, those comments are already out there. You can go to Rotten Tomatoes and read them yourself, the question becomes do they gain any additional meaning by being embedded in a document like this or an experience like this that is -- is the whole greater than the some of the parts, in other words? Because if it is then we're adding value when we building something like this, if it is not, then we're not. I think we thought it was. I think we thought actually pulling things together like this made them more poignant or something. But we were -- you know, we were still sort of -- this is still working by brick-alage. In other words, it's not generating anything. It does have a simple rhetorical model of a back and forth argument like this and it does have a notion of the kinds of topics that matter in movies. But it doesn't have -- it's not generating any text on its own. So that got us interested in the idea of sort of generating back into the idea of interested in the idea of generating stories which is something we thought about, stories of something we thought about for a long time. We had this brain wave which it turns out other people have had, too, because there's never any good new ideas in the world. Everybody's had them, of kind of generating stories from data. And actually, this got a lot of currency. So we actually hit xkcd, which excited me, because I'll have to tell you something, I didn't even know what the fuck xkcd was actually until one of my students showed this to me. It was like wow, we made xkcd, what does xkcd mean. But apparently it meant something good. Weighted random number generator- I love that weighted, weighted random number generator just produced a new batch of numbers. Let's use them to build narratives, all sports commentaries. This is what we build. Let me tell you something about how we built it. One of the things we've been doing in the last few years is teaching joint courses with the journalism school. Northwestern has a very famous school of journalism called the Medill School of Journalism. And we started teaching joint courses with them about three years ago. What I mean by that these are joint projects classes where we take computer science students and journalism students and we stick them in these classes, and we stick them on teams together. And we give them an assignment. And then we mentor them gently yet firmly to try and get something built in the course of a quarter. And actually we gave this project to generate sports stories from -- and even the sports stories thing was sort of accidental. We were thinking about what should we do? We thought, well, it has to be a field where there's a lot of data. We thought sports or business? We decided to go with sports. I think that was a good idea, actually. I think business -- it turns out that-- so there's now a company now in science which is commercializing this technology. I hope so far successfully. So far so good. But obviously business is a much more, much more a part of their product line than sports is because that's where the money is. But sports turned out to be a really good thing to do. We haven't hooked it to a random number generator yet. Although we probably should. We thought about fantasy sports, actually, and doing that kind of thing. Let me show you what it builds. So here is -- this is not the one that was built in the class. So we built -- we had two great journalism students actually on this team with us. One of them was actually a developer also. And one of them was a budding sports journalists which is why sports turned out to be useful. This is data from a game actually -- this is from the not from the proof of concept that came out of the class but from the first real prototype that we built in the lab. We love this project and we hired those students afterwards to actually work with us. This is the line score, the box score and the play-by-play from a game a little more than two years ago, the White Sox in Seattle, in fact. I should say about the play-by-play, that it's actually in English. It's play-by-plays are telegraphic English but it's a limited vocabulary English. This is things like X gets double. X strikes out, things like that. Okay? And we just put them in the sort of quasi numerical form to make it clear that we're not -- it's not really English exactly. And the main thing I think that's clear is by the way you can take a wall of data in and produce a wall of text. That's not a story. Right? I mean, anyway, you push this little button here, and this is not cached so ->>: How much do you need to know about baseball to do this? >> Larry Birnbaum: I'll show you a little bit about what it knows about baseball. "Down to their last out, Griffey saves Mariners. Seattle - The Seattle Mariners were down to their final out when Ken Griffey came to the plate against reliever Tony Pena on Wednesday but he delivered a single to carry the Mariners to 1-0 victory over the Chicago White Sox in a 14-inning marathon at Safeco Field. After Adrian Beltre singled and Jack Hannahan walked, the game was tied." I like that. It was 0-0. "When Griffey came to the plate against Pena with two outs and runners on first and second, he singled, scoring Beltre Beltre from second which gave the Mariners the lead for good." This is a pretty good story. Actually, this story is about as good as -- we haven't done any formal evaluation, but actually a number of sports writers, there was a sports writer at the Toronto Star who, on his own, without having any conversation with us, he grabs one of our story and a story he had written about that game he did kind of a side-by-side comparison. And he was pretty happy, actually. He noticed ours was grammatically correct and his wasn't. But I don't think that's so important T, by the way, I want to make it clear that we have nothing interesting to say about text generation per se. And we make no claims about that. It's a very standard phrasal -- it's a special purpose phrasal generation with gaps that it walks the tree and generates the sub piece and so forth. There's nothing there. Well, there's something there, but that's not the part that we really -- there's nothing innovative there, why don't I put it that way? >>: [inaudible] how do you know that the bats are correct, for example; that Ken Griffey works for the Seattle Mariners? >> Larry Birnbaum: This comes from a company called Stats, Inc. which has the contract to do most major league sports in the United States. It's owned jointly by the AP and Fox Sports. It's actually based in Northbrooke, Illinois, and they do -- their stats are pretty good, occasionally there are problems. We can talk a little bit about the issue of data validation which came up today also. >>: So that's what would have listed the whole list of fact. >> Larry Birnbaum: You give it that. Data validation is a big issue for us obviously. If you don't get the right data. Let me talk a little bit about what it knows, let's talk about what it does, first of all. It does two things, basically. It does an analytics based on -- some of it is based on like sort of saber metrics ideas. This is based on Bill James' game score model. These things are actually -- it also uses predictive modeling to help identify useful positions in the game. In other words, if you predict something is going to happen and after the play your predictions change a lot, that was an important play. And also threshold crossing. So I don't want to actually make this sound too complicated. I mean, if a company is cash flow negative the one quarter and cash flow positive the next quarter, so crossing zero is always a big deal, you know what I mean? There's a couple simple things like maximum minimum, thresholds. You can think of them yourself that turn out to be drivers for potentially interesting events and things in situations. And after having built all these derived features, it then builds this outline and the outline actually I'll go back to it a second. The outline shows what it thinks is going on. So this system is kind of a big expert system. And what it has is it has big semantic models of situations like this that it's trying to match to this game. In this instance, the one that works here is called take the lead late, is actually the model that actually matched this. And if you think about what they are, you can think about -- I'll actually tell you about the rhetorical structure of baseball game of recap. We talked about it last night. We thought the reporters would actually have well thought out rhetorical models and kinds of stories. They don't. And I think this is because reporting -- I think maybe long form, like magazine reporters might be. But news reporters don't and I think it's because it's an effemoral medium. It's like it doesn't matter. It doesn't have to be well written. It has to work. And so they don't really [inaudible] about exactly what does a baseball game recap look like. They just write them. Here's what a baseball game recap, it turns out, looks like. The first part of the baseball game recap is what happened and why it happened. What was the outcome and why was that the outcome? What led to that outcome? The second part of a baseball game recap is the pitching, you must talk about the pitching. The third part of the baseball game recap is other interesting stuff that happened in chronological order or actually alternatively in order based on organized around the players. So those are the two ways you can do that section, okay? And different kinds of patterns matter in different parts of this structure. So, for example, at the beginning part what matters is was it a come-from-behind victory, was it a take-the-lead-late, was it a blowout, was it a back and forth the whole way, was it a piling on. You can do it from either point of view. You can either have a comeback or you can have a comeback but almost makes it, you can almost have a late victory but not quite. And from the point of view of the people who actually won that's called holding off. You held them off at the end, right? And in addition to those, it has a bunch of actually other kinds of angles that are around sort of how did that come to be. For example, it has an angle for heroic individual effort where it can look in the game data and find this player in these places made the difference, versus great team effort. And again these patterns are visible in the data. It finds them and uses them to actually figure out how it's actually going to describe what happened in the game. I hope that makes sense. >>: I'm curious, is there any kind of like -- I'm wondering, seems like sometimes these stories probably talk about events that are sort of external to the game, like there's some scandal involving the players or something like that. So have you thought about looking at Wikipedia or free base or something? >> Larry Birnbaum: We are interested in pulling out -- first of all, you'll notice that the thing actually pulled out Griffey's picture it had to go somewhere and do that. I think it's Griffey's picture. I can't believe we have to -- I'm sorry that -- it doesn't even cache it then. It is Griffey's picture. But it doesn't really -- so we do go out and find stuff. And one of the nice things about doing this in this context is your query is very specific. Because you know exactly what you're looking for. Like I'm looking for a quote by this player about this play in this game, in this location, at this time. Okay? I would be very excited -- we haven't quite done this yet -- to analyze tweets from people who were at the game. I think that would be awesome and pull them back, especially around things like so to get that extra stuff that happened that you can't quite get. And again you have to be able to anticipate what that might be. So I think it's pretty obvious that you can get either people's reactions to or descriptions of why or better descriptions of particular events. If a spaceship lands in the middle of the field during the game I'm not sure what we'll be able to make of that. You know what I'm saying? But if somebody hurts themselves you can expect that somebody might get injured in the game. You actually should be looking out for that, right? I'm sorry. Excuse me. This was just there in case I didn't have connectivity. So we actually are doing Big Ten football, baseball, basketball and women's softball games, previews, in-game updates. We tweet during the game and we do recaps after the games. You might say, well, I want to talk about why we think this matters. So the thing is there's no point in having machine write Big Ten football game stories. Big Ten football is a big deal, there's a lot of money in Big Ten football, and there will be no problem at all paying human reporters to go and write those stories. It's not a problem, okay? Big Ten women's softball, okay, this is, it turns out -here's a story from last year: Kramer blanks Oakland to help Spartans win. It wasn't her best effort, but Lauren Kramer was the backbone in the Michigan State Spartans 4-0 victory over the Oakland Golden Grizzlies on Thursday. The Golden Grizzlies weren't able to score a run off Kramer during her seven innings on the mound. She left five hits, two walks, struck out nine. Biggest win for Michigan State since their '92 win over Eastern Michigan on March 30th." And we can get some historical information in there as the season progresses. You can actually start put in horse racing angles around the, getting into the playoffs, for example. You can start to change the stories toward the end of the season they can become richer that way as well. So it turns out that like the Daily Northwestern publishes about two to three games about women's softball, about Northwestern's softball team every season and the team plays 20 games, 25 games. Most of these games go unreported. And if you ask even by the school newspapers, okay, if you ask who wants to read a story about women's collegiate softball, the question is: Who does? Who wants to read a story about women's collegiate softball. The kids on the team, their parents, their friends. Alumni who used to be on the team. And in this great land of ours, I'm sure there are some people who actually care about women's collegiate softball just because they do. It's got to be, it's a big country. So out of how many that is, 200 people, 2,000 people, probably between 200 and 2,000 people. Yet the stories never get written. They're never written. How much does it cost to write a story. Turns out that a stringer will be paid 50 to $75 to cover a sporting event by a newspaper. That's a labor of love, if he has to pay to get into the game, that will be included. That comes out of that money. Transportation comes out of that money. Okay. It's a labor of love. It's something people do because they love to do it and they go and do it. But it's still too expensive for women's collegiate softball. But so that's the idea. The idea is to actually say, look, there are things that just don't get reported on that you can get reported quickly and easily and cheaply this way if you can get the data. There's a company called Game Changer, which has a baseball scoring app for Little League and middle school and high school football teams available for iPhones and Androids. We generated 300,000 Little League stories last summer. This program is the nation's most prolific author of women collegiate softball game stories, right? And they're decent stories. What's the larger take-away here? There's a lot of data. We all -- I mean ->>: Reading this story. >> Larry Birnbaum: These stories will show up on BigTennetwork.com. You can read these stories on BigtenNetwork.com. Also doing also some very simple stock market stories. So the company Narrative Science now has some actual clients. And some of them will attribute and some will not attribute depending on whether they actually want to make it clear that they are stories written by machines. Some are happy to say they are. And many of them it turns out are not so happy to say that. And we've gone back and forth on whether that matters to us or not. Two customers who I can actually tell you about are Big ten Network.com which is a joint venture of the Big Ten and Fox Sports again. And also Forbes does our -- we do stock market updates on Forbes. Those stories are much simpler, by the way. But anyway, if you want to read these you can go here. You know, there's a lot of data. This is a complex point. There's a tremendous amount of data out there. We all understand that the data, it's an incredible opportunity, and an incredible challenge at the same time. The data will be, has to be made actionable if it's going to make a difference. And what will it mean to make it actionable? Either it's going to be -- action will be taken by the machines on the basis of the data and that's going to happen a lot. But sometimes the data is going to have to be presented to people to make decisions, and if you ask how are we thinking about doing that right now I would say for the most part people are thinking about visualization. So a lot of work in visualization comes down to we're going to have to figure out a way to make the data understandable to people. So I'm just going to say that narrative is actually a time honored method of conveying information. And for some kinds of things, especially things that have somewhat of a temporal component, they seem like a really good way of actually presenting that. We talked about all these things. Here's a partial list of the baseball angles in case you're interested, what does it know about baseball. These are the kinds of patterns it has that it's looking for. And again, these are all built by hand. So a lot of our job as technologists is to come up with tools that make it easy to do this and libraries of these things that allow us to do abstraction properly. I'll give you an example. If a team could win because of one player doing a great job or because everybody does a good job, a company can meet its earnings expectations for the quarter because every division did a little bit better than they expected or because one division did dramatically better than they did. So those strike me as very analogous in my mind. Right now when we write these things we're writing them separately but I think in my mind we have to actually get these abstractions right so that we can do that. I'll give you another example of the kind of thing I'm very excited about right now. We're coming to tend of the talk and I apologize I'll summarize quickly. I'm very excited about situations like this. A team gets a lot of hits. Gets a lot of men on base. Doesn't get a lot of scores. Runs, loses the game. A team gets a lot of yardage, completes a lot of passes, gets a lot of first downs, doesn't score a lot of touchdowns, loses the game. A company, you know, increases market share. Launches new products blah, blah, blah, still remains only -- doesn't improve its profits very much. What do these situations have in common? What they have in common is you're achieving your instrumental goals and you're failing to achieve your ultimate external objective. Okay? So sometimes this is called failing to capitalize on your opportunities, right? You build the preconditions for achieving some success, and then fail to ultimately put it together to achieve the success. That's a good example of a kind of semantic angle that I'm interested in getting into these machines to be able to tell those stories. I'm also -- go on. >>: Talk a lot about the choice of vocabulary. And is that just not interesting or do you use ->> Larry Birnbaum: I think it's really interesting. But I have nothing interesting to say about it. I think it's a phenomenal -- I think it's a great problem. But so that ->>: You always use the same vocabulary? >> Larry Birnbaum: We have a lot we do. So one -- so let me tell you a couple of things about that. So I'm, the variance I am most interested in is the variance that makes meaningful difference. In other words, I'm telling it from the winner's point of view or the loser's point of view or this is really what happened in the game so it changes how we're going to talk about it. It is certainly the case that if you read a lot of our stories over and over again we should actually be thinking about other kinds of variation. I wish I could say to you that every single variation we put in there is a thoughtful decision. It isn't always. You obviously have to put some variation of lexical choice in just so they don't all read the same way, even though -- and I think that's fine. I have nothing against that. That's probably the place where we're going to start doing our mining, by the way. I mean, that's a place where I think we could do a better job gathering that stuff statistically to do. I mean, I think starting at the edges is where you probably would make progress automating some of this stuff. I could certainly see doing the Marty Hearst kind of thing where I take for instance a bunch of stuff about the team or a bunch of the data and the names that were relevant to a sentence that we generated and go out and find a lot of other sentences that mentions the same things and see what words they used. At the very least you could present that to a human mind even if you didn't do anything more intelligent with it. I'll say this also this is what a comeback looks like in baseball, a comeback looks like in basketball and this is what a held-off looks like in basketball. We don't do this either but I'm kind of interested in the idea of getting a program to say here's a story with the same angle, the same actors but different details, same angle, same details, different actors. This is sort of compare and contrast in some ways, right? Same angle across domain or imposing angle. Like the last time they were in the situation, they held them off but this time they failed to hold them off. And they lost a game, right? I think that's kind of a nice -- that will enrich the stories a lot. I don't have a lot more time. I'm going to talk briefly about -- I want to talk a little bit about thematic, about themes. So these angles are kind of like big themes. I think they're like -- one thing I actually came to realize about sports, now is the time for me to confess I'm not actually inherently interested in sports or sports writing. So this has a slightly ironic -- this was a slightly ironic moment in my career really. But I came to realize that the reason why people like sports is that sports are actually about human virtues and failings. They're about persistence or failing to persist or heroism or failing -- they're about these kinds of human -they're sort of Greek in that way. They have that kind of Greek -- they're about big human themes really. And I'm really interested in these kinds of things. They're very hard to detect for machines to actually detect to look at a situation and say -- actually to look at a story. Here's an example of what I mean. The London riots just went on. A number of people characterized these riots as, by mentioning the epi gram idle hands are the devil's playground. That's what I mean by a big theme. So here's a theme -- a lot of people, when the economy was melting down about three or four years ago, Hank Paulson was the Secretary of the Treasury. He was in charge of the TARP. It was not uncommon to say putting Hank Paulson in charge of TARP is like setting the fox to guard the hen house. I'm interested in this kind of tagging, because it's not topical tagging. Topical tagging would be it's a business story or it's a political story. Or it's a story about the TARP. It's not topical and it's not sentiment, it's sort of something else. And I'm kind of excited about it just to see there were 105,000 results for this. Paulson TARP glitters gold actually got a fair number of things, too. Now, probably what I want to do eventually is I'm not sure I want to use these numbers. So actually we're looking at whether these numbers alone are sufficient to know whether people really thought this was a good label for this or not. But I think we're going to have to look inside of the actual documents and see whether somebody supposedly said this is like this. That's kind of the ultimate gold standard for that. By the way, news stories don't do that. Reporters don't say that because that's opinion-izing, they'll only do it if they find somebody to say that that they can quote. Headline writers will say it, because headline writers have more leave comments to news stories. That's where people put these labels on them. It's in the comments on the stories that you see this kind of thing coming out. This is an example, I want to show very quickly, of something we built recently. I actually had nothing to do with this. I was very excited, though, because whatever a teacher you're always excited when your DNA kind of, your intellectual DNA for better or worse shows up in your students. And our student Patrick McNally and a student of his named Don, and I've forgotten his last name -- but they came up with this idea of looking for the seven deadly sins on Twitter. And, again, it's a relatively simple model. But here they are. Here's sloth. Do anyone -- Mabel, Bedford, Garfield, or Warrens will see snow. Too lazy to look. Gluttony. This is for you, Susan, could really go for some peanut butter cookies, craving. Greed. I need my money pronto. Craving Starbucks coffee. Wow, how hard is it to block the Raiders, this is pissing me off. It's pissing me off, stands there with a welcome mat. I ate so much sushi now I'm in pain. So actually it's actually kind of fun to watch these and actually I was thinking if we can get accurate enough we can imagine actually doing today's world -- today in world sin, what's today's winner? >>: Topic modeling. We can treat it as a semi supervised learning. >> Larry Birnbaum: Yeah, I think so. What I like about it -- so, A, I think that's right. But, B, what I like about it is I like the angle on things. I like saying -- this is kind of like a way of looking at the data and maybe this really gets to the point. To actually blow this out will often require a lot of statistical work. But what I like about it is that it's a way of looking at the data which starts from some thematic point of view that says the sins are actually a particular lens through which to look at what people are saying about the world. I'm going to skip this one. Couple of quick current projects. So this is the things we're doing in our course this quarter. We're doing this Twitter profiling system. News and blog recommendations based on your tweet history. And this solves a problem for -- if you're like a big organization, people click all over you all day long or type queries to you so you can begin to understand who they are and what they care about. If you're a smaller organization, where do we get the data from? And if they use Twitter, you might just get the data from them if they will tell you what their Twitter handle is. So actually the Paul and his team has a project actually something like this it turns out this morning. The news of the day, I kind of like this. It's sort of been done but hasn't really been done. There are a lot of things out there like CRM systems that do like salesforce.com will, if you're going to meet with somebody, they'll bring up, of course, the entire history of meetings with them but also your e-mails with them or their LinkedIn profile gets brought up automatically. I think there's a system that's called Supportive that does that. But I haven't seen one that brings up news about the company that actually says, look, look at your calendar, but there may well be. In any case, we gave it to these students. It's looking up. At the very least you'll know if the stock tanked this morning before you go in the door to meet somebody. Or vice versa. Or a new CEO was named, whatever, a new product was launched. It finds -- it tries to look up the people you're meeting with and the organization they're from and just display that to you based on your calendar. The notion was if you're a honcho, of course, somebody actually -- I'm sure it's still printed out for that matter. I'm sure for a lot of people it's put in a manila folder handed to them in the morning, something like that. And they get in the back of their car it's given to them. This is a thing actually to do automatic news and social media aggregation for city council elections. It was actually done for the Seattle city council elections that happened last week because Chicago was [inaudible] elections were last year. So we wanted to pick one that was current. I then discovered that Seattle has a very bizarre system where everybody is at large but they run for at large seats against each other which makes no sense to me whatsoever, but okay. It's a really crazy -- in other words, there's like a seat called seat 1 and two people compete for seat 1 citywide, and seat 3, and two people compete for seat 3 city wide. It's like a very odd model. But so this is not a deep project, but this is a good example of where that's like a need. So there are a number of people, clever people in the news business who have done things like I'm going to do automatic aggregation from Twitter feeds from the campaigns and their Facebook page. And I'll actually automatically look up the news. I'll have some query running in the background that will always feed me stuff. It's surprising how many news organizations do this by hand. News organization that I visited earlier this year, they were very proud to show me a thing that they had done for a city council election where -- and I looked at it, and it was like everything was stale. I thought wow this is kind of stale. He said I'm not updating it anymore. I said what do you mean? I actually put all the data in and I curate by hand every day I put in the news stories and the links in there. I'm like was, what, are you kidding me? But okay. It's not a big deal. It's not a big project but if we make it simple and configureable this will matter to news organizations. It's not rocket science, but they don't -- you know, these are not rocket science, if you can make it robust, reliable and easy to use, it will get used. The porkometer, looking to see if we can identify earmarks in federal legislation. I like to do state legislation as well if we can get this to work. I think we can. Twitter Debate is sort of like the, sort of the Twitter version of News At 7:00. Tweets are atomized things and I'm wondering if you can make them more meaningful by putting them in a context. Originally we thought we'd say Perry on Social Security and Romney on Social Security. And have a back and forth. Twitter doesn't afford that. You can't do an extended debate in Twitter. What I think you can do is ABA, you can do point, counter point, final come back. So we're actually trying to organize -- that's the rhetorical structure we're trying to organize tweets into topically speaking for political candidates. So that's that project and, again, I think it's kind of fun. Quick blurb about us. We just got this -- I mean, we've been doing things like this -- I think as everybody discovers you build a cool project and getting it actually out into the world is like an entire other universe of pain. That you may or may not actually have any capacity to do. Even in an organization like this where you're sitting on the edge of an entire gigantic campus of people whose job it is to build and ship products to people worldwide. I'm sure getting their attention getting their mind share and getting them to say no, we really do want what you have, it's a nightmare, isn't it? And we have the same problem. We build these really cool prototypes, then they just fall on the ground and nobody would ever pick them up and use them. So we got tired of that. The Knight Foundation, it turns out, was also tired of that. They had bee seeding a lot of technology in the news pace for the past five years, felt it had a good impact in terms of building interest and expertise in the field, but the specific technology projects were not necessarily getting picked up and used. So we start talking together about our joint pain here, and eventually decide that we would actually build an incubator or development system thing for news technologies at Northwestern. So technology that we invent, also technology that's invented elsewhere will come into that pipeline and get built out. It still is a fair amount of money to put behind it. Put about $4 million behind it over the next four years which goes a long way in a university. And I'm hopeful. I'm loving what narrative science is doing right now, but that's an example of going the other way. I believe -- I hope we'll be able to have our technology to have impact through that. But that's a big deal. You're starting a company and you're getting venture capital and you're getting management, and it's a completely different -- again, it's a lot of work to do that. I'm hoping this will be a way of actually speeding that, at least early on in stages to see that. >>: [inaudible] in this case? >> Larry Birnbaum: Well, we're starting out with media partners in the Chicago area. So large and small media organizations, and very broadly construed. It could be non-profits or governmental organizations. You know, the PTA, or the Sun Times or WBEZ, which is the local public radio station. So all these organizations we're starting to work with, we've actually hired an executive director from the media business in Chicago who actually knows all these people personally, so that's been a really great boon to us. And we're starting to hire some developers, because again they're things you can't ask graduate students to do like, you know, use Version Control. I mean, they do use Version Control. Version Control date is appended to the file name. That's the kind of version control. I understand that because I was a student, too. But when you're actually going to ship product to the Chicago Sun Times turns out you can't possibly do that. So, yes, we will use Get and we'll actually do these kinds of things. So that's -- it's pretty exciting. I'm looking forward to it. I think the Twitter profiling project is probably the first we'll be shipping out. And if they work -we're not really sure what the final thing is. If something actually clicks, we might just put it out open source. It might be the case other companies get started. I want to thank my research partner Kris Hammond. All the students, Nick Allen and John Templon built the narrative generation system with us. Jay Budzik did Watson along with a number of other folks there. Francisco Iacobelli did Tell Me More. And Jiahui Liu did compare and contrast and. Patrick McNally did the last thing I just showed you on lust and -- I guess lust was not up there. But sloth and gluttony. And Nate Nichols did News At seven. Sean is doing the profiling thing. I guess I'll stop there. Earl Wagner kind of helped me dream up and actually did more of it than me, the thematic tagging stuff. And these are my colleagues over the in the school of journalism. Thank you very much. [applause]. >> Larry Birnbaum: Further questions? I guess people asked them. Thank you. I enjoyed that. Let me know if any of these things seem interesting to you. I mean, I'm really here partly because I feel like you know our kind -- I realize this is still running so I'm going to choose my words. We tend to be pretty seat of the pants I'll put it that way in terms of how we get things built, and a lot of the times that works pretty well and sometimes it doesn't work quite as well. So if any of these projects seem interesting to you, for example, I think we'd be certainly interested in talking about what we might do to sort of push them forward together. >>: You'll be meeting a lot of folks around three or four, themeatical groups. One is Web, one is social, one is MLP. >> Larry Birnbaum: Right. Thank you very much.