>> Christian Bird: Today we have the opportunity to have Emily Hill come in and visit us for a day and give us a talk. She did her PhD at Delaware with Lori Pollock and for the past year, almost year and a half she has been an associate professor at Montclair State, and she is here visiting us today to talk about her work on natural language programming and software engineering. Take it away. >> Emily Hill: All right. In general my research is motivated by the problem where you've got this huge source code base and someone's got to maintain it, and the poor maintenance developer needs to somehow identify the code that they are looking for. There are a couple of steps that they take in trying to locate that code. If they don't have an expert available to tell them where to look, then they have to do something else. One way to locate relevant methods in fields is by searching the source code and trying to look for the big regions of the code that might be relevant, and then further exploring those areas to see, to refine our understanding and really see what else is relevant to exactly what the task that they are trying to solve. Today what I am going to talk about is how we can use the natural language in the source code, the words and the comments and the identifiers to help the developer search and explore and understand their code more effectively. In fact, research has shown that developers spend more time finding and understanding code than fixing bugs, so we can help reduce the high cost of software maintenance if we can speed up this process. So what are the current approaches that developers typically use in addressing these issues? Well, there are a wide variety of navigation aspiration tools, it and those are commonly built into IDEs using the program structure like the AST, the call graph, the type hierarchy and allows the developer to jump to related source code. These are techniques that developers use all the time, and they are great. They take advantage of the program structure, but sometimes they can be predominantly manual and slow for very large and scattered code bases, because each navigation step has to be initiated and if your code takes multiple steps, every time you are locating a new piece of code you are initiating navigation step after navigation step to navigate that program structure. So what is an alternative? Well, there are search tools which work similar to how we search the internet using either Google or Bing, and they apply string matching with these comments and identifiers, and so they do allow you to locate large and scattered codes, but they tend to have a problem with returning many irrelevant results and missing a lot of relevant ones, because if the developer enters a query and it doesn't match the words you original developer used in the source code, then the search results are not going to return anything relevant. So both tools have strengths, but both also have challenges. So how can we go about improving these software maintenance tools to help facilitate software maintenance? So our observation is that programmers express concepts when writing code. They use the program structure if L statements, method calls, what algorithmic steps, the order they organize their statements within their code, but also the natural language, the words and the comments in the identifiers. So our approach is to leverage both of these sources of information to try to build more effective software engineering tools and our specific target is software maintenance. So let me give you an example of combining program structure and natural language information together. Let's say we have an auction sniping program. It will allow us to automatically bid on an eBay auction online and we are looking for the code that implements adding an auction, so the user is going to add an auction to the system, and I happen to know from prior experience with the system that DoAction is the method that handles all user triggered events. If I am just using program structure I can see that DoAction calls 40 methods. That is not terrible, but only two of those 40 are relevant, so going through that list of 40 is a poor use of the developer’s time. If I use natural language alone and search the entire code base, I get about 50 methods, 90 matches across 50 methods, and I located relevant two, but I also located tons of irrelevant ones. But if we combine this information and put it together, we can locate the two relevant ones with just one false positive so narrowing our list of 40 or 50 methods to just three for the developer to look through. So we wanted to try to combine program language, program structure and natural language to help us improve tools and get better information. Uh-huh, oh yes, please, feel free to interrupt. >>: Was that an intersection that was the programming language answers and the natural language answers to get to… >> Emily Hill: Yes. Basically we used search techniques of the natural language on the program structure, so the subset, we only searched the 40 callees of DoAction. Good question. >>: When you say sort of natural language alone, are you talking about static analysis at all or are you just saying just the common sections? >> Emily Hill: So comments and identifiers, so any text that shows up. >>: But any sort of syntactical analysis, are you doing that also? When you say natural language? >> Emily Hill: Usually I mean bag of words at the base level, although we have been working to build more semantic and syntactic analysis on top of that. I will actually show you what I mean by that [laughter] down the road. But strictly natural language information, what it boils down to is somehow using the words, whether that is just straight lists of this set of words in a method, or if it's more advanced than that. And actually thank you, that leads me right to my next point, is that when using this natural language information and combining it with program structure, it is not enough to use the words alone independently. The context of how the words appear is very important. For example, we have three occurrences of the word map. So we have map object in the method name where map is playing the role of a verb or an action, versus object map which is like a hash map that contains objects, so that is really its noun sense and then we might have the words map an object just on two completely unrelated statements in the method, but the word map shows up. So the context of how that word is appearing if it's a query word is very important in proving our accuracy for the search, as well is the location of the word. For example, a method signature is typically a better summary or extraction of what a method is doing then a random word just anywhere in the method body, and so we try to leverage that information to help improve accuracy as well. So let me show you an example of why using context in location is so important. So for example, I like adding things so if we are searching for add item in a piece of shopping cart software for example, on the left I have a method add entry and on the right I have a method called sum. So both are different senses of the word add. So when I talk about context I am talking about going from lexical concepts, which is the individual word itself commonly referred to as a bag of words approach for information retrieval versus phrasal concepts. So if we look at just straight word occurrences, both of these methods contain the words add and item, both equally matched. But if we evolve that to phrasal concepts, so concepts that consist of multiple words, we can see that left-hand side, add entry actually is adding an item, whereas sum is adding a price. So by taking advantage of these phrasal concepts, we can better identify the relevant add entry method over the irrelevant sum. In addition, location further helps us. So the phrasal concepts in the signature concept for add entries, add item entry, which again contains the query words of add item, whereas the signature in sum is simply sum and we could put in a direct object there if we wanted. But looking at the location, the method signature versus the body further helps us figure out what the topic of this method, what action is it really trying to take. So I'm going to talk a little bit about our work for query reformulation, which is where we help the developer select query words and determine result relevance, which was the motivation for our next step which was developing a model of word usage in software that actually captures these phrasal concepts in a general way that is usable by software engineering tools besides just software search. And then how does that model with phrasal concepts help us improve software search? How can we take advantage of natural language and program structure and program exploration, and then if we combine these two pieces together, how much can we lead to improvements. Any questions before I change topics? Well, not really changing topics, just changing problems slightly. With query reformulation we are concerned with helping the developer pick the right query words to help maximize the results of their search and determine if the results are relevant or not. When developers search source code they typically start off with a query that is executed by some search method on the source code base and then those results are returned to the developer. I am sure you have all experienced this before, whether it is on the web or on source code base. And if those results are relevant, the developer can stop their search, and if they are not what they are looking for they can continue to repeat this process until they either get relevant results or they get so frustrated that they stop and walk away and they use some other means to locate the code that they are looking for. In this process the developer faces two key challenges. First in deciding what query words they actually have to search for, and secondly in determining whether or not those results are relevant. And I am going to go into detail as to why those are so challenging. So first why is selecting a query difficult? When we are searching software we have to guess what words the original developer used to implement the concept, and actually research has shown that two people when trying to select words to describe a familiar concept, only agree about 10 to 15% of the time. So this is a really, really common problem, not just in code search, but in searching in general. >>: Is that disagreement without talking to each other, or after? >> Emily Hill: Without talking to each other. So two people trying to describe, like maybe they saw a picture and they are trying to pick words that describe it. >>: And are those developers or just general people? >> Emily Hill: I think that the target was developers for this case, although I would have to double check that; don't hold me to that. It may be more general information retrieval research result there, because I don't think that that study has been done for developers. Although, bigger staff has done some work on how difficult it is to describe those concepts. So the three major challenges in selecting query words, first you can have multiple words with the same meaning, so you might formulate the query delete, but that concept is implemented as remove, rm or del as an abbreviation. Then you might have a single word that has multiple meanings, so add, as we saw in our prior example can mean either appending or adding to a list versus summing, and those are two different senses and you are going to get irrelevant results if you use that one general word. And even if you pick the exactly right word to describe the concept you are looking for, let's say going back to our auction sniping program example, let's say you want the code that implements the data storing the auctions in the system. Auction is clearly the right word, but it is an auction sniping program. The word auction is going to appear everywhere throughout the code. So it is not a good discriminator. So even if the word is perfectly correct and accurate, if it is to frequently occurring, it is not going to be specific enough to get you the results you are looking for. So all three of these challenges really conspire to make it difficult to come up with a good query, and it really makes it difficult for the search tool to try and suit any arbitrary query anyone could think up all of the time. So it actually becomes very challenging. So that is the first challenge we are trying to address. The second challenge is why is determining result relevance so difficult? Well, typically in a typical IDE when we do a keyword search or a regular expression search, typically the results are presented just in a list with the relevant results. We have to read through the code to find if the results are relevant or not. So the challenge, if we think about when we search the web, it is very easy to pull out where our query words appear, what their context are. The query words are boldfaced in the context of the sentences where they are used. The titles of the web pages are nice and in a big old font; they're bigger; they are in a different color, and so it's much easier when I enter a query into something like Bing or Google, I can quickly see ah, was my query even right, before I even go looking to answer whatever question I have, the reason I made the search, I can quickly figure out if my query is even in the ballpark. But with source code, the developer could actually waste time trying to understand code that is not even relevant, and to me that is like the biggest crime is that understanding code is hard enough without having to understand code just to figure out if your query was getting you the right source code. Uh-huh? >>: So couldn't you do something similar to what Google does? They highlight the relevant words and they show it in some context, right? Is there a reason why you couldn't do that, or is that where you're going? >> Emily Hill: That is kind of where I'm going and actually you could take that idea further. I have only pushed a little bit in terms of we are going to use phrases to embed those query words and give that context, because it is easier to read a natural language phrase then some source code. But you could even go further and highlight even more. Uh-huh? >>: Are there studies to show how long this takes to scan through a list of search results like the old-style? >> Emily Hill: I don't think so. And I think it's highly dependent on how expert you are and how familiar you are with the code system. So we normally assume that the person searching has very little familiarity with the code system, and so they are going to take the longest. If you know the code base, you're probably going to be really, get pretty good at filtering out irrelevant results, but a newcomer to a system that is really unfamiliar, they are probably going to have to read each result and it depends on how fast you read, how quick you are, but no, I am not aware of any studies that have evaluated that. >>: [inaudible] under know if it is in any of their papers, but it can take like multiple minutes per and people all typically like give up after five or ten, like they seem irrelevant going down the list might not even go down the list. >>: Like Google, you give up after the first three, or one. >> Emily Hill: Yeah, 5 to 10. >>: I mean they don't scroll. >> Emily Hill: Yeah, I think for information retrieval in general, the average is 5 to 10. They will look at the first 5, 10 results and if they don't see it they will give up. But in that code list, it is just alphabetically in most cases, and so if in your alphabetical listing of file names it doesn't show up right away, you might have the right query and just not know it. If we could really get the developer to figure out is their query even right and then hone in on the correct results, that is our goal. So the problem is that search results in general can't be quickly scanned, which we are going to try to change, and the results are poorly organized so you have to decide the relevance of each result. You have 50 matches; you might look at the first 5 to 10. If you are really, really exhaustive, you might look at all of them, but you have to keep determining the relevance and making that decision for each search result. So we would like to change that. So our key insight is that the context of the query words in the source code is going to enable skimming, organization of results and provide faster feedback for poor queries. We don't claim that we can automatically correct the developer’s query; only they understand their information, but if we can give them that feedback faster so they can more quickly changed their query that is a win for us. So we are going to automatically capture our context by generating phrases from the source code. For example, if I had the signature add item, I could generate a phrase add new book item, for example. Or update event, compare playlist file to object or load history. So for example if our query is load, we can quickly see that this result is loading history versus loading a file, downloading a file, delivering a payload, so just by seeing how the query word appears with other words in the signature, will help us make that determination more quickly. And we try to make it faster read because usually humans can read natural language sentences faster than source code. We are going to organize these phrases into a hierarchy and I will show you an example as to how we do that. If we take an example task, let's say we are searching a JavaScript interpreter for a signed integer conversion. So our query is to int and there are 30 results for to int which I have listed to the right. And we could look through this entire list or if we could use our phrases to try to group them together, we might be able to hone in on the relevant results faster. So the phrase hierarchy at the very top is the query to int 30 and below that with have three sub phrases, add value to code int 16, object to int map and to int 32. And since signed integer conversion involves the 32-bit, to int 32 is the sub phrase that we are really interested in. That is where we think we will find our relevant results. We are able to discard about 30, 27 of the other results, and the context of the query words in the phrase helps us to determine the relevance more quickly. So we are reducing the number of relevance decisions from 30 results down to just three phrases and then three results to verify that those three results are the correct relevant signatures. >>: [inaudible] generate phrases? >> Emily Hill: Yes. So we automatically generate those phrases from the signatures and use a partial opportunistic phrase matching that greedily groups them together into a hierarchy. Uh-huh? >>: Are those phrases every identifier from the signature of the method that you are considering, or is it just a subset? >> Emily Hill: We usually generate them for the entire signature, and I think in this iteration we also generated it for the parameters too. But we tried to match the longest sub phrase so that usually prevented us from grouping based on like formal parameter names unless that provided the largest grouping. Does that make sense? >>: Partially, but just keep going. >> Emily Hill: Okay. Yes we do program for the parameters but usually they are grouped based on their signatures and so typically they are based, for example to int, you notice we can split up to and int; they don't have to be right next to each other, like add value to code int 16, things like that. >>: So are these phrases representative generally is just words or do you have some sort of model, semantic model to work from? >> Emily Hill: Thank you. That is exactly where we are going. We started off with strict phrases, and then we recognized the potential if we could build that general model, any software engineering tool could use it, so that is our ultimate goal. And that is actually our next section, so almost there. [laughter]. And just to give you a sense of how we generate the phrases because this was kind of our starting point for building the model, is that fields and constructors we naïvely assumed that they were all noun phrases. They didn't involve actions, so filewriter,reportdisplaypanel and then we assumed that method names were verb phrases, that they started with a verb and they had an optional object after them. So verb phrases consist of a verb followed by a direct object and an optional preposition and an indirect object, and if you have forgotten your grammar before I did this research, I didn't remember the difference. If we take an example phrase like add item to list, add is the verb and then to is a preposition and then item is a direct object and list is the indirect object. So we always look for a verb and a direct object and if there is a preposition in the method name then we go hunting for the indirect object. Our real challenge was identifying the direct and indirect objects of the verb and we typically look first in the name. That is obviously the best indicator. And if it wasn't there that we looked at the first formal parameter and then at the class name. So for example, get connection type, run I report compiler, update event, or compare playlist file to object. Uh-huh? >>: Did you find any computer science or programmer specific idioms that you needed to also heavily mime? I see a lot of code that says X digit to y for transformations from X to Y. >> Emily Hill: Yeah, so we mostly avoided to, although I did have a version of an identifier splitter that preprocessed to and tried to use it and make it convert. So we did some work with idioms like convert, especially if it starts with the preposition to, to string; that's converting something to a string. >>: What about the digit 2? >> Emily Hill: No, I know. If I handled it, it was during the identifier splitting phase where I could try to detect that, but in general if it started like with a TO preposition, we might infer convert, and there were a couple of cases, but again it is all how much time do you want to spend in doing that? And so for query reformulation, just generating these phrases, we didn't really need that level of detail. It still worked pretty well. But as we go to the more general model, we have to spend more and more time in making that more accurate and doing that parsing. And I will show you how we go about doing that in general. So to evaluate, I called our query formulation technique contextual search because it uses the context of the query words and to evaluate it, we compared it with an existing technique called verb direct object which is very similar to our technique except that it is only the verb and the direct object. It doesn't consider any general noun phrase method names or prepositional phrases. And we compared search results from 22 developers on 28 maintenance tasks; they were searching for 28 concerns, or search tasks. And here we have box plots for the comparison between contextual search which I called context and Verb DO on the right and they are box plots, so the middle shaded box represents the middle 50% of the data. The horizontal line is the median and the plus is the mean and Xs represent outliers. So we compared these two techniques in terms of effort which we measured using the number of queries the user entered. Ideally, we would've liked to have measured effort in terms of time, but we didn't want to tell our subjects that they were being timed and some of them ate during one half of the experiment but not in the other, and so unfortunately all we have is the number of queries, and also in terms of effectiveness using the common information retrieval measure of F measure which combines precision and recall. And we could see that contextual search requires less effort than Verb DO and returned more effective results, which further justifies because contextual search significantly outperforms Verb DO it justifies going down this path that the more accurate we make our information, instead of just stopping with verbs and direct objects and really trying to model noun phrases and prepositional phrases, we can actually get significant improvements. Is that a hand? >>: The measurements were the comparisons hold true for 10 subjects, for every subject? >> Emily Hill: Yes, because that is how we ran it. We did it paired; we ran both ways. We did the two sample t-test as well as the paired, because it was kind of a mixed model result, but yes, it held for both of them. Although a lot of the subjects like--what Verb DO did that contextual search didn't, was that it also did co-occurring pairs, so if you entered--your query had to be a verb followed by a direct object. But if you entered a verb, it would list all of the other co-occurring direct objects. And if you entered a direct object, it would list all other co-occurring verbs, and the subjects did like seeing what other words co-occurred with their query words, so they did really like that. But it was so limited because it only matched using verb and direct object. It couldn't, there were some search tasks that they could not formulate queries for and that is partly what led to it. A combination ultimately would be ideal and we are actually still working on trying to take that to the next level. So any other questions about that before I move on? So as you mentioned before, we started getting inspired by these phrases and thinking gosh, what else could we do with them? And another student at the time actually wanted to work on automatically generating comments, and we thought if we could really turn these phrases into which it a generalized model of semantics of the program structure and the natural language in the underlying source code, it could be used in almost any software engineering tool that uses textual information. And so the challenge was well how do we go from phrases to a generalized model that more people can take advantage of. So with query reformulation our phrases capture noun phrase and verb phrase phrasal concepts for methods and fields. So for example, convert result, load history, synchronized list. But we needed to generalize that model from a textual representation with phrases to a model of this phrasal structure so these could be annotated with their different roles in the natural language. And we also needed to improve the accuracy. For example, if I am going from a field signature to a phrase, I could actually mistakenly label a verb as a noun and the phrase would still come out readable and correct. But when we want to internally represent it as a phrasal concept, we have to have a lot higher accuracy. So our goal is to represent the conceptual knowledge of the programmer as expressed in both the program structure and the natural language through these phrasal concepts. So any piece--we are trying to provide a generalized model that can be used in automated tools to represents or encodes what a human sees when they read code. That's our goal, where we are trying to get to. So this is an overview of our software or user model which I will call SWUM, and it consists of three layers. The top layer is the program model which any program analysis, any program structure you have used before would fall into that layer; ASTs, call graphs type hierarchies, that's the traditional analysis layer. At the bottom there is a word layer, so each word individually, and that is what has typically been used by textual analysis techniques in the past, that so-called bag of words model. So our real insight, our contribution is this interior, middle layer SWUM core which models the phrasal concepts and that is where we do the parsing of the words into verb phrases and noun phrases and start annotating them with action and theme. Now at this level I am switching to the words action and theme from verb and direct object because Verb DO are syntactic layer information whereas action and theme are more semantic, higher-level concepts. But for all intents and purposes, you can think of them as verb and direct objects and you won't be far off. So we have three different types of nodes, one for each layer, program element nodes, word nodes, and then phrase structure nodes which represent the phrasal concepts. And in terms of edges, within each layer we have edges. At the top we have structural edges. In the middle we have parse edges. At the bottom we have word edges, so we can represent things, for example, we can do synonyms or stems if you want to know that adding is the same as add, you could put that kind of word relationship in the bottom layer. And in between the layers we have the bridge edges which allow us to go from the program structure to the phrase structure so you can navigate and take advantage of all of the information of the AST and call graph, as well as all of the semantic information between the parses and the phrasal concepts. So we are really trying to provide integrated solutions so that people don't, tool developers don't have to understand all of the parsing details, but they can still leverage textual information in their software engineering tools. And so our goal is that if we had a model like this we could provide an interface between people who want to use textual information and people who are working on improving the accuracy of the parsing layer, similar to how the PDG became an interface for researchers and developers using program analyses. So that is our ultimate goal. It might not be SWUM, could be something similar, but that is what we are working towards. So what are some of the challenges in automatically constructing such a model? Well, first we have to accurately identify the part of speech. This is a well understood problem for natural language, but in the sub domain of software, it becomes even more challenging. So for example, the same word might have multiple parts of speech, and actually I really like the example fire because in natural language it is typically a noun. You see fire and you try to put it out. But in source code fire is often a verb; it can be a noun modifier like an adjective, or it can be a noun if it is in the gaming system. And so for every word in an identifier we have to somehow identify some kind of part of speech for it if we want to accurately parse the identifier names. So our approach is to use both the position of the word in the identifier and its location. Is it in the field, is in a method, is it in a constructor to help us try to disambiguate what part of speech that word is. After we have identified the parts of speech, then we parse them by identifying the action, theme and secondary arguments for any method verb phrases that we have. Noun phrases are very simple. We don't go beyond noun modifiers and nouns so we don't differentiate between adjectives or nouns that have become adjectives or things like that. But verb phrases and identifying these themes and secondary arguments, that is where the challenge is. For example, we have a reactive method action performed which doesn't tell us much about what the method is doing, so that we don't have a very good solution for yet, handle action performed. Tear down set groups test, convert restriction to a minimum cardinality or at auction entry. And what we've done in phrase generation, we just generated all of the phrases, so we would generate add entry, add auction entry, we just would generate them all. But in building this model, we tried to take a step back so we can present as much and preserve as much information as possible for the end tool because we don't know exactly what that tool is going to be. So now what we do is we say the action is add and there are two themes, entry and auction entry and those are equivalent. They describe the same thing, so we would figure out where, if there is a direct object in the name, does it overlap a parameter? Do the head words, the last words to the right of the phrase, do they overlap? And so we would identify that those are equivalent. Uh-huh? Do you want me to go back? Ask at the end? [laughter] okay. How do we go about developing these SWUM construction rules? Our research process is to analyze how developers actually use words in code. And so the concept behind any machine learning or natural language technique is that if a human can recognize it, we can train some automatic tool to recognize it. But you have to be careful of cost-benefit analysis. Sure I can recognize anything a human can but how long is it going to take me to develop those rules? So we have been highly motivated by our target software engineering applications; query reformulation required the least analysis. It still works really well. We generated really readable phrases with very, with not as accurate rules and then for search I didn't need to be quite as accurate as we needed to be for comment generation. When we are actually generating text for human consumption at summarizes method we had to be even more accurate. So we have been refining our rule identification process to be more and more accurate, each iteration with each new tool we are targeting. So I started with 9000 open source Java programs, because they are available. That is what I had on hand. And we will start with those identifier names and try to classify each name into a partition. The first easiest way is to classify them into method names and field names. And then I will analyze each partition and evaluate the accuracy of our current approach on a random subset. For example, we could start and assume that every method name starts with a verb, and in fact that is where we started with phrase generation for query reformulation is that we would assumed every method name did start with a verb. And we look at our random subset and we can see that that is true for the first three methods, but for size and length, those are actually getters with noun phrases, noun beginnings. To string and next start with prepositions and synchronized list actually starts with an adjective. So our next challenge is to refine our approach in our classification. First we need to find which partitions are missing. That is usually the easy part. But then we have to figure out how to automatically identify and categorize these method signatures into those partitions. And we would continue repeating this process on a random sample until we were happy with the level of the accuracy for our target engineering software application. So as we keep evolving this representation over time, we are working to improve the accuracy more and more. So we have this model, but how expensive is it? [laughter]. Is it going to scale to really, really big software? In terms of space if you build the entire model, it contains a node for every identifier and every unique word that is used. And the number of edges is linear with respect to the number of words within those identifiers and whatever structure or word information is included in the model. So that may be very dependent on your target software engineering application software based on how much program structure information you need. Do you just need the AST, or do you need more than that? In terms of time, it can be constructed incrementally, built incrementally and constructed on-demand, so that helps limit the costs. I created an unoptimized research prototype and to give you a sense for how long that took, I analyzed signatures for a 74,000 line of code program in 11 seconds and 1.5 million lines of code in 11 minutes. So we consider that to be reasonable for most of the codes that we are looking at, but I don't think they are quite as large as what you guys might be looking at. [laughter]. So that would definitely be something to consider. And there are some optimizations that can be done. First, you can optimize by the level of program structure and accuracy that you need. For example, for query reformulation I didn't need the level of accuracy that I needed for searching. So some optimizations can be improved that way. And it can also be constructed once and used in many software engineering tools. So if you wanted to commit to this kind of representation for a wide variety of software engineering tools, it would make more sense to use the expensive analysis because you would get to reuse it over and over again across different software engineering tools. And because it can be built incrementally, it can be updated incrementally overnight, so you just have the one cost up front, the first big-time batch and then you could incrementally update it as the code evolves. So what other software engineering tools can it be used in? So far we have applied it to source code search, also known as concern location. As well as program comprehension and development, we have applied it to automatically generating comments to summarize what a method is doing. It could also be used for automatic documentation of program changes, automatic recommendation of API methods, a novice programming tutor, anywhere you could use text to help solve a software engineering problem, you can take advantage of this kind of analysis. In terms of traceability, linking software artifacts together, external documentation, e-mails, bug reports to the source code. That involves getting a representation that is similar to SWUM for those natural language artifacts. In theory that is the easier problem because analysis tools exist for natural language text general, although they have to be probably tweaked for certain types of software artifacts. We can also work on building more intuitive natural language based interfaces, for example from debugging the why-line interface by Ko and Myers, they were asking questions about the program execution. They were pre-canned, preprogrammed in. We might be able to allow the user to ask more informative questions. They could initiate rather than just having a list of questions, possibly. And also how it detects mining of software repositories, for example we can use this kind of representation automatically build a WordNet for software synonyms by looking at verbs that are in the method signature as well is in the body. And also to continue improving our SWUM construction roles, so we can use SWUM to help improve SWUM in the future and make it more accurate. But anywhere you could use text to solve a software engineering problem, that is really where this could be used as long as it is worth it, as long as this is adding something, adding value, adding accuracy. So any questions about the general model before I show its improvement in something like search? Yes? >>: When you were trying to distinguish between add entry and whether it is an [inaudible] entry or just add entry, have you considered also looking at the call sites to see what the variable is that they, the variable name of the thing that got passed in as the argument to that method? So another name for that [inaudible]. >> Emily Hill: Yes, so I was just demonstrating the signature level analysis, but yes, when we actually analyze a method call, we take into account both the formal, the actual, its type of variable; we have four sources of information, the variable's name and type for both the actual and the formal. And we may have an additional source of information if the method call as a whole is nested inside another method, that is like the formal parameter for whatever it is a parameter for is also summary related. So yes, we do chain them together when we get to the, within the method body analysis, we do chain those all altogether, to extract as much, every last drop of information we can. >>: I guess two questions. It seems like this is specific to the natural language being used. I suspect that a large majority of code uses like English identifiers and [inaudible] but how difficult would it be if you working on a German CodeBase or Chinese or whatever, do you have any notion of how prevalent that is? I mean we see open source code that is written in a different language? >> Emily Hill: Yeah, 99,000 programs contain German and French and Spanish and Italian. Not a lot, but it's there; it is clearly there. [laughter]. >>: [inaudible] change your technique [inaudible] different languages so that the structure could be different? >> Emily Hill: If they structure their identifiers differently, so the challenge is if they are used to writing English and they just start writing in another language, they might actually still follow English naming convention patterns just with different words. That is really simple to address. But if they are actually changing the structure of how they name things, like Germans can have a different phrase structure than English does and if they don't start their method names with verbs anymore, then you have to completely develop a new part of speech analysis for that. So it is challenging when it, if it's not just the substitution. If things are still kind of in the same positions and they follow similar naming conventions, just different words, that's just a new dictionary. That is easy. But if they actually reorder it… >>: So this would be like an off-the-shelf classifier, like what is a noun, what is a verb… >> Emily Hill: Right. And there are a lot of them that exist for other like natural languages, and it's just a matter of tailoring them. The same or similar techniques to what we've used to specialize them for software would work there, but you need some sense of the naming conventions used. I think really the big limitation of this is that it is based on naming conventions and if you change those significantly, whether it's another language, natural language or another programming language, you're going to have to do a lot more work. If you are going from this is mostly done in Java, if you're going from other object oriented languages, like people use C++, there are many similarities, but you have to just reverify them, make sure that they are still following the same naming conventions and that would apply whether you are looking at a natural or a program change. Uh-huh? >>: Do you find that this information is just not very useful? Like names reportedly chosen? >> Emily Hill: So for scientific software all bets are off, like predominantly highly parallel codes, scientific codes where the variable names are all XYZ, ABC, this is not going to work well. We know that. It is kind of a sub domain that we are analyzing separately, because it is separate challenges. So we predominantly looked at open source codes or typically GUI applications. They have user interfaces. They have features that are typically well named because they are open source and they have to use the source code as a communication mechanism between the developers. Other places where it doesn't work well are what we call reactive method names, like API method names. Like if you are overriding an interface, you didn't get any choice in selecting that method name. So we have to really rely on the method bodies to build the semantic model, or generate the summaries for a common generation for example. So that is, but as long as inside that API method, as long as you have implemented some meaningful words, then we can still use it. >>: But you are saying that you also do look at the program structure within a function that the actual statements… >> Emily Hill: Yeah, depending on which problem we are solving. For search I haven't gone to that level because it's too expensive, but for content generation we have to, because we are trying to generate a summary of a method automatically. But yes, we do have mechanisms for analyzing and trying to summarize these sub statements [inaudible] analysis for loops, for if statements for blocks of statements to summarize what they are doing in generally summarizing that action. And so the same concepts can be used to automatically debug method names by looking at what the inside is. Does it match what the method name itself is, like a setter that doesn't set anything. You know, that is an example of things that we can attack using this mechanism. Any other questions before I move on? I ran out of water. So now my target application that I have been mostly interested in using this model is to improve search. Can we make search more accurate for software? And really I am most concerned with improving the precision, and so that is where the phrasal concepts come in. So this is a specific example of SWUM to give you a better understanding of how we are using it. So in the top left I have a very small snippet of code, so it is main object.Java. The method is called handle fatal error and it has one line of code, syslogger.do print and it is printing an error. The program structure representation of that method call in the body is syslogger, so the method do print is invoked on the expression syslogger and it has an actual parameter of error and that maps to the phrase structure all the way to the right. I have gone ahead and put the word nodes right into the phrase structure layer. That is usually how I think about it, but technically these can be three separate layers and it helps with the optimization. But for readability I have put them all up here. So the gray nodes are that phrase structure nodes. So we have the verb phrases, prepositional phrase and a noun phrase. The white nodes are the word nodes. For search what we use are these different semantic roles. We have an action, do print. We have a theme or a direct object, error. Our secondary argument is to sys logger. In this case we have inferred the preposition to and we have some rules to do that, but it is not general. It is just that there are some specific ones that we can look for. And we also have an auxiliary argument, if we have additional formal parameters. So for example, error is our theme; we might find that that is equivalent to the error in the formal parameter. So we can have additional auxiliary arguments, especially if there is a whole list of additional formal parameters, any of them that is not Boolean is usually added to the auxiliary argument list unless it starts with a verb that we know typically has Boolean arguments. But I am getting into low-level details there. So the really important thing is that we have these different semantic roles. Action, theme, secondary argument if there is some kind of preposition involved, and any remaining auxiliary arguments so that we can throw all of the information from the signature, all the information we can find into one of the semantic roles, and we take that into account in calculating our relevance score. We also take into account the head distance, which is the location within the phrase structure. So in natural language phrases there is this concept that the word all the way to the right in the phrase, the last word in the phrase is the head word and it is really the theme of that phrase. So for example, we have the phrase syslogger; it is less about sis or system and more about logger because logger is in the position of the head. So logger would be labeled as head and sys would be labeled as one away from the head. And so we also use that head distance because of a query word appears in the head position, that method or that phrase is more likely to be relevant to the query in that case. So the different source of information we use as I just mentioned we use the semantic role and we assume that query word occurrences in the action in the theme are more relevant than occurrences in other argument roles. That is inspired by the verb direct object approach that was used before. And we also take into account the head distance which that is a new aspect that has not been involved in software search before. That's the closest the query word is to this head position, the more strongly the phrase relates to the query word. So for example, in our auction example, special auction has more to do with auction than auction server because auction server is really about a server, which happens to hold auctions, whereas a special auction is actually an auction. The idea is to be greedy so that we have diminishing head distance so that as long as the word appears somewhere in the phrase, it comes up as relevant. We have chosen the score so that if it always appears in the head position that obviously hit first and later down on the list we will have other occurrences of the query words just in case, to be greedy, if the query word never appeared in the head position. So we try to do a best effort. And additional information we use is the location, query words appearing in the signature, we believe more strongly indicate relevance than appearances in the body. And with traditional information retrieval techniques, they typically use inverse document frequency to approximate usage in the rest of the program so that frequently occurring words throughout the entire program typically aren't good discriminators, and so we inversely weight their contribution to the score using IDF. How's that? Okay? >>: In that one you make, do you segment the difference between the left signature in the body, because if you have printout, it is going to frequent [inaudible] lots of bodies but as a method signature, there is only one. >> Emily Hill: We segment it just based on identifier splitting and whether or not we are using stemming. So we just split all the words and we use that as the IDF. We haven't done a location-based IDF, although that would be an interesting thing to try. The problem is that we don't know what the user is searching for. Do they want just the signatures or not? And so that is the challenge, is figuring out how does the user specify? Did they know that they are looking in a certain role, and if they have that information, certainly we can take advantage of it. But I think that is a challenge as to why we haven't done it yet. Anymore, okay? >>: So this is great, but it is very different from a browser search. If users are used to doing it one way, how can you wake them up and say hey, we do things differently but it's better, have you thought about that? >> Emily Hill: Well, the idea is that we want to make the query mechanism as simple as possible. We want the query mechanism to be a short 2 to 3 word phrase the same way you would search on the internet. That is our goal and that is why we are jumping through all these hoops to try and make a short query be effective, because really the search problems are very different. When you are searching the web you have information and you probably have a question and as soon as you get one webpage that is relevant that answers your question, you are done. But when I am searching code for maintenance purposes, I need every relevant occurrence. I am not satisfied with just one relevant result; I need all of the relevant results. And so that is why we are working so hard to really try to get precise and then we bring in program expiration techniques to improve the recall. Right now we are searching over so many different methods, how can we find the ones that are the most relevant to the query, and then can we refine those further to improve the recall; that is kind of our approach. Uh-huh? >>: It seems like you're operating with the constraint that a query is like a sequence of words. By providing some summary, you are allowing them to, oh, I am looking for a signature, or I am looking for something, but couldn't you rather than displaying everything so they can filter, allow them to filter preemptively by just saying when you query, instead of just providing just words, also here are some things I care about like I only care about methods, or I only care about a class, or providing some additional information in the query instead of trying to provide it in the summary later on, does that make sense? >> Emily Hill: Definitely. You can definitely, the more information they can give us, we just don't want to enforce that. We want to allow the ability--the holy Grail for me has been I should be able to search for my source code as easily as I search the web with Google or Bing. But as we refine this and try to better meet developer needs, I think we are going to find that we are going to have to add things like that into that. But so far we are just trying to make a general solution, how far can we push it? How accurate can we get? But it is really hard to make a general solution that works well, because there are so many different types of information needs and so many different reasons a developer might be searching. It is hard to be all things to everyone, so I think our next steps are further specializing. Yes? >>: [inaudible] searching do you frequently have this [inaudible] page optimization [inaudible]? If you would change [inaudible] identifiers [inaudible] how would you change? Like what would be, what would make it easier for your approach? >> Emily Hill: Oh right. So you could, based on the rules that we have learned, we can provide guidelines to developers that if you write your code and follow these patterns we are going to be better able to find it, definitely. So what we have tried to do is use naming conventions and patterns that developers use over a wide variety of source code, but especially if there are company mandated naming conventions and you follow those, we can increase the rules and the accuracy a lot. So definitely, if developers can have that information, it would definitely help us improve our accuracy, certainly. Although we have made our problem harder by assuming that we don't have that luxury and trying to still be successful. How far can we push it? How accurate can we get? I really think that the accuracy is still only around 70% F measure, because there is a limitation to using the words alone because sometimes there are going to be methods that just don't contain any relevant words and that is a challenge. There is like a bar and we are just trying to see can we reach that bar and then how do we keep going on beyond it. Uhhuh? >>: [inaudible] methods and relevant words, what do you do for abbreviations? >> Emily Hill: I have a technique for abbreviation expansion, but it is not quite accurate enough yet that I haven't thrown it in here. But that is partly why we have pushed the query reformulation technique, so the developer can more quickly explore how it is actually implemented, so that if they wanted to use both the abbreviation and the full form, they could add that in, but by seeing what the words are used for. Right now we are not taking that into account. There is certainly more room for synonyms, abbreviations all of those things, but right now we are just strictly going off the words themselves. Uh-huh? >>: Is there anyway that you could leverage developers to help you with this task so that if you know that my blind spots, here are my methods that I just can't reason about? Could you say okay, you get an hour of a developer’s time to annotate? Like, I don't know these abbreviations. I can't expand them or something like that. Have you thought about--because people aren't going to annotate everything, but sometimes if you can use people's time really effectively and there is a payoff later,… >> Emily Hill: No definitely, we haven't really thought about that, but that is a really good idea if we could get developers to do that. A lot of this unfortunately we do ourselves and so we are relying on our analysis. >>: [inaudible] warning like this isn't very well named. >> Emily Hill: Exactly. >>: If you know [inaudible] that's all right. I have seen where it actually says this is named badly, fix this. >> Emily Hill: Exactly. And yeah, if we could integrate that idea and collect that information then we could really help improve our tools, definitely. Any information is helpful. >>: So one additional source of information that I know has been crucial for web search is the notion of a static ranking page, like what is the prior relevance of this piece of information. And it feels like you could incorporate that same sort of information here, like maybe if a piece of code has a lot of callers or callees, like is sort of a [inaudible] authority in the callee graph [inaudible] greater relevance, if it spends more execution time inside that piece of code, maybe it is more important. Maybe it is closer to the main function that is more important. It feels like there is a bunch of sort of prior signals about the relevance of the piece of code that could not only be used to help relevance but also identify where you get the most bang for the buck if you're going to ask your developers to spend a little more time on things. Have you put any time in this prior? >> Emily Hill: No, we haven't used any relevance feedback yet, although there are some techniques that have, if you use the hub and authority type of mechanism, although it was counterintuitive and they had to actually turn it around. It was like the hubs were not the places you wanted to go because they were so interconnected. That means they are so general they are not useful, but they have taken that into account. So we focus purely on how much can we get from the structure and the words, but actually adding in some kind of hub and authority would really be helpful I think if we could use it to accurately identify it. Because obviously getters and setters, low-level methods, we don't want those. We probably don't want ones that are too high either. You kind of want ones that are in the middle, and I think you could use call graph information to help, definitely. We haven't gone that step yet, but definitely we could totally--any information you've got, we could put into it and further increase the accuracy. I have just been focused on how far can we push the words themselves and then once we get there and figure out what that barrier is, keep going. So I see another hand. >>: What about presenting the search results in a more graphical structural way like maybe as you build up this model all these functions you have sort of a functional model the whole lab, and it would be interesting to view the search results in the context of like a graph, or a call graph. >> Emily Hill: Definitely, in fact I personally really like seeing results in a call graph format, and that is part of the reason why we have worked towards integrating search and exploration because that allows us to present it in a more graphical way and you just get more of a context. That is my personal feeling; I don't know what developers in general want to see and that we would have to undertake a study to see how do people want to see it. And in an informal study of a handful of developers, we found that depending on what they were using it for, they really wanted a map where they could zoom in and out. So presenting the results in a format where they could possibly zoom out and get more context or zoom in which I think you guys have done work on [laughter]. But we have not actually gone that far yet. We are working on can we automatically restrict that graph so that we are not overwhelming them with information using these search and expiration tools. But how these results are represented, so far all we have really contributed there is query reformulation and that phrase hierarchy, but that is definitely not we want to stay. We want to keep evolving it, but we need to study what developers really want to see first, unless we can leverage what some other people have studied [laughter]. Other questions on this? Okay. I can show you some results of what we have done. We evaluated our SWUMbased search technique with some existing search techniques. So there is ELex, which is Eclipse's regular expression search. It is similar to GRAP. We also used Google Desktop search which has been integrated into Eclipse; that is called GES. And then we also have FindConcept which is really where we started from. That was the inspiration for our approach. And it is similar to the Verb DO approach that we used before except that it also uses synonyms in the query reformulation. So FindConcept given a verb direct object query, it searches for Verb DO pairs in comments and methods signatures and allows the user to do query reformulations with synonyms and co-occurring verbs and direct objects. And SWUMT has a similar interface to Google Desktop search because we are using a similar query mechanism and the relevance is determined by our SWUM score exceeding some threshold which we dynamically determined based on the average of the top 20 results. And we used for search tasks we used eight concerns from a previous study, which had 60 relevant methods. We were searching for across 10,000 irrelevant ones in four different programs. And in terms of queries, we used the top performing queries based on a prior evaluation. We did not want to compare how well users could use these search tools. We wanted to see when a user was really able to get a good query in terms of precision recall or F measure, when were they most effective and compare the techniques under those ideal situations. So the measures we used again were Precision, Recall F, Measure commonly used in information retrieval. So what does it look like? So here we have a box plot of the F measure. Just as a quick reminder, the shaded middle region is the middle 50% of the data. The horizontal line in the middle is the median. The plus is the mean. And we can see as we look from ELex to GES, FindConcept and SWUMT all the way to the right, if we look at the height of the box of SWUMT, we consider SWUMT to be more consistently effective than the other techniques. It doesn't have the shortest box, but on the whole it has the smallest box that is also highest. When we analyzed recall and precision, we found that ELex similar to GRAP, had good recall but the precision was so poor that overall it inundated the developer with results that were irrelevant. In terms of precision, we found that SWUMT and FindConcept were best. So that means using phrasal concepts did improve our precision, but in terms of recall, GES, which was the Google equivalent and SWUMT were the best. So the advantage of SWUM over our prior competitor FindConcept was that it had just as good precision, but it slightly improved the recall because it is using a more general representation of phrasal concepts and not just for direct objects anymore. So this was really a more preliminary study and we would like to do a more widespread study to help flesh out these results, because these results are not statistically significant because we were using a small number of queries. We were just using the best in terms of precision recall and F measure. So we want to do a more general broader study to further evaluate this. So slightly switching gears just for a second… Yes? >>: Do you have an example of what kind of query the users were initializing on these? >> Emily Hill: Each type of search is going to give a different type of query. So ELex is going to be a regular expression query. >>: [inaudible] type of regular expression [inaudible] or something like that. >> Emily Hill: And the users were allowed to interact with the tool until they were satisfied and… >>: And so they were given like here is what you're searching for. Now implement it using that. >> Emily Hill: Yes. Good question, thank you. So GES and SWUMT were the same keyword queries. FindConcept had a specific verb followed by a specific direct object and they could look at the search results and stop when they were satisfied and the last query was one that we used. >>: And the sort of things that were searching for were like find me a method that prints out logging information or something? >> Emily Hill: Well, it was more feature oriented. So they may be shown a screen shot and said find the code that implements this feature. They might be given a snippet of documentation and said okay, find the code that implements this aspect of the system. So it was more feature based. Good question. So I'm slightly switching gears, because after we have done this general search to find these seeds to start from, then we want to also further refine that and explore the program further. But these are really two different problems with two different goals. So in search we are trying to find seeds, whereas in exploration we are starting from these seed starting points. We got these pegs in the code that we can start hanging on and we are trying to build our understanding of the code around it locally. We are looking at relevant elements that are structurally connected to these seed starting points. So in search our goal is really high precision because we are searching the entire code base. So we have this huge set of methods that we are trying to prune down, whereas in exploration we are trying to improve the recall further. So our solution was to use phrasal concepts and SWUM to improve precision. And actually even though I have complained about the bag of words approach for information retrieval, it is actually very good for high recall. It is very greedy. So when we are exploring, we actually argue the bag of words. And our solution we created a tool called Dora the program explorer, and used program structure and natural language as well as location, signature versus body. So in general, this is like the example I showed before. We used a frequency of the query words. So for example I have do add on the left and an irrelevant method delete comment on the right, and the relevant method had six occurrences of the query words. The irrelevant one just had two. And we weighted the contribution of the frequency based on the signature being more relevant than the body and so we trained two weights using [inaudible] progression on a training example, to calculate that score. We also compared it to additional techniques. So we compared our Dora score which was more advanced to naïve approaches, and and or. And return true if all the query words were present. Or returned true that something was relevant if any one of the query words were relevant. And we also compared our technique to a purely structural approach called Suade and evaluated it on eight concerns mapped by three independent developers, which translated to 160 methods and over 1800 pages with overlap. And what we found was that using natural language and program structure together does outperform just using program structure. But you have to be careful how you integrate that natural language information. Just putting natural language information, for example if you just selected the and naïve approach, you would be worse off than just using program structure alone. So how you combine the natural language information is very important. So success is highly dependent on the textual scoring performance and our more advanced Dora did appear to outperform the other techniques. So our real question is though, is if we take our highly precise search technique and a greedier exploration technique like Dora to improve recall, how much more of the concern can we get? How many more relevant results can we get for each search task? So what we did is we compared the three state-of-the-art search techniques with SWUM search plus Dora exploration. So in the bottom we have ELex which is like GRAP, GES, FindConcept and then all the way to the right we have SWUM search plus Dora exploring one edge away. And if we look at the medians, we can actually see that the median results are significantly higher than for search alone, so right now we see that this is a promising direction to go in, that we can continue improving the results in general. If you're going to pick one solution, you are going to want to pick the solution that has the highest median that is most effective most of the time. We are never going to have one silver bullet that is a perfect search all of the time, but search plus Dora does a better job in general than the other techniques. And we also found that results can be further improved if we assume that there is a human pruning away at the irrelevant search results before they went to the exploration phase, so in the first bar S plus Dora, I took every search result in the top 10 and explored one edge away and that was the accuracy. If we assume a human is pruning away some of those irrelevant ones, we get even better results. But again the F measure is still only at 60 because that is about the limit that words are going to get us even with the program structure, even with Dora. So this is a preliminary result and we found that it was very exciting. We also did some other studies and found that when we were searching using any base search technique, if we went 2 to 3 edges away from the starting seeds, we could get like 100% of the relevant results. So within 2 to 3 edges of the call graph you can get almost the entire concern because programs are so highly interconnected, I believe is the reason for that [laughter]. >>: Did you look at how many edges that required you looking at? >> Emily Hill: Well, it grew… >>: If you can reach like 20% or 30% of the program from any point in three edges then… >> Emily Hill: No. The sweet spot was like returning the top seven results and then going and looking at the top five results, two edges away. We found that was the sweet spot. It was--so we got 80% of the correct results across the eight concerns that we were looking at using that. So it's possible that you can pick these thresholds and combine them in such a way that you can get a win. Because we found that to get every relevant result we needed to add two more results. So it grew exponentially, but there was a threshold where you are not overwhelming the developer and you're still returning more relevant results. But finding that and it might be different from person to person as well, because different people want to look at different numbers of results. >>: [inaudible] different programs, right? >> Emily Hill: Exactly. Well, and it is highly dependent on the program itself. And actually what makes this problem so challenging is that the query is really one of the most important determining factors in the success of the search, even more so than like word choice in the program and the structure, because if the query is a bad query, it doesn't matter how good the search technique is, it is going to be a bad result. So it is a function of the query itself, the word distribution, the program and the structure. So that is why it's so hard to make a general solution. So what is the research impact we've had so far doing this work? The navigation and exploration tools they were typically manual and slow for large and scattered code. We added automated support to leverage using natural language and program structure information as well as location to outperform competing state-of-the-art techniques. In terms of search tools, they typically return irrelevant results and miss relevant ones. We helped improve the precision by capturing the semantics of word occurrences using these phrasal concepts in SWUM, as well as improving recall by combining search and exploration. But there is certainly more we could do along these lines. And so just to summarize, the insights I tried to share with you today are combining natural language and program structure, taking advantage of word location and using word context through phrasal concepts. I have talked about using that to improve query reformulation, software search and program exploration. But there are tons of other software engineering applications where this could be used. I am just one woman and I haven't had time to try it out in all of these different places. So SWUM captures phrasal concepts and our goal is that this can become an interface for software engineering tool designers and researchers to help improve linguistic analyses for software. That is our long-term goal in trying to develop this. And in future we are hoping to really explore the other ways that text and specifically this SWUM model can be used for other software engineering applications to solve other software engineering problems and to study, to keep pushing the search further, a study what actual developers are searching for so we can further refine and better meet developer needs. Maybe they are not all just general purpose. Maybe we need to start specializing. Okay. So that's it for me, unless you have more questions. [applause]. >>: One more question. Have you thought at all about how one may change languages or annotations that programmers can add to improve this process? I am really more interested in the language because anything you can do at the language level to make this easier or more accurate. >> Emily Hill: Right now developers have tons of choices in choosing their identifiers and I think that's great power, because they can be really flexible, but at the same time it makes it really hard. There is no standard naming conventions. How you are calling methods and naming the methods, if it is slightly more forced in the structure of verbs and direct objects, there would be a lot less ambiguity. If I knew that this is the action that is taking place and this is the object that it is working on, that would make it a lot clearer I think for what we are trying to do. >>: Verifying the noun verb… >> Emily Hill: Right, the action and the object. >>: [inaudible] the names. >> Emily Hill: Yes. And really what we found in general is that actions and verbs in source code are typically used very interchangeably and synonymously, and that is actually the biggest source of issue. But the nouns, they tend to be pretty consistent because they are typically objects that get one name and they are used everywhere. That is one fixed name. So it is an interesting blend of word choice and word restriction. It is way more restrictive than a random average natural language document, because you don't get all of these different forms of the words, because once that identifier is fixed, everywhere else in the program has to use it that same exact way, but the actions, those typically aren't objects, typically don't encapsulate actions, so there's a lot more word choice and variability. So anything from execute to fire, to do, we have so many synonyms for that one simple concept, compute, compare; there are a lot of different verbs that are used to mean the same thing. >>: So seems like it may not be actually changing the language but helping them as developers. So like you could have an IDE that can give you choices about the [inaudible] you should be using in certain points or like there are words with the squiggles underneath… >> Emily Hill: Yeah, like are you sure you mean this? >>: Yeah. >> Emily Hill: Yeah, we had a method like do print or what are the semantics of helping verbs like can fire, can something fire, what does that mean for what that method is doing? And there is a lot that you can program in and learn from how it's used right now, like for example host et all, they did a job on programmer’s phrase before they analyzed the verbs and when is a verb used, and what does that method structure typically look like when that verb is used there and so they can debug poorly named methods. So encoding that and building it into the IDE would really help us better leverage the text that's in there because it would be more organized. The more ambiguity that you can take away, the better the results are going to be. Uh-huh? >>: In the opposite direction, we try to preserve that ambiguity as much as possible in syntactic analysis, right? So that you don't just take--I mean I don't know to what extent you do this already. Are you taking just one best analysis or do you have some packed forced representation of the natural language side, or? >> Emily Hill: We try and preserve the original as much as possible. >>: I mean of course it explodes, right? But in practice, so a lot of [inaudible] is in machine translation, right? So when you throw syntax in, you can explore all of the syntactic possibilities presented by one English sentence when you are translating into Japanese, but you can explore a highly likely subset and if you have that ambiguity preserving representation [inaudible] exponential combinations, we could get much better wins that way then just looking over the one best that syntactic [inaudible]. >> Emily Hill: Yes, so right now in the model that I have shown you, it is just one best. I pick one way of doing it. But we had an undergrad that was working on using more advanced analysis and using even more positional information and she was looking at all of the different possibilities and then choosing between them using accuracies and things like that. So we have pushed it. It's not quite integrated because every time you change a part of speech tagging, you have to change the parsing rule implementation, and so we are working on making a very general way that that can be done into a file or something to make it really easy to change, but not right now our challenge is how do we design this interface in the system to make that really easy to change in the future. But definitely, and the more you can take it--we have tried to avoid presenting multiple possibilities other than something like an equivalence, like okay, these two things are connected; we think they are the same. We have tried to avoid giving two parses because they could have completely different semantic parses if you have two different syntactic parses. So we have tried to pick one, but maybe affiliate accuracy with it. That is not implemented yet, but the goal would be with each rule associate an accuracy for both the part of speech tagging and the semantic parsing. >> Christian Bird: All right, cool. >> Emily Hill: Thanks.