>> Christian Bird: Today we have the opportunity to... us for a day and give us a talk. ...

advertisement
>> Christian Bird: Today we have the opportunity to have Emily Hill come in and visit
us for a day and give us a talk. She did her PhD at Delaware with Lori Pollock and for
the past year, almost year and a half she has been an associate professor at Montclair
State, and she is here visiting us today to talk about her work on natural language
programming and software engineering. Take it away.
>> Emily Hill: All right. In general my research is motivated by the problem where
you've got this huge source code base and someone's got to maintain it, and the poor
maintenance developer needs to somehow identify the code that they are looking for.
There are a couple of steps that they take in trying to locate that code. If they don't have
an expert available to tell them where to look, then they have to do something else. One
way to locate relevant methods in fields is by searching the source code and trying to
look for the big regions of the code that might be relevant, and then further exploring
those areas to see, to refine our understanding and really see what else is relevant to
exactly what the task that they are trying to solve.
Today what I am going to talk about is how we can use the natural language in the source
code, the words and the comments and the identifiers to help the developer search and
explore and understand their code more effectively. In fact, research has shown that
developers spend more time finding and understanding code than fixing bugs, so we can
help reduce the high cost of software maintenance if we can speed up this process. So
what are the current approaches that developers typically use in addressing these issues?
Well, there are a wide variety of navigation aspiration tools, it and those are commonly
built into IDEs using the program structure like the AST, the call graph, the type
hierarchy and allows the developer to jump to related source code. These are techniques
that developers use all the time, and they are great.
They take advantage of the program structure, but sometimes they can be predominantly
manual and slow for very large and scattered code bases, because each navigation step
has to be initiated and if your code takes multiple steps, every time you are locating a
new piece of code you are initiating navigation step after navigation step to navigate that
program structure. So what is an alternative? Well, there are search tools which work
similar to how we search the internet using either Google or Bing, and they apply string
matching with these comments and identifiers, and so they do allow you to locate large
and scattered codes, but they tend to have a problem with returning many irrelevant
results and missing a lot of relevant ones, because if the developer enters a query and it
doesn't match the words you original developer used in the source code, then the search
results are not going to return anything relevant.
So both tools have strengths, but both also have challenges. So how can we go about
improving these software maintenance tools to help facilitate software maintenance? So
our observation is that programmers express concepts when writing code. They use the
program structure if L statements, method calls, what algorithmic steps, the order they
organize their statements within their code, but also the natural language, the words and
the comments in the identifiers. So our approach is to leverage both of these sources of
information to try to build more effective software engineering tools and our specific
target is software maintenance. So let me give you an example of combining program
structure and natural language information together.
Let's say we have an auction sniping program. It will allow us to automatically bid on an
eBay auction online and we are looking for the code that implements adding an auction,
so the user is going to add an auction to the system, and I happen to know from prior
experience with the system that DoAction is the method that handles all user triggered
events. If I am just using program structure I can see that DoAction calls 40 methods.
That is not terrible, but only two of those 40 are relevant, so going through that list of 40
is a poor use of the developer’s time. If I use natural language alone and search the entire
code base, I get about 50 methods, 90 matches across 50 methods, and I located relevant
two, but I also located tons of irrelevant ones. But if we combine this information and
put it together, we can locate the two relevant ones with just one false positive so
narrowing our list of 40 or 50 methods to just three for the developer to look through.
So we wanted to try to combine program language, program structure and natural
language to help us improve tools and get better information. Uh-huh, oh yes, please,
feel free to interrupt.
>>: Was that an intersection that was the programming language answers and the natural
language answers to get to…
>> Emily Hill: Yes. Basically we used search techniques of the natural language on the
program structure, so the subset, we only searched the 40 callees of DoAction. Good
question.
>>: When you say sort of natural language alone, are you talking about static analysis at
all or are you just saying just the common sections?
>> Emily Hill: So comments and identifiers, so any text that shows up.
>>: But any sort of syntactical analysis, are you doing that also? When you say natural
language?
>> Emily Hill: Usually I mean bag of words at the base level, although we have been
working to build more semantic and syntactic analysis on top of that. I will actually
show you what I mean by that [laughter] down the road. But strictly natural language
information, what it boils down to is somehow using the words, whether that is just
straight lists of this set of words in a method, or if it's more advanced than that. And
actually thank you, that leads me right to my next point, is that when using this natural
language information and combining it with program structure, it is not enough to use the
words alone independently. The context of how the words appear is very important. For
example, we have three occurrences of the word map. So we have map object in the
method name where map is playing the role of a verb or an action, versus object map
which is like a hash map that contains objects, so that is really its noun sense and then we
might have the words map an object just on two completely unrelated statements in the
method, but the word map shows up. So the context of how that word is appearing if it's
a query word is very important in proving our accuracy for the search, as well is the
location of the word. For example, a method signature is typically a better summary or
extraction of what a method is doing then a random word just anywhere in the method
body, and so we try to leverage that information to help improve accuracy as well.
So let me show you an example of why using context in location is so important. So for
example, I like adding things so if we are searching for add item in a piece of shopping
cart software for example, on the left I have a method add entry and on the right I have a
method called sum. So both are different senses of the word add. So when I talk about
context I am talking about going from lexical concepts, which is the individual word
itself commonly referred to as a bag of words approach for information retrieval versus
phrasal concepts. So if we look at just straight word occurrences, both of these methods
contain the words add and item, both equally matched. But if we evolve that to phrasal
concepts, so concepts that consist of multiple words, we can see that left-hand side, add
entry actually is adding an item, whereas sum is adding a price. So by taking advantage
of these phrasal concepts, we can better identify the relevant add entry method over the
irrelevant sum.
In addition, location further helps us. So the phrasal concepts in the signature concept for
add entries, add item entry, which again contains the query words of add item, whereas
the signature in sum is simply sum and we could put in a direct object there if we wanted.
But looking at the location, the method signature versus the body further helps us figure
out what the topic of this method, what action is it really trying to take. So I'm going to
talk a little bit about our work for query reformulation, which is where we help the
developer select query words and determine result relevance, which was the motivation
for our next step which was developing a model of word usage in software that actually
captures these phrasal concepts in a general way that is usable by software engineering
tools besides just software search. And then how does that model with phrasal concepts
help us improve software search? How can we take advantage of natural language and
program structure and program exploration, and then if we combine these two pieces
together, how much can we lead to improvements.
Any questions before I change topics? Well, not really changing topics, just changing
problems slightly. With query reformulation we are concerned with helping the
developer pick the right query words to help maximize the results of their search and
determine if the results are relevant or not. When developers search source code they
typically start off with a query that is executed by some search method on the source code
base and then those results are returned to the developer. I am sure you have all
experienced this before, whether it is on the web or on source code base. And if those
results are relevant, the developer can stop their search, and if they are not what they are
looking for they can continue to repeat this process until they either get relevant results or
they get so frustrated that they stop and walk away and they use some other means to
locate the code that they are looking for.
In this process the developer faces two key challenges. First in deciding what query
words they actually have to search for, and secondly in determining whether or not those
results are relevant. And I am going to go into detail as to why those are so challenging.
So first why is selecting a query difficult? When we are searching software we have to
guess what words the original developer used to implement the concept, and actually
research has shown that two people when trying to select words to describe a familiar
concept, only agree about 10 to 15% of the time. So this is a really, really common
problem, not just in code search, but in searching in general.
>>: Is that disagreement without talking to each other, or after?
>> Emily Hill: Without talking to each other. So two people trying to describe, like
maybe they saw a picture and they are trying to pick words that describe it.
>>: And are those developers or just general people?
>> Emily Hill: I think that the target was developers for this case, although I would have
to double check that; don't hold me to that. It may be more general information retrieval
research result there, because I don't think that that study has been done for developers.
Although, bigger staff has done some work on how difficult it is to describe those
concepts. So the three major challenges in selecting query words, first you can have
multiple words with the same meaning, so you might formulate the query delete, but that
concept is implemented as remove, rm or del as an abbreviation. Then you might have a
single word that has multiple meanings, so add, as we saw in our prior example can mean
either appending or adding to a list versus summing, and those are two different senses
and you are going to get irrelevant results if you use that one general word. And even if
you pick the exactly right word to describe the concept you are looking for, let's say
going back to our auction sniping program example, let's say you want the code that
implements the data storing the auctions in the system. Auction is clearly the right word,
but it is an auction sniping program. The word auction is going to appear everywhere
throughout the code. So it is not a good discriminator. So even if the word is perfectly
correct and accurate, if it is to frequently occurring, it is not going to be specific enough
to get you the results you are looking for.
So all three of these challenges really conspire to make it difficult to come up with a good
query, and it really makes it difficult for the search tool to try and suit any arbitrary query
anyone could think up all of the time. So it actually becomes very challenging. So that is
the first challenge we are trying to address.
The second challenge is why is determining result relevance so difficult? Well, typically
in a typical IDE when we do a keyword search or a regular expression search, typically
the results are presented just in a list with the relevant results. We have to read through
the code to find if the results are relevant or not. So the challenge, if we think about
when we search the web, it is very easy to pull out where our query words appear, what
their context are. The query words are boldfaced in the context of the sentences where
they are used. The titles of the web pages are nice and in a big old font; they're bigger;
they are in a different color, and so it's much easier when I enter a query into something
like Bing or Google, I can quickly see ah, was my query even right, before I even go
looking to answer whatever question I have, the reason I made the search, I can quickly
figure out if my query is even in the ballpark. But with source code, the developer could
actually waste time trying to understand code that is not even relevant, and to me that is
like the biggest crime is that understanding code is hard enough without having to
understand code just to figure out if your query was getting you the right source code.
Uh-huh?
>>: So couldn't you do something similar to what Google does? They highlight the
relevant words and they show it in some context, right? Is there a reason why you
couldn't do that, or is that where you're going?
>> Emily Hill: That is kind of where I'm going and actually you could take that idea
further. I have only pushed a little bit in terms of we are going to use phrases to embed
those query words and give that context, because it is easier to read a natural language
phrase then some source code. But you could even go further and highlight even more.
Uh-huh?
>>: Are there studies to show how long this takes to scan through a list of search results
like the old-style?
>> Emily Hill: I don't think so. And I think it's highly dependent on how expert you are
and how familiar you are with the code system. So we normally assume that the person
searching has very little familiarity with the code system, and so they are going to take
the longest. If you know the code base, you're probably going to be really, get pretty
good at filtering out irrelevant results, but a newcomer to a system that is really
unfamiliar, they are probably going to have to read each result and it depends on how fast
you read, how quick you are, but no, I am not aware of any studies that have evaluated
that.
>>: [inaudible] under know if it is in any of their papers, but it can take like multiple
minutes per and people all typically like give up after five or ten, like they seem
irrelevant going down the list might not even go down the list.
>>: Like Google, you give up after the first three, or one.
>> Emily Hill: Yeah, 5 to 10.
>>: I mean they don't scroll.
>> Emily Hill: Yeah, I think for information retrieval in general, the average is 5 to 10.
They will look at the first 5, 10 results and if they don't see it they will give up. But in
that code list, it is just alphabetically in most cases, and so if in your alphabetical listing
of file names it doesn't show up right away, you might have the right query and just not
know it. If we could really get the developer to figure out is their query even right and
then hone in on the correct results, that is our goal. So the problem is that search results
in general can't be quickly scanned, which we are going to try to change, and the results
are poorly organized so you have to decide the relevance of each result. You have 50
matches; you might look at the first 5 to 10. If you are really, really exhaustive, you
might look at all of them, but you have to keep determining the relevance and making
that decision for each search result. So we would like to change that. So our key insight
is that the context of the query words in the source code is going to enable skimming,
organization of results and provide faster feedback for poor queries. We don't claim that
we can automatically correct the developer’s query; only they understand their
information, but if we can give them that feedback faster so they can more quickly
changed their query that is a win for us. So we are going to automatically capture our
context by generating phrases from the source code.
For example, if I had the signature add item, I could generate a phrase add new book
item, for example. Or update event, compare playlist file to object or load history. So for
example if our query is load, we can quickly see that this result is loading history versus
loading a file, downloading a file, delivering a payload, so just by seeing how the query
word appears with other words in the signature, will help us make that determination
more quickly. And we try to make it faster read because usually humans can read natural
language sentences faster than source code.
We are going to organize these phrases into a hierarchy and I will show you an example
as to how we do that. If we take an example task, let's say we are searching a JavaScript
interpreter for a signed integer conversion. So our query is to int and there are 30 results
for to int which I have listed to the right. And we could look through this entire list or if
we could use our phrases to try to group them together, we might be able to hone in on
the relevant results faster. So the phrase hierarchy at the very top is the query to int 30
and below that with have three sub phrases, add value to code int 16, object to int map
and to int 32. And since signed integer conversion involves the 32-bit, to int 32 is the sub
phrase that we are really interested in. That is where we think we will find our relevant
results.
We are able to discard about 30, 27 of the other results, and the context of the query
words in the phrase helps us to determine the relevance more quickly. So we are
reducing the number of relevance decisions from 30 results down to just three phrases
and then three results to verify that those three results are the correct relevant signatures.
>>: [inaudible] generate phrases?
>> Emily Hill: Yes. So we automatically generate those phrases from the signatures and
use a partial opportunistic phrase matching that greedily groups them together into a
hierarchy. Uh-huh?
>>: Are those phrases every identifier from the signature of the method that you are
considering, or is it just a subset?
>> Emily Hill: We usually generate them for the entire signature, and I think in this
iteration we also generated it for the parameters too. But we tried to match the longest
sub phrase so that usually prevented us from grouping based on like formal parameter
names unless that provided the largest grouping. Does that make sense?
>>: Partially, but just keep going.
>> Emily Hill: Okay. Yes we do program for the parameters but usually they are
grouped based on their signatures and so typically they are based, for example to int, you
notice we can split up to and int; they don't have to be right next to each other, like add
value to code int 16, things like that.
>>: So are these phrases representative generally is just words or do you have some sort
of model, semantic model to work from?
>> Emily Hill: Thank you. That is exactly where we are going. We started off with
strict phrases, and then we recognized the potential if we could build that general model,
any software engineering tool could use it, so that is our ultimate goal. And that is
actually our next section, so almost there. [laughter]. And just to give you a sense of
how we generate the phrases because this was kind of our starting point for building the
model, is that fields and constructors we naïvely assumed that they were all noun phrases.
They didn't involve actions, so filewriter,reportdisplaypanel and then we assumed that
method names were verb phrases, that they started with a verb and they had an optional
object after them. So verb phrases consist of a verb followed by a direct object and an
optional preposition and an indirect object, and if you have forgotten your grammar
before I did this research, I didn't remember the difference. If we take an example phrase
like add item to list, add is the verb and then to is a preposition and then item is a direct
object and list is the indirect object. So we always look for a verb and a direct object and
if there is a preposition in the method name then we go hunting for the indirect object.
Our real challenge was identifying the direct and indirect objects of the verb and we
typically look first in the name. That is obviously the best indicator. And if it wasn't
there that we looked at the first formal parameter and then at the class name.
So for example, get connection type, run I report compiler, update event, or compare
playlist file to object. Uh-huh?
>>: Did you find any computer science or programmer specific idioms that you needed
to also heavily mime? I see a lot of code that says X digit to y for transformations from
X to Y.
>> Emily Hill: Yeah, so we mostly avoided to, although I did have a version of an
identifier splitter that preprocessed to and tried to use it and make it convert. So we did
some work with idioms like convert, especially if it starts with the preposition to, to
string; that's converting something to a string.
>>: What about the digit 2?
>> Emily Hill: No, I know. If I handled it, it was during the identifier splitting phase
where I could try to detect that, but in general if it started like with a TO preposition, we
might infer convert, and there were a couple of cases, but again it is all how much time
do you want to spend in doing that? And so for query reformulation, just generating
these phrases, we didn't really need that level of detail. It still worked pretty well. But as
we go to the more general model, we have to spend more and more time in making that
more accurate and doing that parsing. And I will show you how we go about doing that
in general.
So to evaluate, I called our query formulation technique contextual search because it uses
the context of the query words and to evaluate it, we compared it with an existing
technique called verb direct object which is very similar to our technique except that it is
only the verb and the direct object. It doesn't consider any general noun phrase method
names or prepositional phrases. And we compared search results from 22 developers on
28 maintenance tasks; they were searching for 28 concerns, or search tasks. And here we
have box plots for the comparison between contextual search which I called context and
Verb DO on the right and they are box plots, so the middle shaded box represents the
middle 50% of the data. The horizontal line is the median and the plus is the mean and
Xs represent outliers. So we compared these two techniques in terms of effort which we
measured using the number of queries the user entered. Ideally, we would've liked to
have measured effort in terms of time, but we didn't want to tell our subjects that they
were being timed and some of them ate during one half of the experiment but not in the
other, and so unfortunately all we have is the number of queries, and also in terms of
effectiveness using the common information retrieval measure of F measure which
combines precision and recall. And we could see that contextual search requires less
effort than Verb DO and returned more effective results, which further justifies because
contextual search significantly outperforms Verb DO it justifies going down this path that
the more accurate we make our information, instead of just stopping with verbs and direct
objects and really trying to model noun phrases and prepositional phrases, we can
actually get significant improvements. Is that a hand?
>>: The measurements were the comparisons hold true for 10 subjects, for every
subject?
>> Emily Hill: Yes, because that is how we ran it. We did it paired; we ran both ways.
We did the two sample t-test as well as the paired, because it was kind of a mixed model
result, but yes, it held for both of them. Although a lot of the subjects like--what Verb
DO did that contextual search didn't, was that it also did co-occurring pairs, so if you
entered--your query had to be a verb followed by a direct object. But if you entered a
verb, it would list all of the other co-occurring direct objects. And if you entered a direct
object, it would list all other co-occurring verbs, and the subjects did like seeing what
other words co-occurred with their query words, so they did really like that. But it was so
limited because it only matched using verb and direct object. It couldn't, there were some
search tasks that they could not formulate queries for and that is partly what led to it. A
combination ultimately would be ideal and we are actually still working on trying to take
that to the next level. So any other questions about that before I move on?
So as you mentioned before, we started getting inspired by these phrases and thinking
gosh, what else could we do with them? And another student at the time actually wanted
to work on automatically generating comments, and we thought if we could really turn
these phrases into which it a generalized model of semantics of the program structure and
the natural language in the underlying source code, it could be used in almost any
software engineering tool that uses textual information. And so the challenge was well
how do we go from phrases to a generalized model that more people can take advantage
of.
So with query reformulation our phrases capture noun phrase and verb phrase phrasal
concepts for methods and fields. So for example, convert result, load history,
synchronized list. But we needed to generalize that model from a textual representation
with phrases to a model of this phrasal structure so these could be annotated with their
different roles in the natural language. And we also needed to improve the accuracy. For
example, if I am going from a field signature to a phrase, I could actually mistakenly
label a verb as a noun and the phrase would still come out readable and correct. But
when we want to internally represent it as a phrasal concept, we have to have a lot higher
accuracy. So our goal is to represent the conceptual knowledge of the programmer as
expressed in both the program structure and the natural language through these phrasal
concepts. So any piece--we are trying to provide a generalized model that can be used in
automated tools to represents or encodes what a human sees when they read code. That's
our goal, where we are trying to get to.
So this is an overview of our software or user model which I will call SWUM, and it
consists of three layers. The top layer is the program model which any program analysis,
any program structure you have used before would fall into that layer; ASTs, call graphs
type hierarchies, that's the traditional analysis layer. At the bottom there is a word layer,
so each word individually, and that is what has typically been used by textual analysis
techniques in the past, that so-called bag of words model. So our real insight, our
contribution is this interior, middle layer SWUM core which models the phrasal concepts
and that is where we do the parsing of the words into verb phrases and noun phrases and
start annotating them with action and theme. Now at this level I am switching to the
words action and theme from verb and direct object because Verb DO are syntactic layer
information whereas action and theme are more semantic, higher-level concepts. But for
all intents and purposes, you can think of them as verb and direct objects and you won't
be far off.
So we have three different types of nodes, one for each layer, program element nodes,
word nodes, and then phrase structure nodes which represent the phrasal concepts. And
in terms of edges, within each layer we have edges. At the top we have structural edges.
In the middle we have parse edges. At the bottom we have word edges, so we can
represent things, for example, we can do synonyms or stems if you want to know that
adding is the same as add, you could put that kind of word relationship in the bottom
layer. And in between the layers we have the bridge edges which allow us to go from the
program structure to the phrase structure so you can navigate and take advantage of all of
the information of the AST and call graph, as well as all of the semantic information
between the parses and the phrasal concepts.
So we are really trying to provide integrated solutions so that people don't, tool
developers don't have to understand all of the parsing details, but they can still leverage
textual information in their software engineering tools. And so our goal is that if we had
a model like this we could provide an interface between people who want to use textual
information and people who are working on improving the accuracy of the parsing layer,
similar to how the PDG became an interface for researchers and developers using
program analyses. So that is our ultimate goal. It might not be SWUM, could be
something similar, but that is what we are working towards.
So what are some of the challenges in automatically constructing such a model? Well,
first we have to accurately identify the part of speech. This is a well understood problem
for natural language, but in the sub domain of software, it becomes even more
challenging. So for example, the same word might have multiple parts of speech, and
actually I really like the example fire because in natural language it is typically a noun.
You see fire and you try to put it out. But in source code fire is often a verb; it can be a
noun modifier like an adjective, or it can be a noun if it is in the gaming system. And so
for every word in an identifier we have to somehow identify some kind of part of speech
for it if we want to accurately parse the identifier names. So our approach is to use both
the position of the word in the identifier and its location. Is it in the field, is in a method,
is it in a constructor to help us try to disambiguate what part of speech that word is.
After we have identified the parts of speech, then we parse them by identifying the
action, theme and secondary arguments for any method verb phrases that we have. Noun
phrases are very simple. We don't go beyond noun modifiers and nouns so we don't
differentiate between adjectives or nouns that have become adjectives or things like that.
But verb phrases and identifying these themes and secondary arguments, that is where the
challenge is. For example, we have a reactive method action performed which doesn't
tell us much about what the method is doing, so that we don't have a very good solution
for yet, handle action performed. Tear down set groups test, convert restriction to a
minimum cardinality or at auction entry. And what we've done in phrase generation, we
just generated all of the phrases, so we would generate add entry, add auction entry, we
just would generate them all. But in building this model, we tried to take a step back so
we can present as much and preserve as much information as possible for the end tool
because we don't know exactly what that tool is going to be. So now what we do is we
say the action is add and there are two themes, entry and auction entry and those are
equivalent. They describe the same thing, so we would figure out where, if there is a
direct object in the name, does it overlap a parameter? Do the head words, the last words
to the right of the phrase, do they overlap? And so we would identify that those are
equivalent. Uh-huh? Do you want me to go back? Ask at the end? [laughter] okay.
How do we go about developing these SWUM construction rules? Our research process
is to analyze how developers actually use words in code. And so the concept behind any
machine learning or natural language technique is that if a human can recognize it, we
can train some automatic tool to recognize it. But you have to be careful of cost-benefit
analysis. Sure I can recognize anything a human can but how long is it going to take me
to develop those rules? So we have been highly motivated by our target software
engineering applications; query reformulation required the least analysis. It still works
really well. We generated really readable phrases with very, with not as accurate rules
and then for search I didn't need to be quite as accurate as we needed to be for comment
generation. When we are actually generating text for human consumption at summarizes
method we had to be even more accurate. So we have been refining our rule
identification process to be more and more accurate, each iteration with each new tool we
are targeting.
So I started with 9000 open source Java programs, because they are available. That is
what I had on hand. And we will start with those identifier names and try to classify each
name into a partition. The first easiest way is to classify them into method names and
field names. And then I will analyze each partition and evaluate the accuracy of our
current approach on a random subset. For example, we could start and assume that every
method name starts with a verb, and in fact that is where we started with phrase
generation for query reformulation is that we would assumed every method name did
start with a verb. And we look at our random subset and we can see that that is true for
the first three methods, but for size and length, those are actually getters with noun
phrases, noun beginnings. To string and next start with prepositions and synchronized
list actually starts with an adjective. So our next challenge is to refine our approach in
our classification. First we need to find which partitions are missing. That is usually the
easy part. But then we have to figure out how to automatically identify and categorize
these method signatures into those partitions. And we would continue repeating this
process on a random sample until we were happy with the level of the accuracy for our
target engineering software application. So as we keep evolving this representation over
time, we are working to improve the accuracy more and more.
So we have this model, but how expensive is it? [laughter]. Is it going to scale to really,
really big software? In terms of space if you build the entire model, it contains a node for
every identifier and every unique word that is used. And the number of edges is linear
with respect to the number of words within those identifiers and whatever structure or
word information is included in the model. So that may be very dependent on your target
software engineering application software based on how much program structure
information you need. Do you just need the AST, or do you need more than that? In
terms of time, it can be constructed incrementally, built incrementally and constructed
on-demand, so that helps limit the costs. I created an unoptimized research prototype and
to give you a sense for how long that took, I analyzed signatures for a 74,000 line of code
program in 11 seconds and 1.5 million lines of code in 11 minutes. So we consider that
to be reasonable for most of the codes that we are looking at, but I don't think they are
quite as large as what you guys might be looking at. [laughter]. So that would definitely
be something to consider.
And there are some optimizations that can be done. First, you can optimize by the level
of program structure and accuracy that you need. For example, for query reformulation I
didn't need the level of accuracy that I needed for searching. So some optimizations can
be improved that way. And it can also be constructed once and used in many software
engineering tools. So if you wanted to commit to this kind of representation for a wide
variety of software engineering tools, it would make more sense to use the expensive
analysis because you would get to reuse it over and over again across different software
engineering tools. And because it can be built incrementally, it can be updated
incrementally overnight, so you just have the one cost up front, the first big-time batch
and then you could incrementally update it as the code evolves.
So what other software engineering tools can it be used in? So far we have applied it to
source code search, also known as concern location. As well as program comprehension
and development, we have applied it to automatically generating comments to summarize
what a method is doing. It could also be used for automatic documentation of program
changes, automatic recommendation of API methods, a novice programming tutor,
anywhere you could use text to help solve a software engineering problem, you can take
advantage of this kind of analysis. In terms of traceability, linking software artifacts
together, external documentation, e-mails, bug reports to the source code. That involves
getting a representation that is similar to SWUM for those natural language artifacts. In
theory that is the easier problem because analysis tools exist for natural language text
general, although they have to be probably tweaked for certain types of software artifacts.
We can also work on building more intuitive natural language based interfaces, for
example from debugging the why-line interface by Ko and Myers, they were asking
questions about the program execution. They were pre-canned, preprogrammed in. We
might be able to allow the user to ask more informative questions. They could initiate
rather than just having a list of questions, possibly. And also how it detects mining of
software repositories, for example we can use this kind of representation automatically
build a WordNet for software synonyms by looking at verbs that are in the method
signature as well is in the body. And also to continue improving our SWUM construction
roles, so we can use SWUM to help improve SWUM in the future and make it more
accurate. But anywhere you could use text to solve a software engineering problem, that
is really where this could be used as long as it is worth it, as long as this is adding
something, adding value, adding accuracy. So any questions about the general model
before I show its improvement in something like search? Yes?
>>: When you were trying to distinguish between add entry and whether it is an
[inaudible] entry or just add entry, have you considered also looking at the call sites to
see what the variable is that they, the variable name of the thing that got passed in as the
argument to that method? So another name for that [inaudible].
>> Emily Hill: Yes, so I was just demonstrating the signature level analysis, but yes,
when we actually analyze a method call, we take into account both the formal, the actual,
its type of variable; we have four sources of information, the variable's name and type for
both the actual and the formal. And we may have an additional source of information if
the method call as a whole is nested inside another method, that is like the formal
parameter for whatever it is a parameter for is also summary related. So yes, we do chain
them together when we get to the, within the method body analysis, we do chain those all
altogether, to extract as much, every last drop of information we can.
>>: I guess two questions. It seems like this is specific to the natural language being
used. I suspect that a large majority of code uses like English identifiers and [inaudible]
but how difficult would it be if you working on a German CodeBase or Chinese or
whatever, do you have any notion of how prevalent that is? I mean we see open source
code that is written in a different language?
>> Emily Hill: Yeah, 99,000 programs contain German and French and Spanish and
Italian. Not a lot, but it's there; it is clearly there. [laughter].
>>: [inaudible] change your technique [inaudible] different languages so that the
structure could be different?
>> Emily Hill: If they structure their identifiers differently, so the challenge is if they are
used to writing English and they just start writing in another language, they might
actually still follow English naming convention patterns just with different words. That
is really simple to address. But if they are actually changing the structure of how they
name things, like Germans can have a different phrase structure than English does and if
they don't start their method names with verbs anymore, then you have to completely
develop a new part of speech analysis for that. So it is challenging when it, if it's not just
the substitution. If things are still kind of in the same positions and they follow similar
naming conventions, just different words, that's just a new dictionary. That is easy. But
if they actually reorder it…
>>: So this would be like an off-the-shelf classifier, like what is a noun, what is a verb…
>> Emily Hill: Right. And there are a lot of them that exist for other like natural
languages, and it's just a matter of tailoring them. The same or similar techniques to what
we've used to specialize them for software would work there, but you need some sense of
the naming conventions used. I think really the big limitation of this is that it is based on
naming conventions and if you change those significantly, whether it's another language,
natural language or another programming language, you're going to have to do a lot more
work. If you are going from this is mostly done in Java, if you're going from other object
oriented languages, like people use C++, there are many similarities, but you have to just
reverify them, make sure that they are still following the same naming conventions and
that would apply whether you are looking at a natural or a program change. Uh-huh?
>>: Do you find that this information is just not very useful? Like names reportedly
chosen?
>> Emily Hill: So for scientific software all bets are off, like predominantly highly
parallel codes, scientific codes where the variable names are all XYZ, ABC, this is not
going to work well. We know that. It is kind of a sub domain that we are analyzing
separately, because it is separate challenges. So we predominantly looked at open source
codes or typically GUI applications. They have user interfaces. They have features that
are typically well named because they are open source and they have to use the source
code as a communication mechanism between the developers. Other places where it
doesn't work well are what we call reactive method names, like API method names. Like
if you are overriding an interface, you didn't get any choice in selecting that method
name. So we have to really rely on the method bodies to build the semantic model, or
generate the summaries for a common generation for example. So that is, but as long as
inside that API method, as long as you have implemented some meaningful words, then
we can still use it.
>>: But you are saying that you also do look at the program structure within a function
that the actual statements…
>> Emily Hill: Yeah, depending on which problem we are solving. For search I haven't
gone to that level because it's too expensive, but for content generation we have to,
because we are trying to generate a summary of a method automatically. But yes, we do
have mechanisms for analyzing and trying to summarize these sub statements [inaudible]
analysis for loops, for if statements for blocks of statements to summarize what they are
doing in generally summarizing that action. And so the same concepts can be used to
automatically debug method names by looking at what the inside is. Does it match what
the method name itself is, like a setter that doesn't set anything. You know, that is an
example of things that we can attack using this mechanism. Any other questions before I
move on?
I ran out of water. So now my target application that I have been mostly interested in
using this model is to improve search. Can we make search more accurate for software?
And really I am most concerned with improving the precision, and so that is where the
phrasal concepts come in. So this is a specific example of SWUM to give you a better
understanding of how we are using it. So in the top left I have a very small snippet of
code, so it is main object.Java. The method is called handle fatal error and it has one line
of code, syslogger.do print and it is printing an error. The program structure
representation of that method call in the body is syslogger, so the method do print is
invoked on the expression syslogger and it has an actual parameter of error and that maps
to the phrase structure all the way to the right. I have gone ahead and put the word nodes
right into the phrase structure layer. That is usually how I think about it, but technically
these can be three separate layers and it helps with the optimization. But for readability I
have put them all up here. So the gray nodes are that phrase structure nodes. So we have
the verb phrases, prepositional phrase and a noun phrase. The white nodes are the word
nodes.
For search what we use are these different semantic roles. We have an action, do print.
We have a theme or a direct object, error. Our secondary argument is to sys logger. In
this case we have inferred the preposition to and we have some rules to do that, but it is
not general. It is just that there are some specific ones that we can look for. And we also
have an auxiliary argument, if we have additional formal parameters. So for example,
error is our theme; we might find that that is equivalent to the error in the formal
parameter. So we can have additional auxiliary arguments, especially if there is a whole
list of additional formal parameters, any of them that is not Boolean is usually added to
the auxiliary argument list unless it starts with a verb that we know typically has Boolean
arguments. But I am getting into low-level details there. So the really important thing is
that we have these different semantic roles. Action, theme, secondary argument if there
is some kind of preposition involved, and any remaining auxiliary arguments so that we
can throw all of the information from the signature, all the information we can find into
one of the semantic roles, and we take that into account in calculating our relevance
score.
We also take into account the head distance, which is the location within the phrase
structure. So in natural language phrases there is this concept that the word all the way to
the right in the phrase, the last word in the phrase is the head word and it is really the
theme of that phrase. So for example, we have the phrase syslogger; it is less about sis or
system and more about logger because logger is in the position of the head. So logger
would be labeled as head and sys would be labeled as one away from the head. And so
we also use that head distance because of a query word appears in the head position, that
method or that phrase is more likely to be relevant to the query in that case.
So the different source of information we use as I just mentioned we use the semantic
role and we assume that query word occurrences in the action in the theme are more
relevant than occurrences in other argument roles. That is inspired by the verb direct
object approach that was used before. And we also take into account the head distance
which that is a new aspect that has not been involved in software search before. That's
the closest the query word is to this head position, the more strongly the phrase relates to
the query word. So for example, in our auction example, special auction has more to do
with auction than auction server because auction server is really about a server, which
happens to hold auctions, whereas a special auction is actually an auction.
The idea is to be greedy so that we have diminishing head distance so that as long as the
word appears somewhere in the phrase, it comes up as relevant. We have chosen the
score so that if it always appears in the head position that obviously hit first and later
down on the list we will have other occurrences of the query words just in case, to be
greedy, if the query word never appeared in the head position. So we try to do a best
effort. And additional information we use is the location, query words appearing in the
signature, we believe more strongly indicate relevance than appearances in the body.
And with traditional information retrieval techniques, they typically use inverse
document frequency to approximate usage in the rest of the program so that frequently
occurring words throughout the entire program typically aren't good discriminators, and
so we inversely weight their contribution to the score using IDF. How's that? Okay?
>>: In that one you make, do you segment the difference between the left signature in
the body, because if you have printout, it is going to frequent [inaudible] lots of bodies
but as a method signature, there is only one.
>> Emily Hill: We segment it just based on identifier splitting and whether or not we are
using stemming. So we just split all the words and we use that as the IDF. We haven't
done a location-based IDF, although that would be an interesting thing to try. The
problem is that we don't know what the user is searching for. Do they want just the
signatures or not? And so that is the challenge, is figuring out how does the user specify?
Did they know that they are looking in a certain role, and if they have that information,
certainly we can take advantage of it. But I think that is a challenge as to why we haven't
done it yet. Anymore, okay?
>>: So this is great, but it is very different from a browser search. If users are used to
doing it one way, how can you wake them up and say hey, we do things differently but
it's better, have you thought about that?
>> Emily Hill: Well, the idea is that we want to make the query mechanism as simple as
possible. We want the query mechanism to be a short 2 to 3 word phrase the same way
you would search on the internet. That is our goal and that is why we are jumping
through all these hoops to try and make a short query be effective, because really the
search problems are very different. When you are searching the web you have
information and you probably have a question and as soon as you get one webpage that is
relevant that answers your question, you are done. But when I am searching code for
maintenance purposes, I need every relevant occurrence. I am not satisfied with just one
relevant result; I need all of the relevant results. And so that is why we are working so
hard to really try to get precise and then we bring in program expiration techniques to
improve the recall. Right now we are searching over so many different methods, how
can we find the ones that are the most relevant to the query, and then can we refine those
further to improve the recall; that is kind of our approach. Uh-huh?
>>: It seems like you're operating with the constraint that a query is like a sequence of
words. By providing some summary, you are allowing them to, oh, I am looking for a
signature, or I am looking for something, but couldn't you rather than displaying
everything so they can filter, allow them to filter preemptively by just saying when you
query, instead of just providing just words, also here are some things I care about like I
only care about methods, or I only care about a class, or providing some additional
information in the query instead of trying to provide it in the summary later on, does that
make sense?
>> Emily Hill: Definitely. You can definitely, the more information they can give us,
we just don't want to enforce that. We want to allow the ability--the holy Grail for me
has been I should be able to search for my source code as easily as I search the web with
Google or Bing. But as we refine this and try to better meet developer needs, I think we
are going to find that we are going to have to add things like that into that. But so far we
are just trying to make a general solution, how far can we push it? How accurate can we
get? But it is really hard to make a general solution that works well, because there are so
many different types of information needs and so many different reasons a developer
might be searching. It is hard to be all things to everyone, so I think our next steps are
further specializing. Yes?
>>: [inaudible] searching do you frequently have this [inaudible] page optimization
[inaudible]? If you would change [inaudible] identifiers [inaudible] how would you
change? Like what would be, what would make it easier for your approach?
>> Emily Hill: Oh right. So you could, based on the rules that we have learned, we can
provide guidelines to developers that if you write your code and follow these patterns we
are going to be better able to find it, definitely. So what we have tried to do is use
naming conventions and patterns that developers use over a wide variety of source code,
but especially if there are company mandated naming conventions and you follow those,
we can increase the rules and the accuracy a lot. So definitely, if developers can have
that information, it would definitely help us improve our accuracy, certainly. Although
we have made our problem harder by assuming that we don't have that luxury and trying
to still be successful. How far can we push it? How accurate can we get? I really think
that the accuracy is still only around 70% F measure, because there is a limitation to
using the words alone because sometimes there are going to be methods that just don't
contain any relevant words and that is a challenge. There is like a bar and we are just
trying to see can we reach that bar and then how do we keep going on beyond it. Uhhuh?
>>: [inaudible] methods and relevant words, what do you do for abbreviations?
>> Emily Hill: I have a technique for abbreviation expansion, but it is not quite accurate
enough yet that I haven't thrown it in here. But that is partly why we have pushed the
query reformulation technique, so the developer can more quickly explore how it is
actually implemented, so that if they wanted to use both the abbreviation and the full
form, they could add that in, but by seeing what the words are used for. Right now we
are not taking that into account. There is certainly more room for synonyms,
abbreviations all of those things, but right now we are just strictly going off the words
themselves. Uh-huh?
>>: Is there anyway that you could leverage developers to help you with this task so that
if you know that my blind spots, here are my methods that I just can't reason about?
Could you say okay, you get an hour of a developer’s time to annotate? Like, I don't
know these abbreviations. I can't expand them or something like that. Have you thought
about--because people aren't going to annotate everything, but sometimes if you can use
people's time really effectively and there is a payoff later,…
>> Emily Hill: No definitely, we haven't really thought about that, but that is a really
good idea if we could get developers to do that. A lot of this unfortunately we do
ourselves and so we are relying on our analysis.
>>: [inaudible] warning like this isn't very well named.
>> Emily Hill: Exactly.
>>: If you know [inaudible] that's all right. I have seen where it actually says this is
named badly, fix this.
>> Emily Hill: Exactly. And yeah, if we could integrate that idea and collect that
information then we could really help improve our tools, definitely. Any information is
helpful.
>>: So one additional source of information that I know has been crucial for web search
is the notion of a static ranking page, like what is the prior relevance of this piece of
information. And it feels like you could incorporate that same sort of information here,
like maybe if a piece of code has a lot of callers or callees, like is sort of a [inaudible]
authority in the callee graph [inaudible] greater relevance, if it spends more execution
time inside that piece of code, maybe it is more important. Maybe it is closer to the main
function that is more important. It feels like there is a bunch of sort of prior signals about
the relevance of the piece of code that could not only be used to help relevance but also
identify where you get the most bang for the buck if you're going to ask your developers
to spend a little more time on things. Have you put any time in this prior?
>> Emily Hill: No, we haven't used any relevance feedback yet, although there are some
techniques that have, if you use the hub and authority type of mechanism, although it was
counterintuitive and they had to actually turn it around. It was like the hubs were not the
places you wanted to go because they were so interconnected. That means they are so
general they are not useful, but they have taken that into account. So we focus purely on
how much can we get from the structure and the words, but actually adding in some kind
of hub and authority would really be helpful I think if we could use it to accurately
identify it. Because obviously getters and setters, low-level methods, we don't want
those. We probably don't want ones that are too high either. You kind of want ones that
are in the middle, and I think you could use call graph information to help, definitely.
We haven't gone that step yet, but definitely we could totally--any information you've
got, we could put into it and further increase the accuracy. I have just been focused on
how far can we push the words themselves and then once we get there and figure out
what that barrier is, keep going. So I see another hand.
>>: What about presenting the search results in a more graphical structural way like
maybe as you build up this model all these functions you have sort of a functional model
the whole lab, and it would be interesting to view the search results in the context of like
a graph, or a call graph.
>> Emily Hill: Definitely, in fact I personally really like seeing results in a call graph
format, and that is part of the reason why we have worked towards integrating search and
exploration because that allows us to present it in a more graphical way and you just get
more of a context. That is my personal feeling; I don't know what developers in general
want to see and that we would have to undertake a study to see how do people want to
see it. And in an informal study of a handful of developers, we found that depending on
what they were using it for, they really wanted a map where they could zoom in and out.
So presenting the results in a format where they could possibly zoom out and get more
context or zoom in which I think you guys have done work on [laughter]. But we have
not actually gone that far yet. We are working on can we automatically restrict that graph
so that we are not overwhelming them with information using these search and expiration
tools. But how these results are represented, so far all we have really contributed there is
query reformulation and that phrase hierarchy, but that is definitely not we want to stay.
We want to keep evolving it, but we need to study what developers really want to see
first, unless we can leverage what some other people have studied [laughter]. Other
questions on this?
Okay. I can show you some results of what we have done. We evaluated our SWUMbased search technique with some existing search techniques. So there is ELex, which is
Eclipse's regular expression search. It is similar to GRAP. We also used Google
Desktop search which has been integrated into Eclipse; that is called GES. And then we
also have FindConcept which is really where we started from. That was the inspiration
for our approach. And it is similar to the Verb DO approach that we used before except
that it also uses synonyms in the query reformulation. So FindConcept given a verb
direct object query, it searches for Verb DO pairs in comments and methods signatures
and allows the user to do query reformulations with synonyms and co-occurring verbs
and direct objects.
And SWUMT has a similar interface to Google Desktop search because we are using a
similar query mechanism and the relevance is determined by our SWUM score exceeding
some threshold which we dynamically determined based on the average of the top 20
results. And we used for search tasks we used eight concerns from a previous study,
which had 60 relevant methods. We were searching for across 10,000 irrelevant ones in
four different programs. And in terms of queries, we used the top performing queries
based on a prior evaluation. We did not want to compare how well users could use these
search tools. We wanted to see when a user was really able to get a good query in terms
of precision recall or F measure, when were they most effective and compare the
techniques under those ideal situations. So the measures we used again were Precision,
Recall F, Measure commonly used in information retrieval.
So what does it look like? So here we have a box plot of the F measure. Just as a quick
reminder, the shaded middle region is the middle 50% of the data. The horizontal line in
the middle is the median. The plus is the mean. And we can see as we look from ELex
to GES, FindConcept and SWUMT all the way to the right, if we look at the height of the
box of SWUMT, we consider SWUMT to be more consistently effective than the other
techniques. It doesn't have the shortest box, but on the whole it has the smallest box that
is also highest. When we analyzed recall and precision, we found that ELex similar to
GRAP, had good recall but the precision was so poor that overall it inundated the
developer with results that were irrelevant. In terms of precision, we found that SWUMT
and FindConcept were best. So that means using phrasal concepts did improve our
precision, but in terms of recall, GES, which was the Google equivalent and SWUMT
were the best. So the advantage of SWUM over our prior competitor FindConcept was
that it had just as good precision, but it slightly improved the recall because it is using a
more general representation of phrasal concepts and not just for direct objects anymore.
So this was really a more preliminary study and we would like to do a more widespread
study to help flesh out these results, because these results are not statistically significant
because we were using a small number of queries. We were just using the best in terms
of precision recall and F measure. So we want to do a more general broader study to
further evaluate this. So slightly switching gears just for a second… Yes?
>>: Do you have an example of what kind of query the users were initializing on these?
>> Emily Hill: Each type of search is going to give a different type of query. So ELex is
going to be a regular expression query.
>>: [inaudible] type of regular expression [inaudible] or something like that.
>> Emily Hill: And the users were allowed to interact with the tool until they were
satisfied and…
>>: And so they were given like here is what you're searching for. Now implement it
using that.
>> Emily Hill: Yes. Good question, thank you. So GES and SWUMT were the same
keyword queries. FindConcept had a specific verb followed by a specific direct object
and they could look at the search results and stop when they were satisfied and the last
query was one that we used.
>>: And the sort of things that were searching for were like find me a method that prints
out logging information or something?
>> Emily Hill: Well, it was more feature oriented. So they may be shown a screen shot
and said find the code that implements this feature. They might be given a snippet of
documentation and said okay, find the code that implements this aspect of the system. So
it was more feature based. Good question.
So I'm slightly switching gears, because after we have done this general search to find
these seeds to start from, then we want to also further refine that and explore the program
further. But these are really two different problems with two different goals. So in
search we are trying to find seeds, whereas in exploration we are starting from these seed
starting points. We got these pegs in the code that we can start hanging on and we are
trying to build our understanding of the code around it locally. We are looking at
relevant elements that are structurally connected to these seed starting points. So in
search our goal is really high precision because we are searching the entire code base. So
we have this huge set of methods that we are trying to prune down, whereas in
exploration we are trying to improve the recall further.
So our solution was to use phrasal concepts and SWUM to improve precision. And
actually even though I have complained about the bag of words approach for information
retrieval, it is actually very good for high recall. It is very greedy. So when we are
exploring, we actually argue the bag of words. And our solution we created a tool called
Dora the program explorer, and used program structure and natural language as well as
location, signature versus body. So in general, this is like the example I showed before.
We used a frequency of the query words. So for example I have do add on the left and an
irrelevant method delete comment on the right, and the relevant method had six
occurrences of the query words. The irrelevant one just had two. And we weighted the
contribution of the frequency based on the signature being more relevant than the body
and so we trained two weights using [inaudible] progression on a training example, to
calculate that score.
We also compared it to additional techniques. So we compared our Dora score which
was more advanced to naïve approaches, and and or. And return true if all the query
words were present. Or returned true that something was relevant if any one of the query
words were relevant. And we also compared our technique to a purely structural
approach called Suade and evaluated it on eight concerns mapped by three independent
developers, which translated to 160 methods and over 1800 pages with overlap. And
what we found was that using natural language and program structure together does
outperform just using program structure. But you have to be careful how you integrate
that natural language information. Just putting natural language information, for example
if you just selected the and naïve approach, you would be worse off than just using
program structure alone. So how you combine the natural language information is very
important. So success is highly dependent on the textual scoring performance and our
more advanced Dora did appear to outperform the other techniques.
So our real question is though, is if we take our highly precise search technique and a
greedier exploration technique like Dora to improve recall, how much more of the
concern can we get? How many more relevant results can we get for each search task?
So what we did is we compared the three state-of-the-art search techniques with SWUM
search plus Dora exploration. So in the bottom we have ELex which is like GRAP, GES,
FindConcept and then all the way to the right we have SWUM search plus Dora
exploring one edge away. And if we look at the medians, we can actually see that the
median results are significantly higher than for search alone, so right now we see that this
is a promising direction to go in, that we can continue improving the results in general. If
you're going to pick one solution, you are going to want to pick the solution that has the
highest median that is most effective most of the time. We are never going to have one
silver bullet that is a perfect search all of the time, but search plus Dora does a better job
in general than the other techniques.
And we also found that results can be further improved if we assume that there is a
human pruning away at the irrelevant search results before they went to the exploration
phase, so in the first bar S plus Dora, I took every search result in the top 10 and explored
one edge away and that was the accuracy. If we assume a human is pruning away some
of those irrelevant ones, we get even better results. But again the F measure is still only
at 60 because that is about the limit that words are going to get us even with the program
structure, even with Dora. So this is a preliminary result and we found that it was very
exciting. We also did some other studies and found that when we were searching using
any base search technique, if we went 2 to 3 edges away from the starting seeds, we
could get like 100% of the relevant results. So within 2 to 3 edges of the call graph you
can get almost the entire concern because programs are so highly interconnected, I
believe is the reason for that [laughter].
>>: Did you look at how many edges that required you looking at?
>> Emily Hill: Well, it grew…
>>: If you can reach like 20% or 30% of the program from any point in three edges
then…
>> Emily Hill: No. The sweet spot was like returning the top seven results and then
going and looking at the top five results, two edges away. We found that was the sweet
spot. It was--so we got 80% of the correct results across the eight concerns that we were
looking at using that. So it's possible that you can pick these thresholds and combine
them in such a way that you can get a win. Because we found that to get every relevant
result we needed to add two more results. So it grew exponentially, but there was a
threshold where you are not overwhelming the developer and you're still returning more
relevant results. But finding that and it might be different from person to person as well,
because different people want to look at different numbers of results.
>>: [inaudible] different programs, right?
>> Emily Hill: Exactly. Well, and it is highly dependent on the program itself. And
actually what makes this problem so challenging is that the query is really one of the
most important determining factors in the success of the search, even more so than like
word choice in the program and the structure, because if the query is a bad query, it
doesn't matter how good the search technique is, it is going to be a bad result. So it is a
function of the query itself, the word distribution, the program and the structure. So that
is why it's so hard to make a general solution.
So what is the research impact we've had so far doing this work? The navigation and
exploration tools they were typically manual and slow for large and scattered code. We
added automated support to leverage using natural language and program structure
information as well as location to outperform competing state-of-the-art techniques. In
terms of search tools, they typically return irrelevant results and miss relevant ones. We
helped improve the precision by capturing the semantics of word occurrences using these
phrasal concepts in SWUM, as well as improving recall by combining search and
exploration. But there is certainly more we could do along these lines. And so just to
summarize, the insights I tried to share with you today are combining natural language
and program structure, taking advantage of word location and using word context through
phrasal concepts. I have talked about using that to improve query reformulation,
software search and program exploration. But there are tons of other software
engineering applications where this could be used.
I am just one woman and I haven't had time to try it out in all of these different places.
So SWUM captures phrasal concepts and our goal is that this can become an interface for
software engineering tool designers and researchers to help improve linguistic analyses
for software. That is our long-term goal in trying to develop this. And in future we are
hoping to really explore the other ways that text and specifically this SWUM model can
be used for other software engineering applications to solve other software engineering
problems and to study, to keep pushing the search further, a study what actual developers
are searching for so we can further refine and better meet developer needs. Maybe they
are not all just general purpose. Maybe we need to start specializing. Okay. So that's it
for me, unless you have more questions.
[applause].
>>: One more question. Have you thought at all about how one may change languages
or annotations that programmers can add to improve this process? I am really more
interested in the language because anything you can do at the language level to make this
easier or more accurate.
>> Emily Hill: Right now developers have tons of choices in choosing their identifiers
and I think that's great power, because they can be really flexible, but at the same time it
makes it really hard. There is no standard naming conventions. How you are calling
methods and naming the methods, if it is slightly more forced in the structure of verbs
and direct objects, there would be a lot less ambiguity. If I knew that this is the action
that is taking place and this is the object that it is working on, that would make it a lot
clearer I think for what we are trying to do.
>>: Verifying the noun verb…
>> Emily Hill: Right, the action and the object.
>>: [inaudible] the names.
>> Emily Hill: Yes. And really what we found in general is that actions and verbs in
source code are typically used very interchangeably and synonymously, and that is
actually the biggest source of issue. But the nouns, they tend to be pretty consistent
because they are typically objects that get one name and they are used everywhere. That
is one fixed name. So it is an interesting blend of word choice and word restriction. It is
way more restrictive than a random average natural language document, because you
don't get all of these different forms of the words, because once that identifier is fixed,
everywhere else in the program has to use it that same exact way, but the actions, those
typically aren't objects, typically don't encapsulate actions, so there's a lot more word
choice and variability. So anything from execute to fire, to do, we have so many
synonyms for that one simple concept, compute, compare; there are a lot of different
verbs that are used to mean the same thing.
>>: So seems like it may not be actually changing the language but helping them as
developers. So like you could have an IDE that can give you choices about the
[inaudible] you should be using in certain points or like there are words with the
squiggles underneath…
>> Emily Hill: Yeah, like are you sure you mean this?
>>: Yeah.
>> Emily Hill: Yeah, we had a method like do print or what are the semantics of helping
verbs like can fire, can something fire, what does that mean for what that method is
doing? And there is a lot that you can program in and learn from how it's used right now,
like for example host et all, they did a job on programmer’s phrase before they analyzed
the verbs and when is a verb used, and what does that method structure typically look like
when that verb is used there and so they can debug poorly named methods. So encoding
that and building it into the IDE would really help us better leverage the text that's in
there because it would be more organized. The more ambiguity that you can take away,
the better the results are going to be. Uh-huh?
>>: In the opposite direction, we try to preserve that ambiguity as much as possible in
syntactic analysis, right? So that you don't just take--I mean I don't know to what extent
you do this already. Are you taking just one best analysis or do you have some packed
forced representation of the natural language side, or?
>> Emily Hill: We try and preserve the original as much as possible.
>>: I mean of course it explodes, right? But in practice, so a lot of [inaudible] is in
machine translation, right? So when you throw syntax in, you can explore all of the
syntactic possibilities presented by one English sentence when you are translating into
Japanese, but you can explore a highly likely subset and if you have that ambiguity
preserving representation [inaudible] exponential combinations, we could get much better
wins that way then just looking over the one best that syntactic [inaudible].
>> Emily Hill: Yes, so right now in the model that I have shown you, it is just one best.
I pick one way of doing it. But we had an undergrad that was working on using more
advanced analysis and using even more positional information and she was looking at all
of the different possibilities and then choosing between them using accuracies and things
like that. So we have pushed it. It's not quite integrated because every time you change a
part of speech tagging, you have to change the parsing rule implementation, and so we
are working on making a very general way that that can be done into a file or something
to make it really easy to change, but not right now our challenge is how do we design this
interface in the system to make that really easy to change in the future. But definitely,
and the more you can take it--we have tried to avoid presenting multiple possibilities
other than something like an equivalence, like okay, these two things are connected; we
think they are the same. We have tried to avoid giving two parses because they could
have completely different semantic parses if you have two different syntactic parses. So
we have tried to pick one, but maybe affiliate accuracy with it. That is not implemented
yet, but the goal would be with each rule associate an accuracy for both the part of speech
tagging and the semantic parsing.
>> Christian Bird: All right, cool.
>> Emily Hill: Thanks.
Download