18878 >> Tom Zimmermann: So it's my pleasure to welcome... University of Science and Technology. She actually received her...

advertisement
18878
>> Tom Zimmermann: So it's my pleasure to welcome Seung-won Hwang from the Pohang
University of Science and Technology. She actually received her Masters and Ph.D. from the
University of Illinois in Chicago Champagne. And a few years ago she did an internship at
Microsoft Research and today she's going to talk to us about what we can learn from wisdom of
clouds about development. Please go ahead.
>> Seung-won Hwang: Thank you for the introduction. My name is Seung-won. And today I'd
like to talk about my recent work on software code search. And this is joint work with Professor
Sunghun Kim at Hong Kong UST and also my students at POSTECH with names listed here.
So before we start, I'll show you where I'm from. So I'm from POSTECH, which is located in the
city of Pohang. You can see from the map, Pohang is on the southeast part of Korea. We're a
rather young school. We're about 20 years old. And we get about 300 undergrads every year
over ten departments in engineering and science.
That's geographically speaking where I'm from. And academically speaking I'm coming from
search and database field. Since most of you are not coming that field I think I'll spend some
time in the beginning of the talk to explain a little bit about my background.
And also as Tom introduced, I see a lot of interns here. I did my internship here about six years
ago. I spent two wonderful summers at DMX group. It's actually a homecoming for me, and I'm
very happy to be here.
So this is the outline. I'll first talk about search problem in general. And later I will talk about how
those components, search-driven approaches can be used to help software developers. Okay.
Very simplified summary, the search problem is about finding the right result for the given user
right away.
So usually that is very tricky, because the search usually deals with very large search corpus.
For instance, the corpus for search engine is billions and billions of documents. So very popular
analogy for finding the right document among those heap of billions of document is like finding a
needle in the haystack.
So here in this figure, this person could use 65,000 seconds to find his needle. But usually as we
all know the users for search applications are not that patient. So we have to find a result within a
second or within .1 seconds. So doing these two tasks is usually very challenging. So because
of that there has been a very rich body of research for the search, especially dealing with the
haystack of documents.
And for today you can imagine a big haystack consisting of software code. So I'll say a little bit
more about those two main tasks. So first about defining the right result. It could be tricky,
because users do not often fully express what they need as query. So to give you an example
the user here is looking for a good camera for his work. And someone says Nikon 380X is good.
He sends that as a query to Bing and get the result. It's possible that this user is fully satisfied
with Bing result. And meanwhile there could be another user sending out the same query with a
slightly different search intent. It is likely that the same result for Nikon 380X may not be that
satisfying for the second user.
So because of this challenge, there have been two different approaches to tackle this challenge.
First is personalization. So in this approach we are trying to figure out what's hidden. The hidden
intent in the query and try to optimize the result differently for different users, even though they
share the same query keyword.
That is one way which would be very effective if your prediction of search intent is correct. But
oftentimes you cannot predict that accurately in which case the personalization could be
somewhat risky. So for that case there could be another approach of diversifying the result to
show a little bit of everything that can make different people happy.
So that it's not like making one person extremely happy, but all the users are somewhat happy.
So in a way the goal of diversification is minimized that this satisfaction in case the prediction was
wrong.
So no matter which way you take our interest is finding the right result within seconds. So there
has been lots of research on finding those right results for, say, personalization and
diversification. So there are many research questions related to that. And actually all my past
research and ongoing research falls into some area in this slide. So, for instance, for
personalization, what we have to do is you have to figure out what the user wants, the hidden
intent based on user behaviors and so on we have to mine out what is specific about this user.
For instance, in Bing, if they figure out from the query that the user is likely to buy this camera,
they will show you this nice summary summarizing the customer reviews and price range of the
products out there so that the user can go to the best deal right away.
So you can guess or predict or mine those intents and then personalize the result. So for that to
return the result right away you need efficient algorithms to optimize the cost for computing
different results for different people. So I have been working on those algorithms.
And also for another direction what you can do is like the query keyword Jaguar, which could
mean different things for different users. Some people mean Jaguar the cat, Jaguar the car,
Jaguar the team. If you cannot know which Jaguar the user means you can categorize the
results based on different intents and show this category as guidance so that people with different
intents can go to the category of their interest and be satisfied.
So for that, there are also many research questions like what kinds of categories are most
effective for most users. So how to compute those categories and also algorithms to cluster
results efficiently.
So instead of going through the details, I will just close the first part of my talk by showing just one
example. Because I believe often an example is better than many words. This example demo is
what we built together with Microsoft Research Asia during the visit of myself and my student.
So like for the product query that I showed before, what Bing does is to show the product-specific
information for the given query. Somewhere they show summary for just one camera. But what
we wanted to do was show the related camera grouped representing different user intent. The
first user interested in work camera can go for this group, and another user looking for, say,
vacation camera can go to that group, and you can figure out the relationship between two
cameras by moving over the mouse.
Then if you click some camera you're interested in that will be the new center. So here is one
example of diversification. And for that, for grouping the data into the cluster that represents
different groups of search intents, what we need to do first is to extract the product features from
the Web, like the resolution of camera, prize and so on. What we need to do is then to define a
similarity. Similarity. And here in order for the diversification like this to be successful, those
similarity measure should reflect the actual human perception of the product similarity.
So for that what we use is selective intelligence of people who write reviews on the Web. So, for
instance, if one person writes a review comparing camera A and B, that is one indication that this
human being thinks these two cameras are related somehow. So if you collect those
co-occurrences from one million, say, product review that will be another indication of the
perceived similarity metrics we can obtain from the user. So here what we did is to combine the
similarity we extracted from data and the similarity, say, two products sharing similar fissure
values, similar, another notion of similarity from the people, people's perception, and then we
combine them to compute those, to organize a result like that. So for details, if you are interested
and if you're attending the faculty summit next week, you can find us at Demo Fest.
So I think that was a very, very brief overview of the search research in general. But I'm hoping
you can now see why I would be interested in the big pile of software code as a researcher.
Nowadays it is very cheap and easy to collect large pile of code corpus. I can easily build a
corpus of the codes that I have ever written, and maybe I can use that for writing my own cold,
because there are recurring pairs from my code writing, or for Microsoft, I think -- I'm sure they
have a large code base archiving the codes ever written in the Microsoft and you can easily do
that if you want to and even collecting the source code of the available worldwide is not that hard.
You can crawl all the source code and make this big heap.
Now the question is what can we do with this heap? This heap has lots of potential because that
I believe is representing the wisdom of say the wisdom of Microsoft developers or the wisdom of
the developers that has ever existed in the planet. So if there is a good search engine on top of
it, then, for instance, when I'm trying to teach myself some new API, then it's likely that there is
someone else who went through the exact same process and already know a lot about it. Good
search engine can connect me to that person or the code he wrote, I can definitely learn a lot
from that.
And also when I'm about to devise a new code, then also it is likely that somewhere in the
university or in the past it wasn't same functionality and did something about it. So if there is a
good search on that, then maybe you can avoid reinventing the wheel. So that's the vision I
have, and I believe there can be many different types of right search engines we can build. But in
the second part of the talk I'll introduce a sample search engine that we built on the code corpus.
So there has been many past research saying, observing developers. Do look at lots of
resources on the Web and they take advantage of that during their development. So almost need
to motivate the need for building the right search engines for those resources. And those
resources include say API documents, explaining what each method does, what inputs outputs
are.
This is the example document for Java. And a lot of times there was research also observing the
developers often look for examples to see how these API's actually used because we developers
often find reading English is often more complicated than reading the code. So we make a few
more search requests until we hit the high quality examples. And then we fully understand what
the API does.
So based on that what we can think is if you can combine a very high quality documents with high
quality examples, then that will be a very useful resources for the developers. I'm not alone to
think that because there do exist lots of resources combining high quality documents and
example and MSDN is a good example, where the human technical writers write high quality
documents and they also, for each document they have nice handcrafted example that are
designed to explain the given API. So MSDN, as you know, is a high quality resource example.
And you can think of books that you buy to teach yourself Python, Fishharp, which is filled with
those examples and usually comes with a DVD or CD that stores all those code examples so you
can learn from that.
You can run it or play with it to figure out how to program. So this is actually the high quality
resources that we're targeting to. But only one downside, here those users have to handcraft the
examples. So it involves human labor. These resources are rare and expensive, and in reality
most of API documents are generated automatically from the code using the tool like Java docs,
which extract the documents from the code and comment.
With those extraction tools, those documents are inherently lacking of good example. So from
our observation we found that from the document for Java, only 2 percent of the API has the
examples.
So what we wanted to do is build a system, some search engine so if you specify some API
name, this will search for the open source code base to find the actual source code that is using
those API, and then we extract the high quality examples out of it and automatically embed it in,
say, the example list document like Java doc, Java documents. So in a way our question is
whether we can automatically generate MSDN quality documents using the open source code
base.
Okay. So one immediate solution that you can think of is using existing code search products.
There's a product from Google and codesearch.com and we found that the research engines are
not suitable for our goal because to show you that this is the result we get, from the query
connection prepared statement. So for this statement we hope the summary will show how those
APIs called and so on. But usually the top result from those two co-search engines are just
supporting keyword search. So they'll show whatever the match in the code. So usually here
there is the keyword connection. So the top result will show the comment line mentioning the
word connection or import statement mentioning connection. Or some random part mentioning
the keyword connection.
So based on those examples you cannot really know a lot about those APIs. And our goal is
producing, say, MSDN quality example. So for that to define what's good examples we need to
observe the handcrafted examples from the MSDN. So what's so great about those handcrafted
examples? So the first thing is that these examples is showing different usage types. So it's
related to the diversification that I mentioned in the beginning of the talk.
So based on API name, we don't know what type this user would be interested in. So we will
show different usage types for this is an example for list insert. So this API can be called with two
argument, three arguments or array as an argument and so on.
So by looking at it, you can teach yourself different usage types. And another virtue of this
handcrafted example is that this is very concise. So this code is specifically designed to teach
you this API. So no line here is irrelevant for the API you're about to learn. They're there to serve
the purpose of teaching you this API. So we usually do not have this luxury if we are extracting
example from the actual code, because that code is written for different purpose of serving
whatever the project the best.
So that will contain lots of parts that are not related to the API of your interest and so on. Which
suggests that for our system to be effective, that should be able to summarize to get rid of all the
irrelevant parts and save only the part, save only the part that can explain the API. So if there are
more than one candidates that can be embedded to the query, then we will need some kind of
ranking to pick the most high quality example for documents.
So that is what we need. And another interesting thing with the handcrafted example here is that
they are showing some semantic context. So in order to understand the API, API call is not
everything you need to learn. There's something more. For instance, you need to know how this
argument is declared and populated.
So good example often include those semantic contexts as well. So inspired by those
handcrafted examples, this is the system that we designed. So we named the system EXOA for
Exemplar Oriented Document Generator. So first we start with example list document, like
document for Java, which has only examples for 2 percent of the API. And for each API in the
documents we'll search the open source code repository to pick the candidates code that actually
calls those API. And the next step is summarizing them to get rid of all the semantically relevant
parts and keep only the semantic context that's related to that API call.
And then the second part is grouping them into clusters, representing different types of usages.
So we'll have a group of code examples and the last part is that among each group, we will elect
the representative that best shows each group.
So for each cluster we'll elect one representative which will embed to the documents to create
EXOA. So we'll take a look at each of those three steps one by one here.
So first about summarization. The summarization we saw for the snippets of codes.com is just
based on textual vicinity. So they'll find whatever the occurrence of the keyword in the code and
shows two lines before that, two lines after that.
So we as developers all know that those adjacent lines may or may not semantically be relevant.
So what we do is to analyze code semantically by parsing them and figuring out the dependency
between the lines to keep only the lines that are semantically relevant.
So, for instance, declaring the argument or changing the values of argument or declaring the
class of API. So we will summarize like that. And then we'll have corpus of summarized code.
And the next step we do is cluster them to groups representing different types. So as I show with
the product clustering examples, for good clustering you need to define good similarity function.
And natural choice would be using some kind of trace similarity measure, because code is a tree.
But there are actually many tree similarity measures like tree edit distance and so on. But these
metrics are usually very expensive.
So people usually approximate those tree as a vector and then compare the vectors. So in the
research for detecting codes, software codes, the approximation method that they used is
something like this for the given code we get the parse tree and we sort of flatten them into the
vector by showing what types of elements are there and how many are there.
So, for instance, this tree can be summarized by two blue nodes, one green node and two gray
node and two white nodes. So you reflect in the tree like as a vector like that. Once you do that,
these vector is often called in the software code detection research as a characteristic vectors,
which was proposed in for the DECKARD code detection tool.
Once we vectorize like that now your code corpus is a numerical database so you can apply any
existing clustering algorithms for clustering, like K means. But here we have additional challenge,
which is that the right number of cluster K can be different from different API. Some API has
single usage types. Some API has many different usage of types.
So whatever the clustering algorithm you use, that should be able to find out the right number of
clusters and adapt the clustering for, differently for different APIs. That's the challenge.
So the last part is ranking. So now we have a group of example codes representing different
usage types, and we have to pick a good example to really embed it to documents for that we will
elect representative for each group. So as criteria or features for such rankings we are using
those three vectors. First, a good example should be representative. Should well represent the
group it belongs to.
And the second criteria is line of code conciseness. If your example is very short and concise, it
will be easier to read. And then third part we're checking the type matching. So in the Java
document they are specifying the type of class and arguments among those examples will prefer
the example that matched the type of argument and class. So we need to -- so we can -- based
on those criteria, we can quantify the score and then we need to combine them into the overall
score to pick the example with the highest overall score for the documents.
And for aggregation, as I mentioned in the first part of the talk, if for aggregation to be good, those
aggregation functions should well represent the people's perception of how those features should
be aggregated. So I'll show you what we can do for that later.
Okay. So if you're interested, then you are invited to visit this EXOA POSTECH.KR where you
can see the actual automatically generated code from this tool. Here I'll show you one sample
page to what we can do.
From the Java document, what we can -- first thing we can add is this popularity bar, because we
know which API's often used in the actual open source project. And also what we could add is, of
course, an example. As you can see from here, the quality is better than just using the snippet
from coders. Now we are actually showing the actual API call and say how the argument is
declared.
And another feature here is where users can feedback about the ranking. So here you can just
click those buttons to promote or demote the ranking. And these are the feedbacks we can
collect or determine or tune the aggregation for selecting the right examples. So we are currently
collecting those user feedbacks to see how we can adapt the ranking.
Okay. So in order to evaluate how good our automatically generated documents are, we are
asking two key questions. First of all, we want to know how our document compares to other
resources like Java docs or what we can get from quote engine. And the second question is
whether this can help software developer. So I'll first talk about the first question. So when
comparing to the Java docs, like I mentioned for the previous slide, the original Java docs only
has examples for the 2 percent of the API. And after this generation process, we could find
examples for 75 percent of the API for Java. And we can also compare with Google and coders,
compared to using those coder snippets as examples, for extra documents, the probability that
our summary contains both API calls and all the semantic contexts is about 92 percent. And for
coders in Google, that probability is 22 percent or 12 percent, because they are supporting the
keyword search.
So this is how we can compare it to those Google encoders result. And also we can compare to
human ranking.
So what we did, to human assessors, we give the K result, K example that we found from EXOA
and the K random result and present those two K results to the human assessors and make them
pick the K better ones.
And using those selected, human selected K as the ground truth, we could measure the precision
and recall, which it was 66 percent and 60 percent respectively. And for the second question of
whether this tool is useful for the developers, we also gave out those documents blindly to 24
human subjects. Half of them got extra document and half of them got Java documents. And we
asked them to do the four tasks related to the database like establishing a connection, creating
the SQL statements in them and then get the result.
And we measure the average completion time for two groups and the average number of
document lookups and we found out that the completion time was, so the EXOA group could
complete the task with less time. And also for document lookup, EXOA group on average looked
at only the five documents while that number would be three times higher for Java docs. And
among those lookup, some of them are redundant and only three out of five is distinct pages.
However, that ratio is much lower for Java docs which means Java docs group looking at the
same documents over and over again. Among the distinct page most of them are relevant for
EXOA group and that ratio is again lower for the Java documents.
So that was the result for brief users study for EXOA documents.
Okay. So that was the first part of the talk, and more details can be found from AAAI paper we
wrote about this engine..
And the second part is about whether we can support those code search fast. And let me
motivate that by showing some video. Actually, this is the video from Stanford HCI group
showing, proposing a new developer environment. So what they propose is during their
development they can look for information about the API so the information is like what we saw in
EXOA document by example by looking at them they can learn about those API and continue
development.
And if some of those examples are actually what you want to do, then you can just copy and
paste it to the code. So I think this is the system called Blueprint proposed from Stanford. And
what they're showing is if some good code examples or relevant code informations are available,
then those lookup and the development can be interleaved to really change the programming
practices and I'm also interested in those interleaving scenario where the search result should be
instant. And actually as a search researcher, I'm interested in actually interleaving every desktop
task with the search so that's one, that's longer term vision so I'm interested in those interleaving
model because of two reasons. First, as you can see from the video, your task, whatever that is
can be benefited a lot from the search. And the other direction is true, the search can be much
more personalized for you, much more accurate for you by using whatever the task context. So
here in the blueprint they're not using much about your developing context. But the problem I'm
interested in is using that to get the result, search results that are personalized for your
development context. And I believe if this can be done very accurately, that will probably
complete your next line if they can find what they can understand what you're doing and they can
find, search the right corpus, why not. And for development context there could be a lot of
different things. Well, in the blueprint work, they considered the fact that this programmer is
developing in flex with version X, X point X. So they're adding that to the query in a way to adapt
to the search context, the development context, and there could be a lot more different kinds of
context. Like for the second part of the talk we considered code developed so far and then find
the existing code that's shared those parts with similar part for the code developed so far and
there can be even many more like what kinds of communication they had in the team and so on
in those tasks. So here let's assume we are using the code developed so far as such task and try
to look for the clones, clones during the development. And there do exist lots of research so
there are lots of clone detection tools but they do not talk much about getting the result fast
because they didn't need to because the scenario they're interested in is rather off line, such as
while you are heading home, you can run your code detector -- clone detector and tomorrow
morning you come back and see groups of code that share, that are clones.
And then tomorrow you can do something about it to different groups. So the scenario they are
targeting is off line so they do not discuss much about making that on line or instant. But and
meanwhile the code search engine does support online search. You can get the result within
.1 seconds. But as you could see, they're not considering the semantic part.
So for the goal, what we wanted to do was to combine the strengths of both to support online
semantic clones, online clone detector that considers a structural similarity.
And I believe they can open up many interesting opportunities, because right now the clone
detection model is named postmortem in the sense you let the clones happen and do something
about it later. But if you can support that, those structural clone detection tool online and even
interleave with your development session, then your code detection could be more preventive like
when you're about to generate a clone you will know what kinds of clones there are right away
and based on that you can also get the personalized search. You will search your stack, stack of
code and sharing the same part, same part of the code that you developed so far.
So you can use those clone detection in a more preventive and personalized way. So the goal is
to support clone detection using structure similarity online, which is complementing the current
state of ours.
So for code representation, for this work I'm using the same characteristic factors idea. But for
the given context, a match could be happening any part of the tree at any granularity. So we
cannot generate -- we cannot represent a tree with a single vector. So instead we'll generate all
the sub trees with size at least some T so a code will be a set of vectors.
So with that, we want to find the top 5, top 10, top K clones fast. A very naive way to do that
given query you'll compare each and every code in your corpus and rank them. But what we
want to do is to make it fast to avoid looking at every code in the code base. Instead you can
build some nice infrastructure or data structure so that you can access only part of the code base
that are likely to be high scoring and then rank them.
So the naive way you can do it is now that after generating characteristic vectors out of it, now
you have numerical vector. So you can build multi-dimensional index on your database and then
search with the query code.
However, there is a well-known problem in data mining called cursive dimensionality. So if your
vector has dimensionality over certain number, then building those index cannot save you much
compared to the naive solution of accessing everything and comparing it with everything.
So what we can do is reduce the dimensionality for the data and build the index on those reduced
data space. And when the query is given you will reduce the query in the same way and send the
reduced version of query code to the index to find clones.
So that way you can bypass those cursive dimensionality problem, but you have additional
problem that the result you get is not, no longer accurate. Instead, you can get a larger set of
results candidate. That will be step one. Now you will need extra step to get the data, get the
original data of those candidates and rank them. So now your steps needs to be two steps. So
this big approach -- this high level approach is well known in the database community as Gemini
for generalized multimedia indexing.
So often those multimedia databases have this high dimensionality problem. So different
systems. Though the details could vary, a lot of existing multimedia database systems use those
high level techniques of reducing the data in both query to get the candidate first. Then narrow
down to the right result later.
So for the first step of dimensionality reduction, you will have two goals. First you want to ensure
the correctness. For that your candidate from the step one should be guaranteed to contain all
the right results. So for that you need to ensure that the distance between two code in the
reduced space should be smaller than or equal to the distance in the original space.
And also in order to ensure efficiency of how much you need to access to get the top K result, in
order for that to be small, the distance in the reduced space should preserve the distance
relationship in the original space as much as possible. So here the delta is quantifying those
differences in terms of distances.
So now you need to find the -- reduce the dimensionality, pursuing both goals and that problem is
known as NP hard. So in the paper we are developing a greedy approximation scheme. So the
detail I'll skip it for today's talk.
So once dimensionality is reduced, for instance, if your code is reduced to the two dimensional
data, then your code is like a point, now a point in the two dimensional data space, then you can
build some index structure on it and there are many multi-dimensional index structures you can
use. And one of the well known tree structure is R tree. What it does is grouping those points
that are nearby as bounding rectangles, and that will be the tree node like this. And then those
rectangles are hierarchically organized until every code is one rectangle.
Then what's great about doing this is that once the query is given, again the query will be another
two dimensional point, then we need to access the candidates around the query point. Once we
build the index structure to accessing those candidates, we'll be very efficient by only accessing
the rectangles that contain those search parameter. So you can just access those three nodes
and get all the potential top result. So we will build the R tree index, and there are many
off-the-shelf R tree implementation on the Web which you can use. However, we found out that
the performance of those off-the-shelf implementation was not efficient enough so what we did
was we designed our own algorithm for building index and traversing the index. So, for instance,
for building index, we know what kinds of query workloads are to get the candidates for software
search. So we can use those knowledge to group together the candidates that are likely to be
accessed together to incur much less IL cost. And for traversal to get to the candidates, classic
approach will be the two-step approach. So you traverse index first to get the candidate
boundary and then you traverse that index again to get the original data of those candidates and
then rank it. What we proposed in this paper is to combine those two steps into one. We call that
interleaved. If you do that the obvious advantage will be avoiding those redundant, the second
index traversal. But there is actually the downside of it as well, because only after you are done
with the entire filtering step you know the exact candidate boundary. But if you try to get the
entire data during the filtering, try to interleave those two steps the candidate bound you know
during those filter phase is overestimation of candidate boundary. So you will need to access a
few more extra candidates which will incur extra costs.
So there is plus and minus, and what we did is well now our search range is larger, but we can
design some IO optimization technique so that larger request extra cost for larger requests can
be minimized by amortizing those IO costs.
So adding pluses and minuses over all we found this interleaving strategy was most efficient. So
that is our winner strategy. And also these two algorithms are exact clone detection. But often
you can, those often ranking is somewhat tolerant to a little bit of approximation. So if your
application can tolerate some loss in terms of accuracy, then you can also approximate, which we
are also developing, we are also proposing an approximation algorithm which is 20 to 30 times
faster, if you can allow 20 percent loss in terms of precision.
So those are the algorithm. So what we did is we build a code corpus out of JDK and 400 Java
open source projects we collected from the Web and we first generated characteristic vectors
using the existing tool that's used for the off line clone detection.
And for building those indexes for those open source corpus, the time it takes building those
index can take anywhere from 0.5 seconds to about two minutes. However, what we want to
stress here is that indexes build once and then can be used over and over for different query. So
that will be built offline and will be used over and over for different queries. So what we want to
make it instant is actually this querying time. So this is comparing the querying time for, say, up
to 1.6 million vectors. And when we are looking for top -- when we're retrieving top 20 result and
compared to the scan, the interleaving -- so the second line is the two step approach and the third
line is interleaving algorithm. And the interleaving algorithm is about up to 40 times faster and the
querying time is within the range of one second or .1 second which was the range we were
targeting to.
And also we did the same experiment for over different K, depending on how many top results
we're retrieving. And again we are much faster than scan and well scaled over increasing K.
And also we also compared with LSA, LSH which is used by DECKARD tool, and we also
observed that our R tree-based algorithm is about 40 times faster than the existing tool. And here
the missing points for LSH is because of the insufficient memory. So we couldn't get the data
point for that with the system setting we have. And we're about 40 times faster compared to the
existing tool.
So with that, I want to conclude this talk. Through this talk. But what I wanted to say we can
apply many search-driven approaches to contribute to software development process and for the
first paper we found out that we can use search techniques and open source code base to extract
high quality example that can help developers learn how to use those APIs and also I'm
interested in making those semantic code search as instant so that we can interact those lookup
sessions with development.
And I think there are many open questions. We are using characteristic vector representation but
obviously there can be many different ways to represent more semantics from the code like the
structural info or static analysis or so on.
And another thing is that I know that the code is a very, only a very small part of the haystack that
you generate during the software development, and I know there are lots of information that
software development generate and also look up, like related to box tests, specifications, where
there are lots of communication from people like in the forum, e-mail, chatting and so on. But
there are many different types of software stack which is interesting for the search, researcher,
and also all those haystacks are interacting, closely interacting, which is very interesting. And
eventually I'm interested in if this search can be really accurate and fast, I'm interested in whether
those AI-like vision of your tool completing your code for you is feasible. So in the longer term I'm
interested in seeing that. So those are the open questions. And I believe this is interesting field
where experts from many different fields can contribute like software engineering, HCI and I'm
coming from database and search. So because of that I'm very happy to get to meet those
experts from software engineering field and present this work. So I would be happy to get
inspired by your expertise through questions and comments. So that's what I prepared and
thanks for your attention. [applause].
Any questions, comments?
>>: So I was curious about how your characteristic vector works. Like so essentially you use that
as sort of a distance metric between code examples for the clustering step.
>> Seung-won Hwang: Right.
>>: So does -- so two examples are sort of similar if they have essentially sort of the same
frequency of nodes in the ASTs, is that essentially how ->> Seung-won Hwang: Right. As you said the characteristic vector is a frequency for ranking we
are using right now we are using Euclidian distance, but there could be many different distance
functions you can use.
>>: So like the -- so the ordering of like, say, statements wouldn't factor into that.
>> Seung-won Hwang: Wouldn't factor in with current approximation.
>>: Okay.
>> Seung-won Hwang: Any other questions.
>>: How do you choose K?
>> Seung-won Hwang: So what we do is we try out with different K and there are also the
correlating metrics studied for the clustering. So like well the goal of clustering is to group them
so that the internal similarity, intergroup similarities, maximize intergroup similarities minimized.
So there is based on those criteria we can sort of quantify how good the clusterings are. So we
try out different K and pick the one with the -- yeah.
>>: I'm interested in the generating code samples for the documentation. I've done some work in
that area. And one of the things that I found when I was experimenting was that a large open
source databases [inaudible] are often not very representative of actual usage [inaudible] so if
you, say, Google for how do I use an SSL socket versus if you search Google code or any of
the -- any of the search engines which are dominated by huge open source Apache projects and
stuff like that, you're much more likely to get, well, here's the Apache special implementation of
SSL sockets it uses in its mail server. If you get Google the top 1,000 results would be the one
common thing that [inaudible] does. And so it would be interesting to apply these techniques not
just to large databases of code, but also the fragments of sample code that are the number one
resource for programmers learning how to use API [inaudible].
>> Seung-won Hwang: Right. I'm very interested to hear that the corpus from Google code
search is a lot different from what you get from Google, the general search engine. So for this
work right now we are using the Google tool. So if there is such bias, I think this will suffer the
same bias. In order to -- the different way to do that is to build our own crawler for that and
another interesting idea is that right now we are using those actual, we are extracting examples
from the code. But as you said we can also write the crawler for the code example that people
write in the software forum and so on we can just extract them. And I think that will be interesting
as well in the sense that that is developed for explaining something. So that will be likely to be
more instructive. But the downside is that the coverage could be lower. Like the API that is -- I
don't know ->>: So I think the coverage is actually much, much higher.
>> Seung-won Hwang: Higher.
>>: So we've done this experimentation, especially for obscure things. You just won't get it in
Google code we've got hundreds of thousands.
>> Seung-won Hwang: I see.
>>: But we published a paper on the tool was called JDOP [phonetic] search engine.
>> Seung-won Hwang: I see. What we can do is we can crawl both handcrafted example on the
Web and also the actual open source code and combine them as a corpus. Actually we didn't
have chance to compare the two corpuses but I'm interested in comparing the two and comparing
the coverages and pros and cons. And once both of them are collected as the corpus, then the
remaining process will work more or less the same. So that's what we can do.
>>: When I do some searching for [inaudible] code in the search engine, I often find sort of
separate code on some [inaudible] geeks blogs. The key thing is I like their examples, but in the
data code and a lot of experience from introduction to the completion. But there is some texts are
very useful but I don't want to be [inaudible] so have you thought about incorporating sort of
[inaudible] information their code and sort of like their search, the software.
>> Seung-won Hwang: Blog.
>>: Right.
>> Seung-won Hwang: I think those two questions are related. That's what we want to do. We
want -- there are also lots of social resources forums are interesting to crawl and I believe the
quality will be very high. So, yeah, that's some of the future issues, future topic that I want to
explore. I think both can be complementary to each other, and I want to know more about that.
>>: Taking good information from the [inaudible] where [inaudible].
>> Seung-won Hwang: Text as well. For summarization technique, the same technique can be
used for code. For text, I think we can borrow some ideas from NLP. They're doing lots of
summarization. So we can try to build something like this out of forum, which could be a very
interesting thing we can do in the future.
Any other questions? Okay. Thank you. [applause]
Download