The Impact of Classifier Summary and Classifier Combination on Bug Localization

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016
The Impact of Classifier Summary and
Classifier Combination on Bug Localization
Mrs.C.Navamani MCA., M.Phil., M.E., S.Priyanka,
Assistant Professor, Final Year MCA, Deptartment of Computer Applications,
Nandha Engineering College (Anna University), Erode, Tamilnadu.
Abstract—Bug localization is the task of influential
which source code entities are relevant to a bug
report. Manual bug localization is labor exhaustive
since creators must reflect thousands of source code
units Modern research builds bug localization
classifiers, based on sensible retrieval models, to find
entities that are textually similar to the germ report.
Modern research, however, does not reflect the effect
of classifier summary, i.e., all the limit values that
specify the behavior of a classifier. As such, the effect
of each limit or which limit values lead to the best
presentation is unknown, we make two main
contributions. First, we show that the limits of a
classifier have a significant impact on its
presentation. Second, we show that combining
multiple classifiers—whether those classifiers are
hand-picked or randomly chosen relative to
intelligently defined subspaces of classifiers—
improves the concert of even the best individual
classifiers.
Keywords—Software maintenance, bug localization,
sensible retrieval, VSM, LSI, LDA, classifier
combination
1
INTRODUCTION
Creators typically use bug racking databases to
manage incoming germ reports in their software
projects. For several projects, creators are
overwhelmed by the volume of incoming germ reports
that must be addressed. For example, in the Eclipse
project, creators receive an average of 115 new germ
reports every day; the Mozilla and IBM Jazz projects
get 152 and 105 new reports per day, creatorsmust
then spend deflectable time and effort to new
reportand decide which source code entities are
relevantfor fixing the bug.
This task is known as bug localization, which is
defined as a grouping problem:Given n source code
entities and a bug report, classify the bug report as
belonging to one of the n units. The classifier returns a
ranked list of possibly relevant entities, along with a
relevancy score for each entity in the list. An entity is
reflected relevant if it indeed needs to be changed to
decide the germ report, and irrelevant otherwise.
The creator uses the list of possibly relevant entities to
identify an entity related to the germ report and make
the basic modifications. After one important entity is
identified using germ localization, creators can use
conversion propagation methods [1],[2] to identify any
ISSN: 2231-5381
other entities that also need to be changed. Hence, the
germ localization task is to find the first significant
entity; the task then switches to change propagation.
Fully or partially automating germ localization can
hugely reduce the development effort vital to fix bugs,
as much of the setting time is modernly spent
manually locating the appropriate units, which is both
trying and posh.
Moderngerm localization study uses sensiblerescue
(IR) classifiers to find home code units that are
textually similar to germ reports. However, recent
results are unclear and differing: Some due that the
Vector Space Model (VSM) runs the best routine,
while others claim that the Latent Dirichlet Allocation
(LDA) model is best, while still others due that a new
IR impeccable is necessary. These several letters are
due to the use of changed datasets, wellconcert
metrics, and new classifier constructions. A classifier
erectionstates the value of all the boundaries that
identify the act of a classifier, such as the way in
which the home programme is preprocessed, how
positions are limited, and the competition metric
between germbangs and home code units.
We aim to address this ambiguity by presentation a
large-scale experimental study to compare thousands
of IR classifier summarys on a large number of bug
reports. By using the same datasets and recital metrics,
we can make an apples-to-apples contrast of the
differentsummarys, calculating the influence of
summary on presentation, and recognizing which
specificsummarys yield the best presentation. We find
that summary indeed has a large impact on
presentation: Some summarys are nearly useless,
while others perform very well.
Further, we investigate the effect of combining the
results of different classifiers since grouping has been
shown to be useful in further domains as well as in
software engineering (e.g., defect extrapolation.
Classifier groupings are known to be helpful in several
situations, such as when the separate classifiers
everyhave proficiency in only a subdivision of the
input cases, or when the recital of the separate
classifiers inclines to contrast widely. We
contemporary a summary for combining several
classifiers that, as we far along show, can composed
accomplish recovering bug localization results than
any solitary classifier. The main instinctafter classifier
grouping is that when a specific source code object is
resumedextraordinary in the gradients of various
classifiers, then we can deduction with high sureness
http://www.ijettjournal.org
Page 61
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016
that the entity is trulyapplicable for the bug report. Our
structure easily spreads to any caring of bug
localization classifier: We can chain IR-based
classifiers with vigorous analysis classifiers,
shortcoming prediction classifiers, or any classifier
that somehow solves the bug localization problem. If
the classifiers are equally uncorrelated in their
incorrect answers (i.e., the classifiers be disposed to
make dissimilar mistakes from each supplementary),
then merging them will likely recoverpresentation.
Given the environment of the bug localization
problematic,
which
has
severalconceivableincorrectresponses for a given bug
report (i.e., source code articles that are dissimilar to
the bug report), merging models has the probable to
make well.
Then we perform a case study on three important
projects to study the routine of thousands of classifier
shapes. Using results as a baseline of presentation, we
investigate the presentation of groupings of classifiers.
We find that:
1. The
exhibition
of
a
classifier
is
extremelydelicate to its summary. For
example, the “average” summary of a
classifier in Concealreachespartial the
performance of the finestsketch.
2. Joining
classifiers
almost
always
increasespresentation, often by a significant
amount. This is true when the best individual
classifiers are combined, and also when
random classifiers are combined.
We deliver our data, results, and tools onto encourage
others to emulate and spread our work.
2
BACKGROUND AND RELATED WORK
All published bug localization research to date
builds classifiers using IR models. We introduce
IR models and describe existing IR-based bug
localization and concept/feature location
approaches.
2.1
Sensible Retrieval Models
Sensible retrieval is the study of querying for text
within a collection of papers. For example, the Google
search engine uses IR techniques to help users find
snippets of text in web pages. The “Find Papers”
function in a classic operating system is based on the
same theory, although applied at a much smaller scale.
IR-based bug localization classifiers use IR
representations to creationstylisticsimilarities between
a bug report (i.e., query) and the source code entities
(i.e., papers). For example, if a bug report comprises
the words, “Drop 20 bytes off each imgRequest
object,” then an IR model looks for beings which
encompass
these
words
(“drop,”
“bytes,”
“imgRequest,” etc.). When a bug report and object
contain severalcommon words, then an IRbased
classifier gives the entity a high relevancy groove.
ISSN: 2231-5381
IR-based classifiers holdsomelimits that control
their presentation. Specifying a cost for all bounds
fully terms the summaryof a classifier. Shared to all
IR-based classifiers are limits to direct how the say
textual records are denoted and preprocessed:
Which shares of the home code would be
reproduced: the notes, identifier terms, or
various
other
sign,
such
as
the
earliergermbangs
linked
to
each
homecypherunit?
2. Which shares of the germbangmust be
revealed: the award only, report only, or
together?
3. How would the home code and germ report be
preprocessed? Wouldcomplex identifier tags
(e.g.,
imgRequest)
be
split?
Mustmutualstayarguments be distant? Ought
words be slowed to their wrong form?
After these bounds are made, each IR classic has its
own set of handoutbounds that ruler term increment,
drop factors, match metrics, and furtherparts.
Weterm three typical and usually used IR copies: the
Vector Space Model; Latent Semantic Indexing, an
enhancement to the Vector Space Model; and latent
Dirichlet allocation, a probabilistic theme model. We
then chat the many preprocessing stages that can be
useful to the homecipher and germbangs.
1.
2.1.1
Vector Space Model
The Vector Space Model (VSM) is a
humblenumericalclassicbuilt on the term-document
ground of a corpus. The term-document medium is an
m,nmedium whose noisesbesinglerelations (i.e.,
words) and polesdenotespecificforms. The ith, jth
access in the ground is the heft of spanwi in paperdj.
VSM isformulae by their pole vector in the
areadailyground: a routehaving the loads of the
versescurrent in the daily, and zipsthen. The
contrastamiddualsystems is planned by equaling their
dual vectors. In VSM, doublepapers will single be
believedrelated if they haveonsmallestany shared span;
the otherunitedpositions they have, the upper their
matchnickspiritstay.
VSM uses the following restirctions:
1.
2.
Term weighting (TW): The weight of a
term in a file. General values for thislimits
are rare frequency (i.e., the number of
incidences of the period in the file) or tfidf (term frequency, inverse document
frequency).
Similarity metric (Sim): The similarity
between
two
file
vectors.
Popularlimitvalues are euclidean distance,
cosine distance, Hellinger detachment, or
KL divergence.
http://www.ijettjournal.org
Page 62
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016
2.1.2
Latent Semantic Indexing
Latent Semantic Indexing (LSI) is an allowance to
VSM in which singular value decomposition (SVD) is
used to as a means to project the original termdocument mediumkeen on three new matrices: a topicdocument mediumD, a period topicmediumT, and a
slopingmediumS of eigenvalues . Highly, LSI
decreases the overgrown of D and T to overgrownK,
where K is alimitprovided by the operator. In the
prediction, relatives which are linked by compare (i.e.,
terms which often occur in the same papers) are
togethercomposed into “ideas,” or at timestitled
“topics.”
For
example,
a
GUI-related
subjectpowercover the verses “mouse,” “connect,”
“left,” and “roll,” since these influencesrise to watch in
the equivalentpapers. The reduced dimensionality of
the
topic-document
matrix
consumesimprovedperformance over VSM when
dealing with polysemy and synonymy [3]. For
example, forms can be believedparallellevel if they do
not part any positions, but instead part terms from the
equivalentmatter (e.g., document 1 contains “mouse”
and document two contains “click”).
In LSI, papers are still signified as post vectors,
although LSI vectors cover the weight of subjects
whereas VSM vectors cover the weights of sole terms.
LSI and VSM can use the same comparison measures
to determine the comparison between two forms. LSI
uses the following limits:
1.
2.
3.
Term weighting (TW): Similar to the VSM
TW.
Number of topics (K): Controls how several
topics are kept during the SVD reduction.
Similarity metric (Sim): Similar to the VSM
Sim.
2.1.3
Latent Dirichlet Allocation
Latent Dirichlet allocation [4] is a popular
arithmetic topic model that provides a means to
repeatedly index, search, and cluster forms that are
formless and unlabeled [5]. Like LSI, LDA finishes
these responsibilities by first determining a set of
“topics” within the papers, and then expressive each
file as a combination of topics. The key difference
between LSI and LDA is the technique used to
produce topics. In LSI, topics are a byproduct of the
SVD discount of the term-document medium. In LDA,
topics are obviouslymade through a generative
process, using device learning algorithms to iteratively
realize which words are current in which topics, and
which subjects are current in which papers. While the
generative process of LDA enjoys several theoretical
advantages over LSI and VSM, such as model
checking and no molds about the spreading of term
counts in the corpus, the outcomes of LDA in practical
IR studies have thus far been mixed.
LDA uses the following restirctions:
ISSN: 2231-5381
1.
2.
3.
4.
5.
Number of topics (K): Controls how
several topics are created.
A document-topic smoothing parameter.
A word-topic smoothing parameter.
Number of iterations (iters): Number of
sampling
iterations
in
the
reproductiveprocedure.
Similarity metric (Sim): Similar to the
VSM Sim.
2.1.4
Data Preprocessing
Already IR models are realistic to source code and bug
reports, several preprocessing steps are usually taken
in an effort to decrease noise and expand the resulting
models:
1.
Characters related to the syntax of the
programming language (e.g., “&&”, “->”)
are removed; programming language
keywords (e.g., “if,” “while”) are
detached.
2. Identifier names are split, using systematic
expressions, into several parts founded on
common naming resolutions, such as
camel case (oneTwo), underscores
(one_two), dot separators (one.two), and
capitalization
changes
(ONETwo)
[15],Recently, investigators have future
more advanced methods to split
identifiers,
based
on
speech
recognition,automatic expansion, and
mining source code , which may be more
effective than simple regular expressions.
3. Common English-language stop words
(e.g., “the,” “it,” “on”) are detached. In
addition, tradition stop word lists can be
used, such as domain-specific jargon lists.
4. Word stopping is applied to find the root
of each word “changes” both normally
using the Porter algorithm.
The main idea behind schedule these possible steps is
to capture creators’ intentions, which are thought to be
prearranged within the identifier names and
explanations in the source code. The rest of the source
code (i.e., special syntax, language keywords, and stop
words) is observed as blast and will not be valuable as
input for IR models describes which preprocessing
steps we reflect for repetition purposes, we deliver our
preprocessing device online.
2.2 Existing IR-Based Bug Localization Approaches
Investigators have discovered the use of IR models
for bug localization. For example, match the
performance of LSI and LDA using three small case
studies. The authors build the two IR classifiers on the
identifiers and comments of the source code and
compute the resemblance between a bug explosion and
each source code entity using the cosine and
qualifiedchance similarity metrics. By execution case
http://www.ijettjournal.org
Page 63
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016
studies on Eclipse and Mozilla (on a total of three and
five bug reports, respectively), the writers find that
LDA oftenoutperforms LSI. We note that the authors
use physical query expansion, which may inspiration
their results.
Introduce a new topic model based on LDA, is
known as BugScout, in an exertion to develop bug
localization presentation. BugScout explicitly reflects
past bug results, in addition to identifiers and
comments, when expressive source code papers, using
the two data sources conmodernly to recognize key
mechanical concepts. The authors apply BugScout to
four different projects and find that BugScout
developspresentation by up to 20 percent over LDA
applied only to source code.
Rao and Kak compare several IR models for bug
localization, including VSM, LSI, and LDA, as well as
various groupings. The authors execute a case study on
a small dataset, iBUGS [12], and complete the simpler
IR models often outstrip more refined models.
Limitations of modern research. In modern
research, researchers only redirect a single or a small
number of summary’s of the classifiers, often with no
explanation given for why each limit value was
selected out of the huge space of likely values. Worse,
several limit values are left indefinite, making
duplication of their results difficult or impossible.
Given that there are manychoices for eachlimitin the
summary, and the restirctions are independent, there
are thousands of possible summaries for each
underlying IR model. The effectiveness of each
summary—which restirctions are important and which
limit values work best—is modernly unidentified. As a
result, researchers and consultants are left to deduction
which summary to use in their project.
2.4 IR-Based Concept/Feature Location Approaches
Closely related to IR-based bug localization is the
problem of IR-based idea (or feature) location. In both
problems, the aim is to identify a source code unit that is
relevant to a given query. The difference is that in bug
localization the question is the text within a bug report,
whereas in model location the query is manually created
by the creator.
LSI was first used for concept location by Marcus et
al. The authors create that LSI provides better results than
existing methods at the time, such as systematic terms
and dependency graphs.
Researchers have also joint various methods to
accomplish concept location. For example, combine LSI
with a vigorous feature location method is known as
scenario-based probabilistic ranking; these two
approaches function on different documents sets and use
different enquiry methods. The results of the joint
approach are much better than either individual approach,
as evidenced by two case studies on large projects.
Concept Analysis to succeed similar effects. Clearly et
al. [6] combine several IR models with Natural Language
Processing methods and conclude that NLP techniques do
not significantly develop results. Finally, Revelle et al.
ISSN: 2231-5381
combine LSI, dynamic analysis, and web mining
algorithms for feature location and discover that the
combination overtakes any of the separatemethods.
3
SUMMARY FRAMEWORK
The use of IR classifiers in bug localization requires
the meaning of somerestrictions. In fact, since there
are so severalrestrictions and some parameters can
take on any numeric value, there are efficiently an
immeasurable number of possible summaries of IR
classifiers. Investigators in the data mining communal
argue that in the perfect world, algorithms should
require no tuning restirctions to avoidpossible bias by
the user. However, such an ideal is not often met, and
the truth of IR classifiers is that they contain
manyrestirctions. Tactlessly, as we presented
inmodern bug localization research uses manual
selection oflimitvalues and reproduces only a tiny
segment of all possible summaries.
We propose a summary framework for analyzing
the various summarys of a classifier. We review the
framework as follows:
Define a set of classifier summarys.
Execute each summary to accomplish the task
at hand and measure the presentation of each
summary using some effectiveness measure.
3. Analyze the performance of each summary and
the various limit settings, using, for example,
Tukey’s HSD statistical test.
We describe each step in more detail.
1.
2.
3.1
Define Summary
As mentioned previously, a classifier is defined by a
summary—a set of limit values that specify the
behavior of the classifier. To define a summary is to
specify the value of each parameter. Some restirctions
may be categorical (e.g., which similarity metric is
used) while others might be arithmetical (e.g., the
number of topics in LSI or LDA).
We note that severalinvestigational design
procedures exist to help one select the space of
summariesdesired for an accurate analysis. First,
restirctions with numeric values are quantized from
aincessant scale to a small subset of values. Second, an
experimental design is chosen. For example, a fully
factorial design commands that all (reflected) values of
one limit be inspected with respect to all values of all
other restirctions, resulting in the extreme possible
number of summaries. Smaller designs include Box
Benhkens [7] and central combined. No matter the
chosen design, the succeeding two steps in our outline
persist the same.
3.2
Execute Summary
The next step in our outline is to accomplish each
summary on the task at hand to products a set of
tuples, where each tuple contains the effectiveness of
summary Ci. How exactly this is executed varies by
http://www.ijettjournal.org
Page 64
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016
the task. In bug localization, for example, this step
entails preprocessing the source code and bug reports,
construction the index on the source code, running the
queries against the catalogue to recover a ranked list of
results, and evaluating the results using some quantity
of effectiveness (such as top-20 for the task of bug
localization, definite later or F-measure or Mean
Average Precision (MAP) for the task of traceability
linking).
3.3
Analyze Presentation of Summary
We analyze the presentation of the summaries. We
have two different goals. First, we wish to regulate the
best and worst summaries, and second, we wish to
determine which maximum values are most effective.
To determine the best summaries, we simply sort
the summaries by their effectiveness measure in rising
or descending order, depending on whether high or
low values are appropriate, respectively.
To define which limit values are statistically most
effective, we use Tukey’s Honestly Significant
Difference (HSD) test to compare the presentation of
each value of each limit. The HSD test is a statistical
test on the means of the results created for each limit
value—holding one limit perpetual, and letting all
other restirctions vary. For a assumed limit(e.g.,
“number of topics”), the HSD test compares the mean
of every one possible value with the mean of every
other likely value (e.g., “32” versus “64” versus
“128”). Using the studentized series distribution, the
HSD test defines whether the differences between the
means exceed the predictable standard error. The
outcome of HSD is a set of statistically equivalent
groups of limit values. If two limit values have its
place to the same group, then the presentations of the
limit values are not statistically different, and any may
be used in place of the other. Note that a limit value
can belong to many groups, and group memberships
are not transitive: If limit value A and limit value B
belong to the same group, and limit value B and limit
value C have its place to the same group, value A and
value C do not certainly belong to the equivalent
group.
4
COMBINATION FRAMEWORK
There is a rich works in the pattern appreciation and
data mining domains for joining classifiers, called
classifier ensembles, voting experts, or hybrid methods
[8]. No matter the name used, the important method is
the same: Different classifiers often excel on different
inputs and make different errors. For example, one
IRbased classifier might be good at conclusion links
between bug reports and source code entities if they
share one or more.
Fig.1. An illustration of the classifier combination
Framework
Three classifiers are created, based on the offered
input data and the given bug report. The classifiers are
combined, using score addition, to products a
particulargraded list of source code units. In this
example, fileX bubbles up to the top of the combined
list because it is high on each of the three classifiers’
lists. Exact-match keywords, which influence happen
if the bug report clearly references the variable names
or method names of the related source code entity.
Another IR-based classifier might be good at ruling
links between general concepts, such as “GUI” or
“networking,” even if no separate keywords are shared
between the bug report and relevant entity. An entity
metrics-based classifier might be good at determining
which source code objects are likely to be bug-prone
in the first place, regardless of the explicit bug report
in question. By combining these disparate classifiers,
the truly related files are likely to “bubble up” to the
top of the combined list, providing creators with some
false positive competitions to explored. While
classifier grouping has been successful in other
domains and freshly even other areas of software
engineering [9], classifier combination has not been
explored in the context of bug localization.
We current a framework to combine multiple bug
localization classifiers, based mainly on methods from
extra domains and illustrated. Any number of
classifiers are created, based on the offered input data
and the given bug report. The choice of classifiers
affects the performance of classifier combination
1.
The classifiers are combined used any of
several combination methods, such as
Borda Count [10], score addition, or
reciprocal rank fusion [11].
We now describe each constituent in more detail.
4.1
Creation of Classifiers
As mentioned in Section 3, classifiers can come in
many forms. IR-based classifiers endeavor to find
ISSN: 2231-5381
http://www.ijettjournal.org
Page 65
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016
textual similarities between the given bug result and
the source code units. The Entity metric-based
classifiers use entity metrics [12], such as ranks of
code, to categorize which source code entities are
likely to have the largest number of bugs, independent
of the given bug report (which, as we found has
surprisingly good presentation). In fact, we replicate
any different of any technique that returns a ranked list
of source code objects as a classifier. The result set of
a classifier consists of a rank and a (relevancy) score
for every source code entity in the assignment. Note
that scores need not be unique; in fact, many entities
may be apportioned a score of 0. In this case, they all
share a rank of M þ 1, where M is the number of
entities that received a nonzero score.
The choice of separate classifiers affects the
presentation of classifier grouping [13]. As classifier
combination works best when the separate classifiers
err in different ways, choosing classifiers that are
likely to result in the most uncorrelated mistakes—
such as classifiers based on different input data
demonstrations—is likely to achieve the best result.
Defining Classifier Subspaces. Recall that for IRbased classifiers, two data sources must be embodied:
that of the source code entity and that of the bug
report. Accordingly, we describe four IR classifier
subspaces that are based on four different input data
demonstrations, using the notation from the summary
framework, as follows. The first subspace consists of
those classifiers that use the
Our framework provides two basic techniques for
choosing which classifiers to combine: combining the
best performing individual classifiers in each from
each subspace.
To combine the best IR classifiers, we select the
best classifier from each of the subspaces. We should
want to combine the best eight IR classifiers. For
example, then we select the best two classifiers from
each subspace, and so on.
If we do not know in improvement which
classifiers from each subspace execute best, we still
use the subspaces to create so-called intelligently
casualsets of classifiers. Here, we select random
classifiers with an equal possibility from each of the
four subspaces, so that we decrease the likelihood that
the wrong answers from each classifier will be
associated in any way.
4.2
Combination Techniques
Given a set of classifiers we can association them in
any of many ways. A simple rank-only combination is
the Borda Count method [14], which was eventually
devised for political election systems. For each source
code entity dk, the Borda Count system assigns points
based on the rank of dk in each classifier’s result set.
The Borda Count scores for all classifiers are checked
for each entity, and the entity with the maximum total
ISSN: 2231-5381
Borda Count Score is ranked first, andso on. The
combined result is formed by ranking entities
according to their total Borda score.
Instead of using the ranks, the scores of each
classifier can also be used. For example, the score of
each entity dk and each classifier Ciare summed to
products a total score for each entity. Usually, the
scores of each classifier are scaled to be in the similar
range (e.g., ½0;1) before combination to reduce
unintentionally weighting the reputation of certain
classifiers. However, (1) and (2) can be changed to
explicitly weight certain classifiers differently from
others, for example, to scale the best-performing
classifiers in the series of ½0;2. We leave the
investigation of such weighting systems for future
work.
We note that the result of combining jCj classifiers
defines a new classifier. This classifier can itself be
mutual with other classifiers. In this way, a grading of
classifiers can be constructed.
5
MAIN FINDINGS AND DISCUSSION
We highlight the main findings from each case study,
and provide a discussion of results.
5.1
Classifier Summary
From our analysis of 3,172 classifiers, each
evaluated on 8,084 bug reports, we make the following
conclusions:
1.
Summary
matters.
The
presentation
difference between one classifier summary
and extra is often significant.
2. The VSM classifier achieves the best overall
top-20 presentation. LSI is second, and LDA
is last.
3. Using both the bug report’s title and
explanation results in the best overall
presentation, for all IR models.
4. Using the source code entities’ identifiers,
comments, and past bug reports results in the
best whole presentation, for all IR models.
5. Stopping, stemming, and splitting all improve
performance, for all IR models.
6. New bug count is the best-performing entity
metric.
No single classifier summary is best for all three
studied projects. However, some VSM summary has
the best presentation for all studied projects,
suggesting that VSM is the overall best classification
methods for bug localization. For all studied projects,
VSM >LSI >LDA, when sparkly the presentation of
each technique’s best summary.
Interestingly, for all studied projects, if we reflect
the worst summaries of the several IR models (VSM,
LSI, and LDA), LSI achieves the maximum
presentation, often significantly so. In Mozilla, for
example, the worst LSI summary has a presentation of
http://www.ijettjournal.org
Page 66
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016
23 percent, compared to VSM’s 6 percent and LDA’s
0 percent.
Bug localization in Mozilla is relatively easy related
to the other studied projects: The best top-20 for
Mozilla is 80 percent, compared to Jazz’s 69 percent
and Eclipse’s 55 percent. We also note that Mozilla
has the minimum number of source code entities, Jazz
the second smallest, and Eclipse the largest.
5.2
Classifier Combination
Based on the outcomes of our case study on classifier
combination, we conclude that:
1.
Combining
the
best-performing
classifiers
every
time
improves
presentation, by at least 10 percent and
up to 95 percent in top-20 presentation,
compared to the best performing
separate classifier.
2 Combining “intelligently random” sets
.
of classifiers very nearly always helps
(84-100
percent of the time), and usually helps by a
large amount: a median of 14-56 percent
improvement in top-20 presentation,
compared to the best performance individual
classifier.
We saw the smallest comparative improvements in
Mozilla, and the largest in Eclipse. We note that
individual classifiers in Mozilla already have high
presentation (i.e., top-20 values above 0.80), leaving
little room for development for combination.
Individual classifiers in Eclipse, on the other hand,
have comparatively worse presentation (a maximum
top20 value of 0.54).
In general, the Borda Count Combination technique
performed better than the score addition method. In all
four manually generated classifier sets and all studied
projects, the Borda Count offered a larger
improvement than score addition. Also, when
replicating the 150 randomly created classifier sets, the
Borda Count method offered better mean and average
comparative improvements over score addition, for the
Eclipse and Jazz projects. In Mozilla, the mean and
median relative improvements of Borda Count and
score adding were comparable.
In both experiments, we combined sets of five
module classifiers, based on the logic that classifiers
using different data sources as idea will result in
uncorrelated faults. We also considered combining all
3,172 section classifiers of Case Study 1. We found
the top-20 presentation of their combination to be
analogous to the top-20 performance of the good
individual classifier. Given that the set of 3,172
contains various classifiers with very low presentation,
it is cheering that their combination still achieves such
extraordinary presentation. This combination can be
practically useful, not to develop on the best individual
ISSN: 2231-5381
classifier, but to permit us to achieve close to the good
presentation without wanting to identify the best
performing individual classifier.
6
THREATS TO VALIDITY
Threat to the interior validity of our case studies is
our ground-truth data collection algorithm, which was
based on an algorithm recommended by Fischer et al.
[15]. Although the algorithm is the state-of-the-art
algorithm for linking bug information to source code
entities, current research has quizzed the algorithm’s
linking favoritisms, because some modification sets in
the revision control system may not explicitly
indication which bug reports are related. However,
Nguyen et al. find that this bias exists even in a nearideal dataset with high-quality linking, representative
that the bias is a symptom of the underlying
development process rather than the data collection
method. Additional potential threat to the internal
validity of our study is that of false denials in our
ground-truth data. Specifically, our truth data contains
links between bug reports and those source code
entities that were really changed to resolve the bug
report. However, it may be the case that other source
code objects could have been changed instead to
resolve the same bug report. Both interior threats are
moderated in our work because whatever biases or
false negatives exist, we use the same truth data set in
the calculation of all classifiers and combination
techniques, providing an equal stage for comparison.
A potential threat to the external validity of our
study is that even though we implemented three
extensive case studies on large, active, real-world
projects, our results still must be replicated in context.
In particular, our studied projects characterize only a
segment of all real-world projects, domains,
programming languages, and development paradigms,
so we cannot ultimately say that our results will hold
for all possible projects. We have provided our data
and tools online to cheer others to repeat our work on
other projects .Construct validity of our case studies is
our use of the top-20 metric as a proxy for developer
efficiency. While the top-20 metric is the standard
metric for valuing bug localization presentation,
additional research is needed to see how well it relates
with developer effectiveness in real-world conditions.
7. CONCLUSIONS AND FUTURE WORK
Solving the bug localization problem has main
implications for inventers because it can dramatically
reduce the time and effort required to maintain
software. We cast the bug localization problematic as
one of classification, and analyzed the consequence
classifier precipitate had on bug localization
presentation, as well as whether classifier combination
could help. We summarize our main findings as
follows:
http://www.ijettjournal.org
Page 67
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016
1.The summary of an IR-based classifier materials.
2.The best individual IR-based classifier uses the
Vector Space Model, with the index built using the
term weighting on all accessible data in the source
code entities (i.e., identifiers, comments, and past bug
reports for each entity), which has been stopped,
stemmed, and split, and queried with all available data
in the bug report (i.e., title and description) with
cosine similarity.
3.The best EM-based classifier used as the new bug
count metric to rank source code entities.
4.Classifier combination helps in very nearly all
cases, no matter the underlying classifiers used or the
specific combination technique used.
We proposed two frameworks: one for defining and
analyzing classifier summaries, and one for combining
the results of dissimilar classifiers. We used these
frameworks to conduct our experimental case studies
on bug localization; researchers can use our
frameworks to conduct similar analyses in other areas
that reflect multiple summary of classifiers.
We found that the summary of a classifier has a
significant impact on its presentation. In Eclipse, for
example, the difference between a poorly configured
IR classifier and a correctly configured IR classifier is
the difference between having a one in 100 chance of
finding a applicable entity in the top 20 results and
having a better than a one in two chance. We found
that prior bug localization research does not use best
classifier summary. Fortunately, we also found
consistent results of the various summaries across all
three studied projects, suggestive of that the proper
summary is not completely project specific, and can be
useful in a general context.
By identifying which classifier summaries are best,
and how to combine them in the most effective way,
our results substantially advance the state of the art in
bug localization. Practitioners can use our results to
accelerate the task of finding and fixing bugs, resulting
in improved software quality and diminished
maintenance costs. We provide our data, tools, and
results online.
form of stack traces and code snippets could be
beneficial to bug localization results. Finally, we
have yet to fully investigate several possible
combination techniques, such as variations of the
Borda Count and Reciprocal Rank Fusion.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Since IR-based bug localization has only a
moment ago gained the attention of researchers,
there are various exciting avenues to explore in
future work. The most obvious avenue is the
addition of classifier relatives to the grouping
framework presented in this paper, such as those
used for concept location: Page Rank formal idea
analysis vigorous analysis and static analysis.
Further IR models can be reproduced, such as
BM25F, Bug Scout, and other variants of LDA, such
as the Relational Topic Model. Recently, researchers
have proposed query expansion methods, which may
be a useful preprocessing step to any IR-based
classifiers. We also wish to explore whether
preprocessing bug reports by removing noise in the
ISSN: 2231-5381
R. Arnold and S. Bohner, “Impact Analysis—Towards a
Framework for Comparison,” Proc. Int’l Conf. Software
Maintenance, pp. 292-301, 1993.
R. Baezasssss-Yates and B. Ribeiro-Neto, Modern Sensible
Retrieval, vol. 463, ACM Press, 1999.
C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V.
Filkov, and P. Devanbu, “Fair and Balanced?: Bias in Bug-Fix
Data Sets,” Proc. Seventh European Software Eng. Conf. and
Symps. Foundations of Software Eng., pp. 121-130, 2009.
D.M. Blei and J.D. Lafferty, “Topic Models,” Text Mining:
Classification, Clustering, and Applications, pp. 71-94.
Chapman & Hall, 2009.
D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent Dirichlet
Allocation,” J. Machine Learning Research, vol. 3, pp. 9931022, 2003.
S. Bohner and R. Arnold, Software Change Impact Analysis.
IEEE CS Press, 1996.
G. Box and D. Behnken, “Some New Three Level Designs for
the Study of Quantitative Variables,” Technometrics, vol. 2,
no. 4 , pp. 455-475, 1960.
C. Carpineto and G. Romano, “A Survey of Automatic Query
Expansion in Sensible Retrieval,” ACM Computing Surveys,
vol. 44, no. 1, pp. 1-50, Jan. 2012.
J. Chang and D.M. Blei, “Relational Topic Models for
Document Networks,” Proc. 12th Int’l Conf. Artificial
Intelligence and Statistics, pp. 81-88, 2009.
B. Cleary, C. Exton, J. Buckley, and M. English, “An
Empirical Analysis of Sensible Retrieval Based Concept
Location Techniques in Software Comprehension,” Empirical
Software Eng., vol. 14, no. 1, pp. 93-130, 2008.
G.V. Cormack, C.L. Clarke, and S. Buettcherss, “Reciprocal
Rank Fusion Outperforms Condorcet and Individual Rank
Learning Methods,” Proc. 32nd Int’l Conf. Research and
Development in Sensible Retrieval, pp. 758-759, 2009.
V. Dallmeier and T. Zimmermann, “Extraction of Bug
Localization Benchmarks from History,” Proc. 22nd Int’l
Conf. Automated Software Eng., pp. 433-436, 2007.
M. D’Ambros, M. Lanza, and R. Robbes, “Evaluating Defect
Prediction Approaches: A Benchmark and an Extensive
Comparison,” Empirical Software Eng., vol. 17, no. 4, pp.
531-577, 2012.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and
R. Harshman, “Indexing by Latent Semantic Analysis,” J. Am.
Soc. Sensible Science, vol. 41, no. 6, pp. 391-407, 1990.
B. Dit, L. Guerrouj, D. Poshyvanyk, and G. Antoniol, “Can
Better Identifier Splitting Techniques Help Feature Location?”
Proc. Int’l Conf. Program Comprehension, pp. 11-20, 2011.
http://www.ijettjournal.org
Page 68
Download