The Impact of Classifier Summary and Classifier Combination on Bug Localization

International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016 The Impact of Classifier Summary and Classifier Combination on Bug Localization Mrs.C.Navamani MCA., M.Phil., M.E., S.Priyanka, Assistant Professor, Final Year MCA, Deptartment of Computer Applications, Nandha Engineering College (Anna University), Erode, Tamilnadu. Abstract—Bug localization is the task of influential which source code entities are relevant to a bug report. Manual bug localization is labor exhaustive since creators must reflect thousands of source code units Modern research builds bug localization classifiers, based on sensible retrieval models, to find entities that are textually similar to the germ report. Modern research, however, does not reflect the effect of classifier summary, i.e., all the limit values that specify the behavior of a classifier. As such, the effect of each limit or which limit values lead to the best presentation is unknown, we make two main contributions. First, we show that the limits of a classifier have a significant impact on its presentation. Second, we show that combining multiple classifiers—whether those classifiers are hand-picked or randomly chosen relative to intelligently defined subspaces of classifiers— improves the concert of even the best individual classifiers. Keywords—Software maintenance, bug localization, sensible retrieval, VSM, LSI, LDA, classifier combination 1 INTRODUCTION Creators typically use bug racking databases to manage incoming germ reports in their software projects. For several projects, creators are overwhelmed by the volume of incoming germ reports that must be addressed. For example, in the Eclipse project, creators receive an average of 115 new germ reports every day; the Mozilla and IBM Jazz projects get 152 and 105 new reports per day, creatorsmust then spend deflectable time and effort to new reportand decide which source code entities are relevantfor fixing the bug. This task is known as bug localization, which is defined as a grouping problem:Given n source code entities and a bug report, classify the bug report as belonging to one of the n units. The classifier returns a ranked list of possibly relevant entities, along with a relevancy score for each entity in the list. An entity is reflected relevant if it indeed needs to be changed to decide the germ report, and irrelevant otherwise. The creator uses the list of possibly relevant entities to identify an entity related to the germ report and make the basic modifications. After one important entity is identified using germ localization, creators can use conversion propagation methods [1],[2] to identify any ISSN: 2231-5381 other entities that also need to be changed. Hence, the germ localization task is to find the first significant entity; the task then switches to change propagation. Fully or partially automating germ localization can hugely reduce the development effort vital to fix bugs, as much of the setting time is modernly spent manually locating the appropriate units, which is both trying and posh. Moderngerm localization study uses sensiblerescue (IR) classifiers to find home code units that are textually similar to germ reports. However, recent results are unclear and differing: Some due that the Vector Space Model (VSM) runs the best routine, while others claim that the Latent Dirichlet Allocation (LDA) model is best, while still others due that a new IR impeccable is necessary. These several letters are due to the use of changed datasets, wellconcert metrics, and new classifier constructions. A classifier erectionstates the value of all the boundaries that identify the act of a classifier, such as the way in which the home programme is preprocessed, how positions are limited, and the competition metric between germbangs and home code units. We aim to address this ambiguity by presentation a large-scale experimental study to compare thousands of IR classifier summarys on a large number of bug reports. By using the same datasets and recital metrics, we can make an apples-to-apples contrast of the differentsummarys, calculating the influence of summary on presentation, and recognizing which specificsummarys yield the best presentation. We find that summary indeed has a large impact on presentation: Some summarys are nearly useless, while others perform very well. Further, we investigate the effect of combining the results of different classifiers since grouping has been shown to be useful in further domains as well as in software engineering (e.g., defect extrapolation. Classifier groupings are known to be helpful in several situations, such as when the separate classifiers everyhave proficiency in only a subdivision of the input cases, or when the recital of the separate classifiers inclines to contrast widely. We contemporary a summary for combining several classifiers that, as we far along show, can composed accomplish recovering bug localization results than any solitary classifier. The main instinctafter classifier grouping is that when a specific source code object is resumedextraordinary in the gradients of various classifiers, then we can deduction with high sureness http://www.ijettjournal.org Page 61 International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016 that the entity is trulyapplicable for the bug report. Our structure easily spreads to any caring of bug localization classifier: We can chain IR-based classifiers with vigorous analysis classifiers, shortcoming prediction classifiers, or any classifier that somehow solves the bug localization problem. If the classifiers are equally uncorrelated in their incorrect answers (i.e., the classifiers be disposed to make dissimilar mistakes from each supplementary), then merging them will likely recoverpresentation. Given the environment of the bug localization problematic, which has severalconceivableincorrectresponses for a given bug report (i.e., source code articles that are dissimilar to the bug report), merging models has the probable to make well. Then we perform a case study on three important projects to study the routine of thousands of classifier shapes. Using results as a baseline of presentation, we investigate the presentation of groupings of classifiers. We find that: 1. The exhibition of a classifier is extremelydelicate to its summary. For example, the “average” summary of a classifier in Concealreachespartial the performance of the finestsketch. 2. Joining classifiers almost always increasespresentation, often by a significant amount. This is true when the best individual classifiers are combined, and also when random classifiers are combined. We deliver our data, results, and tools onto encourage others to emulate and spread our work. 2 BACKGROUND AND RELATED WORK All published bug localization research to date builds classifiers using IR models. We introduce IR models and describe existing IR-based bug localization and concept/feature location approaches. 2.1 Sensible Retrieval Models Sensible retrieval is the study of querying for text within a collection of papers. For example, the Google search engine uses IR techniques to help users find snippets of text in web pages. The “Find Papers” function in a classic operating system is based on the same theory, although applied at a much smaller scale. IR-based bug localization classifiers use IR representations to creationstylisticsimilarities between a bug report (i.e., query) and the source code entities (i.e., papers). For example, if a bug report comprises the words, “Drop 20 bytes off each imgRequest object,” then an IR model looks for beings which encompass these words (“drop,” “bytes,” “imgRequest,” etc.). When a bug report and object contain severalcommon words, then an IRbased classifier gives the entity a high relevancy groove. ISSN: 2231-5381 IR-based classifiers holdsomelimits that control their presentation. Specifying a cost for all bounds fully terms the summaryof a classifier. Shared to all IR-based classifiers are limits to direct how the say textual records are denoted and preprocessed: Which shares of the home code would be reproduced: the notes, identifier terms, or various other sign, such as the earliergermbangs linked to each homecypherunit? 2. Which shares of the germbangmust be revealed: the award only, report only, or together? 3. How would the home code and germ report be preprocessed? Wouldcomplex identifier tags (e.g., imgRequest) be split? Mustmutualstayarguments be distant? Ought words be slowed to their wrong form? After these bounds are made, each IR classic has its own set of handoutbounds that ruler term increment, drop factors, match metrics, and furtherparts. Weterm three typical and usually used IR copies: the Vector Space Model; Latent Semantic Indexing, an enhancement to the Vector Space Model; and latent Dirichlet allocation, a probabilistic theme model. We then chat the many preprocessing stages that can be useful to the homecipher and germbangs. 1. 2.1.1 Vector Space Model The Vector Space Model (VSM) is a humblenumericalclassicbuilt on the term-document ground of a corpus. The term-document medium is an m,nmedium whose noisesbesinglerelations (i.e., words) and polesdenotespecificforms. The ith, jth access in the ground is the heft of spanwi in paperdj. VSM isformulae by their pole vector in the areadailyground: a routehaving the loads of the versescurrent in the daily, and zipsthen. The contrastamiddualsystems is planned by equaling their dual vectors. In VSM, doublepapers will single be believedrelated if they haveonsmallestany shared span; the otherunitedpositions they have, the upper their matchnickspiritstay. VSM uses the following restirctions: 1. 2. Term weighting (TW): The weight of a term in a file. General values for thislimits are rare frequency (i.e., the number of incidences of the period in the file) or tfidf (term frequency, inverse document frequency). Similarity metric (Sim): The similarity between two file vectors. Popularlimitvalues are euclidean distance, cosine distance, Hellinger detachment, or KL divergence. http://www.ijettjournal.org Page 62 International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016 2.1.2 Latent Semantic Indexing Latent Semantic Indexing (LSI) is an allowance to VSM in which singular value decomposition (SVD) is used to as a means to project the original termdocument mediumkeen on three new matrices: a topicdocument mediumD, a period topicmediumT, and a slopingmediumS of eigenvalues . Highly, LSI decreases the overgrown of D and T to overgrownK, where K is alimitprovided by the operator. In the prediction, relatives which are linked by compare (i.e., terms which often occur in the same papers) are togethercomposed into “ideas,” or at timestitled “topics.” For example, a GUI-related subjectpowercover the verses “mouse,” “connect,” “left,” and “roll,” since these influencesrise to watch in the equivalentpapers. The reduced dimensionality of the topic-document matrix consumesimprovedperformance over VSM when dealing with polysemy and synonymy [3]. For example, forms can be believedparallellevel if they do not part any positions, but instead part terms from the equivalentmatter (e.g., document 1 contains “mouse” and document two contains “click”). In LSI, papers are still signified as post vectors, although LSI vectors cover the weight of subjects whereas VSM vectors cover the weights of sole terms. LSI and VSM can use the same comparison measures to determine the comparison between two forms. LSI uses the following limits: 1. 2. 3. Term weighting (TW): Similar to the VSM TW. Number of topics (K): Controls how several topics are kept during the SVD reduction. Similarity metric (Sim): Similar to the VSM Sim. 2.1.3 Latent Dirichlet Allocation Latent Dirichlet allocation [4] is a popular arithmetic topic model that provides a means to repeatedly index, search, and cluster forms that are formless and unlabeled [5]. Like LSI, LDA finishes these responsibilities by first determining a set of “topics” within the papers, and then expressive each file as a combination of topics. The key difference between LSI and LDA is the technique used to produce topics. In LSI, topics are a byproduct of the SVD discount of the term-document medium. In LDA, topics are obviouslymade through a generative process, using device learning algorithms to iteratively realize which words are current in which topics, and which subjects are current in which papers. While the generative process of LDA enjoys several theoretical advantages over LSI and VSM, such as model checking and no molds about the spreading of term counts in the corpus, the outcomes of LDA in practical IR studies have thus far been mixed. LDA uses the following restirctions: ISSN: 2231-5381 1. 2. 3. 4. 5. Number of topics (K): Controls how several topics are created. A document-topic smoothing parameter. A word-topic smoothing parameter. Number of iterations (iters): Number of sampling iterations in the reproductiveprocedure. Similarity metric (Sim): Similar to the VSM Sim. 2.1.4 Data Preprocessing Already IR models are realistic to source code and bug reports, several preprocessing steps are usually taken in an effort to decrease noise and expand the resulting models: 1. Characters related to the syntax of the programming language (e.g., “&&”, “->”) are removed; programming language keywords (e.g., “if,” “while”) are detached. 2. Identifier names are split, using systematic expressions, into several parts founded on common naming resolutions, such as camel case (oneTwo), underscores (one_two), dot separators (one.two), and capitalization changes (ONETwo) [15],Recently, investigators have future more advanced methods to split identifiers, based on speech recognition,automatic expansion, and mining source code , which may be more effective than simple regular expressions. 3. Common English-language stop words (e.g., “the,” “it,” “on”) are detached. In addition, tradition stop word lists can be used, such as domain-specific jargon lists. 4. Word stopping is applied to find the root of each word “changes” both normally using the Porter algorithm. The main idea behind schedule these possible steps is to capture creators’ intentions, which are thought to be prearranged within the identifier names and explanations in the source code. The rest of the source code (i.e., special syntax, language keywords, and stop words) is observed as blast and will not be valuable as input for IR models describes which preprocessing steps we reflect for repetition purposes, we deliver our preprocessing device online. 2.2 Existing IR-Based Bug Localization Approaches Investigators have discovered the use of IR models for bug localization. For example, match the performance of LSI and LDA using three small case studies. The authors build the two IR classifiers on the identifiers and comments of the source code and compute the resemblance between a bug explosion and each source code entity using the cosine and qualifiedchance similarity metrics. By execution case http://www.ijettjournal.org Page 63 International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016 studies on Eclipse and Mozilla (on a total of three and five bug reports, respectively), the writers find that LDA oftenoutperforms LSI. We note that the authors use physical query expansion, which may inspiration their results. Introduce a new topic model based on LDA, is known as BugScout, in an exertion to develop bug localization presentation. BugScout explicitly reflects past bug results, in addition to identifiers and comments, when expressive source code papers, using the two data sources conmodernly to recognize key mechanical concepts. The authors apply BugScout to four different projects and find that BugScout developspresentation by up to 20 percent over LDA applied only to source code. Rao and Kak compare several IR models for bug localization, including VSM, LSI, and LDA, as well as various groupings. The authors execute a case study on a small dataset, iBUGS [12], and complete the simpler IR models often outstrip more refined models. Limitations of modern research. In modern research, researchers only redirect a single or a small number of summary’s of the classifiers, often with no explanation given for why each limit value was selected out of the huge space of likely values. Worse, several limit values are left indefinite, making duplication of their results difficult or impossible. Given that there are manychoices for eachlimitin the summary, and the restirctions are independent, there are thousands of possible summaries for each underlying IR model. The effectiveness of each summary—which restirctions are important and which limit values work best—is modernly unidentified. As a result, researchers and consultants are left to deduction which summary to use in their project. 2.4 IR-Based Concept/Feature Location Approaches Closely related to IR-based bug localization is the problem of IR-based idea (or feature) location. In both problems, the aim is to identify a source code unit that is relevant to a given query. The difference is that in bug localization the question is the text within a bug report, whereas in model location the query is manually created by the creator. LSI was first used for concept location by Marcus et al. The authors create that LSI provides better results than existing methods at the time, such as systematic terms and dependency graphs. Researchers have also joint various methods to accomplish concept location. For example, combine LSI with a vigorous feature location method is known as scenario-based probabilistic ranking; these two approaches function on different documents sets and use different enquiry methods. The results of the joint approach are much better than either individual approach, as evidenced by two case studies on large projects. Concept Analysis to succeed similar effects. Clearly et al. [6] combine several IR models with Natural Language Processing methods and conclude that NLP techniques do not significantly develop results. Finally, Revelle et al. ISSN: 2231-5381 combine LSI, dynamic analysis, and web mining algorithms for feature location and discover that the combination overtakes any of the separatemethods. 3 SUMMARY FRAMEWORK The use of IR classifiers in bug localization requires the meaning of somerestrictions. In fact, since there are so severalrestrictions and some parameters can take on any numeric value, there are efficiently an immeasurable number of possible summaries of IR classifiers. Investigators in the data mining communal argue that in the perfect world, algorithms should require no tuning restirctions to avoidpossible bias by the user. However, such an ideal is not often met, and the truth of IR classifiers is that they contain manyrestirctions. Tactlessly, as we presented inmodern bug localization research uses manual selection oflimitvalues and reproduces only a tiny segment of all possible summaries. We propose a summary framework for analyzing the various summarys of a classifier. We review the framework as follows: Define a set of classifier summarys. Execute each summary to accomplish the task at hand and measure the presentation of each summary using some effectiveness measure. 3. Analyze the performance of each summary and the various limit settings, using, for example, Tukey’s HSD statistical test. We describe each step in more detail. 1. 2. 3.1 Define Summary As mentioned previously, a classifier is defined by a summary—a set of limit values that specify the behavior of the classifier. To define a summary is to specify the value of each parameter. Some restirctions may be categorical (e.g., which similarity metric is used) while others might be arithmetical (e.g., the number of topics in LSI or LDA). We note that severalinvestigational design procedures exist to help one select the space of summariesdesired for an accurate analysis. First, restirctions with numeric values are quantized from aincessant scale to a small subset of values. Second, an experimental design is chosen. For example, a fully factorial design commands that all (reflected) values of one limit be inspected with respect to all values of all other restirctions, resulting in the extreme possible number of summaries. Smaller designs include Box Benhkens [7] and central combined. No matter the chosen design, the succeeding two steps in our outline persist the same. 3.2 Execute Summary The next step in our outline is to accomplish each summary on the task at hand to products a set of tuples, where each tuple contains the effectiveness of summary Ci. How exactly this is executed varies by http://www.ijettjournal.org Page 64 International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016 the task. In bug localization, for example, this step entails preprocessing the source code and bug reports, construction the index on the source code, running the queries against the catalogue to recover a ranked list of results, and evaluating the results using some quantity of effectiveness (such as top-20 for the task of bug localization, definite later or F-measure or Mean Average Precision (MAP) for the task of traceability linking). 3.3 Analyze Presentation of Summary We analyze the presentation of the summaries. We have two different goals. First, we wish to regulate the best and worst summaries, and second, we wish to determine which maximum values are most effective. To determine the best summaries, we simply sort the summaries by their effectiveness measure in rising or descending order, depending on whether high or low values are appropriate, respectively. To define which limit values are statistically most effective, we use Tukey’s Honestly Significant Difference (HSD) test to compare the presentation of each value of each limit. The HSD test is a statistical test on the means of the results created for each limit value—holding one limit perpetual, and letting all other restirctions vary. For a assumed limit(e.g., “number of topics”), the HSD test compares the mean of every one possible value with the mean of every other likely value (e.g., “32” versus “64” versus “128”). Using the studentized series distribution, the HSD test defines whether the differences between the means exceed the predictable standard error. The outcome of HSD is a set of statistically equivalent groups of limit values. If two limit values have its place to the same group, then the presentations of the limit values are not statistically different, and any may be used in place of the other. Note that a limit value can belong to many groups, and group memberships are not transitive: If limit value A and limit value B belong to the same group, and limit value B and limit value C have its place to the same group, value A and value C do not certainly belong to the equivalent group. 4 COMBINATION FRAMEWORK There is a rich works in the pattern appreciation and data mining domains for joining classifiers, called classifier ensembles, voting experts, or hybrid methods [8]. No matter the name used, the important method is the same: Different classifiers often excel on different inputs and make different errors. For example, one IRbased classifier might be good at conclusion links between bug reports and source code entities if they share one or more. Fig.1. An illustration of the classifier combination Framework Three classifiers are created, based on the offered input data and the given bug report. The classifiers are combined, using score addition, to products a particulargraded list of source code units. In this example, fileX bubbles up to the top of the combined list because it is high on each of the three classifiers’ lists. Exact-match keywords, which influence happen if the bug report clearly references the variable names or method names of the related source code entity. Another IR-based classifier might be good at ruling links between general concepts, such as “GUI” or “networking,” even if no separate keywords are shared between the bug report and relevant entity. An entity metrics-based classifier might be good at determining which source code objects are likely to be bug-prone in the first place, regardless of the explicit bug report in question. By combining these disparate classifiers, the truly related files are likely to “bubble up” to the top of the combined list, providing creators with some false positive competitions to explored. While classifier grouping has been successful in other domains and freshly even other areas of software engineering [9], classifier combination has not been explored in the context of bug localization. We current a framework to combine multiple bug localization classifiers, based mainly on methods from extra domains and illustrated. Any number of classifiers are created, based on the offered input data and the given bug report. The choice of classifiers affects the performance of classifier combination 1. The classifiers are combined used any of several combination methods, such as Borda Count [10], score addition, or reciprocal rank fusion [11]. We now describe each constituent in more detail. 4.1 Creation of Classifiers As mentioned in Section 3, classifiers can come in many forms. IR-based classifiers endeavor to find ISSN: 2231-5381 http://www.ijettjournal.org Page 65 International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016 textual similarities between the given bug result and the source code units. The Entity metric-based classifiers use entity metrics [12], such as ranks of code, to categorize which source code entities are likely to have the largest number of bugs, independent of the given bug report (which, as we found has surprisingly good presentation). In fact, we replicate any different of any technique that returns a ranked list of source code objects as a classifier. The result set of a classifier consists of a rank and a (relevancy) score for every source code entity in the assignment. Note that scores need not be unique; in fact, many entities may be apportioned a score of 0. In this case, they all share a rank of M þ 1, where M is the number of entities that received a nonzero score. The choice of separate classifiers affects the presentation of classifier grouping [13]. As classifier combination works best when the separate classifiers err in different ways, choosing classifiers that are likely to result in the most uncorrelated mistakes— such as classifiers based on different input data demonstrations—is likely to achieve the best result. Defining Classifier Subspaces. Recall that for IRbased classifiers, two data sources must be embodied: that of the source code entity and that of the bug report. Accordingly, we describe four IR classifier subspaces that are based on four different input data demonstrations, using the notation from the summary framework, as follows. The first subspace consists of those classifiers that use the Our framework provides two basic techniques for choosing which classifiers to combine: combining the best performing individual classifiers in each from each subspace. To combine the best IR classifiers, we select the best classifier from each of the subspaces. We should want to combine the best eight IR classifiers. For example, then we select the best two classifiers from each subspace, and so on. If we do not know in improvement which classifiers from each subspace execute best, we still use the subspaces to create so-called intelligently casualsets of classifiers. Here, we select random classifiers with an equal possibility from each of the four subspaces, so that we decrease the likelihood that the wrong answers from each classifier will be associated in any way. 4.2 Combination Techniques Given a set of classifiers we can association them in any of many ways. A simple rank-only combination is the Borda Count method [14], which was eventually devised for political election systems. For each source code entity dk, the Borda Count system assigns points based on the rank of dk in each classifier’s result set. The Borda Count scores for all classifiers are checked for each entity, and the entity with the maximum total ISSN: 2231-5381 Borda Count Score is ranked first, andso on. The combined result is formed by ranking entities according to their total Borda score. Instead of using the ranks, the scores of each classifier can also be used. For example, the score of each entity dk and each classifier Ciare summed to products a total score for each entity. Usually, the scores of each classifier are scaled to be in the similar range (e.g., ½0;1) before combination to reduce unintentionally weighting the reputation of certain classifiers. However, (1) and (2) can be changed to explicitly weight certain classifiers differently from others, for example, to scale the best-performing classifiers in the series of ½0;2. We leave the investigation of such weighting systems for future work. We note that the result of combining jCj classifiers defines a new classifier. This classifier can itself be mutual with other classifiers. In this way, a grading of classifiers can be constructed. 5 MAIN FINDINGS AND DISCUSSION We highlight the main findings from each case study, and provide a discussion of results. 5.1 Classifier Summary From our analysis of 3,172 classifiers, each evaluated on 8,084 bug reports, we make the following conclusions: 1. Summary matters. The presentation difference between one classifier summary and extra is often significant. 2. The VSM classifier achieves the best overall top-20 presentation. LSI is second, and LDA is last. 3. Using both the bug report’s title and explanation results in the best overall presentation, for all IR models. 4. Using the source code entities’ identifiers, comments, and past bug reports results in the best whole presentation, for all IR models. 5. Stopping, stemming, and splitting all improve performance, for all IR models. 6. New bug count is the best-performing entity metric. No single classifier summary is best for all three studied projects. However, some VSM summary has the best presentation for all studied projects, suggesting that VSM is the overall best classification methods for bug localization. For all studied projects, VSM >LSI >LDA, when sparkly the presentation of each technique’s best summary. Interestingly, for all studied projects, if we reflect the worst summaries of the several IR models (VSM, LSI, and LDA), LSI achieves the maximum presentation, often significantly so. In Mozilla, for example, the worst LSI summary has a presentation of http://www.ijettjournal.org Page 66 International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016 23 percent, compared to VSM’s 6 percent and LDA’s 0 percent. Bug localization in Mozilla is relatively easy related to the other studied projects: The best top-20 for Mozilla is 80 percent, compared to Jazz’s 69 percent and Eclipse’s 55 percent. We also note that Mozilla has the minimum number of source code entities, Jazz the second smallest, and Eclipse the largest. 5.2 Classifier Combination Based on the outcomes of our case study on classifier combination, we conclude that: 1. Combining the best-performing classifiers every time improves presentation, by at least 10 percent and up to 95 percent in top-20 presentation, compared to the best performing separate classifier. 2 Combining “intelligently random” sets . of classifiers very nearly always helps (84-100 percent of the time), and usually helps by a large amount: a median of 14-56 percent improvement in top-20 presentation, compared to the best performance individual classifier. We saw the smallest comparative improvements in Mozilla, and the largest in Eclipse. We note that individual classifiers in Mozilla already have high presentation (i.e., top-20 values above 0.80), leaving little room for development for combination. Individual classifiers in Eclipse, on the other hand, have comparatively worse presentation (a maximum top20 value of 0.54). In general, the Borda Count Combination technique performed better than the score addition method. In all four manually generated classifier sets and all studied projects, the Borda Count offered a larger improvement than score addition. Also, when replicating the 150 randomly created classifier sets, the Borda Count method offered better mean and average comparative improvements over score addition, for the Eclipse and Jazz projects. In Mozilla, the mean and median relative improvements of Borda Count and score adding were comparable. In both experiments, we combined sets of five module classifiers, based on the logic that classifiers using different data sources as idea will result in uncorrelated faults. We also considered combining all 3,172 section classifiers of Case Study 1. We found the top-20 presentation of their combination to be analogous to the top-20 performance of the good individual classifier. Given that the set of 3,172 contains various classifiers with very low presentation, it is cheering that their combination still achieves such extraordinary presentation. This combination can be practically useful, not to develop on the best individual ISSN: 2231-5381 classifier, but to permit us to achieve close to the good presentation without wanting to identify the best performing individual classifier. 6 THREATS TO VALIDITY Threat to the interior validity of our case studies is our ground-truth data collection algorithm, which was based on an algorithm recommended by Fischer et al. [15]. Although the algorithm is the state-of-the-art algorithm for linking bug information to source code entities, current research has quizzed the algorithm’s linking favoritisms, because some modification sets in the revision control system may not explicitly indication which bug reports are related. However, Nguyen et al. find that this bias exists even in a nearideal dataset with high-quality linking, representative that the bias is a symptom of the underlying development process rather than the data collection method. Additional potential threat to the internal validity of our study is that of false denials in our ground-truth data. Specifically, our truth data contains links between bug reports and those source code entities that were really changed to resolve the bug report. However, it may be the case that other source code objects could have been changed instead to resolve the same bug report. Both interior threats are moderated in our work because whatever biases or false negatives exist, we use the same truth data set in the calculation of all classifiers and combination techniques, providing an equal stage for comparison. A potential threat to the external validity of our study is that even though we implemented three extensive case studies on large, active, real-world projects, our results still must be replicated in context. In particular, our studied projects characterize only a segment of all real-world projects, domains, programming languages, and development paradigms, so we cannot ultimately say that our results will hold for all possible projects. We have provided our data and tools online to cheer others to repeat our work on other projects .Construct validity of our case studies is our use of the top-20 metric as a proxy for developer efficiency. While the top-20 metric is the standard metric for valuing bug localization presentation, additional research is needed to see how well it relates with developer effectiveness in real-world conditions. 7. CONCLUSIONS AND FUTURE WORK Solving the bug localization problem has main implications for inventers because it can dramatically reduce the time and effort required to maintain software. We cast the bug localization problematic as one of classification, and analyzed the consequence classifier precipitate had on bug localization presentation, as well as whether classifier combination could help. We summarize our main findings as follows: http://www.ijettjournal.org Page 67 International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 2- March 2016 1.The summary of an IR-based classifier materials. 2.The best individual IR-based classifier uses the Vector Space Model, with the index built using the term weighting on all accessible data in the source code entities (i.e., identifiers, comments, and past bug reports for each entity), which has been stopped, stemmed, and split, and queried with all available data in the bug report (i.e., title and description) with cosine similarity. 3.The best EM-based classifier used as the new bug count metric to rank source code entities. 4.Classifier combination helps in very nearly all cases, no matter the underlying classifiers used or the specific combination technique used. We proposed two frameworks: one for defining and analyzing classifier summaries, and one for combining the results of dissimilar classifiers. We used these frameworks to conduct our experimental case studies on bug localization; researchers can use our frameworks to conduct similar analyses in other areas that reflect multiple summary of classifiers. We found that the summary of a classifier has a significant impact on its presentation. In Eclipse, for example, the difference between a poorly configured IR classifier and a correctly configured IR classifier is the difference between having a one in 100 chance of finding a applicable entity in the top 20 results and having a better than a one in two chance. We found that prior bug localization research does not use best classifier summary. Fortunately, we also found consistent results of the various summaries across all three studied projects, suggestive of that the proper summary is not completely project specific, and can be useful in a general context. By identifying which classifier summaries are best, and how to combine them in the most effective way, our results substantially advance the state of the art in bug localization. Practitioners can use our results to accelerate the task of finding and fixing bugs, resulting in improved software quality and diminished maintenance costs. We provide our data, tools, and results online. form of stack traces and code snippets could be beneficial to bug localization results. Finally, we have yet to fully investigate several possible combination techniques, such as variations of the Borda Count and Reciprocal Rank Fusion. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] Since IR-based bug localization has only a moment ago gained the attention of researchers, there are various exciting avenues to explore in future work. The most obvious avenue is the addition of classifier relatives to the grouping framework presented in this paper, such as those used for concept location: Page Rank formal idea analysis vigorous analysis and static analysis. Further IR models can be reproduced, such as BM25F, Bug Scout, and other variants of LDA, such as the Relational Topic Model. Recently, researchers have proposed query expansion methods, which may be a useful preprocessing step to any IR-based classifiers. We also wish to explore whether preprocessing bug reports by removing noise in the ISSN: 2231-5381 R. Arnold and S. Bohner, “Impact Analysis—Towards a Framework for Comparison,” Proc. Int’l Conf. Software Maintenance, pp. 292-301, 1993. R. Baezasssss-Yates and B. Ribeiro-Neto, Modern Sensible Retrieval, vol. 463, ACM Press, 1999. C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu, “Fair and Balanced?: Bias in Bug-Fix Data Sets,” Proc. Seventh European Software Eng. Conf. and Symps. Foundations of Software Eng., pp. 121-130, 2009. D.M. Blei and J.D. Lafferty, “Topic Models,” Text Mining: Classification, Clustering, and Applications, pp. 71-94. Chapman & Hall, 2009. D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent Dirichlet Allocation,” J. Machine Learning Research, vol. 3, pp. 9931022, 2003. S. Bohner and R. Arnold, Software Change Impact Analysis. IEEE CS Press, 1996. G. Box and D. Behnken, “Some New Three Level Designs for the Study of Quantitative Variables,” Technometrics, vol. 2, no. 4 , pp. 455-475, 1960. C. Carpineto and G. Romano, “A Survey of Automatic Query Expansion in Sensible Retrieval,” ACM Computing Surveys, vol. 44, no. 1, pp. 1-50, Jan. 2012. J. Chang and D.M. Blei, “Relational Topic Models for Document Networks,” Proc. 12th Int’l Conf. Artificial Intelligence and Statistics, pp. 81-88, 2009. B. Cleary, C. Exton, J. Buckley, and M. English, “An Empirical Analysis of Sensible Retrieval Based Concept Location Techniques in Software Comprehension,” Empirical Software Eng., vol. 14, no. 1, pp. 93-130, 2008. G.V. Cormack, C.L. Clarke, and S. Buettcherss, “Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods,” Proc. 32nd Int’l Conf. Research and Development in Sensible Retrieval, pp. 758-759, 2009. V. Dallmeier and T. Zimmermann, “Extraction of Bug Localization Benchmarks from History,” Proc. 22nd Int’l Conf. Automated Software Eng., pp. 433-436, 2007. M. D’Ambros, M. Lanza, and R. Robbes, “Evaluating Defect Prediction Approaches: A Benchmark and an Extensive Comparison,” Empirical Software Eng., vol. 17, no. 4, pp. 531-577, 2012. S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, “Indexing by Latent Semantic Analysis,” J. Am. Soc. Sensible Science, vol. 41, no. 6, pp. 391-407, 1990. B. Dit, L. Guerrouj, D. Poshyvanyk, and G. Antoniol, “Can Better Identifier Splitting Techniques Help Feature Location?” Proc. Int’l Conf. Program Comprehension, pp. 11-20, 2011. http://www.ijettjournal.org Page 68

The Impact of Classifier Summary and Classifier Combination on Bug Localization

Related documents

Products

Support

The Impact of Classifier Summary and Classifier Combination on Bug Localization

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib