Multi-abstraction Concern Localization

advertisement
Multi-Abstraction Concern Localization
Tien-Duy B. Le, Shaowei Wang, and David Lo
{btdle.2012, shaoweiwang.2010,davidlo}@smu.edu.sg
Multi-Abstraction
Retrieval
Motivation
• Concern Localization is the process of locating code units
that match a particular textual description (bug reports or
feature requests)
•We propose multi-abstraction Vector Space Model (VSMMA)
by combining VSM with our abstraction hierarchy.
• Recent concern localization techniques compare documents
at one level of abstraction (i.e. words/topics)
•In multi-abstraction VSM, document vectors are extended
by adding elements corresponding to topics in the hierarchy.
•A word can be abstracted at multiple levels of abstraction.
For example, Eindhoven can be abstracted to North Brabant,
Netherlands, Western Europe, European Continent, Earth etc.
•Given a query q and a document d in corpus D, the
similarity between q and d is calculated in VSMMA as follows:
•In multi-abstraction concern localization, we represent
documents at multiple abstraction levels by leveraging
multiple topic models.
V
L
 t f  idf (wi ,q,D) t f  idf (wi ,d,D) 
i 1
k 1
d 
V
Preprocessing
Hierarchy Creation
MultiAbstraction
Retrieval
Level 1
Level 2
….
Level N
Abstraction Hierarchy
+
Standard Retrieval
Technique
Ranked Methods
Per Concern
Text Preprocessing
• We remove Java keywords, punctuation marks, special
symbols, and break identifiers into tokens based on Camel
casing convention
•Finally, we apply Porter Stemming algorithm to reduce
English words into their root forms.
Hierarchy Creation Step
•We apply Latent Dirichlet Allocation (LDA), with
different number of topics, a number of times, to construct
an abstraction hierarchy
•Each application of LDA creates a topic model, which
corresponds to an abstraction level.
•We refer to the number of topic models contained in a
hierarchy as the height of the hierarchy
k
i
i
L K (Hk )
2
tf

idf
(w
,
q
,
D
)


i
V
k 1
Hk 2
(

 q ,t i )
i 1
L K (Hk )
2
tf

idf
(w
,
d
,
D
)


i
i 1
Concerns
i 1
k
Where
i 1
Method Corpus
H
H



 q ,t d ,t
q  d
q 
Overall Framework
K (Hk )
k 1
Hk 2
(

 d ,t i )
i 1
•V is the size of the original document vector
• wi is the ith word in d
• L is the height of abstraction hierarchy H
•Hi is the ith abstraction level in the hierarchy
Hk
•  d ,ti is the probability of topic ti to appear in d as assigned
by the kth topic model in abstraction hierarchy H
•tf-idf (w,d,D) is the term frequency-inverse document
frequency of word w in document d given corpus D
Experiments
Effectiveness of Multi-Abstraction VSM
Number of Topics MAP
Baseline (VSM)
0.0669
H1
50
0.0715
H2
50, 100
0.0777
H3
50, 100, 150
0.0787
H4
50, 100, 150, 200 0.0799
Improvement
N/A
6.82%
16.11%
17.65%
19.36%
• The MAP improvement of H4 (over baseline) is 19.36%
• The MAP is improved when the height of the abstraction
hierarchy is increased
Future Work
•Extend the experiments with combinations of
Different numbers of topics in each level of the hierarchy
Different hierarchy heights
Different topic models (Pachinko Allocation Model,
Syntactic Topic Model, Hierarchical LDA)
Experiment with Panichella et al. ‘s method [1] to infer
good LDA configurations for our approach
[1]A. Panichella, B. Dit, R.Oliveto, M.D. Penta, D.
Poshyvanyk, and A.D Lucia. How to effectively use topic
models for software engineering tasks? an approach based on
genetic algorithms. (ICSE 2013)
Download