Senior Thesis - Abbreviation Expansion

advertisement
Improving Automatic Abbreviation
Expansion within Source Code to
Aid in Program Search Tools
Zak Fry
Outline






Problem and Motivation
Automatically Identifying Abbreviation
Expansions
A Scoped Approach
Analysis and Refinement: iScope
Evaluations
Conclusions
Maintenance Tasks




60-90% of software
lifecycle
Problem: id where
relevant code is – where
changes need to be made
Code to perform a certain
task can be very scattered
Causes difficulty for
current maintenance
search tools
Challenges - Coding Practices


Identifier names important for code documentation
and understanding
Problem: Programmers’ use of abbreviations in code
–
Frequency of occurrence

–
Complex inheritance – long class names


character, integer, string
SecureMessageServiceClientMessageImpl
Negates usefulness of identifier names and
complicates program understanding
Abbreviations and Maintenance Tools

Problem: Search based maintenance tools rely on
natural language
–

Abbreviations change the natural language
Search Term: “distributed hash”
dht = (DHTPlugin)dht_pi.getPlugin();
Thread t = new AEThread( "DHTTrackerPlugin:init" )
{
public void runSupport() {
try{ if ( dht.isEnabled()){ log.log( "DDB Available" ); } }
catch( Throwable e ){ log.log( "DDB Failed", e ); }
... }
}
Automatically Identifying
Abbreviation Expansions
 First,
how do we identify
candidates for expansion?
–
Non-dictionary words
 Abbreviation
–
Short form
 Expansion
–
Long form
Types of Non-Dictionary Words
Abbreviation
Category
Single Word
Type
Short Form
Prefix
int
integer
Dropped
Letter
evt
event
Acronym
FBI
Federal
Bureau of
Investigation
Combination
Multiword
recblk
receive block
Multiple Word
Domain Keywords
and Special Cases
Long Form
---
parsetree
serialize
---
State of the Art

Lawrie, Feild, and Binkley
–
–
Abbreviation Expansion
Problem:


Lack of precision
No support for choosing between multiple matches
Scoped Approach

How to choose between multiple
possible long forms:
–
–
By manual inspection we found correct
long forms are more likely to be found in
certain locations
Also, correctly identifying the long forms
for certain types of abbreviations is easier
than for others
Order of Types
Abbreviation Type
1: Acronym
2: Prefix
3: Dropped Letter
4: Combo Multiword
5: Most Common
Order of Program Context
Context
1: Javadoc
2: Type
3: Method Name
4: Statement
5: Method
6: Method Comments
7: Class Comments
General Algorithm
Acronym
Javadoc
Type
Method Name
…
Javadoc
Prefix
Type
Method Name
…
Multiple matches


We assume one best candidate though
multiple might be present at the same
level of scope
If multiple matches:
1.
2.
3.
4.
Examine frequencies
Stem long forms and reexamine frequencies
Broaden Scope and reexamine frequencies
Most frequent expansion
Most Frequent Expansion (MFE)

If still no ideal candidate is found:
–
–
We mined long forms from 1.5 million LOC of
Java 5 code base
Return most frequent long form as last resort
Evaluation of Scoped Approach



250 abbreviations from 5 subject programs
Gold standard developed by human developer
inspecting the code manually
Implemented LFB according to description
–
Except combination words – due to missing database
(Accuracy)
Analysis and Refinement - iScope


Analyzed results and found 3 major sources
of problems
Developed iScope by addressing these 3
major problem areas
Order of Scoping
•Problem:
•Scoped approach ordering: examine every
context for an abbreviation type then go to
next type
•Insight:
Context is more
sensitive
than
•Investigating
broader
contexts
fortype
one
type Check
before each
even type
the narrowest
context
for
•Solution:
at each context
level,
another
is likely
to(switch
yield incorrect
then go
to nexttype
context
level
order)
matches
Single Letter Abbreviations
•Problem:
•Developers use single letter abbreviations
differently than multiple letter abbreviations
•A large subset are actually semantically
meaningless
•Reader r = new BufferedReader()
•Single
letter
easily
matched we
especially
•Insight:
Based
onvery
manual
inspection,
found that
because
prefixletter
matching
is greedy
meaningful
single
short forms
were identifiers whose
long forms were also their type name
•Solution: Limit contextual scope to type only
Hyper-Common Abbreviation
•Solution: Mine
•Problem:
Someaabbreviations
small set of extremely
used so often
common
in
code that longand
abbreviations
formuse
rarely
as aever
preprocessing
co-occurs leading
step
to incorrect expansion based on coincidence
Mined list of hyper-common
abbreviations
Evaluations

Is our method accurate enough to be useful?
–

Reevaluation of previous experiment
Does abbreviation expansion help
maintenance tasks?
–
–
Simple Search
Concern Location Task
1. Reevaluation of Previous Test

Based on our previous experimental
methodology and metrics, how much
improvement was made from Scope to
iScope?

Modified goldset based on new assumptions
– single letter abbreviations
1. Reevaluation of Previous Test - Results
•Compare LFB with Scope and iScope using non
combinational word (NCW) accuracy values
•Compare JavaMFE, ProgMFE, Scope, and iScope
using the total accuracy values
2. Simple Search Evaluation


When abbreviations are expanded in software,
how many more search results are returned
than without expansion?
Focus: Recall
–

Not missing important results – want as many
potentially relevant results as possible
Metric: Percent increase in results
–
P.I. = Raw returned results with expansion - 100%
Raw returned results without expansion
2. Simple Search Evaluation (cont)

Subjects: 215 concerns(Eaddy et al.)
annotated by 3 people each for total of 645
queries
–

Developed independent of the idea of
abbreviation expansion – many queries might not
be affected by abbreviation expansion at all
“Match”: if any word in the query matches
any word in the method considered a match
and returned as a result
2. Simple Search Evaluation - Results
Approach
No Expansion
Scope
iScope
Total Returned
Results
240,752
284,160
282,489
Percent Increase
--18.03
17.34
•Less increase with iScope – single letter
abbreviation false positive decrease
•Ideally, this means quality is better
•experiment 3
3. Evaluation with Concern Location


Concern location task: identification of
methods that are deemed to be relevant for
the given search term
How much increase in effectiveness can be
gained from expanding abbreviations in
source code when performing concern
location tasks?
3. Evaluation Methodology

Tools: Latent Semantic Indexing(LSI) and
Log Entropy-based concern location
–

Goals: Attempt to calculate similarity values
based on location and frequency of potential
query matches
Subjects: same as previous experiment
3. Methodology (cont)

Metric: Mean Average Precision (MAP)
–
–
Precision: # True positives / Total # of positives
MAP:


–
Collect precision values for every new true positive, going
down the ranked returned results
Then take average of all results
Attempts to reward highly ranked true positives
3. Concern Location Tasks - Results
3. Concern Location Tasks - Results
Conclusions


Abbreviation expansion is proven to be
helpful in maintenance tools and processes
iScope approach improves upon Scope and
greatly upon state-of-the-art
Future Work



Further refinement of expansion process to
achieve highest possible accuracy
Full integration into maintenance tool
Extension into other programming languages
Acknowledgments


Emily Hill and Haley Boyd
Dr. Vijay K. Shanker and Dr. Lori Pollock
Questions?
Inherent Inaccuracy
•Problem: Additional errors in code not
generalizable into solvable problems
•Insight: There will always be inherent error when
developing automatic systems for non-standard
input
Download