Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools Zak Fry Outline Problem and Motivation Automatically Identifying Abbreviation Expansions A Scoped Approach Analysis and Refinement: iScope Evaluations Conclusions Maintenance Tasks 60-90% of software lifecycle Problem: id where relevant code is – where changes need to be made Code to perform a certain task can be very scattered Causes difficulty for current maintenance search tools Challenges - Coding Practices Identifier names important for code documentation and understanding Problem: Programmers’ use of abbreviations in code – Frequency of occurrence – Complex inheritance – long class names character, integer, string SecureMessageServiceClientMessageImpl Negates usefulness of identifier names and complicates program understanding Abbreviations and Maintenance Tools Problem: Search based maintenance tools rely on natural language – Abbreviations change the natural language Search Term: “distributed hash” dht = (DHTPlugin)dht_pi.getPlugin(); Thread t = new AEThread( "DHTTrackerPlugin:init" ) { public void runSupport() { try{ if ( dht.isEnabled()){ log.log( "DDB Available" ); } } catch( Throwable e ){ log.log( "DDB Failed", e ); } ... } } Automatically Identifying Abbreviation Expansions First, how do we identify candidates for expansion? – Non-dictionary words Abbreviation – Short form Expansion – Long form Types of Non-Dictionary Words Abbreviation Category Single Word Type Short Form Prefix int integer Dropped Letter evt event Acronym FBI Federal Bureau of Investigation Combination Multiword recblk receive block Multiple Word Domain Keywords and Special Cases Long Form --- parsetree serialize --- State of the Art Lawrie, Feild, and Binkley – – Abbreviation Expansion Problem: Lack of precision No support for choosing between multiple matches Scoped Approach How to choose between multiple possible long forms: – – By manual inspection we found correct long forms are more likely to be found in certain locations Also, correctly identifying the long forms for certain types of abbreviations is easier than for others Order of Types Abbreviation Type 1: Acronym 2: Prefix 3: Dropped Letter 4: Combo Multiword 5: Most Common Order of Program Context Context 1: Javadoc 2: Type 3: Method Name 4: Statement 5: Method 6: Method Comments 7: Class Comments General Algorithm Acronym Javadoc Type Method Name … Javadoc Prefix Type Method Name … Multiple matches We assume one best candidate though multiple might be present at the same level of scope If multiple matches: 1. 2. 3. 4. Examine frequencies Stem long forms and reexamine frequencies Broaden Scope and reexamine frequencies Most frequent expansion Most Frequent Expansion (MFE) If still no ideal candidate is found: – – We mined long forms from 1.5 million LOC of Java 5 code base Return most frequent long form as last resort Evaluation of Scoped Approach 250 abbreviations from 5 subject programs Gold standard developed by human developer inspecting the code manually Implemented LFB according to description – Except combination words – due to missing database (Accuracy) Analysis and Refinement - iScope Analyzed results and found 3 major sources of problems Developed iScope by addressing these 3 major problem areas Order of Scoping •Problem: •Scoped approach ordering: examine every context for an abbreviation type then go to next type •Insight: Context is more sensitive than •Investigating broader contexts fortype one type Check before each even type the narrowest context for •Solution: at each context level, another is likely to(switch yield incorrect then go to nexttype context level order) matches Single Letter Abbreviations •Problem: •Developers use single letter abbreviations differently than multiple letter abbreviations •A large subset are actually semantically meaningless •Reader r = new BufferedReader() •Single letter easily matched we especially •Insight: Based onvery manual inspection, found that because prefixletter matching is greedy meaningful single short forms were identifiers whose long forms were also their type name •Solution: Limit contextual scope to type only Hyper-Common Abbreviation •Solution: Mine •Problem: Someaabbreviations small set of extremely used so often common in code that longand abbreviations formuse rarely as aever preprocessing co-occurs leading step to incorrect expansion based on coincidence Mined list of hyper-common abbreviations Evaluations Is our method accurate enough to be useful? – Reevaluation of previous experiment Does abbreviation expansion help maintenance tasks? – – Simple Search Concern Location Task 1. Reevaluation of Previous Test Based on our previous experimental methodology and metrics, how much improvement was made from Scope to iScope? Modified goldset based on new assumptions – single letter abbreviations 1. Reevaluation of Previous Test - Results •Compare LFB with Scope and iScope using non combinational word (NCW) accuracy values •Compare JavaMFE, ProgMFE, Scope, and iScope using the total accuracy values 2. Simple Search Evaluation When abbreviations are expanded in software, how many more search results are returned than without expansion? Focus: Recall – Not missing important results – want as many potentially relevant results as possible Metric: Percent increase in results – P.I. = Raw returned results with expansion - 100% Raw returned results without expansion 2. Simple Search Evaluation (cont) Subjects: 215 concerns(Eaddy et al.) annotated by 3 people each for total of 645 queries – Developed independent of the idea of abbreviation expansion – many queries might not be affected by abbreviation expansion at all “Match”: if any word in the query matches any word in the method considered a match and returned as a result 2. Simple Search Evaluation - Results Approach No Expansion Scope iScope Total Returned Results 240,752 284,160 282,489 Percent Increase --18.03 17.34 •Less increase with iScope – single letter abbreviation false positive decrease •Ideally, this means quality is better •experiment 3 3. Evaluation with Concern Location Concern location task: identification of methods that are deemed to be relevant for the given search term How much increase in effectiveness can be gained from expanding abbreviations in source code when performing concern location tasks? 3. Evaluation Methodology Tools: Latent Semantic Indexing(LSI) and Log Entropy-based concern location – Goals: Attempt to calculate similarity values based on location and frequency of potential query matches Subjects: same as previous experiment 3. Methodology (cont) Metric: Mean Average Precision (MAP) – – Precision: # True positives / Total # of positives MAP: – Collect precision values for every new true positive, going down the ranked returned results Then take average of all results Attempts to reward highly ranked true positives 3. Concern Location Tasks - Results 3. Concern Location Tasks - Results Conclusions Abbreviation expansion is proven to be helpful in maintenance tools and processes iScope approach improves upon Scope and greatly upon state-of-the-art Future Work Further refinement of expansion process to achieve highest possible accuracy Full integration into maintenance tool Extension into other programming languages Acknowledgments Emily Hill and Haley Boyd Dr. Vijay K. Shanker and Dr. Lori Pollock Questions? Inherent Inaccuracy •Problem: Additional errors in code not generalizable into solvable problems •Insight: There will always be inherent error when developing automatic systems for non-standard input