Lori Pollock Professor, CIS Program Analysis, Software Development & Maintenance Tools, Optimizing Compilers ‘81 ’81-’86 B.S. CS and Econ, Allegheny PhD in CS, U of Pittsburgh Married Mark ’86-’90 Assistant Prof, Rice U Lauren ‘88; Lindsay ‘90 ’91Assistant, Associate, Full Prof UD CIS Matt ’95 Today: Lauren & Lindsay here at UD, Matt 15, 3 PhD students, 3 masters students, and a few undergraduate researchers What I do here at UD • Research – Software Engineering and Compilation Lab (Hiperspace) • 213 Smith Hall – Collaborations • Vijay Shanker (UD CIS), Terry Harvey (UD CIS), Lisa Marvel (Army Research Lab), Martin Swany (UD CIS), Guang Gao (UD ECE) – Funding • Primarily NSF grants; previously some Army funding • Graduate Teaching – – – – CISC 672 Compilers CISC 673 Program Analysis and Transformations CISC 879 Software Testing and Maintenance CISC 879 Software Tools and Environments • Undergraduates – Programming XO laptops for local middle school teachers – Study abroad programs What I do outside UD • Associate Editor, Transactions on Software Engineering and Methodology (TOSEM) • Computing Research Association (CRA)’s Committee on the Status of Women in Computer Research (CRA-W) • Mentoring – speaker at mentoring workshops for undergrads, grads, assistant and associate profs, and industry lab researchers • Program committees, conf org, NSF panels, paper reviews,… (typical of university researchers) Current Students in Training Antony Danalis PhD Xiaoran Wang PhD Giri Sridhara PhD Masters: Divya, Suparna, Poonam Current Undergraduates: Sana Malik, Michelle Allen Recently Completed PhD 2007-10 David Shepherd Postdoc, Startup Ben Breech Army Research Lab Sara Sprenkle Assistant Prof Washington & Lee U Mike Jochen Assistant Prof East Stroudsburg U Emily Gibson Hill Assist Prof, Montclair State U Overview: Research Projects Natural Language Analysis of Programs Giri, Xiaoran Shanker, Hill Green Software Engineering Clause Winbladh Testing Web Applications Program Analysis Sprenkle Compiler Technology Optimizing Cluster Parallel Programs Antony, Swany Identifying Code vulnerabilities To Malware Software Tools…………..Testing…..........Compilers….……Parallel Computing Optimizing Cluster Parallel Programs Research Problem - How can scientific codes be scaled to a cluster of many CPUs? Major Challenge – Communication Costs Approach and Contributions: An integrated system to hide communication latency -Surveyor: Collect “knowledge” of cluster -Compiler: analyze dependencies and transform to create maximal communication/computation overlap -Communication Library: Use a companion library to MPI ASPhALT: Automatic System for Parallel AppLication Transformations Contribution: FIRST to cluster-optimize MPI codes Testing Web Applications Web Application Structure Client (HTML) Server (Java) Browser Front End Database (MySql) Back End • Combination of – Stand-alone applications – GUIs and Database applications – Distributed applications • Numerous technologies and components Traditional Software Testing Process Hard to obtain when testing web applications!! Application Representation Test Case User-session-based Testing Application Specification Application Implementation Expected Results Generator Test Cases Test Cases Replay Tool Actual Results Oracle Pass/ Fail User-session-based Testing Process User 1: register.jsp?name=ss&pass=tst User-session-based login.jsp?name=ss&pass=tst logout.jsp Users Beta Web Application (v.0.9) Deployment Web Application Implementation (v.1.0) Expecte d Results User Log User Sessions Create/Reduce Create test Requests cases Test Cases Test Test Cases Cases Replay Tool Actual Results Oracle Pass/ Fail Maintenance Testing for Web Applications Research Problem: How can we exploit user session logging for testing of web applications after initial deployment, with minimal tester effort? Contributions: Scalable, practical, automated structural testing framework for web applications * Test case generation * Test suite reduction * Test oracles * Test coverage criteria in terms of URLs, parameters, values Analyzing the Names in Software Research Problem - 60-90% software costs are in reading and navigating large software systems to fix bugs and add new features. Can we help with automation of search, navigation, location of relevant code? - Key: Programmers leave clues of their intent as they choose names. Focus on actions -Correspond to verbs -Verbs need Direct Object - Phrases more useful Proposed Approach – Develop, extend, and apply natural language-based analysis to the identifier names and comments Contribution - Aid understanding, debugging, maintenance, development Our Research Focus and Impact Search Exploration Understanding … Software Maintenance Tools NLPA Natural Language Analysis Word splitting Word relations (synonyms, antonyms, … Part of speech tagging Abbreviations… How do SE tool users search code now? Determine Relevance of Results (Re)formulate Query Query User Search Results Search Method 1. User formulates query 2. Query executed by search method 3. User views search results 4. Repeat as necessary Source Code Our focus: Natural language (NL) queries User faces 2 challenges: • Decide what query words to search for • Determine whether the results are relevant “compile report” vs. “compil*report” Our Contextual Search Process (Re)formulate Query Query Search Method Source Code Informatio Information n Extraction Extraction Process Process Determine Relevance of Results User Search Method NL Phrase Mapping Partial Source Phrase Code Matching Hierarchical Search Results Another Tool - UD-Summarize: An Example /* Update linear edge view. If width <= 1, draw line to given graphics2d, else draw polyline to graphics2d */ Challenges in Summary Comment Generation for Methods • Method name does not always suffice as summary Ex: main(), returnPressed() • There is no notion of what constitutes an important statement for a method summary • A selected statement may have too little or too much detail for a succinct summary • Generated phrases should be understandable without reading the associated code and its context within the method body UD-Summarize: The Process Method M signature and body Build required structural and linguistic program representations UD-GenPhrase: Generate Phrases for Selected Statements and Combine Phrases Summary comment for M Developing Basic NL Analyses Search Exploration Understanding … Software Maintenance Tools NLPA Natural Language Analysis Word splitting Word relations (synonyms, antonyms, … Part of speech tagging Abbreviations… Illustrating some issues: Extracting Clues from Signatures 1. 2. 3. 4. Split Name into Words Part-of-speech tag method name Chunk method name Identify Verb and Direct-Object (DO) public UserList getUserListFromFile( String path ) throws IOException { POS Tag get get try { User List From <verb> <adj> <noun> <prep> File <noun> File tmpFile = new File( path ); return parseFile(tmpFile); Chunk User List From File <verb phrase> <noun phrase> <prep phrase> } catch( java.io.IOException e ) { throw new IOrException( ”UserList format issue" + path + " file " + e ); • • • Automatic Abbreviation Expansion Don’t want to miss relevant code with abbreviations Given a code segment, identify character sequences that are short forms and determine long form non-dictionary word no boundary Approach: Mine expansions from code [MSR 08] 1.Split Identifiers: 2.Identify non-dictionary words 3.Determine long form Issues with a Simple Dictionary Approach • Manually create a lookup table of common abbreviations in code - Vocabulary evolves over time, must maintain table - Same abbreviation can have different Control Flow Graph expansions depending on domain AND context: ? cfg Context-Free Grammar configuration configure Overview: Research Projects Natural Language Analysis of Programs Giri, Xiaoran Shanker, Hill Green Software Engineering Clause Winbladh Testing Web Applications Program Analysis Sprenkle Compiler Technology Optimizing Cluster Parallel Programs Antony, Swany Identifying Code vulnerabilities To Malware Software Tools…………..Testing…..........Compilers….……Parallel Computing What are we doing now? • Emily – Developing a general software word usage model – Implementing SWUM – Evaluating for search • Giri – Developing techniques for automatic comment generation • Which statements to include in summary content? • How to generate phrases for that content? – Providing feedback on SWUM refinement • Divya – Analyzing specific Java statement structures – Developing templates for comment phrase generation What have we learned overall? • Evaluation studies indicate Natural language analysis has far more potential to improve software maintenance tools than we initially believed • Existing technology falls short Synonyms, collocations, morphology, word frequencies, part-of-speech tagging, AOIG • Keys to further success Improve recall Extract additional NL clues Another of our Tools: Dora the Program Explorer* Query Natural Language Query • Maintenance request • Expert knowledge • Query expansion Program Structure • Representation • Current: call graph • Seed starting point Dora Relevant Neighborhood • Subgraph relevant to query * Dora Relevant Neighborhood comes from exploradora, the Spanish word for a female explorer.