A SOURCE CODE QUERY TO SUPPORT STRUCTURED PROGRAM UNDERSTANDING DAHLIA BINTI DIN UNIVERSITI TEKNOLOGI MALAYSIA A SOURCE CODE QUERY TO SUPPORT STRUCTURED PROGRAM UNDERSTANDING DAHLIA BINTI DIN A thesis submitted in fulfillment of the requirements for the award of the degree of Master of Science (Computer Science) Faculty of Computer Science and Information Systems Universiti Teknologi Malaysia JANUARY 2010 iii ALHAMDULLILAH For my beloved parents, Emak and Abah, my lovely sisters, Edah and Adik and the rest of my family members who give me the strength and courage. iv ACKNOWLEDGEMENT I would like to take this opportunity to say thanks to my supervisor Assoc. Prof. Dr Suhaimi Bin Ibrahim and Dato’ Prof. Dr Norbik Bin Bashah and for their motivation, advice and inspiration throughout this research. Special thanks go to my friends, Rita and Fiza for their encouragement and support that help me in making this research successful. I also would like to thank to my friends and post-graduate students of CASE, Universiti Teknologi Malaysia, Kuala Lumpur for their participation in the controlled experiment. v ABSTRACT Most software undoubtedly undergoes changes and needs maintenance due to environment, technologies and domain knowledge changes. A maintainer is responsible to analyze and understand the code occurrences prior to making any changes. The change in some parts of the codes will affect other parts within the same codes. Therefore, program understanding is one of the important factors to understand the code occurrences and its effectiveness in software maintenance. A program understanding activity involves browsing and exploring the source codes or the software documentations. However, in some cases not all the developed software has updated documentations. In this situation the documents tend to be more or less obsolete, while source code remains as the only reliable source left for maintainers to understand the software. In this case, maintainers need to spend more time traversing source code to understand the code occurrences. Thus, a flexible codes query method is proposed to enhance code occurrences understanding. This method applies parsing, pattern matching and regular expression techniques to extract software artifacts from source code. These extracted artifacts are analyzed and brought to light via multiple levels of abstraction. The results present multiple artifacts, relationship among the artifacts and their locations in the source code. In this research, a prototype was developed to provide a source code query method that supports structured program understanding in C programs. The method was then tested in a controlled experiment using a case study to prove effectiveness query of artifacts occurrences in source code to support program understanding. vi ABSTRAK Kebanyakan perisian mengalami perubahan dan perlu diselenggara disebabkan oleh perubahan persekitaran, teknologi dan pengetahuan domain. Penyelenggara perisian bertanggungjawab menganalisa dan memahami keberlakuan kod sebelum membuat pindaan. Pindaan satu kod boleh menjejaskan bahagian lain di dalam kod sumber yang sama. Oleh itu, pemahaman aturcara merupakan salah satu faktor yang penting bagi memahami keberlakuan kod dan keberkesanannya dalam penyelenggaraan perisian. Pemahaman aturcara melibatkan aktiviti menjelajah dan meneliti kod-kod sumber atau dokumen-dokumen perisian. Walau bagaimanapun, tidak semua perisian mempunyai dokumen yang terkini. Dalam situasi ini, dokumen adalah kurang dikemaskini dan hanya kod aturcara menjadi sumber yang boleh dipercayai untuk memahami perisian. Penyelenggara memerlukan masa yang lama untuk memahami keberlakuan kod. Oleh itu, kaedah susuran kod yang fleksibel dicadangkan bagi meningkatkan pemahaman keberlakuan kod. Kaedah ini menggunakan teknik penghuraian, pemadanan corak dan ungkapan nalar untuk mendapatkan artifak aturcara dari kod sumber. Artifak yang diperolehi akan dianalisa dan diabstrakkan dalam pelbagai aras. Hasil abstraksi akan memaparkan pelbagai artifak, pautan hubungan artifak dan lokasinya di dalam kod sumber. Dalam kajian ini, satu prototaip dibangunkan bagi menyediakan kaedah susuran kod sumber untuk menyokong pemahaman aturcara berstruktur dalam bahasa aturcara C. Kaedah tersebut kemudiannya diuji dalam ujikaji terkawal menggunakan kajian kes untuk membuktikan keberkesanan susuran keberlakuan artifak dalam kod sumber bagi menyokong pemahaman aturcara. vii TABLE OF CONTENTS CHAPTER 1 TITLE PAGE DECLARATION ii DEDICATION iii ACKNOWLEDGEMENT iv ABSTRACT v TABLE OF CONTENTS vii LIST OF TABLES xiv LIST OF FIGURES xv LIST OF ACRONYMS AND SYMBOLS xvii LIST OF APPENDICES xviii INTRODUCTION 1.1 Introduction 1 1.2 Background of the Research Problem 2 1.3 Statement of the Problem 3 1.4 Objective of the Study 4 1.5 Scope of Work 5 1.6 Importance of the Study 5 1.7 Thesis Outline 6 1.8 Summary 7 viii 2 LITERATURE REVIEW 2.1 Introduction 8 2.2 Introduction of Software Maintenance 9 2.2.1 Software Maintenance Categories 11 2.2.2 Problem in Software Maintenance 12 2.3 Program Understanding 13 2.3.1 Program Understanding Support Mechanism 15 2.3.1.1 Unaided Browsing 15 2.3.1.2 Leveraging Corporate Knowledge and 16 Experience 2.3.1.3 2.3.2 Computer Aided Technique Program Understanding via Reverse 16 17 Engineering 2.4 Reverse Engineering 18 2.4.1 Reverse Engineering Concept and Definition 19 2.4.2 Challenges in Reverse Engineering 21 2.4.3 Program Understanding in Reverse Engineering 23 Automating Approaches 2.5 Parsing Technique 24 2.5.1 Two Way of Parsing 25 2.5.2 Parsing Methods 26 2.5.2.1 Directionality 26 2.5.2.2 Search Techniques 27 2.5.2.3 Left Corner Parsing 28 2.5.3 2.6 Time Requirement 28 Extraction Process 29 2.6.1 Pattern Matching 29 2.6.2 Regular Expression 30 2.6.2.1 32 Basic Concepts ix 2.6.2.2 Portable Operating System Interface 34 (POSIX) Syntax 2.6.3 Pattern Matching and Regular Expression in 36 Artifact Extraction 2.7 2.8 2.9 Abstraction Process 36 2.7.1 Graphical Representation 38 2.7.2 Textual Representation 38 Concept Location 40 2.8.1 Concept Location in Source Code 41 2.8.2 Static Concept Location Techniques 43 2.8.2.1 String Pattern Matching Technique 44 2.8.2.2 Dependency Search Technique 44 2.8.2.3 IR-based Technique 45 Code Query in Reverse Engineering Tools 45 2.9.1 Windows Grep 46 2.9.2 Rigi 47 2.9.3 2.9.4 3 2.9.2.1 Rigi Features 48 2.9.2.2 Rigi Query Technique 49 CodeSurfer 49 2.9.3.1 CodeSurfer Features 51 2.9.3.2 CodeSurfer Query Technique 52 The Comparative Evaluation of Existing Tools 53 2.10 Proposed Solution 54 2.11 Summary 55 RESEARCH METHODOLOGY 3.1 Introduction 56 3.2 Operational Framework 57 3.2.1 58 Phase 1: Formulation of Research Problem x 3.2.1.1 Literature Reviews 58 3.2.1.1.1 59 Understanding the Need of Change Request Process 3.2.1.1.2 Understanding Structured 59 Programming Concept 3.2.1.1.3 Understanding the 60 Extraction Process 3.2.1.1.4 Understanding the 60 Abstraction Technique 3.2.1.2 Analysis Current Approach and 60 Existing Tools 3.2.1.3 3.2.2 3.2.3 3.2.4 4 Research Proposal 61 Phase 2: Prototype Development 61 3.2.2.1 Code Query Model Design 62 3.2.2.2 Code Query Prototype Development 62 Phase 3: Implementation and Evaluation 63 3.2.3.1 Supporting Tools 64 3.2.3.2 Choose Case Study 64 3.2.3.3 Experimental 65 3.2.3.4 Evaluation 65 Phase 4: Research Report 66 3.3 Research Assumption 66 3.4 Summary 66 CODE QUERY MODEL 4.1 Introduction 68 4.2 Overview of Code Query 68 4.3 Code Query in Structured Programming 70 xi 4.3.1 Structured Programming Concept 70 4.3.1.1 72 Relationship in Structured Programming 4.3.1.2 Dependencies in Structured 73 Programming 4.3.1.2 Observations about Structured 73 Programming 4.4 A Proposed Code Query 74 4.4.1 Keyword 75 4.4.2 Extraction of Artifacts 75 4.4.2.1 Parser 76 4.4.2.2 Pattern Matching 77 4.4.2.3 Regular Expression 77 4.4.3 Abstraction of Artifacts 78 4.4.3.1 Code Query in Textual Representation 79 4.4.3.1 Code Query in Graphical 80 Representation 4.5 5 Summary 81 DESIGN AND IMPLEMENTATION OF CODE QUERY 5.1 Introduction 82 5.2 Code Query Design 82 5.2.1 Code Query Architecture 83 5.2.1.1 Problem Change Request 84 5.2.1.2 Artifacts Repository 84 5.2.1.3 Extraction Process 85 5.2.1.4 Abstraction Process 87 5.2.2 Code Query Use Case 87 5.2.3 Code Query Class Interactions 91 xii 6 5.3 Code Query Implementation and User Interfaces 96 5.4 Other Supporting Tools 101 5.5 Summary 101 EVALUATION 6.1 Introduction 102 6.2 Case Study 103 6.2.1 Outlines of Case Study 103 6.2.2 GI Project Briefing 104 6.3 6.4 6.5 6.6 7 Controlled Experimental 104 6.3.1 Subject and Environment 105 6.3.2 Questionnaires 105 6.3.3 Experimental Procedures 106 6.3.4 Possible Threats and Validity 106 The Analysis 107 6.4.1 Analysis of the Controlled Experiment 107 6.4.2 Analysis of the Usability Study 110 Finding Analysis 113 6.5.1 Acceptance Tool 113 6.5.2 Qualitative Evaluation 114 Summary 116 CONCLUSION AND FUTURE WORK 7.1 Introduction 117 7.2 Contribution 118 7.3 Research Limitation and Future Works 118 7.4 Summary 119 xiii REFERENCES 120 Appendices A-B 124-134 xiv LIST OF TABLES TABLE NO. TITLE PAGE 2.1 Quantifiers of Regular Expression 33 2.2 Metacharacters for BRE Standard 35 2.3 Features of The Static Concept Location Techniques 43 2.4 Existing Features of current tools 54 4.1 Relationship Types 72 4.2 Regular Expression for Match Common Programming Language 78 6.1 Job versus frequencies 108 6.2 Year of Experience in Software Development 109 6.3 Year of Experience in Software Maintenance 109 6.4 Mean of scores for Code Query 111 6.5 Mean of Usefulness of Tools 112 6.6 Mean Comparison between Tools 114 6.7 Existing Features of Code Query Systems 115 xv LIST OF FIGURES FIGURE NO. TITLE PAGE 2.1 Software Maintenance Process 9 2.2 Reverse Engineering 19 2.3 Forward Engineering 20 2.4 Overview of Parser Process 25 2.5 Most concept location techniques rely on an intermediate representation of the source code 41 2.6 Window Grep Search Results 47 2.7 View produced by Rigi via RigiEdit 49 2.8 CodeSurfer Project Viewer 50 2.9 Finder Viewer 52 3.1 Operational Framework 57 4.1 Overview of Code Query Model 69 4.2 Structured Programming – Tax Calculation 71 4.3 Function Relationship 72 4.4 Code Query Approach 74 4.5 Metrics on the important files 79 xvi 4.6 Graphical presentation for function and variables of vehicle simulation 81 5.1 Code Query Architecture 83 5.2 Use Case Diagram of Code Query System 88 5.3 Code Query Class Diagram 91 5.4 Code Query Sequence Diagrams 92 5.5 Code Query Process Algorithm Flowchart 94 5.6 Code Query Introduction Screen 96 5.7 First user interface of Code Query 97 5.8 File Path and the Keyword Field 97 5.9 Textual Representation 98 5.10 Graphical Representation 99 5.11 Low Level of Abstraction - Source Code Viewer 99 5.12 High Level of Abstraction - Detail Relationship of Artifacts 100 6.1 Usefulness and Usability of Tools 110 6.2 Usefulness of Tool 112 xvii LIST OF ACRONYMS AND SYMBOLS BRE - Basic Regular Expressions GUI - Graphical User Interface IEEE - The Institute of Electrical and Electronics Engineers, Inc LSI - Latent Semantic Indexing PBS - Portable Bookshelf PCR - Program Change Request RUP - Rational Unified Process SDLC - Systems Development Life Cycle SHriMP - Simple Hierarchical Multi-Perspective SLC - Software Life Cycle SLCM - Software Life Cycle Model SLCP - Software Life Cycle Process UML - Unified Modeling Language WWW - World Wide Web xviii LIST OF APPENDICES APPENDIX TITLE PAGE A Questionnaire On Usability Of Software Understanding Tool 124 B User Manual 130 1 CHAPTER 1 INTRODUCTION 1.1 Introduction This chapter provides an introduction to the research work presented in this thesis. It describes the research overview that motivates the introduction of a source code query to support structured program understanding. This is followed by a discussion on the research background, problem statements, objectives and importance of the study. Finally, it briefly explains the scope of work and the structure of the thesis. 2 1.2 Background of the Research Problem Software maintenance is a process that happens once a requirement needs to fix, change or adapt to a software system. Whatever it is, the maintainer must fully understand the system before implementing the maintenance. Understanding what a program does, how the program works technically and why the program is in such design is critical to software maintenance. In this case, program understanding is needed in the software maintenance phase. Program understanding involves cognition of software that is the mental process of knowing, learning and understanding the software system. Source code or documentation is used as a source material for program understanding purposes. Unfortunately, documentation of the system structure is often missing or outdated; even when accurate documentation is indispensable, given the complexity of today’s software system (Eichberg, 2005). Therefore, the best reference of the system is source code. This source code is the representation of an executable equivalent of the software system. In a large, long-term software project, software maintainer must often get to know an unfamiliar portion of the source code in order to fix a bug or to add a feature to meet a maintenance requirement. The code may be unfamiliar, for instance, because a different programmer was previously responsible for that portion of the code or because the software is in a maintenance phase where responsibility for the code is no longer strictly apportioned among the team’s programmers. A programmer facing such a task often relies on little more than the executable code itself. 3 Exploring a source code is one of the most common activities performed by software maintainers to understand the program during the maintenance phase. In this activity, they used a tool with the ability to search the location of the objects change and its relationship with other objects in the source code. Hence, a number of research and studies have been conducted in order to assist the cognition aspect of a software system based on source codes. One way that supports the program understanding is reverse engineering technique, where this technique is the process of analyzing a subject system to identify the system’s artifacts and their relationships and create representation of the system in another form or at a higher level of abstraction (Koschke, 2001). From the above factors, the code query was using reverse engineering technique to parse the source codes to extract the artifacts. Then, the extracted artifacts were presented at a multiple levels of abstraction in textual and graphical representation. The abstraction is based on keyword from the PCR to assist maintainer focusing and understanding on related object of change request in the source code. 1.3 Statement of the Problem This research is intended to deal with the problems related to program understanding. The main question is “How to produce a more effective method in parsing source code and extracting software artifacts that can enhance understanding of existing software for software maintenance”. 4 The sub questions of the main research question are as follow: i. Why the current maintenance models, approaches and tools do not provide enough artifacts that could support program understanding? ii. What artifacts are required to extract from the source code in order to support program understanding? iii. How to extract the source code artifacts? iv. Which technique is suitable to present the extracted artifacts in order to help maintainer understand the program? 1.4 Objectives of the Study The above problem statement serves as a premise to establish a set of specific objectives that will constitute major milestones of this research. The objectives of this research are listed as follow: i. To build a model that could extract reliable source code artifacts. ii. To develop a prototype tool to support the proposed model and approach. iii. To demonstrate and evaluate the practicability of the proposed model and approach to support program understanding. 5 1.5 Scope of Work Scope of work is focused on: 1. Research is focused on artifacts extraction that could support program understanding. 2. Research is focused on structured program understanding. 3. Analysis is based on C program. 4. The research use only small-scale software system. 1.6 Importance of the Study The research is based on set of problem in program understanding during software maintenance process. Program understanding is the most expensive task of a software maintenance process because it includes reading documents, scanning it source codes and understanding the change to be made (Sulaiman, 2004). Thus to overcome the problem related to program understanding in cases where the documents are not up-dated or absence, one mechanism is needed to browse and explore the source code to acquire knowledge of the software. The mechanism would build a proper and effective code query approach to extract artifacts from source code to automate the abstraction of the software. The abstraction is used to enhance program understanding in maintenance activities. It is expected that the completion of this thesis will be beneficial to other researchers in software maintenance field and also software maintainers who will use the prototype tool developed. 6 1.7 Thesis Outline This thesis covers some discussions on the specific issues associated to source code query and how this new research is carried out. The rest of the thesis is organized in the following outline. Chapter 2: Discusses the literature review of the software maintenance, program understanding, reverse engineering, parsing technique, pattern matching technique, regular expression and concept location. Few areas of interest are identified from which all the related issues, works and approaches are highlighted. This chapter also discusses some techniques or approach to program understanding. The discussion on some tools that exist in the industry is also given in this chapter. This leads to improvement opportunities that form a basis to develop a new software source code query to support structured program understanding. Chapter 3: Provides a research methodology that describes the research design and formulation of research problems and validation considerations. This chapter leads to an overview of data gathering and analysis. It is followed by some research assumptions. Chapter 4: This chapter describes the newly proposed software code query model and approach to support program understanding. A model is established within the C structured programming language artifacts, relationship dependencies and occurrences in software work products. The chapter begins with the discussion of an overview of code query model, followed by a proposed model and approach. 7 Chapter 5: Presents the design and functionality of some developed tools to support the software source code query to support structured program understanding. This includes an implementation of the design and component tools. Chapter 6: The software source code query to support structured program understanding is evaluated for its effectiveness, usability and accuracy. The evaluation criteria and methods are described and implemented in the model that includes modelling validation, a case study and experiment. This research performs evaluation based on quantitative and qualitative results. Quantitative results are checked against a benchmark set forth and qualitative results are collected based on user perception and comparative study made on the existing models and approaches. Chapter 7: The statements on the research achievements, contributions and conclusion of the thesis are presented in this chapter. This is followed by the research limitations and suggestions for future work. 1.8 Summary This research is focused on structured program understanding that is supported by pattern matching technique, regular expression, and concept location. This study is focused on structured programming as the main material to implement source code query. A model is built to implement source code query to support structured program understanding to search the keyword in the source code. The prototype is developed as validation tool for the proposed model. 8 CHAPTER 2 LITERATURE REVIEW 2.1 Introduction Software maintenance is the most costly activity in software engineering (Erlikh, 2000) and recognized as an important part of software development life cycle. Software maintenance activities currently account for more than half of the typical software budget. In addition, more than 50 percent of global software developers are engaged in modifying existing applications. Nowadays a number of research and studies has grown for software maintenance researchers and practitioners to examine key issues facing the software maintenance activities. They have been carried out issues in program understanding or program comprehension, reverse engineering, reengineering, program transformation, impact analysis, regression testing, software reuse, software configuration management (SCM), WWW (World Wide Web) based maintenance, maintenance process model and maintenance standard (Sulaiman, 2004). This chapter will discuss the literature review of software maintenance, program understanding, reverse engineering, pattern matching technique and regular expression. Part of the discussion in each section will explicate the link between the topics discussed with the research in this thesis. 9 2.2 Introduction of Software Maintenance Software maintenance is the modification of a software product after delivery to correct faults, to improve performance or update other attributes or to adapt the product to a modified environment (ISO/IEC 14764, 2006). The ISO/IEC 14764 standard describes software maintenance process as in Figure 2.1. Exceptional Preparation Analysis Modification Acceptance Migration Retirement Figure 2.1: Software Maintenance Process 1. Software preparation and transaction activities Maintenance project plan and the preparation for handling problem identified during development, and the follow-up on product configuration management. It does include change request by individual in an organization. 2. The problem and modification analysis In this process, the maintenance programmer is responsible to analyze each request, confirm it (by simulating the problem situation) and check its validity, investigate it and propose a solution, document the request and finally, obtain all the required authorizations to apply the modification to the managing group. 3. Modification implementation Once the managing group approves, the maintenance programmer could start the modification. 10 4. Acceptance of modification All modification must be checked by the individual who has submitted the request in order to make sure the solution provided is utilized. 5. Migration The migration process is exceptional and is not a part of daily maintenance tasks. If the software needs to port to other platform without any change in functionality, this process will be used and a maintenance project team is usually assigned to this task. 6. Retirement This process will be used if the software fails to make any modification and port to other platform. It means the software function will stop and is replaced with new software. Software maintenance phase is a critical phase and is very costly. Lee’s discussion (Lee, 1998) highlighted in her paper that software maintenance is too costly and is a difficult phase. In our SDLC, time and cost are the main measures in software maintenance implementation. Over the life of a software system, the software maintenance effort has been estimated to consume more than 50% of its total life cycle cost. This maintenance cost also shows no sign of declining (Lee, 1998). In reference to time and cost issues in software maintenance implementation, the problem and modification analysis process should be done properly. The maintenance programmers need to understand the existing software before the changes is implemented. It is necessary to know where the concept location in the source code and the relationship of the changes to other concept location of the code. The understanding of the code could help the maintenance programmers to 11 implement the changes easily and faster. This activity also could help the management team to estimate the costing and time before proposing the solution. 2.2.1 Software Maintenance Categories Basically, there are four categories of software maintenance which the descriptions are summarized from (Sommerville, 1997), (Hoffer et al., 1999) and (Kajko-Mattsson, 2000). The literatures prior to 1998 did not describe the preventive maintenance because IEEE (The Institute of Electrical and Electronics Engineers, Inc.) only recognized the category in 1998. The categories are as listed below: i. Corrective Maintenance – repair design and programming errors. ii. Adaptive Maintenance – modify system to environmental changes without radical changes of software functionality. iii. Perfective Maintenance – add desired (not necessarily required) new features to improve performance, maintainability or other attributes of a computer program. iv. Preventive Maintenance – performed for the purpose of preventing problems before they occur. The awareness of these categories in software maintenance is important in this research because different category of maintenance might need different approaches and level of information abstraction in order to solve software maintenance task. Understanding the categories of maintenance are necessary to identify what does a maintenance environment contain of and what are the techniques and methods involved. 12 2.2.2 Problem in Software Maintenance Maintenance activities become difficult when a module is expected to interact with other modules in the software and the effect of this will be obvious when modification and revalidation take place. Therefore, to avoid it, analysis, testing and bug fixing are needed in order to observe interaction between modules in the software. This problem will become critical when a new programmer has to take over the job of maintaining legacy software in which it has not a part of the development team and existing documents are outdated. There are several problems that may occur in software maintenance phase. The problems are: 1. Software maintenance does not relate to design and implementation phase. 2. Often maintenance is ignored in software engineering study. It treats maintenance is not important in software engineering. 3. Maintenance activities are not understood by the maintenance programmer. 4. Maintenance programmer has no knowledge of the existing program. Thus, it is necessary to acknowledge what are the techniques and methods involved to solve the software maintenance problem. Basically, program understanding technique and reverse engineering technique are required. As mentioned earlier, the scope of research to be undertaken in this thesis covers the issues of program understanding and reverse engineering techniques. For program understanding technique, this research focuses on enhancement of abstraction method. Besides, the reverse engineering technique required in this research involves parsing-based method to extract the required artifacts. 13 2.3 Program Understanding Program understanding also called as program comprehension or software comprehension is a process that uses existing knowledge to acquire new knowledge that ultimately meets the goals of a code cognition task. This process references both existing and newly acquired knowledge to build a mental model of the software that is under consideration. Understanding is entirely dependent on strategies. Though these cognition strategies vary, they all formulate hypotheses and then resolve, revise or abandon (Deursen, 2001). Program understanding is a Software Engineering discipline which aims to understand computer code written in a high-level programming language. Program understanding is useful for reuse, inspection, maintenance, reverse engineering and many other activities in the context of Software Engineering (Beron et al., 2006). Program understanding is the most expensive task of the software maintenance process since it includes reading of the documentation, scanning the source and understanding the changes that should be made (Kwon et al., 1998). Most studies in program comprehension deal with the study on how programmers comprehend a program or a source code during software maintenance and evolution. As indicated in (Kwon et al., 1998) bottom-up and top-down are two main approaches in program comprehension. These approaches are reflected in most cognitive models. In order to maintain a software system, software maintainers need to understand how the software works. The source to understand the software system is through its system documentation which provides more details of the software including its architecture and detailed design or the source code that is the representation of an executable equivalent of the software system. Nevertheless, most documents are either absent or present. Even if it does exist, it is mostly outdated or incomplete. Hence, software maintainers need to read the source codes thoroughly prior to making the changes in the software. As stated by Kwon et al., 14 (1998), most problems with current maintenance practice are concerned with the fact that all maintenance is conducted at the code level. Likewise, according to Pigoski (1997), programmers spend 40 percent to 60 percent of their time reading the code and attempting to understand its logic. The main goal of program understanding is to acquire sufficient knowledge about a software system so that it can evolve in its disciplinary manner. The essence of program understanding is identifying artifacts and understanding their relationship; this process is fundamentally a pattern matching at various abstraction levels. This involves the identification, manipulation, and exploration of artifacts in a particular representation of a subject system via mental pattern recognition by the software engineer. The aggregations of these artifacts are made to form a more abstract system representation. There are many definitions given by researchers. However, some of the definitions are more complicated to understand. Below are definitions that are given by previous research: 1. Program understanding is the process that uses current knowledge to generate new knowledge in order to acquire goals adjacent to original source code role 2. Program understanding is a task to acquire system design that is either partial or full from its source code. 3. Program understanding is a process performed to acquire computer program’s information. 15 From the definition, researcher could identify two basic characteristics of program understanding: 1. Artifact main input is a source code of the software. 2. Program understanding output is an improvement of program understanding. 2.3.1 Program Understanding Support Mechanism Program understanding needs a mechanism to support the cognitive process. The support mechanisms that could be used for program understanding are: 1. Unaided browsing 2. Leveraging corporate knowledge and experience 3. Computer-aided technique 2.3.1.1 Unaided Browsing This mechanism relies significantly on human ware or human being. Human will flips manually through source code in printed form or browsing online, perhaps using file system to aid navigate the source code file. 16 The good software engineer may be able to keep track of approximately 50K line of code in their head. If the amount exceeds, it becomes difficult to keep track of the information in the source code. 2.3.1.2 Leveraging Corporate Knowledge and Experience This is human knowledge and experience about the subject. The mechanism is via mentoring or by conducting informal interviews with personnel knowledgeable about the system. However this mechanism is very valuable if there are people available who have been associated with the system as it has evolved over time. Responsible individuals will carry important information in their heads about why the system was designed the way it was, the major changes that have occurred over its life cycle and where the subsystems have proven particularly troublesome. However, this mechanism is not always available because the system designer of the same company may be left the company or external sources from other companies have stopped the services. 2.3.1.3 Computer Aided Technique This mechanism uses software maintenance technique such as reverse engineering to implement program understanding. Reverse engineering environment could manage the complexities of program understanding by helping the software 17 engineer extract high-level information from low-level artifacts such as the source code. This will free the software engineer from tedious manual and error-prone tasks such as code reading, searching and pattern matching by inspection. 2.3.2 Program Understanding via Reverse Engineering One of the most popular approaches to the problem of software evolution is program understanding technology. It has been estimated that fifty to ninety percent of evolution work is devoted to program understanding. Developers or maintainer requires programming knowledge, domain knowledge and comprehension strategies to understand a source code program. For instance, one might extract syntactic knowledge from the source code and rely on programming knowledge to form semantic abstractions. The theory of the domain bridging describes the programming process as one of constructing mappings from a problem domain to an implementation domain, possibly through multiple levels. Program understanding then involves reconstructing part or all of these mappings. Furthermore, the programming process is a cognitive one involving the assembly of programming plans and implementation technique that realize goals in another domain. Hence, program understanding also tries to pattern match between a set of mental models and the source code of the subject software. This technique is complicated for the large legacy systems to manual match the mental models. One way of augmenting the program the program understanding process is through reverse engineering technique. Although there are many forms of reverse engineering, the common goal is to extract information from existing 18 software systems. This knowledge can then be used to improve subsequent development and ease maintenance. In this research, source codes will be used as the reliable source for programmers to understand the software. Hence, through tool automation via reverse engineering technique the source codes will be parsed and the artifacts extracted will be represented using multiple level of abstraction to support program understanding. 2.4 Reverse Engineering Reverse engineering (RE) can be defined as the process of analyzing a subject system to identify the system’s artifact and their interrelationships and to create representations of the system in another form or at a higher level of abstraction. The primary objective of reverse engineering a software system is to increase the overall understanding of the system for both maintenance and new development; and the six key objectives as stated by Chikofsky and Cross II (Nelson, 2005) include: i. Cope with complexity – Reverse engineering process analyses a subject system to identify its artifacts and interrelations automatically. ii. Generate alternative views – The artifacts extracted can be presented in a textual or graphical presentation. iii. Recover lost information – In case the documents are out-dated, the latest information can be extracted directly via reverse engineering process from the source code. iv. Detect side effects – A change made in a program may cause side effects that can detect anomalies and problem before users report them as bugs. 19 v. Synthesize higher abstractions – Components extracted can be presented into a higher level of abstraction. vi. Facilitate reuse – Reverse engineering can help candidates for reusable software artifact from present systems. The six key objectives are in line with the objectives of this research. Code Query method proposed to present the software artifacts in order to find code location and its relationship by generating textual and graphical presentation. 2.4.1 Reverse Engineering Concept and Definition There are many RE definitions. One of which is adapted from canonical taxonomy. RE is the process of identifying software components, their interrelationship, and representing these entities at a higher level of abstraction. RE by itself involves only its analysis, and not changing the current software. ‘Program comprehension’ or ‘understanding’ are terms often used interchangeably with RE. Meanwhile, forward engineering (FE) is opposite of RE and is used to distinguish the traditional software engineering process from RE (Nelson, 2005). Figure 2.2: Reverse Engineering 20 Figure 2.3: Forward Engineering There are four areas offered by RE in increasing level of impact, they are: 1. Redocumentation This is the weakest form of RE. Redocumentation merely involves the creation or revision of system documentation at the same level of abstraction. 2. Design Rediscovery It is one of the redocuments, but it uses domain knowledge and other external information where it is possible to create a model of the system at a higher level of abstraction. 3. Restructuring This is the lateral transformation of the system within the same level of abstraction. It maintains the same level of functionality and semantics. 4. Reengineering Generally, reengineering involves a combination of RE for comprehension, and a reapplication of FE to reexamine which functionalities need to be retained, deleted or added. The main concept of RE is to provide an analytical process against software in identifying components with their relationship and represent it into simple text or diagram (Tammy, 2005). 21 Therefore, RE is suitably used to enhance software artifacts’ understanding. Software understanding arises between: 1. Application problem domain These occur when there are changes against software in language implementation, terminology and business logic perspective. 2. Environmental domain. It is environmental domains that occur when it changes against physical issues such as operating system, hardware and configuration. 2.4.2 Challenges in Reverse Engineering Reverse engineering is a challenging task because it involves mapping between different worlds in at least five distinct areas (Nelson, 2005): 1. Application domain mapping to Programming Language Programming language is a model environment to solve some real problem. While the tools exist to support code behavior understand from code perspective. There is a little to aid the reverse engineer in determining what is occurring with the code from a domain perspective. 2. Machines and Programs mapping to Abstract, High-Level Design Computer science education is largely about mapping from the abstract to the detailed implementation, but there is a little to aid in the reverse engineering. 22 3. Original Coherent, Structured System mapping to Actual System, With Structure Decaying Although there are good documentations available for software, maintenance gradually causes the structure to drift from the original specification (Nelson, 2005).The reverse engineer must be able to resolve and synchronize the documented design and the current implemented design. 4. Hierarchical Programs mapping to Cognitive Association Computer programs and formal hierarchical expressions. Human is thinking in associative the pieces of data. The reverse engineer must be able to “build up correct high level pieces of data from the low level details evident in the program” (Nelson, 2005). 5. Bottom-Up Code Analysis mapping to Top-Down Application Analysis. The code analysis is done by its nature through a bottom-up exercise. It requires, simultaneously, a higher level meaning to be extracted from code fragments, and higher level concepts to be mapped to lower level implementations. To make this task even more difficult, the engineer must be able to handle confusion such as interleaving (Nelson, 2005). At present, reverse engineering is heavily dependent on human interaction and steering. While there are several existing tools to assist reverse engineer in program understanding, they are not fully automated. 23 2.4.3 Program Understanding in Reverse Engineering Approaches There is a variety of approaches for automated assistance available for reverse engineer in program understanding. Some of the more prominent approaches include: 1. Textual, lexical and syntactic analyses. These approaches focus on source code and its representation. There are many lexical in the software engineer field. One of the lexical is Columbus. Columbus is an automating parser that can extract the code to give hints about design and abstraction information. The unit of examination is the program source itself. 2. Graphing methods. There are many styles in graphing methods to support program understanding. These include the control flow of the program, data flow of the program, and program dependence graphs. The unit of examination is a graphical representation of the source code (Nelson, 2005). 3. Execution and testing. Dynamic testing and debugging is well known and there are several tools available for this function. The unit of examination is a full, partial or simulated execution of the program. There is also variety technique used to automated artifact from the source code using reverse engineer approaches. In this research, parse, extraction and abstraction technique are applied to generate textual and graphical representation to support program understanding. 24 2.5 Parsing Technique ‘Parsing’ is the term used to describe the process of automatically building syntactic analyses of a sentence in terms of a given grammar and lexicon. Parsing technique is used to parse a string according to a grammar means to reconstruct the production tree (or trees) that indicate how the given string can be produced from the given grammar. There are two important points here; one is that do require the entire production tree and the other is that there may be more than one such tree (Grune and Jacobs, 2008). The requirement to recover the production tree is not natural. After all, a grammar is a condensed description of a set of strings, i.e., a language, and our input string either belongs or does not belong to that language; no internal structure or production path is involved. If researchers adhere to this formal view, the only meaningful question researchers can ask is if a given string can be recognized according to a grammar; any question as to how, would be a sign of senseless, even morbid curiosity. In practice, however, grammars have semantics attached to them; specific semantics is attached to specific rules, and in order to find out which rules were involved in the production of a string and how, we need the production tree. Recognition is not enough, researchers need parsing to get the full benefit of the syntactic approach using some programming effort. Figure 2.4 show the overview of parser process. 25 Figure 2.4 : Overview of Parser Process In short, it is very expensive (and often impractical) to construct the knowledge base(s) necessary for parsing approaches to extract even reasonable semantic information from source code and associated documentation (Maletic and Marcus, 2001). 2.5.1 Two Way of Parsing The basic connection between a sentence and the grammar it derives from is the parse tree, which describes how the grammar was used to produce the sentence. For the reconstruction of this connection we need a parsing technique. When we 26 consult the extensive literature on parsing techniques, we seem to find dozens of them, yet there are only two techniques to do parsing; all the rest is technical detail and embellishment. The first method tries to imitate the original production process by rederiving the sentence from the start symbol. This method is called top-down, because the production tree is reconstructed from the top downwards. The second method tries to roll back the production process and to reduce the sentence back to the start symbol. Quite naturally this technique is called bottom-up. 2.5.2 Parsing Methods There are several parsing methods confronted with a large number of techniques with often unclear interrelationships. But in the research, researcher was focused on three parsing methods that will be discussed next in the subsection below. 2.5.2.1 Directionality There are two directionality methods. First method is a non-directional method constructs the parse tree while accessing the input in any order it sees fit; this of course requires the entire input to be in memory before parsing can start. There is a top-down and a bottom-up version (Grune and Jacobs, 2008). 27 The directional methods process the input symbol by symbol, from left to right. It is also possible to parse from right to left, using a mirror image of the grammar; this is occasionally useful. This has the advantage that parsing can start, and indeed progress, considerably before the last symbol of the input is seen. The directional methods are all based explicitly or implicitly on the parsing automation where the top-down method performs predictions and matches and the bottom-up method performs shifts and reduces. 2.5.2.2 Search Techniques There are in general two methods for solving problems in which there are several alternatives in well-determined points. There are depth-first search, and breadth-first search. In depth-first search, this technique concentrate on one halfsolved problem; if the problem bifurcates at a given point P, its store one alternative for later processing and keep concentrating on the other alternative. If this alternative turns out to be a failure, this technique rolls back the actions until point P and continues with the stored alternative. This is called backtracking. In breadth-first search, this technique keeps a set of half-solved problems. From this set we calculate a new set of better half-solved problems by examining each old half-solved problem; for each alternative, this technique creates a copy in the new set. Eventually, the set will come to contain all solutions. Depth-first search has the advantage that it requires an amount of memory that is proportional to the size of the problem, unlike breadth-first search, which may require exponential memory. Breadth-first search has the advantage that it will find the simplest solution first. Both methods require in principle exponential time; if want more efficiency (and exponential requirements are virtually unacceptable), its need some means to restrict the search (Grune and Jacobs, 2008). 28 2.5.2.3 Left Corner Parsing In left-corner parsing, the right-hand side of each production rule is divided into two parts: the left part is called the left corner and is identified by bottom-up methods. The division of the right-hand side is done so that once its left corner has been identified; parsing of the right part can proceed by a top-down method. Although left-corner parsing has advantages of its own, it tends to combine the disadvantages or at least the problems of top-down and bottom-up parsing, and is hardly used in practice. 2.5.3 Time Requirement When parsing strings consisting of more than a few symbols, it is important to have some idea of the time requirements of the parser, i.e., the dependency of the time required to finish the parsing on the number of symbols in the input string. Expected lengths of input range from some tens (sentences in natural languages) to some tens of thousands (large computer programs); the length of some input strings may even be virtually infinite (the sequence of buttons pushed on a coffee vending machine over its life-time). The dependency of the time requirements on the input length is also called time complexity. In this research, parsing technique is used as a process to parse a string from the source code. The artifacts will be extracting during the parsing process base on match string using pattern matching and regular expression technique. 29 2.6 Extraction Process Extraction is the one of the process in the reverse engineering. Extraction information is gained from parsing process. In extraction process, pattern matching with regular expression is used to match the information with the object in the components. 2.6.1 Pattern Matching Pattern matching is the act of checking for the presence of the constituents of a given pattern and rigidly specified (Pattern, 2003). Pattern matching is used to test whether things have desired structure, to find relevant structure, to retrieve the aligning parts, and to substitute the matching part with keyword or variable given. A pattern matching engine provides an optimal match between the given pattern and a decomposition of the legacy system entities by satisfying the inter/intra-module constraints defined by the pattern (Sartipi et al., 2000). Pattern matching is needed to ease communication among developers and speed up software development and maintenance. Additionally, pattern matching is necessary to detect simple text fragments in all kinds of editors. Pattern matching is the act of checking for the presence of the constituents of a given pattern (Phanindra et al., 2007). Patterns retrieved from pattern matching, as well as idioms represent the lowest-level of patterns (Buschmann et al., 1996). Idioms are mostly language specific; and they capture existing programming experience. These patterns and also 30 idioms can be detected by regular, context-free or context-sensitive languages. An implementation pattern is called regular if the pattern defines regular languages. A pattern is called context free if the pattern defines context-free languages. There are two types of pattern matching. It is sequences and tree pattern. Sequences are also known as text string pattern are often described using regular expression and matched using respective algorithms. Sequences can also be seen as trees branching for each element into the respective element and the rest of the sequence, or as trees that immediately branch into all elements. Tree patterns can be used in programming languages as a general tool to process data based on its structure. Some functional programming languages such as Haskell, ML and the symbolic mathematics language have a special syntax for expressing tree patterns and a language construct for conditional execution and value retrieval based on it. For simplicity and efficiency reasons, these tree patterns lack some features that are available in regular expressions. Depending on the languages, pattern matching can be used for function arguments, in case expressions, whenever new variables are bound, or in very limited situations such as only for sequences in assignment (in Python). Often it is possible to give alternative patterns that are tried one by one, which yields a powerful conditional programming construct. Pattern matching can benefit from guards. 2.6.2 Regular Expression Regular expressions, also referred to as regex or regexp, provide a concise and flexible means for identifying strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language 31 that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification (Regular, 2007). Regular expressions are used in many applications to specify patterns because any regular expression can be compiled into a very efficient one-pass pattern matcher called a finite automaton. Finding matches is useful, but even more useful is parse extraction, which describes in detail how a pattern matches some input. Parse extraction makes it easy to find the search pattern. The following examples illustrate a few specifications that could be expressed in a regular expression. Regular expressions can be much more complex than these examples. • The sequence of characters "car" in any context, such as "car", "cartoon", or "bicarbonate". • The word "car" when it appears as an isolated word. • The word "car" when preceded by the word "blue" or "red" • A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. For instance, Java, C++, Perl, Ruby and Tcl have a powerful regular expression engine built directly into their syntax. Several utilities provided by Unix distributions including the editor ed and the filter Grep were the first to popularize the concept of regular expressions. 32 As an example of the syntax, the regular expression \bex can be used to search for all instances of the string "ex" that occur after "word boundaries" that signified by the \b. In layman's terms, \bex will find the matching string "ex" in two possible locations, (1) at the beginning of words, and (2) between two characters in a string, where one is a word character and the other is not a word character. Thus, in the string "Texts for experts," \bex matches the "ex" in "experts" but not in "Texts". It is because the "ex" occurs inside a word and not immediately after a word boundary. Many modern computing systems provide wildcard characters in matching filenames from a file system. This is a core capability of many command-line shells and is also known as globing. Wildcards differ from regular expressions in generally only expressing very limited forms of alternatives. 2.6.2.1 Basic Concepts A regular expression, often called a pattern, is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. For example, the set containing the three strings "Handel", "Händel", and "Haendel" can be described by the pattern H(ä|ae?)ndel or alternatively, it is said that the pattern matches each of the three strings. In most formalism, if there is any regex that matches a particular set then there is an infinite number of such expressions. Most formalism provides the following operations to construct regular expressions are: 1. Boolean “or” A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey". 33 2. Grouping Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" and "grey". 3. Quantification A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are the question mark ?, the asterisk * (derived from the Kleene star), and the plus sign +. See Table 2.1 Table 2.1 : Quantifiers of Regular Expression. Quantifier Description ? The question mark indicates there is zero or one of the preceding element. For example, colou?r matches both "color" and "colour". * The asterisk indicates there are zero or more of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on. + The plus sign indicates that there is one or more of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac". These constructions can be combined to form arbitrarily complex expressions, much like one can construct arithmetical expressions from numbers and the operations +, −, ×, and ÷. For instance, H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel. 34 2.6.2.2 Portable Operating System Interface (POSIX) Syntax Traditional Unix regular expression syntax followed common conventions but often differed from tool to tool. The IEEE POSIX Basic Regular Expressions (BRE) standard (released alongside an alternative flavor called Extended Regular Expressions or ERE) was designed mostly for backward compatibility with the traditional (Simple Regular Expression) syntax but provided a common standard which has since been adopted as the default syntax of many Unix regular expression tools, though there is often some variation or additional features. Many such tools also provide support for ERE syntax with command line arguments. In the BRE syntax, most characters are treated as literals and they match only themselves (i.e., a matches "a"). The exceptions at Table 2.2, are called metacharacters or metasequences. 35 Table 2.2 : Metacharacters for BRE Standard Metacharacter Description Matches any single character (many applications exclude newlines, and exactly . which characters are considered newlines is flavor, character encoding, and platform specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c". A bracket expression. Matches a single character that is contained within the [] brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z]. The - character is treated as a literal character if it is the last or the first character within the brackets, or if it is escaped with a backslash: [abc-], [-abc], or [a\-bc]. Matches a single character that is not contained within the brackets. For example, [^ ] [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". As above, literal characters and ranges can be mixed. Matches the starting position within the string. In line-based tools, it matches the ^ starting position of any line. Matches the ending position of the string or the position just before a string-ending $ newline. In line-based tools, it matches the ending position of any line. BRE: \( \) Defines a marked subexpression. The string matched within the parentheses can be ERE: ( ) recalled later (see the next entry, \n). A marked subexpression is also called a block or capturing group. \n Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is theoretically irregular and was not adopted in the POSIX ERE syntax. Some tools allow referencing more than nine capturing groups. * Matches the preceding element zero or more times. For example, ab*c matches "ac", "abc", "abbbc", etc. [xyz]* matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. \(ab\)* matches "", "ab", "abab", "ababab", and so on. BRE: \{m,n\} Matches the preceding element at least m and not more than n times. For example, ERE: {m,n} a\{3,5\} matches only "aaa", "aaaa", and "aaaaa". This is not found in a few, older instances of regular expressions. 36 2.6.3 Pattern Matching and Regular Expression in Artifact Extraction In this research, the pattern matching and regular expression are used to extract different artifacts from the source code and represent it at higher level of abstractions. The regular expression based extraction due to their simplicity, ease of use, matching power and robustness features. The regular expression uses the pattern matching to extract the desired system artifacts. The hierarchical, nested and abstract specifications are designed to match the required patterns from source code. The regular expression technique is flexible in the sense that it can be applied to different kind of system artifacts including source code (languages) and data files and only syntactic knowledge of the subject is required. The engineer designs the regular expression pattern, match the pattern with the source code and as a result get valuable information which is further used for extracting other patterns (Rasool and Philippow, 2008). 2.7 Abstraction Process Abstraction is the process of hiding the details and exposing only the essential features of a particular concept or object. Abstraction should be able to improve the understanding of software system prior to making changes towards a software system. Abstraction is a primary concept in software engineering and is, in fact, a basic property for understanding the reality and managing the complexity of software systems (Damasevicius, 2006). Hence, abstraction is used in program understanding to enhance the comprehension of the program thru textual and graphical abstraction. In this research, the extracted artifacts are represented with it location and relationship with other artifacts in the code. 37 Some criteria to evaluate the effectiveness of reverse engineering tools using abstraction level (Sulaiman, 2004): i. Abstraction level refers to the sophistication of the design information that can be extracted from a source code. Ideally, the abstraction level should be as high as possible in which the reverse engineering process should be capable of deriving: a. Procedural design representations (a low-level abstraction) – determine the structure of each procedure. b. Program and data structure information (a little high level of abstraction) – indicate the structure of a program and its dependencies including details of data and its decomposition. c. Data and control flow models (a relatively high level of abstraction) – identify data usage among programs. d. Entity-relationship models (a high level of abstraction) – indicate dependencies among modules or systems. ii. Completeness of a reverse engineering process refers to the level of detail that is provided at an abstraction level. In most cases, completeness decreases as the abstraction level increases. iii. Directionality can be one-way in which all information extracted from the source code is provided to maintainers who later use it during any maintenance activity. In two-way directionality the information is fed to a forward engineering tool that attempts to restructure or regenerate the old program. 38 2.7.1 Graphical Representation Graphical representation provides the alternative or additional presentation of the information which can help the developer understand the graph. Graphical abstractions provide the solution to viewing the graph at the node/edge level, where the developer can only view a portion of the graph, and the layout overview, where the developer can see the entire graph, but not details of the individual nodes. Graphical abstraction can be subdivided into two parts; the representation of abstraction and definition. Representation refers to how the graphical abstractions are presented to the developer and definition refers to how the user specifies the graph or sub-graph to be used for the graphical abstraction (Paulisch, 1993). The use of CASE (Computer-Aided Software Engineering) products such as reverse engineering tool will be able to automatically extract components in existing source codes and provide graphical representations of the artifacts to assist software engineers’ program comprehension or software understanding (Sulaiman, 2004). 2.7.2 Textual Representation In this research, textual is use in support of software comprehension and maintenance. Text has the advantages of being easily communicated, effectively manipulated with existing tools, and highly scalable (Cox and Collard, 2005). The use of textual markup models provides a combination of advantages that other models do not possess. These advantages are: 39 i. Robustness: Text models can be used when other models are not extractable, such as when source code cannot be parsed effectively. ii. Scalability: Large scale text data-sets can be efficiently stored and retrieved using text-oriented databases (e.g., Google) and processed as memoryefficient streams (e.g., SAX). iii. Search ability: Text is easily searched using string based or index-based tools (e.g., grep1 or Google). iv. Independence: Tools for manipulating text (e.g., Perl) or marked-up text (e.g., XSLT) can be used on any textually represented programming language. v. Adoption-Centric: Text manipulation tools already exist and are regularly used for a variety of tasks (e.g., Perl, AWK, grep, vi, emacs). vi. Readability: Text is always readable by maintainers, potentially increasing maintainers' trust and understanding. vii. Communicable: Text is easily communicated between tools, hosts, and environments. viii. Transparency: The relationship between extracted information and the source code is easily seen without the need to build and maintain an external mapping. ix. Abstraction: Textual representations can support many different levels of abstraction. In this research, both representation are use to support program understanding. The abstraction will show the list of artifacts, location of the artifacts and the relationship between the artifacts. 40 2.8 Concept Location Searching in source code or documentation is one of the most common activities performed by software engineers during maintenance (Singer et al., 1997). Concept location is one such searching activity where the software engineers try to locate a part of the source code that implements specific domain concepts. This activity is also referred to as the concept assignment problem (Marcus et al., 2005). Concept location is a process of locating a feature or concept in software system (Rajlich and Wilde, 2002). A concept is relatively easy to handle in small systems in which the programmer fully understands. For large and complex system, it can be significantly difficult. Concept location assumes that a maintainer understands the concept of the program domain, but does not know where in the code the concepts are located. It is directly applicable to program understanding as the programmers or maintainers perpetually have some pre-existing knowledge; otherwise the process of understanding would not be possible (Rajlich and Wilde, 2002). In short, concept location is a part of incremental change process and it allows the programmers to determine an initial location of a change within the source code. Typically, concepts appear as nouns, verbs, or short clauses in the change request. These concepts are also embedded within the structures of the source code and appear as variables, classes, or methods. Concept location is the process that finds the implementations of these concepts (Rajlich and Wilde, 2002). Effective concept location techniques are crucial for software engineers since they provide the means for evolving large software systems without understanding the entire body of the code (Marcus et al., 2005) and identify the place in the software where the change is to be made. 41 2.8.1 Concept Location in Source Code Concept location occurs frequently during incremental change of software (Marcus et al., 2005). Here, concepts are extracted from change requests while the concept location process identifies the starting point of the change. The complexity and importance of the concept location process increases with the size of the software. Different techniques achieve this in different ways. One common feature of various approaches is that the source code is often decomposed into units different than files (e.g., classes, functions, etc.) and it is enriched with additional information (e.g., relationships between elements of the source code). The software decomposition determines the unit of the search, while the additional information determines the searching criteria. Software decomposition and analysis creates an intermediate representation of the software system, see Figure 2.5. Deterministic mappings are defined between this representation and the source code. The user interacts with and searches this representation, but the results of the search are presented as elements of the source code. Figure 2.5: Most concept location techniques rely on an intermediate representation of the source code (Marcus et al., 2005). 42 At a high level, the concept location process can be defined as follows: i) concept formulation that usually in natural language; ii) query formulation and execution based on the intermediary representation; and iii) investigation of results. Most concept location techniques have the following attributes (Marcus et al., 2005): 1. Prerequisites (e.g., complete and executable program, test suite, incomplete program, libraries, etc.); 2. Intermediate representation: a. Format (e.g., string, graph, database, etc.); b. Content (e.g., text, dependencies, data flow, control flow, execution traces, etc.); c. Preprocessing/analysis (e.g., manual, automatic, dynamic analysis, static analysis, parsing, knowledge base, etc.); 3. Query information 4. Format and granularity of results (e.g., line of text, line of code, function, class, file, etc.). Based on the preprocessing needed to create the intermediate representation we can differentiate two major classes of techniques: static and dynamic. Static techniques create the intermediary representation based on the source code that can be incomplete. Dynamic techniques require complete executable programs and test suites. All static techniques share the same prerequisites and have compatible preprocessing; therefore this research is focused only on three of the most popular static techniques based on regular expression matching, static program dependencies, and information retrieval. 43 2.8.2 Static Concept Location Techniques One common characteristic of static location techniques is that they can be used rapidly without too much preparation. They allow work on incomplete programs, design documents, and other work-products. This allows programmers to combine them and leverage their respective advantages. Table 2.3 summarizes the specific attributes of each technique (Marcus et al., 2005). Table 2.3: Features of the static concept location techniques Prerequisites Internal Pattern Dependency Matching Based None None None Graph Document representation String of tokens format Internal vector space representation Characters/tokens content Internal IR Based representation None analysis Function Identifier, dependencies comments Static Parsing, LSI dependencies analysis Query Results Regular Depth first search Natural expression in graph language Line of text Call function Functions/files 44 2.8.2.1 String Pattern Matching Technique (Grep) ‘Grep’ is an acronym for "global regular expression print". It is a tool that prints out lines that contain a match for a regular expression. Even though there are several more advanced pattern matching tools, Grep is one of the most popular tools within this category; therefore, Grep can be taken as the representative of this searching tools function. They refer to the string pattern matching technique as grepbased technique (Marcus et al., 2005). The critical part of the “grepping” process is the formulation of the search pattern. This can be facilitated by certain heuristics and in most cases, it is highly dependent on the experience of the person who performs the search. The technique does not make any assumptions about the structure of the software and hence it can be used on C systems without any adaptation (Marcus et al., 2005). 2.8.2.2 Dependency Search Technique The static dependency search is a variant of the depth first search, conducted by a programmer rather than a computer. The programmer follows the dependencies among the modules, hence the technique is adapted to C programming by dealing with procedure or function and their dependencies (Marcus et al., 2005). When searching for concepts, the functionality of the C files that the programmer encounters can be viewed in two different ways. First, there is the composite functionality that is defined as the complete functionality of C files combined with all its supporting C files. The second type of functionality, local 45 functionality, consists of concepts that are actually implemented in the C files and are not delegated to others (Marcus et al., 2005). 2.8.2.3 IR-based Technique IR-based methods for concept location share the following general pattern (Marcus et al., 2005): 1. Preprocessing of the source code and documentation. 2. Indexing that creates the intermediary representation. 3. Execution of queries formulated in natural language. 4. Retrieving and analyzing the results that are returned as a ranked list. The IR based system uses latent semantic indexing (LSI) (Marcus et al., 2005) for the intermediate representation for the identifiers and comments extracted from the source code. The source code is partitioned into a set of documents. A document can be any contiguous set of lines of the source code; therefore creations of different document definitions are possible. 2.9 Code Query in Reverse Engineering Tools The following sections thoroughly discuss the areas related to code query methods in reverse engineering tools. This chapter will discuss on the study 46 conducted on existing software in reverse engineering environment or tools in order to evaluate qualitatively their functionalities, features and methods provided. The study will discuss entities of Windows Grep, Rigi (Sulaiman, 2004) and CodeSuffer (Anderson and Zarins, 2005). 2.9.1 Windows Grep Windows Grep is a tool for searching files for text strings that you specify. Although Windows and many other programs have file searching capabilities builtin, none can match the power and versatility of Windows Grep. The program combines the power and flexibility of traditional command line grep utilities available on DOS, UNIX and other platforms with the ease of use of Microsoft Windows (WinGrep, 2007). In addition to searching, Windows Grep also performs global replacing in your files, with complete safety. Windows Grep is designed for searching plainASCII text files, such as program source, HTML, RTF and batch files, but it can also search binary files such as word processor documents, databases, spreadsheets and executables. The primary feature of Windows Grep is to search the contents of one or more files on your PC for occurrences of text strings you specify and display the results. Once found, it can replace matches with other strings. See Figure 2.6. 47 Figure 2.6: Window Grep Search Results 2.9.2 Rigi Rigi is a tool that assists in understanding and re-documenting the abstractions of software systems, particularly legacy systems (Rigi, 2004). Rigi is an interactive, visual tool designed to help software maintainer better understand the software. Rigi includes parsers to read the source code of the subject software and produce a graph of extracted artifacts such as procedures, variables, calls, and data accesses. To manage the complexity of the graph, an editor allows software maintainer to automatically or manually collapse related artifacts into subsystems. These subsystems typically represent concepts such as abstract data types or 48 personnel assignments. The created hierarchy can be navigated, analyzed, and presented using various automatic or user-guided graphical layouts (Sulaiman, 2004). 2.9.2.1 Rigi Features The discovered structural information is useful for making informed development and management decisions. The information serves as documentation that is up-to-date and accurate because it is derived from the actual source code. Thus, Rigi helps to understand legacy software systems where the existing documentation may be missing or lacking. Rigi aids reengineering tasks that need to discover design information in existing software. There are good features that Rigi provides for the software maintainer. These features allow the maintainer to determined dependency and potential impacts thru Rigi and the features include: i. Easy-to-use visual interface. ii. Parsers for C, C++ and COBOL. iii. Selection, filtering, and editing operations iv. Dependency and change impact reports v. Standard, overview, and projection perspectives vi. Metrics for cohesion and coupling vii. Views to capture interesting perspectives viii. Scripting language and command library ix. Adaptable to different languages and purposes x. Customizable user interface xi. Simple file format to represent graphs 49 2.9.2.2 Rigi Query Technique Basically, Rigi works by feeding the subject software to the parser of specified language and then a set of triples called “tuples” will be generated into a text file with the extension “.rsf” (token file). The file can then be loaded in RigiEdit (see Figure 2.7) that represents the software artifacts extracted. The SHriMP view of (Storey, 1998) is also based on Rigi reverse engineering environment. Figure 2.7: View produced by Rigi via RigiEdit 2.9.3 CodeSurfer CodeSurfer as shown in Figure 2.8 is a tool that provides a wide range of program understanding capabilities by exposing the results of a static-semantic analysis to the user in novel and interesting ways. The tool performs a number of 50 whole-program analyses, including pointer analysis, and creates a system dependence graph for the program. The user can browse these dependences through the GUI in a manner akin to surfing the web. An open architecture fosters the development of plug-in that can extend the basic functionality. These include tools for reasoning about the paths through the program, and for software assurance (CodeSurfer, 2007). Figure 2.8: CodeSurfer Project Viewer 51 2.9.3.1 CodeSurfer Features A partial list of features included in the CodeSurfer Standard Package is given below. Many of CodeSurfer's advanced features (e.g., pointer analysis) are not available in any other commercial tools. 1. Syntax Highlighting Distinguishable between code, comments, preprocessor directives, macros, and conditionally-compiled-out code 2. Navigation from i. a variable occurrence to the statements that can assign its value ii. an assignment to a variable to a use of the value assigned iii. a statement to the control points that affect whether the statement gets executed iv. a control point to the statements whose execution it controls v. a macro use to its definition vi. a variable occurrence to its declaration vii. a variable occurrence to the declaration of its type viii. a function call to the function definition ix. a #include directive to the included file 3. Call Graphs Display a call graph, modify its layout, and print or save it 4. Pointer Analysis i. Display the variables a pointer can point to ii. Display the pointers that can point to a variable iii. Navigate from an indirect function call site to the targets of the call 5. GMOD/GREF Analysis i. For any function, display all the variables it modifies ii. For any function, display all the variables it uses 6. Finder. See Figure 2.8. Syntax-based searches for i. function definitions, calls, and indirect calls 52 ii. variable declarations, definitions, and uses types 7. Impact Analysis. A program slicer that does i. forward program slicing ii. backward program slicing 8. Metrics Calculate cyclomatic complexity and other metrics. 2.9.3.2 CodeSurfer Query Technique CodeSurfer provides query facility that called Finder. The Finder provides advanced searching capabilities. The Finder is used as a regular expression matching technique to query variable, function or file in C source code directory. For example, maintainers can find all the uses of a specified variable’s value—including indirect uses via pointers (where the variable name does not occur textually). Results are hyper linked to the code. The weakness is there is no visualization after query. Figure 2.9: Finder Viewer 53 2.9.4 Comparative Evaluation of Existing Tools The research has discovered several advantages and disadvantages in Window Grep, Rigi and CodeSurfer. Window Grep is a good searching tool to find text strings in files. Window Grep will list the result which is the location information of the file such as filename, file type, folder, the number of matches, file size, date/time and other textual results. Unfortunately, Window Grep is not suitable for program understanding. Rigi is an interactive, visual tool designed to help developer better comprehend the software. Rigi includes parsers to read the source code of the subject software and produce a graph of extracted artifacts such as procedures, variables, calls, and data accesses. To manage the complexity of the graph, an editor allows the developers to automatically or manually collapse related artifacts into subsystems. These subsystems typically represent concepts such as abstract data types or personnel assignments. The created hierarchy can be navigated, analyzed, and presented using various automatic or user-guided graphical layouts. However, a major disadvantage of Rigi is that it is not a user-friendly tool in searching. CodeSurfer is most the completed program understanding and tracing software to the software maintainer. This tool also provides query facility called Finder and it provides advanced searching capabilities. However, there is no visualization after a successful query. This is the biggest disadvantage of CodeSurfer software. Table 2.4 shows a summary of existing tools approaches with some defined criteria. 54 Table 2.4: Existing Features of current tools Current tools Search Technique Automatic Textual Report Visualization Type Search Coverage Pragmatic Window Grep Regular Expression, Pattern Matching Pattern Matching Syntax Based Available Textual All (query include all file type) TD, BU Not Available Graphical TD, BU Available Graphical, Textual Function, call, indirect call Function definition, calls, indirect calls, variable, definition, uses types. Rigi CodeSurfer TD, BU Note: TD – Top Down, BU – Bottom Up 2.10 Proposed Solution From the above discussion, it motivates us to produce new source code query tool to support structured program understanding. The research has proposed a model that could assist the developer in program understanding. The proposed model is called CQuery. The CQuery applies Regular Expression and pattern matching technique to search C language syntax in the software project directories. The unique of CQuery are the result will display the function name, filename, line no of code and path name for both callee and caller. Besides that, the result also can display in textual and graphical mode. A more detail design of CQuery will be given in Chapter 4. 55 2.11 Summary In summary, the aim of this chapter is to study the link between program understanding, concept location, visualization and level of information required to perform the maintenance tasks in the cases given. The study assumes that there is an extremely important relationship between the program understanding, concept location and visualization as well as level of information required by software engineers. The four tools are used as a reference in this research. The tools: Window Grep, Rigi and CodeSurfer provide most of the information types in the code query technique. The weaknesses and strengths of the existing tools in terms of features provided and methods employed have also been highlighted. In order to better enhance program understanding, source code query is proposed. The method should also reduce software maintainers’ cognitive overhead yet able to produce sufficient information in different level of abstraction. 56 CHAPTER 3 RESEARCH METHODOLOGY 3.1 Introduction Firstly, this chapter will discuss the research procedures or operational framework of this research and the formulation of the research problem. The operational framework will explain the research process from proposing of the research until end of the research and it does include formulation of research problem. The evaluation plans will cover a brief description on data gathering and analysis. At the end of this chapter, some assumptions of the research and its summary are explained. 57 3.2 Operational Framework This research is done based on operational framework is illustrated in Figure 3.1. The operational framework is divided into four phase. The phases are: 1. Phase 1 – Formulation of research problem 2. Phase 2 – Prototype Development 3. Phase 3 – Implementation and Evaluation 4. Phase 4 – Report Research Figure 3.1 : Operational Framework 58 3.2.1 Phase 1: Formulation of Research Problem The objective of phase 1 is to understand and identify the problems that need for research. In this phase, the research problem is focusing on current program understanding tools in the industrial. The research starts with the idea, literature review and current tools analysis. The author proposed the solution based on the research problem identified. 3.2.1.1 Literature Reviews The research is started with preliminary study to gather all the essential data and information needed as a part of literature review. Various steps are undertaken to obtain information such as the following: 1. Current and Existing Research of the program understanding tools. 2. Current and Existing tools of the program understanding with concept location that have query features. Literature review is one of the important methods that can contribute ideas to understand and identify the research problem. In this step, the further understanding to main research material is needed to gain better results at the end of the research. 59 3.2.1.1.1 Understanding the Need of Change Request Process – Finding the keyword Traditionally, the way of determining keyword in change request is manually done by software maintainer by reading or understanding the software documentation and traverse the source code. However, while this method is normally sufficient for minor programs, but for legacy software this method is very exhausting and highly prone to errors. The legacy software needs the repositories to organize and store the software components which becoming increasingly large in subtle way. If the change request is perform manually, it can be time consuming, expensive and strenuous. Thus, ideally software maintainer needs computer-aided system/software to perform or conduct change request and query keyword easily. 3.2.1.1.2 Understanding Structured Programming Concept Understanding of structured programming concept is very important in code query. The researcher needs to understand the structured programming language syntax. The researcher also needs to identify what type of artifacts is necessary to be extract from the code. In this case, C programming language is the subject to be used. 60 3.2.1.1.3 Understanding the Extraction Process The researcher needs to understand the extraction process and identify which technique could be use during artifacts extraction. Extraction process might be involves code parser, pattern matching and regular expression. In this research, the scope of the proposed extraction process is to extract all the required artifacts from the C code. 3.2.1.1.4 Understanding the Abstraction Technique There are few reasons why abstraction is needed in code query. The abstraction is the detail of a software system. This is a primary concept in software engineering and is, in fact, a basic property for understanding the software systems. In this research, the abstraction is used to present the result in textual and graphical representation to enhance maintainer’s program understanding. 3.2.1.2 Analysis Current Approach and Existing Tools The research on the current approach and existing tools is very important to understand and identify the limitation of the current approach and tools. Base on the current tools and approach (chapter 2). Some of the tools are missing the very important information such as the line number of the code and call function. But the line number and call function information must be acknowledged, especially in the 61 change process. Hence, the idea to develop a prototype comes from the limitations of the existing tools. 3.2.1.3 Research Proposal Research proposal of code query is the starting phase to implement the next research. The problem background, objective, scope, important and significant study, and operational framework explained in the proposal. The proposal presented in front of evaluator to determine scope of research is relevant to research study. 3.2.2 Phase 2: Prototype Development The prototype development process involves proof that the model design meets all of the research objectives specified in the chapter one. The main objectives of the prototype development are made as guidelines in order to assure that the model design can be built as designed and will also meet research requirement. 62 3.2.2.1 Code Query Model Design A model design needs to be developed before prototype development. Model design was build base on extended extraction with abstraction of keyword and artifacts occurrences. There are three components in this model: 1. Keyword 2. Extraction 3. Abstraction The detail of model design will be discussed in chapter 4. 3.2.2.2 Code Query Prototype Development Prototype is developed to demonstrate Code Query. The prototype is developed based on C programming language to enhance program understanding. This prototype is a tool as a guide for developer team to understanding the program very easily. Besides that, developer would know potential effect based on the keyword query before make a decision to do the changes. The development tools to develop this prototype are NetBeans 5.5 IDE for JAVA, Java JGraph 1.5 library and Microsoft Access 2003 as C component repositories. 63 3.2.3 Phase 3: Implementation and Evaluation The code query approach needs appropriate plan and design for proof of concept. It’s because appropriate actions can be considered to validate and verify the significant results. For this approach, the important key to look into is determination or measurement of accuracy. The accuracy is based on how close a keyword query in the source code. The keyword will be obtained from Code Query supporting tool and actual keyword location will be determined by the software expert of a particular knowledge domain. Knowledge domain is referred to a specific domain of software project (Ibrahim, 2006). A new case study will be created based on a software project that is chosen from the existing software systems together with its software developers. The software developers involved in the project will act as software maintainers in a new case study. In this research, change request and human involvement is needed for an experiment method of empirical study purposes. The purpose of the experiment is to let software maintainers apply the approach using a Code Query tool to query a keyword and compare the result with the actual query gained from exploring manually. The detailed evaluation will be described in the sub-section. 64 3.2.3.1 Supporting Tools The supporting tools need the right environment to make sure it is working appropriately to that environment. There are two categories of supporting tools: 1. Experimentation-wise A few supporting tools is used to produce C components and their relationship in the source code. The code parser is used to extract all the components from the C source code. 2. Development-wise The proposed approach is developed for any type of system environment. However, for research purposes, the development is used only to support C programming that runs in windows operating system such as Window XP, 2000 and VISTA. The prototype development is based on proposed approach. This prototype is developed using NetBeans 5.5 IDE (Java Development Kit 1.6) and the C components repository is using Microsoft Access 2003. 3.2.3.2 Choose Case Study The new case study created is based on a software project that has been chosen from the existing software systems together with its software developers. The software developers who are involved in the project will act as software maintainers in a new case study. 65 3.2.3.3 Experimental A problem change request and human involvement are needed for an experimental method of empirical study purposes. The purpose of the experimental is to allow software maintainers to apply the approach using a Code Query tool to find the keyword base on problem change request and generating the occurrences of the keyword in the source code. The researcher will occupy a group of software developers as subjects equipped with software maintenance knowledge. Besides that, the researcher will also provide a complete software project with documentation and current tools as a case study. 3.2.3.4 Evaluation Evaluation is the systematic collection and analysis of data needed to make decisions or verify the concept for the prototype or software. The data analysis is done using statistical SPSS package to evaluate the quantitative results. The researcher will collect some dependent variables from the experiment such as accuracy and time consumption. The accuracy is based on how close the keyword queries from the actual query of keyword. The time consumption is focused on how fast it is to find the keyword using prototype rather than browsing manually. The details are discussed in chapter 6. 66 3.2.4 Phase 4: Research Report The report will explain the details of the research work. It includes the detail of implementation and evaluation. The important part to be explained is the research work and the result of the research. 3.3 Research Assumption In this research, the prototype is focusing on C structured programming with selected case study only and it’s not suitable for other programming language. There are several assumptions to running this research: 1. The subjects are the software developers or software engineers to use the prototype. 2. The selected project system focuses on the structured program. 3.4 Summary Chapter 3 explains the detail of the research methodology. The operational framework executes the research idea and the procedure of the research. The important phase to look into is the implementation and evaluation. In this phase, the case study is used for experiment, data gathering and analysis methods. This should provide a means to evaluate and validate the significant contribution of the proposed 67 approached. Some assumptions of this research are raise to resolve proper actions in its implementation. 68 CHAPTER 4 CODE QUERY MODEL 4.1 Introduction This chapter describes the newly proposed software code query model and an approach to support program understanding. A model is established within the C structured programming language components, relationship and dependencies in software work products. The chapter begins with the discussion of an overview of code query model, followed by a proposed model and approach. 4.2 Overview of Code Query Code Query is a method to search a keyword and related artifacts from the source code. This method is extended extraction with abstraction of occurrences of 69 the keyword and the artifacts. There are three main components in this model as shown in Figure 4.1 and there are keyword, artifacts extraction and abstraction. 1. Keyword – maintainer need to find the keyword from the PCR. 2. Artifacts Extraction – The research is focus on this component. Code Query is using concept location, parsing technique, pattern matching and regular expression to extract the artifacts from the source code. 3. Artifacts Abstraction – Textual and graphical representation are used to show the artifacts, the relationship between artifacts and the location of artifacts in the code. Figure 4.1 : Overview of Code Query Model 70 4.3 Code Query in Structured Programming In this research, the structured programming is selected as medium language in program understanding. Structured programming (sometimes known as modular programming) is a subset of procedural programming that enforces a logical structure on the program being written to make it more efficient and easier to understand and modify. In Code Query, the structured programming interaction needs to be defined to show the relationship and dependencies of caller and callee function. The interaction includes relationship, message flow and dependencies between functions and procedures. All the interactions will be presented in textual and graphical representation. 4.3.1 Structured Programming Concept Structured programming can be seen as a subset or sub-discipline of imperative programming, one of the major programming paradigms. Structured programs are often composed of simple, hierarchical program flow structures. Structured programming is beneficial for organizing and coding computer programs which employ a hierarchy of modules. This means that control is passed downwards only through the hierarchy. Examples of structured programming languages include Ada, Pascal, Fortran and C. These programming language are designed with features that encourage or enforce a logical program structure. Structured programming often uses a top-down design model where developers map out the overall program structure into separate subsections from top to bottom. In the top-down design model, programs are drawn as rectangles. A top- 71 down design means that the whole program is broken down into smaller sections that are known as modules. A program may have a module or several modules. A well-structured program should devote a single procedure to the solution of a single problem. The splitting of problems in sub-problems should be reflected by breaking down a single procedure into a number of procedures. The idea of program development by stepwise refinement advocates that this is done in a top-down fashion. Figure 4.2 describes the fundamental ideas in structured programming systems. Structured programming concept consists of functions and procedures. Figure 4.2 : Structured Programming – Tax Calculation Structured programs are separated into modules or subprograms. The instructions of structured program are executed one after the other and calling the subprograms when needed. For instance, Control Program is a main module of the tax calculation. The control program module is divided into individual module; each module represents a specific processing task. 72 4.3.1.1 Relationship in Structured Programming Call function is the main relationship in structured programming. Figure 4.3 is shows function relationship in Tax Calculation. For example, to pay the tax, main() function will call taxComputation() for tax calculation and next call payTax() to execute tax payment. Figure 4.3 : Function Relationship Table 4.1 presents summaries of relationship in C structured programming language. Table 4.1 : Relationship Types Relationship Type Caller Relationship Signs Function A Æ Function B Callee Function B Å Function A 73 4.3.1.2 Dependencies in Structured Programming Dependencies analysis is a data flow that links between relationship and dependencies. The dependencies that are used in this research: i. Data dependency – data flow from definition and data uses relationship. ii. Control dependency – data flow that executes control program such as ifthen-else, for loop, while loop, do-until and case. iii. Component dependency – data flow from file to file. 4.3.1.3 Observations about Structured Programming Structured programming is not the wrong way to write programs. Similarly, object-oriented programming is not necessarily the right way. Object-oriented programming (OOP) is an alternative program development technique that often tends to be better if we deal with large programs and if we care about program reusability. Observations of the structured programming are: i. Structured programming is narrowly oriented towards solving one particular problem a. It would be more preferable if our programming efforts could be oriented more broadly ii. Structured programming is carried out by gradual decomposition of the functionality a. It has been observed that the structure formed by functionality/actions/control is not the most stable parts of a program b. Focusing on data structures instead of control structure would be an alternative approach 74 iii. Real systems have no single top - Real systems may have multiple tops a. It may therefore be natural to consider alternatives to the top-down approach 4.4 A Proposed Code Query Code Query is a method to support and enhance structured program understanding effectively. Research relating to this work primarily comes from the source codes based maintenance of structured programming software. This model has two main components that are shown in Figure 4.4. Find Keywords User Interface Maintainer Processing Tools Processing Tools A b s t r a t i o n C component: Function, Struct, etc Filename Repository E x t r a c t i o n # PCR C files Open C files Figure 4.4: Code Query Approach In exploring the source code, normally a maintainer will use a suite of specialized exploration tools in text editor or development environment, each with its 75 own unique capabilities and interface. This is to help them to get some knowledge to understand the system. Here, code query will replace that conventional approach. 4.4.1 Keyword When the PCR have been raised, maintainer will find the keyword through the PCR. Through the code query, the keyword will match with the source code using pattern matching. The match results such as function name, line number, filename and this includes direct relationship between the artifacts are stored into the repository. 4.4.2 Extraction of Artifacts There are three sub components in the artifacts extraction. These three components are important to determine the artifacts and concept location in the source code. The components are parser, pattern matching and regular expression. These components are extracting the following artifacts from the source code: i. Callee a. Function Name b. File Name c. Line Number d. File Path 76 ii. Caller a. Function Name b. File Name c. Line Number d. File Path 4.4.2.1 Parser Parser is used to parse a string according to a grammar means to reconstruct the production tree (or trees) that indicate how the given string can be produced from the given grammar. There are two way of parsing: i. Top-down parsing ii. Bottom-up parsing This research uses top-down parsing technique to extract the necessary artifacts form the source code. This is because top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of the given formal grammar rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules. 77 4.4.2.2 Pattern Matching The concept of code query is based on the pattern matching. Pattern matching is the act of checking for the presence of the constituents of a given pattern and rigidly specified. Pattern matching is used to test whether things have desired structure, to find relevant structure, to retrieve the aligning parts, and to substitute the matching part with keyword or variable given. Pattern matching detects simple patterns in a source program. These source code patterns and also expressions can be detected by regular, context-free or context-sensitive languages. An implementation pattern is called regular if the pattern defines regular languages. A pattern is called context free if the pattern defines context-free languages. Hence, it is used to performing searches with regular expressions. 4.4.2.3 Regular Expression A regular expression is a sequence or pattern of characters that is matched against a string of text when performing searches (RegEx, 2003). When regular expressions are created, the regular expression is tested against a string. The regular expression is enclosed in forward slashes. For instance, the regular expression /struct/ might be matched against the snippet code below. struct date { int month; int day; int year; } 78 If a struct is contained in the string, there is a successful match. The Table 4.2 below is an example of regular expressions for matched common programming language construct. Table 4.2: Regular Expression for Match Common Programming Language Construct Construct word Comments Example /* This is comments of C- Regular Expression / \*.*?\*/ Style */ Strings Allows the string to span “[^”]*” across multiple lines Numbers Matches a positive integer \b\d+\b number. Reserved Word 4.4.3 To find enum or int \b(enum|int)\b Abstraction of Artifacts The next process is manipulating extracted artifacts from database and generates textual and graphical representation. The simple representation will show the artifacts and detail information based on the keyword. An abstraction generally serves as a presentation towards helping developers in understanding the program structure, architecture and behavior. Abstraction consists of two levels: i. Low level of abstraction focuses on source code that will present artifacts form the source code such as function name, line number and filename. 79 ii. High level of abstraction focuses on architecture of program that will present function roles either caller or callee. 4.4.3.1 Code Query in Textual Representation A textual representation is very helpful to have a basic idea of how to understand the software. In textual representation, the occurrences of the keyword were stated by line number and filename and the keyword was highlighted in the snippet code. In other research, textual representation is used to analyze the relationship between other components of the system. Figure 4.5, shows the metrics on the important files related to HTChunk.c. From this textual representation maintainer can analyze how many files will be involved if HTChunck.c was modified. These findings indicated that HTAAFile.c is a closely related file to HTChunck.c. Figure 4.5 : Metrics on the important files related to HTChunk.c 80 4.4.3.2 Code Query in Graphical Representation The simple graphical representation is used to show the artifact and the detail information based on the keyword. The node of each function and the arrow between the function nodes know as relationship between the function. From the graphical the maintainer also can know the caller and callee function. For example, computed results for queries of the form “Which functions in the program could directly access the representation of component X of variable Y?” Figure 4.6 shows the results of a query on the “current vehicle” field of the map_manager_global global variable (also allow queries involving local variables and function parameters). The shaded nodes are the definitions that directly access representations; that is, whose code constrains the representation of the value in question. In this case, the value in question is a structure, and the shaded nodes constrain the type by accessing fields of the structure. Given that the “veh_” functions are operations on the vehicle abstract data type, it is easy to see that abstraction may be violated in the functions map_mgr_process_image, map_mgr_process_geometry_range_window, map_mgr_comp_range_window, but nowhere else. and 81 Figure 4.6: Graphical presentation for function and variables of vehicle simulation. 4.5 Summary The code query model highlights the ability to find the keyword in the source code. This model focuses on static analyses in order to capture a rich set of relationships in structured programming artifacts. The model leads to the opportunity to find occurrences keyword in the source code. The artifacts, relationship of artifacts and the occurrences of the artifacts are presented via multiple level of abstraction. 82 CHAPTER 5 DESIGN AND IMPLEMENTATION OF CODE QUERY 5.1 Introduction The objective of the chapter is to present the design and implement the proposed program understanding model and approach. The design includes its architecture, use case, interaction and operation. It is followed by a brief explanation of the implementation. 5.2 Code Query Design The proposed system is viewed as a complete system within its defined scope that covers several subsystems and interfaces. The proposed system is divided into 83 code query architecture, use case diagram, modules and operation that are made to aid understanding of the design process. 5.2.1 Code Query Architecture The code query is designed to fulfill a critical need of software maintenance process. This application attempts to automate the cognition code component and abstract the artifacts in order to support the program understanding which is a fundamental issue in software maintenance. The creation of this application is not to provide total solution to software maintenance, but it can serve as an additional tool to support program understanding. Before we proceed to the implementation, it is suggested that one must understand first the architecture of Code Query. Figure 5.1 : Code Query Architecture 84 The Code Query is developed in Java Swing and runs on Microsoft Windows. It is specially built to understand the target system written in C programming language. Code Query (Figure 5.1) was designed to consist of two main functionalities; extract code artifacts and abstracts the artifacts. The extracted information needs to be analyzed and transformed into the artifacts repository for abstractions purposes. 5.2.1.1 Program Change Request The program change request (PCR) is a documented deviation from, or addition to, the project specifications in whatever form they are found on a particular project. PCR is used when the software needs to be modified. In this context, the PCR has been translated into keyword of the initial target of change. With this keyword in hand, the software maintainer will interact with Code Query, to find the occurrences of effects in the related source code. When the PCR arises, maintainer need to find the keyword. Through the Code Query, the keyword will match the source code using pattern matching. The matches’ results – such as function name, line number, filename and direct relationship between artifacts – are stored into the repository. 5.2.1.2 Artifacts Repository Artifacts repository is the storage to keep a collection of artifacts such as functions, filename, line no and their relationship captured from the available source 85 code. This repository is to be prepared as a database to store all the related information to make them readily available to Code Query implementation. The lists of artifacts that will store to the repository are: i. Callee a. Function Name b. File Name c. Line Number d. File Path ii. Caller a. Function Name b. File Name c. Line Number d. File Path 5.2.1.3 Extraction Process Extract process is a process of capturing the artifacts and their relationship from the available resources. This process involves three main components, code and supporting tools. Supporting tools are required to extract the existing artifacts using static analyses as described in the previous chapter. During extraction process, three techniques are used. The techniques are: i. Parsing Technique - To parse the source code using top-down parsing way ii. Pattern Matching Technique - To check the presence of the constituents of a given keyword and pattern specified 86 iii. Regular Expression Technique - To create a pattern of characters that is matched against a string of text when performing searches The artifacts that are required to be extracted from the source code are identified as follows: a) Caller / callee function - synthesizes the function name defined in the module - used to determine the beginning of each function b) Typedef - contains the function return type c) Line number - indicates the line number of each artifacts in the source code d) File information - indicate the file name and file path It is required to extract some useful information from the source code before the subsequent keyword query takes place. The code parser is used to parse the code. Each line of code will be parsed to a single string. That single string will be tested with the keyword given to get the concept location of the keyword using pattern matching. That concept location will return the line no, file name and file path for the matched keyword. Based on found concept location, the Code Query will check the keyword type either the keyword is a function or a variable is stated in one of the functions in the source code. The found function will be defined as a callee function and other call function is defined as caller function. To capture the function name, the pattern matching with regular expression for function pattern in C code is used. The same concept location information and the artifacts relationship captured will be stored into the artifacts before the abstraction process is implemented. 87 5.2.1.4 Abstraction Process Code Query is developed as a software prototype tool to implement the model in order to observe its effectiveness and usability. Code Query system can be divided into two abstraction, textual representation and graphical representation. i. Textual representation The search results are displayed in caller-callee table and snippet code based on the keyword given. The information was retrieved from the artifacts repository based on the keyword with related artifacts. ii. Graphical representation The next function is manipulating extracted component from database and abstract the artifacts. The abstraction will show the artifacts and detail information based on the keyword using node and arrow relationship. 5.2.2 Code Query Use Case Figure 5.2 shows the Use Case diagram of Code Query system that involves a user and the process such as Get Problem Change Request (PCR), Identify keyword, Do Query, Do Visualization and Diagram Display. 88 Code Query System Get PCR Identify Keyword Do Extraction Software maintainer Do Abstraction Figure 5.2 : Use Case Diagram of Code Query System A user represents a software maintainer who is responsible for doing changes in software system base on the PCR. Software maintainer needs to interact with the Do Query process by keying in the keyword. In response to this request, Do Query process will perform the search, controls and classify the path of file and location of the keyword. Then Do Query will interact with Do Visualization to abstract the keyword link with other source code file and display the diagram to the software maintainer. Below is the process explanation of Use Case Diagram of Code Query System as shown in Figure 5.2. 89 1. Get PCR a) Aim This process is used to get change request for software maintainer to perform problem analysis. b) Characteristic of Activation It is activated upon problem change request. c) Pre-Condition Not available d) Basic Flows i) Change request acceptance by software maintainer. ii) Software maintainer analyses the problem. 2. Identify Keyword a) Aim This process is used to identify the keyword after analyzing the PCR. b) Characteristic of Activation It is activated upon the PCR is analyzed. c) Pre-Condition Keyword is identified. d) Basic Flows i) Software maintainer analyzed the PCR. ii) Software maintainer identified the keyword. 3. Do Extraction a) Aim This process is used to query the defined keyword and related artifacts. 90 b) Characteristic of Activation It is activated upon the keyword is defined. c) Pre-Condition Code Query is initialized. d) Basic Flows i) Browse the source code location. ii) Software maintainer key in the keyword iii) Search the keyword in the source code file. iv) Search related artifacts to the keyword. v) Get detail information for keyword and other artifacts. vi) Store into artifacts repository. 4. Do Abstraction a) Aim This process is used to provide a service to software maintainer in an effort to abstract the related artifacts in the source code with a keyword. b) Characteristic of Activation It is activated when Do Extraction is done with the search process. c) Pre-Condition Need the information from repository. d) Basic Flows i) Get keyword relationship with other artifacts. ii) Retrieve the relationship from the repository. iii) Populate the result to abstract representation. iv) The Use Case end. 91 5.2.3 Code Query Class Interactions Code Query system is made up of seven main classes. This is specifically designed to automate software code query. Figure 5.3 shows the code query class diagrams in which the arrows represent the activations between classes. Upon receiving activation, each class via its main operation will manage and activate other related operations to implement some tasks. Each class contains a set of operations and attributes, however the attributes in the figure are purposely left out to show the class functionalities. Figure 5.3 complements the class diagrams in which it shows the sequence operations of code query implementation. Figure 5.3 : Code Query Class Diagram 92 Figure 5.4 : Code Query Sequence Diagrams 1. SplashScreen The main() begins the launch introduction of CQuery application. i) Pre Condition CQueryWindow screen is display. ii) Post Condition Code Query is executed. iii) Algorithm Below the simple algorithm of the operation Activate ShowSplash() to initialize MainFrame 93 2. CQueryWindow The CQueryWindow () begins to load the database connection and frame properties. i) Pre Condition The database connection is physically and logically configured for the Microsoft Access Database. initProperties() is defined CQueryWindow properties. ii) Post Condition CQueryWindow and jInternalFrame1 is display. iii) Algorithm Below describe some brief algorithms of the operation for initialize components. Check database connection Set and initialize the system status to workable Activate all global variables to initialize values and variables Activate initComponents() for MainFrame GUI Activate initProperties() for internalPrimary data. Close the system Below describe some brief algorithms of the operation for keyword searching and artifacts extraction. Perform search the keyword Initiate DirectoryFile.exists() to check the directory, exist or not If directory exist Delete the existing data from the repository Activate searchAllFile() to Search all file names Activate searchAllFunction() to search all functions Activate searchAllCaller() to search all caller function Store the information into the repository Activate appendStatistics() to create result of snippet code and statistical information Activate createTable() to create report list End of searching If directory doesn’t exist End of searching 94 Below describe some brief algorithms of the operation for abstraction. Perform abstraction Retrieve the information from the repository Activate createVertexEdge() to create Vertex Edge Activate createInternalPane() to create Internal Pane Display textual/graphical representation Start Receive Keyword Parse Source Code Keyword Checking Extract Artifacts C Files No Relationship Information Match Pattern? Yes Match Keyword? Yes Get Line Number / File Name / File Path Yes Do Abstraction No Code Query Repository Textual / Graphical Representation End Figure 5.5: Code Query Process Algorithm Flowchart. 95 Process algorithm explanation: i. Receive keyword from PCR. ii. Parse source code from C files. iii. Check match keyword with source code. a) If match keyword, stored into Code Query repository. b) If not match, end the process. iv. Extract artifact to get relationship for the keyword (caller / callee function). a) If match pattern for function, get line number, file name and path name. Store the data into the repository. b) If not, repeat the process for the next line of code. v. Do abstraction for the successful keyword searching. a) Fetch the artifact from Code Query repository and manipulate to textual representation. b) Fetch the artifact from Code Query repository and manipulate to graphical representation. 3. SourceCodeViewer The SourceCodeViewer() begins to display the source code. i) Pre Condition The press action is the key to view the source code. ii) Post Condition The source code frames will display. iii) Algorithm Below describe some brief algorithms of the operation for source code viewer. Activate btnEnter() for source code viewer Activate btnEnterActionPerformed(). Initialize initComponents() to prepare source code frame. Activate ViewCodeC() to display source code with highlight functions. Display SourceCodeViewer 96 5.3 Code Query Implementation and User Interfaces The Code Query assumes that a change request has already been translated and expressed in terms of some acceptable functions. Code Query was designed to search the keyword based on problem change request. The system works such that given a keyword as a change request; Code Query will determine its effects on other function and location in related source code file. Figure 5.6: Code Query Introduction Screen. Figure 5.6 shows Code Query introduction screen, where the user is introduced with Code Query information. While Figure 5.7 shows the first user interface of Code Query to start search the keyword. 97 Figure 5.7: First user interface of Code Query. Code Query will display a file path for maintainer to browse the source code directory and keyword field for maintainer to define the keyword. Visualization button will be activated when the keyword was successfully searched. Figure 5.8: File Path and the Keyword Field. 98 Figure 5.8 shows that the maintainer defines the directory for C source code at D:\Generate_index\GI Codes\GI and dmsc_open as the keyword. Figure 5.9: Textual Representation Figure 5.9 shows the respective source code that matches with the keyword include result list and; results of snippet code and statistic information results list. Results list show the caller and callee details such as function name, file name, line no and File path. From the results list, dmsc_open was define as callee and dictionary_Load, document_GenerateIndex and index_File were defines as caller. Results of the snipped code show the keyword match in the single line of source code. Statistical information shows matched and searched total lines of code, files and directories. 99 Figure 5.10: Graphical Representation Figure 5.10 shows the respective function that was involved with the selected keyword. In this window, maintainer can know the detail information of the dmsc_open. For instance, dictionary_load, document_GenerateIndex and index_File are functions that called the dmsc_open. Figure 5.11: Low Level of Abstraction - Source Code viewer 100 Figure 5.11 shows the location of the keyword dmsc_open that is highlighted with red color. From the source code viewer, the maintainer can straightly read and understand the logic of the code. Figure 5.12: High Level of Abstraction - Detail Relationship of Artifacts Figure 5.12 shows the detail relationship between the function. For example, dictionary_load function was calling dmsc_open, constructionWord and dmsc_close function. dictionary_load is called by document_GenerateIndex function. The source code changes in callee function might be affecting to some part of the caller functions. 101 5.4 Other Supporting Tools The supporting tool was use to facilitate Code Query model and approach, namely as Regular Expression – Java Pattern. A regular expression specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression or if a given regular expression matches a particular string, which comes down to the same thing. In this research, Java Pattern is used as regular expression to extract or parse the source code and token file to get the related information. All the information is stored in repository to use in Code Query system. 5.5 Summary A Code Query approach was specifically developed in this research to support the crucial need of source code query and for a better result of the analysis. This chapter has described system architecture of the proposed model and its design includes the system Use Case and class diagrams. The implementation process is designed to include class interactions, Code Query algorithms and user interfaces. The main contribution of proposed system can be observed at its ability to support abstraction based on keyword query. Other supporting tools were used to capture the relationship information in the source code file. 102 CHAPTER 6 EVALUATION 6.1 Introduction This chapter discusses the evaluation methods for the program understanding model. The approach involves evaluating the model, the case study and the experiment. The objective of this evaluation is to verify that the proposed code query model and approach support program understanding. To achieve this objective, firstly this research elaborates on the modeling itself and how each model specification item is satisfied. Secondly, the output of the analysis is then tested on a case study through a controlled experiment. Thirdly, the quantitative and qualitative evaluations are then measured based on the scored results. Some other qualitative values are also considered by a comparative study made on existing program understanding models and approaches in order to strengthen the overall findings. This chapter is concluded with a summary of the evaluation. 103 6.2 Case Study The main criteria for Code Query empirical study are to verify the concept. The criteria should also enable the Code Query could be accepted as a software maintenance model. The following are the criteria needed to follow the empirical studies: i. A software project as a case study with complete source code. ii. A software project should have all caller and callee function to prove the model proposed. 6.2.1 Outlines of Case Study GI is an individual development project assigned to postgraduate students of software engineering at the Center for Advance Software Engineering (CASE), University of Technology Malaysia (UTM). In this research, the code query model was applied to a case study of software project, called the Generate Index (GI) written in C. The remarkable point of selecting this case study is that CASE students as subjects were already familiar with the case study and its domain knowledge in which this is one of the crucial criteria to establish an experiment. Prior to the experimentation, the code was analyzed to trace the occurrences of the function relationship. 104 6.2.2 GI Project Briefing The GI was a word processing system that could be launched by a user by specifying the name of the document to be analyzed or a word document consisted of a file of characters created and edited by a user. Additionally it consisted of a document that would keep the index of each word in the analyzed word document by referring to a dictionary that contains the list of words to be indexed. The system consisted of four modules: MMIMS, SystDoc, LibAdt and DMSC. The GI system was introduced to the subjects five months before the experiment was conducted in which they studied the system to perform their minor project assignment. The subjects also had taken C language module in their previous semester. Consequently, the subjects had some ideas of what the system was all about and the C language itself. Their previous experience could eliminate the effort needed to brief on subject system because they already had some domain and application knowledge. This enabled us to focus on training the subjects in using the tool to solve maintenance tasks assigned. 6.3 Controlled Experimental The aim of the experiment is to see how the prototype can support maintenance task in terms of its efficiency and accuracy to determine the keyword in source code. Besides that, we also want to see how the prototype can support program understanding. The subjects were provided with a complete GI system that included a set of source code and documentations in this experiment. 105 6.3.1 Subject and Environment The subjects of the study consist of professionals in the industry and also post-graduate students of the final semester of software engineering course at the Centre for Advanced Software Engineering, Universiti Teknologi Malaysia (UTM). They are equipped with the basic software maintenance as this subject was taught in the post-graduate program. The students were familiar with C program as this language was taught and used in some course work projects besides C++ and Java. In the coursework program, the students were taught software project management and how to read and write the documentation standard based on DOD standard, MILSTD-498. The subjects were motivated to get involved in the experiment by making clear that they would gain a valuable experience in the know-how of software maintenance. From investigation, except for one subject, the rest had some working experience of at least one year in the software industries except one with less than one year. 6.3.2 Questionnaires The questionnaires were specially formulated to support the user evaluation of the controlled experiment. The main objective of our assessment was to measure the efficiency and accuracy of the keyword searching in the context program understanding. The questionnaire consists of two sections and was designed to relate to the current professional background and evaluation of the tools, (see Appendix A). 106 6.3.3 Experimental Procedures Twenty participants were involved over the above GI case study of maintenance project. Subjects were then given a briefing and training session before the actual experiment took place. They were provided with a set of system documentations, Rigi, Grep, CodeSurfer and Code Query tool (Appendix B) and source code of GI project. Subjects then had to perform some cognitive understanding tasks prior to experimentation. The idea behind this training session was to avoid the confounding factors such as subjects were not aware of the experimental setting and procedures, and how to perform searching in the required manner. In this experiment, the subject will give set of change request question to find the keyword. Then the keyword index is used to query the keyword using four tools that are provided. In this situation, the subject needed to use the tools to understanding the GI code using the keyword that was identified before. 6.3.4 Possible Threats and Validity There are few factors that may threaten the validity in this study. The factors are listed as below: 1. The participants involved in this experiment are post-graduate students and also professionals in the industry itself. Hence, it is a possibility that may be they are already aware about the GI project. 107 2. There could be an issue of unfamiliarity with the tool for evaluation. To overcome this issue of learning curve, a short training was performed on the use of the Code Query tool prior to its experimentation. The controlled experiment was conducted on all subjects under proper supervision. A twoday experiment on the participants was allocated in order to secure a good result and to foster their enthusiasm in the study. 3. Lastly, there could be some careless mistakes the subjects had drawn from their results as this may be caused by human errors of manually examining the software components. This factor was eliminated by double checking the work flows of each group after their experiment. 6.4 Experimental Result The analysis consists of two parts; the analysis of the controlled experiment and the analysis of the usability. 6.4.1 Analysis of the Controlled Experiment The aim of the experiment is to see how the prototype can support the maintenance task in terms of its efficiency and understanding of the software. The subjects were provided with a complete GI system that included a set of source codes 108 and documentations. With some change requests, the subjects were first asked to obtain the occurrences of the keyword in the source code. The analysis of controlled experiment was based on the values of variables derived from the metrics specified in Table 6.1. Based on the “Past Experience” factor, Table 6.1 reveals the variation among the sample population. Most of the subjects (40 percent) were software engineers / system analysts, followed by programmers (30 percent), other examples system engineer and product engineer (15 percent), project leaders / project managers (10 percent) and lecturers (5 percent). Specification of other jobs is system engineer and product engineer. Table 6.1: Cross tabulation of job versus frequencies Job Frequency Cumulative Percent Valid Percent Percent Lecturer 1 5.0 5.0 5.0 Project Leader / Project Manager 2 10.0 10.0 15.0 Other 3 15.0 15.0 30.0 Programmer 6 30.0 30.0 60.0 Software Engineer / System Analyst 8 40.0 40.0 100.0 Total 20 100.0 100.0 The same procedure was used to analyze the number of “years of experience” factor. The subjects “range of experience” in software development and in software maintenance are illustrated in Table 6.2 and Table 6.3 respectively. 109 Table 6.2: Cross tabulation of experience in software development Year of Experience Cumulative Frequency Percent Valid Percent Percent 1-2 years 5 25.0 25.0 25.0 3-4 years 4 20.0 20.0 45.0 Less 1 year 2 10.0 10.0 55.0 More than 4 years 9 45.0 45.0 100.0 Total 20 100.0 100.0 Between them, there had 1 to 2 years experience in development that was recorded in count five while only two subjects were less than 1 year and nine subjects in more than 6 years. There were only four subjects with minimum 3-4 years. See table 6.3. Table 6.3: Cross tabulation of experience in software maintenance Year of Experience Cumulative Frequency Percent Valid Percent Percent 1-2 years 4 20.0 20.0 20.0 3-4 years 3 15.0 15.0 35.0 Less 1 year 8 40.0 40.0 75.0 More than 4 years 4 20.0 20.0 95.0 No experience 1 5.0 5.0 100.0 Total 20 100.0 100.0 Besides development, most of them also had experience in maintenance. Only one subjects had no experience in maintenance task and eight subjects were with less than one year experience. But there were four subjects with maintenance experience of more than 4 years. (See table 6.3). 110 6.4.2 Analysis of the Usefulness and Usability Study The subjects were asked to evaluate Code Query tool (Table 6.5) with respect to its usefulness and usability to support program understanding based on 6 basic scales (1-Not At All, 2-Low, 3-Moderate, 4-Useful, 5-Very Useful, 6-Extremely Useful). The study also managed to derive the feedback from the subjects on the usefulness and usability comparison of the four tools: GREP (GP), CodeSurfer (CS), Rigi (RG) and Code Query (CQ). Figure 6.1 shows the means of evaluation scale for usefulness and usability of the four tools. The questions are listed in APPENDIX A. Evaluation Scale (Mean) 5 4 CS 3 RG GP 2 CQ 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Question No. Figure 6.1 : Usefulness and Usability of Tools 111 Table 6.4: Mean of scores for Code Query Question No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 CS 3.00 2.50 2.50 2.33 2.67 2.67 2.33 2.83 3.17 3.17 2.33 2.33 2.50 2.17 2.33 RG 2.17 2.50 2.50 2.50 2.33 2.83 2.33 2.83 3.33 3.17 3.33 3.17 3.00 2.00 2.00 GP 3.67 3.67 3.83 3.33 3.33 3.00 2.83 3.17 3.83 3.67 3.17 2.83 2.67 2.83 3.17 CQ 4.50 4.33 4.17 3.83 3.83 4.50 3.33 3.83 4.00 4.00 4.33 4.00 4.17 3.83 3.67 In table 6.4, the subjects were asked to evaluate the overall usefulness of Code Query to support program understanding based on 5 basic scales (1-Strongly Disagree, 2-Disagree, 3-Normal, 4-Agree, 5-Strongly Agree). From the perspective of easiness to use in overall (Question no.1) in Figure 6.1, 4.50 of the subjects responded to strongly agree with CQ, 3.67 agreed with GP, 2.17 agreed with RG and 3.00 agreed with CC. The majority of the sample agreed that the Code Query tool provided easy use overall. On questions 3, 4.17 of the subjects responded strongly agree with CQ, 3.83 agreed with GP, 2.50 agreed with RG and 2.50 was between disagree and agree with CS. The result in Table 6.11 shows that question 3 was accepted to be very much satisfactory. Based on Table 6.4 and Figure 6.1, most of the subjects strongly agreed that CQ was a very useful tool for keyword query searching. CQ provided the query capability to search for the occurrences between function in the source code. The subjects agree to put CQ in comfortable place between ranges 3 to 5 which means mostly between to agree and strongly agree. 112 Figure 6.2: Usefulness of Tool (mean values based on Likert scale 1: Very Useless, 2: Useless, 3: Normal, 4: Useful, 5: Very Useful) Table 6.5: Mean of Usefulness of Tools Std. Tools N Mean Deviation CQ 8 4.62 .518 CS 2 2.00 .000 GP 4 3.25 .500 6 20 3.67 3.80 .516 .951 RG Total Figure 6.2 illustrates that the CQ is derived from the most useful of tools towards each usability criterion provided by the tool compared to that of the three control tools. For RG, most of its usability criteria are more positive compared to GP and CS. Both software maintenance tools of Code Query and Rigi had the values 113 above the mean value 3.00 (normal scale). Whilst all the criteria of CQ had the mean values between 4.00 and 5.00 that were within the range of useful and proved to be very useful. 6.5 Analysis of Finding This section discusses the findings of the research based on two perspectives that is gathered from the controlled experiment and usability study conducted. This finding analysis is based on the acceptance tools by the subject. It’s mean with this acceptance; the Code Query method could support program understanding. 6.5.1 Acceptance Tool An acceptance tool means the subjects agreed with the tools that could help them in program understanding. Twenty subjects were involved in this experiment and they were asked to use the four tools, including the new approached tool. The table 5.6 displays the mean comparison between tools. 114 Table 6.6: Mean Comparison between Tools Tools N Mean Std. Deviation % of Total N CQ 8 4.62 .518 40.0% CS 2 2.00 .000 10.0% GP 4 3.25 .500 20.0% RG 6 3.67 .516 30.0% 20 3.80 .951 100.0% Total From the above result, it is clear that CQ is a good tool to support the subject in program understanding. 40% is considered good enough to measure that CQ is accepted by the subjects. 6.5.2 Qualitative Evaluation The research application was exposed to users by letting them use and evaluate its effectiveness under a controlled experiment. User perception took into account the feedback and comments from users on the usefulness of the Code Query prototype tool to support program understanding in C software. Some questionnaires were designed to establish the usability of the prototype – whether it is useful and effective to support program understanding. Majority of the participants agreed that the Code Query provides an easy use in program understanding. They found it easy to identify the keyword at source code level. On question of speed up time searching, majority responded that Code Query can give result in seconds. They also agreed that the keyword is made easy to indicate the inter-relationship of function. 115 Table 6.7: Existing Features of Code Query Systems Tools/ Features Query Language Search Technique Concept Location Rigi CodeSurfer Grep Code Query N C, C++ Regular Expression, Pattern Matching N Y C, C++ Regular Expression Y All extension Pattern Matching Y C Regular Expression, Pattern Matching Line no, filename, file path for all function Show function name on node Line no, filename, file path by the searched keyword N Line no, filename, file path for caller and callee Show line number, function name on node Contain caller and callee location based on searched keyword Highlighted searched keyword Y Y Graphical Representation Show function name on node Textual Representation N Location for searched keyword Location for searched keyword Source Viewer N Dependency Occurrences Y Y Highlighted selected statement Y Y Highlighted searched word N N The table above is the qualitative comparison between Rigi, CodeSurfer, Grep and Code Query. The tools are supporting program understanding based on their own standard. Rigi does not have query facility and it is based on visualization. CodeSurfer, Grep and Code Query are provided with searching facility. Grep is used to enquire any keyword in any file extension and it’s not focus to programming language. Code Query and Code Surfer is focused on source code query. The entire tools exclude Grep have source code abstractions. It helps enhance program understanding using textual or graphical representation. In concept location, all the tools – excluding Rigi – had concept location feature. But the difference is Code Query has callee and caller information in which both of them have concept location feature. It means Code Query provide from source and to destination concept location information. It gives the advantage to Code Query to score the qualitative features. 116 6.6 Summary This chapter described the analysis and findings of the controlled experiment that evaluate how the proposed code query model can achieve its effectiveness and accuracy in dealing with query facility in program understanding. The prototype results produced by the subjects to complete the experiment were considered as the useful variables. The proposed model is accepted by the subjects and it is deduced that the model provides some significant achievements to handle the program understanding. The subjects also agreed that the tool provides some useful interfaces and improves productivity of software maintenance. This approach was compared qualitatively with other similar approaches. In general, this model is able to improve some aspects of the pro features as discussed in the early literatures. 117 CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Introduction This chapter summarizes the research by providing conclusion and its significant contributions to both academic and practice. It also provides suggestions for future research work. The software understanding approach, as proposed earlier in this research, is summarized in this chapter. The current chapter starts with summary and explanation on how the research achieves each objective set earlier. It is followed by the main contributions of this research in response to a change process in software evolution. Finally, it describes some limitations in the current scope and possible areas of future research. 118 7.2 Contribution The main contributions of the proposed software Code Query model and approach can be summarized as follow. 1. The new model of Code Query provides a keyword query functionality to search a keyword or code in the source code. 2. The new model provides a more useful and informative extraction reliable source code artifacts such as the line number of code, file path, filename snippets of the code based on keyword or code query. 3. The new model provides the information on the caller and callee based on the keyword or code query. 4. The new model provides abstractions to show details of the function and the occurrences of the keyword or code in the source code. 7.3 Research Limitation and Future Works Despite the above contributions, the findings are also exposed to certain limitations such as follow: 1. As large systems are extremely complex, the usefulness of this approach is currently tested and made applicable to a small-sized software system. Large systems may involve some integrated applications of different platforms and environments. 2. The new software Code Query model only focuses on the C language software and it only covers structured programming. 119 It is foreseen – based on the scope established and the limitations discovered – that the following areas may constitute possible future works: 1. The research limitation (2) above can be extended for the next future work. As of now, C language as a subject for the research. For the next research, we must find mechanism to utilize both C and C++ language. 2. For future work, it is highly recommended to focus on both objectoriented and structured program. 3. For the next future work, it is best to focus more on visualization method because program understanding is very useful in using this method. 4. For the next future work, it is preferable to have an editable source code viewer. This will help the maintainer to immediately edit the concept in the source code. 7.4 Summary A set of research objectives as defined in the early stages have set the direction of this research. The prototype was developed and tested by twenty people and the experiment has achieved satisfactory results. The results of the experiment verify the correction and usefulness of the new software proposed approach. Hence, this approach could help the maintainers to understand the software via the source code easily with less dependence on software documentation which is usually not reflecting the working version of the software. 120 REFERENCES Anderson, P. and Zarins, M. (2005). The CodeSurfer software understanding platform, Proceedings 13th International Workshop on Program Comprehension, 147-148. Asif, N. (2008). Artifacts Recovery at Different Levels of Abstractions. Information Technology Journal. Beron, M. M., Henriques, P. R., Pereira, M. J. V., Uzal, R. and Montejano, G. (2006). A Language Processing Tool for Program Comprehension, XII Argentine Congress on Computer Science (CACIC06). Potrero de los Funes, San Luis, Argentina. October 2006. Buschmann, F., Meunier, R., Rohnert, H., Sommerlad, P., and Stal, M. (1996). Pattern-oriented software architecture: A system of patterns. New York, USA: John Wiley & Sons. CodeSurfer (2007). CodeSurfer Overview. http://www.grammatech.com/products/codesurfer/overview.html Cox, A. and Collard, M.L. (2005). Textual Views of Source Code to Support Comprehension. 13th International Workshop on Program Comprehension. (IWPC 2005). St. Louis, MO, USA. Damasevicius, R. (2006). On the Quantitative Estimation of Abstraction Level Increase in Metaprograms. Journal of Computer Science and Information System. June 2006. Deursen A.V. (2001). Program Comprehension Risks and Opportunities in Extreme Programming. Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE'01). Stuttgart, Germany. 176-188. October 2001. Eichberg, M., Haupt, M., Mezin, M. and Schafer, T. (2005), Comprehensive Software Understanding with SEXTANT, 21st IEEE International Conference on Software Maintenance (ICSM'05). 315-324. Budapest, Hungary. September 2005. 121 Erlikh, L. (2000). Leveraging Legacy System Dollars for E-business. IEEE IT Professional Publication. 17-23. May 2000. Frisch, A. and Cardelli, L. (2004). In Proceedings of the 31st International Colloquium on Automata, Languages and Programming (ICALP'04). Turku, Finland. Froehlich, J. and Dourish, P. (2004). Unifying Artifacts and Activities in a Visual Tool Distributed Software Development Teams. Proceedings of the 26th International Conference on Software Engineering. Edinburgh, Scotland, UK. May 2004. Grune, D. and Jacobs, C. J. H. (2008). Parsing Techniques: A Practical Guide. (2nd Edition). Ellis Horwood Ltd, UK: Prentice Hall. Hoffer, J. A., George, J. F. and Valacich, J. S. (1999). Modern Systems Analysis and Design. (2nd Edition). USA: Prentice Hall. Ibrahim S. (2006). A Documemt-Based Software Traceability to Support Change Impact Analysis of Object-Oriented Software. University Of Technology Malaysia (UTM) : PhD Thesis. ISO/IEC 14764 (2006). IEEE Std 14764-2006 Software Engineering -- Software Life Cycle Processes – Maintenance. Kajko-Mattsson, M. (2000). Preventive Maintenance! Do we know what it is? Proceedings of the International Conference on Software Maintenance. San Jose, USA. 12-14. October 2000. Koschke, R. (2001), Software Visualization, International Seminar Dagstuhl Castle, Germany, May 20-25, 2001. Kwon, O. C., Boldyreff, C. and Munro, M. (1998). Software Configuration Management for a Reusable Software Library within a Software Maintenance Environment. The International Journal of Software Engineering and Knowledge Engineering (IJSEKE). September 1998. Lange, C., Sneed, H. M. and Winter, A. (2001). Comparing Graph-based Program Comprehension Tools to Relational Database-based Tools. 9th International Workshop on Program Comprehension (IWPC'01). Toronto, Canada. May 2001. Lee, M. (1998). Change Impact Analysis of Object-Oriented Software. George Mason University: Master Thesis. 122 Maletic, J. I. and Marcus, A. (2001). Supporting Program Comprehension Using Semantic and Structural Information. 23rd International Conference on Software Engineering (ICSE'01). Toronto, Canada. May 2001. Marcus, A., Rajlich, V., Buchta, J., Petrenko, M. and Sergeyev, A. (2005). Static Techniques for Concept Location in Object-Oriented Code. Program Comprehension, 2005. IWPC 2005. Proceedings. 13th International Workshop. 33 – 42. Washington, DC, USA. Marcus, A., Sergeyev, A., Rajlich, V. and Maletic J. I. (2004). An Information Retrieval Approach to Concept Location in Source Code. Proceedings of the 11th Working Conference on Reverse Engineering. Delft, The Netherlands. November 2004. Nelson L. M. (2005). A Survey of Reverse Engineering and Program Comprehension. ODU CS 551 – Software Engineering Survey. Pattern Matching (2003). http://en.wikipedia.org/wiki/Pattern_matching Paul, S. and Prakash, A. (1994). A Framework for Source Code Search Using Program Patterns. IEEE Transactions on Software Engineering (TSE). June 1994. Paulisch, F. N. (1993). The Design of an Extendible Graph Editor. German: Springer-Verlag Berlin Heidelberg. Phanindra, G., Shankar K.V.V.N.R and Sreenivas, P. D. (2007). A Fast Multiple Pattern Matching Algorithm using Context Free Grammar and Tree Model. International Journal of Computer Science and Network Security (IJCSNS). September 2007. Pigoski, T. M. (1997). Practical Software Maintenance: Best Practices for Managing your Software Investment. USA: John Wiley & Sons. Rajlich, V. and Wilde, N. (2002). The Role of Concepts in Program Comprehension, Proceedings of 10th International Workshop on Program Comprehension. June 27-29. France: IEEE Computer Society. 271-278. Rasool, G. and Philippow, I. (2008). Recovering Artifacts from Legacy Systems using Pattern Matching. Proceedings of World Academy of Science, Engineering and Technology. December, 2008. Regular Expression (2007). http://en.wikipedia.org/wiki/Regular_expression Rigi (2004). Rigi Group Home Page. http://www.rigi.csc.uvic.ca/ 123 Sartipi, K., Kontogiannis, K. and Mavaddat, F. (2000). A Pattern Matching Framework for Software Architecture Recovery and Restructuring. 8th International Workshop on Program Comprehension (IWPC'00). Limerick, Ireland. June 2000. Singer, J., Lethbridge, T., Vinson, N., and Acquetil, N. (1997). An Examination of Software Engineering Work Practices. Proceedings Conference of Centre for Advanced Studies on Collaborative Research. Toronto, Ontario. November 1997. Sommerville, I. (1997). Software Engineering. (5th Edition). England: Addison Wesley. Storey, M. -A. D. (1998). A Cognitive Framework for Describing and Evaluating Software Exploration Tools. Simon Fraser University, Canada: PhD Dissertation. Storey, M. -A. (2005). Theories, Tools and Research Methods in Program Comprehension: Past, Present and Future. Proceedings of the 13th International Workshop on Program Comprehension (IWPC 2005). May 2005. Sulaiman S. (2004). A Document-Like Software Visualization Method for Effective Cognition of C-Based Software Systems. University Of Technology Malaysia (UTM) : PhD Thesis. Sulaiman S. (2004). Viewing Software Artifacts for Different Software Maintenance Categories Using Graph Representations. Malaysian Journal of Computer Science. December 2004. Tammy V. (2005). Reading Before Writing: Can Students Read and Understand Code and Documentation? Proceedings of the 36th SIGCSE Technical Symposium on Computer Science Education. St. Louis, Missouri, USA. February 2005. Tilley, S. R., Smith, D. B. and Paul, S. (1996). Towards a Framework for Program Understanding. Proceedings of the 4th International Workshop on Program Comprehension (WPC '96). March 1996. WinGrep (2007). Windows http://www.wingrep.com/ Grep - Advanced searching for Windows. 124 APPENDIX A UNIVERSITI TEKNOLOGI MALAYSIA Centre for Advanced Software Engineering City Campus, Jalan Semarak, 54100 Kuala Lumpur QUESTIONNAIRE ON USABILITY OF SOFTWARE UNDERSTANDING TOOL Software maintenance is a problem that plagues the software industry. Software maintenance takes up approximately 50-75% of the cost of software development [I, Sommerville, Software Engineering, 2004]. Software understanding is the central to maintenance because a programmer who is working on a piece of code is not the original programmer of that software, so the programmer must take the time to understand the code. Even if the original programmer works on their own code during maintenance, that programmer may not remember what that code does, so software understandings become a critical issue. Several tools have been developed to aid in software understanding. Essentially, this questionnaire attempts to derive your opinions on the usefulness and usability of software understanding tools. Objectives 1. To identify the usefulness and usability of software understanding tools. 2. To identify the weakness and strength of existing software understanding tools. Remarks It should take approximately 20-30 minutes to complete the questionnaires. I would like to plead for sincere participation and your cooperation in answering these questionnaires is very much appreciated. Thank you very much. Prepared by: DAHLIA BINTI DIN Faculty of Computer Science and Information System Universiti Teknologi Malaysia ailhaddin@yahoo.com 125 Professional Background Questions in this section are related with your current position and previous experience: (Tick √ for the answer) 1. Which category of your current company? Software Telecommunication Academic Banking Production Other, please specify: 2. What is your current job? Programmer Software Engineer / System Analyst Project Leader / Project Manager Quality Engineer Lecturer Researcher Other, please specify: 3. Number of years you have been involved in software development? Less than 1 year 1 – 2 years 3 – 4 years More than 4 years No experience at all 4. Number of years you have been involved in software maintenance? Less than 1 year 1 – 2 years 3 – 4 years More than 4 years No experience at all 5. What types of work products have you been involved for software maintenance? (Tick √ one or more answers) Code Design Specifications Testing Requirement or project management Documentation Other, please specify: 126 6. Which programming language that you are involved during the software maintenance? (Tick √ one or more answers) C C++ Java Visual Basic Other, please specify : 7. Which tools that you use to support your software maintenance process? (Tick √ one or more answers) Grep Rigi Columbus McCabe CodeSurfer Not use any tools Other, please specify: 8. How do you find the task of software maintenance? (Tick √ one or more answers) Descriptions Agree Normal Disagree No Opinion Boring task Tedious task Time Consuming Critical but crucial job Need skill and experience Others (please specify): 127 GI Project Questions in this section are related to your experience in GI project: 1. How long (in hour) do you take to maintaining the changes? (Average from 6 PCR) (Tick √ for the answers) 1-5 hours 6-10 hours 11-24 hours More than 24 hours 2. Between source code and the software document, which is more helpful for you to understand the software? (Tick √ for the answers) Source code Software document 3. Between source code and the software document, which is more helpful to you for software maintenance? (Tick √ for the answers) Source code Software document 4. Do you use any existing tools to help you do the software maintenance? (Such as Rigi, CodeSurfer, etc.) (Tick √ for the answers) Yes, please specify: No 5. List of the step, did you use to manage the software maintenance? (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) 6. Other comments or issues: 128 Evaluation of the Tools For this section, you need to evaluate all the four tools. Use the codes to indicate the tools: GP: Grep CS: Colombus RG: Rigi CQ: CQuery 1. Specify your opinion on the usability of the four tools by indicate the code as above in the corresponding row and column. Criteria Eg. Criteria 1 Eg. Criteria 2 1. Easy to use in overall 2. Easy to understand 3. Can speed up time in searching the keyword location 4. Search utility provided in graphical views is sufficient 5. The tasks can be completed using this tools effectively 6. Easy to indicate the inter-relationship of function 7. Provide basis to cost estimation and plan schedule 8. Support potential change impact analysis 9. Information provided is well organized 10. Textual information provided in sufficient 11. Graphical views are simple and easy to understand 12. Graphic information Strongly Disagree (1) RG Disagree (2) Normal (3) Agree (4) CQ RG CS CS GP Strongly Agree (5) GP, CQ 129 13. 14. 15. 16. is sufficient Easy to trace link between graphical and source code The interface is good Less time for software understand Other, please specify: 2. Indicate the usefulness of the four tools provided to understand the software (Tick √ the appreciate row and column) Tools Very Useless (1) Useless (2) Grep Colombus Rigi CQuery 3. Other comments: - Thank you - Normal (3) Useful (4) Very Useful (5) 130 APPENDIX B USER MANUAL Source Code Query to Support Structured Program Understanding (CQuery) How to start using the CQuery tool? To use the CQuery, follow the following instruction: 1. Double click on CQuery shortcut in your desktop. 2. Click on ‘Browse’ button to determine the source code location. Browse the source code location 131 3. Enter the keyword in keyword field Enter the keyword Click on ‘Search’ button to start your keyword searching. Click here to start the keyword searching 132 4. Search result will appear as below. Then click on ‘Visualize’ button to visualize the keyword. Click here to visualize the keyword 133 5. The abstraction of the keyword. To view the source code, click at a node in the diagram, and then click on ‘View Source Code Button’. The source code viewer window will display. 134 6. Source code viewer. The keyword will be highlights in the source code.