A SOURCE CODE QUERY TO SUPPORT STRUCTURED PROGRAM UNDERSTANDING

advertisement
A SOURCE CODE QUERY TO SUPPORT STRUCTURED PROGRAM
UNDERSTANDING
DAHLIA BINTI DIN
UNIVERSITI TEKNOLOGI MALAYSIA
A SOURCE CODE QUERY TO SUPPORT STRUCTURED PROGRAM
UNDERSTANDING
DAHLIA BINTI DIN
A thesis submitted in fulfillment of the
requirements for the award of the degree of
Master of Science (Computer Science)
Faculty of Computer Science and Information Systems
Universiti Teknologi Malaysia
JANUARY 2010
iii
ALHAMDULLILAH
For
my beloved parents, Emak and Abah,
my lovely sisters, Edah and Adik
and the rest of my family members
who give me the strength and courage.
iv
ACKNOWLEDGEMENT
I would like to take this opportunity to say thanks to my supervisor Assoc.
Prof. Dr Suhaimi Bin Ibrahim and Dato’ Prof. Dr Norbik Bin Bashah and for their
motivation, advice and inspiration throughout this research.
Special thanks go to my friends, Rita and Fiza for their encouragement and
support that help me in making this research successful.
I also would like to thank to my friends and post-graduate students of CASE,
Universiti Teknologi Malaysia, Kuala Lumpur for their participation in the controlled
experiment.
v
ABSTRACT
Most software undoubtedly undergoes changes and needs maintenance due to
environment, technologies and domain knowledge changes. A maintainer is
responsible to analyze and understand the code occurrences prior to making any
changes. The change in some parts of the codes will affect other parts within the
same codes. Therefore, program understanding is one of the important factors to
understand the code occurrences and its effectiveness in software maintenance. A
program understanding activity involves browsing and exploring the source codes or
the software documentations. However, in some cases not all the developed software
has updated documentations. In this situation the documents tend to be more or less
obsolete, while source code remains as the only reliable source left for maintainers to
understand the software. In this case, maintainers need to spend more time traversing
source code to understand the code occurrences. Thus, a flexible codes query method
is proposed to enhance code occurrences understanding. This method applies parsing,
pattern matching and regular expression techniques to extract software artifacts from
source code. These extracted artifacts are analyzed and brought to light via multiple
levels of abstraction. The results present multiple artifacts, relationship among the
artifacts and their locations in the source code. In this research, a prototype was
developed to provide a source code query method that supports structured program
understanding in C programs. The method was then tested in a controlled experiment
using a case study to prove effectiveness query of artifacts occurrences in source
code to support program understanding.
vi
ABSTRAK
Kebanyakan perisian mengalami perubahan dan perlu diselenggara
disebabkan oleh perubahan persekitaran, teknologi dan pengetahuan domain.
Penyelenggara perisian bertanggungjawab menganalisa dan memahami keberlakuan
kod sebelum membuat pindaan. Pindaan satu kod boleh menjejaskan bahagian lain di
dalam kod sumber yang sama. Oleh itu, pemahaman aturcara merupakan salah satu
faktor yang penting bagi memahami keberlakuan kod dan keberkesanannya dalam
penyelenggaraan perisian. Pemahaman aturcara melibatkan aktiviti menjelajah dan
meneliti kod-kod sumber atau dokumen-dokumen perisian. Walau bagaimanapun,
tidak semua perisian mempunyai dokumen yang terkini. Dalam situasi ini, dokumen
adalah kurang dikemaskini dan hanya kod aturcara menjadi sumber yang boleh
dipercayai untuk memahami perisian. Penyelenggara memerlukan masa yang lama
untuk memahami keberlakuan kod. Oleh itu, kaedah susuran kod yang fleksibel
dicadangkan bagi meningkatkan pemahaman keberlakuan kod. Kaedah ini
menggunakan teknik penghuraian, pemadanan corak dan ungkapan nalar untuk
mendapatkan artifak aturcara dari kod sumber. Artifak yang diperolehi akan dianalisa
dan diabstrakkan dalam pelbagai aras. Hasil abstraksi akan memaparkan pelbagai
artifak, pautan hubungan artifak dan lokasinya di dalam kod sumber. Dalam kajian
ini, satu prototaip dibangunkan bagi menyediakan kaedah susuran kod sumber untuk
menyokong pemahaman aturcara berstruktur dalam bahasa aturcara C. Kaedah
tersebut kemudiannya diuji dalam ujikaji terkawal menggunakan kajian kes untuk
membuktikan keberkesanan susuran keberlakuan artifak dalam kod sumber bagi
menyokong pemahaman aturcara.
vii
TABLE OF CONTENTS
CHAPTER
1
TITLE
PAGE
DECLARATION
ii
DEDICATION
iii
ACKNOWLEDGEMENT
iv
ABSTRACT
v
TABLE OF CONTENTS
vii
LIST OF TABLES
xiv
LIST OF FIGURES
xv
LIST OF ACRONYMS AND SYMBOLS
xvii
LIST OF APPENDICES
xviii
INTRODUCTION
1.1
Introduction
1
1.2
Background of the Research Problem
2
1.3
Statement of the Problem
3
1.4
Objective of the Study
4
1.5
Scope of Work
5
1.6
Importance of the Study
5
1.7
Thesis Outline
6
1.8
Summary
7
viii
2
LITERATURE REVIEW
2.1
Introduction
8
2.2
Introduction of Software Maintenance
9
2.2.1
Software Maintenance Categories
11
2.2.2
Problem in Software Maintenance
12
2.3
Program Understanding
13
2.3.1
Program Understanding Support Mechanism
15
2.3.1.1
Unaided Browsing
15
2.3.1.2
Leveraging Corporate Knowledge and
16
Experience
2.3.1.3
2.3.2
Computer Aided Technique
Program Understanding via Reverse
16
17
Engineering
2.4
Reverse Engineering
18
2.4.1
Reverse Engineering Concept and Definition
19
2.4.2
Challenges in Reverse Engineering
21
2.4.3
Program Understanding in Reverse Engineering
23
Automating Approaches
2.5
Parsing Technique
24
2.5.1
Two Way of Parsing
25
2.5.2
Parsing Methods
26
2.5.2.1
Directionality
26
2.5.2.2
Search Techniques
27
2.5.2.3
Left Corner Parsing
28
2.5.3
2.6
Time Requirement
28
Extraction Process
29
2.6.1
Pattern Matching
29
2.6.2
Regular Expression
30
2.6.2.1
32
Basic Concepts
ix
2.6.2.2
Portable Operating System Interface
34
(POSIX) Syntax
2.6.3
Pattern Matching and Regular Expression in
36
Artifact Extraction
2.7
2.8
2.9
Abstraction Process
36
2.7.1
Graphical Representation
38
2.7.2
Textual Representation
38
Concept Location
40
2.8.1
Concept Location in Source Code
41
2.8.2
Static Concept Location Techniques
43
2.8.2.1
String Pattern Matching Technique
44
2.8.2.2
Dependency Search Technique
44
2.8.2.3
IR-based Technique
45
Code Query in Reverse Engineering Tools
45
2.9.1
Windows Grep
46
2.9.2
Rigi
47
2.9.3
2.9.4
3
2.9.2.1
Rigi Features
48
2.9.2.2
Rigi Query Technique
49
CodeSurfer
49
2.9.3.1
CodeSurfer Features
51
2.9.3.2
CodeSurfer Query Technique
52
The Comparative Evaluation of Existing Tools
53
2.10
Proposed Solution
54
2.11
Summary
55
RESEARCH METHODOLOGY
3.1
Introduction
56
3.2
Operational Framework
57
3.2.1
58
Phase 1: Formulation of Research Problem
x
3.2.1.1
Literature Reviews
58
3.2.1.1.1
59
Understanding the Need
of Change Request
Process
3.2.1.1.2
Understanding Structured
59
Programming Concept
3.2.1.1.3
Understanding the
60
Extraction Process
3.2.1.1.4
Understanding the
60
Abstraction Technique
3.2.1.2
Analysis Current Approach and
60
Existing Tools
3.2.1.3
3.2.2
3.2.3
3.2.4
4
Research Proposal
61
Phase 2: Prototype Development
61
3.2.2.1
Code Query Model Design
62
3.2.2.2
Code Query Prototype Development
62
Phase 3: Implementation and Evaluation
63
3.2.3.1
Supporting Tools
64
3.2.3.2
Choose Case Study
64
3.2.3.3
Experimental
65
3.2.3.4
Evaluation
65
Phase 4: Research Report
66
3.3
Research Assumption
66
3.4
Summary
66
CODE QUERY MODEL
4.1
Introduction
68
4.2
Overview of Code Query
68
4.3
Code Query in Structured Programming
70
xi
4.3.1
Structured Programming Concept
70
4.3.1.1
72
Relationship in Structured
Programming
4.3.1.2
Dependencies in Structured
73
Programming
4.3.1.2
Observations about Structured
73
Programming
4.4
A Proposed Code Query
74
4.4.1
Keyword
75
4.4.2
Extraction of Artifacts
75
4.4.2.1
Parser
76
4.4.2.2
Pattern Matching
77
4.4.2.3
Regular Expression
77
4.4.3
Abstraction of Artifacts
78
4.4.3.1
Code Query in Textual Representation
79
4.4.3.1
Code Query in Graphical
80
Representation
4.5
5
Summary
81
DESIGN AND IMPLEMENTATION OF CODE QUERY
5.1
Introduction
82
5.2
Code Query Design
82
5.2.1
Code Query Architecture
83
5.2.1.1
Problem Change Request
84
5.2.1.2
Artifacts Repository
84
5.2.1.3
Extraction Process
85
5.2.1.4
Abstraction Process
87
5.2.2
Code Query Use Case
87
5.2.3
Code Query Class Interactions
91
xii
6
5.3
Code Query Implementation and User Interfaces
96
5.4
Other Supporting Tools
101
5.5
Summary
101
EVALUATION
6.1
Introduction
102
6.2
Case Study
103
6.2.1
Outlines of Case Study
103
6.2.2
GI Project Briefing
104
6.3
6.4
6.5
6.6
7
Controlled Experimental
104
6.3.1
Subject and Environment
105
6.3.2
Questionnaires
105
6.3.3
Experimental Procedures
106
6.3.4
Possible Threats and Validity
106
The Analysis
107
6.4.1
Analysis of the Controlled Experiment
107
6.4.2
Analysis of the Usability Study
110
Finding Analysis
113
6.5.1
Acceptance Tool
113
6.5.2
Qualitative Evaluation
114
Summary
116
CONCLUSION AND FUTURE WORK
7.1
Introduction
117
7.2
Contribution
118
7.3
Research Limitation and Future Works
118
7.4
Summary
119
xiii
REFERENCES
120
Appendices A-B
124-134
xiv
LIST OF TABLES
TABLE NO.
TITLE
PAGE
2.1
Quantifiers of Regular Expression
33
2.2
Metacharacters for BRE Standard
35
2.3
Features of The Static Concept Location
Techniques
43
2.4
Existing Features of current tools
54
4.1
Relationship Types
72
4.2
Regular Expression for Match Common
Programming Language
78
6.1
Job versus frequencies
108
6.2
Year of Experience in Software Development
109
6.3
Year of Experience in Software Maintenance
109
6.4
Mean of scores for Code Query
111
6.5
Mean of Usefulness of Tools
112
6.6
Mean Comparison between Tools
114
6.7
Existing Features of Code Query Systems
115
xv
LIST OF FIGURES
FIGURE NO.
TITLE
PAGE
2.1
Software Maintenance Process
9
2.2
Reverse Engineering
19
2.3
Forward Engineering
20
2.4
Overview of Parser Process
25
2.5
Most concept location techniques rely on an
intermediate representation of the source code
41
2.6
Window Grep Search Results
47
2.7
View produced by Rigi via RigiEdit
49
2.8
CodeSurfer Project Viewer
50
2.9
Finder Viewer
52
3.1
Operational Framework
57
4.1
Overview of Code Query Model
69
4.2
Structured Programming – Tax Calculation
71
4.3
Function Relationship
72
4.4
Code Query Approach
74
4.5
Metrics on the important files
79
xvi
4.6
Graphical presentation for function and variables
of vehicle simulation
81
5.1
Code Query Architecture
83
5.2
Use Case Diagram of Code Query System
88
5.3
Code Query Class Diagram
91
5.4
Code Query Sequence Diagrams
92
5.5
Code Query Process Algorithm Flowchart
94
5.6
Code Query Introduction Screen
96
5.7
First user interface of Code Query
97
5.8
File Path and the Keyword Field
97
5.9
Textual Representation
98
5.10
Graphical Representation
99
5.11
Low Level of Abstraction - Source Code Viewer
99
5.12
High Level of Abstraction - Detail Relationship of
Artifacts
100
6.1
Usefulness and Usability of Tools
110
6.2
Usefulness of Tool
112
xvii
LIST OF ACRONYMS AND SYMBOLS
BRE
-
Basic Regular Expressions
GUI
-
Graphical User Interface
IEEE
-
The Institute of Electrical and Electronics Engineers, Inc
LSI
-
Latent Semantic Indexing
PBS
-
Portable Bookshelf
PCR
-
Program Change Request
RUP
-
Rational Unified Process
SDLC
-
Systems Development Life Cycle
SHriMP
-
Simple Hierarchical Multi-Perspective
SLC
-
Software Life Cycle
SLCM
-
Software Life Cycle Model
SLCP
-
Software Life Cycle Process
UML
-
Unified Modeling Language
WWW
-
World Wide Web
xviii
LIST OF APPENDICES
APPENDIX
TITLE
PAGE
A
Questionnaire On Usability Of Software
Understanding Tool
124
B
User Manual
130
1
CHAPTER 1
INTRODUCTION
1.1
Introduction
This chapter provides an introduction to the research work presented in this
thesis. It describes the research overview that motivates the introduction of a source
code query to support structured program understanding. This is followed by a
discussion on the research background, problem statements, objectives and
importance of the study. Finally, it briefly explains the scope of work and the
structure of the thesis.
2
1.2
Background of the Research Problem
Software maintenance is a process that happens once a requirement needs to
fix, change or adapt to a software system. Whatever it is, the maintainer must fully
understand the system before implementing the maintenance. Understanding what a
program does, how the program works technically and why the program is in such
design is critical to software maintenance.
In this case, program understanding is needed in the software maintenance
phase. Program understanding involves cognition of software that is the mental
process of knowing, learning and understanding the software system. Source code or
documentation is used as a source material for program understanding purposes.
Unfortunately, documentation of the system structure is often missing or
outdated; even when accurate documentation is indispensable, given the complexity
of today’s software system (Eichberg, 2005). Therefore, the best reference of the
system is source code. This source code is the representation of an executable
equivalent of the software system.
In a large, long-term software project, software maintainer must often get to
know an unfamiliar portion of the source code in order to fix a bug or to add a
feature to meet a maintenance requirement. The code may be unfamiliar, for
instance, because a different programmer was previously responsible for that portion
of the code or because the software is in a maintenance phase where responsibility
for the code is no longer strictly apportioned among the team’s programmers. A
programmer facing such a task often relies on little more than the executable code
itself.
3
Exploring a source code is one of the most common activities performed by
software maintainers to understand the program during the maintenance phase. In
this activity, they used a tool with the ability to search the location of the objects
change and its relationship with other objects in the source code. Hence, a number of
research and studies have been conducted in order to assist the cognition aspect of a
software system based on source codes.
One way that supports the program understanding is reverse engineering
technique, where this technique is the process of analyzing a subject system to
identify the system’s artifacts and their relationships and create representation of the
system in another form or at a higher level of abstraction (Koschke, 2001).
From the above factors, the code query was using reverse engineering
technique to parse the source codes to extract the artifacts. Then, the extracted
artifacts were presented at a multiple levels of abstraction in textual and graphical
representation. The abstraction is based on keyword from the PCR to assist
maintainer focusing and understanding on related object of change request in the
source code.
1.3
Statement of the Problem
This research is intended to deal with the problems related to program
understanding. The main question is “How to produce a more effective method in
parsing source code and extracting software artifacts that can enhance
understanding of existing software for software maintenance”.
4
The sub questions of the main research question are as follow:
i.
Why the current maintenance models, approaches and tools do not
provide enough artifacts that could support program understanding?
ii.
What artifacts are required to extract from the source code in order to
support program understanding?
iii.
How to extract the source code artifacts?
iv.
Which technique is suitable to present the extracted artifacts in order to
help maintainer understand the program?
1.4
Objectives of the Study
The above problem statement serves as a premise to establish a set of specific
objectives that will constitute major milestones of this research. The objectives of
this research are listed as follow:
i. To build a model that could extract reliable source code artifacts.
ii. To develop a prototype tool to support the proposed model and approach.
iii. To demonstrate and evaluate the practicability of the proposed model and
approach to support program understanding.
5
1.5
Scope of Work
Scope of work is focused on:
1. Research is focused on artifacts extraction that could support program
understanding.
2. Research is focused on structured program understanding.
3. Analysis is based on C program.
4. The research use only small-scale software system.
1.6
Importance of the Study
The research is based on set of problem in program understanding during
software maintenance process. Program understanding is the most expensive task of
a software maintenance process because it includes reading documents, scanning it
source codes and understanding the change to be made (Sulaiman, 2004). Thus to
overcome the problem related to program understanding in cases where the
documents are not up-dated or absence, one mechanism is needed to browse and
explore the source code to acquire knowledge of the software. The mechanism would
build a proper and effective code query approach to extract artifacts from source
code to automate the abstraction of the software. The abstraction is used to enhance
program understanding in maintenance activities. It is expected that the completion
of this thesis will be beneficial to other researchers in software maintenance field and
also software maintainers who will use the prototype tool developed.
6
1.7
Thesis Outline
This thesis covers some discussions on the specific issues associated to
source code query and how this new research is carried out. The rest of the thesis is
organized in the following outline.
Chapter 2: Discusses the literature review of the software maintenance, program
understanding, reverse engineering, parsing technique, pattern matching
technique, regular expression and concept location. Few areas of interest are
identified from which all the related issues, works and approaches are
highlighted. This chapter also discusses some techniques or approach to program
understanding. The discussion on some tools that exist in the industry is also
given in this chapter. This leads to improvement opportunities that form a basis
to develop a new software source code query to support structured program
understanding.
Chapter 3: Provides a research methodology that describes the research design
and formulation of research problems and validation considerations. This chapter
leads to an overview of data gathering and analysis. It is followed by some
research assumptions.
Chapter 4: This chapter describes the newly proposed software code query
model and approach to support program understanding. A model is established
within
the
C
structured
programming
language
artifacts,
relationship
dependencies and occurrences in software work products. The chapter begins
with the discussion of an overview of code query model, followed by a proposed
model and approach.
7
Chapter 5: Presents the design and functionality of some developed tools to
support the software source code query to support structured program
understanding. This includes an implementation of the design and component
tools.
Chapter 6: The software source code query to support structured program
understanding is evaluated for its effectiveness, usability and accuracy. The
evaluation criteria and methods are described and implemented in the model that
includes modelling validation, a case study and experiment. This research
performs evaluation based on quantitative and qualitative results. Quantitative
results are checked against a benchmark set forth and qualitative results are
collected based on user perception and comparative study made on the existing
models and approaches.
Chapter 7: The statements on the research achievements, contributions and
conclusion of the thesis are presented in this chapter. This is followed by the
research limitations and suggestions for future work.
1.8
Summary
This research is focused on structured program understanding that is supported
by pattern matching technique, regular expression, and concept location. This study
is focused on structured programming as the main material to implement source code
query. A model is built to implement source code query to support structured
program understanding to search the keyword in the source code. The prototype is
developed as validation tool for the proposed model.
8
CHAPTER 2
LITERATURE REVIEW
2.1
Introduction
Software maintenance is the most costly activity in software engineering
(Erlikh, 2000) and recognized as an important part of software development life
cycle. Software maintenance activities currently account for more than half of the
typical software budget. In addition, more than 50 percent of global software
developers are engaged in modifying existing applications. Nowadays a number of
research and studies has grown for software maintenance researchers and
practitioners to examine key issues facing the software maintenance activities. They
have been carried out issues in program understanding or program comprehension,
reverse engineering, reengineering, program transformation, impact analysis,
regression testing, software reuse, software configuration management (SCM),
WWW (World Wide Web) based maintenance, maintenance process model and
maintenance standard (Sulaiman, 2004). This chapter will discuss the literature
review of software maintenance, program understanding, reverse engineering, pattern
matching technique and regular expression. Part of the discussion in each section will
explicate the link between the topics discussed with the research in this thesis.
9
2.2
Introduction of Software Maintenance
Software maintenance is the modification of a software product after delivery
to correct faults, to improve performance or update other attributes or to adapt the
product to a modified environment (ISO/IEC 14764, 2006). The ISO/IEC 14764
standard describes software maintenance process as in Figure 2.1.
Exceptional
Preparation
Analysis
Modification
Acceptance
Migration
Retirement
Figure 2.1: Software Maintenance Process
1. Software preparation and transaction activities
Maintenance project plan and the preparation for handling problem identified
during
development,
and
the
follow-up
on
product
configuration
management. It does include change request by individual in an organization.
2. The problem and modification analysis
In this process, the maintenance programmer is responsible to analyze each
request, confirm it (by simulating the problem situation) and check its
validity, investigate it and propose a solution, document the request and
finally, obtain all the required authorizations to apply the modification to the
managing group.
3. Modification implementation
Once the managing group approves, the maintenance programmer could start
the modification.
10
4. Acceptance of modification
All modification must be checked by the individual who has submitted the
request in order to make sure the solution provided is utilized.
5. Migration
The migration process is exceptional and is not a part of daily maintenance
tasks. If the software needs to port to other platform without any change in
functionality, this process will be used and a maintenance project team is
usually assigned to this task.
6. Retirement
This process will be used if the software fails to make any modification and
port to other platform. It means the software function will stop and is
replaced with new software.
Software maintenance phase is a critical phase and is very costly. Lee’s
discussion (Lee, 1998) highlighted in her paper that software maintenance is too
costly and is a difficult phase. In our SDLC, time and cost are the main measures in
software maintenance implementation. Over the life of a software system, the
software maintenance effort has been estimated to consume more than 50% of its
total life cycle cost.
This maintenance cost also shows no sign of declining (Lee,
1998).
In reference to time and cost issues in software maintenance implementation,
the problem and modification analysis process should be done properly. The
maintenance programmers need to understand the existing software before the
changes is implemented. It is necessary to know where the concept location in the
source code and the relationship of the changes to other concept location of the code.
The understanding of the code could help the maintenance programmers to
11
implement the changes easily and faster. This activity also could help the
management team to estimate the costing and time before proposing the solution.
2.2.1
Software Maintenance Categories
Basically, there are four categories of software maintenance which the
descriptions are summarized from (Sommerville, 1997), (Hoffer et al., 1999) and
(Kajko-Mattsson, 2000). The literatures prior to 1998 did not describe the preventive
maintenance because IEEE (The Institute of Electrical and Electronics Engineers,
Inc.) only recognized the category in 1998.
The categories are as listed below:
i.
Corrective Maintenance – repair design and programming errors.
ii. Adaptive Maintenance – modify system to environmental changes without
radical changes of software functionality.
iii. Perfective Maintenance – add desired (not necessarily required) new features
to improve performance, maintainability or other attributes of a computer
program.
iv. Preventive Maintenance – performed for the purpose of preventing problems
before they occur.
The awareness of these categories in software maintenance is important in
this research because different category of maintenance might need different
approaches and level of information abstraction in order to solve software
maintenance task. Understanding the categories of maintenance are necessary to
identify what does a maintenance environment contain of and what are the
techniques and methods involved.
12
2.2.2
Problem in Software Maintenance
Maintenance activities become difficult when a module is expected to interact
with other modules in the software and the effect of this will be obvious when
modification and revalidation take place. Therefore, to avoid it, analysis, testing and
bug fixing are needed in order to observe interaction between modules in the
software. This problem will become critical when a new programmer has to take
over the job of maintaining legacy software in which it has not a part of the
development team and existing documents are outdated.
There are several problems that may occur in software maintenance phase.
The problems are:
1. Software maintenance does not relate to design and implementation phase.
2. Often maintenance is ignored in software engineering study. It treats
maintenance is not important in software engineering.
3. Maintenance activities are not understood by the maintenance programmer.
4. Maintenance programmer has no knowledge of the existing program.
Thus, it is necessary to acknowledge what are the techniques and methods
involved to solve the software maintenance problem. Basically, program
understanding technique and reverse engineering technique are required. As
mentioned earlier, the scope of research to be undertaken in this thesis covers the
issues of program understanding and reverse engineering techniques. For program
understanding technique, this research focuses on enhancement of abstraction
method. Besides, the reverse engineering technique required in this research involves
parsing-based method to extract the required artifacts.
13
2.3
Program Understanding
Program understanding also called as program comprehension or software
comprehension is a process that uses existing knowledge to acquire new knowledge
that ultimately meets the goals of a code cognition task. This process references both
existing and newly acquired knowledge to build a mental model of the software that
is under consideration. Understanding is entirely dependent on strategies. Though
these cognition strategies vary, they all formulate hypotheses and then resolve, revise
or abandon (Deursen, 2001). Program understanding is a Software Engineering
discipline which aims to understand computer code written in a high-level
programming language. Program understanding is useful for reuse, inspection,
maintenance, reverse engineering and many other activities in the context of
Software Engineering (Beron et al., 2006).
Program understanding is the most expensive task of the software
maintenance process since it includes reading of the documentation, scanning the
source and understanding the changes that should be made (Kwon et al., 1998). Most
studies in program comprehension deal with the study on how programmers
comprehend a program or a source code during software maintenance and evolution.
As indicated in (Kwon et al., 1998) bottom-up and top-down are two main
approaches in program comprehension. These approaches are reflected in most
cognitive models.
In order to maintain a software system, software maintainers need to
understand how the software works. The source to understand the software system is
through its system documentation which provides more details of the software
including its architecture and detailed design or the source code that is the
representation of an executable equivalent of the software system. Nevertheless,
most documents are either absent or present. Even if it does exist, it is mostly outdated or incomplete. Hence, software maintainers need to read the source codes
thoroughly prior to making the changes in the software. As stated by Kwon et al.,
14
(1998), most problems with current maintenance practice are concerned with the fact
that all maintenance is conducted at the code level. Likewise, according to Pigoski
(1997), programmers spend 40 percent to 60 percent of their time reading the code
and attempting to understand its logic.
The main goal of program understanding is to acquire sufficient knowledge
about a software system so that it can evolve in its disciplinary manner. The essence
of program understanding is identifying artifacts and understanding their
relationship; this process is fundamentally a pattern matching at various abstraction
levels. This involves the identification, manipulation, and exploration of artifacts in a
particular representation of a subject system via mental pattern recognition by the
software engineer. The aggregations of these artifacts are made to form a more
abstract system representation.
There are many definitions given by researchers. However, some of the
definitions are more complicated to understand. Below are definitions that are given
by previous research:
1. Program understanding is the process that uses current knowledge to
generate new knowledge in order to acquire goals adjacent to original
source code role
2. Program understanding is a task to acquire system design that is either
partial or full from its source code.
3. Program understanding is a process performed to acquire computer
program’s information.
15
From the definition, researcher could identify two basic characteristics of
program understanding:
1. Artifact main input is a source code of the software.
2. Program understanding output is an improvement of program
understanding.
2.3.1
Program Understanding Support Mechanism
Program understanding needs a mechanism to support the cognitive process. The
support mechanisms that could be used for program understanding are:
1. Unaided browsing
2. Leveraging corporate knowledge and experience
3. Computer-aided technique
2.3.1.1 Unaided Browsing
This mechanism relies significantly on human ware or human being. Human
will flips manually through source code in printed form or browsing online, perhaps
using file system to aid navigate the source code file.
16
The good software engineer may be able to keep track of approximately 50K
line of code in their head. If the amount exceeds, it becomes difficult to keep track of
the information in the source code.
2.3.1.2 Leveraging Corporate Knowledge and Experience
This is human knowledge and experience about the subject. The mechanism
is via mentoring or by conducting informal interviews with personnel knowledgeable
about the system. However this mechanism is very valuable if there are people
available who have been associated with the system as it has evolved over time.
Responsible individuals will carry important information in their heads about
why the system was designed the way it was, the major changes that have occurred
over its life cycle and where the subsystems have proven particularly troublesome.
However, this mechanism is not always available because the system designer of the
same company may be left the company or external sources from other companies
have stopped the services.
2.3.1.3 Computer Aided Technique
This mechanism uses software maintenance technique such as reverse
engineering to implement program understanding. Reverse engineering environment
could manage the complexities of program understanding by helping the software
17
engineer extract high-level information from low-level artifacts such as the source
code. This will free the software engineer from tedious manual and error-prone tasks
such as code reading, searching and pattern matching by inspection.
2.3.2
Program Understanding via Reverse Engineering
One of the most popular approaches to the problem of software evolution is
program understanding technology. It has been estimated that fifty to ninety percent
of evolution work is devoted to program understanding. Developers or maintainer
requires programming knowledge, domain knowledge and comprehension strategies
to understand a source code program. For instance, one might extract syntactic
knowledge from the source code and rely on programming knowledge to form
semantic abstractions.
The theory of the domain bridging describes the programming process as one
of constructing mappings from a problem domain to an implementation domain,
possibly
through
multiple
levels.
Program
understanding
then
involves
reconstructing part or all of these mappings. Furthermore, the programming process
is a cognitive one involving the assembly of programming plans and implementation
technique that realize goals in another domain. Hence, program understanding also
tries to pattern match between a set of mental models and the source code of the
subject software.
This technique is complicated for the large legacy systems to manual match
the mental models. One way of augmenting the program the program understanding
process is through reverse engineering technique. Although there are many forms of
reverse engineering, the common goal is to extract information from existing
18
software systems. This knowledge can then be used to improve subsequent
development and ease maintenance.
In this research, source codes will be used as the reliable source for
programmers to understand the software. Hence, through tool automation via reverse
engineering technique the source codes will be parsed and the artifacts extracted will
be represented using multiple level of abstraction to support program understanding.
2.4
Reverse Engineering
Reverse engineering (RE) can be defined as the process of analyzing a
subject system to identify the system’s artifact and their interrelationships and to
create representations of the system in another form or at a higher level of
abstraction. The primary objective of reverse engineering a software system is to
increase the overall understanding of the system for both maintenance and new
development; and the six key objectives as stated by Chikofsky and Cross II (Nelson,
2005) include:
i. Cope with complexity – Reverse engineering process analyses a subject system
to identify its artifacts and interrelations automatically.
ii. Generate alternative views – The artifacts extracted can be presented in a
textual or graphical presentation.
iii. Recover lost information – In case the documents are out-dated, the latest
information can be extracted directly via reverse engineering process from the
source code.
iv. Detect side effects – A change made in a program may cause side effects that
can detect anomalies and problem before users report them as bugs.
19
v. Synthesize higher abstractions – Components extracted can be presented into a
higher level of abstraction.
vi. Facilitate reuse – Reverse engineering can help candidates for reusable
software artifact from present systems.
The six key objectives are in line with the objectives of this research. Code
Query method proposed to present the software artifacts in order to find code
location and its relationship by generating textual and graphical presentation.
2.4.1 Reverse Engineering Concept and Definition
There are many RE definitions. One of which is adapted from canonical
taxonomy. RE is the process of identifying software components, their
interrelationship, and representing these entities at a higher level of abstraction. RE
by itself involves only its analysis, and not changing the current software. ‘Program
comprehension’ or ‘understanding’ are terms often used interchangeably with RE.
Meanwhile, forward engineering (FE) is opposite of RE and is used to distinguish the
traditional software engineering process from RE (Nelson, 2005).
Figure 2.2: Reverse Engineering
20
Figure 2.3: Forward Engineering
There are four areas offered by RE in increasing level of impact, they are:
1. Redocumentation
This is the weakest form of RE. Redocumentation merely involves the
creation or revision of system documentation at the same level of abstraction.
2. Design Rediscovery
It is one of the redocuments, but it uses domain knowledge and other external
information where it is possible to create a model of the system at a higher
level of abstraction.
3. Restructuring
This is the lateral transformation of the system within the same level of
abstraction. It maintains the same level of functionality and semantics.
4. Reengineering
Generally, reengineering involves a combination of RE for comprehension,
and a reapplication of FE to reexamine which functionalities need to be
retained, deleted or added.
The main concept of RE is to provide an analytical process against software
in identifying components with their relationship and represent it into simple text or
diagram (Tammy, 2005).
21
Therefore, RE is suitably used to enhance software artifacts’ understanding.
Software understanding arises between:
1. Application problem domain
These occur when there are changes against software in language
implementation, terminology and business logic perspective.
2. Environmental domain.
It is environmental domains that occur when it changes against physical
issues such as operating system, hardware and configuration.
2.4.2
Challenges in Reverse Engineering
Reverse engineering is a challenging task because it involves mapping
between different worlds in at least five distinct areas (Nelson, 2005):
1. Application domain mapping to Programming Language
Programming language is a model environment to solve some real problem.
While the tools exist to support code behavior understand from code
perspective. There is a little to aid the reverse engineer in determining what is
occurring with the code from a domain perspective.
2. Machines and Programs mapping to Abstract, High-Level Design
Computer science education is largely about mapping from the abstract to the
detailed implementation, but there is a little to aid in the reverse engineering.
22
3. Original Coherent, Structured System mapping to Actual System, With
Structure Decaying
Although there are good documentations available for software, maintenance
gradually causes the structure to drift from the original specification (Nelson,
2005).The reverse engineer must be able to resolve and synchronize the
documented design and the current implemented design.
4. Hierarchical Programs mapping to Cognitive Association
Computer programs and formal hierarchical expressions. Human is thinking
in associative the pieces of data. The reverse engineer must be able to “build
up correct high level pieces of data from the low level details evident in the
program” (Nelson, 2005).
5. Bottom-Up Code Analysis mapping to Top-Down Application Analysis.
The code analysis is done by its nature through a bottom-up exercise. It
requires, simultaneously, a higher level meaning to be extracted from code
fragments, and higher level concepts to be mapped to lower level
implementations. To make this task even more difficult, the engineer must be
able to handle confusion such as interleaving (Nelson, 2005).
At present, reverse engineering is heavily dependent on human interaction
and steering. While there are several existing tools to assist reverse engineer in
program understanding, they are not fully automated.
23
2.4.3
Program Understanding in Reverse Engineering Approaches
There is a variety of approaches for automated assistance available for
reverse engineer in program understanding. Some of the more prominent approaches
include:
1. Textual, lexical and syntactic analyses.
These approaches focus on source code and its representation. There are
many lexical in the software engineer field. One of the lexical is Columbus.
Columbus is an automating parser that can extract the code to give hints
about design and abstraction information. The unit of examination is the
program source itself.
2. Graphing methods.
There are many styles in graphing methods to support program
understanding. These include the control flow of the program, data flow of
the program, and program dependence graphs. The unit of examination is a
graphical representation of the source code (Nelson, 2005).
3. Execution and testing.
Dynamic testing and debugging is well known and there are several tools
available for this function. The unit of examination is a full, partial or
simulated execution of the program.
There is also variety technique used to automated artifact from the source
code using reverse engineer approaches. In this research, parse, extraction and
abstraction technique are applied to generate textual and graphical representation to
support program understanding.
24
2.5
Parsing Technique
‘Parsing’ is the term used to describe the process of automatically building
syntactic analyses of a sentence in terms of a given grammar and lexicon. Parsing
technique is used to parse a string according to a grammar means to reconstruct the
production tree (or trees) that indicate how the given string can be produced from the
given grammar. There are two important points here; one is that do require the entire
production tree and the other is that there may be more than one such tree (Grune and
Jacobs, 2008).
The requirement to recover the production tree is not natural. After all, a
grammar is a condensed description of a set of strings, i.e., a language, and our input
string either belongs or does not belong to that language; no internal structure or
production path is involved. If researchers adhere to this formal view, the only
meaningful question researchers can ask is if a given string can be recognized
according to a grammar; any question as to how, would be a sign of senseless, even
morbid curiosity. In practice, however, grammars have semantics attached to them;
specific semantics is attached to specific rules, and in order to find out which rules
were involved in the production of a string and how, we need the production tree.
Recognition is not enough, researchers need parsing to get the full benefit of the
syntactic approach using some programming effort. Figure 2.4 show the overview of
parser process.
25
Figure 2.4 : Overview of Parser Process
In short, it is very expensive (and often impractical) to construct the
knowledge base(s) necessary for parsing approaches to extract even reasonable
semantic information from source code and associated documentation (Maletic and
Marcus, 2001).
2.5.1 Two Way of Parsing
The basic connection between a sentence and the grammar it derives from is
the parse tree, which describes how the grammar was used to produce the sentence.
For the reconstruction of this connection we need a parsing technique. When we
26
consult the extensive literature on parsing techniques, we seem to find dozens of
them, yet there are only two techniques to do parsing; all the rest is technical detail
and embellishment.
The first method tries to imitate the original production process by rederiving the sentence from the start symbol. This method is called top-down, because
the production tree is reconstructed from the top downwards. The second method
tries to roll back the production process and to reduce the sentence back to the start
symbol. Quite naturally this technique is called bottom-up.
2.5.2 Parsing Methods
There are several parsing methods confronted with a large number of
techniques with often unclear interrelationships. But in the research, researcher was
focused on three parsing methods that will be discussed next in the subsection below.
2.5.2.1 Directionality
There are two directionality methods. First method is a non-directional
method constructs the parse tree while accessing the input in any order it sees fit; this
of course requires the entire input to be in memory before parsing can start. There is
a top-down and a bottom-up version (Grune and Jacobs, 2008).
27
The directional methods process the input symbol by symbol, from left to
right. It is also possible to parse from right to left, using a mirror image of the
grammar; this is occasionally useful. This has the advantage that parsing can start,
and indeed progress, considerably before the last symbol of the input is seen. The
directional methods are all based explicitly or implicitly on the parsing automation
where the top-down method performs predictions and matches and the bottom-up
method performs shifts and reduces.
2.5.2.2 Search Techniques
There are in general two methods for solving problems in which there are
several alternatives in well-determined points. There are depth-first search, and
breadth-first search. In depth-first search, this technique concentrate on one halfsolved problem; if the problem bifurcates at a given point P, its store one alternative
for later processing and keep concentrating on the other alternative. If this alternative
turns out to be a failure, this technique rolls back the actions until point P and
continues with the stored alternative. This is called backtracking. In breadth-first
search, this technique keeps a set of half-solved problems. From this set we calculate
a new set of better half-solved problems by examining each old half-solved problem;
for each alternative, this technique creates a copy in the new set. Eventually, the set
will come to contain all solutions.
Depth-first search has the advantage that it requires an amount of memory
that is proportional to the size of the problem, unlike breadth-first search, which may
require exponential memory. Breadth-first search has the advantage that it will find
the simplest solution first. Both methods require in principle exponential time; if
want more efficiency (and exponential requirements are virtually unacceptable), its
need some means to restrict the search (Grune and Jacobs, 2008).
28
2.5.2.3 Left Corner Parsing
In left-corner parsing, the right-hand side of each production rule is divided
into two parts: the left part is called the left corner and is identified by bottom-up
methods. The division of the right-hand side is done so that once its left corner has
been identified; parsing of the right part can proceed by a top-down method.
Although left-corner parsing has advantages of its own, it tends to combine the
disadvantages or at least the problems of top-down and bottom-up parsing, and is
hardly used in practice.
2.5.3 Time Requirement
When parsing strings consisting of more than a few symbols, it is important
to have some idea of the time requirements of the parser, i.e., the dependency of the
time required to finish the parsing on the number of symbols in the input string.
Expected lengths of input range from some tens (sentences in natural languages) to
some tens of thousands (large computer programs); the length of some input strings
may even be virtually infinite (the sequence of buttons pushed on a coffee vending
machine over its life-time). The dependency of the time requirements on the input
length is also called time complexity.
In this research, parsing technique is used as a process to parse a string from
the source code. The artifacts will be extracting during the parsing process base on
match string using pattern matching and regular expression technique.
29
2.6
Extraction Process
Extraction is the one of the process in the reverse engineering. Extraction
information is gained from parsing process. In extraction process, pattern matching
with regular expression is used to match the information with the object in the
components.
2.6.1
Pattern Matching
Pattern matching is the act of checking for the presence of the constituents of
a given pattern and rigidly specified (Pattern, 2003). Pattern matching is used to test
whether things have desired structure, to find relevant structure, to retrieve the
aligning parts, and to substitute the matching part with keyword or variable given. A
pattern matching engine provides an optimal match between the given pattern and a
decomposition of the legacy system entities by satisfying the inter/intra-module
constraints defined by the pattern (Sartipi et al., 2000).
Pattern matching is needed to ease communication among developers and
speed up software development and maintenance. Additionally, pattern matching is
necessary to detect simple text fragments in all kinds of editors. Pattern matching is
the act of checking for the presence of the constituents of a given pattern (Phanindra
et al., 2007).
Patterns retrieved from pattern matching, as well as idioms represent the
lowest-level of patterns (Buschmann et al., 1996). Idioms are mostly language
specific; and they capture existing programming experience. These patterns and also
30
idioms can be detected by regular, context-free or context-sensitive languages. An
implementation pattern is called regular if the pattern defines regular languages. A
pattern is called context free if the pattern defines context-free languages.
There are two types of pattern matching. It is sequences and tree pattern.
Sequences are also known as text string pattern are often described using regular
expression and matched using respective algorithms. Sequences can also be seen as
trees branching for each element into the respective element and the rest of the
sequence, or as trees that immediately branch into all elements.
Tree patterns can be used in programming languages as a general tool to
process data based on its structure. Some functional programming languages such as
Haskell, ML and the symbolic mathematics language have a special syntax for
expressing tree patterns and a language construct for conditional execution and value
retrieval based on it. For simplicity and efficiency reasons, these tree patterns lack
some features that are available in regular expressions. Depending on the languages,
pattern matching can be used for function arguments, in case expressions, whenever
new variables are bound, or in very limited situations such as only for sequences in
assignment (in Python). Often it is possible to give alternative patterns that are tried
one by one, which yields a powerful conditional programming construct. Pattern
matching can benefit from guards.
2.6.2
Regular Expression
Regular expressions, also referred to as regex or regexp, provide a concise
and flexible means for identifying strings of text, such as particular characters,
words, or patterns of characters. A regular expression is written in a formal language
31
that can be interpreted by a regular expression processor, a program that either serves
as a parser generator or examines text and identifies parts that match the provided
specification (Regular, 2007).
Regular expressions are used in many applications to specify patterns
because any regular expression can be compiled into a very efficient one-pass pattern
matcher called a finite automaton. Finding matches is useful, but even more useful is
parse extraction, which describes in detail how a pattern matches some input. Parse
extraction makes it easy to find the search pattern.
The following examples illustrate a few specifications that could be
expressed in a regular expression. Regular expressions can be much more complex
than these examples.
•
The sequence of characters "car" in any context, such as "car", "cartoon",
or "bicarbonate".
• The word "car" when it appears as an isolated word.
• The word "car" when preceded by the word "blue" or "red"
• A dollar sign immediately followed by one or more digits, and then
optionally a period and exactly two more digits
Regular expressions are used by many text editors, utilities, and
programming languages to search and manipulate text based on patterns. For
instance, Java, C++, Perl, Ruby and Tcl have a powerful regular expression engine
built directly into their syntax. Several utilities provided by Unix distributions
including the editor ed and the filter Grep were the first to popularize the concept of
regular expressions.
32
As an example of the syntax, the regular expression \bex can be used to
search for all instances of the string "ex" that occur after "word boundaries" that
signified by the \b. In layman's terms, \bex will find the matching string "ex" in two
possible locations, (1) at the beginning of words, and (2) between two characters in a
string, where one is a word character and the other is not a word character. Thus, in
the string "Texts for experts," \bex matches the "ex" in "experts" but not in "Texts".
It is because the "ex" occurs inside a word and not immediately after a word
boundary.
Many modern computing systems provide wildcard characters in matching
filenames from a file system. This is a core capability of many command-line shells
and is also known as globing. Wildcards differ from regular expressions in generally
only expressing very limited forms of alternatives.
2.6.2.1 Basic Concepts
A regular expression, often called a pattern, is an expression that describes a
set of strings. They are usually used to give a concise description of a set, without
having to list all elements. For example, the set containing the three strings "Handel",
"Händel", and "Haendel" can be described by the pattern H(ä|ae?)ndel or
alternatively, it is said that the pattern matches each of the three strings. In most
formalism, if there is any regex that matches a particular set then there is an infinite
number of such expressions. Most formalism provides the following operations to
construct regular expressions are:
1. Boolean “or”
A vertical bar separates alternatives. For example, gray|grey can match
"gray" or "grey".
33
2. Grouping
Parentheses are used to define the scope and precedence of the operators
(among other uses). For example, gray|grey and gr(a|e)y are equivalent
patterns which both describe the set of "gray" and "grey".
3. Quantification
A quantifier after a token (such as a character) or group specifies how
often that preceding element is allowed to occur. The most common
quantifiers are the question mark ?, the asterisk * (derived from the
Kleene star), and the plus sign +. See Table 2.1
Table 2.1 : Quantifiers of Regular Expression.
Quantifier
Description
?
The question mark indicates there is zero or one of the
preceding element. For example, colou?r matches both
"color" and "colour".
*
The asterisk indicates there are zero or more of the
preceding element. For example, ab*c matches "ac", "abc",
"abbc", "abbbc", and so on.
+
The plus sign indicates that there is one or more of the
preceding element. For example, ab+c matches "abc",
"abbc", "abbbc", and so on, but not "ac".
These constructions can be combined to form arbitrarily complex
expressions, much like one can construct arithmetical expressions from numbers and
the operations +, −, ×, and ÷. For instance, H(ae?|ä)ndel and H(a|ae|ä)ndel are both
valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel.
34
2.6.2.2 Portable Operating System Interface (POSIX) Syntax
Traditional Unix regular expression syntax followed common conventions
but often differed from tool to tool. The IEEE POSIX Basic Regular Expressions
(BRE) standard (released alongside an alternative flavor called Extended Regular
Expressions or ERE) was designed mostly for backward compatibility with the
traditional (Simple Regular Expression) syntax but provided a common standard
which has since been adopted as the default syntax of many Unix regular expression
tools, though there is often some variation or additional features. Many such tools
also provide support for ERE syntax with command line arguments.
In the BRE syntax, most characters are treated as literals and they match only
themselves (i.e., a matches "a"). The exceptions at Table 2.2, are called
metacharacters or metasequences.
35
Table 2.2 : Metacharacters for BRE Standard
Metacharacter Description
Matches any single character (many applications exclude newlines, and exactly
.
which characters are considered newlines is flavor, character encoding, and
platform specific, but it is safe to assume that the line feed character is included).
Within POSIX bracket expressions, the dot character matches a literal dot. For
example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c".
A bracket expression. Matches a single character that is contained within the
[]
brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which
matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z]
matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z].
The - character is treated as a literal character if it is the last or the first character
within the brackets, or if it is escaped with a backslash: [abc-], [-abc], or [a\-bc].
Matches a single character that is not contained within the brackets. For example,
[^ ]
[^abc] matches any character other than "a", "b", or "c". [^a-z] matches any
single character that is not a lowercase letter from "a" to "z". As above, literal
characters and ranges can be mixed.
Matches the starting position within the string. In line-based tools, it matches the
^
starting position of any line.
Matches the ending position of the string or the position just before a string-ending
$
newline. In line-based tools, it matches the ending position of any line.
BRE:
\( \) Defines a marked subexpression. The string matched within the parentheses can be
ERE: ( )
recalled later (see the next entry, \n). A marked subexpression is also called a
block or capturing group.
\n
Matches what the nth marked subexpression matched, where n is a digit from 1 to
9. This construct is theoretically irregular and was not adopted in the POSIX ERE
syntax. Some tools allow referencing more than nine capturing groups.
*
Matches the preceding element zero or more times. For example, ab*c matches
"ac", "abc", "abbbc", etc. [xyz]* matches "", "x", "y", "z", "zx", "zyx", "xyzzy",
and so on. \(ab\)* matches "", "ab", "abab", "ababab", and so on.
BRE: \{m,n\} Matches the preceding element at least m and not more than n times. For example,
ERE: {m,n}
a\{3,5\} matches only "aaa", "aaaa", and "aaaaa". This is not found in a few, older
instances of regular expressions.
36
2.6.3
Pattern Matching and Regular Expression in Artifact Extraction
In this research, the pattern matching and regular expression are used to
extract different artifacts from the source code and represent it at higher level of
abstractions. The regular expression based extraction due to their simplicity, ease of
use, matching power and robustness features. The regular expression uses the pattern
matching to extract the desired system artifacts. The hierarchical, nested and abstract
specifications are designed to match the required patterns from source code. The
regular expression technique is flexible in the sense that it can be applied to different
kind of system artifacts including source code (languages) and data files and only
syntactic knowledge of the subject is required. The engineer designs the regular
expression pattern, match the pattern with the source code and as a result get
valuable information which is further used for extracting other patterns (Rasool and
Philippow, 2008).
2.7
Abstraction Process
Abstraction is the process of hiding the details and exposing only the
essential features of a particular concept or object. Abstraction should be able to
improve the understanding of software system prior to making changes towards a
software system. Abstraction is a primary concept in software engineering and is, in
fact, a basic property for understanding the reality and managing the complexity of
software systems (Damasevicius, 2006). Hence, abstraction is used in program
understanding to enhance the comprehension of the program thru textual and
graphical abstraction. In this research, the extracted artifacts are represented with it
location and relationship with other artifacts in the code.
37
Some criteria to evaluate the effectiveness of reverse engineering tools using
abstraction level (Sulaiman, 2004):
i. Abstraction level refers to the sophistication of the design information
that can be extracted from a source code. Ideally, the abstraction level
should be as high as possible in which the reverse engineering process
should be capable of deriving:
a. Procedural design representations (a low-level abstraction) –
determine the structure of each procedure.
b. Program and data structure information (a little high level of
abstraction) – indicate the structure of a program and its
dependencies including details of data and its decomposition.
c. Data and control flow models (a relatively high level of
abstraction) – identify data usage among programs.
d. Entity-relationship models (a high level of abstraction) – indicate
dependencies among modules or systems.
ii. Completeness of a reverse engineering process refers to the level of detail
that is provided at an abstraction level. In most cases, completeness
decreases as the abstraction level increases.
iii. Directionality can be one-way in which all information extracted from the
source code is provided to maintainers who later use it during any
maintenance activity. In two-way directionality the information is fed to a
forward engineering tool that attempts to restructure or regenerate the old
program.
38
2.7.1
Graphical Representation
Graphical representation provides the alternative or additional presentation of
the information which can help the developer understand the graph. Graphical
abstractions provide the solution to viewing the graph at the node/edge level, where
the developer can only view a portion of the graph, and the layout overview, where
the developer can see the entire graph, but not details of the individual nodes.
Graphical abstraction can be subdivided into two parts; the representation of
abstraction and definition. Representation refers to how the graphical abstractions are
presented to the developer and definition refers to how the user specifies the graph or
sub-graph to be used for the graphical abstraction (Paulisch, 1993).
The use of CASE (Computer-Aided Software Engineering) products such as
reverse engineering tool will be able to automatically extract components in existing
source codes and provide graphical representations of the artifacts to assist software
engineers’ program comprehension or software understanding (Sulaiman, 2004).
2.7.2
Textual Representation
In this research, textual is use in support of software comprehension and
maintenance. Text has the advantages of being easily communicated, effectively
manipulated with existing tools, and highly scalable (Cox and Collard, 2005). The
use of textual markup models provides a combination of advantages that other
models do not possess. These advantages are:
39
i.
Robustness: Text models can be used when other models are not extractable,
such as when source code cannot be parsed effectively.
ii.
Scalability: Large scale text data-sets can be efficiently stored and retrieved
using text-oriented databases (e.g., Google) and processed as memoryefficient streams (e.g., SAX).
iii.
Search ability: Text is easily searched using string based or index-based tools
(e.g., grep1 or Google).
iv.
Independence: Tools for manipulating text (e.g., Perl) or marked-up text (e.g.,
XSLT) can be used on any textually represented programming language.
v.
Adoption-Centric: Text manipulation tools already exist and are regularly
used for a variety of tasks (e.g., Perl, AWK, grep, vi, emacs).
vi.
Readability: Text is always readable by maintainers, potentially increasing
maintainers' trust and understanding.
vii.
Communicable: Text is easily communicated between tools, hosts, and
environments.
viii.
Transparency: The relationship between extracted information and the source
code is easily seen without the need to build and maintain an external
mapping.
ix.
Abstraction: Textual representations can support many different levels of
abstraction.
In this research, both representation are use to support program
understanding. The abstraction will show the list of artifacts, location of the artifacts
and the relationship between the artifacts.
40
2.8
Concept Location
Searching in source code or documentation is one of the most common
activities performed by software engineers during maintenance (Singer et al., 1997).
Concept location is one such searching activity where the software engineers try to
locate a part of the source code that implements specific domain concepts. This
activity is also referred to as the concept assignment problem (Marcus et al., 2005).
Concept location is a process of locating a feature or concept in software
system (Rajlich and Wilde, 2002). A concept is relatively easy to handle in small
systems in which the programmer fully understands. For large and complex system,
it can be significantly difficult. Concept location assumes that a maintainer
understands the concept of the program domain, but does not know where in the code
the concepts are located. It is directly applicable to program understanding as the
programmers or maintainers perpetually have some pre-existing knowledge;
otherwise the process of understanding would not be possible (Rajlich and Wilde,
2002).
In short, concept location is a part of incremental change process and it
allows the programmers to determine an initial location of a change within the source
code. Typically, concepts appear as nouns, verbs, or short clauses in the change
request. These concepts are also embedded within the structures of the source code
and appear as variables, classes, or methods. Concept location is the process that
finds the implementations of these concepts (Rajlich and Wilde, 2002). Effective
concept location techniques are crucial for software engineers since they provide the
means for evolving large software systems without understanding the entire body of
the code (Marcus et al., 2005) and identify the place in the software where the
change is to be made.
41
2.8.1
Concept Location in Source Code
Concept location occurs frequently during incremental change of software
(Marcus et al., 2005). Here, concepts are extracted from change requests while the
concept location process identifies the starting point of the change. The complexity
and importance of the concept location process increases with the size of the
software. Different techniques achieve this in different ways. One common feature of
various approaches is that the source code is often decomposed into units different
than files (e.g., classes, functions, etc.) and it is enriched with additional information
(e.g., relationships between elements of the source code). The software
decomposition determines the unit of the search, while the additional information
determines the searching criteria.
Software decomposition and analysis creates an intermediate representation
of the software system, see Figure 2.5. Deterministic mappings are defined between
this representation and the source code. The user interacts with and searches this
representation, but the results of the search are presented as elements of the source
code.
Figure 2.5: Most concept location techniques rely on an intermediate representation
of the source code (Marcus et al., 2005).
42
At a high level, the concept location process can be defined as follows: i)
concept formulation that usually in natural language; ii) query formulation and
execution based on the intermediary representation; and iii) investigation of results.
Most concept location techniques have the following attributes (Marcus et al., 2005):
1. Prerequisites (e.g., complete and executable program, test suite,
incomplete program, libraries, etc.);
2. Intermediate representation:
a. Format (e.g., string, graph, database, etc.);
b. Content (e.g., text, dependencies, data flow, control flow,
execution traces, etc.);
c. Preprocessing/analysis (e.g., manual, automatic, dynamic analysis,
static analysis, parsing, knowledge base, etc.);
3. Query information
4. Format and granularity of results (e.g., line of text, line of code, function,
class, file, etc.).
Based on the preprocessing needed to create the intermediate representation
we can differentiate two major classes of techniques: static and dynamic. Static
techniques create the intermediary representation based on the source code that can
be incomplete. Dynamic techniques require complete executable programs and test
suites.
All static techniques share the same prerequisites and have compatible
preprocessing; therefore this research is focused only on three of the most popular
static techniques based on regular expression matching, static program dependencies,
and information retrieval.
43
2.8.2
Static Concept Location Techniques
One common characteristic of static location techniques is that they can be
used rapidly without too much preparation. They allow work on incomplete
programs, design documents, and other work-products. This allows programmers to
combine them and leverage their respective advantages. Table 2.3 summarizes the
specific attributes of each technique (Marcus et al., 2005).
Table 2.3: Features of the static concept location techniques
Prerequisites
Internal
Pattern
Dependency
Matching
Based
None
None
None
Graph
Document
representation String of tokens
format
Internal
vector space
representation Characters/tokens
content
Internal
IR Based
representation None
analysis
Function
Identifier,
dependencies
comments
Static
Parsing, LSI
dependencies
analysis
Query
Results
Regular
Depth first search Natural
expression
in graph
language
Line of text
Call function
Functions/files
44
2.8.2.1 String Pattern Matching Technique (Grep)
‘Grep’ is an acronym for "global regular expression print". It is a tool that
prints out lines that contain a match for a regular expression. Even though there are
several more advanced pattern matching tools, Grep is one of the most popular tools
within this category; therefore, Grep can be taken as the representative of this
searching tools function. They refer to the string pattern matching technique as grepbased technique (Marcus et al., 2005).
The critical part of the “grepping” process is the formulation of the search
pattern. This can be facilitated by certain heuristics and in most cases, it is highly
dependent on the experience of the person who performs the search. The technique
does not make any assumptions about the structure of the software and hence it can
be used on C systems without any adaptation (Marcus et al., 2005).
2.8.2.2 Dependency Search Technique
The static dependency search is a variant of the depth first search, conducted
by a programmer rather than a computer. The programmer follows the dependencies
among the modules, hence the technique is adapted to C programming by dealing
with procedure or function and their dependencies (Marcus et al., 2005).
When searching for concepts, the functionality of the C files that the
programmer encounters can be viewed in two different ways. First, there is the
composite functionality that is defined as the complete functionality of C files
combined with all its supporting C files. The second type of functionality, local
45
functionality, consists of concepts that are actually implemented in the C files and
are not delegated to others (Marcus et al., 2005).
2.8.2.3 IR-based Technique
IR-based methods for concept location share the following general pattern
(Marcus et al., 2005):
1. Preprocessing of the source code and documentation.
2. Indexing that creates the intermediary representation.
3. Execution of queries formulated in natural language.
4. Retrieving and analyzing the results that are returned as a ranked list.
The IR based system uses latent semantic indexing (LSI) (Marcus et al.,
2005) for the intermediate representation for the identifiers and comments extracted
from the source code. The source code is partitioned into a set of documents. A
document can be any contiguous set of lines of the source code; therefore creations
of different document definitions are possible.
2.9
Code Query in Reverse Engineering Tools
The following sections thoroughly discuss the areas related to code query
methods in reverse engineering tools.
This chapter will discuss on the study
46
conducted on existing software in reverse engineering environment or tools in order
to evaluate qualitatively their functionalities, features and methods provided. The
study will discuss entities of Windows Grep, Rigi (Sulaiman, 2004) and CodeSuffer
(Anderson and Zarins, 2005).
2.9.1 Windows Grep
Windows Grep is a tool for searching files for text strings that you specify.
Although Windows and many other programs have file searching capabilities builtin, none can match the power and versatility of Windows Grep. The program
combines the power and flexibility of traditional command line grep utilities
available on DOS, UNIX and other platforms with the ease of use of Microsoft
Windows (WinGrep, 2007).
In addition to searching, Windows Grep also performs global replacing in
your files, with complete safety. Windows Grep is designed for searching plainASCII text files, such as program source, HTML, RTF and batch files, but it can also
search binary files such as word processor documents, databases, spreadsheets and
executables.
The primary feature of Windows Grep is to search the contents of one or
more files on your PC for occurrences of text strings you specify and display the
results. Once found, it can replace matches with other strings. See Figure 2.6.
47
Figure 2.6: Window Grep Search Results
2.9.2 Rigi
Rigi is a tool that assists in understanding and re-documenting the
abstractions of software systems, particularly legacy systems (Rigi, 2004). Rigi is an
interactive, visual tool designed to help software maintainer better understand the
software. Rigi includes parsers to read the source code of the subject software and
produce a graph of extracted artifacts such as procedures, variables, calls, and data
accesses. To manage the complexity of the graph, an editor allows software
maintainer to automatically or manually collapse related artifacts into subsystems.
These subsystems typically represent concepts such as abstract data types or
48
personnel assignments. The created hierarchy can be navigated, analyzed, and
presented using various automatic or user-guided graphical layouts (Sulaiman, 2004).
2.9.2.1 Rigi Features
The discovered structural information is useful for making informed
development and management decisions. The information serves as documentation
that is up-to-date and accurate because it is derived from the actual source code.
Thus, Rigi helps to understand legacy software systems where the existing
documentation may be missing or lacking. Rigi aids reengineering tasks that need to
discover design information in existing software. There are good features that Rigi
provides for the software maintainer. These features allow the maintainer to
determined dependency and potential impacts thru Rigi and the features include:
i.
Easy-to-use visual interface.
ii. Parsers for C, C++ and COBOL.
iii. Selection, filtering, and editing operations
iv. Dependency and change impact reports
v. Standard, overview, and projection perspectives
vi. Metrics for cohesion and coupling
vii. Views to capture interesting perspectives
viii. Scripting language and command library
ix. Adaptable to different languages and purposes
x. Customizable user interface
xi. Simple file format to represent graphs
49
2.9.2.2 Rigi Query Technique
Basically, Rigi works by feeding the subject software to the parser of
specified language and then a set of triples called “tuples” will be generated into a
text file with the extension “.rsf” (token file). The file can then be loaded in RigiEdit
(see Figure 2.7) that represents the software artifacts extracted. The SHriMP view of
(Storey, 1998) is also based on Rigi reverse engineering environment.
Figure 2.7: View produced by Rigi via RigiEdit
2.9.3
CodeSurfer
CodeSurfer as shown in Figure 2.8 is a tool that provides a wide range of
program understanding capabilities by exposing the results of a static-semantic
analysis to the user in novel and interesting ways. The tool performs a number of
50
whole-program analyses, including pointer analysis, and creates a system
dependence graph for the program. The user can browse these dependences through
the GUI in a manner akin to surfing the web. An open architecture fosters the
development of plug-in that can extend the basic functionality. These include tools
for reasoning about the paths through the program, and for software assurance
(CodeSurfer, 2007).
Figure 2.8: CodeSurfer Project Viewer
51
2.9.3.1 CodeSurfer Features
A partial list of features included in the CodeSurfer Standard Package is
given below. Many of CodeSurfer's advanced features (e.g., pointer analysis) are not
available in any other commercial tools.
1. Syntax Highlighting
Distinguishable between code, comments, preprocessor directives, macros,
and conditionally-compiled-out code
2. Navigation from
i. a variable occurrence to the statements that can assign its
value
ii. an assignment to a variable to a use of the value assigned
iii. a statement to the control points that affect whether the
statement gets executed
iv. a control point to the statements whose execution it controls
v. a macro use to its definition
vi. a variable occurrence to its declaration
vii. a variable occurrence to the declaration of its type
viii. a function call to the function definition
ix. a #include directive to the included file
3. Call Graphs
Display a call graph, modify its layout, and print or save it
4. Pointer Analysis
i. Display the variables a pointer can point to
ii. Display the pointers that can point to a variable
iii. Navigate from an indirect function call site to the targets of the
call
5. GMOD/GREF Analysis
i. For any function, display all the variables it modifies
ii. For any function, display all the variables it uses
6. Finder. See Figure 2.8. Syntax-based searches for
i. function definitions, calls, and indirect calls
52
ii. variable declarations, definitions, and uses types
7. Impact Analysis. A program slicer that does
i. forward program slicing
ii. backward program slicing
8. Metrics
Calculate cyclomatic complexity and other metrics.
2.9.3.2 CodeSurfer Query Technique
CodeSurfer provides query facility that called Finder. The Finder provides
advanced searching capabilities. The Finder is used as a regular expression matching
technique to query variable, function or file in C source code directory. For example,
maintainers can find all the uses of a specified variable’s value—including indirect
uses via pointers (where the variable name does not occur textually). Results are
hyper linked to the code. The weakness is there is no visualization after query.
Figure 2.9: Finder Viewer
53
2.9.4 Comparative Evaluation of Existing Tools
The research has discovered several advantages and disadvantages in
Window Grep, Rigi and CodeSurfer. Window Grep is a good searching tool to find
text strings in files. Window Grep will list the result which is the location
information of the file such as filename, file type, folder, the number of matches, file
size, date/time and other textual results. Unfortunately, Window Grep is not suitable
for program understanding.
Rigi is an interactive, visual tool designed to help developer better
comprehend the software. Rigi includes parsers to read the source code of the subject
software and produce a graph of extracted artifacts such as procedures, variables,
calls, and data accesses. To manage the complexity of the graph, an editor allows the
developers to automatically or manually collapse related artifacts into subsystems.
These subsystems typically represent concepts such as abstract data types or
personnel assignments. The created hierarchy can be navigated, analyzed, and
presented using various automatic or user-guided graphical layouts. However, a
major disadvantage of Rigi is that it is not a user-friendly tool in searching.
CodeSurfer is most the completed program understanding and tracing
software to the software maintainer. This tool also provides query facility called
Finder and it provides advanced searching capabilities. However, there is no
visualization after a successful query. This is the biggest disadvantage of CodeSurfer
software.
Table 2.4 shows a summary of existing tools approaches with some defined
criteria.
54
Table 2.4: Existing Features of current tools
Current
tools
Search
Technique
Automatic
Textual
Report
Visualization
Type
Search
Coverage
Pragmatic
Window
Grep
Regular
Expression,
Pattern
Matching
Pattern
Matching
Syntax Based
Available
Textual
All
(query
include all file
type)
TD, BU
Not Available
Graphical
TD, BU
Available
Graphical,
Textual
Function, call,
indirect call
Function
definition,
calls, indirect
calls, variable,
definition, uses
types.
Rigi
CodeSurfer
TD, BU
Note: TD – Top Down, BU – Bottom Up
2.10
Proposed Solution
From the above discussion, it motivates us to produce new source code query
tool to support structured program understanding. The research has proposed a model
that could assist the developer in program understanding. The proposed model is
called CQuery. The CQuery applies Regular Expression and pattern matching
technique to search C language syntax in the software project directories. The unique
of CQuery are the result will display the function name, filename, line no of code and
path name for both callee and caller. Besides that, the result also can display in
textual and graphical mode. A more detail design of CQuery will be given in Chapter
4.
55
2.11
Summary
In summary, the aim of this chapter is to study the link between program
understanding, concept location, visualization and level of information required to
perform the maintenance tasks in the cases given. The study assumes that there is an
extremely important relationship between the program understanding, concept
location and visualization as well as level of information required by software
engineers. The four tools are used as a reference in this research. The tools: Window
Grep, Rigi and CodeSurfer provide most of the information types in the code query
technique. The weaknesses and strengths of the existing tools in terms of features
provided and methods employed have also been highlighted. In order to better
enhance program understanding, source code query is proposed. The method should
also reduce software maintainers’ cognitive overhead yet able to produce sufficient
information in different level of abstraction.
56
CHAPTER 3
RESEARCH METHODOLOGY
3.1
Introduction
Firstly, this chapter will discuss the research procedures or operational
framework of this research and the formulation of the research problem. The
operational framework will explain the research process from proposing of the
research until end of the research and it does include formulation of research
problem. The evaluation plans will cover a brief description on data gathering and
analysis. At the end of this chapter, some assumptions of the research and its
summary are explained.
57
3.2
Operational Framework
This research is done based on operational framework is illustrated in Figure
3.1. The operational framework is divided into four phase. The phases are:
1. Phase 1 – Formulation of research problem
2. Phase 2 – Prototype Development
3. Phase 3 – Implementation and Evaluation
4. Phase 4 – Report Research
Figure 3.1 : Operational Framework
58
3.2.1
Phase 1: Formulation of Research Problem
The objective of phase 1 is to understand and identify the problems that need
for research. In this phase, the research problem is focusing on current program
understanding tools in the industrial. The research starts with the idea, literature
review and current tools analysis. The author proposed the solution based on the
research problem identified.
3.2.1.1 Literature Reviews
The research is started with preliminary study to gather all the essential data
and information needed as a part of literature review. Various steps are undertaken to
obtain information such as the following:
1. Current and Existing Research of the program understanding tools.
2. Current and Existing tools of the program understanding with concept
location that have query features.
Literature review is one of the important methods that can contribute ideas to
understand and identify the research problem. In this step, the further understanding
to main research material is needed to gain better results at the end of the research.
59
3.2.1.1.1 Understanding the Need of Change Request Process – Finding the
keyword
Traditionally, the way of determining keyword in change request is manually
done by software maintainer by reading or understanding the software
documentation and traverse the source code. However, while this method is normally
sufficient for minor programs, but for legacy software this method is very exhausting
and highly prone to errors. The legacy software needs the repositories to organize
and store the software components which becoming increasingly large in subtle way.
If the change request is perform manually, it can be time consuming, expensive and
strenuous. Thus, ideally software maintainer needs computer-aided system/software
to perform or conduct change request and query keyword easily.
3.2.1.1.2 Understanding Structured Programming Concept
Understanding of structured programming concept is very important in code
query. The researcher needs to understand the structured programming language
syntax. The researcher also needs to identify what type of artifacts is necessary to be
extract from the code. In this case, C programming language is the subject to be
used.
60
3.2.1.1.3 Understanding the Extraction Process
The researcher needs to understand the extraction process and identify which
technique could be use during artifacts extraction. Extraction process might be
involves code parser, pattern matching and regular expression. In this research, the
scope of the proposed extraction process is to extract all the required artifacts from
the C code.
3.2.1.1.4 Understanding the Abstraction Technique
There are few reasons why abstraction is needed in code query. The
abstraction is the detail of a software system. This is a primary concept in software
engineering and is, in fact, a basic property for understanding the software systems.
In this research, the abstraction is used to present the result in textual and graphical
representation to enhance maintainer’s program understanding.
3.2.1.2 Analysis Current Approach and Existing Tools
The research on the current approach and existing tools is very important to
understand and identify the limitation of the current approach and tools. Base on the
current tools and approach (chapter 2). Some of the tools are missing the very
important information such as the line number of the code and call function. But the
line number and call function information must be acknowledged, especially in the
61
change process. Hence, the idea to develop a prototype comes from the limitations of
the existing tools.
3.2.1.3 Research Proposal
Research proposal of code query is the starting phase to implement the next
research. The problem background, objective, scope, important and significant study,
and operational framework explained in the proposal. The proposal presented in front
of evaluator to determine scope of research is relevant to research study.
3.2.2
Phase 2: Prototype Development
The prototype development process involves proof that the model design
meets all of the research objectives specified in the chapter one. The main objectives
of the prototype development are made as guidelines in order to assure that the
model design can be built as designed and will also meet research requirement.
62
3.2.2.1 Code Query Model Design
A model design needs to be developed before prototype development. Model
design was build base on extended extraction with abstraction of keyword and
artifacts occurrences. There are three components in this model:
1. Keyword
2. Extraction
3. Abstraction
The detail of model design will be discussed in chapter 4.
3.2.2.2 Code Query Prototype Development
Prototype is developed to demonstrate Code Query.
The prototype is
developed based on C programming language to enhance program understanding.
This prototype is a tool as a guide for developer team to understanding the program
very easily. Besides that, developer would know potential effect based on the
keyword query before make a decision to do the changes. The development tools to
develop this prototype are NetBeans 5.5 IDE for JAVA, Java JGraph 1.5 library and
Microsoft Access 2003 as C component repositories.
63
3.2.3
Phase 3: Implementation and Evaluation
The code query approach needs appropriate plan and design for proof of
concept. It’s because appropriate actions can be considered to validate and verify the
significant results.
For this approach, the important key to look into is determination or
measurement of accuracy. The accuracy is based on how close a keyword query in
the source code. The keyword will be obtained from Code Query supporting tool and
actual keyword location will be determined by the software expert of a particular
knowledge domain. Knowledge domain is referred to a specific domain of software
project (Ibrahim, 2006).
A new case study will be created based on a software project that is chosen
from the existing software systems together with its software developers. The
software developers involved in the project will act as software maintainers in a new
case study.
In this research, change request and human involvement is needed for an
experiment method of empirical study purposes. The purpose of the experiment is to
let software maintainers apply the approach using a Code Query tool to query a
keyword and compare the result with the actual query gained from exploring
manually. The detailed evaluation will be described in the sub-section.
64
3.2.3.1 Supporting Tools
The supporting tools need the right environment to make sure it is working
appropriately to that environment. There are two categories of supporting tools:
1. Experimentation-wise
A few supporting tools is used to produce C components and their
relationship in the source code. The code parser is used to extract all the
components from the C source code.
2. Development-wise
The proposed approach is developed for any type of system environment.
However, for research purposes, the development is used only to support C
programming that runs in windows operating system such as Window XP,
2000 and VISTA. The prototype development is based on proposed approach.
This prototype is developed using NetBeans 5.5 IDE (Java Development Kit
1.6) and the C components repository is using Microsoft Access 2003.
3.2.3.2 Choose Case Study
The new case study created is based on a software project that has been
chosen from the existing software systems together with its software developers. The
software developers who are involved in the project will act as software maintainers
in a new case study.
65
3.2.3.3 Experimental
A problem change request and human involvement are needed for an
experimental method of empirical study purposes. The purpose of the experimental is
to allow software maintainers to apply the approach using a Code Query tool to find
the keyword base on problem change request and generating the occurrences of the
keyword in the source code.
The researcher will occupy a group of software developers as subjects
equipped with software maintenance knowledge. Besides that, the researcher will
also provide a complete software project with documentation and current tools as a
case study.
3.2.3.4 Evaluation
Evaluation is the systematic collection and analysis of data needed to make
decisions or verify the concept for the prototype or software. The data analysis is
done using statistical SPSS package to evaluate the quantitative results. The researcher
will collect some dependent variables from the experiment such as accuracy and time
consumption. The accuracy is based on how close the keyword queries from the
actual query of keyword. The time consumption is focused on how fast it is to find
the keyword using prototype rather than browsing manually. The details are
discussed in chapter 6.
66
3.2.4 Phase 4: Research Report
The report will explain the details of the research work. It includes the detail
of implementation and evaluation. The important part to be explained is the research
work and the result of the research.
3.3
Research Assumption
In this research, the prototype is focusing on C structured programming with
selected case study only and it’s not suitable for other programming language. There
are several assumptions to running this research:
1. The subjects are the software developers or software engineers to use the
prototype.
2. The selected project system focuses on the structured program.
3.4
Summary
Chapter 3 explains the detail of the research methodology. The operational
framework executes the research idea and the procedure of the research. The
important phase to look into is the implementation and evaluation. In this phase, the
case study is used for experiment, data gathering and analysis methods. This should
provide a means to evaluate and validate the significant contribution of the proposed
67
approached. Some assumptions of this research are raise to resolve proper actions in its
implementation.
68
CHAPTER 4
CODE QUERY MODEL
4.1
Introduction
This chapter describes the newly proposed software code query model and an
approach to support program understanding. A model is established within the C
structured programming language components, relationship and dependencies in
software work products. The chapter begins with the discussion of an overview of
code query model, followed by a proposed model and approach.
4.2
Overview of Code Query
Code Query is a method to search a keyword and related artifacts from the
source code. This method is extended extraction with abstraction of occurrences of
69
the keyword and the artifacts. There are three main components in this model as
shown in Figure 4.1 and there are keyword, artifacts extraction and abstraction.
1. Keyword – maintainer need to find the keyword from the PCR.
2. Artifacts Extraction – The research is focus on this component. Code
Query is using concept location, parsing technique, pattern matching and
regular expression to extract the artifacts from the source code.
3. Artifacts Abstraction – Textual and graphical representation are used to
show the artifacts, the relationship between artifacts and the location of
artifacts in the code.
Figure 4.1 : Overview of Code Query Model
70
4.3
Code Query in Structured Programming
In this research, the structured programming is selected as medium language
in program understanding. Structured programming (sometimes known as modular
programming) is a subset of procedural programming that enforces a logical structure
on the program being written to make it more efficient and easier to understand and
modify. In Code Query, the structured programming interaction needs to be defined
to show the relationship and dependencies of caller and callee function. The
interaction includes relationship, message flow and dependencies between functions
and procedures. All the interactions will be presented in textual and graphical
representation.
4.3.1
Structured Programming Concept
Structured programming can be seen as a subset or sub-discipline of
imperative programming, one of the major programming paradigms. Structured
programs are often composed of simple, hierarchical program flow structures.
Structured programming is beneficial for organizing and coding computer programs
which employ a hierarchy of modules. This means that control is passed downwards
only through the hierarchy. Examples of structured programming languages include
Ada, Pascal, Fortran and C. These programming language are designed with features
that encourage or enforce a logical program structure.
Structured programming often uses a top-down design model where
developers map out the overall program structure into separate subsections from top
to bottom. In the top-down design model, programs are drawn as rectangles. A top-
71
down design means that the whole program is broken down into smaller sections that
are known as modules. A program may have a module or several modules.
A well-structured program should devote a single procedure to the solution of
a single problem. The splitting of problems in sub-problems should be reflected by
breaking down a single procedure into a number of procedures. The idea of program
development by stepwise refinement advocates that this is done in a top-down
fashion.
Figure 4.2 describes the fundamental ideas in structured programming
systems. Structured programming concept consists of functions and procedures.
Figure 4.2 : Structured Programming – Tax Calculation
Structured programs are separated into modules or subprograms. The
instructions of structured program are executed one after the other and calling the
subprograms when needed. For instance, Control Program is a main module of the
tax calculation. The control program module is divided into individual module; each
module represents a specific processing task.
72
4.3.1.1 Relationship in Structured Programming
Call function is the main relationship in structured programming. Figure 4.3
is shows function relationship in Tax Calculation. For example, to pay the tax,
main() function will call taxComputation() for tax calculation and next call payTax()
to execute tax payment.
Figure 4.3 : Function Relationship
Table 4.1 presents summaries of relationship in C structured programming
language.
Table 4.1 : Relationship Types
Relationship Type
Caller
Relationship
Signs
Function A Æ Function B
Callee
Function B Å Function A
73
4.3.1.2 Dependencies in Structured Programming
Dependencies analysis is a data flow that links between relationship and
dependencies. The dependencies that are used in this research:
i. Data dependency – data flow from definition and data uses relationship.
ii. Control dependency – data flow that executes control program such as ifthen-else, for loop, while loop, do-until and case.
iii. Component dependency – data flow from file to file.
4.3.1.3 Observations about Structured Programming
Structured programming is not the wrong way to write programs. Similarly,
object-oriented programming is not necessarily the right way. Object-oriented
programming (OOP) is an alternative program development technique that often
tends to be better if we deal with large programs and if we care about program
reusability. Observations of the structured programming are:
i. Structured programming is narrowly oriented towards solving one particular
problem
a. It would be more preferable if our programming efforts could be
oriented more broadly
ii. Structured programming is carried out by gradual decomposition of the
functionality
a. It
has
been
observed
that
the
structure
formed
by
functionality/actions/control is not the most stable parts of a program
b. Focusing on data structures instead of control structure would be an
alternative approach
74
iii. Real systems have no single top - Real systems may have multiple tops
a. It may therefore be natural to consider alternatives to the top-down
approach
4.4
A Proposed Code Query
Code Query is a method to support and enhance structured program
understanding effectively. Research relating to this work primarily comes from the
source codes based maintenance of structured programming software. This model
has two main components that are shown in Figure 4.4.
Find Keywords
User
Interface
Maintainer
Processing Tools
Processing Tools
A
b
s
t
r
a
t
i
o
n
C component:
Function, Struct, etc
Filename
Repository
E
x
t
r
a
c
t
i
o
n
# PCR
C files
Open C files
Figure 4.4: Code Query Approach
In exploring the source code, normally a maintainer will use a suite of
specialized exploration tools in text editor or development environment, each with its
75
own unique capabilities and interface. This is to help them to get some knowledge to
understand the system. Here, code query will replace that conventional approach.
4.4.1
Keyword
When the PCR have been raised, maintainer will find the keyword through
the PCR. Through the code query, the keyword will match with the source code
using pattern matching. The match results such as function name, line number,
filename and this includes direct relationship between the artifacts are stored into the
repository.
4.4.2
Extraction of Artifacts
There are three sub components in the artifacts extraction. These three
components are important to determine the artifacts and concept location in the
source code. The components are parser, pattern matching and regular expression.
These components are extracting the following artifacts from the source code:
i.
Callee
a. Function Name
b. File Name
c. Line Number
d. File Path
76
ii.
Caller
a. Function Name
b. File Name
c. Line Number
d. File Path
4.4.2.1 Parser
Parser is used to parse a string according to a grammar means to reconstruct
the production tree (or trees) that indicate how the given string can be produced from
the given grammar. There are two way of parsing:
i.
Top-down parsing
ii.
Bottom-up parsing
This research uses top-down parsing technique to extract the necessary
artifacts form the source code. This is because top-down parsing can be viewed as an
attempt to find left-most derivations of an input-stream by searching for parse trees
using a top-down expansion of the given formal grammar rules. Tokens are
consumed from left to right. Inclusive choice is used to accommodate ambiguity by
expanding all alternative right-hand-sides of grammar rules.
77
4.4.2.2 Pattern Matching
The concept of code query is based on the pattern matching. Pattern matching
is the act of checking for the presence of the constituents of a given pattern and
rigidly specified. Pattern matching is used to test whether things have desired
structure, to find relevant structure, to retrieve the aligning parts, and to substitute the
matching part with keyword or variable given.
Pattern matching detects simple patterns in a source program. These source
code patterns and also expressions can be detected by regular, context-free or
context-sensitive languages. An implementation pattern is called regular if the
pattern defines regular languages. A pattern is called context free if the pattern
defines context-free languages. Hence, it is used to performing searches with regular
expressions.
4.4.2.3 Regular Expression
A regular expression is a sequence or pattern of characters that is matched
against a string of text when performing searches (RegEx, 2003). When regular
expressions are created, the regular expression is tested against a string. The regular
expression is enclosed in forward slashes. For instance, the regular expression /struct/
might be matched against the snippet code below.
struct date {
int month;
int day;
int year;
}
78
If a struct is contained in the string, there is a successful match. The Table
4.2 below is an example of regular expressions for matched common programming
language construct.
Table 4.2: Regular Expression for Match Common Programming Language
Construct
Construct word
Comments
Example
/* This is comments of C-
Regular Expression
/ \*.*?\*/
Style */
Strings
Allows the string to span
“[^”]*”
across multiple lines
Numbers
Matches a positive integer
\b\d+\b
number.
Reserved Word
4.4.3
To find enum or int
\b(enum|int)\b
Abstraction of Artifacts
The next process is manipulating extracted artifacts from database and
generates textual and graphical representation. The simple representation will show
the artifacts and detail information based on the keyword.
An abstraction generally serves as a presentation towards helping developers
in understanding the program structure, architecture and behavior.
Abstraction
consists of two levels:
i.
Low level of abstraction focuses on source code that will present artifacts
form the source code such as function name, line number and filename.
79
ii. High level of abstraction focuses on architecture of program that will present
function roles either caller or callee.
4.4.3.1 Code Query in Textual Representation
A textual representation is very helpful to have a basic idea of how to
understand the software. In textual representation, the occurrences of the keyword
were stated by line number and filename and the keyword was highlighted in the
snippet code.
In other research, textual representation is used to analyze the relationship
between other components of the system. Figure 4.5, shows the metrics on the
important files related to HTChunk.c. From this textual representation maintainer can
analyze how many files will be involved if HTChunck.c was modified. These
findings indicated that HTAAFile.c is a closely related file to HTChunck.c.
Figure 4.5 : Metrics on the important files related to HTChunk.c
80
4.4.3.2 Code Query in Graphical Representation
The simple graphical representation is used to show the artifact and the detail
information based on the keyword. The node of each function and the arrow between
the function nodes know as relationship between the function. From the graphical the
maintainer also can know the caller and callee function.
For example, computed results for queries of the form “Which functions in
the program could directly access the representation of component X of variable Y?”
Figure 4.6 shows the results of a query on the “current vehicle” field of the
map_manager_global global variable (also allow queries involving local variables
and function parameters). The shaded nodes are the definitions that directly access
representations; that is, whose code constrains the representation of the value in
question. In this case, the value in question is a structure, and the shaded nodes
constrain the type by accessing fields of the structure.
Given that the “veh_” functions are operations on the vehicle abstract data
type, it is easy to see that abstraction may be violated in the functions
map_mgr_process_image,
map_mgr_process_geometry_range_window,
map_mgr_comp_range_window, but nowhere else.
and
81
Figure 4.6: Graphical presentation for function and variables of vehicle simulation.
4.5
Summary
The code query model highlights the ability to find the keyword in the source
code. This model focuses on static analyses in order to capture a rich set of
relationships in structured programming artifacts. The model leads to the opportunity
to find occurrences keyword in the source code. The artifacts, relationship of artifacts
and the occurrences of the artifacts are presented via multiple level of abstraction.
82
CHAPTER 5
DESIGN AND IMPLEMENTATION OF CODE QUERY
5.1
Introduction
The objective of the chapter is to present the design and implement the
proposed program understanding model and approach. The design includes its
architecture, use case, interaction and operation. It is followed by a brief explanation
of the implementation.
5.2
Code Query Design
The proposed system is viewed as a complete system within its defined scope
that covers several subsystems and interfaces. The proposed system is divided into
83
code query architecture, use case diagram, modules and operation that are made to
aid understanding of the design process.
5.2.1
Code Query Architecture
The code query is designed to fulfill a critical need of software maintenance
process. This application attempts to automate the cognition code component and
abstract the artifacts in order to support the program understanding which is a
fundamental issue in software maintenance. The creation of this application is not to
provide total solution to software maintenance, but it can serve as an additional tool
to support program understanding. Before we proceed to the implementation, it is
suggested that one must understand first the architecture of Code Query.
Figure 5.1 : Code Query Architecture
84
The Code Query is developed in Java Swing and runs on Microsoft
Windows. It is specially built to understand the target system written in C
programming language. Code Query (Figure 5.1) was designed to consist of two
main functionalities; extract code artifacts and abstracts the artifacts. The extracted
information needs to be analyzed and transformed into the artifacts repository for
abstractions purposes.
5.2.1.1 Program Change Request
The program change request (PCR) is a documented deviation from, or
addition to, the project specifications in whatever form they are found on a particular
project. PCR is used when the software needs to be modified. In this context, the
PCR has been translated into keyword of the initial target of change. With this
keyword in hand, the software maintainer will interact with Code Query, to find the
occurrences of effects in the related source code. When the PCR arises, maintainer
need to find the keyword. Through the Code Query, the keyword will match the
source code using pattern matching. The matches’ results – such as function name,
line number, filename and direct relationship between artifacts – are stored into the
repository.
5.2.1.2 Artifacts Repository
Artifacts repository is the storage to keep a collection of artifacts such as
functions, filename, line no and their relationship captured from the available source
85
code. This repository is to be prepared as a database to store all the related
information to make them readily available to Code Query implementation. The lists
of artifacts that will store to the repository are:
i.
Callee
a. Function Name
b. File Name
c. Line Number
d. File Path
ii.
Caller
a. Function Name
b. File Name
c. Line Number
d. File Path
5.2.1.3 Extraction Process
Extract process is a process of capturing the artifacts and their relationship
from the available resources. This process involves three main components, code and
supporting tools. Supporting tools are required to extract the existing artifacts using
static analyses as described in the previous chapter. During extraction process, three
techniques are used. The techniques are:
i.
Parsing Technique
- To parse the source code using top-down parsing way
ii.
Pattern Matching Technique
- To check the presence of the constituents of a given keyword and
pattern specified
86
iii.
Regular Expression Technique
- To create a pattern of characters that is matched against a string of
text when performing searches
The artifacts that are required to be extracted from the source code are
identified as follows:
a) Caller / callee function
- synthesizes the function name defined in the module
- used to determine the beginning of each function
b) Typedef
- contains the function return type
c) Line number
- indicates the line number of each artifacts in the source code
d) File information
- indicate the file name and file path
It is required to extract some useful information from the source code before
the subsequent keyword query takes place. The code parser is used to parse the code.
Each line of code will be parsed to a single string. That single string will be tested
with the keyword given to get the concept location of the keyword using pattern
matching. That concept location will return the line no, file name and file path for the
matched keyword.
Based on found concept location, the Code Query will check the keyword type either
the keyword is a function or a variable is stated in one of the functions in the source
code. The found function will be defined as a callee function and other call function
is defined as caller function. To capture the function name, the pattern matching with
regular expression for function pattern in C code is used. The same concept location
information and the artifacts relationship captured will be stored into the artifacts
before the abstraction process is implemented.
87
5.2.1.4 Abstraction Process
Code Query is developed as a software prototype tool to implement the
model in order to observe its effectiveness and usability. Code Query system can be
divided into two abstraction, textual representation and graphical representation.
i.
Textual representation
The search results are displayed in caller-callee table and snippet code based
on the keyword given. The information was retrieved from the artifacts
repository based on the keyword with related artifacts.
ii. Graphical representation
The next function is manipulating extracted component from database and
abstract the artifacts. The abstraction will show the artifacts and detail
information based on the keyword using node and arrow relationship.
5.2.2
Code Query Use Case
Figure 5.2 shows the Use Case diagram of Code Query system that involves
a user and the process such as Get Problem Change Request (PCR), Identify
keyword, Do Query, Do Visualization and Diagram Display.
88
Code Query System
Get PCR
Identify Keyword
Do Extraction
Software maintainer
Do Abstraction
Figure 5.2 : Use Case Diagram of Code Query System
A user represents a software maintainer who is responsible for doing changes
in software system base on the PCR. Software maintainer needs to interact with the
Do Query process by keying in the keyword. In response to this request, Do Query
process will perform the search, controls and classify the path of file and location of
the keyword. Then Do Query will interact with Do Visualization to abstract the
keyword link with other source code file and display the diagram to the software
maintainer.
Below is the process explanation of Use Case Diagram of Code Query
System as shown in Figure 5.2.
89
1. Get PCR
a) Aim
This process is used to get change request for software maintainer to
perform problem analysis.
b) Characteristic of Activation
It is activated upon problem change request.
c) Pre-Condition
Not available
d) Basic Flows
i) Change request acceptance by software maintainer.
ii) Software maintainer analyses the problem.
2. Identify Keyword
a) Aim
This process is used to identify the keyword after analyzing the PCR.
b) Characteristic of Activation
It is activated upon the PCR is analyzed.
c) Pre-Condition
Keyword is identified.
d) Basic Flows
i) Software maintainer analyzed the PCR.
ii) Software maintainer identified the keyword.
3. Do Extraction
a) Aim
This process is used to query the defined keyword and related artifacts.
90
b) Characteristic of Activation
It is activated upon the keyword is defined.
c) Pre-Condition
Code Query is initialized.
d) Basic Flows
i)
Browse the source code location.
ii)
Software maintainer key in the keyword
iii) Search the keyword in the source code file.
iv) Search related artifacts to the keyword.
v)
Get detail information for keyword and other artifacts.
vi) Store into artifacts repository.
4. Do Abstraction
a) Aim
This process is used to provide a service to software maintainer in an
effort to abstract the related artifacts in the source code with a keyword.
b) Characteristic of Activation
It is activated when Do Extraction is done with the search process.
c) Pre-Condition
Need the information from repository.
d) Basic Flows
i) Get keyword relationship with other artifacts.
ii) Retrieve the relationship from the repository.
iii) Populate the result to abstract representation.
iv) The Use Case end.
91
5.2.3
Code Query Class Interactions
Code Query system is made up of seven main classes. This is specifically
designed to automate software code query. Figure 5.3 shows the code query class
diagrams in which the arrows represent the activations between classes. Upon
receiving activation, each class via its main operation will manage and activate other
related operations to implement some tasks. Each class contains a set of operations
and attributes, however the attributes in the figure are purposely left out to show the
class functionalities. Figure 5.3 complements the class diagrams in which it shows
the sequence operations of code query implementation.
Figure 5.3 : Code Query Class Diagram
92
Figure 5.4 : Code Query Sequence Diagrams
1. SplashScreen
The main() begins the launch introduction of CQuery application.
i) Pre Condition
CQueryWindow screen is display.
ii) Post Condition
Code Query is executed.
iii) Algorithm
Below the simple algorithm of the operation
Activate ShowSplash() to initialize MainFrame
93
2. CQueryWindow
The CQueryWindow () begins to load the database connection and frame properties.
i) Pre Condition
The database connection is physically and logically configured for the Microsoft
Access Database. initProperties() is defined CQueryWindow properties.
ii) Post Condition
CQueryWindow and jInternalFrame1 is display.
iii) Algorithm
Below describe some brief algorithms of the operation for initialize components.
Check database connection
Set and initialize the system status to workable
Activate all global variables to initialize values and variables
Activate initComponents() for MainFrame GUI
Activate initProperties() for internalPrimary data.
Close the system
Below describe some brief algorithms of the operation for keyword searching and
artifacts extraction.
Perform search the keyword
Initiate DirectoryFile.exists() to check the directory, exist or not
If directory exist
Delete the existing data from the repository
Activate searchAllFile() to Search all file names
Activate searchAllFunction() to search all functions
Activate searchAllCaller() to search all caller function
Store the information into the repository
Activate appendStatistics() to create result of snippet code and statistical
information
Activate createTable() to create report list
End of searching
If directory doesn’t exist
End of searching
94
Below describe some brief algorithms of the operation for abstraction.
Perform abstraction
Retrieve the information from the repository
Activate createVertexEdge() to create Vertex Edge
Activate createInternalPane() to create Internal Pane
Display textual/graphical representation
Start
Receive
Keyword
Parse Source
Code
Keyword Checking
Extract Artifacts
C Files
No
Relationship Information
Match Pattern?
Yes
Match
Keyword?
Yes
Get Line Number /
File Name / File
Path
Yes
Do Abstraction
No
Code Query
Repository
Textual /
Graphical
Representation
End
Figure 5.5: Code Query Process Algorithm Flowchart.
95
Process algorithm explanation:
i.
Receive keyword from PCR.
ii.
Parse source code from C files.
iii.
Check match keyword with source code.
a) If match keyword, stored into Code Query repository.
b) If not match, end the process.
iv.
Extract artifact to get relationship for the keyword (caller / callee
function).
a) If match pattern for function, get line number, file name and
path name. Store the data into the repository.
b) If not, repeat the process for the next line of code.
v.
Do abstraction for the successful keyword searching.
a) Fetch the artifact from Code Query repository and
manipulate to textual representation.
b) Fetch the artifact from Code Query repository and
manipulate to graphical representation.
3. SourceCodeViewer
The SourceCodeViewer() begins to display the source code.
i) Pre Condition
The press action is the key to view the source code.
ii) Post Condition
The source code frames will display.
iii) Algorithm
Below describe some brief algorithms of the operation for source code viewer.
Activate btnEnter() for source code viewer
Activate btnEnterActionPerformed().
Initialize initComponents() to prepare source code frame.
Activate ViewCodeC() to display source code with highlight functions.
Display SourceCodeViewer
96
5.3
Code Query Implementation and User Interfaces
The Code Query assumes that a change request has already been translated
and expressed in terms of some acceptable functions. Code Query was designed to
search the keyword based on problem change request. The system works such that
given a keyword as a change request; Code Query will determine its effects on other
function and location in related source code file.
Figure 5.6: Code Query Introduction Screen.
Figure 5.6 shows Code Query introduction screen, where the user is
introduced with Code Query information. While Figure 5.7 shows the first user
interface of Code Query to start search the keyword.
97
Figure 5.7: First user interface of Code Query.
Code Query will display a file path for maintainer to browse the source code
directory and keyword field for maintainer to define the keyword. Visualization
button will be activated when the keyword was successfully searched.
Figure 5.8: File Path and the Keyword Field.
98
Figure 5.8 shows that the maintainer defines the directory for C source code
at D:\Generate_index\GI Codes\GI and dmsc_open as the keyword.
Figure 5.9: Textual Representation
Figure 5.9 shows the respective source code that matches with the keyword
include result list and; results of snippet code and statistic information results list.
Results list show the caller and callee details such as function name, file name, line
no and File path. From the results list, dmsc_open was define as callee and
dictionary_Load, document_GenerateIndex and index_File were defines as caller.
Results of the snipped code show the keyword match in the single line of source
code. Statistical information shows matched and searched total lines of code, files
and directories.
99
Figure 5.10: Graphical Representation
Figure 5.10 shows the respective function that was involved with the selected
keyword.
In this window, maintainer can know the detail information of the
dmsc_open. For instance, dictionary_load, document_GenerateIndex and index_File
are functions that called the dmsc_open.
Figure 5.11: Low Level of Abstraction - Source Code viewer
100
Figure 5.11 shows the location of the keyword dmsc_open that is highlighted
with red color. From the source code viewer, the maintainer can straightly read and
understand the logic of the code.
Figure 5.12: High Level of Abstraction - Detail Relationship of Artifacts
Figure 5.12 shows the detail relationship between the function. For example,
dictionary_load function was calling dmsc_open, constructionWord and dmsc_close
function. dictionary_load is called by document_GenerateIndex function. The source
code changes in callee function might be affecting to some part of the caller
functions.
101
5.4
Other Supporting Tools
The supporting tool was use to facilitate Code Query model and approach,
namely as Regular Expression – Java Pattern. A regular expression specifies a set of
strings that matches it; the functions in this module let you check if a particular string
matches a given regular expression or if a given regular expression matches a
particular string, which comes down to the same thing. In this research, Java Pattern
is used as regular expression to extract or parse the source code and token file to get
the related information. All the information is stored in repository to use in Code
Query system.
5.5
Summary
A Code Query approach was specifically developed in this research to
support the crucial need of source code query and for a better result of the analysis.
This chapter has described system architecture of the proposed model and its design
includes the system Use Case and class diagrams. The implementation process is
designed to include class interactions, Code Query algorithms and user interfaces.
The main contribution of proposed system can be observed at its ability to support
abstraction based on keyword query. Other supporting tools were used to capture the
relationship information in the source code file.
102
CHAPTER 6
EVALUATION
6.1
Introduction
This chapter discusses the evaluation methods for the program understanding
model. The approach involves evaluating the model, the case study and the
experiment. The objective of this evaluation is to verify that the proposed code query
model and approach support program understanding. To achieve this objective,
firstly this research elaborates on the modeling itself and how each model
specification item is satisfied. Secondly, the output of the analysis is then tested on a
case study through a controlled experiment. Thirdly, the quantitative and qualitative
evaluations are then measured based on the scored results. Some other qualitative
values are also considered by a comparative study made on existing program
understanding models and approaches in order to strengthen the overall findings.
This chapter is concluded with a summary of the evaluation.
103
6.2
Case Study
The main criteria for Code Query empirical study are to verify the concept.
The criteria should also enable the Code Query could be accepted as a software
maintenance model. The following are the criteria needed to follow the empirical
studies:
i.
A software project as a case study with complete source code.
ii.
A software project should have all caller and callee function to prove
the model proposed.
6.2.1
Outlines of Case Study
GI is an individual development project assigned to postgraduate students of
software engineering at the Center for Advance Software Engineering (CASE),
University of Technology Malaysia (UTM). In this research, the code query model
was applied to a case study of software project, called the Generate Index (GI)
written in C.
The remarkable point of selecting this case study is that CASE students as
subjects were already familiar with the case study and its domain knowledge in
which this is one of the crucial criteria to establish an experiment. Prior to the
experimentation, the code was analyzed to trace the occurrences of the function
relationship.
104
6.2.2
GI Project Briefing
The GI was a word processing system that could be launched by a user by
specifying the name of the document to be analyzed or a word document consisted of
a file of characters created and edited by a user. Additionally it consisted of a
document that would keep the index of each word in the analyzed word document by
referring to a dictionary that contains the list of words to be indexed. The system
consisted of four modules: MMIMS, SystDoc, LibAdt and DMSC.
The GI system was introduced to the subjects five months before the
experiment was conducted in which they studied the system to perform their minor
project assignment. The subjects also had taken C language module in their previous
semester. Consequently, the subjects had some ideas of what the system was all
about and the C language itself. Their previous experience could eliminate the effort
needed to brief on subject system because they already had some domain and
application knowledge. This enabled us to focus on training the subjects in using the
tool to solve maintenance tasks assigned.
6.3
Controlled Experimental
The aim of the experiment is to see how the prototype can support
maintenance task in terms of its efficiency and accuracy to determine the keyword in
source code. Besides that, we also want to see how the prototype can support
program understanding. The subjects were provided with a complete GI system that
included a set of source code and documentations in this experiment.
105
6.3.1
Subject and Environment
The subjects of the study consist of professionals in the industry and also
post-graduate students of the final semester of software engineering course at the
Centre for Advanced Software Engineering, Universiti Teknologi Malaysia (UTM).
They are equipped with the basic software maintenance as this subject was taught in
the post-graduate program. The students were familiar with C program as this
language was taught and used in some course work projects besides C++ and Java. In
the coursework program, the students were taught software project management and
how to read and write the documentation standard based on DOD standard, MILSTD-498.
The subjects were motivated to get involved in the experiment by making
clear that they would gain a valuable experience in the know-how of software
maintenance. From investigation, except for one subject, the rest had some working
experience of at least one year in the software industries except one with less than
one year.
6.3.2
Questionnaires
The questionnaires were specially formulated to support the user evaluation
of the controlled experiment. The main objective of our assessment was to measure
the efficiency and accuracy of the keyword searching in the context program
understanding. The questionnaire consists of two sections and was designed to relate
to the current professional background and evaluation of the tools, (see Appendix
A).
106
6.3.3
Experimental Procedures
Twenty participants were involved over the above GI case study of
maintenance project. Subjects were then given a briefing and training session before
the actual experiment took place. They were provided with a set of system
documentations, Rigi, Grep, CodeSurfer and Code Query tool (Appendix B) and
source code of GI project. Subjects then had to perform some cognitive
understanding tasks prior to experimentation.
The idea behind this training session was to avoid the confounding factors
such as subjects were not aware of the experimental setting and procedures, and how
to perform searching in the required manner. In this experiment, the subject will give
set of change request question to find the keyword. Then the keyword index is used
to query the keyword using four tools that are provided. In this situation, the subject
needed to use the tools to understanding the GI code using the keyword that was
identified before.
6.3.4
Possible Threats and Validity
There are few factors that may threaten the validity in this study. The factors
are listed as below:
1. The participants involved in this experiment are post-graduate students and
also professionals in the industry itself. Hence, it is a possibility that may be
they are already aware about the GI project.
107
2. There could be an issue of unfamiliarity with the tool for evaluation. To
overcome this issue of learning curve, a short training was performed on the
use of the Code Query tool prior to its experimentation. The controlled
experiment was conducted on all subjects under proper supervision. A twoday experiment on the participants was allocated in order to secure a good
result and to foster their enthusiasm in the study.
3. Lastly, there could be some careless mistakes the subjects had drawn from
their results as this may be caused by human errors of manually examining
the software components. This factor was eliminated by double checking the
work flows of each group after their experiment.
6.4
Experimental Result
The analysis consists of two parts; the analysis of the controlled experiment
and the analysis of the usability.
6.4.1
Analysis of the Controlled Experiment
The aim of the experiment is to see how the prototype can support the
maintenance task in terms of its efficiency and understanding of the software. The
subjects were provided with a complete GI system that included a set of source codes
108
and documentations. With some change requests, the subjects were first asked to
obtain the occurrences of the keyword in the source code.
The analysis of controlled experiment was based on the values of variables
derived from the metrics specified in Table 6.1. Based on the “Past Experience”
factor, Table 6.1 reveals the variation among the sample population. Most of the
subjects (40 percent) were software engineers / system analysts, followed by
programmers (30 percent), other examples system engineer and product engineer (15
percent), project leaders / project managers (10 percent) and lecturers (5 percent).
Specification of other jobs is system engineer and product engineer.
Table 6.1: Cross tabulation of job versus frequencies
Job
Frequency
Cumulative
Percent Valid Percent
Percent
Lecturer
1
5.0
5.0
5.0
Project Leader / Project
Manager
2
10.0
10.0
15.0
Other
3
15.0
15.0
30.0
Programmer
6
30.0
30.0
60.0
Software Engineer /
System Analyst
8
40.0
40.0
100.0
Total
20
100.0
100.0
The same procedure was used to analyze the number of “years of experience”
factor. The subjects “range of experience” in software development and in software
maintenance are illustrated in Table 6.2 and Table 6.3 respectively.
109
Table 6.2: Cross tabulation of experience in software development
Year of
Experience
Cumulative
Frequency Percent Valid Percent
Percent
1-2 years
5
25.0
25.0
25.0
3-4 years
4
20.0
20.0
45.0
Less 1 year
2
10.0
10.0
55.0
More than 4 years
9
45.0
45.0
100.0
Total
20
100.0
100.0
Between them, there had 1 to 2 years experience in development that was
recorded in count five while only two subjects were less than 1 year and nine
subjects in more than 6 years. There were only four subjects with minimum 3-4
years. See table 6.3.
Table 6.3: Cross tabulation of experience in software maintenance
Year of
Experience
Cumulative
Frequency Percent Valid Percent
Percent
1-2 years
4
20.0
20.0
20.0
3-4 years
3
15.0
15.0
35.0
Less 1 year
8
40.0
40.0
75.0
More than 4 years
4
20.0
20.0
95.0
No experience
1
5.0
5.0
100.0
Total
20
100.0
100.0
Besides development, most of them also had experience in maintenance.
Only one subjects had no experience in maintenance task and eight subjects were
with less than one year experience. But there were four subjects with maintenance
experience of more than 4 years. (See table 6.3).
110
6.4.2
Analysis of the Usefulness and Usability Study
The subjects were asked to evaluate Code Query tool (Table 6.5) with respect
to its usefulness and usability to support program understanding based on 6 basic
scales (1-Not At All, 2-Low, 3-Moderate, 4-Useful, 5-Very Useful, 6-Extremely
Useful). The study also managed to derive the feedback from the subjects on the
usefulness and usability comparison of the four tools: GREP (GP), CodeSurfer (CS),
Rigi (RG) and Code Query (CQ). Figure 6.1 shows the means of evaluation scale for
usefulness and usability of the four tools. The questions are listed in APPENDIX A.
Evaluation Scale (Mean)
5
4
CS
3
RG
GP
2
CQ
1
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15
Question No.
Figure 6.1 : Usefulness and Usability of Tools
111
Table 6.4: Mean of scores for Code Query
Question
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CS
3.00
2.50
2.50
2.33
2.67
2.67
2.33
2.83
3.17
3.17
2.33
2.33
2.50
2.17
2.33
RG
2.17
2.50
2.50
2.50
2.33
2.83
2.33
2.83
3.33
3.17
3.33
3.17
3.00
2.00
2.00
GP
3.67
3.67
3.83
3.33
3.33
3.00
2.83
3.17
3.83
3.67
3.17
2.83
2.67
2.83
3.17
CQ
4.50
4.33
4.17
3.83
3.83
4.50
3.33
3.83
4.00
4.00
4.33
4.00
4.17
3.83
3.67
In table 6.4, the subjects were asked to evaluate the overall usefulness of
Code Query to support program understanding based on 5 basic scales (1-Strongly
Disagree, 2-Disagree, 3-Normal, 4-Agree, 5-Strongly Agree). From the perspective
of easiness to use in overall (Question no.1) in Figure 6.1, 4.50 of the subjects
responded to strongly agree with CQ, 3.67 agreed with GP, 2.17 agreed with RG and
3.00 agreed with CC. The majority of the sample agreed that the Code Query tool
provided easy use overall. On questions 3, 4.17 of the subjects responded strongly
agree with CQ, 3.83 agreed with GP, 2.50 agreed with RG and 2.50 was between
disagree and agree with CS. The result in Table 6.11 shows that question 3 was
accepted to be very much satisfactory.
Based on Table 6.4 and Figure 6.1, most of the subjects strongly agreed that
CQ was a very useful tool for keyword query searching. CQ provided the query
capability to search for the occurrences between function in the source code. The
subjects agree to put CQ in comfortable place between ranges 3 to 5 which means
mostly between to agree and strongly agree.
112
Figure 6.2: Usefulness of Tool (mean values based on Likert scale 1: Very Useless,
2: Useless, 3: Normal, 4: Useful, 5: Very Useful)
Table 6.5: Mean of Usefulness of Tools
Std.
Tools
N
Mean
Deviation
CQ
8
4.62
.518
CS
2
2.00
.000
GP
4
3.25
.500
6
20
3.67
3.80
.516
.951
RG
Total
Figure 6.2 illustrates that the CQ is derived from the most useful of tools
towards each usability criterion provided by the tool compared to that of the three
control tools. For RG, most of its usability criteria are more positive compared to GP
and CS. Both software maintenance tools of Code Query and Rigi had the values
113
above the mean value 3.00 (normal scale). Whilst all the criteria of CQ had the mean
values between 4.00 and 5.00 that were within the range of useful and proved to be
very useful.
6.5
Analysis of Finding
This section discusses the findings of the research based on two perspectives
that is gathered from the controlled experiment and usability study conducted. This
finding analysis is based on the acceptance tools by the subject. It’s mean with this
acceptance; the Code Query method could support program understanding.
6.5.1 Acceptance Tool
An acceptance tool means the subjects agreed with the tools that could help
them in program understanding. Twenty subjects were involved in this experiment
and they were asked to use the four tools, including the new approached tool. The
table 5.6 displays the mean comparison between tools.
114
Table 6.6: Mean Comparison between Tools
Tools
N
Mean
Std.
Deviation
% of Total
N
CQ
8
4.62
.518
40.0%
CS
2
2.00
.000
10.0%
GP
4
3.25
.500
20.0%
RG
6
3.67
.516
30.0%
20
3.80
.951
100.0%
Total
From the above result, it is clear that CQ is a good tool to support the subject
in program understanding. 40% is considered good enough to measure that CQ is
accepted by the subjects.
6.5.2
Qualitative Evaluation
The research application was exposed to users by letting them use and
evaluate its effectiveness under a controlled experiment. User perception took into
account the feedback and comments from users on the usefulness of the Code Query
prototype tool to support program understanding in C software. Some questionnaires
were designed to establish the usability of the prototype – whether it is useful and
effective to support program understanding. Majority of the participants agreed that
the Code Query provides an easy use in program understanding. They found it easy
to identify the keyword at source code level. On question of speed up time searching,
majority responded that Code Query can give result in seconds. They also agreed that
the keyword is made easy to indicate the inter-relationship of function.
115
Table 6.7: Existing Features of Code Query Systems
Tools/
Features
Query
Language
Search Technique
Concept Location
Rigi
CodeSurfer
Grep
Code Query
N
C, C++
Regular
Expression,
Pattern
Matching
N
Y
C, C++
Regular Expression
Y
All extension
Pattern Matching
Y
C
Regular Expression,
Pattern Matching
Line no, filename,
file path for all
function
Show function
name on node
Line no, filename,
file path by the
searched keyword
N
Line no, filename,
file path for caller
and callee
Show line number,
function name on
node
Contain caller and
callee location based
on searched
keyword
Highlighted
searched keyword
Y
Y
Graphical
Representation
Show function
name on node
Textual
Representation
N
Location for
searched keyword
Location for
searched keyword
Source Viewer
N
Dependency
Occurrences
Y
Y
Highlighted
selected statement
Y
Y
Highlighted
searched word
N
N
The table above is the qualitative comparison between Rigi, CodeSurfer,
Grep and Code Query. The tools are supporting program understanding based on
their own standard. Rigi does not have query facility and it is based on visualization.
CodeSurfer, Grep and Code Query are provided with searching facility. Grep is used
to enquire any keyword in any file extension and it’s not focus to programming
language. Code Query and Code Surfer is focused on source code query. The entire
tools exclude Grep have source code abstractions. It helps enhance program
understanding using textual or graphical representation.
In concept location, all the tools – excluding Rigi – had concept location
feature. But the difference is Code Query has callee and caller information in which
both of them have concept location feature. It means Code Query provide from
source and to destination concept location information. It gives the advantage to
Code Query to score the qualitative features.
116
6.6
Summary
This chapter described the analysis and findings of the controlled experiment
that evaluate how the proposed code query model can achieve its effectiveness and
accuracy in dealing with query facility in program understanding. The prototype
results produced by the subjects to complete the experiment were considered as the
useful variables. The proposed model is accepted by the subjects and it is deduced
that the model provides some significant achievements to handle the program
understanding. The subjects also agreed that the tool provides some useful interfaces
and improves productivity of software maintenance. This approach was compared
qualitatively with other similar approaches. In general, this model is able to improve
some aspects of the pro features as discussed in the early literatures.
117
CHAPTER 7
CONCLUSION AND FUTURE WORK
7.1
Introduction
This chapter summarizes the research by providing conclusion and its significant
contributions to both academic and practice. It also provides suggestions for future
research work. The software understanding approach, as proposed earlier in this
research, is summarized in this chapter. The current chapter starts with summary and
explanation on how the research achieves each objective set earlier. It is followed by the
main contributions of this research in response to a change process in software evolution.
Finally, it describes some limitations in the current scope and possible areas of future
research.
118
7.2
Contribution
The main contributions of the proposed software Code Query model and
approach can be summarized as follow.
1. The new model of Code Query provides a keyword query functionality to
search a keyword or code in the source code.
2. The new model provides a more useful and informative extraction
reliable source code artifacts such as the line number of code, file path,
filename snippets of the code based on keyword or code query.
3. The new model provides the information on the caller and callee based
on the keyword or code query.
4. The new model provides abstractions to show details of the function and
the occurrences of the keyword or code in the source code.
7.3
Research Limitation and Future Works
Despite the above contributions, the findings are also exposed to certain
limitations such as follow:
1. As large systems are extremely complex, the usefulness of this approach
is currently tested and made applicable to a small-sized software system.
Large systems may involve some integrated applications of different
platforms and environments.
2. The new software Code Query model only focuses on the C language
software and it only covers structured programming.
119
It is foreseen – based on the scope established and the limitations discovered
– that the following areas may constitute possible future works:
1. The research limitation (2) above can be extended for the next future
work. As of now, C language as a subject for the research. For the next
research, we must find mechanism to utilize both C and C++ language.
2. For future work, it is highly recommended to focus on both objectoriented and structured program.
3. For the next future work, it is best to focus more on visualization method
because program understanding is very useful in using this method.
4. For the next future work, it is preferable to have an editable source code
viewer. This will help the maintainer to immediately edit the concept in
the source code.
7.4
Summary
A set of research objectives as defined in the early stages have set the
direction of this research. The prototype was developed and tested by twenty people
and the experiment has achieved satisfactory results. The results of the experiment
verify the correction and usefulness of the new software proposed approach. Hence,
this approach could help the maintainers to understand the software via the source
code easily with less dependence on software documentation which is usually not
reflecting the working version of the software.
120
REFERENCES
Anderson, P. and Zarins, M. (2005). The CodeSurfer software understanding
platform,
Proceedings
13th
International
Workshop
on
Program
Comprehension, 147-148.
Asif, N. (2008). Artifacts Recovery at Different Levels of Abstractions. Information
Technology Journal.
Beron, M. M., Henriques, P. R., Pereira, M. J. V., Uzal, R. and Montejano, G.
(2006). A Language Processing Tool for Program Comprehension, XII
Argentine Congress on Computer Science (CACIC06). Potrero de los Funes,
San Luis, Argentina. October 2006.
Buschmann, F., Meunier, R., Rohnert, H., Sommerlad, P., and Stal, M. (1996).
Pattern-oriented software architecture: A system of patterns. New York,
USA: John Wiley & Sons.
CodeSurfer (2007). CodeSurfer Overview.
http://www.grammatech.com/products/codesurfer/overview.html
Cox, A. and Collard, M.L. (2005). Textual Views of Source Code to Support
Comprehension. 13th International Workshop on Program Comprehension.
(IWPC 2005). St. Louis, MO, USA.
Damasevicius, R. (2006). On the Quantitative Estimation of Abstraction Level
Increase in Metaprograms. Journal of Computer Science and Information
System. June 2006.
Deursen A.V. (2001). Program Comprehension Risks and Opportunities in Extreme
Programming. Proceedings of the Eighth Working Conference on Reverse
Engineering (WCRE'01). Stuttgart, Germany. 176-188. October 2001.
Eichberg, M., Haupt, M., Mezin, M. and Schafer, T. (2005), Comprehensive
Software
Understanding
with
SEXTANT,
21st
IEEE
International
Conference on Software Maintenance (ICSM'05). 315-324. Budapest,
Hungary. September 2005.
121
Erlikh, L. (2000). Leveraging Legacy System Dollars for E-business. IEEE IT
Professional Publication. 17-23. May 2000.
Frisch, A. and Cardelli, L. (2004). In Proceedings of the 31st International
Colloquium on Automata, Languages and Programming (ICALP'04). Turku,
Finland.
Froehlich, J. and Dourish, P. (2004). Unifying Artifacts and Activities in a Visual
Tool Distributed Software Development Teams. Proceedings of the 26th
International Conference on Software Engineering. Edinburgh, Scotland,
UK. May 2004.
Grune, D. and Jacobs, C. J. H. (2008). Parsing Techniques: A Practical Guide. (2nd
Edition). Ellis Horwood Ltd, UK: Prentice Hall.
Hoffer, J. A., George, J. F. and Valacich, J. S. (1999). Modern Systems Analysis and
Design. (2nd Edition). USA: Prentice Hall.
Ibrahim S. (2006). A Documemt-Based Software Traceability to Support Change
Impact Analysis of Object-Oriented Software. University Of Technology
Malaysia (UTM) : PhD Thesis.
ISO/IEC 14764 (2006). IEEE Std 14764-2006 Software Engineering -- Software Life
Cycle Processes – Maintenance.
Kajko-Mattsson, M. (2000). Preventive Maintenance! Do we know what it is?
Proceedings of the International Conference on Software Maintenance. San
Jose, USA. 12-14. October 2000.
Koschke, R. (2001), Software Visualization, International Seminar Dagstuhl Castle,
Germany, May 20-25, 2001.
Kwon, O. C., Boldyreff, C. and Munro, M. (1998). Software Configuration
Management for a Reusable Software Library within a Software Maintenance
Environment. The International Journal of Software Engineering and
Knowledge Engineering (IJSEKE). September 1998.
Lange, C., Sneed, H. M. and Winter, A. (2001). Comparing Graph-based Program
Comprehension Tools to Relational Database-based Tools. 9th International
Workshop on Program Comprehension (IWPC'01). Toronto, Canada. May
2001.
Lee, M. (1998). Change Impact Analysis of Object-Oriented Software. George
Mason University: Master Thesis.
122
Maletic, J. I. and Marcus, A. (2001). Supporting Program Comprehension Using
Semantic and Structural Information. 23rd International Conference on
Software Engineering (ICSE'01). Toronto, Canada. May 2001.
Marcus, A., Rajlich, V., Buchta, J., Petrenko, M. and Sergeyev, A. (2005). Static
Techniques for Concept Location in Object-Oriented Code. Program
Comprehension, 2005. IWPC 2005. Proceedings. 13th International
Workshop. 33 – 42. Washington, DC, USA.
Marcus, A., Sergeyev, A., Rajlich, V. and Maletic J. I. (2004). An Information
Retrieval Approach to Concept Location in Source Code. Proceedings of the
11th Working Conference on Reverse Engineering. Delft, The Netherlands.
November 2004.
Nelson L. M. (2005).
A Survey of Reverse Engineering and Program
Comprehension. ODU CS 551 – Software Engineering Survey.
Pattern Matching (2003). http://en.wikipedia.org/wiki/Pattern_matching
Paul, S. and Prakash, A. (1994). A Framework for Source Code Search Using
Program Patterns. IEEE Transactions on Software Engineering (TSE). June
1994.
Paulisch, F. N. (1993). The Design of an Extendible Graph Editor. German:
Springer-Verlag Berlin Heidelberg.
Phanindra, G., Shankar K.V.V.N.R and Sreenivas, P. D. (2007). A Fast Multiple
Pattern Matching Algorithm using Context Free Grammar and Tree Model.
International Journal of Computer Science and Network Security (IJCSNS).
September 2007.
Pigoski, T. M. (1997). Practical Software Maintenance: Best Practices for
Managing your Software Investment. USA: John Wiley & Sons.
Rajlich, V. and Wilde, N. (2002). The Role of Concepts in Program Comprehension,
Proceedings of 10th International Workshop on Program Comprehension.
June 27-29. France: IEEE Computer Society. 271-278.
Rasool, G. and Philippow, I. (2008). Recovering Artifacts from Legacy Systems
using Pattern Matching. Proceedings of World Academy of Science,
Engineering and Technology. December, 2008.
Regular Expression (2007). http://en.wikipedia.org/wiki/Regular_expression
Rigi (2004). Rigi Group Home Page. http://www.rigi.csc.uvic.ca/
123
Sartipi, K., Kontogiannis, K. and Mavaddat, F. (2000). A Pattern Matching
Framework for Software Architecture Recovery and Restructuring. 8th
International Workshop on Program Comprehension (IWPC'00). Limerick,
Ireland. June 2000.
Singer, J., Lethbridge, T., Vinson, N., and Acquetil, N. (1997). An Examination of
Software Engineering Work Practices. Proceedings Conference of Centre for
Advanced Studies on Collaborative Research. Toronto, Ontario. November
1997.
Sommerville, I. (1997). Software Engineering. (5th Edition). England: Addison
Wesley.
Storey, M. -A. D. (1998). A Cognitive Framework for Describing and Evaluating
Software Exploration Tools. Simon Fraser University, Canada: PhD
Dissertation.
Storey, M. -A. (2005). Theories, Tools and Research Methods in Program
Comprehension: Past, Present and Future. Proceedings of the 13th
International Workshop on Program Comprehension (IWPC 2005). May
2005.
Sulaiman S. (2004). A Document-Like Software Visualization Method for Effective
Cognition of C-Based Software Systems. University Of Technology Malaysia
(UTM) : PhD Thesis.
Sulaiman S. (2004). Viewing Software Artifacts for Different Software Maintenance
Categories Using Graph Representations. Malaysian Journal of Computer
Science. December 2004.
Tammy V. (2005). Reading Before Writing: Can Students Read and Understand
Code and Documentation? Proceedings of the 36th SIGCSE Technical
Symposium on Computer Science Education. St. Louis, Missouri, USA.
February 2005.
Tilley, S. R., Smith, D. B. and Paul, S. (1996). Towards a Framework for Program
Understanding. Proceedings of the 4th International Workshop on Program
Comprehension (WPC '96). March 1996.
WinGrep
(2007).
Windows
http://www.wingrep.com/
Grep
-
Advanced
searching
for
Windows.
124
APPENDIX A
UNIVERSITI TEKNOLOGI MALAYSIA
Centre for Advanced Software Engineering
City Campus, Jalan Semarak,
54100 Kuala Lumpur
QUESTIONNAIRE ON USABILITY OF SOFTWARE UNDERSTANDING
TOOL
Software maintenance is a problem that plagues the software industry. Software
maintenance takes up approximately 50-75% of the cost of software development [I,
Sommerville, Software Engineering, 2004]. Software understanding is the central to
maintenance because a programmer who is working on a piece of code is not the
original programmer of that software, so the programmer must take the time to
understand the code. Even if the original programmer works on their own code
during maintenance, that programmer may not remember what that code does, so
software understandings become a critical issue.
Several tools have been developed to aid in software understanding. Essentially, this
questionnaire attempts to derive your opinions on the usefulness and usability of
software understanding tools.
Objectives
1. To identify the usefulness and usability of software understanding tools.
2. To identify the weakness and strength of existing software understanding tools.
Remarks
It should take approximately 20-30 minutes to complete the questionnaires. I would
like to plead for sincere participation and your cooperation in answering these
questionnaires is very much appreciated.
Thank you very much.
Prepared by:
DAHLIA BINTI DIN
Faculty of Computer Science and Information System
Universiti Teknologi Malaysia
ailhaddin@yahoo.com
125
Professional Background
Questions in this section are related with your current position and previous
experience: (Tick √ for the answer)
1. Which category of your current company?
Software
Telecommunication
Academic
Banking
Production
Other, please specify:
2. What is your current job?
Programmer
Software Engineer / System Analyst
Project Leader / Project Manager
Quality Engineer
Lecturer
Researcher
Other, please specify:
3. Number of years you have been involved in software development?
Less than 1 year
1 – 2 years
3 – 4 years
More than 4 years
No experience at all
4. Number of years you have been involved in software maintenance?
Less than 1 year
1 – 2 years
3 – 4 years
More than 4 years
No experience at all
5. What types of work products have you been involved for software
maintenance? (Tick √ one or more answers)
Code
Design
Specifications
Testing
Requirement or project management
Documentation
Other, please specify:
126
6. Which programming language that you are involved during the software
maintenance? (Tick √ one or more answers)
C
C++
Java
Visual Basic
Other, please specify :
7. Which tools that you use to support your software maintenance process?
(Tick √ one or more answers)
Grep
Rigi
Columbus
McCabe
CodeSurfer
Not use any tools
Other, please specify:
8. How do you find the task of software maintenance? (Tick √ one or more answers)
Descriptions
Agree
Normal Disagree
No
Opinion
Boring task
Tedious task
Time Consuming
Critical but crucial job
Need skill and experience
Others (please specify):
127
GI Project
Questions in this section are related to your experience in GI project:
1. How long (in hour) do you take to maintaining the changes? (Average from 6
PCR) (Tick √ for the answers)
1-5 hours
6-10 hours
11-24 hours
More than 24 hours
2. Between source code and the software document, which is more helpful for
you to understand the software? (Tick √ for the answers)
Source code
Software document
3. Between source code and the software document, which is more helpful to
you for software maintenance? (Tick √ for the answers)
Source code
Software document
4. Do you use any existing tools to help you do the software maintenance?
(Such as Rigi, CodeSurfer, etc.) (Tick √ for the answers)
Yes, please specify:
No
5. List of the step, did you use to manage the software maintenance?
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
6. Other comments or issues:
128
Evaluation of the Tools
For this section, you need to evaluate all the four tools. Use the codes to indicate
the tools:
ƒ
ƒ
ƒ
ƒ
GP: Grep
CS: Colombus
RG: Rigi
CQ: CQuery
1. Specify your opinion on the usability of the four tools by indicate the code as
above in the corresponding row and column.
Criteria
Eg. Criteria 1
Eg. Criteria 2
1. Easy to use in
overall
2. Easy to understand
3. Can speed up time
in searching the
keyword location
4. Search utility
provided in
graphical views is
sufficient
5. The tasks can be
completed using this
tools effectively
6. Easy to indicate the
inter-relationship of
function
7. Provide basis to cost
estimation and plan
schedule
8. Support potential
change impact
analysis
9. Information
provided is well
organized
10. Textual information
provided in
sufficient
11. Graphical views are
simple and easy to
understand
12. Graphic information
Strongly
Disagree
(1)
RG
Disagree
(2)
Normal
(3)
Agree
(4)
CQ
RG
CS
CS
GP
Strongly
Agree
(5)
GP, CQ
129
13.
14.
15.
16.
is sufficient
Easy to trace link
between graphical
and source code
The interface is
good
Less time for
software understand
Other, please
specify:
2. Indicate the usefulness of the four tools provided to understand the software
(Tick √ the appreciate row and column)
Tools
Very
Useless
(1)
Useless
(2)
Grep
Colombus
Rigi
CQuery
3. Other comments:
- Thank you -
Normal
(3)
Useful
(4)
Very
Useful
(5)
130
APPENDIX B
USER MANUAL
Source Code Query to Support Structured Program Understanding
(CQuery)
How to start using the CQuery tool?
To use the CQuery, follow the following instruction:
1. Double click on CQuery shortcut in your desktop.
2. Click on ‘Browse’ button to determine the source code location.
Browse the source code location
131
3. Enter the keyword in keyword field
Enter the keyword
Click on ‘Search’ button to start your keyword searching.
Click here to start the keyword searching
132
4. Search result will appear as below. Then click on ‘Visualize’ button to visualize
the keyword.
Click here to visualize the keyword
133
5. The abstraction of the keyword. To view the source code, click at a node in the
diagram, and then click on ‘View Source Code Button’. The source code viewer
window will display.
134
6. Source code viewer. The keyword will be highlights in the source code.
Download