Supporting Component-Based Software Development with Active Component Repository Systems

advertisement
Supporting Component-Based Software Development with
Active Component Repository Systems
by
Yunwen Ye
B.Sc., Fudan University, China, 1987
M.S., Fudan University, China, 1990
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Department of Computer Science
2001
This thesis entitled:
Supporting Component-Based Software Development with Active Component Repository
Systems
written by Yunwen Ye
has been approved for the Department of Computer Science
Gerhard Fischer
James Martin
Date
The final copy of this thesis has been examined by the signatories, and we find that both the
content and the form meet acceptable presentation standards of scholarly work in the above
mentioned discipline.
Ye, Yunwen (Ph.D., Computer Science)
Supporting Component-Based Software Development with Active Component Repository Systems
Thesis directed by Prof. Gerhard Fischer
It is widely believed and empirically proven that component reuse improves both the
quality and productivity of software development. Before software components are reused,
however, they must be located. Component repository systems provide a means to locate software components. Current component repository systems are designed to support the paradigm
of development-with-reuse, which views reuse as a process independent of the whole software
development process and relies on programmers to take the reuse initiative. Such systems fall
short in supporting programmers who make no attempt to reuse because they do not know the
existence of reusable components or they perceive reuse costs more than programming from
scratch.
This dissertation advocates a paradigm shift from development-with-reuse to reuse-withindevelopment, which views reuse as an integral part of software development, and component
repository systems as information systems that augment programmers’ insufficient knowledge
about reusable components and assist them in accomplishing their tasks. Active component
repository systems—component repository systems equipped with active information delivery
mechanisms—support reuse-within-development. They can be seamlessly integrated with programming environments. Through this integration, their active information delivery mechanism
delivers task-relevant and user-specific components, without being given explicit reuse queries,
to help programmers reuse unknown components and to reduce the cost of reuse.
An active component repository system, CodeBroker, has been developed and evaluated.
CodeBroker runs continuously in the background of a programming environment and infers
programmers’ needs for reusable components by monitoring their interactions with the environ-
iv
ment. Potentially reusable components that match reuse queries extracted from comments and
signatures in the programming environment are autonomously located and actively delivered to
programmers. Formal evaluations of the CodeBroker system have indicated that it motivated
programmers to reuse once relevant components were delivered, and that it was able to deliver
components relevant to both the task and the background knowledge of programmers.
Acknowledgments
I feel very fortunate that my employer, Software Research Associates, Inc. (SRA),
Tokyo, Japan, provided me the time and financial support to complete this research. In particular, I thank Kouichi Kishida, executive vice president and technical director of SRA, for
his lasting support and encouragement, without which I could not have finished this research.
Yoshitaka Matsumura, Kaoru Hayashi, and Yoshikazu Hayashi have been excellent managers
who have gone to great lengths to provide the best conditions for me to complete my research.
I also want to thank my colleague Tomohiro Oda for his help.
I am grateful to the members of my thesis committee. Gerhard Fischer, my advisor, is
simply the best advisor I could have found. His conceptual frameworks on Domain-Oriented
Design Environments and on learning have provided the foundations for this research. Without
his excellent skills in challenging my ideas and motivating me to think deeper, I could not have
finished the research in this manner. Kumiyo Nakakoji, my mentor and role model, has provided
immeasurable support, both emotionally and intellectually. She has been always there when I
needed help. Brent Reeves has spent much time patiently listening to my sometimes rough ideas
and reading my immature manuscripts, and has provided frank, yet friendly, critical feedback.
His constructive criticism has been invaluable in guiding me to frame the research problem,
prioritize my resources, and present my ideas clearly. The support from other members of my
thesis committee, Ken Anderson, James Martin, and Walter Kintsch helped me to clarify my
understanding, and their input is very much appreciated. In particular, I thank James Martin for
his excellent course on Natural Language Processing, which introduced me to the research field
vi
of information retrieval. That was one of the best courses I have ever taken.
Members of Center for LifeLong Learning and Design have been very supportive. I
thank Taro Adachi for numerous, wide-ranging discussions that I have greatly enjoyed over the
years. Jonathan Ostwald generously offered many times to listen to my thoughts and read my
writings. His encouragement and feedback is greatly appreciated. I was extremely delighted
to have as an officemate Eric Scharff, who had an answer to every computer problem I had,
no matter whether it was a Mac, Windows or Linux problem. Many discussions with Rogerio
de Paula helped me structure my thoughts. I thank Gerry Stahl, Hal Eden, Andy Gorman, and
Francesca Iovine for their support.
Finally, I would like to thank my family members. I thank my parents, who have taught
me the joy of learning and have always urged me to do my best. I thank my eldest daughter,
Hanlu, for understanding when she had to spend many weekends being bored because dad
had to work, and my 5-month-old daughter, Hanlei, for her innocent and sweet smiles which
provided the best comfort after a day’s hard work. Most of all, I wholeheartedly acknowledge
the endless love, understanding, and support that my wife, Yonghong Pan, has given to me. In
particular, I thank her for her unabated confidence in me, which has cheered me greatly at times
of frustration.
Contents
Chapter
1 Introduction
1
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Goal of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Active Component Repository Systems . . . . . . . . . . . . . . . . . . . . .
6
1.4
The CodeBroker System . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.5
Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . .
9
2 Roles of Reusable Components in Programming
11
2.1
A Process Model of Programming . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2
Programming Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3
Opportunistic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.4
Benefits of Software Components in Programming . . . . . . . . . . . . . . .
19
3 Challenges of Software Reuse
22
3.1
Overview of Software Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.2
General Issues of Component Reuse . . . . . . . . . . . . . . . . . . . . . . .
25
3.3
Creating Reusable Components . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.4
Understanding the Cognitive Difficulties of Component Reuse . . . . . . . . .
32
4 The Component Locating Problem
4.1
No Attempt to Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
40
viii
4.2
Paradigm Shift: From Development-with-Reuse to Reuse-within-Development
46
4.3
Information-Enriched Workspaces . . . . . . . . . . . . . . . . . . . . . . . .
49
4.4
Active Component Repository Systems . . . . . . . . . . . . . . . . . . . . .
51
5 Active Information Systems
55
5.1
Basic Issues of Active Information Systems . . . . . . . . . . . . . . . . . . .
55
5.2
Acquiring Information of User Tasks . . . . . . . . . . . . . . . . . . . . . . .
59
5.3
Personalizing Information Delivery . . . . . . . . . . . . . . . . . . . . . . . .
73
5.4
Dealing with Partial, Imprecise Queries . . . . . . . . . . . . . . . . . . . . .
75
5.5
Comparing Active Information Systems with an Example in the Real World . .
78
5.6
The Spectrum of Support for Locating Information . . . . . . . . . . . . . . .
79
6 Indexing and Retrieval Mechanisms in CodeBroker
82
6.1
Indexing and Retrieval Mechanisms . . . . . . . . . . . . . . . . . . . . . . .
83
6.2
Creating the Component Repository . . . . . . . . . . . . . . . . . . . . . . .
94
7 Locating and Delivering Components in CodeBroker
99
7.1
System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2
Listener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.3
Fetcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.4
Presenter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.5
The Retrieval-by-Reformulation Mechanism . . . . . . . . . . . . . . . . . . . 113
7.6
Summary of CodeBroker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8 Evaluations of CodeBroker
99
119
8.1
Evaluating the Retrieval Mechanisms . . . . . . . . . . . . . . . . . . . . . . 120
8.2
Empirical Evaluations of the CodeBroker System . . . . . . . . . . . . . . . . 123
8.3
Findings about the Usage of CodeBroker . . . . . . . . . . . . . . . . . . . . . 128
8.4
Other Findings about Programming in General . . . . . . . . . . . . . . . . . . 139
ix
8.5
Problems of CodeBroker and Needed Improvements . . . . . . . . . . . . . . 143
8.6
Summary of Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9 Related Work
149
9.1
Active Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9.2
Component Repository Systems . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.3
Intelligent Programming Environments . . . . . . . . . . . . . . . . . . . . . . 154
10 Future Work and Conclusions
155
10.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Bibliography
161
Appendix
A The List of Queries and Relevant Components
173
B Questions Asked in the Post-Experiment Interview
176
C Abbreviations
178
D Glossary
179
x
Tables
Table
1.1
The rapid growth of the Java Core API library . . . . . . . . . . . . . . . . . .
2
4.1
Relations between reuse mode, knowledge sources, and tool support . . . . . .
54
5.1
A comparison between plan recognition and similarity analysis . . . . . . . . .
66
8.1
Average precision and recall values for LSA, Mixed (average of LSA and Okapi),
and Okapi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.2
Programming knowledge and expertise of subjects . . . . . . . . . . . . . . . 125
8.3
Overall results of evaluation experiments with programmers . . . . . . . . . . 129
8.4
Subjective evaluations of the CodeBroker system . . . . . . . . . . . . . . . . 130
8.5
Experiment data regarding user models . . . . . . . . . . . . . . . . . . . . . . 136
8.6
Experiment data about discourse models . . . . . . . . . . . . . . . . . . . . . 137
Figures
Figure
1.1
The location-comprehension-modification process of reusing components . . .
3
1.2
Software reuse failure modes . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Overview of CodeBroker . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1
The process model of programming . . . . . . . . . . . . . . . . . . . . . . .
14
2.2
A program and its program plans . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.3
Orthogonality between program plans and software components . . . . . . . .
17
2.4
The role of components in problem framing . . . . . . . . . . . . . . . . . . .
21
3.1
A cognitive model of the component reuse process . . . . . . . . . . . . . . .
34
4.1
Different levels of programmers’ knowledge about a component repository . .
42
4.2
The development-with-reuse paradigm . . . . . . . . . . . . . . . . . . . . . .
47
4.3
The reuse-within-development paradigm . . . . . . . . . . . . . . . . . . . . .
50
5.1
Feedforward information delivery . . . . . . . . . . . . . . . . . . . . . . . .
57
5.2
Autocompletion in Internet Explorer . . . . . . . . . . . . . . . . . . . . . . .
57
5.3
Feedback information delivery . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.4
Two assumptions of similarity analysis . . . . . . . . . . . . . . . . . . . . . .
63
5.5
The spectrum of support to information location . . . . . . . . . . . . . . . . .
80
6.1
The CodeIndexer and CodeBroker subsystems . . . . . . . . . . . . . . . . . .
82
xii
6.2
The process of creating a component repository from Java programs . . . . . .
94
6.3
An example of a document generated by Javadoc . . . . . . . . . . . . . . . .
96
6.4
The indexing format of method documents in CodeBroker . . . . . . . . . . .
97
7.1
The architecture of the CodeBroker system . . . . . . . . . . . . . . . . . . . 100
7.2
Component delivery based on concept queries only . . . . . . . . . . . . . . . 102
7.3
Component delivery based on both concept queries and constraint queries . . . 103
7.4
Presenting more information triggered by mouse movement . . . . . . . . . . . 106
7.5
An example discourse model . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.6
An example user model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.7
An illustrative program for adaptive user modeling . . . . . . . . . . . . . . . 110
7.8
The Skip Components Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.9
The Direct Manipulation interface . . . . . . . . . . . . . . . . . . . . . . . . 115
7.10 The Query Refinement interface . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.11 Summary of CodeBroker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.1
Recall-precision curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Chapter 1
Introduction
1.1
Motivation
A wide gap exists between the constantly increasing demands for complex software sys-
tems and the capability of the software industry to deliver quality software systems in a timely
and cost-effective manner. Software reuse, a development method of using existing reusable
software components to create new programs, has been shown through empirical studies to improve both the quality and productivity of software development (Basili et al., 1996; Boehm,
1999). Software reuse also increases the evolvability of software systems because complex
systems evolve faster when they are built from stable subsystems (Simon, 1996).
Programmers are knowledge workers, and programming is a process of progressive crystallization of their knowledge into a program. Knowledge needed during programming comes
either from the programmer’s head or from such external sources as books, manuals, peer workers, and computerized information systems (Norman, 1993). A lack of needed knowledge is
one of the major reasons for poor quality and productivity of programming. With the advent
of objected-oriented technology, reusable software components now comprise the bulk of programming knowledge. Easy access to needed external information, in particular, reusable software components, to complement the insufficient knowledge of programmers is thus critical to
the improvement of programming quality and productivity.
If programmers know a reusable software component well enough, they may integrate
it into their programs whenever it is applicable without even realizing they are reusing be-
2
Version
No. of Packages
No. of Classes
Year of Release
Java 1.0
Java 1.1
Java 1.2
Java 2
8
23
59
70+
211
503
1525
2100+
1996
1997
1998
1999
Table 1.1: The rapid growth of the Java Core API library
cause such reusable components become “ready-to-hand” to programmers (Winograd & Flores,
1986). However, repositories of reusable software components are often so large that programmers cannot learn about all of the components before they start programming. Software component repositories are not static; they are constantly evolving with new components added and old
components updated. As an example, Table 1.1 shows the rapid growth of the Java Core API
(Application Programmer Interface) library—a repository of reusable components of classes
and methods. Few Java programmers, if any, can claim that they know all the components in
this library.
Programmers who have not learned the software component have to go through the reuse
process if they want to reuse or use it in their programming. The reuse process consists of three
steps: location, comprehension, and modification (Figure 1.1). Programmers have to locate
those components that are potentially reusable in the current programming task from the component repository, comprehend their functionality and usage, and make necessary modifications
if the components do not completely fit their needs (Fischer et al., 1991).
The foremost obstacle to the success of component reuse is that programmers cannot locate needed software components quickly and easily. Locating reusable software components is
often supported by component repository systems or reuse repository systems. Like many other
information repository systems, browsing- and querying-oriented schemes have long served as
the principal techniques for programmers to locate reusable software components. More innovative schemes, such as query by reformulation (Williams et al., 1982; Fischer & Nieper-Lemke,
1989; Henninger, 1993), information filtering (Belkin & Croft, 1992), and Latent Semantic
3
Location
explanation
reformulation
reformulation
Modification
Comprehension
extraction
Figure 1.1: The location-comprehension-modification process of reusing components
Successful reuse requires programmers be able to locate, comprehend, and modify
needed reusable components.
Analysis (Landauer & Dumais, 1997), have introduced new possibilities. Unfortunately, the
problem remains that programmers simply do not actively search for components and make no
attempt to reuse. According to a study by Frakes and Fox (Frakes & Fox, 1995), no attempt
to reuse is the leading failure mode of software reuse (Figure 1.2). This inhibiting factor to
the wide success of reuse has been reported again and again by software companies that have
tried to introduce reuse into their organizations (Devanbu et al., 1991; Rosenbaum & DuCastel,
1995; Fichman & Kemerer, 1997).
1.2
Goal of the Research
Although many factors, such as the lack of managerial commitment and the difficulty
in developing good reusable components, affect the widespread uptake of reuse, this research
focuses on the cognitive difficulties faced by programmers who try to reuse, because only when
programmers are willing and able to put reuse into their daily practice will reuse become fruitful.
This research tries to create a conceptual framework to analyze what hinders programmers from making attempts to locate reusable components and, based on the analysis, it pro-
4
poses a new approach to the design of component repository systems that can motivate and
encourage programmers to reuse by reducing the difficulty of locating components.
By applying cognitive engineering (Norman, 1986) on the reuse process, a cognitive
model of reuse is first built. Based on this cognitive model and past research on the effective use
of large information repositories (Fischer, 2001), the following two barriers to the component
locating process are identified.
Due to the large volume and constantly evolving nature of component repositories,
programmers often fail to anticipate the existence of reusable components; when they
do not believe that a component exists in the repository, they will not even make an
Figure 1.2: Software reuse failure modes
In the Frakes and Fox (1995) paper, seven conditions—attempting to reuse, components existing, components available, components found, components understood,
components valid, and components integratable—form a successful reuse chain.
A breakdown in any condition causes the failure of reuse. The above data were
collected from 29 software development organizations. The Y-axis shows the percentage each condition plays in causing the failure of reuse.
5
effort to locate it in the first place.
Even if programmers are aware of the existence of reusable components, they do not
want to start the locating process if they do not know how to locate the components
or if they perceive that locating the components costs more than programming from
scratch.
Although reusable component repository systems have been an active research area for
more than a decade, these two issues, especially the first one, have not been given enough
attention. This is because those systems are designed to support the paradigm of developmentwith-reuse (Rada, 1995), which advocates reuse as a new paradigm for programming. Under
this paradigm, the reuse process is treated as an independent process, and programmers have
to change their current programming practice to embrace reuse; reusable component repository systems are researched as stand-alone systems under the assumption that programmers are
always willing to use these systems and are able to use them with well-defined queries. Consequently, research on component repository systems has focused mainly on the information
access mechanism only. Information access is an approach to obtain information that requires
users1 to start the information locating process through browsing or querying.
This research proposes a paradigm shift from development-with-reuse to reuse-withindevelopment. Development-with-reuse is a methodology-centered view of reuse that demands
programmers to adapt themselves to the new methodology—reuse. It does not concern itself
with the confusions and difficulties faced by programmers who try to reuse. When the approach does not meet its expected success, programmers are labeled, due to their resistance to
change, as having the NIH (Not Invented Here) syndrome (Fafchamps, 1994), and education of
programmers about the value of reuse is called for.
Conversely, the reuse-within-development paradigm puts programmers back into the
center and views reuse as an integral part of the whole programming process. It stresses that
1
Because users of component repository systems are programmers, in this thesis, the term “user” is used interchangeably with the term “programmer”.
6
reusable component repository systems should serve as extensions to programmers’ limited
knowledge. Such systems should actively participate in the programming process by providing programmers immediate and easy access to reusable software components instead of being
passively waiting for the exploration of programmers after they have made the decision to reuse.
Reuse-within-development needs the support of active component repository systems.
Active component repository systems are a subset of active information systems that are equipped
with the information delivery mechanism. Unlike the passive information access mechanism by
which users have to explicitly launch the information-seeking process by specifying their information needs in the form of well-defined queries or engaging in a series of browsing actions,
the information delivery mechanism presents information to users on its own initiative without
being prompted by explicit queries. With reusable components delivered by active component
repository systems, programmers are able to reuse without changing their current programming
practice and environment.
1.3
Active Component Repository Systems
In general, active information systems that just throw a piece of decontextualized infor-
mation at a user are of little use because they ignore the user’s working context. The working
context consists of the task acted upon and the user acting. The challenge of implementing an
active information system or an information delivery system is to deliver context-sensitive information related to both the task at hand and the background knowledge of the user. Task- and
user-independent information delivery systems, or “push” systems, such as Microsoft’s “Tip of
the Day,” suffer from the problem that information gets thrown at users in a decontextualized
way. The “Tip of the Day” is a feature that tries to acquaint users with some arbitrarily chosen
functionality in a complex system. Despite the possibility for interesting serendipitous encounters of information (Roberts, 1989), most users find this feature more annoying than helpful.
The specific challenge faced by this research is to deliver context-sensitive components.
In other words, how can the active component repository system capture programmers’ needs
7
for reusable components by understanding to some extent what their tasks at hand are and then
present only those task-relevant components that are not yet known to the programmers.
Needs for reusable software components are not determined before programming starts,
as most current component repository systems have assumed; they arise in the middle of the
programming process (Sen, 1997). Inasmuch as programmers are using computer-based development environments to develop software systems, it is possible for component repository
systems to capture the reuse needs autonomously by utilizing information available in programming environments when component repository systems and program development environments are properly integrated. For example, in a programming editor, comments inside programs and signatures—the syntactical interfaces of program modules—are good indications of
what programmers are going to develop next (Ye & Fischer, 2000). The integration of component repository systems and programming environments creates a shared workspace accessible
to both programmers as well as component repository systems. This shared workspace enables component repository systems to play an active role in supporting reuse by programmers,
with the delivery of task-relevant and user-specific reusable components. Presenting components specific to a programmer can be realized through user models (Fischer, 2001), because
user models that represent programmers’ knowledge about reusable components can be used as
filters by the repository system to ensure only unknown components are delivered.
1.4
The CodeBroker System
An active component repository system, CodeBroker, has been developed. CodeBroker
is integrated with the program development environment—Emacs. It utilizes an information
delivery mechanism to bring to the attention of Java programmers those components that are
unknown to them and yet are relevant to their current programming task by
constructing a task model to capture the programming task through continuously monitoring programming activities in the development environment
8
identifying the domains of a programmer’s current interest by creating a discourse
model based on the history of interaction between the system and the programmer
creating a user model to represent each programmer’s knowledge about reusable components to personalize the delivery.
Integrated with CodeBroker, the development environment becomes an informationenriched workspace (Ye, 2001b) consisting of the original programming environment and an
augmented information display that presents reusable components dynamically based on the
programming task and the programmer’s background knowledge. Programmers can access potentially reusable components immediately without switching working contexts. This is a distinct advantage because it avoids interrupting the programming flow. The operational interface
of the component repository system becomes transparent to programmers, and is replaced by
three cooperative autonomous software agents (Bradshaw, 1997): Listener, Fetcher, and Presenter. The Listener agent creates reuse queries from the programming workspace as the task
model; the Fetcher agent retrieves components matching reuse queries; and the Presenter agent
presents retrieved components directly into the workspace of programmers, using discourse
models and user models as filters (Figure 1.3).
The information-enriched workspace created by active component repository systems
improves the “readiness-to-hand” of components because it hides the retrieval interface of component repository systems from programmers so that programmers can directly interact with
reusable components rather than the repository system.
Evaluations of the system with programmers have found that the system was effective in
supporting reuse along the following three dimensions:
CodeBroker effectively encouraged programmers to explore the possibility of reuse.
Programmers were able to reuse unknown software components when they were delivered by the system.
9
Figure 1.3: Overview of CodeBroker
The programming environment is augmented with a reusable component information display (the lower buffer), which presents reusable components dynamically.
These components are autonomously retrieved by three cooperative software agents
(Listener, Fetcher and Presenter) based on the programming task and the programmer’s background knowledge. In this example, the programmer can reuse the first
component (highlighted) to implement the task: “Create a random number between
two limits” (indicated in the doc comment), without leaving the programming environment or explicitly operating the component repository system.
The combination of task models, discourse models, and user models succeeded in most
cases in delivering context-sensitive reusable components.
1.5
Organization of the Dissertation
Chapter 2 of this dissertation presents a conceptual framework of programming for ana-
lyzing the roles of reusable components in programming. Most programmers follow the opportunistic programming strategy, and the availability of reusable components affects the choice of
different development alternatives.
After overviewing the issues of instituting systematic reuse in a software development
organization, Chapter 3 analyzes in detail the difficulties of component reuse from the perspective of programmers. Through cognitive engineering, a cognitive model of the reuse process is
10
created and the challenges faced by programmers in each step are discussed.
Chapter 4 focuses on the central theme of this research: why locating component is difficult for programmers, and, in particular, what prohibits them from attempting to reuse. Drawing
on past research on the use of large information repositories and on human cognition theories,
the argument is made that the “no attempt to reuse” phenomenon is caused by the existence of
information islands and perceived low reuse utility. The concept of active component repository
system is introduced as a solution to this problem.
Chapter 5 delineates the challenges in implementing active information systems and
their general solutions: Task models and discourse models contribute to the task-relevance
of information delivery, and user models support the user-specific delivery. To accommodate
the dynamic nature of the information-seeking process, the concept and role of retrieval-byreformulation is discussed.
Chapter 6 describes the retrieval mechanisms used in the CodeBroker system and the
CodeIndexer subsystem that creates the contents of component repository from existing programs.
Chapter 7 presents the design and implementation of the CodeBroker system.
Chapter 8 presents the findings from formal evaluations of CodeBroker.
Chapter 9 compares this research with related work.
Chapter 10 concludes the thesis by discussing future research directions and summarizing the contributions of this research.
Chapter 2
Roles of Reusable Components in Programming
With the advent of object-oriented technology, reusable software components have become an indispensable part of programming knowledge: “[Reusable component] library design
is [programming] language design” (Stroustrup, 1995). In addition to those classes and methods
included in standard libraries of programming languages, such as the Java API library, many
reusable software components are developed by software development organizations specifically for reuse or repackaged from previously developed systems.
Practitioners and researchers generally believe, and experiments have empirically proven
that component reuse improves the quality and productivity of programming (Lange & Moher,
1989; Lim, 1994; Basili et al., 1996; Simon, 1996; Boehm, 1999). However, most analyses of the benefits of reusable components have been based on the products finally produced.
To better understand how reusable components help programmers produce better software systems faster—not a better product and a shorter production time, per se—we must analyze the
roles of reusable components in the programming process. After presenting the process model
of programming, drawing on design theory in general and empirical programming studies in
particular, this chapter explains the benefits of reusable components in programming.
2.1
A Process Model of Programming
Viewed as a task to create a computer-executable representation—program—of a real-
world problem by piecing together a set of primitive elements provided by a programming
12
language and its component libraries, programming consists of two distinctive, yet tightly intertwined processes: problem framing and problem solving (Schön, 1983; Hoc et al., 1990;
Fischer, 1994).
2.1.1
Intertwining of Problem Framing and Problem Solving
During the problem-framing process, commonly known as the specification process in
software engineering, programmers try to understand the problem given in the actual problem
space by building a mental representation of the programming task. This mental representation is a situation model that is the result of the interaction between the problem and the programmer’s knowledge about the problem domain (Kintsch, 1998). Different programmers with
different knowledge often come up with different situation models of the same programming
task. During the problem-solving process, or implementation in software engineering terminology, programmers create programs based on the situation model as a new representation in the
solution space defined by the programming language and its libraries.
Although problem solving starts after problem framing, these two processes are not separate. The processes of framing the problem and of solving the problem influence each other
because every transformation of the framing of the problem provides the direction in which
a partial solution is to be transformed, and every transformation of the constructed solution
determines into which the framing is to be transformed. Just as all other designs that are the
interaction between understanding (problem framing) and creation (problem solving) (Rittel,
1984; Winograd & Flores, 1986), programming is an iterative process of problem framing
and problem solving. Programmers rarely complete one process before beginning the second
one (Pennington & Grabowski, 1990) for the following two reasons.
(1) In most cases, programming tasks cannot be fully understood without considering the
solution (Ghezzi et al., 1991). For example, given the programming task of drawing
a filled circle, a programmer can define the filled circle as a trajectory of rotating one
end of a fixed line 360 degrees, or as a collection of dots whose distance to a center is
13
not greater than the radius. Each definition is actually based on an intended solution to
the problem.
(2) Programming involves many tentative problem-solving strategies. After those tentative
strategies have been explored and their consequences evaluated, some become eventual
commitments and some require the modification of the initial mental representation of
the problem. This modification often breeds new subtasks to be solved.
2.1.2
Programming Is Knowledge Intensive
Neither problem framing nor problem solving is a process of simple transformation that
converts one representation to another representation; instead, they are processes of interpretation. The programming task, the situational model, and the final program are representations
at different levels of formalization and abstraction intended for different purposes. Drawing on
their knowledge, programmers have to interpret the previous representation by reifying abstract
concepts, explicating the implicit, and structuring the symbols existing at the new representation
level.
Knowledge required in programming can be divided into two categories: domain knowledge and programming knowledge. Domain knowledge is the knowledge about the problem
domain and is mainly used in the process of problem framing. Programming knowledge is the
knowledge needed to construct a program in the process of problem solving. However, due
to the intertwined nature of those two processes, programming knowledge also contributes to
problem framing, and domain knowledge contributes to problem solving as well. Figure 2.1
illustrates the process model of programming and its reliance on knowledge.
2.2
Programming Knowledge
Among the many constituents of programming knowledge—for example, the operation
of compilers and other tools, general data structure knowledge, and the capability of reasoning
14
Problem in
Actual
Problem
Space
Problem
Framing
Programming
Situation Model in
Represented
Problem
Space
Domain
Knowledge
Domain
Specific
Programming
Knowledge
Program in
Solution
Space
Problem
Solving
Programming
Knowledge
Figure 2.1: The process model of programming
Problem framing and problem solving are intertwined and they require both domain
knowledge and programming knowledge. Domain knowledge and programming
knowledge often overlap, and the overlap becomes domain-specific programming
knowledge.
and abstracting—program plans and building blocks are two of the most important. As a series
of interconnecting actions to achieve a goal (Soloway & Ehrlich, 1984; Rich & Waters, 1990), a
program plan provides a skeleton structure for programs by abstracting key elements. Building
blocks are the primitive elements provided by a programming language. They include basic
statements of a programming language and reusable software components in repositories or
libraries.
2.2.1
Program Plans
Considerable evidence exists in empirical studies of programming that program plans
are the basic cognitive chunk used in program design and understanding (Soloway & Ehrlich,
1984; Rich & Waters, 1990). Programs are often added one plan chunk at a time (Rist, 1995;
Detienne, 1995). Because program plans are abstract representations of a solution, during the
process of programming, they need to be gradually fleshed out with building blocks. A program
often contains different plans that are interlaced. Figure 2.2 shows a program and the program
15
plans it uses. Program plans are hierarchical. A program plan at a higher abstraction level is
built upon program plans of lower levels. For example, in Figure 2.2, the plan Shuffling an
array comprises three other program plans: Loop over an array, Create a random number in a range, and Swap two numbers.
2.2.2
Building Blocks
Although programmers can build a program with only the basic statements of a program-
ming language, it is just as impossible to build a complex software system from basic program
statements alone as it is to build a jet airplane from only nuts and bolts. Reusable software components are an indispensable part of the building blocks, especially in today’s object-oriented
programming languages. A reusable software component is a software module that can be integrated into a new program directly or after minor changes. A software module refers to a
named and addressable abstraction—either a procedural abstraction, such as a function, or a
data abstraction, such as a class. Procedures, functions, methods, and classes are all considered
software modules. In this dissertation, the term module refers to software abstractions to be
developed by programmers, and the term component is used to refer to those modules that have
been packaged for reuse. Because basic program statements of a programming language are not
of interest in this research, the term “building block” is used throughout interchangeably with
the term “software component.”
2.2.3
Orthogonality of Program Plans and Software Components
Software components are used to realize program plans. Program plans and software
components are orthogonal to each other: a program plan can be realized with different software
components, and a software component can be used in the realization of different program
plans. Figure 2.3 illustrates the orthogonal relationship between program plans and software
components.
16
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
public class CardDealer{
static int [] cards = new int [52];
static { for (int i=0; i<52; i++) cards[i]=i; }
/** create a random number in a range */
public static int getRandomNumber (int from, int to) {
return ((int)(Math.random() * (to - from)) + from);
}
/** shuffle the cards */
public static void shuffleCards() {
int r, temp;
for (int i=0; i<52; i++) {
r = getRandomNumber(i, 52);
temp = cards[i];
cards[i] = cards[r];
cards[r] = temp:
}
}
public static void main(String[] args) {
shuffleCards();
for (int=0; i<52; i++) {
System.out.print(‘‘ ‘‘ + cards[i]);
}
}
}
The above program contains following program plans:
Plan Name
Plan Description
Lines Realizing the Plan
Create a
random number in a
range
Get the range;
Convert a random number between [0,
1.0] to the range.
5,7
Swap two
numbers
Loop over an
array
Shuffling an
array
Save one data to a temporary variable;
Move the other data to the saved
data;
Move the temporary variable to the
other data.
Initialize;
Set the ending condition;
Perform operations;
Increase the loop variable.
Loop over an array;
Create a random number in a range;
Swap two numbers.
Figure 2.2: A program and its program plans
13-15
11-16;
20-22
11-16
17
AND
Task
Shuffling
an array
OR
Swap two
numbers
swapInt()
Create a random
number in a
range
Math.Random()
Loop over an array
getInt(int, int)
Swap two
sets of numbers
swapRanges()
Plans
Components
:
Figure 2.3: Orthogonality between program plans and software components
The task Shuffling an array can be implemented in at least three ways:
(1) with nodes connected with solid lines (concrete implementation shown in Figure 2.2, except that the swapInt was implemented with primitive statements)
(2) with nodes connected with thick dashed lines, i.e., with program plans of
Create a random number in a range, Loop over an array and
Swap two sets of numbers, and components of getInt(int, int)
and swapRanges()
(3) with the same program plans as in (2), and components connected with thin
dashed lines, i.e., the components of swapInt() and Math.Random().
2.3
Opportunistic Programming
Different strategies exist to develop a program. A top-down development strategy starts
with decomposing the programming task into subtasks, choosing program plans to achieve those
subtasks, and then fleshing out the program plans with reusable software components and program statements. A bottom-up development strategy starts with selecting reusable software
components, and then combining them according to the structure of a program plan.
Empirical studies have revealed, however, that most programmers follow neither the topdown nor the bottom-up design strategy. In fact, their programming activities are very opportunistic: They are a mixture of top-down and bottom-up strategies, and which strategy is chosen
depends on the knowledge of individual programmers and the particular situation (Curtis et al.,
1988; Visser, 1990). Interim decisions made during the programming process “often can lead
18
to subsequent decisions at arbitrary points in the [programming] space” (Hayes-Roth & HayesRoth, 1979).
The opportunisticness of programming comes from the difference in each programmer’s
knowledge of program plans and software components. Simon (Simon, 1996) has pointed out
that cognitive activities are determined by the environment in which they take place. The environment includes information present in the workspace as well as information present in the
memory of human beings. Information in the workspace, including partially constructed programs, talks back to the problem solvers (programmers) and serves as cues to activate relevant
program plans and software components from memory (Schön, 1983). Due to the difference
in programmers’ familiarity with program plans and software components, which determines
the link strength from cues to the activated knowledge in memory, it is quite natural that the
programming process pursued by each programmer is different, and the resulting solutions
vary. Taking Figure 2.3 as an example, if the programmer is more familiar with the component
swapRanges, he or she may choose the program plan Swap two sets of numbers,
and the final implementation will be the one connected with thick dashed lines. Conversely,
if he or she is more familiar with the program plan Swap two numbers, he or she may
proceed from that program plan and choose the component swapInt.
The lack of knowledge about reusable software components needed to implement a program plan often prevents programmers from considering it. However, if information about
relevant reusable components is somehow present in the current workspace, it can expand programmers’ solution spaces that are limited by their knowledge. Active component repository
systems can complement programmers’ insufficient knowledge of reusable components by presenting them with immediately accessible components relevant to the current programming task
in the workspace.
19
2.4
Benefits of Software Components in Programming
Reusable software components have both short-term and long-term benefits for the devel-
opment of software systems. Short-term benefits are the immediate benefits that a programmer
can attain during the implementation of a programming task. Long-term benefits may not be
immediately enjoyed by the programmer who reuses the components, but they extend to the
whole life cycle of the software system and to later programming activities of the programmer.
2.4.1
Short-Term Benefits
Reduced Development Time. By reusing existing software components, fewer pro-
grams are written, and thus less time is spent in programming. Furthermore, because reusable
components are usually carefully tested already, less time is needed in debugging and testing, which are the “hard and slow part” of programming (Brooks, 1995). Lim (Lim, 1994)
has reported that in a Hewlett-Packard division, a nearly linear relationship exists between the
percentage of reused code in the product and the productivity of programmers, measured in
LOC/pm (the number of lines of noncomment source code produced by a programmer in a
month). Only 5% of reused code yields an LOC/pm of 550, and as the percentage of reused
code increases to 81%, the LOC/pm reaches 2,850. Similar reports can be found in (Browne
et al., 1990; Hallsteinsen & Paci, 1997).
Improved Quality. Because software components are often repeatedly reused, the defect
fixes from each reuse accumulate, resulting in higher quality of the developed software systems.
Raymond has vividly described this incremental bug fix process as “given enough eyeballs,
all bugs are shallow” in his seminal essay that explains why Open Source software systems
tend to have high quality (Raymond & Bob, 2001). Basili et al. (Basili et al., 1996) have
reported that the error density (errors per thousand lines of code) drops from 6.11 for systems
developed without reuse to 0.12 for systems developed from reusable components. Similar
formal evaluations on the contribution of reuse to the improved quality of software systems can
20
be found in (Lim, 1994; Thomas et al., 1997).
2.4.2
Long-Term Benefits
Easy Maintenance. Reusable components contribute to easy maintenance not only be-
cause they have fewer defects, but also because they facilitate communication among software
developers by providing a set of common vocabulary, especially for the indirect communication between system builders and system maintainers. Because reused software components are
high-level abstractions, system maintainers do not need to look into the details of implementation to uncover the original intentions of the system builders.
Improved Evolvability. To cope with constantly changing requirements and implementation platforms, software systems must be able to evolve. Reusing software components
improves the evolvability of software systems because it can limit the needed change to components instead of identifying and changing all occurrences distributed all over the system.
Graham (Graham, 1995) has reported a very typical example as follows. Three project teams in
a company had used the same formula in their software systems. Later, they discovered an error
in the formula and needed to modify the systems. The team that had not created a component
for the formula spent 5 weeks to find and correct each incidence of the formula. The other two
teams, which had put the formula in a set of components, spent 1.5 days and 2 days, respectively
to correct the system.
Increased Problem Framing Ability. The representation of a problem is an important
determinant of the range of solutions that will be considered, as well as an important source
of problem-solving difficulty (Hayes & Simon, 1977). Reusable software components provide
programmers with higher level concepts that are both close to application domains and easy to
implement. Components increase programmers’ ability to frame the problem into representations that are easier to solve. A component creates an abstraction for an existing solution, and
it reduces the number of items that a programmer has to hold in simultaneous contemplation
because the programmer can refer to the whole solution with the abstraction, in place of the
21
Concepts in
Problem Domain
Programming
Languages
Computer
Programmer
Compiler
Developer
Concepts in
Problem Domain
Components
Programming
Languages
Computer
Programmer
Component
Developer
Compiler
Developer
Figure 2.4: The role of components in problem framing
In the top figure, programmers have to frame each concept in the problem domain
based on their knowledge of the programming language. In the bottom figure, programmers can represent some domain concepts with components directly (such as
those having the same fill pattern in concepts and components) without thinking of
their implementation.
details of the solution. Figure 2.4 illustrates the contribution software components make to the
problem-framing ability. Without the support of reusable components, programmers have to
frame each concept in the problem domain based on their knowledge of the programming language; with the support of software components, however, the difficulty of problem framing is
reduced because certain concepts can be directly mapped to the components.
Chapter 3
Challenges of Software Reuse
This chapter consists of two parts. The first part overviews software reuse to provide a
broad background for this research. It defines the concept and scope of software reuse; describes
different kinds of reusable software artifacts to establish the link between component reuse and
other reuse research efforts; and discusses managerial issues, legal issues, technical issues,
and cognitive issues involved in instituting a reuse program within a software development
organization. The second part of the chapter analyzes the difficulties of component-based reuse
from the perspective of programmers who want to reuse.
3.1
3.1.1
Overview of Software Reuse
Software Reuse and Reusable Software Artifacts
A broad definition of software reuse is using existing software artifacts to construct a new
software system. A software artifact can be defined as a piece of formalized knowledge that can
contribute to the software development process (Dusink & Van Katwijk, 1995). There are two
types of software artifacts: (1) software products that are created as “things” or deliverables
during the development process, and (2) development knowledge that is applied to the process.
The most commonly reused software product is source code, which is the final and most
important product of software development. In addition to code, any intermediate life cycle
products can be reused, which means that software developers can pursue the reuse of requirement documents, system specifications, modular designs, test plans, test cases, and documenta-
23
tion in various stages of software development.
Reusable software development knowledge and experience exists at different abstraction levels: the architecture level, the modular design level, and the program (or code) level.
Research on software architecture is currently aiming to define different software architecture
styles for different families of software systems (Perry & Wolf, 1992). A software architecture
style describes the formal arrangement of architectural elements, and can be reused by software developers to construct their new software systems once the style is well defined (Shaw &
Garlan, 1996; Taylor et al., 1996). For example, the domain-independent multifaceted architecture is an architecture style for domain-oriented design environments, which has been reused in
and refined through the development of many generations of design environments for different
domains (Fischer, 1994).
Reusable knowledge on modular design can be codified in design patterns (Alexander
et al., 1977) and frameworks. A design pattern is the description of a solution to recurring
problems. It specifies a problem to be solved, a solution that has stood the test of time, and the
context in which the solution works (Gamma et al., 1994). Design patterns provide a common
vocabulary for software developers to discuss their designs and can be passed from one developer to another developer for reuse. The concept of framework comes from object-oriented
programming languages (Fischer et al., 1995). A framework describes the interaction pattern
among a set of collaborative classes or objects, and can be represented as a set of abstract classes
that interact with each other in a particular way (Johnson, 1997). Programmers can reuse frameworks directly in their development after providing implementations for those abstract classes.
Framework reuse is a mixture of knowledge reuse and code reuse.
Programming knowledge at the level of code is represented as program plans that can
also be reused by programmers if a suitable representation form is defined (Rich & Waters,
1988).
24
3.1.2
Two Approaches to Reuse
Another dimension to classify reuse research is the approach it takes: Reuse can be
generation-based or composition-based.
The generation-based approach reuses the process of previous software development
efforts, often embodied in computer tools that automate a part of the development life cycle (Henderson-Sellers & Edwards, 1990). This approach weaves domain knowledge and programming knowledge into a very high-level programming language (VHLL), which is then converted to executable systems by a VHLL compiler or an application generator. Because VHLLs
have a higher abstraction level than most high-level languages (HLLs), such as Java and C,
they are relatively closer to programmers’ informal requirements. They are meant for programmers to describe what the computer program does instead of how it is implemented. Compilers
of VHLLs directly convert the VHLL programs into executable programs in HLLs. Lex and
Yacc in Unix are two well-known examples; other research prototypes include SETL (Dubinsky
et al., 1989) and PAISLey (Zave & Schell, 1984). Unlike VHLL compilers, which make the
conversion from VHLL programs to HLL programs in one step, application generators often
use a series of transformation rules to transform VHLL programs into HLL programs. A transformation rule maps a program in one abstraction level to a semantically equivalent but more
computationally efficient program (Feather, 1989). Transformation-based application generators allow programmers to control which transformation rule is applied when several applicable
rules exist (Biggerstaff, 2000). Problems with the generation-based approach are the following:
(1) VHLLs are often defined for an extremely small application domain.
(2) Most VHLLs use mathematical abstractions, such as set theory or logic theory, that are
actually more difficult to learn and to use than HLLs.
The VHLLs in the generation-based reuse approach have many overlaps with end-user programming languages (Repenning, 1993) that provide end-users with a simple instruction set at
the abstraction level of the problem domain so they can modify the behavior of the application
25
systems to their own needs or add new functionality (Girgensohn, 1992; Fischer & Eisenberg,
1994).
The composition-based approach reuses existing software products in a new system to
avoid repetitive work. As mentioned in the previous section, many types of software products
can be reused. However, because this research focuses on the reuse of components, the discussion here is limited to component reuse, although many problems and solutions discussed
can be extrapolated to the reuse of other software products. Component reuse is also known as
component-based development. Based on the role that components contribute to the programming process, component reuse is further divided into three categories.
Black-Box Reuse. In black-box reuse, a component is directly reused without modification. A component can be reused as it is or reused through inheritance if the
programmer creates a specialized subclass of an existing class component.
White-Box Reuse. In white-box reuse, programmers reuse the component after they
have modified the components to their needs. White-box reuse does not contribute as
much to the easier maintenance and evolution of software systems as black-box reuse
does, but it can reduce development time.
Glass-Box Reuse. In glass-box reuse, programmers do not directly reuse the component; instead, they use it as an example for their own development. For instance,
programmers can look at examples to find out how a program plan is realized and
build their own system through analogy. Glass-box reuse contributes indirectly to the
quality and productivity of programming because examples can reduce the cognitive
load of programmers (Neal, 1996).
3.2
General Issues of Component Reuse
Despite its many benefits, component reuse has not yet received wide success in prac-
tice due to the many difficulties associated with it. Component reuse introduces two different
26
processes in the life cycle of software development:
(1) the process of developing for reusable components
(2) the process of developing with reusable components
The first process creates component repositories by identifying, developing, and indexing components. The second process, commonly known as reuse process, is conducted by programmers
who want to reuse. In the reuse process, programmers need to locate, comprehend, and integrate components (see Figure 1.1 in Chapter 1). Widespread success of reuse needs to overcome
managerial, legal, technical, and cognitive issues incurred by both processes.
3.2.1
Managerial Issues
Successful systematic reuse requires the support and commitment from the managers of
a software development organization. Managers should foster a reuse culture in their organization by encouraging programmers to reuse. For example, managers must stop evaluating the
performance of programmers based on the lines of code produced, which, unfortunately, still
occurs in many software development organizations. This evaluation criterion obviously discourages reuse by programmers because programs developed with reusable components have
fewer lines of code.
To encourage reuse, component repositories, either purchased from third parties or developed in-house, should be set up. To do so, managers must be willing to make long-term
investments. This also requires good metric models to analyze the economics of reuse and to
identify the most effective reuse strategies. Several reuse metric models have been proposed
and are in use in some companies. However, these models still lack formal validation (Frakes
& Terry, 1996).
27
3.2.2
Legal Issues
Protecting legal rights of creators and consumers of software components is another dif-
ficult aspect of instituting reuse. Currently, software is protected by copyrights that are designed
to protect products in the “world of atoms”. In the world of atoms, after a product is passed
from its owner to its customer, the owner no longer owns it, and the customer possesses full
ownership. Software components are made of bits. In the “world of bits”, ownership does not
change hands in the same way as in the world of atoms (Cox, 1996). It is extremely easy to
reproduce a software component without any loss of quality, and even after the software component is passed from its owner to its customer, the owner can still have the exact same software
component as the customer does. This presents a difficulty in estimating the cost of reusable
components as well as defining a suitable standard mechanism to charge fees. For example,
when a customer uses a purchased component in his or her application and sells many copies
of the application to many end users, how much should the customer who purchased the component pay to the original creator? Should the end users of the developed application pay the
original creator as well? And if that is the case, how should the law ensure that such payments
are made? The lack of a standard mechanism of charging for the use or reuse of components
inhibits the emerging of a robust market for components, which is essential to the widespread
reuse of components.
3.2.3
Technical Issues
The implementation of systematic reuse has to overcome many technical difficulties.
These difficulties exist in both phases of reuse: the initial setup of reusable component repositories and the actual reuse of components during programming.
A great investment, both intellectually and financially, is required to develop and maintain reusable component repositories. First, it is difficult to identify what kind of components
should be included in component repositories. Second, the development of reusable components
28
is more difficult than usual development because reusable components should be more general
and require higher quality and better documentation. It costs an estimated two or three times
more to develop a reusable component than an ordinary component (Jones, 1984; Lim, 1994).
Reusable component developers need to balance the inherent dilemma between the component
size and its reuse potential: A larger component has more reuse value than a smaller one, but
its reusability decreases because larger components are often more specific and more difficult
to understand.
Another dilemma in setting up a component repository is the relationship between the
number of components in a repository and the ease of finding the needed components. For a
component repository to really pay off requires a critical mass of available components; however, as the number of components increases, it becomes more difficult for programmers to find
the needed components. Only repositories with a large number of components can reap the real
benefits of reuse, so an effective searching mechanism must be provided for programmers to
find the needed component.
3.2.4
Cognitive Issues
Even if good reusable component repositories do exist and a favorable reuse culture is
fostered in an organization, reuse still fails if programmers do not put it into practice. After all,
reuse must be carried out by programmers. This creates another dilemma for reuse: If programmers do not reuse, the huge investment of building reuse repositories cannot be justified; and
if the investment in reuse repositories cannot be justified, companies are less willing to create
them, and then programmers have nothing to reuse.
Programmers’ resistance to reuse is often falsely dismissed as a mere attitude problem
caused by the so-called NIH syndrome. However, recent studies have revealed that most programmers do not have the NIH syndrome (Fafchamps, 1994; Frakes & Fox, 1995). On the
contrary, programmers are very motivated to reuse if they know of the component or know how
to locate the component (Lange & Moher, 1989; Isoda, 1995). What prevents programmers
29
from reusing is the fundamentally limited capability inherent in human cognition (Curtis, 1989;
Fischer et al., 1991): the limitation of short-term memory, the scarcity of human attention, the
mental inertia of coping with changes, the subjectiveness of evaluation, and the ambiguity of
natural language. Section 3.4 provides a more detailed analysis of what kind of cognitive challenges programmers have to overcome in order to reuse, and what kind of technology is needed
to address such challenges.
3.3
Creating Reusable Components
Although this research is not concerned in particular with the creation of reusable com-
ponents,1 because “anybody who sells a technology for reuse without providing a library of
components is a snake oil salesman, a fraud, a charlatan (Zand et al., 1997),” it is worthwhile to
point out the possible venues from which reusable components will come.
As mentioned before, creating reusable components is difficult, time-consuming and
expensive, and repositories of components with good quality are rare. Nevertheless, stable
progress has been made recently in several directions.
3.3.1
Domain Analysis and Product-Line Analysis
Domain analysis is the identification, analysis, and specification of common require-
ments from a specific application domain for reuse on multiple projects within that application domain. Domain analysis produces a domain model, which is used as a starting point
to construct specifications and designs for many different systems within the application domain (Kang, 1998). The domain analysis can be either synthetic or evidentiary (Fischer et al.,
1995).
The synthetic domain analysis approach resembles the process of developing a single application, but it is more broadly conceived. It starts with an informal description of the applica1
Components in the repository of the CodeBroker system come from existing libraries. For more details, see
Section 6.2.
30
tion domain, identifies the common features, and develops reusable components corresponding
to each feature in the domain.
The evidentiary domain analysis approach starts with existing systems in an application
domain, using reverse engineering or design recovery (Ye, 1996) to identify and repackage
common components for later reuse.
Product-line analysis is a more comprehensive approach than domain analysis. A productline is a set of products, already existing or planned to be developed, that share a common set of
requirements but also exhibit significant variability in requirements (Griss, 2000). Product-line
analysis differs from domain analysis in that it not only extracts the commonality of the family
of systems but also provides a systematic way to treat their significant variability. In addition
to common reusable components, product-line analysis often creates a product-line architecture
for the family of related systems where reusable components can be plugged in (Batory et al.,
2000).
3.3.2
Commercial Off-the-Shelf
Thousands of companies worldwide are developing their own information systems. There
are three problems in this regard: (1) because most companies do not have enough expertise in
software development, they cannot produce information systems with the highest quality; (2)
because these systems are often developed internally and do not follow interoperation standards,
it is very difficult to integrate them; and (3) similar functionality has been repeatedly developed.
Although it may take decades for it to dominate software development, the market of COTS
(Commercial Off-the-Shelf) is rapidly taking shape (Morisio et al., 2000). COTS comes in a
variety of types and levels of software, e.g., components that provide specific functionality (such
as subroutines, classes, frameworks, and even complete applications) or tools used to generate
code (such as domain-oriented language processors and application generators). Many companies are providing reusable off-the-shelf components for specific domains, and those components can be purchased by developers from the market. As this trend continues, programmers
31
may be able to create their own systems in the future by integrating components from different
component vendors. For example, programmers or even end users may create their own word
processing applications by integrating components of outline mode, spell-checking, grammar
correction, and diagram drawing purchased at market in the same way as they purchase standard
applications now.
3.3.3
Open-Source Components
With the advent of the Open Source movement (DiBona et al., 1999), many devel-
oper communities, such as the Gamelan website2 and the Giant Java Tree,3 have formed,
by which programmers can freely exchange their developed products. Moreover, some highquality reusable component repositories, such as the Jun library (Aoki et al., 2001), have become
open-source too. Traditionally, reusable components are created and maintained by creators
who develop those components. Programmers who reuse those components are consumers, and
do not directly contribute to the creation and evolution of components. By giving programmers full access to the source code, Open Source breaks down the binary choice of creators
and consumers (Fischer, 1998a) so that consumers can directly participate in the maintenance
and improvement of reusable components or even derive new components from existing ones.
This encourages the natural emergence of reusable components with good quality, following the
seeding, evolutionary growth, and reseeding (SER) model (Fischer, 1998b). The initial creator
develops the component (seeding), and the component experiences evolutionary growth when
it is reused and modified by many other programmers (or consumers). As those modifications
are incorporated back into the original component (reseeding), the quality and reusability of
the component will improve. The Open Source development model is particularly promising
in pushing reuse to a large scale because programmers working on complimentary projects can
each leverage the results of the other freely.
2
3
http://www.gamelan.com
http://www.gjt.com
32
3.4
Understanding the Cognitive Difficulties of Component Reuse
The exciting advance in the creation of reusable components cannot lead to the material-
ization of reuse if programmers are not able to reuse them during their own system development.
To create appropriate tools to assist programmers in reusing, we need to identify the tasks faced
by programmers when they try to reuse and the cognitive skills required to perform those tasks.
3.4.1
Cognitive Engineering
Component reuse is a cognitive activity, which is a goal-directed problem-solving effort.
To understand the complexity of cognitive activities, Norman (Norman, 1986) has developed
a method called cognitive engineering that applies what is known from cognitive science to
the design and construction of tools that assists cognitive activities of human beings. Such
cognitive tools, including reusable component repository systems, must address the discrepancy
between a user’s goal, expressed in terms relevant to the user and his or her task, and the tool’s
mechanism, expressed in terms relative to it. This discrepancy creates two gulfs: the gulf of
execution and the gulf of evaluation. The gulf of execution is the gap from goals to tools, and it
must be bridged by three consecutive efforts:
Intention Formation. Users decide to do something with an internal specification of
the task created from their goal.
Action Specification. Users externalize the internal specification into a sequence of
specified actions.
Action Execution. The actions are executed with the tool.
The gulf of evaluation is the gap from the tool output to the intended goal, and it must be
bridged by another three consecutive efforts:
System Perception. Users perceive the output of the tool.
Interpretation. Users interpret the perceived output.
33
Evaluation. Users compare the interpretation with the original goal.
3.4.2
A Cognitive Model of Component Reuse
Like other cognitive activities, component reuse starts with a goal in the mind of a pro-
grammer. To achieve such goals, programmers have to translate their internal intentions into
a series of physical actions constrained by a component repository system. Figure 3.1 illustrates the actions needed for the reuse process to succeed based on the method of cognitive
engineering (Ye et al., 2000):
Forming Reuse Intentions. As the first step to start a reuse process, programmers
must consciously decide to use the repository with requirements for reusable components in mind. The requirements for reusable components arise from the goal of the
programming tasks.
Formulating Reuse Queries. Programmers formulate their intentions as a reuse query
in the terms provided by the component repository system. This is the step of action
specification.
Retrieving Components. Reusable components matching the queries are retrieved
from the repository system. This is the step of action execution.
Choosing Components. When the repository system returns matching components,
programmers must choose the appropriate one by comparing them with the reuse intentions. This action of choosing corresponds to the gulf of evaluation for the reuse
activity: Programmers read each retrieved component, interpret its meaning, and evaluate whether the component can be reused in their tasks. The evaluation may result
in the reformulation of the original query if no suitable components are found because
the reuse query has not been appropriate.
Integrating Components. Chosen components are integrated into the current pro-
34
Development
Environment
Programmer
Forming
Intentions
Integrating
Reuse Intentions
Chosen
Components
Formulating
eformulation
Retrieval by R
Queries
Choosing
Retrieved
Components
Reuse Queries
Retrieving
Syste
m Mo
del o
f Rep
Existence of Com
osito
Action
ry
ponents
Component
Repository
Information
Figure 3.1: A cognitive model of the component reuse process
gram. If there is not a complete fit, programmers need to modify the component or
write a wrapper to adapt it. Integration is also a part of evaluation because the component is finally reused only if it can be integrated into the current programming task.
The cognitive model in Figure 3.1 is a refinement of the location-comprehension-modification
cycle (LCM-cycle) of a reuse process in Figure 1.1 (Fischer et al., 1991), with special emphasis
on the location process, which is the focus of this research. Although the LCM-cycle acknowledges and stresses the difficulty of formulating appropriate reuse queries in the location process,
35
it does not point out that formulating reuse queries must be preceded by the forming of reuse
intentions. Furthermore, the comprehension step in the LCM-cycle does not differentiate two
different levels of comprehension: comprehension for choosing components and comprehension for integrating components. Comprehension for choosing is still a part of the location
effort because it may result in the reformulation of queries.
3.4.3
Cognitive Challenges of Component Reuse
Each action in this cognitive model of component reuse poses challenges to programmers
and may deter the success of reuse without appropriate tool support.
3.4.3.1
Vocabulary Learning: A Prerequisite for Forming Reuse Intentions
Reusable components help programmers think at higher levels of abstraction and increase
the “vocabulary” programmers can use to create and interpret program designs (Krueger, 1992).
However, programmers must learn the syntax and the semantics of the new vocabulary to take
advantage of reusable components. At the least, programmers should know the existence of the
components; otherwise, they are not able to form reuse intentions in the first place, and reuse
would fail in this very first step. Vocabulary learning is a major part of the cognitive barrier to
reuse (Brooks, 1995; Fichman & Kemerer, 1997).
Unlike the syntax of programming languages, which can be learned through schooling
or tutoring before programmers start working, the mastery of reusable components cannot be
completed in classrooms or merely by reading books (Ye, 1998). Due to the large volume
of reusable components and their constantly evolving nature, total coverage is impossible and
obsolescence is unavoidable. Moreover, learning components is less effective when the components are separated from their use context; components are better learned when they are needed
for a programming task. Therefore, component learning needs to be integrated with working
where the components are reused, and programmers should learn components on demand—that
is, learn the component when it is needed (Fischer, 1991). However, a pitfall for the learning-
36
on-demand model is that programmers need to learn a component because they do not know
it, but because they do not know it, they may settle on a suboptimal solution by creating their
own program instead of reusing the existing component. To support learning-on-demand, component repository systems should be able to identify learning (and reusing) opportunities by
connecting programmers to the components that can be reused in their current task.
3.4.3.2
Conceptual Gap in Formulating Reuse Queries
Formulating reuse queries refers to transforming internal reuse intentions into explicit,
external reuse queries. Reuse intentions, derived from development activities, are conceptualized in the situation model (Kintsch, 1998) that is related to the application task to be solved
and to the concerns of the programmer. A situation model is the mental model programmers
have of their environment. A system model is the “actual” model of a computer system. For a
component to be retrieved, these intentions need to be mapped from the user’s situation model
onto the system model, namely the repository system (Fischer et al., 1991). Without enough
knowledge about the system model of component repository systems, programmers cannot formulate reuse queries appropriately. This conceptual gap between situation model and system
model is another cognitive barrier to reuse.
There are two types of conceptual gap between situation model and system model: vocabulary mismatch and abstraction mismatch. The vocabulary mismatch refers to the inherent
ambiguity in most natural languages. Thanks to the richness of natural languages, people use
a variety of words to refer to the same concept. Based on their systematic study of word use
of ordinary people in different domains, Furnas et al. have found that the probability that two
persons choose the same word to describe a concept is less than 20% (Furnas et al., 1987). Even
well-trained indexing experts have a 20% disparity on average in choosing terms to describe the
same document (Harman, 1995).
The abstraction mismatch refers to the difference of abstraction levels in requirements
and component descriptions. Programmers deal with concrete problems and thus tend to de-
37
scribe their requirements concretely, whereas reusable components are often described in abstract concepts because they are designed to be generic so they can be reused in many different
situations. For example, in one experiment to evaluate the CodeBroker system, one subject
described his task as follows: 4
/** This class contains methods for converting between
western-style numbers (three numbers in a set with a
comma) and Chinese style numbers (four numbers followed
by a comma). For example, 1,000,000--> 100,0000. */
Another subject initially described the same task in a similar way:5
/** Takes a string with a Chinese formatted number and
outputs a western formatted number. */
This task can be easily implemented by setting the group size to 4 with the method setGroupingSize of the class java.text.DecimalFormat. However, the description of
this method (as follows) is abstract: It describes grouping of numbers without mentioning western style or Chinese style in particular.
public void setGroupingSize(int newValue)
Set the grouping size. Grouping size is the number of
digits between grouping separators in the integer
portion of a number. For example, in the number
"123,456.78", the grouping size is 3.
3.4.3.3
Effective Retrieval Mechanisms in Retrieving
The retrieval process finds the components that match given reuse queries. An effective
retrieval mechanism—including a representation schema for indexing and a matching criterion
between a query and a component—is essential.
4
The task asks the programmer to implement a program that converts a number written in Chinese format to
an equal number in western format. The traditional way of writing big numbers in Chinese is to group numbers in
fours and add a comma before each fourth digit from the right because Chinese concepts are of ten thousand (wan),
a hundred million (yi), a thousand billion (zhao), instead of thousand, million, and billion.
5
He later realized the description was not good enough because he had not found what he wanted, and modified
it to “/** Takes a string with a Chinese formatted number (numbers grouped into 4 columns separated by commas)
and outputs a western formatted number (3 columns separated by commas). */”, which made the system deliver the
component setGroupingSize.
38
Many retrieval mechanisms have been proposed in the past (for more detailed descriptions, see related work in Section 9.2). There are three major approaches: text-based, descriptorbased and formal specification-based. In text-based approaches, components are represented by
their textual documents and information retrieval technology is used to match components to
queries (Maarek et al., 1991). In descriptor-based approaches, components are represented by a
set of selected descriptors. The semantic relationships among those descriptors are captured in
a predetermined structure that can be specified by a semantic network (Henninger, 1997), an AI
frame (Ostertag et al., 1992), a taxonomic category system (Devanbu et al., 1991), or a fuzzy
set theory (Damiani et al., 1997). In specification-based approaches, components are represented with formal specification languages, and automatic theorem-proving systems (Zaremski
& Wing, 1997) or specification refinement systems (Mili et al., 1997a) are used to determine
whether a component matches a query, written in formal specification languages too.
In terms of complexity of representation schemata, the text-based approach is the simplest and the specification-based approach the most complicated. In general, a complete and
precise representation can make the matching more precise and retrieval more effective. However, because the same representation is also used by programmers to specify their reuse queries,
the schema of representation is greatly limited by programmers’ willingness to formulate long
and precise queries. There is no point in representing every bit of relevant information about
a component if a programmer barely has the patience for typing string search regular expressions (Mili et al., 1995).
3.4.3.4
Retrieval by Reformulation in Retrieving and Choosing
Because effective use of any information retrieval system requires users be fairly familiar
with the structure of the information systems and their representation schemata, it is difficult
for most users to create a well-defined query on their first attempt (Jones, 1997). Component
repository systems can, at best, retrieve those components that match the queries submitted by
programmers, but not necessarily match their intentions, many of which are not articulated.
39
Retrieval by reformulation is a mechanism that allows users to incrementally improve their
queries to match their intentions after they have interpreted and evaluated the retrieved results
and have explored the underlying structure of the information system (Williams et al., 1982;
Fischer & Nieper-Lemke, 1989). If programmers cannot find the needed component from the
first retrieval result, they can reformulate their query by using more appropriate terms that they
learn from retrieved components, or they can narrow the search range by taking advantage of
the structure of the repository, which they may not have known before exploring the retrieved
results.
3.4.3.5
Component Comprehension in Choosing and Integrating
Being able to comprehend components is necessary both for choosing the right component and for integrating the chosen component.
Comprehension for choosing is focused on what the component does, and it is conducted
in two stages: information discernment and detailed evaluation (Carey & Rusli, 1995). At
the stage of information discernment, programmers avoid spending too much time by quickly
scanning the component and its description to decide whether this component is related to their
current task, and thereby also avoid any deep understanding at this point (Lange & Moher,
1989). The process of information discernment may result in the reformulation of queries if
programmers find the retrieval results are not satisfactory. Only when a promising component
is found do programmers start to evaluate the components extensively.
To integrate a component into their programs, programmers need to understand the component’s functionality, its usage, and even its implementation details, especially in cases of
white-box reuse and glass-box reuse (Section 3.1.2). Executable examples that use the component prove to be very useful to help programmers quickly understand how to reuse the component in their own programming task (Redmiles, 1992; Aoki et al., 2001).
Chapter 4
The Component Locating Problem
Before programmers can take advantage of reuse, they must be able to locate reusable
components quickly and easily. A component repository system is an information system that
helps programmers locate reusable components. It has three connotations: a collection of
reusable components, a retrieval mechanism, and a retrieval interface. Research on component
repository systems has focused mainly on the effectiveness of retrieval mechanisms. However,
even the most sophisticated and powerful component repository systems will not be effective if
programmers make no attempt to reuse. Studies on reuse have shown that no attempt to reuse
is the most significant barrier to reuse (Figure 1.2) (Frakes & Fox, 1996). This chapter analyzes
the phenomenon of “no attempt to reuse” and points out that it is caused by the existence of
information islands and perceived low reuse utility. As a solution, the concept of the active
component repository system is introduced and its benefits are analyzed.
4.1
No Attempt to Reuse
4.1.1
Three Reuse Modes
As a part of the knowledge-intensive programming process, reuse is a process of applying
the knowledge of reusable components into programs. Because few programmers know all
about reusable components, component repository systems are introduced to facilitate the easy
application of reusable components during programming. Based on the source of the knowledge
of reusable components, three modes of reuse exist: reuse-by-memory, reuse-by-recall, and
41
reuse-by-anticipation.
Reuse-by-Memory. In the reuse-by-memory mode, while designing a new program,
programmers notice similarities between the new program and reusable components
that they have learned in the past and know very well. Therefore, they can reuse these
known components easily during the programming, even without the support of a component repository system, because their memory assumes the role of the repository
system.
Reuse-by-Recall. In the reuse-by-recall mode, while developing a new program, programmers vaguely recall that the repository contains some reusable components with
similar functionality, but they do not remember exactly which components they are.
They need to search the repository to find what they need. In this mode, programmers
are often determined to find the needed components. An effective retrieval mechanism
is the main concern for component repository systems supporting this mode. The successful operation of reuse in this mode needs both knowledge from programmers and
knowledge from the repository.
Reuse-by-Anticipation. In the reuse-by-anticipation mode, programmers formulate
reuse intentions based on their anticipation of the existence of certain reusable components. Even though they are not certain that relevant components exist, their knowledge
of the domain, the programming environment, and the repository is enough to motivate
them to search in hopes of finding relevant components. In this mode, if programmers
cannot find quickly enough what they want from the repository, they will soon give up
reuse (Mili et al., 1995). Repository is the main source of knowledge for the successful
operation of reuse in this mode.
Programmers have little resistance to the first two modes of reuse. As has been reported
by Isoda, programmers reuse those components repeatedly once they have learned them (Isoda,
1995). Lange and Moher, in their empirical study on programming and reuse strategies, have
42
L4
L2
L3
(Belief)
(Vaguely
Known)
L1
(Well Known)
Unknown
Components
Figure 4.1: Different levels of programmers’ knowledge about a component repository
found that programmers search extensively for the components they know exist even if they
may not be able to name them a priori (Lange & Moher, 1989). This explains why individual
ad hoc reuse has been taking place while organization-wide systematic reuse has not received
the same success: programmers have individual reuse repositories in their memories so they
can reuse by memory or reuse by recall (Mili et al., 1995). For those components that have not
yet been internalized into their memories, programmers have to resort to the mode of reuse-byanticipation. The activation of the reuse-by-anticipation mode relies on two enabling factors:
Programmers anticipate the existence of reusable components.
Programmers perceive that the cost of the reuse process is cheaper than that of programming from scratch.
4.1.2
Information Islands in Component Repositories
Unfortunately, programmers’ anticipation of available reusable components does not al-
ways match real repository systems. Empirical studies on the use of high-functionality computer
systems (component repository systems being typical examples of them) have found there are
four levels of users’ knowledge about a computing system (Figure 4.1) (Fischer, 2001).
In Figure 4.1, ovals represent the collection of components that are in a particular knowl-
43
edge level of programmers, and the rectangle represents the actual information space (namely,
the whole collection of items in an information system), labeled L4. L1 includes those reusable
components that are well known, easily employed, and regularly reused by a programmer. L1
corresponds to the reuse-by-memory mode. L2 contains components known vaguely and reused
only occasionally by a programmer; they often require further confirmation before being reused.
L2 corresponds to the reuse-by-recall mode. L3 represents what programmers believe about the
repository. L3 corresponds to the reuse-by-anticipation mode.
Many components exist in the area of (L4 - L3), and their existence is not known to
programmers. Consequently, there is no possibility for programmers to reuse them simply
because people do not ask for what they do not know (Fischer & Reeves, 1995). Components in
(L4 - L3) thus become information islands (Engelbart, 1990; Ye & Fischer, 2000), inaccessible
to programmers without appropriate tools. Repositories are not static—it is expected that they
will evolve over time, and this will increase the size of (L4 - L3).
Many reports about reuse experiences of industrial software companies illustrate this
inhibiting factor of reuse. Devanbu et al. have reported that because developers are unaware
of reusable components, they repeatedly re-implement the same function—in one case, this
occurred ten times (Devanbu et al., 1991). This kind of behavior is also observed as typical
among the four companies investigated by Fichman and Kemerer (Fichman & Kemerer, 1997).
From the experience of promoting reuse, Rosenbaum and DuCastel have concluded that making
components known to developers is a key factor for successful reuse (Rosenbaum & DuCastel,
1995).
4.1.3
Low Reuse Utility
Human beings often try to be utility-maximizers in the decision-making process (Reis-
berg, 1997), and programmers are no exception. When programmers perceive that reuse utility,
which is the ratio of reuse value to reuse cost, is too low, they do not make an attempt to
reuse (Sen, 1997). Because there is no easy way for programmers to estimate reuse value and
44
reuse cost objectively, the estimation made by programmers during programming is quite subjective and suffers from cognitive biases against reuse; they tend to underestimate reuse value
and overestimate reuse cost.
4.1.3.1
Underestimated Reuse Value
The value of reuse is multifold. As stated in Section 2.4, reuse value includes:
(1) reduced development time
(2) improved quality
(3) easy maintenance
(4) improved evolvability
(5) increased problem-framing ability
However, not all programmers recognize reuse value when they are under a tight schedule to finish their current program. Most of the reuse value is long-term and shows its benefit only after
the program has been developed; for programmers, what interests them most are the short-term
benefits. In his investigation on reuse in NTT (Nippon Telegraph and Telephone Corporation),
Isoda concludes that unless programmers find the immediate benefits of applying reusable components, they will not, of their own free will, perform reuse (Isoda, 1995). It is human nature
to pay attention to the immediate benefits only and ignore long-term benefits (Grudin, 1994)
because human beings are unable to think coherently about the remote future and particularly
about the distant consequences of their actions (Simon, 1996). To encourage programmers to
recognize the full benefits of reuse, many researchers have called for reuse education. Despite
its importance, reuse education alone has not brought reuse to fruition (Joos, 1994) because being told that “it is for your own good” seldom provides adequate motivation for programmers to
change their behavior (Simon, 1996). Some organizations have also tried to provide monetary
rewards to programmers who reuse, which has not been successful either (Frakes & Fox, 1995).
45
4.1.3.2
Overestimated Reuse Cost
As analyzed in Section 3.4.3, the cost of reuse caused at reuse time includes:
(1) the cost of forming reuse intentions
(2) the cost of formulating reuse queries
(3) the cost of operating the repository system to retrieve components
(4) the cost of choosing components
(5) the cost of understanding and modifying components
(6) the cost of integrating components
In addition, when reuse repository systems are separated from current programming environment, reuse cost includes the cost associated with switching back and forth between the programming environment and the reuse repository system, which causes the loss of working memory and the disruption of workflow.
Depending on the reuse mode, only some of these costs may be involved. In the reuse-bymemory mode, the cost of reuse is reduced to the cost of (6) only. In the reuse-by-recall mode,
the costs of (1), (2), and (4) are quite small because programmers know what to look for and
where to find the components. In the reuse-by-anticipation mode, all of these costs are involved,
and due to the following two cognitive biases—Einstellung and loss aversion—against reuse,
those costs are often overestimated.
Einstellung. Human beings often display Einstellung in problem solving. Einstellung,
the German word for “attitude,” refers to the mechanization of problem-solving strategy. Once problem solvers discover a strategy that “gets the job done,” they are less
likely to discover new strategies until they are completely stuck (Reisberg, 1997). Due
to Einstellung, human beings often stick with what they know best. As the term production paradox (Carroll & Rosson, 1987) suggests, even though there is an effective
46
strategy of solving a problem, most people are not motivated to learn this new strategy
and will “play it safe” by using a suboptimal solution that they personally consider to
be safe. Even today, for most programmers, building programs from scratch is still
the proven strategy. This partially explains the observed phenomenon of “programmer
machoism”—programmers have a tendency to chronically underestimate how difficult
a programming task is and overestimate the cost of reuse (Graham, 1995).
Loss Aversion. Another known phenomenon in the decision-making process of human
beings is loss aversion—the tendency to be far more sensitive to potential loss than to
potential gain (Reisberg, 1997). Starting a reuse process requires a mental switch. The
demand on working memory and time is immediate, and the potential gain is unclear
because programmers are not sure whether the needed component exists, whether they
are able to find it even if it does exist, and whether they are able to understand and
modify it even if they find it.
4.2
4.2.1
Paradigm Shift: From Development-with-Reuse to Reuse-within-Development
Development-with-Reuse
Designers of current component repository systems are not particularly concerned with
the problem that programmers make no attempt to reuse because these systems are designed
to support the development-with-reuse paradigm (Rada, 1995). The development-with-reuse
paradigm views reuse as a stand-alone process, independent of the current programming process
and environment. Consequently, component repository systems are studied as self-contained
systems, with no consideration of the context from which the needs for reusable components
are derived and the components are reused. Their major focuses have been on the retrieval
mechanisms only, with the assumption that programmers have no difficulty in forming reuse intentions and formulating reuse queries. Such systems require programmers to initiate the reuse
process by switching from their current development environments to component repository
47
Loss
aversion
Lost
working
memory
Knowing or anticipating
the existence
Integrating
Low
reuse
utility
Articulating the queries and retrieving
Component
Repository
Conceptual
gap
Unknown
components
Einstellung
Evaluating
Reuse Process
Program Development Process
Figure 4.2: The development-with-reuse paradigm
In this paradigm, programmers have to take the initiative to overcome the huge gap
between program development and reuse.
systems with properly formulated reuse queries. Whenever a programming task, either from
the original task or as a result of further decomposition, arises, programmers must divert from
their current process to execute the reuse process on their own initiative. If they fail to do so,
component repository systems are of no use, and reuse will not happen.
Figure 4.2 depicts the development-with-reuse paradigm and its relationship with the
overall programming process. At the left side are program development processes and environments, and at the right side are reuse processes and systems. They are separated from each other,
and for reuse to succeed, programmers have to bridge the cognitive gap between programming
tasks and component repository systems by making an attempt to reuse on their own initiative.
48
4.2.2
Reuse-within-Development
Development-with-reuse is derived from the methodology-centered perspective, which
views methodology as the most important thing and requires that programmers adapt their practice to incorporate the new methodology. In contrast, the user-centered perspective—in this
case, the programmer-centered perspective—focuses on the behavior of programmers and aims
at melding the new development methodology (reuse) into the current practice of programmers (Jarzabek & Huang, 1998).
Development-with-reuse is also a result of the company-centered perspective, which
views reuse as a company profitable method, without considering the difficulties encountered
by individual programmers (Aaen, 1992). In contrast, the programmer-centered perspective
stresses the importance of offering immediate benefits for programmers. Instead of being driven
only by the long-term productivity and quality gains for the company, it attempts to appeal to
individual programmers (Winograd, 1995).
Development-with-reuse may work if all programming activities can be planned beforehand. However, as analyzed in Chapter 2, programming is by nature opportunistic: new programming tasks arise all the time during the whole period of programming; so do the reuse
opportunities. Reuse cannot be completely planned a priori; it takes place within the context
and the process of development (Sen, 1997). The needs for reusable components cannot be determined in advance, either; instead, they emerge throughout the whole programming process.
In order to put programmers into the center of the design of component repository systems and to put the reuse into the context of programming activities as a whole, a paradigm
shift from development-with-reuse to reuse-within-development is needed (Ye, 2001a). Reusewithin-development views reuse as a supporting, not a replacing, method to the current practice
of programmers. It requires that the reuse process be smoothly melded into the current programming process and environment so that there is no context change from programming to
reuse. Furthermore, it stresses that reuse should be immediately beneficial to each individual
49
programmer.
To support reuse-within-development, component repository systems should
(1) be integrated seamlessly with the programming environment
(2) help programmers identify reuse opportunities whenever they arise during their programming processes
(3) provide immediate access, from current programming environments, to components
potentially reusable in the current development situation so that programmers do not
need to switch contexts between programming and reuse
4.3
Information-Enriched Workspaces
Integrating a component repository system and a programming environment creates an
information-enriched workspace (Ye, 2001b). An information-enriched workspace is a special
working environment (or programming environment, in this case) that is augmented with an
information display that constantly shows the information immediately needed by users. In an
information-enriched workspace, the cost structure of accessing needed information is tuned to
the requirements of the work (programming) process using it (Robertson et al., 1993) because it
provides immediate access to the most needed information for users without interrupting their
workflow.
An observation of our own physical working environments helps us to better understand
the information needs of users and the concept of an information-enriched workspace. When we
are working, we have memos, paper, and books on our desks serving as the immediate storage
of information mostly relevant to our current task; we have file cabinets and bookshelves as
secondary storage to keep less relevant information; furthermore, libraries and bookstores serve
as tertiary storage to complement the lack of information in our offices. In such a hierarchical
structure of information storage, while the relevance to our task and the frequency of access
decrease, the cost of accessing the information increases rapidly, at orders of magnitude (Card
50
et al., 1991). An independent information repository system (or component repository system)
can be abstractly thought of as secondary information storage from the perspective of computer
users because information stored there is accessible only after users have stopped working on
their current tasks and switched from their workspaces. In contrast, the information display in
an information-enriched space takes the role of immediate storage by storing frequently needed
or immediately needed information that can be readily accessed by users without interrupting
the workflow.
Figure 4.3 shows an information-enriched workspace that helps programmers reuse within
development. A portion of the programming environment is now dedicated to the display of
information on reusable components. This information display presents components that are
extracted from the component repository based on their relevance to the programming task conducted in the programming environment. The interface to the component repository system
becomes transparent to programmers because programmers now, within the programming environment, can evaluate and integrate reusable components without operating the component
repository system directly.
Component
Repository
Program Development Process
Figure 4.3: The reuse-within-development paradigm
Task-relevant components from the repository are now automatically presented in
the information display, which is a part of the programming environment. In this
paradigm, reuse is seamlessly integrated with the program development process.
51
4.4
Active Component Repository Systems
To create a programming environment (workspace) enriched with information on reusable
components, the component repository system needs to predict programmers’ needs for reusable
components and to automatically present those needed components in the information display.
This task can be supported by the active information delivery mechanism.
4.4.1
The Concept of Active Component Repository Systems
In contrast to conventional information access mechanisms, in which users explicitly
initiate the information search process, active information delivery presents relevant information
to users without having been asked for it explicitly (Fischer et al., 1993). The information access
mechanism requires users to articulate and specify clearly their information needs, whereas the
information delivery mechanism infers information needs. Support for information access is
indispensable in reusable component repository systems because when programmers recall or
anticipate the existence of reusable components, they must be able to locate them. However,
reusable component repository systems need to be complemented with the information delivery
mechanism so that programmers can reuse those components they fail to anticipate.
Component repository systems equipped with active information delivery mechanisms
are called active component repository systems, or active component repositories. Traditional
component repository systems that employ information access mechanisms solely are called
passive component repository systems, or passive component repositories. Active component
repositories autonomously extract cues that reveal the programming task in a programming environment, and based on such cues, they formulate reuse queries on behalf of programmers and
deliver relevant components in the information display embedded in the programming environment.
52
4.4.2
Benefits of Active Component Repository Systems
Active component repository systems promote reuse by offering the following bene-
fits (Ye & Fischer, 2000).
A Bridge to Information Islands. As analyzed in Section 4.1.2, the existence of reusable
components does not guarantee their reuse if programmers do not anticipate their existence (See
Figure 4.1). Passive component repository systems can only help programmers locate those
components whose existence is anticipated. Active component repository systems can set up
a bridge to information islands in a component repository. They lower the barrier of the vocabulary learning problem by supporting learning-on-demand because they can deliver those
components that programmers have not learned and yet are reusable in the current task.
Well-Informed Decision-Making. Psychological studies on the decision-making process of human beings have shown that the presence of other alternatives affects decisions dramatically (Reisberg, 1997). The presence of actively delivered reusable components reminds
programmers of the alternative programming approach—reuse—other than their current approach of programming from scratch, and alleviates the cognitive bias against reuse caused by
Einstellung in programming. Immediately accessible reusable components can contribute to
the activation of associated program plans similar to how components in the memory do (see
Section 2.3). Active component repository systems serve as extensions to the memory of programmers, and expand the possible solution space that is bounded by the limited knowledge of
programmers.
Reduction of Perceived Reuse Cost. Compared with stand-alone passive component
repository systems, readily accessible reusable components in an information-enriched workspace
supported by active component repository systems reduce the perceived cost of reuse greatly because this approach requires less commitment of resources from programmers. Programmers
can quickly decide whether suitable components exist by scanning reusable components actively delivered, and there is no conscious context switch between programming and reusing.
53
In passive component repository systems, such a decision (whether reusable components exist
or not) can be made only after programmers have committed considerable resource of working
memory and attention to the process of component location. As programmers switch from programming to reuse, their working memory of the programming activities decays with a half-life
of about 15 seconds (Norman, 1986). Therefore, the longer they spend on locating components, the more working memory gets lost. Conversely, the near-instantaneous decision-making
afforded by active component repository systems allows programmers to stay on task.
Reduction of Actual Reuse Cost. Active component repository systems reduce the actual cost of reuse because programmers do not need to go through the location process explicitly.
As mentioned in the previous paragraph, shifting attention from current work to the operation
of component repository systems causes the loss of precious working memory and interrupts
the workflow. Formulating internal reuse intentions into external reuse queries also presents a
difficult cognitive activity that requires programmers to overcome the conceptual gap between
situation model and system model (Shneiderman, 1998). Active component repository systems
(1) allow programmers to interact directly with reusable components instead of interacting with
the component repository system; (2) improve the readiness-to-hand of reusable components
because the cognitive breakdown caused by the operation of component repository systems is
bridged; and (3) reduce the cost of reusing unknown or anticipated components to the cost of
reuse-by-recall or reuse-by-memory.
4.4.3
Full Support of Component Locating
In passive component repository systems, two kinds of knowledge are required for pro-
grammers to locate reusable components successfully:
(1) They must know something about or at least the existence of the components.
(2) They must know how to operate the repository system correctly by submitting welldefined queries or browsing efficiently.
Reuse Mode
Knowledge Required
Knowledge
Sources
Reuse-bymemory
Knowing components
well
Programmer’s
head
Reuse-byrecall
Knowing components
vaguely
Reuse-byanticipation
No attempt to
reuse
Anticipating the
existence of components
and knowing the
operation of repository
systems
Not knowing the
existence or the
operation
Features of
Locating
Replaced by
learning
efforts in
advance
Needed
Support
Programmer’s
head and
repository
Specific
search
Browsing or
querying
Mostly
repository
Open-end
search
Browsing,
querying, or
delivery
Repository only
No
user-initiated
search
Delivery
54
None
Table 4.1: Relations between reuse mode, knowledge sources, and tool support
Active component repository systems do not have such requirements, and they fill the void
unsupported by passive systems.
Table 4.1 summarizes the knowledge required from programmers and the needed support
from repository systems to locate reusable components. Active delivery mechanisms not only
overcome the “no attempt to reuse” phenomenon, but also support reuse-by-anticipation by
speeding up the locating process.
Chapter 5
Active Information Systems
Active component repository systems are a subclass of active information systems that
support the information delivery mechanism. The information delivery mechanism is a complementary approach to the information access mechanism and is needed in situations in which
users are unable to articulate the need for information or are unaware that they may profit from
information. Examples of active information systems include, among others, active help systems (Fischer et al., 1985; Virvou & Du Boulay, 1999); critic systems (Fischer et al., 1993);
Microsoft’s “Tip of the Day” and Office Assistants; and information agents (Lieberman, 1997;
Nardi et al., 1998). This chapter describes the challenges involved in the implementation of
active information systems, possible solutions and their applicability to component repository
systems.
5.1
Basic Issues of Active Information Systems
Implementing active information systems is quite different from implementing passive
information systems that support browsing and querying only. In passive information systems,
the process of information seeking is explicitly initiated by users, and the needs for information
are either articulated as retrieval queries or externalized through a series of browsing actions.
In active information systems, the system must determine the information needs of users and
when and how to present the retrieved information.
56
5.1.1
Contextualization: What to Deliver?
For users who are engaged in a task, most of the time they are not very interested in in-
formation that bears no relationship to their current task. They need only information that helps
them accomplish their task. Because different users have different knowledge backgrounds,
their needs for information are also different. For most active information systems, the critical
challenge is the contextualization of new information to the task acted upon and the user acting.
Active information systems that just throw a piece of de-contextualized information, such
as Microsoft’s “Tip of the Day”, are of little use to most users. This type of system could be
viewed as a reverse help system that exploits the communication paradigm of “Answer First,
Then Questions” in contrast to the traditional “Question-Answer” paradigm of most help systems (Owen, 1986). Despite the possibility for interesting serendipitous encounters of information (Roberts, 1989), most users find this feature more annoying than helpful. The random
presentation of information also makes it difficult to understand when or how the information
should be used due to the lack of the problem context.
Sections 5.2 through 5.6 explain in detail how to achieve this contextualization in active
information systems, which is the focus of this dissertation research.
5.1.2
Feedforward or Feedback: When to Deliver?
Depending on the temporal order between the time when the information is delivered
and the time when the user action for which the information is delivered takes place, active
information systems can provide feedforward or feedback to users.
For each action, there is a period of time called action-present, in which users have decided what to do but have not yet executed the needed operations to change the situation (Schön,
1983). Information delivered in this period of time is feedforward (Simon, 1996) information
because it can make users change the course of action or assist users in accomplishing the action
(Figure 5.1). For example, the Autocompletion of Internet Explorer provides feedforward (Fig-
57
Figure 5.1: Feedforward information delivery
Information delivered during the period of action-present is feedforward information that can affect the execution of the action.
Figure 5.2: Autocompletion in Internet Explorer
When a user types http://www.cs into the the address bar, all URLs that the user has
recently visited and that start with the typed string are shown in the pop-up menu,
and the user can choose one to revisit.
ure 5.2) to users who want to visit a website by saving some keystrokes, but more importantly,
by relieving users from remembering exact addresses of websites.
Feedback information is delivered when the action for which the information is delivered
has been finished (Figure 5.3). Feedback can create a situational backtalk of the action by
pointing out a potential breakdown the user has not known or noticed, or can augment the
situational backtalk to help users reflect better on the action just completed (Nakakoji et al.,
1998). Feedback can serve two roles. First, it creates a learning opportunity for users to improve
58
work performance. For example, the ACTIVIST system (Fischer et al., 1985) teaches users the
corresponding key shortcut to replace a series of complex keystrokes used in their previous
action in a text editor. Second, if the previous problematic action can be undone or modified,
it helps users reach a better solution, such as the on-the-fly spell-checking mechanism in many
word-processing systems.
Feedback is retrospective because it gives users a chance to change a problematical or
suboptimal solution; feedforward is prophylactic because it prevents a problematical or suboptimal solution. To provide feedback, systems have to compare users’ solutions with ideal
solutions to find out what went wrong; to provide feedforward, systems have to predict what is
needed by the user in the near future based on what has been done so far.
5.1.3
Interruptive or Noninterruptive: How to Deliver?
Because information delivered by active information systems is unsolicited, it has the
risk of interrupting the workflow of users whose primary goal is not the process of the delivered
information. When delivered information distracts users, it becomes intrusive. The intrusiveness of a system is the degree of users’ perception of being interrupted from their current focus.
Not all intrusive information is bad. Information that prevents a user from making a mistake
Figure 5.3: Feedback information delivery
Information delivered after the action has been finished is feedback information that
help users reflect upon the action.
59
that may cause all subsequent work to be void needs to be timely attended so the user can avoid
the cost of revising a whole chain of action.
Information can be delivered interruptively or noninterruptively. An interruptive delivery
requires the immediate reaction of users: if users do not attend to the delivered information, they
cannot continue their current work. A noninterruptive delivery just presents information with
no reaction from users required. It is up to the user whether to pay attention to the delivered
information. Although noninterruptive deliveries present less disruption to the workflow of
users, they may go unnoticed and provide insufficient help. Noninterruptive delivery can have
various degrees of intrusiveness, depending on how the delivered information is presented, for
instance, the distance between the window displaying the information and the focal window of
users. On one end, if the window does not exist or is hidden from the current working space
and gets opened or displayed only when the users become interested, the intrusiveness does not
exist. LispCritic (Fischer & Mastaglio, 1989) is an example of this type of system. On the
other end, if the information window is placed right in the middle of user’s current focus, the
intrusiveness is close to interruptive delivery.
Active information systems need to achieve the right balance between the cost of intrusive interruptions and the loss of context-sensitivity of deferred alerts (Horvitz et al., 1999) by
carefully considering when and how to deliver the information so that it can be utilized best
by users. Depending on the importance of the information, systems can explore a variety of
intervention modes to decide when and how to interrupt the user (Sumner, 1995).
5.2
Acquiring Information of User Tasks
To locate information relevant to a user’s task, the information system has to know to
some extent what the task is. In passive information systems, users communicate to the information systems what their task is by articulating a query or taking a series of browsing actions
through the explicit communication channel established at the time when the user initiates the
information location process. In active information systems, systems must infer what the task
60
is.
In everyday communication among people, understanding draws on a shared background
and a shared context between speakers and listeners. Each speech act committed by the speaker
is interpreted by the listener against the shared background and the shared context implicitly
accessible to both (Winograd & Flores, 1986). An implicit communication channel can be
established when the workspace of users is shared with information systems because such a
shared workspace can be utilized to create the shared background and context between users
and information systems. Recent user actions in the workspace partially reveal what the current
task is, and the actions can be understood and the goal of the task can be inferred based on
the underlying domain knowledge and the relationship between the actions and the elements
existing in the workspace already. Such an inference or understanding of user actions and goal
can be used by systems to predict what kind of information would be interesting to the user. For
example, when a kitchen designer places a stove in front of a window with curtain in a kitchen
design environment, a knowledgeable observer (either a human being or a computer system)
can infer that the designer is not aware of the fire hazard of the design, even if the designer does
not say anything about it; and a piece of information about fire safety rules in kitchen design
probably would interest the designer (Nakakoji, 1993).
The remainder of this section explains how an information system can utilize information
existing in the shared workspace to locate task-relevant information. Relevance of information
to a user’s task can be assessed at two levels: the immediate task level and the larger context
level. Most endeavors of users cannot be accomplished in one action; they often need to be
divided into smaller tasks. The immediate task, or task at hand, is the portion of the whole
endeavor to which a user is currently attending, and every immediate task is conducted within
a context defined by previously accomplished tasks and the overall goal of the whole endeavor.
For example, in a programming environment, an immediate task for a programmer is a procedure or method that he or she is currently developing, and the larger context includes those
functions or methods that have been developed so far and with which the current procedure or
61
method will interact to make a whole system.
5.2.1
General Approaches to Capturing the Immediate Task
To find information relevant to the immediate task, information systems need a repre-
sentation schema of user tasks. An abstract representation of a user task is called a task model,
and the acquisition of such task models is called task modeling. Tasks can be modeled through
either plan recognition or similarity analysis.
5.2.1.1
Plan Recognition
The plan recognition approach uses plans to describe a user task. A plan is a sequence
of user actions that achieve a certain goal.1
In general, a plan can be represented as a rule
consisting of two parts: the condition and the result. The condition part includes a sequence of
actions required to accomplish a task, and the result part is the intended goal of the task. When
the actions of a user match, completely or partially, the condition part, the system can infer that
the user is performing that corresponding task, and information about that task is delivered.
Two kinds of approaches exist for the recognition of task plans: plan libraries and generic
plans.
Plan Libraries. In this approach, all task plans are stored in a plan library of the
information system and each task is described by a specific plan. As a user acts in a
computer system, plans whose beginnings match the sequence of actions are selected.
The ACTIVIST system (Fischer et al., 1985) takes this approach. For instance, as a
user repeatedly uses the delete key to delete a character backward in an editor, three
plans are first recognized: (1) deleting a word, (2) deleting a line up to the current
character, and (3) deleting a paragraph up to current character. If the user stops deleting
at a space, then the first plan is finally recognized and information about the command
of deleting a word is delivered; if the user stops at the beginning of the line, then the
1
Program plans defined in Section 2.2.1 are plans specific to the programming domain.
62
second plan is finally recognized and information about the command of deleting a line
is delivered; and so on. The difficulty with this approach is that system designers must
specify all plans beforehand. Moreover, as the number of plans increases in the plan
library, the performance of plan recognition becomes the bottleneck because there are
so many plans to compare with user actions.
Generic Plans. In this approach, a type of task plan is described in a generic rule
using regular expressions, syntactical grammar rules, or other descriptive patterns. If a
sequence of user actions is an instantiation of such a generic rule, the system recognizes
that the user is performing the corresponding task. For example, the ADD (Apple Data
Detector) system (Nardi et al., 1998) uses a regular expression to describe a URL
(Uniform Resource Locator) or an email address. When a string in text matches the
regular expression, several actions associated with it, such as “save it in the bookmark
list,” are suggested by the system and can be automatically executed if the user chooses
to do so. LispCritic (Fischer, 1987) also takes a similar approach. This approach
requires user actions to have a generalizable structure.
5.2.1.2
Similarity Analysis
The similarity analysis approach examines the contextual information surrounding the
current focus of users, and uses that contextual information to predicate their information needs.
Information from the repository that has high similarity to the contextual circumstance is then
delivered. Systems taking the approach of similarity analysis do not try to directly infer the goal
of user actions; they operate based on the following two assumptions (Figure 5.4):
Similar Situations. If the current situation of a user is similar enough to another situation (Situation A), which has been encountered by either the same user or another user
before, and information X is often explored in Situation A, then the current situation
probably also needs information X.
63
Figure 5.4: Two assumptions of similarity analysis
Relevant information can be determined based on (a) the similarity of situations or
(b) the similarity of information.
64
Similar Information. If the current situation uses information X and information Y is
similar enough to information X, then the user is probably also interested in exploring
information Y.
Some recommendation systems such as Siteseer (Rucker & Polanco, 1997) use the first
assumption to recommend new information to users. Siteseer helps users discover interesting
web pages. The situation of a user is defined by his or her Bookmarks (or Favorites). Users
are thought to be in a similar situation if their Bookmarks have enough overlap. Within such
a group of users, new web pages that have the highest overlap among the Bookmarks of other
users but do not appear in one particular user’s own Bookmarks are recommended to that user.
Other systems, such as Grouplens (Konstan et al., 1997) and PHOAKS (Terveen et al., 1997),
are also designed based on this assumption. This assumption is also widely used in e-commerce
websites. For example, the Amazon.com website recommends to a book-buying customer new
books that have been purchased by other customers who have bought books similar to those the
customer has bought.
The second assumption—similar information—underlies many information agents, such
as Remembrance Agent (Rhodes & Starner, 1996) and WebWatcher (Armstrong et al., 1995).
These information agents deliver to users new information, such as emails or web pages, that
are determined to be similar to what the user is currently focusing on. For example, when a
user is writing a new email in an email editor, Remembrance Agent autonomously searches the
email folders and personal notes of the user and delivers messages and notes that are similar in
content to what the user is currently writing.
5.2.1.3
Comparing Plan Recognition and Similarity Analysis
The plan recognition approach can support multiple well-defined and concrete tasks
(such as deleting a word or extracting an URL address), each of which is relatively easy to
describe in a rule. A rule is activated when its condition is met. For the same task, delivered
65
information is always the same, and the delivered information is meant to be feedback to users
in regard to their just-finished actions. Similarity analysis often supports only one ill-defined
and abstract task (such as finding interesting information or writing a program) that is implicitly
activated when users start to use the system. The operation of the delivery requires input from
the current situation, and thus the delivered information varies in response to the differences
of the situations. Information delivered through similarity analysis can be either feedback on
the finished action or feedforward to stimulate a new action. Plan recognition approaches have
difficulties dealing with semantic aspects of user actions because they try to “understand” what
users are doing. Therefore, they are difficult to scale up because system designers have to engineer this understanding mechanism into the system beforehand. Because the similarity analysis
approach focuses on the intention communication aspect of user’s working environment (Winograd & Flores, 1986), it can circumvent the requirements of human-like understanding and give
an interpretation that makes sense to the system only. However, active information delivery
based on similarity analysis is often not as accurate as the one based on plan recognition.
These two approaches correspond, respectively, to the two models of long-term memory
retrieval of human beings: retrieval by recognition and retrieval by association. In the process
of retrieving information by recognition, information from memory is directly retrieved at the
recognition of distinctive features of a fixed pattern (Simon, 1996). Plan recognition systems
simulate the same process: from features to goal recognition and then to relevant information.
In the process of retrieving information by association, information strongly linked with the
existing perceptual elements is activated from the long-term memory. In this process, along with
information useful for the current situation, some irrelevant information may also be activated.
Therefore, humans have to select, based on their existing knowledge, the information that can be
correctly integrated with the current situation (Kintsch, 1998). Following this process, similarity
analysis-based systems retrieve and deliver information based on the link (the similarity or
association), and it is up to the users to decide which information they need to incorporate into
their task.
66
Table 5.1 summarizes the differences between plan recognition and similarity analysis in
terms of their underlying memory retrieval models, their technical approaches, and their major
shortcomings.
Supported Tasks
Major Shortcomings
Plan Recognition
Retrieval by recognition
User actions - Goal Information
Defining plans and
recognizing plans from
actions
Feedback to previous
actions
Multiple
Difficult to scale
Example Systems
Activist, LispCritic, ADA
Memory Retrieval Model
Technical Approach
Major Technical
Challenges
Major Objective
Similarity Analysis
Retrieval by association
Context - Information
Determining similarity of
situations and similarity of
information
Feedforward to immediate
actions
Single
Imprecise information
Siteseer, CodeBroker,
Remembrance Agent
Table 5.1: A comparison between plan recognition and similarity analysis
5.2.2
Modeling the Programming Task
In the case of locating reusable components, a plan recognition approach needs to recog-
nize the program plan from programs under development, using such plan recovery methods as
adopted in the Proust system (Soloway & Ehrlich, 1984), and to deliver reusable components
that can be used to realize the recognized program plan. Recognition of program plans from
complete programs is extremely difficult. Woods and Yang prove that program plan recognition is reducible to the problem of identifying two isomorphic graphs, which is NP-complete
(Woods & Yang, 1996).
The primary goal of active component repository systems is to identify reuse opportunities by delivering reusable components to programmers before they have fully implemented the
program. Plan recognition is more difficult because the system has to recognize program plans
from partially constructed programs. Therefore, the plan recognition approach is not suitable
for active component repository systems.
67
This research adopts the similarity analysis approach to locate reusable components that
are relevant to the module (a piece of program under development) by making use of the descriptive elements existing in both modules and components.
5.2.2.1
Three Aspects of Programs
A program has three aspects: concept, code, and constraint. The concept of a program
is its functional purpose, or goal; the code is the embodiment of the concept; and the constraint
regulates the environment in which the program runs (Ye et al., 2000). This characterization
is similar to the 3C model of Tracz (Tracz, 1990), who uses concept, content, and context to
describe a reusable component.
Concept
Important concepts of a program are often contained in its informal information structure. Software development is essentially a cooperative process among many programmers. Programs
include both formal information for their executability and informal information for their readability by peer programmers (Fischer & Schneider, 1984). Informal information includes structural indentation, comments, and identifier names (Soloway & Ehrlich, 1984). Comments and
identifier names are important beacons for the understanding of programs because they reveal
the important concepts of programs (Biggerstaff et al., 1994; Etzkorn & Davis, 1997a; Michail
& Notkin, 1999). The embedding of informal information in programs improves long-term indirect communication among programmers because, unlike a separate document for a program,
information about the program is stored in the place where it is most useful—in the program
itself (Reeves, 1993).
Modern programming languages such as Java enforce this embedding of informal information further by introducing the concept of document comment (doc comment for short). A
doc comment, beginning with /** and continuing until the next */, immediately precedes the
declaration of a module which is either a class or a method. Doc comments are utilized by
the Javadoc program to create online documentation from Java programs. Contents inside doc
68
comments are meant to describe the functionality of the following module.
Constraint
Constraints of a program exist at four levels: syntactical level, semantic level, architectural
level, and practical level.
The syntactical constraint of a program is captured by its signature. As the type expression of a program, a signature defines the program’s syntactical interface. The basic form of a
signature of a method or a function is:
Signature:OutputTypeExp<-InputTypeExp
where OutputTypeExp and InputTypeExp are type expressions that result from applying
a Cartesian product constructor to all their parameter types. For example, for the method,
int getRandomNumber (int from, int to)
the signature is
getRandomNumber:
int <- int x int
A signature of a class contains all the type definitions of its attributes and all the signatures of
its methods.
The semantic constraint of a program involves the conditions with which the input and
output data have to agree. In formal specification languages, these conditions are described as
pre-conditions and post-conditions (Wing, 1990). Some programming languages, such as C, use
the assertion statement for programmers to express the intended semantic constraint of a
program. For object-oriented programming languages, a contract model can be used to specify
the semantic constraints of a class (Meyer, 1997). A contract of a class specifies the invariant of
the class and the legal order in which the methods of the class should be called. However, both
pre- and post-conditions and contracts are still not widely adopted by programmers. One reason
is that it is difficult to write these constraints because it requires deep mathematical knowledge,
and another reason is that it is also very difficult to develop reliable yet efficient computing tools
to check these constraints.
The architectural constraint of a program is introduced when a programmer wants to
69
develop a system in conformance with a particular architectural style. In the class level, a programmer may want to confine a class to a chosen design pattern or to fit the class into an existing
framework. Because design patterns and frameworks prescribe how their constituent classes interact with each other, they impose extra constraints on the interface and implementation of
classes.
The practical constraint of a program includes the performance criteria required. In some
critical situations, programs need to be time-efficient to respond in a timely fashion, memoryefficient to consume limited memory resources, thread-safe to assure concurrent executions,
and so on.
Code
Code is meant to be executed by computers. It is the machine-executable representation of
concepts, and it must conform to all the required constraints.
5.2.2.2
Locating Relevant Components
Relevance of reusable components to the current programming task can be determined
by the combination of concept similarity and constraint compatibility. Concept similarity is
the similarity existing from the concept of the current task revealed through comments and
identifiers to the concept revealed in the documentation of reusable components. Constraint
compatibility is the compatibility existing between the constraints required for the module under
development and those satisfied by the components from the repository. A component from
the repository whose concept is similar to the concept of the program under development has
a high probability of being reused in the current situation. Moreover, if the component has
compatible constraints, its reuse possibility is further improved. Programmers who use passive
repository systems, especially browsing mechanisms, follow the above heuristic rule to find
reusable components too. They first look at the names and short descriptions of components
and choose to explore those that suggest something similar to their task (Biggerstaff et al.,
1994; Henninger, 1993), and then choose to reuse components that can be easily integrated
70
(having the constraint compatibility) with their current programs.
5.2.3
Relevance to the Larger Context of Task
Each action taken by users in a computer system serves a global goal, and actions take
place in a historical, larger context that is shaped by all preceeding actions. Information systems
can provide more appropriate information by taking into consideration the overall goal and
the historical context in which the information is needed. To reach this objective, a shared
understanding of the larger context needs to be created between users and information systems.
This shared understanding can be created through up-front specification of goals and objectives
by users or created incrementally during the course of interactions between users and systems.
5.2.3.1
Specifying the Global Goal
Before users start to interact with computer systems, they can specify their goals at a
high level of abstraction. The specifications do not need to be complete and can be modified
and augmented during the work process that follows. Active information systems can utilize
partial specifications to determine what information might be most relevant to the actions taken
by users to accomplish their global goals.
The KID system, a kitchen design environment embedded with active critiquing, includes such a specification mechanism (Nakakoji, 1993). KID supports two types of critiquing:
generic and specific. Generic critics deliver design knowledge applicable to all kitchen designs,
such as accepted standards or regulations. Specific critics deliver design knowledge applicable
only to the design situation currently under consideration. Prior to design actions, a user can
characterize the kitchen to be designed by answering some questions, such as “Size of the family?” and “Is the primary cook right-handed or left-handed?” This specification is used later by
the system to fire relevant specific critics regarding design actions as well as to exclude critics
that are relevant in general but not consistent with the specification in particular.
Despite the value of a partial specification of high-level goals in delivering relevant in-
71
formation, some users may have difficulty articulating their requirements before they start to
perform their task, especially when they are not clear about what kind of information is provided by the system. Moreover, if the structure of the information system is too complicated
and too many questions needed to be answered by users to collect a meaningful partial specification, users may get annoyed and become impatient with the system. Generally speaking,
users are more eager to do “real work” than answer a long list of questions.
5.2.3.2
Incremental Discourse Modeling
Incremental discourse modeling is another approach to the creation of the shared understanding of the larger context. A discourse model represents the interaction history between the
user and the system. In this approach, information about the larger context is revealed to the
system piece by piece, and the shared understanding is established incrementally as users act
to achieve the global goal. Previous actions define the historical context under which current
action takes place and limit the applicability of information in the current situation. This is
similar to the conversation structure in natural languages in which a new utterance is interpreted
by the listener in light of the conversational discourse defined by previous utterances. However,
the term “discourse model” is used here in a broader sense than it is used in natural language
understanding. In natural language understanding, discourse models are mainly used to disambiguate referring expressions such as it, this and my car (Jurafsky & Martin, 2000); whereas in
this thesis, discourse models are used to disambiguate the relevance of information regarding
the context.
Incremental discourse modeling amortizes the efforts needed from users for the specification of the global goal. For each piece of information delivered by active information systems
throughout the work process, users can choose to agree with it, disagree with it, ignore it, or declare it irrelevant. If information systems have means to capture and represent these responses
in a discourse model, more appropriate information can be delivered later.
A discourse model can be either positive or negative. A positive discourse model contains
72
the type of information that has interested users and is likely to be useful. In later deliveries,
information systems try to retrieve and deliver information of the same or a similar type. A negative discourse model contains the type of information in which users have not been interested.
It can be used as a filter to remove the same type of information. Negative discourse modeling
is particularly powerful in situations where misfits are much easier than fits to be identified by
users (Alexander, 1964). Information filtering and information retrieval are two sides of the
same coin; both aim to improve the relevance of information. An information system can create
a discourse model with both positive and negative representations of the larger context. In systems that have this kind of mixed discourse model, information similar to positive descriptions
is considered to have higher relevance, whereas information similar to negative descriptions is
considered to have lower relevance.
A discourse model can be created and augmented explicitly or implicitly. Explicit discourse modeling requires users to respond to the delivery of information with an explicit answer.
For example, a piece of information can be delivered with a mechanism that allows user to specify whether the information is useful or useless, relevant or irrelevant, and the system integrates
the user responses into the discourse model. Instead of asking users for direct input, implicit
discourse modeling observes the users’ reaction to the delivered information—whether the user
uses or ignores the information—and augments discourse models based on acquisition rules
that make inferences from the observation.
As an example, a discourse model can be incorporated into the “Tip of the Day” of
Microsoft Office to improve the context relevance of delivered tips. For instance, if a user
continues to use tables in his or her recent use of the Office system, an appropriate discourse
modeling mechanism would be able to implicitly capture this and instruct the system to deliver
tips about table operations to the user.
73
5.3
Personalizing Information Delivery
Information systems exist as a resource to supplement and overcome the limitation of
a user’s knowledge. It is more disruptive, however, than helpful to deliver a piece of information that is already known to the user. Because different users have different knowledge
backgrounds, a piece of information that is helpful to one user may be distracting to another.
Therefore, active information systems should personalize the delivered information. In other
words, they should deliver user-specific information.
5.3.1
Representing Background Knowledge as User Models
Effective communication requires the ability to represent the other communicating part-
ner’s knowledge (Norman, 1993). User models, which represent the users’ preferences and
knowledge levels about a system, can be used in an active information system to adapt the system behavior to each user and to improve the efficiency of communication between users and
systems (Thomas, 1996). User models are the result of a user modeling mechanism embedded
in a system (Wahlster & Kobsa, 1989).
The term “user model” as well as “user modeling” is overloaded with different meanings
in research literature. As a general definition, a user model is a computer system’s model of
user characteristics for the purpose of tailoring the interaction or making the dialog between
the user and system adaptive (Murray, 1987). However, user characteristics can have many
dimensions: knowledge about the computer system, knowledge about the domain, goal of the
current task, preferences, cognitive and learning abilities or disabilities, and so on. When the
user model represents the goal of the current task, it overlaps with the task modeling described
in Section 5.2.1. Furthermore, a user model has different temporal dimensions (Dieterich et al.,
1993). When it is used to describe characteristics of a user valid only in the current context or
session (short-term data), it overlaps with the discourse model described in Section 5.2.3.2.
Throughout this thesis, the term “user model” is used exclusively in the following sense.
74
A user model represents the background knowledge that a user possesses and is kept as longterm data on a permanent storage medium. As shown in Figure 4.1, a user’s knowledge about an
information repository falls into four levels. A user model should contain pieces of information
falling in both L1 and L2. Because information belonging to L1 is well known and regularly
used, there is no need for the system to actively deliver it. Although information belonging
to L2 has not been completely acquired by the user yet, it can still be considered as a part of
the user’s active knowledge because the user knows about it and will use it readily when it is
needed. Even if the user may need more details about it, he or she knows very well how to find
them with information access mechanisms. Accordingly, user modeling is used in this thesis
to refer to the computational mechanism embedded in the information systems for the creation,
augmentation, and maintenance of such user models.
5.3.2
Acquiring User Models
User models cannot be created once for all because users’ knowledge about a system
changes over time. As users’ knowledge changes, their needs for information change, and their
user models should also be modified to reflect the change.
Similar to discourse models, user models can be explicitly modified by users or implicitly
updated by the system. Direct modification from users requires that the system be adaptable,
which means users can customize the system behaviors to their own needs. Adaptive systems
automatically update user models based on information observed or inferred from monitoring
their interactions with the system (Fischer, 1993).
Adaptability and adaptivity complement each other. Although adaptivity requires little
effort from users, it needs a relatively long time to establish a reliable user model. Deployment
of VDDE, an active design environment for phone-based interface design, has found that experienced designers do not expect to be interrupted with information they have told the system
irrelevant (Sumner, 1995). Adaptability of systems gives users direct and immediate control
over what information should be delivered. However, it places extra work on users.
75
Another challenge in the acquisition of user models is how to initialize a user model.
Few users know nothing about an information system when they start to use it, and, obviously,
empty user models do not reflect this fact. Mechanisms supporting adaptability and adaptivity can be extended to the explicit and implicit acquisition of initial user models, respectively.
An explicit acquisition method directly asks users what they already know through an up-front
questionnaire or testing sessions when they use the system for the first time. An implicit acquisition method is suitable if artifacts previously created by users in the domain for which the
information system is designed to support are available. The interactive adaptivity mechanism
can be modified as a batch process to analyze those existing artifacts to obtain the initial user
models. The third method is to create several stereotypical user models to represent different
levels of users, as is widely done in intelligent tutoring systems and intelligent help systems.
One of the stereotypical models can be chosen as the approximation of the initial user model,
or, a user can copy the user model of another user who has a similar level of knowledge.
5.4
Dealing with Partial, Imprecise Queries
Locating useful information from an information system to support complex design pro-
cesses in wicked problem domains in general and locating reusable components from a component repository in particular are not the same as searching for data in a database system where
the query is well defined and completely articulated. Instead, users are looking for useful information and the usefulness can only be determined by them, depending on how they intend to
make use of it.
Users’ queries as well as items in information systems are usually not directly represented. For example, in a multimedia design environment, a user wants to find an ideal image
from a large image library for a design task. There is plainly no way to express the query
directly—after all, if the user knows how to directly express it, he or she has it already and
does not need to locate it at all! The user must rely on an abstract representation schema to
describe certain attributes of the needed image. Images in the image library are also abstractly
76
represented. But even with this abstract representation schema, users can begin with only a very
vague query (Nakakoji et al., 1998). In a reusable component repository system, reusable components are not directly represented either—a direct representation should be program code;
however, they are represented by surrogates such as textual descriptions and signatures. Those
abstract representations are partial, biased, and imprecise. Even in the domain of document
retrieval, where the representation schema is the same as the information itself, users often ask
the wrong questions (Jones, 1997).
5.4.1
Context-Aware Browsing
Given the fact that complete requirements for information are not available at first, infor-
mation systems cannot locate the exact information needed by users. The problem of incomplete
information requirements is more severe in active information systems because requirements are
inferred. However, active information systems can heuristically reduce the searching space to
the extent that users can easily browse and choose the one needed. This kind of approach to the
acquisition of information is defined as context-aware browsing.
Querying and browsing are the two major information access mechanisms for most users.
Querying is direct: users formulate a query and the system returns information matching the
query. However, formulating queries is a cognitively challenging task because users have to
overcome the gap from the situational model to the system model (See Section 3.4.3.2). In
browsing, users determine the usefulness or relevance of the information currently being displayed in terms of their task and traverse its associated links. People tend to find browsing more
fun than querying because they do not need to commit resources at first and can incrementally
develop their requirements after evaluating the information along the way (Lieberman, 1997).
Mili et al. claims that browsing is the most predominant pattern of component repository usage
because most programmers often cannot formulate clearly-defined requirements for reusable
components so they rely on browsing to get acquainted with available reusable components in
the repository (Mili et al., 1999).
77
However, browsing is not scalable due to the following reasons. First, there is an inherent
dilemma in the design of the browsing structure of an information system: If links are too
many, users will be puzzled by the complexity; if links are too few, information is not well
connected. Second, there cannot be a structure suitable for all users and all user tasks, and
the structure defined at the design time may not be the structure needed at the use time. For
example, Smalltalk class library is structured according to the inheritance relationship. This
structure is perfectly suitable for the execution of programs, and for locating components whose
super nodes (classes or super-classes) are known. However, it is not suitable for programmers to
find a method based on functionality. Some methods with similar functionality are scattered in
different deep nodes of the inheritance tree (Helm & Maarek, 1991). It is therefore very difficult
for programmers to find and compare all of them in order to choose the most appropriate one
to reuse. Third, in a large information system, following the right link requires users to have a
very good understanding of the structure of the whole system. Most users, especially the less
experienced users, may easily get lost in a complex network of nodes while tracing dozens of
links (Halasz, 1988).
Context-aware browsing supported by active information systems combines the strength
of both querying and browsing: the directness of querying and the lower cognitive threshold of
browsing. Active information systems automatically collect and present information to users
based on the task context where the information is consumed. Even though the delivered information may not be precise enough due to the incompleteness of task models, users can immediately start to browse a significantly reduced information space that is organized in accordance
with their task structure.
5.4.2
Supporting Retrieval-by-Reformulation
Another approach to complement the incompleteness of information requirements is
to support retrieval-by-reformulation (see Section 3.4.3.4). Active information systems first
present users with the initial retrieval results. This initial delivery can serve the following two
78
purposes:
(1) Users can learn how the information system stores and organizes its information by
examining the retrieval results.
(2) Users can discover some requirements that were not present in the initial query by
comparing the retrieval results with their intentions of use in context.
Based on newly acquired knowledge on the information system and discovery of new aspects
of requirements, users can either reframe or refine the initial query to improve its completeness
and preciseness, or directly manipulate the retrieval results by filtering out apparently irrelevant
information so that those needed get more focused attention.
When the retrieval-by-reformulation mechanism is integrated with active information
systems, the reformulation process of users can be used at the same time, as a nice side effect,
to augment discourse models and user models. Details will be explained in Section 3.4.3.4 in
the context of introducing CodeBroker.
5.5
Comparing Active Information Systems with an Example in the Real
World
Locating useful information from a large information repository is very similar to lo-
cating an item in a big store. Empirical studies (Reeves, 1991) on the interaction between
customers and sales agents in McGuckin Hardware, a store that carries more than three hundred thousand items, have revealed that customers coming to the store often have a vague idea
of what they want, and they often do not know where to start to find it. Sales agents, called
Roamers, have incomplete or vague knowledge of the items in the store, but enough knowledge
to direct customers toward an aisle where things of interest might be located, based on listening
to customers’ descriptions. Once in that aisle, customers can incrementally improve the preciseness of the problem description, based on examining existing available tools, or by talking
to another kind of agent, called Green Apron who is a specialist of the domain of that aisle.
79
Active information systems play the role of a Roamer by dynamically constructing a
virtual “aisle” of information of interest. Through examining and evaluating the information in
the virtual “aisle,” users have two ways to narrow their focus through retrieval-by-reformulation:
(1) directly manipulate the virtual “aisle” to remove apparently irrelevant items to narrow
the collection for easier choice, or
(2) refine or reframe their queries to start another round of locating.
5.6
The Spectrum of Support for Locating Information
As epitomized in the problem of locating reusable components analyzed in Chapters 3
and 4, a wide gap exists between the user needing information and the information system
providing information. Three approaches exist to bridge this gap as shown in Figure 5.5.
The first approach is taken by passive information systems based on pure information
access mechanisms. For users to acquire the needed information, they must bridge the gap by
themselves after they have learned how to use the system, how to write appropriate queries, and
to anticipate the existence of information. This approach can be called the user-expert approach
because the user is trained to be an expert in using the system.
The second approach is the computer-expert approach, in which the computer system
plays the role of expert and tries to infer the needs of users and deliver the precisely relevant
information. Although this approach is ideal, due to the incompleteness of task models, it is
very difficult to implement such smart systems.
The third approach is the distributed-expert, or the human-computer cooperation approach. It acknowledges the fact that neither computers nor users have enough expertise to find
the relevant information alone, and the expertise is distributed among users and systems—users
know their needs, and systems know what exists in their repositories. To overcome this symmetry of ignorance or asymmetry of knowledge (Rittel, 1984), cooperation is needed between
users and computer systems. Active information systems incorporated with the retrieval-by-
80
User Needing
Information
Information
System
The gap between information needs
of users and information system
pure delivery
delivery
reformulation
pure access
Figure 5.5: The spectrum of support to information location
The first part of the figure describes the computer-expert approach in which the
system plays the role of expert and presents relevant information based on a pure
delivery mechanism; the second part describes the distributed-expert approach, in
which the system and the user cooperate; the third part is the user-expert approach,
in which the user plays the role of expert and locates the information using a pure
access mechanism.
reformulation mechanism adopt such an approach. Their delivery mechanism first presents a
set of potentially relevant items of information based on inferred task models, discourse models, and user models, and then users contribute to the process of information location through
the retrieval-by-reformulation mechanism. This cooperation process is also a mutual learning
process. From the users’ reformulation process, systems learn the knowledge level of users and
the larger context of their tasks to augment the discourse model and user model that can make
systems deliver more context-relevant information later. From the deliveries of systems, users
learn the structure of information systems and the availability of relevant information, which
81
can be utilized in their later retrieval activities.
Chapter 6
Indexing and Retrieval Mechanisms in CodeBroker
This and the following chapters describe the system development effort of the research.
Two subsystems have been developed: CodeIndexer, which creates the component repository
from existing Java programs and libraries, and CodeBroker, which assists Java programmers to
locate and reuse components while creating new programs1 (Figure 6.1).
"!$#&%('#)*
+-,/.*0,/123%
9 .@8A&)@8!B8,
.0,546#&7,&1,&#
8%
9 .02;:0.)8)'
<=8:.%#
'.,&>
9 .@F8G=,&.HF8,
CD8E
+-,/.*0,/123%
+-,/.*0,/1232?8,/%
Figure 6.1: The CodeIndexer and CodeBroker subsystems
After explaining the indexing and retrieval mechanisms used by CodeBroker, this chapter
describes how to create a component repository that can be used by CodeBroker with CodeIndexer.
1
For the sake of brevity, throughout the thesis, the word CodeBroker refers to the whole system development
effort, and CodeIndexer is used when the description involves indexing only.
83
6.1
Indexing and Retrieval Mechanisms
An effective retrieval mechanism is essential in deciding whether relevant reusable com-
ponents can be located. An encoding schema that determines how to represent reusable components and reuse queries, and a relevance judgment criterion that determines whether a component is relevant to a reuse query, are two major considerations in the design of retrieval mechanisms. Encoding a component for the purpose of indexing can be based on its concepts, constraints, or code (see Section 9.2 for more details about other indexing methods used for component repositories). CodeBroker encodes components based on both concepts and constraints.
The CodeBroker system extracts the concept of a component from its associated documentation
embedded in Java source programs in the format of doc comments, and the constraint from
the signature of a component. Reuse queries are represented in the same format. Relevance is
determined by the combination of concept similarity and constraint compatibility. CodeBroker
uses both probabilistic model-based indexing and retrieval techniques (Robertson & Walker,
1994) and the Latent Semantic Analysis (LSA) technique (Landauer & Dumais, 1997) to compute concept similarity. Constraint compatibility is computed by signature matching (Zaremski
& Wing, 1995). These methods are chosen because
(1) They require the least effort, among all suggested retrieval mechanisms for component
repository systems, to encode components for indexing and retrieval.
(2) They are the easiest and the most straightforward way for programmers to formulate
reuse queries for retrieval.
(3) The needed information to formulate a reuse query is readily available from the program editor.
(4) They are as effective as other complicated retrieval mechanisms in terms of retrieval
performance (Frakes & Pole, 1994; Mili et al., 1997b).
84
6.1.1
Free-Text Indexing and Retrieval
Free-text indexing and retrieval is concerned with finding documents in free-text form
(such as newspaper articles, research papers, books, web pages, etc.) relevant to the queries
submitted by users (Salton & McGill, 1983).
Free-text documents are indexed by terms. Terms can be controlled or un-controlled. In
the controlled-term approach, an indexer is responsible for choosing the appropriate terms to
index the document. Those controlled terms are also known as keywords. Because this process
is often manually conducted, it is very time-consuming when the collection of documents gets
large. Another problem with controlled terms is that people often choose different terms to
describe the same document (Furnas et al., 1987; Harman, 1995), and even the same person
may not be consistent in choosing terms.
The other approach of choosing terms is to automatically extract them from documents.
Terms can be precisely the words appearing in documents. However, most free-text indexing
systems use one or both of the following two techniques: stemming and stop list. Stemming
reduces words to their morphological root forms. For example, computer, computing,
compute, computation, and computational are all reduced to the form comput and
all five words are represented by the term comput. The advantage of stemming is to allow users
to find documents that contain morphological variations of the word in their queries. A stop list
is simply a list of high-frequency words that are not used as terms. Words that appear in almost
every document are not very useful in distinguishing one document from other documents in
terms of relevance to user queries.
The two primary models for the indexing and retrieving free-text documents are: the
vector space model and the probabilistic model.
85
6.1.1.1
Vector Space Model
In the vector space model, documents and queries are represented as vectors of terms
contained in the whole collection of documents, commonly known as a corpus. The value of
each element in the vector reflects the importance of a particular term in representing the concept
I
or meaning of that document. In a corpus containing
terms, a document is represented by a
vector in the I -dimensional space as follows:
JLKFMNPORQ&SUTV NXWS?YV NZW\[][][]WS_^`V Nba
(6.1)
where
cedgf h
is a value denoting the importance of the term i in representing the concept of
the document jlk\m h .
The variation of the vector space model comes from the method of determining the value
for each
Son&V N
Son&V N
. The simplest binary value model sets
to p , if the term
q
is present in the
document r , and to s , if it is not. The binary value model does not reflect the fact that some
terms appear more in a document and thus contribute more to the concept of the document.
Term frequency (tvu
n&V N
used as the value of
), which is the number of occurrences of term
Son&V N
q
in document r , can be
. Using term frequency favors longer documents because most longer
documents tend to use the same word more often, as the verbosity hypothesis (Robertson &
Walker, 1994) states. To level the ground,
tu
n&V N
can be normalized by being divided by the
overall length of the document vector.
The term frequency model, normalized or not, treats all terms equally. However, terms
that are limited to a few documents are more useful for discriminating documents from the
rest of the corpus than terms that occur frequently across the entire corpus. To reflect this
discrimination power of a term q ,
Son/V N
can be multiplied by its inverse document frequency
n
(qwu ). The q(w(u for each term is defined as follows:
qwu
nxOzyg{X|}Q I
w(u
n
a
(6.2)
86
where
~
is the number of documents in the collection,
€ d
is the number of documents that include term i .
Queries submitted by users are also in free-text form and are represented as a query
vector in the same way as document vectors:

ORQƒ‚ T W„‚ Y W\[][][]W„‚ ^
a
(6.3)
where
… d
…‘ € d
‹ Œ †‡‰ˆŠ 3 i
is P
 Žb
†‡‰ˆ
€ d
is the frequency of term i in the query.
The relevance of a document to a query is determined by comparing the document vector
against the query vector. It is common to use the cosine of the angle between two vectors as the
criteria to judge the similarity ( ’$q”“ ) of a document and a query:
’0q”“
Q 
W„JlNFa`O
™
•
•
^
ng–=T S n&V N˜—
^
Y
ng–=T S &n V N
—
™
‚ n
•
^
Y
n]–=T ‚ n
(6.4)
When the document and the query (or two documents) are identical, their vectors should
be identical in the vector space and the cosine is one; and when they share no common terms,
namely, they are orthogonal to each other, the cosine is zero. Upon receiving a query from a user,
the retrieval system should thus compute the cosine for each document against the query vector,
and return those documents with higher cosine values to the user as the relevant documents.
6.1.1.2
Probabilistic Model
The probabilistic model ranks documents in decreasing order of their evaluated probability of relevance to a user query. It makes use of formal theories of probability and statistics
to evaluate, or estimate, those probabilities of relevance. The relevance probability is different
87
from the similarity computed in the vector space model. The latter generally lacks the theoretical soundness of the relevance probability, which can be defined precisely. However, the computation of a theoretically sound probability is not practically tractable, and currently the probability can only be roughly approximated based on various simplification assumptions (Crestani
et al., 1998).
The basis for all probabilistic models is the probability ranking principle, which asserts
that optimal retrieval performance can be achieved when documents are ranked according to
their probabilities of being judged relevant to a query (Robertson, 1977). Given a query

,
the main task of retrieval systems based on the probabilistic model is to compute the relevance
Qƒ›œ 
probability š
W„JlNa
for each query-document pair.
The relevance probability of a document can be estimated by assigning an appropriate weight to each term in the document corpus. Probabilistic models assume that terms are
distributed differently in relevant and irrelevant documents, which is known as the cluster hypothesis (Van Rijsbergen, 1979). If a term appears more frequently in relevant than in irrelevant
documents, it has more power to discriminate relevant from irrelevant documents. The discriminating power of a term is called its term relevance weight (ž
›;Ÿ
) in probabilistic models, and
its value is calculated by the following formula:
ž
›;Ÿ¡n ¢„£ O¤y]{X|?¥
n —
‚ n§—
Q
Q
‚$na
p?¦
p?¦
¥
n a
(6.5)
where
¨Bd
, …d represent the probability of the i th term appearing in a relevant or an irrelevant
document, respectively.
The above formula can be computed only retrospectively on test collections where the
relevance assessments of documents are known. At the time of regular document retrieval, we
do not know yet which document is relevant or irrelevant; therefore, we do not know how to
compute ¥
n
and
‚\n
, and ž
›;Ÿ©n
can only be estimated.
A simplified formula, proposed by Croft and Harper (Croft & Harper, 1979), uses corpus
88
n
information to make estimates and does not use the distribution probability (¥
›;Ÿ©n
model, ž
and
‚\n
). In their
is computed as follows:
ž
T
›;ŸRn £ Ozy]{X|
Q&ª_«
Q&°
[­¬Za®¯Qƒ›
s
[­¬Za®¯Q
s
I±¦
ª?«
¦
¦
ª?«
°
¦
[­¬Za
s
› «³ª_«
²
s
[­¬Za
(6.6)
where
~
is the number of documents in the collection
´
is the number of documents containing the term
µ
is the number of relevant documents
¶
When
›
is the number of relevant documents containing the i th term.
ª
n
›;Ÿ
and are not available, ž
can be further reduced to
Y
°¹«
[­¬
›;Ÿ¡n £ Ozy]{X|·I¸¦
s
°¹«
[­¬
s
ž
(6.7)
which is similar to the inverse document frequency in vector space model (see Equation (6.3)),
whose w(u
n
°
corresponds to ).
In Equation (6.6), only the presence and absence of terms (the binary value model) are
considered. To take the term frequency within-document (tvu ) and term frequency within-query
(
‚
tvu
) into consideration, a more refined formula is proposed by Robertson et al. to estimate the
probability of relevance between document
š
Qƒ›œ 
W„J
N a`O
ž
J3n
and query
a —
n&V N
Y Qº»T§«
p
tu
n £
¼½«
&n V N
tu
›;Ÿ

—
(Robertson et al., 1995)
a —
p
ºZ¾¿«À‚
QºZ¾¿«
‚
tu
n
tvu
n
where Á
Ì
FÌ
hÍ$ÎÏ
is Âbà Å
 ÄÄÆeÇÅȄÉÊËÈD
É
 à , ÂÐ , È are parameters depending on the nature of the queries and the collection of
the data. In CodeBroker, Â Ã is set to 1.2, Â0Ð to 1.0, and È to 0.75, according to
the data in (Walker et al., 1998).
FÌ h
ÎFÏ FÌ
is the length of document Ñ
is the average length of all documents.
(6.8)
89
CodeBroker uses Equation (6.8) to calculate the relevance probability of a document to a given
query mainly because it can reuse the source code that is available through the distribution of
the Remembrance Agent system (Rhodes & Starner, 1996), and the system Okapi in which the
equation has been implemented has achieved retrieval performance comparable to other leading
research prototypes of information retrieval systems (Robertson et al., 1995; Walker et al.,
1998). For a more detailed explanation of why Equation (6.8) provides an estimation of ž
›;Ÿ©n
,
please see (Robertson & Walker, 1994). For the sake of brevity, Okapi will be used to refer to
the probabilistic model used in CodeBroker.
6.1.1.3
Latent Semantic Analysis
LSA is an extension of the vector space model. The vector space model assumes that
terms are independent from each other and does not take their semantics into consideration;
therefore, it suffers from the concept-based retrieval problem (also known as vocabulary mismatch, discussed in Section 3.4.3.2): If programmers use terms different from those used in the
descriptions of components, they cannot find what they want. By constructing a large semantic
space of terms to capture the overall pattern of their associative relationship, LSA is expected
to facilitate concept-based retrieval and bridge the conceptual gap in formulating reuse queries.
The indexing process of LSA starts with creating a semantic space with a large corpus
of training documents in a specific domain. It first creates a vector for each document in the
corpus in the same way that the vector space model does. All vectors for documents of the
corpus compose a large term-by-document matrix Ò .
ÓÔ
Ô
Ô
SUTVÖT
SUTV Y
[\[\[×SUTV Ø
S?YVÖT
S?YV Y
[\[\[×S?YV Ø
Ú\Û
Û
Û
Ô
Û
Ô
Ò
Ô
Ô
O
Ô
Û
Û
Û
Û
Õ
[[0[$[[0[$[[0[$[[0[$[[0[$[[0[$[[0[$[[
S
^¿VÖT
S
^¿V Y
[\[\[ÙS ^¿V Ø
Ü
where the columns of the matrix represent the documents in the collection ( Ý
the corpus), and the rows represent the terms (I
documents in
terms in the corpus). The term-by-document
90
matrix
Ò
is then decomposed, by means of singular value decomposition, into the product of
three matrices:
ÓÔ
Ô
¢„VÖT £
T
Ô
t
Ô
Ô
Ô
Ô
t
Ô
žx¢
t
¢„VÖT £
Y
t
Þߢ
,
, and
¢„V Y £
T
[\[\[
¢„V Y £
Y
[\[\[
Jà
¢
.
Ú\Û
t
Û
¢„V á £
T
t
Û
Û
Û
ÓÔ
Û
Ô ’
Û
Ô
¢„V á £
Y
Û
Û
Ò
Ô
Û
Ô
s
Û
Ô
Ô [$[$[0[$[$[$[$[[$[$[$[0[$[$[$[$[0[$[$[ Û
Ô
Û
Û
Ô
Õ
t
¢„VÖ£ T
^¿
t
¢„V£ Y
^¿
[\[\[
t
¢„V£ á
^¿
YV Y
s
s
Ô
Û
’
Ô
Û
Û
¢„VÖT £
Y
Ô
Ô w
—
Û
Ô
Ü
ávV á
w
Ô
Û
s
¢„VÖT £
T
Ô w
Û
[\[0[$[$[\[$[$[\[0[$[$[\[$[$[0[\[$[$[
[\[\[
ÓÔ
Û
s
[\[\[
Õ
Û
[$[$[0[$[$[$[$[[$[$[$[0[$[$[$[$[0[$[$[
’
Ô
—
Ô
Ô
s
Ô
Ô [$[$[0[$[$[$[$[[$[$[$[0[$[$[$[$[0[$[$[ Û
Ú\Û
[\[\[
Ô
Ô
O
TVÖT
w
¢„V Y £
T
[\[\[
¢„V Y £
Y
[\[\[
Ú\Û
Û
¢„V Ø £
T
w
Û
Û
Û
¢„V Ø £
Y
w
Û
Û
Û
[0[0[0[[$[0[[$[0[[$[0[[$[0[[$[0[[
Õ
w
áv ¢„VÖT £
w
áv ¢„V Y £
[\[\[
Ü
w
áv ¢„V Ø £
Ü
is an orthogonal matrix having the left singular vectors of Ò ,
J¹à
¢
is also an orthogonal matrix
having the right singular vectors of Ò , and the diagonal matrix
Þ ¢
is the singular value matrix
ž ¢
ª
whose rank is , the smaller number of I
and Ý
.
’
NV N Q
appear in decreasing order along the diagonal of the matrix
’
ávV á
ª-a
plâãrâ
Þ ¢
are singular values, and they
, namely,
’
TVÖTåä
’
YV YÅäÙæ\æ\æ¿ä
.
The hypothesis behind LSA holds that because of synonymy and polysemy in natural
languages, there is much noise in the matrix Ò , and if the rank of the singular value matrix
is reduced—by getting rid of less significant singular values—to a much smaller number
obtain another singular value matrix Þ , the noise is reduced too. The value of
º
º
Þ ¢
to
often ranges
º
from 40 to 400, but the best value of still remains an open question and needs to be empirically
determined. The ž
J¹à
¢
¢
with the size of
ª
matrix with the size of I
—
Ý
is reduced to
J¹à
—
ª
is reduced to ž with the size of I
with the size of
º
—
Ý
—
º
, and
.
A new matrix Ò ç , viewed as the semantic space of the domain represented by the corpus,
is constructed through the production of the three reduced matrices:
ž
, Þ , and
J¹à
.
ÓÔ
Ú\Û
Û
Û
t
t
t
Û
Û
YVÖT
YV Y
[\[\[
YV è Û
Û
t
t
t
Û
Û
[[$[$[0[0[$[$[0[0[$[0[$[0[$[0[$[0[0[$[ Û
Û
Û
[[$[$[0[0[$[$[0[0[$[0[$[0[$[0[$[0[0[$[ ÛÛ
Û
Û
[[$[$[0[0[$[$[0[0[$[0[$[0[$[0[$[0[0[$[
Ô
TVÖT
Ô
Ô
Ô
Ô
Ô
Ô
Ô
Ò ç
Ô
Ô
O
Ô
Ô
Ô
Ô
Ô
Õ
t
^`VÖT
t
TV Y
^¿V Y
[\[\[
[\[\[
91
t
TV è
^¿V è
ÓÔ
Ô
TVÖT
Ô ’
s
Ô
Ô
Ô
s
’
—
Ô
Ô
Ú\Û
YV Y
Û
s
s
Ô w
Û
’
èV è
Ô
Û
s
Û
Û
Ü
TVÖT
w
Ô
Û
[\[\[
[\[\[
Ô
s
[0[$[$[[$[0[0[$[0[$[[$[$[[$[$[0[0[
Õ
ÓÔ
Û
[\[\[
Ô
—
Ô w
Ô
YVÖT
w
TV Y
[\[\[
YV Y
[\[\[
Ú\Û
w
TV Ø
Û
w
Û
Û
YV Ø
Û
Û
Û
[F[0[[0[0[[0[F[0[[0[0[[0[F[0[[0[0[
Õ
w
èVÖT
w
èV Y
[\[\[
w
Û
èV Ø
Ü
Ü
In this new matrix Ò ç , each row represents the position of each term in the semantic space. Terms
are re-represented in the newly created semantic space. The reduction of singular values is
important because it captures only the major, overall pattern of associative relationships among
terms by ignoring the noises accompanying most automatic thesaurus construction simply based
on co-occurrence statistics of terms.
After the semantic space is created, each document is represented as a vector in the
semantic space based on terms contained, and so is a query. The similarity of a query and a
document is thus determined by the cosine of the two vectors as in Equation (6.4). A document
matches a query if their similarity value is above a certain threshold value.
The corpus used by CodeBroker to create the LSA semantic space for Java programming
comes from four sources: Linux on-line manuals, programming textbooks, the Java language
specification and virtual machine specification, and Java class libraries (component repositories). These four types of documents are chosen because they cover the domain knowledge
a Java programmer needs: knowledge about the computer and operating systems, which is
covered by Linux manuals; knowledge about programming in general, which is covered by
programming text books; knowledge about programming in Java, which is covered by the Java
specifications; and knowledge about reusable components, which is covered by the Java class
libraries. The corpus contains 78,475 documents and 10,988 different terms after common and
extremely rare words are cut off. A word is considered as extremely rare if it appears in one
document once. This is useful to remove those esoteric abbreviations that are common in Linux
on-line manuals but not used elsewhere.
92
6.1.2
Signature Matching
Signature matching is the process of determining the compatibility of two components in
terms of their signatures (Zaremski & Wing, 1995). It is an indexing and retrieval mechanism
based on type constraints of a module or a component (see Section 5.2.2.1).
Two signatures
Sig1 : OutTypeExp1 <- InTypeExp1
Sig2 : OutTypeExp2 <- InTypeExp2
match if and only if InTypeExp1 is in structural conformance with InTypeExp2, and OutTypeExp1 is in structural conformance with OutTypeExp2. Two type expressions are structurally conformant if they are formed by applying the same type constructor to structurally
conformant types.
This definition of signature matching is very restrictive because it misses components
whose signature does not exactly match, but that are practically similar enough to be reusable
after slight modification.
Partial signature matching relaxes the definition of structural conformance of types: A
type is considered as conforming to its more generalized form or more specialized form. For
T
procedural types, if there is a path from type ž
Y
form of ž
, and ž
Y
is a specialized form of ž
T
to type ž
Y
T
in the type lattice, ž
is a generalized
. For example, in most programming languages,
integer is a specialized form of float; and float is a generalized form of integer. For objectoriented types, if ž
T
is a subclass of ž
Y
,ž
T
is a specialized form of ž
Y
, and ž
Y
is a generalized
T
form of ž .
The constraint compatibility value between two signatures is the production of the conformance value between their types. The type conformance value is 1.0 if two types are in
structural conformance according to the definition of the programming language. It drops a
certain percentage if one type conversion is needed or if there is an immediate inheritance relationship between them. The signature compatibility value is 1.0 if two signatures exactly
93
match.
A class signature is composed of its data definition part and method definition part. Signature matching of two classes
é
T
and
é
Y
requires that both their data definition parts match
and their method definition parts match, respectively. Data definition parts are treated as recordtypes, and are compared according to their structural conformance.
The method definition parts of two classes
class
é
T
, there is a corresponding method “
Y
é
T
in class
é
Y
é
and
Y
compatibility with “
T
T
such that “
“
to signature matching for methods. Correspondence from
best match principle—that is, among all methods from
match if for each method
é
Y
T
to
, if
Y
“
“
, it is considered as the matching method of “
Y
matches “
Y
“
T
in
according
is decided based on the
T
has the highest signature
. The compatibility value
of the method definition part is thus the average value of compatibility values existing between
pairs of matching methods.
Because classes inherit data and methods from their parent classes, comparing two classes
only is not enough if they do not inherit immediately from the same class. Inherited data definitions and methods must be taken into consideration. A common ancestor class is located first,
and then all data definitions and method definitions in between the common ancestor and the
compared classes should be added.
CodeBroker does not implement signature matching for classes due to the following
two considerations. First, the primary goal of CodeBroker is to deliver reusable components
before programmers start to implement the module. Unlike other object-oriented programming
languages, such as C++, in which the declaration of class interfaces is usually separated from
the implementation and is stored in a separate file such as a header file, the Java programming
language dose not provide a mechanism to separate the declaration of class interfaces from
implementations. It is not very common that Java programmers start to fill in the implementation
of methods after they have finished the declaration of the class signature—the class variables
and all method signatures. Therefore, in most cases, the class signature would not be available
to CodeBroker for it to deliver components before implementation.
94
Second, signature matching alone is not very powerful in locating reusable components
because the limited number of primary data types in a programming language such as Java leads
to few variations of signatures. However, signature matching can play an important differential role when many components with similar concepts are retrieved, and programmers need to
choose one that fits their type constraints. The main goal of using signature matching in CodeBroker is to make choosing components easier when there is a set of components intended for
the same purpose but implemented for different data types. For example, a reasonably good
component repository that contains random number generators needs a set of components that
create random numbers of different types such as integers, floats and long integers. These components usually have same functionality descriptions, namely, same concepts, and the signaturematching process would help programmers identify the desired component immediately without
too much browsing effort.
6.2
Creating the Component Repository
The component repository in CodeBroker is created by its indexing subsystem, CodeIn-
dexer. CodeIndexer extracts and indexes functional descriptions (concepts) and signatures (constraints) from the HTML-based online documentation generated by running Javadoc over Java
source programs (Figure 6.2).
Java
Source
Programs
Concept
Indexing
Javadoc
HTML-based
Documents
CodeIndexer
Signature
Indexing
Component
Repository
Figure 6.2: The process of creating a component repository from Java programs
Javadoc generates documentation, in HTML format, for Java programs by parsing the
95
source files. In the HTML documentation, each Java class has its own HTML-formatted document file, which is cross-linked to the document files of its super-classes and sub-classes. The
contents of a class document describe the functionality of the class and all of its methods. Those
descriptions are extracted from doc comments associated with each class and method. An example of a Javadoc document is shown in Figure 6.3. Documentation for Java components
distributed with JDK (Java Development Kit) by Sun Microsystems, Inc. is also generated by
Javadoc. Other component developers create documentation for their components in the same
fashion.
CodeIndexer creates indexes for Java components in two steps. First, it extracts needed
information for indexing from Javadoc documents and converts it into the CodeBroker indexing format that can be processed by the indexing program. Each method of a class is treated
as a document to be independently indexed, although in Javadoc documentation, all method
descriptions of a class appear as one physical file. Five types of information are extracted for
the purpose of indexing a method component: the full class name (including the package name
and class name); the HTML tag name which specifies the exact location of the method in the
Javadoc document; the method name; the signature; and the description of the method included
in the doc comment for the method (Figure 6.4). Doc comments of Java may use special tags,
which begin with the @ character and allow Javadoc to provide additional formatting for the
documentation. For example, some doc comments may include @author to specify the author of the component, or @see to specify a link to related methods or classes. These tags
could provide additional indexing information to narrow the range of components to be located.
For instance, a programmer may be interested in components written by a specific author only.
However, the current version of CodeBroker does not support this, and all special tags, along
with their contents, are removed.
The second step of CodeIndexer creates, from the CodeBroker indexing documents in
the format of Figure 6.4, three index files: the probabilistic model index file (or Okapi index
file, for short), the LSA index file, and the signature index file. The Okapi index file and LSA
96
Figure 6.3: An example of a document generated by Javadoc
index file contain the concept indexes of components, and the signature index file contains the
signature indexes.
The Okapi index for a component consists of terms and their frequencies appearing in
the doc comment. A term is the stemmed form of an English word, which is not included in the
stop list.
º
The LSA index for a component is a float vector with length calculated by the following
97
NEW METHOD::
CLS: java.lang.String
TAG: length
MET: length
SIG: int length()
DEF: Returns the length of this string. The length is equal
to the number of 16-bit Unicode characters
in the string.
NEW METHOD::
CLS: java.lang.String
TAG: charAt
MET: charAt
SIG: char charAt(int index)
DEF: Returns the character at the specified index. An index
ranges from 0 to length() - 1.
Figure 6.4: The indexing format of method documents in CodeBroker
In CodeBroker, each method is represented by an independent document, which
starts with the line NEW METHOD::, followed by 5 fields: CLS for the full class
name, TAG for the HTML tag of the method, MET for the method name, SIG for
the signature, and DEF for the description of the method.
equation2 :
ê
ÞeëUì·í
M
Kª·O¡Q&îBT$WîXYbW\[\[\[ïWî è a
t
^
îbnxORQ ð
N
N–=T tvu
—
t
NV na —
&n V n
’Xñ
(6.9)
T
(6.10)
where
Â
~
‘ € d
‘ hf d
òdgf d
is the number of singular values in the pre-computed semantic space
is the number of terms in the semantic space
is the frequency of the term in the component
is the term vector of term Ñ in the pre-computed semantic space; and terms in
the component but not in the corpus are discarded
is the singular value.
ó is the reduced number of singular values in the LSA semantic space created from the training corpus. In
CodeBroker, ó is equal to 278, which is the number of non-zero singular values in the semantic space.
2
98
The signature index for a component is in the following format:
5617 getInt : int <- int x int
where the leftmost number is the identifier number assigned to each component. The string
following the number is the component name (getInt), the string following the colon : is
the returned type (int), and the string following the left arrow (<-) specifies the input type(s)
(int x int).
To speed up the locating process and to reduce the size of indexing files, all three index
files are encoded and stored as a database file.
The indexing mechanism can easily create a component repository from any Java source
programs. However, components scavenged from ordinary programs present more challenges
for reuse other than the locating problem, such as the low quality issue of documents and code.
Because the focus of this research is to help programmers discover reusable components, the
current version of CodeBroker includes the Java 1.1.8 Core API library and JGL 3.0 (Java
General Library from ObjectSpace Inc.), both of which are of high quality and well documented.
There are 503 and 170 classes in Java 1.1.8 and JGL, respectively, and a total of 7,338 method
components.
Chapter 7
Locating and Delivering Components in CodeBroker
This chapter describes in detail the techniques used by CodeBroker. The interface of
CodeBroker is integrated with the programming environment to create an information-enriched
workspace that supports reuse within development. Compared to passive component repository systems that automate the retrieval process only, CodeBroker also eliminates the steps of
forming reuse intentions and formulating reuse queries (see Figure 3.1).
CodeBroker runs, as an active process, in the background of a programmer’s working environment (a program editor—Emacs) to monitor the programmer’s interactions with it. Using
the programmer’s current working context as retrieval cues, it automatically locates and delivers
context-sensitive reusable components.
7.1
System Architecture
The architecture of CodeBroker is shown in Figure 7.1. CodeBroker consists of three
software agents: Listener, Fetcher, and Presenter. A software agent is a software entity that
functions autonomously in response to the changes in its running environment without requiring human guidance or intervention (Bradshaw, 1997; Ye & Reeves, 2000). The Listener agent
captures the task in which programmers are currently engaged by monitoring and analyzing
their interactions with the editor, and creates reuse queries as task models. Those queries are
then passed to the Fetcher agent, which retrieves the matching components from the component
repository. Reusable components retrieved by Fetcher are passed to the Presenter agent, which
100
uses discourse models and user models to remove unwanted components and delivers them
in the Reusable Component Infomation-display (RCI-display, for short) placed below the editor window. Through the RCI-display, programmers can invoke the retrieval-by-reformulation
mechanism supported by CodeBroker to either directly manipulate the delivered components
or refine the queries if they cannot immediately find useful components. The retrieval-byreformulation mechanism also helps to evolve discourse models and user models.
Program Editor
Delivered
Components
RCI-display
Working
Products
Presenter
Delivers
ulate
Manip
Refine
Update
Automatic update
Concept
Queries
Listener
Analyzes
User
Model
Retrieved
Components
Fetcher
Retrieves
Constraint
Queries
Data
Action
Concept
Indexing
Signature
Indexing
Component
Repository
Data flow (bubble shows the contents)
Control flow (label shows the action)
Figure 7.1: The architecture of the CodeBroker system
7.2
Listener
The Listener agent runs continuously in the background of Emacs to monitor the input
of programmers. Its goal is to model the immediate programming task of programmers. As
described in Section 5.2.2, Listener uses similarity analysis to create task models that are used
as reuse queries passed to the Fetcher agent. Two types of reuse queries are autonomously
created by Listener: the concept query extracted from doc comments, and the constraint query
extracted from signatures. A task model may include the concept query only or include both
101
the concept query and the constraint query.
7.2.1
Creating Concept Queries from Doc Comments
Doc comments in Java begin with /** and end with */. Whenever a backslash character
(/) is entered into the editor, Listener scans one character backward to see if the backslash is
preceded by an asterisk (*). If that is the case, Listener scans backward further until it finds the
string /**, or the end of another statement in case the programmer has not written a legal doc
comment statement. If a legal doc comment statement is found, the contents between /** and
*/ are then extracted and passed by the Listener agent as a concept query to the Fetcher agent
that locates and delivers components that match the concept query.
Figure 7.2 shows an example of component delivery based on concept queries. A programmer wants to create a random number between two long integers, and before he or she
implements it (i.e., writes the code part of the program), the programmer indicates his or her
task in the doc comment. As soon as the comment is written, a task model, including the concept query only, is created and represented in the same format as is used for indexing component
documents (Section 6.2, Figure 6.4):
DEF: Create a random number between two limits.
Components whose functionality descriptions or concepts show similarity to this task model are
delivered by CodeBroker.
7.2.2
Creating Constraint Queries from Signatures
Concept queries or concept-based task models are often not complete enough to describe
what the programmer wants. For example, in Figure 7.2, the doc comment from which the
concept query is created does not say the method must take two long integers as input. Although
the fourth component in the RCI-display buffer, the signature of which is shown in the message
buffer (the last line of the window), could be modified to achieve the task, it would be better to
102
Figure 7.2: Component delivery based on concept queries only
Components in RCI-display are delivered based on the task indicated in the doc
comments preceding the cursor. The third component (nextBytes) and the fourth
one (getInt) in RCI-display have the potential to be reused because their concepts
are similar to the concept of the method under development.
find a component that can be immediately integrated without modification.
Further information about the task could be acquired from signatures that reveal the type
constraints of modules under development (Section 6.1.2). In the previous example (Figure 7.2),
as the programmer proceeds to declare the signature, the Listener agent refines the task model
by taking into consideration the constraint requirements for reusable components.
As Figure 7.3 shows, when the programmer types the left bracket
ô
(just before the cur-
sor), Listener is able to recognize it as the end of a signature definition, and creates a constraint
query in the format of signature: long <- long x long. A more precise task model,
including both the concept query and the constraint query, is created as follows:
SIG: getRandomNumber: long <- long x long
DEF: Create a random number between two limits.
103
Figure 7.3: Component delivery based on both concept queries and constraint queries
Components in RCI-display are delivered based on both the doc comment and the
signature. The first component, which was not shown in Figure 7.2, can be reused
by the programmer because it fully matches the programming task.
The RCI-display in Figure 7.3 shows components delivered based on this task model. Note that
the first component in the RCI-display has exactly the same signature, shown in the message
buffer, as the one extracted from the editor, and can be reused without further modification.
7.2.3
Updating User Models
In addition to modeling the programming task by creating concept queries and constraint
queries, the Listener agent is also responsible for the automatic updating of user models. More
details on this are available in Section 7.4.3.2.
7.3
Fetcher
The Fetcher agent performs the retrieval process based on the retrieval mechanisms in-
troduced in Section 6.1. When Listener passes a concept query, Fetcher computes, using both
104
the Okapi technique (Section 6.1.1.2) and the LSA technique (Section 6.1.1.3), the concept
similarity value from each component in the repository to the query, and returns those components whose similarity value ranks in the top 20. The number 20 is the threshold value used by
Fetcher to determine what components are to be regarded as relevant; it can be customized by
programmers.
The concept similarity value is determined by the following formula
õlö °ï÷\øù6úûýü”þÿübü”ú
û§ü”þ
$ù6ü
û§ü”þ û
(7.1)
where
"!$#&%(')
is computed according to Equation (6.8) as is used in the probabilistic
model-based Okapi system.
+*,.-
is computed according to LSA
/021/43
are the weights assigned to each model, and /0456/,34798;: < . Each of them
can be changed on the fly by programmers by issuing the command cb-setlsa-weight from Emacs.
In order to find the best retrieval mechanism, I have experimented with the following
three methods to compute the concept similarity value:
(1) Using the Okapi model only (i.e.,
$>=@?BA
and
CDA?BA
)
(2) Averaging the similarity value computed by Okapi model and the one computed by
LSA (i.e.,
EA?GF
and
(3) Using LSA only (i.e.,
CEA?GF
DA?BA
)
and
H>=@?BA
).
The retrieval performance of method (1) consistently beat the other two methods. Therefore,
the default setting for the CodeBroker system is
I>=@?BA
and
HEA?BA
. For more details about
the evaluation of retrieval mechanisms, see Section 8.1.
When both the concept query and the constraint query are passed, the Fetcher agent computes the similarity value by combining both the concept similarity and constraint compatibility,
which is determined by the signature matching process, according to the following formula:
û§ü”þÿü)FüúJLKMNPO}øQ
õ˜ö °ï÷\øvù6ú„û§ü”þÿüJP)bü”úRTSU
õlö °WV$úX-ü”°xú õ˜ö þ3ùHBúüJYü&ü”úJZ[
(7.2)
105
where
TS\E[]=
different weights to
, and the default values for them are each 0.5. Programmers can assign
TS
and
[
to reflect the importance they assign to the concept similarity
and constraint compatibility, respectively.
7.4
Presenter
Retrieved components are shown to programmers by the Presenter agent in the RCI-
display in decreasing order of similarity value.
7.4.1
Layered Information Presentation
Information about components is presented to programmers in different layers of ab-
straction due to the following two considerations. First, because the RCI-display has only a
limited size (otherwise, it would take up too much working space of programmers), presented
information should be condensed to accommodate as many components as possible for programmers to choose. Second, the evaluation of the usefulness of information by programmers
consists of two stages: information discernment, whereby they grossly determine whether the
component is relevant, and detailed evaluation, whereby they study the component thoroughly
(see Section 3.4.3.5). Therefore, information shown in the RCI-display should contain the essential information on components only, and more detailed information should be displayed to
programmers when they show interest in a particular component (Ye, 2001b).
The CodeBroker system presents information on components to programmers in three
layers. The first layer is the RCI-display in which each component is accompanied with its
rank of similarity, its similarity value, its name, and a short description (Figures 7.2 and 7.3).
The presentation of the second layer of information is triggered by the mouse movements of
programmers. Component names and short descriptions in the RCI-display are mouse-sensitive.
When the mouse cursor is moved over the component name, the signature of the component
is shown in the mini-buffer (see the last lines in Figures 7.2 and 7.3); and when the mouse
cursor is over the short description, terms contributing to the concept similarity between the
106
component and the concept query are shown in the mini-buffer (Figure 7.4) to reveal why this
component is retrieved and to help programmers refine their queries if necessary. The third layer
of information, the most complete description of a component, is shown in an external HTML
browser, such as Netscape Navigator. When the programmer left-clicks on the component, the
full Javadoc documentation for the component is displayed in the browser. The HTML tag
extracted at the time of indexing is used so that the browser can display the exact place of the
component description.
Figure 7.4: Presenting more information triggered by mouse movement
The mini-buffer shows the keywords (terms) that contribute most to the concept
similarity between the first component and the reuse query, which is not shown
here.
7.4.2
Larger Context-Sensitive Presentation
The task model, created by the Listener agent from doc comments and signatures, de-
scribes the immediate programming task, namely, the module that the programmer is going to
develop. However, programmers often do not give a complete description about their tasks in
doc comments. Furthermore, a module is only a part of the whole development task, and the
functionality of this module is deeply connected with other modules that have been developed
so far. As mentioned in Section 5.2.3, if the component repository system knows what the
whole development task is and the larger context under which the current module development
is conducted, it can provide more appropriate information on reusable components.
CodeBroker captures this larger context in a discourse model (Section 5.2.3.2) that represents the previous interactions between the programmer and the system in one development
107
session. The discourse model is used by Presenter as a filter to remove components in which
the programmer is not interested in the current development session, although they are retrieved
by Fetcher based on incomplete task models.
Java component repositories are organized hierarchically according to packages and
classes, and packages and classes are often designed for particular application domains. For
most programming tasks, only a part of the repository is involved. CodeBroker uses negative
discourse models to capture what part of the repository is not of interest to programmers because
discourse models are incrementally evolved by programmers during their interactions with the
CodeBroker system, and in many cases it takes less effort for programmers to identify apparent
irrelevant components. Section 7.5 explains in detail how the discourse model is incrementally
augmented by programmers.
A discourse model in CodeBroker is in the format of a Lisp association list (Figure 7.5).
It specifies packages or classes in which the programmer has no interest for the current development session. Before components retrieved by Fetcher are delivered to programmers, Presenter
compares each component against the discourse model, and if the component belongs to a class
or a package in the discourse model, it is removed.
Figure 7.5: An example discourse model
A discourse model is a Lisp list of items with the format: (package-name
(class-name (method-name))). Empty class-name or methodname fields indicate that the whole package or the whole class should not be delivered in this session.
Discourse models also reduce the delivery of irrelevant components caused by polysemy—
a difficult problem for any information retrieval systems—by limiting searching domains because polysemous words often have different meanings in totally different domains. For example, if the programming task is to shuffle a deck of cards, the programmer may use the
108
word “card” in doc comments. That would make the system deliver components from the class
java.awt.CardLayout, a GUI (Graphic User Interface) class in which “card” means a
graphical element. If the current development project does not involve interface building, this
whole class is irrelevant. The programmer can add the class (java.awt.CardLayout) or
even the whole package (java.awt) to the discourse model to prevent components belonging
to it from being delivered in this development session.
7.4.3
Personalized Component Presentation
The goal of active delivery in CodeBroker is meant to inform programmers of those com-
ponents that fall into L3 (reuse-by-anticipation) and the area of (L4 - L3) (information islands)
in Figure 4.1 (Section 4.1.2). Delivery of components from L2 (reuse-by-recall) and especially
from L1 (reuse-by-memory) might be of little use, with the risk of making the unknown, really needed components less salient. Therefore, the system needs to know what components
the programmer already knows. CodeBroker uses user models (Section 5.3.1) to represent programmers’ knowledge about the component repository to ensure user-specific delivery of components. User models in CodeBroker are both adaptable and adaptive (Thomas, 1996; Fischer
& Ye, 2001).
User models in CodeBroker contain a list of components known to the programmer,
namely, those components from L1 and L2. An example user model is shown in Figure 7.6.
Each item in the list is a package, a class, or a method. Each component retrieved from the
component repository is looked up in the user model before it is delivered. If a method component matches a method in the user model, and the user model indicates the programmer has used
it more than three times (this number is adjustable by the programmer), the system assumes the
programmer knows it already and removes it from the delivery. If the method has no use time,
it means the method was added by the programmer, who had claimed he or she had known it
very well and did not want it delivered. If the class of the method (which has no method list in
the user model), or the package of the method (which has no class list) is included in the user
109
model, the method is removed as well.
Figure 7.6: An example user model
A user model is a Lisp list of items with the format: (package-name (classname (method-name use-time use-time ...))). When the use of a
component is detected by the system, it is added to the list with the current time as
the “use time.” If the component is added by the user, there is no use time. As with
discourse models, empty class-name or method-name areas mean the whole
package or class is included.
7.4.3.1
Adaptable User Models
Programmers can explicitly update their user models through interactions with CodeBroker. If they find a known component is delivered, they can invoke the retrieval-by-reformulation
interface (see Section 7.5 for more details) to tell the system that they know the component already.
7.4.3.2
Adaptive User Models
Due to the large volume of components and the constantly evolving nature of repositories, it is a time-consuming task for programmers to maintain their user models. To reduce
the difficulty of maintaining user models, user models in CodeBroker are also adaptive. As
mentioned in Section 7.2.3, the Listener agent continuously monitors programmers’ input in
the programming editor. In addition to modeling the programming task, Listener detects what
components are used by a programmer and updates user models through the following three
110
heuristic steps.
Figure 7.7: An illustrative program for adaptive user modeling
This program is excerpted from a user experiment and is slightly modified. The line
number is added to make the explanation easier.
Step 1: Extracting Method Names
A method invocation in a Java program is followed by a pair of parentheses between
which parameters are passed. When a left parenthesis ( is entered in the editor, Listener scans backward to extract the identifier preceding the left bracket. For example,
in Figure 7.7, when the left parenthesis (where the cursor is placed)1 is entered, Listener extracts the identifier addElement. After that, Listener scans back further to
determine if this identifier is a legal method name using the following rules.
(1) If the identifier is a Java keyword, such as the for in line 8, it is not a method
name.
(2) If the identifier follows another word or a right square bracket, it is not a method
invocation either; instead, it is the name of a new method developed by the programmer, such as the findDuplicates in line 5.
1
The extraction of the used component starts immediately after the left bracket is entered, not after the whole
statement is entered, as shown in the figure. The whole statement is included in the figure simply to show what a
method invocation looks like.
111
(3) If the identifier follows a dot (.), it is a method invocation, and the identifier is a
legal method name. Listener scans further back to extract all characters preceding
the dot until a white space is met. If these characters constitute a legal class name
in Java, the method is a class method instead of an object method; otherwise,
these characters are recognized by Listener as a variable name which will be
used to find what class the method belongs to (described in Step 2).
(4) If the identifier follows termination characters of a Java statement, such as a semicolon (;), a left brace ( ^ ), or a right parenthesis, it is a class method name.
Step 2: Finding the Class of an Object Method
The class name of a variable can be extracted from the variable declaration statement.
A variable declaration statement is recognized by Listener based on the following BNF
syntax of Java (Gosling et al., 1996):
VariableDeclarationStatement := LocalVariableDeclaration;
LocalVariableDeclaration := final_opt Type VariableDeclarators
VariableDeclarators := VariableDeclarator |
VariableDeclarators, VariableDeclarator
VariableDeclarator := VariableDeclaratorId |
VariableDeclaratorId = VariableInitializer
VariableDeclaratorId := Identifier | VariableDeclaratorId [ ]
VariableInitializer := Expression | ArrayInitializer
Type := PrimitiveType | ReferenceType
ReferenceType := TypeName | Type []
TypeName := Identifier | PackageOrTypeName.Identifier
PackageOrTypeName := Identifier | PackageOrTypeName.Identifier
Each time a new variable is declared by a programmer, the Listener stores the variable
name and its class name in an association list. When an object method name and its
variable name are extracted, Listener looks up the variable-class association list to find
112
to which class the method belongs. For example, in Figure 7.7, the variable name for
the method addElement (line 9) is tmpVec, which is declared as a Vector (line
6).
Step 3: Finding the Package of a Class
Because not all class names in a Java program include its package name, and class
names are not unique, Listener needs to find to which package the class belongs. Listener first finds all packages that include the class from the list of indexed components,
created by CodeIndexer at the time of indexing. If only one package is found, it is assumed to be the package of the class. If several packages are found, Listener will pick
the package imported by the programmer in the package import statements (lines 1 and
2 in Figure 7.7). Whenever a package import statement is entered, Listener recognizes
it based on its BNF syntax shown below:
ImportDeclaration := import TypeName; |
import PackageOrTypeName.*;
and creates a list of imported packages and classes. If the package of a class is unique
in the imported package list, then the imported package becomes the package of the
class; otherwise, the programmer has probably made a mistake,2 and the extracted
method is ignored.
To make it easier to understand here, three steps were described in the reverse order of
their execution. Listener creates the list of imported packages first, followed by creating the
variable-class list, followed by extracting method names. When Listener successfully extracts
the method name and determines its class name and package name, it adds the component,
including its class and package, with the current time as use time, to the user model. Listener
adds only methods to the user model; it does not add a class or a package because the use of a
class or a package does not mean that the developer knows the whole class or package.
2
This mistake will cause a compiler error. As an extension, CodeBroker could point out this error so that
programmers can correct it before they submit the program to the compiler.
113
7.4.3.3
Initializing User Models
Initial user models are created by analyzing the Java programs that programmers have
written so far. CodeBroker analyzes Java programs to extract each method used in the same
way as adaptive user modeling, except that it is a batch process.
7.5
The Retrieval-by-Reformulation Mechanism
To complement the incompleteness of reuse queries, CodeBroker supports two forms of
retrieval-by-reformulation (Section 3.4.3.4): direct manipulation and query refinement. After
examining the components initially delivered by CodeBroker, programmers can either refine
the query to improve its completeness and preciseness or directly manipulate the delivered
components by removing apparently irrelevant ones.
7.5.1
Direct Manipulation
Direct manipulation of the delivered components serves two purpose: to facilitate the
easy choice of components and to augment the discourse model or the user model. Each component in the RCI-display is associated with a float menu, the Skip Components Menu
(Figure 7.8), which pops up as the component name is right-clicked. The Skip Components
Figure 7.8: The Skip Components Menu
Menu allows programmers to remove those components that are apparently not related to their
current development task so that they can find needed information easier. The first item of
the menu is the method component itself; the second, its class; and the third, its package. If
114
programmers want to remove the method or all of the components in the class or the package
from the RCI-display, they can choose the appropriate item. Each item has three choices: This
Buffer Only, This Session Only, and All Sessions.
When the command This Buffer Only is chosen, the corresponding components
are removed from the RCI-display. When the command This Session Only is chosen, the
components are not only removed from the RCI-display, they are also added to the discourse
model and will not be delivered later in this development session. The discourse model is empty
when a development session starts, and it gets incrementally increased by programmers as they
interact with the system. When the command All Sessions is chosen, the components are
removed from the current RCI-display and are added to the user model. Components added to
user models through the Skip Components Menu do not have the use time field (see the
last line in Figure 7.6).
With this design, the system can obtain information to evolve discourse models and
user models without adding too much extra work for programmers, who also gain the immediate benefit because the choice of needed components becomes easier by removing those
apparently irrelevant components. For example, in Figure 7.9(a), in response to the doc comment, CodeBroker delivers some components (No. 1 through No. 4) belonging to the class
java.awt.Cardlayout (a GUI class) due to the term “card.” However, the current task is
not related to the class java.awt.Cardlayout, so the programmer can remove it through
the direct manipulation interface. This manipulation brings the needed component randomShuffle, obscured previously, to the salient fourth place (Figure 7.9(b)). The fact that the programmer is not interested in the class java.awt.Cardlayout can be added to the discourse
model at the same time if the programmer chooses the This Session Only command, and
then no components from the class java.awt.Cardlayout will be delivered later in this
development session, even if the programmer uses the word “card” in doc comments again,
which is quite possible because the programmer is developing programs about card shuffling.
115
_adeb
_a`cb
Figure 7.9: The Direct Manipulation interface
Parts (a) and (b) show the delivered components before and after direct manipulation, respectively.
7.5.2
Query Refinement
Query refinement is invoked by choosing the Query Refinement command in the
same pop-up menu, or directly typing it in as an Emacs command. A buffer (Figure 7.10) will
appear for programmers to start another round of component locating after having refined the
automatically extracted reuse queries. Programmers can refine the concept query by choosing
more appropriate terms, or they can modify the constraint query to make it less restrictive or
more restrictive depending on the situation. To narrow the searching range of relevant components, the query refinement interface also provides two additional fields:
Filtered Components: for specifying classes or packages that are not of interest, and
Interested Components: for instructing the system to return components from the specified classes or packages only.
Component repository systems could provide a mechanism to let programmers specify
either of these fields previous to the initial use of systems. However, programmers who do not
116
Figure 7.10: The Query Refinement interface
The components in the RCI-display are retrieved after the refined query is submitted. Any one of the four visible components can be reused in the current situation,
depending on how the programmer wants to restructure his or her data types.
know the structure of the repository well enough may not be able specify these two fields. Even
a system-guided dialog mechanism to solicit user specifications as explored in the KID system (Nakakoji, 1993), is not suitable for repository systems because component repositories are
often very large and it will take a long time to get a meaningful specification. The CodeBroker
system does not assume that programmers know the repository structure well enough, and it
solicits user input only after its delivered components have acquainted programmers with the
structure of the component repository, especially the structure of the part of the repository that
might be relevant to the task at hand.
117
7.5.3
Comparing Retrieval-by-Reformulation and Relevance Feedback
The retrieval-by-reformulation mechanism in CodeBroker is a more comprehensive ap-
proach to improving the retrieval performance than the relevance feedback mechanism used in
many information retrieval systems (Buckley et al., 1994). Through the adjustment of terms
used in a query by query expansion or other techniques, relevance feedback of information
retrieval systems focuses mainly on the improvement of the retrieval process itself. Instead,
the focus of retrieval-by-reformulation is to improve the relevance of information to the working context of programmers, not to the query per se. The direct manipulation tries to establish a
shared understanding of the context between the component repository system and the programmer. It uses programmers’ previous interactions with the system as filters for later deliveries.
Although it does not affect what the Fetcher agent returns, it does modify what gets shown. The
system also takes advantage of the fact that software components are organized into a hierarchy
(packages, classes, and methods) according to their application domains to let programmers
limit the retrieval range to their interests.
7.6
Summary of CodeBroker
Figure 7.11 summarizes the role of each agent and retrieval-by-reformulation in Code-
Broker. In Figure 7.11, T represents components that can potentially be reused in the current
programming task. From these components, programmers need to choose the most appropriate one. D1 and D2 represent delivered reusable components before and after user models are
considered, respectively. User models can reduce the work of choosing by removing known
components, because if the most appropriate component were known, it would have already
been reused. Ideally, active component repository systems should present to programmers the
T set with known components (black circles) removed. However, due to the incompleteness
of reuse queries, irrelevant components and missed components are unavoidable. Therefore,
retrieval-by-reformulation is needed to allow programmers to move D2 toward T incrementally.
118
Figure 7.11: Summary of CodeBroker
Direct manipulation of retrieved results lets programmers remove irrelevant components (circles
with waved lines) quickly; and query refinement lets programmers incorporate missed components (circles with no shade) into D2 in following locating efforts with incrementally developed
reuse queries.
Chapter 8
Evaluations of CodeBroker
This chapter presents the results of two types of evaluations conducted on the CodeBroker system. The first evaluation compares the retrieval effectiveness of the Okapi-based retrieval
mechanism and the LSA-based retrieval mechanism (Section 6.1.1). The second evaluation
empirically studies how well CodeBroker supports reuse-within-development (Section 4.2.2
through experiments with programmers.
The purpose of the empirical studies of CodeBroker was not to analyze the quality of
programs produced by programmers and the productivity of programming, but to observe and
analyze how the system promotes reuse during programming.1 The empirical studies attempted
to answer the following questions:
f
Are programmers able to reuse unknown software components with the support of
CodeBroker?
f
f
Does CodeBroker encourage programmers to explore the possibility of reuse?
Is the task modeling based on doc comments and signatures good enough to find components relevant to the task at hand?
f
f
Do discourse models improve the relevance of delivered components?
1
Do user models contribute to the personalization of component delivery?
The fact that reuse improves the quality and productivity has been well studied by other researchers and has
been discussed in Chapter 3.
120
8.1
Evaluating the Retrieval Mechanisms
Retrieval mechanisms play an important role in locating reusable components that match
reuse queries. This section presents the evaluation and comparison of two retrieval mechanisms:
the Okapi mechanism (Section 6.1.1.2) based on probabilistic models, and the LSA mechanism
(Section 6.1.1.3), used by the Fetcher agent of CodeBroker for the retrieval of conceptually
similar components.
8.1.1
The Concept of Recall and Precision
Conventionally, information retrieval systems are measured by recall and precision. Re-
call indicates the ability of the system to present all relevant documents, and precision indicates
the ability of the system to present only the relevant documents. They can be computed by the
following equations:
gih
õ
Qjk
l
l
uMgih
õMv û v l
öXnoBö ÷(OþËø\°xúpVIZø\úbü(ø2q¯ø o
OþmYø2
O6þmYø2
l
öXn
B° o
Xø;ø2q)B°xú
ú ö úr)sXø;ø2q)B°xú o-ö ÷tO6þËø$°xúrVoü”°¹÷ ö ƒø0÷ú(ü ö °
OþmYøe
l
öXnoBö ÷(OþËø\°xúpV\Zø\úFüø2q5ø o
O6þmYø2
B° o
(8.1)
Xø;ø2q)B°xú
ö@nwo-ö ÷tO6þËø$°xúrVIZø$úJbü(øeq5ø o
(8.2)
Recall and precision are not absolute, objective measurements of an information retrieval
systems because
(1) the definition of relevance between documents and queries is subjective, and
(2) even if the relevance of a document is unanimously agreed, it may not be of interest to
one particular user if that user knows the document.
Nonetheless, when the relevance of documents is agreed, these two measures can be used to
compare the performance of two retrieval systems.
121
8.1.2
Recall and Precision in CodeBroker
The purpose of computing recall and precision of the CodeBroker system is to compare
the two retrieval mechanisms to find the better one. The data should not be taken as an absolutely objective measurement of the effectiveness of retrieval mechanisms implemented in the
CodeBroker system because sample queries were not random enough.
8.1.2.1
Reuse Queries and Relevant Components
In total, 19 reuse queries were selected. Among them, 10 queries were created by me,
4 queries were chosen from questions asked in newsgroups related to Java programming with
phrases such as “How do I”, “Can someone tell me how to” removed, and 5 queries were
extracted from the evaluation experiments (Section 8.2) without any words changed.
The relevance of components was determined as follows:
(1) For queries created by me, I chose those components that I thought could be used to
implement the task.
(2) For queries from newsgroups, those components suggested by responders were considered relevant in addition to components that I chose from the JGL library that not
all Java programmers are using.
(3) For queries from experiments, only those components that were used by the programmers were considered relevant.
The sets of relevant components determined by above criteria are by no means extensive. However, for the purpose of comparison, they can provide sufficient evidence.2
8.1.2.2
Computed Recall and Precision
Two retrieval mechanisms—Okapi and LSA—are supported by CodeBroker to locate
components that are conceptually similar to queries extracted from doc comments. The two
2
Queries and relevant components are listed in Appendix A.
122
Recall
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Average Precision
LSA
Mixed
Okapi
35.77% 32.80% 45.82%
31.86% 32.80% 45.82%
30.89% 32.75% 45.82%
25.62% 28.63% 41.20%
20.62% 24.66% 41.01%
20.44% 24.66% 40.74%
13.86% 21.95% 37.46%
13.82% 20.70% 37.46%
13.82% 20.63% 32.71%
12.32% 19.90% 32.19%
12.32% 17.86% 29.43%
Table 8.1: Average precision and recall values for LSA, Mixed (average of LSA and Okapi),
and Okapi
mechanisms can be combined to retrieve components by being given different weights to the
similarity value computed by each. I tried the system by using LSA only, Okapi only, and the
average of both (Mixed). The average precision values at different recall values are shown in
Table 8.1. Figure 8.1 shows the recall-precision curves which are constructed by plotting the
precision values against the recall values. Superimposing recall-precision curves of different
retrieval mechanisms in the same graph can determine which retrieval mechanism is superior. In
general, the curve closest to the upper right-hand corner indicates the best performance (Salton
& McGill, 1983).
8.1.2.3
Conclusions
It is easy to see from Table 8.1 and Figure 8.1 that Okapi has better retrieval performance than LSA and the mixed one (the average of both). The result is somehow unexpected
because other researchers have reported that LSA has better performance than other retrieval
methods (Deerwester et al., 1990). The unexpected low performance of LSA might be caused
by insufficient training documents used in CodeBroker because LSA performance is largely dependent on the quality and volume of training documents. Because the evaluation shows that
123
“•” ‚2–˜—
™— š›€2œ
tž)Ÿ
}~x~x
|Cx
{Cx
‘’
Œ Ž
Œ
zCx
ˆ ‰Š‹
yCx
x
x
yCx
zCx
{Cx
|Cx
}~x~x
C€ec‚;ƒ ƒe„†…$‡
Figure 8.1: Recall-precision curves
Okapi has the best performance, the default setting of CodeBroker is to use Okapi only. Okapi
is also favored over LSA because in Okapi, the system can find the terms that contribute most to
the relevance between components and queries, and those terms can be shown to programmers,
when they move the mouse cursor over the descriptions of components in the RCI-display, to
help them refine their queries (Figure 7.4, Section 7.4.1). In LSA, in contrast, the reason a
component is determined to be relevant is obscure because of the semantic space.
8.2
Empirical Evaluations of the CodeBroker System
To understand the effectiveness of the CodeBroker system in supporting reuse-within-
development, formal evaluation experiments have been conducted. The structure of the experiments is described in this section, and findings and conclusions are presented in the next four
sections.
124
8.2.1
Subjects of Experiments
Subjects were recruited from undergraduate and graduate students from the Computer
Science Department. As mentioned in Chapter 2, programming involves a wide range of knowledge. Because the design goal of CodeBroker is to provide knowledge about reusable components to programmers, to minimize other factors that contribute to the difficulty of programming
in general, only students who already had extensive programming knowledge and experience
were recruited as subjects. Because CodeBroker is developed as an add-on to the existing programming environment, Emacs in Unix, a basic working knowledge of Emacs and Unix was
also required so that subjects could easily learn the operations of the system and experiments
could be focused on the support provided by the system.
Five subjects voluntarily participated in the evaluation experiments. All but one programmer had extensive knowledge in other programming languages, such as C and C++. Two
had worked as professional programmers. Three were regular contributors to several Open
Source projects. Their expertise in Java programming varied, ranging from medium to expert
level. All of them knew the syntax of Java very well; the difference of their expertise came from
the range of reusable components (classes and methods in API libraries) they knew. Table 8.2
summarizes their background knowledge about programming in general and Java in particular.
In that table, small (abbreviated as S) projects refer to projects similar to semester projects,
requiring 1 or 2 man-months; medium projects (abbreviated as M) refer to projects requiring
3 to 5 man-months; large projects (abbreviated as L) refer to projects requiring more than 6
man-months.
8.2.2
Structure of Experiments
Subjects were asked to implement two or three programming tasks with the CodeBro-
ker system. Days before the experiments, CodeBroker created an initial user model (see Section 7.4.3 for the method, and Figure 7.6 for an exmaple) for each subject by scanning programs
Subject
Years of general programming
Programming experience in
general (measured in number of
projects)
Current major programming
language
S1
3 or 4
S2
5 or 6
S3
8
S4
10+
S5
10+
3S, 1L
10S
7M, 1L
10+SM,
2L
10+L
C++
Java
Java
Java
Java
Years of Java programming
10
months
4
4
7
5
Self-evaluation of Java
expertise (1: Beginner - 10:
Expert)
4
7
7 or 8
10
7
Recent frequency of
programming in Java
Not
active
for 3
months.
Not
active
for 3
months.
Every
week
Every
day
Not
active
for
months
125
Table 8.2: Programming knowledge and expertise of subjects
the subject had written recently. Because many of the programs the subjects had written were
for companies and thus were not available, no user models were complete. Nonetheless, the
number and range of components included in the user models were consistent with the subjects’ self-evaluations of Java expertise.
After analyzing their user models, the subjects were assigned tasks whose implementation involved components they probably had not known well enough. In the beginning of the
experiments, the main functionality of the CodeBroker system was briefly introduced with a
running example after the subjects had signed the Informed Consent Form for participating in
the experiments. This took about 5 minutes. Previous to the implementation of each task, programmers were asked to describe briefly how they would implement the task, and after each task
had been finished, simple questions such as “Did you know this component before?” and “Why
did you choose this component?” were asked regarding their programming activities. At the end
of the experiments, a post-experiment interview3 was conducted to capture the subjects’ background knowledge of programming and their subjective evaluation of the CodeBroker system
based on their use.
3
Questions asked in the interview are listed in Appendix B.
126
Programmers were told to do programming in their normal way but to take advantage
of the support provided by CodeBroker. They could use books, the Java API Documentation
Browser, and all other support as they usually did. Two subjects actually brought and consulted
their favorite “Java in a Nutshell” (Flanagan, 1997). As an observer, I occasionally answered
their questions about the operation of the CodeBroker system.
The CodeBroker system used the following default settings in the experiments:
(1) It adopted the Okapi retrieval mechanism and the signature-matching mechanism, with
each assigned the weight of 0.5.
(2) A component was decided to be known to the programmer if the user model indicated
the programmer had used it three times.
(3) In the first four experiments with the first two subjects, the system delivered 14 components in the RCI-display because the experiments were conducted on a laptop with a
small monitor. In all other experiments that were conducted on a desktop with a large
monitor, the system delivered 20 components.
(4) The component repository contained 673 classes and 7,338 methods from both the
Core API library of Java 1.1.8 and JGL 3.0.
8.2.3
Programming Tasks
Because subjects were volunteers, large and time-consuming tasks were not very suit-
able. The experiments used programming tasks similar to the typical assignments of a programming language course, which could be implemented with several methods in about 20 to 60
minutes. The following tasks were used in the experiments.
Task 1
You are asked to implement a program that selectively backs up files based on a list
that holds all files needed to be backed up. The list looks like:
/usr/java/private/important/letter1
127
/usr/joe/project/backup/getAllFiles.java
and the file name is passed as a parameter in the command line.
It requires that the back-up program retain the same hierarchical structure when the
files are backed up in another directory $BACKUPDIR, which is passed as the second
parameter in the command line; for example, getALLFiles.java should also be
found under the directory $BACKUPDIR/usr/joe/project/backup/.
Task 2
You are asked to write a program to simulate the process of card dealing. Each card is
represented by a number from 0 to 51. The program should produce a list of 52 cards,
as it results from a human card dealer. Let us assume that if a person cuts a deck of
cards and shuffles it 7 times, the result is satisfactory.
Task 3
Traditionally, Chinese write numbers with a comma inserted at each fourth number
from the right. For example, 1,000,000 is written as 100,0000. Please implement a
program that transforms the Chinese writing format (100,0000) to the western format
(1,000,000).
To simplify the programming task, you don’t need to read the input from the keyboard.
You can assume you can get the input anywhere you like, such as a static class variable,
a parameter of a method, or input from the command line.
Task 4
Jack has a long list of MP3 songs he has compiled. However, many of the songs are
repeated in the list. He wants to create a new list in which each song appears only
once. Assume each list has the following format TITLEa, TITLEb, TITLEc,
... where TITLEi is a string including letters only. Implement a method to create a
new list with no repetitions.
Assume the list is stored somewhere; for example, you can put it into a class variable.
Task 5
Please write a program that can calculate the day of the week. We know that today is
Jan. 19, 2001, Friday. Your program should be able to compute the day of the week
M years from today, or N months from today. Both M and N could be negative, which
means M years or N months before. Assume the convention to pass the data to your
program is:
Y 10 means 10 years from today, and
M -5 means 5 months ago.
Task 6
A processor needs to respond to a series of events. Each event is assigned a distinct
number. When the processor is busy, newly arrived events will be put into a waiting
list. When the processor finishes processing the previous event, it picks an event in the
waiting list. However, it picks the event with the largest number in the waiting list. You
are asked to implement a pair of operations: one to put a new event into the waiting
list, and the other to help the processor pick up the next event to be processed. (You
don’t need to be concerned with concurrency.)
128
All tasks could be implemented with different combinations of different reusable components from the repository. If the subjects know or find the right components, the implementation
would be fairly easy; if they do not, they would have to use components of lower levels or even
basic statements. Therefore, those tasks can allow us to observe how the delivery of the system
changes the programming process of subjects.
8.2.4
Methods of Observation and Analysis
The CodeBroker system has an automatic log mechanism that logs the reuse queries ex-
tracted by Listener, the components retrieved by Fetcher, the components removed by Presenter, and both system-initiated and user-initiated changes to discourse models and user models.
All experiments, including interviews, were videotaped. Subjects were asked to think
aloud during the experiments. However, because thinking aloud may interfere with normal
programming practice, this was not stressed. Analysis of the system was based on the log
data, the video tapes, and transcribed interviews. The purpose of the analysis was not about
the quality and productivity of programming; instead, it was about how CodeBroker affects the
process of programming by encouraging programmers to reuse. Quantitative assessment was
based on log data, and qualitative evaluation was based on interviews and think-aloud protocols.
8.3
Findings about the Usage of CodeBroker
This section presents the findings about the usage of CodeBroker in the experiments.
After presenting the overall results, I discuss in detail the observed roles of active component
delivery, task models, discourse models, user models, and the retrieval-by-reformation mechanism.
8.3.1
Overall Results
Table 8.3 summarizes the support the CodeBroker system provided to programmers dur-
ing the experiments based on automatically logged data. Numbers in each column are defined
Ä °@ÒcÓÕÔ
Ä µ
Ä ¸
Ä ¹
Ä ·
Ä »
ÈsÉWÊ
´«(Ñ&¢
´£(§P«¤±
ÍW¬¤± ¨ Î&¬¤ÅǬ¤­
´.µ
´N¸
´ ¹
.
´N·
´ »
.
´ ¹
.
´ »
.
´ ¼
.
´ ¹
.
´ »
.
´ ¹
.
´ ¼
.
µ¤¶
¹
º
·
»
»
·
¹
·
¹
·
»
·
½¤¾
¿@À
µ
µ
µ
¹
¸
¹
¶
¹
µ
µ
¶
ËHÅƬ¤«@¢˜­¤£(¥\¡$£cÌLÍW¬¤± ¨ Î&¬¤ÅǬ¤­$ÏW£@Ð$ª@£¤¡¤¬@¡c§GÑ
«¤¯¤°@¬¤± ²
W¡@¢˜¡¤£c¥\¡
¦4¡c§P¨ ©˜¨ ª¤«(§P¬¤­
®
³H¡¤£c¥\¡
¸
¸
¶
µ
¶
¶
µ
¶
¶
µ
¶
¶
¶
¸
µ
µ
µ
¶
µ
¸
¶
¶
¶
¶
¶
¹
¶
µ
¶
¶
µ
¶
¶
¶
¶
¶
Á
Â@À
129
´Åƨ ¯@¯¤¬¤ÅǬ¤­
Â
¶
µ
¶
¶
µ
µ
µ
¶
¶
¸
¸
¶
Ã
Table 8.3: Overall results of evaluation experiments with programmers
as follows.
Subj. The subject who participated in the experiment.
Task. The number of the task used in the experiment.
Total. The number of distinct method components used in the implemented program. If a
method was used more than once, it was still counted as one. Class components were
not directly counted because when they were used, some of their methods, including
constructors, must have been used somewhere in the program.
Delivered. The total number of components that programmers directly reused from the components delivered by the system. Those components are further broken down into three
categories based on subjects’ original knowledge about the components.
Unknown. Components whose existence were not expected (i.e., components from
the information islands (L4 - L3) in Figure 4.1).
Anticipated. Components subjects believed existed, but had never used before (i.e.,
components of L3 in Figure 4.1). Sometimes, they even guessed the right class
or the right package.
130
Vaguely known. Components subjects have used before, but were not sure about the
name, or remembered the name incorrectly (i.e., components of L2 in Figure 4.1).
Triggered. The number of components that were not delivered but were triggered to be reused
in the programs by the delivery. In some cases, when the subjects wanted to reuse
a delivered component that needed other supplementary components, they needed to
find those components out. Triggered components were not known by subjects before.
For example, one subject wanted to use the delivered randomShuffle method that
operates on Array. Because he did not know the Array class, he used the browsing
mechanism to find it out. Although triggered components were not directly delivered
by the system, they would not have been reused without the support of the system.
Table 8.4 summarizes the responses from subjects when asked to rate the usefulness of
the system on a scale from 1 (totally useless) to 10 (extremely useful), and if they would use
the system as their daily programming environment. Although those evaluations are subjective,
they are indications of the subjects’ desire to use the system.
Subj.
S1
S2
S3
S4
S5
Rate
7
4
8.5
7
8
Will you use CodeBroker as your daily programming environment?
Yes.
It is right on the threshold that maybe I would use it.
Yes. It is not perfect, but it is really good and it is very helpful.
Yes, but I have to get used to the system.
Yes.
Table 8.4: Subjective evaluations of the CodeBroker system
As both the quantitative data in Table 8.3 and the subjective evaluations in Table 8.4
show, CodeBroker has been quite effective in supporting programmers to locate and reuse components during the experiments.
131
8.3.2
Roles of Information Delivery Mechanism in Supporting Reuse
In the experiments, the information delivery mechanism of CodeBroker provided multi-
ple supports to encourage subjects to reuse.
8.3.2.1
Supporting the Reuse of Unknown Components
As Table 8.3 shows, in 7 out of the 12 experiments, the system delivered reused components that had not been known to subjects. In one experiment, the subject had not even known
the existence of the whole package. Without the delivery mechanism, the subjects would not
have been able to reuse them and would have created their own solutions instead, as two subjects
commented in the interviews:
“I would have never looked the roll function by myself, I would have done a
lot of stuff by hand. Just because it showed up in the list, I saw the Calendar
provided the roll feature that allowed me to do the task.”
“I did not know the isDigit thing. I would have wasted time to design that
thing.”
The delivery mechanism not only supported subjects to reuse components right off the
deliveries, it also created a snowball effect that triggered them to reuse other unknown components that were not directly delivered but were needed to reuse those delivered components. Components in the libraries of object-oriented programming languages are often coupled
through parameter passing or accessing the common class variables. To reuse one component
often requires the reuse of other components tightly coupled. In the experiments, when those
coupled components were not known, programmers used the deliveries of CodeBroker as the
starting point and then followed the existing hyperlinks of the documentation system to learn
and reuse them.
The delivery mechanism also created latent reuse opportunities. Sometimes, the delivered components were not immediately reused because, although they were related to the task
to some extent, they could not be directly reused right away for the immediate programming
132
task in which the subjects were engaged. However, as programming continued, subjects realized that something delivered before could now be reused. For example, in one experiment, the
subject was first concerned with finding how to read the contents of a file. Among the delivered
components was the isDirectory method, which could not be reused right away for the task
of “reading the contents of a file” but somehow caught the attention of the subject. Later, when
the subject moved to the task of “creating a new directory if it does not exist,” he thought of
something he had seen before, but he could not remember the name. So he asked if the system
had a mechanism allowing him to go back to previous deliveries. When told not, he inserted a
temporary comment to find the isDirectory method.
8.3.2.2
Reducing the Cost of Locating Anticipated or Vaguely Known Components
In all experiments, 9 components that were reused from the deliveries of the CodeBroker
system were somehow anticipated by the subjects (Table 8.3). In those cases, the subjects knew
there was something in the Java library that could help them implement the task. Although some
of them even knew the class names, they did not know the needed method names and were not
sure whether all the needed functionality was supported by the class. Those components might
have been reused by subjects without the support of delivery if they could locate them through
browsing or querying quickly enough. However, CodeBroker made the locating of those components faster and easier. It is difficult to evaluate objectively how much the system reduced
the cost of locating those anticipated components because (1) we cannot find two programmers
with the same knowledge about the repository, which determines how a programmer conducts
the locating process to compare the cost of locating the same component with and without
the support of the CodeBroker system; and, more importantly, (2) programmers’ evaluation of
the locating cost itself is subjective. Therefore, the conclusion that the delivery mechanism of
CodeBroker reduced the cost of locating anticipated or vaguely known components was based
on the subjects’ answers to the question “Did you think the system saved you time in locating
this component?” (referring to a specific component anticipated or known by the subject):
133
“It beats browsing. Because the way that I normally would have done the
task, I would do a lot of browsing and then write the code alongside. So this
reduced the browsing and searching.”
“Yes. First, I did not have to start browsing and go through the packages, and
I did not have to go through the index of methods. I could just go to the short
list [RCI-display], found it and clicked it.”
“I thought there might be a parse method, but I also was not sure whether it
is called parse or something else. I also wasn’t sure if it was in the Format
class. Maybe it is in a different class like Integer or number or something
else. It’s helpful that I saw parse [in the RCI-display] and went through to
see that it was in the Format class. ”
“It seems to me the key benefit of this [CodeBroker] is that it gives you methods for every class, not like this one [the API Documentation Browser] that
you have to first find which class it is in then go to the class. Although it has index of methods, but it is hard to find here [the API Documentation Browser].”
Subjects also acknowledged that the reduced cost of locating components motivated them
more to explore the possibility of reuse. As one subject said:
“Having this system, I would try to explore more, I would spend more time to
see whether this thing exists or not.”
8.3.3
Effectiveness of Task Models
Task models in CodeBroker are used as queries to retrieve relevant components, and
they are created from doc comments and signatures of modules. In the experiments, most doc
comments written by subjects were easy for humans to understand the functionality of their
programs. In the meantime, those doc comments served as good reuse queries too, as evidenced
by the number of successful deliveries shown in Table 8.3.
The more knowledge subjects had about the repository, the better were their doc comments used to retrieve relevant components. One subject described why he wrote one particular
comment as:
“I knew there should be a class called NumberFormat or DecimalFormat having the method format...That’s why I wrote the word ‘format’ because I knew it would catch those.”
134
As a result, he found what he expected from the deliveries of CodeBroker.
Probably because most subjects had a fairly good working knowledge of programming
in general and Java in particular, they were able to describe, in comments, the functionality of
programs in a similar way that was used to describe components in the repository.
Different subjects had different ways of writing comments. Some wrote very long and
elaborate comments to describe everything they wanted to do in the method or the class. Others
wrote very concise and short comments focused on the major task of the program. Because
descriptions of components in the repository are short and concise, the short and focused comments made the system deliver more task-relevant components.
Comments are essential for CodeBroker to deliver task-relevant components. Therefore,
in the interviews, subjects were asked if they wrote comments in their daily programming activities. Two subjects answered they always wrote comments before the implementation for most
classes and methods, one subject said he always wrote comments for classes but not always before the implementation, one subject said he mostly wrote comments for classes but not always
before the implementation, and one subject said he usually did not write comments. However,
two subjects indicated that they probably would write more comments before implementation
if they were going to use CodeBroker because they could benefit from the comments. Two
subjects changed their styles of comments within methods from C++-style (which begins with
“//” and continues until the end of the line) to the doc comment style in order to take advantage
of the system delivery. That was an unexpected use of the system,4 and showed that subjects
expected and valued the help provided by the delivery mechanism of the CodeBroker system.
The signature matching mechanism of CodeBroker (Section 6.1.2) did not play too much
of a role in the experiments. In fact, only one subject tried once to look at the change of
delivery when he finished the signature declaration of a method, but the system failed to improve
the task-relevance of the delivery because there was not any component in the repository that
4
The original design goal of the CodeBroker system was to deliver components based on doc comments preceeding methods or classes, not on comments inside methods.
135
was both similar in concept and compatible in constraint with the task of the subject. In all
other experiments, subjects shifted their attention to the RCI-display immediately after they
had written the comments and started browsing. When they found components they needed,
they moved back to programming and did not pay any attention to the RCI-display until they
wrote the next doc comment. The original design goal of adopting the signature matching
mechanism in CodeBroker was to help programmers find components that can be reused to
replace the module under development. However, in the experiments, all subjects used the
system to look for components that could be reused as parts of the module implementation
instead of components to replace their intended implementation. The system was more effective
in delivering implementation parts than delivering replacement components.
The fact that subjects did not pay attention to the change of delivery caused by signature
definitions provided a clue to speculate on the boundary of the action-present (see Section 5.1.2)
period of programming. When programmers are writing comments, they are still at the stage
of planning, thinking of how to implement the program. At that stage, they are still willing
to explore alternative solutions. When they start to define the signature, they have already
committed to one chosen solution and have shifted into the stage of execution, at which they
are less inclined to explore alternatives.
8.3.4
Effectiveness of User Models
No strong and conclusive data were collected regarding the role of user models in Code-
Broker. User models removed some known components in five experiments (Table 8.5), and
in only two experiments (S1-T1 and S2-T5), more than 8% of the components were removed
because they were included in the user models of subjects. The low number of components
removed by user models was probably caused by the following two reasons:
(1) All subjects provided only a very small portion of their Java programs for CodeBroker
to create the initial user models. As a result, the user models did not sufficiently reflect
subjects’ knowledge about components.
Subj.Task
S1-T1
S1-T2
S2-T3
S2-T4
S2-T5
S3-T3
S3-T5
S3-T6
S4-T3
S4-T5
S5-T3
S5-T6
SUM
Total No. of
Comp.
Retrieved
168
28
140
52
160
60
20
60
80
140
100
420
1428
No. of Comp.
Removed by
UM
15
0
5
0
14
0
1
0
0
0
1
0
36
No. of Comp.
Added to UM
by User
0
0
0
0
2
0
0
0
0
0
0
0
2
No. of Comp.
Added to UM
by System
0
0
0
0
5
6
0
0
0
0
9
0
20
136
Table 8.5: Experiment data regarding user models
Subj. is an abbreviation for Subject, Comp. for Components, and UM for User
Model.
(2) In order to observe the role of component delivery, subjects were assigned tasks whose
implementation required components from packages and classes that subjects had not
known yet. Therefore, most delivered components were not included in user models.
Although user models did not remove too many components, all subjects said, in the
interviews, that they did not notice many well-known components were delivered. A careful
examination of those components removed by user models in the experiments found that none
of them could have been reused in those tasks. The perception of subjects and available, although quite limited, data still pointed to the need and effectiveness of personalizing component
delivery based on user models.
As shown in Table 8.5, neither the system nor the users added too many components to
the user models. More on this will be discussed in Section 8.5.4.
Subj.Task
S1-T1
S1-T2
S2-T3
S2-T4
S2-T5
S3-T3
S3-T5
S3-T6
S4-T3
S4-T5
S5-T3
S5-T6
SUM
Total No. of
Comp.
Retrieved
168
28
140
52
160
60
20
60
80
140
100
420
1428
No. of Comp.
Added to DM
1p, 1c
1p, 1c
4m
0
0
0
0
0
1p
2p
0
0
5p, 2c, 4m
No. of Comp.
Removed by
DM
45
10
0
0
0
0
0
0
7
68
0
0
130
137
Table 8.6: Experiment data about discourse models
Subj. is an abbreviation for Subject, Comp. for Components, and DM for Discourse Model. In the third column, p, c, m refers to package, class, and method,
respectively.
8.3.5
Effectiveness of Discourse Models
Discourse models, used as filters, improved the relevance of retrieved components. As
the experiments with S1 and S4 showed (Table 8.6), when subjects added packages or classes
into the discourse models, those discourse models were successful in removing irrelevant components in later deliveries, and in all the four experiments with S1 and S4, the system helped
them find their needed components. However, if only method components were added to the
discourse model, such as in the experiment of S2-T3, discourse model was not very effective
in filtering components. All components removed by discourse models were not needed for
the implementation of tasks in those experiments. Therefore, discourse models did reduce the
number of irrelevant components.
8.3.6
Use of the Retrieval-by-Reformulation Mechanism
The CodeBroker system supports two interfaces of retrieval-by-reformulation: direct ma-
nipulation and query refinement. Discourse models and user models were the results of using
138
the direct manipulation interface (Section 7.5.1), and their roles were discussed in previous two
sections (Sections 8.3.4 and 8.3.5). This section discusses the usage of the query refinement
interface (Section 7.5.2).
The query refinement interface was invoked by three subjects (S1, S3, and S5), and
interestingly, they used three different features. S1 used the interface to modify a query and the
modification led to locating a needed component. S3 did not change the query; instead he filled
the field of Interested Components with a package name because he was sure that all his
needed components were from that package. He did, in fact, find all of his needed components
from that package. Later in the interview, he commented:
“It worked great since I knew everything was going to be in java.text.
That is a nice feature, that little refinement thing.”
In addition to query modifications, S5 also used the interface to specify packages in which he
was not interested by filling the field of Filtered Components. However, that still did not
help him find what he wanted because the terms he chose appear in many other packages he did
not specify in the Filtered Components field.
Instead of using the query refinement interface, some subjects activated the delivery
mechanism by directly modifying comments in the editor when they realized their previous
comments did not help them find what they wanted.
All of these observations confirmed that locating reusable components is an iterative
process and the support of retrieval-by-reformulation is necessary (Section 5.4). However, the
CodeBroker system did not provide a mechanism that could guide programmers in refining their
queries (for discussions, see Section 8.5.3).
8.3.7
The Role of Layered Presentation of Information
In CodeBroker, information about reusable components is presented to programmers in
different layers: names and short descriptions of components in RCI-display, signatures and
matching terms triggered by mouse movements, and full details in an external HTML Browser
139
(Section 7.4.1). The experiments confirmed that such a layered presentation mechanism was
effective in helping programmers choose the right component to reuse.
Subjects were observed to use the following procedure to choose a component. They
first looked at names and descriptions in RCI-display. If they found one promising component,
they clicked on it and went to the Browser for detailed information. If they did not find anything
interesting in RCI-display, they either moved back to the editor to modify their doc comments
or invoked the query refinement interface. If they found several similar components in RCIdisplay, they moved the mouse over the names and looked at the signatures, trying to find the
best one. This process, called information discernment in Section 3.4.3.5, was often very short,
and rarely took more than a minute; in contrast, the detailed evaluation of chosen components
often took several minutes.
8.4
Other Findings about Programming in General
This section presents findings about programming in general that are not necessarily
related to the use of CodeBroker, but are related to the theoretical framework discussed in
Chapters 2 and 3.
8.4.1
Knowledge of Components and Problem Framing
In Chapter 2, I postulated that programming is an interaction between problem framing
and problem solving, and knowledge about reusable components not only makes the solving
process easier but also increases the programmers’ capability of problem-framing because they
can frame the problem directly with those components. Findings from the experiments support
the claim.
Programming Task 3 (T3) (Section 8.2.3) was assigned to four subjects. Two subjects
(S3 and S4) knew some classes in Java could be used, even though both of them had never used
the classes and were not sure what the classes were and whether the classes included all the
needed functionality. They described their understanding and implementation plan of the task
140
as follows:
S3: “There are two ways that I could do this. One way is that Java might actually have some supports in Java’s international text classes for doing reading
in Chinese format and then writing out in western format because I know there
is a NumberFormat class, but I have never used it. And it might be easier
for me to just do it by hand, which is to take the number, read it from right to
left and then read it and write out another string, because it is pretty a simple
thing. I have to take the comma from here and put it right here.”
S4: “So what I want to do is that I will probably parse it using something like
date format—I don’t remember exactly what kind of number you can parse
into—and then just reformat it into another way, another date format...oh, not
date format, number format or something like that—java.util.text or
java.text or something like that package. If it doesn’t work, I am going
to do it by hand: just remove the commas and insert them back. But I believe
there should be a class that, given a pattern of a string, can convert it back to
a string of another pattern.”
Two other subjects (S2 and S5), who did not know anything about those classes, described the same task as follows:
S2: “I will convert the western number to an int primitive, run through the
int-like StringBuffer backwards, throwing sets of four numbers into
another StringBuffer. After each fourth number, check if there is a fifth
number available; and if there is, insert a comma. At the end, simply reversing
the string should give you the Chinese number.”5
S5: “Basically, I am gonna to parse the number, take out the commas and
insert the commas.”
Apparently, the problem descriptions of S3 and S4 were strongly affected by their knowledge about components, whereas S2 and S5 described a more detailed, lower-level implementation plan. As a result, both S3 and S4 implemented the task in a very compact way using
components delivered by CodeBroker. S2 implemented the task as he described because the
system failed to deliver relevant components based on his comments. S5 came up with a compact implementation too because the CodeBroker system delivered components from the NumberFormat class.
5
The subject thought it was to convert a western format number to a Chinese format number.
141
8.4.2
The Opportunistic Nature of Programming
The example of S5 implementing T3 well illustrates the opportunistic nature of program-
ming (Section 2.3). According to his original implementation plan, S5 first started with creating
two method interfaces: one reads the inputted number, and another one converts and returns it.
However, based on the doc comment he wrote for the second method, the CodeBroker delivered
the format method of the java.text.NumberFormat class. S5 noticed it, browsed its
document, and totally changed his original implementation plan. His final implementation was
very close to the one created by S3, who anticipated the existence of the format method.
This example showed that actively delivered components augment programmers’ insufficient programming knowledge and create learning and reuse opportunities for programmers.
The phenomenon that delivered components changed a programmer’s original implementation
plan was also observed in experiments with two other subjects.
8.4.3
Four Levels of Knowledge about Components
Not surprisingly, the subjects did have four different levels (Figure 4.1) of knowledge
about components in the repository. The disparity between L3 and L4 was very obvious in all
subjects. For example, neither S2 nor S5 knew the existence of the format method. Some
subjects knew a lot about GUI packages, whereas others knew a lot about the java.lang
package. Subjects who had had more Java programming experience had more anticipation
of the repository because they had explored the repository more often than less experienced
ones. Some of the anticipations were transferred from their programming experience with other
programming languages. Sometimes, subjects anticipated the existence of components that
were not included in the repository.6
For example, one subject thought there should be a
method of filling a sequence with a sequence of numbers. Before he gave up and wrote his own
code, he first wrote a comment in his program, hoping the system could deliver it, and then
browsed the documentation system quite thoroughly, trying to find the nonexistent component.
6
Note that in Figure 4.1, a part of L3 is outside of L4.
142
Even after he had finished his program, he still believed the component should exist in the
repository, and he said: “I am not sure there is not a fill method [that can fill a sequence with a
sequence of numbers].”
For those components that subjects had used before, there was a difference of levels of
mastery (L1 and L2) as well. As expected, many components were reused by subjects directly,
without consulting documents. However, in one experiment (S2-T5), the subject (S2) relied on
the delivery mechanism to reuse one component he claimed he had used many times before but
could not remember its exact name, even though he knew that it was a static method.
8.4.4
Learning Task-Relevant Components on Demand
One important factor that contributes to the importance of the learning-on-demand model
(Section 3.4.3.1) is that learning is more effective within the working context because programmers are more motivated to learn new things that can be immediately applied to their task at
hand, and the existence of application working context makes learning easier. Observation of
the experiments confirmed this claim, and the following example is quite typical in all experiments.
In the previously cited example (Section 8.4.1), both S3 and S4 had not known how to
use the class of java.text.NumberFormat and what kind of functionality it had,7 , and
S5 had not even known the existence of the java.text package. When the format method
from that class was delivered, all of them realized they could use it. However, in order to use
the format method, they needed to know how the class was structured and also needed other
methods from the same class to implement the task. Despite their claims in the beginning that
they could easily implement the task with their known knowledge, they were all very motivated
to learn how to use the format method and other related methods. It took S3 and S4 more
than 15 minutes, and S5 more than 30 minutes to learn all the needed methods. Given the fact
7
In fact, S4 used the java.text.DecimalFormat class, whereas S3 and S5 used the
java.text.NumberFormat class. For the sake of brevity, I will use java.text.NumberFormat to represent both.
143
that they all are expert programmers with extensive programming knowledge, learning those
components probably cost them more time than had they just implemented the task by hand.
However, they all were determined to learn those new components because they wanted to use
those components instead of creating their own primitive solutions.
8.5
Problems of CodeBroker and Needed Improvements
The experiments uncovered several problems of CodeBroker, which provide guidelines
to improve the system.
8.5.1
Irrelevant Components
Although the system delivered task-relevant components, it also delivered many irrel-
evant components. Most subjects would have liked the deliveries of the system to be more
“focused” on their tasks. However, they also acknowledged that if they could find something
immediately useful from the deliveries that could save time, they would like to use the system
even if the deliveries are not “focused” enough.
In the experiments, subjects were asked to implement two or three unrelated tasks; therefore, the benefits of user models and discourse models were not fully utilized. As programmers
continue using the system for a relative longer time, the number of irrelevant components can
be expected to be reduced.
One particular problem was that the system looked at doc comments only. Some subjects
wanted the system also to deliver components based on comments inside methods. One subject
was disappointed that the system did not use the names of methods and variables because,
although he did not like to write comments, he always created very good names to indicate
what he wanted to implement. It might be possible for the system to better the relevance of
deliveries if it uses not only the currently entered doc comments, but also those surrounding
comments and identifiers.
144
8.5.2
Abstraction Mismatch from Queries to Components
When programmers wrote comments to describe concretely what they wanted, the sys-
tem had no means to find components that could be reused but were described in abstract terms.
For example, the system did not deliver any reusable component for Task 6 (T6) that was assigned to S3 and S5. The task could be very easily implemented with the two methods push
and pop in the jgl.PriorityQueue class. However, descriptions of those methods in the
repository are very abstract.
public synchronized void push(Object object)
Push an object.
public synchronized Object pop()
Pop the last object that was pushed onto me.
Both subjects, however, described the task in concrete terms. Both of them interpreted the task
literally as an event management problem and described the two methods as
/** Takes the new event and puts it at the end of
the event queue */
/** Gets the next event and handles it */
/** add new event to waiting list */
/** get the next event from the list */
In order to address this abstraction mismatch problem (Section 3.4.3.2), CodeBroker
needs to index components based on not only their descriptions but also their using context.
For example, despite the disparity between the two subjects’ comments and the descriptions
of the components in the repository, the comments of both subjects were quite similar. If one
programmer used the push and pop methods in his program, and the system also indexed the
two methods by using the comments of the program that used them, it might be able to deliver
them to the second programmer.
145
8.5.3
Lack of Guidance in Refining Queries
The system did not provide any guidance to help programmers chose more appropriate
terms to describe their task. One subject wanted to find a sorting component, and he strongly
believed it should exist somewhere in the repository. He first created the following comment:
/** sort the list */
for which the system did not deliver anything useful. He continued to try two more queries (as
follows) by invoking the query refinement interface:
order list
returns the highest value from the list
which did not help either. In fact, the problem was caused by the term “list,” which was used in
all three queries, because this term is used in the descriptions of dozens of components in the
repository. If the system had a mechanism pointing out to the subject that the term “list” should
be dropped, the subject would have been able to find the component he needed.
8.5.4
Problems with User Modeling
As shown in Table 8.5, only one subject (S2) added two items (one package and one
class) into the user models, and that was actually a wrong operation. The subject actually
wanted to add them to the discourse model. This illustrated one design problem with the system.
Subjects were concerned that if they added something into the user model, which is a permanent
file, they could not get it back. An interface for programmers to edit their user models might be
able to alleviate this concern.
Subjects used at least 57 different components (shown in the Total column in Table 8.3),
but the system automatically detected and added only 20 method components in total to the user
models (shown in Table 8.5). This was caused by the inconsistency between the programming
style assumed by the adaptive mechanism in CodeBroker and those of subjects. The system
assumed that programmers would import a package first, followed by variable declarations,
146
followed by method invocations. However, most subjects programmed in the reverse order:
they declared variables and imported packages after they had used the variables in method
invocations (Section 7.4.3.2). Modifications to the algorithm of adaptive user modeling are
needed.
Current user modeling technique in CodeBroker is too simple. One subject reused one
delivered component whose name he often mistook for something else although he had used
the component many times before. A more sophisticated user modeling technique is needed to
capture misconceptions like that so that the system can deliver the right name of the component
when the user repeats using the wrong name. Another problem was pointed out by another
subject, who said:
“When you have programmed for a very long time, you may forget what you
have used in your first program.”
Therefore, counting the number of use is too simplistic; there should be a forgetting mechanism
incorporated to decide when to remove from user models those components that have not been
used by programmers for a long time.
8.5.5
Lack of Examples of Components
Some components in the repository have included simple examples to explain how to
use them. During the experiments, whenever subjects found such examples, they immediately
jumped to read them instead of reading descriptive texts. However, only very few components
are accompanied with examples in the current repository. Creating examples for each component is a time-consuming task and increases the difficulty in setting up a component repository.
A possible solution to this problem is to make the component repository system extendable by
reusing programmers and to provide a mechanism that allow programmers to add their own
simple examples to the component repository. Programmers commonly write their own test
programs to try out potentially reusable components in order to fully understand their functionality (Lange & Moher, 1989; Aoki et al., 2001). Examples do not need to be added into
147
the physical storage of the repository; instead, hyperlinks from components in the repository
to example programs can be created by utilizing the link service provided by open hypermedia
systems, such as the Chimera system (Anderson et al., 2000).
8.5.6
Lack of User Configurability
Currently, the RCI-display is placed as a part of the editor buffer, and it reduces the
available workspace of programmers. This placement did not cause any discomfort in the experiments. It may become a problem in real use, however, because the experiment tasks were
very small and programmers did not need a lot of editing space. Some subjects said that because they wanted to make their effective working space as large as possible in their normal
programming practice, they wanted to be able to rearrange the RCI-display to other places.
They also wanted to control what information is shown in the RCI-display. For example, when
the working space is too small, it would be better if the RCI-display shows component names
only and shows the descriptions of components when programmers move the mouse over the
names. Some subjects did not like to use the mouse to invoke the Skip Components Menu
(Section 7.5.1); instead, they wished they could use the keyboard to filter components.
8.6
Summary of Evaluations
f
Overall, the findings from the experiments have indicated that
Active component delivery can promote reuse by supporting the reuse of unknown
f
components and reducing the cost of locating components.
In most cases, modeling a programmer’s task based on comments is sufficient, although
f
not perfect, to locate relevant components.
f
Discourse models can improve the relevance of retrieved components.
User models contribute to the personalization of delivery to individual programmers.
148
f
Support of retrieval-by-reformulation is essential because locating reusable components is an iterative process.
All subjects reused components delivered by CodeBroker during the experiments. Most
of them have very positively acknowledged the overall effectiveness of the system and have
indicated that they would like to use CodeBroker as their daily programming environment.
However, the delivery mechanism alone is not sufficient; it must be combined with query
and browsing mechanisms. Delivery jump-starts the reuse process by giving programmers immediate access to components contextualized to their task and their background knowledge.
After that, programmers need to use query or browsing mechanisms to find other components
that are also needed because in object-oriented programming, reusing one component often
requires the reuse of other tightly coupled components.
Chapter 9
Related Work
This research has tried to find a suitable way to integrate active information delivery
with reusable component repository systems to provide a reusable component-enriched programming environment so that programmers can easily access the needed components. This
work has been greatly influenced by research efforts on active information systems, component repository systems, and intelligent programming environments. This Chapter compares
CodeBroker with closely related research work on those three fields.
9.1
Active Information Systems
The simplest implementation of the information delivery mechanism in an active infor-
mation system is to deliver a piece of information without considering the working context,
such as Microsoft Office’s “Tip of the Day” and a similar research prototype, the DYK (Did
You Know) system (Owen, 1986). The irrelevance of the information to the task at hand often
causes users to ignore it totally.
ACTIVIST is an active help system for a text editor that uses a plan library to infer user
goals from observed actions by matching them against the condition part of plans (Fischer et al.,
1985). It then suggests more efficient solutions that accomplish the same goals. It maintains a
user model to tune the delivery to each user. CodeBroker shares with ACTIVIST in that both of
them are totally embedded in the working environment, and task-relevant information is delivered into the workspace. However, ACTIVIST delivers feedback information after users have
150
finished their task and the delivery is meant to improve their future work. Information delivered by CodeBroker is meant to influence the current task under execution. LispCritic (Fischer,
1987) helps programmers improve their programming skills. It uses program transformation
rules to suggest a syntactical equivalent that is either a more cognitively efficient or computationally efficient solution after it has recognized a less ideal code segment. Unlike CodeBroker,
which makes use of both the semantic and syntactical information of programs, LispCritic has
no knowledge about the semantics.
Apple Data Detector (ADD) (Nardi et al., 1998) is an active system that smoothes the
routine workflow of extracting structured information from everyday documents and automates
the consequent actions. Both CodeBroker and ADD take advantage of the semiformal structure
(doc comments in CodeBroker, URLs in ADD) existing in the working environment and automate the following actions (retrieving reusable components in CodeBroker, storing the URL as
a bookmark in ADD) to eliminate the unnecessary switch of working contexts.
Remembrance Agent (RA) (Rhodes & Starner, 1996) tries to augment human memory
by displaying relevant documents. Like CodeBroker, RA also listens to a text editor and autonomously formulates a query based on the user’s current focus. A back-end search engine
is invoked to find relevant old emails and notes in the user’s individual information space. RA
deals with unstructured texts only; CodeBroker relies on the semiformal structure of the program to extract needed information. In addition, CodeBroker also makes use of syntactical
information. One shortcoming of RA is that it treats all documents the same, although its goal
is to remind users of forgotten documents.
Letizia (Lieberman, 1997) assists users in browsing WWW by suggesting web pages
within a few links from the current page. Like CodeBroker, it aims at eliminating the context switch from a browsing interface to a search interface to streamline the exploration of
web information. Web pages in the bookmark list of a user are analyzed based on information retrieval techniques to create an interest profile. Suggestions are based on the similarity
between web pages and the interest profile. Other web-browsing assistant systems, such as
151
WebWatcher (Armstrong et al., 1995) and Lira (Balabanovic & Shoham, 1995), adopt similar
approaches.
9.2
Component Repository Systems
Research on reusable component repository systems is abundant. They differ from each
other mainly in the retrieval mechanisms adopted. These systems can be divided into three
groups according to the aspect of programs on which the retrieval mechanisms are based.
9.2.1
Concept-Based Component Repository Systems
The retrieval mechanism used in CodeBroker is similar to those of reusable component
repository systems that use free-text indexing. GURU (Maarek et al., 1991) indexes components
based on their textual documentation. Etzkorn and Davis have tried to use header comments
(similar to doc comments in CodeBroker) to index legacy object-oriented programs (Etzkorn &
Davis, 1997b). Comments and identifier names are also used for indexing in (Girardi & Ibrahim,
1995) and (DiFelice & Fonzi, 1998). Michail and Notkin have demonstrated the possibility of
using identifier names only to find similar reusable components for comparison (Michail &
Notkin, 1999).
Free-text indexing is easy both for setting up a component repository and for programmers to formulate reuse queries. Despite its simplicity, empirical studies have found that it performs no worse, in terms of retrieval effectiveness, than other more delicate, effort-consuming
repository systems (Frakes & Pole, 1994; Mili et al., 1997b). Nevertheless, free-text indexingbased reuse systems do not directly support shortening the conceptual gap in query formulation.
One attempt to bridge the conceptual gap is to use structured representations and knowledge bases. Both CodeFinder (Henninger, 1997) and LaSSIE (Devanbu et al., 1991) use frames
to represent reusable components. Frames in CodeFinder are connected by an associative network whose links have weights to reflect the semantic relationships among components. Searching relevant components is supported by spreading activation. Frames in LaSSIE are structured
152
into hierarchical, taxonomic categories by human experts. To ease the difficulty of creating
the frame representations of components, ROSA (Girardi & Ibrahim, 1995) applies natural language processing techniques to automate the process. However, sentences that can be processed
are very limited. The multiple faceted classification schema (Prieto-Diaz, 1991) is another format of structured representations. Reusable components are represented with multiple facets,
each of which is described with a term. A conceptual distance graph has to be constructed to
reflect the semantic relationships among terms. AIRS is a system that combines multiple facets
and the frame-based approach (Ostertag et al., 1992).
Structured representation-based systems are labor intensive in creating representations
of components and knowledge bases.
9.2.2
Constraint-Based Component Repository Systems
Constraints of programs can also be used to index and retrieve reusable components.
Rittri first proposed to use signatures in reusable component retrieval (Rittri, 1989). His work
is further extended by Zaremski and Wing who give a general framework for signature matching in functional programming languages (Zaremski & Wing, 1995). Although research on
signature matching has largely focused on functional programming languages that are often
designed with a sound type theory, signature matching is also applicable to other strong-typed
programming languages (DiCosmo, 1995). An Ada version of signature matching has been implemented in (Stringer-Calvert, 1994). CodeBroker applies this technique to the strong-typed
object-oriented programming language. Signature matching in CodeBroker is not used as the
sole method of retrieving components; rather, it is used as a filter to exclude those components
that are significantly different from the current task in terms of constraint compatibility.
The formal specification-based approach is another form of using constraints to index
and retrieval components. Zaremski and Wing have adopted pre- and post-predicates to find
components that exactly match or approximately match a reuse query (Zaremski & Wing, 1997).
A. Mili et al. have tried to classify reusable components based on refinement order existing
153
among their formal specifications (Mili et al., 1997a). The formal specification-based approach
could be integrated into CodeBroker to improve the precision of retrieval, if the programming
environment supports formal methods. For the majority of programmers, however, the formal
approach is too difficult to use.
9.2.3
Code-Based Reuse Repository Systems
Behavior sampling exploits the code aspect of programs to retrieve reusable compo-
nents (Hall, 1993; Podgurski & Pierce, 1993). In behavior sampling-based systems, a programmer randomly chooses a small set of sample inputs and computes the corresponding outputs after having specified the signature of the module (same as the constraint query created by
CodeBroker). Reusable components whose signature is compatible are found and executed on
the sample inputs. Components whose outputs match the outputs computed by the programmer are returned. Behavior sampling is difficult to apply to components with complicated data
structures. Moreover, it is unable to find close but not identical components.
9.2.4
Uniqueness of CodeBroker
Compared with other component repository systems, CodeBroker has three unique fea-
tures:
(1) It automatically extracts and formulates reuse queries.
(2) It is adaptable and adaptive to each programmer to reflect their growing knowledge
about reusable components.
(3) It is seamlessly integrated with current programming environments.
All of the above-mentioned systems, except CodeFinder, assume that component retrieval is an one-time effort and do not support retrieval-by-reformulation. In addition to the
query refinement supported in CodeFinder, CodeBroker allows programmers to manipulate retrieved components interactively to reduce the difficulties of component choosing.
154
9.3
Intelligent Programming Environments
CodeBroker is similar to structure-based editors, which explore the structure of programs
to speed and ease the work of program entry. Syntax-based editors translate the syntactical
knowledge of programming languages into templates to aid novice programmers by freeing
them from remembering syntactical details (Szwillus & Neal, 1996). In terms of supporting
reuse, CodeBroker is more similar to the example-based programming environment proposed
in (Neal, 1996) and the cliche-based programming environment of KBEmacs (Rich & Waters,
1990). An example-based programming environment has a window for example programs.
Programmers can reuse them directly or use them as examples of language constructs, or of algorithm implementations. KBEmacs is implemented as an extension to Emacs. It has a knowledge base of cliches, or program plans, which programmers can reuse in their programming by
referring to them by name. However, in both systems, programmers have to activate examples
or cliches by name. That is to say, programmers have to learn the “vocabulary” before they can
reuse.
The Argo design environment (Robbins & Redmiles, 1998) incorporates plan recognitionbased critics to actively deliver information for programmers to reflect upon their current design
or programs. Based on the type of programming knowledge provided, critics are grouped into
9 types: correctness, completeness, consistency, optimization, alternative, evolvability, presentation, experimental, and organizational critics. The critiquing infrastructure (or information
delivery mechanism, as it is called in this thesis) of Argo supports program design by automatically and timely supplying general programming knowledge that is relevant to the task at hand.
It complements the support provided by CodeBroker that focuses on supplying knowledge on
reusable components.
Chapter 10
Future Work and Conclusions
This chapter discusses future research opportunities uncovered by this research, summarizes the research approach taken, and then concludes with a description of the major contributions of this research.
10.1
Future Work
In the discussion of the problems of CodeBroker found in the experiments (Section 8.5),
I have described some future work needed to improve the system. In this section, I point out
future research questions in need of investigation based on lessons learned from the research,
and speculate on possible directions.
10.1.1
Extending CodeBroker to a Larger Scale
Currently, the component repository of CodeBroker is located in the same machine of the
programming environment and is created statically before its use because most current component repositories are closed and proprietary. As the movement of Open Source Systems attracts
more and more developers, we can expect more software systems and software components
will become open-source, for example, the Jun system (a 3D Smalltalk/Java library) (Aoki
et al., 2001). This will make it increasingly difficult for programmers to know newly available
open-source components. I envision a distributed CodeBroker system running on several server
computers to which programmers contribute open-source components. It dynamically indexes
156
components from those constantly evolving repositories and then delivers components through
networks to other programmers. Through the brokerage of CodeBroker, programmers can benefit from each other’s work and improve the productivity of software development by avoiding
unnecessary repetition of work.
CodeBroker not only can be used to make programmers aware of unknown components
developed by others, but also can be used to foster a community of developers. Forming a
relatively stable community of developers is the key factor for the success of an Open Source
project. Currently, support for the creation of such communities is rare. When CodeBroker
finds and delivers components or systems that are relevant to what a programmer is doing, it
can also make the programmer aware of other programmers who created those components and
systems, thus creating a possible opportunity for those programmers to join forces and form a
new community.
10.1.2
Extending CodeBroker to Higher Design Levels
Although CodeBroker is designed to promote reuse in the phase of coding, the underlying
principles are equally applicable to higher levels of software development activities. For example, if developers use modeling tools such as Argo (Taylor et al., 1996; Robbins & Redmiles,
1998) to create a conceptual design of a software system, they need to specify the functionality
and signature for each class. An active component repository system can utilize that information
to actively deliver potentially reusable components.
In addition to delivering reusable components, an active reuse system can also try to deliver reusable designs from a repository of existing designs by exploiting the similarity between
the current design and those existing designs. The conceptual similarity can still be computed
based on their common descriptive names and functionality descriptions; however, constraint
compatibility at the design level needs to be based on the interaction patterns of design elements.
157
10.1.3
Supporting More Complicated Indexing and Retrieving Mechanisms
How well an active component repository system such as CodeBroker can deliver relevant components depends on how well it can capture the programming task from existing
information in a programming environment. CodeBroker has tried to capture the task based on
conceptual information revealed through comments and constraint information revealed through
signatures. However, there are other kinds of information that can be utilized. For example, a
programmer may try to write a program using a known design pattern or framework, which may
place extra constraints on the kind of components that could be reused. This piece of information could be used by the repository system to limit its search range and improve the relevance
of delivered components.
Active component repository systems can also explore the affinity of components to deliver relevant components. The affinity of two components is the likeliness that two components
will be used in a same program. The repository system can deliver components that have high
affinity with the components used by programmers in their current programs. There are two possible approaches to compute the affinity of two components: coupling-based or statistics-based.
The coupling-based approach looks at how tightly two components are structurally connected.
Two components are more likely to appear together if they both access common data, or if the
data type output by one component is the same as the data type input by another. The statisticsbased approach looks at how often two components appear together in a single program. Such
co-occurrence information could be obtained in a way similar to the automatic thesaurus construction in information retrieval systems by treating programs as documents and components
as terms.
10.2
Summary
“The best way to attack the essence of building software is not to build it at all” (Brooks,
1995). Reusing existing software components is one way of reaching this goal. By reuse,
158
programmers can avoid repetitive work and focus on the unique features of the new system.
Based on his investigation, Jones claimed that, of all programs written in 1983, fewer than 15%
of them were unique, novel and specific to individual programs, whereas 85% were common
and generic (Jones, 1984). However, the problem is how can programmers know that they are
doing something that others have done many times before. As one programmer said:
“I could be creating a method that does exactly the same thing somebody
else’s does...even though we have access to each other’s code. We might call
them different names and we might have a bit different way of doing it, but
we’re still doing the same thing.” (Fichman & Kemerer, 1997)
This has been the central question investigated in this thesis. In order for programmers to
be able to reuse those components whose existence is not known to them, component repository
systems need to support the information delivery mechanism in addition to the traditional information access mechanism. A conceptual framework of designing active component repository
systems was proposed. The key challenge of creating an active component repository system is
to contextualize the delivery of reusable components to the task and the background knowledge
of programmers. This thesis proposed that the contextualization could be realized through the
combination of task models, which capture the immediate programming task; discourse models,
which capture the larger context under which current programming task takes place; and user
models, which represent programmers’ knowledge about reusable components.
Based on the conceptual framework, an active component repository system, CodeBroker, has been developed. CodeBroker postulates that comments and signatures in a program
can serve as task models to describe what programmers are going to develop. Relevant reusable
components can be retrieved by using such task models as reuse queries. The relevance of components can be further improved by using discourse models that are incrementally evolved by
programmers through interactions with the system. In addition, the delivery of components can
be personalized to each programmer using his or her user model, which is both adaptable and
adaptive.
159
The empirical evaluation of CodeBroker has found the research has met its goals at both
the technical level and theoretical level. The technical approach taken by CodeBroker was
feasible because, in most of the experiments, CodeBroker successfully delivered task-relevant
and user-specific components to programmers. The theoretical hypothesis of the research that
active component repository systems can promote reuse has also been validated through the
empirical studies. CodeBroker not only made it possible for programmers to reuse components
they have not known before, but also triggered programmers to find more related components
in order to reuse those delivered ones. Moreover, because CodeBroker reduced the cost of
component locating, programmers were more willing to explore the possibility of reuse.
10.3
Contributions
The contributions of this research were threefold. First, this research contributed to the
better understanding of cognitive difficulties faced by programmers who want to reuse. It identified two major cognitive barriers to locating reusable components: information islands and the
perceived low reuse utility, drawing on cognitive theories and empirical studies on programming
and the use of large information repositories.
Second, a new type of component repository systems—active component repository
systems—was proposed, developed, and evaluated. Active component repository systems not
only automate the component-locating process but also help programmers identify reuse opportunities that otherwise would be missed because they either did not know the existence of the
components or perceived it was too costly to locate them.
Third, this research contributed to the design of active information systems in general.
Besides programming, there are many other knowledge-intensive domains where workers rely
on external information resources to augment their mental abilities to comprehend and solve
complex problems as much as programmers rely on reusable components. The challenge in
designing information systems to support information acquisition for those knowledge workers
is not only to make information available to them at any time, at any place, and in any form,
160
but to reduce information overload by making information relevant to task-at-hand and to the
background knowledge of the users. The similarity analysis approach-based task modeling, in
combination with incremental discourse modeling and user modeling, proposed and evaluated
through the implementation of CodeBroker in this research is a step forward to meet such a
challenge.
Bibliography
Aaen, I. (1992), CASE Tools Bootstrapping (How Little Strokes Fell Great Oaks), in V.-P. T.
K. Lyytinen (ed.), Next Generation CASE Tools, IOS, Netherlands, pp. 8–17.
Alexander, C. (1964), The Synthesis of Form, Harvard University Press.
Alexander, C., Ishikawa, S., Silverstein, M., Jacobson, M., Fiksdahl-King, I. & S., A. (1977), A
Pattern Language: Towns, Buildings, Construction, Oxford University Press, New York.
Anderson, K. M., Taylor, R. N. & Whitehead, E. J. (2000), Chimera: Hypermedia for Heterogeneous Software Development Environments, ACM Transactions on Information Systems
18(3), 211–245.
Aoki, A., Hayashi, K., Kishida, K., Nakakoji, K., Nishinaka, Y., Reeves, B., Takashima, A.
& Yamamoto, Y. (2001), A Case Study of the Evolution of Jun: An Object-Oriented
Open-Source 3D Multimedia Library, in Proceedings of 23rd International Conference on
Software Engineering (ICSE’01), Toronto, Canada, (to appear).
Armstrong, R., Freitag, D., Joachims, T. & Mitchell, T. (1995), WebWatcher: A Learning Apprentice for the World Wide Web, in Proceedings of AAAI Spring Symposium on Information Gathering, Stanford, CA, pp. 6–12.
Balabanovic, M. & Shoham, Y. (1995), Learning Information Retrieval Agents: Experiments
with Automated Web Browsing, in Proceedings of AAAI Spring Symposium on Information Gathering, Stanford, CA, pp. 13–18.
Basili, V., Briand, L. & Melo, W. (1996), How Reuse Influences Productivity in Object-Oriented
Systems, Communications of the ACM 39(10), 104–116.
Batory, D., Johnson, C., MacDonald, B. & von Heeder, D. (2000), Achieving Extensibility
through Product-Lines and Domain-Specific Languages: A Case Study, in Proceedings
of 6th International Conference on Software Reuse (ICSR-6), Springer-Verlag, Vienna,
Austria, pp. 117–136.
Belkin, N. & Croft, B. (1992), Information Filtering and Information Retrieval, Communications of the ACM 35(12), 29–37.
Biggerstaff, T. J. (2000), A New Control Structure for Transformation-Based Generators,
in Proceedings of 6th International Conference on Software Reuse (ICSR-6), SpringerVerlag, Vienna, Austria, pp. 1–19.
162
Biggerstaff, T. J., Mitbander, B. G. & Webster, D. E. (1994), Program Understanding and the
Concept Assignment Problem, Communications of the ACM 37(5), 72–83.
Boehm, B. (1999), Managing Software Productivity and Reuse, IEEE Computer 16(9), 111–
113.
Bradshaw, J. M. (1997), An Introduction to Software Agents, in J. M. Bradshaw (ed.), Software
Agents, AAAI Press, Menlo Park, CA, pp. 1–46.
Brooks, F. P. J. (1995), The Mythical Man-Month: Essays on Software Engineering, 20th anniversary edition, Addison-Wesley, Reading, MA.
Browne, J., Lee, T. & Werth, J. (1990), Experimental Evaluation of a Reusability-Oriented Parallel Programming Environment, IEEE Transactions on Software Engineering 16(2), 111–
120.
Buckley, C., Salton, G. & Allan, J. (1994), The Effect of Adding Relevance Information in a
Relevance Feedback Environment, in W. B. Croft & C. J. v. Rijsbergen (eds.), Proceedings
of 17th Annual International ACM SIGIR Conference, Springer-Verlag, Dublin, Ireland,
pp. 292–300.
Card, S., Robertson, G. & Mackinlay, J. (1991), The Information Visualizer: An Information
Workspace, in Proceedings of Conference on Human Factors in Computing Systems, ACM
Press, pp. 181–188.
Carey, T. & Rusli, M. (1995), Usage Representations for Reuse of Design Insights: A Case
Study of Access to On-Line Books, in J. M. Carroll (ed.), Scenario-Based Design: Envisioning Work and Technology in System Development, Wiley, pp. 165–182.
Carroll, J. M. & Rosson, M. B. (1987), Paradox of the Active User, in J. M. Carroll (ed.),
Interfacing Thought: Cognitive Aspects of Human-Computer Interaction, The MIT Press,
Cambridge, MA, pp. 80–111.
Cox, B. J. (1996), Superdistribution: Objects as Property on the Electronic Frontier, AddisonWesley, Reading, MA.
Crestani, F., Lalmas, M., Van Rijsbergen, C. J. & Campbell, I. (1998), ’Is This Document
Relevant? ... Probably’: A Survey of Probabilistic Models in Information Retrieval, ACM
Computing Surveys 30(4), 528–552.
Croft, W. B. & Harper, D. J. (1979), Using Probabilistic Models of Document Retrieval without
Relevance Information, Journal of the Documentation 35, 285–295.
Curtis, B. (1989), Cognitive Issues in Reusing Software Artifacts, in T. J. Biggerstaff & A. J.
Perlis (eds.), Software Reusability, Vol. II, ACM Press, New York, pp. 269–287.
Curtis, B., Krasner, H. & Iscoe, N. (1988), A Field Study of the Software Design Process for
Large Systems, Communications of the ACM 31(11), 1268–1287.
Damiani, E., Fugini, M. G. & Fusaschi, E. (1997), A Descriptor-Based Approach to OO Code
Reuse, IEEE Software 14(10), 73–80.
163
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. (1990), Indexing
by Latent Semantic Analysis, Journal of the American Society for Information Science
41(6), 391–407.
Detienne, F. (1995), Design Strategies and Knowledge in Object-Oriented Programming: Effects of Expertise, Human-Computer Interaction 10(2/3), 129–169.
Devanbu, P., Brachman, R. J., Selfridge, P. G. & Ballard, B. W. (1991), LaSSIE: A KnowledgeBased Software Information System, Communications of the ACM 34(5), 34–49.
DiBona, C., Ockman, S. & Stone, M. (eds.) (1999), Open Sources: Voices from the Open Source
Revolution, O’Reilly & Associates, Sebastopol, CA.
DiCosmo, R. (1995), Isomorphisms of Types: From Lamda Calculus to Information Retrieval
and Language Design, Birkhauser, Boston.
Dieterich, H., Malinowski, U., Kuhme, T. & Schneider-Hufschmidt, M. (1993), State of the Art
in Adaptive User Interfaces, in M. Schneider-Hufschmidt, T. Kuhme & U. Malinowski
(eds.), Adaptive User Interfaces: Principles and Practice, Elsevier Science Publishers,
Amsterdam, pp. 13–48.
DiFelice, P. & Fonzi, G. (1998), How to Write Comments Suitable for Automatic Software
Indexing, Journal of Systems and Software 42, 17–28.
Dubinsky, E., Freudenberger, S., Schonberg, E. & Schwartz, J. T. (1989), Reusability of Design
for Large Software Systems: An Experiment with the SETL Optimizer, in T. J. Biggerstaff
& A. J. Perlis (eds.), Software Reusability, Vol. I, ACM Press, New York, pp. 275–294.
Dusink, L. & Van Katwijk, J. (1995), Reuse Dimensions, in Proceedings of ACM Symposium
on Software Reuse (SSR’95), ACM Press, Seattle, WA, pp. 137–149.
Engelbart, D. C. (1990), Knowledge-Domain Interoperability and an Open Hyperdocument
System, in Proceedings of Computer Supported Cooperative Work 1990, ACM Press, New
York, pp. 143–156.
Etzkorn, L. H. & Davis, C. G. (1997a), Automated Object-Oriented Reusable Component Identification, Knowledge-Based Systems 9(8), 517–524.
Etzkorn, L. H. & Davis, C. G. (1997b), Automatically Identifying Reusable OO Legacy Code,
IEEE Computer 30(10), 66–71.
Fafchamps, D. (1994), Organizational Factors and Reuse, IEEE Software 11(5), 31–41.
Feather, M. S. (1989), Reuse in the Context of a Transformation-Based Methodology, in T. J.
Biggerstaff & A. J. Perlis (eds.), Software Reusability, ACM Press, New York, pp. 337–
360.
Fichman, R. G. & Kemerer, C. E. (1997), Object Technology and Reuse: Lessons from Early
Adopters, IEEE Software 14(10), 47–59.
Fischer, G. (1987), A Critic for LISP, in J. McDermott (ed.), Proceedings of the 10th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, Los Altos, CA, pp.
177–184.
164
Fischer, G. (1991), Supporting Learning on Demand with Design Environments, in L. Birnbaum
(ed.), International Conference on the Learning Sciences, Association for the Advancement of Computing in Education, Evanston, IL, pp. 165–172.
Fischer, G. (1993), Shared Knowledge in Cooperative Problem-Solving Systems—Integrating
Adaptive and Adaptable Components, in M. Schneider-Hufschmidt, T. Kuehme & U. Malinowski (eds.), Adaptive User Interfaces: Principles and Practice, Elsevier Science Publishers, Amsterdam, pp. 49–68.
Fischer, G. (1994), Domain-Oriented Design Environments, Automated Software Engineering
1(2), 177–203.
Fischer, G. (1998a), Beyond ’Couch Potatoes’: From Consumers to Designers, in Proceedings
of 1998 Asia-Pacific Computer and Human Interaction, IEEE Computer Society, Kanagawa, Japan, pp. 2–9.
Fischer, G. (1998b), Seeding, Evolutionary Growth and Reseeding: Constructing, Capturing
and Evolving Knowledge in Domain-Oriented Design Environments, Automated Software
Engineering 5(4), 447–464.
Fischer, G. (2001), User Modeling in Human-Computer Interaction, User Modeling and UserAdapted Interaction (to appear).
Fischer, G. & Eisenberg, M. (1994), Programmable Design Environments: Integrating EndUser Programming with Domain-Oriented Assistance, in Human Factors in Computing
Systems, CHI’94 Conference Proceedings, Boston, MA, pp. 431–437.
Fischer, G. & Mastaglio, T. (1989), Computer-Based Critics, in Proceedings of the 22nd Annual Hawaii Conference on System Sciences (HICSS-22), Vol. III: Decision Support and
Knowledge Based Systems Track, IEEE Computer Society, Kailua-Kona, HI, pp. 427–436.
Fischer, G. & Nieper-Lemke, H. (1989), HELGON: Extending the Retrieval by Reformulation
Paradigm, in Human Factors in Computing Systems, CHI’89 Conference Proceedings,
Austin, TX, pp. 357–362.
Fischer, G. & Reeves, B. N. (1995), Beyond Intelligent Interfaces: Exploring, Analyzing and
Creating Success Models of Cooperative Problem Solving, in R. Baecker, J. Grudin,
W. Buxton & S. Greenberg (eds.), Readings in Human-Computer Interaction: Toward
the Year 2000, 2nd edition, Morgan Kaufmann, San Francisco, CA, pp. 822–831.
Fischer, G. & Schneider, M. (1984), Knowledge-Based Communication Processes in Software
Engineering, in Proceedings of 7th International Conference on Software Engineering
(ICSE’84), IEEE Computer Society, Orlando, FL, pp. 358–368.
Fischer, G. & Ye, Y. (2001), Personalizing Delivered Information in a Software Reuse Environment, in Proceedings of User Modeling 2001, Sonthofen, Germany, (to appear).
Fischer, G., Henninger, S. & Redmiles, D. (1991), Cognitive Tools for Locating and Comprehending Software Objects for Reuse, in Proceedings of 13th International Conference on
Software Engineering (ICSE’91), IEEE Computer Society, Austin, TX, pp. 318–328.
165
Fischer, G., Lemke, A. C. & Schwab, T. (1985), Knowledge-Based Help Systems, in Human
Factors in Computing Systems, CHI’85 Conference Proceedings, San Francisco, CA, pp.
161–167.
Fischer, G., Nakakoji, K., Ostwald, J., Stahl, G. & Sumner, T. (1993), Embedding Critics in
Design Environments, The Knowledge Engineering Review Journal 8(4), 285–307.
Fischer, G., Nakakoji, K., Ostwald, J., Stahl, G. & Sumner, T. (1998), Embedding Critics in
Design Environments, in M. T. Maybury & W. Wahlster (eds.), Readings in Intelligent
User Interfaces, Morgan Kaufmann Publisher, pp. 537–559.
Fischer, G., Redmiles, D., Williams, L., Puhr, G., Aoki, A. & Nakakoji, K. (1995), Beyond
Object-Oriented Development: Where Current Object-Oriented Approaches Fall Short,
Human-Computer Interaction, Special Issue on Object-Oriented Design 10(1), 79–119.
Flanagan, D. (1997), JAVA in a Nutshell, 2nd edition, O’Reilly & Associates, Sebastopol, CA.
Frakes, W. & Terry, C. (1996), Software Reuse: Metrics and Models, ACM Computing Surveys
28(2), 415–435.
Frakes, W. B. & Fox, C. J. (1995), Sixteen Questions about Software Reuse, Communications
of the ACM 38(6), 75–87.
Frakes, W. B. & Fox, C. J. (1996), Quality Improvement Using a Software Reuse Failure Modes
Model, IEEE Transactions on Software Engineering 22(4), 274–279.
Frakes, W. B. & Pole, T. P. (1994), An Empirical Study of Representation Methods for Reusable
Software Components, IEEE Transactions on Software Engineering 20(8), 617–630.
Furnas, G. W., Landauer, T. K., Gomez, L. M. & Dumais, S. T. (1987), The Vocabulary Problem
in Human-System Communication, Communications of the ACM 30(11), 964–971.
Gamma, E., Johnson, R., Helm, R. & Vlissides, J. (1994), Design Patterns—Elements of
Reusable Object-Oriented Systems, Addison-Wesley, Reading, MA.
Ghezzi, C., Jazayeri, M. & Mandrioli, D. (1991), Fundamentals of Software Engineering, Prentice Hall, Englewood Cliffs, NJ.
Girardi, M. R. & Ibrahim, B. (1995), Using English to Retrieve Software, Journal of Systems
and Software 30, 249–270.
Girgensohn, A. (1992), End-User Modifiability in Knowledge-Based Design Environments,
Ph.D. Dissertation, University of Colorado at Boulder.
Gosling, J., Joy, B. & Steele, G. (1996), The Java Language Specification, 2nd edition, AddisonWesley, Reading, MA.
Graham, I. (1995), Reuse: A Key to Successful Migration, Object Magazine 5(6), 82–83.
Griss, M. L. (2000), Implementing Product-Line Features with Component Reuse, in Proceedings of 6th International Conference on Software Reuse (ICSR-6), Springer-Verlag, Vienna, Austria, pp. 137–152.
166
Grudin, J. (1994), Groupware and Social Dynamics: Eight Challenges for Developers, Communications of the ACM 37(1), 92–105.
Halasz, F. G. (1988), Reflections on NoteCards: Seven Issues for the Next Generation of Hypermedia Systems, Communications of ACM 31(7), 836–852.
Hall, R. J. (1993), Generalized Behavior-Based Retrieval, in Proceedings of 15th International
Conference on Software Engineering (ICSE’93), ACM Press, Baltimore, MD, pp. 371–
380.
Hallsteinsen, S. & Paci, M. (eds.) (1997), Experiences in Software Evolution and Reuse: Twelve
Real World Projects, Springer-Verlag, Berlin.
Harman, D. (1995), Overview of the Third REtrieval Conference (TREC-3), in D. Harman
(ed.), Overview of the Third REtrieval Conference, National Institute of Standards and
Technology Special Publication, Gaithersburg, MD, pp. 1–21.
Hayes, J. R. & Simon, H. A. (1977), Psychological Differences among Problem Isomorphs,
in N. J. Castellan, D. B. Pisoni & G. R. Potts (eds.), Cognitive Theory, Vol. 2, Erlbaum,
Hillsdale, NJ.
Hayes-Roth, B. & Hayes-Roth, F. (1979), A Cognitive Model of Planning, Cognitive Science
3, 275–310.
Helm, R. & Maarek, Y. S. (1991), Integrating Information Retrieval and Domain Specific Approaches for Browsing and Retrieval in Object-Oriented Class Libraries, in Proceedings
of the 1991 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA’91), pp. pp. 47–61.
Henderson-Sellers, B. & Edwards, J. M. (1990), The Object-Oriented Systems Life Cycle,
Communications of the ACM 33(9), 143–159.
Henninger, S. (1993), Locating Relevant Examples for Example-Based Software Design, Ph.D.
Dissertation, University of Colorado at Boulder.
Henninger, S. (1997), An Evolutionary Approach to Constructing Effective Software Reuse
Repositories, ACM Transactions on Software Engineering and methodology 6(2), 111–
140.
Hoc, J.-M., Green, T. R. G., Samurcay, R. & Gilmore, D. J. (eds.) (1990), Psychology of Programming, Academic Press, New York.
Horvitz, E., Jacobs, A. & Hovel, D. (1999), Attention-Sensitive Alerting, in Proceedings of
Conference on Uncertainty and Artificial Intelligence 1999, Morgan Kaufmann, San Francisco, CA, pp. 305–313.
Isoda, S. (1995), Experiences of a Software Reuse Project, Journal of Systems and Software
30, 171–186.
Jarzabek, S. & Huang, R. (1998), The Case for User-Centered CASE Tools, Communications
of the ACM 41(8), 93–99.
167
Johnson, R. E. (1997), Components, Frameworks, Patterns, in Proceedings of ACM Symposium
on Software Reuse (SSR’97), ACM Press, Boston, MA, pp. 10–17.
Jones, M. P. (1997), Spoken-Language Help for High-Functionality Applications, Ph.D. Dissertation, University of Colorado at Boulder.
Jones, T. C. (1984), Reusability in Programming: A Survey of the State of the Art, IEEE Transactions on Software Engineering SE-10(5), 1984.
Joos, R. (1994), Software Reuse at Motolora, IEEE Software 11(5), 42–47.
Jurafsky, D. & Martin, J. (2000), Speech and Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall,
Upper Saddle River, NJ.
Kang, K. C. (1998), Feature-Oriented Development of Applications for a Domain, in W. Frakes
(ed.), Systematic Software Reuse, Annals of Software Engineering 5, Baltzer Science Publishers, Bussum, The Netherlands, pp. 143–168.
Kintsch, W. (1998), Comprehension: A Paradigm for Cognition, Cambridge University Press,
Cambridge, UK.
Konstan, J. A., Miller, B. N., Maltz, D., Herlocker, J. L., Gordon, L. R. & Riedl, J. (1997),
GroupLens: Applying Collaborative Filtering to Usenet News, Communications of ACM
40(3), 77–87.
Krueger, C. W. (1992), Software Reuse, ACM Computing Surveys 24(2), 131–183.
Landauer, T. K. & Dumais, S. T. (1997), A Solution to Plato’s Problem: The Latent Semantic
Analysis Theory of Acquisition, Induction and Representation of Knowledge, Psychological Review 104(2), 211–240.
Lange, B. M. & Moher, T. G. (1989), Some Strategies of Reuse in An Object-oriented Programming Environment, in Human Factors in Computing Systems, CHI’89 Conference
Proceedings, ACM Press, Austin, TX, pp. 69–73.
Lieberman, H. (1997), Autonomous Interface Agents, in Human Factors in Computing Systems,
CHI’97 Conference Proceedings, ACM Press, Atlanta, GA, pp. 67–74.
Lim, W. C. (1994), Effects of Reuse on Quality, Productivity and Economics, IEEE Software
11(5), 23–29.
Maarek, Y. S., Berry, D. M. & Kaiser, G. E. (1991), An Information Retrieval Approach for Automatically Constructing Software Libraries, IEEE Transactions on Software Engineering
17(8), 800–813.
Meyer, B. (1997), Object-Oriented Software Construction, 2nd edition, Prentice Hall.
Michail, A. & Notkin, D. (1999), Assessing Software Libraries by Browsing Similar Classes,
Functions and Relationships, in Proceedings of 21st International Conference on Software
Engineering (ICSE’99), ACM Press, Los Angeles, CA, pp. 463–472.
168
Mili, A., Mili, R. & Mittermeir, R. (1997a), Storing and Retrieving Software Components: A
Refinement-Based System, IEEE Transaction on Software Engineering 23(7), 445–460.
Mili, A., Yacoub, S., Addy, E. & Hafedh, M. (1999), Toward an Engineering Discipline of
Software Reuse, IEEE Software 16(5), 22–31.
Mili, H., Ah-Ki, E., Grodin, R. & Mcheick, H. (1997b), Another Nail to the Coffin of Faceted
Controlled-Vocabulary Component Classification and Retrieval, in Proceedings of Symposium on Software Reuse (SSR’97), ACM Press, Boston, MA, pp. 89–98.
Mili, H., Mili, F. & Mili, A. (1995), Reusing Software: Issues and Research Directions, IEEE
Transactions on Software Engineering 21(6), 528–562.
Morisio, M., Seaman, C. B., Parra, A. T., Basili, V. R., Kraft, S. E. & Condon, S. E. (2000),
Investigating and Improving a COTS-Based Software Development Process, in Proceedings of 22nd International Conference on Software Engineering (ICSE’00), ACM Press,
Limerick, Ireland, pp. 31–40.
Murray, D. M. (1987), Embedded User Models, in H.-J. Bullinger & B. Shackel (eds.), Proceedings of Human-Computer Interaction (INTERACT’87), Elsevier, Amsterdam, pp. 228–
235.
Nakakoji, K. (1993), Increasing Shared Understanding of a Design Task between Designers
and Design Environments: The Role of a Specification Component, Ph.D. Dissertation,
University of Colorado at Boulder.
Nakakoji, K., Yamamoto, Y., Suzuki, T., Takada, S. & Gross, M. D. (1998), From Critiquing to Representational Talkback: Computer Support for Revealing Features in Design, Knowledge-Based Systems 11(7-8), 457–468.
Nardi, B. A., Miller, J. R. & Wright, D. J. (1998), Collaborative, Programmable Intelligent
Agents, Communications of the ACM 41(3), 96–104.
Neal, L. (1996), Support for Software Design, Development and Reuse through an ExampleBased Environment, in G. Szwillus & L. Neal (eds.), Structure-Based Editors and Environments, Academic Press, San Diego, CA, pp. 185–192.
Norman, D. (1986), Cognitive Engineering, in D. Norman & S. Draper (eds.), User Centered
System Design, New Perspectives on Human-Computer Interaction, Erlbaum, Hillsdale,
NJ, pp. 31–61.
Norman, D. (1993), Things That Make Us Smart, Addison-Wesley, Reading, MA.
Ostertag, E., Hendler, J., Prieto-Diaz, R. & Braun, C. (1992), Computing Similarity in a Reuse
Library System: An AI-Based Approach., ACM Transactions on Software Engineering
and Methodology 1(3), 205–228.
Owen, D. (1986), Answers First, Then Questions, in D. Norman & S. Draper (eds.), User
Centered System Design, New Perspectives on Human-Computer Interaction, Erlbaum,
Hillsdale, NJ, pp. 361–375.
169
Pennington, N. & Grabowski, B. (1990), The Tasks of Programming, in J.-M. Hoc, T. R. G.
Green, R. Samurcay & D. J. Gilmore (eds.), Psychology of Programming, Academic Press,
New York, pp. 45–61.
Perry, D. & Wolf, A. (1992), Foundations for the Study of Software Architecture, ACM Software
Engineering Notes 17(4), 40–52.
Podgurski, A. & Pierce, L. (1993), Retrieving Reusable Software by Sampling Behavior, ACM
Transactions on Software Engineering and Methodology 2(3), 286–303.
Prieto-Diaz, R. (1991), Implementing Faceted Classification for Software Reuse, Communications of the ACM 34(5), 88–97.
Rada, R. (1995), Software Reuse: Principles, Methodologies and Practices, Ablex, Norwood,
NJ.
Raymond, E. S. & Bob, Y. (2001), The Cathedral and the Bazaar: Musings on Linux and Open
Source by an Accidental Revolutionary, rev. edition, O’Reilly & Associates.
Redmiles, D. F. (1992), From Programming Tasks to Solutions: Bridging the Gap through the
Explanation of Examples, Ph.D. Dissertation, University of Colorado at Boulder.
Reeves, B. N. (1991), Locating the Right Object in a Large Hardware Store – An Empirical
Study of Cooperative Problem Solving among Humans, Technical Report CU-CS-523-91,
Department of Computer Science, University of Colorado.
Reeves, B. N. (1993), Supporting Collaborative Design by Embedding Communication and
History in Design Artifacts, Ph.D. Dissertation, University of Colorado at Boulder.
Reisberg, D. (1997), Cognition, W. W. Norton & Company, New York.
Repenning, A. (1993), Agentsheets: A Tool for Building Domain-Oriented Visual Programming Environments, in Human Factors in Computing Systems, CHI’93 Conference Proceedings, ACM Press, Amsterdam, pp. 142–143.
Rhodes, B. J. & Starner, T. (1996), Remembrance Agent: A Continuously Running Automated
Information Retrieval System, in Proceedings of 1st International Conference on the Practical Application of Intelligent Agents and Multi Agent Technology, London, pp. 487–495.
Rich, C. H. & Waters, R. C. (1988), Automatic Programming: Myths and Prospects, 21(8), 40–
51.
Rich, C. H. & Waters, R. C. (1990), The Programmer’s Apprentice, Addison-Wesley, Reading,
MA.
Rist, R. S. (1995), Program Structure and Design, Cognitive Science pp. 507–562.
Rittel, H. (1984), Second-Generation Design Methods, in N. Cross (ed.), Developments in Design Methodology, Wiley, New York, pp. 317–327.
Rittri, M. (1989), Using Types as Search Keys in Function Libraries, Journal of Functional
Programming 1(1), 71–89.
170
Robbins, J. E. & Redmiles, D. F. (1998), Software Architecture Critics in the Argo Design
Environment, Knowledge-Based Systems 11, 47–60.
Roberts, R. M. (1989), Serendipity: Accidental Discoveries in Science, Wiley, New York.
Robertson, G. G., Card, S. K. & Mackinlay, J. D. (1993), Information Visualization Using 3D
Interactive Animation, Communications of the ACM 36(4), 57–71.
Robertson, S. E. (1977), The Probability Ranking Principle in IR, Journal of Documents
33(4), 294–304.
Robertson, S. E. & Walker, S. (1994), Some Simple Effective Approximations to the 2-Poisson
Model for Probabilistic Weighted Retrieval, in W. B. Croft & C. J. Van Rijsbergen (eds.),
Proceedings of the 17th International ACM-SIGIR Conference, Springer-Verlag, Dublin,
Ireland, pp. 232–241.
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M. & Gatford, M. (1995), Okapi
at TREC-3, in D. K. Harman (ed.), The 3rd Text REtrieval Conference (TREC-3), National
Institute of Standards and Technology, Gaithersburg, MD, pp. 109–126.
Rosenbaum, S. & DuCastel, B. (1995), Managing Software Reuse–An Experience Report, in
Proceedings of 17th International Conference on Software Engineering (ICSE’95), ACM
Press, Seattle, WA, pp. 105–111.
Rucker, J. & Polanco, m. J. (1997), Siteseer: Personalized Navigation for the Web, Communications of the ACM 40(3), 73–75.
Salton, G. & McGill, M. J. (1983), Introduction to Modern Information Retrieval, McGrawHill, New York.
Schön, D. A. (1983), The Reflective Practitioner: How Professionals Think in Action, Basic
Books, New York.
Sen, A. (1997), The Role of Opportunism in the Software Design Reuse Process, IEEE Transactions on Software Engineering 23(7), 418–436.
Shaw, M. & Garlan, D. (1996), Software Architecture: Perspectives on an Emerging Discipline,
Prentice Hall, Upper Saddle River, NJ.
Shneiderman, B. (1998), Designing the User Interface: Strategies for Effective HumanComputer Interaction, 3rd edition, Addison-Wesley, Reading, MA.
Simon, H. A. (1996), The Sciences of the Artificial, third edition, The MIT Press, Cambridge,
MA.
Soloway, E. & Ehrlich, K. (1984), Empirical Studies of Programming Knowledge, IEEE Transactions on Software Engineering SE-10(5), 595–609.
Stringer-Calvert, D. W. J. (1994), Signature Matching for Ada Software Reuse, Master’s thesis,
University of York, UK.
Stroustrup, B. (1995), The C++ Programming Language, 2nd edition, Addison-Wesley, Reading, MA.
171
Sumner, T. (1995), Designers and Their Tools: Computer Support for Domain Construction,
Ph.D. Dissertation, University of Colorado at Boulder.
Szwillus, G. & Neal, L. (eds.) (1996), Structure-Based Editors and Environments, Academic
Press, New York.
Taylor, R. N., Medvidovic, N., Anderson, K. M., Whitehead, E. J., Robbins, J. E., Nies, K. A.,
Oreizy, P. & Dubrow, D. L. (1996), A Component- and Message-Based Architectural Style
for GUI Software, IEEE Transactions on Software Engineering 22(6), 390–406.
Terveen, L., Hill, W., Amento, B., McDonald, D. & Creter, J. (1997), PHOAKS: A System for
Sharing Recommendations, Communications of the ACM 40(3), 59–62.
Thomas, C. G. (1996), To Assist the User: On the Embedding of Adaptive and Agent-Based
Mechanisms, R. Oldenbourg Verlag.
Thomas, W. M., Delis, A. & Basili, V. R. (1997), An Analysis of Errors in a Reuse-Oriented
Development Environment, Journal of Systems Software 38, 211–224.
Tracz, W. (1990), The 3 Cons of Software Reuse, in Proceedings of the 3rd Annual Workshop
on Institutionalizing Software Reuse (WISR ’90), Syracuse, NY.
Van Rijsbergen, C. J. (1979), Information Retrieval, 2nd edition, Butterworths, London.
Virvou, M. & Du Boulay, B. (1999), Human Plausible Reasoning for Intelligent Help, User
Modeling and User-Adapted Interaction 9, 321–375.
Visser, W. (1990), More or Less Following a Plan During Design: Opportunistic Deviations in
Specification, International Journal of Man-Machine Studies 33(3), 247–278.
Wahlster, W. & Kobsa, A. (1989), User Models in Dialog Systems, in W. Wahlster & A. Kobsa
(eds.), User Models in Dialog Systems, Springer-Verlag, New York, pp. 4–34.
Walker, S., Robertson, S. E., M., B., Jones, G. J. F. & K., S. J. (1998), Okapi at TREC6: Automatic ad hoc, VLC, Routing, Filtering and QSDR, in D. K. Harman (ed.), The
6th Text REtrieval Conference (TREC-6), National Institute of Standards and Technology,
Gaithersburg, MD, pp. 125–136.
Williams, M. D., Tou, F. N., Fikes, R., Henderson, A. & Malone, T. W. (1982), RABBIT:
Cognitive Science in Interface Design, in Proceedings of the 4th Annual Conference of the
Cognitive Science Society, Cognitive Science Society, Ann Arbor, MI, pp. 82–85.
Wing, J. M. (1990), A Specifier’s Introduction to Formal Methods, IEEE Computer 23(9), 8–24.
Winograd, T. (1995), From Programming Environments to Environments for Designing, Communications of the ACM 38(6), 65–74.
Winograd, T. & Flores, F. (1986), Understanding Computers and Cognition: A New Foundation
for Design, Ablex, Norwood, NJ.
Woods, S. & Yang, Q. (1996), The Program Understanding Problem: Analysis and a Heuristic Approach, in Proceedings of 18th International Conference on Software Engineering
(ICSE’96), ACM Press, Berlin, Germany, pp. 6–15.
172
Ye, Y. (1996), TCARE–Total Computer Aided Reverse Engineering Tool, in Proceedings of International Symposium on Software Engineering for the Next Generation, Nagoya, Japan,
pp. 89–95.
Ye, Y. (1998), Supporting Incremental Learning with Active Accumulative and Adaptable Documentation, in Proceedings of International Symposium on Future Software Technology
1998, Software Engineers Association, Hangzhou, China, pp. 185–190.
Ye, Y. (2001a), An Active and Adaptive Reuse Repository System, in Proceedings of 34th
Hawaii International Conference on System Sciences (HICSS-34), IEEE Press, Maui, HI,
pp. CD–ROM.
Ye, Y. (2001b), Information Enriched Workspaces, in Proceedings of INTERACT’01, Tokyo,
Japan, (to appear).
Ye, Y. & Fischer, G. (2000), Promoting Reuse with Active Reuse Repository Systems, in Proceedings of 6th International Conference on Software Reuse (ICSR-6), Springer-Verlag,
Vienna, Austria, pp. 302–317.
Ye, Y. & Reeves, B. (2000), An Active and Intelligent Agent for Component Location, in
Proceedings of Software Symposium 2000, Software Engineers Association, Kanazawa,
Japan, pp. 67–74.
Ye, Y., Fischer, G. & Reeves, B. (2000), Integrating Active Information Delivery and Reuse
Repository Systems, in Proceedings of ACM SIGSOFT 8th International Symposium on
the Foundations of Software Engineering, ACM Press, San Diego, CA, pp. 60–68.
Zand, M., Arango, G., Davis, M., Johnson, R., Poulin, J. S. & Watson, A. (1997), Reuse R&D:
Is It on the Right Track, in Proceedings of ACM Symposium on Software Reuse (SSR’97),
ACM Press, Boston, MA, pp. 212–216.
Zaremski, A. M. & Wing, J. M. (1995), Signature Matching: A Tool for Using Software Libraries, ACM Transactions on Software Engineering and Methodology 4(2), 146–170.
Zaremski, A. M. & Wing, J. M. (1997), Specification Matching of Software Components, ACM
Transaction on Software Engineering and Methodology 6(4), 333–369.
Zave, P. & Schell, W. (1984), Salient Features of an Executable Specification Language and Its
Environment, IEEE Transactions on Software Engineering SE-12(2), 312–325.
Appendix A
The List of Queries and Relevant Components
The following table (next two pages) includes the queries and relevant components used
in the evaluation of retrieval mechanisms (Section 8.1). In the table, the Query Contents column
shows the queries submitted to the system. Queries 1 through 10 were created by me; queries
11 through 14 were extracted from newsgroups; and queries 15 through 19 were extracted from
experiments. The Component Name column shows the names (including both class names and
method names, and input parameter types if the component is polymorphic) of pre-determined
relevant components. The Rank column shows the rank of the component returned by each
retrieval mechanism.
174
+-, Ý. /
02143
å
+>, Ý ./
9 148=< Ý 8< ?
æèçÞéÞêÞëXì íÞîXí ì ï ëXì ð;îXéÞì ñ ëÞòèó ôÞñ õ
û Þ ï ì óÞî&ðÇó ñ ì üÞêXì üÞó ôXðèçÞýÞðÇó ñ ì üÞêÞð
ú
øJëÞó ëÞñ Xì üÞëXì íÞì óÞì ð;îXï ëÞîXõèëÞîÞñ
!JîÞüÞéÞôXì ÇëXîXðèë&ÞçÞëÞüÞòÇë
!JëÞó
çÞñ üXó ñ ç ëXì íÞó ÞëXðÇó ñ ì üÞê&ì ð;òèîÞì ó îÞï ì ÇëÞé
;ÞîÞüÞêÞëXó ÞëXí
ì ï ëXüÞîXë
˜ñ
ëÞîÞó ëXîXñ îÞüÞéÞôðÇë&ÞçÞëÞüÞòèëXí ñ ôó Þë
ôÞñ ì êÞì üÞîÞïÇôÞüÞë
ù
@ ÞëÞüÞéXó '˜ôXðÇó
›ëÞóÞô5ÇëÞñ ï îÞð›ôÞíÞó '˜ôXðÇëÞó
å
ñ ì üÞêÞð
ð
øJëÞï ëÞó ëXòÇô6XôÞüXëÞï ëXëÞüÞó ð;í ñ ôîXðÇë&ÞçÞëÞüÞòèë
9 1;:;ä2148ÇÝ8=<40Jã:eÝ
öÆì ï ëc÷ ì ðrøJì ñ ëÞòÇó ôÞñ õ
û ó ñ ì üÞêc÷ ðÇçÞýÞðèó ñ ì üÞêÞþ ì üÞó ÿ
û ó ñ ì üÞêc÷ ðÇçÞýÞðèó ñ ì üÞêÞþ ì üÞó ì üÞó ÿ
û ó ñ ì üÞêc÷ ì üÞéÞë ;í þ û ó ñ ì üÞêÞÿ
ê ì ü ó ÿ
û ó ñ ì üÞêc÷ ì üÞéÞë ;í þ û ó ñ ì üÞ
û ó ñ ì üÞêc÷ ï îÞðèó üÞéÞë ›í þ û ó ñ ì üÞê ÿ
û ó ñ ì üÞêc÷ ï îÞðèó üÞéÞë ›í þ û ó ñ ì üÞê ì üÞó ÿ
û ó ñ ì üÞê Jô ÇëÞüÞì èëÞñÕ÷ ÞîÞð XôÞñ ë Jô èëÞüÞð
û ó ñ ì üÞê Jô ÇëÞüÞì èëÞñÕ÷ üÞë èó Jô ÇëÞü
;ñ ëÞêÞôÞñ ì îÞ
ü ˜îÞï ëÞüÞéÞîÞñ†÷ ì ð ÞëÞî ÙëÞîÞñ
û ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô û Þí í ï ëÞþ Ùì éÞì ñ ëÞòèó ì ôÞüÞîÞï ó ëÞñ îÞó ôÞñ Ùì éÞì ñ ëÞòÇó ì ôÞüÞîÞï ó ëÞñ
îÞó ôÞñ !JîÞüÞéÞô Xÿ
û ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô û Þí í ï ëÞþ Ùì éÞì ñ ëÞòèó ì ôÞüÞîÞï ó ëÞñ îÞó ôÞñ Ùì éÞì ñ ëÞòÇó ì ôÞüÞîÞï ó ëÞñ
îÞó ôÞñ ÿ
û ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô û Þí í ï ëÞþ ˜ôÞüÞó îÞì üÞëÞñ !JîÞüÞéÞôXÿ
û ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô û Þí í ï ëÞþ ˜ôÞüÞó îÞì üÞëÞñ ÿ
"ÞîÞñ îÞòèó ëÞñÕ÷ ì ð ì ó ï ë ˜îÞðèë
öÆì ï ëc÷ ñ ëÞüÞî Xë Jô
öÆì ï ëc÷ òÇîÞ
ü !ëÞîÞé
öÆì ï ëc÷ òÇîÞü #Lñ ì ó ë
û ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô û Þí í ï ëÞþ Ùì éÞì ñ ëÞòèó ì ôÞüÞîÞï ó ëÞñ îÞó ôÞñ Ùì éÞì ñ ëÞòÇó ì ôÞüÞîÞï ó ëÞñ
îÞó ôÞñ !JîÞüÞéÞô Xÿ
û ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô û Þí í ï ëÞþ Ùì éÞì ñ ëÞòèó ì ôÞüÞîÞï ó ëÞñ îÞó ôÞñ Ùì éÞì ñ ëÞòÇó ì ôÞüÞîÞï ó ëÞñ
îÞó ôÞñ ÿ
û ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô û Þí í ï ëÞþ ˜ôÞüÞó îÞì üÞëÞñ !JîÞüÞéÞô Xÿ
û ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô û Þí í ï ëÞþ ˜ôÞüÞó îÞì üÞëÞñ ÿ
û ó ñ ì üÞê ÙçÞí í ëÞñ†÷ î ÞëÞüÞéÞþ û ó ñ ì üÞêÞÿ
û ó ñ ì üÞêc÷ òÇôÞüÞòèîÞó
ð û ëÞóG÷ ì üÞó ëÞñ ðèëÞòÇó ì ôÞü
$ îÞ
;ñ éÞëÞñ ëÞé û ëÞóG÷ ì üÞó ëÞñ ðÇëÞòèó ì ôÞü
ø Þì ðÇóG÷ çÞüÞì &ÞçÞë
%
öÆì ï ó ëÞñ ì üÞêc÷ çÞüÞì &ÞçÞëÞþÕöÇôÞñ '˜îÞñ é ó ëÞñ îÞó ôÞñ öÇôÞñ '˜îÞñ é ó ëÞñ î ó ôÞñ ÿ
öÆì ï ó ëÞñ ì üÞêc÷ çÞüÞì &ÞçÞëÞþÕöÇôÞñ '˜îÞñ é ó ëÞñ îÞó ôÞñ öÇôÞñ '˜îÞñ ë ó ëÞñ î ó ôÞñ Ùì üÞîÞñ )
õ (ñ ëÞé
ì òÇîÞó ë ÿ
öÆì ï ó ëÞñ ì üÞêc÷ çÞüÞì &ÞçÞë tô ÞõÇþ ü ÞçÞó ó ëÞñ îÞó ôÞñ ü ÞçÞó ó ëÞñ îÞó ôÞñ ›çÞó ÞçÞó ó ëÞñ îÞó
ôÞñ ÿ
öÆì ï ó ëÞñ ì üÞêc÷ çÞüÞì &ÞçÞë tô ÞõÇþ ü ÞçÞó ó ëÞñ îÞó ôÞñ ü ÞçÞó ó ëÞñ îÞó ôÞñ ›çÞó ÞçÞó ó ëÞñ îÞó
ôÞñ Ùì üÞîÞñ *
õ (ñ ëÞéÞì òèîÞó ëÞÿ
öÆì ï ó ëÞñ ì üÞêc÷ çÞüÞì &ÞçÞë tô ÞõÇþ tôÞüÞó îÞì üÞëÞñ ;çÞó ÞçÞó ó ëÞñ îÞó ôÞñ cì üÞîÞñ *
õ (ñ ëÞéÞì ò
îÞó ëÞÿ
öÆì ï ó ëÞñ ì üÞêc÷ çÞüÞì &ÞçÞë tô ÞõÇþ tôÞüÞó îÞì üÞëÞñ ;çÞó ÞçÞó ó ëÞñ îÞó ôÞñ ÿ
7Jã8Jâ
ÖÆ×ÙØ
ù
ÚXÛ ÜÞÝÞßáàNâÕãÞä˜Û
ù
ú
å
å
ú
å
ú
Þù
cå
å
å
å ú
å
ù
ù
ùcå
Þú
å
åÞå
åÞå ú
å
ú
ú å
c
å
å
ú
Þù
cå
å
å
cå
å
åú
å
å
å ù
å ùÞù
cå
å
cå
ù
Þú ú
úcå
å
å
ú
å
Þú
å ù
ù
Þú
å
å
åÞåÕú
å ù
Þù
cå
cå
Þú
ù
175
+-, Ý. /
02143
åÞå
å
åú
å
å
å
åù
å
å
+>, Ý ./
9 148=< Ý 8< ?
9 1;:;ä2148ÇÝ8=<40Jã:eÝ
öÆì ï ëc÷ 6ÇéÞì ñ
öÆì ï ëc÷ 6ÇéÞì ñ ð
tîÞï ëÞüÞéÞîÞñ†÷ îÞéÞé
@ éÞéXéÞîÞõÇð›ó ôXîsøî ó ëXôÞýA ëÞòÇó
XîÞó c÷ ñ îÞüÞéÞ
ô !îÞüÞé ô s÷ üÞë Çó Ùõèó ëÞð
!îÞüÞé ô Xì ÇëÞñ†÷ êÞëÞó üÞó þ ì üÞó ÿ
!îÞüÞé ô Xì ÇëÞñ†÷ êÞëÞó üÞó þ ì üÞó ì üÞó ÿ
!îÞüÞé ô Xì ÇëÞñ†÷ êÞëÞó ÞôÞüÞêÞþ ï ôÞüÞêÞÿ
!îÞüÞé ô Xì ÇëÞñ†÷ êÞëÞó ÞôÞüÞêÞþ ï ôÞüÞ
ê ï ôÞüÞêÞÿ
!îÞüÞé ô Xì ÇëÞñ†÷ êÞëÞóGöÇï ôÞîÞó þ í ï ôÞîÞó ÿ
!îÞüÞé ô Xì ÇëÞñ†÷ êÞëÞóGöÇï ôÞîÞó þ í ï ôÞîÞó í ï ôÞîÞó ÿ
!îÞüÞé ô Xì ÇëÞñ†÷ êÞëÞóGøJôÞçÞýÞï ëÞþ éÞôÞçÞýÞï ëÞÿ
˜ñ ëÞîÞó ëXîXñ îÞüÞéÞôüÞçXýÞëÞñJýÞëÞó '˜ëÞëÞüXó '˜ô
üÞçXý ëÞñ ð
!îÞüÞé ô Xì ÇëÞñ†÷ êÞëÞóGøJôÞçÞýÞï ëÞþ éÞôÞçÞýÞï ë éÞôÞçÞýÞï ëÞÿ
!îÞüÞé ô s÷ ðÇëÞó û ëÞëÞé
!îÞüÞé ô s÷ üÞë Çó üÞó
!îÞüÞé ô s÷ üÞë Çó ÞôÞüÞê
!îÞüÞé ô s÷ üÞë ÇóGöÇï ôÞîÞó
!îÞüÞé ô s÷ üÞë ÇóGøJôÞçÞýÞï ë
ë !JîÞüÞéÞô s÷ üÞë Çó cõÇó ëÞð
û ëÞòÇçÞñ !îÞüÞé ô s÷ üÞë Çó ;îÞç ðÇðÇì îÞü
!îÞüÞé ô s÷ üÞë Çó
›ëÞóÞó Þë6XôÞéÞì í ì òÇì îÞó ì ôÞüXó ì @ë&ôÞíÞîXí ì ï ë
öÆì ï ëc÷ ï îÞðèó XôÞéÞì í ì ëÞé
öÆì ï ëc÷ êÞëÞó @ ýÞðèôÞï çÞó ë (ÞîÞó ›ì 5èëÞüXîsöÆì ï ëXîÞüÞéXîXéÞì ñ ëÞòèó ôÞñ õ›òÇôÞõ›ó ÞëXí ì ï ë
ì üÞó ôXó ÞëXéÞì ñ ëÞòÇó ôÞñ õ
öÆì ï ëc÷ êÞëÞó B˜î Xë
;ÞëÞò;ì íÞî&éÞì ñ ëÞòÇó ôÞñ õ;ëÇì ðÇó ð) ì íÞüÞôÞóÞó ÞëÞüXòèñ ëÞîÞó ë öÆì ï ëc÷ 6ÇéÞì ñ
ìó
öÆì ï ëc÷ ì ðrøJì ñ ëÞòÇó ôÞñ õ
!îÞüÞé ô s÷ êÞëÞó üÞó
(Þñ ì üÞóÞôÞçÞó 6E2 ì üXîXñ î üÞéÞôôÞñ éÞëÞñ
(ñ ì üÞó û ó ñ ëÞ
î s÷ Þñ ì üÞó ï ü
;ÞëÞò;ì íÞó ëXòÞîÞñ îÞòÇó ëÞñì ð;î&éÞì êÞì ó îÞüÞé&ì íÞì óÞì *
ð "ÞîÞñ îÞòèó ëÞñÕ÷ ì ðøJì êÞì ó
ó Þñ ô'ì óÞì üÞó ôXó Þë û ó ñ ì üÞêÙçÞí í ëÞñ†÷Cï ðÇëXì êÞü ôÞñ ëXì ó
ô Þ)
õ üÞó ô
D ëÞòÇó ôÞñ†÷ òÇ
D ëÞòÇó ôÞñ†÷ ðÇì Çë
˜ôÞõ›ó ÞëXëÞï ë XëÞüÞó ð›ôÞíÞó ÞëXîÞñ ñ îÞõ;ì üÞó ôX6
î 5ÇëÞòèó ôÞñ
D ëÞòÇó ôÞñ†÷ òÇôÞüÞó îÞì üÞð
é Cï ë XëÞüÞó
D ëÞòÇó ôÞñ†÷ îÞéÞ
˜ñ
ëÞîÞó ëXîXéÞì ñ ëÞòÇó ôÞñ õ;ôÞü&îXí ï ôÞõ;é ì ð
ÖÆ×ÙØ
å
7J
ã 8Jâ
ÚXÛ ÜÞÝÞßáàNâÕãÞä˜Û
å
å
åÞå
å
å
ù
ú
å
ú
ú å
c
ú
Þú ú
ú
å
å
Þå å
ù
å
åú
åù
å
ù
å
ú
å
åù
ú
Þù
Þ ù
ùcå
å
c
åÞå
Þú
å
ú
cå
å
åú
å
å
ùÞú
å
å
å
å
Þú
å
å
ú
åÞå Þú
å
åÞåÞå
å
åÞå å
åÞåÞå
cå
å
åÞå
ú
cå
Þù
å
Þù
ú
Appendix B
Questions Asked in the Post-Experiment Interview
Q1: How many years of programming experience do you have?
Q2: How many projects have you participated in? How large were they?
Q3: What is your current major programming language?
Q4: When did you start to learn Java? How often do you program with it?
Q5: What do you think your programming level in Java is, on a scale from 1 (beginner) to 10
(guru)?
Q6: Do you write comments in general? And in Java in particular?
Q7: Did you find the automatically delivered components useful to your programming tasks?
Please give an explicit example.
Q8: Did you learn something from the deliveries? For example, even though the delivered
component was not used at that delivery moment, but you used it later, or you think
you will use it from now on. Please give examples.
Q9: Did you find that some known components were delivered?
Q10: Did the Reusable-Component-Information-display distract your attention from your work?
If yes, did you want to turn it off?
Q11: Did you find the system useful overall?
177
Q12: What part of the system did you like most?
Q13: What part of the system did you not like most?
Q14: On a scale from 1 (totally useless) to 10 (extremely useful), how do you rate the system?
Q15: Do you want to use the system as your daily programming environment?
Q16: Do you have any suggestions or comments on the system and the experiment?
Appendix C
Abbreviations
ADA
Apple Data Detector
API
Application Programmer Interface
COTS
Commercial Off-The-Shelf
GUI
Graphic User Interface
HLL
High-Level Language
HTML
Hyper Text Markup Language
JDK
Java Development Kit
JGL
Java General Library
LSA
Latent Semantic Analysis
LCM
Location, Comprehension, and Modification
NIH
Not Invented Here
SER
Seeding, Evolutionary growth, and Reseeding
RCI
Reusable Component Information
URL
Uniform Resource Locator
VHLL
Very High-Level Language
Appendix D
Glossary
Abstraction Mismatch, 36, 144
The difference of abstraction levels in reuse requirements and component descriptions.
Programmers deal with concrete problems and thus tend to describe their requirements
concretely, whereas reusable components are often described in abstract concepts because they are designed to be generic so they can be reused in many different situations.
Related Terms: Vocabulary Mismatch
Action-Present, 56, 135
Action-present is the period of time in which users have decided what to do but have
not yet executed the needed operations to change the situation. (Schön, 1983)
Active Component Repository System, 6, 18, 51, 52-55, 117
Active component repository systems support the information delivery mechanism
which presents context-sensitive components to programmers without being given explicit reuse queries. (Ye & Fischer, 2000)
Related Terms: Active Information System, Information Delivery
Active Information System, 6, 55
Active information systems are information systems that actively deliver information
to users. The challenge in implementing active information systems is to contextualize
the information to the task and to the background knowledge of users. (Fischer et al.,
1998)
Related Terms: Information Delivery
Adaptive, 74, 109, 146
A characteristic of a system. A system is adaptive if it changes its behavior by itself.
Adaptive user modeling (or discourse modeling) means that the system automatically
updates user models (or discourse models) based on information observed or inferred
from monitoring users’ interactions with the system. (Fischer, 1993; Fischer & Ye,
2001)
Related Terms: Adaptable
Adaptable, 74, 109
A characteristic of a system. A system is adaptable if its behavior can be adjusted by
180
users to their own needs. Adaptable user modeling (or discourse modeling) means that
users can directly modify their own user models (or discourse models). (Fischer, 1993;
Fischer & Ye, 2001)
Related Terms: Adaptable
Black-Box Reuse, 25
In black-box reuse, a component is directly reused without modification. A component
can be reused as it is or reused through inheritance if the programmer creates a specialized subclass of an existing class component.
Related Terms: Component Reuse, Glass-Box Reuse, White-Box Reuse
Building Block, 14, 15
Building blocks are the primitive elements provided by a programming language. They
include basic statements of a programming language and reusable software components in repositories or libraries.
Related Terms: Components
Component, 15
Components are software modules that have been packaged for reuse. Both classes and
methods are components.
Related Terms: Module, Component Reuse
Component-Based Development, 25
An approach of creating new software systems by reusing existing software components. Component-based development improves the quality and productivity of software development.
Related Terms: Component Reuse.
Component Reuse, 25
An approach of creating new software systems by reusing existing software components. Component reuse has three forms: black-box reuse, white-box reuse and glassbox reuse.
Related Terms: Component-Based Development
Cognitive Engineering, 32, 33
Cognitive engineering is the process of applying what is known from cognitive science
to the design and construction of tools that assists cognitive activities of human beings.
(Norman, 1986)
Component Repository System, 1, 40, 49, 51
An information system that supports the locating of reusable software components. It
has three connotations: a collection of reusable components, a retrieval mechanism,
and a retrieval interface.
Concept Similarity 69, 83, 104,
The similarity existing from the concept of the current programming task revealed
through comments and identifiers in programs under development to the concept revealed in the documentation of reusable components. The concept of a program is its
functional purpose, or goal. A reusable component from the repository whose concept is similar to the concept of the program under development has a high probability
181
of being reused in the current situation. Concept similarity can be determined by using information retrieval techniques such as probabilistic models and latent semantic
analysis.
Constraint Compatibility 69, 83, 104
The compatibility existing between the constraints required for the program under development and those satisfied by components from the repository. The constraint of
a program regulates the environment in which it runs. For a component to be easily
reused in a programming task, it should have compatible constraints. Signature matching is a process of determining the syntactical compatibility between a component and
a program under development.
Related Terms: Signature Matching
Context-Sensitive Information, 6, 56, 99
Information that is relevant to the working context of users. Working context consists
of the task acted upon and the user acting, therefore, context-sensitive information is
related to both the task and the background knowledge of users. (Fischer & Ye, 2001)
Development with Reuse, 5, 46, 47, 48
The development-with-reuse paradigm views reuse as a stand-alone process, independent of the current programming process and environment. Programmers have to
change their current programming practice to embrace reuse. Component repository
systems designed to support development-with-reuse assume that programmers have
no difficulty in forming reuse intentions and formulating reuse queries. (Rada, 1995;
Ye, 2001a)
Related Terms: Reuse within Development
Discourse Model, 8, 71, 73, 78, 106-108, 114, 137
A discourse model represents the interaction history between the user and the system.
It captures the larger context of current task and can improve the task-relevance of
information.
Einstellung, 45, 52
Human beings often display Einstellung in problem solving. Einstellung, the German
word for “attitude,” refers to the mechanization of problem solving strategy. Once
problem solvers discover a strategy that “gets the job done,” they are less likely to discover new strategies until they are completely stuck. Einstellung is one of the cognitive
biases that prevent programmer from attempting to reuse because for most programmers, programming from scratch is the proven approach. (Reisberg, 1997; Ye et al.,
2000)
Feedforward, 56, 58
Information delivered during the period of action-present. Feedforward information
affects the execution of user actions. (Simon, 1996)
Glass-Box Reuse 25, 39
In glass-box reuse, programmers do not directly reuse the component; instead, they
use it as an example for their own development. For instance, programmers can look
182
at examples to find out how a program plan is realized and build their own system
through analogy. Glass-box reuse contributes indirectly to the quality and productivity
of programming because examples can reduce the cognitive load of programmers.
High-Functionality Computer Systems, 42
Systems that contain thousands of items and whose description requires thousands of
pages. For these systems, complete understanding is impossible. Component repository systems are examples of high-functionality computer systems. Also known as
HFA (High-Functionality Applications). (Fischer, 2001)
Information Access, 5, 6, 51, 55
Information access requires users to start the information locating process through
browsing or querying. Users have to anticipate the existence of information, and know
how to search the information space by specifying their information needs in the form
of well-defined queries or engaging in a series of browsing actions.
Related Terms: Information Delivery
Information Delivery, 6, 7, 51, 55, 131-133
The information delivery mechanism presents information to users on its own initiative
without being prompted by explicit queries. Information delivery complements information access and is needed in situations where users are unable to articulate the need
for information or are unaware that they may profit from information. (Fischer, 1994)
Related Terms: Active Information System
Information Discernment, 39, 105, 139
One of the two stages needed for choosing the right component. At the stage of information discernment, programmers avoid spending too much time by quickly scanning
the component and its description to decide whether this component is related to their
current task, and thereby also avoid any deep understanding at this point. (Ye, 2001b)
Information-Enriched Workspace, 8, 49, 50, 52, 99
An information-enriched workspace is a special working environment that is augmented with an information display that constantly shows the information immediately
needed by users. In an information-enriched workspace, the cost structure of accessing
needed information is tuned to the requirements of the work process using it because it
provides immediate access to the most needed information for users without interrupting their workflow. (Ye, 2001b)
Information Island, 0, 52
In an information system, those items whose existence is not anticipated by users become information islands. Information access mechanisms offer little support for users
to reach information islands. In contrast, information delivery mechanisms can build a
bridge to informaton islands. (Engelbart, 1990)
Intrusiveness, 58
The intrusiveness of a system is the degree of users’ perception of being interrupted
from their current focus. Active information systems need to achieve the right balance
between the cost of intrusive interruptions and the loss of context-sensitivity of deferred
183
alerts by carefully considering when and how to deliver the information so that it can
be utilized best by users. (Horvitz et al., 1999)
Latent Semantic Analysis, 2, 83, 89
An extension of the vector space model. By constructing a large semantic space of
terms to capture the overall pattern of their associative relationship, LSA (Latent Semantic Analysis) is expected to facilitate concept-based retrieval and bridge the conceptual gap in formulating reuse queries. (Landauer & Dumais, 1997)
Related Terms: Vector Space Model
Learning on Demand, 35, 52, 142
Situated learning in a working context which occurs at the user’s discretion—often
triggered by a breakdown. However, if users are not aware of the existence of the
knowledge they need to learn for the working context, they may miss the learning
opportunity and settle on a suboptimal solution. Active information systems can create
learning opportunities. Learning-on-demand is the only feasible way for programmers
to learn about reusable components when the repository becomes very large. (Fischer,
1991)
Loss Aversion, 46
In the decision-making process, human beings have the tendency to be far more sensitive to potential loss than to potential gain. Loss aversion is one of the cognitive biases
that prevent programmers from attempting to reuse. Starting a reuse process requires a
mental switch. The demand on working memory and time is immediate, and the potential gain is unclear because programmers are not sure whether the needed component
exists, whether they are able to find it even if it exists, and whether they are able to
understand and modify it even if they find it. (Reisberg, 1997; Ye et al., 2000)
Module, 15
A module refers to a named and addressable abstraction in software—either a procedural abstraction such as a function, or a data abstraction such as a class. Procedures,
functions, methods and classes are all considered as software modules. In this dissertation, the term module refers to software abstractions to be developed by programmers.
Related Terms: Components
Opportunistic Programming, 17, 48, 141
Most programmers follow neither the top-down nor the bottom-up design strategy. In
fact, their programming activities are very opportunistic: They are a mixture of topdown and bottom-up strategies, and which strategy is chosen depends on the knowledge of individual programmers and the particular situation. Interim decisions made
during the programming process often can lead to subsequent decisions at arbitrary
points in the programming space. (Curtis et al., 1988; Visser, 1990)
Passive Component Repository System, 51, 53
A component repository system that supports information access only. Most current
component repository systems are passive, and they fall short in supporting programmers who make no attempt to reuse.
Related Terms: Information Access, Active Component Repository System
184
Plan Recognition, 61, 64-66
The plan recognition approach uses plans to describe a user task. A plan is a sequence
of user actions that achieve a certain goal. In general, a plan can be represented as a
rule consisting of two parts: the condition and the result. The condition part includes
a sequence of actions required to accomplish a task, and the result part is the intended
goal of the task. When the actions of a user match, completely or partially, the condition part, the system can infer that the user is performing that corresponding task, and
information about that task is delivered. (Fischer, 1987)
Related Terms: Similarity Analysis, Task Model
Precision, 120-123
A metric measuring the performance of an information retrieval system. It is the ratio
of the number of retrieved relevant items to the number of total retrieved items. Precision indicates the ability of the system to present only the relevant documents. (Salton
& McGill, 1983)
Related Terms: Recall
Probabilistic Model, 83, 86, 96
The probabilistic model ranks documents in decreasing order of their evaluated probability of relevance to a user query. It makes use of formal theories of probability and
statistics in order to evaluate, or estimate, the probabilities of relevance. (Robertson &
Walker, 1994; Crestani et al., 1998)
Program Plan, 14, 15-18, 23, 52
As a series of interconnecting actions to achieve a goal, a program plan provides a
skeleton structure for programs by abstracting key elements. Programs plans are the
basic chunk used in program design and understanding. (Soloway & Ehrlich, 1984;
Rist, 1995)
Recall, 120-123
A metric measuring the performance of information retrieval systems. It is the ratio
of the number of retrieved relevant items to the number of total relevant items in the
collection. Recall indicates the ability of the system to present all relevant documents.
(Salton & McGill, 1983)
Related Terms: Precision
Retrieval by Reformulation 2, 38, 77, 113-117, 137
A mechanism that allows users to incrementally improve their queries to match their
intentions after they have interpreted and evaluated the retrieved results and have explored the underlying structure of the information systems. (Williams et al., 1982;
Fischer & Nieper-Lemke, 1989)
Reuse-by-Anticipation, 41, 42-45, 53, 54
In the reuse-by-anticipation mode, programmers formulate reuse intentions based on
their anticipation of the existence of certain reusable components. (Ye & Fischer, 2000)
Related Terms: Reuse-by-Memory, Reuse-by-Recall
Reuse-by-Memory, 41, 42-45, 53, 54
In the reuse-by-memory mode, while designing a new program, programmers notice
185
similarities between the new program and reusable components that they have learned
in the past and know very well. Therefore, they can reuse them easily during the
programming, even without the support of a component repository system because
their memory assumes the role of the repository system. (Ye & Fischer, 2000)
Related Terms: Reuse-by-Recall, Reuse-by-Anticipation
Reuse-by-Recall, 41, 42-45, 53, 54
In the reuse-by-recall mode, while developing a new program, programmers vaguely
recall that the repository contains some reusable components with similar functionality,
but they do not remember exactly which components they are. They need to search the
repository to find what they need. (Ye & Fischer, 2000)
Related Terms: Reuse-by-Memory, Reuse-by-Anticipation
Reuse Repository System, 1, 40
A synonym of Component Repository System.
Reuse Process, 2, 26, 33, 34, 47-49,
For programmers to reuse a software component from a repository, they have to go
through the reuse process which has three steps: location, comprehension and modification. (Fischer et al., 1991)
Reuse within Development, 5, 48, 49, 50
The reuse-within-development paradigm views reuse as an integral part of software
development and component repository systems as information systems that augment
programmers’ insufficient knowledge about reusable components and assist them in
accomplishing their tasks. It requires that the reuse process be smoothly melded into
the current programming process and environment so that there is no context change
from programming to reuse. (Ye, 2001a)
Related Terms: Development with Reuse
Reuse Utility, 43
The ratio of reuse value to reuse cost. If reuse utility is perceived as too low, programmers do not make an attempt to reuse. Reducing the reuse cost with the support of
appropriate repository systems and increasing the recognition of reuse value through
education are two approaches to increasing the reuse utility. (Ye et al., 2000)
Signature, 6, 68, 76, 96, 101-103, 134, 152
The type expression of a program. A signature captures the syntactic constraints of a
program by defining the types of input and output data.
Related Terms: Signature Matching, Constraint Compatibility
Signature Matching, 92
The process of determining the compatibility of two components in terms of their signatures. It is an indexing and retrieval mechanism based on type constraints of a module or a component. (Zaremski & Wing, 1995)
Similarity Analysis, 62-66,
An approach to capturing the tasks of users for the purpose of delivering task-relevant
information. It examines the contextual information surrounding the current focus of
186
users, and uses that contextual information to predicate their information needs. Information from the repository that has high similarity to the contextual circumstance
is then delivered. Task-relevant information can be determined based on similar situations or similar information.
Related Terms: Plan Recognition, Task Model
Situation Model, 12, 36, 53
User’s understanding of a problem. A situation model is characterized with respect to
the goal and background knowledge of a user. The conceptual gap between the situation model and the system model includes the vocabulary mismatch and the abstraction
mismatch. (Kintsch, 1998)
Related Terms: System Model
Software Agent, 8, 9, 99
A software entity that functions autonomously in response to the changes in its running
environment without requiring human guidance or intervention. (Bradshaw, 1997; Ye
& Reeves, 2000)
System Model, 36, 53
An actual model of the computer system. The system model of a component repository system is the terms used in the descriptions of its components and the repository
structure of organizing and storing the components.
Task Model, 7, 61, 73, 77, 80, 100-102
An abstract representation of a user task. Appropriate task models are essential in
delivering task-relevant information. The acquisition of such task models is called task
modeling. Tasks can be modeled through either plan recognition or similarity analysis.
(Fischer & Ye, 2001)
Related Terms: Plan Recognition, Similarity Analysis
User Model, 6, 8, 73, 78, 80, 109-114, 124-125, 135-136, 145-146
A representation of a user’s knowledge about an information space. User models can
be used as a filter by the information system to ensure only unknown information
is presented. The process of acquiring user models is called user modeling. User
modeling can be adaptable or adaptive. (Fischer, 2001)
White-Box Reuse, 25, 39
In white-box reuse, programmers reuse the component after they have modified the
components to their needs. White-box reuse does not contribute as much to the easier
maintenance and evolution of software systems as black-box reuse does, but it can
reduce development time.
Vector Space Model, 85
An approach to free-text indexing and retrieval. Documents and queries are represented
as vectors of terms contained in the whole collection of documents, commonly known
as a corpus. The similarity between a query and a document is the distance between
their vectors in the vector space. (Jurafsky & Martin, 2000)
Vocabulary Mismatch, 36
The vocabulary mismatch describes the phenomenon that people use a variety of words
187
to refer to the same concept. Studies have found that the probability that two persons
choose the same word to describe a concept is less than 20%. Even well-trained indexing experts have 20% disparity on average in choosing terms to describe a same
document. (Furnas et al., 1987; Harman, 1995)
Related Terms: Abstraction Mismatch
Download