CONTEXT Controlled Experiment Survey Introduction

advertisement
CONTEXT Controlled Experiment Survey
Introduction
“There is an increasing understanding in the software engineering (SE) community
that empirical studies are needed to develop or improve processes, methods and tools
for software development and maintenance” [1]. What is done in the field of Software
Engineering? What kind of knowledge is needed in the empirical field?
CONTEXT is a project run by the Software Engineering group at Simula Research
Labs (from now on referred to as Simula), and aims to give answers to questions like
these and to place the research performed at Simula in the landscape. The project
started during the fall 2002 but was formally established February 2003 and is a
survey meant to produce an overview over all controlled experiments in the field of
Software Engineering that have been reported in 11 leading journals and conferences
during the years 1993 through 2002. The object of this survey is to look at how
experiments have been performed from a meta-perspective through analysing the
articles regarding a number of identified variables expected to be of interest for the SE
community. We also hope to get a qualified notion of what factors describe a State of
the Art experiment.
The project is performed by people working at the Software Engineering group at
Simula and three master students at the Department of Informatics, University of
Oslo.
Method
Employees working at the SE group at Simula have determined which journals and
conferences we are going to analyse, and these have been selected according to which
sources are considered the leading ones by the community itself (reference to the
article Amela mentioned). From these sources two research assistants have performed
a preliminary analysis of all articles to determine which ones are actually controlled
experiments. Most of the papers, approximately 90%, were found either on the World
Wide Web using the search-engine “Alltheweb” or by accessing the online version of
the magazines. The rest were found in the paper-versions of the journals. It is not easy
to decide what criteria should be met to be able to label a study as a controlled
experiment. To decide whether an article is reporting a controlled experiment an
instrumental working definition of what a controlled experiment is has been
constructed from various sources as well as opinions from qualified Simula
employees. The initial working definition was:
 A controlled experiment is in our analysis viewed as a human activity in a
controlled environment where at least one variable is measured and at least one
background variable is varied, where (a scientist) has designed the experiment and the
experiment is described as an experiment by the author.
 Studies where background variables are not manipulated are left out. Examples of
such studies are case studies and queries.
After reading through the materials several times a new picture emerged, as it seemed
independent variables were less essential. The research assistants described a new
operational definition:
 Subjects are people.
 The setting was controlled. The experimenters manipulate the setting in a certain
way. Differences or variations in the setting are called independent variables. Most
experiments have one or more independent variables.
 It is decided in advance what is going to be measured/observed. Dependent
variables are deducted from the observed results, but these do not have to be
established in advance.
Among the 4851 papers initially examined, 174 were found to describe controlled
experiments according to the working definition. Later on these numbers have
changed somewhat, and are likely to change again as some papers are ruled out and
others are added according to the evolving understanding and specification of what a
controlled experiment is. All of these papers are available in printouts as well as
electronically (as pdf-documents), either gathered from the World Wide Web or
scanned manually from paper-hardcopies. The SE-group has also decided what
variables are of interest in the analysis, and the outcome of this is approximately 30
variables. As they studied the papers they analysed these variables in all papers they
decided were controlled experiments and registered these in an Access database.
We are going to analyse a total of 150 to 200 articles reporting controlled experiments
published in the 11 journals and conferences. The variables have been divided
between the three master students and one post doc working at Simula, creating four
“blocks” of variables that belong naturally together in respect of some aspect of the
analysis. Some of these variables overlap, meaning that more than one person will
analyse papers regarding these variables. This is due to the variables relevance for
each individual’s thesis and also to verify the values of these variables. We have been
given empty databases to populate with data from our analyses. These analyses will
verify and quite possibly correct the results found by the initial analysis performed by
the research assistants. After finishing the individual analyses a completely populated
database will be available for different statistical meta-analyses. The end result will be
a report, and we expect the outcome of the CONTEXT-project to be met with great
interest from the international SE-community.
Discussion
My thesis is focused on the role of the subjects participating in the experiments.
Exactly what questions I will address will be determined along the way, as I get
familiar with the way experiments are described and pick up useful information from
the papers we read. Who are they? What backgrounds do they have? What is different
for people participating in SE-experiments compared to other disciplines? To what
extent do the authors address the heterogeneity of the subjects due to their very
different backgrounds, and what implications does this have for the experiments and
their generalization?
Research in the software engineering field is difficult as more or less unique solutions
are developed every time. It is problematic to generalize to large populations, thus it is
important to keep in mind what populations you wish the results to apply to.
Students are commonly used as subject because they are easier accessible than
professionals. They are cheap, more flexible regarding time-issues, and some times
experiments can be run as a part of courses they are taking. How are results from
students generalized to apply for professionals, and how is this accounted for? Are
there differences of interest between subjects? How do we know the results are not
due to chance? How realistic are the experiments?
The essential purpose for controlled experiments is to study cause and effect relations.
You alter one variable, usually referred to as the independent variable or the
treatment, and observe how the variation of this variable affects some other variable,
referred to as the dependent variable. In some sciences this can be done
mathematically, without considering potential confounding effects, but as in all other
sciences studying human behaviour SE experiments need to consider very carefully
what other causes (except from the variation of the independent variables) can
influence the variations in the dependent variable. For instance, learning effects can
be a threat to results in a within subjects designed experiment as the subjects can learn
from the first treatment and thus contaminate results from the second treatment.
Because of this a between subjects design is an alternative, but this design opens for
uncertainty about whether the results may be caused by differences between the
subjects. .
Due to the diversity of different people and backgrounds qualitative methods (as
opposed to quantitative methods) are often necessary, just like they are in humanistic
disciplines.
The variables I am going to study:
Total number of participants
The total number of subjects participating in the experiment. Whether the subjects did
complete the experiment or were removed for some reason is not important, they are
all counted in this field. This number varies between less than 10 and hundreds.
Active participants
The number of participating subjects that actually completed the experiment and was
a part of the analysis. The reason why they were removed could be that they are
considered outliers with inappropriate influence on the analyse of the results, they
might not have followed the instructions or for some other reason not be viewed as
representative contributors. In experiments where this is the case it is interesting to
observe how this is described and whether the removed subjects are removed from all
parts of the analysis or only parts of it.
Students
The number of subjects who were students. Students include undergraduates,
graduates, PhDs and post docs.
Undergraduate students
The number of subjects who were undergraduates.
Graduate students
The number of subjects who were graduates, PhDs or post docs.
Participating scientists
The number of scientists taking part as subjects in the experiment. Scientists involved
in setting up and running the experiment are not included.
Professional participants
The number of professional subjects in the experiment. These are typically
professional developers working in commercial companies.
Differences between participants
Differences in results between the results of different categories of subjects. This does
not include differences between groups that receive different treatments, but within
treatment. E.g. differences between professionals and students receiving the same
treatment. This is most likely to be explicitly addressed in replicated studies.
Individual or team
Did the subjects perform the experiment individually, as a part of a team working
together or a combination of both.
Selection of participants
Who are the subjects, and why are they a part of the experiment? Are they taking a
particular course? Are they selected from some certain company? Are there
differences in background between different groups of subjects?
Information about the participants
What kind of background information is registered about the participants? What kind
of experience they have, what kind of knowledge they possess. This is addressed in
varying degrees. In some papers this is hardly mentioned at all, in others they have
carefully monitored background information through questionnaires.
Recruitment
How were the subjects recruited for the experiment? Was the experiment a mandatory
part of a course they were taking, were they volunteers, were they paid? What
motivated them to participate? We find many different motivations. Some participate
in the experiments as a mandatory part of courses they are taking, some are motivated
by the chance to acquire knowledge they find useful, some participate as a part of
courses or projects at work. Sometimes companies participate because they expect the
outcome to be useful for their working practices.
Generalizations from students
Does the authors make any generalizations from experiments featuring students to
other populations, e.g. professionals in an industrial setting? If they don’t say
anything about it, is it possible to read between the lines whether they assume
generalizations or not? Many papers address this discussing external threats to
validity, but quite often just by stating they are aware of this problem without
discussing it further.
Replication
If the experiment is explicitly claimed to be a replication of some other experiment,
what type of replication is it? Basili [3] differs between six different types of
replications: Strict replications, replications that vary the manner in which the
experiment is run, replications that vary variables intrinsic to the object of study (i.e.,
independent variables), replications that vary variables intrinsic to the focus of the
evaluation (i.e., dependent variables), replications that vary context variables in the
environment in which the solution is evaluated, and replications that extend the
theory. This field is of particular interest because replications are necessary to be able
to make generalizations from populations, which is a major problem for experiments
in this field.
Related work
A study like this has to our knowledge not been performed in the field of SE before,
thus the results are expected to be of interest for the entire SE community. Lukowicz
[2] did a survey of over 400 research articles to establish whether computer scientists
validate their results in the papers they publish, but this survey merely studied
experimental validation. Zelkowitz [4] did an analyse of a total of 612 papers in three
journals in the years 1985, 1990 and 1995 to describe the qualitative nature of these
from a meta-level using numbers.
(Har notert fra fellesmøte at Basili også har gjort noe, men har for øyeblikket ikke
oversikt over hva).
Time schedule
I plan to finish analysing the articles by early July, and hope to deliver my thesis by
November 1st.
Summary
There is a need to draw a picture of the research being conducted in the field of
software engineering. Creating an overview based on what has been done can provide
useful information and pinpoint where the SE community stands today. A common
impression of history may become a guideline for where and how to move on. Our
survey is meant to become a contribution doing so.
References
[1] Dag I.K. Sjøberg, Bente Anda, Erik Arisholm, Tore Dybå, Magne Jørgensen,
Amela Karahasanovic, Espen F. Koren, Marek Vokác “Conducting Realistic
Experiments in Software Engineering”
[2] Paul Lukowicz, Ernst A. Heinz, Lutz Prechelt, Walter F. Tichy: „Experimental
Evaluation in Computer Science: A Quantative Study“, Journal of Systems and
Software, vol. 28, pp. 9-18, 1995
[3]Victor R. Basili, Forrest Shull, Filippo Lanubile “Building Knowledge through
Families of Experiments”, IEEE Transactions on Software Engineering, vol. 25, no. 4,
July/August 1999.
[4]Marvin V. Zelkowitz, Dolores R. Wallace “Experimental Models for Validating
Technology”, IEEE Computer, vol. 31, no. 5, pp. 23-31, May 1998.
Download