Essay – Master assignment – Nils

advertisement
Reflections on threats to validity in controlled experiments
Nils-Kristian Liborg
MSc student
University of Oslo, department of informatics
E-mail: nilskril@ifi.uio.no
The primary goal of CONTEXT (Controlled
Experiment Survey) is to conduct a survey on
all published material concerning controlled
experiments from 11 leading journals and
conferences within Software Engineering in
the decade 1993-20021. The material is
systematically gathered and analysed by
research assistants and Masters Students at
Simula Research Laboratory. All the results
are systematized in a database to enable
producing statistical reports, and evaluating
and
comparing
different
controlled
experiments. An example is to make statistics
on controlled experiments conducted with
professional experiment participants.
varies, where a scientist has designed the
experiment and the author himself describes it
as a experiment. This excludes studies where
no background variable varies. As the study
went on it has been clear that independent
variables are not so central, so some
experiments without independent variables are
included as well. As a main rule, articles the
author claims describe controlled experiment
are included.
The aim of this study is to see what is
done in the field and discover the state of the
art in Software Engineering today. It is also
interesting to see what kind of knowledge is
needed in the empirical Software Engineering
field and how we place ourselves in the big
picture.
1
2
Abstract
Introduction
My Masters
Thesis
is
called
“Reflections on threats to validity in controlled
experiments”. This Masters Thesis investigates
reflections and discussions done by
researchers, around the validity on their
experiments. The experimental material under
investigation is analysed with regard to the
following context variables; recruitment,
selections of participants, grouping, target,
generalization of context, external threats,
internal threats, generalizations made when
using students as experiment participants and
replication.
To help recognize a controlled
experiment, an operational definition has been
made. A controlled experiment is by this
definition a human act in a controlled
environment, where at least one variable are
measured and at least one background variable
1
Empirical Software Engineering
IEEE Transaction on Software Engineering
Journal of Software Engineering Methodology
ACM Transaction on Software Engineering Methodology
Journal of Information and Software Technology
ISESE
Journal of Systems and Software
ICSE
IEEE Software
IEEE Computer
IEEE International Symposium on Software Metrics
Related work
There is not much related work
published, but Walter Tichy et al. [4] have
done a survey on about 400 articles in 1995.
They divided the articles into categories and
analysed them with respect to their validity. To
do this, they divided them into subclasses
depending of how much of the article, in
percent, was used for the discussion of
validity. They found out that articles
concerning Computer Science hadn’t discussed
validity at all, or discussed it very poorly. In
this article they also discusses the importance
of replication of studies [5]. They mean that by
replicating experiments, you’d get better
validation of the results.
Another article by Walter Tichy [2]
discusses why computer scientists should do
experiments. The article points at 16 common
excuses computer scientists have for not doing
experiments. One thing Tichy discusses in this
article is that it is important that computer
scientists test out their theories by experiments.
Tichy says that’s the only way to obtain
validated results, and a good way for further
research in the field. His conclusion says that
experimentation is central to the scientific
process, and that experiments are the only
thing that can test theories, explore critical
factors and bring new phenomena to light so
theories can be formulated in the first place.
An article by Marvin Zelkowitz and
Dolores R. Wallace [3] describes 12 different
experimental approaches2 and how they used
this approaches in a survey similar to this. In
that survey they analysed about 600 articles
concerning Software Engineering from three
different years (1985, 1990 and 1995).
Through the analysis they observed that too
many papers had no experimental validation at
all, too many papers used an informal form of
validation, researchers used lesson learned and
case studies about 10% of the time, with the
other techniques only used a small percent at
the time at most and that experimentation
terminology is sloppy. However the percentage
of articles with no experimental validation at
all dropped from about 36% in 1985 to about
19% in 1995, so it will be interesting to
compare those results with our survey, to see if
there is any improvement.
3
The study
The first thing to do in this Masters
Thesis, is to collect the data material that will
be the basis of the assignment. This
information is collected when reading and
analysing about 200 articles describing
controlled experiments within Software
Engineering from the past 10 years, analysed
with respect to different context variables. The
articles describing controlled experiments
where conducted by research assistants in
advance, among about 5000 papers.
About 90% of the articles are available
electronically as pdf files, the rest are on paper
and will be scanned and saved as pdf files. The
electronic versions where found using the
search engine “Alltheweb”. When identifying
the
articles
the
research
assistants
systematically read through all abstracts of the
articles
in
every
journal,
backward
chronologically. If there were any doubts that
the article described a controlled experiment,
they skimmed through the rest of the article. If
it still was any doubts that the article described
a controlled experiment, they marked it unsure,
and let a professor read through the article and
decide whether it describes a controlled
experiment.
The articles were then analysed by two
research assistants. If they under the analysis
were unsure whether the article described a
controlled experiment, they marked the article
2
Project monitoring, Case study, Assertion, Field study,
Literature search, Legacy, Lessons learned, Static analysis,
Replicated, Synthetic, Dynamic analysis, Simulation
unsure and stopped the analysis. This where
mostly articles concerning case studies,
usability studies and meta-studies who only
described how to perform a controlled
experiment. Articles concerning focus group
and controlled experiments where the
participants worked in focus groups and
discussed a Software Engineering topic are
mostly not included by the analysed articles. If
the focus groups discussed a Software
Engineering topic like Software Process
Improvement (SPI), and different variables
where analysed, the article is included by the
analysed articles. The exception to this is if the
articles are in the borderline of the operational
definition of a controlled experiment. After all
analysis was finished, the unsure articles where
deleted from the database.
What I will do is to analyse the articles
ones more, with special concern to some
context variables. I will probably also go more
into depth on the different variables.
The material will be collected in a
database. This will be done in a structured
way, so it is easy to make statistics out of the
material without more preparation.
Some conventions for the fields are
already made. If a field is empty, it simply
means that the person analysing this article has
missed this field. “??” means that the person
analysing this article has looked for the
information, but couldn’t find it.
When all the material is collected and
the basis for the assignment is laid, the results
will be analysed and statistics made. In the
next sections I will give a brief description of
the context variables. The names of the context
variables are also used as the field names in the
database.
The recruitment field describes how the
participants in the survey where recruited (e.g.
as part of a course, volunteers from
organisations, recruited by letters or others). In
this field I also describe if they become payed
in some way (money, grades, teaching etc.)
In the selection of participants field I
describe where the participants came from.
That could be graduate students taking the
same course, consultants from a consultant
company etc. The field also contains
information on which course they are taken or
which company the work for and if it is
differences between the participants, and in
that case, what the differences are (this are
differences in background, not in results from
the experiment).
Grouping describes how the participants
where grouped, if they were. If randomising
was used, it is described how this is done.
In the target field it is described whom
the authors describe as the target population
for the experiment. By this we mean that if
students are used in the experiment, do they
generalize the sampling to count also for
professionals. If it is not mentioned explicit in
the article, the field only contains the value
‘implicit’.
The generalization of context field
describes whether or not the authors have
generalized the context of the experiment or
not. By this we mean if they for example say
that this count for all real situation, even
though the experiment is done on paper, they
have generalized the context. If they don’t
have described this explicit, the field only
contains the value ‘implicit’.
In the fields external threats and
internal threats we describe what the authors
say about general threats to the experiments
validity. If they have not discussed it, the field
contains the information ‘not discussed’.
The generalization from students field
describes whether or not the authors have
discussed the articles possible threats to
validity if students are used as sampling. The
field is marked ‘discussed’, ‘not discussed’ or
‘not relevant’ if the sampling was not students.
In the comments field for this field, it is
described what they have discussed as a threat.
The last context variable I am supposed
to investigate is replication. This field
describes whether or not the experiment is a
replication or not. If it is, we mark it with an
X. We only classify the experiment as a
replication if the authors describe the study as
it. In the comments field it is described what
type of replication (strict, that vary the manner
in which the experiment is run, that vary
independent variables, that vary dependent
variables, that vary context variables, that
extend the theory [1]) this is and preferably a
short reference to the article describing the
original experiment. The comments field could
also contain information telling that we
suspects this to be a replication, even though
the author don’t describe it as it.
To all fields there is a comments field.
This is used for information that can be of
interest, but do not fit into the fields I have
described here, or information that cannot be
written in a standard way.
4
Time schedule
I hope to complete this Masters Thesis
within the time standard time. That means that
the final report will be finished spring 2004.
To fulfil this, I hope to have all the articles
analysed within June 2003. From summer
2003 I will start to prepare the material I have
collected through the analysis, and hopefully I
can start writing the report early dawn 2003.
In the period of writing the report (dawn
2003 to spring 2004), I hope to write an article
about the theme.
5
Discussion
When the complete analysis of the
articles are finished and the final report
written, this material will be of great interest to
those who want to place themselves and their
research in the landscape of Software
Engineering. It will also be interesting to
compare the results of our analysis with similar
earlier surveys, to see if there is an evolution
process going on. Hopefully this survey will
bring some new thoughts to researchers around
the world, so we all can take benefit of what
others do and have done.
6
Summary
Throughout the Context project we hope
to find out what the State of the Art in
Software Engineering is today and to place
ourselves (Simula Research Laboratory) in the
landscape. There has never been carried out
such a profound analysis of controlled
experiments in Software Engineering before,
so this is kind of pioneer work. It will be
interesting to see what scientists around the
world do different and what they do like in
their experiments, and to see what kinds of
threats to the project they take into
consideration.
References
[1] V. R. Basili, F. Shull & F. Lanubile,
“Building Knowledge through Families of
Experiments”, IEEE Transactions on
Software Engineering, vol. 25, no. 4, pp.
456-473, Jul/Aug 1999.
[2] W. F. Tichy, “Should Computer Scientists
Experiment More? 16 Excuses to
Avoid Experimentation”, IEEE Computer,
vol. 31, no. 5, pp. 32-40, May
1998.
[3] M. V. Zelkowitz & D. R. Wallace,
“Experimental Models for Validating
Technology”, IEEE Computer, vol. 31, no.
5, pp. 23-31, May 1998.
[4] P. Lukowicz, E. A. Heinz, L. Prechelt &
W. F. Tichy, “Experimental Evaluation in
Computer Science: A Quantitative Study”,
J. Systems Software, vol.28, pp. 9-18,
1995
[5] R. M. Lindsay & A. S. C. Ehrenberg, “The
Design of Replicated Studies”, The
American Statistican, vol. 47, no. 3, pp.
217-228, August 1993
Download