CONTEXT Controlled Experiment Survey Introduction “There is an increasing understanding in the software engineering (SE) community that empirical studies are needed to develop or improve processes, methods and tools for software development and maintenance” [1]. What is done in the field of Software Engineering? What kind of knowledge is needed in the empirical field? CONTEXT is a project run by the Software Engineering group at Simula Research Labs (from now on referred to as Simula), and aims to give answers to questions like these and to place the research performed at Simula in the landscape. The project started during the fall 2002 but was formally established February 2003 and is a survey meant to produce an overview over all controlled experiments in the field of Software Engineering that have been reported in 11 leading journals and conferences during the years 1993 through 2002. The object of this survey is to look at how experiments have been performed from a meta-perspective through analysing the articles regarding a number of identified variables expected to be of interest for the SE community. We also hope to get a qualified notion of what factors describe a State of the Art experiment. The project is performed by people working at the Software Engineering group at Simula and three master students at the Department of Informatics, University of Oslo. Method Employees working at the SE group at Simula have determined which journals and conferences we are going to analyse, and these have been selected according to which sources are considered the leading ones by the community itself (reference to the article Amela mentioned). From these sources two research assistants have performed a preliminary analysis of all articles to determine which ones are actually controlled experiments. Most of the papers, approximately 90%, were found either on the World Wide Web using the search-engine “Alltheweb” or by accessing the online version of the magazines. The rest were found in the paper-versions of the journals. It is not easy to decide what criteria should be met to be able to label a study as a controlled experiment. To decide whether an article is reporting a controlled experiment an instrumental working definition of what a controlled experiment is has been constructed from various sources as well as opinions from qualified Simula employees. The initial working definition was: A controlled experiment is in our analysis viewed as a human activity in a controlled environment where at least one variable is measured and at least one background variable is varied, where (a scientist) has designed the experiment and the experiment is described as an experiment by the author. Studies where background variables are not manipulated are left out. Examples of such studies are case studies and queries. After reading through the materials several times a new picture emerged, as it seemed independent variables were less essential. The research assistants described a new operational definition: Subjects are people. The setting was controlled. The experimenters manipulate the setting in a certain way. Differences or variations in the setting are called independent variables. Most experiments have one or more independent variables. It is decided in advance what is going to be measured/observed. Dependent variables are deducted from the observed results, but these do not have to be established in advance. Among the 4851 papers initially examined, 174 were found to describe controlled experiments according to the working definition. Later on these numbers have changed somewhat, and are likely to change again as some papers are ruled out and others are added according to the evolving understanding and specification of what a controlled experiment is. All of these papers are available in printouts as well as electronically (as pdf-documents), either gathered from the World Wide Web or scanned manually from paper-hardcopies. The SE-group has also decided what variables are of interest in the analysis, and the outcome of this is approximately 30 variables. As they studied the papers they analysed these variables in all papers they decided were controlled experiments and registered these in an Access database. We are going to analyse a total of 150 to 200 articles reporting controlled experiments published in the 11 journals and conferences. The variables have been divided between the three master students and one post doc working at Simula, creating four “blocks” of variables that belong naturally together in respect of some aspect of the analysis. Some of these variables overlap, meaning that more than one person will analyse papers regarding these variables. This is due to the variables relevance for each individual’s thesis and also to verify the values of these variables. We have been given empty databases to populate with data from our analyses. These analyses will verify and quite possibly correct the results found by the initial analysis performed by the research assistants. After finishing the individual analyses a completely populated database will be available for different statistical meta-analyses. The end result will be a report, and we expect the outcome of the CONTEXT-project to be met with great interest from the international SE-community. Discussion My thesis is focused on the role of the subjects participating in the experiments. Exactly what questions I will address will be determined along the way, as I get familiar with the way experiments are described and pick up useful information from the papers we read. Who are they? What backgrounds do they have? What is different for people participating in SE-experiments compared to other disciplines? To what extent do the authors address the heterogeneity of the subjects due to their very different backgrounds, and what implications does this have for the experiments and their generalization? Research in the software engineering field is difficult as more or less unique solutions are developed every time. It is problematic to generalize to large populations, thus it is important to keep in mind what populations you wish the results to apply to. Students are commonly used as subject because they are easier accessible than professionals. They are cheap, more flexible regarding time-issues, and some times experiments can be run as a part of courses they are taking. How are results from students generalized to apply for professionals, and how is this accounted for? Are there differences of interest between subjects? How do we know the results are not due to chance? How realistic are the experiments? The essential purpose for controlled experiments is to study cause and effect relations. You alter one variable, usually referred to as the independent variable or the treatment, and observe how the variation of this variable affects some other variable, referred to as the dependent variable. In some sciences this can be done mathematically, without considering potential confounding effects, but as in all other sciences studying human behaviour SE experiments need to consider very carefully what other causes (except from the variation of the independent variables) can influence the variations in the dependent variable. For instance, learning effects can be a threat to results in a within subjects designed experiment as the subjects can learn from the first treatment and thus contaminate results from the second treatment. Because of this a between subjects design is an alternative, but this design opens for uncertainty about whether the results may be caused by differences between the subjects. . Due to the diversity of different people and backgrounds qualitative methods (as opposed to quantitative methods) are often necessary, just like they are in humanistic disciplines. The variables I am going to study: Total number of participants The total number of subjects participating in the experiment. Whether the subjects did complete the experiment or were removed for some reason is not important, they are all counted in this field. This number varies between less than 10 and hundreds. Active participants The number of participating subjects that actually completed the experiment and was a part of the analysis. The reason why they were removed could be that they are considered outliers with inappropriate influence on the analyse of the results, they might not have followed the instructions or for some other reason not be viewed as representative contributors. In experiments where this is the case it is interesting to observe how this is described and whether the removed subjects are removed from all parts of the analysis or only parts of it. Students The number of subjects who were students. Students include undergraduates, graduates, PhDs and post docs. Undergraduate students The number of subjects who were undergraduates. Graduate students The number of subjects who were graduates, PhDs or post docs. Participating scientists The number of scientists taking part as subjects in the experiment. Scientists involved in setting up and running the experiment are not included. Professional participants The number of professional subjects in the experiment. These are typically professional developers working in commercial companies. Differences between participants Differences in results between the results of different categories of subjects. This does not include differences between groups that receive different treatments, but within treatment. E.g. differences between professionals and students receiving the same treatment. This is most likely to be explicitly addressed in replicated studies. Individual or team Did the subjects perform the experiment individually, as a part of a team working together or a combination of both. Selection of participants Who are the subjects, and why are they a part of the experiment? Are they taking a particular course? Are they selected from some certain company? Are there differences in background between different groups of subjects? Information about the participants What kind of background information is registered about the participants? What kind of experience they have, what kind of knowledge they possess. This is addressed in varying degrees. In some papers this is hardly mentioned at all, in others they have carefully monitored background information through questionnaires. Recruitment How were the subjects recruited for the experiment? Was the experiment a mandatory part of a course they were taking, were they volunteers, were they paid? What motivated them to participate? We find many different motivations. Some participate in the experiments as a mandatory part of courses they are taking, some are motivated by the chance to acquire knowledge they find useful, some participate as a part of courses or projects at work. Sometimes companies participate because they expect the outcome to be useful for their working practices. Generalizations from students Does the authors make any generalizations from experiments featuring students to other populations, e.g. professionals in an industrial setting? If they don’t say anything about it, is it possible to read between the lines whether they assume generalizations or not? Many papers address this discussing external threats to validity, but quite often just by stating they are aware of this problem without discussing it further. Replication If the experiment is explicitly claimed to be a replication of some other experiment, what type of replication is it? Basili [3] differs between six different types of replications: Strict replications, replications that vary the manner in which the experiment is run, replications that vary variables intrinsic to the object of study (i.e., independent variables), replications that vary variables intrinsic to the focus of the evaluation (i.e., dependent variables), replications that vary context variables in the environment in which the solution is evaluated, and replications that extend the theory. This field is of particular interest because replications are necessary to be able to make generalizations from populations, which is a major problem for experiments in this field. Related work A study like this has to our knowledge not been performed in the field of SE before, thus the results are expected to be of interest for the entire SE community. Lukowicz [2] did a survey of over 400 research articles to establish whether computer scientists validate their results in the papers they publish, but this survey merely studied experimental validation. Zelkowitz [4] did an analyse of a total of 612 papers in three journals in the years 1985, 1990 and 1995 to describe the qualitative nature of these from a meta-level using numbers. (Har notert fra fellesmøte at Basili også har gjort noe, men har for øyeblikket ikke oversikt over hva). Time schedule I plan to finish analysing the articles by early July, and hope to deliver my thesis by November 1st. Summary There is a need to draw a picture of the research being conducted in the field of software engineering. Creating an overview based on what has been done can provide useful information and pinpoint where the SE community stands today. A common impression of history may become a guideline for where and how to move on. Our survey is meant to become a contribution doing so. References [1] Dag I.K. Sjøberg, Bente Anda, Erik Arisholm, Tore Dybå, Magne Jørgensen, Amela Karahasanovic, Espen F. Koren, Marek Vokác “Conducting Realistic Experiments in Software Engineering” [2] Paul Lukowicz, Ernst A. Heinz, Lutz Prechelt, Walter F. Tichy: „Experimental Evaluation in Computer Science: A Quantative Study“, Journal of Systems and Software, vol. 28, pp. 9-18, 1995 [3]Victor R. Basili, Forrest Shull, Filippo Lanubile “Building Knowledge through Families of Experiments”, IEEE Transactions on Software Engineering, vol. 25, no. 4, July/August 1999. [4]Marvin V. Zelkowitz, Dolores R. Wallace “Experimental Models for Validating Technology”, IEEE Computer, vol. 31, no. 5, pp. 23-31, May 1998.