Reflections on threats to validity in controlled experiments Nils-Kristian Liborg MSc student University of Oslo, department of informatics E-mail: nilskril@ifi.uio.no The primary goal of CONTEXT (Controlled Experiment Survey) is to conduct a survey on all published material concerning controlled experiments from 11 leading journals and conferences within Software Engineering in the decade 1993-20021. The material is systematically gathered and analysed by research assistants and Masters Students at Simula Research Laboratory. All the results are systematized in a database to enable producing statistical reports, and evaluating and comparing different controlled experiments. An example is to make statistics on controlled experiments conducted with professional experiment participants. varies, where a scientist has designed the experiment and the author himself describes it as a experiment. This excludes studies where no background variable varies. As the study went on it has been clear that independent variables are not so central, so some experiments without independent variables are included as well. As a main rule, articles the author claims describe controlled experiment are included. The aim of this study is to see what is done in the field and discover the state of the art in Software Engineering today. It is also interesting to see what kind of knowledge is needed in the empirical Software Engineering field and how we place ourselves in the big picture. 1 2 Abstract Introduction My Masters Thesis is called “Reflections on threats to validity in controlled experiments”. This Masters Thesis investigates reflections and discussions done by researchers, around the validity on their experiments. The experimental material under investigation is analysed with regard to the following context variables; recruitment, selections of participants, grouping, target, generalization of context, external threats, internal threats, generalizations made when using students as experiment participants and replication. To help recognize a controlled experiment, an operational definition has been made. A controlled experiment is by this definition a human act in a controlled environment, where at least one variable are measured and at least one background variable 1 Empirical Software Engineering IEEE Transaction on Software Engineering Journal of Software Engineering Methodology ACM Transaction on Software Engineering Methodology Journal of Information and Software Technology ISESE Journal of Systems and Software ICSE IEEE Software IEEE Computer IEEE International Symposium on Software Metrics Related work There is not much related work published, but Walter Tichy et al. [4] have done a survey on about 400 articles in 1995. They divided the articles into categories and analysed them with respect to their validity. To do this, they divided them into subclasses depending of how much of the article, in percent, was used for the discussion of validity. They found out that articles concerning Computer Science hadn’t discussed validity at all, or discussed it very poorly. In this article they also discusses the importance of replication of studies [5]. They mean that by replicating experiments, you’d get better validation of the results. Another article by Walter Tichy [2] discusses why computer scientists should do experiments. The article points at 16 common excuses computer scientists have for not doing experiments. One thing Tichy discusses in this article is that it is important that computer scientists test out their theories by experiments. Tichy says that’s the only way to obtain validated results, and a good way for further research in the field. His conclusion says that experimentation is central to the scientific process, and that experiments are the only thing that can test theories, explore critical factors and bring new phenomena to light so theories can be formulated in the first place. An article by Marvin Zelkowitz and Dolores R. Wallace [3] describes 12 different experimental approaches2 and how they used this approaches in a survey similar to this. In that survey they analysed about 600 articles concerning Software Engineering from three different years (1985, 1990 and 1995). Through the analysis they observed that too many papers had no experimental validation at all, too many papers used an informal form of validation, researchers used lesson learned and case studies about 10% of the time, with the other techniques only used a small percent at the time at most and that experimentation terminology is sloppy. However the percentage of articles with no experimental validation at all dropped from about 36% in 1985 to about 19% in 1995, so it will be interesting to compare those results with our survey, to see if there is any improvement. 3 The study The first thing to do in this Masters Thesis, is to collect the data material that will be the basis of the assignment. This information is collected when reading and analysing about 200 articles describing controlled experiments within Software Engineering from the past 10 years, analysed with respect to different context variables. The articles describing controlled experiments where conducted by research assistants in advance, among about 5000 papers. About 90% of the articles are available electronically as pdf files, the rest are on paper and will be scanned and saved as pdf files. The electronic versions where found using the search engine “Alltheweb”. When identifying the articles the research assistants systematically read through all abstracts of the articles in every journal, backward chronologically. If there were any doubts that the article described a controlled experiment, they skimmed through the rest of the article. If it still was any doubts that the article described a controlled experiment, they marked it unsure, and let a professor read through the article and decide whether it describes a controlled experiment. The articles were then analysed by two research assistants. If they under the analysis were unsure whether the article described a controlled experiment, they marked the article 2 Project monitoring, Case study, Assertion, Field study, Literature search, Legacy, Lessons learned, Static analysis, Replicated, Synthetic, Dynamic analysis, Simulation unsure and stopped the analysis. This where mostly articles concerning case studies, usability studies and meta-studies who only described how to perform a controlled experiment. Articles concerning focus group and controlled experiments where the participants worked in focus groups and discussed a Software Engineering topic are mostly not included by the analysed articles. If the focus groups discussed a Software Engineering topic like Software Process Improvement (SPI), and different variables where analysed, the article is included by the analysed articles. The exception to this is if the articles are in the borderline of the operational definition of a controlled experiment. After all analysis was finished, the unsure articles where deleted from the database. What I will do is to analyse the articles ones more, with special concern to some context variables. I will probably also go more into depth on the different variables. The material will be collected in a database. This will be done in a structured way, so it is easy to make statistics out of the material without more preparation. Some conventions for the fields are already made. If a field is empty, it simply means that the person analysing this article has missed this field. “??” means that the person analysing this article has looked for the information, but couldn’t find it. When all the material is collected and the basis for the assignment is laid, the results will be analysed and statistics made. In the next sections I will give a brief description of the context variables. The names of the context variables are also used as the field names in the database. The recruitment field describes how the participants in the survey where recruited (e.g. as part of a course, volunteers from organisations, recruited by letters or others). In this field I also describe if they become payed in some way (money, grades, teaching etc.) In the selection of participants field I describe where the participants came from. That could be graduate students taking the same course, consultants from a consultant company etc. The field also contains information on which course they are taken or which company the work for and if it is differences between the participants, and in that case, what the differences are (this are differences in background, not in results from the experiment). Grouping describes how the participants where grouped, if they were. If randomising was used, it is described how this is done. In the target field it is described whom the authors describe as the target population for the experiment. By this we mean that if students are used in the experiment, do they generalize the sampling to count also for professionals. If it is not mentioned explicit in the article, the field only contains the value ‘implicit’. The generalization of context field describes whether or not the authors have generalized the context of the experiment or not. By this we mean if they for example say that this count for all real situation, even though the experiment is done on paper, they have generalized the context. If they don’t have described this explicit, the field only contains the value ‘implicit’. In the fields external threats and internal threats we describe what the authors say about general threats to the experiments validity. If they have not discussed it, the field contains the information ‘not discussed’. The generalization from students field describes whether or not the authors have discussed the articles possible threats to validity if students are used as sampling. The field is marked ‘discussed’, ‘not discussed’ or ‘not relevant’ if the sampling was not students. In the comments field for this field, it is described what they have discussed as a threat. The last context variable I am supposed to investigate is replication. This field describes whether or not the experiment is a replication or not. If it is, we mark it with an X. We only classify the experiment as a replication if the authors describe the study as it. In the comments field it is described what type of replication (strict, that vary the manner in which the experiment is run, that vary independent variables, that vary dependent variables, that vary context variables, that extend the theory [1]) this is and preferably a short reference to the article describing the original experiment. The comments field could also contain information telling that we suspects this to be a replication, even though the author don’t describe it as it. To all fields there is a comments field. This is used for information that can be of interest, but do not fit into the fields I have described here, or information that cannot be written in a standard way. 4 Time schedule I hope to complete this Masters Thesis within the time standard time. That means that the final report will be finished spring 2004. To fulfil this, I hope to have all the articles analysed within June 2003. From summer 2003 I will start to prepare the material I have collected through the analysis, and hopefully I can start writing the report early dawn 2003. In the period of writing the report (dawn 2003 to spring 2004), I hope to write an article about the theme. 5 Discussion When the complete analysis of the articles are finished and the final report written, this material will be of great interest to those who want to place themselves and their research in the landscape of Software Engineering. It will also be interesting to compare the results of our analysis with similar earlier surveys, to see if there is an evolution process going on. Hopefully this survey will bring some new thoughts to researchers around the world, so we all can take benefit of what others do and have done. 6 Summary Throughout the Context project we hope to find out what the State of the Art in Software Engineering is today and to place ourselves (Simula Research Laboratory) in the landscape. There has never been carried out such a profound analysis of controlled experiments in Software Engineering before, so this is kind of pioneer work. It will be interesting to see what scientists around the world do different and what they do like in their experiments, and to see what kinds of threats to the project they take into consideration. References [1] V. R. Basili, F. Shull & F. Lanubile, “Building Knowledge through Families of Experiments”, IEEE Transactions on Software Engineering, vol. 25, no. 4, pp. 456-473, Jul/Aug 1999. [2] W. F. Tichy, “Should Computer Scientists Experiment More? 16 Excuses to Avoid Experimentation”, IEEE Computer, vol. 31, no. 5, pp. 32-40, May 1998. [3] M. V. Zelkowitz & D. R. Wallace, “Experimental Models for Validating Technology”, IEEE Computer, vol. 31, no. 5, pp. 23-31, May 1998. [4] P. Lukowicz, E. A. Heinz, L. Prechelt & W. F. Tichy, “Experimental Evaluation in Computer Science: A Quantitative Study”, J. Systems Software, vol.28, pp. 9-18, 1995 [5] R. M. Lindsay & A. S. C. Ehrenberg, “The Design of Replicated Studies”, The American Statistican, vol. 47, no. 3, pp. 217-228, August 1993