An eLearning website for the design and analysis of experiments with application to chemical processes D.C. Woods1 , D.M. Grove1 , I. Liccardi2 , S.M. Lewis1 , and J.G. Frey3 1 2 3 Southampton Statistical Sciences Research Institute School of Electronics and Computer Science School of Chemistry University of Southampton, UK Email: D.C.Woods@maths.soton.ac.uk Summary. An eLearning website is described for the design and analysis of experiments, with particular application to chemistry research. Interactive learning content is provided via the R statistical software. Authentic chemistry examples, which make use of simulation and data analysis routines in the software, demonstrate the application of the statistical methods. We outline the content of the website, the scope of the interactive examples and the interface linking the web-browser and the R system. Key words: chemistry education, simulation, statistics education, web-based learning 1 Introduction The value of statistical design of experiments (DOE) methods has become increasing recognised by researchers in many areas of chemistry (see, for example, [?]). As a result of collaborations between statisticians and chemists at the University of Southampton, a need was recognised for tailored statistics learning resources for research chemists, specifically academics, postdoctoral researchers, PhD students and advanced undergraduate students. The development of a web-based approach using chemistry exemplars was viewed as a method of providing resources that include both clear exposition of statistical topics and interactive examples to promote problem-based learning. There is a huge variety of web-based statistics and chemistry resources available; see the peer-reviewed Merlot database at http://www.merlot.org. In statistics, many of these take the form of Java Applets which demonstrate specific statistical concepts. Broader online tutorials are generally appropriate for introductory statistics courses (see [?] for a review) and many of them are commercial ventures that require a subscription fee. Freely available web-based tutorials and textbooks, such as StatSoft’s electronic text- 1642 D.C. Woods et al book (http://www.statsoftinc.com/textbook/stathome.html), MM*Stat (http://www.quantlet.com/mdstat/) and the German-language e-stat system (http://www.e-stat.de), are not tailored to the specific needs of chemistry researchers and tend to offer a breadth of topics that may be intimidating for statistical novices. The eLearning website described in this paper was designed to provide a freely available interactive design of experiments learning tool for the chemistry community. Authentic chemistry examples are used to motivate and explain the statistical ideas and links to the R statistical software [?] are used to provide interactive content. In Section 2 we describe the design and content of the website. Section 3 outlines the interface between the webpages and the R system, and, in Section 4, we describe the use of interactive R content in the context of computer-generated and optimal design. The website is platform and browser independent and does not require any specialist software to be installed on the client machine. The site is available at http://www.doe.soton.ac.uk/elearning. 2 Design of the website The aim of the website is the introduction of the design and analysis of experiments to chemists who, although having little or no formal statistics training, are comfortable with the basics of handling data. Hence the material on the website assumes that the user has a fairly limited background in statistics, with perhaps some experience of the Normal distribution and simple linear regression (although these topics are briefly revised). A model-based approach to the design of experiments is developed on the website and, in the analysis of the experimental data, the link with regression analysis is established. Thus the focus of the website is the collection of data for subsequent model fitting. This is a different approach to that taken by much design of experiments training for chemists but, in our opinion, it allows the easy introduction of more advanced topics (such as optimal design) and allows the website to be used as a primer for subsequent courses or training. The use of embedded R code, which we shall call R scripts, allows interactive and dynamic content to be provided. It is well recognised that students learn best from the use of dynamic activities, rather than static exposition (see, for example, [?]). Use is made of simulations to allow learners to generate experimental data, fit regression models and visualise their results, all in real time. This allows the user to “try out” the methods discussed on real chemistry examples and provides demonstration of the potential application of the statistical methods. The website has six sections which form a linear story and may also be used as stand-alone modules, for example, as part of a statistics or chemometrics course. Starting with the motivation for factorial experiments, the website introduces the basic ideas of factorial experiments and regression analysis and also covers more advanced topics, including computer-generated design, model building and model selection, as described below. Design of experiments eLearning website 1643 Section 1: Introduction to designed experiments This section motivates the use of designed experiments in a chemistry context. Terminology is explained and regression models are introduced (possibly as revision). R scripts allow users to explore a variety of response surfaces that can be described by linear regression models. The efficiency of factorial experiments compared with a “one factor at a time” approach is also discussed. Section 2: Classical designs for multi-factor experiments An example on the optimisation of the desilylation of a silyl ether, taken from [?], is introduced in this section, and is used to motivate and illustrate the topics of 2-level factorial and fractional factorial designs. Aliasing between factorial effects, including interactions, is described through the regression model by considering the bias introduced into the parameter estimators. Central composite designs that allow the fitting of second-order models are also introduced. R scripts are used to demonstrate the fitting of regression models to factorial experiments, see Figure 1. The use of a simulation also helps to explain aliasing and confounding through the generation of data from a full 23 factorial experiment. Learners are able to compare the fitted models, and judge the impact of aliasing on subsequent conclusions, by fitting various models to the full factorial and the 23−1 half-replicate. Section 3: Optimal designs for multi-factor experiments In this section, the use of optimal designs obtained from computer search algorithms is motivated and introduced. D and V -optimality (see, for example, [?] Ch.10) are defined in an intuitive manner without mathematical detail. Appropriate applications for the different criteria are outlined, e.g. D-optimality for screening experiments and V -optimality for experiments to optimize a response. This section of the website is more fully described in Section 4 of this paper. Section 4: Running an experiment in practice Section 4 starts with a checklist of important questions to be asked and decisions to be made before a designed experiment is run in practice. Issues such as scoping studies, reproducibility and replication are then discussed. Some topics which are beyond the current scope of the module, such as blocking and transformations of a response, are briefly introduced. Section 5: Fitting a model to experimental data This section discusses issues of model fitting, again using an example based on that of [?]. The use of replicated experimental runs to estimate the residual error is explained and demonstrated using simulation. Hypothesis testing and the interpretation of pvalues in an Analysis of Variance table are then intuitively explained, with R scripts allowing the learner to explore these ideas interactively. The section ends with an exercise in which the learner interprets the simulated results of an experiment and decides which of the factorial effects are important. 1644 D.C. Woods et al Fig. 1. Output from an interactive R script which demonstrates the fitting of a first-order regression model to data from an experiment using two replicates of a 22 factorial design, with data simulated from a second-order model taken from [?]. Section 6: Evaluating and choosing a model Several tools for assessing the adequacy of a fitted model are introduced in this section, including residual plots, R2 and Mallows’ method (see, for example, [?]). R scripts are used to allow the learner to engage with these topics, again using the desilylation example from [?]. The final exercise invites the learner to select factor levels which optimize the response predicted by their chosen model. 3 Web-interface to the R system To embed output from the R system into the webpages requires a fast and secure interface between the web-browser and R. This is achieved using Java’s JSP technology, which allows Java code to be embedded within the HTML of a website. JSP allows the development of custom tags, which resemble HTML or XML tags. When the webpage is retrieved by a browser, the custom tag is replaced with the output of the Java code associated with the tag. This code is invoked on the server and so no Java software is required on the client machine. Design of experiments eLearning website 1645 A library of custom JSP tags has been developed to allow the embedding of R scripts within the website. An R script is dynamically invoked when an access is made to a webpage. Through tags, input boxes can be added to pages to allow values to be submitted to scripts, and the output from the scripts presented to the user. To allow pages to be interactive, parameters may be passed to scripts. Inside R, these take the form of global variables. The variables are defined within the JSP tag; for example, the following code would invoke the script named ”interactive.R” with the variable ‘x’ set to the value of 123: <rembed:script name="interactive.R"> <rembed:variable name="x" defaultValue="123" /> </rembed:script> <rembed:controls /> Use of the “controls” tag automatically adds a form to control all inputs to the script. Unlike a system such as Rweb [?], users may not upload scripts to the server for security reasons. As the user only ever accesses R through the JSP tags, the potential for malicious misuse is minimized. Tags were developed similarly to allow the embedding of plots and tables, outputted from the R scripts. JSP tags were also developed to allow data to be passed between R scripts, thereby allowing data to be passed between different pages on the website and to be manipulated on each page. This was achieved through the JSP session, which provides storage that persists between page accesses. These tags are used extensively in Sections 5 and 6, where linear “stories” are told through a sequence of webpages using simulated data. All R computations are conducted on the server. When a script is to be executed, the custom tags first wrap up the script and attach header and footer sections. The header defines various R functions which embedded scripts may use. An example is the “rembed output” function, which controls the output of data tables from the R code. The header also declares any parameter variables that are to be passed to the script. An R “interpreter” is then started and the script passed to the interpreter through an interprocess communication channel. The use of header and footer sections and the R interpreter means that standard R code requires only minimal alteration to be used on the webpages and hence development time is greatly reduced. Any information printed to the standard output by the script is added to the page automatically by the tag. More complicated output, such as tables and graphical plots, is written to files by the R script. Each R script is given a special directory by the embedding code where it can place output files. These temporary files are unique to each user and allow multiple users to access the pages at any one time. Any such output files are collected after the script terminates so that they may be presented on the page. A schematic of the communication between the browser and R is given in Figure 2. The screenshots in this paper give examples of outputs provided by the JSP tags. 4 Examples of interactivity using R To demonstrate the use of R scripts to add interactive content, we focus on Section 3 of the website, entitled “Optimal designs for multi-factor experiments”. The learner 1646 D.C. Woods et al Fig. 2. Schematic of the interface between the web browser and R. is introduced to the method of using computer algorithms to find designs under both D- and V -optimality. R scripts are used to allow learners to find optimal designs under a variety of models and for a range of run sizes. The R scripts in this section use the exchange algorithm in the AlgDesign package [?] to find the computer-generated designs. Two different examples are used: Example 1. A three factor example from Section 2 is reintroduced and used to demonstrate how computer-generated designs can be applied when there are constraints on the design region. In this example, it is impossible to experiment in one corner of the design region. This prevents the application of standard factorial and response surface designs. The learner can choose the form of the polynomial model (linear, quadratic or cubic) under which the design is evaluated, the number of runs, the optimality criterion (D or V ) and the corner of the design region to exclude from the candidate list. The resulting design is displayed graphically. This example allows a learner to observe how the computer-generated design locates its design points in relation to the excluded region. Example 2. The advantage of computer-generated designs in incorporating irregular design regions and relationships between variables is demonstrated using the nine chemical descriptors for solvents given by [?], see Figure 3. The learner may select three of these descriptors and use R to find a D- or V -optimal design for an experiment on combinations of descriptor values, again for a variety of models and numbers of runs. The candidate list of possible design points and the selected design are displayed graphically. This allows learners to compare the distribution of points in the selected design with the distribution of all possible points. Both examples use a subset of drop-down lists, numeric input boxes and tick boxes to obtain input from the learner. This user-friendly input is then translated by the JSP tags into input files for the R scripts. Checks on the requested number of runs input and, in the second example, the number of descriptors selected prevents the user from entering inputs that are not meaningful or that would cause errors in the R code. 5 Discussion The website described here is a freely available learning tool for the statistics and chemistry communities. It has been successfully used as part of postgraduate and undergraduate courses at the University of Southampton, in a blended learning Design of experiments eLearning website 1647 Fig. 3. Graphical displays of the candidate list and a D-optimal design with 12 runs for a quadratic regression model in three chemical descriptors. approach, for statistics students as well as chemists. In statistics courses, it was found to be valuable for providing background and reinforcement material, particularly on courses with students who come from a wide variety of backgrounds. The feedback from students and instructors is being used to improve both the interface and content of the pages. Further features under development include a database of multiple choice exercises, which will be accessible from within each section of the webpage, and further formative feedback from the interactive examples. These developments will enhance the use of the website as a stand-alone learning tool. Acknowledgements This work was supported by the EPSRC e-Science grant GR/R67729 and by a miniproject award from the Higher Education Academy Maths, Stats and OR Network. References [FW74] G. Furnival and R. Wilson, Regression by leaps and bounds, Technometrics 16 (1974), 499–511. [Har01] F.E. Harrell, Regression modeling strategies, Springer, New York, 2001. [HTF01] T.J. Hastie, R.J. Tibshirani, and J. Friedman, The elements of statistical learning, Springer, New York, 2001. [Urb04] S. Urbanek, Model selection and comparison using interactive graphics, Ph.D. thesis, Augsburg, 2004. [UVW03] A. R. Unwin, C. Volinsky, and S. Winkler, Parallel coordinates for exploratory modelling analysis, Computational Statistics & Data Analysis 43 (2003), no. 4, 553–564. 1648 [VR02] D.C. Woods et al W. N. Venables and B. D. Ripley, Modern applied statistics with S, 4th ed., Springer, New York, 2002.