Statistical Methods in Scientific Research Practical 2: Discussion topics 1. Linear models A linear model is any model of the form y 0 1x1 ... p x p , where y is a quantitative response of interest, x1, …, xp are prognostics factors that might influence the value of y, 1, …,p are unknown coefficients and is called the residual error and is assumed to have mean 0 and standard deviation . Write simple linear models that might express the following. (a) The dependence of the yield of wheat from an experimental plot on the amount of nitrogen, phosphorous and potassium applied as fertiliser. (b) The dependence of the change in heart rate of a healthy human volunteer on the dose of an administered drug and on the square of the dose. (c) The dependence of the amount of carbon dioxide emissions from cars during 10 minutes of steady driving on the speed and on the type of fuel (petrol or diesel). (d) The dependence of the reading speed of children in their native language by age and by language (English, French, German or Italian). In each case, identify a null hypotheses that it might be of interest to test. Express it in terms of setting certain i values to be equal to 0. What would it mean if it was not rejected? 2. Stepwise regression Consider a quantitative response of interest, y, that could be related to any of a number of factors, x1, …, xp. When there are many factors, the number of possible models that can be fitted becomes very large. Automatic, stepwise procedures have been suggested to overcome this problem. For example, in backward elimination, a model with all factors included is fitted first. Then p simpler models are fitted, each omitting just one of the factors. Each of the simpler models is compared with the full model using an F-test, and the factor which when omitted leads to the smallest F-statistic is then eliminated. The process is then repeated. It is continued until the factor identified for elimination is one that corresponds to a significantly large F statistic. At that stage the model including all remaining factors is chosen to be the final model. 1 (a) What problems might be encountered in trying to identify the full model with which to start a backward elimination procedure? (b) Suppose that a study is conducted to assess the effectiveness of a new washing powder in cleaning white shirts. The response y is some continuous measure of whiteness. The first factor x1 is the identity of the washing powder, with value 0 for the current product and value 1 for a new and potentially improved product. A second factor is x2, the temperature of the wash. The study involves three washing machines, A, B and C and x3 = 1 if the machine is B and is zero otherwise, while x4 = 1 if the machine is C and is zero otherwise (thus washing machine A corresponds to setting both x3 and x4 equal to 0). These machines are just the available laboratory machines and are not intended to represent different types of domestic machine. Would a backward elimination procedure be suitable for the analysis of this study? If so, what would the full model be to take as a starting point? If not, how should the various possible models be compared? 3. Choice of model An epidemiological study is to be conducted of the effect of exposure to coal dust on lung function. Workers in a coal-fired power station are assessed at the beginning of the study and one year later, and the response of interest y is the reduction in forced expiratory volume between the two assessments. Factors of interest are as follows: x1 is a measure of exposure to coal dust (some workers from clean areas with x1 = 0 are included), x2 is age and x3 is smoking history (average number of cigarettes per day). What considerations would govern the decision as to whether to include x3 in the model when assessing the significance of x1? 4. Linear models for binary data In an experiment on the vegetative reproduction of root-stocks for plum trees, trees were grown from cuttings taken from the roots of older trees. The response of interest y was whether the tree was alive one year after taking the cutting, and the factors included x1, the length of the cutting and x2 an indicator of whether the cuttings were planted immediately (in the autumn) or carefully stored over the winter and planted in the spring. As y is a binary variable (taking the value 0 for dead or 1 for alive), it cannot be modelled as a normal random variable, and standard regression models cannot be fitted. Alternative methods, based on maximum likelihood estimation can be applied (see Collett, 2003). Goodness-of-fit is assessed by a measure known as deviance, and the analysis takes a similar form to that of normally distributed data. In these methods, the probability p that y = 1 is modelled as a linear function of the factor values. Consider the two possible models below. 2 (i) p 0 1x1 2 x 2 ; (ii) p log 0 1x1 2 x 2 . 1 p Discuss the advantages and disadvantages of these two choices. 3