Statistical Methods in Scientific Research

advertisement
Statistical Methods in Scientific Research
Practical 2: Discussion topics
1.
Linear models
A linear model is any model of the form
y  0  1x1  ...  p x p   ,
where y is a quantitative response of interest, x1, …, xp are prognostics factors that might
influence the value of y, 1, …,p are unknown coefficients and  is called the residual
error and is assumed to have mean 0 and standard deviation .
Write simple linear models that might express the following.
(a)
The dependence of the yield of wheat from an experimental plot on the amount of
nitrogen, phosphorous and potassium applied as fertiliser.
(b)
The dependence of the change in heart rate of a healthy human volunteer on the
dose of an administered drug and on the square of the dose.
(c)
The dependence of the amount of carbon dioxide emissions from cars during 10
minutes of steady driving on the speed and on the type of fuel (petrol or diesel).
(d)
The dependence of the reading speed of children in their native language by age
and by language (English, French, German or Italian).
In each case, identify a null hypotheses that it might be of interest to test. Express it in
terms of setting certain i values to be equal to 0. What would it mean if it was not
rejected?
2.
Stepwise regression
Consider a quantitative response of interest, y, that could be related to any of a number of
factors, x1, …, xp. When there are many factors, the number of possible models that can
be fitted becomes very large. Automatic, stepwise procedures have been suggested to
overcome this problem. For example, in backward elimination, a model with all factors
included is fitted first. Then p simpler models are fitted, each omitting just one of the
factors. Each of the simpler models is compared with the full model using an F-test, and
the factor which when omitted leads to the smallest F-statistic is then eliminated. The
process is then repeated. It is continued until the factor identified for elimination is one
that corresponds to a significantly large F statistic. At that stage the model including all
remaining factors is chosen to be the final model.
1
(a)
What problems might be encountered in trying to identify the full model with
which to start a backward elimination procedure?
(b)
Suppose that a study is conducted to assess the effectiveness of a new washing
powder in cleaning white shirts. The response y is some continuous measure of
whiteness. The first factor x1 is the identity of the washing powder, with value 0
for the current product and value 1 for a new and potentially improved product. A
second factor is x2, the temperature of the wash. The study involves three
washing machines, A, B and C and x3 = 1 if the machine is B and is zero
otherwise, while x4 = 1 if the machine is C and is zero otherwise (thus washing
machine A corresponds to setting both x3 and x4 equal to 0). These machines are
just the available laboratory machines and are not intended to represent different
types of domestic machine. Would a backward elimination procedure be suitable
for the analysis of this study? If so, what would the full model be to take as a
starting point? If not, how should the various possible models be compared?
3.
Choice of model
An epidemiological study is to be conducted of the effect of exposure to coal dust on lung
function. Workers in a coal-fired power station are assessed at the beginning of the study
and one year later, and the response of interest y is the reduction in forced expiratory
volume between the two assessments. Factors of interest are as follows: x1 is a measure
of exposure to coal dust (some workers from clean areas with x1 = 0 are included), x2 is
age and x3 is smoking history (average number of cigarettes per day). What
considerations would govern the decision as to whether to include x3 in the model when
assessing the significance of x1?
4.
Linear models for binary data
In an experiment on the vegetative reproduction of root-stocks for plum trees, trees were
grown from cuttings taken from the roots of older trees. The response of interest y was
whether the tree was alive one year after taking the cutting, and the factors included x1,
the length of the cutting and x2 an indicator of whether the cuttings were planted
immediately (in the autumn) or carefully stored over the winter and planted in the spring.
As y is a binary variable (taking the value 0 for dead or 1 for alive), it cannot be modelled
as a normal random variable, and standard regression models cannot be fitted.
Alternative methods, based on maximum likelihood estimation can be applied (see
Collett, 2003). Goodness-of-fit is assessed by a measure known as deviance, and the
analysis takes a similar form to that of normally distributed data. In these methods, the
probability p that y = 1 is modelled as a linear function of the factor values. Consider the
two possible models below.
2
(i)
p  0  1x1  2 x 2 ;
(ii)
 p 
log 
  0  1x1  2 x 2 .
 1 p 
Discuss the advantages and disadvantages of these two choices.
3
Download