Adaptive modelling of conditional variance function

advertisement
Adaptive modelling of conditional variance
function
Juutilainen I. and Röning J.
Intelligent Systems Group, University of Oulu, 90014 PO BOX 4500, Finland,
ilmari.juutilainen@ee.oulu.fi juha.roning@ee.oulu.fi
Summary. We study a situation where the dependence of conditional variance on
explanatory variables varies over time. The possibility and potential advantages of
adaptive modelling of conditional variance are recognized. We present approaches
for adaptive modelling of the conditional variance function and elaborate two procedures, moving window estimation and online quasi-Newton. The proposed methods
were successfully tested in a real industrial data set.
Key words: adaptive methods, conditional variance function, variance modelling,
time-varying parameter
1 Introduction
In many problems, both the mean and the variance of the response variable depend
on several explanatory variables. A model for the variance is needed to draw the
right conclusions based on the predicted conditional distribution. Modelling of the
conditional variance function has been applied in many fields, including industrial
quality improvement [Gre93].
Adaptive learning (on-line learning) is commonly used to model time-varying
dependence of the response on the explanatory variables or to increase model accuracy along with time and new data. Adaptive methods sequentially adjust the model
parameters based on the most recent data.
Adaptive models have usually described the conditional distribution function of
the response as a time-varying relationship between the explanatory variables and
the expectation value of the response. Some models such as GARCH and stochastic
volatility models, assume time-varying variance which does not depend on the explanatory variables. Models for time-varying dependence of the conditional variance
on the explanatory variables have not been discussed earlier at all.
Recursive kernels have been proposed for the sequential estimation of conditional
variance depending on several explanatory variables [ST95]. The authors, however,
assume that the variance function does not change along with time. Their model
does not adapt well to changes in variance function, because old observations are
never discarded from the model.
1518
Juutilainen I. and Röning J.
In this paper, we propose two methods for adaptive modelling of conditional
variance function: moving window estimation and on-line quasi-Newton. We also
discuss the role of mean model estimation in adaptive modelling of variance. We
used the proposed methods to predict the conditional distribution of strength of
steel plates based on a large industrial data set.
2 Methods
We notate the ith observation of the response variable with yi . The related vector of inputs is notated with xi . The observations (yi , xi ), i = 1, 2, . . . are observed sequentially at times t1 , t2 , . . . , ti < ti+1 , . . .. We assume that yi s are normally, independently distributed with the mean µi = µ(β(ti ), xi ) and the variance
σi2 = σ 2 (τ (ti ), xi ). Both the parameter vector of the mean function, β, and the
parameter vector of the variance function, τ , change along with time t and form
time-continuous processes {β(t)} and {τ (t)}.
The expectation of the squared error term equals the conditional variance Eε2i =
E(yi − µi )2 = σi2 . When the response variable is normally distributed, the squared
error term is gamma-distributed. If we knew the correct mean model, the variance
function
can
be correctly estimated by maximising the gamma log-likelihood L =
P
P
2
2
2
2
i Li =
i [− log σ (τ, xi ) − εi /σ (τ, xi )] using the squared error term εi = [yi −
2
µ(β(ti ), xi )] as the response [CR88].
2.1 Moving Window Modelling
Moving window is a simple and widely used method for adaptive modelling. In
the moving window method, the model is regularly re-estimated using only the
most recent observations. The drawback of the method is that the whole model
must be re-estimated in each model update. The update formulas developed for
linear regression reduce essentially the computational cost [Pol03] and are seemingly
approximately applicable to gamma generalised linear models by applying the results
of [MW98]. The window width, w, can be determined as a time interval or as the
number of observations included in the estimation data set. One usual modification
is to discount the weight of the earlier observations in the model fitting instead of
discarding them completely.
The moving window method is easily applicable to the modelling of the variance
function. At the chosen time moments te or after the chosen observations (ye , xe ), the
conditional variance function is estimated by maximising the gamma log-likelihood
τbe = maxτ
X
i∈W
ωi − log σ 2 (τ, xi ) −
ε2i
2
σ (τ, xi )
.
(1)
in the set of the most recent observations: W = {i|te − w ≤ ti ≤ te } or W =
{e − w, e − w + 1, e − w + 2, . . . , e}. One can choose unit weights ωi = 1 ∀i or discount
the weight of older observations. The window width and the amount of discounting
are set to optimise the speed of adaptivity.
Adaptive modelling of conditional variance function
1519
2.2 Stochastic Gradient
The stochastic gradient (stochastic approximation) method employs each new observation to move the parameter estimates based on the gradient of the loss function at
that observation. After that, the observation is discarded, so that the model is maintained without a need to store any observations. With a non-shrinking learning rate,
the model can adapt to time-varying changes in the modelled phenomenon [MK02].
The methods discussed under the title of ‘on-line learning’ are often variations of
the stochastic gradient.
We propose to apply the stochastic gradient method for adaptive modelling
of conditional variance. We call the proposed method ‘on-line quasi-Newton’. The
proposed method is an adaptive modification of the non-adaptive on-line quasiNewton algorithm [Bot98] and the recursive estimation method of generalised linear
models [MW98]. The modification that yields the adaptivity is the introduction of
the learning rate η(i − 1) in Eq. (2). The update step directions are controlled by
the accumulated outer-product approximation for the information matrix Ii (kl) =
Pi
2
2
j=1 E∂ /(∂τk ∂τl )Lj . After each new observation, εi , we propose to update the
parameter estimates like in a single quasi-Newton algorithm step. At the same time,
we keep track of the inverse of approximated Hessian Ki = Ii−1 by using the wellknown matrix equality (A + BB T )−1 = A−1 − (A−1 B)(I + B T A−1 B)−1 (A−1 B)T .
We propose to use a constant learning rate η, because it has been common in the
2
modelling of time-varying dependence [MK02]. Let τbi = τb(ti ), σ
bi+1
= σ 2 (τbi , xi+1 )
2
and δ(τ, xi ) = (∂/∂τ )σ (τ, xi ) be the vector of partial derivatives. The resulting
update formula for parameter estimates is
τbi+1 = τbi + η(i + 1)Ki+1
ε2i+1
−1
2
σ
bi+1
δ(τbi , xi+1 )
.
2
σ
bi+1
(2)
Note that iKi = o(1), and learning speed thus remains stable when η is constant.
The learning rate controls the speed of adaptivity and should be selected based
on the application. The inverse of the approximated information matrix is updated
after each observation with
Ki+1
2
2
T
Ki δ(τbi , xi+1 )/σ
bi+1
Ki δ(τbi , xi+1 )/σ
bi+1
.
= Ki −
2
2
1 + δ(τbi , xi+1 )T /σ
bi+1 Ki δ(τbi , xi+1 )/σ
bi+1
(3)
We propose to initialise the algorithm by using the results of maximum likelihood
fit in a relatively
initial inverse approximated Hessian is
P large initial
data
set. The2 −1
2
obtained by
b, xi )/σ
bi T δ(τb, xi )/σ
bi
.
i δ(τ
3 Effect of Mean Model Estimation
In practice, the true mean model is not known and has to be estimated. Variance
function is estimated using squared residuals εb2i as the response variable. The usual
practice is to iterate mean model estimation and variance model estimation [CR88].
We first assume that the true mean model is static β(t) = β ∀t. The accuracy
of the mean model can be improved with new data by occasional re-estimation.
The response variable for variance function modelling should then be formed based
on the latest, most accurate mean model. Let βb denote the current estimator and
1520
Juutilainen I. and Röning J.
b xi ) denote the residual. One should, however, notice that E ε
εbi = yi − µ(β,
b2i =
σi2 + var(µ
bi ) − 2cov(yi , µ
bi ) + (µi − E µ
bi )2 . The covariance cov(yi , µ
bi ) = 0, if the
ith observation is not used for mean model fitting but is otherwise positive. The
bi )2 is difficult to approximate, and the usual practice is to assume it
bias (µi − E µ
negligible. If the covariances ∆i = 2cov(yi , µ
bi )/σi2 −var(µ
bi )/σi2 can be approximated,
they should be taken into account in the model fitting by using the corrected response
ei = εb2i /(1 − ∆i ), satisfying Eei = σi2 . For example, in the linear regression context
−1
X)−1 (xi /σi2 ) where V is a
bi ) = var(µ
bi ) = xi T (X T V
yi = xi T β + εi holds cov(yi , µ
diagonal matrix with elements V(ii) = σi2 .
When the mean model changes over time, it is much more difficult to neglect
the uncertainty about the mean. We now assume that the true mean model parameters form a continuous time Lévy process {β(t)} satisfying E[β(ti ) − β(ta )] = 0,
b which
cov[β(ti ) − β(ta )] = B|ti − ta |. We use ‘moving window’-type estimator β,
has been estimated based on the observations measured around the time ta so
that E βb = β(ta ). The estimator follows the true parameter with a delay likely
to occur in practice. Conditioned at the time ta , the residual εbi is normally distributed with the expectation E εbi = 0 and variance depending on σ 2 (τ (ti ), xi ),
the steepness of µ(β(t), x) around xi , cov[β(ti ) − β(ta )], var(µ
bi ) and cov(yi , µ
bi ).
We suggest that the fluctuation in the mean model can be taken into account
in the estimation of conditional variance by using an additional offset variable
qi = var [µ(β(ta ), xi ) − µ(β(ti ), xi )]. The offset variable is approximated using cob time difference |ti − ta | and the form of the regression function
variance estimator B,
around xi . The model is fitted using the equation E εb2i = qi + σ 2 (τ, xi ).
Adaptive on-line quasi-Newton can be applied to the joint likelihood of mean
and variance parameters. Because the information matrix is block diagonal, the
mean and variance can be treated separately. As an alternative method to the
adaptive joint modelling of mean and variance we sketch a moving window
method in a linear case. The mean model is regularly refitted using the moving window method. For each fit, we choose a recent time moment ta , based
on which we predict. We had assumed that cov[β(ti ), β(ta )] = |ti − ta |B. Let
bi = β(ti ) − β(ta ). Now our model becomes yi = xi T β(ta ) + xi T bi + εi and
cov(bi − bj ) = min (|ti − ta |, |tj − ta |) BI [sign(ti − ta ) = sign(tj − ta )] , where I()
denotes the indicator function. As discussed in [CP76] it follows that cov(yi , yj ) =
I [sign(ti − ta ) = sign(tj − ta )] min (|ti − ta |, |tj − ta |) xi T Bxj . The covariance mab a)
trix B can be estimated by maximum likelihood or MINQUE [CP76] and β(t
by generalised least squares, using the tools available for mixed models. We conb a )]2 and fit variance model using the movstruct squared residuals εb2i = [yi − β(t
2
ing window method σ (τ (ti ), xi ). In variance model fitting, we use an additional
b i . We predict the distribution of the new oboffset variable qi = |ti − ta |xi T Bx
b a ), xn ) and the variance
servation xn to be Gaussian with the expectation µ(β(t
2
T
T
b
b
σ (τb, xn ) + |tn − ta |xn Bxn + xn cov[β(ta )]xn .
4 Industrial Application
We applied adaptive methods for predicting the conditional variance of steel
strength. The data set consisted of measurements made on the production line of
Adaptive modelling of conditional variance function
1521
Ruukki steel plate mill. The data set included about 200 000 observations, an average of 130 from each of the 1580 days. The data included observations of thousands
of different steel plate products. We had two response variables: tensile strength
(Rm) and yield strength (ReH).
We fitted models for strength (Rm and ReH) using the whole data set and used
the ensuing series of squared residuals to fit the models for conditional variance. In
moving window modelling, we refitted the models at intervals of a million seconds
(about 12 days). Based on the results in a smaller validation data set, we decided to
use a unit-weighted ωi = 1 ∀i moving window with width w = 350 days and on-line
quasi-Newton with a learning rate η = 1/30000.
We modelled conditional variance using the framework of generalised linear models. We decided to use a linear model for deviation σi2 = (xi T τ (ti ))2 . Both variances
seemed to depend non-linearly on 12 explanatory variables related to the composition of steel and the thickness and thermomechanical treatments of the plate. As a
result of our model selection procedure, we ended up with models representing the
discovered non-linearities and interactions with 40 and 32 parameters for ReH and
Rm, respectively.
The first 450 days of the data set were used to fit the basic, non-adaptive model
and to initialise the adaptive models. The models were compared for their ability to
predict in the rest of the data. We used real forecasting - at each time moment only
the earlier observations were available to fit the model used in prediction.
Because variance cannot be directly observed, it is somewhat difficult to measure
the goodness of models in predicting variance. Let a model predict the variances
to be σ
bi2 = σ 2 (τb(ti−1 ), xi ). We base the study on the likelihood of the test data
set, assuming that the response variable is normally distributed. It is easy to see
that the gamma likelihood of squared residuals εb2i is equivalent to full Gaussian
likelihood when the mean model is kept fixed. Thus, we measure the goodness of a
model hin predicting the ith observation
i with the gamma deviance of squared residual
2
2
2
2
2
bi ) + (ε
bi − σ
bi )/σ
bi .
di = 2 − log(εbi /σ
4.1 Results
The average prediction accuracies of the models in the test data set are presented in
Table 1 and in Fig. 1. The adaptive models performed better than the non-adaptive
basic model. On-line quasi-Newton worked better than the moving window method
and was also better than non-adaptive fit to the whole data. The differences between
the models are significant, but the non-adaptive model seems fairly adequate.
Examination of the time paths of the model parameters revealed that many of
the model parameters had changed during the examination period. Examples of the
development of the estimated parameter values are given in Fig. 2. The changes
in a parameter were often compensated for a reverse change in another correlated
parameter.
We examined the time paths of the predicted variances of some example steel
plates. We found two groups of steel plate products whose variance had slightly
decreased during the study period (Fig. 3 ). For most of the products, we did not
find any indication about significant changes in variance.
1522
Juutilainen I. and Röning J.
4.2 Discussion
One of the main goals of industrial quality improvement is to decrease variance.
Variance does not, however, decrease uniformly: changes and variation in the facilities and practices of the production line affect variance in an irregular way. Variance
heteroscedasticy may often be explained by differences in the way in which the
variability in the manner of production appears in the final product. In industrial
applications, a model for variance can be employed in determining an optimal working allowance [JR06]. Adjustment of the working allowance to the decreased variance
results in economical benefits. An adaptive variance model can be utilised to adjust
working allowance automatically and rapidly.
The purpose of the strength of steel study was to find out the benefits of adaptive
variance modelling in view of the possible implementation in a steel plate mill.
The results of the study did not indicate an immediate need for adaptivity. In this
application, however, the introduction of new processing methods and novel products
creates a need for repetitive model updating. Utilisation of adaptive models is a
useful alternative for keeping the models up-to-date.
Table 1. The average test deviances of the models. Note that ‘fit to whole data’
does not measure real prediction performance
Model
Stochastic gradient
Moving window
Non-adaptive model
Constant variance
Fit to whole data
Rm
2.789
2.801
2.824
3.265
2.795
ReH
2.655
2.667
2.689
2.942
2.666
Fig. 1. The smoothed differences between the average model deviances and the
average deviance of the fit to the whole data. Negative values mean that the model
predicts better than the fit
Adaptive modelling of conditional variance function
1523
Fig. 2. The time paths of two parameter estimates
Fig. 3. The predicted deviations of two steel plates used as examples
5 Conclusion
We introduced the possibility to model adaptively the conditional variance function
and the potential advantages of the approach. We developed two adaptive methods
for modelling variance and applied them successfully in a large data set.
Acknowledgement
We are grateful to Ruukki for providing the data and the research opportunity.
1524
Juutilainen I. and Röning J.
References
[Bot98] Bottou L.: Online learning and stochastic approximations. In: Saad, D. (ed)
On-Line Learning in Neural Networks, Cambridge University Press (1988)
[CR88] Carroll, R.J., Ruppert, D.: Transformation and Weighting in Regression.
Chapman and Hall, New York (1988)
[CP76] Cooley, T.F., Prescott, E.C.: Estimation in the presence of stochastic parameter variation. Econometrica, 44, 167–184 (1976)
[Gre93] Grego, J.M.: Generalized linear models and process variation. J. Qual.
Technol., 25, 288–295 (1993)
[JR06] Juutilainen, I., Röning, J.: Planning of strenght margins using joint modelling of mean and dispersion. Mater. Manuf. Processes (in press).
[MW98] McGilchrist, C.A., Matawie, K.M.: Recursive residuals in generalised linear
models. J. Stat. Plan. Infer., 70, 335–344 (1998)
[MK02] Murata, N., Kawanabe, M., Ziehe, A., Müller, K.R., Amari, S.: On-line
learning in changing environments with applications in supervised and unsupervised learning. Neural Networks, 15, 743–760 (2002)
[Pol03] Pollock, D.S.G.: Recursive estimation in econometrics. Comput. Stat. Data
An., 44, 37–75 (2003)
[ST95] Stadtmüller, U., Tsybakov, A.B.: Nonparamteric recursive variance estimation. Statistics, 27, 55–63 (1995)
Download