Improved prediction skill using regularized error covariance estimates in an ensemble Kalman filter Jacob V. Tornfeldt Sørensen DHI Water & Environment, Hørsholm, Denmark Henrik Madsen DHI Water & Environment, Hørsholm, Denmark Henrik Madsen Informatics and Mathematical Modelling, Technical University of Denmark, Kgs. Lyngby, Denmark In ensemble Kalman filtering error covariance estimates are obtained from sample statistics of of an ensemble of states. These states are perturbed according to a set of underlying error assumptions, which are believed to capture the first order model errors. The approach often gives dynamically appealing covariance structures and its updated states are in many respects consistent with the foundation of the numerical models. However, the description of the model error can never be better than the imposed error assumption. This implies that the resulting model error covariance estimate is biased and hence from a statistical point of view the bias-variance trade-off can be exploited. Thus, through regularization we can introduce a small additional bias and obtain a decreased variance of our covariance estimate. If this is done successfully, the actual state estimate will in turn have an improved prediction skill. In this extended abstract, a barotropic regional ocean model of the North Sea and Baltic Sea system is used to examine the regularization ideas presented. We will consider the regularization implied by assuming a slowly varying error and a distance dependence of the error covariance. The approach is designed to simultaneously give a significant speed-up of the scheme. It is acknowledged that the regularization will lead to updated states, which do not exactly fulfill the equations being solved. However, if the deviation is small enough the prediction based on the scheme may very well be improved never the less. The skill with and without regularized error covariance estimated in the assimilation techniques are examined in hindcast as well as forecast. In this particular case, the efficient approximate techniques have a clear advantage. This is especially true in data sparse areas and for forecasts. Introduction A large part of the world's population lives close to the ocean and is affected by the coastal environment. Therefore, forecasting of key parameters in the coastal ocean has been on the agenda for decades and in many countries warning systems are being operated for selected key parameters, e.g. Vested et al. [11], Gerritsen et al. [5] and Erichsen and Rasch [3]. For most forecast products, the forecast skill is of prime importance. Since numerical modelling is only slowly improving and has fundamental limitations, the present on-going development also focuses on the on-line assimilation of available data. The basic idea in most assimilation systems with a forecasting objective, is to provide the best possible estimate of the ocean state at the time of forecast. Such an approach was implemented by Heemink [7] in a storm surge model for the Dutch coast. He used a Steady Kalman filter and showed an improved skill relative to a standard forecast model at both a three and six hour forecast horizon. Vested et al. [11] and Gerritsen et al. [5] also investigated the forecast skill in the Southern North Sea. They similarly found that Kalman filter based initialization improves the forecast skill at short time scales. However, at longer time scales the skill deteriorates for a while before converging to that of the standard forecast model without data assimilation. Cañizares et al. [1] applied the Steady Kalman filter for assimilating tide gauge data in the North Sea and Baltic Sea system, where they showed a good filtering performance in areas of fairly dense data coverage. However, far from observations, the filtering skill was degraded. This problem was treated in Sørensen et al. [10] and a regularization technique (distance regularization) was introduced to solve it. The effect on forecast skill of applying distance regularization to a steady approximation of an Ensemble Kalman Filter, (Evensen [4]), is investigated in this paper. Filtering Technique The schemes used for the assimilation of water level data in the present study can be categorized as sequential estimation techniques. The technique is basically composed of two parts. One part is a specification and a model propagation of the stochastic model state in between measurement times. The other part provides an estimate, x ia , of the state based on the distributions of the model estimate, x i f , and measured variables, y io , respectively at time ti. Let H x if be the model representation of the measurement and let Pi f and R i be the error covariances of x i f and y io respectively. The standard approach is to assume no bias and use the best estimator in a minimal variance sense. This estimator can be written xia xif K i ( y io H xif ) (1) where the Kalman gain, K i , is given by, K i Pi f H T H Pi f H T Ri 1 (2) The observational error also needs to be quantified. The specification of error models for the numerical model and for the observations is build on a number of assumptions. In the present study tide gauge stations are assumed to have a constant standard deviation and to be mutually uncorrelated. In a dynamical model the uncertainty is continuously altered by the model dynamics and hence the error description needs to be propagated in time. A Markov Chain Monte Carlo approach is followed here leading to the Ensemble Kalman Filter (EnKF), (Evensen[4]). The ensemble approach is an efficient way of making the workload of the model error propagation tractable, by reducing the degrees of freedom in the description dramatically. However, the resulting scheme requires of the order 100 times a standard model simulation and is still too expensive for many operational systems, which are typically pushed close to the limit in terms of computational resources in order to resolve as many processes as possible. Further, the EnKF scheme may introduce spurious correlations in data sparse regions due to an inaccurate model error description and the stochastic nature of the scheme. Hence, despite risking introducing nondynamical modes in the system, two forms of regularisation of the gain is used, (Sørensen [10]). The resulting Steady approximation and distance regularisation are presented below. Regularisation methods allow the expression of a prior knowledge about the elements in K i and their interdependence to be taken into account, Hastie et al. [6]. The techniques can usually be cast in a Bayesian framework, e.g. if a prior information about the model error covariance, PPRIOR , is available for P f , then the posterior estimate, PPOSTERIOR , is P POSTERIOR 1 PPRIOR P f 1 1 (3) Such an approach is not tractable in the highdimensional state space under consideration. However, the approximate schemes presented below can be regarded as attempts to incorporate or exploit prior knowledge. The Steady Kalman filter can be regarded as an ad-hoc regularisation method. Instead of calculating the Kalman gain at every measurement time, it can be assumed that the state and measurement error covariances are the same at every update. This yields a constant Kalman gain, which simultaneously reduces the computational time to the same order of magnitude as a standard model execution and hence makes the scheme applicable to an operational forecast setting. The Kalman gain is calculated as a long term average of the gain from an EnKF. Since the gain actually is varying, this introduces a bias in the gain elements, but the time averaging that creates the steady gain smoothes the gain and lowers the variance of the elements of the Kalman gain. This variance reduction lowers the prediction error if the time varying bias indeed is not too big. The distance regularisation is an ad-hoc procedure for expressing that we do not believe any tide gauge observation should be used for updating state variables that are positioned far away, Houtekamer and Mitchell [8]. This is implemented by constructing a vector, with coefficients between 0 and 1, which are a Gaussian function of their geographical distance, dm to observation, m, according to, (d m ) exp( d m2 ) D2 (4) The parameter, D specifies the spatial decorrelation scale. This regularisation can be used in both the EnKF and the Steady Kalman filter presented above, by multiplying the each element in the Kalman gain by ( dm). Figure 1. Bathymetry and available tide gauge stations, including 10 measurement stations (M1-M10) and 7 validation stations (V1-V7). Setup The area under consideration in the present study is the North Sea, Baltic Sea and interconnecting waters. We restrict our attention to the barotropic hydrodynamics and hence employ the depth averaged numerical model, MIKE 21, developed at DHI Water & Environment, DHI [2]. The area and bathymetry is shown in Figure 1 with the available tidal gauge measurement points indicated. The gauges were divided into measurement stations (M) used in the assimilation and validation stations (V), which were only used for performance assessment. The spatial resolution varies from 9 to 1/3 nautical miles through a two-way dynamic nesting technique. The temporal resolution is 2.5 minutes and measurements are available every 30 minutes. The measurements are linearly interpolated and assimilated every 10 minutes, i.e. every fourth model time step. The period of January 2002 was used in the study. A steady Kalman gain was estimated as an average of the gain calculated in an execution of the EnKF in a three day period from 1 January to 4 January. All measurements were adjusted to have the same average as a standard model prediction in January 2002 to diminish datum problems. The experiments were designed to test the forecasting performance of three prediction schemes: 1. A standard model execution without data assimilation 2. A Steady Kalman filter 3. A Steady Kalman filter using distance regularisation Twenty forecasts were performed with one day intervals. Each model run included one day of hindcast and a four day forecast. Hindcast wind fields were used for forecast. In the assimilation schemes, the model error was assumed to derive solely from errors in the wind velocity and open boundary water level forcing terms. These errors were assumed to be colored with temporal correlation scales of 5.7 hours and 1.7 hours, respectively, and to have spatial correlation scales of 300 km and 95 km. All measurements errors were assumed to have a standard deviation of 0.05 m. See Sørensen et al. [9] for a more detailed description of the Kalman filter settings and their effect on hindcast performance. The spatial decorrelation scale of the distance regularisation was set to 250 km. The performance of the schemes were assessed using the root mean square errors RMSE of the A=20 forecasts for each forecast horizon, ti, and tidal gauge station, s, RMSEti , s 1 A A y o i,a ( s) H ( s) xia, a (5) a 1 Bulk performance measures were constructed as averages of measurement and validation stations. Results The bulk RMSE statistics of the three experiments are shown in Figures 2 and 3 for measurement and validation stations, respectively. The overall picture is that the data assimilation clearly improves the state estimate in hindcast as compared to the standard model execution in run 1, but this improved skill on average only lasts 6-8 hours when using the classical Steady Kalman filter in run 2. After this period of improved predictive skill a period follows with degraded water level predictions. However, when applying the distance regularisation in run 3 the forecast skill is improved for 2-3 days and no deterioration is observed at any forecast horizon for any station. The distance regularisation provides an improved global state estimate, (Sørensen et al. [11] ) and hence no erroneous signals are set free to propagate in the domain at time of forecast. A closer look at the RMSE statistics of run 3 indicates two modes of error correction by the filter in the forecast period. In the first 12 hours there is a relatively sharp decrease in prediction skill. Thereafter, the skill only slowly moves towards that of the reference forecast run 1. This shows that the long-term prediction improvement is due to a bias correction by the assimilation scheme. Administration of Navigation and Hydrography and the Swedish Meteorological and Hydrological Institute is acknowledged. References Figure 2. Aggregated RMSE of the reference run 1 (thin black), the Steady run 2 (thick grey) and Steady distance regularised measurement points. run 3 (thick black) in all Figure 3. Aggregated RMSE of the reference run 1 (thin black), the Steady run 2 (thick grey) and Steady distance regularised run 3 (thick black) in all validation points. Conclusions This paper has undertaken an investigation of the water level forecast prediction skill when using Kalman filtering to initialize the state at the time of forecast. Two schemes was tested for the purpose; a steady approximation of the Ensemble Kalman Filter with and with out distance regularisation. The performance of the schemes was investigated in an operational model of the North Sea, Baltic Sea and interconnecting waters. Forecast initialisation by the Steady Kalman filter gave an average improved prediction for a period of 6-8 hours. The distance regularising scheme improves the forecast skill significantly, hence adding increased value to the prediction for 2-3 days. The use of distance regularisation significantly improves the forecast skill in the system under consideration and is to be encouraged for operational forecasting purposes. It can easily be combined with any filtering scheme but the Steady Kalman filter will often be sufficiently accurate and hence is recommended due to its low computational cost. Acknowledgements This research was carried out jointly at DHI Water & Environment and the Technical University of Denmark under the Industrial Ph.D. Programme (EF835). Contribution of tide gauge data from the Danish Meteorological Institute, the Royal Danish [1] Cañizares, R., Madsen, H., Jensen, H.R. and Vested, H.J., Developments in operational shelf sea modelling in Danish waters, Estuar. Coast. Shelf Sci., Vol. 53, (2001), pp 595-605. [2] DHI, MIKE 21 coastal hydraulics and oceanography, DHI Water & Environment (2001). [3] Erichsen, A.C. and Rasch, P.S., Two and three-dimensional model system predicting the water quality of tomorrow, Proceedings of the seventh international conference on estuarine and coastal modeling, St. Petersburg, Florida, USA, (2002), pp 165-184. [4] Evensen, G., Sequential data assimilation with a non-linear quasi-geostrophic model using Monte Carlo methods to forecast error statistics, J. Geophys. Res., Vol. 99(C5), (1994), pp 10,143-10,162. [5] Gerritsen, H., de Vries, H. and Philippart, M., The Dutch continental shelf model, D.R. Lynch & A.M. Davies, eds, Quantitative Skill Assessment for Coastal Ocean Models, American Geoph. Union, (1995), pp 425-467. [6] Hastie, T., Tibshirani, R. and Friedman J., The elements of statistical learning: data mining, inference, and prediction, 1st edition, Springer Verlag, (2001). [7] Heemink, A., Storm surge prediction using Kalman filtering, PhD thesis Twente University of Technology, The Netherlands (1986). [8] Houtekamer, P.L. and Mitchell, H.L., Data assimilation using an ensemble Kalman filter technique, Monthly Weather Review, Vol. 126, (1998). [9] Sørensen, J.V.T., Madsen H. and Madsen H., Parameter sensitivity of three Kalman schemes for the assimilation of tide gauge data in coastal and shelf sea models, Ocean Modelling, Submitted, (2003). [10] Sørensen, J.V.T., Madsen H. and Madsen H., Efficient sequential techniques for the assimilation of tide gauge data in three dimensional modeling of the North Sea and Baltic Sea system, J. Geophys. Res. (Oceans), In press, (2004). [11] Vested, H.J., Nielsen, J.W., Jensen H.R. and Kristensen, K.B., Skill assessment of an operational hydrodynamic forecast system for the North Sea and Danish belts, D.R. Lynch & A.M. Davies, eds, Quantitative Skill Assessment for Coastal Ocean Models, American Geoph. Union, (1995), pp 373-396.