2011 International Conference on Modeling, Simulation and Control IPCSIT vol.10 (2011) © (2011) IACSIT Press, Singapore Missing Data Prediction and Forecasting for Water Quantity Data Prakhar Gupta1 and R.Srinivasan2 + 1 Department of Chemical Engineering & Technology, Institute of Technology, BHU, Varanasi, India 2 Nalco Technology Center, Pune, India Abstract. In industrial applications, especially in water treatment plants, it is necessary to obtain flow data (quantity and quality) for a system over a broad range of time. In most cases it is not possible to obtain the parameter of interest in a closed form. Generation of the data over the time period within the known range is not possible or may be extremely time-consuming. Numerous methods are available for interpolation or extrapolation to determine the unknown data or missing data within or outside a range of known data points. In this paper two developed methodologies were analyzed, one to predict the missing values within a data range and another to forecast the seasonal data outside the data range for a raw water quantity data. A proven method like Two-Directional Exponential Smoothing (TES) is applied for predicting the missing values for a raw stream flow data and a seasonal data was forecasted using Exponentially Weighted Moving Average (EWMA) by using the known data values of previous two seasons. Both the methods predicted the data within and outside the period range of water quantity data with good results. Keywords: Missing Data, Two-Directional Exponential Smoothing (TES), Exponentially Weighted Moving Average (EWMA), Forecasting. 1. Introduction Predicting the missing values within a time period and forecasting data for future periods is of common interest for many industrial applications. Replacing missing data with time series within the range of known data is crucial for more accurate design proposals and performance evaluation. Especially, in water treatment plants it is required to replace the missing values of the water quality and quantity data to gain knowledge in the system and also to manage the water resources effectively. The water quantity and quality data are defined as time series variables which are recorded at successive time intervals. To use existing operational data as an input to a process simulation model, the missing data should be replaced. To provide a more accurate design proposal and system performance, a reasonable and reliable prediction of missing data is highly needed to determine the correct variability of water treatment plant data. Most common methods available for predicting the missing values in time series are replacing all missing values for a given variable with mean, median or other location statistics [1], SAS program using a time-series model to predict missing values [2,3], Average Nearest Observation (ANO)[4], and Two-Directional Exponential Smoothing (TES) [5]. Also, forecasting and predicting the future data can be very helpful in preparing in advance for any unexpected values. The forecast of seasonal high or low output is crucial in taking care of the installed capacity systems and therefore trigger alarms as needed according to the values. Popular methods like Auto Regressive Integrated Moving Average (ARIMA) [6], Exponentially Weighted Moving Average Method (EWMA) [7], Thomas-Fiering method [8] are also used for forecasting data from a known data set. Comprehensive methods for forecasting based on the exponentially weighted moving average are available [7] for series with trend, non-seasonal, and seasonal series. ________________________________ + Corresponding author. Tel.: +91 20 39394089; Fax: +91 20 39394381 E-mail address: sramanathan@nalco.com 98 In this paper a proven method like Two-Directional Exponential Smoothing (TES) is applied for predicting the missing values with time series in a data set. A sample data was used to test the applicability of the method by intentionally deleting some of the known values. Also, the raw sample data set was extended to forecast the values using Exponentially Weighted Moving Average (EWMA). 2. Methodology & Results The methodology for predicting the unknown values within and outside a known range by applying established methods like Two-Directional Exponential Smoothing (TES) and Exponentially Weighted Moving Average (EWMA) respectively are discussed in the following sections. For application of these methods, sample raw data of stream flow over a period of 5 years was used [9]. Here, only the stream flow rate data was considered for prediction and forecasting, assuming that the water flow data was used by an industrial plant nearby for water management. A test data set was created by removing some data points for predicting the stream flow within the known range and the sample test data set for two known seasons was used for predicting outside the range. 2.1 Missing Data Prediction For predicting the missing data of stream flow, Two-Directional exponential smoothing (TES) [5] was applied. Test data of stream flow rate by removing some data points over a period of five year is shown in Figure 1. Figure 1. Raw data of stream flow rate [9] with missing values for a 5 year period TES method was developed to replace missing data. TES method depends on a suitable Exponential Smoothing (ES) method and was developed by using Holt’s linear trend algorithm method. The TES method estimates missing data points based on the autocorrelations of the time series to account for the fact that the missing values occur at non-random times. The TES method is designed to represent both forward and backward autocorrelations in the time series, can decrease the difference above caused by different directions. The first step in TES method is to generate the full data set of data using Average nearest observation (ANO) method [4]. The ANO method will replace the missing values with the average of the nearest previous and the following observation i.e., the values are estimated by a weighted average of the nearest observations with higher weight given to the closer observation. Once the data set is generated using the ANO method, the missing values are predicted using a suitable Exponential Smoothing method, Holt’s linear trend method, in the forward and reverse direction. An ES method could generate different values depending on the direction of the time series. The Holt’s Linear Trend algorithm can be represented as at = δ Yt + (1 – δ)(at-1 + bt-1) (1) bt = γ(at – at-1) + (1 – γ)bt-1 (2) 99 where Yt is the actual value at time t, at and at-1 are intercepts (smoothed levels) at time t and t-1 respectively , bt and bt-1 are the slopes (smoothed trends) at time t and t-1 respectively, δ and γ are smoothing constants that are between 0 and 1 [1,10]. The smoothing constants, δ is used to smooth the new actual and trend-adjusted previously smoothed level and γ is used to smooth or average the trend, which eliminates some of the random error reflected in the unsmoothed trend. The smoothing constants determine the weight given to most recent past observations and therefore control the rate of smoothing or averaging. Values near 1 give weightage to more recent data and near 0 distribute the weights to consider data from the more distant past data. The averaged forward and backward ES estimates were used for predicting the missing data points. The TES method is a combination time series method and represented for missing values as TESt = (ES forward, t + ES backward, t) / 2 (3) Figure 2 shows the flow diagram of TES method. This TES method is applied to the test data (Figure 1). Figure 3 shows the raw stream flow rate data (with missing values) and the predicted missing stream flow values (by TES). For estimating the ES forward and ES backward, the constants δ and γ were chosen as 0.7 and 0.9 by trial and error method. The replaced stream flow values determined by the TES method are relatively close to the original values. Stream flow data with missing values Forward ANO data with putative missing values replacement Backward ANO data with putative missing values replacement Forward ES data: ES forward, t Backward ES data: ES backward, t TESt = (ES forward, t + ES backward, t) / 2 for missing data Figure 2. Flow diagram of TES method Figure 3. Comparison between the predicted values by TES method and raw data with missing values 2.2 Data Forecasting The forecasting ratio seasonal method [7] developed for predicting the sales rate was used for predicting the water quantity stream flow data. The forecasting ratio seasonal method can be used to predict the data for next season when the previous seasons share the comparable behaviour. It follows the exponentially weighted moving average method of smoothing the random fluctuations which is extremely easy to compute 100 with minimum historical data required. The predicted or the forecasted value can be estimated from the following extrapolation equation: ESt +T = St ' (4) Pt +T − N where St’ is the smoothed seasonally adjusted rate in period t, Pt is the periodical adjustment ratio for the ‘t’th period, N is the number of the periods in one seasonal and T is the number of forecast periods. The smoothed seasonally adjusted rate in the period t, is represented as S t ' = APt St + (1 − A) St −1 ' (5) where St is the value in period t and A is a constant that ranges between 0 to 1, which determines how fast the exponential weights decline over the past consecutive periods. The current seasonal adjustment ratio is obtained by combining the current ratio of data and the data with the seasonal adjustment rate from a season ago. Pt = B St ' + (1 − B) Pt − N St (6) where the constant B, determines how fast the exponential weights decline over the past season i.e., one period drawn from each season. By solving Equation (5) and (6), the explicit analytic expression for the new seasonal ratio [7] can be given as A(1 − B) 1− A ]Pt − N St + [ ]St −1 ' 1 − AB 1 − AB 1− B B(1 − A) St −1 ' ]Pt − N + [ Pt = [ ] 1 − AB 1 − AB St St ' = [ (7) (8) The above methodology is used to predict the seasonal data for the third season by using the known values of previous two seasons. The test data published for stream flow rate was considered for seasonal variation of 32 weeks for each season (high and low peaks) out of 257 weeks [9]. The main objective is to predict the third season of 32 weeks data considering the 64 weeks data. Here the sample data was carefully chosen to be in variation with seasonal changes from the available data of 257 weeks. Figure 4 shows the raw test data (two seasons of 64 weeks) and the forecasted data for third season (32 weeks). Here the constants A and B were chosen as 0.1 and 1 by trial and error method. The forecasted data for these constant values seems to be more similar to the immediate previous season i.e., the second season. However, if the constant values are change to 0.2 and 0.1, as shown in Figure 5, the forecasted data is towards the first season. The significance of the constants in the equation determines the data correlations to the immediate or previous season’s data. Fig.4. Comparison of forecasted data for one season with the raw test data [9] of two seasons 101 . Fig 5.Comparison of forecasted data for one season with the raw test data [9] of two seasons 3. Conclusions In this paper, a Two-Directional exponential smoothing (TES) method was applied for predicting the missing values with time series and exponentially weighted moving average (EWMA) was applied for forecasting the water stream flow data with seasonal variation. Both the methods predicted the data within and outside the period range with good results. 4. Acknowledgements This work was carried out at the Nalco Technology Center, Pune, as a part of an industry-university collaborative research internship project. The authors would like to acknowledge Dr. Hari Reddy, Director, Nalco Technology Center, Pune, India and Dr. A.K. Verma, Head, Department of Chemical Engineering & Technology, Institute of Technology, BHU, Varanasi, India for providing their support to carry out this project. 5. References [1] DeLurgio, S. A.. Forecasting Principles and Applications, McGraw-Hill, New York (1998) [2] H. Junninen, H. Niska, K. Tuppurainen, J. Ruuskanen, M. Kolehmainen. Methods of imputation of missing values in air quality data sets. Atmos . Environ. (2004), 38, 2895-2907. [3] T. Scheider. Analysis of incomplete climate data:Estimation of mean values and covariance matrices and imputation of missing values. J. Clim.14(5) (2001), 853-871 [4] J. Huo. Application of statistical methods and process models for the design and analysis of activated sludge wastewater treatment plants. PhD dissertation (2005), The Univ. of Tennessee , Knoxville, Tenn. [5] J. Huo, C. D. Cox, W. L. Seaver, R.B. Robinson, Y. Jiang. Application of Two-Directional Time Series Models to Replace Missing Data. J. of Environmental Engineering. ASCE (April 2010) [6] G.E.P. Box, G.M. Jenkins , Time Series Analysis Forecasting and Control. Holden-Day, San Francisco, 1976 [7] C. C. Holt, Forecasting Seasonals and trends by exponentially weighted moving averages, International J. of Forecasting 20 (2004) 5-10 [8] R.T. Clarke. Mathematical Models in Hydrology, FAO of United Nations, Rome, 1984 [9] A. Kurunc, K. Yurekli, O. Cevik. Performance of two stochastic approaches for forecasting water quality and stream flow from Yeşilırmak River, Turkey. Environmental Modeling & Software 20 (2005) 1195 -1200 [10] SAS Institute Inc. SAS onlineDoc, version 8, SAS Institute, Cary, N.C. (1999) 102