Introducing interval time series: accuracy measures ∗ Javier Arroyo1 and Carlos Maté2 1 2 Departamento de Sistemas Informáticos, Universidad Complutense, Profesor Garcı́a-Santesmases s/n, 28040 Madrid, Spain. javier.arroyo@fdi.ucm.es Instituto de Investigación Tecnológica, ETSI (ICAI), Universidad Pontificia Comillas, Alberto Aguilera 25, 28015 Madrid, Spain. cmate@upcomillas.es Summary. The confluence between time series analysis and symbolic data analysis lead to a promising area: symbolic time series. In these kind of series the considered variable is a symbolic one (e.g. a histogram or an interval variable). This paper focuses on interval time series, which are useful to describe the evolution through time of the range of variation of a phenomenon (e.g. the flow of a river). Accuracy measures for these series based on distances for interval data will be proposed. Finally, an example illustrates how to forecast interval time series in a simple way. Key words: interval data, error measures, forecasting, symbolic data analysis, time series analysis 1 Introduction For the time being, quantitative forecasting methods have focused on single-valued time series, i.e. series where every observation at each time point is a single value. The analysis of this kind of time series is a mature field where enormous progress have been made from the seventies [DH06]. Single-valued time series are useful to represent time-varying contexts, however, in some situations, other kinds of time series are better suited. For example, consider the daily electric demand in a region: in this case a classical time series with the total demand each day can describe the phenomenon, but it would not report on the intra-daily variability of the demand; however, a histogram-valued or a boxplot-valued time series describing the distribution of the hourly demand each day would be more appropriate as they would report on the distribution shape. Similarly, the lower and upper monthly water levels of a river at a given location can be suitably represented by an interval-valued time series, as it reports on the variability of the river flow. Both of them are examples of symbolic-valued time series. Symbolic-valued time series arise as the combination of time series analysis and symbolic data analysis. Symbolic data analysis states that symbolic variables (lists, ∗ This work is funded by Universidad Pontificia Comillas (PRESIM project). 1140 Javier Arroyo and Carlos Maté intervals, frequency distributions, etc) are better suited than single-valued variables for faithfully describing complex real-life situations. According to [BD03], there is a great need of development of methodologies dealing with symbolic variables. However, thus far, symbolic-valued time series have not been tackled. There is an approach for dealing with symbolic data in a temporal setting [GP00], but it is not related with forecasting but with three-way data analysis. There is a field also called symbolic time series analysis [DFT03], where series are sequences of a finite set of symbols. However, we are referring to series that are sequences of numerical observations taken at regular time intervals, where the observed variable is not single-valued but symbolic, such as distribution-, interval- or histogram-valued variables. This paper will only be focused on interval time series (ITS). Notation for this new entity will be proposed and several sources of ITS will be commented. The core of the paper is devoted to propose an approach to measure forecasting errors in an ITS context. Finally, an example to illustrate an straightforward way of forecasting ITS is shown. 2 Notation for Interval Time Series A variable Y is termed interval variable and denoted by [Y ], if all elements i of a set E take values in the domain B = {[α, β], −∞ < α ≤ β < ∞}. The particular value of [Y ] for the ith element is denoted by [Y ]i = [Yi , Yi ]. An interval is defined by its interval bounds (minimum and maximum). Equivalently, it can be defined by its midpoint (center) and its radius, which are ci = mid[Y ]i = Yi + Yi Yi − Yi and ri = rad[Y ]i = , 2 2 (1) respectively. Thus, the value of an interval variable for the ith individual can also be denoted by the midpoints-radii notation: [Y ]i = hci , ri i. Minimum, maximum, midpoint and radius can be considered as interval attributes. An interval time series {[Y ]t } is a time series where the variable observed through time t = 1, ..., n is an interval variable, the value of the variable in each instant of time t is expressed as [Y ]t = [Yt , Yt ] = hct , rt i. However, in order to improve the legibility of subsequent formulae the following notation will be used: [Y ]t = [Yt,L , Yt,U ] = hYt,C , Yt,R i. In order to denote a forecasted value, a hat will be placed above the variables: [Ŷ ]t = [Ŷt,L , Ŷt,U ] ≡ hŶt,C , Ŷt,R i. 3 Sources of Interval Time Series ITS can describe situations that cannot be described by classical time series, such as when no precise data is available and inaccuracy or uncertainty must be taken into account (e.g. when the measurement instrument is not reliable). Another example is an ITS describing the blood pressure of a person through time. However, inherent ITS are scarce. Sampling and aggregation are the main ways of obtaining ITS. These approaches are also applied to obtain classical time series, but we believe that is worth to use ITS as they can offer a complementary view of the phenomena. For example, a Introducing interval time series: accuracy measures 1141 continuous time series produced by a sensor can be sampled recording the lower and upper values in each hour. It will lead to an ITS describing the hourly range of values, which would report on the variability of the original series. Something similar happens when ITS are applied to aggregate across a set of individual time series measuring the same variable, e.g. a set of series representing the levels of an air-pollutant in different locations in a city can be aggregated by an ITS representing the minimum and maximum levels of pollutant in the whole city. In [ZT00] eighteen time series representing the annual output growth rate of industrialized countries are aggregated by the median. Curiously, while the article is focused on the median time series, its charts also represent an ITS describing the interquartile ranges. We believe that in these cases, it would be interesting to also analyze the ITS. Obviously, in sampling and aggregation contexts intervals can arise not only from the minimum and maximum observed values, but from the interquartile range or from the middle 90% of the scores (in order to avoid outliers). 4 Error Measures for Interval Time Series In classical time series, error measures are based in the difference between the observed and the actual value. In interval algebra, the difference between a pair of intervals, [A] and [B], is defined as [A] − [B] = [AL − BU , AU − BL ]. Unfortunately, this operation is not appropriate to define error measures, because it does not faithfully represent the concept of deviation [PL03]. Therefore, we propose to define ITS error measures based on distances for intervals. Distances objectively measure the dissimilarity between an observed interval and its forecast. Moreover, distances can be easily summarized without squaring them or using absolute values as they always take non-negative values. Let {[Y ]t } be the observed ITS, and {[Ŷ ]t } be the forecast of this ITS with t = 1, ..., n, a set of error measures is shown in the next subsections. 4.1 Mean Distance Error based on Hausdorff Distance Hausdorff proposed a metric to measure distances between compact sets. Intervals are compact sets identified by ordered couples of values (i.e. their lower and upper bounds). Given two intervals [A] = [AL , AU ] = hAC , AR i and [B] = [BL , BU ] = hBC , BR i, the Hausdorff metric for intervals is: dH ([A], [B]) = max(|AL − BL |, |AU − BU |) = |AC − BC | + |AR − BR |. (2) Note that for a pair of degenerate intervals, i.e. [A] = [x, x] and [B] = [y, y], we have dH ([A], [B]) = |x − y|, which is the usual topology in the real line. The Mean Distance Error based on Hausdorff metric is defined by: M DEH = 1 n Xd n t=1 H ([Y ]t , [Ŷ ]t ) = 1 n X[|Y n t=1 t,C − Ŷt,C | + |Yt,R − Ŷt,R |]. (3) 1142 Javier Arroyo and Carlos Maté 4.2 Root Mean Squared Distance Error based on Hausdorff metric There are two generalizations of the Hausdorff distance for interval data in a ndimensional space. They consider that an item described by n interval variables can be alternatively represented in a n-dimensional space by parallelotopes or by hyperspheres [PI05]. Here, we only consider the parallelotopes approach, which generalizes the Hausdorff distance by means of the Minkowski metric. The distance with α ≥ 1 for two items, X and Y , described by n interval variables is defined as: vuX u (X, Y ) = t [d n dHgen α α H ([X]j , [Y ]j )] . j=1 An extension of (4) is obtained by introducing weights, wj > 0 with 1, to model the relative importance of the variables: (4) P vuX u (X, Y ) = t [w d n j=1 wj = n dHgen α j H ([X]j , [Y ]j )] . α (5) j=1 An interval time series {[Y ]t } with t = 1, ..., n can be considered as an individual described by n interval variables, that is, represented in a n-dimensional space as a parallelotope of n faces. Given (5) with α = 2 and assigning equal weights (wj = 1/n) to all variables, the Root Mean Squared Distance Error based on the Hausdorff metric is defined as: vu 1 uX = t (Y n n RM SDEH t,C − Ŷt,C )2 + (Yt,R − Ŷt,R )2 + 2|Yt,C − Ŷt,C ||Yt,R − Ŷt,R |. t=1 (6) It can be seen that the first two measure components account for the square difference between midpoints and radii, while the third component accounts for the combined effect of the error in the midpoints and in the radii. 4.3 Mean Distance Error based on Ichino-Yaguchi distance Ichino and Yaguchi proposed a generalized Minkowski metric for a multidimensional space of mixed variables (quantitative, qualitative and structural variables) [IY94]. This metric is based on the cartesian join and meet operators, which for interval variables are defined as: [A] ⊕ [B] = [A] ∪ [B] = [min(AL , BL ), max(AU , BU )] and [A] ⊗ [B] = [A] ∩ [B], respectively. Let [A] and [B] be a pair of intervals, the IchinoYaguchi distance is: dIY ([A], [B]) = w([A]∪[B])−w([A]∩[B])+γ(2w([A]∩[B])−w([A])−w([B])), (7) where w([X]) denotes the width of the interval [X], i.e. w([X]) = XU − XL , and γ ∈ [0, 0.5] controls the effects of the inner-side nearness and the outer-side nearness between [A] and [B]. This measure holds the properties to be considered a distance [IY94]. The use of γ = 0.5 is suggested in [IY94]: Introducing interval time series: accuracy measures 1143 dγ=0.5 ([A], [B]) = w([A] ∪ [B]) − 0.5[w([A]) + w([B])], IY (8) We believe that, in error measurement, γ = 0.5 is a suitable choice as, in our experience, is not worse than other γ values, and produces a clearer equation (8) which, moreover, is equivalent to this meaningful formulae: dγ=0.5 ([A], [B]) = 0.5[|AL − BL | + |AU − BU |]. IY (9) The Mean Distance Error based on the Ichino-Yaguchi distance is: M DEIY = 1 n X 0.5[|Y n t,L − Ŷt,L | + |Yt,U − Ŷt,U |]. (10) t=1 4.4 Mean Distance Error based on De Carvalho distance In [Dec96], De Carvalho proposes a normalization of Ichino-Yaguchi’s distance: dDC ([A], [B]) = dγIY ([A], [B]) , w([A] ∪ [B]) (11) where dγIY ([A], [B]) is given in (7). The range of (11) is [0, 1], and it satisfies the properties to be considered a distance [Dec96]. If γ = 0, the measure takes its maximum value, when the intersection of the intervals is null, not taking into account the nearness of the intervals. As this is not a desirable feature in error measurement, we discard this value and propose γ = 0.5 which offers more suitable features: dDC ([A], [B]) = 1 if and only if the considered intervals are degenerate and not equal, e.g [A] = [3, 3] and [B] = [4, 4]; dDC ([A], [B]) = 0.5 if the considered intervals are adjacent, e.g. [A] = [1, 3] and [B] = [3, 7], or if one interval is degenerated and is contained in the other one, e.g. [A] = [1, 4] and [B] = [2, 2]; dDC ([A], [B]) < 0.5 if w([A]∩[B]) and only if w([A] ∩ [B]) > 0; dDC ([A], [B]) ≤ 0.25 if and only if w([A]∪[B]) ≥ 0.5. The Mean Distance Error based on De Carvalho distance is defined by: M DEDC = 1 n Xd n t=1 γ=0.5 ([Y ]t , [Ŷ ]t ) IY w([Yt ∪ Ŷt ]) . (12) The M DEDC is a scale-independent error measure with range [0, 1] and, according to its definition, a M DEDC ≥ 0.5 means a poor forecast record. 4.5 Some issues on the accuracy measures proposed A classical time series can be seen as a particular case of an ITS whose intervals are degenerate, [Y ]t = [at , at ], at ∈ R. If the accuracy of a classical time series is evaluated with ITS error measures, the values of M DEH and M DEIY will be equivalent to the Mean Absolute Error of the classical time series; while the behavior of RM SDEH will be similar to the behavior of the Root Mean Square Error. In this case, M DEDC is not applicable, because De Carvalho’s distance take the value 1, if the pair of intervals are degenerate and not equal. RM SDEH is a measure sensitive to outliers (i.e. to extreme midpoint or radius values) and it remains to be seen if it has good statistical properties as the Root Mean Square Error in classical time series. In addition, it is not as interpretable as 1144 Javier Arroyo and Carlos Maté the other measures proposed. M DEDC is scale independent and is very interpretable as it ranges from 0 to 1, being 0.5 and 0.25 values with a clear significance. Therefore, it is useful to compare errors in time series that have different scales. M DEH and M DEIY are scale dependent measures, but their interpretation is clear as they account for the deviation in midpoints and radii, and in minimums and maximums, respectively. Their behavior is similar to the one of the Mean Absolute Error in classical time series; thus, they are measures less sensitive to outliers. People familiar with interval data can feel comfortable using the error measures based on the Hausdorff distance as the Hausdorff distance is the distance applied in Interval Algebra and taken as a basis in several symbolic methods dealing with intervals [PL03]. The choice of an ITS error measure also should be guided by wether the concept of accuracy lies on the midpoints and raddi of intervals, or on their lower and upper bounds. The aspects commented in this section should guide practitioners when choosing an ITS error measure, but decision should also be oriented by other factors such as the domain of the problem. 5 A primer approach to forecast interval time series At the moment, there are no specific forecasting methods proposed for ITS, but forecasting methods for classical time series can be applied in the following way. First, ITS should be expressed in terms of their minimum and maximum series, or of their midpoint and radius series. Each of these series should be independently analyzed using classical time series analysis methods in order to find the components of its pattern (trend, cycle and seasonality) including possible non-linearities. Then, it should be determined which pair of series is going to be forecasted and the forecasting method for each one; moreover, if appropriate, a multivariate method can be applied. Then, the value of the parameters of the chosen method should be estimated minimizing an ITS error measure in the training set. Finally, the accuracy of the calibrated method has to be corroborated with the test set. Consider an ITS representing the minimum and maximum daily price of the share of a Spanish bank company from 2005-05-01 until 2005-09-30 (see Fig.1). We will propose different ways of forecasting this ITS based on exponential smoothing methods. They are simple methods, but many times have been shown that their forecasting ability is, on average, as good as that of the more sophisticated ones. We have proposed three models3 . The midpoint-radius approach, which consists of a Holt’s Exponential Smoothing method for the midpoints series (α = .99; β = .02) because there is a trend in the series; and a Single Exponential Smoothing method for the radius series (α = .16) as it has no trend present. The minimummaximum approach, which consists of the Holt’s Exponential Smoothing method for the minimum series (α = .99; β = .03) and for the maximum series (α = .99; β = .14); because both series have trend. The naive method ([Ŷ ]t+1 = [Y ]t ]) will be applied to determine wether the use of the other two methods is justified or not. Table 1 shows the one step-ahead forecast performance of the three models in the test set. The midp-rad approach outperforms the two other approaches, min-max 3 The parameters of the models have been estimated by a genetic algorithm. The GA have searched good values through the parameter space in order to minimize the value of the M DEIY in the training set (76 periods). Introducing interval time series: accuracy measures 1145 Fig. 1. ITS representing the minimum and maximum daily prices of a share and naive, which obtain quite similar results. Though the conclusion cannot be extrapolated, it agrees with the intuitive notion that it seems more appropriate to deal with ITS in terms of midpoint and radius than in terms of minimum and maximum, because it models separately the behavior of the interval location (midpoint) and of the interval inner variability (radius). Table 1. Errors of the different forecasting methods in the test set (32 observations) Approach M DEH M DSEH −2 −2 M DEIY M DEDC −2 midp-rad 10.22 · 10 2.21 · 10 8.38 · 10 30.97 · 10−2 min-max 11.11 · 10−2 2.35 · 10−2 8.81 · 10−2 33.6 · 10−2 naive 11 · 10−2 2.37 · 10−2 8.5 · 10−2 31.43 · 10−2 6 Conclusions ITS provides an interesting approach to sample and aggregate massive temporal data reporting on the range variation through time of the observed variables. This article has offered a first approach to the field, stressing in accuracy measurement. However, other matters await further research, for example: the development of new forecasting methods for ITS along with case studies endorsing their usefulness; the definition of concepts that allow the description of an ITS; empirical comparisons 1146 Javier Arroyo and Carlos Maté of different forecasting methods in order to draw conclusions to guide practitioners; and so on. Besides ITS, it must be borne in mind that other kinds of symbolic-valued time series, such as histogram and distribution time series await investigation. References [BD03] Billard, L., Diday, E.: From the statistics of data to the statistics of knowledge: symbolic data analysis. J. of the Am. Stat.Assoc., 98, 991–999 (2003) [DFT03] Daw C.S., Finney C.E.A., Tracy E.R.: A review of symbolic analysis of experimental data. Review of Scientific Instruments 74, 916–930 (2003) [Dec96] De Carvalho, F.A.T.: Histogrammes et indices de proximité en analyse de données symboliques. Actes de l’ecole d’eté sur l’analyse des données symboliques. Université de Paris IX - Dauphine, Paris (1996) [DH06] De Gooijer, J.G., Hyndman, R.J.: 25 years of time series forecasting. International Journal of Forecasting, to appear (2006) [GP00] Gettler-Summa, M., Pardoux, C.: Symbolic approaches for three-way data. In: Bock H. -H., Diday E. (eds.) Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. SpringerVerlag, Berlin (2000) [IY94] Ichino, M., Yaguchi, H.: Generalized Minkowski metrics for mixed featuretype data analysis. IEEE Trans. on Systems, Man and Cybernetics, 24/1, 698-708 (1994) [PI05] Palumbo, F., Irpino, A.: Multidimensional interval-data: metrics and factorial analysis. In: Proceedings of the ASMDA 2005. ENST Bretagne, Brest (2005) [PL03] Palumbo, F., Lauro, C.L.: A PCA for interval-valued data based on midpoints and radii. In: Yanai H. et al. (eds), New Developments on Psychometrics. Springer-Verlag, Tokyo (2003) [ZT00] Zellner, A., Tobias, J.: A note on aggregation, disaggregation and forecasting performance. Journal of Forecasting, 19, 457-465 (2000)