Master Thesis in Statistics, Data Analysis and Knowledge Discovery Change-point detection in vector time series using tree algorithms and smoothers He Zhang The back cover shall contain the ISRN number obtained from the department. This number shall be centered and at the same distance from the top as the last line on the front page. LiU-IDA-???-SE Datum Date 2009-08-22 Avdelning, Institution Division, Department Department of Computer and Information Science Svenska/Swedish Examensarbete ISBN ___________________________________________________ __ ISRN Engelska/English C-uppsats _________________________________________________________________ D-uppsats Serietitel och serienummer Övrig rapport Title of series, numbering Språk Rapporttyp Language Report category Licentiatavhandling ISSN ____________________________________ URL för elektronisk version http://www.ep.liu.se Titel Title Change-point detection in vector time series using tree algorithms and smoothers Författare Author He Zhang Sammanfattning Abstract Detection of abrupt level shifts in observational data is of great interest in climate research, bioinformatics, surveillance of manufacturing processes, and many other areas. The most widely used techniques are based on models in which the mean is stepwise constant. Here, we consider detection of more or less synchronous level shifts in vector time series of data that also exhibit smooth trends. The method we propose is based on a back-fitting algorithm that alternates between estimation of smooth trends for a given set of change-points and estimation of change-points for given smooth trends. More specifically, we combine an existing non-parametric smoothing technique with a new two-step technique for change-point detection. First, we use a tree algorithm to grow a large decision tree in which the space of vector coordinates and time points is partitioned into rectangles where the response (adjusted for smooth trends) has a constant mean. Thereafter, we reduce the complexity of the tree model by merging adjacent pairs of rectangles until further merging would cause a significant increase in the residual sum of squares. Our algorithm was tested on synthetic vector time series of data with promising results. An application to a vector time series of water quality data showed that such data can be decomposed into smooth trends, abrupt level shifts and noise. Further work should focus on stopping rules for reducing the complexity of the tree model. Nyckelord Keyword Change-point detection, regression trees, nonparametric smoothing Abstract Detection of abrupt level shifts in observational data is of great interest in climate research, bioinformatics, surveillance of manufacturing processes, and many other areas. The most widely used techniques are based on models in which the mean is stepwise constant. Here, we consider detection of more or less synchronous level shifts in vector time series of data that also exhibit smooth trends. The method we propose is based on a back-fitting algorithm that alternates between estimation of smooth trends for a given set of change-points and estimation of change-points for given smooth trends. More specifically, we combine an existing non-parametric smoothing technique with a new two-step technique for change-point detection. First, we use a tree algorithm to grow a large decision tree in which the space of vector coordinates and time points is partitioned into rectangles where the response (adjusted for smooth trends) has a constant mean. Thereafter, we reduce the complexity of the tree model by merging adjacent pairs of rectangles until further merging would cause a significant increase in the residual sum of squares. Our algorithm was tested on synthetic vector time series of data with promising results. An application to a vector time series of water quality data showed that such data can be decomposed into smooth trends, abrupt level shifts and noise. Further work should focus on stopping rules for reducing the complexity of the tree model. Keywords: change-point detection, regression trees, nonparametric smoothing I II Acknowledgements I would like to express my deep gratitude to my supervisor Prof. Anders Grimvall who introduced a very interesting and powerful research topic. He spent a lot of time on helping me develop the ideas and improve this thesis both in contents and language. I would like to thank Ph.D student Sackmone Sirisack for his kind suggestion and discussion on the thesis. I also would like to thank my parents Ailing Zheng and Decheng Zhang and my good friend Johan Hemström dearly for their support and encouragement. III IV Table of contents 1 1.1 1.2 2 2.1 Introduction ...................................................................................................... 2 Background ...................................................................................................... 2 Objective .......................................................................................................... 3 Methods............................................................................................................ 4 Detection of change-points ............................................................................... 4 2.1.1Parameter estimation ............................................................................................... 5 2.1.2 Model selection........................................................................................................ 6 2.1.3 A generalized tree algorithm .................................................................................... 8 3 4 5 6 7 8 2.2 Smoothing of vector time series of data .............................................................. 8 2.3 A general back-fitting algorithm .......................................................................... 9 Mean shift model and datasets ....................................................................... 10 3.1 Computer-generated data ................................................................................... 11 3.2 Observational data ............................................................................................. 11 Test beds for the proposed algorithms ........................................................... 12 4.1 Runs involving surrogate data ........................................................................... 12 4.2 Runs involving real data ................................................................................ 13 Results ............................................................................................................ 14 5.1 Analysis of surrogate data .............................................................................. 14 5.1 Analysis of observational water quality data ................................................. 20 Discussion and conclusions ........................................................................... 21 Literature ........................................................................................................ 23 Appendix ........................................................................................................ 25 1 1 Introduction 1.1 Background Change-point analysis aims to detect and estimate abrupt level shifts and other sudden changes in the statistical properties of sequences of observed data. This form of statistical analysis has a great variety of applications, such as quality control of manufactured items, early detection of epidemics, and homogenization of environmental quality data. In some cases, the change-points represent real changes in the systems under consideration. In other cases, abrupt changes in the collected data may be due to systematic measurement errors. Control charts are considered to be the oldest statistical instruments for change-point detection. The basic ideas of such charts were outlined by Shewhart (1924) who was a pioneer in statistical quality control for industrial manufacturing processes. Other authors have contributed to a solid theory for retrospective analyses of changes in the mean of a probability distribution. Hawkins (1977) proposed a test for a single shift in location of a sequence of normally distributed random variables, and Srivastava & Worsley (1986) proposed a test for a single change in the mean of multivariate normal data. Shifts in the mean were also examined by Alexandersson (1986) who proposed the so-called standard normal homogeneity test (SNHT) for detecting artificial level shifts in meteorological time series when a reliable reference series is available. Fitting multiple change-point models to observational data is a computationally demanding problem that requires algorithms in which a global optimization is separated into simpler optimization tasks (Hawkins, 2001). This is particularly true when the number of change-points is unknown (Caussinus & Mestre, 2004; Picard et al., 2005). A fast Bayesian change-point analysis was recently published by Erdman & Emerson (2008). 2 The cited methods are based on the assumption that the mean is stepwise constant or that this condition is fulfilled after removing a common trend component in all the investigated series. Here, we consider the problem of detecting change-points in the presence of smooth trends that may be different for different vector components. Recently, it was shown how a synchronous level shift in all vector components can be estimated in the presence of trends that vary smoothly across series (Wahlin et al., 2008). In this thesis we consider change-points that are not necessarily synchronous. 1.2 Objective The general objective of this thesis is to develop an algorithm that enables detection of an unknown number of level shifts in vector time series of data that also exhibit smooth trends. In particular, we shall investigate how change-point detection involving regression trees can be combined with a nonparametric smoothing technique (Grimvall et al., 2008) into a back-fitting algorithm. The specific objectives are as follows: • Implement a tree algorithm for splitting the space of vector coordinates and time points into a set of rectangular subsets (regions). • Reduce the complexity of the derived tree model by successively merging regions until further merging would lead to a significant increase in the residual sum of squares. • Integrate the tree algorithm with an existing nonparametric smoothing technique (MULTITREND) into an algorithm for decomposing the variation of the given data into smooth trends, abrupt level shifts and noise. • Test the tree method and the integrated algorithm on synthetic data and real data. 3 2 Methods 2.1 Detection of change-points Decision trees provide simple and useful structures for decision processes and for estimation of piecewise constant functions in nonlinear regression. A widely used algorithm that we shall refer to as the tree algorithm was developed by Breiman et al. (1984). It assumes that the mean response is a function of p predictors or inputs and that the input domain can be split into cuboids where the mean is constant. We shall consider tree algorithms for segmentation of vector time series of data. The mean response is assumed to be piecewise constant with an arbitrary number of level shifts. Some of these level shifts may be synchronous for two or more series, whereas others only affect the mean of a single series. Figure 1 Arbitrary partitions and CART (Hastie et al, 2001) 4 Figure 1 compares arbitrary partition and CART splitting where the latter partitions the dataset by building a tree structure or segmenting the feature space into a set of rectangles. More specifically, the top left panel shows a general partition that is obtained from an arbitrary (not recursively binary) splitting; while the top right panel illustrates a partition of a two-dimensional feature space by recursively binary splitting, as used in CART. The bottom left panel shows the corresponding tree structure, and a perspective plot of the prediction surface appears in the last panel (Hastie et al., 2001). We shall use a tree algorithm to partition the two-dimensional space of time points and vector coordinates. 2.1.1 Parameter estimation Here, we describe a general tree algorithm outlined by Breiman and co-workers (1984) and Hastie and co-workers (2001). Furthermore, we explain how this algorithm can be simplified when we are estimating a function that is piecewise constant in two ordinal-scale inputs. Consider the dataset ( xi , yi ) for i 1, 2, , N , with xi ( xi1 , xi 2 , , xip ) comprising N observations with p independent variables (inputs) and one response. The cited algorithm then recursively identifies splitting variables and split points so that the input space is partitioned into cuboids (or regions) R1 , R2 , , RM in which the response is modeled as a constant cm . f ( x) M c I (x m 1 m Rm ) When searching for the first two regions, we let R1 ( j , s) { X X j s} and R2 ( j , s) { X X j s} and determine the splitting variable j and split point s that solve min[min j ,s c1 xi R1 ( j , s ) ( yi c1 ) 2 min c2 xi R2 ( j , s ) ( yi c2 ) 2 This procedure is then repeated on each of the derived regions until a large tree 5 model has been constructed. Because the sum of squares ( yi f ( xi ))2 is employed as measure of the goodness-of-fit, the best estimator of the mean cm of yi in region Rm is equal to the average observed response. cm ave( yi xi Rm ) In our case, we have a choice between splitting with respect to time (j=1) or vector coordinates (j=2) of the observed time series in each step. The number of combinations of j and s that shall be investigated is thus equal to the sum of the number of time points and the number of time series in the region under consideration. Furthermore, it is sufficient to use information about the number of observations and their sum in arbitrary rectangles of the input space. This makes the tree algorithm a very fast algorithm with complexity O(N). 2.1.2 Model selection Breiman’s procedure for selecting a suitable tree model involves two basic steps. First, a large tree is built to ensure that the input domain is split into regions that are pure in the sense that the response exhibits little variation. Then, regions are merged by pruning branches of the decision tree. Generally, the target of the tree algorithms is to achieve the best partition with minimum costs. In other words, the split at each node should generate the greatest improvement in decision accuracy. This is usually measured by the node impurity (sum of squared residuals) which provides an indication of the relative homogeneity of all the cases within one region. Furthermore, another criterion must be established to determine when to stop the splitting and pruning processes. 6 The tree size is determined by considering the trade-off between the derived number of regions (node size or model complexity) and the goodness-of-fit to the given data. A huge tree may lead to over-fit the given data and increased variance of model predictions. A small tree might not capture important features of the underlying model structure and produce biased predictions. One approach to the model selection problem is to perform a significance test for each suggested split. However, this strategy is short-sighted since a seemingly worthless split might lead to a very good split after other splits have been undertaken. Therefore, it is often recommended to first grow a large tree and then reduce the model complexity by pruning. Because the size of the large tree is not very crucial, one can simply set a threshold for the node size (the number of observations in a node) in each terminal region (Hastie et al., 2001). The final pruning requires a more elaborated assessment of the tradeoff between model complexity (no. of terminal nodes) and goodness-of-fit (sum of squared errors). Hastie and co-workers (2001) suggest a cost complexity criterion which is composed of the total cost (total sum of squared errors) and a penalty quantity indicating the complexity of the derived tree in terms of number of terminal nodes. The expression is M RSS m M , where RSSm ( yi cm ) 2 xi Rm m 1 Adaptively choosing the penalty factor is the main difficulty in using the cited procedure to prune a tree into the right size. Alternative model selection procedures may be based on information criteria such as Akaike’s Information Criterion (AIC), Bayesian Information Criterion (BIC) or minimum description length. 7 2.1.3 A generalized tree algorithm We have previously noted that tree algorithms generate partitions of the input domain into cuboids or rectangles. To be able to fit more complex structures to a set of collected data we consider models in which the mean is constant in regions that are formed as unions of rectangles. First, a tree algorithm is used to create a set of relatively small rectangular regions, and then adjacent regions are step by step merged until there is a significant increase in the segmentation cost. We tried to control the merging of regions using two different procedures: (i) a standard partial F-test; (ii) block cross-validation. The F-test compares the residual sum of squared errors before and after the merging. Block cross-validation splits the data sets into training and test sets, fits models to the training sets, and evaluates the predictive power of the fitted models on the test sets. 2.2 Smoothing of vector time series of data When vector time series of data are analyzed, it is often appropriate to assume that the expected response is a smooth function of both time and vector coordinate. For example, this is a natural assumption when the analyzed dataset represents environmental quality at sampling sites located along a transection or along an elevation gradient. It may also be appropriate for data representing measured concentrations of chemical compounds that can be ordered linearly with respect to volatility or polarity. So-called gradient smoothing (Grimvall et al., 2008) refers to fitting smooth response surfaces to a vector time series of data by minimizing the sum 8 S( , , ) L( ) 1 1 n 2 L2 ( ) , where ( yt( j ) ( j) t i 1 j 1 ( ( j) t ( ( j) t ( j) t 1 t 2 j 1 n m 1 L2 ( ) 2 are roughness penalty factors; ( j) t ( xit( j ) xi( j ) ))2 is the residual sum of squares; i 1 n 1 m L1 ( ) and p m S( , , ) 1 ( j) t 1 2 ) represents roughness over time; 2 ( j 1) t ( j 1) 2 t 2 t 1 j 2 ) represents roughness across coordinates. Minor modifications of this smoothing method can make it appropriate also for seasonal or circular data. Figure 2 Gradient smoothing for data collected at different sites along a gradient 2.3 A general back-fitting algorithm The following pseudo code given by Wahlin et al. (2008) illustrates the back fitting algorithm for joint estimation of smooth trends and discontinuities (level shifts). 1) Initialize 2) Initialize 0 by using a multiple linear regression model with intercept to regress y on x1, , xp 3) Initialize press =0 4) Initialize s =1 5) Repeat 9 6) T = {1, …, s-1, s+1, …, n} Cycle p ut( j ) yt( j ) ( j) k ( xkt( j ) xk( j ) ) ( j) k k 1 m (ut( j ) arg min[ ( j) 2 t ) L (W1 , ) 1 1 2 L2 (W2 , )] t T j 1 ut( j ) yt( j ) ( j) t ( j) t , t T , j 1, ,m p m (ut( j ) arg min[ ( j) k 0 t T j 1 ( xkt( j ) xk( j ) )) 2 ] k 1 p ut( j ) yt( j ) ( j) t ( j) k ( xkt( j ) xk( j ) ), t T , j 1, ,m k 1 m (ut( j ) arg min[ ( j) 2 t ) ] t T j 1 m ( j) t ( j) t ( j) t mean( ) t T j=1 until the relative change in the penalized sum of squares on T is below a predefined threshold p m press ( ys( j ) press ( j) s j 1 s ( j) k ( xks( j ) xk( j ) ) ( j) 2 s ) k 1 s 1 7) Until s=n Here, represents the smooth trend, abrupt level shifts and the impact of covariates. 3 Mean shift model and datasets A univariate mean shift model can be written t yt t S t t . t 1, , n , where S t i i is a level shift at time t and yt is the observed response at time t. S(t) is thus 10 the cumulated level shifts from time 1 to t. If there is no level shift at time i, t is zero. Error terms are assumed to be independent and normally distributed with zero mean and constant standard deviation. Multivariate mean shift models can be defined analogously, and the level shifts can be more or less synchronous. 3.1 Computer-generated data A set of surrogate data representing mean monthly temperatures recorded during the period 1913-1999 was downloaded from www.homogenisation.org. This data is part of a larger benchmark dataset for assessing the ability of statistical methods to detect artificial level shifts in climate data with realistic marginal distributions, temporal and spatial correlations. More information about the surrogate data set can be found in the following link: ftp://ftp.meteo.uni-bonn.de/pub/victor/costhome/monthly_benchmark/inho/ 3.2 Observational data Water quality data representing the concentrations of total phosphorus at three different depths (0.5, 10 and 20m) at Dagskärsgrund in Lake Vänern (Sweden), 1991-2005, were downloaded from the Swedish University of Agricultural Science. The mean value of the observed concentrations for each combination of year and depth were then used as inputs to our change-point detection algorithm. 11 4 Test beds for the proposed algorithms 4.1 Runs involving surrogate data Data with a negligible trend slope, such as our surrogate temperature data, can be analyzed using models with piecewise constant means. Accordingly, we applied our tree algorithm directly to that dataset. The input domain was step by step subjected to binary splits using the combination of splitting variable and splitting point that produced the most significant “gain”, i.e. the strongest reduction in total residual sum of squares (RSS). Furthermore, region features, such as size (the number of observations in each derived region), mean, and RSS were updated after each split. As the number of regions increases, the total RSS may first decrease sharply and then level out as the “gain” decreases. To assess the statistical significance of the observed gain, we computed the following F-statistic at each step: F RSS1 RSS 2 RSS 2 / p2 p1 n p2 Here, RSS1 denotes the residual sum of squares before splitting, whereas RSS2 is the residual sum of squares after splitting. Furthermore, n represents the number of observations, while the degrees of freedom of the two sums are n-p1 and n-p2, respectively. The split under consideration was accepted when the F-statistic exceeded 2.71, which corresponds to a significance level of approximately 0.1 when p2 - p1 = 1 and n is large. When no splitting was statistically significant the growth of the decision tree was terminated. To reduce the risk of over-fitting, pairs of adjacent regions were then merged until the F-statistic F RSS1 RSS 2 RSS 2 / p2 p1 n p2 was above a predefined level for all adjacent pairs of regions. Different 12 significance levels were examined. However, in general, we used a lower significance level when regions were merged. The final result of the entire process is a partition of the input domain into regions that are now unions of rectangles. Cross-validation was used as an alternative method to determine how many pairs of regions should be joined. More specifically, we used K-fold cross-validation. This means that the original sample is partitioned into K subsamples and that K pairs of training and test sets are formed by letting one of the subsamples be the test set while all other data constitute the training set. The prediction error sum of squares (PRESS) was computed for all test sets, and the total PRESS-value was then used to examine how long regions could be joined without reducing the predictive power of the model. 4.2 Runs involving real data Datasets that have a smooth trend can be analyzed for change-points using a back-fitting algorithm that alternates between estimation of change-points for a given smooth trend and estimation of smooth trends for a given set of change-points. We used the previously described water quality data from Lake Vänern for our test runs. Initial estimates of one smooth and one discontinuous component were obtained by employing the decomposition method developed by Wahlin and co-workers (2008). Thereafter, the decomposition was step by step improved using our back-fitting algorithm where the level shifts are not necessarily synchronous. More specifically, our tree algorithm was used to estimate change-points for given trends, whereas the software MULTITREND (Grimvall et al., 2008) was used to estimate smooth trends for a given set of change-points. A noise component was extracted by subtracting the smooth trends and the level shifts from observed data. 13 5 Results 5.1 Analysis of surrogate data Station1 20 Station2 Temperature 15 Station3 Station4 10 Station5 Station6 5 Station7 0 1900 -5 Station8 1920 1940 1960 1980 2000 Station9 Station10 -10 Station11 Year Station12 Figure 3. Surrogate monthly temperature data for 12 stations in 1913-1999. Station 1 20 Station 2 Cell mean of temperature 15 Station 3 Station 4 10 Station 5 Station 6 5 Station 7 Station 8 0 1900 -5 1920 1940 1960 1980 2000 Station 9 Station 10 Station 11 -10 Year Station 12 Figure 4. Estimated region means after using our tree algorithm to split the input domain into rectangles. The analyzed surrogate data are illustrated in Figure 3. Visual inspection strongly indicates that there is a major level shift around 1960 and possibly two other significant level shifts. When our tree algorithm was applied to the data 14 shown in Figure 3, the input domain was partitioned into 126 rectangles defining a total of 131 change-points. Figure 4 shows the region mean for each cell, i.e each combination of year and month. It can be seen that each monthly series has several level shifts. In addition, the tree algorithm has identified several outliers that are shown as segments with a single observation. Table 1 in the appendix reveals the results of implementing cross-validation model selection method and compares the derived total predicted residual sum of squares (PRESS) between ten-fold and leave one out cross-validation methods at different significance levels. The two cross-validation methods are implemented by using one year of observations as a unit. For instance, leave one out cross-validation leaves out the observations in one year each time. Analogously, ten-fold cross-validation divides the whole data set into 10 subsamples, and each subsample is used as a test data set once to measure the predicted residual sum of squares while the remaining nine subsamples are employed as training data. In both cases, predicted values are computed as the mean value in the associated region. A model is then selected by choosing the alpha level with the smallest PRESS in each method. In our test runs, we obtained =0.5 using ten-fold CV while =0.01 was obtained with the leave-one-out method. A significance level of 0.5 did not entail a large reduction in the number of terminal regions (from 126 to 109), whereas merging the number of regions decreased to only 44 when was 0.01. Figure 5 and 6 compare the results of the two cross-validation methods in terms of the derived PRESS when different significance levels were used. Both graphs have a minimal PRESS that was used to select the significance level in the merging step. 15 3300 3200 3100 PRESS 3000 2900 2800 2700 2600 2500 0 0,2 0,4 0,6 0,8 1 alpha level Figure 5. Total sum of squared prediction errors using ten-fold cross-validation to select the prediction model 3820 3800 3780 PRESS 3760 3740 3720 3700 3680 3660 3640 0,000001 0,00001 0,0001 0,001 0,01 0,1 1 alpha level Figure 6. Total sum of squared prediction errors obtained by leaving out one year of observations at a time. Next step of the algorithm is to remove the superfluous segmentations, and figure 7 displays the region mean for each observational cell after merging the adjacent regions. Compared with the previous region means graph, there are less short lines or single dots after merging procedure, and hence the data before or after a change-point becomes more unified in one series. In more detail, the merging procedure was based on the derived 126 rectangles, and each time the merging of all possible adjacent pairs of regions were evaluated 16 using an F test statistic that measured if the difference in mean between the two regions was small enough. Furthermore, the corresponding p-value of each test statistic was calculated and used as a criterion for selecting which pair to join. The algorithm always joins the pair with the largest p-value and stops when this value is less than or equal to a given threshold. For the synthetic data set, 0.01 is chosen as a significance level to stop merging regions. Station 1 14 Station 2 Station 3 Cell mean of temperature 9 Station 4 Station 5 Station 6 4 Station 7 Station 8 -1 1900 Station 9 1920 1940 1960 1980 2000 Station 10 Station 11 -6 Year Station 12 Figure 7. Region means after merging regions not significantly different at level 0.01. The sum of squared residuals (RSS) of all the regions are calculated and recorded in each step of our CART splitting and region merging procedure respectively. The total RSS is decreasing as the number of regions increases by recursive splitting while it is increasing slowly when the algorithm merges the regions. Figure 8 illustrates that our CART splitting algorithm produced 126 terminal regions under the condition that the F-statistic for splitting a region should be greater than 2.71 (with a significance level of 0.1). In addition, this figure compares the results from merging adjacent regions using different stopping rules. The final number of regions after merging was 109 with a significance level of 0.5 and 44 with a significance level of 0.01. In more detail, 17 the merging was stopped when the maximum p-value was less than or equal to the given significance level. 3700 Total RSS 3200 2700 Splitting Joining0.01 2200 Joining0.5 1700 1200 0 50 100 150 No.of regions Figure 8 Total residual sums of squares observed during: (i) the binary splitting, (ii) merging of regions with significance level 0.01, and (iii) merging of regions with significance level 0.5. Figure 9 shows the adjusted temperature after removing the cumulated level shifts from the original cell value. It can be seen that the temperature in each series moves along its main trend with small fluctuation and there are no obvious level shifts although there are still some outliers. The splitting algorithm detected 131 change-points while there were 92 change-points left after merging adjacent regions. The algorithm worked well in the sense that it produced a large number of homogeneous rectangles and then reduced the model complexity by removing superfluous segments. We regarded the abrupt level shifts as false or artificial changes. Accordingly, the corrected series obtained by removing cumulated level shifts should provide a more realistic picture of the variation in temperature over time. 18 Station 1 14 Station 2 Corrected temperature Station 3 Station 4 9 Station 5 Station 6 4 Station 7 Station 8 -1 1900 1920 1940 1960 1980 2000 Station 9 Station 10 Station 11 -6 Year Station 12 Figure 9 Cell means after removing cumulated level shifts. 15 Predicted value 10 -10 5 0 -5 0 5 10 15 20 -5 -10 Observational value Figure 10 Predicted versus observed values, when the model selection was based on 10-fold cross-validation with 0.01 as the significance level for stopping the merging of regions. Predictions and prediction errors obtained from the ten-fold cross-validation method are plotted in figures 10 and 11. The first one describes the goodness-of-fit of the prediction model. The second graph shows the prediction errors versus its corresponding predictions. As can be seen, the residuals have zero mean and small variance, which also indicates that the predicted values are reasonable. 19 6 4 Predicted errors 2 0 -10 -5 -2 0 5 10 15 -4 -6 -8 Prediction Figure 11 Prediction errors versus predicted values obtained by 10-fold cross-validation with 0.01 as significance level for stopping the merging of regions. 5.1 Analysis of observational water quality data The integration of our tree model into a back-fitting algorithm that alternates between estimation of smooth trends and change-points enabled a decomposition of a given vector time series into: (i) a smooth trend surface, (ii) a piecewise constant function representing abrupt level shifts, and (ii) irregular variation or noise. The results obtained are illustrated in figures 12-15. Mean total-P 7,3 7,2 7,2-7,3 7,1 7,1-7,2 7 7-7,1 6,9 6,9-7 6,8 1991 1993 "D0.5" 1995 1997 1999 2001 2003 2005 Figure 12 Corrected trend surface 20 6,8-6,9 Cumulated level shifts 2 1 1-2 0 0-1 -11991 1993 1995 1997 1999 2001 -2 "D0.5" 2003 -1-0 -2--1 2005 -3--2 -3 Year Noise Figure 13 Cumulated level shifts standardized to mean zero. 2 1,5 1 0,5 0 -0,51991 1993 1995 1997 1999 -1 2001 -1,5 1,5-2 1-1,5 0,5-1 0-0,5 "D0.5" 2003 2005 -0,5-0 -1--0,5 -1,5--1 Year Figure 14 Noise 6 Discussion and conclusions Detection of an unknown number of change-points in vector time series of data is a computationally demanding task. This is particularly true for change-point detections integrated into larger algorithms where the detection step is repeated many times. The work presented in this thesis demonstrated that the 21 computational burden can be substantially reduced by using a tree algorithm to identify a set of candidate change-points. Our tree algorithm in which the space of time points and vector coordinates is partitioned into rectangular is very fast. This is so because the binary splitting is based solely on the number of observations and their sum in rectangular subsets of the input domain. Theoretically, the binary splitting is an O(n) algorithm, i.e. the computational burden is approximately proportional to the number of observations. The dynamic programming algorithm described by Hawkins (2001) is an O(n2) algorithm that will take much longer time for large datasets. The merging of rectangles into larger subsets of more complex shape is also a fast process, because it is based solely on the number of observations and their sum in the already identified regions or segments. In addition, the number of candidate change-points is typically much smaller than the number of observations. Test runs involving surrogate temperature data confirmed the computational speed of our tree algorithm. The integration of our tree algorithm into a back-fitting algorithm that alternates between estimation of smooth trends and change-points is still in its infancy. However, our test runs involving water quality data demonstrated that such algorithms can be used to partition a vector time series into three parts: (i) a smooth trend surface, (ii) a piecewise constant function representing abrupt level shifts, and (iii) irregular variation or noise. The stopping rules for growing and reducing our tree model require further work. We used hypothesis testing and cross-validation for the model selection, but none of these methods performed satisfactorily. The F-tests are not theoretically correct, because we considered the maximum gain or loss for 22 large sets of splitting and merging operations. Cross-validation has previously been successfully used to fit smooth response surfaces to observed data (Grimvall et al., 2008). However, when the model includes abrupt level shifts at unknown time points it is not obvious how models fitted to training sets shall be used to predict the response at untried points. Despite the preliminary character of much of the work in this thesis, it shall not lessen the achievements that have been made. There is obviously a great demand for methods that can detect and estimate abrupt level shifts in the presence of smooth trends, and our work shows that using a tree algorithm to identify a set of candidate change-points is a key step in the development of such algorithms. 7 Literature [1] Alexandersson H. (1986). A homogeneity test applied to precipitation data. Journal of Climatology 6:661-675. [2] Caussinus, H., and Mestre, O. (2004). Detection and correction of artificial shifts in climate series, Applied Statistics 53: 405–425. [3] Erdman C. and Emerson, J.W. (2008) A fast Bayesian change-point analysis for the segmentation of microarray data. Bioinformatics 24:2143-2148. [4] Grimvall A., Wahlin K., Hussian M., and von Bromssen C. (2008). Semiparametric smoothers for trend assessment of multiple time series of envirionmental quality data. Submitted to Environmetrics. [5] Hastie T., Tisbshirani R., and Friedman J. (2001). The elements of statistical learning: Data Mining, Inference, and Prediction. Springer: New York. [6] Hawkins D.M. (1977). Testing a sequence of observations for a shift in location. Journal of the American Statistical Association 68:941-943. [7] Hawkins D.M. (2001). Fitting multiple change-point models to data 23 Computational Statistics & Data Analysis 37:323-341. [8] Wahlin K, Grimvall A., and Sirisack S. (2008). Estimating artificial level shifts in the presence of smooth trends. Submitted to Environmental Monitoring and Assessment. [9] Khaliq M.N. and Ouarda T. B. M. J. (2006). Short Communications on the critical values of the standard normal homogeneity test (SNHT) Journal of Climatology 27:681-687. [10] Gey S. and Lebarbier E. (2008), Using CART to detect multiple change-points in the mean for large samples. Technical report, Preprint Statistics for Systems Biology 12. [11] Picard F., Robin S., Lavielle M., Vaisse C., and Daudin J. (2005). A statistical approach for array CGH data analysis. BMC Bioinformatics. 6:1-14. [12] Sristava M.S. and Worsley K.J. (1986). Likelihood ratio test for a change in multivariate normal mean. Journal of the American Statistical Association 81:199-204. [13] Worsley K.J. (1979). On the likelihood ratio test for a shift in location of normal population. Journal of the American Statistical Association 74:365-367. 24 8 Appendix Table 1 PRESS corresponding to different significance levels derived from ten-fold and leave-one-out cross-validation. The smallest PRESS in each method is highlighted by yellow shadow and their corresponding alpha levels are selected as significance levels used to stop the merging process. Ten-fold CV Significance level 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.01 0.001 0.0001 0.00001 0.000001 2536.2 2531.9 2531.2 2533.1 2526.7 2547.1 2560.9 2626.6 2645.7 2792.8 2785.8 2902.3 3083.4 3219.7 25 Leave-one-year-out CV PRESS value 3788.6 3789.8 3787.0 3785.1 3786.6 3782.6 3796.1 3795.2 3772.6 3658.2 3725.1 3694.9 3738.0 3811.1 Table 2 Rectangle output derived from tree algorithm after the merging procedure Region 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 No. obs 35 13 47 17 70 9 99 21 105 25 54 34 63 18 35 14 48 12 5 21 13 3 7 10 1 36 1 4 20 17 1 1 14 1 10 6 27 4 65 11 1 7 13 2 Mean -0.2 6.4 3.0 3.9 -1.8 0.6 11.0 7.3 7.4 4.4 -0.1 -1.1 3.9 2.1 4.8 1.2 13.4 4.3 4.1 8.4 8.9 1.8 1.0 11.8 -4.8 1.5 5.9 4.3 5.0 8.0 5.2 5.6 13.3 -4.4 8.5 1.9 1.3 10.2 9.2 2.3 -2.9 2.2 5.9 13.5 26 SSR 72.3 11.8 68.5 21.8 142.8 8.4 171.2 43.0 262.8 27.2 142.3 74.8 97.5 25.0 45.6 9.7 87.1 10.6 5.7 38.9 12.7 0.1 6.4 4.2 0.0 46.2 0.0 3.9 14.6 18.2 0.0 0.0 8.6 0.0 8.1 1.0 32.7 4.2 103.5 14.8 0.0 2.1 13.7 3.1 Table 3 is the final detection of change-points derived by our tree algorithm. There are 92 change-points detected by implementing the whole tree algorithm and they are presented by when and where the change-point has occurred and how large is the level shift. “Nr” denotes the index of the change-points; “Y” represents the vector coordinate, “S” indicates when the change-point has occurred and “Level shift” evaluates the change in mean. Detected change-points Detected change-points Detected change-points Nr. Y S Level shift Nr. Y S Level shift Nr. Y S Level shift 1 1 14 2.2 32 4 82 2.5 63 8 46 -5.9 2 1 23 1.8 33 5 14 2.0 64 8 48 1.8 3 1 27 -8.7 34 5 34 2.5 65 8 81 2.7 4 1 28 6.3 35 5 38 -7.1 66 9 5 3.7 5 1 45 4.4 36 5 39 3.5 67 9 8 -3.7 6 1 46 -7.7 37 5 46 -3.5 68 9 12 3.7 7 1 81 3.0 38 5 62 1.1 69 9 15 2.5 8 2 14 2.2 39 5 82 2.4 70 9 18 -2.5 9 2 23 1.8 40 6 3 -2.4 71 9 45 2.4 10 2 27 -2.3 41 6 13 2.4 72 9 46 -5.9 11 2 46 -3.3 42 6 42 -3.6 73 9 82 2.8 12 2 81 3.0 43 6 43 5.9 74 10 13 1.6 13 3 6 2.4 44 6 44 -5.9 75 10 26 2.1 14 3 9 -2.4 45 7 13 -5.8 76 10 29 -3.0 15 3 10 4.1 46 7 14 5.8 77 10 46 -3.6 16 3 11 -4.1 47 7 15 2.5 78 10 71 3.0 17 3 14 4.1 48 7 16 -7.8 79 10 75 -3.1 18 3 46 -4.0 49 7 17 7.8 80 10 81 3.1 19 3 82 1.3 50 7 38 -2.5 81 11 10 1.8 20 4 6 -1.9 51 7 40 2.4 82 11 45 -1.8 21 4 10 1.7 52 7 46 -5.9 83 11 46 -2.4 22 4 18 4.5 53 7 48 1.8 84 11 55 -1.7 23 4 19 -4.5 54 7 51 -1.8 85 11 56 2.4 24 4 20 2.0 55 7 52 1.8 86 11 80 3.0 25 4 33 -2.0 56 7 81 2.7 87 12 8 -5.4 26 4 45 3.5 57 8 4 -3.7 88 12 9 7.4 27 4 46 -7.5 58 8 5 6.2 89 12 46 -4.1 28 4 59 4.0 59 8 7 -2.5 90 12 79 2.4 29 4 60 -1.6 60 8 15 2.5 91 12 82 -4.2 30 4 71 -2.4 61 8 38 -2.5 92 12 83 4.7 31 4 76 1.9 62 8 40 2.4 27 Table 4 Water quality dataset representing phosphorus concentrations ( g/l) in Lake Vänern. Year 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Depth 0.5m 9.3 8 8.8 10.3 9.3 6 6 6.6 7 6.2 6.6 6.2 5.2 6 6.8 Depth10m 8.17 8.3 8.3 10.3 8.7 5.8 6.4 6 7.2 5.2 6.6 6 4.8 6.6 7 28 Depth20m 9.3 8.8 8.8 8.3 8.3 5.6 5.5 7.6 6.6 6 8.6 6.2 5.2 5.4 5.8