Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory Sparse Kernel Machines and HPC PhD Thesis Proposal PhD. Student: Laurentiu Bucur, AI-MAS Laboratory, Department of Computer Science, University “Politehnica” of Bucharest Supervisor: Prof. Dr. Eng. Adina Florea, AI-MAS Laboratory, Department of Computer Science, University “Politehnica” of Bucharest 1. Introduction The emergence of Kernel Methods [1] seen since the mid 1990’s has brought a revolution in the field of machine learning. The problems of classification, time series prediction and feature extraction have enjoyed a fresh perspective over the classic approaches seen until the mid ‘90s. The superior performance of Support Vector Machines (SVMs), Support Vector Machines for Regression (SVR) and Kernel PCA over their classic counterparts, the feedforward neural networks and linear Principal Component Analysis have established Kernel Methods as the state of the art approach in statistical learning theory. Vapnik and Chervonenkis theory of complexity has revolutionized the way we look at the stability-plasticity dillema. Their results basically make a strong case for reducing model complexity and in the realm of Kernel Machines and regularization theory these results have translated into very simple and elegant methods for reducing the capacity of a pattern function, also known as Structural Risk Minimization (SRM). SRM as applied to Kernel Machines leads to sparse representations of the predictive model, also known as sparse kernel machines for regression [8]. They can be characterized simply as kernel machines of low complexity and high predictive capability and are the result of applying SRM methods to kernel machines built from training data, taylored for the problem of regression. Their generalization capability is guaranteed by their reduced complexity, also known as function capacity or Vapnik-Chervonenkis (VC) dimension [9]. It can be postulated that the capacity of any pattern function is allocated in regions of the problem space where patterns of high concentration are found. This can be intuitively translated in the notion of low entropy sets for the case of classification, or in the notion of positive kurtosis for the case of time series prediction. Finding these sets is a form of data mining also known as local modelling in machine learning. Several algorithms have been developed for training such kernel machines, all at the expense of understanding the intricate mathematics behind the Reproducing Kernel Hilbert Spaces (RKHS) theory and the dedicated training methods derived from convex optimization. From the author’s investigation, there has been a departure of the results attained in kernel machine learning, especially in local modelling from the advances in High Performance Computing. 1 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory The research presented in this thesis aims at simplifying the problem of training sparse kernel machines for regression and makes a strong case for using High Performance Computing and Differential Evolution for detecting high concentration patterns in the problem space. This approach is further justified by findings in [6], where it has been established that predictive accuracy can vary widely in the problem space. A distributed Differential Evolution (DE) algorithm for training sparse kernel machines will be implemented and tested against a set of problem specific benchmark results found in the literature. Applications of the algorithm will involve the prediction of chaotic time series and the detection of the lead-lag effect between financial instruments. Actual implementations of the algorithm on HPC middleware will be tested on a Condor pool using various numbers of execute nodes and also on the UPB Sun Grid Engine. 2. An introduction to Sparse Kernel Machines for Regression. A kernel machine is a function of the form: f(x) = +b (1) where: b, a compact set, X Rn N – the number of training data points {xi}i=1..N – the training set K is a function which satisfies the kernel property : K(xi,xj) = <(xi), (xj)> for some high dimensional feature function [10] . The value of the kernel function is equal to the inner (dot) product of two vectors xi and xj in the high dimensional feature space defined by . Typically (I) defines the gaussian kernel, with the kernel parameter. When using large data sets, the runtime complexity of evaluating f(x) is prohibitive. The solution to reducing the complexity of f is to select a smaller number of training points such that f is comprised of a set of M reduced kernel centers, called the reduced set (RS) [9]. This reduction must be done such as to maximize the reduction ratio N/M while keeping up the prediction accuracy. Specifically, we look for a function 2 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory g(x) = (2) where : c, a compact set, X Rn, M<<N and ziX, {zi}i=1,M is the Reduced Set (RS) of g(x) such as to minimize: ||f-g||2 = (3) The training method developed in this thesis uses a competitive greedy heuristic approach to selecting the best M candidates from the training set in O(N2/k()3) time, where k is a kernel hashing function, followed by a parralel Differential Evolution training step for determining and (optionally) zi using High Performance Computing, due to the possible large number of training examples N in (3). The objective function used in the literature for training both f(x) with the entire training set and g(x) with the reduced kernel set for the problem of regression is the epsilon-insensitive loss function [10]: In training f and g, the problem is determining the weight vector w = { }i=1..N or w = { }i=1..M respectively, such as to minimize the regularized loss function [11]: where: L is a loss function, typically L , 1 and 2 are positive constants subject to the constraint: and ||f||2 = (8) a similar equation applies to g, the reduced version of f. 3 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory The applications of the sparse kernel machines studied in this thesis involve mainly the evaluation of a trading system performance to measure the predictability of chaotic time series. This requires a slightly modified loss function, which incorporates the simulated profit or loss on a per-transaction basis and also possible transaction costs. The intuition of maximizing a process score reverses the task in (6) by finding a function f (or its reduced version g) such as to maximize the Regularized Performance Function P: Subject to (7), where the is a per-transaction score function: (11) where y is the desired output and transactionCost is a positive real constant which emulates a broker commission for each simulated transaction. The importance of the transactionCost cannot be overemphasized. If the experiments show some degree of predictability either in chaotic time series forecasting or in the detection of the lead-lag effect, the average of the predictions must outweigh transaction costs, otherwise the Efficient Market Hypothesis could still be disproved but its inefficiencies would not be exploited beyond transaction costs if P<0 for any sample set S={x1,….xL} X. The performance function (10) subject to (7) and (11) will be used instead of (6) and (4) throughout the thesis to calculate the weights of the pattern functions f (eq. (1)) and g (eq. (2)). The performance function P will measure the quality of a directional time series predictor in a simple trading scenario as well as to the statistic edge provided by the existence of stable attractors or the lead-lag effect, beyond transaction costs, should either exist for a given experiment. The thesis will also test the applicability of the results in an online learning scenario using the approach illustrated in [10]. 3. Stability, concentration and regularization. Statistical learning theory uses two basic assumptions: 1) Each input data x X is generated by a constant source which generates the data according to a probability distribution D [1]. 2) Training and test samples are iid – independent and identically distributed based on X and D 4 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory Based on the assumption that the data source is constant, pattern analysis seeks to find regularities in the data which are not a coincidence of the finite sampling of data but are characteristics of the data source. This property is the statistical robustness [1] characteristic of a classification algorithm, in the sense that if we re-run the algorithm on a pattern generated by the same source, it should identify it as the same pattern. Thus, the algorithm should only depend on the source of the data, and not on the training data . It follows that for a local model such as the sparse kernel machine, kernel capacity is usually allocated in areas of the problem space where patterns have high concentration: “A random variable that is concentrated is very likely to assume values close to its expectation since values become exponentially unlikely away from the mean.” [1]. This is illustrated by the McDiarmid Theorem [1]: Theorem (McDiarmid) [1]: Let X1,….,Xn be independent random variables taking values in a set A, and assume that f: An R satisfies Then for all > 0: (12) The theorem gives a probabilistic bound regarding the possibility of the value of a function f of random variable x=(X1,…Xn) to fall outside its expected mean Ef(x) by a distance . The theorem assumes the variation of f to have anisotropic bounds around any possible value An of the random variable x There are two possible situations regarding the stabiliy of the source generating the data: 1) If D is constant then test data is drawn from the same distribution as the training data, and the performance of the kernel machine falls within the bounds established by the McDiarmid theorem (12) [1]. 2) If D changes then (12) no longer holds and the kernel machine does not provide the expected performance ensured by the McDiarmid theorem (12) and Vapnik-Chervonenkis theory, and adaptation is required. If the distribution D changes, then the system must weigh the training samples in such a manner as to account for the most recent changes in data distribution. This can be achieved using a weighted cost function and a sliding window technique coupled with batch learning [10]. The author has studied and experimentally established the capability of a local fuzzy adaptive model to track the dynamics of the data 5 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory source in a traffic prediction scenario [12] while maintaing accuracy by allocating the kernel machine g (eq.(2)) capacity in high concentration areas of the problem space An. In both cases, the function g must satisfy (10) given a maximum number of kernels M. This requirement introduces the problem of optimization in (10), also known as structural risk minimization. For the purpose of the applications in the thesis, we use g for binary classification - We define a concentrated pattern as a ball B(x0,r) in the problem space X such that ci are small for the kernel machine g constrained to the ball B(x0,r). In other words, B(x0,r) is a low entropy set: • Let (X) = { (g(x)>0) , (g(x)<0) , (g(x)=0) } – the set of all possible outcomes for any x X The entropy of the ball B(x0,r) for a sparse kernel machine g: H ( B( x0, r ), g ) e ( B ( x 0,r )) P(e) log 2 P(e) (I) A low entropy ball B(x0,r) satisfies McDiarmid theorem for the function g For accurate binary classification, kernel machine capacity must be allocated in the centers of low entropy balls and entropy calculated using a local concentration measure 4. Contributions of the thesis : Training sparse kernel machines with Differential Evolution and High Performance Computing In [1] it is shown how the training procedure of any kernel based algorithm is modular and involves two steps: 1) The choice of a kernel function and the calculation of the kernel matrix 2) The execution of a pattern analysis algorithm which calculates the weights of the function The thesis focuses on the sparse kernel machine model. In this framework, step 1 involves finding the reduced kernel set {zi}i=1,M in (2) in polynomial time. The idea behind the author’s implementation is to find a measure of concentration for each data sample in the training set, then follow the intuition of the McDiarmid theorem and use the top M concentrated samples to allocate the kernel machine capacity {zi}i=1,M. The search procedure should be polynomial in the number of training samples N. The first contribution proposed in the thesis is the use of a 3D kernel hashing function which solves the search problem in O(N2/k(3)), where k() is a reduction function which depends on the parameter - the 6 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory kernel parameter in (I). Step 2 usually involves a quadratic optimization problem. In the literature specific methods using the derivative of the loss function and Lagrange multipliers are used to train the kernel machine, but some authors see this approach as rather complicated and suggest evolutionary approaches to finding the weights of the kernel function ([13]). The main difficulty involves the possible large number of training samples N (eq. (1)). Following the evolutionary training paradigm, the thesis advances the use of an already established and powerful optimization algorithm known as Differential Evolution [7] in a novel parralel implementation for training the sparse kernel machine (2) for the purpose of maximizing (10), using a Condor pool and a Sun GridEngine implementation in actual experimental settings. Differential Evolution [7] is a population-based search strategy and an evolutionary algorithm that has recently proven to be a valuable method for optimizing real valued multi-modal objective functions. In brief, starting with an initial population of NPOP candidate solutions and the generation number G=0, the following steps are executed until the best candidate solution no longer improves: 1) For each vector Xi,G in the current generation G, a trial vector Xv,G is generated : , (13) with a1, a2, a3 integers in [0,…NPOP -1] different among each other, and K>0 is called the cross-over constant 2) In order to introduce variation in the population, mutation is performed between the initial candidate Xi and its cross-over generated candidate Xv,G for the entire population {Xv,G}v=1..NPOP. 3) For each v=1..N Xv,G Xu,G 4) For each v=1..N Evaluate the fitness of Xv,G 5) For each i=1..N 7 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory If the fitness of Xi in the initial population is less than its counterpart Xi,G in the candidate population, replace Xi with Xi,G 6) G G+1 For the purpose of the applications in the thesis: i) A candidate solution X consists of the kernel weights in (2) and optionally the reduced kernel set positions {zi}i=1,M a) If only the weights are calculated, then X= {1, 2,… M} and the problem dimension is M b) If the kernel positions are also optimized, then X={1, 2,… M, z1,z2,…zM} and the problem is M*(p+1) dimensional, where p in (1) is the dimension of the kernel machine inputs. ii) The fitness function is (10) for a given training data set S= { (x1,y1), .. (xN,yN) } The most time consuming operation in the Differential Evolution loop is step 4), the evaluation of the population candidates. This is an operation which involves the calculation of (10) over a possibly large number of training examples. The evaluation of candidate solutions is a step which can be parralelized thus speeding up the training of the sparse kernel machine (2). The author suggests the parralel implementation of step 4 using a job submission mechanism. Given a number of NPOP candidate solutions, the task of evaluating the candidates Xv,G, v=1..NPOP is divided in groups of maximum K candidates, which are evaluated in parralel by a distributed system: Figure 1: Parralel distribution of jobs for the evaluation of candidate solutions in the DE algorithm for training the sparse kernel machine 8 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory In step 4 of the algorithm, the task of the machine running the Differential Evolution procedure, called the Central Manager thus becomes one of allocating jobs, scheduling execution and waiting for results. The author has chosen two High Performance Computing environments for testing the parralel Differential Evolution Algorithm: - The Condor High Throughput Computing solution from the University of Wisconsin [16] The Sun GridEngine [17] At the moment of writing this report, the author has already implemented the job scheduling mechanism for a Condor pool (figure 3). The full implementation will have the possibility of optimizing the reduced kernel set positions using Differential Evolution and the possibiliy of submitting jobs on the Sun Grid Engine via libssh (figure 2). Figure 2: Job submission from the Central Manager to the SGE via LibSSH 9 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory Figure 3: Condor pool configuration for training the sparse kernel machine 5. Applications in time series prediction and lead-lag effect detection with sparse kernel machines and HPC. In the applications pursued in the thesis, the author seeks to answer three questions: 1) Are there any high concentration patterns found in daily financial data (i.e in the state space of a system defined by deterministic chaos) that provide good runtime performance for 1 step ahead directional predictions ? 2) If financial data cannot be predicted on a daily basis with a consistent statistic advantage, then to what degree can the moving average be predicted, after removing various levels of noise ? 10 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory 3) Is the sparse kernel machine model (2) better at detecting and exploiting the possible lead-lag effect between financial time series ? The first application involves the prediction of directional change in daily financial data. An introduction to deterministic chaos will precede the actual experiment. It will lay out basic concepts such as trajectory, map, orbit, attractor, periodic point, and the most important concept of all, the recurrent or non-wandering set. The idea behind the experiment is to probabilistically detect the existence of low entropy non-wandering points, even though the problem in itself is ill-posed in the absolute sense, due to the availability of finite data. Several training experiments involving the daily exchange rate of several currency pairs will be conducted in the framework of deterministic chaos, using the reconstructed state space method suggested by the Takens theorem [14] to find the embedding dimension of the system [4]. The experiments will aim at measuring the training speed improvement over using a single workstation and also measure the performance of the trained sparse kernel machines versus other approaches. The performance of the kernel machines trained with the proposed algorithm will be compared against the method used by Pavdilis and Tasoulis ([2] and the revision in [6]), Iokibe and Murata [3], and by McNames [4] in predicting the directional change in chaotic time series. A second set of experiments involving time series data will aim at confirming whether the introduction of the symmetry hint introduced by Abu-Mustafa [5] increases the kernel machine performance for the same data sets. The introduction of the symmetry hint doubles the volume of training samples thus increasing the need for using High Performance Computing, which justifies the introduction of the symmetry hint for both measuring training speed with HPC and assessing any improvement in performance. An adaptive online version of the DE/HPC algorithm based on the techniques in [10] will be described and tested. The second application is an exploration of the problem of detecting the lead-lag effect between pairs of financial time series The lead-lag effect is the phenomenon seen in financial markets where a high volume traded security responds to incoming fundamental information for the group it belongs to faster than a low volume traded security within the group, the former thus acting as a leading indicator for directional change of the latter. 11 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory This causal effect may be spurious or continuous. The thesis will investigate the lead-lag effect using two approaches: 1) Calculating the coefficients for a bivariate Granger Causality Model [15] class pattern function f(x, y) 0 where x X and y Y are time delayed variations in the price of two securities within the same group, namely the lagging and the leading price series. After the calculation of the linear regression coefficients, a trading experiment will be conducted and performance will be measured in terms of simulated returns. Positive returns should indicate the existence of a lead-lag effect, while negative or near-zero results should indicate the absence of uni-directional causality. 2) Detecting high concentration (low entropy) sets in X for predicting y using sparse kernel machines. This will be done using the training algorithm developed in the thesis using Differential Evolution and HPC in calculating the relative strength of each training pattern, i.e the weights of the kernel machine. After training, the kernel machine will be used in predicting the directional change of the lagging security in a simulated trading experiment. The results obtained using the two approaches will be compared and conclusions will be drawn after the investigation of several currency pairs. Results should indicate the absence or existence of the lead-lag effect causality between currency pairs, thus confirming or providing evidence against the efficient market hypothesis (EMH). The purpose of the second application is to test whether a sparse kernel machine model (2) can outperform a bivariate Granger Causality Model (15) in actual trading performance: (15) In case the first set of experiments in the chaotic time series application yields negative results, this should not imply that markets are efficient and that there are no persistent nonlinearities in the price motion. The second experiment aims at finding other types of nonlinearities existent in the price motion model of various securities. 6. Research reports Two research reports will be developed. The Sparse Kernel Machine Model for Time Series Prediction. Advances using Differential Evolution and High Performance Computing The first report will outline the theory behind kernel machines and the main advances Kernel Methods [1] have brought to statistical learning theory. The report will focus on the model used in 12 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory the thesis and introduced in [8] – Sparse Kernel Machines for Regression. The second part of the report will illustrate the Differential Evolution Algorithm [7] as a powerful optimization algorithm for non-differentiable objective functions, which has lately attracted a lot of attention in the scientific community. The DE1 and DE2 variants of the algorithm [7] will be detailed and explained in the context of regularization theory and training of kernel machines for regression. The parralel algorithm advanced in the thesis will be detailed and the the job submission and execution mechanisms will be illustrated for a Condor Pool implementation as well as for a Sun Grid Engine SSH-enabled connection. Performance of the training algorithm will be measured on a stand-alone workstation, on a Condor pool and on the Sun Grid Engine, using financial data sets. The generalization capability of the resulting kernel machines will be illustrated in the second report, when compared to state of the art prediction methods. Applications in Time Series Forecasting and lead-lag effect detection with Sparse Kernel Machines and HPC The second report will illustrate the applicability of the proposed training method to several real-life applications. The report will begin with a description of a system characterized by deterministic chaos and the methods used in chaotic series prediction, described in [2], [3] and [4]. The first part of the second report will detail the experimental results obtained using the Differential Evolution algorithm running on HPC middleware for chaotic time series prediction, as described in section 5 of this report. The second part of the second report will cover the notion of the lead-lag effect as seen in financial data, and will illustrate the experimental results as shown in section 5. 7. Conclusions This report has illustrated an overview of the thesis. It has given an introduction on the Sparse Kernel Machine model for regression. Following the principles of stability, concentration and regularization, and the modular design of kernel machine training algorithms, the author advances the use of High Performance Computing and Differential Evolution in the same evolutionary paradigm studied by Stoean and Dumitrescu [13]. Two applications will be involved in actual experiments. The first application will involve the training of sparse kernel machines for the directional prediction of financial time series, and the second application will be involved with the detection of the lead-lag effect between pairs of financial instruments. While the former will focus on using a parralel algorithm for scanning the state space of a chaotic system and allocate kernel capacity in high concentration areas of the state space, the latter will illustrate the capability of kernel machines of detecting uni-directional causality caused by the lead-lag effect. Both applications will involve experimental comparison to dedicated methods already used in the literature ([2],[3],[4],[5],[6] and [15]). 13 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory 8. Acknowledgements This is privately funded research. 9. Bibliography [1] J. Shawe-Taylor, N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, ISBN 978-0-521-81397-6. [2] N.G. Pavdilis, D.K. Tasoulis, M.N. Vrahatis. Financial Forecasting Through Unsupervised Clustering and Evolutionary Trained Neural Networks. In: Proceedings of the 2003 Congress on Evolutionary Computation, 2003, p. 2314-2321, Vol.4 [3] T. Iokibe, S. Murata, M. Koyama. Prediction of Foreign Exchange by Fuzzy Local Reconstruction Method. In: IEEE International Conference on Systems, Man and Cybernetics, 1995, vol. 5, p.40514054 [4] J. McNames. A Nearest Trajectory Strategy for Time Series Prediction. In: Proceedings of the International Workshop on Advanced Black-Box Techniques for Nonlinear Modeling. K. U. Leuven Belgium, 1998. [5] Y.S. Abu-Mustafa. Introduction to Financial Forecasting. In: Journal of Applied Intelligence, vol. 6, p. 205-213, 1996 [6] N.G. Pavlidis, V.P. Plagianakos, D.K. Tasoulis, M.N. Vrahatis. Financial Forecasting Through Unsupervised Clustering and Neural Networks. In: Journal of Operational Research, Springer Berlin, p.103-127, Volume 6, number 2, May 2006 [7] A. K. Palit, D. Popovic. Computational Intelligence in Time Series Forecasting. ISBN: 978-1-85233948-7, Springer, 2005 [8] D.Lee, K.H Yung, J.Lee. Constructing Sparse Kernel Machines Using Attractors. In: IEEE Transactions on Neural Networks. Volume 20 , Issue 4 (April 2009) [9] A. Moore. VC dimension for characterizing classifiers. In: Lecture notes. [Online. Available at]: http://www.autonlab.org/tutorials/vcdim08.pdf. [Last accessed: February 10, 2010]. [10] L.J. Cao, F.E.H Tay. Support Vector Machine with Adaptive Parameters in Financial Time Series Forecasting. In: IEEE Transactions on Neural Networks, Vol. 14, No. 6, November 2003 [11] L. Zhang, G.Dai, Y.Cao, G. Zhai, Z. Liu. A Learnable Kernel Machine for Short Term Load Forecasting. In: Power Systems Conference and Exposition, 2009. PSCE '09. IEEE/PES 14 Artificial Intelligence and University “Politehnica” of Multi-Agent Systems Bucharest Laboratory [12] L.Bucur, A. Florea, S. Petrescu. An Adaptive Fuzzy Neural Network for Traffic Prediction. Accepted in: 18th Mediterranean Conference on Control and Automation 2010 (MED 2010). [13] R. Stoean, D. Dumitrescu, C. Stoean. Nonlinear Evolutionary Support Vector Machines. Application to Classification. In: Studia Univ. Babes-Bolyai, INFORMATICA, Volume LI, number 1, 2006. [14] F. Takens. Detecting strange attractors in turbulence. In D. A. Rand and L. S. Young, editors, Dynamical Systems and Turbulence, volume 898 of Lecture Notes in Mathematics, pages 366-381. Springer-Verlag, 1981. [15] A. Hossain. The Granger-Causality Between Money Growth, Inflation, Currency Devaluation and Economic Growth in Indonesia. In: International Journal of Applied Econometrics and Quantitative Studies, vol. 2-3, 2005. [16] Condor Project Homepage. [Online. Available at]: http://www.cs.wisc.edu/condor/. [Last accessed: February 10, 2010]. [17] The Sun GridEngine Homepage. [Online. Available at]: http://gridengine.sunsource.net/. [Last accessed: February 10,2010]. 15