Reputation-based Mechanisms for

Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
Sparse Kernel Machines and HPC
PhD Thesis Proposal
PhD. Student: Laurentiu Bucur, AI-MAS Laboratory, Department of Computer Science, University
“Politehnica” of Bucharest
Supervisor: Prof. Dr. Eng. Adina Florea, AI-MAS Laboratory, Department of Computer Science,
University “Politehnica” of Bucharest
1. Introduction
The emergence of Kernel Methods [1] seen since the mid 1990’s has brought a revolution in the field
of machine learning. The problems of classification, time series prediction and feature extraction
have enjoyed a fresh perspective over the classic approaches seen until the mid ‘90s. The superior
performance of Support Vector Machines (SVMs), Support Vector Machines for Regression (SVR) and
Kernel PCA over their classic counterparts, the feedforward neural networks and linear Principal
Component Analysis have established Kernel Methods as the state of the art approach in statistical
learning theory. Vapnik and Chervonenkis theory of complexity has revolutionized the way we look
at the stability-plasticity dillema. Their results basically make a strong case for reducing model
complexity and in the realm of Kernel Machines and regularization theory these results have
translated into very simple and elegant methods for reducing the capacity of a pattern function, also
known as Structural Risk Minimization (SRM).
SRM as applied to Kernel Machines leads to sparse representations of the predictive model, also
known as sparse kernel machines for regression [8]. They can be characterized simply as kernel
machines of low complexity and high predictive capability and are the result of applying SRM
methods to kernel machines built from training data, taylored for the problem of regression. Their
generalization capability is guaranteed by their reduced complexity, also known as function capacity
or Vapnik-Chervonenkis (VC) dimension [9].
It can be postulated that the capacity of any pattern function is allocated in regions of the problem
space where patterns of high concentration are found. This can be intuitively translated in the notion
of low entropy sets for the case of classification, or in the notion of positive kurtosis for the case of
time series prediction. Finding these sets is a form of data mining also known as local modelling in
machine learning. Several algorithms have been developed for training such kernel machines, all at
the expense of understanding the intricate mathematics behind the Reproducing Kernel Hilbert
Spaces (RKHS) theory and the dedicated training methods derived from convex optimization.
From the author’s investigation, there has been a departure of the results attained in kernel machine
learning, especially in local modelling from the advances in High Performance Computing.
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
The research presented in this thesis aims at simplifying the problem of training sparse kernel
machines for regression and makes a strong case for using High Performance Computing and
Differential Evolution for detecting high concentration patterns in the problem space. This approach
is further justified by findings in [6], where it has been established that predictive accuracy can vary
widely in the problem space.
A distributed Differential Evolution (DE) algorithm for training sparse kernel machines will be
implemented and tested against a set of problem specific benchmark results found in the literature.
Applications of the algorithm will involve the prediction of chaotic time series and the detection of
the lead-lag effect between financial instruments.
Actual implementations of the algorithm on HPC middleware will be tested on a Condor pool using
various numbers of execute nodes and also on the UPB Sun Grid Engine.
2. An introduction to Sparse Kernel Machines for Regression.
A kernel machine is a function of the form:
f(x) =
a compact set, X  Rn
N – the number of training data points
{xi}i=1..N – the training set
K is a function which satisfies the kernel property : K(xi,xj) = <(xi), (xj)> for some high dimensional
feature function  [10] . The value of the kernel function is equal to the inner (dot) product of two
vectors xi and xj in the high dimensional feature space defined by .
(I) defines the gaussian kernel, with  the kernel
When using large data sets, the runtime complexity of evaluating f(x) is prohibitive. The solution to
reducing the complexity of f is to select a smaller number of training points such that f is comprised
of a set of M reduced kernel centers, called the reduced set (RS) [9]. This reduction must be done
such as to maximize the reduction ratio N/M while keeping up the prediction accuracy.
Specifically, we look for a function
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
g(x) =
where :
a compact set, X  Rn, M<<N and ziX,
{zi}i=1,M is the Reduced Set (RS) of g(x)
such as to minimize:
||f-g||2 =
The training method developed in this thesis uses a competitive greedy heuristic approach to selecting
the best M candidates from the training set in O(N2/k()3) time, where k is a kernel hashing function,
followed by a parralel Differential Evolution training step for determining and (optionally) zi using
High Performance Computing, due to the possible large number of training examples N in (3).
The objective function used in the literature for training both f(x) with the entire training set and g(x)
with the reduced kernel set for the problem of regression is the epsilon-insensitive loss function
In training f and g, the problem is determining the weight vector w = { }i=1..N or w = { }i=1..M
respectively, such as to minimize the regularized loss function [11]:
where: L is a loss function, typically L , 1 and 2 are positive constants subject to the constraint:
and ||f||2 =
a similar equation applies to g, the reduced version of f.
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
The applications of the sparse kernel machines studied in this thesis involve mainly the evaluation of
a trading system performance to measure the predictability of chaotic time series. This requires a
slightly modified loss function, which incorporates the simulated profit or loss on a per-transaction
basis and also possible transaction costs. The intuition of maximizing a process score reverses the
task in (6) by finding a function f (or its reduced version g) such as to maximize the Regularized
Performance Function P:
Subject to (7), where the  is a per-transaction score function:
where y is the desired output and transactionCost is a positive real constant which emulates a broker
commission for each simulated transaction. The importance of the transactionCost cannot be
overemphasized. If the experiments show some degree of predictability either in chaotic time series
forecasting or in the detection of the lead-lag effect, the average of the predictions must outweigh
transaction costs, otherwise the Efficient Market Hypothesis could still be disproved but its
inefficiencies would not be exploited beyond transaction costs if P<0 for any sample set
S={x1,….xL}  X.
The performance function (10) subject to (7) and (11) will be used instead of (6) and (4) throughout
the thesis to calculate the weights of the pattern functions f (eq. (1)) and g (eq. (2)). The performance
function P will measure the quality of a directional time series predictor in a simple trading scenario as
well as to the statistic edge provided by the existence of stable attractors or the lead-lag effect, beyond
transaction costs, should either exist for a given experiment. The thesis will also test the applicability
of the results in an online learning scenario using the approach illustrated in [10].
3. Stability, concentration and regularization.
Statistical learning theory uses two basic assumptions:
1) Each input data x X is generated by a constant source which generates the data according
to a probability distribution D [1].
2) Training and test samples are iid – independent and identically distributed based on X and D
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
Based on the assumption that the data source is constant, pattern analysis seeks to find regularities
in the data which are not a coincidence of the finite sampling of data but are characteristics of the
data source. This property is the statistical robustness [1] characteristic of a classification algorithm,
in the sense that if we re-run the algorithm on a pattern generated by the same source, it should
identify it as the same pattern. Thus, the algorithm should only depend on the source of the data,
and not on the training data .
It follows that for a local model such as the sparse kernel machine, kernel capacity is usually
allocated in areas of the problem space where patterns have high concentration: “A random variable
that is concentrated is very likely to assume values close to its expectation since values become
exponentially unlikely away from the mean.” [1]. This is illustrated by the McDiarmid Theorem [1]:
Theorem (McDiarmid) [1]: Let X1,….,Xn be independent random variables taking values in a set A,
and assume that f: An  R satisfies
Then for all  > 0:
The theorem gives a probabilistic bound regarding the possibility of the value of a function f of
random variable x=(X1,…Xn) to fall outside its expected mean Ef(x) by a distance . The theorem
assumes the variation of f to have anisotropic bounds
around any possible value
An of the random variable x
There are two possible situations regarding the stabiliy of the source generating the data:
1) If D is constant then test data is drawn from the same distribution as the training data, and
the performance of the kernel machine falls within the bounds established by the McDiarmid
theorem (12) [1].
2) If D changes then (12) no longer holds and the kernel machine does not provide the expected
performance ensured by the McDiarmid theorem (12) and Vapnik-Chervonenkis theory, and
adaptation is required. If the distribution D changes, then the system must weigh the
training samples in such a manner as to account for the most recent changes in data
distribution. This can be achieved using a weighted cost function and a sliding window
technique coupled with batch learning [10]. The author has studied and experimentally
established the capability of a local fuzzy adaptive model to track the dynamics of the data
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
source in a traffic prediction scenario [12] while maintaing accuracy by allocating the kernel
machine g (eq.(2)) capacity in high concentration areas of the problem space An.
In both cases, the function g must satisfy (10) given a maximum number of kernels M. This
requirement introduces the problem of optimization in (10), also known as structural risk
For the purpose of the applications in the thesis, we use g for binary classification
We define a concentrated pattern as a ball B(x0,r) in the problem space X such that ci are
small for the kernel machine g constrained to the ball B(x0,r). In other words, B(x0,r) is a low
entropy set:
Let (X) = { (g(x)>0) , (g(x)<0) , (g(x)=0) } – the set of all possible outcomes for any x  X
The entropy of the ball B(x0,r) for a sparse kernel machine g:
H ( B( x0, r ), g )  e ( B ( x 0,r )) P(e) log 2 P(e)
 A low entropy ball B(x0,r) satisfies McDiarmid theorem for the function g
 For accurate binary classification, kernel machine capacity must be allocated in the centers of low
entropy balls and entropy calculated using a local concentration measure
4. Contributions of the thesis : Training sparse kernel machines with Differential Evolution and High
Performance Computing
In [1] it is shown how the training procedure of any kernel based algorithm is modular and involves
two steps:
1) The choice of a kernel function and the calculation of the kernel matrix
2) The execution of a pattern analysis algorithm which calculates the weights of the function
The thesis focuses on the sparse kernel machine model. In this framework, step 1 involves finding the
reduced kernel set {zi}i=1,M in (2) in polynomial time. The idea behind the author’s implementation is to
find a measure of concentration for each data sample in the training set, then follow the intuition of the
McDiarmid theorem and use the top M concentrated samples to allocate the kernel machine capacity
{zi}i=1,M. The search procedure should be polynomial in the number of training samples N. The first
contribution proposed in the thesis is the use of a 3D kernel hashing function which solves the search
problem in O(N2/k(3)), where k() is a reduction function which depends on the parameter  - the
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
kernel parameter in (I).
Step 2 usually involves a quadratic optimization problem. In the literature specific methods using the
derivative of the loss function and Lagrange multipliers are used to train the kernel machine, but some
authors see this approach as rather complicated and suggest evolutionary approaches to finding the
weights of the kernel function ([13]). The main difficulty involves the possible large number of
training samples N (eq. (1)).
Following the evolutionary training paradigm, the thesis advances the use of an already
established and powerful optimization algorithm known as Differential Evolution [7] in a novel
parralel implementation for training the sparse kernel machine (2) for the purpose of
maximizing (10), using a Condor pool and a Sun GridEngine implementation in actual
experimental settings.
Differential Evolution [7] is a population-based search strategy and an evolutionary
algorithm that has recently proven to be a valuable method for optimizing real valued
multi-modal objective functions.
In brief, starting with an initial population of NPOP candidate solutions and the generation number
G=0, the following steps are executed until the best candidate solution no longer improves:
1) For each vector Xi,G in the current generation G, a trial vector Xv,G is generated :
with a1, a2, a3 integers in [0,…NPOP -1] different among each other, and K>0 is called the
cross-over constant
2) In order to introduce variation in the population, mutation is performed between the initial
candidate Xi and its cross-over generated candidate Xv,G for the entire population
3) For each v=1..N
Xv,G  Xu,G
For each v=1..N
Evaluate the fitness of Xv,G
5) For each i=1..N
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
If the fitness of Xi in the initial population is less than its counterpart Xi,G in the
candidate population, replace Xi with Xi,G
6) G  G+1
For the purpose of the applications in the thesis:
A candidate solution X consists of the kernel weights in (2) and optionally the
reduced kernel set positions {zi}i=1,M
a) If only the weights are calculated, then X= {1, 2,… M} and the problem
dimension is M
b) If the kernel positions are also optimized, then
X={1, 2,… M, z1,z2,…zM} and the problem is M*(p+1) dimensional, where p in
(1) is the dimension of the kernel machine inputs.
The fitness function is (10) for a given training data set S= { (x1,y1), .. (xN,yN) }
The most time consuming operation in the Differential Evolution loop is step 4), the evaluation of the
population candidates. This is an operation which involves the calculation of (10) over a possibly
large number of training examples. The evaluation of candidate solutions is a step which can be
parralelized thus speeding up the training of the sparse kernel machine (2).
The author suggests the parralel implementation of step 4 using a job submission mechanism.
Given a number of NPOP candidate solutions, the task of evaluating the candidates Xv,G, v=1..NPOP is
divided in groups of maximum K candidates, which are evaluated in parralel by a distributed system:
Figure 1: Parralel distribution of jobs for the evaluation of candidate solutions in the DE algorithm
for training the sparse kernel machine
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
In step 4 of the algorithm, the task of the machine running the Differential Evolution procedure,
called the Central Manager thus becomes one of allocating jobs, scheduling execution and waiting
for results.
The author has chosen two High Performance Computing environments for testing the parralel
Differential Evolution Algorithm:
The Condor High Throughput Computing solution from the University of Wisconsin [16]
The Sun GridEngine [17]
At the moment of writing this report, the author has already implemented the job scheduling
mechanism for a Condor pool (figure 3). The full implementation will have the possibility of
optimizing the reduced kernel set positions using Differential Evolution and the possibiliy of
submitting jobs on the Sun Grid Engine via libssh (figure 2).
Figure 2: Job submission from the Central Manager to the SGE via LibSSH
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
Figure 3: Condor pool configuration for training the sparse kernel machine
5. Applications in time series prediction and lead-lag effect detection with sparse kernel machines
and HPC.
In the applications pursued in the thesis, the author seeks to answer three questions:
1) Are there any high concentration patterns found in daily financial data (i.e in the state space
of a system defined by deterministic chaos) that provide good runtime performance for 1 step
ahead directional predictions ?
2) If financial data cannot be predicted on a daily basis with a consistent statistic advantage, then
to what degree can the moving average be predicted, after removing various levels of noise ?
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
3) Is the sparse kernel machine model (2) better at detecting and exploiting the possible lead-lag
effect between financial time series ?
The first application involves the prediction of directional change in daily financial data. An
introduction to deterministic chaos will precede the actual experiment. It will lay out basic concepts
such as trajectory, map, orbit, attractor, periodic point, and the most important concept of all, the
recurrent or non-wandering set. The idea behind the experiment is to probabilistically detect the
existence of low entropy non-wandering points, even though the problem in itself is ill-posed in the
absolute sense, due to the availability of finite data.
Several training experiments involving the daily exchange rate of several currency pairs will be
conducted in the framework of deterministic chaos, using the reconstructed state space method
suggested by the Takens theorem [14] to find the embedding dimension of the system [4]. The
experiments will aim at measuring the training speed improvement over using a single workstation
and also measure the performance of the trained sparse kernel machines versus other approaches.
The performance of the kernel machines trained with the proposed algorithm will be compared
against the method used by Pavdilis and Tasoulis ([2] and the revision in [6]), Iokibe and Murata [3],
and by McNames [4] in predicting the directional change in chaotic time series. A second set of
experiments involving time series data will aim at confirming whether the introduction of the
symmetry hint introduced by Abu-Mustafa [5] increases the kernel machine performance for the
same data sets. The introduction of the symmetry hint doubles the volume of training samples thus
increasing the need for using High Performance Computing, which justifies the introduction of the
symmetry hint for both measuring training speed with HPC and assessing any improvement in
performance. An adaptive online version of the DE/HPC algorithm based on the techniques in [10]
will be described and tested.
The second application is an exploration of the problem of detecting the lead-lag effect between
pairs of financial time series
The lead-lag effect is the phenomenon seen in financial markets where a high volume traded security
responds to incoming fundamental information for the group it belongs to faster than a low volume
traded security within the group, the former thus acting as a leading indicator for directional change of
the latter.
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
This causal effect may be spurious or continuous. The thesis will investigate the lead-lag effect using
two approaches:
1) Calculating the coefficients for a bivariate Granger Causality Model [15] class pattern
function f(x, y) 0 where x X and y  Y are time delayed variations in the price of two
securities within the same group, namely the lagging and the leading price series. After the
calculation of the linear regression coefficients, a trading experiment will be conducted and
performance will be measured in terms of simulated returns. Positive returns should indicate
the existence of a lead-lag effect, while negative or near-zero results should indicate the
absence of uni-directional causality.
2) Detecting high concentration (low entropy) sets in X for predicting y using sparse kernel
machines. This will be done using the training algorithm developed in the thesis using
Differential Evolution and HPC in calculating the relative strength of each training pattern, i.e
the weights of the kernel machine. After training, the kernel machine will be used in
predicting the directional change of the lagging security in a simulated trading experiment.
The results obtained using the two approaches will be compared and conclusions will be
drawn after the investigation of several currency pairs. Results should indicate the absence
or existence of the lead-lag effect causality between currency pairs, thus confirming or
providing evidence against the efficient market hypothesis (EMH).
The purpose of the second application is to test whether a sparse kernel machine model (2) can
outperform a bivariate Granger Causality Model (15) in actual trading performance:
In case the first set of experiments in the chaotic time series application yields negative results, this
should not imply that markets are efficient and that there are no persistent nonlinearities in the price
motion. The second experiment aims at finding other types of nonlinearities existent in the price
motion model of various securities.
6. Research reports
Two research reports will be developed.
The Sparse Kernel Machine Model for Time Series Prediction. Advances using Differential Evolution
and High Performance Computing
The first report will outline the theory behind kernel machines and the main advances Kernel
Methods [1] have brought to statistical learning theory. The report will focus on the model used in
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
the thesis and introduced in [8] – Sparse Kernel Machines for Regression. The second part of the
report will illustrate the Differential Evolution Algorithm [7] as a powerful optimization algorithm for
non-differentiable objective functions, which has lately attracted a lot of attention in the scientific
community. The DE1 and DE2 variants of the algorithm [7] will be detailed and explained in the
context of regularization theory and training of kernel machines for regression. The parralel
algorithm advanced in the thesis will be detailed and the the job submission and execution
mechanisms will be illustrated for a Condor Pool implementation as well as for a Sun Grid Engine
SSH-enabled connection. Performance of the training algorithm will be measured on a stand-alone
workstation, on a Condor pool and on the Sun Grid Engine, using financial data sets. The
generalization capability of the resulting kernel machines will be illustrated in the second report,
when compared to state of the art prediction methods.
Applications in Time Series Forecasting and lead-lag effect detection with Sparse Kernel Machines
and HPC
The second report will illustrate the applicability of the proposed training method to several real-life
applications. The report will begin with a description of a system characterized by deterministic
chaos and the methods used in chaotic series prediction, described in [2], [3] and [4].
The first part of the second report will detail the experimental results obtained using the Differential
Evolution algorithm running on HPC middleware for chaotic time series prediction, as described in
section 5 of this report. The second part of the second report will cover the notion of the lead-lag
effect as seen in financial data, and will illustrate the experimental results as shown in section 5.
7. Conclusions
This report has illustrated an overview of the thesis. It has given an introduction on the Sparse Kernel
Machine model for regression. Following the principles of stability, concentration and regularization,
and the modular design of kernel machine training algorithms, the author advances the use of High
Performance Computing and Differential Evolution in the same evolutionary paradigm studied by
Stoean and Dumitrescu [13]. Two applications will be involved in actual experiments. The first
application will involve the training of sparse kernel machines for the directional prediction of
financial time series, and the second application will be involved with the detection of the lead-lag
effect between pairs of financial instruments. While the former will focus on using a parralel
algorithm for scanning the state space of a chaotic system and allocate kernel capacity in high
concentration areas of the state space, the latter will illustrate the capability of kernel machines of
detecting uni-directional causality caused by the lead-lag effect. Both applications will involve
experimental comparison to dedicated methods already used in the literature ([2],[3],[4],[5],[6] and
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
8. Acknowledgements
This is privately funded research.
9. Bibliography
[1] J. Shawe-Taylor, N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,
ISBN 978-0-521-81397-6.
[2] N.G. Pavdilis, D.K. Tasoulis, M.N. Vrahatis. Financial Forecasting Through Unsupervised Clustering
and Evolutionary Trained Neural Networks. In: Proceedings of the 2003 Congress on Evolutionary
Computation, 2003, p. 2314-2321, Vol.4
[3] T. Iokibe, S. Murata, M. Koyama. Prediction of Foreign Exchange by Fuzzy Local Reconstruction
Method. In: IEEE International Conference on Systems, Man and Cybernetics, 1995, vol. 5, p.40514054
[4] J. McNames. A Nearest Trajectory Strategy for Time Series Prediction. In: Proceedings of the
International Workshop on Advanced Black-Box Techniques for Nonlinear Modeling. K. U. Leuven
Belgium, 1998.
[5] Y.S. Abu-Mustafa. Introduction to Financial Forecasting. In: Journal of Applied Intelligence, vol. 6,
p. 205-213, 1996
[6] N.G. Pavlidis, V.P. Plagianakos, D.K. Tasoulis, M.N. Vrahatis. Financial Forecasting Through
Unsupervised Clustering and Neural Networks. In: Journal of Operational Research, Springer Berlin,
p.103-127, Volume 6, number 2, May 2006
[7] A. K. Palit, D. Popovic. Computational Intelligence in Time Series Forecasting. ISBN: 978-1-85233948-7, Springer, 2005
[8] D.Lee, K.H Yung, J.Lee. Constructing Sparse Kernel Machines Using Attractors. In: IEEE
Transactions on Neural Networks. Volume 20 , Issue 4 (April 2009)
[9] A. Moore. VC dimension for characterizing classifiers. In: Lecture notes. [Online. Available at]: [Last accessed: February 10, 2010].
[10] L.J. Cao, F.E.H Tay. Support Vector Machine with Adaptive Parameters in Financial Time Series
Forecasting. In: IEEE Transactions on Neural Networks, Vol. 14, No. 6, November 2003
[11] L. Zhang, G.Dai, Y.Cao, G. Zhai, Z. Liu. A Learnable Kernel Machine for Short Term Load
Forecasting. In: Power Systems Conference and Exposition, 2009. PSCE '09. IEEE/PES
Artificial Intelligence and
University “Politehnica” of
Multi-Agent Systems
[12] L.Bucur, A. Florea, S. Petrescu. An Adaptive Fuzzy Neural Network for Traffic Prediction.
Accepted in: 18th Mediterranean Conference on Control and Automation 2010 (MED 2010).
[13] R. Stoean, D. Dumitrescu, C. Stoean. Nonlinear Evolutionary Support Vector Machines.
Application to Classification. In: Studia Univ. Babes-Bolyai, INFORMATICA, Volume LI, number 1,
[14] F. Takens. Detecting strange attractors in turbulence. In D. A. Rand and L. S. Young, editors,
Dynamical Systems and Turbulence, volume 898 of Lecture Notes in Mathematics, pages 366-381.
Springer-Verlag, 1981.
[15] A. Hossain. The Granger-Causality Between Money Growth, Inflation, Currency Devaluation and
Economic Growth in Indonesia. In: International Journal of Applied Econometrics and Quantitative
Studies, vol. 2-3, 2005.
[16] Condor Project Homepage. [Online. Available at]: [Last
accessed: February 10, 2010].
[17] The Sun GridEngine Homepage. [Online. Available at]: [Last
accessed: February 10,2010].