(Input Selection for Time Series)

advertisement
Input Selection based on Gamma Test and Mutual Information
Nima Reyhani, Jin Hao, Y.N. Ji, Amaury Lendasse,
Helsinki University of Technology, Katholic de
Abstract
Introduction
Input selection and regressor selection is one of the most important issues in learning task problems and
especially in problem of time series modelling and prediction. (For example, the redundant regressor will
increase the size of inputs and lead to lower convergence ratio and low accuracy; also, the lack of
informative elements in regression vectors will lead to low accuracy in prediction or lead to more
misclassification rate or mean square error correspondingly in case of classification and regression.)
Mathematically speaking, usually a finite set of regressors or inputs are sufficient in order to extract an
accurate model out of the infinite observations. But, in practice, first, there is no dataset with infinite
number of data points from observations; second, the necessary size of regressors increases dramatically
with the number of inputs and outputs (curse of dimensionality). So, we should select the best regressors in
sense that they contain the necessary information and it would be then possible to capture and reconstruct
the underlying system dynamic or relationship between data input and output.
With respect to this, some approaches have been proposed for both the problems of continuous (finding
the best value of lag) and non-continuous (selecting more informative or related to the input from …)
regressor selection []. And one may have various options for model selection or input selection methods.
Hence, some comparative studies might be helpful in practical experiments as a reference.
One can study the problem of regressor selection as a generalization error estimation problem within
which he/she wants to find the best regressor against a predetermined model (e.g. NN, k-NN, …), which
maximize the generalization error through a LOO, Bootstrap or other methods. But they are very time
consuming and a computation may take one or two weeks. However, there are approaches which select the
best regressors or near to best based only on the dataset, as we are concern. So the computational cost
would be less than in model dependent cases.
Model independent approaches try to find a regression vector through optimizing a criterion over
different combinations of input-output pairs which are basically extracted from the original dataset.
The criteria computes the dependence between the input structures, e.g. an arbitrary combination of
regressors, and the output, e.g. a head value of the observed system, in terms of predictability, correlation
or mutual information, and etc. For example, an approach based on Lipshitz quotients was proposed in [],
and it was demonstrated that when the underlying function is smooth then the lack of an important
regressor most often will lead to very large quotient, and the superfluous regressor will have only a minor
impact on the Lipshitz quotient. This seems reasonable and applicable, but the approach needs a strong
intuition to be useful for practical problems.
So, we can formulate the regressor selection or input selection problem as finding q,(the best or near the
best input-output structure) through following optimization problem:
In this paper, we are comparing Gamma Test, as a statistical criterion, for input selection and Mutual
Information from information theory in order to find whether Gamma test is useful for input selection in
time series prediction context? The paper is organized as follows: In section 2 and 3 we will shortly
introduce Gamma Test and Mutual Information. In section 4 we will illustrate the comparison results.
Finally, in section 5 the paper conclusion is given.
Theory Review
Gamma Test
The Gamma Test is a technique for estimating the variance of noise or the least mean square error (MSE)
that can be achieved using set of observations. It is a generalization of the approach proposed by [], which
is basically based on the fact that the conditional expectation of (1) approaches to variance of the noise
when the distance between the data points tends to zero [].
, where f is continuous underlying function and r is the error
The Gamma Test has been applied for the various problems in Control Theory [], feature selection [],
secure communication. The experiments show that the test is useful and applicable in real world problems.
There is a mathematical proof on Gamma Test in [], which is based on a generalization of Chybechof
inequality, see formula (1) , and the property of k-nearest neighbor structures, i.e. the number of k-nearest
points of a given point is upper bounded.
The Gamma Test proof tells that whenever the underlying function be smooth:
, and first through fourth moments of the noise distribution exist
and the noise is independent to the corresponding points
then the variance of noise is computable using the following procedure:
There is an algorithm for finding the k-nearest neighbour of a point which is of order o(nlogn). So, the
gamma test is of the same order.
Mutual Information
Mutual Information (MI) is a criterion for evaluating the dependences between random variables, and
has been applied for many problems such as in Feature Selection [2], Independent Component Analysis,
and Blind Source Separation [3]. Comparing with other criterions, MI has several theoretical advantages
for its close relationship with information theory, such as it is zero if and only if the variables are
statistically independent, and it is invariant under homeomorphisms.
For a moment, consider two random variables, let’s say X and Y, with joint density µ(x, y), then the MI
between them would be,
I ( X , Y )  H ( X )  H (Y )  H ( X , Y )
(1)
And for M random variables,
M
I ( X 1 , X 2 ,... X M )   H ( X m )  H ( X 1 , X 2 ,... X M )
(2)
m 1
Here, H(x) is from Shannon entropy as,
H ( X )    dx ( x) log  ( x)
(3)
This is how we can compute MI from its definition. But in practice, computing the integrations is not
easy, and the conventional approach proposed in [4] is using estimators based on the partitioning of the
supports of X and Y into bins of finite size. In this paper, we use a recently introduced estimator [1] [5]
based on k-nearest neighbor statistics. What is new in this method is that it can calculate MI between
variables of any dimensions. The idea of this estimator is to estimate H(X) from the average distance to the
k-nearest neighbors, averaged over all xi, and MI can be obtained from equ. (1) and (2). The actual formula
we use is,
I ( X , Y )   (k ) 
1
  (n x )  (n y )  ( N )
k
I ( X 1 , X 2 ,... X M )   (k ) 
m 1
   (nm )  (m  1) ( N )
k
m 1
(4)
M
(5)
Here,
N
 ...  N 1  E ...(i )
i 1
ψ(x) is the digamma function,
 (k )  ( x)  1  d( x) / dx
(6)
ψ(x) satisfies the recursion ψ(x + 1) = ψ(x) + 1/x and ψ(1) = −C, where C = 0.5772156 . . . . nx(i), ny(i) are
the number of points which are in the region ||xi − xj|| ≤Єx(i)/2 and ||yi − yj|| ≤Єy(i)/2, Є(i)/2 is the distance
from zi to its k-nearest neighbor, and Єx(i)/2 and Єy(i)/2 the projection of Є(i)/2 into X and Y subspace. N
is the size of data set. [1] [5] We use k equal to 6 in our experiments.
Now, with the grouping property
I ( X , Y , Z )  I (( X , Y ), Z )  I ( X , Y )
(7)
We can obtain I((X,Y),Z), which is the MI between the two variables Z and (X,Y), by calculating
I(X,Y,Z) and I(X,Y) using equ(5) and (4). In the following, we use this I((X,Y),Z) to calculate the MI
between all the combinations of input values and the output value. i.e, if we have 8 input values, we need to
compute 28-1 MIs. After the computation of these MIs, we can select the input by choosing the one which
gives the maximum MI, because it means the dependence between these inputs and the output is the largest.
Mutual Information
Experimental Results
Both the Mutual Information and Gamma Test have a profound background and the theories behind them
are remarkable. The proof is true under infinite dataset assumption or in other words they are
asymptotically true. By the way, there are some numerical methods which are estimators and are true under
finite dataset assumption, but the accuracy is highly dependent on dataset population size.
We have done experiments on small datasets in order to show the efficiency of Gamma test and Mutual
Information in the problem of regressor selection.
Toy Example
In this first experiment, we investigate the robustness of Mutual Information and Gamma Test for
selecting the corrected inputs considering the noise.
First, we use the following equation for generating a toy data set.
Y  X 1 X 2  sin X 7  X 10  a  noise
(1)
Here, X and noise are 1000 dimensional random variables, a is the noise coefficient. So, by increasing
the value of a, we can test which approach is more robust regarding the noise level.
Figure () shows the result of our experiment. The X-axes in the figure is the value of a, and the Y-axes
shows the number of right selected inputs.
It is apparent from the figure that first, with respect to the noise, MI is more robust than Gamma Test in
selecting the right inputs; second, Gamma Test can also work well if the noise level is small enough.
Poland Electronic data set
This experiment is done by the following procedure:
First of all, we use two third of the data set as the training set, and do an exhaustive search on the
regressor combination space through following procedure:
Next, we use Least Square-Support Vector Machine (LS SVM) for comparing the regressor selection
results. For each trial, also two third of the whole data set has been used for making the training set, and the
remaining data points as test set. Also, cross-validation procedure has been done for model selection
purposes.
Give some remarks about the result.
Concluding Remarks
As the number of data points in real world problems are finite and may be not enough comparing with
the number of features, the input selection is one of the important tasks for data modeling purposes. There
are lots of approaches for the problem of input selection []. So, having a comparison between them is for
sure useful for real problem solving tasks.
Here, we are comparing two of them, namely, Mutual Information and Gamma Test. In the literature [],
it has been demonstrated that mutual information is a useful and practical approach for input selection
problem []. Within this respect we’ve done some experiments with a toy example and a data set using
LS-SVM model, see section 4. From experiments, we can conclude that although Gamma Test-based
regressors are not as robust as Mutual Information-based ones considering the noise added, it is also good
for input selection problems, and the prediction results states that the selections based on Gamma Test leads
to a little more accurate prediction in both MSE and MAE sense than Mutual Information based ones. Also,
from figure () of the experiment, we can see Gamma Test-based regressors produce smoother model than of
Mutual Information-based ones as the number of non-differentiable points are concerned.
References:
http://www.fzjuelich.de/nic/Forschungsgruppen/Komplexe_Systeme/software/milca-home.html
[1] Estimating mutual information A. Kraskov, H. Stögbauer, and P. Grassberger, Phys. Rev. E, in press
[2] Application of the mutual information criterion for feature selection in computer-aided diagnosis Georgia D.
Tourassi, Erik D. Frederick, Mia K. Markey, and Carey E. Floyd, Jr.
[3] Adaptive online learning algorithms for blind separation: Maximum entropy and minimum mutual information H.
H.Yang and S.Amari, Neural Comput., vol. 9, pp. 1457-1482, 1997.
[4] Medical Image Segmentation Based on Mutual Information Maximization Jaume Rigau1, Miquel Feixas1,
Mateu Sbert1, Anton Bardera1 and Imma Boada1.
[5] Least Dependent Component Analysis Based on Mutual Information Harald Stögbauer, Alexander Kraskov, Sergey
A. Astakhov, and Peter Grassberger , submitted to Phys. Rev. E http://arXiv.org/abs/physics/0405044
Download