Input Selection based on Gamma Test and Mutual Information Nima Reyhani, Jin Hao, Y.N. Ji, Amaury Lendasse, Helsinki University of Technology, Katholic de Abstract Introduction Input selection and regressor selection is one of the most important issues in learning task problems and especially in problem of time series modelling and prediction. (For example, the redundant regressor will increase the size of inputs and lead to lower convergence ratio and low accuracy; also, the lack of informative elements in regression vectors will lead to low accuracy in prediction or lead to more misclassification rate or mean square error correspondingly in case of classification and regression.) Mathematically speaking, usually a finite set of regressors or inputs are sufficient in order to extract an accurate model out of the infinite observations. But, in practice, first, there is no dataset with infinite number of data points from observations; second, the necessary size of regressors increases dramatically with the number of inputs and outputs (curse of dimensionality). So, we should select the best regressors in sense that they contain the necessary information and it would be then possible to capture and reconstruct the underlying system dynamic or relationship between data input and output. With respect to this, some approaches have been proposed for both the problems of continuous (finding the best value of lag) and non-continuous (selecting more informative or related to the input from …) regressor selection []. And one may have various options for model selection or input selection methods. Hence, some comparative studies might be helpful in practical experiments as a reference. One can study the problem of regressor selection as a generalization error estimation problem within which he/she wants to find the best regressor against a predetermined model (e.g. NN, k-NN, …), which maximize the generalization error through a LOO, Bootstrap or other methods. But they are very time consuming and a computation may take one or two weeks. However, there are approaches which select the best regressors or near to best based only on the dataset, as we are concern. So the computational cost would be less than in model dependent cases. Model independent approaches try to find a regression vector through optimizing a criterion over different combinations of input-output pairs which are basically extracted from the original dataset. The criteria computes the dependence between the input structures, e.g. an arbitrary combination of regressors, and the output, e.g. a head value of the observed system, in terms of predictability, correlation or mutual information, and etc. For example, an approach based on Lipshitz quotients was proposed in [], and it was demonstrated that when the underlying function is smooth then the lack of an important regressor most often will lead to very large quotient, and the superfluous regressor will have only a minor impact on the Lipshitz quotient. This seems reasonable and applicable, but the approach needs a strong intuition to be useful for practical problems. So, we can formulate the regressor selection or input selection problem as finding q,(the best or near the best input-output structure) through following optimization problem: In this paper, we are comparing Gamma Test, as a statistical criterion, for input selection and Mutual Information from information theory in order to find whether Gamma test is useful for input selection in time series prediction context? The paper is organized as follows: In section 2 and 3 we will shortly introduce Gamma Test and Mutual Information. In section 4 we will illustrate the comparison results. Finally, in section 5 the paper conclusion is given. Theory Review Gamma Test The Gamma Test is a technique for estimating the variance of noise or the least mean square error (MSE) that can be achieved using set of observations. It is a generalization of the approach proposed by [], which is basically based on the fact that the conditional expectation of (1) approaches to variance of the noise when the distance between the data points tends to zero []. , where f is continuous underlying function and r is the error The Gamma Test has been applied for the various problems in Control Theory [], feature selection [], secure communication. The experiments show that the test is useful and applicable in real world problems. There is a mathematical proof on Gamma Test in [], which is based on a generalization of Chybechof inequality, see formula (1) , and the property of k-nearest neighbor structures, i.e. the number of k-nearest points of a given point is upper bounded. The Gamma Test proof tells that whenever the underlying function be smooth: , and first through fourth moments of the noise distribution exist and the noise is independent to the corresponding points then the variance of noise is computable using the following procedure: There is an algorithm for finding the k-nearest neighbour of a point which is of order o(nlogn). So, the gamma test is of the same order. Mutual Information Mutual Information (MI) is a criterion for evaluating the dependences between random variables, and has been applied for many problems such as in Feature Selection [2], Independent Component Analysis, and Blind Source Separation [3]. Comparing with other criterions, MI has several theoretical advantages for its close relationship with information theory, such as it is zero if and only if the variables are statistically independent, and it is invariant under homeomorphisms. For a moment, consider two random variables, let’s say X and Y, with joint density µ(x, y), then the MI between them would be, I ( X , Y ) H ( X ) H (Y ) H ( X , Y ) (1) And for M random variables, M I ( X 1 , X 2 ,... X M ) H ( X m ) H ( X 1 , X 2 ,... X M ) (2) m 1 Here, H(x) is from Shannon entropy as, H ( X ) dx ( x) log ( x) (3) This is how we can compute MI from its definition. But in practice, computing the integrations is not easy, and the conventional approach proposed in [4] is using estimators based on the partitioning of the supports of X and Y into bins of finite size. In this paper, we use a recently introduced estimator [1] [5] based on k-nearest neighbor statistics. What is new in this method is that it can calculate MI between variables of any dimensions. The idea of this estimator is to estimate H(X) from the average distance to the k-nearest neighbors, averaged over all xi, and MI can be obtained from equ. (1) and (2). The actual formula we use is, I ( X , Y ) (k ) 1 (n x ) (n y ) ( N ) k I ( X 1 , X 2 ,... X M ) (k ) m 1 (nm ) (m 1) ( N ) k m 1 (4) M (5) Here, N ... N 1 E ...(i ) i 1 ψ(x) is the digamma function, (k ) ( x) 1 d( x) / dx (6) ψ(x) satisfies the recursion ψ(x + 1) = ψ(x) + 1/x and ψ(1) = −C, where C = 0.5772156 . . . . nx(i), ny(i) are the number of points which are in the region ||xi − xj|| ≤Єx(i)/2 and ||yi − yj|| ≤Єy(i)/2, Є(i)/2 is the distance from zi to its k-nearest neighbor, and Єx(i)/2 and Єy(i)/2 the projection of Є(i)/2 into X and Y subspace. N is the size of data set. [1] [5] We use k equal to 6 in our experiments. Now, with the grouping property I ( X , Y , Z ) I (( X , Y ), Z ) I ( X , Y ) (7) We can obtain I((X,Y),Z), which is the MI between the two variables Z and (X,Y), by calculating I(X,Y,Z) and I(X,Y) using equ(5) and (4). In the following, we use this I((X,Y),Z) to calculate the MI between all the combinations of input values and the output value. i.e, if we have 8 input values, we need to compute 28-1 MIs. After the computation of these MIs, we can select the input by choosing the one which gives the maximum MI, because it means the dependence between these inputs and the output is the largest. Mutual Information Experimental Results Both the Mutual Information and Gamma Test have a profound background and the theories behind them are remarkable. The proof is true under infinite dataset assumption or in other words they are asymptotically true. By the way, there are some numerical methods which are estimators and are true under finite dataset assumption, but the accuracy is highly dependent on dataset population size. We have done experiments on small datasets in order to show the efficiency of Gamma test and Mutual Information in the problem of regressor selection. Toy Example In this first experiment, we investigate the robustness of Mutual Information and Gamma Test for selecting the corrected inputs considering the noise. First, we use the following equation for generating a toy data set. Y X 1 X 2 sin X 7 X 10 a noise (1) Here, X and noise are 1000 dimensional random variables, a is the noise coefficient. So, by increasing the value of a, we can test which approach is more robust regarding the noise level. Figure () shows the result of our experiment. The X-axes in the figure is the value of a, and the Y-axes shows the number of right selected inputs. It is apparent from the figure that first, with respect to the noise, MI is more robust than Gamma Test in selecting the right inputs; second, Gamma Test can also work well if the noise level is small enough. Poland Electronic data set This experiment is done by the following procedure: First of all, we use two third of the data set as the training set, and do an exhaustive search on the regressor combination space through following procedure: Next, we use Least Square-Support Vector Machine (LS SVM) for comparing the regressor selection results. For each trial, also two third of the whole data set has been used for making the training set, and the remaining data points as test set. Also, cross-validation procedure has been done for model selection purposes. Give some remarks about the result. Concluding Remarks As the number of data points in real world problems are finite and may be not enough comparing with the number of features, the input selection is one of the important tasks for data modeling purposes. There are lots of approaches for the problem of input selection []. So, having a comparison between them is for sure useful for real problem solving tasks. Here, we are comparing two of them, namely, Mutual Information and Gamma Test. In the literature [], it has been demonstrated that mutual information is a useful and practical approach for input selection problem []. Within this respect we’ve done some experiments with a toy example and a data set using LS-SVM model, see section 4. From experiments, we can conclude that although Gamma Test-based regressors are not as robust as Mutual Information-based ones considering the noise added, it is also good for input selection problems, and the prediction results states that the selections based on Gamma Test leads to a little more accurate prediction in both MSE and MAE sense than Mutual Information based ones. Also, from figure () of the experiment, we can see Gamma Test-based regressors produce smoother model than of Mutual Information-based ones as the number of non-differentiable points are concerned. References: http://www.fzjuelich.de/nic/Forschungsgruppen/Komplexe_Systeme/software/milca-home.html [1] Estimating mutual information A. Kraskov, H. Stögbauer, and P. Grassberger, Phys. Rev. E, in press [2] Application of the mutual information criterion for feature selection in computer-aided diagnosis Georgia D. Tourassi, Erik D. Frederick, Mia K. Markey, and Carey E. Floyd, Jr. [3] Adaptive online learning algorithms for blind separation: Maximum entropy and minimum mutual information H. H.Yang and S.Amari, Neural Comput., vol. 9, pp. 1457-1482, 1997. [4] Medical Image Segmentation Based on Mutual Information Maximization Jaume Rigau1, Miquel Feixas1, Mateu Sbert1, Anton Bardera1 and Imma Boada1. [5] Least Dependent Component Analysis Based on Mutual Information Harald Stögbauer, Alexander Kraskov, Sergey A. Astakhov, and Peter Grassberger , submitted to Phys. Rev. E http://arXiv.org/abs/physics/0405044