Unsupervised Order-Preserving Regression Kernel for Sequence Analysis Young-In Shin Department of Computer Sciences 1 University Station C0500 Austin, TX 78712-0233 codeguru@cs.utexas.edu http://www.cs.utexas.edu/users/codeguru/ Abstract one may directly match sequence. For example, in (Hassdonk & Burkhardt 2002), Gaussian DTW kernel is used to compute the distance between two sequence. However, as noted by the authors, it is not a metric since it lacks some necessary properties, e.g. positive semidefiniteness, yielding sub-optimal solutions. As work from past shows, it is desirable that a learning method rely on application-dependent features as little as possible and that similarity is a metric with properties owned by kernel functions. The method proposed in this work meets all these criteria, which takes the latter approach of direct sequence matching using a kernel function that computes the similarity between sequence. Our approach is based on unsupervised order-preserving regression, where one finds a function that approximates a sequence of unlabelled data points with the constraint that data points projected onto the approximation function are in the same order as they are in a given sequence. Order-preservation is a necessary property in sequence matching. However, we do not perform regression over the sequence. We rather focus on the order-preserving property of the projection indices, which are the input parameters needed for regression. Note that in unsupervised regression, projection indices are missing and must be somehow provided. To introduce our kernel function, we first define projection indices as follows. Each data point xm i in a sequence xi , for m = 1, · · · , Ni , is associated with a project index, m n−1 n for m ≥ 2, n=2 xi − xi ≡ (1) tm i 0 for m = 1. In this work, a generalized method for learning from sequence of unlabelled data points based on unsupervised order-preserving regression is proposed. Sequence learning is a fundamental problem, which covers a wide area of research topic including, e.g. handwritten character recognition or speech and natural language processing. For this, one may compute feature vectors from sequence and learn a function in feature space or directly match sequence using methods like dynamic time warping. The former approach is not general in that they rely on sets of applicationdependent features, while, in the latter, matching is often inefficient or ineffective. Our method takes the latter approach, while providing a very simple and robust matching. Results obtained from applying our method on a few different types of data show that the method is gerneral, while accuracy is enhanced or comparable. Introduction We consider the problem of learning from sequence of unlabelled data points. Learning from sequence is a fundamental problem, which covers a wide of research topic including, e.g. hand-written character recognition (Tapia & Rojas 2003) speech and natural language processing, and object detection. When dealing with sequence with variable lengths, the difficulty lies in coming up with an effective similarity measure for sequence. For this, one may compute feature vectors from sequence and learn a function in feature space, implying inner product as the similarity measure, or one may directly match sequence using methods like dynamic time warping (DTW). For example, in (Tapia & Rojas 2003), features computed from hand-written figures and characters include coordinates of points, turning angles and their changes, length position of each point, the center of gravity of points, length, relative length, and accumulated angle, after preprocessing. They work well as the result shows. But, since they are heuristically chosen and application-dependent, one may have difficulty applying this method over data from other domains, not to mention that they are computationally expensive. Furthermore, such features may not be effective enough. Alternatively, 1 2 ≤ tm for Note the order-preserving property that tm i i any m1 < m2 . Then, we define our kernel function as, κ(xi , xj ) ≡ Nj Ni n=1 m=1 n m kx (xni , xm j ) · kt (ti , tj ), (2) where kx and kt are any valid kernel functions. κ is a kernel function since multiplication and summation of kernel functions is also a kernel function. Suppose now, for example, that we are given a set of unlabelled training sequence, each drawn i.i.d. from a set X , i i.e. S = {xi = [x1i , · · · , xN i ] ∈ X , i = 1, · · · , }, where k D xi ∈ R , Ni is the number of data points in xi , is the c 2006, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved. 1895 number of example sequence in S. It may be that Ni = Nj for i = j. We may then wish to learn a function from S that can determine whether an unseen test sequence x is also from X . Since we defined our kernel, we could simply apply any kernel based unsupervised learning algorithm, e.g. support vector novelty detection (SVND) (Scholkopf & Smola 2002) over S. which ranges from −120◦ to 120◦ and d is the normalized distance to the obstacle in that direction, which ranges from 0 to 1, where d = 1 corresponds to 4m. Our objective here is to detect blob of the scan which corresponds to the ball. A blob is a sequence of only distance values; di = [d1 , · · · , dNi ] and θ is not used here for rotation invariance. A ball blob will resemble an arc distorted by small sensor noise. In Figure 3, red blob in the sensor plot is a ball and other green and blue blobs are non ball objects, e.g. box, table, etc. On its right, sample training data for ball and non ball blobs are given. Our training data has 79 ball blobs and test data has 204 ball and 804 non-ball blobs. We used SVND and RBF was chosen for kx and kt with σ = 0.1. Experiments My method is applied to many different types of data. Among them, we show results for hand-written numbers and sensor data captured from Hokuyo URG-04LX laser rangefinder. Hand-written numbers are represented as sequence of 2D point (x, y) of pixels on the screen. Our objective is to learn from a training set of sequence a function that recognizes an unseen hand-written number. Training set is composed of 200 examples, i.e. 20 examples for each number from 0 to 9, and each of the two writers created 10 of the 20 samples. Following images show some training samples for number ’1’ and ’5’ with number of data points in each of them shown below. 400 400 400 400 350 350 350 350 300 300 300 300 250 250 250 250 200 200 200 200 150 100 100 150 150 200 250 300 350 400 100 100 19 points 150 150 200 250 300 350 400 9 points 100 100 200 250 300 350 400 14 points 100 100 350 350 300 300 300 300 250 250 250 250 200 200 200 200 250 300 350 14 points 150 150 200 250 300 350 15 points 150 150 0.15 0.27 60 0.26 0.145 0.25 0.14 0.6 0.24 0.23 0.135 30 150 0.22 0.4 0.13 0.21 0.2 0.125 0.2 0.19 0.12 180 0 5 10 15 20 25 0.18 30 0 5 10 15 20 25 0 0.15 0.301 0.3 0.145 0.299 0.14 210 330 0.298 0.135 0.297 0.13 0.296 0.125 240 300 270 Sensor Plot 0.12 0.295 0 5 10 15 20 25 0.294 30 Ball Blobs 0 2 4 6 8 10 12 14 Non Ball Blobs Figure 3: Sample Training Data for Laser RangeFinder 150 200 250 300 350 Following table shows that a less than 1% of classification error was obtained. 400 28 points 350 1 0.8 150 150 350 150 150 90 120 # SVs 54 Ball Error 0.49 % Non Ball Error 0.62 % 200 200 250 300 350 17 points 150 150 200 250 300 Figure 4: Ball Blob Recognition 350 21 points Figure 1: Sample Training Data for Hand-Written Numbers Conclusion We trained SVND for each number and RBF was chosen for kx and kt with σ = 50. Test data is composed of 50 samples for each character, total 500 samples. If SNVD says a test sample of its class is not novel or one of a different class is novel, then we consider the classification is right. The classification result is shown in the following table. The classification error was less than 1% in most cases, while the number of support vectors was on average about 90%. This is due to relatively small numbers of training data. The proposed kernel showed enhanced or comparable result, while it is simple to compute. The intuition is that kx scores high (low) when data points are close (far), while kt scores high (low) when the order in the sequence is close (far). (2) does not suffer from yielding sub-optimal solutions as DTW kernel, while relying on heuristics or application-dependent features as little as possible. Further challenges for future research include making (2) transformation-invariant, realtime learning and prediction, and finding optimal projection indices. I wish to point out that this method could also be used for signature authentication. No. ’1’ ’2’ ’3’ ’4’ ’5’ # SVs 17 18 15 19 17 Error 0.94 % 1.02 % 0.10 % 0.87 % 0.98 % No. ’6’ ’7’ ’8’ ’9’ ’0’ # SVs 18 17 19 14 18 Error 1.15 % 0.38 % 0.14 % 0.07 % 0.01 % References Hassdonk, C. B. B., and Burkhardt, H. 2002. On-line handwriting recognition with support vector machines - a kernel approach, Proc. of 8th IWFHR, pp. 49-54. Scholkopf, B., and Smola, A. J. 2002. Learning with kernels - support vector machines, regularizations, optimization, and beyond. Tapia, E., and Rojas, R. 2003. Recognition of on-line handwritten mathematical formulas in the e-chalk system, ICDAR. Figure 2: Hand-written Number Recognition Sensor data are represented as sequence of 2D points (θ, d), where θ is the angle at which laser rays are shoot, 1896