Spikernels: Predicting Arm Movements by Embedding Population Spike Rate Patterns in Inner-Product Spaces Lavi Shpigelman shpigi@cs.huji.ac.il School of Computer Science and Engineering and Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel Yoram Singer singer@cs.huji.ac.il School of Computer Science and Engineering Hebrew University, Jerusalem 91904, Israel Rony Paz ronyp@hbf.huji.ac.il Eilon Vaadia eilon@hbf.huji.ac.il Interdisciplinary Center for Neural Computation and Department of Physiology, Hadassah Medical School The Hebrew University Jerusalem, 91904, Israel Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. The focus of this letter is the construction of biologically motivated kernels for cortical activities. The kernels we derive, termed Spikernels, map spike count sequences into an abstract vector space in which we can perform various prediction tasks. We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any two sequences of neural population spike counts. We demonstrate the merits of our modeling approach by comparing the Spikernel to various standard kernels in the task of predicting hand movement velocities from cortical recordings. All of the kernels that we tested in our experiments outperform the standard scalar product used in linear regression, with the Spikernel consistently achieving the best performance. 1 Introduction Neuronal activity in primary motor cortex (MI) during multijoint arm reaching movements in 2D and 3D (Georgopoulos, Schwartz, & Kettner, 1986; 672 L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia Georgopouls, Kettner, & Schwartz, 1988) and drawing movements (Schwartz, 1994) has been used extensively as a test bed for gaining understanding of neural computations in the brain. Most approaches assume that information is coded by firing rates, measured on various timescales. The tuning curve approach models the average firing rate of a cortical unit as a function of some external variable, like the frequency of an auditory stimulus or the direction of a planned movement. Many studies of motor cortical areas (Georgopoulos, Kalaska, & Massey, 1983; Georgopoulos et al., 1988; Moran & Schwartz, 1999; Schwartz, 1994; Laubach, Wessberg, & Nicolelis, 2000) showed that while single units are broadly tuned to movement direction, a relatively small population of cells (tens to hundreds) carries enough information to allow for accurate prediction. Such broad tuning can be found in many parts of the nervous system, suggesting that computation by distributed populations of cells is a general cortical feature. The population vector method (Georgopoulos et al., 1983, 1988) describes each cell’s firing rate as the dot product between that cell’s preferred direction and the direction of hand movement. The vector sum of preferred directions, weighted by the measured firing rates, is used both as a way of understanding what the cortical units encode and as a means for estimating the velocity vector. Several recent studies (Fu, Flament, Cotz, & Ebner, 1997; Donchin, Gribova, Steinberg, Bergman, & Vaadia, 1998; Schwartz, 1994) propose that neurons can represent or process multiple parameters simultaneously, suggesting that it is the dynamic organization of the activity in neuronal populations that may represent temporal properties of behavior such as the computation of transformation from desired action in external coordinates to muscle activation patterns. A few studies (Vaadia, et al., 1995; Laubach, Schuler, & Nicolelis, 1999; Reihle, Grun, Diesmann, & Aersten, 1997) support the notion that neurons can associate and dissociate rapidly to functional groups in the process of performing a computational task. The concepts of simultaneous encoding of multiple parameters and dynamic representation in neuronal populations together could explain some of the conundrums in motor system physiology. These concepts also invite using of increasingly complex models for relating neural activity to behavior. Advances in computing power and recent developments of physiological recording methods allow recording of ever growing numbers of cortical units that can be used for real time analysis and modeling. These developments and new understandings have recently been used to reconstruct movements on the basis of neuronal activity in real time in an effort to facilitate the development of hybrid brain-machine interfaces that allow interaction between living brain tissue and artificial electronic or mechanical devices to produce brain-controlled movements (Chapin, Moxon, Markowitz, & Nicolelis, 1999; Laubach et al., 2000; Nicolelis, 2001; Wessberg et al., 2000; Laubach et al., 1999; Nicolelis, Ghazanfar, Faggin, Votaw, & Oliveira, 1997; Isaccs, Weber, & Schwartz, 2000). Spikernels 673 Most of the current attempts that focus on predicting movement from cortical activity rely on modeling techniques that employ parametric models to describe a neuron’s tuning (instantaneous spike rate) as a function of current behavior. For instance, cosine-tuning estimation (population vector) is used by Taylor, Tillery, and Schwartz (2002), and linear regression was employed by Wessberg et al. (2000) and Serruya, Hatsopoulos, Paninski, Fellows, and Donoghue (2002). Brown, Frank, Tang, Quirk, and Wilson (1998) and Brockwell, Rojas, and Kass (2004) have applied filtering methods that, apart from assuming parametric models of spike rate generation, also incorporate a statistical model of the movement. An exception is the use of artificial neural nets (Wessberg et al, 2000) (though this study reports getting better results by linear regression). This article describes the tailoring of a kernel method for the task of predicting two-dimensional hand movements. The main difference between our method and the other approaches is that our nonparametric approach does not assume cell independence, imposes relatively few assumptions on the tuning properties of the cells and how the neural code is read, allowing representation of behavior in highly specific neural patterns, and, as we demonstrate later, results in better predictions on our test sets. However, due to the implicit manner in which neural activity is mapped to feature space, improved results are often achieved at the expense of understanding tuning of individual cells. We attempt to partially overcome this difficulty by describing the feature space that our kernel induces and by explaining the way our kernel parameters affect the features. The letter is organized as follows. In section 2, we describe the problem setting that this article is concerned with. In section 3, we introduce and explain the main mathematical tool that we use: the kernel operator. In section 4, we discuss the design and implementation of a biologically motivated kernel for neural activities. The experimental method is described in section 5. We report experimental results in section 6 and give conclusions in section 7. 2 Problem Setting Consider the case where we monitor instantaneous spike rates from q cortical units during physical motor behavior of a subject (the method used for translating recorded spike times into rate measures is explained in section 5). Our goal is to learn a predictive model of some behavior parameter (such as hand velocities, as described in section 5) with the cortical activity as the input. Formally, let S ∈ Rq×m be a sequence of instantaneous firing rates from q cortical units consisting of m time samples. We use S, T to denote sequences of firing rates and denote by len(S) the time length of a sequence S (len(S) = m in above definition). Si ∈ Rq designates the instantaneous firing rates of all neurons at time i in the sequence S and Si,k ∈ R the rate of neuron k. We also use Ss to denote the concatenation of S with one more sample s ∈ Rq . The instantaneous firing rate of a unit k in sample s is then 674 L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia sk ∈ R. We also need to employ a notation for subsequences. A consecutive subsequence of S from time t1 to time t2 is denoted St1 :t2 . Finally, throughout the work, we need to examine possibly nonconsecutive subsequences. We denote by i ∈ Rn a vector of time indices into the sequence S such that 1 ≤ i1 < i2 < · · · < in ≤ len(S). Si ∈ Rq×n is then an ordered selection of spike rates (of all neurons) from times i. Let y ∈ Rm denote a parameter of the movement that we would like to predict (e.g., the movement velocity in the horizontal direction, vx ). Our goal is to learn an approximation y of the form f: Rq×m → Rm from neural firing rates to movement parameter. In general, information about movement can be found in neural activity both before and after the time of movement itself. Our long-term plan, though, is to design a model that can be used for controlling a neural prosthesis. We therefore confine ourselves to causal predictors that use an l (l ∈ Z) long window of neural activity ending at time step t, S(t−l+1):t , to predict yt ∈ R. We would like to make yt = ft S(t−l+1):t as close as possible (in a sense that is explained in the sequel) to yt . 3 Kernel Method for Regression The major mathematical tool employed in this article is kernel operators. Kernel operators allow algorithms whose interface to the data is limited to scalar products to employ complicated premappings of the data into feature spaces by use of kernels. Formally, a kernel is an inner-product operator K: X × X → R where X is some arbitrary vector space. In this work, X is the space of neural activities (specifically, population spike rates). An explicit way to describe K is via a mapping φ: X → H from X to a feature space H such that K(x, x ) = φ(x) · φ(x ). Given a kernel operator, we can use it to perform various statistical learning tasks. One such task is support vector regression (SVR) (see, e.g., Smola & Schölkopf, 2004) which attempts to find a regression function for target values that is linear if observed in the (typically very high-dimensional) feature space mapped by the kernel. We give here a brief description of SVR for clarity. SVR employs the ε-insensitive loss function (Vapnik, 1995). Formally, this loss is defined as yt − f (S(t−l+1):t ) = max 0, yt − f (S(t−l+1):t ) − ε . ε Examples that fall within ε of the target value (yt ) do not contribute to the error. Those that fall further away contribute linearly to the loss (see Figure 1, left). The form of the regression model is f (S(t−l+1):t ) = w · φ S(t−l+1):t + b, (3.1) which is linear in the feature space H and is implicitly defined by the kernel. Combined with a loss function, f defines a hyperslab of width ε centered at the estimate (see Figure 1, right, for a single-dimensional illustration). Spikernels 675 y f(x) Loss ε ε y − yˆ φ(x) Figure 1: Illustration of the ε-insensitive loss (left) and an ε-insensitive area around a linear regression in a single-dimensional feature space (right). Examples that fall within distance ε have zero loss (shaded area). For each trial i and time index t designating a frame within the trial, we received a pair, (Si(t−l+1):t , yit ), consisting of spike rates and a corresponding target value. The optimization problem employed by SVR is arg min w 1 w2 + C yit − f Si(t−l+1):t , ε 2 i,t (3.2) where ·2 is the squared vector norm. This minimization casts a trade-off (weighed by C ∈ R) between a small empirical error and a regularization term that favors low sensitivity to feature values. Let φ(x) denote the feature vector that is implicitly implemented by kernel function K(·, x). Then there exists a regression model equivalent to the one given in equation 3.1, is a minimum of equation 3.2, and takes the form f (T) = At,i K(Si(t−l+1):t , T) + b. t,i Here, T is the observed neural activity, and A is a (typically sparse) matrix of Lagrange multipliers determined by the minimization problem in equation 3.2. In summary, SVR solves a quadratic optimization problem aimed at finding a linear regressor in an induced high-dimensional feature space. This regressor is the best penalized estimate for the ε-insensitive loss function where the penalty is the squared norm of the regressor’s weights. 4 Spikernels The quality of kernel-based learning is highly dependent on how the data are embedded in the feature space via the kernel operator. For this reason, several studies have been devoted to developing new kernels (Jaakola & Haussler, 1998; Genton, 2001; Lodhi, Shawe-Taylor, Cristianini, & Watkins, 2000). In fact, a good kernel could render the work of a classification or regression algorithm trivial. With this in mind, we develop a kernel for neural spike rate activity. 676 L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia R a te Pattern A Pattern B T im e (a) Bin by bin P attern A P attern B P attern A P attern B R a te R a te Time of Interest T im e (b) Time warp T im e (c) Time of interest Figure 2: Illustrative examples of pattern similarities. (a) Bin-by-bin comparison yields small differences. (b) Patterns with large bin-by-bin differences that can be eliminated with some time warping. (c) Patterns whose suffix (time of interest) is similar and prefix is different. 4.1 Motivation. Our goal in developing a kernel for spike trains is to map similar patterns to nearby areas of the feature space. In the description of our kernel, we attempt to capture some well-accepted notions on similarities between spike trains. We make the following assumptions regarding similarities between spike patterns: • A commonly made assumption is that firing patterns that have small differences in a bin-by-bin comparison are similar. This is the basis of linear regression. In Figure 2a, we show an example of two patterns that are bin-wise similar. • A cortical population may display highly specific patterns to represent specific information such as a particular stimulus or hand movement. Spikernels 677 Though in this work we make use of spike counts within each time bin, we assume that a pattern of spike counts may be specific to an external stimulus or action and that the manner in which these patterns change as a function of the behavior may be highly nonlinear (Segev & Rall, 1998). Many models of neuronal computation are based on this assumption (see, e.g., Pouget & Snyder, 2000). • Two patterns may be quite different from a simple bin-wise perspective, yet if they are aligned using a nonlinear time distortion or shifting, the similarity becomes apparent. An illustration of such patterns is given in Figure 2b. In comparing patterns, we would like to induce a higher score when the time shifts are small. Some biological plausibility for time-warped integration of inputs may stem from the spatial distribution of synaptic inputs on cortical cells’ dendrites. In order for two signals to be integrated along a dendritic tree or at the axon hillock, their sources must be appropriately distributed in space and time, and this distribution may be sensitive to the current depolarization state of the tree (Magee, 2000). • Patterns that are associated with identical values of an external stimulus at time t may be similar at that time but rather different at t ± when values of the external stimulus for these patterns are no longer similar (as illustrated in Figure 2c. We would like to give a higher similarity score to patterns that are similar around a time of interest (prediction time), even if they are rather different at other times. 4.2 Spikernel Definition. We describe our kernel construction, the Spikernel, by specifying the features that make up the feature space. Our construction of the feature space builds on the work of Haussler (1999), Watkins (1999), and Lodhi et al. (2000). We chose to directly specify the features constituting the kernel rather than describe the kernel from a functional point of view, as it provides some insight into the structure of the feature space. We first need to introduce a few more notations. Let S be a sequence of length l = len(S). The set of all possible n-long index vectors defining a subsequence of S is In,l = i: i ∈ Zn , 1 ≤ i1 < · · · < in ≤ l . Also, let d(α, β ), (α, β ∈ Rq ) denote a bin-wise distance over a pair of samples (population rates). We also overload notation and denote by n firing d (Si , U) = k=1 d Sik , Uk a distance between sequences. The sequence distance is the sum of distances over the samples. One such function (the d we explore here) is the squared 2 norm. Let µ, λ ∈ (0, 1). The U component of our (infinite) feature vector φ(S) is defined as φnU (S) = µd(Si ,U) λlen(S)−i1 , (4.1) i∈In,len(S) where i1 is the first entry in the index vector i. In words, φnU (S) is a sum 678 L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia over all n-long subsequences of S. Each subsequence Si is compared to U (the feature coordinate) and contributes to the feature value by adding a number that is proportional to the bin-wise similarity between Si and U and is inversely proportional to the amount of stretching Si needs to undergo. That is, the feature entry indexed by U measures how similar U is to the latter part of the time series S. This definition seems to fit our assumptions on neural coding for the following reasons: • A feature φnU (S) gets large values only if U is bin-wise similar to subsequences of the pattern S. • It allows for learning complex patterns: choosing small values of λ and µ (or concentrated d measures) means that each feature tends toward being either 1 or 0, depending on whether the feature coordinate U is almost identical to a suffix of S. • We allow gaps in the indexes defining subsequences, thus allowing for time warping. • Subpatterns that begin further from the required prediction time are penalized by an exponentially decaying weight; thus, more weight is given to activities close to the prediction time (the end of the sequence), which designates our time of interest. 4.3 Efficient Kernel Calculation. The definition of φ given by equation 4.1 requires the manipulation of an infinite feature space and an iteration over all subsequences that grows exponentially with the lengths of the sequences. Straightforward calculation of the feature values and performing the induced inner product is clearly impossible. Building on ideas from Lodhi et al. (2000), who describe a similar kernel (for finite sequence alphabets and no time of interest), we developed an indirect method for evaluating the kernel through a recursion that can be performed efficiently using dynamic programming. The solution we now describe is rather general, as it employs an arbitrary base feature vector ψ . We later describe our specific design choice for ψ . The definition in equation 4.1 is a special case of the following feature definition, φnU (S) = i∈In,len(S) ψ Uk Sik λlen(S)−i1 , (4.2) k where n specifies the length of the subsequences, and for the Spikernel, we substitute d S ,U ψ Uk Sik = µ ik k . Spikernels 679 The feature vector of equation 4.2 satisfies the following recursion: len(S) i=1 λlen(S)−i+1 ψ (Si ) φn−1 (S1:i−1 ) len(S) ≥ n > 0 n φ (S) = 0 len(S) < n, n > 0 (4.3) 1 n=0 The recursive form is derived by decomposing the sum over all subsequences into a sum over the last subsequence index and a sum over the rest of the indices. The fact that φn (S) is the zero vector for sequences of length less than n is due to equation 4.2. The last value of φn when n = 0 is defined to be the all-one vector. The kernel that performs the dot product between two feature vectors is then defined as Kn (S, T) = φn (S) · φn (T) = φnU (S)φnU (T)dU, Rq×n where n designates subsequences of length n as before. We now plug in the recursive feature vector definition from equation 4.3: len(S) n−1 n len(S)−i+1 K (S, T) = λ ψ Un (Si ) φU1:n−1 (S1:i−1 ) Rq×n i=1 len(T) × dU. λlen(T)−j+1 ψ Un Tj φn−1 U1:n−1 T1:j−1 (4.4) j=1 Next, we note that n−1 n−1 φn−1 (S1:i−1 ) · φn−1 (T1:j−1 ) U (S1:i−1 )φU (T1:j−1 )dU = φ Rq×n−1 = Kn−1 S1:i−1 , T1:j−1 , and also Rq ψ Uk (Si )ψ Uk (Tj )dUk = ψ (Si ) · ψ (Tj ). Defining K∗ (Si , Tj ) = ψ (Si )·ψ (Si ), we now insert the integration in equation 4.4 into the sums to get Kn (S, T) len(S) len(T) n−1 λlen(S)+len(T)−i−j+2 ψ (Si )· ψ Tj φ (S1:i−1 )· φn−1 T1:j−1 = i=1 j=1 len(S) len(T) = i=1 j=1 λlen(S)+len(T)−i−j+2 K∗ Si , Tj Kn−1 S1:i−1 , T1:j−1 . (4.5) 680 L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia We have thus derived a recursive definition of the kernel, which inherits the initial conditions from equation 4.3: ∀S, T K0 (S, T) = 1 if min len(S), len(T) < i Ki (S, T) = 0. (4.6) The kernel function K∗ (·, ·) may be any kernel operator and operates on population spike rates sampled at single time points. To obtain the Spikernel described above, we chose K∗ (α, β ) = Rq µd(u,α) µ d u,β du. A more time-efficient description of the recursive computation of the kernel may be obtained if we notice that the sums in equation 4.5 are used, up to a constant λ, in calculating the kernel for prefixes of S and T. Caching these sums yields Kn (Ss, Tt) = λKn (S,Tt)+λKn (Ss,T)−λ2 K(S,T)+λ2 K∗ (s, t) Kn−1 (S,T) , where s and t are the last samples in the sequences (and the initial conditions of equation 4.6 still hold). Calculating Kn (S, T) using the above recursion is computationally cheaper than equation constructing a three-dimensional 4.5 and comprises array, yielding an O len(S) len(T) n dynamic programming procedure. 4.4 Spikernel Variants. The kernels defined by equation 4.1 consider only patterns of fixed length (n). It makes sense to look at sub-sequences of various lengths. Since a linear combination of kernels is also a kernel, we can define our kernel to be K(S, T) = n pi Ki (S, T), i=1 Rn+ where p ∈ is a vector of weights. The weighted kernel summation can be interpreted as performing the dot product between vectors that take the √ √ form [ p1 φ1 (·), . . . , pn φn (·)], where φi (·) is the feature vector that kernel i K (·, ·) represents. In the results, we use this definition with pi = 1 (a better weighing was not explored). Different choices of K∗ (α, β ) allow different methods of comparing population rate values (once the time bins for comparison were chosen). We 2 q 2 chose d(α, β ) to be the squared 2 norm: α − β 2 = k=1 αk − β k . The kernel K∗ (·, ·) then becomes 2 1 α−β 2 , K∗ (α, β ) = cµ 2 Spikernels with c = 681 π −2 ln µ q ; however, c can (and did) set to 1 WLOG. Note that 2 − ln(µ) α−β 2 and is therefore a K∗ (α, β ) can be written as K∗ (α, β ) = e 2 simple gaussian. It is also worth noticing that if λ ≈ 0, Kn (S, T) → ce− ln(µ) S−T22 2 . That is, in this limit, no time warping is allowed, and the Spikernel is reduced to the standard exponential kernel (up to a constant c). 5 Experimental Method 5.1 Data Collection. The data used in this work were recorded from the primary motor cortex of a rhesus (Macaca mulatta) monkey (approximately 4.5 kg). The animal’s care and surgical procedures accorded with The NIH Guide for the Care and Use of Laboratory Animals (rev. 1996) and with the Hebrew University guidelines supervised by the institutional committee for animal care and use. The monkey sat in a dark chamber, and eight electrodes were introduced into each hemisphere. The electrode signals were amplified, filtered, and sorted (MCP-PLUS, MSD, Alpha-Omega, Nazareth, Israel). The data used in this report were recorded on four different days during a one-month period of daily sessions. Up to 16 microelectrodes were inserted on each day. Recordings include spike times of single units (isolated by signal fit to a series of windows) and multiunits (detection by threshold crossing). The monkey used two planar-movement manipulanda to control two cursors (× and + shapes) on the screen to perform a center-out reaching task. Each trial begun when the monkey centered both cursors on a central circle for 1.0 to 1.5 s. Either cursor could turn green, indicating the hand to be used in the trial (× for right arm and + for the left). Then, after an additional hold period of 1.0 to 1.5ss one of eight targets (0.8 cm diameter) appeared at a distance 4 cm from the origin. After another 0.7 to 1.0ss, the center circle disappeared (referred to as the Go Signal), and the monkey had to move and reach the target in less than 2 s to receive liquid reward. We obtained data from all channels that exhibited spiking activity (i.e., were not silent) during at least 90% of the trials. The number and types of recorded units and the number of trials used in each of the recording days are described in Table 1. At the end of each session, we examined the activity of neurons evoked by passive manipulation of the limbs and applied intracortical microstimulation (ICMS) to evoke movements. The data presented here were recorded in penetration sites where ICMS evoked shoulder and elbow movements. Identification of the recording area was also aided by MRI. More information can be found in (Paz, Boraud, Natan, Bergman, and Vaadia (2003). 682 L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia Table 1: Summary of the Data Sets Obtained. Data Set Number of Single Units 1 2 3 4 27 22 27 25 Number of Mean Number of Multiunit Spike Rate Successful Trials Channels (spikes/sec) 16 9 10 13 8.6 11.6 8.0 6.5 317 333 155 307 Net Cumulative Movement Time (sec) 529 550 308 509 The results that we present here refer to prediction of instantaneous hand velocities (in 2D) during the movement time (from Go Signal to Target Reach times) of both hands in successful trials. Note that some of the trials required movement of the left hand while keeping the right hand steady and vice versa. Therefore, although we considered only movement periods of the trials, we had to predict both movement and nonmovement for each hand. 5.2 Data Preprocessing and Modeling. The recordings of spike times were made at 1 ms precision, and hand positions were recorded at 500 Hz. Our kernel method is suited for use with windows of observed neural activity as time series and with the movement values at the ends of those windows. We therefore chose to partition spike trains into bins, count the spikes in each bin, and use a running window of such spike counts as our data with the hand velocities at the end of the window as the label. Thus, a labeled example (St , vt ) for time t consisted of the X (left-right) or Y (forward-backward) velocity for the left or right hand as the target label vt and the preceding window of spike counts from all q cortical units as the input sequence St . In one such preprocessing, we chose a running window of 10 bins of size 100 ms each (a 1-second-long window), and the time interval between two such windows was 100 ms. Two such consecutive examples would then have nine time bins of overlap. For example, the number of cortical units q in the first data set was 43 (27 single +16 multiple), and the total length of all the trials used in that data set is 529 seconds. Hence, in that session, with the above preprocessing, there are 5290 consecutive examples, where each is a 43 × 10 matrix of spike counts and there are 4 sequences of movement velocities (X/Y and R/L hand) of the same length. Before we could train an algorithm, we had to specify the time bin size, the number of bins, and the time interval between two observation windows. Furthermore, each kernel employs a few parameters (for the Spikernel, they are λ, µ, n, and pi , though we chose pi = 1 a priori) and the SVM regression setup requires setting two more parameters, ε and C. In order to test the algorithms, we first had to establish which parameters work best. We used the first half of data set 1 in fivefold cross-validation to choose the best Spikernels 683 Table 2: Parameter Values Tried for Each Kernel. Kernel Parameters Spikernel λ ∈ {0.5, 0.7, 0.9, 0.99}, µ ∈ {0.7, 0.9, 0.99, 0.999}, n ∈ {3, 5, 10}, pi = 1, C = 0.1, ε = 0.1 γ ∈ {0.1, 0.01, 0.001, 0.005, 0.0001}, C ∈ {1, 5, 10}, ε = 0.1 C ∈ {0.001, 10−4 , 10−5 }, ε = 0.1 C ∈ {10−4 , 10−5 , 10−6 }, ε = 0.1 C ∈ {100, 10, 1, 0.1, 0.01, 0.001, 10−4 }, ε = 0.1 Exponential Polynomial, second degree Polynomial, third degree Linear Notes: Not all combinations were tried for the Spikernel. n was chosen to be and some peripheral combinations of µ and λ were not tried either. number of bins , 2 preprocessing and kernel parameters by trying a wide variety of values for each parameter, picking the combinations that produced the best predictions on the validation sets. For all kernels, the prediction quality as a function of the parameters was single peaked, and only that set of parameters was chosen per kernel for testing of the rest of the data. Having tuned the parameters, we then used the rest of data set 1 and the other three data sets to learn and predict the movement velocities. Again, we employed fivefold cross-validation on each data set to obtain accuracy results on all the examples. The five-fold cross-validation was produced by randomly splitting the trials into five groups: four of the five groups were used for training, and the rest of the data was used for evaluation. This process was repeated five times by using each fifth of the data once as a test set. The kernels that we tested are the Spikernel, the exponential kernel 2 K(S, T) = e−γ (S−T) , the homogeneous polynomial kernel K(S, T) = (S · T)d d = 2, 3, and standard scalar product kernel K(S, T) = S · T, which boils down to a linear regression. In our initial work, we also tested polynomial kernels of higher order and nonhomogeneous polynomial kernels, but as these kernels produced worse results than those we describe here, we did not pursue further testing. During parameter fitting (on the first half of data set 1), we tried all combinations of bin sizes {50ms, 100ms, 200ms} and number of bins {5, 10, 20} except for bin size and number of bins resulting in windows of length 2 s or more (since full information regarding the required movement was not available to the monkey 2S before the Go Signal and therefore neural activity at that time was probably not informative). The time interval between two consecutive examples was set to 100 ms. For each preprocessing, we then tried each kernel with parameter values described in Table 2. 6 Results Prediction samples of the Spikernel and linear regression are shown in Figure 3. The samples are of consecutive trials taken from data set 4 without 684 L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia v y 1 0 −1 200 202 x 206 208 210 212 214 216 218 220 Go Signal / Target Reach Actual Spikernel Linear Regression 1 v 204 0 −1 200 202 204 206 208 210 212 214 216 cumulative prediction time from session start (sec) 218 220 Figure 3: Sample of actual movement, Spikernel prediction and linear regression prediction of left-hand velocity in data set 4 in a few trials: (top) vy , (bottom) vx (both normalized). Only intervals between Go Signal and Target Reach are shown (separated by vertical lines). Actual movement in thick gray, Spikernel in thick black, and linear regression in thin black. Time axis is seconds of accumulated prediction intervals from the beginning of the session. prior inspection of their quality. We show the Spikernel and linear regression as prediction quality extremes. Both methods seem to underestimate the peak values and are somewhat noisy when there is no movement. The Spikernel prediction is better in terms of mean absolute error. It is also smoother than the linear regression. This may be due to the time warping quality of the features employed. We computed the correlation coefficient (r2 ) between the recorded and predicted velocities per fold for each kernel. The results are shown in Figure 4 as scatter plots. Each circle compares the Spikernel correlation coefficient scores to those of one of the other kernels. There are five folds for each of the four data sets and four correlation coefficients for each fold (one for each movement direction and hand), making up 80 scores (and circles in the plots) for each of the kernels compared with the Spikernel. Results above the diagonal are cases where the Spikernel outperformed. Note that though the correlation coefficient is informative and considered a standard performance indicator, the SVR algorithm minimizes the ε-insensitive loss. The results in terms of this loss are qualitatively the same (i.e., the kernels are ranked by this measure in the same manner), and we therefore omit them. Spikernels 685 0.8 0.8 Spikernel − r Spikernel − r 2 1 2 1 0.6 0.4 0.2 0 0 0.4 0.2 0.2 0.4 0.6 0.8 Exponential Kernel − r 2 0 1 0 0.2 0.4 0.6 0.8 Polynomial 2nd Deg. Kernel − r 2 1 0.8 0.8 1 Spikernel − r 2 1 2 Spikernel − r 0.6 0.6 0.4 0.2 0 0 0.6 0.4 0.2 0.2 0.4 0.6 0.8 Polynomial 3rd Deg. Kernel − r 2 1 0 0 0.2 0.4 0.6 0.8 Linear Regression − r 2 1 Figure 4: Correlation coefficient comparisons of the Spikernel versus other kernels. Each scatter plot compares the Spikernel to one of the other kernels. Each circle shows the correlation coefficient values obtained by the Spikernel and by the other kernel in one fold of one of the data sets for one of the two axes of movement. (a) Compares the Spikernel to the exponential kernel. (b) Compares with the second-degree polynomial kernel. (c) Compares with the third-degree polynomial kernel. (d) Compares with linear regression. Results above the diagonals are cases where the Spikernel outperformed. 686 L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia Table 3: Chosen Preprocessing Values, Parameter Values, and Mean Correlation Coefficient Across Folds for Each Kernel in Each Data Set. Kernel Kernel Parameters Day 1 Day 2 Day 3 Day 4 Total LhX RhX LhX RhX LhX RhX LhX RhX Mean LhY RhY LhY RhY LhY RhY LhY RhY Type Preprocessing Spikernel 10 bins 100 ms µ = .99 λ = .7 .71 .79 N = 5 C = 1 .74 .63 .77 .63 .87 .79 .75 .70 .67 .66 .69 .66 .79 .71 .72 Exponential 5 bins 200 ms γ = .01 C = 5 .67 .76 .69 .45 .73 .57 .85 .74 .71 .67 .56 .61 .66 .63 .74 .65 .67 Polynomial second degree 10 bins 100 ms C = 10−4 .67 .76 .70 .51 .72 .54 .85 .74 .70 .68 .61 .63 .65 .62 .76 .68 .68 Polynomial third degree 10 bins 100 ms C = 10−5 .65 .72 .66 .51 .69 .52 .84 .71 .67 .68 .61 .62 .63 .61 .75 .64 .66 Linear (dot product) 10 bins 100 ms C = 0.01 .54 .65 .67 .35 .62 .34 .75 .57 .66 .64 .50 .57 .51 .51 .70 .62 .57 Notes: Total means (across data sets, folds hands, and movement directions) are computed for each kernel. The Spikernel outperforms the rest, and the nonlinear kernels outperform the linear regression in all data sets. Table 3 shows the best bin sizes, number of bins, and kernel parameters that were found on the first half of data set 1 and the mean correlation coefficients obtained with those parameters on the rest of the data. Some of the learning problems were easier than others (we observed larger differences between data sets than between folds of the same data set). Contrary to our preliminary findings (Shpigelman, Singer, Paz, & Vaadia, 2003), there was no significantly easier axis to learn. The mean scores define a ranking order on the kernels. The Spikernel produced the best mean results, followed by the polynomial second-degree kernel, the exponential kernel, the polynomial third-degree kernel, and finally, the linear regression (dot product kernel). To determine how consistent the ranking of the kernels is, we computed the difference in correlation coefficient scores between each pair of kernels in each fold and determined the 95% confidence interval of these differences. The Spikernel is significantly better than the other kernels and the linear regression (the mean difference is larger than the confidence interval). The other nonlinear kernels (exponential and polynomial second and third degree) are also significantly better than linear regression, but the differences between these kernels are not significant. For all kernels, windows of 1 s were chosen over windows of 0.5 s as best preprocessing (based on the parameter training data). However, within the 1 s window, different bin sizes were optimal for the different kernels. Specifically, for the exponential kernel, 5 bins of size 200 ms, were best and for the rest, 10 bins of 100 ms were best. Any processing of the data (such as time binning or PCA) can only cause loss of information (Cover & Thomas, 1991) Spikernels 687 Table 4: Mean Correlation Coefficients (Over All Data) for the Spikernel with Different Combinations of Bin Sizes and Number of Bins. Number of Bins 50 ms 100 ms 200 ms 5 bins 10 bins 20 bins .52 .66 .66 .65 .72 .68 Note: The best fit to the data is still 10 bins of 100 ms. but may aid an algorithm. The question of what the time resolution of cells is is not answered here, but for the purpose of linear regression and SVR (with the kernels tested) on our MI data, it seems that binning was necessary. In many cases, a poststimulus time histogram, averaged over many trials, is used to get an approximation of firing rate that is accurate in both value and position in time (assuming that one can replicate the same experimental condition many times). In single-trial reconstruction, this approach cannot be used, and the trade-off between time precision versus accurate approximation of rate must be balanced. To better understand what the right bin size is, we retested the Spikernel with bin sizes of 50 ms, 100 ms, and 200 ms and number of bins 5, 10, and 20, leaving out combinations resulting in windows of 2 seconds or more. The mean correlation coefficients (over all the data) are shown in Table 4. The 10 by 100 ms configuration is optimal for the whole data as well as for the parameter fitting step. Note that since the average spike count per second was less then 10, the average spike count in bins of 50 ms to 200 ms bins was 0.5 to 2 spikes. In this respect, the data supplied to the algorithms are on the border between a rate and a spike train (with crude resolution). Run-time issues were not the focus of this study, and little effort was made to make the code implementation as efficient as possible. Run time for the exponential and polynomial kernels (prediction time) was close to real time (the time it took for the monkey to make the movements) on a Pentium 4, 2.4 GHz CPU. Run time for the Spikernel is approximately 100 times longer than the exponential kernel. Thus, real time use is currently impossible. The main contribution of the Spikernel is the allowance of time warping. To better understand the effect of λ on Spikernel prediction quality, we retested the Spikernel on all of the data, using 10 bins of size 100 ms (n = 5, µ = 0.99) and λ ∈ {0.5, 0.7, 0.9}. The mean correlation coefficients were {0.68, 0.72, 0.66}, respectively. We also retested the exponential kernel with γ = 0.005 on all the data with 5 bins of size 100 ms. This would correspond to a Spikernel with its previous parameters (µ = 0.99, n = 5) but no time warping. The correlation coefficient achieved was 0.64. Again, the relevance of time warping to neural activity cannot be directly addressed by our method. However, we can safely conclude that consideration of time- 688 L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia warped neural activity (in the Spikernel sense) does improve prediction. We believe that it would make sense to assume that this is relevant to cells as well, as mentioned in section 4.1, because the 3D structure of a cell’s dendritic tree can align integration of differently placed inputs from different times. 7 Conclusion In this article, we described an approach based on recent advances in kernelbased learning for predicting response variables from neural activities. On the data we collected, all the kernels we devised outperform the standard scalar product that is used in linear regression. Furthermore, the Spikernel, a biologically motivated kernel, operator consistently outperforms the other kernels. The main contribution of this kernel is the allowance of time warping when comparing two sequences of population spike counts. Our current research is focused in two directions. First, we are investigating the adaptations of the Spikernel to other neural activities such as local field potentials and the development of kernels for spike trains. Our second goal is to devise statistical learning algorithms that use the Spikernel as part of a dynamical system that may incorporate biofeedback. We believe that such extensions are important and necessary steps toward operational neural prostheses. Acknowledgments We thank the anonymous reviewers for their constructive criticism. This study was partly supported by a Center of Excellence grant (8006/00) administrated by the ISF, BMBF-DIP, and by the United States–Israel Binational Science Foundation. L.S. is supported by a Horowitz fellowship. References Brockwell, A. E., Rojas, A. L., & Kass, R. E. (2004). Recursive Bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91, 1899–1907. Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. Journal of Neuroscience, 18, 7411–7425. Chapin, J. K., Moxon, K. A., Markowitz, R. S., & Nicolelis, M. A. (1999). Realtime control of a robot arm using simultaneously recorded neurons in the motor cortex. Nature Neuroscience, 2, 664–670. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley. Spikernels 689 Donchin, O., Gribova, A., Steinberg, O., Bergman, H., & Vaadia, E. (1998). Primary motor cortex is involved in bimanual coordination. Nature, 395, 274– 278. Fu, Q. G., Flament, D., Coltz, J. D., & Ebner, T. J. (1997). Relationship of cerebellar Purkinje cell simple spike discharge to movement kinematics in the monkey. Journal of Neurophysiology, 78, 478–491. Genton, M. G. (2001). Classes of kernels for machine learning: A statistical perspective. Journal of Machine Learning Research, 2, 299–312. Georgopoulos, A. P., Kalaska, J., & Massey, J. (1983). Spatial coding of movements: A hypothesis concerning the coding of movement direction by motor cortical populations. Experimental Brain Research (Supp.), 7, 327–336. Georgopoulos, A. P., Kettner, R. E., & Schwartz, A. B. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. Journal of Neuroscience, 8, 2913–2947. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Haussler, D. (1999). Convolution kernels on discrete structures (Tech. Rep. No. UCSC-CRL-99-10). Santa Cruz, CA: Uniiversity of California. Isaacs, R. E., Weber, D. J., & Schwartz, A. B. (2000). Work toward real-time control of a cortical neural prosthesis. IEEE Trans. Rehabil. Eng., 8, 196–198. Jaakola, T. S., & Haussler, D. (1998). Exploiting generative models in discriminative calssifiers. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press. Laubach, M., Wessberg, J., & Nicolelis, M. A. (2000). Cortical ensemble activity increasingly predicts behavior outcomes during learning of a motor task. Nature, 405(1), 141–154. Laubach, M., Shuler, M., & Nicolelis, M. A. (1999). Independent component analyses for quantifying neuronal ensemble interactions. J. Neurosci Methods, 94, 141–154. Lodhi, H., Shawe-Taylor, J., Cristianini, N., & Watkins, C. J. C. H. (2000). Text classification using string kernels. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in nueural information processing systems. Cambridge, MA: MIT Press. Magee, J. (2000). Dendritic integration of excitatory synaptic input. Nat. Rev. Neuroscience, 1, 181–190. Moran, D. W., & Schwartz, A. B. (1999). Motor cortical representation of speed and direction during reaching. Journal of Neurophysiology, 82, 2676–2692. Nicolelis, M. A. (2001). Actions from thoughts. Nature, 409(18), 403–407. Nicolelis, M. A., Ghazanfar, A. A., Faggin, B. M., Votaw, S., & Oliveira, L. M. (1997). Reconstructing the engram: Simultaneous, multi-site, many single neuron recordings. Neuron, 18, 529–537. Paz, R., Boraud, T., Natan, C., Bergman, H., & Vaadia, E. (2003). Preparatory activity in motor cortex reflects learning of local visuomotor skills. Nature Neuroscience, 6(8), 882–890. Pouget, A., & Snyder, L. (2000). Computational approaches to sensorimotor transformations. Nature Neuroscience, 8, 1192–1198. 690 L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia Reihle, A., Grun, S., Diesmann, M., & Aersten, A. M. H. J. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1952. Schwartz, A. B. (1994). Direct cortical representation of drawing. Science, 265, 540–542. Segev, I., & Rall, W. (1998). Excitable dendrites and spines: Earlier theoretical insights elucidate recent direct observations. TINS, 21, 453–459. Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R., & Donoghue, J. P. (2002). Instant neural control of a movement signal. Nature, 416, 141–142. Shpigelman, L., Singer, Y., Paz, R., & Vaadia, E. (2003). Spikernels: Embedding spiking neurons in inner product spaces. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Smola, A., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics of Computing, 14, 199-222. Taylor, D. M., Tillery, S. I. H. S. I., & Schwartz, A. B. (2002). Direct cortical control of 3D neuroprosthetic devices. Science, 1829–1832. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. Nature, 373, 515–518. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Watkins, C. (1999). Dynamic alignment kernels. In P. Bartlett, B. Schölkopf, D. Schuurmans, & A. Smola (Eds.), Advances in large main classifiers (pp. 39– 50). Cambridge, MA: MIT Press. Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin, J. K., Kim, J., Biggs, J., Srinivasan, M. A., & Nicolelis, M. A. (2000). Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408(16), 361–365.