Spikernels - Yoram Singer

advertisement
Spikernels: Predicting Arm Movements by Embedding
Population Spike Rate Patterns in Inner-Product Spaces
Lavi Shpigelman
shpigi@cs.huji.ac.il
School of Computer Science and Engineering and Interdisciplinary Center for Neural
Computation, Hebrew University, Jerusalem 91904, Israel
Yoram Singer
singer@cs.huji.ac.il
School of Computer Science and Engineering Hebrew University, Jerusalem 91904,
Israel
Rony Paz
ronyp@hbf.huji.ac.il
Eilon Vaadia
eilon@hbf.huji.ac.il
Interdisciplinary Center for Neural Computation and Department of Physiology,
Hadassah Medical School The Hebrew University Jerusalem, 91904, Israel
Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. The
focus of this letter is the construction of biologically motivated kernels for
cortical activities. The kernels we derive, termed Spikernels, map spike
count sequences into an abstract vector space in which we can perform
various prediction tasks. We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any
two sequences of neural population spike counts. We demonstrate the
merits of our modeling approach by comparing the Spikernel to various
standard kernels in the task of predicting hand movement velocities from
cortical recordings. All of the kernels that we tested in our experiments
outperform the standard scalar product used in linear regression, with
the Spikernel consistently achieving the best performance.
1 Introduction
Neuronal activity in primary motor cortex (MI) during multijoint arm reaching movements in 2D and 3D (Georgopoulos, Schwartz, & Kettner, 1986;
672
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
Georgopouls, Kettner, & Schwartz, 1988) and drawing movements
(Schwartz, 1994) has been used extensively as a test bed for gaining understanding of neural computations in the brain. Most approaches assume
that information is coded by firing rates, measured on various timescales.
The tuning curve approach models the average firing rate of a cortical unit
as a function of some external variable, like the frequency of an auditory
stimulus or the direction of a planned movement. Many studies of motor cortical areas (Georgopoulos, Kalaska, & Massey, 1983; Georgopoulos
et al., 1988; Moran & Schwartz, 1999; Schwartz, 1994; Laubach, Wessberg, &
Nicolelis, 2000) showed that while single units are broadly tuned to movement direction, a relatively small population of cells (tens to hundreds)
carries enough information to allow for accurate prediction. Such broad
tuning can be found in many parts of the nervous system, suggesting that
computation by distributed populations of cells is a general cortical feature.
The population vector method (Georgopoulos et al., 1983, 1988) describes
each cell’s firing rate as the dot product between that cell’s preferred direction and the direction of hand movement. The vector sum of preferred
directions, weighted by the measured firing rates, is used both as a way of
understanding what the cortical units encode and as a means for estimating
the velocity vector.
Several recent studies (Fu, Flament, Cotz, & Ebner, 1997; Donchin, Gribova, Steinberg, Bergman, & Vaadia, 1998; Schwartz, 1994) propose that
neurons can represent or process multiple parameters simultaneously, suggesting that it is the dynamic organization of the activity in neuronal populations that may represent temporal properties of behavior such as the
computation of transformation from desired action in external coordinates
to muscle activation patterns. A few studies (Vaadia, et al., 1995; Laubach,
Schuler, & Nicolelis, 1999; Reihle, Grun, Diesmann, & Aersten, 1997) support
the notion that neurons can associate and dissociate rapidly to functional
groups in the process of performing a computational task. The concepts of
simultaneous encoding of multiple parameters and dynamic representation
in neuronal populations together could explain some of the conundrums
in motor system physiology. These concepts also invite using of increasingly complex models for relating neural activity to behavior. Advances
in computing power and recent developments of physiological recording
methods allow recording of ever growing numbers of cortical units that
can be used for real time analysis and modeling. These developments and
new understandings have recently been used to reconstruct movements on
the basis of neuronal activity in real time in an effort to facilitate the development of hybrid brain-machine interfaces that allow interaction between
living brain tissue and artificial electronic or mechanical devices to produce
brain-controlled movements (Chapin, Moxon, Markowitz, & Nicolelis, 1999;
Laubach et al., 2000; Nicolelis, 2001; Wessberg et al., 2000; Laubach et al.,
1999; Nicolelis, Ghazanfar, Faggin, Votaw, & Oliveira, 1997; Isaccs, Weber,
& Schwartz, 2000).
Spikernels
673
Most of the current attempts that focus on predicting movement from
cortical activity rely on modeling techniques that employ parametric models to describe a neuron’s tuning (instantaneous spike rate) as a function of
current behavior. For instance, cosine-tuning estimation (population vector) is used by Taylor, Tillery, and Schwartz (2002), and linear regression
was employed by Wessberg et al. (2000) and Serruya, Hatsopoulos, Paninski, Fellows, and Donoghue (2002). Brown, Frank, Tang, Quirk, and Wilson
(1998) and Brockwell, Rojas, and Kass (2004) have applied filtering methods
that, apart from assuming parametric models of spike rate generation, also
incorporate a statistical model of the movement. An exception is the use of
artificial neural nets (Wessberg et al, 2000) (though this study reports getting
better results by linear regression). This article describes the tailoring of a
kernel method for the task of predicting two-dimensional hand movements.
The main difference between our method and the other approaches is that
our nonparametric approach does not assume cell independence, imposes
relatively few assumptions on the tuning properties of the cells and how the
neural code is read, allowing representation of behavior in highly specific
neural patterns, and, as we demonstrate later, results in better predictions on
our test sets. However, due to the implicit manner in which neural activity is
mapped to feature space, improved results are often achieved at the expense
of understanding tuning of individual cells. We attempt to partially overcome this difficulty by describing the feature space that our kernel induces
and by explaining the way our kernel parameters affect the features.
The letter is organized as follows. In section 2, we describe the problem setting that this article is concerned with. In section 3, we introduce
and explain the main mathematical tool that we use: the kernel operator. In
section 4, we discuss the design and implementation of a biologically motivated kernel for neural activities. The experimental method is described in
section 5. We report experimental results in section 6 and give conclusions
in section 7.
2 Problem Setting
Consider the case where we monitor instantaneous spike rates from q cortical units during physical motor behavior of a subject (the method used for
translating recorded spike times into rate measures is explained in section 5).
Our goal is to learn a predictive model of some behavior parameter (such
as hand velocities, as described in section 5) with the cortical activity as the
input. Formally, let S ∈ Rq×m be a sequence of instantaneous firing rates
from q cortical units consisting of m time samples. We use S, T to denote
sequences of firing rates and denote by len(S) the time length of a sequence
S (len(S) = m in above definition). Si ∈ Rq designates the instantaneous
firing rates of all neurons at time i in the sequence S and Si,k ∈ R the rate
of neuron k. We also use Ss to denote the concatenation of S with one more
sample s ∈ Rq . The instantaneous firing rate of a unit k in sample s is then
674
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
sk ∈ R. We also need to employ a notation for subsequences. A consecutive
subsequence of S from time t1 to time t2 is denoted St1 :t2 . Finally, throughout the work, we need to examine possibly nonconsecutive subsequences.
We denote by i ∈ Rn a vector of time indices into the sequence S such that
1 ≤ i1 < i2 < · · · < in ≤ len(S). Si ∈ Rq×n is then an ordered selection of
spike rates (of all neurons) from times i.
Let y ∈ Rm denote a parameter of the movement that we would like to
predict (e.g., the movement velocity in the horizontal direction, vx ). Our goal
is to learn an approximation y of the form f: Rq×m → Rm from neural firing
rates to movement parameter. In general, information about movement can
be found in neural activity both before and after the time of movement
itself. Our long-term plan, though, is to design a model that can be used
for controlling a neural prosthesis. We therefore confine ourselves to causal
predictors that use an l (l ∈ Z) long window of neural activity ending
at time
step t, S(t−l+1):t , to predict yt ∈ R. We would like to make yt = ft S(t−l+1):t
as close as possible (in a sense that is explained in the sequel) to yt .
3 Kernel Method for Regression
The major mathematical tool employed in this article is kernel operators.
Kernel operators allow algorithms whose interface to the data is limited to
scalar products to employ complicated premappings of the data into feature
spaces by use of kernels. Formally, a kernel is an inner-product operator
K: X × X → R where X is some arbitrary vector space. In this work, X is the
space of neural activities (specifically, population spike rates). An explicit
way to describe K is via a mapping φ: X → H from X to a feature space
H such that K(x, x
) = φ(x) · φ(x
). Given a kernel operator, we can use
it to perform various statistical learning tasks. One such task is support
vector regression (SVR) (see, e.g., Smola & Schölkopf, 2004) which attempts
to find a regression function for target values that is linear if observed in
the (typically very high-dimensional) feature space mapped by the kernel.
We give here a brief description of SVR for clarity.
SVR employs the ε-insensitive loss function (Vapnik, 1995). Formally, this
loss is defined as
yt − f (S(t−l+1):t ) = max 0, yt − f (S(t−l+1):t ) − ε .
ε
Examples that fall within ε of the target value (yt ) do not contribute to the
error. Those that fall further away contribute linearly to the loss (see Figure
1, left). The form of the regression model is
f (S(t−l+1):t ) = w · φ S(t−l+1):t + b,
(3.1)
which is linear in the feature space H and is implicitly defined by the kernel.
Combined with a loss function, f defines a hyperslab of width ε centered
at the estimate (see Figure 1, right, for a single-dimensional illustration).
Spikernels
675
y
f(x)
Loss
ε
ε
y − yˆ
φ(x)
Figure 1: Illustration of the ε-insensitive loss (left) and an ε-insensitive area
around a linear regression in a single-dimensional feature space (right). Examples that fall within distance ε have zero loss (shaded area).
For each trial i and time index t designating a frame within the trial, we
received a pair, (Si(t−l+1):t , yit ), consisting of spike rates and a corresponding
target value. The optimization problem employed by SVR is
arg min
w
1
w2 + C
yit − f Si(t−l+1):t ,
ε
2
i,t
(3.2)
where ·2 is the squared vector norm. This minimization casts a trade-off
(weighed by C ∈ R) between a small empirical error and a regularization
term that favors low sensitivity to feature values.
Let φ(x) denote the feature vector that is implicitly implemented by kernel function K(·, x). Then there exists a regression model equivalent to the
one given in equation 3.1, is a minimum of equation 3.2, and takes the form
f (T) =
At,i K(Si(t−l+1):t , T) + b.
t,i
Here, T is the observed neural activity, and A is a (typically sparse) matrix of
Lagrange multipliers determined by the minimization problem in equation
3.2.
In summary, SVR solves a quadratic optimization problem aimed at finding a linear regressor in an induced high-dimensional feature space. This
regressor is the best penalized estimate for the ε-insensitive loss function
where the penalty is the squared norm of the regressor’s weights.
4 Spikernels
The quality of kernel-based learning is highly dependent on how the data
are embedded in the feature space via the kernel operator. For this reason,
several studies have been devoted to developing new kernels (Jaakola &
Haussler, 1998; Genton, 2001; Lodhi, Shawe-Taylor, Cristianini, & Watkins,
2000). In fact, a good kernel could render the work of a classification or
regression algorithm trivial. With this in mind, we develop a kernel for
neural spike rate activity.
676
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
R a te
Pattern A
Pattern B
T im e
(a) Bin by bin
P attern A
P attern B
P attern A
P attern B
R a te
R a te
Time of
Interest
T im e
(b) Time warp
T im e
(c) Time of interest
Figure 2: Illustrative examples of pattern similarities. (a) Bin-by-bin comparison
yields small differences. (b) Patterns with large bin-by-bin differences that can be
eliminated with some time warping. (c) Patterns whose suffix (time of interest)
is similar and prefix is different.
4.1 Motivation. Our goal in developing a kernel for spike trains is to
map similar patterns to nearby areas of the feature space. In the description
of our kernel, we attempt to capture some well-accepted notions on similarities between spike trains. We make the following assumptions regarding
similarities between spike patterns:
• A commonly made assumption is that firing patterns that have small
differences in a bin-by-bin comparison are similar. This is the basis of
linear regression. In Figure 2a, we show an example of two patterns
that are bin-wise similar.
• A cortical population may display highly specific patterns to represent
specific information such as a particular stimulus or hand movement.
Spikernels
677
Though in this work we make use of spike counts within each time
bin, we assume that a pattern of spike counts may be specific to an
external stimulus or action and that the manner in which these patterns
change as a function of the behavior may be highly nonlinear (Segev
& Rall, 1998). Many models of neuronal computation are based on this
assumption (see, e.g., Pouget & Snyder, 2000).
• Two patterns may be quite different from a simple bin-wise perspective, yet if they are aligned using a nonlinear time distortion or shifting, the similarity becomes apparent. An illustration of such patterns
is given in Figure 2b. In comparing patterns, we would like to induce
a higher score when the time shifts are small. Some biological plausibility for time-warped integration of inputs may stem from the spatial
distribution of synaptic inputs on cortical cells’ dendrites. In order for
two signals to be integrated along a dendritic tree or at the axon hillock,
their sources must be appropriately distributed in space and time, and
this distribution may be sensitive to the current depolarization state
of the tree (Magee, 2000).
• Patterns that are associated with identical values of an external stimulus at time t may be similar at that time but rather different at t ± when values of the external stimulus for these patterns are no longer
similar (as illustrated in Figure 2c. We would like to give a higher
similarity score to patterns that are similar around a time of interest
(prediction time), even if they are rather different at other times.
4.2 Spikernel Definition. We describe our kernel construction, the Spikernel, by specifying the features that make up the feature space. Our construction of the feature space builds on the work of Haussler (1999), Watkins
(1999), and Lodhi et al. (2000). We chose to directly specify the features constituting the kernel rather than describe the kernel from a functional point
of view, as it provides some insight into the structure of the feature space.
We first need to introduce a few more notations. Let S be a sequence
of length l = len(S). The set of all possible n-long index vectors defining a subsequence of S is In,l = i: i ∈ Zn , 1 ≤ i1 < · · · < in ≤ l . Also,
let d(α, β ), (α, β ∈ Rq ) denote a bin-wise distance over a pair of samples (population
rates).
We also overload notation and denote by
n firing
d (Si , U) =
k=1 d Sik , Uk a distance between sequences. The sequence
distance is the sum of distances over the samples. One such function (the d
we explore here) is the squared 2 norm. Let µ, λ ∈ (0, 1). The U component
of our (infinite) feature vector φ(S) is defined as
φnU (S) =
µd(Si ,U) λlen(S)−i1 ,
(4.1)
i∈In,len(S)
where i1 is the first entry in the index vector i. In words, φnU (S) is a sum
678
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
over all n-long subsequences of S. Each subsequence Si is compared to
U (the feature coordinate) and contributes to the feature value by adding a
number that is proportional to the bin-wise similarity between Si and U and
is inversely proportional to the amount of stretching Si needs to undergo.
That is, the feature entry indexed by U measures how similar U is to the
latter part of the time series S.
This definition seems to fit our assumptions on neural coding for the
following reasons:
• A feature φnU (S) gets large values only if U is bin-wise similar to subsequences of the pattern S.
• It allows for learning complex patterns: choosing small values of λ and
µ (or concentrated d measures) means that each feature tends toward
being either 1 or 0, depending on whether the feature coordinate U is
almost identical to a suffix of S.
• We allow gaps in the indexes defining subsequences, thus allowing for
time warping.
• Subpatterns that begin further from the required prediction time are
penalized by an exponentially decaying weight; thus, more weight is
given to activities close to the prediction time (the end of the sequence),
which designates our time of interest.
4.3 Efficient Kernel Calculation. The definition of φ given by equation
4.1 requires the manipulation of an infinite feature space and an iteration
over all subsequences that grows exponentially with the lengths of the sequences. Straightforward calculation of the feature values and performing
the induced inner product is clearly impossible. Building on ideas from
Lodhi et al. (2000), who describe a similar kernel (for finite sequence alphabets and no time of interest), we developed an indirect method for evaluating the kernel through a recursion that can be performed efficiently using
dynamic programming. The solution we now describe is rather general, as
it employs an arbitrary base feature vector ψ . We later describe our specific
design choice for ψ .
The definition in equation 4.1 is a special case of the following feature
definition,
φnU (S)
=
i∈In,len(S)
ψ Uk Sik
λlen(S)−i1 ,
(4.2)
k
where n specifies the length of the subsequences, and for the Spikernel, we
substitute
d S ,U
ψ Uk Sik = µ ik k .
Spikernels
679
The feature vector of equation 4.2 satisfies the following recursion:

len(S)
 i=1 λlen(S)−i+1 ψ (Si ) φn−1 (S1:i−1 ) len(S) ≥ n > 0
n
φ (S) = 0
len(S) < n, n > 0 (4.3)

1
n=0
The recursive form is derived by decomposing the sum over all subsequences into a sum over the last subsequence index and a sum over the rest
of the indices. The fact that φn (S) is the zero vector for sequences of length
less than n is due to equation 4.2. The last value of φn when n = 0 is defined
to be the all-one vector.
The kernel that performs the dot product between two feature vectors is
then defined as
Kn (S, T) = φn (S) · φn (T) =
φnU (S)φnU (T)dU,
Rq×n
where n designates subsequences of length n as before. We now plug in the
recursive feature vector definition from equation 4.3:
len(S)
n−1
n
len(S)−i+1
K (S, T) =
λ
ψ Un (Si ) φU1:n−1 (S1:i−1 )
Rq×n
i=1

len(T)
×

 dU.
λlen(T)−j+1 ψ Un Tj φn−1
U1:n−1 T1:j−1
(4.4)
j=1
Next, we note that
n−1
n−1
φn−1
(S1:i−1 ) · φn−1 (T1:j−1 )
U (S1:i−1 )φU (T1:j−1 )dU = φ
Rq×n−1
= Kn−1 S1:i−1 , T1:j−1 ,
and also
Rq
ψ Uk (Si )ψ Uk (Tj )dUk = ψ (Si ) · ψ (Tj ).
Defining K∗ (Si , Tj ) = ψ (Si )·ψ (Si ), we now insert the integration in equation
4.4 into the sums to get
Kn (S, T)
len(S)
len(T)
n−1
λlen(S)+len(T)−i−j+2 ψ (Si )· ψ Tj
φ (S1:i−1 )· φn−1 T1:j−1
=
i=1 j=1
len(S)
len(T)
=
i=1 j=1
λlen(S)+len(T)−i−j+2 K∗ Si , Tj Kn−1 S1:i−1 , T1:j−1 .
(4.5)
680
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
We have thus derived a recursive definition of the kernel, which inherits
the initial conditions from equation 4.3:
∀S, T K0 (S, T) = 1
if min len(S), len(T) < i Ki (S, T) = 0.
(4.6)
The kernel function K∗ (·, ·) may be any kernel operator and operates on
population spike rates sampled at single time points. To obtain the Spikernel
described above, we chose
K∗ (α, β ) =
Rq
µd(u,α) µ
d u,β
du.
A more time-efficient description of the recursive computation of the kernel
may be obtained if we notice that the sums in equation 4.5 are used, up to
a constant λ, in calculating the kernel for prefixes of S and T. Caching these
sums yields
Kn (Ss, Tt) = λKn (S,Tt)+λKn (Ss,T)−λ2 K(S,T)+λ2 K∗ (s, t) Kn−1 (S,T) ,
where s and t are the last samples in the sequences (and the initial conditions
of equation 4.6 still hold).
Calculating Kn (S, T) using the above recursion is computationally
cheaper than equation
constructing a three-dimensional
4.5 and comprises
array, yielding an O len(S) len(T) n dynamic programming procedure.
4.4 Spikernel Variants. The kernels defined by equation 4.1 consider
only patterns of fixed length (n). It makes sense to look at sub-sequences of
various lengths. Since a linear combination of kernels is also a kernel, we
can define our kernel to be
K(S, T) =
n
pi Ki (S, T),
i=1
Rn+
where p ∈
is a vector of weights. The weighted kernel summation can
be interpreted as performing the dot product between vectors that take the
√
√
form [ p1 φ1 (·), . . . , pn φn (·)], where φi (·) is the feature vector that kernel
i
K (·, ·) represents. In the results, we use this definition with pi = 1 (a better
weighing was not explored).
Different choices of K∗ (α, β ) allow different methods of comparing population rate values (once the time bins for comparison were chosen). We
2 q 2
chose d(α, β ) to be the squared 2 norm: α − β 2 = k=1 αk − β k . The
kernel K∗ (·, ·) then becomes
2
1
α−β 2
,
K∗ (α, β ) = cµ 2
Spikernels
with c =
681
π
−2 ln µ
q
; however, c can (and did) set to 1 WLOG. Note that
2
− ln(µ) α−β 2 and is therefore a
K∗ (α, β ) can be written as K∗ (α, β ) = e 2
simple gaussian. It is also worth noticing that if λ ≈ 0,
Kn (S, T) → ce−
ln(µ)
S−T22
2
.
That is, in this limit, no time warping is allowed, and the Spikernel is reduced
to the standard exponential kernel (up to a constant c).
5 Experimental Method
5.1 Data Collection. The data used in this work were recorded from the
primary motor cortex of a rhesus (Macaca mulatta) monkey (approximately
4.5 kg). The animal’s care and surgical procedures accorded with The NIH
Guide for the Care and Use of Laboratory Animals (rev. 1996) and with the Hebrew University guidelines supervised by the institutional committee for
animal care and use. The monkey sat in a dark chamber, and eight electrodes were introduced into each hemisphere. The electrode signals were
amplified, filtered, and sorted (MCP-PLUS, MSD, Alpha-Omega, Nazareth,
Israel). The data used in this report were recorded on four different days
during a one-month period of daily sessions. Up to 16 microelectrodes were
inserted on each day. Recordings include spike times of single units (isolated
by signal fit to a series of windows) and multiunits (detection by threshold
crossing).
The monkey used two planar-movement manipulanda to control two
cursors (× and + shapes) on the screen to perform a center-out reaching
task. Each trial begun when the monkey centered both cursors on a central
circle for 1.0 to 1.5 s. Either cursor could turn green, indicating the hand to be
used in the trial (× for right arm and + for the left). Then, after an additional
hold period of 1.0 to 1.5ss one of eight targets (0.8 cm diameter) appeared at
a distance 4 cm from the origin. After another 0.7 to 1.0ss, the center circle
disappeared (referred to as the Go Signal), and the monkey had to move
and reach the target in less than 2 s to receive liquid reward. We obtained
data from all channels that exhibited spiking activity (i.e., were not silent)
during at least 90% of the trials. The number and types of recorded units
and the number of trials used in each of the recording days are described in
Table 1.
At the end of each session, we examined the activity of neurons evoked
by passive manipulation of the limbs and applied intracortical microstimulation (ICMS) to evoke movements. The data presented here were recorded
in penetration sites where ICMS evoked shoulder and elbow movements.
Identification of the recording area was also aided by MRI. More information can be found in (Paz, Boraud, Natan, Bergman, and Vaadia (2003).
682
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
Table 1: Summary of the Data Sets Obtained.
Data Set
Number of
Single Units
1
2
3
4
27
22
27
25
Number of
Mean
Number of
Multiunit
Spike Rate Successful Trials
Channels (spikes/sec)
16
9
10
13
8.6
11.6
8.0
6.5
317
333
155
307
Net Cumulative
Movement Time
(sec)
529
550
308
509
The results that we present here refer to prediction of instantaneous hand
velocities (in 2D) during the movement time (from Go Signal to Target
Reach times) of both hands in successful trials. Note that some of the trials
required movement of the left hand while keeping the right hand steady
and vice versa. Therefore, although we considered only movement periods
of the trials, we had to predict both movement and nonmovement for each
hand.
5.2 Data Preprocessing and Modeling. The recordings of spike times
were made at 1 ms precision, and hand positions were recorded at 500
Hz. Our kernel method is suited for use with windows of observed neural
activity as time series and with the movement values at the ends of those
windows. We therefore chose to partition spike trains into bins, count the
spikes in each bin, and use a running window of such spike counts as
our data with the hand velocities at the end of the window as the label.
Thus, a labeled example (St , vt ) for time t consisted of the X (left-right) or
Y (forward-backward) velocity for the left or right hand as the target label
vt and the preceding window of spike counts from all q cortical units as the
input sequence St . In one such preprocessing, we chose a running window of
10 bins of size 100 ms each (a 1-second-long window), and the time interval
between two such windows was 100 ms. Two such consecutive examples
would then have nine time bins of overlap. For example, the number of
cortical units q in the first data set was 43 (27 single +16 multiple), and the
total length of all the trials used in that data set is 529 seconds. Hence, in that
session, with the above preprocessing, there are 5290 consecutive examples,
where each is a 43 × 10 matrix of spike counts and there are 4 sequences of
movement velocities (X/Y and R/L hand) of the same length.
Before we could train an algorithm, we had to specify the time bin size, the
number of bins, and the time interval between two observation windows.
Furthermore, each kernel employs a few parameters (for the Spikernel, they
are λ, µ, n, and pi , though we chose pi = 1 a priori) and the SVM regression
setup requires setting two more parameters, ε and C. In order to test the
algorithms, we first had to establish which parameters work best. We used
the first half of data set 1 in fivefold cross-validation to choose the best
Spikernels
683
Table 2: Parameter Values Tried for Each Kernel.
Kernel
Parameters
Spikernel
λ ∈ {0.5, 0.7, 0.9, 0.99}, µ ∈ {0.7, 0.9, 0.99, 0.999},
n ∈ {3, 5, 10}, pi = 1, C = 0.1, ε = 0.1
γ ∈ {0.1, 0.01, 0.001, 0.005, 0.0001}, C ∈ {1, 5, 10}, ε = 0.1
C ∈ {0.001, 10−4 , 10−5 }, ε = 0.1
C ∈ {10−4 , 10−5 , 10−6 }, ε = 0.1
C ∈ {100, 10, 1, 0.1, 0.01, 0.001, 10−4 }, ε = 0.1
Exponential
Polynomial, second degree
Polynomial, third degree
Linear
Notes: Not all combinations were tried for the Spikernel. n was chosen to be and some peripheral combinations of µ and λ were not tried either.
number of bins
,
2
preprocessing and kernel parameters by trying a wide variety of values for
each parameter, picking the combinations that produced the best predictions
on the validation sets. For all kernels, the prediction quality as a function
of the parameters was single peaked, and only that set of parameters was
chosen per kernel for testing of the rest of the data.
Having tuned the parameters, we then used the rest of data set 1 and the
other three data sets to learn and predict the movement velocities. Again,
we employed fivefold cross-validation on each data set to obtain accuracy
results on all the examples. The five-fold cross-validation was produced by
randomly splitting the trials into five groups: four of the five groups were
used for training, and the rest of the data was used for evaluation. This
process was repeated five times by using each fifth of the data once as a test
set.
The kernels that we tested are the Spikernel, the exponential kernel
2
K(S, T) = e−γ (S−T) , the homogeneous polynomial kernel K(S, T) = (S · T)d
d = 2, 3, and standard scalar product kernel K(S, T) = S · T, which boils
down to a linear regression. In our initial work, we also tested polynomial
kernels of higher order and nonhomogeneous polynomial kernels, but as
these kernels produced worse results than those we describe here, we did
not pursue further testing. During parameter fitting (on the first half of
data set 1), we tried all combinations of bin sizes {50ms, 100ms, 200ms} and
number of bins {5, 10, 20} except for bin size and number of bins resulting
in windows of length 2 s or more (since full information regarding the required movement was not available to the monkey 2S before the Go Signal
and therefore neural activity at that time was probably not informative). The
time interval between two consecutive examples was set to 100 ms. For each
preprocessing, we then tried each kernel with parameter values described
in Table 2.
6 Results
Prediction samples of the Spikernel and linear regression are shown in Figure 3. The samples are of consecutive trials taken from data set 4 without
684
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
v
y
1
0
−1
200
202
x
206
208
210
212
214
216
218
220
Go Signal / Target Reach
Actual
Spikernel
Linear Regression
1
v
204
0
−1
200
202
204
206
208
210
212
214
216
cumulative prediction time from session start (sec)
218
220
Figure 3: Sample of actual movement, Spikernel prediction and linear regression prediction of left-hand velocity in data set 4 in a few trials: (top) vy , (bottom)
vx (both normalized). Only intervals between Go Signal and Target Reach are
shown (separated by vertical lines). Actual movement in thick gray, Spikernel
in thick black, and linear regression in thin black. Time axis is seconds of accumulated prediction intervals from the beginning of the session.
prior inspection of their quality. We show the Spikernel and linear regression as prediction quality extremes. Both methods seem to underestimate
the peak values and are somewhat noisy when there is no movement. The
Spikernel prediction is better in terms of mean absolute error. It is also
smoother than the linear regression. This may be due to the time warping
quality of the features employed.
We computed the correlation coefficient (r2 ) between the recorded and
predicted velocities per fold for each kernel. The results are shown in Figure 4 as scatter plots. Each circle compares the Spikernel correlation coefficient scores to those of one of the other kernels. There are five folds for each
of the four data sets and four correlation coefficients for each fold (one for
each movement direction and hand), making up 80 scores (and circles in the
plots) for each of the kernels compared with the Spikernel. Results above
the diagonal are cases where the Spikernel outperformed. Note that though
the correlation coefficient is informative and considered a standard performance indicator, the SVR algorithm minimizes the ε-insensitive loss. The
results in terms of this loss are qualitatively the same (i.e., the kernels are
ranked by this measure in the same manner), and we therefore omit them.
Spikernels
685
0.8
0.8
Spikernel − r
Spikernel − r
2
1
2
1
0.6
0.4
0.2
0
0
0.4
0.2
0.2
0.4
0.6
0.8
Exponential Kernel − r 2
0
1
0
0.2
0.4
0.6
0.8
Polynomial 2nd Deg. Kernel − r 2
1
0.8
0.8
1
Spikernel − r
2
1
2
Spikernel − r
0.6
0.6
0.4
0.2
0
0
0.6
0.4
0.2
0.2
0.4
0.6
0.8
Polynomial 3rd Deg. Kernel − r 2
1
0
0
0.2
0.4
0.6
0.8
Linear Regression − r 2
1
Figure 4: Correlation coefficient comparisons of the Spikernel versus other kernels. Each scatter plot compares the Spikernel to one of the other kernels. Each
circle shows the correlation coefficient values obtained by the Spikernel and by
the other kernel in one fold of one of the data sets for one of the two axes of
movement. (a) Compares the Spikernel to the exponential kernel. (b) Compares
with the second-degree polynomial kernel. (c) Compares with the third-degree
polynomial kernel. (d) Compares with linear regression. Results above the diagonals are cases where the Spikernel outperformed.
686
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
Table 3: Chosen Preprocessing Values, Parameter Values, and Mean Correlation
Coefficient Across Folds for Each Kernel in Each Data Set.
Kernel
Kernel
Parameters
Day 1
Day 2
Day 3
Day 4
Total
LhX RhX LhX RhX LhX RhX LhX RhX
Mean
LhY RhY LhY RhY LhY RhY LhY RhY
Type
Preprocessing
Spikernel
10 bins
100 ms
µ = .99 λ = .7 .71 .79
N = 5 C = 1 .74 .63
.77 .63
.87 .79
.75 .70
.67 .66
.69 .66
.79 .71
.72
Exponential
5 bins
200 ms
γ = .01 C = 5
.67 .76
.69 .45
.73 .57
.85 .74
.71 .67
.56 .61
.66 .63
.74 .65
.67
Polynomial
second degree
10 bins
100 ms
C = 10−4
.67 .76
.70 .51
.72 .54
.85 .74
.70 .68
.61 .63
.65 .62
.76 .68
.68
Polynomial
third degree
10 bins
100 ms
C = 10−5
.65 .72
.66 .51
.69 .52
.84 .71
.67 .68
.61 .62
.63 .61
.75 .64
.66
Linear
(dot product)
10 bins
100 ms
C = 0.01
.54 .65
.67 .35
.62 .34
.75 .57
.66 .64
.50 .57
.51 .51
.70 .62
.57
Notes: Total means (across data sets, folds hands, and movement directions) are computed
for each kernel. The Spikernel outperforms the rest, and the nonlinear kernels outperform
the linear regression in all data sets.
Table 3 shows the best bin sizes, number of bins, and kernel parameters
that were found on the first half of data set 1 and the mean correlation coefficients obtained with those parameters on the rest of the data. Some of the
learning problems were easier than others (we observed larger differences
between data sets than between folds of the same data set). Contrary to our
preliminary findings (Shpigelman, Singer, Paz, & Vaadia, 2003), there was
no significantly easier axis to learn. The mean scores define a ranking order on the kernels. The Spikernel produced the best mean results, followed
by the polynomial second-degree kernel, the exponential kernel, the polynomial third-degree kernel, and finally, the linear regression (dot product
kernel).
To determine how consistent the ranking of the kernels is, we computed
the difference in correlation coefficient scores between each pair of kernels in
each fold and determined the 95% confidence interval of these differences.
The Spikernel is significantly better than the other kernels and the linear
regression (the mean difference is larger than the confidence interval). The
other nonlinear kernels (exponential and polynomial second and third degree) are also significantly better than linear regression, but the differences
between these kernels are not significant.
For all kernels, windows of 1 s were chosen over windows of 0.5 s as best
preprocessing (based on the parameter training data). However, within the
1 s window, different bin sizes were optimal for the different kernels. Specifically, for the exponential kernel, 5 bins of size 200 ms, were best and for the
rest, 10 bins of 100 ms were best. Any processing of the data (such as time
binning or PCA) can only cause loss of information (Cover & Thomas, 1991)
Spikernels
687
Table 4: Mean Correlation Coefficients (Over All Data) for
the Spikernel with Different Combinations of Bin Sizes and
Number of Bins.
Number of Bins
50 ms
100 ms
200 ms
5 bins
10 bins
20 bins
.52
.66
.66
.65
.72
.68
Note: The best fit to the data is still 10 bins of 100 ms.
but may aid an algorithm. The question of what the time resolution of cells
is is not answered here, but for the purpose of linear regression and SVR
(with the kernels tested) on our MI data, it seems that binning was necessary.
In many cases, a poststimulus time histogram, averaged over many trials,
is used to get an approximation of firing rate that is accurate in both value
and position in time (assuming that one can replicate the same experimental
condition many times). In single-trial reconstruction, this approach cannot
be used, and the trade-off between time precision versus accurate approximation of rate must be balanced. To better understand what the right bin
size is, we retested the Spikernel with bin sizes of 50 ms, 100 ms, and 200
ms and number of bins 5, 10, and 20, leaving out combinations resulting in
windows of 2 seconds or more. The mean correlation coefficients (over all
the data) are shown in Table 4. The 10 by 100 ms configuration is optimal for
the whole data as well as for the parameter fitting step. Note that since the
average spike count per second was less then 10, the average spike count
in bins of 50 ms to 200 ms bins was 0.5 to 2 spikes. In this respect, the data
supplied to the algorithms are on the border between a rate and a spike
train (with crude resolution).
Run-time issues were not the focus of this study, and little effort was
made to make the code implementation as efficient as possible. Run time
for the exponential and polynomial kernels (prediction time) was close to
real time (the time it took for the monkey to make the movements) on a
Pentium 4, 2.4 GHz CPU. Run time for the Spikernel is approximately 100
times longer than the exponential kernel. Thus, real time use is currently
impossible.
The main contribution of the Spikernel is the allowance of time warping. To better understand the effect of λ on Spikernel prediction quality,
we retested the Spikernel on all of the data, using 10 bins of size 100 ms
(n = 5, µ = 0.99) and λ ∈ {0.5, 0.7, 0.9}. The mean correlation coefficients
were {0.68, 0.72, 0.66}, respectively. We also retested the exponential kernel
with γ = 0.005 on all the data with 5 bins of size 100 ms. This would correspond to a Spikernel with its previous parameters (µ = 0.99, n = 5) but
no time warping. The correlation coefficient achieved was 0.64. Again, the
relevance of time warping to neural activity cannot be directly addressed by
our method. However, we can safely conclude that consideration of time-
688
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
warped neural activity (in the Spikernel sense) does improve prediction.
We believe that it would make sense to assume that this is relevant to cells
as well, as mentioned in section 4.1, because the 3D structure of a cell’s dendritic tree can align integration of differently placed inputs from different
times.
7 Conclusion
In this article, we described an approach based on recent advances in kernelbased learning for predicting response variables from neural activities. On
the data we collected, all the kernels we devised outperform the standard
scalar product that is used in linear regression. Furthermore, the Spikernel, a biologically motivated kernel, operator consistently outperforms the
other kernels. The main contribution of this kernel is the allowance of time
warping when comparing two sequences of population spike counts. Our
current research is focused in two directions. First, we are investigating the
adaptations of the Spikernel to other neural activities such as local field potentials and the development of kernels for spike trains. Our second goal is
to devise statistical learning algorithms that use the Spikernel as part of a
dynamical system that may incorporate biofeedback. We believe that such
extensions are important and necessary steps toward operational neural
prostheses.
Acknowledgments
We thank the anonymous reviewers for their constructive criticism. This
study was partly supported by a Center of Excellence grant (8006/00) administrated by the ISF, BMBF-DIP, and by the United States–Israel Binational
Science Foundation. L.S. is supported by a Horowitz fellowship.
References
Brockwell, A. E., Rojas, A. L., & Kass, R. E. (2004). Recursive Bayesian decoding
of motor cortical signals by particle filtering. Journal of Neurophysiology, 91,
1899–1907.
Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A
statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. Journal
of Neuroscience, 18, 7411–7425.
Chapin, J. K., Moxon, K. A., Markowitz, R. S., & Nicolelis, M. A. (1999). Realtime control of a robot arm using simultaneously recorded neurons in the
motor cortex. Nature Neuroscience, 2, 664–670.
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley.
Spikernels
689
Donchin, O., Gribova, A., Steinberg, O., Bergman, H., & Vaadia, E. (1998). Primary motor cortex is involved in bimanual coordination. Nature, 395, 274–
278.
Fu, Q. G., Flament, D., Coltz, J. D., & Ebner, T. J. (1997). Relationship of cerebellar
Purkinje cell simple spike discharge to movement kinematics in the monkey.
Journal of Neurophysiology, 78, 478–491.
Genton, M. G. (2001). Classes of kernels for machine learning: A statistical
perspective. Journal of Machine Learning Research, 2, 299–312.
Georgopoulos, A. P., Kalaska, J., & Massey, J. (1983). Spatial coding of movements: A hypothesis concerning the coding of movement direction by motor
cortical populations. Experimental Brain Research (Supp.), 7, 327–336.
Georgopoulos, A. P., Kettner, R. E., & Schwartz, A. B. (1988). Primate motor
cortex and free arm movements to visual targets in three-dimensional space.
Journal of Neuroscience, 8, 2913–2947.
Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419.
Haussler, D. (1999). Convolution kernels on discrete structures (Tech. Rep. No.
UCSC-CRL-99-10). Santa Cruz, CA: Uniiversity of California.
Isaacs, R. E., Weber, D. J., & Schwartz, A. B. (2000). Work toward real-time control
of a cortical neural prosthesis. IEEE Trans. Rehabil. Eng., 8, 196–198.
Jaakola, T. S., & Haussler, D. (1998). Exploiting generative models in discriminative calssifiers. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural
information processing systems. Cambridge, MA: MIT Press.
Laubach, M., Wessberg, J., & Nicolelis, M. A. (2000). Cortical ensemble activity
increasingly predicts behavior outcomes during learning of a motor task.
Nature, 405(1), 141–154.
Laubach, M., Shuler, M., & Nicolelis, M. A. (1999). Independent component
analyses for quantifying neuronal ensemble interactions. J. Neurosci Methods,
94, 141–154.
Lodhi, H., Shawe-Taylor, J., Cristianini, N., & Watkins, C. J. C. H. (2000). Text
classification using string kernels. In S. A. Solla, T. K. Leen, & K.-R. Müller
(Eds.), Advances in nueural information processing systems. Cambridge, MA:
MIT Press.
Magee, J. (2000). Dendritic integration of excitatory synaptic input. Nat. Rev.
Neuroscience, 1, 181–190.
Moran, D. W., & Schwartz, A. B. (1999). Motor cortical representation of speed
and direction during reaching. Journal of Neurophysiology, 82, 2676–2692.
Nicolelis, M. A. (2001). Actions from thoughts. Nature, 409(18), 403–407.
Nicolelis, M. A., Ghazanfar, A. A., Faggin, B. M., Votaw, S., & Oliveira, L. M.
(1997). Reconstructing the engram: Simultaneous, multi-site, many single
neuron recordings. Neuron, 18, 529–537.
Paz, R., Boraud, T., Natan, C., Bergman, H., & Vaadia, E. (2003). Preparatory
activity in motor cortex reflects learning of local visuomotor skills. Nature
Neuroscience, 6(8), 882–890.
Pouget, A., & Snyder, L. (2000). Computational approaches to sensorimotor
transformations. Nature Neuroscience, 8, 1192–1198.
690
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
Reihle, A., Grun, S., Diesmann, M., & Aersten, A. M. H. J. (1997). Spike synchronization and rate modulation differentially involved in motor cortical
function. Science, 278, 1950–1952.
Schwartz, A. B. (1994). Direct cortical representation of drawing. Science, 265,
540–542.
Segev, I., & Rall, W. (1998). Excitable dendrites and spines: Earlier theoretical
insights elucidate recent direct observations. TINS, 21, 453–459.
Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R., & Donoghue,
J. P. (2002). Instant neural control of a movement signal. Nature, 416, 141–142.
Shpigelman, L., Singer, Y., Paz, R., & Vaadia, E. (2003). Spikernels: Embedding
spiking neurons in inner product spaces. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge,
MA: MIT Press.
Smola, A., & Schölkopf, B. (2004). A tutorial on support vector regression.
Statistics of Computing, 14, 199-222.
Taylor, D. M., Tillery, S. I. H. S. I., & Schwartz, A. B. (2002). Direct cortical control
of 3D neuroprosthetic devices. Science, 1829–1832.
Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen,
A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to
behavioral events. Nature, 373, 515–518.
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.
Watkins, C. (1999). Dynamic alignment kernels. In P. Bartlett, B. Schölkopf,
D. Schuurmans, & A. Smola (Eds.), Advances in large main classifiers (pp. 39–
50). Cambridge, MA: MIT Press.
Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin,
J. K., Kim, J., Biggs, J., Srinivasan, M. A., & Nicolelis, M. A. (2000). Real-time
prediction of hand trajectory by ensembles of cortical neurons in primates.
Nature, 408(16), 361–365.
Download