Preprint - Gatsby Computational Neuroscience Unit

advertisement
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR
1
An Active Learning Algorithm for Control of
Epidural Electrostimulation
Thomas A. Desautels, Jaehoon Choe*, Parag Gad*, Mandheerej S. Nandra, Roland R. Roy, Hui Zhong,
Yu-Chong Tai, V. Reggie Edgerton, Joel W. Burdick
Abstract—Epidural electrostimulation has shown promise for
spinal cord injury therapy. However, finding effective stimuli
on the multi-electrode stimulating arrays employed requires
a laborious manual search of a vast space for each patient.
Widespread clinical application of these techniques would be
greatly facilitated by an autonomous, algorithmic system which
choses stimuli to simultaneously deliver effective therapy and
explore this space. We propose a method based on GP-BUCB, a
Gaussian process bandit algorithm. In n = 4 spinally transected
rats, we implant epidural electrode arrays and examine the algorithm’s performance in selecting bipolar stimuli to elicit specified
muscle responses. These responses are compared with temporally
interleaved, intra-animal stimulus selections by a human expert.
GP-BUCB successfully controlled the spinal electrostimulation
preparation in 37 testing sessions, selecting 670 stimuli. These
sessions included sustained, autonomous operations (10 session
duration). Delivered performance with respect to the specified
metric was as good as or better than that of the human expert.
Despite receiving no information as to anatomically likely locations of effective stimuli, GP-BUCB also consistently discovered
such a pattern. Further, GP-BUCB was able to extrapolate from
previous sessions’ results to make predictions about performance
in new testing sessions, while remaining sufficiently flexible to
capture temporal variability. These results provide validation for
applying automated stimulus selection methods to the problem
of spinal cord injury therapy.
Index Terms—Implants, Learning Automata, Neural Engineering, Neuromuscular Stimulation, Spinal Cord Injury.
I. I NTRODUCTION
Animal [1] and human [2], [3] studies have shown that
multi-electrode epidural spinal stimulation can provide significant recovery of motor and autonomic function in subjects
suffering from severe spinal cord injury (SCI). Fig. 1 shows a
c 2014 IEEE. Personal use of this material is permitted.
Copyright However, permission to use this material for any other purposes must be
obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
Submitted December 15, 2014. This work was supported by NIH
U01EB15521, R01EB007615, the Leona M. and Harry B. Helmsley Charitable Trust, and the Christopher and Dana Reeve, Broccoli, Walkabout, and
F. M. Kirby Foundations. J. Choe and P. Gad contributed equally.
T. A. Desautels was with the California Institute of Technology, Pasadena,
CA 91125, USA. He is now with the Gatsby Computational Neuroscience
Unit, University College London, London WC1N 3AR UK (email: tdesautels@gatsby.ucl.ac.uk).
J. Choe was and P. Gad, R. R. Roy, H. Zhong, and V. R. Edgerton are with
the University of California, Los Angeles, Los Angeles, CA, USA. J. Choe
is now with HRL Laboratories, LLC, Malibu, CA, USA.
M. S. Nandra was and Y.-C. Tai and J. W. Burdick are with the California
Institute of Technology, Pasadena, CA, USA.
VRE, RRR, and JWB hold shareholder interest in NeuroRecovery Technologies (NRT). VRE is also the President and Chairman of the Board.
VRE, RRR, and JWB hold certain inventorship rights on intellectual property
licensed by The Regents of the University of California to NRT and its
subsidiaries. A patent has been submitted covering concepts described here.
27-electrode stimulating array we employ in rat SCI models.
When implanted in the epidural space, the application of
appropriate spatio-temporal patterns of electrical stimulation
to these arrays can facilitate the operation of spinal circuits
caudal to the spinal injury, enabling beneficial motor function,
and the recovery of some voluntary control and autonomic
function [2], [3].
Multi-electrode stimulating arrays provide many benefits
over simple wire electrodes, such as patient-customized stimulation parameters, compensation for errors in surgical placement of the array, and potential adaptation to spinal cord
plasticity. Moreover, this adaptivity is needed in the clinic:
optimal stimulus patterns vary substantially across patients [3]
and over time for each patient due to spinal cord plasticity.
Hence, the clinical process of deploying multi-electrode epidural stimulation requires a burdensome search for the optimal
stimulation paradigm for each patient.
The space of possible stimuli which must be searched to find
high-performing electrode combinations can be vast; e.g., the
19 parameters (which describe active electrode selection, electrode polarity, voltage, stimulus frequency, and pulse width)
which can be varied on the 16-electrode array used in [2], [3]
allow for 108 different stimulus combinations. When restricted
to pulse-like stimuli (parametrized by amplitude, frequency,
phase, and pulse width), the array shown in Fig. 1 presents
a 108-dimensional space that must be searched. Even though
the range of permissible values is greatly reduced by safety
and design factors, the necessary search for the small set of
high-performing or optimal parameters can be tedious and of
limited direct therapeutic value.
This burden suggests the need for automated algorithms to
facilitate, simplify, and accelerate the search process. Other
factors motivate the need for automated search techniques: (1)
different stimulation patterns are needed to facilitate different
motor behaviors (e.g., standing vs. stepping); (2) the current
labor-intensive process of finding optimal stimulation patterns
is expensive and may not result in optimal parameter choices;
(3) no method yet exists to predictively account for plasticity
in the nervous system during locomotor recovery after severe
paralysis; and (4) a systematic method can facilitate widespread, cost-effective distribution of this therapy, as there are
currently few therapists trained to optimize epidural stimulation for recovery after an SCI.
This paper introduces a novel algorithm which can automatically optimize multi-electrode array stimulation parameters.1
1A
preliminary version of this work has been reported [4].
2
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR
Fig. 1. Parylene Array Device. (A) and (B): The complete implant, including
main circuit board, parylene electrode array, headplug connector, EMG wires,
and ground wires. (C): Tip of the array, showing suture sites. (D): Detail of
a single electrode. The bright surface regions are electrical contacts and the
darker material is a layer of parylene to prevent delamination. (E): View of the
entire parylene and platinum microfabricated portion of the implant. Figure
reproduced under Open Access from [1].
The methodology is based on a novel Gaussian Process
Batch Upper Confidence Bound (GP-BUCB) active machine
learning technique developed by some of the authors [5].
GP-BUCB uses a batch processing format that meshes with
clinical practice: data from a completed training session is
processed to select a batch of stimuli to be tested in the next
training session. This approach decouples the data processing
step from the clinical process. When the data processing
step is sufficiently rapid, as in the experiments here, this
approach can also be used within a session. The basic theory
underlying GP-BUCB is developed and analyzed elsewhere
[5]. This paper focuses on animal experiments designed to
evaluate the method. These studies used a block strategy in
which an experienced neurophysiologist and the GP-BUCB
algorithm each selected and applied batches of stimuli in
an interleaving sequence. Each was blind to the competitor’s
actions and the responses elicited, allowing comparison of the
exploratory/exploitative strategies, while largely compensating
for spinal plasticity. Our experiments showed that an automated algorithm can find high-performing stimulus parameters
efficiently, competitively with a human expert. This should
lead to a better therapeutic recovery experience, and shows
that an automated algorithm can be a competitive option for
determining effective therapeutic strategies.
Relation to Prior Work. This paper lies at the intersection
of multiple fields: machine learning, neuromotor rehabilitation,
and multi-electrode array technology. Herein, we address
spinal cord injury, a medical condition with broad impact,
which has been widely studied in both human [2], [3] and
animal models [6]. Reviews of SCI research include [7]–
[9]. Although regenerative approaches [10] hold long-term
promise, they have not yet had substantial success in patients
with a complete SCI [7].
Since there is no present cure for SCI, current practice
focuses on therapy for the injured neuromuscular system.
Locomotor therapy [11] produces gains in some patients
through a variety of putative mechanisms [12], [13]. Several
therapies involve electrical stimulation. Functional electrostimulation (FES) [14] aims to create muscle activation patterns
associated with a desired movement. Since FES is an openloop method, the stimulation must be carefully designed and/or
user-controlled if complex behaviors are desired. FES also can
engender rapid fatigue [15]. Other approaches focus on spinal
cord stimulation to rehabilitate the cord’s motor control ability,
without addressing the injury itself [16]. This philosophy relies
on the spinal cord’s intrinsic circuitry caudal to the injury
site, which remains viable and adaptable [8]. Specific targets
include interneuronal networks responsible for reflexes and
the central pattern generators that coordinate muscle activity
in walking [17]. A variety of methods for delivering the
electrical stimulus have been suggested, including penetrating
microelectrodes [18].
The present paper focuses on epidural spinal electrostimulation (EES), which relies upon an electrode array implanted in
the epidural space over the dorsal aspect of the lumbosacral
spinal cord. Although originally developed for chronic pain
therapy [19], epidural electrostimulation can produce complex
motor patterns [17] in SCI subjects. EES has been successfully
applied to spinal and decerebrate cats, spinalized rats, and
humans with an SCI [9]. Properly configured EES can produce
walking motions [9]. The combination of EES with partial
body weight support exercise training can produce substantial
functional and metabolic gains in locomotion in an incomplete
quadriplegic patient [20]. More recently, some of the present
authors have used EES coupled with motor training to demonstrate substantial gains in a motor-complete patients [2], [3].
One mechanism believed to be involved with in this type of
stimulation is the activation of afferent fibers as they enter
the spinal cord through the dorsal nerve roots [21]. A detailed
attempt to examine spinal circuitry via EES-evoked potentials
in human participants is presented in [22]. Even in this light,
it is difficult to choose stimuli a priori, and so a search is
necessary.
Since the search for optimal stimulating parameters consumes valuable patient and clinician time, any automated algorithm must quickly find high-performing stimulus parameters
(explore the stimulus space) and then apply them (exploit its
knowledge) to enable useful therapy. The unavoidable tension between these two requirements is classically called the
exploration-exploitation tradeoff. Well-designed active learning algorithms strike an effective balance. The companion
paper [5] reviews the active learning literature. Here we focus
on the use of active learning for neurostimulation.
Our algorithm is not the first to “tune” the electrical stimuli
applied by a medical device. Machine learning algorithms have
tuned cochlear implants [23]. Computational models have been
proposed to adjust deep brain stimulation (DBS) parameters
[24], and automated DBS motor evaluation is in development
[25]. In comparison to our approach, the ad-hoc techniques
used in the cochlear tuning have no formal convergence
guarantees. Moreover, EES invokes a more complex response
(involving sensorimotor spinal networks) than does cochlear
stimulation. Although the need for automated DBS tuning has
been recognized [26], it is not yet a reality.
Whereas this paper focuses on neural stimulation, active
DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OF EPIDURAL ELECTROSTIMULATION
learning has been applied to Brain Computer Interfaces (BCIs),
which classify recorded neural activity. Active learning has
been used to find the stimuli which produce good discrimination between volitional and resting states [27], [28], to
calibrate BCI decoders [29] and BCI systems [30]. In contrast,
this paper uses an approach with more structure (a Gaussian
process class of reward functions) over the space of actions in
order to enable very large decision sets. Further, we optimize
a reward function while simultaneously applying the therapy,
rather than in a separate, potentially non-therapeutic session.
Finally, we note that our automated stimulus optimization
approach can likely be adapted to other multi-electrode therapies, such as transcranial direct current [31], transcutaneous
spinal cord [32], or deep brain stimulation.
Paper Organization. Section II describes the procedures
and the electrode arrays used in our experiments. The theory
and structure of the GP-BUCB algorithm [5] is briefly reviewed in Section III and Section IV describes adaptations of
GP-BUCB to the clinical and experimental setting. Section
V summarizes our experimental results, with Section VI
providing a detailed analysis.
II. E XPERIMENTAL M ETHODS
This section describes the preparation of the animal subjects
(Section II-A), the micro-fabricated (Section II-B) and wire
electrode arrays (Section II-C) used to stimulate their spinal
cords, and basic testing procedures (Section II-D).
A. Injury, Implantation, and Animal Care
Our surgical and care procedures are largely derived from
[1], and are similar to those used for cats [33]. All animals
are adult female Sprague Dawley rats, approximately 300 g
in mass at time of implantation. The following procedures are
performed on each animal, usually in a single surgery:
1) Partial laminectomy at the T8-T9 vertebral level and
complete spinal cord transection at approximately the T8
spinal segment, including the dura, using microscissors;
2) Placement of gel foam at the transection site as a
coagulant and separator of the cut spinal cord ends;
3) Partial or full laminectomies of some vertebrae (T11,
T12, L3, and L4 for animals receiving parylene arrays,
T12, T13, L1, and L2 for those receiving wired arrays);
4) Implantation of an epidural stimulating array, inserted
using the T11, L4 (T12, L2) laminectomies for parylene
(wire) arrays, placing the most rostral electrodes in the
middle of the T12 vertebral level, and sutured to the
dura at the rostral end using 8-0 Ethilon sutures;
5) Implantation of 2 ground wires (each made of 5 braided
0.003 cm gold wires, A-M Systems, Sequim, WA) for
parylene arrays, or one teflon coated stainless steel wire
for the wired arrays (0.304 mm, AS 632, Cooner Wire,
Chatsworth, CA). The grounds are placed in the muscle
mass dorsal to the spinal column;
6) Implantation of a pair of multi-stranded, Teflon-coated
stainless steel EMG wires (AS 632, Cooner Wire) into
the bellies of left and right tibialis anterior (LTA & RTA)
and left and right soleus (LSol & RSol) muscles; and
3
7) Attachment of two headplug connectors (Amphenol,
Wallingford, CT), that are screwed to the skull and
secured with dental cement.
All surgeries are performed under aseptic conditions with
general anesthesia (Isoflurane) delivered via face mask. Analgesia is provided with buprenex (0.5-1.0 mg/kg, 3 times
per day subcutaneously), begun before the end of surgery
and continued for 3-5 days post-surgery. The animals were
administered Baytril sub-cutaneously at the end of surgery and
at 12-hr intervals thereafter for at least 3 days. The animals
are allowed to recover from anesthesia within an incubator and
are individually housed both pre- and postoperatively, with free
access to food and water. After recovering for one week (day 7
post-surgery, denoted P7), the experiments begin. Experiments
continue as long as the animal and array both remain viable,
or until 6 weeks post-operative (P42). Due to their spinal
cord injuries, the animals’ bladders are manually expressed
2-3 times per day and their hind limbs are manually moved
through a complete range of motion once per day to retain
joint mobility.
All procedures were carried out in accordance with the NIH
Guide for the Care and Use of Laboratory Animals and were
approved by the UCLA Animal Research Committee.
B. Parylene Arrays
Two animals were implanted with parylene stimulating arrays, which were built using MEMS fabrication and traditional
microelectronics, described in detail in [1]. The 26 mm x
3mm array is sized to the rat spinal cord. The array consists
of 27 platinum electrodes (each 0.2 mm x 0.5 mm, partially
covered by an anti-delamination waffle pattern of parylene)
and platinum wire traces deposited on a parylene C substrate
(see Fig. 1). The flexibility of the 20µm thick array conforms
to the cord’s movements, allowing delivery of roughly the
same stimulus in various body positions. The electrodes are
in three rostro-caudal columns (see Fig. 2) denoted “A,” “B,”
and “C” from the animal’s left to its right, and are numbered
by rows from 1 (rostral, L2 spinal cord level, T12 vertebral
level) to 9 (caudal, S2 spinal cord level, L2 vertebral level).
The rows are equally spaced.
The array is coupled to a biocompatible circuit board (33.2
mm x 10.3 mm) mounted dorsally on the spinal column.
This board’s circuits control the stimuli, record responses,
and communicate with external circuitry through a headplug
connector. A stimulus consists of a single pair of electrodes
(cathode and anode) chosen from 29 candidates, the 27 on the
array and 2 grounds located within the body. Considering only
bipolar configurations (i.e., those without grounds), there are
702 possible stimulus pairs. Due to the design of the driving
circuitry, certain pairs (36 in total) cannot be stimulated,
leaving 666 pairs over which the stimulus can be optimized.
C. Wire-Based Spinal Stimulating Arrays
The wire electrode arrays consist of 7 Teflon-coated, multistranded, stainless steel wires (5×30 guage, A-M Systems;
2×AS 632 wires, Cooner Wire) laid parallel. Openings cut
into the wire insulation create exposed electrodes on the
T13
L3
9
TA
A B C
9
8
S2
7
S1
6
L1
L3
8
7
6
5
L2
L6
5
L5
4
L4
3
2
T12
4
2
L3
1
L2
L1
3
T13
1
T12
MG
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR
Soleus
4
L2
Fig. 2. Placement of a 9x3 array relative to the spinal cord (segmental level
in red) and vertebrae (gray). The colored bars at the top and bottom represent
the rostro-caudal spinal locations of the soleus, tibialis anterior, and medial
gastrocnemius motor pools. Adapted under Open Access from [1].
surface facing the dura. The wires are sutured to the dura
above and below the electrode, ensuring consistent stimulus
location. The other end is then connected to a headplug
connector, to which the implanted EMG wires are also routed.
The 7 stimulating electrodes are placed over the same spinal
locations as electrodes A1, A4, A9, B2, C1, C4, and C9 on
the parylene array (Fig. 2).
D. Animal Testing Procedures
During testing, the animals are suspended in a vest that
supports their body weight. The device is positioned so that
the animal is in a bipedal position, with both hindpaws in
contact with a custom surface offering good traction. In a
typical experiment, the algorithm chooses a batch of five
bipolar electrode pairs, possibly with repetition. The stimulus
associated to each selection consists of a train of 10 (or 20
in some animals) constant voltage pulses (1 Hz repetition,
monophasic, 0.5 ms pulse width) applied across the selected
electrodes.
The EMG signals corresponding to evoked potentials are
amplified (A-M Systems model 1700) and recorded by custom
LabVIEW (National Instruments, Austin, TX) software, with
a 10 kHz sampling rate. The digitized signals are processed
with custom MATLAB (The MathWorks, Inc., Natick, MA)
code to calculate the peak-to-peak amplitude of each evoked
potential within a time window defined relative to the onset of
the stimulus pulse. Along with previously recorded data, these
new observations are used to select the five actions comprising
the next batch. On a typical testing day, five batches are
selected. A run of the algorithm typically lasts several days,
and several runs may be performed on the same animal, with
the algorithm’s memory wiped between runs.
Human-selected batches (5 choices of electrode pairs and
voltage) are interleaved between tests of the algorithm’s batch
selections in an alternating manner, constituting an intraanimal baseline. The algorithm and human were blind to one
another’s actions and the animal’s resulting responses. The two
data sources were not combined for decision-making purposes,
but they were combined for meta-analysis and tuning of the
algorithm’s hyperparameters.
III. R EVIEW OF THE GP-BUCB ALGORITHM
The GP-BUCB algorithm is extensively developed elsewhere [5], [34]; here, we briefly sketch its construction. GPBUCB is a Gaussian process (GP) bandit algorithm. The
bandit setting [35] describes an agent which seeks to obtain
high reward in its interaction with an unknown reward function
f . In round t, the agent selects action xt from a decision set D,
and then receives the single observation, yt = f (xt )+t , of the
corresponding reward, f (xt ), corrupted by noise t . Since only
the reward corresponding to the action is observed, actions
must be selected in view of both exploration and exploitation.
GP bandits impose the structural assumption that the reward function is drawn from a GP, a probability distribution
over functions. This assumption is useful mathematically and
because it allows a relatively intuitive encoding of prior
(e.g., anatomical) knowledge about the problem. A GP is
characterized by a mean function µ0 (x) and a kernel function
k(x, x0 ), which describes the covariance between f (x) and
f (x0 ). With their hyperparameters, θ, the kernel and mean
functions encode assumptions about likely reward functions
f , such as smoothness with respect to actions x, or locations
of regions which are likely to yield high reward.
At time t, GP-BUCB updates the posterior distribution over
f (x) for each x ∈ D using the actions {x1 , . . . , xt−1 } and
observations y 1:t−1 = {y1 , . . . , yt−1 }: the posterior over f (x)
2
is f (x) ∼ N (µt−1 (x), σt−1
(x)), where
µt−1 (x) = µ0 (x) + ~k T (x)(K + σn2 I)−1 (y t−1 − µ~0 ) and
2
σt−1
(x) = k(x, x) − ~k T (x)(K + σn2 I)−1~k(x),
(1)
where ~k(x) is the column vector of k(x, xi ) for i = 1, . . . , t−
1, µ~0 is the corresponding vector of prior means µ0 (xi ), and
K is the Gram matrix, i.e., [K]i,j = k(xi , xj ). This posterior
assumes that f is drawn from the GP and that the noise is i.i.d.,
with t ∼ N (0, σn2 ). The noise accounts for all factors which
influence the response, but are unobserved by the algorithm.
The GP-BUCB algorithm addresses the problem of making
decisions in “batches,” where not all feedback is immediately
available; only measurements y fb[t] , where fb[t] < t − 1 are
available. Conventionally, all observations y t−1 are required to
select xt , but by using delayed measurements, the expenditure
of valuable clinical time waiting for data processing is avoided.
Mathematically, if the pending actions xfb[t]+1 , . . . , xt−1 are
known to the algorithm, the posterior variance available after
those observations arrive can be calculated from (1), since
this equation does not depend on the observation values. GPBUCB uses this pseudo-updated posterior to select xt ,
1/2
xt = argmax[µfb[t] (x) + βt
σt−1 (x)],
(2)
x∈D
where βt is a time-varying parameter (in the following, chosen
to increase during a testing session). This lets GP-BUCB
select observations in non-redundant batches (or despite a
delay). A number of performance guarantees for GP-BUCB
with particular choices of βt are obtained by [5].
IV. C LINICAL A DAPTATIONS OF GP-BUCB
Noteworthy modifications of the GP-BUCB algorithm
which are needed for clinical application are described below.
DESAUTELS
et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OFIEEE
EPIDURAL
ELECTROSTIMULATION
5
6
TRANSACTIONS
ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR
TABLE
TABLE II
GP
GP PRIORS
PRIORS USED
USED TO
TO MODEL
MODEL RESPONSES
RESPONSES TO
TO SPINAL
SPINAL STIMULI
STIMULI.. T
THE
HE GP
GP PRIOR
PRIOR MEAN
MEAN IS
IS µ00(t) = c0 OR
OR c0 + tc1 ..
K ERNEL
H YPERPARAMETERS
l = [0.2740, 0.4659, 0.3297, 0.6300, 1.2018],
σ = 0.1244, σn = 0.0268
A NIMAL
RUN
P1
RUN 1 A
F UNCTION
SE
W1
RUN 1 B
RUN 1 A
SE
RUN 1 B
RUN 1 C
l = [0.5480, 0.9317, 0.6593, 1.2599, 2.4034],
σ = 0.1244, σn = 0.0268
SE
RUN 2
SE
W2
RUN 3
RUN 1 A
l = [0.1147, 46.9592, 0.1949, 5.0130, 0.9114],
σ = 0.5288, σn = 0.1708
l = [2.9453, 6.3777, 3.4291, 2.3453, 2.4034],
σ = 0.7903, σn = 0.1364
H YBRID
P2
RUN 1 B
RUN 2
RUN 1
H YBRID
l1 = [1.0000, 2.7183, 1.0000, 2.7183, 2.7183],
σ1 = 0.2865, σ2 = 0.2231, σn = 0.1353
l1 = [1.0000, 2.7183, 1.0000, 2.7183, 2.7183],
σ1 = 0.2865, σ2 = 0.2231, σn = 0.1353
C. Objective
Redundancy/Repetition
Control
A.
Function
Another deviation from the classical GP bandit setting
To optimize clinical reward, GP-BUCB must observe the
concerns the repetitions of stimuli within short periods. For
actual reward or its faithful surrogate. Many methods exist to
most experiments (see Table I), the algorithm was not allowed
quantify standing performance [36], [37], which we eventually
to request the same action more than twice per day in order
aim to optimize by EES. Many motor skill grading schemes
to ensure adequate exploration. Since 5 batches of 5 stimuli
rely on human observation [38], [39] or automated post-hoc
are typically requested daily, 13-25 distinct electrode pairs
analysis [40], [41]. Although we do not use human grading
are tested. Some of the consequences of this restriction are
due to the variability in human judgement, our method can
dicsussed in Section VI-A.
work with human-based ordinal ratings and is mathematically
agnostic to the particular choice of scalar reward function.
D.We
Kernel
Mean Functions
use and
measured
EMG activity as the reward function
and mean
functions
the algorithm’s
in The
our kernel
experiments.
We chose
the describe
peak-to-peak
amplitude
prior
spinal
cord
response.
To ensure
reaof
theassumptions
left tibialis about
anterior
(LTA,
a foot
dorsiflexor)
muscle
sonable action
selection,
their chosen
carefully
enresponse
to individual
stimulus
pulses form
(fixedmust
amplitude
of 5V,
code 7V
crucial
information.
importantly,
k(x,
x0 )
with
usedproblem
in animal
P2), in theMost
“middle
response”
(MR,
encodes
between
and x0 . For our
4.5
- 7.5 the
ms dependence
post-stimulus)
latencystimuli
period,x corresponding
to
experiments,
thewhich
5-dimensional
stimulus
description
motor
responses
likely involve
a single
synaptic vector
delay
consisted
of the
anode
and the
cathode
spatial
coordinates
on the
in
the motor
path.
Since
evoked
potentials
represent
a
array (4-dimensions)
the interneuron
time, in daysnetworks,
post-injury.
Formay
the
low-level
function of and
spinal
they
firstless
twosensitive
animals,towe
used aparameter
squared-exponential
(SE)higherkernel
be
stimulus
variations than
kSE (x
(see [43], At
Section
level
motor
the 14.2)
Hz stimulus frequency, the
i , xj ) functions.
pulse responses can be dissociated
from their predecessors
kSE (xi , xj ) = σ 2 exp[−r2 /2],
(3)
and successors. Although
this performance metric does not
q
involve a complex
r = motor
(xi −behavior,
xj )T Σ−1it(xprovides
(4)
i − xj ), a short-term,
measurable surrogate for therapeutic effectiveness, as it is an
where Σof =interneuronal
diag(l12 , . . .function.
, l52 ) andImportantly,
ln is the although
length scale
indicator
our
corresponding
to dimension
n of x.outcomes,
The SE kernel
implies
end-goal
is to improve
therapeutic
focusing
on a
that the GPsurrogate
is infinitely
mean-square
differentiable
(see [43],
short-term
avoids
the credit assignment
problem,
i.e.,
Section 4.1.1).
For the
third and
a hybrid kernel
determining
which
actions
are fourth
more animals,
or less responsible
for
kh (xtherapeutic
used:
that
This choice also allows us to use an
i , xj ) was outcome.
intra-animal control design,
thus avoiding a requirement for
kh (xi , xj ) = σ12 km (xi , xj ) + σ22 δ(i, j),
(5)
infeasibly large
cohorts of animals.
where
is a the
3rd
order Matérn
kernel, km (x
=
Showing
can successfully
control
i , xj ) this
√km that
√ algorithm
(1 + 3r)
3r),structure
which allows
considerably
rougher
activity
andexp(−
learn the
of spinal
cord’s responses
functions thana the
kernel a[43],
and δ(i,implementation.
j) is the Dirac
demonstrates
stepSEtoward
therapeutic
delta function
indicesspecified
i, j. Thisfor
formulation
tries in
to
Further,
since on
thestimulus
prior means
this function
2
infer the
noisy
f (xrespect
∼ N (0, σthe
Table
I do
not function
vary with
to stimulus
i , i) = g(x
i ) + ηi , ηilocation,
2)
instead of
of the
the function’s
underlyingspatial
function
g(x),must
which
has kernel
entirety
structure
be discovered
km . Since
khsuccessful
implies a minimum
about findicates
(x), the
online;
thus,
learning ofuncertainty
spatial structure
response
to any candidate
action,
short-time
variations
in the
that
the algorithm
may be able
to discover
previously
unknown
CONSTANT
M EAN
H YPERPARAMETERS
c0 = 0.0 M V
R EPETITIONS
A LLOWED
2
CONSTANT
c0 = 0.1 M V
c0 = 0.1 M V
1
3
CONSTANT
CONSTANT
c0 = 0.9 M V
c0 = 0.9 M V
2
2
CONSTANT
c0 = 0.9 M V
2
CONSTANT
c0 = 1.4 M V
c1 = 0.08 M V/ DAY,
c0 = −0.2492 M V
c1 = 0.08 M V/ DAY, c0 = 1.7 M V
c1 = 0.08 M V/ DAY, c0 = −0.1348 M V
c1 = 0.08 M V/ DAY,
c0 = 0.2513 M V
2
2
F UNCTION
L INEAR
L INEAR
L INEAR
L INEAR
2
2
2
reward
are subsumed
into the
term, leaving the overall
or
patient-specific
patterns
via noise
exploration.
function
shape
to be captured
by the Matérn
kernel. to
Thea
Since the
performance
of this algorithm
was compared
hybrid kernel
proved towe
be describe
more successful.
Both kernelssearch
were
human
experimenter,
the experimenter’s
implemented
using
the GPML
[44].
Table
I showsfive
all
and
exploitation
strategy.
For toolbox
a typical
day,
involving
kernel hyperparameters
usedofinstimulus
the experiments.
batches
(with five selections
parameters per batch):
conventional
practice,
hyperparameters
•Following
During batches
1-3, the
human’sthe
first
3 actions were used
chowith sen
the first
two
animals
were
estimated
from
an experimental
using prior knowledge of electrode placement
relative
data to
setthe
selected
the human
which
evenly sampled
target by
muscle’s
motorexpert,
pool, and
observations
from
th
th be expected to faithfully
the decision
set,
and
thus
could
earlier tests. The 4 and 5 actions were typically small
estimate
the spatial
lengthscales.
was performed
variations
on successes
in the The
first fitting
3 choices.
2
th
th
using
a
conjugate
gradient
method
on
the
data
• In the 4
and 5 batches, the human selected likelihood
actions to
underexplore
the GPportions
model.ofThis
procedure
was reasonably
the fitting
spinal cord
and array.
successful
for the
lengthscales,
poorly
captured
the
The human
andspatial
the algorithm
actedbut
under
different
rules,
multi-day
temporal
lengthscales.
This
poor
fit
likely
results
confounding their competitive analysis. First, the human did
fromrepeat
the strong
influence
short-timescale
(e.g.,
not
an action
withinofany
day. Further,effects
the human
noise)
in
the
value
of
the
data
likelihood,
dominating
the
received immediate feedback, i.e., could observe the responses
effects
of
fitting
the
underlying
spinal
plasticity.
Consequently,
to an action and immediately use this information. Despite
the time
lengthscale
in the that
firstthe
twohuman
animals
was hand-selected.
these
factors,
we maintain
expert’s
performance
Similarly, the hyperparameters were hand-selected for the third
is an important baseline, due to the human’s reasonable
and fourth animals’ kernel in such a way as to largely ignore
effectiveness at finding and utilizing good stimulus parameters.
in-session variation and produce model predictions capturing
the long-term changes in the spinal cord’s responses.
B. Time Variation of the Reward Function
The GP-BUCB algorithm
V. Rexplores
ESULTS and exploits the reward
function
by
selecting
actions
which
individually
are expected
Experiments in four rats prepared
as in Section
II were
to
yield
high
reward,
substantial
new
information
the
carried out, resulting in 37 testing sessions and 1200 on
actions
reward
function,
or
both.
Assuming
a
stationary
reward
func(670 actions selected by the algorithm, with the rest selected
tion,
well
chosen algorithm
parameters,
some technical
by the
competing
human). Fig.
3 showsand
a retrospective
of
conditions,
GP-BUCB
is
guaranteed
to
converge
overgraph
time
GP-BUCB performance across all experiments. Each
to
a subset
of actions
which yield
maximal
reward
[34]. In
plots
the reward
(peak-to-peak
evoked
potential
amplitudes)
practice,
the
response
function
is
temporally
non-stationary
resulting from all stimulus pulses, color-coded according to
across
sessions
due to factors
including
spinal cord
plasticity
the human
or GP-BUCB
origin
of the stimulus
selection.
dueThese
to training
and
the
natural
timecourse
of
post-injury
plots show substantial variation in the evoked potenresponse.
Other
include
array degradation
or small
tials over the
timeeffects
frame of
the experiments.
In all animals,
the
physical
array
displacements
along
the
cord.
Since
per-day range of rewards is similar for both human- andspinal
GPcord
plasticity isstimuli,
an essential
phenomenon
of thisvariations
therapeutic
BUCB-selected
indicating
that response
do
approach,
the
algorithm’s
model
of
the
spinal
response
not strongly depend upon which agent selects the actionsmust
and
capture
variation.
thus thattemporal
the interleaved
block structure provides a reasonable,
To
do
so,
a
variable
time, T , (in The
unitstemporal
of days
intra-animal baseline forrepresenting
examining GP-BUCB.
post-injury) is added to the GP model. Hence, GP-BUCB
2 Using the minimize.m function in the GPML toolbox [44].
must
regress on a function of both applied stimulus x and
6
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR
time T . This approach requires the covariance function to
be a map k : (D × T ) × (D × T ) → R. Further, actions
are selected from the subset of D × T corresponding to the
present time t, effectively a time-varying decision set Dt . The
modeling problem becomes one of extrapolating from the past
to the present, both on the order of days (between sessions)
or minutes (between batches). In our setting, the posterior
uncertainty at t grows as t increases and previous observations
recede into the past. Even with optimal action selection, the
posterior uncertainty cannot decrease below a level set by the
measurement noise, rate of observation, and rate of change
in the underlying responses, and effect analogous to nonzero steady state uncertainty limits for Kalman filters [42]
and Gaussian Markov processes [43]. Thus, the algorithm will
never know the best action with vanishingly small uncertainty.
Fortunately, the spinal cord’s current state appears to vary
slowly; given observations from previous days, the algorithm’s
internal model should make reasonable predictions about the
spinal cord’s responses. Further, it is not necessary to select
the optimal action at time t, only one which provides useful
therapy. Indeed, maintaining some variety in the stimuli may
be appropriate, as training via an overly limited set of stimuli
may result in inferior therapy [41].
C. Redundancy/Repetition Control
Another deviation from the classical GP bandit setting
concerns the repetitions of stimuli within short periods. For
most experiments (see Table I), the algorithm was not allowed
to request the same action more than twice per day in order
to ensure adequate exploration. Since 5 batches of 5 stimuli
are typically requested daily, 13-25 distinct electrode pairs
are tested. Some of the consequences of this restriction are
dicsussed in Section VI-A.
D. Kernel and Mean Functions
The kernel and mean functions describe the algorithm’s
prior assumptions about spinal cord response. To ensure reasonable action selection, their chosen form must carefully encode crucial problem information. Most importantly, k(x, x0 )
encodes the dependence between stimuli x and x0 . For our
experiments, the 5-dimensional stimulus description vector
consisted of the anode and cathode spatial coordinates on the
array (4-dimensions) and the time, in days post-injury. For the
first two animals, we used a squared-exponential (SE) kernel
kSE (xi , xj ) (see [43], Section 4.2)
kSE (xi , xj ) = σ 2 exp[−r2 /2],
q
r = (xi − xj )T Σ−1 (xi − xj ),
(3)
(4)
where Σ = diag(l12 , . . . , l52 ) and ln is the length scale
corresponding to dimension n of x. The SE kernel implies
that the GP is infinitely mean-square differentiable (see [43],
Section 4.1.1). For the third and fourth animals, a hybrid kernel
kh (xi , xj ) was used:
kh (xi , xj ) = σ12 km (xi , xj ) + σ22 δ(i, j),
(5)
where√km is a 3rd
√ order Matérn kernel, km (xi , xj ) =
(1 + 3r) exp(− 3r), which allows considerably rougher
functions than the SE kernel [43], and δ(i, j) is the Dirac
delta function on stimulus indices i, j. This formulation tries to
infer the noisy function f (xi , i) = g(xi ) + ηi , ηi ∼ N (0, σ22 )
instead of the underlying function g(x), which has kernel
km . Since kh implies a minimum uncertainty about f (x), the
response to any candidate action, short-time variations in the
reward are subsumed into the noise term, leaving the overall
function shape to be captured by the Matérn kernel. The
hybrid kernel proved to be more successful. Both kernels were
implemented using the GPML toolbox [44]. Table I shows all
kernel hyperparameters used in the experiments.
Following conventional practice, the hyperparameters used
with the first two animals were estimated from an experimental
data set selected by the human expert, which evenly sampled
the decision set, and thus could be expected to faithfully
estimate the spatial lengthscales. The fitting was performed
using a conjugate gradient method2 on the data likelihood
under the GP model. This fitting procedure was reasonably
successful for the spatial lengthscales, but poorly captured the
multi-day temporal lengthscales. This poor fit likely results
from the strong influence of short-timescale effects (e.g.,
noise) in the value of the data likelihood, dominating the
effects of fitting the underlying spinal plasticity. Consequently,
the time lengthscale in the first two animals was hand-selected.
Similarly, the hyperparameters were hand-selected for the third
and fourth animals’ kernel in such a way as to largely ignore
in-session variation and produce model predictions capturing
the long-term changes in the spinal cord’s responses.
V. R ESULTS
Experiments in four rats prepared as in Section II were
carried out, resulting in 37 testing sessions and 1200 actions
(670 actions selected by the algorithm, with the rest selected
by the competing human). Fig. 3 shows a retrospective of
GP-BUCB performance across all experiments. Each graph
plots the reward (peak-to-peak evoked potential amplitudes)
resulting from all stimulus pulses, color-coded according to
the human or GP-BUCB origin of the stimulus selection.
These plots show substantial variation in the evoked potentials over the time frame of the experiments. In all animals, the
per-day range of rewards is similar for both human- and GPBUCB-selected stimuli, indicating that response variations do
not strongly depend upon which agent selects the actions and
thus that the interleaved block structure provides a reasonable,
intra-animal baseline for examining GP-BUCB. The temporal
variation in reward appears to be composed of both day-to-day
fluctuations and a long-term, upward trend. While some of this
upward trend is due to the algorithm’s growing knowledge of
effective stimulus patterns, much of it is a general, plastic
increase in spinal cord responses to the stimuli. Table II
summarizes the experiments, the number of actions initiated
by the algorithm, and the number of unique stimuli chosen.
2 Using
the minimize.m function in the GPML toolbox [44].
DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OF EPIDURAL ELECTROSTIMULATION
DESAUTELS
et al.:
ACTIVE
LEARNING
ALGORITHM
FOR
CONTROL
MULTI-ELECTRODE
EPIDURAL
ELECTROSTIMULATION
DESAUTELS
et al.:
ACTIVE
LEARNING
ALGORITHM
FOR
CONTROL
OF OF
MULTI-ELECTRODE
EPIDURAL
ELECTROSTIMULATION
Reward
4 4
3 3
2 2
4
4
3
3
Reward
Peak−to−Peak Response (mV)
Peak−to−Peak Response (mV)
5 5
2
2
1
1
7
7
7
1 1
0 0
15 15
20 20
25 25
Time
(Days
Post−injury)
Time
(Days
Post−injury)
30 30
0 0
0 0
35 35
50 50
100 100
150 150
TimeTime
(Actions)
(Actions)
Reward
4 4
3 3
15
15
20
25
20 Post−injury)
25
Time (Days
Time (Days Post−injury)
2
2.5
2
1.5
1
0.5
0.5
0
4 0
4
3
2
2
1
1
30
30
35
0 0
0
0
35
6
8
10
Time
6 (Days Post−injury)
8
10
Time (Days Post−injury)
(c) Animal P1
(c) Animal P1
12
12
1.5
1
2
2
1.5
1
1.5
1
0.5
0.5
0.5
0
9 0
9
10
11
12
13
14
Time
10 (Days11Post−injury)
12
13
Time (Days Post−injury)
(d) Animal P2
(d) Animal P2
50
50
100
150
200
250
100 Time
150
250 300 300 350 350
(Actions) 200
Time (Actions)
(b) (b)
Animal
W2 W2
Animal
2.5
Reward
1
3
14
2
2.5
1.5
2
Reward
10
Peak−to−Peak Response (mV)
1.5
4
1.5
1
Reward
0 10
Peak−to−Peak Response (mV)
Peak−to−Peak Response (mV)
Peak−to−Peak Response (mV)
2
4
1
(b)(b)
Animal
W2W2
Animal
2.5
5
2
Reward
0
5
Reward
Peak−to−Peak Response (mV)
Peak−to−Peak Response (mV)
5 5
1
250 250
(a) (a)
Animal
W1 W1
Animal
(a)(a)
Animal
W1W1
Animal
6 6
2
200 200
1
0.5
0.5
0
0 0
0
20
40
60
80
20 (Actions)
40
60
80
Time
Time (Actions)
(c) Animal P1
(c) Animal P1
0
0
2
1.5
1
0.5
0
0
20
40
Time20
(Actions) 40
Time (Actions)
60
60
(d) Animal P2
(d) Animal P2
Fig. 4. Human expert’s (blue) and GP-BUCB’s (green) reward, defined as the
Fig. 3.
3. Peak-to-peak
Peak-to-peak amplitude
(i.e.,
reward
ininmV)
ofofindividual
stimulus
Fig.
amplitude
(i.e.,
reward
individual
stimulus
Fig.4.4. Human
Human
expert’s
(blue)
andevoked
GP-BUCB’s
(green)
reward,
as
Fig.
expert’s
(blue)
and
GP-BUCB’s
(green)electrical
reward, defined
defined
asthe
the
Fig. for
3. each
Peak-to-peak
amplitude
(i.e.,
rewardevoked
inmV)
mV)
of
individual
stimulus peak-to-peak
amplitude
(mV)
of MR
by an epidural
stimulus
pulses
animal.
Blue
circles:
potentials
by
the
human
expert’s
pulses
forfor
each
animal.
Blue
circles:
potentials
evoked
bybythe
human
expert’s
amplitude
(mV)
of
MR
evoked
an
epidural
stimulus
amplitude
(mV)
by
an delivered
epidural electrical
electrical
stimulus
pulses
each
animal.
Blue
circles:
potentials
evoked
the
human
expert’s of peak-to-peak
5peak-to-peak
V (animals P1,
W1, and
W2)of
orMR
7 Vevoked
(animalby
P2),
in 1 Hz pulse
choices;
Green
‘x’:
potentials
due
to
algorithm
choices.
Solid
red
lines
denote
choices;
Green
‘x’:‘x’:
potentials
due
to to
algorithm
choices.
Solid
red
(animals
P1,
W1, and
and W2)
W2)
or 77measures
V (animal
delivered
in
ofof55The
VV(animals
W1,
or
V
(animal
P2),
delivered
in11Hz
Hz
pulse
choices;
potentials
due
algorithmlines
choices.
Solid
redlines
linesdenote
denote trains.
averageP1,
reward
(solid
lines)
theP2),
rewards
generated
bypulse
wipe of
of Green
the algorithm’s
algorithm’s
memory.
Dashed
denote
hyperparameter
trains.
Thewhile
average
reward (solid
(solid
lines)
measures
the
rewards
generated
by
aa wipe
the
memory.
Dashed
lines
denote
hyperparameter
a
wipe
of
the
algorithm’s
memory.
Dashed
lines
denote
hyperparameter
trains.
The
average
reward
lines)
measures
the
rewards
generated
by
every
action,
the
maximum
reward
(dashed)
is
the
best
reward
observed
changes
without
a
memory
wipe.
The
human
experimenter
and
algorithm
every
action,
while
the
maximum
reward
(dashed)
is
the
best
reward
observed
changes
without
a memory
wipe.
The
human
experimenter
and
algorithm
changes
without
a
memory
wipe.
The
human
experimenter
and
algorithm
every
action,
while
the
maximum
reward
(dashed)
is
the
best
reward
observed
up
to
a
given
time.
GP-BUCB
typically
has
a
superior
average
reward
to
were blind to each other’s actions and the resulting rewards, and their actions
up
given time.
time. GP-BUCB
GP-BUCB
typically
has
average
reward
were
blind
to to
each
other’s
actions
and
rewards,
and
were
blind
each
other’s
actions
andthetheresulting
resulting
rewards,
andtheir
theiractions
actions theup
toto aa given
typically
has aa superior
superior
average
reward
human,
while
maintaining
a superior
or competitive
maximum
reward.
In toto
were
interleaved
in batches.
The
decreased
evoked
potential
magnitude
from
thehuman,
human,
while
maintaining
superior
or
maximum
reward.
were
interleaved
in in
batches.
The
evoked
potential
magnitude from
were
interleaved
batches.
Thedecreased
decreased
evoked
potential
the
superior
or competitive
competitive
maximum
reward.
W1 (a)while
and maintaining
W2
(b), runsaa(i.e.,
periods
between memory
wipes)
are InIn
days
9 and
10 to 15-17
in animal
W2
is a typical
feature
of the magnitude
preparation.from animals
animalsby
W1
(a)
and W2
W2
(b), runs
runs (i.e.,
(i.e., periods
days
9 and
10 10
to to
15-17
in in
animal
W2
days
9 and
15-17
animal
W2is isa atypical
typicalfeature
featureofofthe
thepreparation.
preparation. separated
red(a)
vertical
lines.
animals
W1
and
(b),
periods between
between memory
memory wipes)
wipes) are
are
separated by
by red
red vertical
vertical lines.
lines.
separated
variation
in reward
appears to be composed of both day-to-day
A.
Evaluaton
of Reward
variation
in reward
appears to be composed of both day-to-day The maximum reward, Wmax , may be examined in terms
fluctuations and a long-term, upward trend. While some of this
maximum
beperform
examined
terms
so The
strategies
like reward,
random Wsearch
may
wellin in
this
max , may
of the maximum reward value observed
so far,
fluctuations
and
a
long-term,
upward
trend.toWhile
some i.e.,
of this
Reward
is calculated
from
the
response
action,
upward
trend
is due to the
algorithm’s
growinganknowledge
of a metric,
of the maximum
reward value
observed
so far,
without delivering
effective
therapy.
Alternatively, the
upward trend
is dueoftopulses.
the algorithm’s
growing
knowledge of
continuous
sequence
Let of
yτ itdenote
the peak-to-peak
effective stimulus
patterns, much
is a general,
plastic
Wmax
(T ) = max(wt ).
average reward so
far observed
th patterns, much of it is a general, plastic
effective
stimulus
t≤Tmax(wt ).
Wmax (T ) =
response
τ cord
stimulus
pulse. The
reward
resulting
increase to
in the
spinal
responses
to the
stimuli.
Tablefrom
II
t≤TT
increase
in
spinal
cord
responses
to
the
stimuli.
Table
II
Xthoroughness of the
an
action started
at time t is calculated
summarizes
the experiments,
the numberas:
of actions initiated
The maximum reward describes1the
W
(T
)
=
,
τmax
(t)actions initiated
summarizes
the
experiments,
the
number
of
The
maximum
reward
describes
thewtthoroughness
of the
avg
by the algorithm, and the number
of uniqueX
stimuli chosen.
search over the stimulus space, sinceT a large
maximum reward
1
t=1
by the algorithm,
and the number of unique stimuli
searchthat
overhigh-performing
the stimulus space,
since
a
large
maximum
reward
wt =
yτ , chosen.
(6) implies
stimuli have been found, and
τmax (t) − τmin (t) + 1
implies
stimuli
have been
found,
τ =τmin (t)
measures
howhigh-performing
well exploration
and exploitation
tradedand
off;
thus
could that
conceivably
be exploited.
However,
thisarequantity
A. Evaluaton of Reward
thus
could
conceivably
be
exploited.
However,
this
quantity
high
average
reward
implies
that
a
search
process
spent
most
the number of applications of high-performing stimuli,
A. Evaluaton
of Reward
where
τmin (t) and
τmax (t) respectively index the first and last ignores
ignores
the like
number
of applications
ofactions,
high-performing
stimuli,
of
its search
timerandom
choosing
effective
yet
alsoin explored
so
strategies
search
may
perform
well
this
Reward
is
calculated
from
the
response
to
an
action,
i.e.,
a
pulses
for action
t.
so strategies
like random
search
may perform
wellSuperior
in
Reward
is calculated
from Let
the yresponse
to
an
action, i.e., a metric,
thoroughly
enough
to
find
high-performing
stimuli.
without
delivering
effective
therapy.
Alternatively,
thethis
continuous
sequence
of
pulses.
denote
the
peak-to-peak
τ
The maximum
reward,
Wmax Let
, may
be examined
in terms
metric,
without
delivering
effective
therapy.
Alternatively,
continuous
sequence
of
pulses.
y
denote
the
peak-to-peak
th
averagereward
performance
of one approach relative to anothertheat
so far observed
response
to the τ reward
stimulus
pulse.
Theτ reward
from average
th
of
the maximum
value
observed
so
far,resulting
average
reward
so
far
response
to
the
τ
stimulus
pulse.
The
reward
resulting
from
time T is indicated by observed
a larger average
reward value.
an action started at time t is calculated as:
T
an action started at
time
t
is
calculated
as:
1X
τmax
(t)
T
Wire
Arrays:
Results:
Fig.
3(a)
and
Wmax (T
)
=
max
(w
).
X
t
Wavg (T ) =
wt , (b) show the search
1
1X
τmax (t)
t≤T
X yτ ,
(6)
wt =
Wavg
(T ) T
= t=1and the
wt , human expert in anresponses
obtained
by
GP-BUCB
1
T t=1
(6)
wt =τmax (t) − τmin (t) + 1 τ =τmin (t) yτ ,
The maximum
reward
describes
the
thoroughness
of
the
imals
implanted
with
wire
arrays,
while
Fig. 4(a) and (b) show
τmax (t) − τmin (t) + 1
measures how well exploration and exploitation are traded off;
τ =τmin (t)
search
over
the
stimulus
space,
since
a
large
maximum
reward
the
associated
maximum
and
average
rewards.
Thetraded
algorithm
measures
wellimplies
exploration
exploitation
are
where τmin (t) and τmax (t) respectively index the first and last high
averagehow
reward
that aand
search
process spent
mostoff;
implies
that
high-performing
stimuli
have
been
found,
and
chose
a
total
of
235
and
310
actions
respectively
for
animals
wherefor
τmin
(t) and
average
implies
that actions,
a searchyet
process
spent
most
pulses
action
t. τmax (t) respectively index the first and last of high
its search
timereward
choosing
effective
also explored
thus
could
conceivably
be
exploited.
However,
this
quantity
W1
and
W2.
In
both
animals,
left,
caudal
anode
locations
were
pulses for action t.
of its search time choosing effective actions, yet also explored
ignores the number of applications of high-performing stimuli, associated with strong responses to the stimulus. Although the
8
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR
IEEE
TRANSACTIONS
ONON
BIOMEDICAL
ENGINEERING,
IEEE
TRANSACTIONS
BIOMEDICAL
ENGINEERING,VOL.
VOL.X,X,NO.
NO.Y,Y,MONTH
MONTHYEAR
YEAR
TABLE II
N UMBER OF ACTIONS CHOSEN
BY
GP-BUCB
( UNIQUE ), AND S PEARMAN
TABLE
TABLE
II II
RANK
BETWEEN
EACH
REWARD
NCORRELATIONS
UMBER
OFCHOSEN
ACTIONS
CHOSEN
BY SELECTED
GP-BUCB
),’ SAND
N UMBER
OF
ACTIONS
BY
GP-BUCB
( UNIQUESTIMULUS
),( UNIQUE
AND S PEARMAN
AND
THE NUMBER
OF PULSES
USED
BY GP-BUCB
OR’ SHUMAN
EXPERT
SCORRELATIONS
PEARMAN
’ S ρ BETWEEN
EACH
SELECTED
STIMULUS
AND
RANK
BETWEEN
EACH
SELECTED
STIMULUS
’REWARD
S REWARD
TO
TEST
THAT
STIMULUS
, WITH
VALUES
. OR
PULSES
BYP -GP-BUCB
HUMAN .
ANDTHE
THENUMBER
NUMBEROF
OFSTIMULUS
PULSES
USED
BY APPLIED
GP-BUCB
OR
HUMAN
EXPERT
MTHAT
ACHINE
TO TEST
STIMULUS , WITH P - VALUES .
1
1
2
2
Row
2
Row
A NIMAL
ACTIONS
ρM achine (p)
ρHuman (p)
M ACHINE
& RUN
(U NIQUE )
A NIMAL
ACTIONS
ρM achine (p)
ρHuman (p)
P1, RUN 1
70 (43)
0.264 (0.0906)
N/A
& RUN
(U NIQUE )
W1, RUN 1
80 (23)
0.105 (0.6348)
−0.073 (0.6829)
P1, RUN 1
70 (43)
0.264 (0.0906)
N/A
RUN 2
55 (23)
0.339 (0.1132)
−0.045 (0.8091)
W1, RUN 1
80 (23)
0.105 (0.6348)
−0.073 (0.6829)
RUN 3
100 (29)
0.480 (0.0083)
0.325 (0.0500)
RUN 2
55 (23)
0.339 (0.1132)
−0.045 (0.8091)
W2, RUN 1
70 (28)
0.582 (0.0012)
−0.034 (0.8682)
RUN 3
100 (29)
0.480 (0.0083)
0.325 (0.0500)
RUN 2
240 (33)
0.824 (< 0.0001)
0.569 (0.0001)
W2, RUN 1
70 (28)
0.582 (0.0012)
−0.034 (0.8682)
P2, RUN 1
55 (40)
0.581 (0.0001)
−0.477 (0.0021)
RUN 2
240 (33)
0.824 (< 0.0001)
0.569 (0.0001)
P2, RUN 1
55 (40)
0.581 (0.0001)
−0.477 (0.0021)
1
1
2
Row
8
4
Row
8
4
9
9
4
4
9
9
Gnd
Gnd
Gnd
Gnd
sizes of the strongly responding regions of the stimulus space
thoroughly enough to find high-performing stimuli. Superior
were different in the two animals (see Fig. 5), the algorithm
A
B
C
A
B
C
average enough
performance
one approach relative
another at
Column
Column
thoroughly
to findofhigh-performing
stimuli.toSuperior
learned the response function well in both cases. Both animals
A
B
C
A
B
C
time performance
T is indicatedofbyone
a larger
average
reward
average
approach
relative
to value.
another at
(a) Animal
W1
(b) Animal
Column
Column W2
exhibited increased responsiveness over the course of the
Arrays:byResults:
3(a) reward
and (b) value.
show the search
timeexperiments.
TWire
is indicated
a largerFig.
average
(a) Animal W1
(b) Animal W2
Fig. 5. Time-averaged, intra-day-normalized response for cathode-anode
responses
obtained
by GP-BUCB
and (b)
the human
expert
in an- electrode pairs, animal W1 (a) and animal W2 (b), shown with respect to
Wire
Arrays:
Results:
Fig.Animals:
3(a) and
show Fig.
the
search
Parylene
Microarray
Results:
3(c) and Fig.Fig.
5. 5.Time-averaged,
intra-day-normalized
response
for
Time-averaged,
intra-day-normalized
response
forcathode-anode
cathode-anode
imals implanted
with
wire arrays,
while
Fig. 4(a)
and in
(b) show array
location
as viewed dorsally
(rostral at top).
Each large
box
corresponds
responses
obtainedP1)
by
GP-BUCB
andand
the4(d)
human
expert
4(c) (animal
and
Fig. 3(d)
(animal
P2) anshow electrode
pairs,
animal
W1W1
(a)(a)and
animal
W2
(b),
shown
with
respect toto
electrode
pairs,
animal
and
animal
W2
(b),
shown
with
to a spatial location of a wire array electrode. Within the boxes,respect
each small
the
associated
maximum
and
average
rewards.
The
algorithm
imals
implanted
with
wire
arrays,
while
Fig.
4(a)
and
(b)
show
location
as as
viewed
dorsally
(rostral
at attop).
Each
box
corresponds
array
location
viewed
dorsally
(rostral
top).cathode
Eachlarge
large
boxcould
corresponds
the maximum and average reward obtained with the parylene array
rectangle
denotes
the spatial
location
of one
which
be paired
to
a
spatial
location
of
a
wire
array
electrode.
Within
the
boxes,
each
small
chose
a
total
of
235
and
310
actions
respectively
for
animals
to
a
spatial
location
of
a
wire
array
electrode.
Within
the
boxes,
each
small
the associated
maximum
average
The
algorithm
with an anode located at the large box’s position. Every stimulus thus
has a
arrays, which
logged 4and
days
and 70rewards.
actions in
animal
P1, and rectangle
denotes
the
spatial
location
of
one
cathode
which
could
be
paired
rectangle
denotes
the
spatial
location
of
one
cathode
which
could
be
paired
W1
and
W2.
In
both
animals,
left,
caudal
anode
locations
were
unique
address.
Each
small
rectangle’s
color
corresponds
to
the
time-averaged
chose
a total
235
and 310
actions
animals
3 days
andof55
actions
in animal
P2.respectively
GP-BUCB for
quickly
found withwith
an anode
located
at the
large
box’s
position.
Every stimulus
has aa
an anode
located
at the
large
position,
everythus
stimulus
normalized
response
strength
(1:
red;box’s
0: blue)
of thatgiving
anode-cathode
pair. Black
strong
responses
to
theanode
stimulus.
Although
theunique
W1 associated
and
Inwith
bothhigh-performing
animals,
left, caudal
locations
address.
Each
small
rectangle’s
color
corresponds
unique
address.
Each
small
rectangle’s
color
correspondstotothe
thetime-averaged
time-averaged
and W2.
exploited
stimuli
in this
muchwere
larger
rectangles represent electrode pairs not selected. Data are shown only from
normalized
response
strength
(1: 0:
red;
0: of
blue)
that anode-cathode
pair.
response
strength
(1: red;
blue)
thatof
anode-cathode
pair. Black
sizes
the
strongly
responding
regions
of theAlthough
stimulusstimuli
spacenormalized
associated
with
strong stimuli.
responses
to the
the
days where a total of at least 15 actions were taken between the human and
spaceofof
possible
Note
thatstimulus.
high-performing
Black rectangles
represent
invalid
pairs,
e.g., A1
A1.are
Data
are shown
only
rectangles
represent
electrode
pairs
not
selected.
Data
shown
only
from
were
different
in
the
two
animals
(see
Fig.
5),
the
algorithm
algorithm. The set of highly excitable configurations for animal W1 is broader
sizesobserved
of the strongly
regions
the stimulus
in theseresponding
two animals
had ofcaudal,
left-sidespace
anode daysfrom
daysa where
of at
15 actions
were between
taken between
the human
where
total
ofa total
at W2.
least
15least
actions
were taken
the ahuman
than
that for
animal
Note
that monopolar
stimuli (using
groundand
wire)
theasin
response
function
well
inFig.
both5),
cases.
Both
animals
and algorithm.
Thehighly
set of highly excitable
configurations
for animal
W1 is
werelearned
different
the in
two
animals
(seeexperiments,
thethough
algorithm
The set
configurations
animal
is broader
locations,
seen
the
wire array
having algorithm.
were tested
by of
the humanexcitable
experimenter
in animal for
W1;
these W1
results
are shown
broader
than
that
for
animal
W2.
Note
that
monopolar
stimuli
(using
a
ground
exhibited
increased
responsiveness
over
the
course
of
the
than
that
for
animal
W2.
Note
that
monopolar
stimuli
(using
a
ground
wire)
learned
well
in bothfor
cases.
animals
suchthe
an response
anode wasfunction
not itself
sufficient
goodBoth
responsiveness.
in thewere
bottom-most
wire)
by boxes.
theexperimenter
human experimenter
animal
W1;
theseare
results
are
were
tested bytested
the human
in animalinW1;
these
results
shown
experiments.
exhibited
increased responsiveness over the course of the in the
shown
in the bottom-most
bottom-most
boxes. boxes.
Microarray
Animals: Results: Fig. 3(c) and
B.Parylene
Computational
Performance
experiments.
typically dominated by the Cholesky decompositions (which
4(c)When
(animal
P1)
and
Fig.
3(d)MATLAB
and
4(d) programming
(animal
P2)and
show
Parylene
Microarray
Animals:
Results:
Fig. 3(c)
implemented in the
lanexplores
additional
of our
experiments
on
the
require an
order n3 ramifications
calculation) and
memory
usage by
storage
maximum
and
average
reward
obtained
the one
parylene
3
(which
4(c)the
(animal
P1)EMG
and data
Fig.processing
3(d)
and needed
4(d)
(animal
P2)
show
guage
, the
to with
prepare
new typically dominated by the Cholesky decompositions
2 multi-electrode stim3
prospects
for
using
machine
learning
in
of
the
n
×
n
kernel
matrices
(order
n
memory).
As
n
grows
which
4 days
andobtained
70
actions
in the
animal
P1,
andrequire an order n calculation) and memory usage by storage
the arrays,
maximum
andlogged
average
reward
with
parylene
batch typically
required
2-3 min,
whereas
GP-BUCB’s
seleculation
VI-A
the wire
array
withnthe
number
ofanalyzes
observations,
the As
computational
3tion
days
and
55
actions
in
animal
P2.
GP-BUCB
quickly
found
×therapy.
nincreasing
kernelSection
matrices
(order
n2 memory).
n results,
grows
arrays,
which
logged
4
days
and
70
actions
in
animal
P1,
and
of new stimuli required approximately 5 s of processor of the
including
a
discussion
of
inter-animal
comparisons.
Section
load
can
be
managed
by
“forgetting”
older
observations,
using
and
exploited
high-performing
stimuli
in
this
much
larger
with
the
increasing
number
of
observations,
the
computational
time.
In
our
alternating
double
blind
experiment
format,
the
3 days and 55 actions in animal P2. GP-BUCB quickly found
VI-B
addresses
the
analogous
parylene
array
results.
Section
approximate
Cholesky
decomposition,
or
sparse
approximate
of possible
stimuli.
Note
that high-performing
stimuli
total
time
to
construct
a new
stimulus
batchmuch
was less
than load can be managed by “forgetting” older observations, using
and space
exploited
high-performing
stimuli
in this
larger
VI-C
examines
another
perspective
therapeutic
utility.
GP regression
[45].
Aggregating
the individual,
pulse-by-pulse
observed
in
these
two
animals
had
caudal,
left-side
anode
Cholesky
decomposition,
oronsparse
approximate
the
time
required
for
the
human
expert
to
perform
a
batch
space of possible stimuli. Note that high-performing stimuli of approximate
Lessons
learned
regarding
kernel
function
and
hyperparameter
an action
also saves
substantialpulse-by-pulse
effort.
locations,
as seen
in indicate
the
wire that,
array
experiments,
though
havingGPresponses
regressionto[45].
Aggregating
the individual,
tests. These
results
a sufficiently
efficient
observed
in these
two
animals
had given
caudal,
left-side
anode
selection are the focus of Section VI-D. Section VI-E examines
such
an
anode
was
not
itself
sufficient
for
good
responsiveness.
responses
to
an
action
also
saves
substantial
effort.
and
fluid
data
acquisition
and
processing
system,
the
algolocations, as seen in the wire array experiments, though having
the effectiveness of the VI.
algorithm’s
search from the perspective
D ISCUSSION
could
perform
closed-loop
active
learning
much more
suchrithm
an anode
was
not itself
sufficient for
good
responsiveness.
of information gain. Finally, effects on hindlimb muscles other
rapidly
than was the
case in our experiments. Under the most
experiments
show
that GP-BUCB
can compete
D ISCUSSION
B.
Computational
Performance
thanOur
theanimal
LTA are
theVI.
subject
of Section
VI-F .
severe condition, with n = 4432 previous evoked potential with a human expert in optimizing stimuli. This section also
When implemented
in the MATLAB programming lan- Our animal experiments show that GP-BUCB can compete
B. Computational
observations,
aPerformance
new batch of five actions was constructed in explores additional ramifications of our experiments on the
3
with aWire
human
expert
in optimizing stimuli. This section also
,
the
EMG
data
processing
needed
to prepare one
guage
Array
Animals
When
in thetoMATLAB
programming
lan-new
142.0implemented
s, with 50 s devoted
Cholesky decomposition,
and
50 A.
prospects
for using
machine learning
multi-electrode
stimexplores
additional
ramifications
of ourinexperiments
on the
batch
typically
required
2-3
min,
whereas
GP-BUCB’s
selec3
, the EMG
data processing
needed
to prepare
one neware ulation
guage
s allocated
to evaluation
the kernel
matrix.
Computations
The wire
arraySection
experiments
tested how
GP-BUCB
copes
therapy.
VI-A
analyzes
the
wire
array
results,
for using machine learning in multi-electrode stimtion
of new
stimuli required
approximately
5 s of processor
batch
typically
required
2-3
min,Cholesky
whereas decompositions
GP-BUCB’s
selectypically
dominated
by the
(which prospects
with
a time-varying
reward
The algorithm generally
including
a discussion
of function.
inter-animal
Section
ulation
therapy.
Section VI-A
analyzes the comparisons.
wire array results,
time.
In
our
alternating
double
blind
experiment
format,
the
3
order nrequired
calculation)
and memory
by storage succeeded
tion require
of newanstimuli
approximately
5 susage
of processor
in future
performance
prediction.
For example,
VI-B
addresses
the
analogous
parylene
array
results.
Section
a discussion
inter-animalshowed
comparisons.
Section
total
time
to
construct
a new
stimulus
batch format,
was
ofIntheour
n×
n kernel
matrices
(order
n2 memory).
As less
n grows
time.
alternating
double
blind
experiment
thethanincluding
in
animal
W2 (run
2)of GP-BUCB
strong queueing
VI-C
examines
another
perspective
on
therapeutic
utility.
addresses
the analogous
parylene
array
results.
time
fornumber
human
expert
to perform
a batch
with
thetorequired
increasing
of
observations,
the
totalthe
time
construct
athe
new
stimulus
batch
wascomputational
less
than ofVI-B
behavior
(it first
selected what
were
predicted
andSection
proven
Lessons
learned
regarding
kernel
function
and
hyperparameter
These
results
indicate
that,
given
a sufficiently
efficient
another before
perspective
on
therapeutic
utility.
load required
can
be managed
by
“forgetting”
older
observations,
using
to
beexamines
the are
bestthe
stimuli)
being
forced
by the
repetition
the tests.
time
for the
human
expert
to
perform
a batch
of VI-C
selection
focus
of
Section
VI-D.
Section
VI-E
examines
and
fluid
data
acquisition
and
processing
system,
the
algolearned
regarding
kernel function
and hyperparameter
Cholesky
or sparse approximate
limit
to
consider
other
stimuli.
Thissearch
behavior
results
as the
tests.approximate
These results
indicatedecomposition,
that, given a sufficiently
efficient Lessons
the effectiveness
of
theSection
algorithm’s
from
theexamines
perspective
rithm
could
perform
closed-loop
active
learning
much
more
4
selection
are
the
focus
of
VI-D.
Section
VI-E
GP
regression
[45].
Aggregating
the
individual,
pulse-by-pulse
signature
saw-tooth
shape
in
the
average
reward
plots
and fluid data acquisition and processing system, the algoof
informationofgain.
Finally, effects
on hindlimb
muscles(Fig.
other
rapidly
than
was
the
case
in
our
experiments.
Under
the
most
the
effectiveness
the
algorithm’s
search
from
the
perspective
responses
to
an
action
also
saves
substantial
effort.
4(b)).
Moreover,
GP-BUCB
autonomously
maintained
good
rithm could perform closed-loop active learning much more
than
the
LTA
are
the
subject
of
Section
VI-F
.
severe condition, with n = 4432 previous evoked potentialof performance
information gain.
Finally,
effectsgood
on hindlimb
(showing
the same
queueing muscles
behavior)other
over
rapidly
than was the case in our experiments. Under the most
VI. D
observations, a new batch
ofISCUSSION
five actions was constructed inthan
the
LTA
are
the
subject
of
Section
VI-F
.
long
periods,
e.g.,
in
animal
W2’s
two-week-long
second
run,
severe condition, with n = 4432 previous evoked potential
A. Wire
Array two
Animals
142.0
s,animal
with 50
s devoted to
Cholesky
decomposition,
and 50 which
Our
experiments
show
that
GP-BUCB
can
compete
included
multi-day
gaps
in
testing.
In
this
respect,
observations, a new batch of five actions was constructed in
s allocated
to evaluation
the kernel matrix.
areA. Wire
a human
expert intooptimizing
stimuli. Computations
This section
also
TheArray
wire Animals
array experiments tested how GP-BUCB copes
142.0with
s, with
50 s devoted
Cholesky decomposition,
and 50
4 The upward edge is produced by the superior reward resulting from the
with
a
time-varying
reward
function.
The algorithm
generally
s allocated
to on
evaluation
matrix.
Computations
are
33Running
Running
on
either aa 2.2
2.2the
GHzkernel
quad-core
i7
machine,
88 GB
first stimuli
the session
and the flatter
or downward-sloped
portion
of the
wire inarray
experiments
tested
how GP-BUCB
copes
either
GHz
quad-core
i7 processor
processor
machine, with
with
GB The
succeeded
future
prediction.
Forqueued
example,
RAM, and
and Mac
Mac OS
OS 10.6,
10.6, or
or an
an AMD
AMD Athlon
Athlon 64
average
reward in
curve
results performance
from the poorer stimuli
which are
later
RAM,
64 X2
X2 Dual
Dual Core
Core 3800+,
3800+, 33 GB
GBwith
a
time-varying
reward
function.
The
algorithm
generally
3 Running
RAM machine
machine
running
Ubuntu
12.04. i7 processor machine, with 8 GB
inintheanimal
session. W2 (run 2) GP-BUCB showed strong queueing
on either
a 2.2 GHz
quad-core
RAM
running
Ubuntu
12.04.
succeeded in future performance prediction. For example,
RAM, and Mac OS 10.6, or an AMD Athlon 64 X2 Dual Core 3800+, 3 GB
RAM machine running Ubuntu 12.04.
in animal W2 (run 2) GP-BUCB showed strong queueing
DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OF EPIDURAL ELECTROSTIMULATION
these experiments validated that good forward prediction is
possible over such intervals due to the utility of GP-BUCB’s
internal model, and reached a standard of performance which
could be clinically relevant.
Algorithm Analysis. Several possible criticisms might follow from the fact that GP-BUCB can daily request a number
of actions (typically 25, at least 13 unique, in 5 batches of
5 actions) which is of the same order as 42, the size of the
decision set, D.
• Measures of success: Simply finding a high reward stimulus may be an inadequate measure of algorithm success.
• Differentiation between algorithms: Given the small size
of D and the repetition limit, other learning algorithms
would likely select similar actions to GP-BUCB.
• Generalization to unconstrained settings: Our use of a
repetition limit can cloud an assessment of how effectively GP-BUCB can track f in an unconstrained setting.
The experimental data support additional several arguments
for GP-BUCB’s success. First, and unsurprisingly5 , GPBUCB found high-performing stimuli in the first day (within
2 batches of 5 stimuli, in each case) in all wire array runs.
This indicates that GP-BUCB’s search is free from gross
deficiencies. Second, GP-BUCB demonstrates success in the
sense of avoiding ineffective stimuli. The number of unique
electrode pairs selected (see Table II) was always substantially
smaller than the size of the decision set. GP-BUCB thus
avoids exhaustive testing, but appears to effectively model
the reward for untested stimuli. Finally, since GP-BUCB was
found to be competitive with the performance of a human
expert in terms of the maximum reward so far observed,
while maintaining a better average reward, it follows that
GP-BUCB thoroughly searches for high-reward regions while
simultaneously exploiting its knowledge in a therapeutically
more effective fashion. These observations argue that GPBUCB makes a successful exploration-exploitation tradeoff.
The second criticism argues that the decision set size and
repetition limit mean that all algorithms will have to explore D
somewhat, and that further, convergence cannot be observed,
making algorithms indistinguishable. GP-BUCB’s ability to
discriminate against (likely bad) stimuli, discussed in relation
to the first criticism, is a crucial feature in this setting, and
compares favorably with simple methods such as random
search or bandits without structured payoffs. With respect to
convergence, the strong queueing behavior indicates that GPBUCB has converged to the extent allowed by the repetition
restriction, since high-reward actions are selected first (i.e., in
preference to other actions), despite the low exploratory value
of doing so under (2), until disallowed by the repetition limit.
GP-BUCB’s queueing behavior also requires a reasonably
correct posterior over the rewards which will be received in
the day’s first batch of stimulation, implying that its internal
model is adequate to the inter-day prediction task.
Fatigue offers an alternate explanation for the observed
queueing behavior. If the animal rapidly tires during a session,
5 If 6 electrode pairs are effective out of 42, e.g., all those pairs including a
particular anode, 10 stimuli drawn randomly without replacement will include
one of these 6 with probability > 82%.
9
and this fatigue appears as a decrease in evoked peak-to-peak
amplitude, the responses elicited by the first stimuli would
appear stronger than those from later stimuli. These “better”
stimuli would then be selected early in future sessions, a
feedback loop which could produce the observed queueing
effect. However, the human expert’s choices would also show a
strong within-session fatigue pattern, which they do not. Also,
the difference, ∆V , between average responses to different
applications of the same stimulus in the same session (by
human, algorithm, or both) can be measured. A large and
negative ∆V is required if queueing is explained by fatigue.
However, for actions leading to substantial responses (withinsession average ≥ 0.2 mV), ∆V is small in magnitude
(∆V = −0.0871 ± 0.4897 mV, n = 142 pairs from animal
W2), even over prolonged periods (−0.0336 ± 0.4228 mV for
34 intervals of ≥ 1 hour). Further, the correlation between
the interval and ∆V is very weak (r = 0.0090). Fatigue,
therefore, is not a convincing explanation for the observed
queueing behavior.
The third criticism is more difficult to directly refute since
no experiments without the repetition constraint were performed, and the repetition limit forces the algorithm to track
stimuli which are high-performing, but suboptimal. However,
for stationary reward functions, the performance of GPBUCB is supported by theoretical guarantees and validation
in simulation [5]. Although [5] does not provide guarantees
in the case of time-varying rewards, such guarantees have
been derived for other methods, e.g., [46], suggesting that
similar guarantees could be derived for GP-BUCB. Together,
the relatively benign time variation in the reward function in
our experiments and the stationary case guarantees suggest
that unconstrained GP-BUCB could also maintain a useful
posterior over responses to previously high-performing stimuli.
Cross-animal Comparisons. Fig. 5 compares the performance of electrode pair selections across animals W1 and
W2. The data demonstrate substantial repeatability between
animals with regard to the evoked potential strength elicited
by the same stimulus, particularly for the highest-performing
stimuli (typically those with electrode A9 as the anode).
B. Parylene Array Animals
The data from the animals implanted with parylene arrays
allow us to assesses how well GP-BUCB can search a large
decision space (666 possible electrode pairs, of which no
more than 25 can be tested in a typical day), using a structured reward function. In both animals, after finding a highperforming configuration (C8 A9 in animal P1, and C6 A9
in animal P2), the algorithm moved sharply to exploit this
knowledge, allocating repeated queries to neighboring actions.
Both discovered configurations including an anode at the left,
caudal corner of the array, similar to those found in the wire
array animals. Note that GP-BUCB overcame its spatially flat
prior on f to find this anatomical pattern. The key, highperforming stimulus was found in the 3rd batch for animal
P1, and the 6th batch for animal P2. The efficiency of the
search and the rapid exploration of high-performing stimuli
argue that the structured GP model provides a benefit over
conventional bandit algorithms for this application.
IEEE
TRANSACTIONS
ON
BIOMEDICAL
ENGINEERING,
VOL.
X, NO.
NO.
Y, MONTH
MONTH
IEEE
TRANSACTIONS
ON
BIOMEDICAL
ENGINEERING,
VOL.
X,
Y,
YEAR
IEEE
TRANSACTIONS
ON
BIOMEDICAL
ENGINEERING,
VOL.
X, NO.
Y, MONTH
YEAR
PPROPORTION
OF
STIMULI
WHICH
ELICITED
APEAK
PEAK
- TO
- PEAK
ROPORTION
OF APPLIED
APPLIED
STIMULI
WHICH
ELICITED
- TO
- PEAK
P ROPORTION
OF APPLIED
STIMULI
WHICH
ELICITED
A APEAK
- TO
- PEAK
RESPONSES
EXCEEDING
A THRESHOLD
THRESHOLD
OLD
INDICATES
MORE
RESPONSES
EXCEEDING
A
BBOLD
INDICATES
MORE
RESPONSES
EXCEEDING
A THRESHOLD
. B..OLD
INDICATES
MORE
SUPRA
-- THRESHOLD
RESPONSES
SUPRA
THRESHOLD
RESPONSES
SUPRA
- THRESHOLD
RESPONSES
. ..
A
NIMAL
A
GENT
T
HRESHOLD
M V)
A NIMAL
AGENT
T HRESHOLD Π Π
( M(V)
0.50.5 1.01.0
1.51.5
2.02.0
A LGORITHM 0.229
0.229 0.157
0.157 0.100
0.100 0.043
0.043
P1 P1
A LGORITHM
A LGORITHM 0.662
0.662 0.450
0.450 0.263
0.263 0.000
0.000
W1,W1,
RUNR1UN 1 A LGORITHM
H UMAN 0.440
0.440 0.253
0.253 0.067
0.067 0.013
0.013
H UMAN
W1, RUN 2
A LGORITHM
0.945
0.727
0.545
0.091
W1, RUN 2
A LGORITHM
0.945
0.727
0.545
0.091
H UMAN
0.720
0.520
0.280
0.040
H UMAN
0.720
0.520
0.280
0.040
W1, RUN 3
A LGORITHM
0.910
0.850
0.790
0.560
W1, RUN 3
A LGORITHM
0.910
0.850
0.790
0.560
H UMAN
0.828
0.697
0.626
0.424
H UMAN
0.828
0.697
0.626
0.424
W2, RUN 1
A LGORITHM
0.657
0.543
0.357
0.214
W2, RUN 1
A LGORITHM
H UMAN 0.657
0.564 0.543
0.418 0.357
0.273 0.214
0.127
0.564
W2, RUN 2 HAUMAN
LGORITHM
0.779 0.418
0.642 0.273
0.442 0.127
0.308
W2, RUN 2
A LGORITHM
H UMAN 0.779
0.681 0.642
0.472 0.442
0.285 0.308
0.153
HAUMAN
0.681
P2
LGORITHM
0.255 0.472
0.091 0.285
0.036 0.153
0.000
P2
A LGORITHM
H UMAN 0.255
0.085 0.091
0.000 0.036
0.000 0.000
0.000
H UMAN
0.085
0.000
0.000
0.000
Although no comparison to a human expert was made in
in animalno
P2), the algorithm
sharply
to exploit
this
Although
to followed
a moved
humanthe
expert
was
made
in
animal P1, thecomparison
P2 experiment
interleaving
humanknowledge,
allocating
repeated
queries
to
neighboring
actions.
animal
P1, thebatch
P2 experiment
humanmachine
paradigm. followed
Possibly the
dueinterleaving
to prior anatomical
Both discovered
configurations
including
anode
at the left,
machine
batch the
paradigm.
Possibly
due responding
to anprior
anatomical
knowledge,
human found
strongly
stimuli on
caudal
ofday.
the However,
array,
those
in the
wire
the firstcorner
testing
soon to
after,
thefound
algorithm
found
knowledge,
the human
found similar
strongly
responding
stimuli
on
array
animals.
Note
that
GP-BUCB
overcame
its
spatially
flat
high-performing
in both
the final day
of
the first
testing day. stimuli
However,
soon animals.
after, theInalgorithm
found
prior
on
f
to
find
this
anatomical
pattern.
The
key,
highanimal P2 testing,
GP-BUCB
found stimuli
comparable
high-performing
stimuli
in both animals.
In theoffinal
day of
performing
stimulus
was by
found
in the as
3rdseen
batch
for animal
reward
to
the
best
found
the
human,
in
Fig.
4(d).
animal P2 testing,
GP-BUCB found stimuli of comparable
P1, and the 6th batch for animal P2. The efficiency of the
reward to the best found by the human, as seen in Fig. 4(d).
search
and the rapid
exploration of high-performing stimuli
C. Therapeutic
Relevance
argue that the structured GP model provides a benefit over
The reward
in (6) makes the heuristic assumption that
C. conventional
Therapeutic
Relevance
bandit algorithms for this application.
strength of response and contribution to long-term therapeutic
TheAlthough
reward in comparison
(6) makes the
assumption
thatin
to linearly
a heuristic
humanrelated).
expert
wasalternate
made
outcomes arenoequivalent
(i.e.,
An
strength
of
response
and
contribution
to
long-term
therapeutic
animal
P1,
the
P2
experiment
followed
the
interleaving
humanreward heuristic is
outcomes
equivalent
(i.e.,Possibly
linearly due
related).
An anatomical
alternate
machinearebatch
paradigm.
to prior
knowledge,
strongly
t = I(w
t > Π), responding stimuli on
reward
heuristictheis humanRfound
the
first testing day. However,
soon
after, the algorithm found
where wt is definedRin
(6),
Πt is
aΠ),
threshold dividing stimuli
I(w
> animals.
t = in
high-performing
stimuli
both
the final
day of
into therapeutically effective and ineffective In
classes
by average
animal
P2
testing,
GP-BUCB
found
stimuli
of
comparable
where
wt is
definedand
in I(·)
(6), isΠ the
is aindicator
threshold
dividing
stimuli
twitch
strength,
function.
This
alterreward
to
the
best
found
by
the
human,
as
seen
in
Fig.
4(d).
nate
reward
function
is
reasonable
if
fine
differences
between
into therapeutically effective and ineffective classes by average
strongly
responding
stimuli
the previous
twitch
strength,
and I(·)
is the(important
indicator tofunction.
Thisreward
altermeasures)
are
irrelevant.
Table
III
presents
a
comparison
of
nate reward function is reasonable if fine differences between
human
and
machine
reward
under
this
heuristic
for
different
C. Therapeutic
strongly
respondingRelevance
stimuli (important to the previous reward
values of
For nearly all
runsIII
andpresents
values ofaΠ,comparison
proportionally
measures)
areΠ.irrelevant.
Table
of
The algorithm
reward inactions
(6) makes
theexceeded
heuristictheassumption
that
more
met
or
specification.
human
and
machine
reward
under
this
heuristic
for
different
strength
of
response effectiveness
and contribution
to
long-term
therapeutic
Thus,
if
therapeutic
is
a
function
of
the
proporvalues
of Π. For
all runs andlinearly
values of
Π, proportionally
outcomes
are nearly
equivalent
related).
An alternate
tion of stimuli
which elicit(i.e.,
a strong response,
the algorithm
is
more
algorithm
actions
met
or
exceeded
the
reward
heuristicthan
is the human expert’s chosen specification.
more effective
strategy.
Thus, if therapeutic effectiveness is a function of the proportion of stimuli which elicit
response,
the algorithm is
Rt a=strong
I(wt >
Π),
Kernels than
and Hyperparameters
moreD.effective
the human expert’s chosen strategy.
For the structured GP model to faithfully capture muscle
where wt is defined in (6), Π is a threshold dividing stimuli
responses, yet allow rapid learning, the kernel function, mean
therapeutically
effective and ineffective classes by average
D. into
Kernels
and Hyperparameters
function, and their hyperparameters must be chosen well. Poor
twitch
strength,
and
I(·)model
is thetoindicator
function.
This
alterFor
the structured
faithfully
capture
muscle
choices
can lead toGP
undesired
behavior,
as seen
in Fig.
6(a).
nate
reward
function
is
reasonable
if
fine
differences
between
However,
comparatively
goodtheperformance
obtained
in
responses,
yetthe
allow
rapid learning,
kernel function,
mean
strongly
respondingthe
stimuli
(important
to the (5),
previous
later animals
hybrid
Matérn
showsreward
that
function,
and theirwith
hyperparameters
mustkernel,
be chosen
well.
Poor
measures)
are
irrelevant.
Table
III presents
a comparison
of
appropriate
kernel
choice can
resolve
this
choices
can lead
to undesired
behavior,
as problem
seen in (Fig.
Fig. 6(b)).
6(a).
human
and
machine
reward
under
this
heuristic
for
different
Although
the hybrid kernel
was performance
successful, ourobtained
experiments
However,
the comparatively
good
in
values
ofother
Π. Forpossible
nearly all
runs and
values
of Π,
proportionally
suggest
kernels.
For
example,
the
hybrid that
kerlater animals with the hybrid Matérn kernel, (5), shows
more
algorithm
actions
met or exceeded
the specification.
nel’s noise
termchoice
could
be
replaced
a term
which
models
the
appropriate
kernel
can
resolvebythis
problem
(Fig.
6(b)).
Thus,
if
therapeutic
effectiveness
is
a
function
of the on
proporcovariance
between
measurements
of
the
same
action
the
Although
the hybrid
successful,
ourtheexperiments
tion of stimuli
whichkernel
elicit awas
strong
response,
algorithm is
suggest
other
possible
kernels.
For
example,
the strategy.
hybrid kermore effective than the human expert’s chosen
nel’s noise term could be replaced by a term which models the
covariance between measurements of the same action on the
4 4
Peak-to-peak amplitude of MR (mV)
Peak-to-peak amplitude of MR (mV)
TABLE
III
TABLE
TABLE
IIIIII
3.53.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0 28
28
29
29
30
30
31
32
31
32
Time (Days
post-injury)
Time (Days post-injury)
33
33
34
34
(a) Animal W1, Run 3
(a) Animal W1, Run 3
5
5
4.5
4.5
4
4
3.5
3.5
3
32.5
Peak-to-peak amplitude of MR (mV)
Peak-to-peak amplitude of MR (mV)
10 10
10
2.5 2
21.5
1.5 1
10.5
0.5 0
20
0
20
25
25
30
Time (Days post-injury)
30
Time (Days post-injury)
35
35
(b) Animal W2, Run 2
(b) Animal W2, Run 2
Fig. 6. GP posterior over the response function f (A1,A9,t). A black square
Fig.
6. GP
GP
posteriorover
overthe
theby
response
function
fA9
(A1,
A9,
t). Points
show
Fig.
6.
response
function
f (A1,
A9,
t).green
Points
show
denotes
a posterior
response
evoked
electrode
pair A1
and
‘x’s
denote
responses
electrode
pair
A1A1
A9A9
(black
squares)
andprior
other
stimuli
responses
evoked
by
electrode
pair
(black
squares)
and
other
stimuli
responsesevoked
due toby
other
stimuli.
The
red
dashed
line
is the
mean
of the
(green
dashed
prior
mean
of
the
entire
GPand
with±1
respect
entire‘x’s).
GP The
with
respect
toline
time.
The
posterior
mean
(solid)
standard
(green
‘x’s).
Thered
red
dashed
lineis isthe
the
prior
mean
of the
entire
GP
with
respect
todeviation
time. The
posterior
and
standard
deviation
confidence
intervals
(dashed)
for
fstandard
(A1,A9,t)
are shown
in blue.
to
time.
Theconfidence
posteriormean
mean(solid)
(solid)
and±1
±1
deviation
confidence
intervals
(dashed)
A9,
t)
blue.
(a):(a):
Animal
W1,W1,
run run
(a): Animal
W1,for
runff(A1,
3,(A1,
shows
the
effects
ofinpoor
kernel
andAnimal
hyperparameter
intervals
(dashed)
for
A9,
t)are
areshown
shown
in
blue.
3,choices.
shows the
effects
ofofpoor
kernel
hyperparameter
TheThe
posterior
The
posterior
prediction
of
thehyperparameter
responses atchoices.
thechoices.
beginning
ofposterior
P34 (at
3,
shows
the
effects
poor
kerneland
and
prediction
of
atatthe
of of
P34P34
(at (at
right,
using
dataThis
right, using
data
acquired on
P31
and
earlier)
was
extrapolated
poorly.
prediction
of the
the responses
responses
thebeginning
beginning
right,
using
data
acquired
P31
and
earlier)
poorly.
This
resulted
fromfrom
thetime
resultedon
from
squared-exponential
kernel’s
inherent
smoothness,
the
acquired
on
P31the
and
earlier)was
wasextrapolated
extrapolated
poorly.
This
resulted
the
squared-exponential
kernel’s
inherent
smoothness,
the time
lengthscale
used,
lengthscale used, and
the low
assumed
noise level.
selected
onused,
day
squared-exponential
kernel’s
inherent
smoothness,
theActions
time
lengthscale
and
theusing
low assumed
level. model,
Actionswere
selected
on day P34,
using this (b):
P34,
badlynoise
inaccurate
consequently
and
the low this
assumed
noise
level. Actions
selected
on dayunproductive.
P34, using this
badly
inaccurate
model,
wereand
consequently
unproductive.
(b): The altered
The
altered
kernel,
mean,
hyperparameter
choices
employed
in animal
badly
inaccurate
were consequently
unproductive.
(b):
kernel,
mean,
and model,
hyperparameter
employed
in animal
W2,The
run altered
2,
W2, run
2, avoid
this pathology, choices
while
still
providing
useful
prediction.
kernel,
mean,
and
hyperparameter
choices
employed
in
animal
W2,
run 2,
avoid this pathology, while still providing useful prediction.
avoid this pathology, while still providing useful prediction.
D. Kernels
Hyperparameters
same
day, butand
otherwise
treats observations as independent.
same
but includes
otherwise
treats observations
as independent.
SuchFor
aday,
kernel
a
variable
for each
the structured GPhidden
model additive
to faithfully
capture
muscle
Such
a
kernel
includes
a
hidden
additive
variable
each
configuration
on
each
day.
A
linear
temporal
term
inforthe
responses, yet allow rapid learning, the kernel function,
mean
configuration
on be
each
day. provided
A linearthat
temporal
term in the
kernel may also
useful,
old observations
function, and their
hyperparameters
must be
chosen well. Poor
kernel
also as
be ituseful,
that non-responsive
old observations
can be may
forgotten,
modelsprovided
the fact that
choices can lead to undesired behavior, as seen in Fig. 6(a).
can
be forgotten,
as itremain
models
the fact that non-responsive
configurations
typically
non-responsive.
However, the comparatively good performance obtained in
The kernel, typically
mean, and
hyperparameters
were periodiconfigurations
remain
non-responsive.
later animals with the hybrid Matérn kernel, (5), shows that
cally
If appropriate,
hyperparameters
wereperiodireThere-examined.
kernel, mean,
and hyperparameters
were
appropriate kernel choice can resolve this problem (Fig. 6(b)).
initialized
after an animal’s
first run, accompanied
by a wipe
of recally
re-examined.
If appropriate,
hyperparameters
were
Although the
hybridTable
kernelI and
wasFig.
successful,
our experiments
the
algorithm’s
memory;
3
detail
these
refitting
initialized after an animal’s first run, accompanied by a wipe of
suggest
other
possible
For example,
hybrid
kerand
memory
wipe
events.kernels.
The choice
to re-fit
orthe
wipe
algothe
algorithm’s
memory;
Table
I andby
Fig.
3 detail
these
refitting
nel’s
noise
term
could
be
replaced
a
term
which
models
the
rithmmemory
memorywipe
was made
by The
human
inspection.
This
approach
and
events.
choice
to re-fit
or
wipeonalgocovariance
between
measurements
of
the
same
action
protected the algorithm from making prolonged, catastrophic the
rithm
made by human
inspection.asThis
approach
samememory
day, butwas
otherwise
observations
errors,
introducing
some bias.treats
However,
this manual independent.
oversight
protected
the
algorithm
from
making
prolonged,
catastrophic
Such a kernel
includes a hidden
additive
variable tofortheeach
constitutes
a meta-algorithm
which can
be compared
errors,
introducing
some
bias. A
However,
this manual
oversight
configuration
on
each
day.
linear
temporal
in the
oversight required to detect device failure and is term
thus not
constitutes
a
meta-algorithm
which
can
be
compared
to
kernel may
also be execution
useful, provided
that old observations
unrealistic.
Automated
of such re-examinations
(in the
oversight
required
to
detect
device
failure
and
is
thus
can sense,
be forgotten,
as hyperprior
it models fitting)
the factand
thatalso
non-responsive
some
a form of
heuristics not
unrealistic.
Automated
execution
of
such
re-examinations
configurations
typically
remain
non-responsive.
for
detection of various
failure
modes
should naturally be built (in
some
sense,
a
form
of
hyperprior
fitting)
also
heuristics
into The
futurekernel,
systems,
but areand
beyond
the scopeand
of this
work.
mean,
hyperparameters
were
periodifor
detection
of
various
failure
modes
should
naturally
be built
cally re-examined. If appropriate, hyperparameters were
reinto
future systems,
but are first
beyond
scope of this
afterGain
an animal’s
run, the
accompanied
by a work.
wipe of
E.initialized
Information
the
algorithm’s
I andbyFig.
detail these
refitting
GP-BUCB
canmemory;
also be Table
asssessed
the3 amount
of inforand
memory
wipe
events.
The
choice
to
re-fit
or
wipe
algomation
it gains about
E.
Information
Gain the spinal cord’s responses. Using the
rithm
memory
was
made
by
human
inspection.
This
approach
GPGP-BUCB
model, the information
gain with by
respect
to the entire
also befrom
asssessed
the amount
of inforprotected the can
algorithm
making prolonged,
catastrophic
mation
it
gains
about
the
spinal
cord’s
responses.
Using
the
errors, introducing some bias. However, this manual oversight
GP model, the information gain with respect to the entire
reward function f can be quantified directly. Table IV shows
DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OF MULTI-ELECTRODE EPIDURAL ELECTROSTIMULATION
DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OF EPIDURAL ELECTROSTIMULATION
11
11
(tens to hundreds) and substantial periods of time (weeks).
exploitation
simpleallowed
therapeutic
application.
In particular,
Further, the in
GPa model
effective
extrapolation
forward
the
algorithm
allocated
more
actions
to
high-performing
stimin time from previous sessions and observations, a requirement
uli
while
still
matching
the
human
experimenter’s
effectiveness
to avoid extensive re-exploration. GP-BUCB also found taskinrelevant
findinganatomical
the best stimuli.
Such
demonstration
that
structure
in athe
responses it implies
encountered,
M EAN Im /Ir
M EAN R EWARD
M EAN Im /Ih
DAYS
A NIMAL
automated
algorithms
have
the
potential
to
efficiently
use
despite starting with no information as to which regions ofthe
the
& RUN
(± STD .)
R ATIO (± STD .)
(± STD .)
U SED
stimuli
available
to them,
and outside
the clinic,
P1, RUN 1
0.730 ± 0.145
N/A
N/A
0
cord were
expected
to beboth
mostinside
responsive.
This of
indicates
that
tothese
simultaneously
learn
about the
cord’s responses
W1, RUN 1
0.549 ± 0.165
1.368 ± 0.519
N/A
5∗
structures, and
potentially
novelspinal
or patient-specific
ones,
RUN 2
0.791 ± 0.132
1.474 ± 0.490
N/A
4
and
deliver
effective
therapy.
Our
successful
handlingGiven
of this
can be found by online, automatic experimentation.
the
RUN 3
0.697 ± 0.105
1.297 ± 0.317
N/A
5
problem
constitutes
theamong
successful
completion
of aand
necessary
W2, RUN 1
0.811 ± 0.145
1.540 ± 0.560
0.798 ± 0.093
4
extensive
differences
individual
injuries
injured
RUN 2
0.802 ± 0.119
1.600 ± 0.749
0.895 ± 0.152
10
prerequisite
a well-founded
attempt toFinally,
solve our
the results
more
spinal cords,forthis
capability is essential.
P2, RUN 1
0.924 ± 0.214
2.520 ± 2.415
1.119 ± 0.323
3
complex
clinical
problems
ahead,
maximizing
standing
or
show that GP-BUCB can effectively trade off exploration and
stepping
performance
in
paralyzed
subjects
and
tuning
stimuli
constitutes a meta-algorithm which can be compared to the exploitation in a simple therapeutic application. In particular,
to individual patients’ needs.
reward
function
f can
quantified
shows
oversight
required
to be
detect
device directly.
failure Table
and isIVthus
not the algorithm allocated more actions to high-performing stimthe
averaged
ratio
of
the
algorithm’s
daily
gain
in
conditional
unrealistic. Automated execution of such re-examinations (in uli while still matching the human experimenter’s effectiveness
information
that of
human fitting)
and to and
that also
of aheuristics
random
ACKNOWLEDGMENT
in finding the best stimuli.
Such a demonstration implies that
some sense, to
a form
of the
hyperprior
sampling
policy.
Interestingly,
the
algorithm
gains
a
compathe potential
efficiently
use in
the
for detection of various failure modes should naturally be built automated
We thank algorithms
Yanan Sui have
and Mrinal
Rath fortotheir
assistance
rable
(but smaller)
information
to both
the work.
human executing
stimuli available
to them, both inside and outside of the clinic,
into future
systems,amount
but areofbeyond
the scope
of this
the experiments.
and random searches. The algorithm’s repeated, exploitative
to simultaneously learn about the spinal cord’s responses
actions, which produce higher mean daily reward, may account
and deliver effective therapy.
Our successful handling of this
R EFERENCES
E. Information
for
most of thisGain
difference. However, the algorithm’s high
problem constitutes the successful completion of a necessary
[1] P. Gad et al.,
of a multi-electrode
for spinal
cord
conditional
information
gain,
despite allocating
manyofactions
GP-BUCB
can also be
asssessed
by the amount
infor- prerequisite
for“Development
a well-founded
attempt toarray
solve
the more
epidural stimulation to facilitate stepping and standing after a complete
to
stimuli
are well-understood,
suggests
that the
actions
mation
it which
gains about
the spinal cord’s
responses.
Using
the complex
clinical
problems
maximizing
standing
spinal cord
injury in
adult rats,”ahead,
J. Neuroeng.
& Rehab., vol.
10, no. 1,or
allocated
to exploration
are well-chosen.
GP model,
the information
gain with respect to the entire stepping
p. 2, 2013.
performance in paralyzed subjects and tuning stimuli
[2] S. Harkema et al., “Effect of epidural stimulation of the lumbosacral
reward function f can be quantified directly. Table IV shows to
individual
patients’ needs.
spinal cord on voluntary movement, standing, and assisted stepping after
the
averaged
ratio
of
the
algorithm’s
daily
gain
in
conditional
F. Effect of Learning on Other Hindlimb Muscles
motor complete paraplegia: A case study,” The Lancet, vol. 377, no.
information
thatLTA
of the
human and
to that of a random
9781, pp. 1938–1947,A2011.
CKNOWLEDGMENT
In additionto to
responses,
the peak-to-peak
twitch
[3] C. A. Angeli et al., “Altering spinal cord excitability enables voluntary
sampling
policy.
Interestingly,
the
algorithm
gains
a
compaamplitudes were recorded in the left soleus, right TA, and right
T.movements
A. Desautels
thanks
Yananparalysis
Sui and
Mrinal Brain,
Rath for
their
after chronic
complete
in humans,”
vol. 137,
rable (but
smaller)
amount
of muscles
information
to bothstrongly
the human
no. 5, pp.in1394–1409,
soleus
muscles.
Often,
several
responded
toassistance
executing2014.
the experiments.
and random
searches.
TheAlthough
algorithm’s
repeated,
[4] T. Desautels et al., “Application of a batch active learning algorithm to
gether
to a given
stimulus.
the LTA
twitchexploitative
amplitude
spinal cord injury therapy (program no. 466.01),” in Neuroscience 2013
actions, which
higherofmean
daily reward,
may account
accounted
for aproduce
large portion
the response
amplitudes
in the
R EFERENCES
Abstracts. San Diego, CA:
Soc. for Neuroscience, November 2013.
for most
of this
However,
the algorithm’s
high [5] ——, “Parallelizing exploration–exploitation tradeoffs in Gaussian pro−3
other
muscles
(r >difference.
0.25, p 1e
in all animals
and muscles,
[1] P. Gad et al., “Development of a multi-electrode array for spinal cord
conditional
gain, P1),
despite
allocating
many
cess bandit optimization,” J. Mach. Learning Res., vol. 15, pp. 3873–
except
the information
RTA in animal
these
responses
in actions
other
epidural stimulation to facilitate stepping and standing after a complete
3923, December 2014.
to stimulialso
which
are well-understood,
that theEven
actions
spinal cord injury in adult rats,” J. Neuroeng. & Rehab., vol. 10, no. 1,
muscles
demonstrated
substantial suggests
independence.
in
[6] R. van den Brand et al., “Restoring voluntary control of locomotion
p. 2, 2013.
allocated
exploration
are combinations
well-chosen. of muscle responses
after paralyzing spinal cord injury,” Science, vol. 336, no. 6085, pp.
this
simpletoparadigm,
many
TABLE IV
TABLE
I NFORMATION GAINS (GP-BUCB,
ImIV
; RANDOM SEARCH , Ir ; HUMAN ,
Im ; .RANDOM
SEARCH
, Ir ; HUMAN ,IN
IIhNFORMATION
), COMPUTEDGAINS
USING(GP-BUCB,
HYBRID KERNEL
Ih CANNOT
BE CALCULATED
Ih ), COMPUTED
HYBRID
KERNEL . Ih
CANNOT
BE CALCULATED
ANIMAL
W1 DUEUSING
TO USE
OF MONOPOLAR
STIMULI
, FOR
WHICH kh (x, IN
x0 )
ANIMAL IS
W1
DUE TO USE
, FOR
WHICH k.h (x, x0 )
UNDEFINED
. ∗OF
: RMONOPOLAR
EWARD RATIOSTIMULI
OUTLIER
DISCARDED
IS UNDEFINED . ∗ : R EWARD RATIO OUTLIER DISCARDED .
can be elicited. The creation of an efficiently computable
reward
metric
which faithfully
F. Effect
of Learning
on Other matches
Hindlimbtherapeutic
Muscles utility in
multi-muscle
behaviors
is
a
highly
non-trivial
problem worthy
In addition to LTA responses, the peak-to-peak
twitch
of
future
investigation.
amplitudes were recorded in the left soleus, right TA, and right
soleus muscles. Often, several muscles responded strongly toVII. CAlthough
ONCLUSIONS
gether to a given stimulus.
the LTA twitch amplitude
Our
experiments
show
that
algorithmincan
accounted for a large portion of an
the automated
response amplitudes
the
−3
handle
the
difficult
task
of
maximizing
the
performance
of a
other muscles (r > 0.25, p 1e in all animals and muscles,
relatively
neural interface.
results in
provide
except thecomplicated
RTA in animal
P1), theseThese
responses
other
amuscles
strong indication
that
a
structured
GP
model
can
effectively
also demonstrated substantial independence. Even in
capture
spinal
responses
over
large sets of
of muscle
possibleresponses
stimuli
this simple
paradigm,
many
combinations
(tens
to
hundreds)
and
substantial
periods
of
time
(weeks).
can be elicited. The creation of an efficiently computable
Further,
the GPwhich
model faithfully
allowed effective
forwardin
reward metric
matchesextrapolation
therapeutic utility
in
time
from
previous
sessions
and
observations,
a
requirement
multi-muscle behaviors is a highly non-trivial problem worthy
to
re-exploration. GP-BUCB also found taskof avoid
futureextensive
investigation.
relevant anatomical structure in the responses it encountered,
despite starting with no
information
as to which regions of the
VII.
C ONCLUSIONS
cord were expected to be most responsive. This indicates that
Ourstructures,
experiments
show that novel
an automated
algorithmones,
can
these
and potentially
or patient-specific
handle
the
difficult
task
of
maximizing
the
performance
of
can be found by online, automatic experimentation. Given thea
relatively differences
complicatedamong
neural individual
interface. These
provide
extensive
injuriesresults
and injured
a
strong
indication
that
a
structured
GP
model
can
effectively
spinal cords, this capability is essential. Finally, our results
capture
responses
over large trade
sets off
of possible
stimuli
show
thatspinal
GP-BUCB
can effectively
exploration
and
[2] S. Harkema et al., “Effect of epidural stimulation of the lumbosacral
1182–1185, 2012.
spinal cord on voluntary movement, standing, and assisted stepping after
[7] S. Thuret et al., “Therapeutic interventions after spinal cord injury,” Nat.
motor complete paraplegia: A case study,” The Lancet, vol. 377, no.
Rev. Neurosci., vol. 7, no. 8, pp. 628–643, 2006.
9781, pp. 1938–1947, 2011.
[8] V. R. Edgerton et al., “Rehabilitative therapies after spinal cord injury,”
[3] C. A. Angeli et al., “Altering spinal cord excitability enables voluntary
J. of Neurotrauma, vol. 23, no. 3-4, pp. 560–570, 2006.
movements after chronic complete paralysis in humans,” Brain, vol. 137,
[9] Y. Gerasimenko et al., “Epidural stimulation: Comparison of the spinal
no. 5, pp. 1394–1409, 2014.
circuits that generate and control locomotion in rats, cats and humans,”
[4] Exper.
T. Desautels
et al.,
“Application
of a batch2008.
active learning algorithm to
Neur., vol.
209,
no. 2, pp. 417–425,
spinal
cord
injury
therapy
(program
no.
466.01),”
Neuroscience
2013
[10] A. P. Pêgo et al., “Regenerative medicine for the intreatment
of spinal
Abstracts.
San Diego,
CA:promises?”
Soc. for Neuroscience,
2013.
cord
injury: More
than just
J. of CellularNovember
and Molecular
[5] Medicine,
——, “Parallelizing
tradeoffs in Gaussian provol. 16, no.exploration–exploitation
11, pp. 2564–2582, 2012.
cess
banditand
optimization,”
Mach. Learning
Res.,with
vol. body
15, pp.weight
3873–
[11] A.
Wernig
S. Müller, J.“Laufband
locomotion
3923,
December
2014.
support improved walking in persons with severe spinal cord injuries,”
[6] Spinal
R. vanCord,
den vol.
Brand
voluntary
30,etno.al.,4, “Restoring
pp. 229–238,
1992. control of locomotion
after
paralyzing
Science,
vol. 336,
6085,vol.
pp.
[12] G.
Courtine
et al., spinal
“Spinalcord
cordinjury,”
injury: Time
to move,”
Theno.
Lancet,
1182–1185,
2012. June 2011.
377,
pp. 1896–1898,
[7] ——,
S. Thuret
et al., “Therapeutic
interventions
spinalvia
cord
injury,”proNat.
[13]
“Recovery
of supraspinal
control of after
stepping
indirect
Rev. Neurosci.,
vol. 7, no. 8,after
pp. spinal
628–643,
priospinal
relay connections
cord2006.
injury,” Nature Medicine,
[8] vol.
V. R.
et 69
al.,–“Rehabilitative
therapies after spinal cord injury,”
14,Edgerton
no. 1, pp.
74, 2008.
J. of
vol.“Functional
23, no. 3-4,electrotherapy:
pp. 560–570, 2006.
[14] W.
T. Neurotrauma,
Liberson et al.,
Stimulation of the
[9] peroneal
Y. Gerasimenko
et al., “Epidural
of the
nerve synchronized
withstimulation:
the swingComparison
phase of the
gaitspinal
of
circuits thatpatients,”
generateArchives
and control
locomotion
in rats,
cats
and humans,”
hemiplegic
of Physical
Medicine
and
Rehabilitation,
Exper.
Neur.,
vol.
209,
no.
2,
pp.
417–425,
2008.
vol. 42, pp. 101–105, 1961.
[10] A.
A. Thrasher
P. Pêgo et
medicine
fordue
the totreatment
of elecspinal
[15]
et al.,
al., “Regenerative
“Reducing muscle
fatigue
functional
cord stimulation
injury: More
thanrandom
just promises?”
Cellular and
Molecular
trical
using
modulationJ.ofofstimulation
parameters,”
Medicine,
vol. 16,vol.
no.29,
11,no.
pp.6,2564–2582,
2012.
Artificial
Organs,
pp. 453–458,
2005.
[11] E.A.J. Wernig
“Laufband
locomotion
body weight
[16]
Bradburyand
andS.S. Müller,
B. McMahon,
“Spinal
cord repairwith
strategies:
Why
support
improved
walking
in personsvol.
with
severe
spinal August
cord injuries,”
do
they work?”
Nature
Rev. Neurosci.,
7, pp.
644–653,
2006.
Spinal
Cord, vol. et
30,al.,no.
4, pp. 229–238,
1992.
[17] M.
R. Dimitrijevic
“Evidence
for a spinal
central pattern generator
[12] inG.humans.”
CourtineAnn.
et al.,
Time
to move,”vol.
The860,
Lancet,
vol.
of “Spinal
the New cord
York injury:
Academy
of Sciences,
p. 360,
377,
pp.
1896–1898,
June
2011.
1998.
12
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR
[18] A. Prochazka et al., “Neural prostheses,” J. of Physiology, vol. 533,
no. 1, pp. 99–109, 2001.
[19] C. N. Shealy et al., “Electrical inhibition of pain by stimulation of the
dorsal columns: Preliminary clinical report,” Anesthesia and Analgesia,
vol. 46, pp. 489–491, July-August 1967.
[20] R. Herman et al., “Spinal cord stimulation facilitates functional walking
in a chronic, incomplete spinal cord injured,” Spinal Cord, vol. 40, pp.
65–68, 2002.
[21] K. Minassian et al., “Human lumbar cord circuitried can be activated
by extrinsic tonic input to generate locomotor-like activity,” Human
Movement Sci., vol. 26, pp. 275–295, 2007.
[22] D. G. Sayenko et al., “Neuromodulation of evoked muscle potentials
induced by epidural spinal cord stimuation in paralyzed individuals,” J.
of Neurophysiology, vol. 111, no. 5, pp. 1088–1099, March 2014.
[23] D. Baskent et al., “Using genetic algorithms with subjective input
from human subjects: Implications for fitting hearing aids and cochlear
implants,” Ear and Hearing, vol. 28, no. 3, pp. 370–380, 2007.
[24] A. Frankemolle, “Reversing cognitive-motor impairment in parkinson’s
disease patients using computational modelling approach to deep brain
stimulation programming,” Brain, vol. 133, no. 3, pp. 746–761, 2010.
[25] T. Mera, “Kinematic optimization of deep brain stimulation across
multiple systems in parkinson’s disease,” J. Neurosci. Methods, vol. 198286, no. 2, p. 280, 2011.
[26] J. Giuffrida, “Automated optimization of deep brain stimulation settings
multiple parkinson’s disease motor symptoms,” Neurology, vol. 74, no. 9,
p. Suppl. 2:A479, 2011.
[27] J. Fruitet et al., “Bandit algorithms boost brain computer interfaces for
motor-task selection of a brain-controlled button,” in Adv. Neural Info.
Proc. Syst., 2012, pp. 458–466.
[28] ——, “Automatic motor task selection via a bandit algorithm for a braincontrolled button,” J. Neural Eng., vol. 10, no. 1, p. 016012, 2013.
[29] T. Gürel and C. Mehring, “Unsupervised adaptation of brain-machine
interface decoders,” Frontiers in neuroscience, vol. 6, 2012.
[30] C. Vidaurre et al., “Co-adaptive calibration to improve BCI efficiency,”
J. Neural Eng., vol. 8, no. 2, p. 025009, 2011.
[31] A. Brunoni et al., “Clinical research with transcranial direct current
stimulation (tdcs): Challenges and future directions,” Brain Stimulation,
vol. 5, pp. 175–95, 2012.
[32] R. M. Gorodnichev et al., “Transcutaneous electrical stimulation of the
spinal cord: A noninvasive tool for the activation of stepping pattern
generators in humans,” Human Physiology, vol. 38, no. 2, pp. 158–167,
April 2012.
[33] R. R. Roy et al., “Chronic spinal cord-injured cats: Surgical procedures
and management,” Lab. Animal Sci., vol. 42, no. 4, pp. 335– 343, 1992.
[34] T. Desautels et al., “Parallelizing exploration–exploitation tradeoffs with
Gaussian process bandit optimization,” in Proc. 29th Int. Conf. Mach.
Learning, 2012.
[35] H. Robbins, “Some aspects of the sequential design of experiments,”
Bul. Am. Math. Soc., vol. 55, 1952.
[36] T. E. Prieto et al., “Measures of postural steadiness: Differences between
healthy young and elderly adults,” IEEE Trans. Biomed. Eng., vol. 43,
no. 9, pp. 956–966, 1996.
[37] B. R. Santos et al., “Reliability of centre of pressure summary measures
of postural steadiness in healthy young adults,” Gait & posture, vol. 27,
no. 3, pp. 408–415, 2008.
[38] D. M. Basso et al., “A sensitive and reliable locomotor rating scale for
open field testing in rats,” J. of Neurotrauma, vol. 12, no. 1, pp. 1–21,
1995.
[39] M. Antri et al., “Locomotor recovery in the chronic spinal rat: Effects
of long-term treatment with a 5-HT2 agonist,” Eur. J. of Neuroscience,
vol. 16, no. 3, pp. 467–476, 2002.
[40] A. J. Fong et al., “Spinal cord-transected mice learn to step in response
to quipazine treatment and robotic training,” J. Neurosci., vol. 25, no. 50,
pp. 11 738–11 747, 2005.
[41] L. L. Cai et al., “Implications of assist-as-needed robotic step training
after a complete spinal cord injury on intrinsic strategies of motor
learning,” J. Neurosci., vol. 26, no. 41, pp. 10 564–10 568, 2006.
[42] R. E. Kalman, “A new approach to linear filtering and prediction
problems,” Trans. ASME–J. Basic Eng., vol. 82, no. D, pp. 35–45, 1960.
[43] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine
Learning. MIT Press, 2006.
[44] C. E. Rasmussen and H. Nickisch, “Gaussian processes for machine
learning (GPML) toolbox,” J. Mach. Learning Res., vol. 11, pp. 3011–
3015, 2010.
[45] J. Quiñonero-Candela and C. Rasmussen, “A unifying view of sparse
approximate gaussian process regression,” J. of Mach. Learning Res.,
vol. 6, pp. 1939–1959, 2005.
[46] A. Slivkins, “Contextual bandits with similarity information,” J. Mach.
Learning Res., vol. 15, pp. 2533–2568, July 2014.
Thomas A. Desautels received the B.S. degree in biomedical engineering
from the University of California, Davis, CA, USA, and the M.S. and Ph.D.
degrees from the California Institute of Technology, Pasadena, CA, USA.
He is now a visiting research associate at the Gatsby Computational
Neuroscience Unit of University College London, London, UK, where his
work is supported by the Whitaker International Scholarship. His research
interests include neural decoding and interfaces, learning algorithms, and
neuroprosthetics.
Jaehoon Choe received the B.A. degree in biological sciences from the University of Chicago, Chicago, IL, USA, and the Ph.D. degree in neuroscience
from the University of California, Los Angeles, CA, USA.
He is now a postdoctoral research associate in the Information Systems
Sciences Laboratory at HRL Laboratories LLC, Malibu, CA, USA. His
research interests include brain-machine interfaces, neuroprosthetics, and
neural modeling.
Parag Gad received the B.Eng. degree in biomedical engineering from the
University of Mumbai, Mumbai, India, and the M.S. and Ph.D. degrees in
Biomedical Engineering from the University of California, Los Angeles.
He is currently employed as an assistant researcher in the department of
Integrative Biology and Physiology at University of California, Los Angeles.
His research includes developing hardware, software tools, and stimulation
strategies to facilitate locomotion and bladder control after paralysis.
Manheerej S. Nandra, biography not available at the time of publication.
Roland R. Roy received the B.S. degree in physical education from the
University of New Hampshire, Durham, NH, USA, and the MS and PhD
degrees in exercise physiology from Michigan Sate University, East Lansing,
MI, USA.
He has been at the University of California, Los Angeles, since 1977 where
he is currently a Distinguished Researcher in the Department of Integrative
Biology and Physiology and the Brain Research Institute. He is the primary
or co-author of over 400 research articles and book chapters.
Dr. Roy is a member of the Society for Neuroscience, American Physiological Society, and the American College of Sports Medicine.
Hui Zhong, biography not available at the time of publication.
Yu-Chong Tai (M’96-SM’03-F’06) received his B.S. degree in electrical
engineering from the National Taiwan University, Taiwan, and the M.S. and
Ph.D. degrees in electrical engineering from the University of California,
Berkeley, Berkeley, CA, USA.
He joined the California Institute of Technology as an Assistant Professor
in electrical engineering in 1989. He is the Anna L. Rosen Professor of
Electrical Engineering and Mechanical Engineering and the Executive Officer
of Medical Engineering at the California Institute of Technology. His research
focuses on Micro-Electro-Mechanical Systems (MEMS) and implantable
micro biomedical devices.
Dr. Tai was the co-chairman of the 2002 IEEE MEMS Conference. He
received the 2015 IEEE Robert Bosch MEMS/NEMS Award. He is also a
senior member of the American Society of Mechanical Engineers (ASME).
V. Reggie Edgerton, biography not available at the time of publication.
Joel W. Burdick received his B.S. degree in mechanical engineering from
Duke University and M.S. and Ph.D. degrees in mechanical engineering from
Stanford University.
He has been with the department of Mechanical Engineering at the California Institute of Technology since May 1988, where he is the Richard and
Dorothy Hayman Professor of Mechanical Engineering and BioEngineering.
He was appointed an IEEE Robotics Society Distinguished Lecturer in 2003.
His current research interests include sensor-based robot motion planning,
multi-fingered robotic hand manipulation, and spinal cord injury rehabilitation.
Download