IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR 1 An Active Learning Algorithm for Control of Epidural Electrostimulation Thomas A. Desautels, Jaehoon Choe*, Parag Gad*, Mandheerej S. Nandra, Roland R. Roy, Hui Zhong, Yu-Chong Tai, V. Reggie Edgerton, Joel W. Burdick Abstract—Epidural electrostimulation has shown promise for spinal cord injury therapy. However, finding effective stimuli on the multi-electrode stimulating arrays employed requires a laborious manual search of a vast space for each patient. Widespread clinical application of these techniques would be greatly facilitated by an autonomous, algorithmic system which choses stimuli to simultaneously deliver effective therapy and explore this space. We propose a method based on GP-BUCB, a Gaussian process bandit algorithm. In n = 4 spinally transected rats, we implant epidural electrode arrays and examine the algorithm’s performance in selecting bipolar stimuli to elicit specified muscle responses. These responses are compared with temporally interleaved, intra-animal stimulus selections by a human expert. GP-BUCB successfully controlled the spinal electrostimulation preparation in 37 testing sessions, selecting 670 stimuli. These sessions included sustained, autonomous operations (10 session duration). Delivered performance with respect to the specified metric was as good as or better than that of the human expert. Despite receiving no information as to anatomically likely locations of effective stimuli, GP-BUCB also consistently discovered such a pattern. Further, GP-BUCB was able to extrapolate from previous sessions’ results to make predictions about performance in new testing sessions, while remaining sufficiently flexible to capture temporal variability. These results provide validation for applying automated stimulus selection methods to the problem of spinal cord injury therapy. Index Terms—Implants, Learning Automata, Neural Engineering, Neuromuscular Stimulation, Spinal Cord Injury. I. I NTRODUCTION Animal [1] and human [2], [3] studies have shown that multi-electrode epidural spinal stimulation can provide significant recovery of motor and autonomic function in subjects suffering from severe spinal cord injury (SCI). Fig. 1 shows a c 2014 IEEE. Personal use of this material is permitted. Copyright However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. Submitted December 15, 2014. This work was supported by NIH U01EB15521, R01EB007615, the Leona M. and Harry B. Helmsley Charitable Trust, and the Christopher and Dana Reeve, Broccoli, Walkabout, and F. M. Kirby Foundations. J. Choe and P. Gad contributed equally. T. A. Desautels was with the California Institute of Technology, Pasadena, CA 91125, USA. He is now with the Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR UK (email: tdesautels@gatsby.ucl.ac.uk). J. Choe was and P. Gad, R. R. Roy, H. Zhong, and V. R. Edgerton are with the University of California, Los Angeles, Los Angeles, CA, USA. J. Choe is now with HRL Laboratories, LLC, Malibu, CA, USA. M. S. Nandra was and Y.-C. Tai and J. W. Burdick are with the California Institute of Technology, Pasadena, CA, USA. VRE, RRR, and JWB hold shareholder interest in NeuroRecovery Technologies (NRT). VRE is also the President and Chairman of the Board. VRE, RRR, and JWB hold certain inventorship rights on intellectual property licensed by The Regents of the University of California to NRT and its subsidiaries. A patent has been submitted covering concepts described here. 27-electrode stimulating array we employ in rat SCI models. When implanted in the epidural space, the application of appropriate spatio-temporal patterns of electrical stimulation to these arrays can facilitate the operation of spinal circuits caudal to the spinal injury, enabling beneficial motor function, and the recovery of some voluntary control and autonomic function [2], [3]. Multi-electrode stimulating arrays provide many benefits over simple wire electrodes, such as patient-customized stimulation parameters, compensation for errors in surgical placement of the array, and potential adaptation to spinal cord plasticity. Moreover, this adaptivity is needed in the clinic: optimal stimulus patterns vary substantially across patients [3] and over time for each patient due to spinal cord plasticity. Hence, the clinical process of deploying multi-electrode epidural stimulation requires a burdensome search for the optimal stimulation paradigm for each patient. The space of possible stimuli which must be searched to find high-performing electrode combinations can be vast; e.g., the 19 parameters (which describe active electrode selection, electrode polarity, voltage, stimulus frequency, and pulse width) which can be varied on the 16-electrode array used in [2], [3] allow for 108 different stimulus combinations. When restricted to pulse-like stimuli (parametrized by amplitude, frequency, phase, and pulse width), the array shown in Fig. 1 presents a 108-dimensional space that must be searched. Even though the range of permissible values is greatly reduced by safety and design factors, the necessary search for the small set of high-performing or optimal parameters can be tedious and of limited direct therapeutic value. This burden suggests the need for automated algorithms to facilitate, simplify, and accelerate the search process. Other factors motivate the need for automated search techniques: (1) different stimulation patterns are needed to facilitate different motor behaviors (e.g., standing vs. stepping); (2) the current labor-intensive process of finding optimal stimulation patterns is expensive and may not result in optimal parameter choices; (3) no method yet exists to predictively account for plasticity in the nervous system during locomotor recovery after severe paralysis; and (4) a systematic method can facilitate widespread, cost-effective distribution of this therapy, as there are currently few therapists trained to optimize epidural stimulation for recovery after an SCI. This paper introduces a novel algorithm which can automatically optimize multi-electrode array stimulation parameters.1 1A preliminary version of this work has been reported [4]. 2 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR Fig. 1. Parylene Array Device. (A) and (B): The complete implant, including main circuit board, parylene electrode array, headplug connector, EMG wires, and ground wires. (C): Tip of the array, showing suture sites. (D): Detail of a single electrode. The bright surface regions are electrical contacts and the darker material is a layer of parylene to prevent delamination. (E): View of the entire parylene and platinum microfabricated portion of the implant. Figure reproduced under Open Access from [1]. The methodology is based on a novel Gaussian Process Batch Upper Confidence Bound (GP-BUCB) active machine learning technique developed by some of the authors [5]. GP-BUCB uses a batch processing format that meshes with clinical practice: data from a completed training session is processed to select a batch of stimuli to be tested in the next training session. This approach decouples the data processing step from the clinical process. When the data processing step is sufficiently rapid, as in the experiments here, this approach can also be used within a session. The basic theory underlying GP-BUCB is developed and analyzed elsewhere [5]. This paper focuses on animal experiments designed to evaluate the method. These studies used a block strategy in which an experienced neurophysiologist and the GP-BUCB algorithm each selected and applied batches of stimuli in an interleaving sequence. Each was blind to the competitor’s actions and the responses elicited, allowing comparison of the exploratory/exploitative strategies, while largely compensating for spinal plasticity. Our experiments showed that an automated algorithm can find high-performing stimulus parameters efficiently, competitively with a human expert. This should lead to a better therapeutic recovery experience, and shows that an automated algorithm can be a competitive option for determining effective therapeutic strategies. Relation to Prior Work. This paper lies at the intersection of multiple fields: machine learning, neuromotor rehabilitation, and multi-electrode array technology. Herein, we address spinal cord injury, a medical condition with broad impact, which has been widely studied in both human [2], [3] and animal models [6]. Reviews of SCI research include [7]– [9]. Although regenerative approaches [10] hold long-term promise, they have not yet had substantial success in patients with a complete SCI [7]. Since there is no present cure for SCI, current practice focuses on therapy for the injured neuromuscular system. Locomotor therapy [11] produces gains in some patients through a variety of putative mechanisms [12], [13]. Several therapies involve electrical stimulation. Functional electrostimulation (FES) [14] aims to create muscle activation patterns associated with a desired movement. Since FES is an openloop method, the stimulation must be carefully designed and/or user-controlled if complex behaviors are desired. FES also can engender rapid fatigue [15]. Other approaches focus on spinal cord stimulation to rehabilitate the cord’s motor control ability, without addressing the injury itself [16]. This philosophy relies on the spinal cord’s intrinsic circuitry caudal to the injury site, which remains viable and adaptable [8]. Specific targets include interneuronal networks responsible for reflexes and the central pattern generators that coordinate muscle activity in walking [17]. A variety of methods for delivering the electrical stimulus have been suggested, including penetrating microelectrodes [18]. The present paper focuses on epidural spinal electrostimulation (EES), which relies upon an electrode array implanted in the epidural space over the dorsal aspect of the lumbosacral spinal cord. Although originally developed for chronic pain therapy [19], epidural electrostimulation can produce complex motor patterns [17] in SCI subjects. EES has been successfully applied to spinal and decerebrate cats, spinalized rats, and humans with an SCI [9]. Properly configured EES can produce walking motions [9]. The combination of EES with partial body weight support exercise training can produce substantial functional and metabolic gains in locomotion in an incomplete quadriplegic patient [20]. More recently, some of the present authors have used EES coupled with motor training to demonstrate substantial gains in a motor-complete patients [2], [3]. One mechanism believed to be involved with in this type of stimulation is the activation of afferent fibers as they enter the spinal cord through the dorsal nerve roots [21]. A detailed attempt to examine spinal circuitry via EES-evoked potentials in human participants is presented in [22]. Even in this light, it is difficult to choose stimuli a priori, and so a search is necessary. Since the search for optimal stimulating parameters consumes valuable patient and clinician time, any automated algorithm must quickly find high-performing stimulus parameters (explore the stimulus space) and then apply them (exploit its knowledge) to enable useful therapy. The unavoidable tension between these two requirements is classically called the exploration-exploitation tradeoff. Well-designed active learning algorithms strike an effective balance. The companion paper [5] reviews the active learning literature. Here we focus on the use of active learning for neurostimulation. Our algorithm is not the first to “tune” the electrical stimuli applied by a medical device. Machine learning algorithms have tuned cochlear implants [23]. Computational models have been proposed to adjust deep brain stimulation (DBS) parameters [24], and automated DBS motor evaluation is in development [25]. In comparison to our approach, the ad-hoc techniques used in the cochlear tuning have no formal convergence guarantees. Moreover, EES invokes a more complex response (involving sensorimotor spinal networks) than does cochlear stimulation. Although the need for automated DBS tuning has been recognized [26], it is not yet a reality. Whereas this paper focuses on neural stimulation, active DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OF EPIDURAL ELECTROSTIMULATION learning has been applied to Brain Computer Interfaces (BCIs), which classify recorded neural activity. Active learning has been used to find the stimuli which produce good discrimination between volitional and resting states [27], [28], to calibrate BCI decoders [29] and BCI systems [30]. In contrast, this paper uses an approach with more structure (a Gaussian process class of reward functions) over the space of actions in order to enable very large decision sets. Further, we optimize a reward function while simultaneously applying the therapy, rather than in a separate, potentially non-therapeutic session. Finally, we note that our automated stimulus optimization approach can likely be adapted to other multi-electrode therapies, such as transcranial direct current [31], transcutaneous spinal cord [32], or deep brain stimulation. Paper Organization. Section II describes the procedures and the electrode arrays used in our experiments. The theory and structure of the GP-BUCB algorithm [5] is briefly reviewed in Section III and Section IV describes adaptations of GP-BUCB to the clinical and experimental setting. Section V summarizes our experimental results, with Section VI providing a detailed analysis. II. E XPERIMENTAL M ETHODS This section describes the preparation of the animal subjects (Section II-A), the micro-fabricated (Section II-B) and wire electrode arrays (Section II-C) used to stimulate their spinal cords, and basic testing procedures (Section II-D). A. Injury, Implantation, and Animal Care Our surgical and care procedures are largely derived from [1], and are similar to those used for cats [33]. All animals are adult female Sprague Dawley rats, approximately 300 g in mass at time of implantation. The following procedures are performed on each animal, usually in a single surgery: 1) Partial laminectomy at the T8-T9 vertebral level and complete spinal cord transection at approximately the T8 spinal segment, including the dura, using microscissors; 2) Placement of gel foam at the transection site as a coagulant and separator of the cut spinal cord ends; 3) Partial or full laminectomies of some vertebrae (T11, T12, L3, and L4 for animals receiving parylene arrays, T12, T13, L1, and L2 for those receiving wired arrays); 4) Implantation of an epidural stimulating array, inserted using the T11, L4 (T12, L2) laminectomies for parylene (wire) arrays, placing the most rostral electrodes in the middle of the T12 vertebral level, and sutured to the dura at the rostral end using 8-0 Ethilon sutures; 5) Implantation of 2 ground wires (each made of 5 braided 0.003 cm gold wires, A-M Systems, Sequim, WA) for parylene arrays, or one teflon coated stainless steel wire for the wired arrays (0.304 mm, AS 632, Cooner Wire, Chatsworth, CA). The grounds are placed in the muscle mass dorsal to the spinal column; 6) Implantation of a pair of multi-stranded, Teflon-coated stainless steel EMG wires (AS 632, Cooner Wire) into the bellies of left and right tibialis anterior (LTA & RTA) and left and right soleus (LSol & RSol) muscles; and 3 7) Attachment of two headplug connectors (Amphenol, Wallingford, CT), that are screwed to the skull and secured with dental cement. All surgeries are performed under aseptic conditions with general anesthesia (Isoflurane) delivered via face mask. Analgesia is provided with buprenex (0.5-1.0 mg/kg, 3 times per day subcutaneously), begun before the end of surgery and continued for 3-5 days post-surgery. The animals were administered Baytril sub-cutaneously at the end of surgery and at 12-hr intervals thereafter for at least 3 days. The animals are allowed to recover from anesthesia within an incubator and are individually housed both pre- and postoperatively, with free access to food and water. After recovering for one week (day 7 post-surgery, denoted P7), the experiments begin. Experiments continue as long as the animal and array both remain viable, or until 6 weeks post-operative (P42). Due to their spinal cord injuries, the animals’ bladders are manually expressed 2-3 times per day and their hind limbs are manually moved through a complete range of motion once per day to retain joint mobility. All procedures were carried out in accordance with the NIH Guide for the Care and Use of Laboratory Animals and were approved by the UCLA Animal Research Committee. B. Parylene Arrays Two animals were implanted with parylene stimulating arrays, which were built using MEMS fabrication and traditional microelectronics, described in detail in [1]. The 26 mm x 3mm array is sized to the rat spinal cord. The array consists of 27 platinum electrodes (each 0.2 mm x 0.5 mm, partially covered by an anti-delamination waffle pattern of parylene) and platinum wire traces deposited on a parylene C substrate (see Fig. 1). The flexibility of the 20µm thick array conforms to the cord’s movements, allowing delivery of roughly the same stimulus in various body positions. The electrodes are in three rostro-caudal columns (see Fig. 2) denoted “A,” “B,” and “C” from the animal’s left to its right, and are numbered by rows from 1 (rostral, L2 spinal cord level, T12 vertebral level) to 9 (caudal, S2 spinal cord level, L2 vertebral level). The rows are equally spaced. The array is coupled to a biocompatible circuit board (33.2 mm x 10.3 mm) mounted dorsally on the spinal column. This board’s circuits control the stimuli, record responses, and communicate with external circuitry through a headplug connector. A stimulus consists of a single pair of electrodes (cathode and anode) chosen from 29 candidates, the 27 on the array and 2 grounds located within the body. Considering only bipolar configurations (i.e., those without grounds), there are 702 possible stimulus pairs. Due to the design of the driving circuitry, certain pairs (36 in total) cannot be stimulated, leaving 666 pairs over which the stimulus can be optimized. C. Wire-Based Spinal Stimulating Arrays The wire electrode arrays consist of 7 Teflon-coated, multistranded, stainless steel wires (5×30 guage, A-M Systems; 2×AS 632 wires, Cooner Wire) laid parallel. Openings cut into the wire insulation create exposed electrodes on the T13 L3 9 TA A B C 9 8 S2 7 S1 6 L1 L3 8 7 6 5 L2 L6 5 L5 4 L4 3 2 T12 4 2 L3 1 L2 L1 3 T13 1 T12 MG IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR Soleus 4 L2 Fig. 2. Placement of a 9x3 array relative to the spinal cord (segmental level in red) and vertebrae (gray). The colored bars at the top and bottom represent the rostro-caudal spinal locations of the soleus, tibialis anterior, and medial gastrocnemius motor pools. Adapted under Open Access from [1]. surface facing the dura. The wires are sutured to the dura above and below the electrode, ensuring consistent stimulus location. The other end is then connected to a headplug connector, to which the implanted EMG wires are also routed. The 7 stimulating electrodes are placed over the same spinal locations as electrodes A1, A4, A9, B2, C1, C4, and C9 on the parylene array (Fig. 2). D. Animal Testing Procedures During testing, the animals are suspended in a vest that supports their body weight. The device is positioned so that the animal is in a bipedal position, with both hindpaws in contact with a custom surface offering good traction. In a typical experiment, the algorithm chooses a batch of five bipolar electrode pairs, possibly with repetition. The stimulus associated to each selection consists of a train of 10 (or 20 in some animals) constant voltage pulses (1 Hz repetition, monophasic, 0.5 ms pulse width) applied across the selected electrodes. The EMG signals corresponding to evoked potentials are amplified (A-M Systems model 1700) and recorded by custom LabVIEW (National Instruments, Austin, TX) software, with a 10 kHz sampling rate. The digitized signals are processed with custom MATLAB (The MathWorks, Inc., Natick, MA) code to calculate the peak-to-peak amplitude of each evoked potential within a time window defined relative to the onset of the stimulus pulse. Along with previously recorded data, these new observations are used to select the five actions comprising the next batch. On a typical testing day, five batches are selected. A run of the algorithm typically lasts several days, and several runs may be performed on the same animal, with the algorithm’s memory wiped between runs. Human-selected batches (5 choices of electrode pairs and voltage) are interleaved between tests of the algorithm’s batch selections in an alternating manner, constituting an intraanimal baseline. The algorithm and human were blind to one another’s actions and the animal’s resulting responses. The two data sources were not combined for decision-making purposes, but they were combined for meta-analysis and tuning of the algorithm’s hyperparameters. III. R EVIEW OF THE GP-BUCB ALGORITHM The GP-BUCB algorithm is extensively developed elsewhere [5], [34]; here, we briefly sketch its construction. GPBUCB is a Gaussian process (GP) bandit algorithm. The bandit setting [35] describes an agent which seeks to obtain high reward in its interaction with an unknown reward function f . In round t, the agent selects action xt from a decision set D, and then receives the single observation, yt = f (xt )+t , of the corresponding reward, f (xt ), corrupted by noise t . Since only the reward corresponding to the action is observed, actions must be selected in view of both exploration and exploitation. GP bandits impose the structural assumption that the reward function is drawn from a GP, a probability distribution over functions. This assumption is useful mathematically and because it allows a relatively intuitive encoding of prior (e.g., anatomical) knowledge about the problem. A GP is characterized by a mean function µ0 (x) and a kernel function k(x, x0 ), which describes the covariance between f (x) and f (x0 ). With their hyperparameters, θ, the kernel and mean functions encode assumptions about likely reward functions f , such as smoothness with respect to actions x, or locations of regions which are likely to yield high reward. At time t, GP-BUCB updates the posterior distribution over f (x) for each x ∈ D using the actions {x1 , . . . , xt−1 } and observations y 1:t−1 = {y1 , . . . , yt−1 }: the posterior over f (x) 2 is f (x) ∼ N (µt−1 (x), σt−1 (x)), where µt−1 (x) = µ0 (x) + ~k T (x)(K + σn2 I)−1 (y t−1 − µ~0 ) and 2 σt−1 (x) = k(x, x) − ~k T (x)(K + σn2 I)−1~k(x), (1) where ~k(x) is the column vector of k(x, xi ) for i = 1, . . . , t− 1, µ~0 is the corresponding vector of prior means µ0 (xi ), and K is the Gram matrix, i.e., [K]i,j = k(xi , xj ). This posterior assumes that f is drawn from the GP and that the noise is i.i.d., with t ∼ N (0, σn2 ). The noise accounts for all factors which influence the response, but are unobserved by the algorithm. The GP-BUCB algorithm addresses the problem of making decisions in “batches,” where not all feedback is immediately available; only measurements y fb[t] , where fb[t] < t − 1 are available. Conventionally, all observations y t−1 are required to select xt , but by using delayed measurements, the expenditure of valuable clinical time waiting for data processing is avoided. Mathematically, if the pending actions xfb[t]+1 , . . . , xt−1 are known to the algorithm, the posterior variance available after those observations arrive can be calculated from (1), since this equation does not depend on the observation values. GPBUCB uses this pseudo-updated posterior to select xt , 1/2 xt = argmax[µfb[t] (x) + βt σt−1 (x)], (2) x∈D where βt is a time-varying parameter (in the following, chosen to increase during a testing session). This lets GP-BUCB select observations in non-redundant batches (or despite a delay). A number of performance guarantees for GP-BUCB with particular choices of βt are obtained by [5]. IV. C LINICAL A DAPTATIONS OF GP-BUCB Noteworthy modifications of the GP-BUCB algorithm which are needed for clinical application are described below. DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OFIEEE EPIDURAL ELECTROSTIMULATION 5 6 TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR TABLE TABLE II GP GP PRIORS PRIORS USED USED TO TO MODEL MODEL RESPONSES RESPONSES TO TO SPINAL SPINAL STIMULI STIMULI.. T THE HE GP GP PRIOR PRIOR MEAN MEAN IS IS µ00(t) = c0 OR OR c0 + tc1 .. K ERNEL H YPERPARAMETERS l = [0.2740, 0.4659, 0.3297, 0.6300, 1.2018], σ = 0.1244, σn = 0.0268 A NIMAL RUN P1 RUN 1 A F UNCTION SE W1 RUN 1 B RUN 1 A SE RUN 1 B RUN 1 C l = [0.5480, 0.9317, 0.6593, 1.2599, 2.4034], σ = 0.1244, σn = 0.0268 SE RUN 2 SE W2 RUN 3 RUN 1 A l = [0.1147, 46.9592, 0.1949, 5.0130, 0.9114], σ = 0.5288, σn = 0.1708 l = [2.9453, 6.3777, 3.4291, 2.3453, 2.4034], σ = 0.7903, σn = 0.1364 H YBRID P2 RUN 1 B RUN 2 RUN 1 H YBRID l1 = [1.0000, 2.7183, 1.0000, 2.7183, 2.7183], σ1 = 0.2865, σ2 = 0.2231, σn = 0.1353 l1 = [1.0000, 2.7183, 1.0000, 2.7183, 2.7183], σ1 = 0.2865, σ2 = 0.2231, σn = 0.1353 C. Objective Redundancy/Repetition Control A. Function Another deviation from the classical GP bandit setting To optimize clinical reward, GP-BUCB must observe the concerns the repetitions of stimuli within short periods. For actual reward or its faithful surrogate. Many methods exist to most experiments (see Table I), the algorithm was not allowed quantify standing performance [36], [37], which we eventually to request the same action more than twice per day in order aim to optimize by EES. Many motor skill grading schemes to ensure adequate exploration. Since 5 batches of 5 stimuli rely on human observation [38], [39] or automated post-hoc are typically requested daily, 13-25 distinct electrode pairs analysis [40], [41]. Although we do not use human grading are tested. Some of the consequences of this restriction are due to the variability in human judgement, our method can dicsussed in Section VI-A. work with human-based ordinal ratings and is mathematically agnostic to the particular choice of scalar reward function. D.We Kernel Mean Functions use and measured EMG activity as the reward function and mean functions the algorithm’s in The our kernel experiments. We chose the describe peak-to-peak amplitude prior spinal cord response. To ensure reaof theassumptions left tibialis about anterior (LTA, a foot dorsiflexor) muscle sonable action selection, their chosen carefully enresponse to individual stimulus pulses form (fixedmust amplitude of 5V, code 7V crucial information. importantly, k(x, x0 ) with usedproblem in animal P2), in theMost “middle response” (MR, encodes between and x0 . For our 4.5 - 7.5 the ms dependence post-stimulus) latencystimuli period,x corresponding to experiments, thewhich 5-dimensional stimulus description motor responses likely involve a single synaptic vector delay consisted of the anode and the cathode spatial coordinates on the in the motor path. Since evoked potentials represent a array (4-dimensions) the interneuron time, in daysnetworks, post-injury. Formay the low-level function of and spinal they firstless twosensitive animals,towe used aparameter squared-exponential (SE)higherkernel be stimulus variations than kSE (x (see [43], At Section level motor the 14.2) Hz stimulus frequency, the i , xj ) functions. pulse responses can be dissociated from their predecessors kSE (xi , xj ) = σ 2 exp[−r2 /2], (3) and successors. Although this performance metric does not q involve a complex r = motor (xi −behavior, xj )T Σ−1it(xprovides (4) i − xj ), a short-term, measurable surrogate for therapeutic effectiveness, as it is an where Σof =interneuronal diag(l12 , . . .function. , l52 ) andImportantly, ln is the although length scale indicator our corresponding to dimension n of x.outcomes, The SE kernel implies end-goal is to improve therapeutic focusing on a that the GPsurrogate is infinitely mean-square differentiable (see [43], short-term avoids the credit assignment problem, i.e., Section 4.1.1). For the third and a hybrid kernel determining which actions are fourth more animals, or less responsible for kh (xtherapeutic used: that This choice also allows us to use an i , xj ) was outcome. intra-animal control design, thus avoiding a requirement for kh (xi , xj ) = σ12 km (xi , xj ) + σ22 δ(i, j), (5) infeasibly large cohorts of animals. where is a the 3rd order Matérn kernel, km (x = Showing can successfully control i , xj ) this √km that √ algorithm (1 + 3r) 3r),structure which allows considerably rougher activity andexp(− learn the of spinal cord’s responses functions thana the kernel a[43], and δ(i,implementation. j) is the Dirac demonstrates stepSEtoward therapeutic delta function indicesspecified i, j. Thisfor formulation tries in to Further, since on thestimulus prior means this function 2 infer the noisy f (xrespect ∼ N (0, σthe Table I do not function vary with to stimulus i , i) = g(x i ) + ηi , ηilocation, 2) instead of of the the function’s underlyingspatial function g(x),must which has kernel entirety structure be discovered km . Since khsuccessful implies a minimum about findicates (x), the online; thus, learning ofuncertainty spatial structure response to any candidate action, short-time variations in the that the algorithm may be able to discover previously unknown CONSTANT M EAN H YPERPARAMETERS c0 = 0.0 M V R EPETITIONS A LLOWED 2 CONSTANT c0 = 0.1 M V c0 = 0.1 M V 1 3 CONSTANT CONSTANT c0 = 0.9 M V c0 = 0.9 M V 2 2 CONSTANT c0 = 0.9 M V 2 CONSTANT c0 = 1.4 M V c1 = 0.08 M V/ DAY, c0 = −0.2492 M V c1 = 0.08 M V/ DAY, c0 = 1.7 M V c1 = 0.08 M V/ DAY, c0 = −0.1348 M V c1 = 0.08 M V/ DAY, c0 = 0.2513 M V 2 2 F UNCTION L INEAR L INEAR L INEAR L INEAR 2 2 2 reward are subsumed into the term, leaving the overall or patient-specific patterns via noise exploration. function shape to be captured by the Matérn kernel. to Thea Since the performance of this algorithm was compared hybrid kernel proved towe be describe more successful. Both kernelssearch were human experimenter, the experimenter’s implemented using the GPML [44]. Table I showsfive all and exploitation strategy. For toolbox a typical day, involving kernel hyperparameters usedofinstimulus the experiments. batches (with five selections parameters per batch): conventional practice, hyperparameters •Following During batches 1-3, the human’sthe first 3 actions were used chowith sen the first two animals were estimated from an experimental using prior knowledge of electrode placement relative data to setthe selected the human which evenly sampled target by muscle’s motorexpert, pool, and observations from th th be expected to faithfully the decision set, and thus could earlier tests. The 4 and 5 actions were typically small estimate the spatial lengthscales. was performed variations on successes in the The first fitting 3 choices. 2 th th using a conjugate gradient method on the data • In the 4 and 5 batches, the human selected likelihood actions to underexplore the GPportions model.ofThis procedure was reasonably the fitting spinal cord and array. successful for the lengthscales, poorly captured the The human andspatial the algorithm actedbut under different rules, multi-day temporal lengthscales. This poor fit likely results confounding their competitive analysis. First, the human did fromrepeat the strong influence short-timescale (e.g., not an action withinofany day. Further,effects the human noise) in the value of the data likelihood, dominating the received immediate feedback, i.e., could observe the responses effects of fitting the underlying spinal plasticity. Consequently, to an action and immediately use this information. Despite the time lengthscale in the that firstthe twohuman animals was hand-selected. these factors, we maintain expert’s performance Similarly, the hyperparameters were hand-selected for the third is an important baseline, due to the human’s reasonable and fourth animals’ kernel in such a way as to largely ignore effectiveness at finding and utilizing good stimulus parameters. in-session variation and produce model predictions capturing the long-term changes in the spinal cord’s responses. B. Time Variation of the Reward Function The GP-BUCB algorithm V. Rexplores ESULTS and exploits the reward function by selecting actions which individually are expected Experiments in four rats prepared as in Section II were to yield high reward, substantial new information the carried out, resulting in 37 testing sessions and 1200 on actions reward function, or both. Assuming a stationary reward func(670 actions selected by the algorithm, with the rest selected tion, well chosen algorithm parameters, some technical by the competing human). Fig. 3 showsand a retrospective of conditions, GP-BUCB is guaranteed to converge overgraph time GP-BUCB performance across all experiments. Each to a subset of actions which yield maximal reward [34]. In plots the reward (peak-to-peak evoked potential amplitudes) practice, the response function is temporally non-stationary resulting from all stimulus pulses, color-coded according to across sessions due to factors including spinal cord plasticity the human or GP-BUCB origin of the stimulus selection. dueThese to training and the natural timecourse of post-injury plots show substantial variation in the evoked potenresponse. Other include array degradation or small tials over the timeeffects frame of the experiments. In all animals, the physical array displacements along the cord. Since per-day range of rewards is similar for both human- andspinal GPcord plasticity isstimuli, an essential phenomenon of thisvariations therapeutic BUCB-selected indicating that response do approach, the algorithm’s model of the spinal response not strongly depend upon which agent selects the actionsmust and capture variation. thus thattemporal the interleaved block structure provides a reasonable, To do so, a variable time, T , (in The unitstemporal of days intra-animal baseline forrepresenting examining GP-BUCB. post-injury) is added to the GP model. Hence, GP-BUCB 2 Using the minimize.m function in the GPML toolbox [44]. must regress on a function of both applied stimulus x and 6 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR time T . This approach requires the covariance function to be a map k : (D × T ) × (D × T ) → R. Further, actions are selected from the subset of D × T corresponding to the present time t, effectively a time-varying decision set Dt . The modeling problem becomes one of extrapolating from the past to the present, both on the order of days (between sessions) or minutes (between batches). In our setting, the posterior uncertainty at t grows as t increases and previous observations recede into the past. Even with optimal action selection, the posterior uncertainty cannot decrease below a level set by the measurement noise, rate of observation, and rate of change in the underlying responses, and effect analogous to nonzero steady state uncertainty limits for Kalman filters [42] and Gaussian Markov processes [43]. Thus, the algorithm will never know the best action with vanishingly small uncertainty. Fortunately, the spinal cord’s current state appears to vary slowly; given observations from previous days, the algorithm’s internal model should make reasonable predictions about the spinal cord’s responses. Further, it is not necessary to select the optimal action at time t, only one which provides useful therapy. Indeed, maintaining some variety in the stimuli may be appropriate, as training via an overly limited set of stimuli may result in inferior therapy [41]. C. Redundancy/Repetition Control Another deviation from the classical GP bandit setting concerns the repetitions of stimuli within short periods. For most experiments (see Table I), the algorithm was not allowed to request the same action more than twice per day in order to ensure adequate exploration. Since 5 batches of 5 stimuli are typically requested daily, 13-25 distinct electrode pairs are tested. Some of the consequences of this restriction are dicsussed in Section VI-A. D. Kernel and Mean Functions The kernel and mean functions describe the algorithm’s prior assumptions about spinal cord response. To ensure reasonable action selection, their chosen form must carefully encode crucial problem information. Most importantly, k(x, x0 ) encodes the dependence between stimuli x and x0 . For our experiments, the 5-dimensional stimulus description vector consisted of the anode and cathode spatial coordinates on the array (4-dimensions) and the time, in days post-injury. For the first two animals, we used a squared-exponential (SE) kernel kSE (xi , xj ) (see [43], Section 4.2) kSE (xi , xj ) = σ 2 exp[−r2 /2], q r = (xi − xj )T Σ−1 (xi − xj ), (3) (4) where Σ = diag(l12 , . . . , l52 ) and ln is the length scale corresponding to dimension n of x. The SE kernel implies that the GP is infinitely mean-square differentiable (see [43], Section 4.1.1). For the third and fourth animals, a hybrid kernel kh (xi , xj ) was used: kh (xi , xj ) = σ12 km (xi , xj ) + σ22 δ(i, j), (5) where√km is a 3rd √ order Matérn kernel, km (xi , xj ) = (1 + 3r) exp(− 3r), which allows considerably rougher functions than the SE kernel [43], and δ(i, j) is the Dirac delta function on stimulus indices i, j. This formulation tries to infer the noisy function f (xi , i) = g(xi ) + ηi , ηi ∼ N (0, σ22 ) instead of the underlying function g(x), which has kernel km . Since kh implies a minimum uncertainty about f (x), the response to any candidate action, short-time variations in the reward are subsumed into the noise term, leaving the overall function shape to be captured by the Matérn kernel. The hybrid kernel proved to be more successful. Both kernels were implemented using the GPML toolbox [44]. Table I shows all kernel hyperparameters used in the experiments. Following conventional practice, the hyperparameters used with the first two animals were estimated from an experimental data set selected by the human expert, which evenly sampled the decision set, and thus could be expected to faithfully estimate the spatial lengthscales. The fitting was performed using a conjugate gradient method2 on the data likelihood under the GP model. This fitting procedure was reasonably successful for the spatial lengthscales, but poorly captured the multi-day temporal lengthscales. This poor fit likely results from the strong influence of short-timescale effects (e.g., noise) in the value of the data likelihood, dominating the effects of fitting the underlying spinal plasticity. Consequently, the time lengthscale in the first two animals was hand-selected. Similarly, the hyperparameters were hand-selected for the third and fourth animals’ kernel in such a way as to largely ignore in-session variation and produce model predictions capturing the long-term changes in the spinal cord’s responses. V. R ESULTS Experiments in four rats prepared as in Section II were carried out, resulting in 37 testing sessions and 1200 actions (670 actions selected by the algorithm, with the rest selected by the competing human). Fig. 3 shows a retrospective of GP-BUCB performance across all experiments. Each graph plots the reward (peak-to-peak evoked potential amplitudes) resulting from all stimulus pulses, color-coded according to the human or GP-BUCB origin of the stimulus selection. These plots show substantial variation in the evoked potentials over the time frame of the experiments. In all animals, the per-day range of rewards is similar for both human- and GPBUCB-selected stimuli, indicating that response variations do not strongly depend upon which agent selects the actions and thus that the interleaved block structure provides a reasonable, intra-animal baseline for examining GP-BUCB. The temporal variation in reward appears to be composed of both day-to-day fluctuations and a long-term, upward trend. While some of this upward trend is due to the algorithm’s growing knowledge of effective stimulus patterns, much of it is a general, plastic increase in spinal cord responses to the stimuli. Table II summarizes the experiments, the number of actions initiated by the algorithm, and the number of unique stimuli chosen. 2 Using the minimize.m function in the GPML toolbox [44]. DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OF EPIDURAL ELECTROSTIMULATION DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL MULTI-ELECTRODE EPIDURAL ELECTROSTIMULATION DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OF OF MULTI-ELECTRODE EPIDURAL ELECTROSTIMULATION Reward 4 4 3 3 2 2 4 4 3 3 Reward Peak−to−Peak Response (mV) Peak−to−Peak Response (mV) 5 5 2 2 1 1 7 7 7 1 1 0 0 15 15 20 20 25 25 Time (Days Post−injury) Time (Days Post−injury) 30 30 0 0 0 0 35 35 50 50 100 100 150 150 TimeTime (Actions) (Actions) Reward 4 4 3 3 15 15 20 25 20 Post−injury) 25 Time (Days Time (Days Post−injury) 2 2.5 2 1.5 1 0.5 0.5 0 4 0 4 3 2 2 1 1 30 30 35 0 0 0 0 35 6 8 10 Time 6 (Days Post−injury) 8 10 Time (Days Post−injury) (c) Animal P1 (c) Animal P1 12 12 1.5 1 2 2 1.5 1 1.5 1 0.5 0.5 0.5 0 9 0 9 10 11 12 13 14 Time 10 (Days11Post−injury) 12 13 Time (Days Post−injury) (d) Animal P2 (d) Animal P2 50 50 100 150 200 250 100 Time 150 250 300 300 350 350 (Actions) 200 Time (Actions) (b) (b) Animal W2 W2 Animal 2.5 Reward 1 3 14 2 2.5 1.5 2 Reward 10 Peak−to−Peak Response (mV) 1.5 4 1.5 1 Reward 0 10 Peak−to−Peak Response (mV) Peak−to−Peak Response (mV) Peak−to−Peak Response (mV) 2 4 1 (b)(b) Animal W2W2 Animal 2.5 5 2 Reward 0 5 Reward Peak−to−Peak Response (mV) Peak−to−Peak Response (mV) 5 5 1 250 250 (a) (a) Animal W1 W1 Animal (a)(a) Animal W1W1 Animal 6 6 2 200 200 1 0.5 0.5 0 0 0 0 20 40 60 80 20 (Actions) 40 60 80 Time Time (Actions) (c) Animal P1 (c) Animal P1 0 0 2 1.5 1 0.5 0 0 20 40 Time20 (Actions) 40 Time (Actions) 60 60 (d) Animal P2 (d) Animal P2 Fig. 4. Human expert’s (blue) and GP-BUCB’s (green) reward, defined as the Fig. 3. 3. Peak-to-peak Peak-to-peak amplitude (i.e., reward ininmV) ofofindividual stimulus Fig. amplitude (i.e., reward individual stimulus Fig.4.4. Human Human expert’s (blue) andevoked GP-BUCB’s (green) reward, as Fig. expert’s (blue) and GP-BUCB’s (green)electrical reward, defined defined asthe the Fig. for 3. each Peak-to-peak amplitude (i.e., rewardevoked inmV) mV) of individual stimulus peak-to-peak amplitude (mV) of MR by an epidural stimulus pulses animal. Blue circles: potentials by the human expert’s pulses forfor each animal. Blue circles: potentials evoked bybythe human expert’s amplitude (mV) of MR evoked an epidural stimulus amplitude (mV) by an delivered epidural electrical electrical stimulus pulses each animal. Blue circles: potentials evoked the human expert’s of peak-to-peak 5peak-to-peak V (animals P1, W1, and W2)of orMR 7 Vevoked (animalby P2), in 1 Hz pulse choices; Green ‘x’: potentials due to algorithm choices. Solid red lines denote choices; Green ‘x’:‘x’: potentials due to to algorithm choices. Solid red (animals P1, W1, and and W2) W2) or 77measures V (animal delivered in ofof55The VV(animals W1, or V (animal P2), delivered in11Hz Hz pulse choices; potentials due algorithmlines choices. Solid redlines linesdenote denote trains. averageP1, reward (solid lines) theP2), rewards generated bypulse wipe of of Green the algorithm’s algorithm’s memory. Dashed denote hyperparameter trains. Thewhile average reward (solid (solid lines) measures the rewards generated by aa wipe the memory. Dashed lines denote hyperparameter a wipe of the algorithm’s memory. Dashed lines denote hyperparameter trains. The average reward lines) measures the rewards generated by every action, the maximum reward (dashed) is the best reward observed changes without a memory wipe. The human experimenter and algorithm every action, while the maximum reward (dashed) is the best reward observed changes without a memory wipe. The human experimenter and algorithm changes without a memory wipe. The human experimenter and algorithm every action, while the maximum reward (dashed) is the best reward observed up to a given time. GP-BUCB typically has a superior average reward to were blind to each other’s actions and the resulting rewards, and their actions up given time. time. GP-BUCB GP-BUCB typically has average reward were blind to to each other’s actions and rewards, and were blind each other’s actions andthetheresulting resulting rewards, andtheir theiractions actions theup toto aa given typically has aa superior superior average reward human, while maintaining a superior or competitive maximum reward. In toto were interleaved in batches. The decreased evoked potential magnitude from thehuman, human, while maintaining superior or maximum reward. were interleaved in in batches. The evoked potential magnitude from were interleaved batches. Thedecreased decreased evoked potential the superior or competitive competitive maximum reward. W1 (a)while and maintaining W2 (b), runsaa(i.e., periods between memory wipes) are InIn days 9 and 10 to 15-17 in animal W2 is a typical feature of the magnitude preparation.from animals animalsby W1 (a) and W2 W2 (b), runs runs (i.e., (i.e., periods days 9 and 10 10 to to 15-17 in in animal W2 days 9 and 15-17 animal W2is isa atypical typicalfeature featureofofthe thepreparation. preparation. separated red(a) vertical lines. animals W1 and (b), periods between between memory memory wipes) wipes) are are separated by by red red vertical vertical lines. lines. separated variation in reward appears to be composed of both day-to-day A. Evaluaton of Reward variation in reward appears to be composed of both day-to-day The maximum reward, Wmax , may be examined in terms fluctuations and a long-term, upward trend. While some of this maximum beperform examined terms so The strategies like reward, random Wsearch may wellin in this max , may of the maximum reward value observed so far, fluctuations and a long-term, upward trend.toWhile some i.e., of this Reward is calculated from the response action, upward trend is due to the algorithm’s growinganknowledge of a metric, of the maximum reward value observed so far, without delivering effective therapy. Alternatively, the upward trend is dueoftopulses. the algorithm’s growing knowledge of continuous sequence Let of yτ itdenote the peak-to-peak effective stimulus patterns, much is a general, plastic Wmax (T ) = max(wt ). average reward so far observed th patterns, much of it is a general, plastic effective stimulus t≤Tmax(wt ). Wmax (T ) = response τ cord stimulus pulse. The reward resulting increase to in the spinal responses to the stimuli. Tablefrom II t≤TT increase in spinal cord responses to the stimuli. Table II Xthoroughness of the an action started at time t is calculated summarizes the experiments, the numberas: of actions initiated The maximum reward describes1the W (T ) = , τmax (t)actions initiated summarizes the experiments, the number of The maximum reward describes thewtthoroughness of the avg by the algorithm, and the number of uniqueX stimuli chosen. search over the stimulus space, sinceT a large maximum reward 1 t=1 by the algorithm, and the number of unique stimuli searchthat overhigh-performing the stimulus space, since a large maximum reward wt = yτ , chosen. (6) implies stimuli have been found, and τmax (t) − τmin (t) + 1 implies stimuli have been found, τ =τmin (t) measures howhigh-performing well exploration and exploitation tradedand off; thus could that conceivably be exploited. However, thisarequantity A. Evaluaton of Reward thus could conceivably be exploited. However, this quantity high average reward implies that a search process spent most the number of applications of high-performing stimuli, A. Evaluaton of Reward where τmin (t) and τmax (t) respectively index the first and last ignores ignores the like number of applications ofactions, high-performing stimuli, of its search timerandom choosing effective yet alsoin explored so strategies search may perform well this Reward is calculated from the response to an action, i.e., a pulses for action t. so strategies like random search may perform wellSuperior in Reward is calculated from Let the yresponse to an action, i.e., a metric, thoroughly enough to find high-performing stimuli. without delivering effective therapy. Alternatively, thethis continuous sequence of pulses. denote the peak-to-peak τ The maximum reward, Wmax Let , may be examined in terms metric, without delivering effective therapy. Alternatively, continuous sequence of pulses. y denote the peak-to-peak th averagereward performance of one approach relative to anothertheat so far observed response to the τ reward stimulus pulse. Theτ reward from average th of the maximum value observed so far,resulting average reward so far response to the τ stimulus pulse. The reward resulting from time T is indicated by observed a larger average reward value. an action started at time t is calculated as: T an action started at time t is calculated as: 1X τmax (t) T Wire Arrays: Results: Fig. 3(a) and Wmax (T ) = max (w ). X t Wavg (T ) = wt , (b) show the search 1 1X τmax (t) t≤T X yτ , (6) wt = Wavg (T ) T = t=1and the wt , human expert in anresponses obtained by GP-BUCB 1 T t=1 (6) wt =τmax (t) − τmin (t) + 1 τ =τmin (t) yτ , The maximum reward describes the thoroughness of the imals implanted with wire arrays, while Fig. 4(a) and (b) show τmax (t) − τmin (t) + 1 measures how well exploration and exploitation are traded off; τ =τmin (t) search over the stimulus space, since a large maximum reward the associated maximum and average rewards. Thetraded algorithm measures wellimplies exploration exploitation are where τmin (t) and τmax (t) respectively index the first and last high averagehow reward that aand search process spent mostoff; implies that high-performing stimuli have been found, and chose a total of 235 and 310 actions respectively for animals wherefor τmin (t) and average implies that actions, a searchyet process spent most pulses action t. τmax (t) respectively index the first and last of high its search timereward choosing effective also explored thus could conceivably be exploited. However, this quantity W1 and W2. In both animals, left, caudal anode locations were pulses for action t. of its search time choosing effective actions, yet also explored ignores the number of applications of high-performing stimuli, associated with strong responses to the stimulus. Although the 8 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR IEEE TRANSACTIONS ONON BIOMEDICAL ENGINEERING, IEEE TRANSACTIONS BIOMEDICAL ENGINEERING,VOL. VOL.X,X,NO. NO.Y,Y,MONTH MONTHYEAR YEAR TABLE II N UMBER OF ACTIONS CHOSEN BY GP-BUCB ( UNIQUE ), AND S PEARMAN TABLE TABLE II II RANK BETWEEN EACH REWARD NCORRELATIONS UMBER OFCHOSEN ACTIONS CHOSEN BY SELECTED GP-BUCB ),’ SAND N UMBER OF ACTIONS BY GP-BUCB ( UNIQUESTIMULUS ),( UNIQUE AND S PEARMAN AND THE NUMBER OF PULSES USED BY GP-BUCB OR’ SHUMAN EXPERT SCORRELATIONS PEARMAN ’ S ρ BETWEEN EACH SELECTED STIMULUS AND RANK BETWEEN EACH SELECTED STIMULUS ’REWARD S REWARD TO TEST THAT STIMULUS , WITH VALUES . OR PULSES BYP -GP-BUCB HUMAN . ANDTHE THENUMBER NUMBEROF OFSTIMULUS PULSES USED BY APPLIED GP-BUCB OR HUMAN EXPERT MTHAT ACHINE TO TEST STIMULUS , WITH P - VALUES . 1 1 2 2 Row 2 Row A NIMAL ACTIONS ρM achine (p) ρHuman (p) M ACHINE & RUN (U NIQUE ) A NIMAL ACTIONS ρM achine (p) ρHuman (p) P1, RUN 1 70 (43) 0.264 (0.0906) N/A & RUN (U NIQUE ) W1, RUN 1 80 (23) 0.105 (0.6348) −0.073 (0.6829) P1, RUN 1 70 (43) 0.264 (0.0906) N/A RUN 2 55 (23) 0.339 (0.1132) −0.045 (0.8091) W1, RUN 1 80 (23) 0.105 (0.6348) −0.073 (0.6829) RUN 3 100 (29) 0.480 (0.0083) 0.325 (0.0500) RUN 2 55 (23) 0.339 (0.1132) −0.045 (0.8091) W2, RUN 1 70 (28) 0.582 (0.0012) −0.034 (0.8682) RUN 3 100 (29) 0.480 (0.0083) 0.325 (0.0500) RUN 2 240 (33) 0.824 (< 0.0001) 0.569 (0.0001) W2, RUN 1 70 (28) 0.582 (0.0012) −0.034 (0.8682) P2, RUN 1 55 (40) 0.581 (0.0001) −0.477 (0.0021) RUN 2 240 (33) 0.824 (< 0.0001) 0.569 (0.0001) P2, RUN 1 55 (40) 0.581 (0.0001) −0.477 (0.0021) 1 1 2 Row 8 4 Row 8 4 9 9 4 4 9 9 Gnd Gnd Gnd Gnd sizes of the strongly responding regions of the stimulus space thoroughly enough to find high-performing stimuli. Superior were different in the two animals (see Fig. 5), the algorithm A B C A B C average enough performance one approach relative another at Column Column thoroughly to findofhigh-performing stimuli.toSuperior learned the response function well in both cases. Both animals A B C A B C time performance T is indicatedofbyone a larger average reward average approach relative to value. another at (a) Animal W1 (b) Animal Column Column W2 exhibited increased responsiveness over the course of the Arrays:byResults: 3(a) reward and (b) value. show the search timeexperiments. TWire is indicated a largerFig. average (a) Animal W1 (b) Animal W2 Fig. 5. Time-averaged, intra-day-normalized response for cathode-anode responses obtained by GP-BUCB and (b) the human expert in an- electrode pairs, animal W1 (a) and animal W2 (b), shown with respect to Wire Arrays: Results: Fig.Animals: 3(a) and show Fig. the search Parylene Microarray Results: 3(c) and Fig.Fig. 5. 5.Time-averaged, intra-day-normalized response for Time-averaged, intra-day-normalized response forcathode-anode cathode-anode imals implanted with wire arrays, while Fig. 4(a) and in (b) show array location as viewed dorsally (rostral at top). Each large box corresponds responses obtainedP1) by GP-BUCB andand the4(d) human expert 4(c) (animal and Fig. 3(d) (animal P2) anshow electrode pairs, animal W1W1 (a)(a)and animal W2 (b), shown with respect toto electrode pairs, animal and animal W2 (b), shown with to a spatial location of a wire array electrode. Within the boxes,respect each small the associated maximum and average rewards. The algorithm imals implanted with wire arrays, while Fig. 4(a) and (b) show location as as viewed dorsally (rostral at attop). Each box corresponds array location viewed dorsally (rostral top).cathode Eachlarge large boxcould corresponds the maximum and average reward obtained with the parylene array rectangle denotes the spatial location of one which be paired to a spatial location of a wire array electrode. Within the boxes, each small chose a total of 235 and 310 actions respectively for animals to a spatial location of a wire array electrode. Within the boxes, each small the associated maximum average The algorithm with an anode located at the large box’s position. Every stimulus thus has a arrays, which logged 4and days and 70rewards. actions in animal P1, and rectangle denotes the spatial location of one cathode which could be paired rectangle denotes the spatial location of one cathode which could be paired W1 and W2. In both animals, left, caudal anode locations were unique address. Each small rectangle’s color corresponds to the time-averaged chose a total 235 and 310 actions animals 3 days andof55 actions in animal P2.respectively GP-BUCB for quickly found withwith an anode located at the large box’s position. Every stimulus has aa an anode located at the large position, everythus stimulus normalized response strength (1: red;box’s 0: blue) of thatgiving anode-cathode pair. Black strong responses to theanode stimulus. Although theunique W1 associated and Inwith bothhigh-performing animals, left, caudal locations address. Each small rectangle’s color corresponds unique address. Each small rectangle’s color correspondstotothe thetime-averaged time-averaged and W2. exploited stimuli in this muchwere larger rectangles represent electrode pairs not selected. Data are shown only from normalized response strength (1: 0: red; 0: of blue) that anode-cathode pair. response strength (1: red; blue) thatof anode-cathode pair. Black sizes the strongly responding regions of theAlthough stimulusstimuli spacenormalized associated with strong stimuli. responses to the the days where a total of at least 15 actions were taken between the human and spaceofof possible Note thatstimulus. high-performing Black rectangles represent invalid pairs, e.g., A1 A1.are Data are shown only rectangles represent electrode pairs not selected. Data shown only from were different in the two animals (see Fig. 5), the algorithm algorithm. The set of highly excitable configurations for animal W1 is broader sizesobserved of the strongly regions the stimulus in theseresponding two animals had ofcaudal, left-sidespace anode daysfrom daysa where of at 15 actions were between taken between the human where total ofa total at W2. least 15least actions were taken the ahuman than that for animal Note that monopolar stimuli (using groundand wire) theasin response function well inFig. both5), cases. Both animals and algorithm. Thehighly set of highly excitable configurations for animal W1 is werelearned different the in two animals (seeexperiments, thethough algorithm The set configurations animal is broader locations, seen the wire array having algorithm. were tested by of the humanexcitable experimenter in animal for W1; these W1 results are shown broader than that for animal W2. Note that monopolar stimuli (using a ground exhibited increased responsiveness over the course of the than that for animal W2. Note that monopolar stimuli (using a ground wire) learned well in bothfor cases. animals suchthe an response anode wasfunction not itself sufficient goodBoth responsiveness. in thewere bottom-most wire) by boxes. theexperimenter human experimenter animal W1; theseare results are were tested bytested the human in animalinW1; these results shown experiments. exhibited increased responsiveness over the course of the in the shown in the bottom-most bottom-most boxes. boxes. Microarray Animals: Results: Fig. 3(c) and B.Parylene Computational Performance experiments. typically dominated by the Cholesky decompositions (which 4(c)When (animal P1) and Fig. 3(d)MATLAB and 4(d) programming (animal P2)and show Parylene Microarray Animals: Results: Fig. 3(c) implemented in the lanexplores additional of our experiments on the require an order n3 ramifications calculation) and memory usage by storage maximum and average reward obtained the one parylene 3 (which 4(c)the (animal P1)EMG and data Fig.processing 3(d) and needed 4(d) (animal P2) show guage , the to with prepare new typically dominated by the Cholesky decompositions 2 multi-electrode stim3 prospects for using machine learning in of the n × n kernel matrices (order n memory). As n grows which 4 days andobtained 70 actions in the animal P1, andrequire an order n calculation) and memory usage by storage the arrays, maximum andlogged average reward with parylene batch typically required 2-3 min, whereas GP-BUCB’s seleculation VI-A the wire array withnthe number ofanalyzes observations, the As computational 3tion days and 55 actions in animal P2. GP-BUCB quickly found ×therapy. nincreasing kernelSection matrices (order n2 memory). n results, grows arrays, which logged 4 days and 70 actions in animal P1, and of new stimuli required approximately 5 s of processor of the including a discussion of inter-animal comparisons. Section load can be managed by “forgetting” older observations, using and exploited high-performing stimuli in this much larger with the increasing number of observations, the computational time. In our alternating double blind experiment format, the 3 days and 55 actions in animal P2. GP-BUCB quickly found VI-B addresses the analogous parylene array results. Section approximate Cholesky decomposition, or sparse approximate of possible stimuli. Note that high-performing stimuli total time to construct a new stimulus batchmuch was less than load can be managed by “forgetting” older observations, using and space exploited high-performing stimuli in this larger VI-C examines another perspective therapeutic utility. GP regression [45]. Aggregating the individual, pulse-by-pulse observed in these two animals had caudal, left-side anode Cholesky decomposition, oronsparse approximate the time required for the human expert to perform a batch space of possible stimuli. Note that high-performing stimuli of approximate Lessons learned regarding kernel function and hyperparameter an action also saves substantialpulse-by-pulse effort. locations, as seen in indicate the wire that, array experiments, though havingGPresponses regressionto[45]. Aggregating the individual, tests. These results a sufficiently efficient observed in these two animals had given caudal, left-side anode selection are the focus of Section VI-D. Section VI-E examines such an anode was not itself sufficient for good responsiveness. responses to an action also saves substantial effort. and fluid data acquisition and processing system, the algolocations, as seen in the wire array experiments, though having the effectiveness of the VI. algorithm’s search from the perspective D ISCUSSION could perform closed-loop active learning much more suchrithm an anode was not itself sufficient for good responsiveness. of information gain. Finally, effects on hindlimb muscles other rapidly than was the case in our experiments. Under the most experiments show that GP-BUCB can compete D ISCUSSION B. Computational Performance thanOur theanimal LTA are theVI. subject of Section VI-F . severe condition, with n = 4432 previous evoked potential with a human expert in optimizing stimuli. This section also When implemented in the MATLAB programming lan- Our animal experiments show that GP-BUCB can compete B. Computational observations, aPerformance new batch of five actions was constructed in explores additional ramifications of our experiments on the 3 with aWire human expert in optimizing stimuli. This section also , the EMG data processing needed to prepare one guage Array Animals When in thetoMATLAB programming lan-new 142.0implemented s, with 50 s devoted Cholesky decomposition, and 50 A. prospects for using machine learning multi-electrode stimexplores additional ramifications of ourinexperiments on the batch typically required 2-3 min, whereas GP-BUCB’s selec3 , the EMG data processing needed to prepare one neware ulation guage s allocated to evaluation the kernel matrix. Computations The wire arraySection experiments tested how GP-BUCB copes therapy. VI-A analyzes the wire array results, for using machine learning in multi-electrode stimtion of new stimuli required approximately 5 s of processor batch typically required 2-3 min,Cholesky whereas decompositions GP-BUCB’s selectypically dominated by the (which prospects with a time-varying reward The algorithm generally including a discussion of function. inter-animal Section ulation therapy. Section VI-A analyzes the comparisons. wire array results, time. In our alternating double blind experiment format, the 3 order nrequired calculation) and memory by storage succeeded tion require of newanstimuli approximately 5 susage of processor in future performance prediction. For example, VI-B addresses the analogous parylene array results. Section a discussion inter-animalshowed comparisons. Section total time to construct a new stimulus batch format, was ofIntheour n× n kernel matrices (order n2 memory). As less n grows time. alternating double blind experiment thethanincluding in animal W2 (run 2)of GP-BUCB strong queueing VI-C examines another perspective on therapeutic utility. addresses the analogous parylene array results. time fornumber human expert to perform a batch with thetorequired increasing of observations, the totalthe time construct athe new stimulus batch wascomputational less than ofVI-B behavior (it first selected what were predicted andSection proven Lessons learned regarding kernel function and hyperparameter These results indicate that, given a sufficiently efficient another before perspective on therapeutic utility. load required can be managed by “forgetting” older observations, using to beexamines the are bestthe stimuli) being forced by the repetition the tests. time for the human expert to perform a batch of VI-C selection focus of Section VI-D. Section VI-E examines and fluid data acquisition and processing system, the algolearned regarding kernel function and hyperparameter Cholesky or sparse approximate limit to consider other stimuli. Thissearch behavior results as the tests.approximate These results indicatedecomposition, that, given a sufficiently efficient Lessons the effectiveness of theSection algorithm’s from theexamines perspective rithm could perform closed-loop active learning much more 4 selection are the focus of VI-D. Section VI-E GP regression [45]. Aggregating the individual, pulse-by-pulse signature saw-tooth shape in the average reward plots and fluid data acquisition and processing system, the algoof informationofgain. Finally, effects on hindlimb muscles(Fig. other rapidly than was the case in our experiments. Under the most the effectiveness the algorithm’s search from the perspective responses to an action also saves substantial effort. 4(b)). Moreover, GP-BUCB autonomously maintained good rithm could perform closed-loop active learning much more than the LTA are the subject of Section VI-F . severe condition, with n = 4432 previous evoked potentialof performance information gain. Finally, effectsgood on hindlimb (showing the same queueing muscles behavior)other over rapidly than was the case in our experiments. Under the most VI. D observations, a new batch ofISCUSSION five actions was constructed inthan the LTA are the subject of Section VI-F . long periods, e.g., in animal W2’s two-week-long second run, severe condition, with n = 4432 previous evoked potential A. Wire Array two Animals 142.0 s,animal with 50 s devoted to Cholesky decomposition, and 50 which Our experiments show that GP-BUCB can compete included multi-day gaps in testing. In this respect, observations, a new batch of five actions was constructed in s allocated to evaluation the kernel matrix. areA. Wire a human expert intooptimizing stimuli. Computations This section also TheArray wire Animals array experiments tested how GP-BUCB copes 142.0with s, with 50 s devoted Cholesky decomposition, and 50 4 The upward edge is produced by the superior reward resulting from the with a time-varying reward function. The algorithm generally s allocated to on evaluation matrix. Computations are 33Running Running on either aa 2.2 2.2the GHzkernel quad-core i7 machine, 88 GB first stimuli the session and the flatter or downward-sloped portion of the wire inarray experiments tested how GP-BUCB copes either GHz quad-core i7 processor processor machine, with with GB The succeeded future prediction. Forqueued example, RAM, and and Mac Mac OS OS 10.6, 10.6, or or an an AMD AMD Athlon Athlon 64 average reward in curve results performance from the poorer stimuli which are later RAM, 64 X2 X2 Dual Dual Core Core 3800+, 3800+, 33 GB GBwith a time-varying reward function. The algorithm generally 3 Running RAM machine machine running Ubuntu 12.04. i7 processor machine, with 8 GB inintheanimal session. W2 (run 2) GP-BUCB showed strong queueing on either a 2.2 GHz quad-core RAM running Ubuntu 12.04. succeeded in future performance prediction. For example, RAM, and Mac OS 10.6, or an AMD Athlon 64 X2 Dual Core 3800+, 3 GB RAM machine running Ubuntu 12.04. in animal W2 (run 2) GP-BUCB showed strong queueing DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OF EPIDURAL ELECTROSTIMULATION these experiments validated that good forward prediction is possible over such intervals due to the utility of GP-BUCB’s internal model, and reached a standard of performance which could be clinically relevant. Algorithm Analysis. Several possible criticisms might follow from the fact that GP-BUCB can daily request a number of actions (typically 25, at least 13 unique, in 5 batches of 5 actions) which is of the same order as 42, the size of the decision set, D. • Measures of success: Simply finding a high reward stimulus may be an inadequate measure of algorithm success. • Differentiation between algorithms: Given the small size of D and the repetition limit, other learning algorithms would likely select similar actions to GP-BUCB. • Generalization to unconstrained settings: Our use of a repetition limit can cloud an assessment of how effectively GP-BUCB can track f in an unconstrained setting. The experimental data support additional several arguments for GP-BUCB’s success. First, and unsurprisingly5 , GPBUCB found high-performing stimuli in the first day (within 2 batches of 5 stimuli, in each case) in all wire array runs. This indicates that GP-BUCB’s search is free from gross deficiencies. Second, GP-BUCB demonstrates success in the sense of avoiding ineffective stimuli. The number of unique electrode pairs selected (see Table II) was always substantially smaller than the size of the decision set. GP-BUCB thus avoids exhaustive testing, but appears to effectively model the reward for untested stimuli. Finally, since GP-BUCB was found to be competitive with the performance of a human expert in terms of the maximum reward so far observed, while maintaining a better average reward, it follows that GP-BUCB thoroughly searches for high-reward regions while simultaneously exploiting its knowledge in a therapeutically more effective fashion. These observations argue that GPBUCB makes a successful exploration-exploitation tradeoff. The second criticism argues that the decision set size and repetition limit mean that all algorithms will have to explore D somewhat, and that further, convergence cannot be observed, making algorithms indistinguishable. GP-BUCB’s ability to discriminate against (likely bad) stimuli, discussed in relation to the first criticism, is a crucial feature in this setting, and compares favorably with simple methods such as random search or bandits without structured payoffs. With respect to convergence, the strong queueing behavior indicates that GPBUCB has converged to the extent allowed by the repetition restriction, since high-reward actions are selected first (i.e., in preference to other actions), despite the low exploratory value of doing so under (2), until disallowed by the repetition limit. GP-BUCB’s queueing behavior also requires a reasonably correct posterior over the rewards which will be received in the day’s first batch of stimulation, implying that its internal model is adequate to the inter-day prediction task. Fatigue offers an alternate explanation for the observed queueing behavior. If the animal rapidly tires during a session, 5 If 6 electrode pairs are effective out of 42, e.g., all those pairs including a particular anode, 10 stimuli drawn randomly without replacement will include one of these 6 with probability > 82%. 9 and this fatigue appears as a decrease in evoked peak-to-peak amplitude, the responses elicited by the first stimuli would appear stronger than those from later stimuli. These “better” stimuli would then be selected early in future sessions, a feedback loop which could produce the observed queueing effect. However, the human expert’s choices would also show a strong within-session fatigue pattern, which they do not. Also, the difference, ∆V , between average responses to different applications of the same stimulus in the same session (by human, algorithm, or both) can be measured. A large and negative ∆V is required if queueing is explained by fatigue. However, for actions leading to substantial responses (withinsession average ≥ 0.2 mV), ∆V is small in magnitude (∆V = −0.0871 ± 0.4897 mV, n = 142 pairs from animal W2), even over prolonged periods (−0.0336 ± 0.4228 mV for 34 intervals of ≥ 1 hour). Further, the correlation between the interval and ∆V is very weak (r = 0.0090). Fatigue, therefore, is not a convincing explanation for the observed queueing behavior. The third criticism is more difficult to directly refute since no experiments without the repetition constraint were performed, and the repetition limit forces the algorithm to track stimuli which are high-performing, but suboptimal. However, for stationary reward functions, the performance of GPBUCB is supported by theoretical guarantees and validation in simulation [5]. Although [5] does not provide guarantees in the case of time-varying rewards, such guarantees have been derived for other methods, e.g., [46], suggesting that similar guarantees could be derived for GP-BUCB. Together, the relatively benign time variation in the reward function in our experiments and the stationary case guarantees suggest that unconstrained GP-BUCB could also maintain a useful posterior over responses to previously high-performing stimuli. Cross-animal Comparisons. Fig. 5 compares the performance of electrode pair selections across animals W1 and W2. The data demonstrate substantial repeatability between animals with regard to the evoked potential strength elicited by the same stimulus, particularly for the highest-performing stimuli (typically those with electrode A9 as the anode). B. Parylene Array Animals The data from the animals implanted with parylene arrays allow us to assesses how well GP-BUCB can search a large decision space (666 possible electrode pairs, of which no more than 25 can be tested in a typical day), using a structured reward function. In both animals, after finding a highperforming configuration (C8 A9 in animal P1, and C6 A9 in animal P2), the algorithm moved sharply to exploit this knowledge, allocating repeated queries to neighboring actions. Both discovered configurations including an anode at the left, caudal corner of the array, similar to those found in the wire array animals. Note that GP-BUCB overcame its spatially flat prior on f to find this anatomical pattern. The key, highperforming stimulus was found in the 3rd batch for animal P1, and the 6th batch for animal P2. The efficiency of the search and the rapid exploration of high-performing stimuli argue that the structured GP model provides a benefit over conventional bandit algorithms for this application. IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. NO. Y, MONTH MONTH IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, Y, YEAR IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR PPROPORTION OF STIMULI WHICH ELICITED APEAK PEAK - TO - PEAK ROPORTION OF APPLIED APPLIED STIMULI WHICH ELICITED - TO - PEAK P ROPORTION OF APPLIED STIMULI WHICH ELICITED A APEAK - TO - PEAK RESPONSES EXCEEDING A THRESHOLD THRESHOLD OLD INDICATES MORE RESPONSES EXCEEDING A BBOLD INDICATES MORE RESPONSES EXCEEDING A THRESHOLD . B..OLD INDICATES MORE SUPRA -- THRESHOLD RESPONSES SUPRA THRESHOLD RESPONSES SUPRA - THRESHOLD RESPONSES . .. A NIMAL A GENT T HRESHOLD M V) A NIMAL AGENT T HRESHOLD Π Π ( M(V) 0.50.5 1.01.0 1.51.5 2.02.0 A LGORITHM 0.229 0.229 0.157 0.157 0.100 0.100 0.043 0.043 P1 P1 A LGORITHM A LGORITHM 0.662 0.662 0.450 0.450 0.263 0.263 0.000 0.000 W1,W1, RUNR1UN 1 A LGORITHM H UMAN 0.440 0.440 0.253 0.253 0.067 0.067 0.013 0.013 H UMAN W1, RUN 2 A LGORITHM 0.945 0.727 0.545 0.091 W1, RUN 2 A LGORITHM 0.945 0.727 0.545 0.091 H UMAN 0.720 0.520 0.280 0.040 H UMAN 0.720 0.520 0.280 0.040 W1, RUN 3 A LGORITHM 0.910 0.850 0.790 0.560 W1, RUN 3 A LGORITHM 0.910 0.850 0.790 0.560 H UMAN 0.828 0.697 0.626 0.424 H UMAN 0.828 0.697 0.626 0.424 W2, RUN 1 A LGORITHM 0.657 0.543 0.357 0.214 W2, RUN 1 A LGORITHM H UMAN 0.657 0.564 0.543 0.418 0.357 0.273 0.214 0.127 0.564 W2, RUN 2 HAUMAN LGORITHM 0.779 0.418 0.642 0.273 0.442 0.127 0.308 W2, RUN 2 A LGORITHM H UMAN 0.779 0.681 0.642 0.472 0.442 0.285 0.308 0.153 HAUMAN 0.681 P2 LGORITHM 0.255 0.472 0.091 0.285 0.036 0.153 0.000 P2 A LGORITHM H UMAN 0.255 0.085 0.091 0.000 0.036 0.000 0.000 0.000 H UMAN 0.085 0.000 0.000 0.000 Although no comparison to a human expert was made in in animalno P2), the algorithm sharply to exploit this Although to followed a moved humanthe expert was made in animal P1, thecomparison P2 experiment interleaving humanknowledge, allocating repeated queries to neighboring actions. animal P1, thebatch P2 experiment humanmachine paradigm. followed Possibly the dueinterleaving to prior anatomical Both discovered configurations including anode at the left, machine batch the paradigm. Possibly due responding to anprior anatomical knowledge, human found strongly stimuli on caudal ofday. the However, array, those in the wire the firstcorner testing soon to after, thefound algorithm found knowledge, the human found similar strongly responding stimuli on array animals. Note that GP-BUCB overcame its spatially flat high-performing in both the final day of the first testing day. stimuli However, soon animals. after, theInalgorithm found prior on f to find this anatomical pattern. The key, highanimal P2 testing, GP-BUCB found stimuli comparable high-performing stimuli in both animals. In theoffinal day of performing stimulus was by found in the as 3rdseen batch for animal reward to the best found the human, in Fig. 4(d). animal P2 testing, GP-BUCB found stimuli of comparable P1, and the 6th batch for animal P2. The efficiency of the reward to the best found by the human, as seen in Fig. 4(d). search and the rapid exploration of high-performing stimuli C. Therapeutic Relevance argue that the structured GP model provides a benefit over The reward in (6) makes the heuristic assumption that C. conventional Therapeutic Relevance bandit algorithms for this application. strength of response and contribution to long-term therapeutic TheAlthough reward in comparison (6) makes the assumption thatin to linearly a heuristic humanrelated). expert wasalternate made outcomes arenoequivalent (i.e., An strength of response and contribution to long-term therapeutic animal P1, the P2 experiment followed the interleaving humanreward heuristic is outcomes equivalent (i.e.,Possibly linearly due related). An anatomical alternate machinearebatch paradigm. to prior knowledge, strongly t = I(w t > Π), responding stimuli on reward heuristictheis humanRfound the first testing day. However, soon after, the algorithm found where wt is definedRin (6), Πt is aΠ), threshold dividing stimuli I(w > animals. t = in high-performing stimuli both the final day of into therapeutically effective and ineffective In classes by average animal P2 testing, GP-BUCB found stimuli of comparable where wt is definedand in I(·) (6), isΠ the is aindicator threshold dividing stimuli twitch strength, function. This alterreward to the best found by the human, as seen in Fig. 4(d). nate reward function is reasonable if fine differences between into therapeutically effective and ineffective classes by average strongly responding stimuli the previous twitch strength, and I(·) is the(important indicator tofunction. Thisreward altermeasures) are irrelevant. Table III presents a comparison of nate reward function is reasonable if fine differences between human and machine reward under this heuristic for different C. Therapeutic strongly respondingRelevance stimuli (important to the previous reward values of For nearly all runsIII andpresents values ofaΠ,comparison proportionally measures) areΠ.irrelevant. Table of The algorithm reward inactions (6) makes theexceeded heuristictheassumption that more met or specification. human and machine reward under this heuristic for different strength of response effectiveness and contribution to long-term therapeutic Thus, if therapeutic is a function of the proporvalues of Π. For all runs andlinearly values of Π, proportionally outcomes are nearly equivalent related). An alternate tion of stimuli which elicit(i.e., a strong response, the algorithm is more algorithm actions met or exceeded the reward heuristicthan is the human expert’s chosen specification. more effective strategy. Thus, if therapeutic effectiveness is a function of the proportion of stimuli which elicit response, the algorithm is Rt a=strong I(wt > Π), Kernels than and Hyperparameters moreD.effective the human expert’s chosen strategy. For the structured GP model to faithfully capture muscle where wt is defined in (6), Π is a threshold dividing stimuli responses, yet allow rapid learning, the kernel function, mean therapeutically effective and ineffective classes by average D. into Kernels and Hyperparameters function, and their hyperparameters must be chosen well. Poor twitch strength, and I(·)model is thetoindicator function. This alterFor the structured faithfully capture muscle choices can lead toGP undesired behavior, as seen in Fig. 6(a). nate reward function is reasonable if fine differences between However, comparatively goodtheperformance obtained in responses, yetthe allow rapid learning, kernel function, mean strongly respondingthe stimuli (important to the (5), previous later animals hybrid Matérn showsreward that function, and theirwith hyperparameters mustkernel, be chosen well. Poor measures) are irrelevant. Table III presents a comparison of appropriate kernel choice can resolve this choices can lead to undesired behavior, as problem seen in (Fig. Fig. 6(b)). 6(a). human and machine reward under this heuristic for different Although the hybrid kernel was performance successful, ourobtained experiments However, the comparatively good in values ofother Π. Forpossible nearly all runs and values of Π, proportionally suggest kernels. For example, the hybrid that kerlater animals with the hybrid Matérn kernel, (5), shows more algorithm actions met or exceeded the specification. nel’s noise termchoice could be replaced a term which models the appropriate kernel can resolvebythis problem (Fig. 6(b)). Thus, if therapeutic effectiveness is a function of the on proporcovariance between measurements of the same action the Although the hybrid successful, ourtheexperiments tion of stimuli whichkernel elicit awas strong response, algorithm is suggest other possible kernels. For example, the strategy. hybrid kermore effective than the human expert’s chosen nel’s noise term could be replaced by a term which models the covariance between measurements of the same action on the 4 4 Peak-to-peak amplitude of MR (mV) Peak-to-peak amplitude of MR (mV) TABLE III TABLE TABLE IIIIII 3.53.5 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 28 28 29 29 30 30 31 32 31 32 Time (Days post-injury) Time (Days post-injury) 33 33 34 34 (a) Animal W1, Run 3 (a) Animal W1, Run 3 5 5 4.5 4.5 4 4 3.5 3.5 3 32.5 Peak-to-peak amplitude of MR (mV) Peak-to-peak amplitude of MR (mV) 10 10 10 2.5 2 21.5 1.5 1 10.5 0.5 0 20 0 20 25 25 30 Time (Days post-injury) 30 Time (Days post-injury) 35 35 (b) Animal W2, Run 2 (b) Animal W2, Run 2 Fig. 6. GP posterior over the response function f (A1,A9,t). A black square Fig. 6. GP GP posteriorover overthe theby response function fA9 (A1, A9, t). Points show Fig. 6. response function f (A1, A9, t).green Points show denotes a posterior response evoked electrode pair A1 and ‘x’s denote responses electrode pair A1A1 A9A9 (black squares) andprior other stimuli responses evoked by electrode pair (black squares) and other stimuli responsesevoked due toby other stimuli. The red dashed line is the mean of the (green dashed prior mean of the entire GPand with±1 respect entire‘x’s). GP The with respect toline time. The posterior mean (solid) standard (green ‘x’s). Thered red dashed lineis isthe the prior mean of the entire GP with respect todeviation time. The posterior and standard deviation confidence intervals (dashed) for fstandard (A1,A9,t) are shown in blue. to time. Theconfidence posteriormean mean(solid) (solid) and±1 ±1 deviation confidence intervals (dashed) A9, t) blue. (a):(a): Animal W1,W1, run run (a): Animal W1,for runff(A1, 3,(A1, shows the effects ofinpoor kernel andAnimal hyperparameter intervals (dashed) for A9, t)are areshown shown in blue. 3,choices. shows the effects ofofpoor kernel hyperparameter TheThe posterior The posterior prediction of thehyperparameter responses atchoices. thechoices. beginning ofposterior P34 (at 3, shows the effects poor kerneland and prediction of atatthe of of P34P34 (at (at right, using dataThis right, using data acquired on P31 and earlier) was extrapolated poorly. prediction of the the responses responses thebeginning beginning right, using data acquired P31 and earlier) poorly. This resulted fromfrom thetime resultedon from squared-exponential kernel’s inherent smoothness, the acquired on P31the and earlier)was wasextrapolated extrapolated poorly. This resulted the squared-exponential kernel’s inherent smoothness, the time lengthscale used, lengthscale used, and the low assumed noise level. selected onused, day squared-exponential kernel’s inherent smoothness, theActions time lengthscale and theusing low assumed level. model, Actionswere selected on day P34, using this (b): P34, badlynoise inaccurate consequently and the low this assumed noise level. Actions selected on dayunproductive. P34, using this badly inaccurate model, wereand consequently unproductive. (b): The altered The altered kernel, mean, hyperparameter choices employed in animal badly inaccurate were consequently unproductive. (b): kernel, mean, and model, hyperparameter employed in animal W2,The run altered 2, W2, run 2, avoid this pathology, choices while still providing useful prediction. kernel, mean, and hyperparameter choices employed in animal W2, run 2, avoid this pathology, while still providing useful prediction. avoid this pathology, while still providing useful prediction. D. Kernels Hyperparameters same day, butand otherwise treats observations as independent. same but includes otherwise treats observations as independent. SuchFor aday, kernel a variable for each the structured GPhidden model additive to faithfully capture muscle Such a kernel includes a hidden additive variable each configuration on each day. A linear temporal term inforthe responses, yet allow rapid learning, the kernel function, mean configuration on be each day. provided A linearthat temporal term in the kernel may also useful, old observations function, and their hyperparameters must be chosen well. Poor kernel also as be ituseful, that non-responsive old observations can be may forgotten, modelsprovided the fact that choices can lead to undesired behavior, as seen in Fig. 6(a). can be forgotten, as itremain models the fact that non-responsive configurations typically non-responsive. However, the comparatively good performance obtained in The kernel, typically mean, and hyperparameters were periodiconfigurations remain non-responsive. later animals with the hybrid Matérn kernel, (5), shows that cally If appropriate, hyperparameters wereperiodireThere-examined. kernel, mean, and hyperparameters were appropriate kernel choice can resolve this problem (Fig. 6(b)). initialized after an animal’s first run, accompanied by a wipe of recally re-examined. If appropriate, hyperparameters were Although the hybridTable kernelI and wasFig. successful, our experiments the algorithm’s memory; 3 detail these refitting initialized after an animal’s first run, accompanied by a wipe of suggest other possible For example, hybrid kerand memory wipe events.kernels. The choice to re-fit orthe wipe algothe algorithm’s memory; Table I andby Fig. 3 detail these refitting nel’s noise term could be replaced a term which models the rithmmemory memorywipe was made by The human inspection. This approach and events. choice to re-fit or wipeonalgocovariance between measurements of the same action protected the algorithm from making prolonged, catastrophic the rithm made by human inspection.asThis approach samememory day, butwas otherwise observations errors, introducing some bias.treats However, this manual independent. oversight protected the algorithm from making prolonged, catastrophic Such a kernel includes a hidden additive variable tofortheeach constitutes a meta-algorithm which can be compared errors, introducing some bias. A However, this manual oversight configuration on each day. linear temporal in the oversight required to detect device failure and is term thus not constitutes a meta-algorithm which can be compared to kernel may also be execution useful, provided that old observations unrealistic. Automated of such re-examinations (in the oversight required to detect device failure and is thus can sense, be forgotten, as hyperprior it models fitting) the factand thatalso non-responsive some a form of heuristics not unrealistic. Automated execution of such re-examinations configurations typically remain non-responsive. for detection of various failure modes should naturally be built (in some sense, a form of hyperprior fitting) also heuristics into The futurekernel, systems, but areand beyond the scopeand of this work. mean, hyperparameters were periodifor detection of various failure modes should naturally be built cally re-examined. If appropriate, hyperparameters were reinto future systems, but are first beyond scope of this afterGain an animal’s run, the accompanied by a work. wipe of E.initialized Information the algorithm’s I andbyFig. detail these refitting GP-BUCB canmemory; also be Table asssessed the3 amount of inforand memory wipe events. The choice to re-fit or wipe algomation it gains about E. Information Gain the spinal cord’s responses. Using the rithm memory was made by human inspection. This approach GPGP-BUCB model, the information gain with by respect to the entire also befrom asssessed the amount of inforprotected the can algorithm making prolonged, catastrophic mation it gains about the spinal cord’s responses. Using the errors, introducing some bias. However, this manual oversight GP model, the information gain with respect to the entire reward function f can be quantified directly. Table IV shows DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OF MULTI-ELECTRODE EPIDURAL ELECTROSTIMULATION DESAUTELS et al.: ACTIVE LEARNING ALGORITHM FOR CONTROL OF EPIDURAL ELECTROSTIMULATION 11 11 (tens to hundreds) and substantial periods of time (weeks). exploitation simpleallowed therapeutic application. In particular, Further, the in GPa model effective extrapolation forward the algorithm allocated more actions to high-performing stimin time from previous sessions and observations, a requirement uli while still matching the human experimenter’s effectiveness to avoid extensive re-exploration. GP-BUCB also found taskinrelevant findinganatomical the best stimuli. Such demonstration that structure in athe responses it implies encountered, M EAN Im /Ir M EAN R EWARD M EAN Im /Ih DAYS A NIMAL automated algorithms have the potential to efficiently use despite starting with no information as to which regions ofthe the & RUN (± STD .) R ATIO (± STD .) (± STD .) U SED stimuli available to them, and outside the clinic, P1, RUN 1 0.730 ± 0.145 N/A N/A 0 cord were expected to beboth mostinside responsive. This of indicates that tothese simultaneously learn about the cord’s responses W1, RUN 1 0.549 ± 0.165 1.368 ± 0.519 N/A 5∗ structures, and potentially novelspinal or patient-specific ones, RUN 2 0.791 ± 0.132 1.474 ± 0.490 N/A 4 and deliver effective therapy. Our successful handlingGiven of this can be found by online, automatic experimentation. the RUN 3 0.697 ± 0.105 1.297 ± 0.317 N/A 5 problem constitutes theamong successful completion of aand necessary W2, RUN 1 0.811 ± 0.145 1.540 ± 0.560 0.798 ± 0.093 4 extensive differences individual injuries injured RUN 2 0.802 ± 0.119 1.600 ± 0.749 0.895 ± 0.152 10 prerequisite a well-founded attempt toFinally, solve our the results more spinal cords,forthis capability is essential. P2, RUN 1 0.924 ± 0.214 2.520 ± 2.415 1.119 ± 0.323 3 complex clinical problems ahead, maximizing standing or show that GP-BUCB can effectively trade off exploration and stepping performance in paralyzed subjects and tuning stimuli constitutes a meta-algorithm which can be compared to the exploitation in a simple therapeutic application. In particular, to individual patients’ needs. reward function f can quantified shows oversight required to be detect device directly. failure Table and isIVthus not the algorithm allocated more actions to high-performing stimthe averaged ratio of the algorithm’s daily gain in conditional unrealistic. Automated execution of such re-examinations (in uli while still matching the human experimenter’s effectiveness information that of human fitting) and to and that also of aheuristics random ACKNOWLEDGMENT in finding the best stimuli. Such a demonstration implies that some sense, to a form of the hyperprior sampling policy. Interestingly, the algorithm gains a compathe potential efficiently use in the for detection of various failure modes should naturally be built automated We thank algorithms Yanan Sui have and Mrinal Rath fortotheir assistance rable (but smaller) information to both the work. human executing stimuli available to them, both inside and outside of the clinic, into future systems,amount but areofbeyond the scope of this the experiments. and random searches. The algorithm’s repeated, exploitative to simultaneously learn about the spinal cord’s responses actions, which produce higher mean daily reward, may account and deliver effective therapy. Our successful handling of this R EFERENCES E. Information for most of thisGain difference. However, the algorithm’s high problem constitutes the successful completion of a necessary [1] P. Gad et al., of a multi-electrode for spinal cord conditional information gain, despite allocating manyofactions GP-BUCB can also be asssessed by the amount infor- prerequisite for“Development a well-founded attempt toarray solve the more epidural stimulation to facilitate stepping and standing after a complete to stimuli are well-understood, suggests that the actions mation it which gains about the spinal cord’s responses. Using the complex clinical problems maximizing standing spinal cord injury in adult rats,”ahead, J. Neuroeng. & Rehab., vol. 10, no. 1,or allocated to exploration are well-chosen. GP model, the information gain with respect to the entire stepping p. 2, 2013. performance in paralyzed subjects and tuning stimuli [2] S. Harkema et al., “Effect of epidural stimulation of the lumbosacral reward function f can be quantified directly. Table IV shows to individual patients’ needs. spinal cord on voluntary movement, standing, and assisted stepping after the averaged ratio of the algorithm’s daily gain in conditional F. Effect of Learning on Other Hindlimb Muscles motor complete paraplegia: A case study,” The Lancet, vol. 377, no. information thatLTA of the human and to that of a random 9781, pp. 1938–1947,A2011. CKNOWLEDGMENT In additionto to responses, the peak-to-peak twitch [3] C. A. Angeli et al., “Altering spinal cord excitability enables voluntary sampling policy. Interestingly, the algorithm gains a compaamplitudes were recorded in the left soleus, right TA, and right T.movements A. Desautels thanks Yananparalysis Sui and Mrinal Brain, Rath for their after chronic complete in humans,” vol. 137, rable (but smaller) amount of muscles information to bothstrongly the human no. 5, pp.in1394–1409, soleus muscles. Often, several responded toassistance executing2014. the experiments. and random searches. TheAlthough algorithm’s repeated, [4] T. Desautels et al., “Application of a batch active learning algorithm to gether to a given stimulus. the LTA twitchexploitative amplitude spinal cord injury therapy (program no. 466.01),” in Neuroscience 2013 actions, which higherofmean daily reward, may account accounted for aproduce large portion the response amplitudes in the R EFERENCES Abstracts. San Diego, CA: Soc. for Neuroscience, November 2013. for most of this However, the algorithm’s high [5] ——, “Parallelizing exploration–exploitation tradeoffs in Gaussian pro−3 other muscles (r >difference. 0.25, p 1e in all animals and muscles, [1] P. Gad et al., “Development of a multi-electrode array for spinal cord conditional gain, P1), despite allocating many cess bandit optimization,” J. Mach. Learning Res., vol. 15, pp. 3873– except the information RTA in animal these responses in actions other epidural stimulation to facilitate stepping and standing after a complete 3923, December 2014. to stimulialso which are well-understood, that theEven actions spinal cord injury in adult rats,” J. Neuroeng. & Rehab., vol. 10, no. 1, muscles demonstrated substantial suggests independence. in [6] R. van den Brand et al., “Restoring voluntary control of locomotion p. 2, 2013. allocated exploration are combinations well-chosen. of muscle responses after paralyzing spinal cord injury,” Science, vol. 336, no. 6085, pp. this simpletoparadigm, many TABLE IV TABLE I NFORMATION GAINS (GP-BUCB, ImIV ; RANDOM SEARCH , Ir ; HUMAN , Im ; .RANDOM SEARCH , Ir ; HUMAN ,IN IIhNFORMATION ), COMPUTEDGAINS USING(GP-BUCB, HYBRID KERNEL Ih CANNOT BE CALCULATED Ih ), COMPUTED HYBRID KERNEL . Ih CANNOT BE CALCULATED ANIMAL W1 DUEUSING TO USE OF MONOPOLAR STIMULI , FOR WHICH kh (x, IN x0 ) ANIMAL IS W1 DUE TO USE , FOR WHICH k.h (x, x0 ) UNDEFINED . ∗OF : RMONOPOLAR EWARD RATIOSTIMULI OUTLIER DISCARDED IS UNDEFINED . ∗ : R EWARD RATIO OUTLIER DISCARDED . can be elicited. The creation of an efficiently computable reward metric which faithfully F. Effect of Learning on Other matches Hindlimbtherapeutic Muscles utility in multi-muscle behaviors is a highly non-trivial problem worthy In addition to LTA responses, the peak-to-peak twitch of future investigation. amplitudes were recorded in the left soleus, right TA, and right soleus muscles. Often, several muscles responded strongly toVII. CAlthough ONCLUSIONS gether to a given stimulus. the LTA twitch amplitude Our experiments show that algorithmincan accounted for a large portion of an the automated response amplitudes the −3 handle the difficult task of maximizing the performance of a other muscles (r > 0.25, p 1e in all animals and muscles, relatively neural interface. results in provide except thecomplicated RTA in animal P1), theseThese responses other amuscles strong indication that a structured GP model can effectively also demonstrated substantial independence. Even in capture spinal responses over large sets of of muscle possibleresponses stimuli this simple paradigm, many combinations (tens to hundreds) and substantial periods of time (weeks). can be elicited. The creation of an efficiently computable Further, the GPwhich model faithfully allowed effective forwardin reward metric matchesextrapolation therapeutic utility in time from previous sessions and observations, a requirement multi-muscle behaviors is a highly non-trivial problem worthy to re-exploration. GP-BUCB also found taskof avoid futureextensive investigation. relevant anatomical structure in the responses it encountered, despite starting with no information as to which regions of the VII. C ONCLUSIONS cord were expected to be most responsive. This indicates that Ourstructures, experiments show that novel an automated algorithmones, can these and potentially or patient-specific handle the difficult task of maximizing the performance of can be found by online, automatic experimentation. Given thea relatively differences complicatedamong neural individual interface. These provide extensive injuriesresults and injured a strong indication that a structured GP model can effectively spinal cords, this capability is essential. Finally, our results capture responses over large trade sets off of possible stimuli show thatspinal GP-BUCB can effectively exploration and [2] S. Harkema et al., “Effect of epidural stimulation of the lumbosacral 1182–1185, 2012. spinal cord on voluntary movement, standing, and assisted stepping after [7] S. Thuret et al., “Therapeutic interventions after spinal cord injury,” Nat. motor complete paraplegia: A case study,” The Lancet, vol. 377, no. Rev. Neurosci., vol. 7, no. 8, pp. 628–643, 2006. 9781, pp. 1938–1947, 2011. [8] V. R. Edgerton et al., “Rehabilitative therapies after spinal cord injury,” [3] C. A. Angeli et al., “Altering spinal cord excitability enables voluntary J. of Neurotrauma, vol. 23, no. 3-4, pp. 560–570, 2006. movements after chronic complete paralysis in humans,” Brain, vol. 137, [9] Y. Gerasimenko et al., “Epidural stimulation: Comparison of the spinal no. 5, pp. 1394–1409, 2014. circuits that generate and control locomotion in rats, cats and humans,” [4] Exper. T. Desautels et al., “Application of a batch2008. active learning algorithm to Neur., vol. 209, no. 2, pp. 417–425, spinal cord injury therapy (program no. 466.01),” Neuroscience 2013 [10] A. P. Pêgo et al., “Regenerative medicine for the intreatment of spinal Abstracts. San Diego, CA:promises?” Soc. for Neuroscience, 2013. cord injury: More than just J. of CellularNovember and Molecular [5] Medicine, ——, “Parallelizing tradeoffs in Gaussian provol. 16, no.exploration–exploitation 11, pp. 2564–2582, 2012. cess banditand optimization,” Mach. Learning Res.,with vol. body 15, pp.weight 3873– [11] A. Wernig S. Müller, J.“Laufband locomotion 3923, December 2014. support improved walking in persons with severe spinal cord injuries,” [6] Spinal R. vanCord, den vol. Brand voluntary 30,etno.al.,4, “Restoring pp. 229–238, 1992. control of locomotion after paralyzing Science, vol. 336, 6085,vol. pp. [12] G. Courtine et al., spinal “Spinalcord cordinjury,” injury: Time to move,” Theno. Lancet, 1182–1185, 2012. June 2011. 377, pp. 1896–1898, [7] ——, S. Thuret et al., “Therapeutic interventions spinalvia cord injury,”proNat. [13] “Recovery of supraspinal control of after stepping indirect Rev. Neurosci., vol. 7, no. 8,after pp. spinal 628–643, priospinal relay connections cord2006. injury,” Nature Medicine, [8] vol. V. R. et 69 al.,–“Rehabilitative therapies after spinal cord injury,” 14,Edgerton no. 1, pp. 74, 2008. J. of vol.“Functional 23, no. 3-4,electrotherapy: pp. 560–570, 2006. [14] W. T. Neurotrauma, Liberson et al., Stimulation of the [9] peroneal Y. Gerasimenko et al., “Epidural of the nerve synchronized withstimulation: the swingComparison phase of the gaitspinal of circuits thatpatients,” generateArchives and control locomotion in rats, cats and humans,” hemiplegic of Physical Medicine and Rehabilitation, Exper. Neur., vol. 209, no. 2, pp. 417–425, 2008. vol. 42, pp. 101–105, 1961. [10] A. A. Thrasher P. Pêgo et medicine fordue the totreatment of elecspinal [15] et al., al., “Regenerative “Reducing muscle fatigue functional cord stimulation injury: More thanrandom just promises?” Cellular and Molecular trical using modulationJ.ofofstimulation parameters,” Medicine, vol. 16,vol. no.29, 11,no. pp.6,2564–2582, 2012. Artificial Organs, pp. 453–458, 2005. [11] E.A.J. Wernig “Laufband locomotion body weight [16] Bradburyand andS.S. Müller, B. McMahon, “Spinal cord repairwith strategies: Why support improved walking in personsvol. with severe spinal August cord injuries,” do they work?” Nature Rev. Neurosci., 7, pp. 644–653, 2006. Spinal Cord, vol. et 30,al.,no. 4, pp. 229–238, 1992. [17] M. R. Dimitrijevic “Evidence for a spinal central pattern generator [12] inG.humans.” CourtineAnn. et al., Time to move,”vol. The860, Lancet, vol. of “Spinal the New cord York injury: Academy of Sciences, p. 360, 377, pp. 1896–1898, June 2011. 1998. 12 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. X, NO. Y, MONTH YEAR [18] A. Prochazka et al., “Neural prostheses,” J. of Physiology, vol. 533, no. 1, pp. 99–109, 2001. [19] C. N. Shealy et al., “Electrical inhibition of pain by stimulation of the dorsal columns: Preliminary clinical report,” Anesthesia and Analgesia, vol. 46, pp. 489–491, July-August 1967. [20] R. Herman et al., “Spinal cord stimulation facilitates functional walking in a chronic, incomplete spinal cord injured,” Spinal Cord, vol. 40, pp. 65–68, 2002. [21] K. Minassian et al., “Human lumbar cord circuitried can be activated by extrinsic tonic input to generate locomotor-like activity,” Human Movement Sci., vol. 26, pp. 275–295, 2007. [22] D. G. Sayenko et al., “Neuromodulation of evoked muscle potentials induced by epidural spinal cord stimuation in paralyzed individuals,” J. of Neurophysiology, vol. 111, no. 5, pp. 1088–1099, March 2014. [23] D. Baskent et al., “Using genetic algorithms with subjective input from human subjects: Implications for fitting hearing aids and cochlear implants,” Ear and Hearing, vol. 28, no. 3, pp. 370–380, 2007. [24] A. Frankemolle, “Reversing cognitive-motor impairment in parkinson’s disease patients using computational modelling approach to deep brain stimulation programming,” Brain, vol. 133, no. 3, pp. 746–761, 2010. [25] T. Mera, “Kinematic optimization of deep brain stimulation across multiple systems in parkinson’s disease,” J. Neurosci. Methods, vol. 198286, no. 2, p. 280, 2011. [26] J. Giuffrida, “Automated optimization of deep brain stimulation settings multiple parkinson’s disease motor symptoms,” Neurology, vol. 74, no. 9, p. Suppl. 2:A479, 2011. [27] J. Fruitet et al., “Bandit algorithms boost brain computer interfaces for motor-task selection of a brain-controlled button,” in Adv. Neural Info. Proc. Syst., 2012, pp. 458–466. [28] ——, “Automatic motor task selection via a bandit algorithm for a braincontrolled button,” J. Neural Eng., vol. 10, no. 1, p. 016012, 2013. [29] T. Gürel and C. Mehring, “Unsupervised adaptation of brain-machine interface decoders,” Frontiers in neuroscience, vol. 6, 2012. [30] C. Vidaurre et al., “Co-adaptive calibration to improve BCI efficiency,” J. Neural Eng., vol. 8, no. 2, p. 025009, 2011. [31] A. Brunoni et al., “Clinical research with transcranial direct current stimulation (tdcs): Challenges and future directions,” Brain Stimulation, vol. 5, pp. 175–95, 2012. [32] R. M. Gorodnichev et al., “Transcutaneous electrical stimulation of the spinal cord: A noninvasive tool for the activation of stepping pattern generators in humans,” Human Physiology, vol. 38, no. 2, pp. 158–167, April 2012. [33] R. R. Roy et al., “Chronic spinal cord-injured cats: Surgical procedures and management,” Lab. Animal Sci., vol. 42, no. 4, pp. 335– 343, 1992. [34] T. Desautels et al., “Parallelizing exploration–exploitation tradeoffs with Gaussian process bandit optimization,” in Proc. 29th Int. Conf. Mach. Learning, 2012. [35] H. Robbins, “Some aspects of the sequential design of experiments,” Bul. Am. Math. Soc., vol. 55, 1952. [36] T. E. Prieto et al., “Measures of postural steadiness: Differences between healthy young and elderly adults,” IEEE Trans. Biomed. Eng., vol. 43, no. 9, pp. 956–966, 1996. [37] B. R. Santos et al., “Reliability of centre of pressure summary measures of postural steadiness in healthy young adults,” Gait & posture, vol. 27, no. 3, pp. 408–415, 2008. [38] D. M. Basso et al., “A sensitive and reliable locomotor rating scale for open field testing in rats,” J. of Neurotrauma, vol. 12, no. 1, pp. 1–21, 1995. [39] M. Antri et al., “Locomotor recovery in the chronic spinal rat: Effects of long-term treatment with a 5-HT2 agonist,” Eur. J. of Neuroscience, vol. 16, no. 3, pp. 467–476, 2002. [40] A. J. Fong et al., “Spinal cord-transected mice learn to step in response to quipazine treatment and robotic training,” J. Neurosci., vol. 25, no. 50, pp. 11 738–11 747, 2005. [41] L. L. Cai et al., “Implications of assist-as-needed robotic step training after a complete spinal cord injury on intrinsic strategies of motor learning,” J. Neurosci., vol. 26, no. 41, pp. 10 564–10 568, 2006. [42] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Trans. ASME–J. Basic Eng., vol. 82, no. D, pp. 35–45, 1960. [43] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT Press, 2006. [44] C. E. Rasmussen and H. Nickisch, “Gaussian processes for machine learning (GPML) toolbox,” J. Mach. Learning Res., vol. 11, pp. 3011– 3015, 2010. [45] J. Quiñonero-Candela and C. Rasmussen, “A unifying view of sparse approximate gaussian process regression,” J. of Mach. Learning Res., vol. 6, pp. 1939–1959, 2005. [46] A. Slivkins, “Contextual bandits with similarity information,” J. Mach. Learning Res., vol. 15, pp. 2533–2568, July 2014. Thomas A. Desautels received the B.S. degree in biomedical engineering from the University of California, Davis, CA, USA, and the M.S. and Ph.D. degrees from the California Institute of Technology, Pasadena, CA, USA. He is now a visiting research associate at the Gatsby Computational Neuroscience Unit of University College London, London, UK, where his work is supported by the Whitaker International Scholarship. His research interests include neural decoding and interfaces, learning algorithms, and neuroprosthetics. Jaehoon Choe received the B.A. degree in biological sciences from the University of Chicago, Chicago, IL, USA, and the Ph.D. degree in neuroscience from the University of California, Los Angeles, CA, USA. He is now a postdoctoral research associate in the Information Systems Sciences Laboratory at HRL Laboratories LLC, Malibu, CA, USA. His research interests include brain-machine interfaces, neuroprosthetics, and neural modeling. Parag Gad received the B.Eng. degree in biomedical engineering from the University of Mumbai, Mumbai, India, and the M.S. and Ph.D. degrees in Biomedical Engineering from the University of California, Los Angeles. He is currently employed as an assistant researcher in the department of Integrative Biology and Physiology at University of California, Los Angeles. His research includes developing hardware, software tools, and stimulation strategies to facilitate locomotion and bladder control after paralysis. Manheerej S. Nandra, biography not available at the time of publication. Roland R. Roy received the B.S. degree in physical education from the University of New Hampshire, Durham, NH, USA, and the MS and PhD degrees in exercise physiology from Michigan Sate University, East Lansing, MI, USA. He has been at the University of California, Los Angeles, since 1977 where he is currently a Distinguished Researcher in the Department of Integrative Biology and Physiology and the Brain Research Institute. He is the primary or co-author of over 400 research articles and book chapters. Dr. Roy is a member of the Society for Neuroscience, American Physiological Society, and the American College of Sports Medicine. Hui Zhong, biography not available at the time of publication. Yu-Chong Tai (M’96-SM’03-F’06) received his B.S. degree in electrical engineering from the National Taiwan University, Taiwan, and the M.S. and Ph.D. degrees in electrical engineering from the University of California, Berkeley, Berkeley, CA, USA. He joined the California Institute of Technology as an Assistant Professor in electrical engineering in 1989. He is the Anna L. Rosen Professor of Electrical Engineering and Mechanical Engineering and the Executive Officer of Medical Engineering at the California Institute of Technology. His research focuses on Micro-Electro-Mechanical Systems (MEMS) and implantable micro biomedical devices. Dr. Tai was the co-chairman of the 2002 IEEE MEMS Conference. He received the 2015 IEEE Robert Bosch MEMS/NEMS Award. He is also a senior member of the American Society of Mechanical Engineers (ASME). V. Reggie Edgerton, biography not available at the time of publication. Joel W. Burdick received his B.S. degree in mechanical engineering from Duke University and M.S. and Ph.D. degrees in mechanical engineering from Stanford University. He has been with the department of Mechanical Engineering at the California Institute of Technology since May 1988, where he is the Richard and Dorothy Hayman Professor of Mechanical Engineering and BioEngineering. He was appointed an IEEE Robotics Society Distinguished Lecturer in 2003. His current research interests include sensor-based robot motion planning, multi-fingered robotic hand manipulation, and spinal cord injury rehabilitation.