Manhattan Rule Training for Memristive Crossbar Circuit Pattern Classifiers Elham Zamanidoost, Farnood M. Bayat, Dmitri Strukov Irina Kataeva Electrical and Computer Engineering Department University Of California Santa Barbara Santa Barbara, CA, USA {elham, farnoodmb, strukov}@ece.ucsb.edu Advanced Research Division Denso Corporation Komenoki-cho, Nisshin, Japan irina_kataeva@denso.co.jp Abstract—We investigated batch and stochastic Manhattan Rule algorithms for training multilayer perceptron classifiers implemented with memristive crossbar circuits. In Manhattan Rule training, the weights are updated only using sign information of classical backpropagation algorithm. The main advantage of Manhattan Rule is its simplicity, which leads to more compact hardware implementation and faster training time. Additionally, in case of stochastic training, Manhattan Rule allows performing all weight updates in parallel, which further speeds up the training procedure. The tradeoff for simplicity is slightly worse classification performance. For example, simulation results showed that classification fidelity on Proben1 benchmark for memristor-based implementation trained with batch Manhattan Rule were comparable to that of classical backpropagation algorithm, and about 2.8 percent worse than the best reported results. Keywords—Crossbar memory, Memristor, Artificial neural network, Training algorithm, Pattern classification. I. INTRODUCTION Artificial neural network is a biologically inspired computing paradigm suitable for variety of applications. To approach energy efficiency of biological neural networks in information processing, a specialized hardware must be developed [1]–[5]. Crossbar-based hybrid circuits [6]–[10], and in particular those of CrossNets variety [10], [11], are identified as one of the most promising solutions, because such circuits can provide high integration density of artificial synapses and high connectivity between artificial neurons, which are two major challenges for efficient artificial neural network hardware implementation. The disadvantages of crossbar circuits are certain restrictions on how the signals can be applied to the artificial synapses, which, in turn, impose limitations on training algorithms. The main contribution of this paper is the development of a crossbar circuit compatible training approach for multilayer perceptron (MLP) networks. The performance of the proposed training approach was simulated using a standard benchmark suite for specific memristive crossbar network, which has recently been utilized in successful experimental demonstration of a small-scale pattern classifier [1, 5]. While the considered MLPs are small and hardly practical, their key features, e.g. network topology and training algorithms, are similar to those of more practical networks such as deep learning convolutional neural networks [12]. This work is supported by AFOSR under MURI grant FA9550-12-1-0038 and Denso Corporation, Japan. Also we acknowledge support from the Center for Scientific Computing from the CNSI, MRL: an NSF MRSEC (DMR1121053) and NSF CNS-0960316. To the best of our knowledge, the presented work is novel and Manhattan Rule training has never been investigated before in the context of crossbar circuit hardware. The majority of previous work is devoted to unsupervised spiking networks [4], [7], [8]. The most relevant studies are, perhaps, experimental demonstration of a single layer perceptron [1, 5], and simulations of small-scale [6], [9], and larger-scale pattern classifiers [13]. In the next section, we provide brief background information on the considered neural network, its crossbar circuit implementation, and memristor models used for simulation studies in this paper. II. BACKGROUND A. Multi-Layer Perceptron In its simplest form, a feedforward neural network can be represented by a directed acyclic graph (Fig. 1a) in which neurons and synapses are nodes and edges of a graph, respectively. Each neuron applies a certain transfer function to the sum of its inputs and then passes information forward to the next layer of neurons. A synapse multiplies its weight W with the output of a pre-synaptic neuron and passes the resulting product to the input of the post-synaptic neuron. Mathematically, the operation within a single layer of the network can be formulated as fi = tanh(βui), with ui = ƩjWijXj, (1) where ui and fi are the input and output of the i-th post-synaptic neuron, respectively, Xj is the output of j-th pre-synaptic neuron, and Wij is the synaptic weight between j-th presynaptic and i-th post-synaptic neurons. Each neuron is (a) (b) synapses (d) V u X W f pre-synaptic postneurons synaptic neurons (c) + ‐ I+ + W G+ I- = G- ‐ memristor Fig. 1. Feedforward artificial neural network: (a) Abstracted graph representation of one layer with three input and two output neurons, (b) its crossbar circuit mapping, and (c, d) memristor-based crossbar implementation. assumed to have a tanh activation function with slope β. For the first layer of the network, | X | ≤ 1 values correspond to the applied input pattern. Feedforward neural networks, and in particular multilayered perceptron which are based on such networks, allow performing pattern classification task, i.e. mapping of input patterns to certain classes. The classification is considered successful if the specific output neuron, corresponding to the applied pattern, produces the largest value. Such operation is achieved by properly setting weights in the network, which in the most general case, cannot be calculated analytically but rather is found via some optimization procedure, e.g. using the backpropagation training algorithm in MLP networks [14]. Backpropagation training can be implemented in batch or stochastic mode. For stochastic (sometimes called online) training, weights are adjusted immediately after application of a single pattern from a training set. In the first step of this algorithm, a randomly chosen pattern n from a training set is applied to the network and the output is calculated according to (1). In the second step, the synaptic weights are adjusted according to characterized by the misclassification rate (MCR), i.e. the percentage of input patterns, which are classified incorrectly. The other important metric is training time which characterizes how quickly the training algorithm converges. B. MLP Implementation with Crossbar Circuits The MLP structure maps naturally to the crossbar array circuit (Fig. 1b). In particular, X and f are physically implemented with voltages |V | ≤ Vread, while neuron’s input u with current I. Synapses are implemented with crosspoint devices whose conductance G is proportional to the synaptic weight W. Because weight values can be negative, while physical conductance is strictly positive, one solution is to represent each synaptic weight with a pair of crosspoint devices (Fig. 1c), which are denoted as G+ and G-, i.e. Wij ≡ Gij+-Gij- , (6) (2) In such configuration, neuron receives two currents – one from the crossbar line with weights G+ and another from the line with weights G-, so that the negative weights are implemented due to the subtraction of these two currents inside the neuron (Fig. 1d). With Gmax and Gmin being the maximum and the minimum conductances of the crosspoint devices, respectively, the effective weight ranges from -Gmax + Gmin to Gmax - Gmin. where α is learning rate and δi is the local (backpropagated) error of the corresponding post-synaptic neuron. δi is calculated first for output neurons, for which it is equal to the product of the derivative of neuron output with respect to its input and the difference between the actual output f and the desired value of the output f (g), i.e. Assuming virtually-grounded inputs of the post-synaptic neuron, input current I is equal to the product GV. The current difference is then converted to voltage via an operational amplifier with feedback resistor R and then applied to a saturating operational amplifier to approximate the hyperbolic tangent activation function [1, 5, 11], i.e. implementing (1) on the physical level: ΔWij (n) = - αδi(n)Xj(n), δi(n) =[ fi (g)(n) - fi (n)] | , (3) The error is then propagated backward (i.e. from the output to the input layer) using the recurrence relation δjpre(n) = | Ʃiδipost(n)Wij(n), (4) (Additional superscripts are added to distinguish between preand post-synaptic variables when describing operation within the network layer.) The application of all patterns from a training set constitutes one epoch of training with multiple epochs typically required for successful training. In the simplest version of the batch backpropagation algorithm, the synaptic weights are adjusted by ΔWij’ = Ʃn ΔWij (n), (5) only at the end of each epoch, i.e. after all training patterns are applied to the network. Reaching perfect mapping during training is not guaranteed. In addition, classification performance is typically measured on a separate set of test patterns, which are not used in the training process. Therefore classifier performance is Vipost = Vread tanh[R(Ii + - Ii -)], (7a) Ii= Ii + - Ii - = Σj (Gij+ - Gij-)Vjpre, (7b) In general, a crossbar classifier can be trained ex-situ or insitu. In the first method, the neural network is first implemented in software and the proper weights are calculated by simulating the training process. The calculated weights are then imported into hardware, which is somewhat similar to the write operation for conventional memory. The main difference, however, is that imported values are analog (or multi-bit), which dramatically increases complexity and time of the write operation. Alternatively, for in-situ approach, training is implemented directly in the hardware. In this case, weights are physically adjusted in hardware during training as described by (2) or (5). Both ex-situ and in-situ training approaches have recently been demonstrated experimentally for memristive crossbar circuits [1, 5]. The advantage of the ex-situ method is that any (i.e. state-of-the-art) training algorithm that results in the best classification performance can be implemented in software without incurring much overhead in hardware. In-situ training, however, automatically adjusts to any hardware variations, which are unavoidable in analog circuits. Note that obtaining and supplying detailed information of circuit’s defects and variations to the software-implemented network for ex-situ training is hardly practical for large-scale systems. Because of this issue, the particular focus of this paper is on in-situ training. III. IN-SITU MANHATTAN RULE TRAINING For the in-situ training to be practical, its area and time overhead should be minimized. Straightforward implementation of the backpropagation algorithm in memristive hardware does not seem to be practical, because each weight must be modified by unique amount according to (2) or (5). Such analog adjustment of weights is possible (e.g. using feedback write algorithm [17]) and could be reasonable for small circuits [1], but would certainly be too slow for the desired large-scale circuits, especially taking into account that large number of epochs is typically required to perform training [12]. Fortunately, there are some useful variations of the backpropagation algorithm, which allow much more efficient implementation of training in the considered memristive crossbar networks. Here, we consider one such example Manhattan Rule training [18] - which is a coarse-grain variation of backpropagation algorithm. In Manhattan Rule only sign information of weight adjustment is utilized so that weight updates for (2) and (5) become ΔWijM (n) = sgn[ΔWij (n)] , (8) ΔWijM ’ = sgn[ΔWij’], (9) and The main appeal of such a training algorithm is that all weights are updated by the same amount, which simplifies the weight update operation and creates an opportunity for efficient implementation of in-situ training in hardware. Fig. 4 shows one instance of such implementation which we considered in this paper. In particular, let us consider a small portion of the crossbar consisting of 4×2 effective weights, or equivalently 4×4 -3 VresetTH Experiment 1 Simulation Current (A) 0.5 Vset RESET Vread TH 0 -0.5 S -1 SET -2 -1 A 0 1 Voltage (V) 2 3 Fig. 2. Simulated and experimentally measured I-V switching characteristics for Pt/TiO2-x/Pt memristor for an applied voltage sweep shown in the inset [16]. -5 x 10 1 -1.7 V 0.8 |ΔG| (S) C. Memristive Devices In general, different types of two-terminal resistive switching devices [15], can be integrated into crossbar array circuits to implement a pattern classifier. In this paper, a particular type of crosspoint devices - Pt/TiO2-x/Pt memristors - for which an accurate device model is available [16] has been investigated. A typical switching I-V for such devices is shown in Fig. 2. The device conductance can be gradually decreased (reset) by applying positive voltages above VresetTH= 1.3 V and gradually increased (set) with negative voltages below VsetTH = -0.9 V. The rate of conductance change for both switching transitions increases with the applied voltage (Fig. 3). The conductance remains unchanged when small voltages, i.e. |V| ≤ Vread = 0.5 V for the considered devices, are applied to the device. Therefore, we assumed that relatively large (exceeding set or reset threshold) voltages were applied to adjust synaptic weights during the training process. Alternatively, smaller (read) voltages, which do not modify synaptic conductances, were assumed to be used for calculation of network output during training and/or operation of the classifier. x 10 Vset = -0.9 V …-1.7 V Vreset = 1.3 V… 2.6 V 2.6V 0.6 0.4 0.2 0 0 0.2 0.4 0.6 G (S) 0.8 x 10 1 -3 Fig. 3. Simulated switching dynamics for reset and set transitions for the considered memristors, in particular, showing absolute change in conductance as a function of initial conductance, for several values of write voltages (incremented in 0.1 V steps). The conductances are measured at 0.5 V (i.e. at read bias). differential weights (Fig. 4c). According to (3), for stochastic training, the sign of the weight update depends on peripheral values of local error δ (associated with horizontal crossbar lines on Fig. 4c) and input X (associated with vertical lines). There are four possible combinations of signs of δ and X and, therefore, adjustments of all weights can be performed in four steps with each step corresponding to a particular combination of signs. For example, Fig. 4c shows weight update for a specific case δ1 < 0, δ2 > 0 and X1 > 0, X2 < 0, X3 < 0, X4 > 0. (Note that with considered differential weight implementation, both positive and negative synapses are adjusted during the weight update, with the latter always updated in the opposite direction.) Because all updates have the same magnitude, all the weights sharing the same sign of δ and X in each step could be updated simultaneously. To implement this parallel update, each crossbar line receives a specific voltage pulse sequence shown on Fig. 4a. In any particular step of such sequence, only one specific set of memristors, which are located at the crossbar lines with the same signs of δ and X, receive large enough voltage bias of a proper polarity to modify their conductances (Fig. 4b). The remaining memristors, which are not supposed to be modified during the same step, receive voltages below corresponding switching thresholds, which is ensured by using VresetTH ≤ Vreset < 2VresetTH, and VsetTH ≥ Vset > 2VsetTH , (10) The hardware implementation of Manhattan Rule training is quite straightforward and involves application of pulse sequences s1 and s2 to the vertical crossbar lines with X < 0 and X > 0, respectively, and pulse sequences s3 and s4 to the horizontal crossbar lines with δ < 0 and δ > 0. In batch Manhattan Rule training, the weight updates are no longer correlated with peripheral error and input values (Fig. 4d). In this case, memristors can be updated in parallel for two crossbar lines (which form a differential pair) using the scheme proposed for stochastic training. Multiple pairs of crossbar lines, however, are updated sequentially in this case (Figs. 4e and 4f). The considered training approach was evaluated on three datasets - Cancer1, Diabetes1 and Thyroid1 from Proben1 benchmark [19]. Each dataset is implemented with a two-layer differential-weight MLP network with 4 hidden neurons. There are 9, 8 and 21 input neurons, and 2, 2, and 3 output neurons for Cancer1, Diabetes1, and Thyroid1 datasets, respectively. The total number of patterns in the training set were 350, 384 and 3600 for Cancer1, Diabates1 and, Thyroid1 datasets, respectively. Several cases of weight updates were considered. In all simulations, conductances were initialized randomly between Gmin = 0.01 mS and Gmax = 1 mS, and clipped at Gmin and Gmax during training. Also, R = 2.27 kΩ and target output values were V(g) = ±0.29 V, which correspond to the recommended sigmoid function from [14] for the considered Vread and range of conductances. The benchmark inputs were scaled to [-Vread, Vread] range. All computations were performed using 32-bit floating point precision arithmetic. In the first (“ideal”) case, weights were updated according to (8) and (9) without using the device model. Table I shows the best classification performance achieved within 4000 epochs and the corresponding number of epochs required to achieve best performance, with both values averaged over 1500 runs. For comparison, this table also shows simulation results for the conventional backpropagation algorithm and some of TABLE I. CLASSIFICATION PERFORMANCE FOR IDEAL NETWORK Dataset Cancer1 Diabetes1 Thyroid1 Batch Manhattan Rule [19] Avg. MCR Avg # of epochs Avg. MCR Avg # of epochs Best reported MCR .0288 .2641 .0753 800 474 305 .0291 .2726 .0774 1995 2120 1716 .0114 .2500 .0200 V s1 s2 V s3 V s4 V Vreset/2 t t Vset/2 t t step 1 2 3 4 (b) V s1-s3 V s1-s4 V s2-s3 V s2-s4 VresetTH t t t t VsetTH (c) s1 s2 (d) δ1+ < 0 s3 δ1- > 0 s4 δ2+ > 0 s4 δ2- < 0 (e) IV. SIMULATION RESULTS Batch Backpropagation (a) X3<0 X1>0 X2<0 X4>0 s1 s3 (f) s2 s2 s1 s3 0 s4 0 0 s3 0 s4 Fig. 4. Manhattan Rule training implementation: (a) 4-step pulse sequences which are applied to the crossbar lines and (b) corresponding voltage biases across memristor (with respect to the bottom terminal) as a result of an application of pulse sequence. (c) A specific example of desired weight update for stochastic training in a 4×4 memristive crossbar circuit and its corresponding implementation. (d) A specific example of desired weight update for batch training, and (e, f) its corresponding implementations. On panel (d), red and green backgrounds correspond to negative and positive updates, respectively. the best reported results. In the remaining studies, weight updates were performed using the memristor device model described in Sec. IIC. The best classification performance results were chosen within 1500 and 300 epochs of training for batch and stochastic algorithms, respectively. Fig. 5 shows simulation results for batch training using various pairs of Vset and Vreset voltages satisfying (10). The performance results are summarized in Table II. The Manhattan Rule training was also simulated for a more realistic case with added defects and variations to the memristive crossbar network. Fig. 6 shows simulation results with a fraction of randomly chosen memristors stuck in either high conductive state Gmax (stuck-on-close) or low conductive state Gmin (stuck-on-open). In particular, defective memristors are assumed to be equally split between stuck-on-close and stuck-on-open, so that, e.g., the defect fraction of 0.2 MCR Diabetes1 1.4 0.038 1.6 0.036 1.8 0.034 2 0.032 2.2 0.03 2.2 2.4 0.028 2.4 -1.6 -1.4 -1.2 Vset (V) MCR Thyroid1 0.295 1.4 0.086 1.6 0.29 1.6 0.084 0.285 1.8 0.28 2 -1 Vreset (V) 1.4 Vreset (V) Vreset (V) MCR Cancer1 0.275 0.27 -1.6 -1.4 -1.2 Vset (V) -1 0.265 1.8 0.082 2 0.08 2.2 0.078 2.4 -1.6 -1.4 -1.2 Vset (V) -1 Fig. 5. Misclassification rate for batch training as a function of Vset and Vreset. The results are averaged over 500 runs. TABLE II. CLASSIFICATION PERFORMANCE FOR MEMRISTIVE CROSSBAR CIRCUITS (500 RUNS) Batch Manhattan Rule Stochastic Manhattan Rule Dataset Avg. MCR Avg. # of epochs Optimal Vset / Vreset Avg. # of updates Cancer1 .0268 580 -1.7 V / 1.3 V 2.32 × 10 Diabetes1 Thyroid1 .2647 .0765 388 460 -1.1 V / 2.5 V -1.7 V / 1.3 V Avg. MCR Avg. # of epochs Optimal Vset / Vreset Avg. # of updates 3 .066 35 -1.5 V / 1.5 V 12.5 × 10 3 .344 26 -0.9 V / 1.3 V 9.98 × 10 -1.7 V / 2.1 V 4 1.55 × 10 1.84 × 10 corresponds to 10% of stuck-on-open and 10% of stuck-onclose devices. Fig. 7 shows simulation results for memristive crossbar circuits with device-to-device switching threshold variations. Such variations were simulated by assuming that Vset and Vreset of each device were normally distributed with mean values corresponding to the optimal ones reported in Table II. V. DISCUSSION AND SUMMARY Simulation results summarized in Table I and II show that classification fidelity for batch Manhattan Rule training is comparable to that of conventional backpropagation training and somewhat worse as compared to the best reported performance. Moreover, classification fidelity remained the same (or even slightly improved) when performing simulations with realistic device models. The slight improvement in performance could be explained by more optimal training conditions, i.e. the optimal choice of reset and set voltages, which effectively defines the learning rate for the simulated insitu training. Stochastic Manhattan Rule training requires fewer epochs to converge, though its classification performance was significantly worse as compared to batch training. A similar outcome is quite typical for classical backpropagation training [14]. The additional coarsening of the weight update for Manhattan Rule algorithm seems to be the reason behind further increase in performance gap between the two modes of training. As Figs. 6 and 7 show, the considered network is quite robust to the variations in device switching dynamics and stuck-on defects. The main effect of device-to-device variations is on convergence speed. For example, the number 3 .0730 10 3.6 × 10 3 3 of training epochs to reach the classification fidelity of the variation-free network increased by at least 10%, 40% and, 32% for Cancer1, Diabates1 and, Thyroid1, respectively, while the classification performance degraded rather gracefully with added variations. Because in stochastic training weights are updated for each applied pattern, it is useful to estimate training time in terms of elementary weight updates, rather than epochs. Assuming that application of the four-step pulse sequence is one elementary update, the training time for the stochastic algorithm is a product of the number of patterns in the training set and the total number of epochs. Here we neglect other computations during training such as described by (3) and (4) and assume that the weights can be updated in parallel in different MLP layers (even though error is back-propagated sequentially). For batch training, the weights are updated only once per epoch, however, because of the sequential update, the training time for a particular crossbar layer is the product of the number of post-synaptic neurons and the total number of epochs. Table II provides training time expressed in elementary updates. Clearly, batch training is the fastest when taking into account implementation details. It is unclear though if this will hold for more practical circuits with much larger crossbar arrays. Since in this paper we only focused on the weight update implementation, let us briefly discuss area overhead of other operations during training. It should be noted first that the most computationally expensive operation for error backpropagation is vector δ by matrix W computation (4). Such operation can easily be performed without much additional overhead using the same crossbar array hardware but with reverse direction of computation. Other operations, e.g. derivative calculations in (3) and (4), are performed at the periphery of the array, and hence their relative contribution to the total area is expected to shrink as the crossbar arrays get larger (which will happen for more practical applications). The most challenging operation in batch training is calculation and storing of temporal weight increments which must be performed for each weight of the array. If the network does not have to be retrained frequently, one solution would be to implement this operation off-chip. Investigation of better solutions, which e.g. would combine the small overhead of stochastic Manhattan Rule training and the high classification performance of batch training is our immediate goal. In summary, we proposed a training approach based on Manhattan Rule algorithm for multilayer perceptron networks implemented with memristive crossbar circuits. The classification performance of the proposed approach was evaluated on Proben1 benchmark for batch and stochastic modes of training and compared with state-of-the-art results. We found that batch training results in better classification performance and potentially faster tuning time among the two, though at the price of significantly higher implementation overhead. X10-1 Diabetes1 X10-1 7.0 153 3.5 310 1.4 148 6.2 131 3.3 256 1.3 121 5.4 109 3.2 203 1.1 95 4.6 88 3.0 150 1.0 69 3.8 66 2.9 97 0.9 43 3.0 0.2 0.4 0.6 0.8 45 2.8 0.2 0.4 0.6 0.8 44 0.8 [1] [2] [3] [4] [5] [6] [7] Thyroid1 0.2 0.4 0.6 0.8 17 Increase in non-Converged Cases% MCR X10-2 Cancer1 REFERENCES [8] [9] [10] Defect Fraction Fig. 6. Classification performance for batch training with stuck-on-open and stuck-on-close devices. For all panels, right vertical axis shows the percentage of increase in the number of cases that did not converge to an acceptable solution, namely when MCR remained above 0.1, 0.4 and 0.3 for Cancer1, Diabetes1 and, Thyroid1, respectively, within 1500 training epochs. The results are averaged over 1500 runs. MCR x10-2 Cancer1 x10-2 Diabetes1 x10-2 3.0 27.22 7.95 2.97 27.17 7.90 2.93 27.12 7.85 2.90 27.06 7.80 2.87 27.01 7.75 2.84 0 .2 .4 .6 26.95 0 .2 .4 .6 7.70 Thyroid1 [11] [12] [13] [14] [15] 0 .2 .4 .6 [16] Standard Deviation (V) Fig. 7. Classification performance as a function of standard deviation in set and reset switching threshold voltages. The results are averaged over 1500 runs. ACKNOWLEDGMENT The authors would like to acknowledge helpful discussions with F. Alibart, O. Bichler, C. Gamrat, K. K. Likharev, G. Snider, and D. Querlioz. [17] [18] [19] F. Alibart, E. Zamanidoost, and D. B. Strukov, “Pattern classification by memristive crossbar circuits using ex-situ and in-situ training”, Nature Communications, vol. 4, Jun. 2013. C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun, “NeuFlow: A runtime reconfigurable dataflow processor for vision”, in 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2011, pp. 109– 116. J. Lu, S. Young, I. Arel, and J. Holleman, “A 1TOPS/W analog deep machine-learning engine with floating-gate storage in 0.13 um CMOS”, in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, 2014, pp. 504–505. P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, and D. S. Modha, “A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm”, in 2011 IEEE Custom Integrated Circuits Conference (CICC), 2011, pp. 1–4. M. Prezioso, F. Merrikh-Bayat, B. Hoskins, G. Adam, K. K., Likharev, and D. B. Strukov, “Training and operation of integrated neuromorphic network based on metal-oxide memristors”, available online at http://arxiv.org/abs/1412.0611. D. Chabi, D. Querlioz, W. Zhao, and J.-O. Klein, “Robust learning approach for neuro-inspired nanoscale crossbar architecture”, J. Emerg. Technol. Comput. Syst., vol. 10, no. 1, pp. 5:1–5:20, Jan. 2014. Y. Kim, Y. Zhang, and P. Li, “A digital neuromorphic VLSI architecture with memristor crossbar synaptic array for machine learning”, in SOC Conference (SOCC), 2012 IEEE International, 2012, pp. 328–333. D. Querlioz, O. Bichler, and C. Gamrat, “Simulation of a memristorbased spiking neural network immune to device variations”, in The 2011 International Joint Conference on Neural Networks (IJCNN), 2011, pp. 1775–1781. C. Yakopcic and T. M. Taha, “Energy efficient perceptron pattern recognition using segmented memristor crossbar arrays”, in The 2013 International Joint Conference on Neural Networks (IJCNN), 2013, pp. 1–8. K. K. Likharev, “Hybrid CMOS/nanoelectronic circuits: Opportunities and challenges”, Journal of Nanoelectronics and Optoelectronics, vol. 3, no. 3, pp. 203–230, Dec. 2008. K. K. Likharev, “CrossNets: Neuromorphic hybrid CMOS/nanoelectronic networks”, Science of Advanced Materials, vol. 3, no. 3, pp. 322–331, Jun. 2011. A. Krizhevsky, I. Sutskever, and G.E. Hinton, “ImageNet classification with deep convolutional neural networks”, in NIPS’12, 2012, pp. 10971105. J. H. Lee and K. K. Likharev, “Defect-tolerant nanoelectronic pattern classifiers”, Int. J. Circ. Theor. Appl., vol. 35, no. 3, pp. 239–264, May 2007. Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient BackProp”, in Neural Networks: Tricks of the Trade, G. Montavon, G. B. Orr, and K.-R. Müller, Eds. Springer Berlin Heidelberg, 2012, pp. 9– 48. D. B. Strukov and H. Kohlstedt, “Resistive switching phenomena in thin films: Materials, devices, and applications”, MRS Bulletin, vol. 37, no. 02, pp. 108–114, Feb. 2012. F. Merrikh-Bayat, B. Hoskins, and D. B. Strukov, “Phenomenological modeling of memristive devices”, Appl. Phys. A, vol. 118, no. 3, pp. 779–786, Jan. 2015. F. Alibart, L. Gao, B. D. Hoskins, and D. B. Strukov, “High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm”, Nanotechnology, vol. 23, no. 7, p. 075201, Feb. 2012. W. Schiffmann, M. Joost, and R. Werner, “Optimization of the backpropagation algorithm for training multilayer perceptrons”, Technical Report, University of Koblenz, 1994. L. Prechelt, “PROBEN1-A set of benchmarks and benchmarking rules for neural network training algorithms”, Technical Report 21/94, Fakultat fur Informatik, Universitat Karlsruhe, p. 1994.