AAAI Technical Report FS-12-01 Artificial Intelligence for Gerontechnology An Automated Machine Learning Approach Applied To Robotic Stroke Rehabilitation Jasper Snoek, Babak Taati and Alex Mihailidis University of Toronto 500 University Ave. Toronto, ON, Canada Abstract 3. Apply a selection of standard discriminative machine learning algorithms and compare results. While machine learning methods have proven to be a highly valuable tool in solving numerous problems in assistive technology, state-of-the-art machine learning algorithms and corresponding results are not always accessible to assistive technology researchers due to required domain knowledge and complicated model parameters. This short paper highlights the use of recent work in machine learning to entirely automate the machine learning pipeline, from feature extraction to classification. A nonparametrically guided autoencoder is used to extract features and perform classification while Bayesian optimization is used to automatically tune the parameters of the model for best performance. Empirical analysis is performed on a real-world rehabilitation research problem. The entirely automated approach significantly outperforms previously published results using carefully tuned machine learning algorithms on the same data. This strategy is unsatisfying for a number of reasons. The performance of each machine learning algorithm, for example, is dependent on the method used for feature extraction. Consider a classification task where the structure of interest within the data lies on some nonlinear latent manifold. This structure can be captured by a nonlinear feature extraction followed by a linear classifier or conversely linear feature extraction followed by a nonlinear classifier. Although this is a simple example, it elucidates the fact that there are underlying complexities that make the comparison of various approaches challenging or less meaningful. In general, each machine learning algorithm also requires the setting of non-trivial hyperparameters. Often these parameters govern the complexity of the model or the amount of regularisation and require expert domain knowledge and time-consuming cross-validation procedures to select. Some examples of these hyperparameters include the number of hidden units in a neural network, the regularisation term in support vector machines and the number of dimensions in principal components analysis. The combinations of feature extraction methods, discriminative machine learning models and corresponding hyperparameters of each forms a vast space to explore for the best result and strategy. Researchers in assistive technology in general do not possess the advanced machine learning domain knowledge necessary to intuitively explore the vast space of machine learning models and parameterizations. However, assistive technology is a domain that requires high accuracy and relatively small improvements in performance can translate to significant real world impact. Consider, for example, the difference between 95% and 99.5% accuracy for a classifier that detects falls in an older adult’s home from sensor data. Such a discrepancy in classification accuracy can be translated to lives saved or a reduction in false positive classifications large enough to make the classifier useful rather than irritating. Such improvements can be garnered through more appropriate combinations of feature extraction, discriminative learning and better hyperparameters. This short paper highlights the findings of (Snoek, Adams, and Larochelle 2012; Snoek, Larochelle, and Adams 2012) in the context of assistive technology. The paper does not demonstrate novel methods or empirical anal- As better healthcare worldwide is improving longevity and the baby boomer generation is aging, the proportion of elderly adults within the population is rapidly growing. Healthcare systems and governments are seeking new ways to alleviate the burden on society of caring for this aging population. Artificial intelligence has been shown to be a promising solution, as many of the simpler tasks that burden caregivers can be automated. This also suggests solutions for promoting independence and aging in place, because it alleviates the need for the constant presence of a caregiver in the home. The benefits of the application of machine learning to problems in assistive technology are becoming ever more clear. However, the application of machine learning to problems in assistive technology remains challenging. In particular, it is often unclear what machine learning model or approach is most appropriate for a given task. A common paradigm is to apply multiple standard machine learning tools in a black box manner and compare the results. This proceeds according to the following steps: 1. Collect data representative of the problem of interest. 2. Extract a set of features from these data. c 2012, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 38 yses but rather seeks to stimulate discussion and demonstrate a promising solution to a significant practical problem in the application of machine learning to assistive technology. In this paper, we explore the use of an integrated and entirely automatic approach to perform feature extraction, classification and hyperparameter selection applied to a real-world rehabilitation research problem. In particular, we explore the use of an unsupervised machine learning model that is guided to learn a representation that is more appropriate for a given discriminative task. This nonparametrically guided autoencoder (Snoek, Adams, and Larochelle 2012) uses a neural network to learn a nonlinear encoding that captures the underlying structure of the input data while being constrained to maintain an encoding for which there exists a mapping to some discriminative label information. This model integrates the feature extraction and discriminative tasks and reduces the complexity induced by exploring various combinations of feature extraction algorithms and classifiers. The nonparametrically guided autoencoder (NPGA) requires a number of hyperparameters, such as the number of hidden units of the neural network, that are nontrivial to select. However, recent advances in machine learning have developed extremely effective methods for automatically optimizing such hyperparameters. A black-box optimization strategy known as Bayesian Optimization has recently been shown (Bergstra et al. 2011; Snoek, Larochelle, and Adams 2012) to be particularly well suited to this task, consistently finding better parameters, and doing so more efficiently, than machine learning experts. In this work it is shown that such a fully automated strategy arising from the combination of the NPGA and Bayesian Optimization can outperform a carefully hand tuned machine learning approach on a real-world rehabilitation problem (Taati et al. 2012). This is significant as it suggests that assistive technology researchers can achieve state-of-the-art results on their problems without requiring expert knowledge or tedious algorithm and parameter tuning. ture but also enforces that a mapping to label information exists. The training objective of the NPGA can be formulated as finding the optimal model parameters, φ? , ψ ? , Γ? , under the combined objective of the autoencoder, Lauto and Gaussian process, LGP , parameterized by hyperparameter α ∈ [0, 1], φ? , ψ ? , Γ? = arg min (1−α)Lauto (φ, ψ) + αLGP (φ, Γ). φ,ψ,Γ This objective can be further extended to incorporate the loss of a parametric multi-class logistic regression mapping, LLR , to the labels: L(φ, ψ, Λ, Γ ; α, β) = (1−α)Lauto (φ, ψ) + α((1−β)LLR (φ, Λ) + βLGP (φ, Γ)), where an additional hyperparameter, β, trades off the contribution of the nonparametric guidance afforded by the Gaussian process with the parametric classifier. This additional loss thus can direct the model to learn a representation that is harmonious with the actual logistic regression classifier that will be used at test time. The result is a model that is trained to extract features from some data such that they are explicitly enforced to encode structure that captures the major sources of variation in the data but also are well suited to the classifier that will be used at test time. The model hyperparameters permit one to flexibly interpolate between three common models, an autoencoder, a neural network for classification and a Gaussian process latent variable model. A caveat, however, is that one must search the space of hyperparameters to find the best formulation for a given problem. Fortunately, Bayesian Optimization can be used to efficiently find the best setting of the hyperparameters in a fully automated way. Nonparametrically Guided Autoencoder The NPGA is a semiparametric latent variable model. It leverages both parametric and nonparametric approaches to create a latent representation of input data that encodes the salient structure of the data while enforcing that more subtle discriminative structure is encoded as well. The parametric component consists of an autoencoder (Cottrell, Munro, and Zipser 1987), a neural network that is architecturally designed to learn an encoding of the data that captures the salient structure while discarding noise. This is achieved by creating a neural network that is trained to reconstruct the input data at its output while constraining the complexity of an internal coding layer. Often, however, the structure relevant to some discriminative task is not captured within the most prominent sources of variation. The innovation of the NPGA is to leverage a theoretical interpretation of Gaussian processes (Rasmussen and Williams 2006) as a kind of infinite neural network in order to augment the autoencoder with a nonparametric Gaussian process mapping to some additional label information. The autoencoder then learns a latent representation of the data that captures the salient struc- Bayesian Optimization Bayesian optimization (Mockus, Tiesis, and Zilinskas 1978) is a methodology for finding the extremum of noisy blackbox functions. Given some small number of observed inputs and corresponding outputs of a function of interest, Bayesian optimization iteratively suggests the next input to explore such that the optimum of the function is reached in as few function evaluations as possible. Provided that the function of interest is continuous and follows some loose assumptions, Bayesian optimization has been shown to converge to the optimum efficiently (Bull 2011). For an overview of Bayesian optimization and example applications see (Brochu, Cora, and de Freitas 2010). Recently, Bayesian optimization has been shown to be effective for optimizing the hyperparameters of machine learning algorithms (Bergstra et al. 2011; Snoek, Larochelle, and Adams 2012). In this work the Gaussian process expected improvement formulation of (Snoek, Larochelle, and Adams 2012) is used. 39 (a) The Robot (b) Using the Robot (c) Depth Image (d) Skeletal Joints Figure 1: The rehabilitation robot setup and sample data captured by the sensor. Empirical Analysis exercises using the robotic arm and it records their posture as a temporal sequence of seven estimated upper body skeletal joint angles (see Figures 1(c), 1(d) for an example depth image and corresponding pose skeleton captured by the system). A machine learning classifier is applied to the joint angles to classify between five different classes of posture. These include proper posture and four common cases of improper posture resulting from users’ natural tendency to compensate for limited agility. (Taati et al. 2012) collected a data set consisting of seven users each performing each class of action at least once, creating a total of 35 sequences (23,782 frames). They compare the use of a multiclass support vector machine (Tsochantaridis et al. 2004) and a hidden Markov support vector machine (Altun, Tsochantaridis, and Hofmann 2003) in a leave-one-subject-out test setting to distinguish these classes and report best per-frame classification accuracy rates of 80.0% and 85.9% respectively. The approach outlined in this work is empirically validated on a real-world application in assistive technology for rehabilitation. About 15 million people suffer stroke worldwide each year, according to the World Health Organization. Up to 65% of stroke survivors have difficulty using their upper limbs in daily activities and thus require rehabilitation therapy (Dobkin 2005). The frequency at which rehabilitation patients can perform rehabilitation exercises, a significant factor determining the rate of recovery, is often limited due to a shortage of rehabilitation therapists. This motivated the development of a robotic system to automate the role of a therapist providing guidance to patients performing repetitive upper limb rehabilitation exercises by (Kan et al. 2011), (Huq et al. 2011), (Lu et al. 2012) and (Taati et al. 2012). The system allows a user to perform upper limb reaching exercises with a robotic arm (see Figures 1(a), 1(b)) while it dynamically adjusts the amount of resistance to match the user’s ability level. The system can thus alleviate the burden on therapists and allow patients to perform exercises as frequently as desired, significantly expediting rehabilitation. The system’s effectiveness is critically dependent on its ability to discriminate between correct and incorrect posture and prompt the user accordingly. Currently, the system (Taati et al. 2012) uses a Microsoft Kinect sensor to observe a patient while they perform upper limb reaching In this work an NPGA is used to encode a latent embedding of postures that provides better discrimination between different posture types. The same data is used as in (Taati et al. 2012). The input to the model is the seven skeletal joint angles, i.e., Y = R7 , and the label space is over the five classes of posture. Unfortunately, the NPGA model requires setting a number of non-trivial hyperparameters. Rather than adjust these manually or perform a grid search, validation set error is optimized over the hy- 40 Model Bull, A. D. 2011. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research (34):2879–2904. Cottrell, G. W.; Munro, P.; and Zipser, D. 1987. Learning internal representations from gray-scale images: An example of extensional programming. In Conference of the Cognitive Science Society. Dobkin, B. H. 2005. Clinical practice, rehabilitation after stroke. New England Journal of Medicine 352:1677–1684. Huq, R.; Kan, P.; Goetschalckx, R.; Hbert, D.; Hoey, J.; and Mihailidis, A. 2011. A decision-theoretic approach in the design of an adaptive upper-limb stroke rehabilitation robot. In International Conference of Rehabilitation Robotics (ICORR). Kan, P.; Huq, R.; Hoey, J.; Goestschalckx, R.; and Mihailidis, A. 2011. The development of an adaptive upper-limb stroke rehabilitation robotic system. Neuroengineering and Rehabilitation. Lu, E.; Wang, R.; Huq, R.; Gardner, D.; Karam, P.; Zabjek, K.; Hbert, D.; Boger, J.; and Mihailidis, A. 2012. Development of a robotic device for upper limb stroke rehabilitation: A user-centered design approach. Journal of Behavioral Robotics. Mockus, J.; Tiesis, V.; and Zilinskas, A. 1978. The application of Bayesian methods for seeking the extremum. Towards Global Optimization 2:117–129. Rasmussen, C. E., and Williams, C. K. I. 2006. Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press. Snoek, J.; Adams, R. P.; and Larochelle, H. 2012. On nonparametric guidance for learning autoencoder representations. In International Conference on Artificial Intelligence and Statistics. Snoek, J.; Larochelle, H.; and Adams, R. P. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems (To Appear). Taati, B.; Wang, R.; Huq, R.; Snoek, J.; and Mihailidis, A. 2012. Vision-based posture assessment to detect and categorize compensation during robotic rehabilitation therapy. In International Conference on Biomedical Robotics and Biomechatronics. Tsochantaridis, I.; Hofmann, T.; Joachims, T.; and Altun, Y. 2004. Support vector machine learning for interdependent and structured output spaces. In International Conference on Machine Learning. Accuracy SVMM ulticlass (Taati et al. 2012) Hidden Markov SVM (Taati et al. 2012) `2 -Regularized Logistic Regression NPGA 80.0% 85.9% 86.1% 91.7% Table 1: Experimental results on the rehabilitation data. Perframe classification accuracies are provided for different classifiers on the test set. Bayesian optimization was performed on a validation set to select hyperparameters for the `2 -regularized logistic regression and NPGA algorithms. perparameters of the model using Bayesian optimization. The Gaussian process expected improvement algorithm was used to search over α ∈ [0, 1], β ∈ [0, 1], 10 − 1000 hidden units in the autoencoder and an additional GP latent dimensionality H ∈ [1, 10]. The best validation set error observed by the algorithm, on the twelfth of thirty-seven iterations, was at α = 0.8147, β = 0.3227, H = 3 and 242 hidden units. These settings correspond to a per-frame classification error rate of 91.70%, which is significantly higher than the 85.9% reported by the best method of (Taati et al. 2012). Results obtained using various models are presented in Table 1. Interestingly, it seems clear that the best region in hyperparameter space is a combination of all three objectives, the parametric logistic regression, nonparametric guidance and unsupervised autoencoder learning. This suggests that manually finding the best combination of feature extraction and classification algorithm would be challenging. The final product of the system is a simple neural network for classification, which is directly applicable to this real-time setting. Conclusion In this paper, a methodology was presented to combine feature extraction and classification into a single model and to optimize the model hyperparameters automatically. The approach was empirically validated on a real-world rehabilitation research problem, for which state-of-the-art results were achieved. The approach is very general, and as such can be applied to potentially many problems in the domain of assistive technology. This is valuable as the need for careful exploration of machine learning models and model parameters, which often requires significant domain knowledge, is obviated. References Altun, Y.; Tsochantaridis, I.; and Hofmann, T. 2003. Hidden markov support vector machines. In International Conference on Machine Learning. Bergstra, J. S.; Bardenet, R.; Bengio, Y.; and Kégl, B. 2011. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems. Brochu, E.; Cora, V. M.; and de Freitas, N. 2010. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. pre-print. arXiv:1012.2599. 41