Formal Verification and Modeling in Human-Machine Systems: Papers from the AAAI Spring Symposium A Look at Probabilistic Gaussian Process, Bayes Net, and Classifier Models for Prediction and Verification of Human Supervisory Performance Nisar Ahmed Ewart de Visser Tyler Shaw, Raja Parasuraman University of Colorado Boulder Aerospace Engineering Sciences Boulder, CO 80309 Nisar.Ahmed@colorado.edu Perceptronics Solutions, Inc. Falls Church, VA 22042 edevisse@gmail.com George Mason University Department of Psychology Fairfax, VA 22030 (tshaw4,rparasur)@gmu.edu Amira Mohammed-Ameen Mark Campbell University of Central Florida Department of Psychology Orlando, FL 32816 amoameen@knights.ucf.edu Cornell University Mechanical and Aerospace Engineering Ithaca, NY 14853 mc288@cornell.edu Abstract as SPIN and NuSVM. Using goals/rules encoded into logic based specifications along with models of the system/environment, correct by construction controllers are automatically generated for the system using controller synthesis methods coupled with model checkers (typically theorem proving and/or exhaustive search) (Kress-Gazit, Fainekos, and Pappas 2009). Recent advances in probabilistic model checkers (e.g. PRISM (Kwiatkowska, Norman, and Parker 2002)) have led to the concept of controller verification at a particular level of probability. In (Johnson and Kress-Gazit 2012; Johnson et al. 2012), controller software for an automated taxi was developed to minimize the probability of collision. In this case, probabilistic models of other car movements in the environment were mapped into the logic, and probabilistically verified controllers were automatically generated from external specifications. This approach accounts for uncertainty in a natural and intuitive way, since all controllers/decisions are probabilistic. Such advances have led to the key question: how do we model, verify, and utilize humans as collaborative decision makers and actors in correct-by-construct autonomous systems? It is purported in this paper that a new suite of probabilistic models of human performance/capabilities could be integrated into the correct by construction verifiable framework. The potential benefits of such integration include: (i) guarantees on probability of performance based on rigorous statistical measures of model validity; (ii) the ability to ‘match’ verifiable automation software to individual human operators (e.g. given a model for how a person performs, the automation programming can be automatically generated to compensate for expected individual weaknesses); and (iii) speeding up of the end validation (acceptance) of autonomy software, due to inclusion of human models from the start. Many ‘non-cognitive’ probabilistic models have been proposed as alternatives to well-known detailed cognitive computational models (e.g. ACT-R (Anderson, Matessa, and Lebiere 1997), EPIC (Kieras and Meyer 1997), Soar (Lehman et al. 1996) etc.) for predicting human-in-the-loop performance in networked unmanned vehicle applications. This paper motivates why three classes of probabilistic models - Gaussian Process, Bayes Net, and Classifier - can be very useful in verifying collaborative interactions between humans and autonomy. These models have the ability to capture individual, average, and distributions of human capabilities (including supervisory tasking and communication) and human factors metrics such as working memory. We argue that these models can provide probabilistic information in a form that will enable a formal probabilistic approach to designing and fielding collaborative human-autonomous systems. In this paper, we will discuss key advantages and limitations of these models in the context of performance prediction, training/learning, and deployment. Our goal is to initiate a discussion of how data-driven models such as these can be incorporated into formal verification frameworks for human-machine systems. 1 Introduction The ultimate performance of autonomy is typically closely tied to human interaction. For example, UAVs are operated by humans, and the data coming from their sensors are analyzed by humans. However, a recent Defense Science Board (DSB) study on Autonomy (Murphy and Shields 2012) made clear that the benefits of autonomous systems for the military have been slow to materialize; a key shortcoming is that humans and autonomy must be collaborative. Likewise, the DoD Autonomy Priority Steering Council (Junker 2011) selected four key technical challenge areas/gaps that must be addressed: 1) Human-Autonomy Interaction and Collaboration, 2) Scalability, 3) Machine Intelligence, and 4) Verification and Validation. Formal methods for verification and validation have been developed in recent years in an effort to evaluate requirements during controller construction (Manna and Pnueli 1995). Autonomy has borrowed/expanded on these concepts, starting with formal model checking tools such c 2014, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 2 male), who were trained to act as solo UAV supervisors through Aptima’s DDD 4.0 r distributed client server. The software provided advisory messages from a simulated automated teammate, which pertained to the defense task that varied in degree of usefulness to the task being performed by the supervisor at any given time. Subjects could command 8 friendly UAV assets located inside a ‘Red Zone’, which neutral and enemy UAVs approached from different directions. During each simulated run, operators were tasked with: (i) using friendly UAVs to attack and prevent enemies from entering the Red Zone; (ii) protecting their own UAVs from damage/destruction via friendly fire or enemy attack; and (iii) sending messages to warn of enemy UAVs intrusions into the simulated teammate’s zone. Participants completed simulation runs in a randomized 2 x 3 factorial experimental design, with two TL and three MQ levels. The MQ levels were: (1) ‘relevant messages’, i.e. all messages from the teammate provided useful engagement information (e.g. ‘2 B-type enemy assets approaching from top of screen’); (2) ‘noisy messages’, i.e. 80% of messages were irrelevant (e.g. ‘Neutral assets in battle space’); and (3) no messages at all. The TL levels varied the number of enemies actually attempting to enter the Red Zone over each 7.5 min simulation: ‘low’ = 31 targets and ‘high’ = 47 targets. Prior to the experiments, 22 participants also completed a version of the Operation Span (OSPAN), a standard, well-validated measure of WM (Engle 2002), and all WM scores were normalized against the maximum possible score of 25. Four performance metrics were assessed: red zone performance (RZP), time to destroy enemy target (DT), enemy destroyed performance (EDP), and attack efficiency (AE), In (Fan, Yen, and Chen 2010) and (Heger and Singh 2006), for instance, human operators are modelled dynamically via probabilistic Markov models in order to capture random transitions between abstract discrete states that influence decision-making and task performance. In (Donmez, Cummings, and Nehme 2010), discrete-event task simulations with probability distributions on operator servicing times are used to explicitly model the performance effects of changing workload and vehicle utilization in a multi-UAV supervisory task. Both cognitive and non-cognitive dynamic probabilistic human-operator models can generate sample-based performance prediction statistics via repeated random simulations of closed-loop task execution, and as such can provide useful insight into specific scenarios that lead to good/bad operator performance. However, such dynamic probabilistic models require a high level of detail and much training data to explicitly account for the effects of various task/networkrelated factors (e.g. number of agents, task load). These models also do not explicitly account for individual factors, e.g. differences in working memory capacity. We present and discuss three other classes of probabilistic models that help mitigate these issues and can thus be potentially useful for prediction and verification of human operator performance in human-machine systems, either in the sense of performing detailed analyses related to dynamical process simulations or performing gross ‘high-level’ analyses of human-machine system performance that abstract away certain details. Two of these models, Gaussian Process (GP) regression and Bayesian networks (BN), can enable direct ‘function-like’ performance metric predictions without requiring simulations or an explicit model of the operators decision-making processes (Ahmed et al. 2013). Any expected variability arising from differences in these and other unmodeled factors related to task dynamics are described by the estimated probabilities associated with each prediction. The remaining model class, probabilistic discriminative classification models, can be used to capture stochastic non-Markovian state-dependent switching behaviors for discrete supervisory decision making by human operators in detailed process models (Bourgault et al. 2007), (Ahmed and Campbell 2008), (Ahmed and Campbell 2011), which is currently not realizable with the probabilistic Markov or discrete event models mentioned above. 2 2.1 # enemies successfully entering Red Zone total number of enemies DT = average time taken to destroy each enemy (secs) # enemies destroyed EDP = total # of enemies # enemies destroyed AE = total # of times enemies engaged RZP = 1 − where RZP, EDP, and AE are ‘accuracy’ indices in the range 0 to 1 (perfect performance), and DT is a latency measure (lower means better performance). The goals of successfully protecting the Red Zone and destroying as many enemies as quickly as possible are often competing supervisory demands, since frequent switches must be made to handle enemies both near and far from the Red Zone. Coupled with the tasks of handling network messages and protecting friendly UAVs, the operator’s cognitive load can build up quickly and lead to diminished supervisory performance. Initial modeling results found that all four operator performance metrics could be reasonably well predicted on the basis of TL, MQ and WM using standard parametric linear regression models learned from the experimental data (de Visser et al. 2010), (Ahmed et al. 2013). Such models could be used by human-automation system designers to help specify system requirements or develop supervisory decision aids to support various system operating regimes on the basis of different operator cognitive traits such as WM. Importantly, such models can replace or complement detailed human-in-the-loop simulations to determine expected GP and BN Statistical Performance Models Networked UAV Supervision Application Refs. (de Visser et al. 2010) and (Ahmed et al. 2013) studied how human operator supervisory performance metrics for a simulated networked UAV air superiority scenario could be modeled from combined knowledge of task load (TL), quality of inter-network communications (i.e. message quality, MQ), and human operator working memory capacity (WM), which plays an important role in executive control processes underlying effective multi-tasking and decisionmaking capabilities (Parasuraman 2010), (Endsley 1995). Performance for a single human/multi-UAV team system was examined under various operating conditions for 30 human participants (compensated university students, 18 fe- 3 performance outcomes for different operators under various human-machine operating conditions. However, linear parametric models are generally inappropriate for other important tasks: (i) extended performance prediction for conditions that are not sampled in data (which is often quite limited since the full space of task conditions can only be partially explored offline); and (ii) ‘inverse reasoning’ of unknown tasking conditions and/or operator capabilities from observed performance metric values. Furthermore, although linear models provide a reasonable starting point, they are not able to capture complex nonlinear relationships between performance metrics and operating/operator conditions. These issues can be addressed by nonparametric GP regression models and probabilistic BN models, both of which generalize linear regression models and can provide more insight on relationships between performance metrics and system operating factors. 2.2 squared exponential form, (D ) X 1 k(xi , xj ) = θ0 exp (xd,i − xd,j )2 , 2θd d=1 where the xd,i and xd,j terms are the D individual entries of operating points xi and xj (i.e. for the application described here, D = 3 for TL, MQ, and WM), and all θ terms are free parameters that affect shape and scale sensitivity along each component of xi and xj . Note that, for large θ terms, the corresponding factors of x (TL, MQ, WM) become less important for predicting y ∗ . Optimal kernel parameters (e.g. as determined by maximum likelihood) can therefore offer valuable insight into the relative importance of dependencies between TL, MQ, WM and the performance metrics. The GP model has several useful characteristics for predictive performance modeling and verification. Firstly, due to the Gaussian probabilistic nature of the model, m(x∗ , x1:N ) gives the most likely value of y ∗ for a given x∗ and σ 2 (x∗ , x1:N ) gives the degree of uncertainty in this estimate. Secondly, the kernel parameterization lends three useful properties. It can be mathematically shown that the GP model predicts via a ‘nearest neighbor’ strategy: the more similar x∗ is to xi , then the more probable it becomes that y ∗ will be similar to yi . Likewise, the predicted variance σ(x∗ , x1:N ) grows as x∗ is more dissimilar to the original data x1:N , but adjusts to the variance of y1:N otherwise. Hence, the GP model has the appealing ability to adjust its prediction statistics on the basis of how similar x∗ is to original modeling data, such that unfamiliar x∗ values automatically lead to more uncertain predictions. This is in contrast to a standard linear regression model, which maintains a constant level of prediction uncertainty for all possible x values. It can also be shown that GP regression implicitly considers an infinite number of parametric function candidates at once during the modeling process (where the space of parametric functions depends on the covariance kernel) (Bishop 2006). This circumvents the usual need to explicitly restrict the regression process to a limited (and likely incomplete) family of candidate models to find a best fit for the data. Figure 1 shows two example operator performance metric prediction results that compare GP and standard linear regression predictions of RZP as a function of TL, MQ and WM values taken from the experimental data. Both models show good general agreement in predicting downward shifts in performance as task load increases and upward shifts in performance as working memory capacity increases. However, the uncertainty bounds from the standard deviations of both models lead to significantly different statistical performance predictions under certain task load and working memory conditions. Unlike the linear model, as x∗ gets further from x1:N , the GP becomes more agnostic and hence conservative about performance predictions. GP models allow for ‘cognitively aware’ performance predictions on the basis of realistically obtainable sets of experimental data. This could be useful in many ways. First, one could define a specification on the performance of the combined human-autonomy system that is now in the form of a distribution. Second, formal, probabilistic verification GP regression models for robust prediction Nonparametric GP models can provide extended statistical predictions for expected values and standard deviations of performance metrics as a function of operator characteristics and operating conditions. Through the use of special covariance kernel functions, GP predictions can also automatically adapt to uncertainties arising from non-linearities and sample sparseness in the experimental modeling data. These features allow a data set to ‘speak for itself’ as predictions are made, which have made GPs standard tools for complex regression problems involving noisy, sparsely sampled processes (Bishop 2006). Formally, let x =(TL, MQ, WM) denote a humanmachine system operating condition and let y be the value of some performance metric Y (i.e. RZP, AE, etc.). If N is the number of distinct experimental data points available for modeling, let x1:N = {x1 , ..., xN } and y1:N = {y1 , ..., yN } be, respectively, the set of all experimentally sampled opertaing points and corresponding set of observed performance metric values. For some new unsampled operating point x∗ , the GP predicts that the corresponding metric value Y ∗ = y ∗ has the conditional Gaussian probability distribution function with mean m(x∗ , x1:N ) and variance σ 2 (x∗ , x1:N ). p(Y = y ∗ |x∗ , x1:N ) =N (y ∗ ; m(x∗ , x1:N , σ 2 (x∗ , x1:N ))) 1 1 = √ exp − 2 ∗ (m(x∗ , x1:N ))2 , 2σ (x , x1:N ) 2πσ(x∗ , x1:N ) m(x∗ , x1:N ) = f (x∗ ) + kT C −1 y1:N , σ 2 (x∗ , x1:N ) = c − kT C −1 k. Here, c is a positive scalar term; k is a N × 1 vector of terms describing the similarity between x∗ and members of x1:N ; C is a covariance matrix that describes the similarity of all members of x1:N to each other; and f (x) is a user-defined mean function that models a priori beliefs about the relationship between x and y. Although not detailed here, k, c and C are all directly defined by a user-selected covariance kernel function k(xi , xj ) that models expected similarities between two points xi and xj . A common choice for k(xi , xj ) is the 4 TL = 0.155, MQ = 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 RZP RZP TL = 0.235, MQ = 0 0.9 0.5 0.4 mated via Bayesian inference, which ‘inverts’ relationships described by a joint probability distribution over the unknown variables of interest and observed performance measures. A probabilistic BN model can be used to encode this distribution explicitly as a directed acyclic graph (DAG), consisting of nodes (random variables) and edges that specify local factorized probabilistic conditional dependencies. Figure 2 shows a BN model developed in (Ahmed et al. 2013) from the experimental performance data using model estimation techniques based on discretization of TL, MQ, WM and performance metric values. The model parameters for the joint distribution are given by conditional probability tables (CPTs) describing the ‘causal’ parent-child dependencies between the nodes in the graph. These CPTs efficiently encode the full joint distribution through the factorization p(TL, MQ, WM, AE, EDP, DT, RZP, H) = p(TL)p(MQ)p(WM)p(H|TL,MQ,WM)p(EDP|TL) × p(AE|H)p(RZP|H)p(DT|H), (1) which permits the use of efficient message-passing algorithms for Bayesian inference. The binary latent ‘summarizes’ information flowing from TL, MQ and WM to the performance metrics, and thereby helps to simplify the overall joint probability model by reducing the overall number of free CPT parameters to be estimated. The summarizing effect of H is reflected by the strong dependence of these performance metric CPTs on H, which can be interpreted as a variable that predictively ‘clusters’ each human subject’s overall task competency (for given values of TL, MQ, and WM) into ‘below’ and ‘at/above average’ levels for H=1 and H=2 values, respectively. Note that the BN model for this problem is not unique: over 1 × 107 different DAGs could be used to represent the joint distribution, e.g. by introducing extra links from TL, MQ and WM to the performance variables, by breaking/reversing the order of individual conditional dependencies, etc. The structure shown in Fig. 2 was selected from a smaller subset of DAGs that reflect commonsense ‘causality constraints’ (i.e. TL, MQ, and WM do not influence each other, and are not influenced by performance metrics) and permitted use of reasonably sized CPTs that could be estimated from data. Complete details of the modeling process are given in (Ahmed et al. 2013). Figure 3 (a) and (b) show how the unknown (discretized) TL, MQ and WM values can be inferred from observed (discretized) experimental performance metric data. Shown are the marginal Bayesian posterior probabilities for the unknown variables at different fixed values of RZP, EDP, AE, and DT. These probabilities can be used to estimate the unknown variables along with confidence assessments. For instance, in scenario (a), the true recorded values for the unknown variables all correspond to the maximum a posteriori (MAP) estimates derived from the marginal posterior probabilities, which show overwhelming confidence in the value of TL but considerable ambiguity for MQ and WM. These probabilities can be used for threshold-based decision making or hedging against incorrect estimates of operator/operating conditions. In scenario (b), for instance, an autonomous decision aid would be able to account for the fact that MQ is almost as equally likely to be in a good 0.5 0.4 GP 2σ GP 1σ GP mean raw data LR mean LR 2σ LR 1σ 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 WM (a) 0.6 0.7 0.8 0.9 GP 2σ GP 1σ GP mean raw data LR mean LR 2σ LR 1σ 0.3 0.2 0.1 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 WM (b) Figure 1: Predicted RZP mean with 1σ/2σ bounds for learned GP and linear regression (LR) models as function of TL, MQ, and WM. Predictions for new WM values at fixed TL and MQ ‘slices’ are shown by lines/shaded regions; experimental data are shown as blue markers. TL gives fraction of 200 enemies attacking Red Zone; MQ gives fraction of relevant messages. can now be attempted with models of the combined humanautonomy. Third, the models can be used to infer/analyze specifications (on humans, autonomy, or the combination) on-line in order to ensure that the specifications are being met, or assumptions are not violated. Fourth, data driven models have the ability to be updated as additional data is collected; this learning ability is important as the complexity of the interaction increases. Importantly, this is potentially the first step towards the concept of automatically generating correct-by-construction controllers that ‘match’ the strengths and weaknesses of the human. Still, GP class models have challenges. Firstly, the model will only capture a small portion of the human and/or interaction. What models of human capabilities can be more easily integrated into a verifiable framework? What if this model is incomplete? What about situations that are not easily modeled, such as emergency situations, or difficult-tomodel (e.g. uncooperative) humans? It is important to understand exactly what can be done with GP models in such cases and whether existing learning techniques are sufficient to address these issues. Secondly, should specifications be defined based on a ‘typical’ human, a distribution of humans, or the worst case human? What if a system falls outside of the specified bounds, either in design or in operation? The integration into formal verification methods in a ‘guaranteed statistical’ sense is still an open question, since the assumption of second order sufficient statistics (mean, covariance) through a GP may not fully reflect true performance uncertainty in all cases. Thirdly, how can these models adapt/learn on-line? Data will be collected as new experiences occur, and thus, models (and controllers) should improve quickly on-line. Currently, computational cost is a weakness of GP class models (as well as correctby-construction controllers) that still needs to be addressed. 2.3 BN models and inverse reasoning In many situations, variables like TL, MQ or WM may be required by autonomous decision aids and analysis tools, but are unknown and difficult to measure. These can be esti- 5 0.8 0.6 0.4 0.2 0 Low High* Discretized TL Value MQ Posterior Probability 1 0.8 0.6 0.4 0.2 0 Low Med High* Discretized MQ Value Posterior Probability P(WM|EDP,AE,RZP,DT) Posterior Probability P(TL|EDP,AE,RZP,DT) TL Posterior Probability Posterior Probability P(MQ|EDP,AE,RZP,DT) Discretized Inference Results for EDP = High, AE = High, RZP = High, DT = Low 1 WM Posterior Probability 1 0.8 0.6 0.4 0.2 0 Low Med High* Discretized WM Value (a) 0.8 0.6 0.4 0.2 0 Low* High Discretized TL Value MQ Posterior Probability 1 0.8 0.6 0.4 0.2 0 Low Med High* Discretized MQ Value Posterior Probability P(WM|EDP,AE,RZP,DT) Posterior Probability P(TL|EDP,AE,RZP,DT) Figure 2: BN graph model and estimated conditional probability tables for hidden 2-state structural variable H and performance variables. Directed arrows denote conditional dependency of child node on parent nodes in overall joint probability distribution. 1 Posterior Probability P(MQ|EDP,AE,RZP,DT) Discretized Inference Results for EDP = Med, AE = Low, RZP = Low, DT = Med TL Posterior Probability WM Posterior Probability 1 0.8 0.6 0.4 0.2 0 Low* Med High Discretized WM Value (b) state of ‘high’ quality as it is to be in a bad state of ‘low’ quality when deciding how best it can help the human operator. BN models of human performance would therefore serve a similar purpose as the hybrid stochastic dynamic vehicle estimation models in (Johnson and Kress-Gazit 2012), where models of likely sensor errors were used to probabilistically verify correct-by-construction controllers for autonomous driving, and thus enable probabilistic guarantees on high-level behavior. 3 Figure 3: Posterior probabilities for unknown discretized TL, MQ, and WM values following Bayesian inference with observed operator performance metrics, using BN shown in Figure 2 on real data for two different experimental scenarios (a) and (b). Values marked with ∗ in abscissa indicate true value of unknown variable; most probable and true values in both scenarios, with exception of MQ in scenario (b). predicting the discrete action ‘category’ Dk selected by the operator given a set of information ‘features’ Xk (i.e. in the observables from the robot, ‘Range to target’ and ‘Robot’s confidence that target is victim’, in this case). The simulated data shown in the right-hand side of Fig. 4 show how such decisions might stochastically depend on Xk (temporally or contextually correlated features are not shown here for simplicity, but could be included more generally). Ref. (Bourgault et al. 2007) explored how generative density estimation methods could be used to build compact probabilistic models of state-dependent human supervisory discrete task switching and continuous waypoint commanding behaviors for multi-robot applications, using real experimental data collected from 16 human operators on the RoboFlag simulation testbed (Shah et al. 2009), (D’Andrea and Babish 2003). The generative modeling approach models the statistics of Xk first as a function of Dk and then apply Bayes’ rule to derive desired probabilistic model of Dk as posterior Bayesian distribution. For the example in Fig. 4, this means that the data would first be used to estimate a class-conditional probability density functions p(Xk |Dk = i) and a prior decision probability p(Dk = i) (independent of Xk ). The desired generative posterior distribution would then be formed via Bayes rule as p(Dk = i|X ) = c(X1 k ) p(Dk = i)p(Xk |Dk = i), where c(Xk ) = Pk j p(Dk = j)p(Xk |Dk = j). The main drawback of this approach is the inherent difficulty of the density estimation problem when Xk has more than 4 or 5 dimensions, as very large amounts of experimental data points are then required to obtain reliable estimates of p(Xk |Dk = i). Detailed Probabilistic Operator Modeling In the GP and BN approaches outlined above, any variability arising from unmodeled factors related to detailed task dynamics are subsumed by estimated probabilities associated with each prediction. However, detailed dynamical models of human operator behavior can yield more refined and insightful performance predictions, especially if individual operator cognitive traits are expected to be widely different or if multiple operator ‘failure points’ exist. Existing detailed dynamic operator behavior models tend to be developed either from an economical data-driven ‘black box’ high-level perspective (which abstracts away the influence of mixed continuous/discrete dynamic variables on operator decision making) (Fan, Yen, and Chen 2010), (Heger and Singh 2006), (Donmez, Cummings, and Nehme 2010), or from a detailed first principles ‘white box’ computational cognitive modeling perspective (which attempts to simulate all relevant neurophysical aspects of human-in-the-loop behavior) (Anderson, Matessa, and Lebiere 1997), (Kieras and Meyer 1997), (Lehman et al. 1996). Alternatively, a ‘gray box’ probabilistic approach could be used to model high-/low-level operator behaviors as discrete/continuous random variables that are conditionally dependent on external variables that are known to influence human-machine system behavior. To motivate this idea, Figure 4 shows a hypothetical robot-assisted search and rescue scenario, in which a human operator must decide which of three possible autonomous behaviors a ground robot should execute at a given instance on the basis of available information. The modeling problem here can be cast as one of 6 References Noisy state-dependent “mode switching” boundaries for human supervisor Ahmed, N., and Campbell, M. 2008. Multimodal operator decision models. In American Control Conference 2008. Ahmed, N., and Campbell, M. 2011. Variational Bayesian learning of probabilsitic discriminative models with latent softmax variables. IEEE Trans. on Sig. Proc. 59(7):3143– 3154. Ahmed, N.; deVisser, E.; Shaw, T.; Mohamed-Ameen, A.; Campbell, M.; and Parasuraman, R. 2013. Statistical modeling of networked human-automation performance using working memory capacity. Ergonomics in press. Anderson, J. R.; Matessa, M.; and Lebiere, C. 1997. ACT-R: A theory of higher level cognition and its relation to visual attention. Human-Computer Interaction 12(4):439–462. Bishop, C. 2006. Pattern Recognition and Machine Learning. New York: Springer. Bourgault, F.; Ahmed, N.; Shah, D.; and Campbell, M. 2007. Probabilistic operator-multiple robot modeling using bayesian network representation. Proc. AIAA Conference on Guidance, Navigation, and Control, Hilton Head, SC. D’Andrea, R., and Babish, M. 2003. The RoboFlag testbed. In American Control Conference, 2003. Proceedings of the 2003, volume 1, 656–660. IEEE. de Visser, E.; Shaw, T.; Mohamed-Ameen, A.; and Parasuraman, R. 2010. Modeling human-automation team performance in networked systems: Individual differences in working memory count. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, volume 54, 1087–1091. SAGE Publications. Donmez, B.; Cummings, M.; and Nehme, C. 2010. Modeling Workload Impact in Multiple Unmanned Vehicle Supervisory Control. IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans 40(6). Endsley, M. R. 1995. Toward a theory of situation awareness in dynamic systems. Human Factors: The Journal of the Human Factors and Ergonomics Society 37(1):32–64. Engle, R. W. 2002. Working memory capacity as executive attention. Current directions in psychological science 11(1):19–23. Fan, X.; Yen, J.; and Chen, P.-C. 2010. Learning HMMbased cognitive load models for supporting human-agent teamwork. Cognitive Systems Research 11:108–119. Heger, F., and Singh, S. 2006. Sliding Autonomy for Complex Coordinated Multi-Robot Tasks: Analysis & Experiments. In Robotics Science and Systems. Johnson, B., and Kress-Gazit, H. 2012. Probabilistic guarantees for high-level robot behavior in the presence of sensor error. Autonomous Robots 33(3):309–321. Johnson, B.; Havlak, F.; Campbell, M.; and Kress-Gazit, H. 2012. Execution and analysis of high-level tasks with dynamic obstacle anticipation. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, 330–337. IEEE. Juloski, A.; Heemels, W.; Ferrari-Trecate, G.; Vidal, R.; Paoletti, S.; and Niessen, J. 2005. Comparison of Four Procedures for the Identification of Hybrid Systems. In L.Thiele, Xk Dk Xk = [Range to target; Robot’s confidence that target is victim] Dk = ignore object and move on Range to target (m) 10 0 100 Robot’s confidence that target is victim (%) Dk = get more data on object Dk = move in to assist victim Figure 4: Conceptual example of a discrete set of human supervisory control decisions for robot-assisted search and rescue modeled by noisy probabilistic switching as function of operator-observable system state (colors represent simulated data). Refs. (Ahmed and Campbell 2008) and (Ahmed and Campbell 2011) showed that discriminative probabilistic classifier models could instead be used to directly learn the ‘noisy boundaries’ between Dk clusters for high-dimensional Xk via fully Bayesian learning techniques that also provide estimates of uncertainty in the boundary parameter estimates (which are especially useful for learning with sparse data). Note that this approach bears resemblance to techniques for identifying discrete modal switching boundaries in nonlinear hybrid dynamical systems (Juloski et al. 2005). 4 Conclusions and Open Directions This paper motivates why data driven, probabilistic models of human capabilities would be useful in the design and verification of collaborative human-autonomy systems. Three classes of probabilistic models were presented - Gaussian Process, Bayes Net, and Classifier - along with examples showing their ability to capture some element of human behavior. These models have the ability to capture individual, average, and distributions of human capabilities (including supervisory tasking and communication) as well as human factors metrics such as working memory. These models have many potential uses. Distributions over specifications could be defined for the combined human-autonomy system that would enable formal checking and on-line evaluation of assumptions/specifications. Guarantees on probability of performance (or other metrics) may be realized based on rigorous statistical measures of model validity. Off-line data collection and models of human capabilities could improve eventual implementation of the collaborative human-autonomy system. And this work could be the start of a path towards the ‘holy grail’: generation of probabilistically correct-by-construction controllers that ‘match’ the human capabilities (e.g. given a model for how a person performs, the automation programming can be automatically generated to compensate for expected individual weaknesses). Challenges in these models still must be addressed, including understanding what to model, what if models are incomplete, how to integrate into the formal verification framework, and how models can learn on-line. 7 M. M., ed., Hybrid Systems: Computation and Control, volume LNCS 3414, 354–369. Springer-Verlag. Junker, B. 2011. Autonomy roadmap. Technical report, DoD Priority Steering Council. Kieras, D. E., and Meyer, D. E. 1997. An overview of the EPIC architecture for cognition and performance with application to human-computer interaction. Human-computer interaction 12(4):391–438. Kress-Gazit, H.; Fainekos, G. E.; and Pappas, G. J. 2009. Temporal-logic-based reactive mission and motion planning. Robotics, IEEE Transactions on 25(6):1370–1381. Kwiatkowska, M.; Norman, G.; and Parker, D. 2002. Prism: Probabilistic symbolic model checker. In Computer Performance Evaluation: Modelling Techniques and Tools. Springer. 200–204. Lehman, J. F.; Laird, J. E.; Rosenbloom, P.; et al. 1996. A gentle introduction to Soar, an architecture for human cognition. Invitation to cognitive science 4:212–249. Manna, Z., and Pnueli, Z. 1995. Temporal verification of reactive systems: safety. Springer. Murphy, R., and Shields, J. 2012. The role of autonomy in DoD systems. Technical report, DoD Defense Sciences Study Board. Parasuraman, R. 2010. Neurogenetics of working memory and decision making under time pressure. In Applied Human Factors and Ergonomics. Taylor and Francis. Shah, D.; Campbell, M.; Bourgault, F.; Ahmed, N.; Galster, S.; and Knott, B. 2009. An empirical study of human-robotic teams with three levels of autonomy. In AIAA Infotech@ Aerospace Conference. 8