A Look at Probabilistic Gaussian Process, Bayes Net, and Classifier... for Prediction and Verification of Human Supervisory Performance

advertisement
Formal Verification and Modeling in Human-Machine Systems: Papers from the AAAI Spring Symposium
A Look at Probabilistic Gaussian Process, Bayes Net, and Classifier Models
for Prediction and Verification of Human Supervisory Performance
Nisar Ahmed
Ewart de Visser
Tyler Shaw, Raja Parasuraman
University of Colorado Boulder
Aerospace Engineering Sciences
Boulder, CO 80309
Nisar.Ahmed@colorado.edu
Perceptronics Solutions, Inc.
Falls Church, VA 22042
edevisse@gmail.com
George Mason University
Department of Psychology
Fairfax, VA 22030
(tshaw4,rparasur)@gmu.edu
Amira Mohammed-Ameen
Mark Campbell
University of Central Florida
Department of Psychology
Orlando, FL 32816
amoameen@knights.ucf.edu
Cornell University
Mechanical and Aerospace Engineering
Ithaca, NY 14853
mc288@cornell.edu
Abstract
as SPIN and NuSVM. Using goals/rules encoded into
logic based specifications along with models of the system/environment, correct by construction controllers are automatically generated for the system using controller synthesis methods coupled with model checkers (typically theorem
proving and/or exhaustive search) (Kress-Gazit, Fainekos,
and Pappas 2009).
Recent advances in probabilistic model checkers (e.g.
PRISM (Kwiatkowska, Norman, and Parker 2002)) have
led to the concept of controller verification at a particular level of probability. In (Johnson and Kress-Gazit 2012;
Johnson et al. 2012), controller software for an automated
taxi was developed to minimize the probability of collision.
In this case, probabilistic models of other car movements
in the environment were mapped into the logic, and probabilistically verified controllers were automatically generated from external specifications. This approach accounts
for uncertainty in a natural and intuitive way, since all controllers/decisions are probabilistic.
Such advances have led to the key question: how do we
model, verify, and utilize humans as collaborative decision
makers and actors in correct-by-construct autonomous systems? It is purported in this paper that a new suite of probabilistic models of human performance/capabilities could be
integrated into the correct by construction verifiable framework. The potential benefits of such integration include: (i)
guarantees on probability of performance based on rigorous statistical measures of model validity; (ii) the ability to
‘match’ verifiable automation software to individual human
operators (e.g. given a model for how a person performs, the
automation programming can be automatically generated to
compensate for expected individual weaknesses); and (iii)
speeding up of the end validation (acceptance) of autonomy
software, due to inclusion of human models from the start.
Many ‘non-cognitive’ probabilistic models have been
proposed as alternatives to well-known detailed cognitive
computational models (e.g. ACT-R (Anderson, Matessa,
and Lebiere 1997), EPIC (Kieras and Meyer 1997), Soar
(Lehman et al. 1996) etc.) for predicting human-in-the-loop
performance in networked unmanned vehicle applications.
This paper motivates why three classes of probabilistic models - Gaussian Process, Bayes Net, and Classifier - can be very useful in verifying collaborative interactions between humans and autonomy. These models
have the ability to capture individual, average, and distributions of human capabilities (including supervisory
tasking and communication) and human factors metrics
such as working memory. We argue that these models can provide probabilistic information in a form that
will enable a formal probabilistic approach to designing
and fielding collaborative human-autonomous systems.
In this paper, we will discuss key advantages and limitations of these models in the context of performance
prediction, training/learning, and deployment. Our goal
is to initiate a discussion of how data-driven models
such as these can be incorporated into formal verification frameworks for human-machine systems.
1
Introduction
The ultimate performance of autonomy is typically closely
tied to human interaction. For example, UAVs are operated by humans, and the data coming from their sensors
are analyzed by humans. However, a recent Defense Science Board (DSB) study on Autonomy (Murphy and Shields
2012) made clear that the benefits of autonomous systems
for the military have been slow to materialize; a key shortcoming is that humans and autonomy must be collaborative. Likewise, the DoD Autonomy Priority Steering Council (Junker 2011) selected four key technical challenge areas/gaps that must be addressed: 1) Human-Autonomy Interaction and Collaboration, 2) Scalability, 3) Machine Intelligence, and 4) Verification and Validation.
Formal methods for verification and validation have been
developed in recent years in an effort to evaluate requirements during controller construction (Manna and Pnueli
1995). Autonomy has borrowed/expanded on these concepts, starting with formal model checking tools such
c 2014, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
2
male), who were trained to act as solo UAV supervisors
through Aptima’s DDD 4.0 r distributed client server. The
software provided advisory messages from a simulated automated teammate, which pertained to the defense task that
varied in degree of usefulness to the task being performed
by the supervisor at any given time. Subjects could command 8 friendly UAV assets located inside a ‘Red Zone’,
which neutral and enemy UAVs approached from different
directions. During each simulated run, operators were tasked
with: (i) using friendly UAVs to attack and prevent enemies
from entering the Red Zone; (ii) protecting their own UAVs
from damage/destruction via friendly fire or enemy attack;
and (iii) sending messages to warn of enemy UAVs intrusions into the simulated teammate’s zone. Participants completed simulation runs in a randomized 2 x 3 factorial experimental design, with two TL and three MQ levels. The MQ
levels were: (1) ‘relevant messages’, i.e. all messages from
the teammate provided useful engagement information (e.g.
‘2 B-type enemy assets approaching from top of screen’);
(2) ‘noisy messages’, i.e. 80% of messages were irrelevant
(e.g. ‘Neutral assets in battle space’); and (3) no messages
at all. The TL levels varied the number of enemies actually
attempting to enter the Red Zone over each 7.5 min simulation: ‘low’ = 31 targets and ‘high’ = 47 targets. Prior to
the experiments, 22 participants also completed a version
of the Operation Span (OSPAN), a standard, well-validated
measure of WM (Engle 2002), and all WM scores were normalized against the maximum possible score of 25.
Four performance metrics were assessed: red zone performance (RZP), time to destroy enemy target (DT), enemy
destroyed performance (EDP), and attack efficiency (AE),
In (Fan, Yen, and Chen 2010) and (Heger and Singh 2006),
for instance, human operators are modelled dynamically via
probabilistic Markov models in order to capture random
transitions between abstract discrete states that influence
decision-making and task performance. In (Donmez, Cummings, and Nehme 2010), discrete-event task simulations
with probability distributions on operator servicing times are
used to explicitly model the performance effects of changing
workload and vehicle utilization in a multi-UAV supervisory
task. Both cognitive and non-cognitive dynamic probabilistic human-operator models can generate sample-based performance prediction statistics via repeated random simulations of closed-loop task execution, and as such can provide
useful insight into specific scenarios that lead to good/bad
operator performance. However, such dynamic probabilistic
models require a high level of detail and much training data
to explicitly account for the effects of various task/networkrelated factors (e.g. number of agents, task load). These
models also do not explicitly account for individual factors,
e.g. differences in working memory capacity.
We present and discuss three other classes of probabilistic
models that help mitigate these issues and can thus be potentially useful for prediction and verification of human operator performance in human-machine systems, either in the
sense of performing detailed analyses related to dynamical
process simulations or performing gross ‘high-level’ analyses of human-machine system performance that abstract
away certain details. Two of these models, Gaussian Process
(GP) regression and Bayesian networks (BN), can enable direct ‘function-like’ performance metric predictions without
requiring simulations or an explicit model of the operators
decision-making processes (Ahmed et al. 2013). Any expected variability arising from differences in these and other
unmodeled factors related to task dynamics are described
by the estimated probabilities associated with each prediction. The remaining model class, probabilistic discriminative classification models, can be used to capture stochastic non-Markovian state-dependent switching behaviors for
discrete supervisory decision making by human operators
in detailed process models (Bourgault et al. 2007), (Ahmed
and Campbell 2008), (Ahmed and Campbell 2011), which
is currently not realizable with the probabilistic Markov or
discrete event models mentioned above.
2
2.1
# enemies successfully entering Red Zone
total number of enemies
DT = average time taken to destroy each enemy (secs)
# enemies destroyed
EDP =
total # of enemies
# enemies destroyed
AE =
total # of times enemies engaged
RZP = 1 −
where RZP, EDP, and AE are ‘accuracy’ indices in the range
0 to 1 (perfect performance), and DT is a latency measure
(lower means better performance). The goals of successfully
protecting the Red Zone and destroying as many enemies
as quickly as possible are often competing supervisory demands, since frequent switches must be made to handle enemies both near and far from the Red Zone. Coupled with the
tasks of handling network messages and protecting friendly
UAVs, the operator’s cognitive load can build up quickly and
lead to diminished supervisory performance.
Initial modeling results found that all four operator performance metrics could be reasonably well predicted on the
basis of TL, MQ and WM using standard parametric linear regression models learned from the experimental data
(de Visser et al. 2010), (Ahmed et al. 2013). Such models could be used by human-automation system designers
to help specify system requirements or develop supervisory
decision aids to support various system operating regimes on
the basis of different operator cognitive traits such as WM.
Importantly, such models can replace or complement detailed human-in-the-loop simulations to determine expected
GP and BN Statistical Performance Models
Networked UAV Supervision Application
Refs. (de Visser et al. 2010) and (Ahmed et al. 2013) studied
how human operator supervisory performance metrics for
a simulated networked UAV air superiority scenario could
be modeled from combined knowledge of task load (TL),
quality of inter-network communications (i.e. message quality, MQ), and human operator working memory capacity
(WM), which plays an important role in executive control
processes underlying effective multi-tasking and decisionmaking capabilities (Parasuraman 2010), (Endsley 1995).
Performance for a single human/multi-UAV team system
was examined under various operating conditions for 30 human participants (compensated university students, 18 fe-
3
performance outcomes for different operators under various
human-machine operating conditions.
However, linear parametric models are generally inappropriate for other important tasks: (i) extended performance
prediction for conditions that are not sampled in data (which
is often quite limited since the full space of task conditions can only be partially explored offline); and (ii) ‘inverse reasoning’ of unknown tasking conditions and/or operator capabilities from observed performance metric values.
Furthermore, although linear models provide a reasonable
starting point, they are not able to capture complex nonlinear relationships between performance metrics and operating/operator conditions. These issues can be addressed by
nonparametric GP regression models and probabilistic BN
models, both of which generalize linear regression models
and can provide more insight on relationships between performance metrics and system operating factors.
2.2
squared exponential form,
(D
)
X 1
k(xi , xj ) = θ0 exp
(xd,i − xd,j )2 ,
2θd
d=1
where the xd,i and xd,j terms are the D individual entries of
operating points xi and xj (i.e. for the application described
here, D = 3 for TL, MQ, and WM), and all θ terms are
free parameters that affect shape and scale sensitivity along
each component of xi and xj . Note that, for large θ terms,
the corresponding factors of x (TL, MQ, WM) become less
important for predicting y ∗ . Optimal kernel parameters (e.g.
as determined by maximum likelihood) can therefore offer
valuable insight into the relative importance of dependencies
between TL, MQ, WM and the performance metrics.
The GP model has several useful characteristics for
predictive performance modeling and verification. Firstly,
due to the Gaussian probabilistic nature of the model,
m(x∗ , x1:N ) gives the most likely value of y ∗ for a given x∗
and σ 2 (x∗ , x1:N ) gives the degree of uncertainty in this estimate. Secondly, the kernel parameterization lends three useful properties. It can be mathematically shown that the GP
model predicts via a ‘nearest neighbor’ strategy: the more
similar x∗ is to xi , then the more probable it becomes that
y ∗ will be similar to yi . Likewise, the predicted variance
σ(x∗ , x1:N ) grows as x∗ is more dissimilar to the original
data x1:N , but adjusts to the variance of y1:N otherwise.
Hence, the GP model has the appealing ability to adjust its
prediction statistics on the basis of how similar x∗ is to original modeling data, such that unfamiliar x∗ values automatically lead to more uncertain predictions. This is in contrast to
a standard linear regression model, which maintains a constant level of prediction uncertainty for all possible x values.
It can also be shown that GP regression implicitly considers
an infinite number of parametric function candidates at once
during the modeling process (where the space of parametric
functions depends on the covariance kernel) (Bishop 2006).
This circumvents the usual need to explicitly restrict the regression process to a limited (and likely incomplete) family
of candidate models to find a best fit for the data.
Figure 1 shows two example operator performance metric prediction results that compare GP and standard linear
regression predictions of RZP as a function of TL, MQ and
WM values taken from the experimental data. Both models
show good general agreement in predicting downward shifts
in performance as task load increases and upward shifts in
performance as working memory capacity increases. However, the uncertainty bounds from the standard deviations
of both models lead to significantly different statistical performance predictions under certain task load and working
memory conditions. Unlike the linear model, as x∗ gets further from x1:N , the GP becomes more agnostic and hence
conservative about performance predictions.
GP models allow for ‘cognitively aware’ performance
predictions on the basis of realistically obtainable sets of experimental data. This could be useful in many ways. First,
one could define a specification on the performance of the
combined human-autonomy system that is now in the form
of a distribution. Second, formal, probabilistic verification
GP regression models for robust prediction
Nonparametric GP models can provide extended statistical
predictions for expected values and standard deviations of
performance metrics as a function of operator characteristics and operating conditions. Through the use of special covariance kernel functions, GP predictions can also automatically adapt to uncertainties arising from non-linearities and
sample sparseness in the experimental modeling data. These
features allow a data set to ‘speak for itself’ as predictions
are made, which have made GPs standard tools for complex
regression problems involving noisy, sparsely sampled processes (Bishop 2006).
Formally, let x =(TL, MQ, WM) denote a humanmachine system operating condition and let y be the value
of some performance metric Y (i.e. RZP, AE, etc.). If N is
the number of distinct experimental data points available for
modeling, let x1:N = {x1 , ..., xN } and y1:N = {y1 , ..., yN }
be, respectively, the set of all experimentally sampled opertaing points and corresponding set of observed performance
metric values. For some new unsampled operating point x∗ ,
the GP predicts that the corresponding metric value Y ∗ = y ∗
has the conditional Gaussian probability distribution function with mean m(x∗ , x1:N ) and variance σ 2 (x∗ , x1:N ).
p(Y = y ∗ |x∗ , x1:N ) =N (y ∗ ; m(x∗ , x1:N , σ 2 (x∗ , x1:N )))
1
1
= √
exp − 2 ∗
(m(x∗ , x1:N ))2 ,
2σ (x , x1:N )
2πσ(x∗ , x1:N )
m(x∗ , x1:N ) = f (x∗ ) + kT C −1 y1:N ,
σ 2 (x∗ , x1:N ) = c − kT C −1 k.
Here, c is a positive scalar term; k is a N × 1 vector of terms
describing the similarity between x∗ and members of x1:N ;
C is a covariance matrix that describes the similarity of all
members of x1:N to each other; and f (x) is a user-defined
mean function that models a priori beliefs about the relationship between x and y. Although not detailed here, k, c and C
are all directly defined by a user-selected covariance kernel
function k(xi , xj ) that models expected similarities between
two points xi and xj . A common choice for k(xi , xj ) is the
4
TL = 0.155, MQ = 1
0.9
0.8
0.8
0.7
0.7
0.6
0.6
RZP
RZP
TL = 0.235, MQ = 0
0.9
0.5
0.4
mated via Bayesian inference, which ‘inverts’ relationships
described by a joint probability distribution over the unknown variables of interest and observed performance measures. A probabilistic BN model can be used to encode this
distribution explicitly as a directed acyclic graph (DAG),
consisting of nodes (random variables) and edges that specify local factorized probabilistic conditional dependencies.
Figure 2 shows a BN model developed in (Ahmed et al.
2013) from the experimental performance data using model
estimation techniques based on discretization of TL, MQ,
WM and performance metric values. The model parameters
for the joint distribution are given by conditional probability
tables (CPTs) describing the ‘causal’ parent-child dependencies between the nodes in the graph. These CPTs efficiently
encode the full joint distribution through the factorization
p(TL, MQ, WM, AE, EDP, DT, RZP, H) =
p(TL)p(MQ)p(WM)p(H|TL,MQ,WM)p(EDP|TL)
× p(AE|H)p(RZP|H)p(DT|H),
(1)
which permits the use of efficient message-passing algorithms for Bayesian inference. The binary latent ‘summarizes’ information flowing from TL, MQ and WM to the
performance metrics, and thereby helps to simplify the overall joint probability model by reducing the overall number
of free CPT parameters to be estimated. The summarizing
effect of H is reflected by the strong dependence of these
performance metric CPTs on H, which can be interpreted as
a variable that predictively ‘clusters’ each human subject’s
overall task competency (for given values of TL, MQ, and
WM) into ‘below’ and ‘at/above average’ levels for H=1 and
H=2 values, respectively. Note that the BN model for this
problem is not unique: over 1 × 107 different DAGs could
be used to represent the joint distribution, e.g. by introducing extra links from TL, MQ and WM to the performance
variables, by breaking/reversing the order of individual conditional dependencies, etc. The structure shown in Fig. 2 was
selected from a smaller subset of DAGs that reflect commonsense ‘causality constraints’ (i.e. TL, MQ, and WM do not
influence each other, and are not influenced by performance
metrics) and permitted use of reasonably sized CPTs that
could be estimated from data. Complete details of the modeling process are given in (Ahmed et al. 2013).
Figure 3 (a) and (b) show how the unknown (discretized)
TL, MQ and WM values can be inferred from observed
(discretized) experimental performance metric data. Shown
are the marginal Bayesian posterior probabilities for the unknown variables at different fixed values of RZP, EDP, AE,
and DT. These probabilities can be used to estimate the unknown variables along with confidence assessments. For instance, in scenario (a), the true recorded values for the unknown variables all correspond to the maximum a posteriori (MAP) estimates derived from the marginal posterior
probabilities, which show overwhelming confidence in the
value of TL but considerable ambiguity for MQ and WM.
These probabilities can be used for threshold-based decision making or hedging against incorrect estimates of operator/operating conditions. In scenario (b), for instance, an
autonomous decision aid would be able to account for the
fact that MQ is almost as equally likely to be in a good
0.5
0.4
GP 2σ
GP 1σ
GP mean
raw data
LR mean
LR 2σ
LR 1σ
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
WM
(a)
0.6
0.7
0.8
0.9
GP 2σ
GP 1σ
GP mean
raw data
LR mean
LR 2σ
LR 1σ
0.3
0.2
0.1
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
WM
(b)
Figure 1: Predicted RZP mean with 1σ/2σ bounds for
learned GP and linear regression (LR) models as function
of TL, MQ, and WM. Predictions for new WM values at
fixed TL and MQ ‘slices’ are shown by lines/shaded regions;
experimental data are shown as blue markers. TL gives fraction of 200 enemies attacking Red Zone; MQ gives fraction
of relevant messages.
can now be attempted with models of the combined humanautonomy. Third, the models can be used to infer/analyze
specifications (on humans, autonomy, or the combination)
on-line in order to ensure that the specifications are being
met, or assumptions are not violated. Fourth, data driven
models have the ability to be updated as additional data is
collected; this learning ability is important as the complexity of the interaction increases. Importantly, this is potentially the first step towards the concept of automatically generating correct-by-construction controllers that ‘match’ the
strengths and weaknesses of the human.
Still, GP class models have challenges. Firstly, the model
will only capture a small portion of the human and/or interaction. What models of human capabilities can be more
easily integrated into a verifiable framework? What if this
model is incomplete? What about situations that are not easily modeled, such as emergency situations, or difficult-tomodel (e.g. uncooperative) humans? It is important to understand exactly what can be done with GP models in such
cases and whether existing learning techniques are sufficient to address these issues. Secondly, should specifications be defined based on a ‘typical’ human, a distribution of humans, or the worst case human? What if a system falls outside of the specified bounds, either in design or
in operation? The integration into formal verification methods in a ‘guaranteed statistical’ sense is still an open question, since the assumption of second order sufficient statistics (mean, covariance) through a GP may not fully reflect
true performance uncertainty in all cases. Thirdly, how can
these models adapt/learn on-line? Data will be collected as
new experiences occur, and thus, models (and controllers)
should improve quickly on-line. Currently, computational
cost is a weakness of GP class models (as well as correctby-construction controllers) that still needs to be addressed.
2.3
BN models and inverse reasoning
In many situations, variables like TL, MQ or WM may be
required by autonomous decision aids and analysis tools,
but are unknown and difficult to measure. These can be esti-
5
0.8
0.6
0.4
0.2
0
Low
High*
Discretized TL Value
MQ Posterior Probability
1
0.8
0.6
0.4
0.2
0
Low Med High*
Discretized MQ Value
Posterior Probability P(WM|EDP,AE,RZP,DT)
Posterior Probability P(TL|EDP,AE,RZP,DT)
TL Posterior Probability
Posterior Probability P(MQ|EDP,AE,RZP,DT)
Discretized Inference Results for EDP = High, AE = High, RZP = High, DT = Low
1
WM Posterior Probability
1
0.8
0.6
0.4
0.2
0
Low Med High*
Discretized WM Value
(a)
0.8
0.6
0.4
0.2
0
Low*
High
Discretized TL Value
MQ Posterior Probability
1
0.8
0.6
0.4
0.2
0
Low Med High*
Discretized MQ Value
Posterior Probability P(WM|EDP,AE,RZP,DT)
Posterior Probability P(TL|EDP,AE,RZP,DT)
Figure 2: BN graph model and estimated conditional probability tables for hidden 2-state structural variable H and
performance variables. Directed arrows denote conditional
dependency of child node on parent nodes in overall joint
probability distribution.
1
Posterior Probability P(MQ|EDP,AE,RZP,DT)
Discretized Inference Results for EDP = Med, AE = Low, RZP = Low, DT = Med
TL Posterior Probability
WM Posterior Probability
1
0.8
0.6
0.4
0.2
0
Low* Med High
Discretized WM Value
(b)
state of ‘high’ quality as it is to be in a bad state of ‘low’
quality when deciding how best it can help the human operator. BN models of human performance would therefore
serve a similar purpose as the hybrid stochastic dynamic vehicle estimation models in (Johnson and Kress-Gazit 2012),
where models of likely sensor errors were used to probabilistically verify correct-by-construction controllers for autonomous driving, and thus enable probabilistic guarantees
on high-level behavior.
3
Figure 3: Posterior probabilities for unknown discretized
TL, MQ, and WM values following Bayesian inference with
observed operator performance metrics, using BN shown in
Figure 2 on real data for two different experimental scenarios (a) and (b). Values marked with ∗ in abscissa indicate
true value of unknown variable; most probable and true values in both scenarios, with exception of MQ in scenario (b).
predicting the discrete action ‘category’ Dk selected by the
operator given a set of information ‘features’ Xk (i.e. in the
observables from the robot, ‘Range to target’ and ‘Robot’s
confidence that target is victim’, in this case). The simulated
data shown in the right-hand side of Fig. 4 show how such
decisions might stochastically depend on Xk (temporally or
contextually correlated features are not shown here for simplicity, but could be included more generally).
Ref. (Bourgault et al. 2007) explored how generative density estimation methods could be used to build compact
probabilistic models of state-dependent human supervisory
discrete task switching and continuous waypoint commanding behaviors for multi-robot applications, using real experimental data collected from 16 human operators on the
RoboFlag simulation testbed (Shah et al. 2009), (D’Andrea
and Babish 2003). The generative modeling approach models the statistics of Xk first as a function of Dk and then
apply Bayes’ rule to derive desired probabilistic model of
Dk as posterior Bayesian distribution. For the example in
Fig. 4, this means that the data would first be used to
estimate a class-conditional probability density functions
p(Xk |Dk = i) and a prior decision probability p(Dk = i)
(independent of Xk ). The desired generative posterior distribution would then be formed via Bayes rule as p(Dk =
i|X ) = c(X1 k ) p(Dk = i)p(Xk |Dk = i), where c(Xk ) =
Pk
j p(Dk = j)p(Xk |Dk = j). The main drawback of this
approach is the inherent difficulty of the density estimation problem when Xk has more than 4 or 5 dimensions,
as very large amounts of experimental data points are then
required to obtain reliable estimates of p(Xk |Dk = i).
Detailed Probabilistic Operator Modeling
In the GP and BN approaches outlined above, any variability arising from unmodeled factors related to detailed
task dynamics are subsumed by estimated probabilities associated with each prediction. However, detailed dynamical
models of human operator behavior can yield more refined
and insightful performance predictions, especially if individual operator cognitive traits are expected to be widely different or if multiple operator ‘failure points’ exist. Existing
detailed dynamic operator behavior models tend to be developed either from an economical data-driven ‘black box’
high-level perspective (which abstracts away the influence
of mixed continuous/discrete dynamic variables on operator
decision making) (Fan, Yen, and Chen 2010), (Heger and
Singh 2006), (Donmez, Cummings, and Nehme 2010), or
from a detailed first principles ‘white box’ computational
cognitive modeling perspective (which attempts to simulate
all relevant neurophysical aspects of human-in-the-loop behavior) (Anderson, Matessa, and Lebiere 1997), (Kieras and
Meyer 1997), (Lehman et al. 1996).
Alternatively, a ‘gray box’ probabilistic approach could
be used to model high-/low-level operator behaviors as discrete/continuous random variables that are conditionally dependent on external variables that are known to influence
human-machine system behavior. To motivate this idea, Figure 4 shows a hypothetical robot-assisted search and rescue
scenario, in which a human operator must decide which of
three possible autonomous behaviors a ground robot should
execute at a given instance on the basis of available information. The modeling problem here can be cast as one of
6
References
Noisy state-dependent “mode switching”
boundaries for human supervisor
Ahmed, N., and Campbell, M. 2008. Multimodal operator
decision models. In American Control Conference 2008.
Ahmed, N., and Campbell, M. 2011. Variational Bayesian
learning of probabilsitic discriminative models with latent
softmax variables. IEEE Trans. on Sig. Proc. 59(7):3143–
3154.
Ahmed, N.; deVisser, E.; Shaw, T.; Mohamed-Ameen, A.;
Campbell, M.; and Parasuraman, R. 2013. Statistical modeling of networked human-automation performance using
working memory capacity. Ergonomics in press.
Anderson, J. R.; Matessa, M.; and Lebiere, C. 1997. ACT-R:
A theory of higher level cognition and its relation to visual
attention. Human-Computer Interaction 12(4):439–462.
Bishop, C. 2006. Pattern Recognition and Machine Learning. New York: Springer.
Bourgault, F.; Ahmed, N.; Shah, D.; and Campbell, M.
2007. Probabilistic operator-multiple robot modeling using
bayesian network representation. Proc. AIAA Conference on
Guidance, Navigation, and Control, Hilton Head, SC.
D’Andrea, R., and Babish, M. 2003. The RoboFlag testbed.
In American Control Conference, 2003. Proceedings of the
2003, volume 1, 656–660. IEEE.
de Visser, E.; Shaw, T.; Mohamed-Ameen, A.; and Parasuraman, R. 2010. Modeling human-automation team performance in networked systems: Individual differences in
working memory count. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, volume 54,
1087–1091. SAGE Publications.
Donmez, B.; Cummings, M.; and Nehme, C. 2010. Modeling Workload Impact in Multiple Unmanned Vehicle Supervisory Control. IEEE Transactions on Systems, Man, and
Cybernetics, Part A: Systems and Humans 40(6).
Endsley, M. R. 1995. Toward a theory of situation awareness
in dynamic systems. Human Factors: The Journal of the
Human Factors and Ergonomics Society 37(1):32–64.
Engle, R. W. 2002. Working memory capacity as executive attention. Current directions in psychological science
11(1):19–23.
Fan, X.; Yen, J.; and Chen, P.-C. 2010. Learning HMMbased cognitive load models for supporting human-agent
teamwork. Cognitive Systems Research 11:108–119.
Heger, F., and Singh, S. 2006. Sliding Autonomy for Complex Coordinated Multi-Robot Tasks: Analysis & Experiments. In Robotics Science and Systems.
Johnson, B., and Kress-Gazit, H. 2012. Probabilistic guarantees for high-level robot behavior in the presence of sensor
error. Autonomous Robots 33(3):309–321.
Johnson, B.; Havlak, F.; Campbell, M.; and Kress-Gazit, H.
2012. Execution and analysis of high-level tasks with dynamic obstacle anticipation. In Robotics and Automation
(ICRA), 2012 IEEE International Conference on, 330–337.
IEEE.
Juloski, A.; Heemels, W.; Ferrari-Trecate, G.; Vidal, R.; Paoletti, S.; and Niessen, J. 2005. Comparison of Four Procedures for the Identification of Hybrid Systems. In L.Thiele,
Xk
Dk
Xk = [Range to target;
Robot’s confidence that target is victim]
Dk = ignore object and move on
Range to target (m)
10
0
100
Robot’s confidence that target is victim (%)
Dk = get more data on object
Dk = move in to assist victim
Figure 4: Conceptual example of a discrete set of human
supervisory control decisions for robot-assisted search and
rescue modeled by noisy probabilistic switching as function
of operator-observable system state (colors represent simulated data).
Refs. (Ahmed and Campbell 2008) and (Ahmed and Campbell 2011) showed that discriminative probabilistic classifier models could instead be used to directly learn the ‘noisy
boundaries’ between Dk clusters for high-dimensional Xk
via fully Bayesian learning techniques that also provide estimates of uncertainty in the boundary parameter estimates
(which are especially useful for learning with sparse data).
Note that this approach bears resemblance to techniques for
identifying discrete modal switching boundaries in nonlinear hybrid dynamical systems (Juloski et al. 2005).
4
Conclusions and Open Directions
This paper motivates why data driven, probabilistic models
of human capabilities would be useful in the design and verification of collaborative human-autonomy systems. Three
classes of probabilistic models were presented - Gaussian
Process, Bayes Net, and Classifier - along with examples
showing their ability to capture some element of human behavior. These models have the ability to capture individual,
average, and distributions of human capabilities (including
supervisory tasking and communication) as well as human
factors metrics such as working memory.
These models have many potential uses. Distributions
over specifications could be defined for the combined
human-autonomy system that would enable formal checking
and on-line evaluation of assumptions/specifications. Guarantees on probability of performance (or other metrics) may
be realized based on rigorous statistical measures of model
validity. Off-line data collection and models of human capabilities could improve eventual implementation of the collaborative human-autonomy system. And this work could
be the start of a path towards the ‘holy grail’: generation
of probabilistically correct-by-construction controllers that
‘match’ the human capabilities (e.g. given a model for how
a person performs, the automation programming can be automatically generated to compensate for expected individual weaknesses). Challenges in these models still must be
addressed, including understanding what to model, what if
models are incomplete, how to integrate into the formal verification framework, and how models can learn on-line.
7
M. M., ed., Hybrid Systems: Computation and Control, volume LNCS 3414, 354–369. Springer-Verlag.
Junker, B. 2011. Autonomy roadmap. Technical report,
DoD Priority Steering Council.
Kieras, D. E., and Meyer, D. E. 1997. An overview of the
EPIC architecture for cognition and performance with application to human-computer interaction. Human-computer
interaction 12(4):391–438.
Kress-Gazit, H.; Fainekos, G. E.; and Pappas, G. J. 2009.
Temporal-logic-based reactive mission and motion planning. Robotics, IEEE Transactions on 25(6):1370–1381.
Kwiatkowska, M.; Norman, G.; and Parker, D. 2002.
Prism: Probabilistic symbolic model checker. In Computer
Performance Evaluation: Modelling Techniques and Tools.
Springer. 200–204.
Lehman, J. F.; Laird, J. E.; Rosenbloom, P.; et al. 1996. A
gentle introduction to Soar, an architecture for human cognition. Invitation to cognitive science 4:212–249.
Manna, Z., and Pnueli, Z. 1995. Temporal verification of
reactive systems: safety. Springer.
Murphy, R., and Shields, J. 2012. The role of autonomy
in DoD systems. Technical report, DoD Defense Sciences
Study Board.
Parasuraman, R. 2010. Neurogenetics of working memory
and decision making under time pressure. In Applied Human
Factors and Ergonomics. Taylor and Francis.
Shah, D.; Campbell, M.; Bourgault, F.; Ahmed, N.; Galster,
S.; and Knott, B. 2009. An empirical study of human-robotic
teams with three levels of autonomy. In AIAA Infotech@
Aerospace Conference.
8
Download