Predicting and Explaining Individual Performance in Complex Tasks

advertisement
Predicting and Explaining
Individual Performance in
Complex Tasks
Marsha Lovett, Lynne Reder, Christian Lebiere,
John Rehling, Baris Demiral
This project is sponsored by the Department of the Navy, Office of Naval Research
Multi-Tasking
• A single person can perform multiple tasks.
A single model should be able to capture performance
on those multiple tasks.
• A single person brings to bear the same fundamental
processing capacities to perform all those tasks.
A single model should be able to predict that person’s
performance across tasks from his/her capacities.
A way to keep the multiple-constraint advantage
offered by unified theories of cognition while
making their development tractable is to do
Individual Data Modeling. That is, to gather a
large number of empirical/experimental
observations on a single subject (or a few subjects
analysed individually) using a variety of tasks that
exercise multiple abilities (e.g., perception
memory, problem solving), and then to use these
data to develop a detailed computational model of
the subject that is able to learn while performing
the tasks.
Gobet & Ritter, 2000
ZERO
PARAMETER
PREDICTIONS!
Basic Goals of Project
• Combine best features of cognitive modeling
– Study performance in a dynamic, multi-tasking
situation (albeit less complex than real world)
– Explain not only aggregate behavior but variation
(using individual difference variables)
– Predict (not fit/postdict) complex performance
• Use cognitive architecture and fixed parameters
• Employ off-the-shelf models whenever possible
• Plug in individual difference params for each person
How to predict task performance
• Estimate each individual’s processing parameters
– Measure individuals’ performance on “standard” tasks
– Using models of these tasks, estimate participant’s
corresponding architectural parameters (e.g., working
memory capacity, perceptual/motor speed)
• Build/refine model of target task
• Select global parameters for model of target task
(e.g., from previously collected data)
• Plug into model of target task each individual’s
parameters to predict his/her target task performance
Example: Memory Task Performance
• Fit task A to estimate individuals’ parameters
Subject 610
W = 0.8
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Subject 619
W = 0.9
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
3
4
5
6
Memory Set Size
Subject 613
W = 1.0
Subject 623
W = 1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
3
4
5
6
Memory Set Size
3
4
5
6
Memory Set Size
Data
Model
3
4
5
6
Memory Set Size
Zero-Parameter Predictions
• Plug those parameters into model of task B
Subject 610
W = 0.8
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
Subject 619
W = 0.9
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0
1
2
3
Memory Load
(n-back)
Subject 613
W = 1.0
Subject 623
W = 1.1
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0
1
2
3
Memory Load
(n-back)
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0
1
2
3
Memory Load
(n-back)
0
1
2
3
Memory Load
(n-back)
Data
Model
(Lovett, Daily, & Reder, 2000)
Challenges of Complex Tasks
• Modeling the target task is harder
• More than one individual difference variable
likely impacting target task
• Possibility of knowledge/strategy differences
What about knowledge differences?
• Develop tasks that reduce their relevance
• Train participants on specific procedures
• Measure skill/knowledge differences in
another task and incorporate them in model
• Use model to predict variation in relative
use of strategies by way of estimates of
individuals’ processing capacities
Individual Differences in ACT-R
• Most ACT-R models don’t account for impact of
individual differences on performance, but the
potential is there
• There are many parameters with particular
interpretations related to individual difference
variables
• Most ACT-R modelers set parameters to universal
or global values, i.e., defaults or values that fit
aggregate data
ACT-R & Individual Differences
P1, P2, P3, …
M1, M2, M3, …
W1, W2, W3, …
Overview of Talk
• Review tasks we are studying
• Illustrate methodology
• Highlight key results
– Visual search vs. memory strategies trade off in
final performance => complex task modeling
offers best constraint with fine-grained analysis
Modified Digit Span (MODS)
a
j
2
1st
string
T
b
i
I
M
E
e
6
2nd
string
c
f
8
3rd
string
recall
_ _
_
Modified Digit Span (MODS)
a
j
2
1st
string
T
b
i
I
M
E
e
6
2nd
string
c
f
8
3rd
string
recall
_ _
_
P/M Tasks
• In our earlier studies, initial training phase
of target task was used to collect data on
individuals’ perceptual/motor speed.
– e.g., Time to find object “A7” and click on it
• In later studies, separate task used to
measure perceptual and motor speed.
How to predict task performance
• Estimate each individual’s processing parameters
– Measure individuals’ performance on MODS, PercMotor
– Using models of these tasks, estimate participant’s
corresponding architectural parameters (e.g., working
memory capacity, perceptual/motor speed)
• Build/refine model of target task
• Select global parameters for model of target task
(e.g., from previously collected data)
• Plug into model of target task each individual’s
parameters to predict his/her target task performance
W affects Performance
• W is the ACT-R parameter for source
activation, which impacts the degree to
which activation of goal-related facts rises
above the sea of other facts’ activations
• Higher W => goal-related facts relatively
more activated => faster and more
accurately retrieved => better MODS
performance
Estimating W
• Model of MODS task is fit to individual’s
MODS performance by varying W
• Best fitting value of W is taken as estimate
Subject 610
W = 0.8
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Subject 619
W = 0.9
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
3
4
5
6
Memory Set Size
Subject 613
W = 1.0
Subject 623
W = 1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
3
4
5
6
Memory Set Size
3
4
5
6
Memory Set Size
Data
Model
3
4
5
6
Memory Set Size
Estimating PM
• For simplicity, we estimated a combined PM
parameter directly from each individual’s
perceptual/motor task performance.
• This PM parameter was then used to scale the
timing of the target task’s perceptual-motor
productions.
Joint Distribution of W and P/M
Pm
1.60
1.00
0.40
0.40
1.00
1.60
W
W and P/M are tapping distinct characteristics
ACT-R & Individual Differences
P1, P2, P3, …
M1, M2, M3, …
W1, W2, W3, …
Specifics of our Approach
• Estimate each individual’s processing parameters
– Measure individuals’ performance on modified digit span, spatial span,
perceptual/motor speed
– Using models of these tasks, estimate participant’s W, P, M
• Build/refine model of air traffic control task–AMBR
• Select global parameters for AMBR model
• Plug in individuals’ parameters to predict performance across
different AMBR scenarios
AMBR: Air Traffic Control Task
• Complex and dynamic task
• Spatial and verbal aspects
• Multi-tasking
• Testbed for cognitive modeling architectures
AMBR Task
AC=aircraft, ATC=air traffice controller
• As ATC, you communicate with AC and other ATC to
handle all AC in your airspace
• Six commands with different triggers:
• First ACCEPT, then WELCOME incoming AC (these two
separated by short interval)
• First TRANSFER, then order a CONTACT message from
outgoing AC (these two separated by short interval)
• Decide to OK or REJECT requests for speed increase
• When a command is not handled before AC reaches zone
boundary, this is a HOLD (error)
Issuing an AMBR Command
•
•
•
•
•
Text message or radar cues particular action
Click on Command Button
Click on Aircraft (in radar screen)
Click on Air Traffic Controller (if nec’y)
Click on SEND Button
General Methods
• Empirical Methods
– Day 1: Collect MODS and P/M data and train on
AMBR plus AMBR practice
– Day 2: Review AMBR instructions, battery of AMBR
scenarios
• Modeling Methods
– Use MODS & PM data to estimate W and PM for each
subject
– Plug individual W and PM values into AMBR model
– Compare individuals’ AMBR performance with model
predictions
Experiments 1 & 2
• AMBR Scenario Design
– Experiment 1: alternating 5 easy, 5 hard
– Experiment 2: 9 scenarios of varying difficulty
• AMBR Dependent Measures
– Total time to handle each command
– Number of hold errors
Off-the-shelf ACT-R Model of
AMBR
• Scan for something to do: Radar, Left,
Right, Bottom text windows
• When an action cue is noticed, determine if
it has been handled or not: scan/remember
• If the cue has not been handled, click
command, AC, [ATC], SEND
• Resume scanning
Model Captures Range of
Performance
25
# Hold Errors
20
Subject 1
Subject 2
Subject 3
Subject 4
LoLo Model
HiHi Model
15
10
5
0
1
2
3
4
5
Scenario
6
7
8
9
Model Predictions
• Prediction of whether a subject commits an
error in a scenario, based on scenario details
and individual’s W & P/M
Subject scenarios
with errors
Subject scenarios
with no errors
Model scenarios
with errors
205
4
Model scenarios
with no errors
21
70
Ind’l Diffs’ Impact on Hold Errors
• Hold errors only weakly
dependent on W, more
strongly on P/M and
scenario difficulty
50
45
40
35
30
# Hold
Errors
Pm
W
25
20
15
10
5
0
0.7
0.8
0.9
1
1.1
1.2
Parameter Value
1.3
1.4
Scenario Difficulty
250
# aircraft * aircraft speed
200
150
Experiment 1
Experiment 2
100
50
0
Scenario
Mean Errors by Scenario
18
16
Mean # Hold Errors
14
12
10
Experiment 1
Experiment 2
8
6
4
2
0
Scenario
Be Careful What (DM) you Model
• Error data too coarse to constrain model
• Even total RT/command data insufficient
• Model predicts that scanning strategy plays a
large role in performance.
• This is consistent with participant reports who
may be doing any combination of visual search
or memory retrieval
Observable Behaviors
Subject
T 0.0 Cue: Accept T6?
T 3.6 ACCEPT button
T 5.9 AC “T6”
T 6.7 ATC “EAST”
T 7.7 SEND button
Model
T 0.0 Cue: Accept T6?
T 3.7 ACCEPT button
T 5.7 AC “T6”
T 7.0 ATC “EAST”
T 8.2 SEND button
Stochastic variation on the single-action level is part of subject
and model behavior
The Details Are Inside
Model I/O
T 0.0 Cue: Accept T6?
T 3.7 ACCEPT button
T 5.7 AC “T6”
T 7.0 ATC “EAST”
T 8.2 SEND button
Model Trace
T 1.5 Notice cue
T 2.5 Subgoal task
T 3.7 Mouse click
T 3.8 Start AC search
T 4.9 Find AC
T 5.7 Mouse click
T 7.0 Mouse click
T 8.2 Mouse click
Conclusion thus far…
• Visual search vs. memory strategies trade off in
final performance => even when modeling a
complex task, coarse dependent measures
(accuracy, total RT) hide important details
• Previous AMBR model fit group data well
• Only by seeking extra constraint of modeling
individual participants were important gaps in
model fidelity revealed
Modifications for Experiment 3
• Use more fine-grained measures: Action RT & Clicks
• Modify the ATC task to increase memory demand
–
–
–
–
More interesting for our purposes
More realistic
Lengthen scenario length so same planes are in play
Hide AC names until click, then only after delay
• Use model to bracket appropriate difficulty level
Raw Characteristics of Data
Experiment 3
• Action RT 12.1 sec, Holds 3.3 / subject
• Action RT correlates with W (r = -0.314)
and Pm (r = 0.485)
• Holds correlates with W (r = -0.444) and
Pm (r = 0.508)
Model Modifications
• Search not only can give the answer sought
(a specific AC’s location) but an additional
rehearsal of that information
• In slack times, possible strategy of studying
radar screen to rehearse AC names (called
“exploratory clicks”)
Model Predicts Hold Errors
• Predicts errors per subject, r = 0.81
• Hold errors depend more on W (compared
to previous version of task) but still mostly
dependent on PM and scenario difficulty
• Move to modeling more fine-grained
aspects of data…
Model Predicts Number of Clicks
Mean AC Clicks
Clicks
3
2.5
2
1.5
1
0.5
Subjects
Model
0
ep
c
Ac
t
e
W
om
lc
e
a
Tr
r
fe
s
n
t
ac
t
n
o
C
Command Type
ee
p
S
d
3.5
2.5
3.0
# AC Clicks
# AC Clicks
3.0
2.0
1.5
1.0
0.5
2.5
2.0
1.5
1.0
0.5
0.0
0.0
Accept
Welcome
Transfer
Contact
Speed
Accept
3.0
3.5
2.5
3.0
2.0
1.5
1.0
0.5
Contact
Speed
2.5
2.0
1.5
1.0
0.5
0.0
0.0
Accept
Welcome
Transfer
Contact
Speed
Accept
Command Type
Welcome
Transfer
Contact
Speed
Command Type
3.0
3.5
2.5
3.0
# AC Clicks
# AC Clicks
Transfer
Command Type
# AC Clicks
# AC Clicks
Command Type
Welcome
2.0
1.5
1.0
0.5
2.5
2.0
1.5
1.0
0.5
0.0
0.0
Accept
Welcome
Transfer
Contact
Command Type
Speed
Accept
Welcome
Transfer
Contact
Command Type
Speed
W, P/M affect RT click by click
Hi-Hi Model & Subject
Cumulative RT
12000
10000
8000
data
model
6000
4000
2000
0
Comm
AC
ATC
Send
Click Type
Lo-Lo Model & Subject
14000
12000
Cumulative RT
• Set W-P/M parameters
in model corresponding
to participants (e.g., hihi & lo-lo)
• Run model to produce
RT predictions click by
click (for 2 commands:
Accept and Contact)
14000
10000
8000
data
model
6000
4000
2000
0
Comm
AC
ATC
Click Type
Send
W, P/M affect RT click by click
14000
12000
10000
Model RTs
• Set W-P/M parameters
in model corresponding
to participants
• Run model to produce
RT predictions click by
click (for 2 commands:
Accept and Contact)
8000
6000
4000
2000
0
0
5000
10000
Subject RTs
15000
20000
Conclusion thus far
• Modeling more fine-grained measures required
task and model modifications, but this produced
individual participant predictions that were very
promising.
• Clicking on correct AC the first time ranges from
69% to 96%
– Akin to remember vs. scan strategies
– Higher number -> more (accurate) remembering
– This detailed aspect of performance relates to W
Theoretical Interlude:
Spatial vs. Verbal WM
• Our working assumption (parsimoniously) posits a
single source activation parameter, W
• W modulates the degree to which goal-relevant
facts are activated above the sea of unrelated facts
• …regardless of spatial/verbal representation
• This perspective still allows for spatial/verbal
distinctions in performance but explains them as a
function of differences in spatial/verbal skills etc.
Opportunity to Test in Current
Work
• AMBR task has spatial and verbal aspects
• Included verbal and spatial working memory tasks
in battery, starting with Experiment 3
• Which span task produces W estimates that best
predict individuals’ AMBR performance?
• Spatial Span task from Miyake and Shah (1996):
“normal”
“reversed” “normal”
Opportunity to Test in Current
Work
• Result
– Experiments 3 & 4: Spatial Span-based W predicts
AMBR performance better than MODS-based W
• Possible explanations:
– Spatial format more relevant for this task?
– Spatial Span shows more variability -> more sensitive?
– Spatial Span variability taps other sources of variation?
– Are there separate W’s for verbal and spatial WM?
Opportunity to Test in Current
Work
• Result
– Experiments 3 & 4: Spatial Span-based W predicts
AMBR performance better than MODS-based W
• Possible explanations:
– Spatial format more relevant for this task?
– Spatial Span shows more variability -> more sensitive?
– Spatial Span variability taps other sources of variation?
– Are there separate W’s for verbal and spatial WM?
Spatial Span taps speed as well…
• Another study, spawned by this issue, shows
relationship between individuals’ mental rotation
speed and Spatial Span
• Pattern of correlations with PM:
– MODS: r=.25 Spatial Span: r=.65
• Pattern of correlations with AMBR components:
MODS
SS
PM
Mem+Mouse SpeedReq-AC
-.62
-.55
-.39
Mouse Welcome-AC
-.20
-.61
-.53
Mouse Welcome-Tot
-.16
.-56
-.70
Theoretical Interlude Conclusion
• Studying verbal vs. spatial memory
resources in context of AMBR task moves
theoretical debate to more realistic arena
– This complements work with laboratory tasks
and allows greater potential for generalization
of results
Strategic Variation Emerges
• Experiment 4 also revealed several sources of
strategic variation, explored further in Experiment 5
• Waiting for AC name: ranges from 42% to 100%
– May reflect lack of confidence in memory, utility of
checking one’s memory
– Somewhat negatively correlated with W
• Initiating “welcome” and “contact” commands in
anticipation of text cue (ranges from 0% to 100%)
• Making exploratory clicks on ACs during slack time
(ranges from never to > 5 per scenario)
Experiment 5 Details
• Scenarios designed to have low (6 ACs) vs.
high memory load (total 12 ACs)
• Speed requests most common command
– Most interesting for model predictions
– Least susceptible to snowball effects
• Dependent measures include RTs for
individual clicks and strategy use as a
function of scenario difficulty and command
Modeling Specific AMBR Components
1.2
1
0.8
Hard
Scenarios
0.6
0.4
0.2
0
SPEED REQUEST
C ONTAC T
AC C EPT
WELC OME
Antic
WELC OME
Antic
Accuracy of first AC click
1.2
1
0.8
Easy
Scenarios
0.6
0.4
0.2
0
SPEED REQUEST
C ONTAC T
AC C EPT
Accuracy of first AC click
Modeling Specific AMBR Components
25000
20000
Hard
Scenarios
15000
10000
5000
0
-5000
SPEED REQUEST
C ONTAC T
AC C EPT
WELC OME
RT to Correct AC click
8000
6000
Easy
Scenarios
4000
2000
0
-2000
SPEED REQUEST
C ONTAC T
AC C EPT
-4000
RT to Correct AC click
WELC OME
Model Predictions Match Data
• Main effects of scenario difficulty amplified
for low W individuals
• Main effects of command type (more/less
memory-demanding) amplified for low W
• Wait-for-AC-name strategy varied as a
function of command type
• Exploratory clicks strategy varied as a
function of scenario difficulty
Summary of Conclusions
• Complex tasks are not a modeling panacaea! Only
by seeking extra constraint of modeling individual
participants were important gaps in model’s
fidelity revealed.
• Studying verbal vs. spatial memory resources in
context of AMBR task moves theoretical debate to
more realistic arena.
• Variability in performance -- from different use of
strategies and/or from differences in processing
capacities -- is there for the looking. Studying
performance on average offers incomplete
understanding.
Features of Our Approach
• Our approach aims to jointly provide
–
–
–
–
Predictions that are accurate and detailed
At the individual participant level
Generated in real time (or faster)
Based on an interpretable model with variation
in meaningful individual difference parameters
– That generalize to variants of the target task
Joint Distribution of W and P/M
2.5
Estimated PM Value
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
Estimated W Value
W and P/M are tapping distinct characteristics
1.6
Download