(FA2386-10-1-4127) PI: John E. Laird (University of Michigan)

advertisement
EXTENDING SEMANTIC AND EPISODIC MEMORY
TO SUPPORT ROBUST DECISION MAKING
(FA2386-10-1-4127)
PI: John E. Laird (University of Michigan)
Graduate Students:
Nate Derbinsky, Mitchell Bloch, Mazin Assanie
AFOSR Program Review:
Mathematical and Computational Cognition Program
Computational and Machine Intelligence Program
Robust Decision Making in Human-System Interface Program
(Jan 28 – Feb 1, 2013, Washington, DC)
EXTENDING SEMANTIC AND EPISODIC MEMORY (JOHN
LAIRD)
Objective:
Develop algorithms that support effective,
general, and scalable long-term memory:
1. Effective: retrieves useful knowledge
2. General: effective across a variety of tasks
3. Scalable: supports large amounts of
knowledge and long agent lifetimes:
manageable growth in memory and
computational requirements
DoD Benefit:
Develop science and technology to support:
• Intelligent knowledge-rich autonomous
systems that have long-term existence,
such as autonomous vehicles (ONR,
DARPA: ACTUV).
• Large-scale, long-term cognitive models
(AFRL)
Technical Approach:
1. Analyze multiple tasks and domains to
determine exploitable regularities.
2. Develop algorithms that exploit those
regularities.
3. Embed within a general cognitive
architecture.
4. Perform formal analyses and empirical
evaluations across multiple domains.
Budget:
Actual/
Planned $K
FY11
$99
FY12
$195
FY13
$205
$158
$165
$176
Annual Progress
Report Submitted?
Y
Y
N
Project End Date: June 29, 2013
2
LIST OF PROJECT GOALS
1. Episodic memory (experiential & contextualized)
– Expand functionality
– Improve efficiency of storage (memory) and retrieval (time)
2. Semantic memory (context independent)
– Enhance retrieval
– Automatic generalization
3. Cognitive capabilities that leverage episodic and semantic memory
functionality
– Reusing prior experience, noticing familiar situations, …
4. Evaluate on real world domains
5. Extended Goal
– Competence-preserving selective retention across multiple memories
3
PROGRESS TOWARDS GOALS
1. Episodic memory
– Expand functionality (recognition)
• [AAAI 2012b]
– Improve efficiency of storage (memory) and retrieval (time)
• Exploits temporal contiguity, structural regularity, high cue structural selectivity, high
temporal selectivity, low cue feature co-occurrence
• For many different cues and many different tasks, no significant slowdown with
experience: runs for days of real time (tens of millions of episodes), faster than real time.
• [ICCBR 2009; BRIMS 2011; AAMAS 2012]
2. Semantic memory
– Enhance retrieval
• Evaluated multiple bias functions: conclude base-level (exponential) activation works best
• Developed efficient approximate algorithm that maintains high (>90%) validity
– 30-100x fast as prior retrieval algorithms (non base-level activation) (for 3x larger data set)
– sub linear slowdown as memory size increases
• Exploits small node outdegree, high selectivity, not low co-occurrence of cue features.
• [ICCM 2010; AISB 2011; AAAI 2011]
• Current research: how to use context – collaboration with Braden Phillips, University of
Adelaide on special purpose hardware to support spreading activation in semantic memory
– Automatic generalization
• Current research: Leverage data maintained for episodic memory
4
PROGRESS TOWARDS GOALS
3.
Cognitive capabilities that leverage episodic and semantic memory functionality
– Episodic memory
• Seven distinct capabilities: recognition, prospective memory, virtual sensing, action modeling, …
• [BRIMS 2011; ACS 2011b; AAAI 2012a]
– Semantic memory
• Support reconstruction of forgotten working memory
• [ACS 2011; ICCM 2012a]
4.
Evaluate on real world domains
– Episodic memory
• Multiple domains including mobile robotics, games, planning problems, linguistics
• [BRIMS 2011; AAAI 2012a]
– Semantic memory
• Word sense disambiguation, mobile robotics
• [ICCM 2010; BRIMS 2011; AAAI 2011]
5. Competence preserving retention/forgetting
– Working memory
•
•
Automatic management of working memory to improve the scalability of episodic memory, utilizing semantic
memory
[ACS 2011; ICCM 2012b; Cog Sys 2013]]
– Procedural memory
•
•
Automatic management or procedural memory using same algorithms as in working-memory management
[ICCM 2012b; Cog Sys 2013]
5
NEW GOALS
• Dynamic determination of value-functions for reinforcement learning to
support robust decision making.
– [ACS 2012; AAAI submitted]
6
OVERVIEW
• Goal:
– Online learning and decision making in novel domains
with very large state spaces.
– No a priori knowledge of which features are most
important
• Approach:
– Reinforcement learning with adaptive value function
determination using hierarchical tile coding
– Only online, incremental methods need apply!
• Hypothesis:
– Will lead to more robust decision making and learning
over small changes to environment and task
7
REINFORCEMENT LEARNING
FOR ACTION SELECTION
• Choose action based on the expected (Q) value stored in a value
function
– Value function maps from situation-action to expected value.
• Value function updated based on reward received and expected future
reward (Q Learning: off policy)
Value Function
(si, aj) → qij
State: S1
Perception &
Internal
Structures
Reward
a3
a1
a2
a3
a4 (s2, a4)
State: S2
a5
Page 8
VALUE-FUNCTION FOR LARGE STATE
SPACES
• (si, aj) → qij
• si = (f1, f2, f3, f4, f5, f6, … fn)
• Usually only a subset of features are relevant
• If include irrelevant features, slow learning
• If don’t include relevant features, suboptimal asymptotic
performance
• How get the best of both?
• First step: hierarchical tile coding (Sutton & Barto, 1998)
• Initial results for propositional representations in Puddle World
and Mountain Car
9
PUDDLE WORLD
10
PUDDLE WORLD
2x2
4x4
8x8
Q-value for (si, aj) = ∑ (sit, aj) [as opposed average]
More abstract tilings (2x2) gets more updates, which form the baseline for subtilings
Update is distributed across all tiles that contribute to Q-value
• Explored variety of distributions: 1/sqrt(updates), even, 1/updates, …
11
Puddle World: Single Level Tilings
0
-10000
Cumulative Reward/Episodes
-20000
4x4
-30000
8x8
-40000
16x16
32x32
-50000
64x64
-60000
-70000
-80000
-90000
-100000
0
50
100
Actions (thousands)
150
200
12
Puddle World: Single Level Tilings Expanded
0
Cumulative Reward/Episodes
-1000
-2000
4x4
8x8
-3000
16x16
-4000
-5000
-6000
-7000
0
10
20
30
Actions (thousands)
40
50
13
Puddle World: Includes Static Hierarchical Tiling 1-64
0
Cumulative Reward/Episodes
-1000
-2000
4x4
8x8
-3000
16x16
1-64 static
-4000
-5000
-6000
-7000
0
10
20
30
40
50
Actins (thousands)
14
MOUNTAIN CAR
15
Mountain Car: Static Tilings
0
Cumulative Reward/Episodes
-1000
-2000
16x16
-3000
32x32
64x64
-4000
128x128
256x256
-5000
-6000
-7000
0
100
200
300
400
500
600
Actions (thousands)
700
800
900
1000
16
Mountain Car: Static Tilings Expanded
0
Cumulative Reward/Episodes
-1000
-2000
16x16
-3000
32x32
64x64
-4000
128x128
256x256
-5000
-6000
-7000
0
10
20
30
40
50
60
70
80
90
100
Actions (thousands)
17
Mountain Car: Includes Static Hierarchical Tiling
0
Cumulative Reward/Episodes
-1000
-2000
16x16
32x32
64x64
128x128
256x256
1-256 static
-3000
-4000
-5000
-6000
-7000
0
10
20
30
40
50
60
70
80
90
100
Actions (thousands)
18
WHY DOES
HIERARCHICAL TILING WORK?
• Abstract Q values serve as starting point
for learning more specific Q values so they
require less learning
• Exploits a locality assumption –
– There is continuity in the mapping from
feature space to Q values at multiple levels of
refinement
19
FOR LARGE STATE SPACES, HOW
AVOID HUGE MEMORY COSTS?
• Hypothesis: non uniform tiling is sufficient
• How do this incrementally and online?
• Split a tile if mean Cumulative Absolute Bellman Error
(CABE) is half a standard deviation above the mean
– CABE is stored proportionally to the credit assignment and
the learning rate.
– The mean and standard deviations for CABE are tracked
100% incrementally at low computational cost
• Incremental and online algorithm
20
1x1
PUDDLE WORLD
2x2
4x4
8x8
2x2
8x8
4x4
21
ANALYSIS AND EXPECTED RESULTS
• Might lose performance because takes
time to “grow” the tiling.
• Might gain performance because not
wasting updates on useless details.
• Expect many fewer “active” Q values
22
Puddle World: Static Hierarchical Tiling
Reward and Memory Usage
0
60000
Reward: 1-64 static
-200
Memory: 1-64 static
50000
-600
40000
-800
-1000
30000
Q values
Cumulative Reward/Episodes
-400
-1200
20000
-1400
-1600
10000
-1800
-2000
0
0
2
4
6
8
10
12
Actions (thousands)
14
16
18
20
23
Puddle World: Static and Dynamic Hierarchical Tiling
Reward and Memory Usage
0
60000
Reward 1-64 static
-200
Reward 1-64 dynamic
50000
Memory: 1-64 static
Memory: 1-64 dynamic
-600
40000
-800
-1000
30000
Q values
Cumulative Reward/Episodes
-400
-1200
20000
-1400
-1600
10000
-1800
-2000
0
0
2
4
6
8
10
12
Actions (thousands)
14
16
18
20
24
Puddle World: Static and Dynamic Hierarchical Tiling
Reward and Memory Usage
0
60000
Reward 1-64 static
-200
Reward 1-64 dynamic
50000
Memory: 1-64 static
Memory: 1-64 dynamic
-600
40000
-800
-1000
30000
Q values
Cumulative Reward/Episodes
-400
-1200
20000
-1400
Dynamic memory is 9% of static at
10,000 actions
-1600
10000
-1800
-2000
0
0
2
4
6
8
10
12
Actions (thousands)
14
16
18
20
25
26
Tile decomposition for move(north) 10K actions
27
Tile decomposition for move(south) 10K actions
28
Mountain Car: Even Credit Assignment
400000
-500
Reward: 1x256 static
Reward: 1x256 dynamic
Memory: 1x256 static
Memory: 1x256 dynamic
Cumulative Reward/Episodes
-1000
-1500
350000
300000
250000
-2000
200000
-2500
150000
-3000
100000
-3500
50000
-4000
0
0
10
20
30
40
50
60
Actions (thousands)
70
80
90
Q Values
0
100
29
Mountain Car: Inverse Log Credit Assignment
0
400000
-500
Reward: 1-256 dynamic
-1000
300000
Memory: 1-256 static
Memory: 1-256 dynamic
-1500
250000
-2000
200000
-2500
150000
-3000
100000
-3500
50000
-4000
0
0
10
20
30
40
50
60
Actions (thousands)
70
80
90
Q Values
Cumulative Reward/Episodes
350000
Reward: 1-256 static
100
30
Mountain Car: Inverse Log Credit Assignment
0
400000
-500
Reward: 1-256 dynamic
-1000
300000
Memory: 1-256 static
Memory: 1-256 dynamic
-1500
250000
-2000
200000
-2500
150000
Dynamic memory is 6% of static at
50,000 actions
-3000
100000
-3500
50000
-4000
0
0
10
20
30
40
50
60
Actions (thousands)
70
80
90
Q Values
Cumulative Reward/Episodes
350000
Reward: 1-256 static
100
31
Tile decomposition for move(right): 50K
32
Tile decomposition for move(right): 1,000K
33
Tile decomposition for move(left): 50K
34
Tile decomposition for move(left): 1,000K
35
Tile decomposition for move(idle): 50K
36
Tile decomposition for move(idle): 1,000K
37
RELATED WORK
(McCallum, 1996) Reinforcement Learning with Selective Perception and Hidden State
•
Not strict hierarchies, but similar motivation for relational representations.
Two levels with independent updating and no adaptive splitting
•
•
•
(Taylor & Stone, 2005) Behavior Transfer for Value-Function-Based Reinforcement Learning
(Zheng, Luo & Lv, 2006) Control Double Inverted Pendulum by Reinforcement Learning with Double CMAC
Network
(Grzes, 2010) Improving Exploration in Reinforcement Learning through Domain Knowledge and Parameter
Analysis
Maintain data on the fringe of the hierarchy
•
(Munos & Moore, 1999) Variable Resolution Discretization in Optimal Control
Splits periodically; Maintains a fringe of the hierarchy and splits top 'f'% of the cells to minimize the Stand_Dev of influence
and variance; No time-based online performance data; requires action model
•
(Whiteson, Taylor, & Stone, 2007) Adaptive Tile Coding for Value Function Approximation
Splits when Bellman error for any has not decreased in N steps; Maintains the fringe of the hierarchy and splits which
maximally reduces Bellman error value, or maximally improves the policy
38
Whiteston Policy and Value Results
Compared to Our Results
0
Cumulative Reward/Episodes
-500
-1000
Whiteson Value-based splitting
Whiteson Policy-based splitting
-1500
Our static hierarchy
Our dynamic hierarchy
-2000
-2500
-3000
0
10
20
30
40
50
60
Actions (Thousands)
70
80
90
100
39
ROBUSTNESS
“A characteristic describing a model's, test's or
system's ability to effectively perform while its
variables or assumptions are altered.”
Puddle World
1.
2.
3.
Change position of goal
Change the size of the puddles
Increase stochasticity of actions
Mountain Car
1.
Change the force of actions
Hypotheses
– Hierarchical tiling should be robust for small changes
– Incremental tiling should have similar performance
40
CHANGES IN GOAL POSITION
4
3
2
1
0
41
PW: Different Goals with
Static Hierarchy
Cumulative Reward/Episodes
0
-1000
-2000
Static 1
Static 2
-3000
Static 3
-4000
Static 4
-5000
0
10
20
30
40
50
Actions (Thousands)
42
PW: Different Goals with
Static Hierarchy
Cumulative Reward/Episodes
0
-1000
-2000
Static 1
Static 2
-3000
Static 3
-4000
Static 4
-5000
0
10
20
30
40
50
Actions (Thousands)
PW: Different Goals with
Dynamic Hierarchy
Cumulative Reward/Episodes
0
-1000
-2000
Dynamic 1
Dynamic 2
-3000
Dynamic 3
-4000
Dynamic 4
-5000
0
10
20
30
40
50
Actions (Thousands)
43
Different Goals with
Static Hierarchy
Transfer from Original to New Goal
with Static Hierarchy
0
-1000
-2000
Static 1
Static 2
-3000
Static 3
-4000
Static 4
Cumulative Reward/Episodes
Cumulative Reward/Episodes
0
-5000
-1000
-2000
Static 1
Static 2
-3000
Static 3
-4000
Static 4
-5000
0
10
20
30
40
50
0
10
Actions (Thousands)
30
40
50
Actions (Thousands)
Different Goals with
Dynamic Hierarchy
Transfer from Original to New Goal
with Dynamic Hierarchy
0
-1000
-2000
Dynamic 1
Dynamic 2
-3000
Dynamic 3
-4000
Dynamic 4
-5000
Cumulative Reward/Episodes
0
Cumulative Reward/Episodes
20
-1000
-2000
Dynamic 1
Dynamic 2
-3000
Dynamic 3
-4000
Dynamic 4
-5000
0
10
20
30
Actions (Thousands)
40
50
0
10
20
30
40
50
Actions (Thousands)
44
FUTURE WORK
1.
Short term
–
–
–
2.
3.
Develop a criteria for stopping refinement
More research on robustness
Better understanding of different credit assignment policies
Research on choosing which dimensions should be expanded
Expand to relational representations
–
–
No longer a strict hierarchy
Must decide which relations/features should be included
•
•
4.
What meta-data maintain?
Can we use additional background knowledge?
Embed within a cognitive architecture (Soar)
–
Already have prototype implementation for continuous features
45
LIST OF PUBLICATIONS ATTRIBUTED TO THE GRANT
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
[AAAI submitted] Bloch, M. & Laird, J. E. (submitted) Incremental Hierarchical Tile Coding in Reinforcement Learning, AAAI.
[Cog Sys 2013] Derbinsky, N., & Laird, J. E. (2013) Effective and efficient forgetting of learned knowledge in Soar’s working and
procedural memories. Cognitive Systems Research.
[ACS 2012] Laird, J. E., Derbinsky, N. and Tinkerhess, M. (2012). Online Determination of Value-Function Structure and Actionvalue Estimates for Reinforcement Learning in a Cognitive Architecture, Advances in Cognitive Systems, Volume 2, Palo Alto,
California.
[AAMAS 2012] Derbinsky, N., Li, J., Laird, J. E. (2012) Evaluating Algorithmic Scaling in a General Episodic Memory (Extended
Abstract). Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). Valencia,
Spain.
[ICCM 2012a] Derbinsky, N., Laird, J. E. (2012) Efficient Decay via Base-Level Activation. Proceedings of the 11th International
Conference on Cognitive Modeling (ICCM). Berlin, Germany. Best Poster.
[ICCM 2012b] Derbinsky, N., Laird, J. E. (2012) Competence-Preserving Retention of Learned Knowledge in Soar's Working and
Procedural Memories. Proc. of the 11th International Conference on Cognitive Modeling (ICCM). Berlin, Germany.
[AAAI 2012a] Derbinsky, N., Li, J., Laird, J. E. (2012) A Multi-Domain Evaluation of Scaling in a General Episodic Memory.
Proceedings of the 26th AAAI Conference on Artificial Intelligence. Toronto, Canada.
[AAAI 2012b] Li, J., Derbinsky, N., Laird, J. E. (2012) Functional Interactions Between Encoding and Recognition of Semantic
Knowledge. Proceedings of the 26th AAAI Conference on Artificial Intelligence. Toronto, Canada. [AFOSR & ONR]
[BRIMS 2011] Laird, J. E., Derbinsky, N., Voigt, J. (2011)Performance Evaluation of Declarative Memory Systems in Soar (2011).
Proceedings of the 20th Behavior Representation in Modeling & Simulation Conference (BRIMS), 33-40. Sundance, UT.
[AISB 2011] Derbinsky, N., Laird, J. E.(2011) A Preliminary Functional Analysis of Memory in the Word Sense Disambiguation
Task. Proceedings of the 2nd Symposium on Human Memory for Artificial Agents, AISB, 25-29. York, England.
[AAAI 2011] Derbinsky, N., and Laird, J. E. (2011) A Functional Analysis of Historical Memory Retrieval Bias in the Word Sense
Disambiguation Task. Proceedings of the 25th National Conference on Artificial Intelligence (AAAI). 663-668. San Francisco, CA.
[ACS 2011] Derbinsky, N., Laird, J. E. (2011) Effective and Efficient Management of Soar's Working Memory via Base-Level
Activation. Papers from the 2011 AAAI Fall Symposium Series: Advances in Cognitive Systems (ACS), 82-89. Arlington, VA.
[ICCM 2010] Derbinsky, N., Laird, J. E., Smith, B. (2010) Towards Efficiently Supporting Large Symbolic Declarative Memories.
Proc. of the 10th International Conference on Cognitive Modeling (ICCM), 49-54.Philadelphia, PA.
46
Download