Jeff Kephart
IBM Research
© 2003 IBM Corporation
IBM Research
Background and Motivation
Autonomic Computing Research at IBM
Architecture
Overview of Research Program
Autonomic Computing Research Challenges
Conclusions
2 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
My role in autonomic computing
My group does research on agents and multi-agent systems
–
Architecture, Communication, Negotiation, Machine learning
AC Research strategy; joint program manager
University relations; faculty awards, equipment grants
Co-chair, International Conference on Autonomic Computing 2004
What I hope to achieve here
Explore overlaps between research interests of e-Science and AC communities
Explore potential collaborations
3 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
4
IBM Research
Dozens of systems and applications
DNS
Server
Web
Application
Thousands of tuning parameters
Hundreds of components
Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
5
Individual system elements increasingly difficult to maintain and operate
100s of config, tuning parameters for commercial databases, servers, storage
Heterogeneous systems are becoming increasingly connected
Integration becoming ever more difficult
Architects can't intricately plan component interactions
Increasingly dynamic; more frequently with unanticipated components
This places greater burden on system administrators, but
they are already overtaxed
they are already a major source of cost (6:1 for storage) and error
We need self-managing computing systems
Behavior specified by sys admins via high-level policies
System and its components figure out how to carry out policies
Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
6
Self-
Heal
Optimize Web servers, databases have hundreds of nonlinear tuning parameters; many new ones with each release. Adjusted manually.
Protect
The Human-Intensive Present The Autonomic Future
Configure Corporate data centers are multivendor, multi-platform. Installing, configuring, integrating systems is time-consuming, error-prone.
Problem determination in large, complex systems can take a team of programmers weeks.
Manual vulnerability analysis. Manual detection and recovery from attacks, cascading failures.
Automated configuration of components, systems according to high-level policies; rest of system adjusts seamlessly.
Automated detection, diagnosis, and repair of localized software/hardware problems.
Components and systems will continually strive to improve their own performance and efficiency.
Automated defense against malicious attacks or cascading failures; use early warning to anticipate and prevent system-wide failures.
Business case: Increased resiliency, responsiveness, efficiency, ROI
Reduced down-time, risk, time-to-value, cost
Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
7 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Background and Motivation
Autonomic Computing Research at IBM
Architecture
Overview of Research Program
AI Research Challenges
Conclusions
8 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
The Autonomic Element
AEs are the basic atoms of autonomic systems
An AE contains
Exactly one autonomic manager
Zero or more managed element (s)
AE is responsible for
Managing own behavior in accordance with policies
Interacting with other autonomic elements to provide or consume computational services
Autonomic Manager
Analyze Plan
Monitor
Knowledge
Execute
S E
Managed Element
An Autonomic Element
9
Service-oriented architecture
Software agents
An Autonomic Element software app, workload mgr, sentinel, arbiter, OGSA infrastructure elements
Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Element interactions
System self-* properties, behavior arise from interactions among autonomic managers
Interactions are
Dynamic, ephemeral
Formed by (negotiated) agreement
Flexible in pattern; determined by policies
Based on OGSI -
> WSRF, …? and specific AC extensions
10
A multi-agent system!
Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Example scenario: Autonomic Data Center
Autonomic Data Center
System
Manager
Resourcelevel utility
Resource
Arbiter
Policy
Repository
Registry
Service-level utility
Application
Manager
Application
Manager
Demand
11
Database
Router
Server
Storage
Application Environment
Database
Router
Server
Storage
Application Environment
Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004
Demand
© 2003 IBM Corporation
IBM Research
100-150 researchers working on various aspects of Autonomic Computing
Some projects predate AC initiative; now trying to realign them with AC architecture
Technologies for specific autonomic elements
Database, storage, network, server, client…
Generic element technologies for autonomic elements
Autonomic Manager Toolset integrates many element-level technologies
– Modeling, analysis, forecasting, optimization, planning, feedback control , etc.
Uses Open Grid Services Architecture standards for inter-element communication
Available (with ETTK v1.1) on www.alphaworks.ibm.com
; open source later
Generic system-level technologies
Dependency management, problem determination and remediation, workload management, provisioning , …
System scenarios and prototypes
Small- to medium-scale autonomic systems
Demonstrate self-* arising from AC architecture + technology
Identify gaps, necessary modifications
12 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Over 150 researchers working on various aspects of Autonomic Computing
Some projects predate AC initiative; now trying to realign them with AC architecture
Technologies for specific autonomic elements
Database, storage, server, client…
Generic element technologies for autonomic elements
Autonomic Manager Toolset integrates many element-level technologies
– Modeling, analysis, forecasting, optimization, planning, feedback control , etc.
Uses Open Grid Services Architecture standards for inter-element communication
Available (with ETTK v1.1) on www.alphaworks.ibm.com
; open source later
Generic system-level technologies
Dependency management, problem determination and remediation, workload management, provisioning, …
System scenarios and prototypes
Small- to medium-scale autonomic systems
Demonstrate self-* arising from AC architecture + technology
Identify gaps, necessary modifications
13 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Autonomic Manager ToolSet
W. Arnold et al., Watson
Facilitates autonomic mgr construction
In accordance w/ AC architecture
Catcher for generic AM technologies
OGSI (Globus 3.0 beta) -> WSRF
Policy tools
Monitoring standards and technologies
AI tools for knowledge representation, reasoning, planning
Math libraries for modeling, optimization
Feedback control
AMTS V1.0 available as part of Emerging
Technologies Toolkit v 1.1 on IBM alphaWorks
(www.alphaworks.ibm.com)
Considering open source
Should we think about OMII?
S E
Autonomic Manager
Analyze Plan
Monitor
Knowledge
Execute
S E
Managed Element
14 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Over 150 researchers working on various aspects of Autonomic Computing
Some projects predate AC initiative; now trying to realign them with AC architecture
Technologies for specific autonomic elements
Database, storage, server, client…
Generic element technologies for autonomic elements
Autonomic Manager Toolset integrates many element-level technologies
– Modeling, analysis, forecasting, optimization, planning, feedback control, etc.
Uses Open Grid Services Architecture standards for inter-element communication
Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later
Generic system-level technologies
Dependency management, problem determination and remediation, workload management, provisioning , …
System scenarios and prototypes
Small- to medium-scale autonomic systems
Demonstrate self-* arising from AC architecture + technology
Identify gaps, necessary modifications
15 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Dependency Mgt & Self-Healing
G. Kar, Watson and H. Lee & S. Ma, Watson
Determine functional dependencies among elements
Mine design docs, system config metadata, log files
Actively probe running system
Use dependency information for system configuration, healing
Problem management lifecycle
Monitor->Detect->Localize->Repair->Learn
App Server
16
Dependency Matrix pWS pAS pDBS pingR 0 pingWS 0 pingAS 0 pingDBS 0
WS AS DBS R HWS HAS HDBS
1 1 1 1 1 1 1
0 1
0 0
1
1
1
1
0
0
1
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
0
1
0
0
0
0
1
0
0
0
0
1
Web Server
Router
DB
Server
Probe
Analysis & Control
Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Enterprise Workload Management
D. Dillenberger
Data and
Transaction
Internet/
Extranet
Large, distributed, heterogeneous system
Achieves end-to-end performance via adaptive algorithms
Administrator defines policy
– Desired response times for various classes of users, apps
eWLM managers on each resource cooperate to adaptively tune parameters
– OS, network, storage, virtual server knobs
– JVM heap size, # garbage collection threads
– Workload balancing, routing parameters
17 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Over 150 researchers working on various aspects of Autonomic Computing
Some projects predate AC initiative; now trying to realign them with AC architecture
Technologies for specific autonomic elements
Database, storage, server, client…
Generic element technologies for autonomic elements
Autonomic Manager Toolset integrates many element-level technologies
– Modeling, analysis, forecasting, optimization, planning, feedback control, etc.
Uses Open Grid Services Architecture standards for inter-element communication
Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later
Generic system-level technologies
Dependency management, problem determination and remediation, workload management, provisioning, …
System scenarios and prototypes
Small- to medium-scale autonomic systems
Demonstrate self-* arising from AC architecture + technology
Identify gaps, necessary modifications
18 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Human Interaction with Autonomic Systems
P. Maglio, Almaden
Basic questions
What do middleware administrators do?
How can we better support the problems and practices they have?
Learn answers to these questions via ethnographic studies
Use insights to develop new ways to interact with complex computing systems
… but we thought that was the return port!
We had it wrong. Our assumption of how it worked was incorrect.
We start with looking at the proxy server log files, then the web server log files, then the application server admin log files then the application log files.
19 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Example scenario: Autonomic Data Center
Autonomic Data Center
System
Manager
Resourcelevel utility
Resource
Arbiter
Policy
Repository
Registry
Service-level utility
Application
Manager
Application
Manager
Demand
20
Database
Router
Server
Storage
Application Environment
Database
Router
Server
Storage
Application Environment
Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004
Demand
© 2003 IBM Corporation
IBM Research
Background and Motivation
Autonomic Computing Research at IBM
Architecture
Overview of Research Program
Scenarios
Autonomic Computing Research Challenges
Systems and Software
– Architecture, software engineering & tools, testing/validation
– Prototyping a large-scale self-* system
Human-Computer Interaction
– Policies, Interfaces
Artificial Intelligence
Learning, Negotiation, Self-healing, Emergent Behavior
Conclusions
21 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Challenge : Architecture
Define set of fundamental architectural principles from which self-* emerges
AE : How to coordinate multiple threads of activity?
AE’s live in complex environments
Multiple task instances and types
– concurrent, asynchronous
Multiple interacting expert modules
S E
Autonomic Manager
Analyze Plan
Monitor Knowledge Execute
AE : How to detect/resolve conflicts arising from
Internal decisions by independent expert modules
External directives (possibly asynchronous)
Internal policies vs. external directives
S E
Managed Element
An Autonomic Element
System-level : Enable more flexible, service-oriented patterns of interaction
As opposed to traditional top-down, hierarchical systems management
Multi-agent architecture
– Communication
–
Representing and reasoning about needs, capabilities, dependencies
22 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
Challenge
IBM Research
Policy : “Set of guidelines or directives provided to autonomic element to influence its behavior”
Human interface
Authoring and understanding policies
Avoiding or ameliorating specification errors
S E
Autonomic Manager
Analyze Plan
Developing a universal representation and grammar
Many different application domains, disciplines
Many different flavors of policy
Covers service agreements too?
Algorithms that operate upon policies (and agreements?)
Automated derivation of actions (e.g. planning, optimization)
Automated derivation of lower-level policies from high-level policies
E.g. “Maximize profit from this set of service contracts”
Conflict resolution
Both design time and run time
Need to establish protocols, interfaces, algorithms
Monitor Knowledge Execute
S E
Managed Element
23 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Action rule
If (S) then do a
2
Results implicitly in desired state s
2
Goal
Achieve a most desired state s
2
Compute a
2 most likely to result in s
2
Current
State
S
Assumes that most desired state can be determined a priori a
1 a
2
Possible
State s
1
Possible
State s
2 a
3
Possible
State s
3
24
Utility function
Achieve state s with maximal net value V( s
) – C(a
S d s
)
Benefit and burden of being explicit about value
States have intrinsic value; value of policy is a derived quantity
Machine code
[More levels of code hierarchy] Workflows
Programming
Rules
Adapters,
Translaters
Actions Generative
Planning
Decisiontheoretic
Planning
Element
Goals
Optimization
Element utility functions
Modeling,
Optimization
System utility functions
Research Challenges in Autonomic Computing | CMU, September 4, 2003
Higher-level specifications
© 2003 IBM Corporation
IBM Research
Challenge
Specify goals and objectives to AC systems, and visualize their potential effect
Techniques must be
– Sufficiently expressive of preferences regarding cost vs. performance, security, risk and reliability
– Sufficiently structured and/or naturally suited to human psychology and cognition to keep specification errors to an absolute minimum
– Robust to specification errors
25 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Challenge
Establish theoretical foundation for understanding and performing learning and optimization in multi-agent systems.
AE needs to learn a model of itself and environment quickly; environment is noisy, and dynamic in both state and structure
On-line, so exploration of the space can be costly and/or harmful
May be several hundreds of tunable parameters!
–
Maybe only a few dozen are relevant, but which ones?
– Some of them can only be changed upon reboot – is it worthwhile?
Multi-agent system: several interacting learners
What are good learning algorithms for cooperative, competitive systems?
– What are conditions for stability?
– What is sensitivity to perturbations?
Opportunities for layered learning
26 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Challenge
Methods for expressing or computing preferences
Negotiation protocols
Negotiation algorithms
27
Explore conditions under which to apply
– Bilateral
– Multi-lateral (mediated, or not)
–
Supply-chain
Study how system behavior depends on mixture of negotiation algorithms in AE population
Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Challenge
How do self-*, stability, etc. depend on
– Behaviors and goals of the autonomic elements
– Pattern and type of interactions among AEs
– External influences and demands on system
Invert relationship to attain desired global behavior
– How?
– Are there fundamental limits?
Hierarchical
Distributed
28 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Background and Motivation
Autonomic Computing Research at IBM
Architecture
Scenarios
Overview of Research Program
Research Challenges
Conclusions
29 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
Autonomic Computing is a grand challenge, requiring advances in several fields of science and technology
Policy, planning, learning, knowledge representation, multi-agent systems, negotiation, emergent behavior
Human-system interfaces
Integrating these technologies to support self-management in complex, realistic environments is a research challenge in itself
What are the best architectures and design patterns? Role of (multi-)agent systems?
Building system prototypes is key to developing and validating AC technology and architecture
The e-Science community is facing many of the same challenges
Which ones are you most interested in tackling?
How might we collaborate?
AMTS in OMII?
Conferences (come to ICAC ’04 in NYC May 17-18)
Encouragement from EPSRC?
Seek effective collaborations with IBM Researchers
30 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
A Vision of Autonomic Computing
IEEE Computer, January 2003
IBM Systems Journal special issue on Autonomic Computing
http://www.research.ibm.com/journal/sj42-1.html
Web site
www.research.ibm.com/autonomic
www.autonomic-conference.org
May 17-18, New York City
Submission deadline: January 12, 2003
31 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation
IBM Research
32 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation