Research Challenges in Autonomic Computing IBM Research

advertisement

IBM Research

Research Challenges in

Autonomic Computing

Jeff Kephart

IBM Research

kephart@us.ibm.com

www.research.ibm.com/autonomic

© 2003 IBM Corporation

IBM Research

Outline

 Background and Motivation

 Autonomic Computing Research at IBM

 Architecture

 Overview of Research Program

 Autonomic Computing Research Challenges

 Conclusions

2 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Background and Motivation (Kephart)

 My role in autonomic computing

My group does research on agents and multi-agent systems

Architecture, Communication, Negotiation, Machine learning

AC Research strategy; joint program manager

 University relations; faculty awards, equipment grants

 Co-chair, International Conference on Autonomic Computing 2004

 What I hope to achieve here

 Explore overlaps between research interests of e-Science and AC communities

 Explore potential collaborations

3 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

4

IBM Research

Complex heterogeneous infrastructures are a reality!

Dozens of systems and applications

DNS

Server

Web

Application

Thousands of tuning parameters

Hundreds of components

Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

5

Autonomic Computing: Motivation

 Individual system elements increasingly difficult to maintain and operate

100s of config, tuning parameters for commercial databases, servers, storage

 Heterogeneous systems are becoming increasingly connected

Integration becoming ever more difficult

 Architects can't intricately plan component interactions

Increasingly dynamic; more frequently with unanticipated components

 This places greater burden on system administrators, but

 they are already overtaxed

 they are already a major source of cost (6:1 for storage) and error

 We need self-managing computing systems

 Behavior specified by sys admins via high-level policies

System and its components figure out how to carry out policies

Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Facets of Self-Management

6

Self-

Heal

Optimize Web servers, databases have hundreds of nonlinear tuning parameters; many new ones with each release. Adjusted manually.

Protect

The Human-Intensive Present The Autonomic Future

Configure Corporate data centers are multivendor, multi-platform. Installing, configuring, integrating systems is time-consuming, error-prone.

Problem determination in large, complex systems can take a team of programmers weeks.

Manual vulnerability analysis. Manual detection and recovery from attacks, cascading failures.

Automated configuration of components, systems according to high-level policies; rest of system adjusts seamlessly.

Automated detection, diagnosis, and repair of localized software/hardware problems.

Components and systems will continually strive to improve their own performance and efficiency.

Automated defense against malicious attacks or cascading failures; use early warning to anticipate and prevent system-wide failures.

Business case: Increased resiliency, responsiveness, efficiency, ROI

Reduced down-time, risk, time-to-value, cost

Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Evolving towards Autonomic Computing Systems

7 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Outline

 Background and Motivation

 Autonomic Computing Research at IBM

 Architecture

 Overview of Research Program

 AI Research Challenges

 Conclusions

8 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Autonomic Computing Architecture

The Autonomic Element

 AEs are the basic atoms of autonomic systems

 An AE contains

Exactly one autonomic manager

 Zero or more managed element (s)

 AE is responsible for

 Managing own behavior in accordance with policies

 Interacting with other autonomic elements to provide or consume computational services

Autonomic Manager

Analyze Plan

Monitor

Knowledge

Execute

S E

Managed Element

An Autonomic Element

9

Service-oriented architecture

Software agents

An Autonomic Element software app, workload mgr, sentinel, arbiter, OGSA infrastructure elements

Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Autonomic Computing Architecture

Element interactions

 System self-* properties, behavior arise from interactions among autonomic managers

 Interactions are

Dynamic, ephemeral

 Formed by (negotiated) agreement

 Flexible in pattern; determined by policies

Based on OGSI -

> WSRF, …? and specific AC extensions

10

A multi-agent system!

Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Example scenario: Autonomic Data Center

Autonomic Data Center

System

Manager

Resourcelevel utility

Resource

Arbiter

Policy

Repository

Registry

Service-level utility

Application

Manager

Application

Manager

Demand

11

Database

Router

Server

Storage

Application Environment

Database

Router

Server

Storage

Application Environment

Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004

Demand

© 2003 IBM Corporation

IBM Research

Overview of IBM’s Autonomic Computing Research Program

100-150 researchers working on various aspects of Autonomic Computing

 Some projects predate AC initiative; now trying to realign them with AC architecture

 Technologies for specific autonomic elements

 Database, storage, network, server, client…

Generic element technologies for autonomic elements

 Autonomic Manager Toolset integrates many element-level technologies

– Modeling, analysis, forecasting, optimization, planning, feedback control , etc.

 Uses Open Grid Services Architecture standards for inter-element communication

 Available (with ETTK v1.1) on www.alphaworks.ibm.com

; open source later

Generic system-level technologies

 Dependency management, problem determination and remediation, workload management, provisioning , …

 System scenarios and prototypes

 Small- to medium-scale autonomic systems

 Demonstrate self-* arising from AC architecture + technology

 Identify gaps, necessary modifications

12 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Overview of IBM’s Autonomic Computing Research Program

Over 150 researchers working on various aspects of Autonomic Computing

 Some projects predate AC initiative; now trying to realign them with AC architecture

 Technologies for specific autonomic elements

 Database, storage, server, client…

Generic element technologies for autonomic elements

 Autonomic Manager Toolset integrates many element-level technologies

– Modeling, analysis, forecasting, optimization, planning, feedback control , etc.

 Uses Open Grid Services Architecture standards for inter-element communication

 Available (with ETTK v1.1) on www.alphaworks.ibm.com

; open source later

Generic system-level technologies

 Dependency management, problem determination and remediation, workload management, provisioning, …

 System scenarios and prototypes

 Small- to medium-scale autonomic systems

 Demonstrate self-* arising from AC architecture + technology

 Identify gaps, necessary modifications

13 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Autonomic Manager ToolSet

W. Arnold et al., Watson

 Facilitates autonomic mgr construction

 In accordance w/ AC architecture

 Catcher for generic AM technologies

 OGSI (Globus 3.0 beta) -> WSRF

 Policy tools

 Monitoring standards and technologies

 AI tools for knowledge representation, reasoning, planning

 Math libraries for modeling, optimization

 Feedback control

 AMTS V1.0 available as part of Emerging

Technologies Toolkit v 1.1 on IBM alphaWorks

(www.alphaworks.ibm.com)

 Considering open source

Should we think about OMII?

S E

Autonomic Manager

Analyze Plan

Monitor

Knowledge

Execute

S E

Managed Element

14 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Overview of IBM’s Autonomic Computing Research Program

Over 150 researchers working on various aspects of Autonomic Computing

 Some projects predate AC initiative; now trying to realign them with AC architecture

 Technologies for specific autonomic elements

 Database, storage, server, client…

Generic element technologies for autonomic elements

 Autonomic Manager Toolset integrates many element-level technologies

– Modeling, analysis, forecasting, optimization, planning, feedback control, etc.

 Uses Open Grid Services Architecture standards for inter-element communication

 Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later

Generic system-level technologies

 Dependency management, problem determination and remediation, workload management, provisioning , …

 System scenarios and prototypes

 Small- to medium-scale autonomic systems

 Demonstrate self-* arising from AC architecture + technology

 Identify gaps, necessary modifications

15 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Dependency Mgt & Self-Healing

G. Kar, Watson and H. Lee & S. Ma, Watson

Determine functional dependencies among elements

 Mine design docs, system config metadata, log files

 Actively probe running system

 Use dependency information for system configuration, healing

 Problem management lifecycle

Monitor->Detect->Localize->Repair->Learn

App Server

16

Dependency Matrix pWS pAS pDBS pingR 0 pingWS 0 pingAS 0 pingDBS 0

WS AS DBS R HWS HAS HDBS

1 1 1 1 1 1 1

0 1

0 0

1

1

1

1

0

0

1

0

1

1

0

0

0

0

0

0

0

0

1

1

1

1

0

1

0

0

0

0

1

0

0

0

0

1

Web Server

Router

DB

Server

Probe

Analysis & Control

Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Enterprise Workload Management

D. Dillenberger

Data and

Transaction

Internet/

Extranet

Large, distributed, heterogeneous system

 Achieves end-to-end performance via adaptive algorithms

 Administrator defines policy

– Desired response times for various classes of users, apps

 eWLM managers on each resource cooperate to adaptively tune parameters

– OS, network, storage, virtual server knobs

– JVM heap size, # garbage collection threads

– Workload balancing, routing parameters

17 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Overview of IBM’s Autonomic Computing Research Program

Over 150 researchers working on various aspects of Autonomic Computing

 Some projects predate AC initiative; now trying to realign them with AC architecture

 Technologies for specific autonomic elements

 Database, storage, server, client…

Generic element technologies for autonomic elements

 Autonomic Manager Toolset integrates many element-level technologies

– Modeling, analysis, forecasting, optimization, planning, feedback control, etc.

 Uses Open Grid Services Architecture standards for inter-element communication

 Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later

Generic system-level technologies

 Dependency management, problem determination and remediation, workload management, provisioning, …

 System scenarios and prototypes

 Small- to medium-scale autonomic systems

 Demonstrate self-* arising from AC architecture + technology

 Identify gaps, necessary modifications

18 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Human Interaction with Autonomic Systems

P. Maglio, Almaden

 Basic questions

 What do middleware administrators do?

 How can we better support the problems and practices they have?

Learn answers to these questions via ethnographic studies

 Use insights to develop new ways to interact with complex computing systems

… but we thought that was the return port!

We had it wrong. Our assumption of how it worked was incorrect.

We start with looking at the proxy server log files, then the web server log files, then the application server admin log files then the application log files.

19 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Example scenario: Autonomic Data Center

Autonomic Data Center

System

Manager

Resourcelevel utility

Resource

Arbiter

Policy

Repository

Registry

Service-level utility

Application

Manager

Application

Manager

Demand

20

Database

Router

Server

Storage

Application Environment

Database

Router

Server

Storage

Application Environment

Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004

Demand

© 2003 IBM Corporation

IBM Research

Outline

 Background and Motivation

 Autonomic Computing Research at IBM

Architecture

 Overview of Research Program

Scenarios

 Autonomic Computing Research Challenges

 Systems and Software

– Architecture, software engineering & tools, testing/validation

– Prototyping a large-scale self-* system

 Human-Computer Interaction

– Policies, Interfaces

 Artificial Intelligence

Learning, Negotiation, Self-healing, Emergent Behavior

 Conclusions

21 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Challenge : Architecture

Define set of fundamental architectural principles from which self-* emerges

 AE : How to coordinate multiple threads of activity?

 AE’s live in complex environments

Multiple task instances and types

– concurrent, asynchronous

Multiple interacting expert modules

S E

Autonomic Manager

Analyze Plan

Monitor Knowledge Execute

 AE : How to detect/resolve conflicts arising from

 Internal decisions by independent expert modules

 External directives (possibly asynchronous)

Internal policies vs. external directives

S E

Managed Element

An Autonomic Element

 System-level : Enable more flexible, service-oriented patterns of interaction

 As opposed to traditional top-down, hierarchical systems management

 Multi-agent architecture

– Communication

Representing and reasoning about needs, capabilities, dependencies

22 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

Challenge

IBM Research

: Policy

Policy : “Set of guidelines or directives provided to autonomic element to influence its behavior”

 Human interface

 Authoring and understanding policies

 Avoiding or ameliorating specification errors

S E

Autonomic Manager

Analyze Plan

 Developing a universal representation and grammar

 Many different application domains, disciplines

 Many different flavors of policy

 Covers service agreements too?

 Algorithms that operate upon policies (and agreements?)

 Automated derivation of actions (e.g. planning, optimization)

 Automated derivation of lower-level policies from high-level policies

E.g. “Maximize profit from this set of service contracts”

 Conflict resolution

 Both design time and run time

 Need to establish protocols, interfaces, algorithms

Monitor Knowledge Execute

S E

Managed Element

23 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Three flavors of (policy = “decision-making guide”)

 Action rule

 If (S) then do a

2

 Results implicitly in desired state s

2

 Goal

 Achieve a most desired state s

2

 Compute a

2 most likely to result in s

2

Current

State

S

 Assumes that most desired state can be determined a priori a

1 a

2

Possible

State s

1

Possible

State s

2 a

3

Possible

State s

3

24

Utility function

 Achieve state s with maximal net value V( s

) – C(a

S d s

)

 Benefit and burden of being explicit about value

 States have intrinsic value; value of policy is a derived quantity

Machine code

[More levels of code hierarchy] Workflows

Programming

Rules

Adapters,

Translaters

Actions Generative

Planning

Decisiontheoretic

Planning

Element

Goals

Optimization

Element utility functions

Modeling,

Optimization

System utility functions

Research Challenges in Autonomic Computing | CMU, September 4, 2003

Higher-level specifications

© 2003 IBM Corporation

IBM Research

Challenge

: Human-System Interface

Develop new languages, metaphors and translation technologies that enable humans to monitor, visualize, and control AC systems

 Specify goals and objectives to AC systems, and visualize their potential effect

 Techniques must be

– Sufficiently expressive of preferences regarding cost vs. performance, security, risk and reliability

– Sufficiently structured and/or naturally suited to human psychology and cognition to keep specification errors to an absolute minimum

– Robust to specification errors

25 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Challenge

: Learning

Establish theoretical foundation for understanding and performing learning and optimization in multi-agent systems.

Single element level

 AE needs to learn a model of itself and environment quickly; environment is noisy, and dynamic in both state and structure

 On-line, so exploration of the space can be costly and/or harmful

 May be several hundreds of tunable parameters!

Maybe only a few dozen are relevant, but which ones?

– Some of them can only be changed upon reboot – is it worthwhile?

System level

 Multi-agent system: several interacting learners

 What are good learning algorithms for cooperative, competitive systems?

– What are conditions for stability?

– What is sensitivity to perturbations?

 Opportunities for layered learning

26 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Challenge

: Negotiation

Develop and analyze

 Methods for expressing or computing preferences

 Negotiation protocols

 Negotiation algorithms

27

Establish theoretical foundation for negotiation

 Explore conditions under which to apply

– Bilateral

– Multi-lateral (mediated, or not)

Supply-chain

 Study how system behavior depends on mixture of negotiation algorithms in AE population

Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Challenge

: Control and Harness Emergent Behavior

Understand, control, and exploit emergent behavior in autonomic systems

 How do self-*, stability, etc. depend on

– Behaviors and goals of the autonomic elements

– Pattern and type of interactions among AEs

– External influences and demands on system

 Invert relationship to attain desired global behavior

– How?

– Are there fundamental limits?

Develop theory of interacting feedback loops

 Hierarchical

 Distributed

28 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Outline

 Background and Motivation

 Autonomic Computing Research at IBM

 Architecture

 Scenarios

Overview of Research Program

 Research Challenges

 Conclusions

29 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Conclusions

Autonomic Computing is a grand challenge, requiring advances in several fields of science and technology

 Policy, planning, learning, knowledge representation, multi-agent systems, negotiation, emergent behavior

 Human-system interfaces

Integrating these technologies to support self-management in complex, realistic environments is a research challenge in itself

 What are the best architectures and design patterns? Role of (multi-)agent systems?

 Building system prototypes is key to developing and validating AC technology and architecture

The e-Science community is facing many of the same challenges

 Which ones are you most interested in tackling?

 How might we collaborate?

 AMTS in OMII?

 Conferences (come to ICAC ’04 in NYC May 17-18)

 Encouragement from EPSRC?

 Seek effective collaborations with IBM Researchers

30 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Additional Information

 A Vision of Autonomic Computing

IEEE Computer, January 2003

 IBM Systems Journal special issue on Autonomic Computing

 http://www.research.ibm.com/journal/sj42-1.html

 Web site

 www.research.ibm.com/autonomic

International Conference on Autonomic Computing

 www.autonomic-conference.org

 May 17-18, New York City

 Submission deadline: January 12, 2003

31 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

IBM Research

Backup Slides

32 Research Challenges in Autonomic Computing | EPSRC e-Science Meeting , March 26, 2004 © 2003 IBM Corporation

Download