Resiliency and self-healing Visa Holopainen,

advertisement
Resiliency and self-healing
Visa Holopainen, visa@netlab.tkk.fi
Reinforcement Learning for Autonomic
Network Repair,
M. Littman, N. Ravi, E. Fenson, R. Howard, 2004

Reinforcement learning
–
Used to solve Markov decision problems (MDPs)

–
–
–
States, actions, rewards, transitions, transition probabilities
Agent explores an environment in which it perceives its
current state and takes actions to reach new states
A reward is assosiated to every state
Reinforcement learning tries to find a policy for maximizing
cumulative reward for a task
(Simplified) Reinforcement Learning example

Which direction should
the agent move?
Goal
State
Agent
Goal
State
Reinforcement Learning example (cont)

Agent makes random moves
until a Goal state is reached
Goal
State
|
V



|
V
Goal
State
+ Agent
Reinforcement Learning example (cont)

Now a policy is associated
with the state from which the
goal state was reached
Goal
State
Goal
State
Reinforcement Learning example (cont)

Now if at some point state S
(that has policy associated
to it) is reached from state
S’, a policy is assigned to S’
also
Goal
State
S’
S
Goal
State
Reinforcement Learning example (cont)

After some amount of
iterations the optimal
policies have been formed
Goal
State
Goal
State
Reinforcement Learning example (cont)

The corresponding state
rewards
Goal
State
-1
-2
-3
-1
-2
-3
-2
-2
-3
-2
-1
-1
Goal
State
-3
-2
Implemented concept






Reinforcement learning is used to
restore network connectivity after a
failure
Starting state: no connectivity, Goal
state: connectivity
Actions: PingGateway, PingIP,
DNSLookup, UseCachedIP, FixIP,
RenewLease, UseCachedIP
Learned policy in the picture
Prototype implemented
Nice concept but not very useful…
Approaches to Building Self Healing
Systems using Dependency Analysis,
J. Gao, G. Kar, P. Kermani, 2004

Problems
–
–
Is there a way to automatically determine the root cause(s)
of a downgraded performance of i.e. an Internet shopping
site
Provided that the root cause(s) can be determined, are
there some ways to automatically fix this problem
Architecture

Distributed System
–

The Monitoring System
–

Includes monitoring agents that
monitor 1) the response time of the
system from user’s perspective and
2) the application components
(servlets, EJBs,…)
The Dependency Matrix
–

A typical multi-tier e-Business
system (web access, database)
Which transactions depend on which
system components
Self-healing Engine
–
Launched when a performance
problem is noticed by monitoring
system
Problem description





Based on previous work a
dependency matrix can be formed
The matrix informs which customer
transactions depend on which
system resources
Using this matrix the system
resource that causes a preformance
problem can be tracked
The initial goal was to minimize the
needed transactions to find the root
cause of a problem
This problem is found to be NP-hard
-> a heuristic solution is presented
Solution
No solution can be guaranteed to be found if two or
more matrix columns are similar
Assume that 1) all matrix colums are different and 2)
there is only one broken system component


–
Now the solution can be found by the following algorithm
The set of all resources is denoted S. The set of all transactions is
denoted T
1) Run all transactions one by one
2) If a trasaction succeeds then remove all resources that this trasaction
depends on from S.
3) Finally only one resource is left in S. This is the broken resource.

Solution (cont)



If the fixed set of customer transactions cannot locate the
root cause of performance problem, synthetic
transactions need to be created and executed
Many practical difficuties exists in doing so
No testing
Ensembles of Models for Automated
Diagnosis of System Performance
Problems, S. Zhang, I. Cohen, M. Goldszmidt, J. Symons, A. Fox, 2005


Ensemble = collection
SLA contains Service Level Objectives (SLO)
–

Problem: Which system metrics correlate with SLO
violations?
–

SLO example: “Server downtime < X sec in a day”
Example system metrics: CPU metrics, Memory, I/O,
Network activity coming in and out of servers, Swapspace
usage, Paging, etc…
Tree Augmented Naïve Bayes (TAN) models
–
–
Determine which low-level metrics most likely contributed to
an SLO violation
A mapping function is learned by the algorithm
TAN model example



”Given SLO state
(SLO violation) S,
what is the most
predictive set of
system-level metrics
for S”
Combinations of
metrics more
predictive of SLO
violations than
individual metrics
Small numbers of
metrics (3-8) usually
sufficient to predict
SLO violation
Multiple TAN models


TAN models that are built using data collected under
some conditions don't work well on data collected under
different conditions -> need to maintain multiple TAN
models
The model that best suits the current conditions is chosen
by using Brier score
–
Brier score is similar to Mean Squared Error (MSE) and
offers a fine grained evaluation of a model
Results


Ensembles of
models
outperform single
model
Also do slightly
better than
workload specific
approach
–
Indicates that
some workload
conditions too
complex for
single model
BA = Balanced Accuracy
FA = False Alerts
Det = Detections
TAN summary




Ensemble of models perform better than single
model
The approach allows for rapid adaptation to
changing conditions
No domain specific knowledge is required
Different workloads seem to be characterized
by different metric-attribution “signatures”
(future work)
Towards Autonomic Web Services:
Achieving Self-Healing Using Web
Services, S. Gurguis, A. Zeid, 2005





CBE-log is a representation
format into which log files of all
different applications can be
converted
Diagnosis Engine selects a set
of repair actions
The Symptoms Database is an
XML-file containing symptoms
and recovery actions
Rule Engine decides which
repair actions should be taken
based on the Policy Database
No prototype implemented
A typical record in the Symptom Database presented in the
picture


Possible application: legacy systems
Reflection, Self-Awareness and SelfHealing in OpenORB, G. Blair, G. Coulson, et al. 2002

OMG (Object Management Group)
–

OMG CORBA (Common Object Request Broker Architecture)
–
–
–
–

An open membership, not-for-profit consortium that produces and
maintains computer industry specifications for interoperable enterprise
applications
Open, vendor-independent architecture and infrastructure that computer
applications use to work together over networks
Supports communication between different types of operating systems,
programming languages and networks
Interfaces defined in OMG IDL (Interface Definition Language)
Mappings exists between IDL and C, C++, Java, COBOL, Smalltalk, Ada,
Lisp, Python, and IDLscript
OpenORB
–
Provides a Java implementation of the OMG CORBA 2.4.2 specification
Example, OMG IDL <-> C mappings
OpenORB self-healing





Meta-interface supports access to the underlying platform
Open ORB supports the ability to discover meta-information
about the current system, both in terms of its structure and
ongoing behaviour
System properties can also be adapted by using the
appropriate meta-interfaces
Management component can be introduced (dynamically) into
the various meta-space models
??
Measuring the Effectiveness of SelfHealing Autonomic Systems, A. Brown, C. Redlin,
2005

SPEC (Standard Performance Evaluation Group)
–

SPEC jAppServer2004
–
–


Non-profit corporation that maintains a standardized set of relevant
benchmarks applicable to the newest generation of high-performance
computers
Benchmark for measuring the performance of J2EE application
servers
An end-to-end application which exercises all major J2EE
technologies
Based on jAppServer2004 a benchmarking system was created
that is capable of quantifying the autonomic self-healing capability
of a large-scale J2EE software solution
The system is used in various production environments
The Architecture

30 different types of disturbances
representing common failure
modes can be injected into the
SUT
–

Component shutdowns, data loss,
resource exhaustion, load surges,
operator errors, ...
Two metrics are used to evaluate
SUT’s self-healing capacity
1)
How effectively the SUT heals itself

2)
Basically measured by counting how
many requests the jAppServer2004 gets
right in case of disturbance while
compared to normal working conditions
How autonomic the healing response
is
 A 90-question survey is used
The Survey

The 90-question survey assigns points to the SUT based on the
level of automation present in its response to each disturbance
(based on IBMs autonomic computing maturity model)
–

0 points for a basic manual response, 1 point for a managed
response, 2 for predictive, 4 for adaptive, and 8 for autonomic
“...Our baseline run on SUT #1 resulted in an average healing
effectiveness score of 0.79 and an autonomic maturity score of
0.15 (both out of 1.0), indicating a relatively low level of
autonomic self-healing capability. In comparison, SUT #2
attained an effectiveness score of 0.83 and a maturity score of
0.22. Comparing the two results indicates that SUT #2’s system
management technology provided a small—but measurable—
improvement in autonomic capability...”
Personal Autonomic Computing SelfHealing Tool, R. Sterritt, S. Chung, 2004



A self-healing tool consisting of pulse monitor and a health
monitor
Used in PC-environment
Pulse Monitoring application (PBM) is an UDP-based peer-topeer application which
1)
2)
3)
Checks whether hosts are providing a ‘heartbeat’ or not and
Indicates the health level of the system (state of processes)
Reboots a neighbor if no heartbeat is heard from it (security?)

Health Monitoring runs on a host and restarts a process on
the same host if it’s not responding

Combines three old concepts: watchdog processes, hellomechanism, and remote control
The Architecture


Pulse Monitor (Java)
communicates with platformspecific Health Monitor (C)
through JNI
Main monitor monitors Pulse
monitor and Health monitor
Testing



A proof-of-concept prototype system was built on MS. Windows
platform
Future topics: more autonomic functionality & supported
platforms
Maybe useful when human administration not possible (sensor
networks?)
Conclusions
1)
Reinforcement Learning for Autonomic Network Repair
–
–
2)
Approaches to Building Self Healing Systems using Dependency
Analysis
–
–
3)
Determine the root-cause of downgraded performance and try to fix it
No testing, use 3. instead?
Ensembles of Models for Automated Diagnosis of System
Performance Problems
–
–
4)
Learn autonomically the best sequence of actions to repair a network
outage
Prototype implemented and tested (useful?)
Suitable (tested) system for (Hewlett Packard) server systems
Pinpoints causes of SLO violation
Towards Autonomic Web Services: Achieving Self-Healing Using
Web Services
–
–
Autonomic web server healing system
No testing
Conclusions
1)
Reflection, Self-Awareness and Self-Healing in OpenORB
–
2)
Measuring the Effectiveness of Self-Healing Autonomic
Systems
–
–
–
3)
Suitable system for J2EE server systems
Provides users with a quantitative way to measure the selfhealing capability of their IT systems
Implemented and in use
Personal Autonomic Computing Self-Healing Tool
–
–
–

?
Enables a group of PCs to monitor the health of each other
Applications?
Prototype implemented
Overall much discussion about server self-healing
Download