Resiliency and self-healing Visa Holopainen,

Resiliency and self-healing
Visa Holopainen,
Reinforcement Learning for Autonomic
Network Repair,
M. Littman, N. Ravi, E. Fenson, R. Howard, 2004
Reinforcement learning
Used to solve Markov decision problems (MDPs)
States, actions, rewards, transitions, transition probabilities
Agent explores an environment in which it perceives its
current state and takes actions to reach new states
A reward is assosiated to every state
Reinforcement learning tries to find a policy for maximizing
cumulative reward for a task
(Simplified) Reinforcement Learning example
Which direction should
the agent move?
Reinforcement Learning example (cont)
Agent makes random moves
until a Goal state is reached
+ Agent
Reinforcement Learning example (cont)
Now a policy is associated
with the state from which the
goal state was reached
Reinforcement Learning example (cont)
Now if at some point state S
(that has policy associated
to it) is reached from state
S’, a policy is assigned to S’
Reinforcement Learning example (cont)
After some amount of
iterations the optimal
policies have been formed
Reinforcement Learning example (cont)
The corresponding state
Implemented concept
Reinforcement learning is used to
restore network connectivity after a
Starting state: no connectivity, Goal
state: connectivity
Actions: PingGateway, PingIP,
DNSLookup, UseCachedIP, FixIP,
RenewLease, UseCachedIP
Learned policy in the picture
Prototype implemented
Nice concept but not very useful…
Approaches to Building Self Healing
Systems using Dependency Analysis,
J. Gao, G. Kar, P. Kermani, 2004
Is there a way to automatically determine the root cause(s)
of a downgraded performance of i.e. an Internet shopping
Provided that the root cause(s) can be determined, are
there some ways to automatically fix this problem
Distributed System
The Monitoring System
Includes monitoring agents that
monitor 1) the response time of the
system from user’s perspective and
2) the application components
(servlets, EJBs,…)
The Dependency Matrix
A typical multi-tier e-Business
system (web access, database)
Which transactions depend on which
system components
Self-healing Engine
Launched when a performance
problem is noticed by monitoring
Problem description
Based on previous work a
dependency matrix can be formed
The matrix informs which customer
transactions depend on which
system resources
Using this matrix the system
resource that causes a preformance
problem can be tracked
The initial goal was to minimize the
needed transactions to find the root
cause of a problem
This problem is found to be NP-hard
-> a heuristic solution is presented
No solution can be guaranteed to be found if two or
more matrix columns are similar
Assume that 1) all matrix colums are different and 2)
there is only one broken system component
Now the solution can be found by the following algorithm
The set of all resources is denoted S. The set of all transactions is
denoted T
1) Run all transactions one by one
2) If a trasaction succeeds then remove all resources that this trasaction
depends on from S.
3) Finally only one resource is left in S. This is the broken resource.
Solution (cont)
If the fixed set of customer transactions cannot locate the
root cause of performance problem, synthetic
transactions need to be created and executed
Many practical difficuties exists in doing so
No testing
Ensembles of Models for Automated
Diagnosis of System Performance
Problems, S. Zhang, I. Cohen, M. Goldszmidt, J. Symons, A. Fox, 2005
Ensemble = collection
SLA contains Service Level Objectives (SLO)
Problem: Which system metrics correlate with SLO
SLO example: “Server downtime < X sec in a day”
Example system metrics: CPU metrics, Memory, I/O,
Network activity coming in and out of servers, Swapspace
usage, Paging, etc…
Tree Augmented Naïve Bayes (TAN) models
Determine which low-level metrics most likely contributed to
an SLO violation
A mapping function is learned by the algorithm
TAN model example
”Given SLO state
(SLO violation) S,
what is the most
predictive set of
system-level metrics
for S”
Combinations of
metrics more
predictive of SLO
violations than
individual metrics
Small numbers of
metrics (3-8) usually
sufficient to predict
SLO violation
Multiple TAN models
TAN models that are built using data collected under
some conditions don't work well on data collected under
different conditions -> need to maintain multiple TAN
The model that best suits the current conditions is chosen
by using Brier score
Brier score is similar to Mean Squared Error (MSE) and
offers a fine grained evaluation of a model
Ensembles of
outperform single
Also do slightly
better than
workload specific
Indicates that
some workload
conditions too
complex for
single model
BA = Balanced Accuracy
FA = False Alerts
Det = Detections
TAN summary
Ensemble of models perform better than single
The approach allows for rapid adaptation to
changing conditions
No domain specific knowledge is required
Different workloads seem to be characterized
by different metric-attribution “signatures”
(future work)
Towards Autonomic Web Services:
Achieving Self-Healing Using Web
Services, S. Gurguis, A. Zeid, 2005
CBE-log is a representation
format into which log files of all
different applications can be
Diagnosis Engine selects a set
of repair actions
The Symptoms Database is an
XML-file containing symptoms
and recovery actions
Rule Engine decides which
repair actions should be taken
based on the Policy Database
No prototype implemented
A typical record in the Symptom Database presented in the
Possible application: legacy systems
Reflection, Self-Awareness and SelfHealing in OpenORB, G. Blair, G. Coulson, et al. 2002
OMG (Object Management Group)
OMG CORBA (Common Object Request Broker Architecture)
An open membership, not-for-profit consortium that produces and
maintains computer industry specifications for interoperable enterprise
Open, vendor-independent architecture and infrastructure that computer
applications use to work together over networks
Supports communication between different types of operating systems,
programming languages and networks
Interfaces defined in OMG IDL (Interface Definition Language)
Mappings exists between IDL and C, C++, Java, COBOL, Smalltalk, Ada,
Lisp, Python, and IDLscript
Provides a Java implementation of the OMG CORBA 2.4.2 specification
Example, OMG IDL <-> C mappings
OpenORB self-healing
Meta-interface supports access to the underlying platform
Open ORB supports the ability to discover meta-information
about the current system, both in terms of its structure and
ongoing behaviour
System properties can also be adapted by using the
appropriate meta-interfaces
Management component can be introduced (dynamically) into
the various meta-space models
Measuring the Effectiveness of SelfHealing Autonomic Systems, A. Brown, C. Redlin,
SPEC (Standard Performance Evaluation Group)
SPEC jAppServer2004
Non-profit corporation that maintains a standardized set of relevant
benchmarks applicable to the newest generation of high-performance
Benchmark for measuring the performance of J2EE application
An end-to-end application which exercises all major J2EE
Based on jAppServer2004 a benchmarking system was created
that is capable of quantifying the autonomic self-healing capability
of a large-scale J2EE software solution
The system is used in various production environments
The Architecture
30 different types of disturbances
representing common failure
modes can be injected into the
Component shutdowns, data loss,
resource exhaustion, load surges,
operator errors, ...
Two metrics are used to evaluate
SUT’s self-healing capacity
How effectively the SUT heals itself
Basically measured by counting how
many requests the jAppServer2004 gets
right in case of disturbance while
compared to normal working conditions
How autonomic the healing response
 A 90-question survey is used
The Survey
The 90-question survey assigns points to the SUT based on the
level of automation present in its response to each disturbance
(based on IBMs autonomic computing maturity model)
0 points for a basic manual response, 1 point for a managed
response, 2 for predictive, 4 for adaptive, and 8 for autonomic
“...Our baseline run on SUT #1 resulted in an average healing
effectiveness score of 0.79 and an autonomic maturity score of
0.15 (both out of 1.0), indicating a relatively low level of
autonomic self-healing capability. In comparison, SUT #2
attained an effectiveness score of 0.83 and a maturity score of
0.22. Comparing the two results indicates that SUT #2’s system
management technology provided a small—but measurable—
improvement in autonomic capability...”
Personal Autonomic Computing SelfHealing Tool, R. Sterritt, S. Chung, 2004
A self-healing tool consisting of pulse monitor and a health
Used in PC-environment
Pulse Monitoring application (PBM) is an UDP-based peer-topeer application which
Checks whether hosts are providing a ‘heartbeat’ or not and
Indicates the health level of the system (state of processes)
Reboots a neighbor if no heartbeat is heard from it (security?)
Health Monitoring runs on a host and restarts a process on
the same host if it’s not responding
Combines three old concepts: watchdog processes, hellomechanism, and remote control
The Architecture
Pulse Monitor (Java)
communicates with platformspecific Health Monitor (C)
through JNI
Main monitor monitors Pulse
monitor and Health monitor
A proof-of-concept prototype system was built on MS. Windows
Future topics: more autonomic functionality & supported
Maybe useful when human administration not possible (sensor
