PowerPoint bemutató

advertisement
Proactivity =
Observation + Analysis + Knowledge extraction +
Action planning?
András Pataricza,
Budapest University of Technology and
Economics
Budapest University of Technology and Economics
Department of Measurement and Information Systems
Contributors







Prof. G. Horváth (BME)
I. Kocsis (BME)
Z. Micskei (BME)
K. Gáti (BME)
Zs. Kocsis (IBM)
I. Szombath (BME)
And many others
There will be nothing new in this
lecture
I learned the basics, when I was so young
4
But old professors are happy of new audience
5
What can traditional signal
processing help for proactivity
Proactive stance: Builds on foreknowledge
(intelligence) and creativity
• to anticipate the situation as an opportunity,
• regardless of how threatening or how bad it
looks;
• influence the system constructively instead of
reacting
Reactivity vs. proactivity
 Reactive control
 Proactive control
o „acting in response to a
situation rather than creating or
controlling it:”
o „controlling a situation rather
than just responding to it after
it has happened:”
7
Test environment
Test configuration
Virtual desktop infrastructure  Objective:
o VM level SLA control
~ a few of tens of VM/host
• Capacity planning,
~ a few of tens of host/cluster
• Proactive migration
VSphere monitoring and
o „CPU-ready” metrics:
supervisory control
• VM ready to run, but lack of
resources to start
Performance monitoring
Detecting a possible problem
on VM or host level
Failure indicator as well
10
Actions to prevent performance issue
Add limits neighbouring
VMs
11
Actions to prevent performance issue
Live migrate VM to other
(underutilized) host
12
Measured data (at 20 sec sampling rate)
VM: ~ 20 resource parameters
(CPU, mem, net)
Host: ~ 50 resource parameters
(Kernel/VM core)
Cluster: ~70 (derived by aggregation)
History: VM-to host deployment
13
Aggregation over population
Statistical cluster behavior versus
QoS over the VM population
14
Mean of the goal VM-metric (VM_CPU_READY)
VM application:
• ready to run
• Resource lack
-> Performance
bottleneck
-> Availability
problem
Vmware
recommended
threshold:
• 5% watching
• 10% typically
action is needed
The two traps
Visual processing: You believe your eyes
Automated processing: you believe your computer
Mean of the goal VM-metric
Visual inspection:
Lot of bad values
This is a bad system
 Statistics:
 Mean: 0.007
-> a good system
 Only 2/3 of the
samples are errorfree
 -> A bad system
 After eliminating
failure-free cases
below the threshold
 Mean: 0.023
 -> a good system
Host shared and used memory along the time
 Noisy…
 High frrequency
components
dominate
 But they correlate
(93%!)
 YOU DON’T SEE IT
… and a host of more mundane observations
 Computing power use
= CPU use ×
 CPU clk rate (const.)
 Should be pure
proportional
 Correlation coefficient:
 0.99998477434137
 Well-visible, but
numerically
suppressed
 Origin???
19
Most important factor: host CPU usage mean
 Host CPU usage vs
 VM ratio: „bad”
vCPU ready
The battleplan
Impacts of temporal resultion
 Nyquist–Shannon sampling
theorem:
o 2× sampling frequency = bandwidth
 Sampling period = 20 sec
-> Sampling frequency = 5 Hz
-> Bandwidth = 2.5 Hz
 Additionaly:
o Sampling clock jitter (SW sampling)
o Clock skew (distributed system)
o Precision Time Protocol (PTP)
(IEEE 1588-2008)
 No fine granular prediction
22
Proactivity
 Proactivity needs:
1. Situation recognition based on historical experience
• What is to be expected ?
2. Identification of the principal factors
• Single factor /multiple factors
• Operation domains leading to failures
• Boundaries
3. Predictor design
• High failure coverage
• Temporal lookahead sufficient for reaction
4. Design of reaction
23
 Situations to be covered
1.Single VM: application demand > resource allocated
2.VM-host: overcommisioning, overload due to other VMs
3.VM-host-cluster
24
Data preparation
Data cleaning
Data reduction
Data reduction
 Huge initial set of samples
 Reduction
o Object sampling: Represenative measurement objects
o Parameter selection/reduction:
• Aggregation
• Relevance
• Redundancy
o Temporal
• Sampling
• Relevance
Object sampling
In pursuit of discovering fine-grained behavior
and the reasons for outliers
27
Subsample: ratio > 0 + random subsampling
 For presentation
purposes only
 - Reduction of the
sample size to 400

 Manageability
Real-life analysis:
- keep enough data
to maintain a
proper correlation
with the operation
Demo: Visual data discovery with || coordinates
Visual multifactor analysis
Visual analytics for an arbitrary
number of factors
 You can do much, much more
o
o
o
o
o
o
 Inselberg, A: Parallel
Coordinates, Visual
Multidimensional Geometry
and Its Applications, Springer
2009
30
Redundancy reduction
Correlation analysis
Clustering
Data mining
Approximation
Optimization
Prediction at the cluster level
What ratio of the VMs will become problematic?
31
Pinpointed interval for one VM
Situation of interest
Training time
> Prediction time
One minute prediction based on all data sources
One minute prediction and classification
Predicted\Real
Alarm
Normal
Alarm
77 (67.54%)
56 (0.3%)
Normal
37 (32.46%)
18269 (99.7%)
One minute prediction with selected variables
Classification error (simplest predictor)
Factors
All
Proper feature set
Wrong feature set
All
Prediction
time
1 min
1 min
1 min
5 min
Uncovered
failure rate
73 %
32 %
97 %
87 %
False alarm
rate
0,2 %
0,3 %
0,04%
0,1%
False alarm rate is low (dominant pattern)
Feature set selection is critical to detection
 More is less (PROPER selection is needed – cf. PFARM 2010)
 Case separation for different situations
 Long term prediction is hard (automated reactions)
Case study – Connectivity testing in
Large Networks
In dynamic infrastructures the active internode topology
has to be discovered as well…
Large Networks
 not known explicitly
 too complex for
conventional algorithms
 Social network graph
o Yahoo! Instant Messenger friend
connectivity graph *
o 1.8M nodes ~4M edges
 Serve as a model of
Large Infrastructures
 Typical power law network
o 75% of the friendships
are related to 35% of users
Yahoo! Research Alliance Webscope program
*ydata-yim-friends-graph-v1_0
http://research.yahoo.com/Academic_Relations
Typical Model: Random graphs
 Yahoo! Instant Messenger dataset – Adjacency Matrix
Random order:
Ordered by degree:
 Preferential attachment graph
𝑛 = 50
𝑛 = 200
𝑛 = 800
Limit: 𝑛 → ∞
Graphon
Approx. edge density by subgraph sampling
 Graph with 800 nodes 320000 edges
 Subgraph sampling method
Relative error
o Random induced subgraph 𝑮 𝑺
o Take k random nodes
(𝑺 ⊆ 𝑽 𝑮 , 𝑺 = 𝒌)
o Repeat n times
Sample size k = 35
Repeated n = 20 times
2% error
4% of the graph examined
White:
error < 5%
Sample size
(k)
Random, k=4 sample
Number of
samples (n)
Neighborhood sampling: Fault Tolerant Services
 No. of 3 and 4 cycles = possible redundancy
Root node
o High  node has many substitute nodes (e.g. load balancer)
Redundancy?
Neighborhood sampling
Fault Tolerant Domain
Trends
take random nodes
explore neighborhood
to a given depth (m)
 Distribution approximated from samples are very close!
Summary: proactivity needs
 Observations
Thank you for your attention
o All relevant cases (Stress test)
 Analysis
o
o
o
o
Check of input data
Visual analysis
UNDERSTANDING
Automated methods for
calculation
 Knowledge extraction
oClustering (situation recognition)
oPredictor
o(generalization)
Action planning
o Situation defining principal
factors are indicative
42
Download