Proactivity = Observation + Analysis + Knowledge extraction + Action planning? András Pataricza, Budapest University of Technology and Economics Budapest University of Technology and Economics Department of Measurement and Information Systems Contributors Prof. G. Horváth (BME) I. Kocsis (BME) Z. Micskei (BME) K. Gáti (BME) Zs. Kocsis (IBM) I. Szombath (BME) And many others There will be nothing new in this lecture I learned the basics, when I was so young 4 But old professors are happy of new audience 5 What can traditional signal processing help for proactivity Proactive stance: Builds on foreknowledge (intelligence) and creativity • to anticipate the situation as an opportunity, • regardless of how threatening or how bad it looks; • influence the system constructively instead of reacting Reactivity vs. proactivity Reactive control Proactive control o „acting in response to a situation rather than creating or controlling it:” o „controlling a situation rather than just responding to it after it has happened:” 7 Test environment Test configuration Virtual desktop infrastructure Objective: o VM level SLA control ~ a few of tens of VM/host • Capacity planning, ~ a few of tens of host/cluster • Proactive migration VSphere monitoring and o „CPU-ready” metrics: supervisory control • VM ready to run, but lack of resources to start Performance monitoring Detecting a possible problem on VM or host level Failure indicator as well 10 Actions to prevent performance issue Add limits neighbouring VMs 11 Actions to prevent performance issue Live migrate VM to other (underutilized) host 12 Measured data (at 20 sec sampling rate) VM: ~ 20 resource parameters (CPU, mem, net) Host: ~ 50 resource parameters (Kernel/VM core) Cluster: ~70 (derived by aggregation) History: VM-to host deployment 13 Aggregation over population Statistical cluster behavior versus QoS over the VM population 14 Mean of the goal VM-metric (VM_CPU_READY) VM application: • ready to run • Resource lack -> Performance bottleneck -> Availability problem Vmware recommended threshold: • 5% watching • 10% typically action is needed The two traps Visual processing: You believe your eyes Automated processing: you believe your computer Mean of the goal VM-metric Visual inspection: Lot of bad values This is a bad system Statistics: Mean: 0.007 -> a good system Only 2/3 of the samples are errorfree -> A bad system After eliminating failure-free cases below the threshold Mean: 0.023 -> a good system Host shared and used memory along the time Noisy… High frrequency components dominate But they correlate (93%!) YOU DON’T SEE IT … and a host of more mundane observations Computing power use = CPU use × CPU clk rate (const.) Should be pure proportional Correlation coefficient: 0.99998477434137 Well-visible, but numerically suppressed Origin??? 19 Most important factor: host CPU usage mean Host CPU usage vs VM ratio: „bad” vCPU ready The battleplan Impacts of temporal resultion Nyquist–Shannon sampling theorem: o 2× sampling frequency = bandwidth Sampling period = 20 sec -> Sampling frequency = 5 Hz -> Bandwidth = 2.5 Hz Additionaly: o Sampling clock jitter (SW sampling) o Clock skew (distributed system) o Precision Time Protocol (PTP) (IEEE 1588-2008) No fine granular prediction 22 Proactivity Proactivity needs: 1. Situation recognition based on historical experience • What is to be expected ? 2. Identification of the principal factors • Single factor /multiple factors • Operation domains leading to failures • Boundaries 3. Predictor design • High failure coverage • Temporal lookahead sufficient for reaction 4. Design of reaction 23 Situations to be covered 1.Single VM: application demand > resource allocated 2.VM-host: overcommisioning, overload due to other VMs 3.VM-host-cluster 24 Data preparation Data cleaning Data reduction Data reduction Huge initial set of samples Reduction o Object sampling: Represenative measurement objects o Parameter selection/reduction: • Aggregation • Relevance • Redundancy o Temporal • Sampling • Relevance Object sampling In pursuit of discovering fine-grained behavior and the reasons for outliers 27 Subsample: ratio > 0 + random subsampling For presentation purposes only - Reduction of the sample size to 400 Manageability Real-life analysis: - keep enough data to maintain a proper correlation with the operation Demo: Visual data discovery with || coordinates Visual multifactor analysis Visual analytics for an arbitrary number of factors You can do much, much more o o o o o o Inselberg, A: Parallel Coordinates, Visual Multidimensional Geometry and Its Applications, Springer 2009 30 Redundancy reduction Correlation analysis Clustering Data mining Approximation Optimization Prediction at the cluster level What ratio of the VMs will become problematic? 31 Pinpointed interval for one VM Situation of interest Training time > Prediction time One minute prediction based on all data sources One minute prediction and classification Predicted\Real Alarm Normal Alarm 77 (67.54%) 56 (0.3%) Normal 37 (32.46%) 18269 (99.7%) One minute prediction with selected variables Classification error (simplest predictor) Factors All Proper feature set Wrong feature set All Prediction time 1 min 1 min 1 min 5 min Uncovered failure rate 73 % 32 % 97 % 87 % False alarm rate 0,2 % 0,3 % 0,04% 0,1% False alarm rate is low (dominant pattern) Feature set selection is critical to detection More is less (PROPER selection is needed – cf. PFARM 2010) Case separation for different situations Long term prediction is hard (automated reactions) Case study – Connectivity testing in Large Networks In dynamic infrastructures the active internode topology has to be discovered as well… Large Networks not known explicitly too complex for conventional algorithms Social network graph o Yahoo! Instant Messenger friend connectivity graph * o 1.8M nodes ~4M edges Serve as a model of Large Infrastructures Typical power law network o 75% of the friendships are related to 35% of users Yahoo! Research Alliance Webscope program *ydata-yim-friends-graph-v1_0 http://research.yahoo.com/Academic_Relations Typical Model: Random graphs Yahoo! Instant Messenger dataset – Adjacency Matrix Random order: Ordered by degree: Preferential attachment graph 𝑛 = 50 𝑛 = 200 𝑛 = 800 Limit: 𝑛 → ∞ Graphon Approx. edge density by subgraph sampling Graph with 800 nodes 320000 edges Subgraph sampling method Relative error o Random induced subgraph 𝑮 𝑺 o Take k random nodes (𝑺 ⊆ 𝑽 𝑮 , 𝑺 = 𝒌) o Repeat n times Sample size k = 35 Repeated n = 20 times 2% error 4% of the graph examined White: error < 5% Sample size (k) Random, k=4 sample Number of samples (n) Neighborhood sampling: Fault Tolerant Services No. of 3 and 4 cycles = possible redundancy Root node o High node has many substitute nodes (e.g. load balancer) Redundancy? Neighborhood sampling Fault Tolerant Domain Trends take random nodes explore neighborhood to a given depth (m) Distribution approximated from samples are very close! Summary: proactivity needs Observations Thank you for your attention o All relevant cases (Stress test) Analysis o o o o Check of input data Visual analysis UNDERSTANDING Automated methods for calculation Knowledge extraction oClustering (situation recognition) oPredictor o(generalization) Action planning o Situation defining principal factors are indicative 42