Intel Collaborative Research Institute Computational Intelligence Self-Learning, Adaptive Computer Systems Yoav Etsion, Technion CS & EE Dan Tsafrir, Technion CS Shie Mannor, Technion EE Assaf Schuster, Technion CS Intel Collaborative Research Institute Computational Intelligence Adaptive Computer Systems • Complexity of computer systems keeps growing • We are moving towards heterogeneous hardware • Workloads are getting more diverse • Process variability affects performance/power of different parts of the system • Human programmers and administrators • cannot handle complexity • The goal: Adapt to workload and hardware variability Intel Collaborative Research Institute Computational Intelligence Predicting System Behavior • When a human observes the workload, she can typically identify cause and effect • Workload carries inherent semantics • The problem is extracting them automatically… • Key issues with machine learning: • Huge datasets (performance counters; exec. traces) • Need extremely fast response time (in most cases) • Rigid space constraints for ML algorithms Intel Collaborative Research Institute Computational Intelligence Memory + Machine Learning Current state-of-the-art • Architectures are tuned for structured data • Managed using simple heuristics • Spatial and temporal locality • Frequency and recency (ARC) • Block and stride prefetchers • Real data is not well structured • Programmer must transform data • Unrealistic for program agnostic management (swapping, prefetching) Intel Collaborative Research Institute Computational Intelligence Memory + Machine Learning Multiple learning opportunities • Identify patterns using machine learning • Bring data to the right place at the right time • Memory hierarchy forms a pyramid • Caches / DRAM, PCM / SSD, HDD • Different levels require different learning strategies • Top: smaller, faster, costlier • Bottom: bigger, slower, pricier [prefetching to caches] [fetching from disk] • Need both hardware and software support Intel Collaborative Research Institute Computational Intelligence Research track: Predicting Latent Faults in Data Centers Moshe Gabel, Assaf Schuster Intel Collaborative Research Institute Computational Intelligence Latent Fault Detection • Failures and misconfiguration happen in large datacenters • Cause performance anomalies? • Sound statistical framework to detect latent faults • Practical: Non-intrusive, unsupervised, no domain knowledge • Adaptive: No parameter tuning, robust to system/workload changes 7 Intel Collaborative Research Institute Computational Intelligence Latent Fault Detection • Applied to real-world production service of 4.5K machines • Over 20% machine/sw failures preceded by latent faults • Slow response time; network errors; disk access times • Predict failures 14 days in advance, 70% precision, 2% FPR • Latent Fault Detection in Large Scale Services, DSN 2012 8 Intel Collaborative Research Institute Computational Intelligence Research track: Task Differentials: Dynamic, inter-thread predictions using memory access footsteps D Adi Fuchs , Yoav Etsion, Shie Mannor, Uri Weiser Intel Collaborative Research Institute We are in the age of parallel computing. Synchronization tas Motivation ks Computational Intelligence Parallel section Synchronization Programming paradigms shift towards task level parallelism Tasks are supported by libraries such as TBB and OpenMP: Parallel section Synchronization ... GridLauncher<InitDensitiesAndForcesMTWorker> &id = *new (tbb::task::allocate_root()) GridLauncher<InitDensitiesAndForcesMTWorker>(NUM_TBB_GRIDS); tbb::task::spawn_root_and_wait(id); GridLauncher<ComputeDensitiesMTWorker> &cd = *new (tbb::task::allocate_root()) GridLauncher<ComputeDensitiesMTWorker>(NUM_TBB_GRIDS); tbb::task::spawn_root_and_wait(cd); ... Taken from: PARSEC.fluidanimate TBB implementation Implicit forms of task level parallelism include GPU kernels and parallel loops Tasks behavior tends to be highly regular = target for learning and adaptation10 Intel Collaborative Research Institute Computational Intelligence How do things currently work? • Programmer codes a parallel loop • SW maps multiple tasks to one thread • HW sees a sequence of instructions • HW prefetchers try to identify patterns between consecutive memory accesses B • No notion of program semantics, i.e. execution consists of a sequence of tasks, not instructions C A B C D E E A 11 Intel Collaborative Research Institute Computational Intelligence Task Address Set Given the memory trace of task instance A, the task address set TA is a unique set of addresses TA a1, a2 ...an ordered by access time: Trace: START TASK INSTANCE(A) R 0x7f27bd6df8 R 0x61e630 R 0x6949cc R 0x7f77b02010 R 0x6949cc R 0x61e6d0 R 0x61e6e0 W 0x7f77b02010 STOP TASK INSTANCE(A) TA: 0x7f27bd6df8 0x61e630 0x6949cc 0x7f77b02010 0x61e6d0 0x61e6e0 12 Intel Collaborative Research Institute Computational Intelligence Address Differentials Motivation: Task instance address sets are usually meaningless TA: 7F27BD6DF8 + 0 TB: = 7F27BD6DF8 + 0 TC: = 7F27BD6DF8 61E630 + 8000480 = DBFA10 + 8000480 = 1560DF0 6949CC + 54080 + 54080 = 6A1D0C = 6AF04C 7F77B02010 + 8770090 = 7F7835F23A + 8770090 = 7F78BBC464 61E6D0 + 456 = 61E898 + 456 = 61EA60 61E6E0 -1808 = 61DFD0 -1808 = 61D8C0 Differences tend to be compact and regular, thus can represent state transitions 13 Intel Collaborative Research Institute Computational Intelligence Address Differentials Given instances A and B, the differential vector is defined as follows: D AB i | i bi ai for each i TA DAB TB a1 1 b1 a2 2 b2 Example: TA: 10000 60000 8000000 7F00000 FE000 𝛥𝐴𝐵 : 32, 96, 8, 64, 96 TB: 10020 60060 8000008 7F00040 FE060 14 Intel Collaborative Research Institute Computational Intelligence Differentials Behavior: Mathematical intuition Differential use is beneficial in cases of high redundancy. Non uniform Application distribution functions can provide the intuition on vector repetitions. Uniform Non uniform CDFs imply highly regular patterns. Uniform CDFs imply noisy patterns (differentials behavior cannot be exploited) 15 Intel Collaborative Research Institute Computational Intelligence Differentials Behavior: Mathematical intuition Given N vectors, straightforward dictionary will be of size: R=log2(N) Entropy H is Na theoretical lower bound on representation, based on distribution: H p k log p k k 1 Example – assuming 1000 vector instances with 4 possible values: R = 2. Differential Value #instances p (20,8000,720,100050) (16,8040,-96,50) (0,0,14420,100) (0,0,720,100050) 700 150 50 100 0.7 0.15 0.05 0.1 0.7 log 0.7 0.15 log 0.15 H 1.31 0.05 log 0.05 0.1 log 0.1 Differential Entropy Compression Ratio (DECR) is used as repetition criteria: Benchmark FFT.128M NQUEENS.N=12 SORT.8M SGEFA.500x500 FLUIDANIMATE.SIMSMALL SWAPTIONS.SIMSMALL STREAMCLUSTER.SIMSMALL Suite Implementation Differential representation Differential entropy BOTS OpenMP BOTS OpenMP BOTS OpenMP LINPACKOpenMP PARSEC TBB PARSEC TBB PARSEC TBB 19.4 11.8 16.4 14.1 16.4 17.9 19.6 14.4 8.4 16.3 0.9 8.0 13.1 8.9 DECR (%) 25.5 28.7 0.1 93.6 51.3 26.6 54.4 16 Intel Collaborative Research Institute Computational Intelligence Possible differential application: cache line prefetching First attempt: Prefix based predictor, given a differential prefix – predict suffix Example: A and B finished running (𝛥𝐴𝐵 is stored) Now C is running… 𝛥𝐵C : 0, TB: 7F27BD6DF8 0, TC: 7F27BD6DF8 61E630 8000480, DBFA10 8000480, 1560DF0 6949CC 54080, 6A1D0C 54080? 6AF04C? 7F77B02010 8770090, 7F7835F23A 8770090? 7F78BBC464? 61E6D0 456, 61E898 456? 61EA60? 61E6E0 -1808 61DFD0 -1808? 61D8C0? TA: 7F27BD6DF8 𝛥𝐴𝐵 : 17 Intel Collaborative Research Institute Computational Intelligence Possible differential application: cache line prefetching Second attempt: PHT predictor, based on the last X differentials – predict next differential. Example: 𝛥𝐴𝐵 : 𝛥𝐵𝐶 : 𝛥CD : 𝛥𝐷𝐸 : 𝛥𝐸𝐹 : 𝛥𝐹𝐺 : 𝛥𝐺𝐻 : 𝛥𝐻𝐼 : 𝛥IJ : 32 32 10 32 32 10 32 32 10? 96 96 16 96 96 16 96 96 16? 8 8 0 8 8 0 8 8 0? 64 64 16 64 64 16 64 64 16? 96 96 32 96 96 32 96 96 32? 18 Intel Collaborative Research Institute Computational Intelligence Possible differential application: cache line prefetching Prefix policy: Differential DB is a prefix tree, Prediction performed once differential prefix is unique. PHT policy: Differential DB hold the history table, Prediction performed upon task start, based on history pattern: Differential logic Start task/ Stop task Past Task Addresses New Differential Execution CPUs New Memory Request Caching Hierarchy Current Task Addresses Differential DB Pre-fetch Addresses 19 Intel Collaborative Research Institute Computational Intelligence Possible differential application: cache line prefetching Predictors compared with 2 models: Base (no prefetching) and Ideal (theoretical predictor – accurately predicts every repeating differential) Misses Per 1K Instructions Misses Per 1K Instructions 6 5 4 70 Base 60 Prefix Base Prefix PHT PHT 50 Ideal Ideal Cache Miss Elimination (%) Prefix PHT Ideal NQUEENS.N=12 19.4 11.4 62.1 SWAPTIONS 18.3 0.1 49.2 FLUIDANIMATE 14.9 26.0 46.0 SGEFA.500 0.0 97.6 99.9 STREAMCLUSTER 21.7 36.5 82.3 FFT.128M 45.0 -1.0 87.9 SORT.8M 3.3 0.0 0.1 40 3 30 10 0 0 SGEFA.500 FLUIDANIMATE SWAPTIONS NQUEENS.N=12 SORT.8M 1 FFT.128M 20 STREAMCLUSTER 2 20 Intel Collaborative Research Institute Computational Intelligence Future work Hybrid policies: which policy to use when? (PHT is better for complete vector repetitions, prefix is better for partial vector repetitions, i.e. suffixes) Regular expression based policy (for pattern matching, beyond “ideal” model) Predict other functional features using differentials (e.g. branch prediction, PTE prefetching etc.) 21 Intel Collaborative Research Institute Computational Intelligence Conclusions (so far…) • When we look at the data, patterns emerge… • Quite a large headroom for optimizing computer systems • Existing predictions are based on heuristics • • A machine that does not respond within 1s is considered dead Memory prefetchers look for blocked and strided accesses • Goal: Use ML, not heuristics, to uncover behavioral semantics 22