Self-Learning Predictive Computer Systems - ICRI-CI

advertisement
Intel Collaborative Research Institute
Computational Intelligence
Self-Learning, Adaptive
Computer Systems
Yoav Etsion,
Technion CS & EE
Dan Tsafrir,
Technion CS
Shie Mannor,
Technion EE
Assaf Schuster, Technion CS
Intel Collaborative Research Institute
Computational Intelligence
Adaptive Computer Systems
• Complexity of computer systems keeps growing
• We are moving towards heterogeneous hardware
• Workloads are getting more diverse
• Process variability affects performance/power of
different parts of the system
• Human programmers and administrators
• cannot handle complexity
• The goal:
Adapt to workload and hardware variability
Intel Collaborative Research Institute
Computational Intelligence
Predicting System Behavior
• When a human observes the workload, she can
typically identify cause and effect
• Workload carries inherent semantics
• The problem is extracting them automatically…
• Key issues with machine learning:
• Huge datasets (performance counters; exec. traces)
• Need extremely fast response time (in most cases)
• Rigid space constraints for ML algorithms
Intel Collaborative Research Institute
Computational Intelligence
Memory + Machine Learning
Current state-of-the-art
• Architectures are tuned for structured data
• Managed using simple heuristics
• Spatial and temporal locality
• Frequency and recency (ARC)
• Block and stride prefetchers
• Real data is not well structured
• Programmer must transform data
• Unrealistic for program agnostic
management (swapping, prefetching)
Intel Collaborative Research Institute
Computational Intelligence
Memory + Machine Learning
Multiple learning opportunities
• Identify patterns using machine learning
• Bring data to the right place at the right time
• Memory hierarchy forms a pyramid
• Caches / DRAM, PCM / SSD, HDD
• Different levels require different learning strategies
• Top: smaller, faster, costlier
• Bottom: bigger, slower, pricier
[prefetching to caches]
[fetching from disk]
• Need both hardware and software support
Intel Collaborative Research Institute
Computational Intelligence
Research track:
Predicting Latent Faults in Data Centers
Moshe Gabel, Assaf Schuster
Intel Collaborative Research Institute
Computational Intelligence
Latent Fault Detection
• Failures and misconfiguration happen in large datacenters
• Cause performance anomalies?
• Sound statistical framework to detect latent faults
• Practical:
Non-intrusive, unsupervised, no domain knowledge
• Adaptive:
No parameter tuning, robust to system/workload changes
7
Intel Collaborative Research Institute
Computational Intelligence
Latent Fault Detection
• Applied to real-world production service of 4.5K machines
• Over 20% machine/sw failures preceded by latent faults
• Slow response time; network errors; disk access times
• Predict failures 14 days in advance, 70% precision, 2% FPR
• Latent Fault Detection in Large Scale Services, DSN 2012
8
Intel Collaborative Research Institute
Computational Intelligence
Research track:
Task Differentials:
Dynamic, inter-thread predictions
using memory access footsteps
D
Adi Fuchs , Yoav Etsion, Shie Mannor, Uri Weiser
Intel Collaborative Research Institute
 We are in the age of parallel computing.
Synchronization
tas
Motivation
ks
Computational Intelligence
Parallel section
Synchronization
 Programming paradigms shift towards task level parallelism
 Tasks are supported by libraries such as TBB and OpenMP:
Parallel section
Synchronization
...
GridLauncher<InitDensitiesAndForcesMTWorker> &id = *new (tbb::task::allocate_root()) GridLauncher<InitDensitiesAndForcesMTWorker>(NUM_TBB_GRIDS);
tbb::task::spawn_root_and_wait(id);
GridLauncher<ComputeDensitiesMTWorker> &cd = *new (tbb::task::allocate_root()) GridLauncher<ComputeDensitiesMTWorker>(NUM_TBB_GRIDS);
tbb::task::spawn_root_and_wait(cd);
...
Taken from: PARSEC.fluidanimate TBB implementation
 Implicit forms of task level parallelism include GPU kernels and parallel loops
 Tasks behavior tends to be highly regular = target for learning and adaptation10
Intel Collaborative Research Institute
Computational Intelligence
How do things currently work?
•
Programmer codes a parallel loop
•
SW maps multiple tasks to one thread
• HW sees a sequence of instructions
•
HW prefetchers try to identify patterns
between consecutive memory accesses
B
•
No notion of program semantics, i.e.
execution consists of a sequence of
tasks, not instructions
C
A B C D E E
A
11
Intel Collaborative Research Institute
Computational Intelligence
Task Address Set
 Given the memory trace of task instance A, the task address set TA is a
unique set of addresses TA  a1, a2 ...an  ordered by access time:
Trace:
START TASK INSTANCE(A)
R 0x7f27bd6df8
R 0x61e630
R 0x6949cc
R 0x7f77b02010
R 0x6949cc
R 0x61e6d0
R 0x61e6e0
W 0x7f77b02010
STOP TASK INSTANCE(A)
TA:
0x7f27bd6df8
0x61e630
0x6949cc
0x7f77b02010
0x61e6d0
0x61e6e0
12
Intel Collaborative Research Institute
Computational Intelligence
Address Differentials
 Motivation: Task instance address sets are usually meaningless
TA:
7F27BD6DF8 + 0
TB:
= 7F27BD6DF8 + 0
TC:
= 7F27BD6DF8
61E630
+ 8000480 = DBFA10
+ 8000480 = 1560DF0
6949CC
+ 54080
+ 54080
= 6A1D0C
= 6AF04C
7F77B02010 + 8770090 = 7F7835F23A + 8770090 = 7F78BBC464
61E6D0
+ 456
= 61E898
+ 456
= 61EA60
61E6E0
-1808
= 61DFD0
-1808
= 61D8C0
 Differences tend to be compact and regular, thus can represent state
transitions
13
Intel Collaborative Research Institute
Computational Intelligence
Address Differentials
 Given instances A and B, the differential vector is defined as follows:
D AB  i | i  bi  ai for each i 
TA
DAB
TB
a1
1
b1
a2
2
b2
 Example:
TA:
10000
60000
8000000
7F00000
FE000
𝛥𝐴𝐵 :
32,
96,
8,
64,
96
TB:
10020
60060
8000008
7F00040
FE060
14
Intel Collaborative Research Institute
Computational Intelligence
Differentials Behavior: Mathematical intuition
 Differential use is beneficial in cases of
high redundancy.
Non uniform
 Application distribution functions can
provide the intuition on vector repetitions.
Uniform
 Non uniform CDFs imply highly regular
patterns.
 Uniform CDFs imply noisy patterns
(differentials
behavior
cannot
be
exploited)
15
Intel Collaborative Research Institute
Computational Intelligence
Differentials Behavior: Mathematical intuition
 Given N vectors, straightforward dictionary will be of size: R=log2(N)
 Entropy H is Na theoretical lower bound on representation, based on
distribution: H   p  k  log  p  k  
k 1
 Example – assuming 1000 vector instances with 4 possible values: R = 2.
Differential Value
#instances
p
(20,8000,720,100050)
(16,8040,-96,50)
(0,0,14420,100)
(0,0,720,100050)
700
150
50
100
0.7
0.15
0.05
0.1
0.7  log  0.7   0.15  log  0.15  
H  
  1.31
 0.05  log  0.05   0.1 log  0.1 
 Differential Entropy Compression Ratio (DECR) is used as repetition criteria:
Benchmark
FFT.128M
NQUEENS.N=12
SORT.8M
SGEFA.500x500
FLUIDANIMATE.SIMSMALL
SWAPTIONS.SIMSMALL
STREAMCLUSTER.SIMSMALL
Suite Implementation Differential representation Differential entropy
BOTS OpenMP
BOTS OpenMP
BOTS OpenMP
LINPACKOpenMP
PARSEC TBB
PARSEC TBB
PARSEC TBB
19.4
11.8
16.4
14.1
16.4
17.9
19.6
14.4
8.4
16.3
0.9
8.0
13.1
8.9
DECR (%)
25.5
28.7
0.1
93.6
51.3
26.6
54.4
16
Intel Collaborative Research Institute
Computational Intelligence
Possible differential application: cache line prefetching
 First attempt: Prefix based predictor, given a differential prefix – predict suffix
 Example: A and B finished running (𝛥𝐴𝐵 is stored)
 Now C is running…
𝛥𝐵C :
0,
TB:
7F27BD6DF8
0,
TC:
7F27BD6DF8
61E630
8000480,
DBFA10
8000480,
1560DF0
6949CC
54080,
6A1D0C
54080?
6AF04C?
7F77B02010
8770090,
7F7835F23A
8770090?
7F78BBC464?
61E6D0
456,
61E898
456?
61EA60?
61E6E0
-1808
61DFD0
-1808?
61D8C0?
TA:
7F27BD6DF8
𝛥𝐴𝐵 :
17
Intel Collaborative Research Institute
Computational Intelligence
Possible differential application: cache line prefetching
 Second attempt: PHT predictor, based on the last X differentials – predict
next differential.
 Example:
𝛥𝐴𝐵 :
𝛥𝐵𝐶 :
𝛥CD :
𝛥𝐷𝐸 :
𝛥𝐸𝐹 :
𝛥𝐹𝐺 :
𝛥𝐺𝐻 :
𝛥𝐻𝐼 :
𝛥IJ :
32
32
10
32
32
10
32
32
10?
96
96
16
96
96
16
96
96
16?
8
8
0
8
8
0
8
8
0?
64
64
16
64
64
16
64
64
16?
96
96
32
96
96
32
96
96
32?
18
Intel Collaborative Research Institute
Computational Intelligence
Possible differential application: cache line prefetching
 Prefix policy: Differential DB is a prefix tree, Prediction performed once
differential prefix is unique.
 PHT policy: Differential DB hold the history table, Prediction performed upon
task start, based on history pattern:
Differential logic
Start task/
Stop task
Past Task Addresses
New Differential
Execution
CPUs
New Memory
Request
Caching
Hierarchy
Current Task
Addresses
Differential
DB
Pre-fetch
Addresses
19
Intel Collaborative Research Institute
Computational Intelligence
Possible differential application: cache line prefetching
 Predictors compared with 2 models: Base (no prefetching) and Ideal
(theoretical predictor – accurately predicts every repeating differential)
Misses Per 1K Instructions
Misses Per 1K Instructions
6
5
4
70
Base
60
Prefix
Base
Prefix
PHT
PHT
50
Ideal
Ideal
Cache Miss Elimination (%)
Prefix PHT Ideal
NQUEENS.N=12
19.4 11.4 62.1
SWAPTIONS
18.3 0.1 49.2
FLUIDANIMATE
14.9 26.0 46.0
SGEFA.500
0.0 97.6 99.9
STREAMCLUSTER
21.7 36.5 82.3
FFT.128M
45.0 -1.0 87.9
SORT.8M
3.3 0.0 0.1
40
3
30
10
0
0
SGEFA.500
FLUIDANIMATE
SWAPTIONS
NQUEENS.N=12
SORT.8M
1
FFT.128M
20
STREAMCLUSTER
2
20
Intel Collaborative Research Institute
Computational Intelligence
Future work
 Hybrid policies: which policy to use when? (PHT is better for complete vector
repetitions, prefix is better for partial vector repetitions, i.e. suffixes)
 Regular expression based policy (for pattern matching, beyond “ideal” model)
 Predict other functional features using differentials (e.g. branch prediction,
PTE prefetching etc.)
21
Intel Collaborative Research Institute
Computational Intelligence
Conclusions (so far…)
• When we look at the data, patterns emerge…
• Quite a large headroom for optimizing computer systems
• Existing predictions are based on heuristics
•
•
A machine that does not respond within 1s is considered dead
Memory prefetchers look for blocked and strided accesses
• Goal:
Use ML, not heuristics, to uncover behavioral semantics
22
Download