I/O System Performance Debugging Using Model-driven Anomaly Characterization

advertisement
I/O System Performance Debugging Using
Model-driven Anomaly Characterization
Kai Shen Ming Zhong Chuanpeng Li
Dept. of Computer Science, Univ. of Rochester
Motivation

Implementations of complex systems (e.g., operating
systems) contain performance “problems”



Such problems are hard to identify and understand for
complex systems




over-simplification, mishandling of special cases, … …
these problems degrade the system performance; make
system behavior unpredictable
many system features and configuration settings
dynamic workload behaviors
problems manifest under special conditions
Goal

comprehensively identify performance problems over wide
ranges of system configurations and workload conditions
6/20/2016
FAST'05
2
Bird’s Eye View of Our Approach

Construct models to predict system performance



Model-driven anomaly characterization



“simple”: modeling system components following their highlevel design algorithms
“comprehensive”: considering wide ranges of system
configuration and workload conditions
Discover performance anomalies (discrepancies between model
prediction and measured actual performance)
Characterize them and attribute them to possible causes
What can you do with the anomaly characterizations?


making system perform better and more predictable through
debugging
identifying problematic settings for avoidance
6/20/2016
FAST'05
3
Operating System Support for
Disk I/O-Bound Online Servers

Disk I/O-bound online servers




Server processing access large disk-resident data
Examples:
 Web servers serving large Web data
 index searching
 database-driven server systems
Complex workload characteristics affecting performance
Operating system support




I/O prefetching
Disk I/O scheduling (elevator, anticipatory, …)
File system layout and meta-data management
Memory caching
6/20/2016
FAST'05
4
A “Simple” Yet “Comprehensive”
Throughput Model
Workload
characteristics
System
I/O throughput

Memory caching
model
workload’
throughput’

OS
configuration
I/O prefetching
model
workload’’
Operating
system
throughput’’
Decompose a
complex system into
weakly coupled subcomponents (layers)
Each layer
transforms the
workload and alters
the I/O throughput
I/O scheduling
model

workload’’’
Storage
properties
6/20/2016
throughput’’’
Storage device
model
FAST'05
Consider wide ranges
of workloads and
server concurrency
5
Model-Driven Anomaly Characterization
An OS implementation may deviate from model prediction


over-simplification, mishandling of special cases, … …
a “performance bug” may only manifest under specific system
configurations or workload conditions
Real system
measurement
Sample workload
& configuration
settings
Comparison
Statistical
clustering and
characterization
Representative
anomalous
settings
Performance
model prediction
6/20/2016
FAST'05
Performance bug profiles
Correlated system
component & workload
conditions
… … … ...
… … … ...
… … … ...
6
Parameter Sampling


We choose a set of system configurations and workload properties
to check performance anomalies
Sample parameters are chosen from a parameter space
system
configuration x
system
configuration y
workload
property z

If we choose samples randomly and independently, the chance
for missing a bug decreases exponentially as the sample
number increases
6/20/2016
FAST'05
7
Sampling Parameter Space

Workload properties


server concurrency
I/O access pattern
a stream
a stream


application inter-I/O think time
OS configurations



prefetching: enable (prefetching depth)/disable
I/O scheduling: elevator or anticipatory
memory caching: enable/disable
6/20/2016
FAST'05
8
Anomaly Clustering
system
configuration x
system
configuration y
workload
property z



Anomalous settings may be due to multiple causes (bugs)
 hard to make observation out of all anomalous settings
 desirable to cluster anomalous settings into groups likely
attributed to individual causes
Existing clustering algorithms (EM, K-means) do not handle crossintersected clusters
We perform hyper-rectangle clustering
6/20/2016
FAST'05
9
Anomaly Characterization

Anomaly characterization



hard to derive useful debugging information from a group of
anomalous settings
succinct characterizations are desirable
Characterization is easy after hyper-rectangle clustering

simply projecting the hyper-rectangle onto all dimensions
6/20/2016
FAST'05
10
Experimental Setup


A micro-benchmark that can be configured to exhibit any
desired workload patterns
Linux 2.6.10 kernel
parameter sampling (400 samples)
anomaly clustering and characterization
for one possible bug
human debugging (assisted by a kernel tracing tool)
6/20/2016
FAST'05
11
Result – Top 50 Model/Measurement
Errors out of 400 Samples
Error defined as:
1–
Measured throughput
Model-predicted throughput
Model/measurement error
100%
80%
Original Linux 2.6.10
#1 bug fix
#1, #2 fixes
#1, #2, #3 fixes
#1, #2, #3, #4 fixes
60%
40%
Performance error
20%
0%
Sample parameter settings ranked on errors
6/20/2016
FAST'05
12
Result – Anomaly #1
Workload property
concurrency:
Stream length:
128 and above
256KB and above
System configuration
Prefetching:

The cause



enabled
when the disk queue is “congested”, prefetching is cancelled
however, prefetching sometimes include synchronously requested
data, which is resubmitted as single-page “makeup” I/O
Solutions


do not cancel prefetching that includes synchronously requested data
or block reads when the disk queue is “congested”
6/20/2016
FAST'05
13
Result – Anomaly #2, #3, #4

Anomaly #2



Anomaly #3



concerning the anticipatory I/O scheduler
uses average seek distance of past requests to estimate seek time
concerning the elevator I/O scheduler
always search from block address 0 for next request after “reset”
Anomaly #4


concerning the anticipatory I/O scheduler
a large I/O operation is often split into small disk requests,
anticipation timer is started after the first disk request returns
6/20/2016
FAST'05
14
Result – Overall Predictability
Model prediction
Measured performance
35
30
25
20
15
10
5
I/O throughput (in MB/sec)
0
After four bug fixes
I/O throughput (in Mbytes/sec)
I/O throughput (in Mbytes/sec)
Original Linux 2.6.10
35
30
25
20
15
10
5
I/O0 throughput (in MBytes/sec)
Ranked sample parameter settings
6/20/2016
Ranked sample parameter settings
FAST'05
15
Support for Real Applications



Index searching from Ask Jeeves
search engine
Search workload following a
2002 Ask Jeeves trace
Anticipatory I/O scheduler

Apache Web server
Media clips workload following
IBM 1998 World Cup trace
Elevator I/O scheduler


25
20
15
10
#1, #3 bug fixes
5
#1 bug fix
Original(in
Linux
2.6.10
I/O throughput
MB/sec)
0
1
2
4
8
16 32 64 128 256
Server concurrency
6/20/2016
I/O throughput (in Mbytes/sec)
I/O throughput (in Mbytes/sec)
30
15
10
5
I/O
0
#1, #2, #4 bug fixes
#1, #2 bug fixes
#1 bug fix
Original(in
Linux
2.6.10
throughput
MB/sec)
1
FAST'05
2
4
8
16 32 64 128 256
Server concurrency
16
Related Work

I/O system performance modeling



[Worthington et al. 1994] [Shriver et al. 1998] [Uysal et al. 2001]
OS I/O subsystem [Cao et al. 1995] [Shenoy & Vin 1998] [Shriver et
al. 1999]
Performance debugging



Storage devices [Ruemmler & Wilkes 1994] [Kotz et al. 1994]
Fine-grain system instrumentation & simulation [Goldberg &
Hennessy 1993] [Rosenblum et al. 1997]
Analyzing online traces [Chen et al. 2002] [Aguilera et al. 2003]
Correctness (non-performance) debugging


Code analysis [Engler et al. 2001] [Li et al. 2004]
Configuration debugging [Nagaraja et al. 2004] [Wang et al. 2004]
6/20/2016
FAST'05
17
Summary

Model-driven anomaly characterization



a systematic approach to assist performance debugging for
complex systems over wide ranges of runtime conditions
for disk I/O-bound online servers, we discovered several
performance bugs of Linux 2.6.10 kernel
Linux 2.6.10 kernel patch for bug fix #1 available

http://www.cs.rochester.edu/~cli/Publication/patch1.htm
6/20/2016
FAST'05
18
Download