Document 10590275

advertisement
Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas, May 14-19, 2000
Data Mining on an OLTP System (Nearly) for Free
Erik Riedel1, Christos Faloutsos, Gregory R. Ganger, David F. Nagle
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
{riedel,christos,ganger,nagle}@cs.cmu.edu
Abstract
1 Introduction
This paper proposes a scheme for scheduling disk requests that
takes advantage of the ability of high-level functions to operate
directly at individual disk drives. We show that such a scheme
makes it possible to support a Data Mining workload on an OLTP
system almost for free: there is only a small impact on the throughput and response time of the existing workload. Specifically, we
show that an OLTP system has the disk resources to consistently
provide one third of its sequential bandwidth to a background Data
Mining task with close to zero impact on OLTP throughput and
response time at high transaction loads. At low transaction loads,
we show much lower impact than observed in previous work. This
means that a production OLTP system can be used for Data Mining
tasks without the expense of a second dedicated system. Our
scheme takes advantage of close interaction with the on-disk
scheduler by reading blocks for the Data Mining workload as the
disk head “passes over” them while satisfying demand blocks from
the OLTP request stream. We show that this scheme provides a
consistent level of throughput for the background workload even at
very high foreground loads. Such a scheme is of most benefit in
combination with an Active Disk environment that allows the
background Data Mining application to also take advantage of the
processing power and memory available directly on the disk
drives.
Query processing in a database system requires several resources,
including 1) memory, 2) processor cycles, 3) interconnect bandwidth, and 4) disk bandwidth. Performing additional tasks, such as
data mining, on a transaction processing system without impacting
the existing workload would require there to be “idle” resources in
each of these four categories. A system that uses Active Disks
[Riedel98] contains additional memory and compute resources at
the disk drives that are not utilized by the transaction processing
workload. Using Active Disks to perform highly-selective scan
and aggregation operations directly at the drives keeps the interconnect requirements low. This leaves the disk arm and media rate
as the critical resources. This paper proposes a scheduling algorithm at the disks that allows a background sequential workload to
be satisfied essentially for free while servicing random foreground
requests. We first describe a simple priority-based scheduling
scheme that allows the background workload to proceed with a
small impact on the foreground work and then extend this system
to read additional blocks completely “for free” by reading blocks
during the otherwise idle rotational delay time. We also show that
these benefits are consistent at high foreground transaction loads
and as data is striped over a larger number of disks.
2 Background and Motivation
The use of data mining to elicit patterns from large databases is
becoming increasingly popular over a wide range of application
domains and datasets [Fayyad98, Chaudhuri97, Widom95]. One of
the major obstacles to starting a data mining project within an
organization is the high initial cost of purchasing the necessary
hardware. This means that someone must “take a chance” on the
up-front investment simply on the suspicion that there may be
interesting “nuggets” to be mined from the organization’s existing
databases.
This research was sponsored by DARPA/ITO through ARPA Order D306, and issued
by Indian Head Division, NSWC under contract N00174-96-0002. Partial funding was
provided by the National Science Foundation under grants IRI-9625428,
DMS-9873442, IIS-9817496, and IIS-9910606. Additional funding was provided by
donations from NEC and Intel. We are indebted to generous contributions from the
member companies of the Parallel Data Consortium. At the time of this writing, these
companies include Hewlett-Packard Laboratories, LSI Logic, Data General, Compaq,
Intel, 3Com, Quantum, IBM, Seagate Technology, Hitachi, Siemens, Novell, Wind
River Systems, and Storage Technology Corporation. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of any supporting organization
or the U.S. Government.
The most common strategy for data mining on a set of transaction
data is to purchase a second database system, duplicate the transaction records from the OLTP system in the decision support system
each evening, and perform mining tasks only on the second system, i.e. to use a “data warehouse” separate from the production
system. This strategy not only requires the expense of a second
system, but requires the management cost of maintaining two com-
1
1
now with Hewlett-Packard Laboratories, Palo Alto, California,
riedel@hpl.hp.com
system
# of CPUs
memory
(GB)
# of disks
storage
(GB)
live data
(GB)
cost
($)
NCR WorldMark 4400 (TPC-C)
NCR TeraData 5120 (TPC-D 300)
4
104
4
26
203
624
1,822
2,690
1,400
300
$839,284
$12,269,156
Table 1: Comparison of an OLTP and a DSS system from the same vendor. Data from www.tpc.org, May and June 1998.
plete copies of the data. Table 1 compares a transaction system and
a decision support system from the same manufacturer. The decision support system contains a larger amount of compute power,
and higher aggregate I/O bandwidth, even for a significantly
smaller amount of live data. In this paper, we argue that the ability
to operate close to the disk makes it possible for a significant
amount of data mining to be performed using the transaction processing system, without requiring a second system at all. This provides an effective way for an organization to “bootstrap” its
mining activities.
Previous work has shown that highly parallel and selective operations such as aggregation, selection, or selective joins can be profitably offloaded to Active Disks or similar systems [Riedel98,
Acharya98, Keeton98]. Many data mining operations including
nearest neighbor search, association rules [Agrawal96], ratio and
singular value decomposition [Korn98], and clustering [Zhang97,
Guha98] eventually translate into a few large sequential scans of
the entire data. If these selective, parallel scans can be performed
directly at the individual disks, then the limiting factor will be the
bandwidth available for reading data from the disk media. This
paper offers one example of a scheduling optimization that can be
performed only with application-level knowledge available
directly at the disk drives.
Active Disks provide an architecture to take advantage of the processing power and memory resources available in future generation disk drives to perform application-level functions. Next
generation drive control chips have processing rates of 150 and
200 MHz and use standard RISC cores, with the promise of up to
500 MIPS processors in two years [Cirrus98, Siemens98]. This
makes it possible to perform computation directly on commodity
disk drives, offloading server systems and network resources by
computing at the edges of the system. The core advantages of this
architecture are 1) the parallelism in large systems, 2) the reduction in interconnect bandwidth requirements by filtering and
aggregating data directly at the storage devices, before it is placed
onto the interconnect and 3) closer integration with on-disk scheduling and optimization. Figure 1 illustrates the architecture of such
a system.
3 Proposed System
The performance benefits of Active Disks are most dramatic with
the highly-selective parallel scans that form a core part of many
data mining applications. The scheduling system we propose
assumes that a mining application can be specified abstractly as:
(1) foreach block(B) in relation(X)
(2) filter(B) -> B’
assumption: ordering of
(3) combine(B’) -> result(Y)
blocks does not affect the
result of the computation
Traditional System
Active Disk System
selective
processing reduces
network bandwidth
required upstream
on-disk processing
offloads server CPU
disk bandwidth becomes
the critical resource
Figure 1: Diagram of a traditional server and an Active Disk architecture. By moving processing to the disks,
the amount of data transferred on the network is reduced, the computation can take advantage of the parallelism
provided by the disks and benefit from closer integration with on-disk scheduling. This allows the system to
continue to support the same transaction workload with additional mining functions operating at the disks.
2
Action in Today’s Disk Drive
B
1
foreground demand request
2
A
seek from A to B
3
wait for rotation
read block
Modified Action With “Free” Block Scheduling
B
1a
background requests
C
A
seek from A to C
C
1b
2
read “free” block at C,
seek from C to B
wait for rotation
3
read block
Figure 2: Illustration of ‘free’ block scheduling. In the original operation, a request to read or write a block causes the disk to seek
from its current location (A) to the destination cylinder (B). It then waits for the requested block to rotate underneath the head. In the
modified system, the disk has a set of potential blocks that it can read “at its convenience”. When planning a seek from A to B, the
disk will consider how long the rotational delay at the destination will be and, if there is sufficient time, will plan a shorter seek to C,
read a block from the list of background requests, and then continue the seek to B. This additional read is completely ‘free’ because
the time waiting for the rotation to complete at cylinder B is completely wasted in the original operation.
where steps (1) and (2) can be performed directly at the disk drives
in parallel, and step (3) combines the results from all the disks at
the host once the individual computations complete.
ground requests that are satisfied as soon as possible; and 2) a list
of the background blocks that are satisfied when convenient.
Whenever the disk plans a seek to satisfy a request from the foreground queue, it checks if any of the blocks in the background
queue are “in the path” from the current location of the disk head
to the desired foreground request. This is accomplished by comparing the delay that will be incurred by a direct seek and rotational latency at the destination to the time required to seek to an
alternate location, read some number of blocks and then perform a
second seek to the desired cylinder. If this “detour” is shorter than
the rotational delay, then some number of background blocks can
be read without increasing the response time of the foreground
request. If multiple blocks satisfy this criteria, the location that satisfies the largest number of background blocks is chosen. Note that
in the simplest case, the drive will continue to read blocks at the
current location, or seek to the destination and read some number
of blocks before the desired block rotates under the head.
Applications that fit this model - low computation cost for the
filter function and high selectivity (large data reduction) from
B to B’ - will be limited by the raw bandwidth available for
sequential reads from the disk media. In a dedicated mining system, this bandwidth would be the full sequential bandwidth of the
individual disks. However, even in a system running a transaction
processing workload, a significant amount of the necessary bandwidth is available in the idle time between and during disk seek
and rotational latency for the transaction workload.
The key insight is that while positioning the disk mechanism for a
foreground transaction processing (OLTP) workload, disk blocks
passing under the disk head can be read “for free”. If the blocks are
useful to a background application, they can be read without any
impact on the OLTP response time by completely hiding the read
within the request’s rotational delay. In other words, while the disk
is moving to the requested block, it opportunistically reads blocks
that it passes over and provides them to the data mining application. If this application is operating directly at the disk drive, then
the block can be immediately processed, without ever having to be
transferred to the host. As long as the data mining application - or
any other background application - can issue a large number of
requests at once and does not depend on the order of processing
the requested background blocks, the background application can
read a significant portion of its data without any cost to the OLTP
workload. The disk can ensure that only blocks of a particular
application-specific size (e.g. database pages) are provided and
that all the blocks requested are read exactly once, but the order of
blocks will be determined by the pattern of the OLTP requests.
4 Experiments
All of our experiments were conducted using a detailed disk simulator [Ganger98], synthetic traces based on simple workload characteristics, and traces taken from a server running a TPC-C
transaction workload. The simulation models a closed system with
a think time of 30 milliseconds which approximates that seen in
our traces. We vary the multiprogramming level (MPL) of the
OLTP workload to illustrate increasing foreground load on the system. Multiprogramming level is specified in terms of disk
requests, so a multiprogramming level of 10 means that there are
ten disk requests active in the system at any given point (either
queued at one of the disks or waiting in think time).
In the synthetic workloads, the OLTP requests are evenly spaced
across the entire surface of the disk with a read to write ratio of 2:1
and a request size that is a multiple of 4 kilobytes chosen from an
Figure 2 shows the basic intuition of the proposed scheme. The
drive maintains two request queues: 1) a queue of demand fore-
3
OLTP Throughput − 1 disk
Mining Throughput
2500
60
throughput (KB/s)
throughput (req/s)
70
50
40
30
20
2000
1500
1000
500
10
0
0
0
10
20
30
40
50
multiprogramming level (MPL) of OLTP
average response time (ms)
OLTP Response Time
300
200
100
0
10
20
30
40
50
multiprogramming level (MPL) of OLTP
Figure 3: Throughput comparison for a single disk
using Background Blocks Only. The first chart shows
the throughput of the OLTP workload both with and
without the Mining workload. Using the Background
Blocks Only approach, we see that the addition of the
Mining workload has a small impact on OLTP
throughput that decreases as the OLTP load increases
and the Mining workload “backs off”. This trend is
visible in the chart on the upper right which shows the
Mining throughput trailing off to zero as the OLTP
load increases. Finally, the bottom chart shows the
impact of the Mining workload on the response time of
the OLTP. This impact is as high as 30% at low load,
and decreases to zero as the load increases.
400
0
0
10
20
30
40
50
multiprogramming level (MPL) of OLTP
exponential distribution with a mean of 8 kilobytes. The background data mining (Mining) requests are large sequential reads
with a minimum block size of 8 kilobytes. In the experiments,
Mining is assumed to occur across the entire database, so the background workload reads the entire surface of the disk. Reading the
entire disk is a pessimistic assumption and further optimizations
are possible if only a portion of the disk contains data (see
Section 4.5).
4.1 Background Blocks Only, Single Disk
Figure 3 shows the performance of the OLTP and Mining workloads running concurrently as the OLTP load increases. Mining
requests are handled at low priority and are serviced only when the
foreground queue is empty. The first chart shows that increasing
the OLTP load increases throughput until the disk saturates and
queues begin to build. This effect is also clear in the response time
chart below, where times grow quickly at higher loads. The second
chart shows the throughput of the Mining workload at about
2 MB/s for low load, but decreases rapidly as the OLTP load
increases, forcing out the low priority background requests. The
third chart shows the impact of Mining requests on OLTP response
time. At low load, when requests are already fast, the OLTP
response time increases by 25 to 30%. This increase occurs
because new OLTP requests arrive while a Mining request is being
serviced. As the load increases, OLTP request queueing grows,
reducing the chance that an OLTP request would wait behind a
Mining request in service and eliminating the increase in OLTP
response time as the Mining work is forced out.
All si mulations run for one hour of simulated time and complete
between 50,000 and 250,000 foreground disk requests and up to
900,000 background requests, depending on the load.
There are several different approaches for integrating a background sequential workload with the foreground OLTP requests.
The simplest only performs background requests during disk idle
times (i.e. when the queue of foreground requests is completely
empty). The second uses the “free blocks” technique described
above to read extra background blocks during the rotational delay
of an OLTP request, but does nothing during disk idle times.
Finally, a scheme that integrates both of these approaches allows
the drive to service background requests whenever they do not
interfere with the OLTP workload. This section presents results for
each of these three approaches followed by results that show the
effect is consistent as data is striped over larger numbers of disks.
Finally, we present results for the traced workload that correspond
well with those seen for the synthetic workload.
4.2 ‘Free’ Blocks Only, Single Disk
Figure 4 shows the effect of reading ‘free’ blocks while the drive
performs seeks for OLTP requests. Low OLTP loads produce low
Mining throughput because little opportunity exists to exploit
‘free’ block on OLTP requests. As the foreground load increases,
the opportunity to read ‘free’ blocks improves, increasing Mining
throughput to about 1.7 MB/s. This is a similar level of throughput
seen in the Background Blocks Only approach, but occurs under
high OLTP load where the first approach could sustain significant
4
OLTP Throughput − 1 disk
Mining Throughput
2500
60
throughput (KB/s)
throughput (req/s)
70
50
40
30
20
2000
1500
1000
500
10
0
0
average response time (ms)
OLTP Response Time
400
300
200
100
0
0
0
10
20
30
40
50
multiprogramming level (MPL) of OLTP
0
10
20
30
40
50
multiprogramming level (MPL) of OLTP
Figure 4: Performance of the Free Blocks Only approach.
When reading exclusively ‘free’ blocks, the Mining
throughput is limited by the rate of the OLTP workload.
If there are no OLTP requests being serviced, there are
also no ‘free’ blocks to pick up. One advantage of using
only the ‘free’ blocks is that the OLTP response time is
completely unaffected, even at low loads. The true
benefit of the ‘free’ blocks comes as the OLTP load
increases. Where the Background Blocks Only approach
rapidly goes to zero at high loads, the Free Blocks Only
approach reaches a steady 1.7 MB/s of throughput that is
sustained even at very high OLTP loads.
10
20
30
40
50
multiprogramming level (MPL) of OLTP
Mining throughput only under light load, rapidly dropping to zero
for loads above 10. Since Mining does not make requests during
completely idle time in the ‘Free’ Blocks Only approach, OLTP
response time does not increase at all. The only shortcoming of the
‘Free’ Blocks Only approach is the low Mining throughput under
light OLTP load.
sents more than 1/3 of the raw bandwidth of the drive completely
“in the background” of the OLTP load.
4.4 Combination Background and ‘Free’ Blocks,
Multiple Disks
Systems optimized for bandwidth rather than operations per second will usually have more disks than are strictly required to store
the database (as illustrated by the decision support system of
Table 1). This same design choice can be made in a combined
OLTP/Mining system.
4.3 Combination of Background and ‘Free’ Blocks,
Single Disk
Figure 5 shows the effect of combining these two approaches. On
each seek caused by an OLTP request, the disk reads a number of
‘free’ blocks as described in Figure 2. This models the behavior of
a query that scans a large portion of the disk, but does not care in
which order the blocks are processed. Full table scans in the
TPC-D queries, aggregations, or the association rule discovery
application [Riedel98] could all make use of this functionality.
Figure 5 shows that Mining throughput increases to between 1.4
and 2.0 MB/s at low load. At high loads, when the Background
Blocks Only approach drops to zero, the combined system continues to provide a consistent throughput at about 2.0 MB/s without
any impact on OLTP throughput or response time. The full sequential bandwidth of the modeled disk (if there were no foreground
requests) is only 5.3 MB/s to read the entire disk1, so this repre-
Figure 6 shows that Mining throughput using our scheme increases
linearly as the workloads are striped across a multiple disks. Using
two disks to store the same database (i.e. increasing the number of
disks used to store the data in order to get higher Mining throughput, while maintaining the same OLTP load and total amount of
“live” data) provides a Mining throughput above 50% of the maximum drive bandwidth across all load factors, and Mining throughput reaches more than 80% of maximum with three disks.
We can see that the performance of the multiple disk systems is a
straightforward “shift” of the single disk results, where the Mining
throughput with n disks at a particular multiprogramming level is
simply n times the performance of a single disk at 1/n that MPL.
The two disk system at 20 MPL performs twice as fast as the single disk at 10 MPL, and similarly with 3 disks at 30 MPL. This
predictable scaling in Mining throughput as disks are added bodes
well for database administrators and capacity planners designing
these hybrid systems. Additional experiments indicate that these
1
Note that reading the entire disk is pessimistic since reading the inner
tracks of modern disk drives is significantly slower than reading the outer
tracks. If we only read the beginning of the disk (which is how “maximum
bandwidth” numbers are determined in manufacturer spec sheets), the
bandwidth would be as high as 6.6 MB/s, but our scheme would also perform proportionally better.
5
OLTP Throughput − 1 disk
Mining Throughput
2500
60
throughput (KB/s)
throughput (req/s)
70
50
40
30
20
2000
1500
1000
500
10
0
0
average response time (ms)
OLTP Response Time
400
300
200
100
0
0
0
10
20
30
40
50
multiprogramming level (MPL) of OLTP
10
20
30
40
50
multiprogramming level (MPL) of OLTP
10
20
30
40
50
multiprogramming level (MPL) of OLTP
Figure 5: Performance by combining the Background
Blocks and Free Blocks approaches. This shows the best
portions of both performance curves. The Mining
throughput is consistently about 1.5 or 1.7 MB/s, which
represents almost 1/3 of the maximum sequential
bandwidth of the disk being modeled. At low OLTP
loads, it has the behavior of the Background Blocks Only
approach, with a similar impact on OLTP response time
and at high loads, it maintains throughput by the use of
‘free’ blocks. Also note that at even lower
multiprogramming levels (going to the right on the
Mining throughput chart), performance would be even
better and that an MPL of 10 requests outstanding at a
single disk is already a relatively high absolute load in
today’s systems.
benefits are also resilient in the face of load imbalances (“hot
spots”) among disks in the foreground workload.
(i.e. the areas not often accessed by the OLTP workload and the
areas that are expensive to seek to). This means that if data can be
kept near the “front” or “middle” of the disk, overall ‘free’ block
performance would improve (staying to the right of the second
chart in Figure 7). Extending our scheduling scheme to “realize”
when only a small portion of the background work remains and
issue some of these background requests at normal priority (with
the corresponding impact on foreground response time) should
also improve overall throughput. The challenge is to find an appro-
4.5 ‘Free’ Blocks, Details
Figure 7 shows the performance of the ‘free’ block system at a single, medium foreground load (an MPL of 10 as shown in the previous charts). The rate of handling background requests drops
steadily as the fraction of unread background blocks decreases and
more and more of the unread blocks are at the “edges” of the disk
OLTP Throughput − 1, 2, 3 disks
0
Mining Throughput w/ Free Blocks
200
150
throughput (KB/s)
throughput (req/s)
7000
100
50
6000
5000
4000
3 disks
3000
2 disks
0
10
20
30
40
50
multiprogramming level (MPL) of OLTP
(without any
foreground load)
2000
1 disk
1000
0
maximum single disk
sequential bandwidth
0
0
10
20
30
40
50
multiprogramming level (MPL) of OLTP
Figure 6: Throughput of ‘free’ blocks as additional disks are used for the same OLTP workload. If we stripe the same amount of
data over a larger number of disks while maintaining a constant OLTP load, we see that the total Mining throughput increases as
expected. It “shifts” up and to the right in proportion to the number of disks used.
6
Rate of Free Blocks − 1 disk
Instantaneous Bandwidth of Free Blocks
4000
bandwidth (KB/s)
percent complete (%)
100
80
60
40
20
0
0
500
1000
time (seconds)
3000
2000
1000
0
1500
0
500
1000
time (seconds)
1500
Figure 7: Details of ‘free’ block throughput with a particular foreground load. The first plot shows the amount of time
needed to read the entire disk in the background at a multiprogramming level of 10. The second plot shows the
instantaneous bandwidth of the background workload over time. We see that the bandwidth is significantly higher at the
beginning, when there are more background blocks to choose from. As the number of blocks still needed falls, less of
them are “within reach” of the ‘free’ algorithm and the throughput decreases. The dashed line shows the average
bandwidth of the entire operation.
priate trade-off of impact on the foreground against improved
background performance.
impact on the OLTP response time. At higher OLTP loads, the
Mining workload is forced out, and the impact on response time is
reduced unless the ‘free’ block approach is used. The Mining
throughput is a bit lower than the synthetic workload shown in
Figure 6, but this is most likely because the OLTP workload is not
evenly spread across the disk while the Mining workload still tries
to read the entire disk.
Finally, note that even with the basic scheme as described here, it
is possible to read the entire 2 GB disk for ‘free’ in about 1700 seconds (under 28 minutes), allowing a disk to perform over 50
“scans per day” [Gray97] of its entire contents completely unnoticed.
The disk being simulated and the disk used in the traced system is
a 2.2 GB Quantum Viking 7,200 RPM disk with a (rated) average
seek time of 8 ms. We have validated the simulator against the
drive itself and found that read requests come within 5% for most
of the requests and that writes are consistently under-predicted by
an average of 20%. Extraction of disk parameters is a notoriously
complex job [Worthington95], so a 5% difference is a quite reasonable result. The under-prediction for writes could be the result of
several factors and we are looking in more detail at the disk param-
4.6 Workload Validation
Figure 8 shows the results of a series of traces taken from a real
system running TPC-C with varying loads. The traced system is a
300 MHz Pentium II with 128 MB of memory running
Windows NT and Microsoft SQL Server on a one gigabyte TPC-C
test database striped across two Quantum Viking disks. When we
add a background sequential workload to this system, we see
results similar to those of the synthetic workloads. At low loads,
several MB/s of Mining throughput are possible, with a 25%
OLTP Throughput − 2 disks (trace)
Mining Throughput
5000
foreground only
80
background only
60
40
background
and free blocks
20
0
0
Throughput (KB/s)
Throughput (req/s)
100
4000
background
and free blocks
3000
2000
1000
background only
0
100
200
300
average OLTP response time (ms)
0
100
200
300
average OLTP response time (ms)
Figure 8: Performance for the traced OLTP workload in a two disk system. The numbers are more variable than the
synthetic workload, but the basic benefit of the ‘free’ block approach is clear. We see that use of the ‘free’ block
system provides a significant boost above use of the Background Blocks Only approach. Note that since we do not
control the multiprogramming level of the traced workload, the x axes in these charts are the average OLTP response
time, which combines the three charts given in the earlier figures into two and makes the MPL a hidden parameter.
7
eters to determine the cause of the mismatch. It is possible that this
is due to a more aggressive write buffering scheme modeled in the
simulator than actually exists at the drive. This discrepancy should
have only a minor impact on the results presented here, since the
focus is on seeks and reads, and an underprediction of service time
would be pessimistic to our results. The demerit figure
[Ruemmler94] for the simulation is 37% for all requests.
suggests that a priority scheme is required in the database system
as a whole to balance the two types of workloads.
Brown, Carey and DeWitt [Brown92, Brown93] discuss the allocation of memory as the critical resource in a mixed workload
environment. They introduce a system with multiple workload
classes, each with varying response time goals that are specified to
the memory allocator. They show that a modified memory manager is able to successfully meet these goals in the steady state
using ‘hints’ in a modified LRU scheme. The modified allocator
works by monitoring the response time of each class and adjusting
the relative amount of memory allocated to a class that is operating
below or above its goals. The scheduling scheme we propose here
for disk resources also takes advantage of multiple workload
classes with different structures and performance goals. In order to
properly support a mixed workload, a database system must manage all system resources and coordinate performance among them.
5 Discussion
Previous work [Riedel98] has shown that Active Disks - individual
disk drives that provide application-level programmability - can
provide the compute power, memory, and reduction in interconnect bandwidth to make data mining queries efficient on a system
designed for a less demanding workload. This paper illustrates that
there is also sufficient disk bandwidth in such a system to make a
combined transaction processing and data mining workload possible. We show that a significant amount of data mining work can be
accomplished with only a small impact on the existing transaction
processing performance. This means that if the “dumb” disks in a
traditional system are replaced with Active Disks, there will be
sufficient resources in compute power, memory, interconnect
bandwidth, and disk bandwidth to support both workloads. It is no
longer necessary to buy an expensive second system with which to
perform decision support and basic data mining queries.
Existing work on disk scheduling algorithms [Denning67,
Worthington94] shows that dramatic performance gains are possible by dynamically reordering requests in a disk queue. One of the
results in this previous work indicates that many scheduling algorithms can be performed equally well at the host [Worthington94].
The scheme that we propose here takes advantage of additional
flexibility in the workload (the fact that requests for the background workload can be handled at low priority and out of order)
to expand the scope of reordering possible in the disk queue. Our
scheme also requires detailed knowledge of the performance characteristics of the disk (including exact seek times and overhead
costs such as settle time) as well as detailed logical-to-physical
mapping information to determine which blocks can be picked up
for free. This means that this scheme would be difficult, if not
impossible, to implement at the host without close feedback on the
current state of the disk mechanism. This makes it a compelling
use of additional “smarts” directly at the disk.
Alternatively, one could design a backup system that would be
able to read the entire contents of a 2 GB disk in 30 minutes with
minimal impact on a running OLTP workload. It would no longer
be necessary to run backups in the middle of the night, stop the
system in order to back it up, or endure reduced performance during backups.
The results in Section 4.5 indicate that our current scheme is pessimistic because it requires the background workload to read every
last block on the disk, even at much lower bandwidth. There are a
number of optimizations in data placement and the choice of
which background blocks to “go after” to be explored, but our simple scheme shows that significant gains are possible.
With the advent of Storage Area Networks (SANs), storage
devices are being shared among multiple hosts performing different workloads [HP98, IBM99, Seagate98, Veritas99]. As the
amount and variety of sharing increases, the only central location
to optimize scheduling across multiple workloads will be directly
on the devices themselves.
The prevailing trends in disk drive technology are also promising
for this approach. While rotational speeds are increasing and seek
times decreasing, by far the most dramatic change is the steady
increase in media density. This means that the number of opportunities and the duration of “idle” times between requests is being
slowly reduced, but that the rapid increase in density should more
than compensate by allowing us to read a much greater amount of
data for a given amount of disk area we “pass over”.
7 Conclusions
This paper presents a scheduling scheme that takes advantage of
the properties of large, scan-intensive workloads such as data mining to extract additional performance from a system that already
seems completely busy. We used a detailed disk simulator and both
synthetic and traced workloads to show that there is sufficient disk
bandwidth to support a background data mining workload on a
system designed for transaction processing. We propose to take
advantage of ‘free’ blocks that can be read during the seeks
required by the OLTP workload. Our results indicate that we can
get one third of the maximum sequential bandwidth of a disk for
the background workload without any effect on the OLTP response
times. This level of performance is possible even at high transaction loads. At low transaction loads, it is possible to achieve an
even higher level of background throughput if we allow a small
impact (between 25 and 30% impact on transaction response time)
on the OLTP performance.
6 Related Work
Previous studies of combined OLTP and decision support workloads on the same system indicate that the disk is the critical
resource [Paulin97]. Paulin observes that both CPU and memory
utilization is much higher for the Mining workload than the OLTP,
which is also clear from the design of the decision support system
shown in Table 1 in our introduction. In his experiments, all system resources are shared among the OLTP and decision support
workloads with an impact of 36%, 70%, and 118% on OLTP
response time when running decision support queries against a
heavy, medium, and light transaction workload, respectively. The
author concludes that the primary performance issue in a mixed
workload is the handling of I/O demands on the data disks, and
8
The use of such a scheme in combination with Active Disks that
also provide parallel computational power directly at the disks
makes it possible to perform a significant amount of data mining
without having to purchase a second, dedicated system or maintain
two copies of the data.
[Guha98]
Guha, S., Rastogi, R. and Shim, K. “CURE:
An Efficient Clustering Algorithm for Large
Databases” SIGMOD, June 1998.
[HP98]
Hewlett-Packard Company “HP to Deliver
Enterprise-Class Storage Area Network Management Solution” News Release,
October 1998.
[IBM99]
IBM Corporation and International Data
Group “Survey says Storage Area Networks
may unclog future roadblocks to e-Business”
News Release, December 1999.
[Keeton98]
Keeton, K., Patterson, D.A. and Hellerstein,
J.M. “A Case for Intelligent Disks (IDISKs)”
SIGMOD Record 27 (3), August 1998.
[Korn98]
Korn, F., Labrinidis, A., Kotidis, Y. and
Faloutsos, C. “Ratio Rules: A New Paradigm
for Fast, Quantifiable Data Mining” VLDB,
August 1998.
[Paulin97]
Paulin, J. “Performance Evaluation of Concurrent OLTP and DSS Workloads in a Single
Database System” Master’s Thesis, Carleton
University, November 1997.
[Riedel98]
Riedel, E., Gibson, G. and Faloutsos, C.
“Active Storage For Large-Scale Data Mining
and Multimedia” VLDB, August 1998.
[Ruemmler94]
Ruemmler, C. and Wilkes, J. “An Introduction
to Disk Drive Modeling” IEEE Computer
27 (3), March 1994.
[Seagate98]
Seagate Technology, Inc. “Storage Networking: The Evolution of Information Management” White Paper, November 1998.
[Siemens98]
Siemens Microelectronics, Inc. “Siemens
Announces Availability of TriCore-1 For New
Embedded System Designs” News Release,
March 1998.
[Veritas99]
Veritas Software Corporation “Veritas Software and Other Industry Leaders Demonstrate
SAN Solutions” News Release, May 1999.
[Widom95]
Widom, J. “Research Problems in Data Warehousing” CIKM, November 1995.
8 Acknowledgements
The authors wish to thank Jiri Schindler for help extracting disk
parameters, Khalil Amiri for many valuable discussions, and all
the other members of the Parallel Data Lab for their invaluable
support. We thank Jim Gray, Charles Levine, Jamie Reding and the
rest of the SQL Server Performance Engineering group at
Microsoft for allowing us to use their TPC-C Benchmark Kit. We
thank Pat Conroy of MTI, as well as the anonymous SIGMOD
reviewers for comments on earlier drafts of this paper.
9 References
[Acharya98]
Acharya, A., Uysal, M. and Saltz, J. “Active
Disks” ASPLOS, October 1998.
[Agrawal96]
Agrawal, R. and Schafer, J. “Parallel Mining
of Association Rules” IEEE Transactions on
Knowledge and Data Engineering 8 (6),
December 1996.
[Brown92]
Brown, K., Carey, M., DeWitt, D., Mehta, M.
and Naughton, J. “Resource Allocation and
Scheduling for Mixed Database Workloads”
Technical Report, University of Wisconsin,
1992.
[Brown93]
Brown, K., Carey, M. and Livny, M. “Managing Memory to Meet Multiclass Workload
Response Time Goals” VLDB, August 1993.
[Chaudhuri97]
Chaudhuri, S. and Dayal, U. “An Overview of
Data Warehousing and OLAP Technology”
SIGMOD Record 26 (1), March 1997.
[Cirrus98]
Cirrus Logic, Inc. “New Open-Processor Platform Enables Cost-Effective, System-on-a-chip Solutions for Hard Disk
Drives” www.cirrus.com/3ci, June 1998.
[Denning67]
Denning, P.J. “Effects of Scheduling on File
Memory Operations” AFIPS Spring Joint
Computer Conference, April 1967.
[Fayyad98]
Fayyad, U. “Taming the Giants and the Monsters: Mining Large Databases for Nuggets of
Knowledge” Database Programming and
Design, March 1998.
[Ganger98]
[Gray97]
[Worthington94] Worthington, B.L., Ganger, G.R. and Patt,
Y.N. “Scheduling Algorithms for Modern
Disk Drives” SIGMETRICS, May 1994.
[Worthington95] Worthington, B.L., Ganger, G.R., Patt, Y.N.,
Wilkes, J. “On-Line Extraction of SCSI Disk
Drive Parameters” SIGMETRICS, May 1995.
Ganger, G.R., Worthington, B.L. and Patt,
Y.N. “The DiskSim Simulation Environment
Version 1.0 Reference Manual” Technical
Report, University of Michigan,
February 1998.
[Zhang97]
Gray, J. “What Happens When Processing,
Storage, and Bandwidth are Free and Infinite?” IOPADS Keynote, November 1997.
9
Zhang, T., Ramakrishnan, R. and Livny, M.
“BIRCH: A New Data Clustering Algorithm
and Its Applications” Data Mining and
Knowledge Discovery 1 (2), 1997.
Download