Striping techniques

advertisement
Striping techniques
The Basics of Striping
The process of taking a pool of data and evenly dividing it
up across a set of drives is called "striping." To illustrate,
let's look at a data warehouse that stores information
pertaining to foreign automobile sales for the United States.
Assume that our platform has four CPUs. Also, for
simplicity's sake, assume our warehouse only has
information on Mercedes, Porsche, BMW and Volvo, and
that it has roughly an equivalent amount of information on
each type of car, all of which is stored in a single table called
"Car_Sales." If we were to put all the information on a
single disk drive (assuming for a minute that it would fit),
then only one scan process would be able to read the table at
a time. The optimal solution is to spread the data (that is,
stripe the data) across at least four disks as shown in Figure
1.
Now, when we query Car_Sales, we can use four scan
processes, and each process will read one-fourth of the total
table, completing a query in approximately one-fourth of the
time it would have taken to scan the table without
parallelism. We can continue extending this principle even
further by adding more CPUs, striping the data over more
disk drives and using more scan processes.
Striping can be performed by either the hardware, the
operating system or the database. Each one has its benefits
and drawbacks.
Hardware Striping
This method of disk striping involves purchasing specialized
intelligent disk array technology which includes additional
hardware that automatically handles striping the data across
the multiple disks in the disk array. To the rest of the
system, this disk array usually looks like a single (albeit very
fast) disk drive that has the ability to simultaneously handle
multiple I/O requests. The striping is usually done in a
round- robin fashion, which means that chunks of data
(usually 32K to 64K each) are distributed to disk drives
similar to the way a card dealer deals out a deck of cards.
The benefit of using this technique is that data is spread
evenly over many physical devices, balancing the I/O load
across all the disks. This, therefore, minimizes the risk of
having disk "hot spots" which occur when data is requested
from some drives much more frequently than others.
Another advantage is that these intelligent disk arrays also
automatically handle various RAID levels.
However, the hardware striping solution is usually the most
expensive method of achieving disk striping. Also, the
resulting parallelism is not necessarily what you would hope
for. In general, to maximize I/O throughput, you always
want to have the disk head move smoothly across the disk in
one continuous motion, streaming the data back to the
system as it goes. Unfortunately, because the database is
unaware of how the data is actually striped (remember, with
hardware striping the striping is intended to be transparent
to the system), the I/O requests issued by a single scan
thread will almost always reference data that is on multiple
drives. This problem affects each scan thread, so all the disks
in the array are constantly satisfying requests from multiple
threads. The disk heads will have to constantly seek back
and forth, significantly lowering I/O performance.
Operating System Striping
Operating system striping introduces the concept of a
"logical volume group," which (similar to hardware
striping) appears as a single disk device to the database.
However, it actually consists of pieces from multiple physical
disk drives logically grouped together to give the appearance
of one physical device. The data is distributed across the
pieces of the various disks in a round- robin fashion. As with
hardware striping, this approach removes hot spots, and
since there is no special hardware required, this solution is
cheaper. However, more CPU resources are needed to
manage the logical volume group. Also, it suffers from the
same head-seeking problem I discussed earlier, since a scan
thread can only issue a request to the logical volume group,
not to a specific disk within that group (see Figure 2).
Database Striping
Of the three striping techniques available, database striping
is the easiest to employ and offers the best performance
when there is a smaller number of concurrent users (than in
typical OLTP applications) running parallel queries. Using
this technique, a database table is divided into a number of
sections (called "fragments" or "extents"), and each section
is assigned to a specific drive. The database then has the
ability to assign a single scan process to a single
fragment/extent and, as a result, be assured that the head
seek movement will be minimized (because the scans are
sequential and the disk head does not have to be
repositioned elsewhere on the disk platter to service another
request).
The downside of database striping lies in the fact that each
fragment/extent is a separate operating system file.
Therefore, there will be many more data files to manage
compared with operating system striping or hardware
striping, where the four separate sections would be treated
and managed as a single file. It's simply a tradeoff between
performance and ease of maintenance.
Conclusion
A scalable application must be thought of as a "performance
chain." All components in the chain must be scalable for the
entire application to be scalable. If any component is not
scalable, then a bottleneck exists in your performance chain,
and your application as a whole will have limited scalability.
As a major component of any application, the I/O subsystem
must also be scalable. Striping is one of the most effective
techniques for removing disk hot spots, and it's required if
you want to be able to take advantage of I/O parallelism and
have a highly scalable I/O subsystem.
Download