Optimizing Data Warehouse

advertisement
Optimizing Data Warehouse
Loads via Parallel Pro-C and Parallel/DirectMode SQL
Bert Scalzo, PhD
INTRODUCTION
1
STAR SCHEMA DESIGN
1
MODELING STARS
2
SIZE OF STARS
2
THE LOADING CHALLENGE
2
HARDWARE WON’T COMPENSATE
3
PROGRAM DESIGN PARAMOUNT
3
PARALLEL DESIGN OPTIONS
4
STEP #1: FORM STREAMS
4
STEP #2: PROCESS STREAMS
5
RESULTS
6
Introduction
This paper is meant to accompany the PowerPoint presentation by the same name. The
information and ideas contained by these papers are founded on 3+ years of work as the
lead DBA for the 7-11 Corporation’s data warehouse for reporting on 6000+ convenient
stores’ Point of Sale (POS) and Electronic Data Interchange (EDI) information.
Built by Electronic Data Systems (EDS), this multi-terabyte data warehouse was
constructed on a Hewlett Packard V-2200 with 16 400 MHz PA-RISC CPU’s and 8
gigabytes of RAM. The operating system was the 64-bit version of Hewlett Packard’s
UNIX, HP-UX 11.0. We used an EMC Symmetrix 3700-47 with 4 gigabytes of cache
and 3 terabytes usable of mirrored disk space.
We started with Oracle 7.3.2 and progressed to Oracle 8.1.6. Over those 3+ years, we
learned numerous hardware, operating system and database tuning issues. At first, our
reports took over 12 hours to complete. Today, those same reports run in less than 7
minutes on average. Most of those run-time gains were made possible by Oracle 8.X
optimizer improvements.
Note: this paper’s section headings match the PowerPoint slides’ page headers.
Star Schema Design
Data warehouse projects are rampant. Every company claims to be building one.
In reality, most companies are building Operational Data Stores (ODS). An ODS is really
just a traditional database design, with a few extra columns – most often timestamps. The
ODS was originally intended to provide a historical collection of OLTP data from legacy
systems, which could then be used as a single source for loading into a data warehouse.
Unfortunately, most people seem content with calling this initial stepping stone the data
warehouse. But it’s reality still a predominately OLTP design, with OLTP tuning issues
merely scaled up to accommodate the higher volumes of data.
However, some companies are following Ralph Kimball’s Dimensional Modeling
technique, known as “Star Schema” design, to construct true data warehouses. This
design methodology goes contrary to conventional OLTP design theory. The goal is just
the reverse of the norm. With this technique, we strive for fewer, larger tables – with a
very low degree of normalization. In fact, the whole design concept seems initially quite
bizarre to the seasoned DBA. Old rules and tricks no longer apply. Welcome to the
strange and wonderful world of tuning data warehouses. You’re not in Kansas anymore
Dorothy. Leave your baggage behind, as we look at the techniques to design such
radically different database designs.
1
Modeling Stars
Modeling “Star Schemas” is easy. In fact, you can use any data modeling software – even
if that software does not have any data warehousing specific extensions. In fact (no pun
intended), you can just view the fact tables as base tables with numerous lookup tables,
known as dimensions. While some data modeling tools tout supporting different graphic
representations for dimensions and facts, it doesn’t amount to much – other than show.
Some tools do support the modeling of aggregations and have hierarchy browsers, but
with Oracle 8i’s new Materialized Views and Dimensions, it’s less important that they be
directly supported in the data model. In fact with Materialized Views, you may just want
to model the detailed fact tables. So a typical data model might only have four to ten
detailed fact tables and four to ten dimensions – or a total of only 20 or so entities. That’s
small potatoes when compared to OLTP models, which can often have hundreds or even
thousands of entities.
Size of Stars
One big difference between traditional and “Star Schema” designs is the relative size
difference between the table types. With OLTP and ODS systems, you can have tables
with a wide range of row counts. For example, an OLTP system may have tables with a
few thousand to a few million rows. And an ODS might have tables with a few thousand
to tens or even hundreds of millions of rows. But a “Star Schema” design has only two
kinds of tables: small dimensions and huge facts. The typical dimension might have from
a few thousand to a few hundred thousand rows, but facts are truly gargantuan – typically
a few hundred million to a few billion rows! In data warehousing, size does matter.
The Loading Challenge
There are two primary challenges to data warehousing: loading massive amounts of data
quickly and running genuinely ad-hoc queries within the end-users’ lifetimes. Of these
two challenges, data loading is more problematic. This may seem odd at first. The enduser queries are truly ad-hoc on huge tables, so there are no hard and fast rules on what
they might do. In fact, successful data mining implies that the more they learn the more
they’ll do. On the other hand, the data loading rules are well defined and any given data
load run is tiny in proportion to the warehouse. So why is data loading harder?
Simply put – it’s Oracle’s optimizer. The optimizer does a fantastic job of making queries
efficient and therefore quick. In fact, the 8i optimizer was specifically designed to handle
ad-hoc queries on star schema designs. If the DBA merely follows a simple bitmap index
design (i.e. individual bitmap indexes on fact table foreign keys columns and fully bitmap
2
indexed dimension tables), collects statistics with histograms and set’s one or two init.ora
parameters – Oracle optimizer will make mince meat of ad-hoc queries on billion row or
larger tables. It’s that easy. Moreover, the DBA can then use partitions, parallel explain
plan execution for SELECT commands, materialized views and other advanced Oracle
features for even better results. Moreover, Oracle generally scales well for queries as you
improve your hardware (e.g. more or faster processors, more memory, faster disks, etc).
The data loading aspect does not have such an inherent advantage. Sub-standard data load
programs simply take too long to run. Oracle does not compensate here for bad design as
it does with queries. Here, garbage in truly yields garbage out.
Hardware Won’t Compensate
The bitter and honest truth is that better hardware generally cannot improve load times.
Yet many managers, programmers and even some DBA’s mistakenly believe that faster
hardware can improve data load times. In my case, the team voted to double our CPU’s,
memory, EMC cache and disks (RAID 5 to RAID 1). The end result, we spent a million
dollars and significant downtown to upgrade the hardware in order to have our 4 hours 30
minutes load program run in 4 hours 15 minutes. You can guess what hit the fan.
The goal should be to verify that resource saturation has genuinely occurred before you
update or improve that resource’s hardware. Think of a grocery store with many register
lines open, but with everyone in the express line – opening more register lines will not
improve the queues’ throughput. In this grocery example, closing one register line and
having that person instead direct people on register line to enter will accomplish both
load balancing and higher queue throughput. It’s no different with you hardware.
Program Design Paramount
Most programmers (and they hate to admit it) have never written applications to leverage
SMP or MPP architectures. So load programs are often run serially, with designs that are
uni-processor oriented in design. So a 16 CPU box with gigabytes of RAM and an EMC
RAID storage array is underutilized on all counts by such programs. It’s like the movie
“Star Trek: The Wrath of Kahn”, where Spock tells Captain Kirk they can beat Khan
because he’s not accustomed to thinking in three dimensional tactics. Most application
developers have not and will not develop parallel architecture load programs without
some advice and assistance from the DBA.
The goal is simple: minimize inter-process waits and maximize total concurrent resource
usage. In our grocery store example’s terms: minimize individual register line wait time
by maximizing concurrent register line usage. That’s the fancy way to say the obvious:
3
spread the customers evenly over the open lines. Now think of a data load program that’s
to open a huge data file and process each record until end of file – how does that spread
the load over multiple CPU’s or push your EMC’s I/O capabilities? Answer – it doesn’t.
Parallel Design Options
There are generally 3 ways to accomplish parallel architecture data loads.
First, you could use SQL Loader run in parallel with the direct option. If you had say ten
million records to load, 10 CPU’s and a fast I/O device – you could run 20 SQL Loader
programs with each responsible for loading 500,000 records. Of course you would have
to monitor resource usage and adjust the parallel execution count up or down based upon
whether you’re more CPU or I/O constrained. There is only one major drawback to this
approach: you cannot do data lookups and data scrubbing without writing pre-insert
and/or pre-update triggers. The chief problem is update triggers that trigger (no pun
intended) “ORA-04091: table XXXX is mutating, trigger/function may not see it”.
Are your developers up to this challenge? Mine weren’t.
Second, you could use multi-threaded Pro-C programs – where each context (i.e. thread)
loads an equal portion of the input data. I highly recommend the non-mutex architecture
since many developers claim to have never even heard the term mutex. Yet it’s taught in
most basic operating systems classes – it’s merely an inter-process locking mechanism.
So once again we run into complexity issues as the primary drawback. Mutli-threaded
applications are difficult to program and even harder to debug. Once again, are your
developers up to this challenge?
Third, you could use a classic “Divide and Conquer” approach. Take that simple Pro-C
program that opens a big data file and reads/processes each record till end of file. It’s a
simple shell scripting exercise to logically (not physically) split that file into N parts and
run N parallel copies of that simple program each responsible for 1/N of the data. Unlike
the multi-threaded approach, no special Pro-C coding skills are required and the program
is dirt simple to debug. This approach leverages many of UNIX’s key strengths. However
once again, are your developers up to this challenge? If not, have them write the Pro-C
programs and you provide the simple scripts to execute them in parallel.
Step #1: Form Streams
First you’ll need the script to logically divide the input data. In my case, we had 6000
files instead of one big file. The following script does just that – it creates 16 streams of
input files by building 16 directory listing files that each contain 1/16th of the 6000 file
names. Note that this is a purely logical operation. We did nothing more than create some
very tiny files that specify how to subdivide the workload. There is absolutely no need to
4
move, copy or concatenate the original data files in order to provide any kind physical
separation or ordering.
degree=16
file_name= ras.dltx.postrn
file_count=`ll ${file_name}.* | wc -l`
if [ $file_count ]
then
if [ -f file_list* ]
then
rm -f file_list*
fi
ls ${file_name}.* > file_list
split_count=`expr \( $file_count + file_count % $degree \) / $degree`
split -$split_count file_list file_list_
…
… Step 2’s code …
…
fi
Step #2: Process Streams
Second, you need to kick off the parallel program executions using your logically divided
input data. In my case with 6000 files, this script merely creates 16 concurrently running
background programs that each read the next file name to process from its list and sends
that file’s data to the Pro-C program. It’s really not that tricky. UNIX normally creates a
process for each command line. The “(commands)&” notation merely tells UNIX to force
those commands to run together as one process and places that singular process in the
background. So we kick of 16 and then wait for them all to finish. That’s it.
for file in `ls file_list_*`
do
( cat $file | while read line
do
if [ -s $line ]
then
cat $line | pro_c_program
fi
done )&
done
wait
5
Note, you can very easily write Windows BAT scripts to easily do the same thing (which
I have done), so it’s not just a UNIX solution. Merely use Windows’ CALL and START
commands, as appropriate. Or more simply, obtain a competent UNIX scripting engine
for Windows, such as the MKS toolkit, CYGWIN or UWIN.
Results
Remember my 4 hours 30 minutes load program we reduced to 4 hours 15 minutes by
spending a million dollars in upgrades and system down time for those upgrades? The
new parallel architecture load program ran in 25 minutes – that’s a 1080% run time
improvement. We went from 7% CPU utilization to 95%. We went from less than 5%
load on our EMC RAID disk array to 80% (we were actually not I/O bound yet). We
went from less than 5% memory usage to nearly 80% (still no swapping or paging).
Our customer was ecstatic. We got to go to a Ranger’s baseball game as a team, have a
team pizza-party, got an extra ½ day off of personal leave – plus every team member
received a plaque of appreciation. Not to mention that I also got a fantastic raise and a
promotion out of it.
Total cost for these improvements:



4 hours DBA time at $150/hour = $600
20 hours developer time at $100/hour = $2000
Total cost = $2600
That translates into 385 times less cost than the million dollars spent for new hardware.
Remember, application design is much, much cheaper than new hardware – in fact in this
case, ridiculously so.
6
Download