Optimizing Data Warehouse

Optimizing Data Warehouse Loads via Parallel Pro-C and Parallel/DirectMode SQL Bert Scalzo, PhD INTRODUCTION 1 STAR SCHEMA DESIGN 1 MODELING STARS 2 SIZE OF STARS 2 THE LOADING CHALLENGE 2 HARDWARE WON’T COMPENSATE 3 PROGRAM DESIGN PARAMOUNT 3 PARALLEL DESIGN OPTIONS 4 STEP #1: FORM STREAMS 4 STEP #2: PROCESS STREAMS 5 RESULTS 6 Introduction This paper is meant to accompany the PowerPoint presentation by the same name. The information and ideas contained by these papers are founded on 3+ years of work as the lead DBA for the 7-11 Corporation’s data warehouse for reporting on 6000+ convenient stores’ Point of Sale (POS) and Electronic Data Interchange (EDI) information. Built by Electronic Data Systems (EDS), this multi-terabyte data warehouse was constructed on a Hewlett Packard V-2200 with 16 400 MHz PA-RISC CPU’s and 8 gigabytes of RAM. The operating system was the 64-bit version of Hewlett Packard’s UNIX, HP-UX 11.0. We used an EMC Symmetrix 3700-47 with 4 gigabytes of cache and 3 terabytes usable of mirrored disk space. We started with Oracle 7.3.2 and progressed to Oracle 8.1.6. Over those 3+ years, we learned numerous hardware, operating system and database tuning issues. At first, our reports took over 12 hours to complete. Today, those same reports run in less than 7 minutes on average. Most of those run-time gains were made possible by Oracle 8.X optimizer improvements. Note: this paper’s section headings match the PowerPoint slides’ page headers. Star Schema Design Data warehouse projects are rampant. Every company claims to be building one. In reality, most companies are building Operational Data Stores (ODS). An ODS is really just a traditional database design, with a few extra columns – most often timestamps. The ODS was originally intended to provide a historical collection of OLTP data from legacy systems, which could then be used as a single source for loading into a data warehouse. Unfortunately, most people seem content with calling this initial stepping stone the data warehouse. But it’s reality still a predominately OLTP design, with OLTP tuning issues merely scaled up to accommodate the higher volumes of data. However, some companies are following Ralph Kimball’s Dimensional Modeling technique, known as “Star Schema” design, to construct true data warehouses. This design methodology goes contrary to conventional OLTP design theory. The goal is just the reverse of the norm. With this technique, we strive for fewer, larger tables – with a very low degree of normalization. In fact, the whole design concept seems initially quite bizarre to the seasoned DBA. Old rules and tricks no longer apply. Welcome to the strange and wonderful world of tuning data warehouses. You’re not in Kansas anymore Dorothy. Leave your baggage behind, as we look at the techniques to design such radically different database designs. 1 Modeling Stars Modeling “Star Schemas” is easy. In fact, you can use any data modeling software – even if that software does not have any data warehousing specific extensions. In fact (no pun intended), you can just view the fact tables as base tables with numerous lookup tables, known as dimensions. While some data modeling tools tout supporting different graphic representations for dimensions and facts, it doesn’t amount to much – other than show. Some tools do support the modeling of aggregations and have hierarchy browsers, but with Oracle 8i’s new Materialized Views and Dimensions, it’s less important that they be directly supported in the data model. In fact with Materialized Views, you may just want to model the detailed fact tables. So a typical data model might only have four to ten detailed fact tables and four to ten dimensions – or a total of only 20 or so entities. That’s small potatoes when compared to OLTP models, which can often have hundreds or even thousands of entities. Size of Stars One big difference between traditional and “Star Schema” designs is the relative size difference between the table types. With OLTP and ODS systems, you can have tables with a wide range of row counts. For example, an OLTP system may have tables with a few thousand to a few million rows. And an ODS might have tables with a few thousand to tens or even hundreds of millions of rows. But a “Star Schema” design has only two kinds of tables: small dimensions and huge facts. The typical dimension might have from a few thousand to a few hundred thousand rows, but facts are truly gargantuan – typically a few hundred million to a few billion rows! In data warehousing, size does matter. The Loading Challenge There are two primary challenges to data warehousing: loading massive amounts of data quickly and running genuinely ad-hoc queries within the end-users’ lifetimes. Of these two challenges, data loading is more problematic. This may seem odd at first. The enduser queries are truly ad-hoc on huge tables, so there are no hard and fast rules on what they might do. In fact, successful data mining implies that the more they learn the more they’ll do. On the other hand, the data loading rules are well defined and any given data load run is tiny in proportion to the warehouse. So why is data loading harder? Simply put – it’s Oracle’s optimizer. The optimizer does a fantastic job of making queries efficient and therefore quick. In fact, the 8i optimizer was specifically designed to handle ad-hoc queries on star schema designs. If the DBA merely follows a simple bitmap index design (i.e. individual bitmap indexes on fact table foreign keys columns and fully bitmap 2 indexed dimension tables), collects statistics with histograms and set’s one or two init.ora parameters – Oracle optimizer will make mince meat of ad-hoc queries on billion row or larger tables. It’s that easy. Moreover, the DBA can then use partitions, parallel explain plan execution for SELECT commands, materialized views and other advanced Oracle features for even better results. Moreover, Oracle generally scales well for queries as you improve your hardware (e.g. more or faster processors, more memory, faster disks, etc). The data loading aspect does not have such an inherent advantage. Sub-standard data load programs simply take too long to run. Oracle does not compensate here for bad design as it does with queries. Here, garbage in truly yields garbage out. Hardware Won’t Compensate The bitter and honest truth is that better hardware generally cannot improve load times. Yet many managers, programmers and even some DBA’s mistakenly believe that faster hardware can improve data load times. In my case, the team voted to double our CPU’s, memory, EMC cache and disks (RAID 5 to RAID 1). The end result, we spent a million dollars and significant downtown to upgrade the hardware in order to have our 4 hours 30 minutes load program run in 4 hours 15 minutes. You can guess what hit the fan. The goal should be to verify that resource saturation has genuinely occurred before you update or improve that resource’s hardware. Think of a grocery store with many register lines open, but with everyone in the express line – opening more register lines will not improve the queues’ throughput. In this grocery example, closing one register line and having that person instead direct people on register line to enter will accomplish both load balancing and higher queue throughput. It’s no different with you hardware. Program Design Paramount Most programmers (and they hate to admit it) have never written applications to leverage SMP or MPP architectures. So load programs are often run serially, with designs that are uni-processor oriented in design. So a 16 CPU box with gigabytes of RAM and an EMC RAID storage array is underutilized on all counts by such programs. It’s like the movie “Star Trek: The Wrath of Kahn”, where Spock tells Captain Kirk they can beat Khan because he’s not accustomed to thinking in three dimensional tactics. Most application developers have not and will not develop parallel architecture load programs without some advice and assistance from the DBA. The goal is simple: minimize inter-process waits and maximize total concurrent resource usage. In our grocery store example’s terms: minimize individual register line wait time by maximizing concurrent register line usage. That’s the fancy way to say the obvious: 3 spread the customers evenly over the open lines. Now think of a data load program that’s to open a huge data file and process each record until end of file – how does that spread the load over multiple CPU’s or push your EMC’s I/O capabilities? Answer – it doesn’t. Parallel Design Options There are generally 3 ways to accomplish parallel architecture data loads. First, you could use SQL Loader run in parallel with the direct option. If you had say ten million records to load, 10 CPU’s and a fast I/O device – you could run 20 SQL Loader programs with each responsible for loading 500,000 records. Of course you would have to monitor resource usage and adjust the parallel execution count up or down based upon whether you’re more CPU or I/O constrained. There is only one major drawback to this approach: you cannot do data lookups and data scrubbing without writing pre-insert and/or pre-update triggers. The chief problem is update triggers that trigger (no pun intended) “ORA-04091: table XXXX is mutating, trigger/function may not see it”. Are your developers up to this challenge? Mine weren’t. Second, you could use multi-threaded Pro-C programs – where each context (i.e. thread) loads an equal portion of the input data. I highly recommend the non-mutex architecture since many developers claim to have never even heard the term mutex. Yet it’s taught in most basic operating systems classes – it’s merely an inter-process locking mechanism. So once again we run into complexity issues as the primary drawback. Mutli-threaded applications are difficult to program and even harder to debug. Once again, are your developers up to this challenge? Third, you could use a classic “Divide and Conquer” approach. Take that simple Pro-C program that opens a big data file and reads/processes each record till end of file. It’s a simple shell scripting exercise to logically (not physically) split that file into N parts and run N parallel copies of that simple program each responsible for 1/N of the data. Unlike the multi-threaded approach, no special Pro-C coding skills are required and the program is dirt simple to debug. This approach leverages many of UNIX’s key strengths. However once again, are your developers up to this challenge? If not, have them write the Pro-C programs and you provide the simple scripts to execute them in parallel. Step #1: Form Streams First you’ll need the script to logically divide the input data. In my case, we had 6000 files instead of one big file. The following script does just that – it creates 16 streams of input files by building 16 directory listing files that each contain 1/16th of the 6000 file names. Note that this is a purely logical operation. We did nothing more than create some very tiny files that specify how to subdivide the workload. There is absolutely no need to 4 move, copy or concatenate the original data files in order to provide any kind physical separation or ordering. degree=16 file_name= ras.dltx.postrn file_count=`ll ${file_name}.* | wc -l` if [ $file_count ] then if [ -f file_list* ] then rm -f file_list* fi ls ${file_name}.* > file_list split_count=`expr $ $file_count + file_count % $degree $ / $degree` split -$split_count file_list file_list_ … … Step 2’s code … … fi Step #2: Process Streams Second, you need to kick off the parallel program executions using your logically divided input data. In my case with 6000 files, this script merely creates 16 concurrently running background programs that each read the next file name to process from its list and sends that file’s data to the Pro-C program. It’s really not that tricky. UNIX normally creates a process for each command line. The “(commands)&” notation merely tells UNIX to force those commands to run together as one process and places that singular process in the background. So we kick of 16 and then wait for them all to finish. That’s it. for file in `ls file_list_*` do ( cat $file | while read line do if [ -s $line ] then cat $line | pro_c_program fi done )& done wait 5 Note, you can very easily write Windows BAT scripts to easily do the same thing (which I have done), so it’s not just a UNIX solution. Merely use Windows’ CALL and START commands, as appropriate. Or more simply, obtain a competent UNIX scripting engine for Windows, such as the MKS toolkit, CYGWIN or UWIN. Results Remember my 4 hours 30 minutes load program we reduced to 4 hours 15 minutes by spending a million dollars in upgrades and system down time for those upgrades? The new parallel architecture load program ran in 25 minutes – that’s a 1080% run time improvement. We went from 7% CPU utilization to 95%. We went from less than 5% load on our EMC RAID disk array to 80% (we were actually not I/O bound yet). We went from less than 5% memory usage to nearly 80% (still no swapping or paging). Our customer was ecstatic. We got to go to a Ranger’s baseball game as a team, have a team pizza-party, got an extra ½ day off of personal leave – plus every team member received a plaque of appreciation. Not to mention that I also got a fantastic raise and a promotion out of it. Total cost for these improvements:    4 hours DBA time at $150/hour = $600 20 hours developer time at $100/hour = $2000 Total cost = $2600 That translates into 385 times less cost than the million dollars spent for new hardware. Remember, application design is much, much cheaper than new hardware – in fact in this case, ridiculously so. 6

Optimizing Data Warehouse

Related documents

Products

Support

Optimizing Data Warehouse

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib