Big Data – Extract-Transform-Load (ETL) 001

advertisement
Extract-Transform-Load (ETL) Technologies – Part 1
Monday, December 31, 2012
By: Dale T. Anderson
Principal Consultant
DB Best, Technologies, LLC
*~:~* Happy New Year *~:~*
My last blog (Column Oriented Database Technologies) discussed the differences between Row and
Column oriented databases and some key players in this space. Concepts and technologies on Big Data
have been discussed in previous blogs (Big Data & NoSQL Technologies & NoSQL .vs. Row .vs.
Column). From these blogs one should surmise that deciding upon the best database technology (or
DBMS vendor) really depends on schema complexities, how you intend to retrieve your data, and how
to get it there in the first place. We’re going to dive into this next, but before we do it is imperative that
we briefly examine the differences between OLTP and OLAP database designs. Then let’s leave OLTP
details for a future blog as I expect most readers already know plenty about transactional database
systems. Instead we’ll focus here on OLAP details and how we process Big Data using ETL technologies
for data warehouse applications.
Generally the differences between OLTP and OLAP database applications center upon how frequently
data must be stored and retrieved, the integrity of that data, and how much of there is and its growth.
OLTP database schemas are optimized for processing transactions by an increasing number of users
while OLAP database schemas are optimized for aggregations against an increasing amount of data and
exponential query permutations. Design considerations involve normalization, indexing, datatypes, user
load, storage requirements, performance, and scalability. We will need to defer the many interesting
details on these considerations for a future blog.
On-Line Transactional Processing (OLTP) applications, like Customer Relationship Management (CRM),
(ERP), Corporate Financials (AR/AP/GL), e-Commerce, or other enterprise systems use traditional SQL
queries that are either embedded within application code (not a great practice in my humble opinion),
or within stored-procedures (a much better practice) in order to store and retrieve data. Where data
integrity and performance is critical, an OLTP database is the most appropriate. OLTP schema designs
are generally highly normalized, comprehensive structures that are optimized for fast transactional data
processing. Typically they represent the source data that feed a data warehouse.
On-Line Analytical Processing (OLAP) applications however focus on Data Warehouse and Business
Intelligence systems where the volume of data can grow quite large. Performance and scalability
become the driving factors and data integrity is generally inherited from the source system reducing its
importance. Typically OLAP schema designs follow well known modeling practices like Ralph Kimball’s
Star Schema or Dan Linstedt’s Data Vault (read my blog Data Vault - What is it?). Perhaps the best way
to think of an OLAP system is to think of the wizard behind the curtain who as all answers, for everyone,
all the time. Boy if were only true! Instead, OLAP schema designs are generally refactored data
structures based upon source OLTP systems that are optimized for fast query processing.
And so we arrive at the crux of this blog. How do we get data from the source OLTP system into the
target OLAP data warehouse in a practical, efficient way? And what do we need to do to that data to
conform and resolve the clearly different schema designs? So along comes ETL, or Extract-TransformLoad; fundamentally understandable, but potentially very hard to do once you dive into all the
complexities involved; like peeling an onion or melting the wicked witch! Let’s examine what is really
involved.
Extracting data from a source dataset, transforming that data in potentially many particular ways, and
loading the resulting data into a target dataset is the essence of ETL. Understanding that is simple. Yet
consider that the source data may originate from several different files, tables, views, or databases;
furthermore these are potentially varied in structure, location, and host systems (ie: Oracle, MS SQL
Server, MySQL, etc.). Also consider that transformations can include a myriad of different requirements
from normalization to de-normalization of source data, lookups from other datasets, merging, sorting,
truncating, datatype conversion, inner joins, outer joins, matched and/or unmatched records, etc.
These requirements are anywhere from simple to daunting. Consider the data load; perhaps there is
one target, perhaps many; data might reside in a data warehouse, and/or data marts; maybe it’s a Star
Schema, maybe not! Finally the entire ETL data process may be a full data set, or an incremental one.
OMG! ETL data processing permutations are endless and one thing is certain, getting it right is critical!
What should we do then to deal with these comprehensive
and complex data processes?
We use tools of course!
Some may build their ETL data processing with SQL scripts, or
maybe embedded SQL in another scripting language like PHP
or Python. Others may actually craft programs using high
level languages like C#, Java, or Visual Basic. These solutions
are fine, of course, but they present a cumulative burden of development, maintenance, and cost that
can easily exceed the alternative of using tools designed specifically for ETL data processing. Not that
there is no burden if you use ETL tools, but a greatly reduced burden, in my humble opinion. There are
several ETL tool vendors out there all providing the various functionalities needed. I found this survey
result, which is about as I would expect.
Keep in mind however that there are several other key features that should be considered when looking
at ETL tools, including:


Automation
 Task Scheduler & Triggers
 Environment Control (ie: DB Connections, Global Variables)
 Restart/Recover/Abort
 Load Balancing
Administration
 Admin Console
 Project, User, Task Management
 Integrated Source Control
 Distributed Processing
 Task Execution Analysis

Monitoring & Logging
 Real-Time execution logging
 File and Database error capture
Using ETL tools with the appropriate features and usability is very important, yet there is one more
aspect I submit must be considered before crafting any ETL process. Data Warehouse Schema design!
All the objects involved within a Data Warehouse are really driven by the overarching architecture. This
aspect is so easily overlooked, or under-valued. Many just slap something together and hope for the
best. Maybe this contributes to the almost 80% failure rates of first DW/BI project efforts. Let’s take a
step back then and look at this from 35,000 feet for a moment.
Ok, we all know that an OLAP Data Warehouse commonly supports analytics and reporting, often
referred as Business Intelligence. We typically understand what our data sources are and that we will
use ETL data processes to move data through a system that in turn must be a sustainable, pliable,
expanding DW/BI solution. Yet what is the architecture of that system? Do you notice we started from
a comfortable position of factual information of what is known and quickly culminated in a place of
genuine uncertainty about what will be constructed? The good news is that we also do know what the
results should be.
Consider carefully then goals and requirements of the business and the metrics involved and then
decide what the end data stores should be. From there you can fill in the gaps; ask yourself these many
questions:


What are the business metrics and how should they be defined?
What permutations are involved? (ie: reporting periods, frequency, filters, etc.)









Should the target data stores be column, or row based?
Should Star Schemas or Data Vaults be used? and when?
What, if any, Data Marts would be employed?
Do they inter-relate or are they stand-alone?
How much source data exists already?
How big will the target data grow? and how fast?
What are the expected complexities of the transformations involved?
How about hardware and software requirements?
Optimizations? Scalability? Functionality? ~~ you get the picture …
I surmise that anyone who stops to think about the architecture involved, many more questions will
easily come to mind. The more questions you can consider and answer regarding any DW/BI system
architecture the better; and sooner is better than later. One principle I always try to employ in any
design I craft, is that change is inevitable so DW/BI system architectures must be pliable and extensible.
Achieve this in your solution and you’re at least half way there. Just remember to click your heels
together three times first, close your eyes, and repeat: there’s no place like home!
One final good practice common to many DW/BI system architectures aims at staging data first, before
it goes into the data warehouse. Often called an ODS, or Operational Data Store, having a pre-emptive
place to prepare, or stage, data before it moves into the data warehouse can be a tremendous
advantage to any DW/BI system architecture. It does introduce an additional ETL step, but usually quite
worth it. A decent ODS will contain a reduced dataset from the source with few to no transformations
thus eliminating data you know you don’t need. Then the next ETL step does the real work in
transformation and eventually loading into the data warehouse.
ETL data processing provides an essential methodology for moving data from point A to point B.
Crafting these processes can be straight-forward to highly complex. I often wonder how we all got by
with SQL scripting before ETL tools came along; today when faced with a data migration and/or date
warehouse population task, I break out these tools and get to work. Best
advice: Follow the yellow brick road…
As this topic is quite extensive, my next blog, “Extract-Transform-Load
(ETL) Technologies – Part 2”, will focus on ETL vendors so we can examine
who the key players are and what they offer.
As always, don’t be afraid to comment, question, or debate… I learn new
things every day!
Download