ETL IN DW2.0
By W H Inmon
ETL (extract/transform/load) is the process of taking operational data and turning
the operational data into DW2.0 data. ETL takes the mapping generated from the
source data system of record and uses the mapping as the specification for the
transformation process. Fig etl.1 shows the ETL process in the context of DW2.0.
Fig etl.1
ETL - ETL is an integral part
of the data warehouse 2.0
environment
ETL processing can be done manually or by a tool of automation. When done
manually programs are written and maintained by programmers. When done with a
tool of automation, data mappings are fed into the tool of automation and code is
created that performs the mapping.
Or
Fig etl.2
Transformation code can be created either
manually or automatically
As a general rule, manual mappings are done only when there are very few
programs to be written. If there is any number of programs to be written, it is better
to use a tool of automation for the purpose of making the transformations between
source and target.
However they are created, programs are written for the purpose of moving and
transforming source data to target data. One place those ETL programs can be
executed is in the host machine environment, as seen in Fig etl.3.
Fig etl.3
ETL is executed in the host
machine
The host machine environment is the environment that holds and executes
operational processing. One advantage of this approach is that data from all parts of
the operational environment is available if needed for ETL processing. On occasion
reference data is needed and other sources of data are needed for ETL processing.
Since the ETL processing is done in the operational environment, that data is
naturally available.
© INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED
ETL IN DW2.0
By W H Inmon
The disadvantage of the ETL processing taking place in the operational environment
is that 1) the machines cycles are expensive, 2) the operational environment may
not have windows of time to do ETL processing, and 3), the ETL processing is at odds
with the way the operational environment is tuned. In addition, there may be little
growth potential for the ETL processing should the window for processing grow.
A second place to execute the ETL processes is in the machine that houses the data
warehouse. Fig etl.4 shows this alternative.
Fig etl.4
ETL is executed in the machine
that houses the data warehouse
Fig etl.4 shows that ETL processes have been placed in the processor that houses the
data warehouse. Raw data is passed to the ETL processes. Once processed they are
passed on to the data warehouse. There are both advantages and disadvantages to
this process. The advantages are that 1) machine cycles are both more available and
less expensive for ETL processing, 2) it is very easy to get the data once processed –
to the data warehouse, and 3) should there be a desire for more machine resources,
they are probably easy to come by. In addition, there is no contention with an online
window that has to be scheduled around. Some of the disadvantages are that 1) if
needed, auxiliary data that resides in the operational environment is not available,
and 2) lots of raw data must be passed for ETL processing.
A third alternative is to pass raw data from the host environment to a separate
machine, process the ETL there, and then pass the results to the data warehouse on
yet a separate machine. Fig etl.5 shows this alternative.
Fig etl.5
ETL is executed in a
separate machine
Fig etl.5 shows that ETL processing is done on a processor altogether different from
the host processor or the processor that holds the data warehouse. There are both
advantages and disadvantages to this approach.
© INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED
ETL IN DW2.0
By W H Inmon
The advantage of this approach is that the processor that executes the ETL code is
entirely dedicated to the ETL process. This means that the cost of machines cycles
can be kept to a minimum level, that there is no conflict with any operating window
in the host environment. The disadvantages are that 1) lots of raw data have to be
passed, 2) the auxiliary data found in the host is not available, and that passing data
to the data warehouse processor may require extra resources.
There is yet another option, and that option is the passing of raw data for ETL
processing to multiple processors. This is done when there is a very large stream of
raw data that must be processed quickly. Fig etl.6 shows this option.
Fig etl.6
ETL is executed in
separate machines
In Fig etl.6 the raw data stream can be processed very quickly because it can be
processed in a parallel manner. If more throughput is needed, more processors are
added. In processing in a parallel manner, the stream of input coming out of the host
processor or processors can be accommodated.
On occasion when doing ETL processing, a staging area is needed. Fig etl.7 shows a
staging area.
Fig etl.7
On occasion a staging area
is useful
A staging area is a place – processor or part of a processor – where data can be kept
on a temporary basis. The data is placed from the host processor into the staging
area. Once in the staging area, the data that has been placed there can wait for
other data to join it, or can be separated for the purpose of parallel processing.
© INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED
ETL IN DW2.0
By W H Inmon
One of the limitations of parallel ETL processors is that there can be little or no
interaction between the different processors as the data is being passed through. Fig
etl.8 shows this issue.
Fig etl.8
With parallel ETL,
there can be no interaction
between ETL processors
Fig etl.8 shows that with parallel ETL that there can be – at most – limited
interaction between the ETL processors as they are executing their code. Each ETL
processor acts independently of the other processors. Usually this does not place a
large constraint on the DW2.0 architect, but it must be kept in mind.
DIFFERENT TYPES OF ETL
There are different types of ETL. One form of ETL is the case where requirements are
fed to the ETL programs parametrically. Fig etl.9 shows this form of ETL.
Requirements
Requirements
Requirements
Requirements
Requirements
Fig etl.9
One form of ETL is the type
where requirements are fed
into a program where the
requirements are treated
like parameters
Fig etl.9 shows that different parameters are fed to a central program. The program
is then executed in real time. There is no executable module that is created.
The other type of ETL is the type where executable code is produced. Fig etl.10
shows this form of ETL.
© INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED
ETL IN DW2.0
By W H Inmon
Requirements
Requirements
Requirements
Requirements
Requirements
Executable
code
Fig etl.10
Another form of ETL is the type
where requirements are fed into
a control program where
executable code is created
In this form of ETL parameters are fed to the ETL tool and then executable modules
are created. The executable modules are efficient to execute and are highly flexible.
ETL LIMITATIONS
One of the limitations of ETL processing is that data passes through the ETL process
a record at a time. Fig etl.11 shows this limitation.
Fig etl.11
Data passes through the ETL
process a record at a time
The fact that data passes through the ETL process a record at a time means that
only processing that is limited to looking at a single record can be done. (In truth, it
is theoretically possible that multiple records can be handled at the same time while
passing through ETL. But when multiple records are handled, processing becomes
very complex. So, in order to keep processing simple, one record at a time is
processed.)
Fig etl.12 shows that editing and cleansing is done one record at a time while
passing through ETL.
© INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED
ETL IN DW2.0
By W H Inmon
Edit, cleanse
Fig etl.12
Data can be accessed and cleansed a
record at a time
ELT
There is a mutant form of ETL processing called ELT – extract/load/transform. Fig
etl.13 suggests ELT processing.
Fig etl.13
Another form of ETL is
processing that is called
ELT
Fig etl.13 shows that ELT processing consists of extracting data, loading data into the
data warehouse, and then transforming it. There are both advantages and
disadvantages to ELT processing. The advantage of ELT processing is that multiple
records can be accessed at once, something normally not possible with ETL
processing. Fig etl.14 shows this advantage.
Fig etl.14
The advantage of ELT is that
multiple groups of records can
be edited at once
But there are considerable disadvantages to ELT processing. The first disadvantage is
that processing may occur on the data in the data warehouse before it is
transformed and integrated. The second disadvantage is that ELT processing
automatically requires the most expensive machine cycles because it is executed on
the data warehouse processor itself. The third disadvantage is that once the data is
integrated, records in the data warehouse ma have to be added and deleted.
METADATA AND ETL
© INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED
ETL IN DW2.0
By W H Inmon
One of the opportunities of doing ETL in an automated manner is the chance to
capture metadata. Fig etl.15 shows that as ETL processing is being done that
metadata can be captured.
Md
Fig etl.15
As data is being processed,
there is the opportunity to
gather metadata
Indeed, as ETL processing is done, considerable amounts of metadata are exposed.
At the very least metadata representing the source, metadata representing the
target, and the mapping between the two is required. Fig etl.16 shows this
requirement for metadata.
Md
Source
target
transformation
Fig etl.16
Typical of the metadata that is
captured are source, target, and
transformation information
As long as metadata is exposed during the transformation process, it is convenient to
store the metadata so that there is a clear record of processing.
TRANSFORMATIONS IN THE ETL ENVIRONMENT
There are many different kinds of transformations that can occur during ETL
processing. Fig etl.17 shows that the simplest kind of transformation is the mere
movement of data from source filed to target field.
© INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED
ETL IN DW2.0
By W H Inmon
Source
Target
Field A
Field A
Fig etl.17
The simplest kind of transformation one field is simply moved
Another simple transformation is the reformatting of data as data is moved, as seen
in Fig etl.18.
Source
Target
yyyymmdd
mmddyyyy
Fig etl.18
A simple transformation reformatting a field
In some cases simple logic is required as part of transformation. Fig etl.19 shows
that “male” and “female” are transformed to “m” and “f”.
Target
Source
Gender - m, f
Gender - male, female
Fig etl.19
A simple conversion
Another type of transformation is the conversion of the unit of measurement. Fig
etl.20 shows that inches is converted into centimeters.
Source
Target
Distance in centimeters
Distance in inches
Fig etl.20
A simple conversion of the unit
of measurement
A simple example of calculated conversion is the movement of data from one foreign
currency to another, as seen in Fig etl.21.
© INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED
ETL IN DW2.0
By W H Inmon
Source
Target
Money in US $
Money - Canadian $
Fig etl.21
Another simple conversion
of currencies
In yet other cases a more sophisticated calculation is required, as seen in the
addition of different fields of data in Fig etl.22.
Target
Source
Field A + field B + field C
Fig etl.22
A simple calculation
Field D
A basic conversion that can appear in ETL transformation is in the basic
representation of data, as seen in Fig etl.23.
Source
Target
ebcdic
ascii
Fig etl.23
a conversion of
basic data representation
A more sophisticated conversion is from one dbms to another, as seen in Fig etl.24.
Source
IMS
Fig etl.24
a conversion of
dbms types
Target
Oracle
In yet other cases logic to determine the best source of data when there are multiple
sources is part of ETL logic. Fig etl.25 shows such a conversion using logic.
© INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED
ETL IN DW2.0
By W H Inmon
Source
Field A
Target
Field C
Field B
Fig etl.25
logic to determine
which source of data
is best
The real miracle of ETL is that although any one of these transformations is not
terribly difficult, they must all be done at once, by the ETL tools. Fig etl.26 shows
that all transformations have to be done at once, in one pass of the source data.
Field A
Field B
IMS
ascii
Field A + field B + field C
Money - Canadian $
Distance in inches
Gender - male, female
mmddyyyy
Field A
Field C
Oracle
ebcdic
Field D
Money in US $
Distance in centimeters
Gender - m, f
yyyymmdd
Field A
Fig etl.26
The really challenging aspect of ETL is that
all aspects of conversion must be done at once
READING OLDER TECHNOLOGIES
Another differentiating factor between different ETL tools is the ability to read and
process older technologies. Some ETL tools have the capabilities to reread and
process older technologies such as IMS and IDMS in a native format. Fig etl.27
shows this capability.
© INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED
ETL IN DW2.0
By W H Inmon
Fig etl.27
Some ETL can read older dbms
in a native format
Fig etl.27 shows that some ETL tools have the capability to read older technologies in
a native mode. This is a very good capability because there is much information
stored in the structure of older technologies.
Other ETL tools do not have the capability to read older technologies in a native
mode. In this case the ETL tool forces the data to be brought out in a “flat file”
mode. This approach is shown by Fig etl.28.
Fig etl.28
Other ETL require the data
to be “flattened” before it can
be read
Fig etl.28 shows that this ETL tool requires that older technologies be read and that
data in those technologies is brought out into a flat file format. Once in a flat file
format, the data can then be read and processed by the ETL tool. But in reading the
native data and flattening the file, much structural information is lost.
© INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED