ETL IN DW2.0 By W H Inmon ETL (extract/transform/load) is the process of taking operational data and turning the operational data into DW2.0 data. ETL takes the mapping generated from the source data system of record and uses the mapping as the specification for the transformation process. Fig etl.1 shows the ETL process in the context of DW2.0. Fig etl.1 ETL - ETL is an integral part of the data warehouse 2.0 environment ETL processing can be done manually or by a tool of automation. When done manually programs are written and maintained by programmers. When done with a tool of automation, data mappings are fed into the tool of automation and code is created that performs the mapping. Or Fig etl.2 Transformation code can be created either manually or automatically As a general rule, manual mappings are done only when there are very few programs to be written. If there is any number of programs to be written, it is better to use a tool of automation for the purpose of making the transformations between source and target. However they are created, programs are written for the purpose of moving and transforming source data to target data. One place those ETL programs can be executed is in the host machine environment, as seen in Fig etl.3. Fig etl.3 ETL is executed in the host machine The host machine environment is the environment that holds and executes operational processing. One advantage of this approach is that data from all parts of the operational environment is available if needed for ETL processing. On occasion reference data is needed and other sources of data are needed for ETL processing. Since the ETL processing is done in the operational environment, that data is naturally available. © INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED ETL IN DW2.0 By W H Inmon The disadvantage of the ETL processing taking place in the operational environment is that 1) the machines cycles are expensive, 2) the operational environment may not have windows of time to do ETL processing, and 3), the ETL processing is at odds with the way the operational environment is tuned. In addition, there may be little growth potential for the ETL processing should the window for processing grow. A second place to execute the ETL processes is in the machine that houses the data warehouse. Fig etl.4 shows this alternative. Fig etl.4 ETL is executed in the machine that houses the data warehouse Fig etl.4 shows that ETL processes have been placed in the processor that houses the data warehouse. Raw data is passed to the ETL processes. Once processed they are passed on to the data warehouse. There are both advantages and disadvantages to this process. The advantages are that 1) machine cycles are both more available and less expensive for ETL processing, 2) it is very easy to get the data once processed – to the data warehouse, and 3) should there be a desire for more machine resources, they are probably easy to come by. In addition, there is no contention with an online window that has to be scheduled around. Some of the disadvantages are that 1) if needed, auxiliary data that resides in the operational environment is not available, and 2) lots of raw data must be passed for ETL processing. A third alternative is to pass raw data from the host environment to a separate machine, process the ETL there, and then pass the results to the data warehouse on yet a separate machine. Fig etl.5 shows this alternative. Fig etl.5 ETL is executed in a separate machine Fig etl.5 shows that ETL processing is done on a processor altogether different from the host processor or the processor that holds the data warehouse. There are both advantages and disadvantages to this approach. © INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED ETL IN DW2.0 By W H Inmon The advantage of this approach is that the processor that executes the ETL code is entirely dedicated to the ETL process. This means that the cost of machines cycles can be kept to a minimum level, that there is no conflict with any operating window in the host environment. The disadvantages are that 1) lots of raw data have to be passed, 2) the auxiliary data found in the host is not available, and that passing data to the data warehouse processor may require extra resources. There is yet another option, and that option is the passing of raw data for ETL processing to multiple processors. This is done when there is a very large stream of raw data that must be processed quickly. Fig etl.6 shows this option. Fig etl.6 ETL is executed in separate machines In Fig etl.6 the raw data stream can be processed very quickly because it can be processed in a parallel manner. If more throughput is needed, more processors are added. In processing in a parallel manner, the stream of input coming out of the host processor or processors can be accommodated. On occasion when doing ETL processing, a staging area is needed. Fig etl.7 shows a staging area. Fig etl.7 On occasion a staging area is useful A staging area is a place – processor or part of a processor – where data can be kept on a temporary basis. The data is placed from the host processor into the staging area. Once in the staging area, the data that has been placed there can wait for other data to join it, or can be separated for the purpose of parallel processing. © INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED ETL IN DW2.0 By W H Inmon One of the limitations of parallel ETL processors is that there can be little or no interaction between the different processors as the data is being passed through. Fig etl.8 shows this issue. Fig etl.8 With parallel ETL, there can be no interaction between ETL processors Fig etl.8 shows that with parallel ETL that there can be – at most – limited interaction between the ETL processors as they are executing their code. Each ETL processor acts independently of the other processors. Usually this does not place a large constraint on the DW2.0 architect, but it must be kept in mind. DIFFERENT TYPES OF ETL There are different types of ETL. One form of ETL is the case where requirements are fed to the ETL programs parametrically. Fig etl.9 shows this form of ETL. Requirements Requirements Requirements Requirements Requirements Fig etl.9 One form of ETL is the type where requirements are fed into a program where the requirements are treated like parameters Fig etl.9 shows that different parameters are fed to a central program. The program is then executed in real time. There is no executable module that is created. The other type of ETL is the type where executable code is produced. Fig etl.10 shows this form of ETL. © INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED ETL IN DW2.0 By W H Inmon Requirements Requirements Requirements Requirements Requirements Executable code Fig etl.10 Another form of ETL is the type where requirements are fed into a control program where executable code is created In this form of ETL parameters are fed to the ETL tool and then executable modules are created. The executable modules are efficient to execute and are highly flexible. ETL LIMITATIONS One of the limitations of ETL processing is that data passes through the ETL process a record at a time. Fig etl.11 shows this limitation. Fig etl.11 Data passes through the ETL process a record at a time The fact that data passes through the ETL process a record at a time means that only processing that is limited to looking at a single record can be done. (In truth, it is theoretically possible that multiple records can be handled at the same time while passing through ETL. But when multiple records are handled, processing becomes very complex. So, in order to keep processing simple, one record at a time is processed.) Fig etl.12 shows that editing and cleansing is done one record at a time while passing through ETL. © INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED ETL IN DW2.0 By W H Inmon Edit, cleanse Fig etl.12 Data can be accessed and cleansed a record at a time ELT There is a mutant form of ETL processing called ELT – extract/load/transform. Fig etl.13 suggests ELT processing. Fig etl.13 Another form of ETL is processing that is called ELT Fig etl.13 shows that ELT processing consists of extracting data, loading data into the data warehouse, and then transforming it. There are both advantages and disadvantages to ELT processing. The advantage of ELT processing is that multiple records can be accessed at once, something normally not possible with ETL processing. Fig etl.14 shows this advantage. Fig etl.14 The advantage of ELT is that multiple groups of records can be edited at once But there are considerable disadvantages to ELT processing. The first disadvantage is that processing may occur on the data in the data warehouse before it is transformed and integrated. The second disadvantage is that ELT processing automatically requires the most expensive machine cycles because it is executed on the data warehouse processor itself. The third disadvantage is that once the data is integrated, records in the data warehouse ma have to be added and deleted. METADATA AND ETL © INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED ETL IN DW2.0 By W H Inmon One of the opportunities of doing ETL in an automated manner is the chance to capture metadata. Fig etl.15 shows that as ETL processing is being done that metadata can be captured. Md Fig etl.15 As data is being processed, there is the opportunity to gather metadata Indeed, as ETL processing is done, considerable amounts of metadata are exposed. At the very least metadata representing the source, metadata representing the target, and the mapping between the two is required. Fig etl.16 shows this requirement for metadata. Md Source target transformation Fig etl.16 Typical of the metadata that is captured are source, target, and transformation information As long as metadata is exposed during the transformation process, it is convenient to store the metadata so that there is a clear record of processing. TRANSFORMATIONS IN THE ETL ENVIRONMENT There are many different kinds of transformations that can occur during ETL processing. Fig etl.17 shows that the simplest kind of transformation is the mere movement of data from source filed to target field. © INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED ETL IN DW2.0 By W H Inmon Source Target Field A Field A Fig etl.17 The simplest kind of transformation one field is simply moved Another simple transformation is the reformatting of data as data is moved, as seen in Fig etl.18. Source Target yyyymmdd mmddyyyy Fig etl.18 A simple transformation reformatting a field In some cases simple logic is required as part of transformation. Fig etl.19 shows that “male” and “female” are transformed to “m” and “f”. Target Source Gender - m, f Gender - male, female Fig etl.19 A simple conversion Another type of transformation is the conversion of the unit of measurement. Fig etl.20 shows that inches is converted into centimeters. Source Target Distance in centimeters Distance in inches Fig etl.20 A simple conversion of the unit of measurement A simple example of calculated conversion is the movement of data from one foreign currency to another, as seen in Fig etl.21. © INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED ETL IN DW2.0 By W H Inmon Source Target Money in US $ Money - Canadian $ Fig etl.21 Another simple conversion of currencies In yet other cases a more sophisticated calculation is required, as seen in the addition of different fields of data in Fig etl.22. Target Source Field A + field B + field C Fig etl.22 A simple calculation Field D A basic conversion that can appear in ETL transformation is in the basic representation of data, as seen in Fig etl.23. Source Target ebcdic ascii Fig etl.23 a conversion of basic data representation A more sophisticated conversion is from one dbms to another, as seen in Fig etl.24. Source IMS Fig etl.24 a conversion of dbms types Target Oracle In yet other cases logic to determine the best source of data when there are multiple sources is part of ETL logic. Fig etl.25 shows such a conversion using logic. © INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED ETL IN DW2.0 By W H Inmon Source Field A Target Field C Field B Fig etl.25 logic to determine which source of data is best The real miracle of ETL is that although any one of these transformations is not terribly difficult, they must all be done at once, by the ETL tools. Fig etl.26 shows that all transformations have to be done at once, in one pass of the source data. Field A Field B IMS ascii Field A + field B + field C Money - Canadian $ Distance in inches Gender - male, female mmddyyyy Field A Field C Oracle ebcdic Field D Money in US $ Distance in centimeters Gender - m, f yyyymmdd Field A Fig etl.26 The really challenging aspect of ETL is that all aspects of conversion must be done at once READING OLDER TECHNOLOGIES Another differentiating factor between different ETL tools is the ability to read and process older technologies. Some ETL tools have the capabilities to reread and process older technologies such as IMS and IDMS in a native format. Fig etl.27 shows this capability. © INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED ETL IN DW2.0 By W H Inmon Fig etl.27 Some ETL can read older dbms in a native format Fig etl.27 shows that some ETL tools have the capability to read older technologies in a native mode. This is a very good capability because there is much information stored in the structure of older technologies. Other ETL tools do not have the capability to read older technologies in a native mode. In this case the ETL tool forces the data to be brought out in a “flat file” mode. This approach is shown by Fig etl.28. Fig etl.28 Other ETL require the data to be “flattened” before it can be read Fig etl.28 shows that this ETL tool requires that older technologies be read and that data in those technologies is brought out into a flat file format. Once in a flat file format, the data can then be read and processed by the ETL tool. But in reading the native data and flattening the file, much structural information is lost. © INMON DATA SYSTEMS, 2006, ALL RIGHTS RESERVED