Java/XML ETL Engine
By
Bob Timlin
Outline
• Data Extraction, Transformation, and
Loading (ETL).
• Java & XML
• Meta-Data
• Mapping Data from Source to Target
Outline
•
•
•
•
•
Proposed XML Usage.
XML for Meta-Data
Challenges/Issues
Sample XML Data File
Sample XML Meta-Data File
Extract/Transform/Load (ETL):
The process of getting data from the source system(s) into the datawarehouse is easily 80% of the effort of the entire data-warehouse. This is
because of the complexity of the source systems, the cleansing or
transformation process, and all of the prep work to get the detail operational
data into summary data-warehouse data. The more the source systems you
have the harder this process is and this increases exponentially.
Cleaning/Transforming the data is probably the most complicated part of
this process.
Transformations can either be done on the source system or the target system.
Java & XML
Currently ETL processes are mostly written in Cobol and C with
Embedded SQL. There are many GUI tools out there to streamline
this process. These tools mostly generate proprietary code that is
then executed by an scheduling program.
All of the big vendors in this field are pushing XML as a language
to store transformation meta-data and all of the big plays, sans
Microsoft, are backing Java as the language to implement
transformations. For some weird reason Microsoft doesn’t seem to
like Java.
The Major Vendors include: IBM, Oracle, and Microsoft.
Meta-Data
Data about data. In terms of data warehouse it stores information
about the structures of both source and destination data and how
to extract, transform, and load data. It may also maintain
network configuration information like ip-addresses and ports.
The meta-data coalition http://www.mdcinfo.com/ recently
merged with Object Management Group (OMG)
http://www.omg.org. They are backed by many heavy-hitters
including Oracle, IBM, and Microsoft. The industry seems to be
moving towards using XML for storing meta-data. This makes
the meta-data very standardized and portable.
Mapping Data from Source to Target:
Target:
Name: The name of the logical table in the data-warehouse.
Source: table name in the xml data file.
Driver: JDBC driver name
Url: Path to the data-warehouse.
Username: username to connect to the data-warehouse
Password: password to connect to the data-warehouse
Mapping (continued)
Column:
Name: The name of the logical column in the dw.
Type: The data type of the logical column in the data warehouse.
Key: Is this a primary key, if so the engine will use it in the where
clause.
Source: The name of the column in the xml data file
Proposed XML Usage
• For meta-data about the ETL processing.
This will contain all information about
mapping source to target, including
transformation rules.
• As a data-file to store data from database’s.
XML for Meta-Data
The specification is designed to be flexible enough to support
many protocols, however for our project we will only implement
two protocols. 1. XML Data File, 2. JDBC
The Protocol will be part of the url attribute of the target or source
node. Every transformation will have a source and target.
<source url="xml://localhost/tmp/test.xml“>
…
<target url="jdbc:oracle:thin:@localhost:1521:timlin"
driver="oracle.jdbc.driver.OracleDriver"
username=“scott" password=“tiger" name="srctest">
The basic construct of a XML meta-data file is:
<translation>
<source url=“…”, etc >
<column name=“…”>
[<rule language=“…”>
</rule>]
</column>
[<column name=“…”>[<rule></rule>]</column>]
</source>
<target url=“…”, etc.>
<column name=“…”, etc.>[<rule></rule>]</column>
[<column name=“…”, etc.>[<rule></rule>]</column>]
</target>
</ translation >
Challenges/Issues
• Mapping multiple sources to multiple
targets.
• Transformations can involve very complex
coding. Especially eliminating duplicates,
merging, and purging of data. These
transformations usually involve “fuzzy”
logic.
<target url="jdbc:oracle:thin:@64.130.33.125:1521:timlin"
driver="oracle.jdbc.driver.OracleDriver"
username=“scott" password=“tiger" name="srctest">
<! As the target, connect to the database using JDBC and
Insert the data from the source XML file and rules that
follow>
<table name="patients">
<column name = "lname" source="fullname">
<rule language="java">
source.replace("'", "")
</rule>
<rule language="sql">
INITCAP(SUBSTR(source, 1, INSTR(source, ',') -1))
</rule>
</column>
<column name="fname" source="fullname">
<rule language="java">
source.replace("'", "")
</rule>
<rule language="sql">
INITCAP(SUBSTR(source, INSTR(source, ',') +1))
</rule>
</column>
<column name="dob" source="dob">
<rule langauge="sql">
TO_DATE(source, 'DD/MM/YYYY')
</rule>
</column>
</table>
</target>
</translation>
<translation>
<!From Database to XML>
<source url="jdbc:oracle:thin:@64.130.33.125:1521:timlin"
driver="oracle.jdbc.driver.OracleDriver"
username=“scott" password=“tiger" name=“targetTest">
<table name="patients">
<column name = "lname" source="fullname">
<rule language="java">
source.replace("'", "")
</rule>
<rule language="sql">
INITCAP(SUBSTR(source, 1, INSTR(source, ',') -1))
</rule>
</column>
<column name="fname" source="fullname">
<rule language="java">
source.replace("'", "")
</rule>
<rule language="sql">
INITCAP(SUBSTR(source, INSTR(source, ',') +1))
</rule>
</column>
<column name="dob" source="dob“>
<rule langauge="sql">
TO_DATE(source, 'DD/MM/YYYY')
</rule>
</column>
</table>
</source>
<target url="xml://localhost/tmp/test.xml">
<table name="patients">
<column name="fullname"></column>
<column name="street"></column>
<column name="city"></column>
<column name="state"></column>
<column name="zip"></column>
<column name="dob"></column>
<column name="balance"></column>
</table>
</target>
</translation>
Sample XML Data File
<Record TableName=“table1”>
<Column1>data for column 1</Column1>
<Column2>data for column 2</Column2>
</Record>
<Record TableName=“table1”>
<Column1>data for column 1</Column1>
<Column2>data for column 2</Column2>
</Record>
<Record TableName=“table2”>
<Column1>data for column 1</Column1>
<Column2>data for column 2</Column2>
</Record>
<column name="month_admitted" type="number"
source="Month_Admitted"></column>
<column name="year_admitted" type="number"
source="Year_Admitted"></column>
<column name="source_of_admission" type="number"
source="Source_Of_Admissions"></column>
<column name="disposition" type="number"
source="Disposition"></column>
<column name="charges" type="number" source="Charges"></column>
<column name="drg" type="number"
source="Diagnosis_Related_Group"></column>
<column name="rec_link_no" type="varchar"
source="Record_Linkage_Number" key="yes"></column>
</target>
<Record TableName="patient">
<ID>1</ID>
<Facility>10735</Facility>
<Age>67</Age>
<Sex>2</Sex>
<Ethnicity>2</Ethnicity>
<Race>2</Race>
<ZIP>946</ZIP>
<Length_Of_Stay>18</Length_Of_Stay>
<Month_Admitted>12</Month_Admitted>
<Year_Admitted>1995</Year_Admitted>
<Source_of_admission>512</Source_of_admission>
<Disposition>11</Disposition>
<Charges>36948</Charges>
<Diagnosis_Related_Group>202</Diagnosis_Related_Group>
<Record_Linkage_Number>FRFSFEM1E</Record_Linkage_Number>
</Record>
<Record TableName="drg">
<Diagnosis_Related_Group>1</Diagnosis_Related_Group>
<Major_Diagnostic>1</Major_Diagnostic>
<Category>S</Category>
<Description><![CDATA[CRANIOTOMY, AGE >17 EXCEPT FOR
TRAUMA]]></Description>
</Record>
Sample XML Meta-Data
<target name="admits" source="patient"
driver="org.gjt.mm.mysql.Driver" url="jdbc:mysql://localhost:3306/test"
username="test" password="">
<column name="id" type="number" key="yes" source="ID"></column>
<column name="facility" type="number" key="yes"
source="Facility"></column>
<column name="age" type="number" source="Age"></column>
<column name="gender" type="number" source="Sex"></column>
<column name="ethnicity" type="number" source="Ethnicity"></column>
<column name="race" type="number" source="Race"></column>
<column name="length_of_stay" type="number"
source="Length_Of_Stay"></column>
<column name="day_admitted" type="number" source="day_admitted" >
</column>
Thank You