Big Data – Extract-Transform-Load (ETL) 002

advertisement
Extract-Transform-Load (ETL) Technologies – Part 2
Sunday, June 09, 2013
By: Dale T. Anderson
Principal Consultant
DB Best, Technologies, LLC
My last blog (Extract-Transform-Load (ETL) Technologies – Part 1) discussed
the purpose, function, and some of the inherent benefits of ETL technologies
for moving data from source applications into target reporting and analytic
environments. Hopefully from that discussion one can gain some general
understanding of ETL; what it is, how it can be used, and why we would want to
use it. Now let’s dive a bit deeper and discover who some of the key vendors
are in the ETL Tools marketplace.
ETL tools have been around for some time, have evolved, matured, and now
present us with productive environments for Big Data, Data Warehouse, Business Intelligence, and
analytics processing. The ETL data processing cycle itself can be simple or highly complex. Most vendor
offerings address the diverse requirements well enough although some do better than others. The
challenges with ETL tools come from the other important considerations, like:











Functionality
Scalability
Performance
Programmability
Parallel Processing
Data Virtualization
Rerunability
Recoverability
Extensibility
Handling Keys (Primary/Foreign)
Attribute Mapping
While the first three on this list are essential, personally I think that programmability could easily be at
the top of the list. Obviously my developer bias is showing! d;) Since all of the vendors below offer a
graphical user interface for creating ETL code they pass my first test for programmability. It is very nice
to finally paint the code with pictures rather than typing out syntax. Having a GUI design tool for ETL
programming beats out PHP, Perl, or SQL scripting any day of the week and twice on Sunday.
There is however some important vendor differences under the hood that one must consider. Some ETL
tools generate actual compiled code (like Java) while others generate intermediate code that must be
processed further. This intermediate code is proprietary and requires interpretation by the ETL engine
impacting the execution performance to some degree. While this can be acceptable, I prefer to work
with tools that generate code any skilled programmer can read and understand.
 Informatica (www.informatica.com)
A comprehensive integration platform that
promotes code standardization, unifies
collaboration between business and IT roles,
and provides capabilities that handle the high volume and wide variety of
today’s business data. The Informatica Platform has eight distinct technologies
designed to be a true industrial strength ETL solution. These include:
o
o
o
o
o
o
o
o
Messaging
Complex Event Processing
B2B Data Exchange
Cloud Data Integration
Enterprise Data Integration
Application Lifecycle Management
Data Quality
Master Data Management
Informatica PowerCenter is the ETL tool that empowers an IT organization to
implement highly scalable, high-performance data processing maps using a
graphical interface that generates proprietary intermediate code. This mapping
code, when coupled with a defined data workflow plan can then be scheduled
for a variety of execution strategies.
Widely accepted as an industry front runner, Informatica boasts high
productivity and low cost. In truth, this may be just the opposite. Perhaps high
productivity at a cost is more accurate; in terms of experienced developers,
administrators, and license fees. Companies who choose to use Informatica
usually have very large IT teams and budgets.
The Informatica Cloud Integration version however does offer easy to use ETL
programmability without the need for an extensive team or the hosting an IT
infrastructure. While very cost effective, its features are simply a subset of the
PowerCenter that may not provide all the required functionality. Yet while basic
it may be a very suitable alternative for many projects at a much lower cost.
 Microsoft SSIS (http://msdn.microsoft.com/en-us/library/ms141026.aspx)
SQL Server Integration Services, aka: SSIS is
an enterprise-level data integration and
transformation solution from Microsoft.
Widely used by customers to solve a variety of complex business problems this
ETL tool does it all. As part of the SQL
Server evolution, SSIS has its roots in
the BCP (Bulk Copy Program) and DTS
(Data Transformation Services) which
have been around MS SQL Server for
many years. SSIS works in direct
correlation with SSAS (SQL Server
Analytics Service) and SSRS (SQL
Server Reporting Service) but can be
used for standalone ETL processing.
MS SQL Server Management Studio provides the graphical interface for the
creation of ETL ‘packages’ against MS SQL Server or other data sources. These
packages are executed manually or can be scheduled. Fundamentally, under
the hood, SSIS is creating T-SQL queries coupled with the process logic defined
in the package. With the many robust components that are dropped into the
package the developer can easily craft sophisticated ETL processes.
Microsoft’s overall Business Intelligence stack places SSIS on the front line for
data access and processing. As discussed in my previous blog moving data from
a source to a target can quickly become highly complex. Understanding the goal
of that data is critical to how the ETL process should be constructed. Mastering
this stack is something else altogether; perhaps another blog?
 Talend (www.talend.com)
Talend has recently surged as an open source
ETL tool that offers a wide variety of data
integration components and a streamlined
graphical user interface. Offering freely downloadable flavors of the Talend
Open Studio for Big Data, Data Integration and ESB; these fully functional
versions allow for the design, development, of deployable execution of real data
processing capabilities. The subscription based Talend Integration Suite
presents the enhanced version providing the additional enterprise functionality
and features you might expect for more comprehensive projects. These include:
o
o
o
o
o
o
o
o
Collaboration (with integrated SVN)
Automated Deployment (scheduler)
Load Balancing (multiple job processing servers)
Monitoring (administration console)
Data Quality Profiling
Master Data Management
Big Data connectors (ie: Hadoop, Cassandra, MongoDB)
Context Variable Management (Global/Local)
As an Open Source technology, Talend’s latest release now boasts the first and
only ETL vendor to support 100% Map Reduce functionality. Additional support
for Hadoop, Cassandra, MongoDB, Hive, and even PIG is available. Expect more
to come!
Talend’s architecture demonstrates a practical approach to the complexities
involved in deployment and execution of the data integration process. Highly
scalable, this architecture is not only easy to understand but through the Talend
Administration Center is a straightforward to configure and maintain.
Unlike other tools (which separate these two tightly interwoven functionalities),
as an Eclipse-based editor, Talend Integration Studio provides a heterogeneous
application and data integration viewpoint for crafting ETL processes. Talend
code generation options for either Perl or Java eliminates proprietary
intermediate code and streamlines job execution. Talend also boasts more than
400 connectors providing virtually unlimited connectivity to enterprise systems,
Big Data, databases, software packages, mainframes, files, web services, and the
like. The tMap component delivers highly complex transformation capabilities
allowing for multiple data source integration, lookups, error handling, and a
wide variety of data processing options. I have been using Talend for over three
years now and admittedly find it the most productive ETL tool at a more
competitive price.
Actual product offerings from Talend are packaged in segments designed to
focus on project needs. These include:
o
o
o
o
o
o
Big Data
Data Integration
Business Process Management
Data Quality
Enterprise Service Bus
Master Data Management
These open source products are available as free use, non-supported downloads
(http://www.talend.com/products/talend-open-studio) and offer considerable
features and functionality for developing working ETL processes.
Talend ‘On Demand’ is the one of the first ETL vendors to offer an Open Source
data integration solution in the cloud (the Informatica Cloud offering is not
Open Source). This SaaS model is free for up to two users and presents a great
starting point for smaller projects or prototyping.
 Pentaho (www.pentaho.com)
Coupling data integration with embedded
analytics, Pentaho sets itself apart from
most of its competitors. This platform offers
business users an interactive and visualized approach that can be deployed in
the cloud or on-premise. Focused on the business analytics aspects of data
integration, Pentaho can manage very large volumes and variety of data.
Visualization, data mining, reporting, and interactive dashboards simplify the
development process that provides rapid deployment platform for a broad set
of users.
Pentaho Data Integration (PDI) can extract data from complex and diverse data
sources offering high performance ETL capabilities, including:
o
o
o
o
o
o
o
o
o
Rich graphical visual design environment
Broad connectivity to any type or source of data
Enterprise scalability and performance
High performance multi-threaded processing
Data Profiling and Quality
Open and Standards-based Architecture
Big Data integration for Hadoop, NoSQL, and Analytic databases
Integrated reporting, data visualization/analysis, and predictive analytics
Salesforce and SugarCRM plug-ins
ETL developers will appreciate the PDI designer which offers an intuitive, drag
and drop GUI, a robust library of pre-built components, data transformation
mappings, content versioning and locking, and an integrated debugger. With
the centralized monitoring, scheduler, and security, PDI delivers a reliable
solution for data migration and synchronization, application consolidation,
business intelligence, and data warehousing.
The PDI Community Edition (PDI CE), also known as ‘Kettle’, is available as a free
download (http://kettle.pentaho.com).
This self-supported open source
software provides many of the key features and functionality found in the
Enterprise Edition (PDI EE). Along with Kettle you can find ‘Mondrain’ the
Pentaho Analysis Services Community Edition which, as an OLAP Server, claims
to enable business users to analyze large amounts of data in real-time. I am
unable to dispute or confirm this claim, however in my experience, achieving
even near-real-time is difficult making it (IMHO) somewhat dubious.
To wrap up it is fair to point out that there are several other ETL contenders in today’s marketplace. For
practical reasons this blog does not dive into them further, yet let’s at least give some of them mention.
They include:
 Astera (www.astera.com)
A relative new comer to the crowded ETL
marketplace, Astera’s Centerprise Data
Integrator is focused on easy to use, easy to
manage, enterprise-level integration tool for everyone. As a 100% managed
.NET code product, Astera provides a reliable, scalable, and secure platform for
flexible, reusable, high-performance processing; at a cost. However, it is
perhaps the only ETL tool that supports EDI transactions.
 Clover (www.cloveretl.com)
Clover’s slick graphical designer helps visualize data flows while
offering a wide variety of flexible components for complex data
integration requirements. As an open-source software product,
their Community Edition gets you started right away, yet its
proprietary intermediate process layer may not provide the
most robust development environment.
 Adeptia (www.adeptia.com)
Offering a wizard-driven, “configure, not
code” ETL approach, Adeptia eliminates
complexity thus reducing data integration
costs while improving deployment and ROI
for business users. Adeptia’s low cost commercial product provides many
features; however is not open-source source software and thus only trial period
evaluations are available.
So there you have it. More than you might ever want
to know about ETL and Data Integration vendors. Yet, I
think ETL data processing and all its complexities cannot
be ignored. My hope is that this information, while
incomplete at its best, and perhaps slightly inaccurate
in its technical reflection at its worse, still edifies you,
the gentle reader, in a constructive way.
I look forward to any comment, question, or debate…
ETL ON ~ Dudes!
Download