SQOOP Hcatalog Integration

advertisement
SQOOP HCatalog Integration
Venkat Ranganathan
Sqoop Meetup
10/28/13
Agenda
•
•
•
•
•
HCatalog Overview
Sqoop HCatalog integration Goals
Features
Demo
Benefits
HCatalog Overview
• Table and Storage Management Service for
Hadoop
– Enables PIG/MR and Hive to more easily share
data on the grid
• Uses the Hive Meta-store.
• Abstracts location and format of the data
• Supports reading and writing files in any
format for which there is a Hive Serde
available.
• Now part of Hive.
Sqoop HCatalog Integration Goals
• Support HCatalog features consistent with
Sqoop usage.
– Support both imports into and exports from
HCatalog table
– Enable Sqoop read and write data in various
formats.
– Automatic table schema mapping
– Data fidelity
– Support for static and dynamic partition keys
Support imports and exports
• Allows the HCatalog tables to be either the
source or destination of a Sqoop job.
• In an HCatalog import, target-dir and
warehouse-dir are replaced with the
HCatalog table name.
• Similarly for exports, the export directory is
substituted with the HCatalog table name.
File format support
• HCatalog integration into Sqoop now
enables Sqoop to
– Import/Export files of various formats that
have hive serde created
– Textfiles, Sequence files, RCFiles, ORCFile,…
– This makes Sqoop agnostic of the file format
used which can change over time based on
new innovations/needs.
Automatic table schema mapping
• Sqoop allows a hive table to be created
based on the enterprise data store schema
• This is enabled for HCatalog table imports as
well.
• Automatic mapping with optional user
overrides.
• Ability to provide a storage options for the
newly created table.
• All HCatalog primitive types supported
Data fidelity
• With Text based imports (as in Sqoop
hive-import option), the text values have to
be massaged so that delimiters are not
misinterpreted.
• Sqoop provides two options to handle this.
--hive-delims-replacement
--hive-drop-import-delims
• Error prone and data is modified to be
stored on Hive
Data fidelity
• With HCatalog table imports to file formats
like RCFile, ORCFile etc, there is no need
to strip these delimiters in column values.
• Data is preserved without any massaging
• If the target Hcatalog table file format is
Text, then the two options can still be used
as before.
--hive-delims-replacement
--hive-drop-import-delims
Support for static and dynamic partitioning
• HCatalog tables partition keys can be
dynamic or static.
• Static partitioning keys have values
provided as part of the DML (known at
Query compile time)
• Dynamic partitioning keys have values
provided at execution time.
– Based on value of a column being imported
Support for static and dynamic partitioning
• Both types of tables supported during
import.
• Multiple partition keys per table are
supported.
• Only one can be a static partition key can
be specified (Sqoop restriction).
• Only table with one partitioning key can be
automatically created.
Benefits
• Future proof your Sqoop jobs by making
them agnostic of file-formats used
• Remove additional steps before taking
data to the target table format
• Preserve data contents
Availability & Documentation
• Part of Sqoop 1.4.4 release
• A chapter devoted to HCatalog integration
in the User Guide
• URL:
https://sqoop.apache.org/docs/1.4.4/Sqoo
pUserGuide.html#_sqoop_hcatalog_integr
ation
DEMO
© Hortonworks Inc. 2013
Questions?
© Hortonworks Inc. 2013
Download