Loom User Guide - Teradata Developer Exchange

advertisement
Loom User Guide
Michael Lang
September 2014
Contents
Overview ....................................................................................................................................................... 1
Features and Capabilities .............................................................................................................................. 2
Basic Use Cases ............................................................................................................................................. 4
Source Cataloging and Profiling with Activescan ...................................................................................... 4
Data Preparation with Weaver and Hive .................................................................................................. 5
Batch Data Processing of Log Data with Activescan ................................................................................. 7
Advanced Use Cases ..................................................................................................................................... 7
Data Governance ...................................................................................................................................... 7
Data Lineage ............................................................................................................................................. 8
ETL Management ...................................................................................................................................... 8
Appendix A: Loom Concepts and Model....................................................................................................... 8
Sources ...................................................................................................................................................... 9
Datasets .................................................................................................................................................. 10
Transforms and Jobs ............................................................................................................................... 10
Activescan ............................................................................................................................................... 11
Overview
Loom provides core capabilities needed for enterprises to successfully deploy a Hadoop data lake. Loom
makes the first phase of the analytic workflow more efficient, enabling analysts to quickly find,
understand, and prepare data in a Hadoop data lake, which allows them to spend more time developing
analytics. Ultimately, this means more business insights are developed faster, which is the ultimate
driver for ROI of the data lake.
The primary purpose of Loom is to serve as a “workbench” for an analyst working in the Hadoop data
lake, helping them to:




Find Data – Search/browse for data in the data lake through Loom’s Workbench
Explore Data – View data previews and navigate between related datasets
Understand Data – In addition to data previews, Loom provides the user with valuable
metadata that gives extra context to the data and helps a user understand its true meaning,
including statistics, business metadata, and lineage
Prepare Data – Execute transformations to convert data from its original form into the form
required for analysis, including cleaning tables and merging tables
Loom includes an automation framework, called Activescan, to assist with many of the underlying data
management tasks necessary to support the capabilities above, including cataloging and profiling data.
Features and Capabilities
Loom’s features are categorized into three high-level buckets:


Metadata management – Enables users to create and manage information about the data in a
Hadoop cluster. Enables users to view samples of data.
Automation (Activescan) – Automates the creation and management of metadata as much as
possible. Automates some aspects of data processing.

Data Preparation (Weaver, Hive) – Enables users to execute transformations in Hadoop needed
to prepare data for analysis. Weaver provides a simple, interactive interface for defining
transformations.
Metadata is a key underpinning of much of the value delivered by Loom. It is important in helping users
find and understand available data. Aspects of metadata that Loom supports include:


Technical metadata – information about the physical characteristics of some data, including the
location, structure, format, and schema. Technical metadata also includes statistics about the
data (e.g. row count, # of nulls) and lineage for the data.
Business metadata – user-generated metadata that describes non-technical aspects of the data,
including names, descriptions, tags, and custom properties. Custom properties can be
configured by users. Business glossaries allow business terminology to be formally managed
independent of any particular set of data. Business glossary terms can be mapped to technical
metadata (e.g. table names).
The Loom Workbench provides a simple interface to browse metadata, explore data, and navigate
relationships between entities (e.g. lineage relationships). The Workbench also has a search interface
that supports keyword search over most metadata fields.
Loom’s Activescan framework automates many common data and metadata management tasks,
including:





Source Cataloging – Loom scans HDFS and Hive and catalogs all new data in supported formats
Source Profiling – Loom automatically detects characteristics of a source, including format,
structure, and schema
Lineage Detection – Loom detects the lineage of sources that are (1) imported/exported with
Sqoop or TDCH, (2) transformed with Hive, or (3) transformed with Loom
Data Profiling – Loom generates descriptive statistics about tables and columns
Data Processing – Loom can automatically apply structure to new files and transform them
through Weaver. Alternately, Loom can “call out” to external code to process new files.
Finally, Loom enables users to execute the transformations required to prepare their data for analysis
using:


Weaver – Loom’s interface for “data wrangling”, Weaver enables users to modify and combine
columns to produce new columns (or overwrite existing ones). Users can combine these
operations with filters to transform only subsets of the rows in a table (or delete rows
altogether). Users can also make schema level modifications (add/delete/rename columns,
convert datatypes).
HiveQL – users can also execute Hive queries from Loom, providing all the transformation
capabilities of Hive (e.g. joins, unions, aggregations, UDFs)
Basic Use Cases
This section summarizes some basic use cases that can be simply developed and demonstrated with
Loom:



Source Cataloging and Profiling with Activescan
Data Preparation with Hive and Weaver
Batch Data Processing with Activescan
Each of these use cases can be viewed as stand-alone, but they also build on each other. Activescan
catalogs data in the cluster. Users can execute transformations against data that has been cataloged,
in order to prepare it for analysis. Users can automate the execution of their transformations for data
that will be updated on a regular basis.
In addition to these three high-level use cases, there is a section below which discusses some advanced
use cases:



Data Governance
Data Lineage
ETL Management
These use cases require more advanced configuration both of Loom and Hadoop.
Source Cataloging and Profiling with Activescan
This use case provides a demonstration of how Activescan can catalog and profile data in HDFS and Hive.
Multiple files are loaded into HDFS. Activescan scans HDFS, discovers the files, profiles the files (for
structural metadata), and registers them in Loom. Users can then use data previews and generated
metadata to explore the available data. Users can also “activate” sources to augment the metadata
generated by Activescan with their own descriptive metadata. The next use case shows how
Scenario:
Many files are being loaded into a Hadoop cluster, which is serving as a “data lake” – a large scale
platform for data storage, refinement, and analysis. The specific content of the files or the source of the
files is not important for this use case (see ‘Advanced Use Cases’ below for more discussion about data
lineage). Cluster administrators want to provide a simplified way for users to explore data in the cluster,
giving them “self-service” access to the data lake. Activescan is used to automatically catalog and profile
the data. Users can then browse and explore data through the Loom Workbench.
Loom Flow:
0. Set up data - Load files into HDFS of the following formats – log, CSV, Avro, Parquet. Create a
Hive database with at least one Table.
1. Configure Activescan – turn source scanner on, configure target directory, start server (restart, if
it is already running)
2. Browse Sources – log into the Loom Workbench and browse the available sources
3. OPTIONAL: Activate Sources and Add Metadata – users can activate sources catalogs by
Activescan and add their own metadata, including basic descriptive information, custom
properties (configurable by an administrator), and business glossaries.
Extensions to this use case:

See the “data preparation” and “batch data processing” use cases below as examples
Limitations to this use case:


Activescan will only discover files in supported formats - log, CSV, Avro, Parquet
Activescan default configurations may need to be changed for files to be cataloged and profiled
correctly. See the ‘docs’ folder in the Loom distribution for more details on how to configure
Activescan.
Data Preparation with Weaver and Hive
This use case provides a demonstration focused on the data preparation features of Loom. Once data is
registered with Loom, either by Activescan, through the API, or through the Workbench (manually), a
user can apply structure to the data, and then execute transformations against the data to prepare it for
analysis. This use case builds upon the capabilities described above. After a user can successfully
explore and describe data in the cluster, the next natural step is to execute their own transformations to
clean, combine, aggregate, and otherwise prepare their data for their own particular purpose. Loom
provides two tools for this purpose: Weaver and Hive. Weaver lets a user quickly and iteratively
“wrangle” an individual table. Hive lets a user execute SQL-style transforms – joins, unions,
aggregations, UDFs, etc.
Scenario:
A pair of files have been loaded into HDFS. For the purposes of this use case, it does not matter how the
files came to be in HDFS (discussed further below). One file contains general data about books, such as
ISBN, title, author, and publisher (we refer to this file as the ‘books’ data). The second file contains usersubmitted rating for the same books (we refer to this file as the ‘ratings’ data). An analyst wants to
determine whether the words used in the title of a book have predictive power for its rating. That is,
given the title of a book, can you successfully predict what rating it will get? Obviously, as a publisher or
author, this would be valuable information, as you would seek to optimize your title to generate higher
ratings. In order to do this analysis, the data must be prepared. The books file must be cleaned, and
then merged with the ratings data. The data can then be analyzed from a variety of tools, including, but
not limited to: R, Python, Hive, Pig, MapReduce, Tableau, and Excel.
Loom flow:
1. Each file is registered as a Source – the data can be previewed and basic technical metadata can
be seen in Loom. Users can augment this with their own descriptive metadata, if desired.
2. A Dataset is created for each Source – this puts a schema onto the data and makes it available
for transformation and profiling
3. The books data is cleaned with Weaver – the title column is the main focus. We need to remove
any special characters, normalize possessive nouns (i.e. remove all “ ‘s “), convert everything to
lowercase, remove anything in between a pair of parentheses (this often references the
publisher or series name), and any extra whitespace. We also need to delete any non-English
books, as these would complicate our analysis, without adding value.
4. The books data is further prepared with HiveQL and merged with the ratings data – Using Hive,
we aggregate all the ratings into a single average rating for each book, join the books and ratings
tables, extract the individual words from each title (using a Hive built-in function), split the
joined table into one table of ‘good’ books and one table of ‘bad’ books (based on the average
rating), and finally produce a count for ‘good’ books and a count for ‘bad’ books of how many
times each individual word appears in book titles.
5. OPTIONAL: Export the final tables from Hive to a relational database using Sqoop or TDCH. This
demonstrates that Loom can automatically track the lineage of this operation, for the case
where the analyst plans to use a relational platform for the final analysis.
6. OPTIONAL: Import the data from Hive into R, for the case where the analyst plans to use R for
the final analysis.
Extensions to this use case:



Business metadata – For each entity involved in the workflow outlined above, add descriptions,
tags, folders, and custom properties (either using default custom properties or by configuring
new custom properties)
o Business Glossary – Create a business glossary which defines terminology relevant to
the books domain and create mappings into some or all of the tables involved in the
workflow
Statistics – For each table in a Dataset, generate statistics (either manually trigger the statistics
scan through the Workbench or configure Activescan to do it automatically)
Lineage – This is generated automatically along the way for all transforms executed through
Loom or for any Sqoop/TDCH job run. For other lineage (e.g. the original source of the books
data outside of Hadoop), the API can be used to create the necessary entities. See ‘Advanced
Use Case’ below for a discussion of data lineage.
Limitations within this use case:




Loom can only read data from HDFS in the following formats – CSV, log (regex), Avro, Parquet
Loom cannot execute transformations other than those described above (HiveQL, Weaver)
Loom cannot automatically track lineage for transformations executed outside of Loom, except
for Sqoop jobs, TDCH jobs, or Hive queries (new in Loom 2.3).
A “sandbox” cluster (i.e. single-node virtual machine) may struggle executing multiple jobs at
the same time (transformations or statistics). Queues can be used to address this.
Batch Data Processing of Log Data with Activescan
This use case provides a demonstration of how Loom can be configured to automatically move new files
to Hive and apply a Weaver script. This is a combination of Loom’s data preparation features (namely,
Weaver) and Loom’s Activescan framework. This is a natural extension of the ‘data preparation’ use
case described above. Once a Weaver transform is defined to clean a specific table, a user will want to
apply the same transform to new files that are loaded into the cluster.
Scenario:
A new directory is created in HDFS and a single file loaded, with the intention of loading new files on a
scheduled basis (e.g. once a day). Each file represents a set of records in the same logical table. A user
has defined a Weaver script for cleaning the data, using the original file. Activescan is configured to
automatically execute this Weaver script for every new file that is loaded.
Loom flow:
1. HDFS directory and original file are discovered by Activescan and registered as a Source and
Table, respectively
2. A user creates a Dataset from the Source
3. A user defines a Weaver script to clean the Table
4. A user defines a ‘directives’ file for the Dataset which references the Weaver script
5. New files are loaded into the directory – Activescan automatically detects each file, registers it
as a new Table in the existing Source, then creates a corresponding Table in the existing Dataset,
and then executes the Weaver script. Lineage is tracked for all operations.
Extensions to this use case:


Everything described above in the ‘data preparation’ use case
Activescan can trigger “external” processes for new files, as an alternative to Weaver
transforms. An “external” process can be any executable code (e.g. a bash script or oozie script)
Limitations within this use case:


Activescan cannot trigger HiveQL transformations with Loom, at this time
The default format for log files is the Apache Combined Log format. Other formats can be
supported by editing the Source profile in Activescan. See the ‘docs’ folder in the Loom
distribution for more details on how to configure Source profiles for Activescan.
Advanced Use Cases
The use cases above describe the most straightforward use of Loom to provide a Workbench for
analysts working in Hadoop to explore and transform data, augmented by Activescan’s ability to
automate tasks. Loom can provide broader capabilities
Data Governance
There are many aspects to data governance, but metadata underpins all of them. Loom provides a
catalog of data in a Hadoop cluster, with rich metadata. Loom does not allow users to define and
enforce specific policies about data. Loom does provide the ability to define custom properties that can
be used to track specific information that is important to data governance. Users can also create
business glossaries, which are important to aligning semantics and organizing data in a consistent way.
Data Lineage
Loom provides the core capability to track metadata about data lineage. Loom automatically tracks
lineage for any transformations executed through Loom and can also automatically track operations
outside of Loom (Sqoop, TDCH, Hive). Other types of operations (e.g. MapReduce, Pig, etc.) may be
tracked in the future. However, Loom currently provides an API which can be used to track lineage. If
data lineage is fundamentally important, then some choices will need to be made. Users will need to be
limited to certain types of operations through certain tools and these will need to be instrumented to
record lineage with Loom. It is possible to ensure that everything is tracked automatically in Loom, but
it does not come for free, unless operations are limited entirely to Loom, Sqoop, TDCH, and Hive.
ETL Management
Loom has a basic capability to execute “data pipelines” when new files are discovered by Activescan.
This is very limited when compared to full-blow ETL tools , but does provide a basic level of automated
data processing. Loom also has te ability to “call out” to external tools, which can provide more
powerful transformation and workflow capabilities. The Loom API allows users to track metadata about
anything going on in Hadoop. For users that are managing ETL in Hadoop with a workflow engine that
does not include its own metadata repository (e.g. Oozie), Loom can fill this gap. A user needs to
proactively integrate API calls to Loom into their data processing workflows.
Appendix A: Loom Concepts and Model
There are four core entities in Loom: Sources, Datasets, Transforms (Processes) and Jobs.
Sources represent sets of data that are outside Loom's control. That is, their lifecycles are managed
external to Loom -- creation, modification, and deletion. Although Loom does not control or manage
these external sources, it allows you to register them, and so manage their metadata, and for some
sources, access their data. Loom also discovers potential sources by scanning your Hadoop file system;
potential sources can be reviewed and activated as sources.
Datasets represent sets of data that are controlled by Loom. They are used for data pre-processing and
analysis. Datasets can be created from sources initially; thereafter datasets are output from transforms.
Sources and Datasets are both 'containers' of sets of related data (e.g., tables). In that way, they are
both analogous to databases, and similar to each other. They are distinguished in Loom due to the
different semantics and activities associated with each, based on their ownership and how much Loom
knows about (and controls) each.
Processes define data processing to be performed on datasets. Loom supports the execution of HiveQL
and Weaver transforms. Processes can be executed by providing the input and output data contexts. A
data context is a (container ID, table name) pair, where a container can be either a source or a dataset.
Hive queries can have multiple inputs, or no inputs; there is always one output context. Weaver
transforms always have one input and one output context.
Lineage is computed from executed (or 'used') processes, called 'process uses' because not all processes
have to be formally executed in order to relate an input to an output. (The API can be used to define
'generic' processes and process uses relating sources and datasets to each other.)
When a transform is executed, a job entity is created in Loom to track and record the
execution progress and statistics.
Sources
Sources are either registered through the Workbench, the API, or by Activescan


Activescan continuously scans HDFS and registers any new files/directories
Sources can be
o Created through the Workbench: HDFS files or directories, Hive databases, or relational
databases
o Created by Activescan: HDFS files or directories, Hive databases, RDBMs (when
referenced in Sqoop or TDCH jobs)
o Created through the API: anything you want
HDFS Sources




A Source is mapped to a Directory
Files in the Directory become Tables in the Source
Supported formats: Delimited text (e.g. CSV), log file (regex), Avro
Pluggable through a Java interface
Hive Sources


A Source is mapped to a Hive Database
Tables in the Database becomes Tables in the Source
RDBMS

Descriptive only (i.e. Loom cannot connect to a RDMBS to read metadata or data), mainly for
lineage tracking through Activescan and API
Datasets
 Datasets are generated from Sources
 Datasets are always stored in Hive
o Sometimes the Source data is copied into Hive resulting in a duplicate of the original
data stored in Hive’s default format (a specific delimited text format)
o Other times the Source data is registered as an external table in Hive, which references
the original data
 Datasets can be transformed through the Loom Workbench or API using Weaver or HiveQL.
Dataset metadata includes:


Everything covered by Sources
Statistics - generated by Activescan upon creation of a table or triggered manually through the
Workbench
Transforms and Jobs
 Datasets can be transformed through Loom using Weaver or Hive
 Data import/export processes (Sqoop/TDCH) will be registered as Transforms when discovered
by Activescan
 HiveQL queries detected by Activescan will be registered as Transforms
 Other types of transforms can be registered through the API
Weaver



For working with individual tables
Interactive editing based on samples - choose an operation, apply the operation, see the results
Batch processing over full datasets
HiveQL


Any single-statement HiveQL query - primarily geared towards using SELECT (converted to a
CREATE TABLE AS under the covers)
Lineage is tracked automatically for Loom-initiated transforms
Jobs

Jobs are created to track the execution of a Transforms

Transforms are saved and can be reused or edited later
Activescan
Activescan supports the following types of “scans”:



HDFS/Hive scan – for cataloging HDFS and Hive Sources
Job scan – for detecting Sqoop and TDCH jobs, cataloging associated Sources, with lineage
Statistics scan – for calculating statistics for individual tables
HDFS scan




Scanning HDFS at scheduled intervals looking for new files/directories
o Identifies type of file using configured “scanners” based on extensions, names, structure
of data, etc.
Scanning Hive at scheduled intervals looking for new databases/tables
Registers discovered data as Sources (or new Tables within existing Sources)
Execute processes based on “directives files”
Directives

A .activescan file, which references some executable file (e.g. Oozie script) or Weaver transform,
can be configured for a Source (Directory) or Dataset. Whenever Activescan finds a new table in
the Source/Dataset, it will trigger the executable referenced in the .activescan file
Job scan



Scanning MR history server at scheduled intervals for new jobs
Identifies any jobs executed by Sqoop or TDCH
Registers the input and output as Sources and the import/export as a process and process-use
(creates lineage)
Statistics scan


Triggered whenever a new Table is added to a Dataset
Can also be manually triggered through Workbench
Download