Loom User Guide Michael Lang September 2014 Contents Overview ....................................................................................................................................................... 1 Features and Capabilities .............................................................................................................................. 2 Basic Use Cases ............................................................................................................................................. 4 Source Cataloging and Profiling with Activescan ...................................................................................... 4 Data Preparation with Weaver and Hive .................................................................................................. 5 Batch Data Processing of Log Data with Activescan ................................................................................. 7 Advanced Use Cases ..................................................................................................................................... 7 Data Governance ...................................................................................................................................... 7 Data Lineage ............................................................................................................................................. 8 ETL Management ...................................................................................................................................... 8 Appendix A: Loom Concepts and Model....................................................................................................... 8 Sources ...................................................................................................................................................... 9 Datasets .................................................................................................................................................. 10 Transforms and Jobs ............................................................................................................................... 10 Activescan ............................................................................................................................................... 11 Overview Loom provides core capabilities needed for enterprises to successfully deploy a Hadoop data lake. Loom makes the first phase of the analytic workflow more efficient, enabling analysts to quickly find, understand, and prepare data in a Hadoop data lake, which allows them to spend more time developing analytics. Ultimately, this means more business insights are developed faster, which is the ultimate driver for ROI of the data lake. The primary purpose of Loom is to serve as a “workbench” for an analyst working in the Hadoop data lake, helping them to: Find Data – Search/browse for data in the data lake through Loom’s Workbench Explore Data – View data previews and navigate between related datasets Understand Data – In addition to data previews, Loom provides the user with valuable metadata that gives extra context to the data and helps a user understand its true meaning, including statistics, business metadata, and lineage Prepare Data – Execute transformations to convert data from its original form into the form required for analysis, including cleaning tables and merging tables Loom includes an automation framework, called Activescan, to assist with many of the underlying data management tasks necessary to support the capabilities above, including cataloging and profiling data. Features and Capabilities Loom’s features are categorized into three high-level buckets: Metadata management – Enables users to create and manage information about the data in a Hadoop cluster. Enables users to view samples of data. Automation (Activescan) – Automates the creation and management of metadata as much as possible. Automates some aspects of data processing. Data Preparation (Weaver, Hive) – Enables users to execute transformations in Hadoop needed to prepare data for analysis. Weaver provides a simple, interactive interface for defining transformations. Metadata is a key underpinning of much of the value delivered by Loom. It is important in helping users find and understand available data. Aspects of metadata that Loom supports include: Technical metadata – information about the physical characteristics of some data, including the location, structure, format, and schema. Technical metadata also includes statistics about the data (e.g. row count, # of nulls) and lineage for the data. Business metadata – user-generated metadata that describes non-technical aspects of the data, including names, descriptions, tags, and custom properties. Custom properties can be configured by users. Business glossaries allow business terminology to be formally managed independent of any particular set of data. Business glossary terms can be mapped to technical metadata (e.g. table names). The Loom Workbench provides a simple interface to browse metadata, explore data, and navigate relationships between entities (e.g. lineage relationships). The Workbench also has a search interface that supports keyword search over most metadata fields. Loom’s Activescan framework automates many common data and metadata management tasks, including: Source Cataloging – Loom scans HDFS and Hive and catalogs all new data in supported formats Source Profiling – Loom automatically detects characteristics of a source, including format, structure, and schema Lineage Detection – Loom detects the lineage of sources that are (1) imported/exported with Sqoop or TDCH, (2) transformed with Hive, or (3) transformed with Loom Data Profiling – Loom generates descriptive statistics about tables and columns Data Processing – Loom can automatically apply structure to new files and transform them through Weaver. Alternately, Loom can “call out” to external code to process new files. Finally, Loom enables users to execute the transformations required to prepare their data for analysis using: Weaver – Loom’s interface for “data wrangling”, Weaver enables users to modify and combine columns to produce new columns (or overwrite existing ones). Users can combine these operations with filters to transform only subsets of the rows in a table (or delete rows altogether). Users can also make schema level modifications (add/delete/rename columns, convert datatypes). HiveQL – users can also execute Hive queries from Loom, providing all the transformation capabilities of Hive (e.g. joins, unions, aggregations, UDFs) Basic Use Cases This section summarizes some basic use cases that can be simply developed and demonstrated with Loom: Source Cataloging and Profiling with Activescan Data Preparation with Hive and Weaver Batch Data Processing with Activescan Each of these use cases can be viewed as stand-alone, but they also build on each other. Activescan catalogs data in the cluster. Users can execute transformations against data that has been cataloged, in order to prepare it for analysis. Users can automate the execution of their transformations for data that will be updated on a regular basis. In addition to these three high-level use cases, there is a section below which discusses some advanced use cases: Data Governance Data Lineage ETL Management These use cases require more advanced configuration both of Loom and Hadoop. Source Cataloging and Profiling with Activescan This use case provides a demonstration of how Activescan can catalog and profile data in HDFS and Hive. Multiple files are loaded into HDFS. Activescan scans HDFS, discovers the files, profiles the files (for structural metadata), and registers them in Loom. Users can then use data previews and generated metadata to explore the available data. Users can also “activate” sources to augment the metadata generated by Activescan with their own descriptive metadata. The next use case shows how Scenario: Many files are being loaded into a Hadoop cluster, which is serving as a “data lake” – a large scale platform for data storage, refinement, and analysis. The specific content of the files or the source of the files is not important for this use case (see ‘Advanced Use Cases’ below for more discussion about data lineage). Cluster administrators want to provide a simplified way for users to explore data in the cluster, giving them “self-service” access to the data lake. Activescan is used to automatically catalog and profile the data. Users can then browse and explore data through the Loom Workbench. Loom Flow: 0. Set up data - Load files into HDFS of the following formats – log, CSV, Avro, Parquet. Create a Hive database with at least one Table. 1. Configure Activescan – turn source scanner on, configure target directory, start server (restart, if it is already running) 2. Browse Sources – log into the Loom Workbench and browse the available sources 3. OPTIONAL: Activate Sources and Add Metadata – users can activate sources catalogs by Activescan and add their own metadata, including basic descriptive information, custom properties (configurable by an administrator), and business glossaries. Extensions to this use case: See the “data preparation” and “batch data processing” use cases below as examples Limitations to this use case: Activescan will only discover files in supported formats - log, CSV, Avro, Parquet Activescan default configurations may need to be changed for files to be cataloged and profiled correctly. See the ‘docs’ folder in the Loom distribution for more details on how to configure Activescan. Data Preparation with Weaver and Hive This use case provides a demonstration focused on the data preparation features of Loom. Once data is registered with Loom, either by Activescan, through the API, or through the Workbench (manually), a user can apply structure to the data, and then execute transformations against the data to prepare it for analysis. This use case builds upon the capabilities described above. After a user can successfully explore and describe data in the cluster, the next natural step is to execute their own transformations to clean, combine, aggregate, and otherwise prepare their data for their own particular purpose. Loom provides two tools for this purpose: Weaver and Hive. Weaver lets a user quickly and iteratively “wrangle” an individual table. Hive lets a user execute SQL-style transforms – joins, unions, aggregations, UDFs, etc. Scenario: A pair of files have been loaded into HDFS. For the purposes of this use case, it does not matter how the files came to be in HDFS (discussed further below). One file contains general data about books, such as ISBN, title, author, and publisher (we refer to this file as the ‘books’ data). The second file contains usersubmitted rating for the same books (we refer to this file as the ‘ratings’ data). An analyst wants to determine whether the words used in the title of a book have predictive power for its rating. That is, given the title of a book, can you successfully predict what rating it will get? Obviously, as a publisher or author, this would be valuable information, as you would seek to optimize your title to generate higher ratings. In order to do this analysis, the data must be prepared. The books file must be cleaned, and then merged with the ratings data. The data can then be analyzed from a variety of tools, including, but not limited to: R, Python, Hive, Pig, MapReduce, Tableau, and Excel. Loom flow: 1. Each file is registered as a Source – the data can be previewed and basic technical metadata can be seen in Loom. Users can augment this with their own descriptive metadata, if desired. 2. A Dataset is created for each Source – this puts a schema onto the data and makes it available for transformation and profiling 3. The books data is cleaned with Weaver – the title column is the main focus. We need to remove any special characters, normalize possessive nouns (i.e. remove all “ ‘s “), convert everything to lowercase, remove anything in between a pair of parentheses (this often references the publisher or series name), and any extra whitespace. We also need to delete any non-English books, as these would complicate our analysis, without adding value. 4. The books data is further prepared with HiveQL and merged with the ratings data – Using Hive, we aggregate all the ratings into a single average rating for each book, join the books and ratings tables, extract the individual words from each title (using a Hive built-in function), split the joined table into one table of ‘good’ books and one table of ‘bad’ books (based on the average rating), and finally produce a count for ‘good’ books and a count for ‘bad’ books of how many times each individual word appears in book titles. 5. OPTIONAL: Export the final tables from Hive to a relational database using Sqoop or TDCH. This demonstrates that Loom can automatically track the lineage of this operation, for the case where the analyst plans to use a relational platform for the final analysis. 6. OPTIONAL: Import the data from Hive into R, for the case where the analyst plans to use R for the final analysis. Extensions to this use case: Business metadata – For each entity involved in the workflow outlined above, add descriptions, tags, folders, and custom properties (either using default custom properties or by configuring new custom properties) o Business Glossary – Create a business glossary which defines terminology relevant to the books domain and create mappings into some or all of the tables involved in the workflow Statistics – For each table in a Dataset, generate statistics (either manually trigger the statistics scan through the Workbench or configure Activescan to do it automatically) Lineage – This is generated automatically along the way for all transforms executed through Loom or for any Sqoop/TDCH job run. For other lineage (e.g. the original source of the books data outside of Hadoop), the API can be used to create the necessary entities. See ‘Advanced Use Case’ below for a discussion of data lineage. Limitations within this use case: Loom can only read data from HDFS in the following formats – CSV, log (regex), Avro, Parquet Loom cannot execute transformations other than those described above (HiveQL, Weaver) Loom cannot automatically track lineage for transformations executed outside of Loom, except for Sqoop jobs, TDCH jobs, or Hive queries (new in Loom 2.3). A “sandbox” cluster (i.e. single-node virtual machine) may struggle executing multiple jobs at the same time (transformations or statistics). Queues can be used to address this. Batch Data Processing of Log Data with Activescan This use case provides a demonstration of how Loom can be configured to automatically move new files to Hive and apply a Weaver script. This is a combination of Loom’s data preparation features (namely, Weaver) and Loom’s Activescan framework. This is a natural extension of the ‘data preparation’ use case described above. Once a Weaver transform is defined to clean a specific table, a user will want to apply the same transform to new files that are loaded into the cluster. Scenario: A new directory is created in HDFS and a single file loaded, with the intention of loading new files on a scheduled basis (e.g. once a day). Each file represents a set of records in the same logical table. A user has defined a Weaver script for cleaning the data, using the original file. Activescan is configured to automatically execute this Weaver script for every new file that is loaded. Loom flow: 1. HDFS directory and original file are discovered by Activescan and registered as a Source and Table, respectively 2. A user creates a Dataset from the Source 3. A user defines a Weaver script to clean the Table 4. A user defines a ‘directives’ file for the Dataset which references the Weaver script 5. New files are loaded into the directory – Activescan automatically detects each file, registers it as a new Table in the existing Source, then creates a corresponding Table in the existing Dataset, and then executes the Weaver script. Lineage is tracked for all operations. Extensions to this use case: Everything described above in the ‘data preparation’ use case Activescan can trigger “external” processes for new files, as an alternative to Weaver transforms. An “external” process can be any executable code (e.g. a bash script or oozie script) Limitations within this use case: Activescan cannot trigger HiveQL transformations with Loom, at this time The default format for log files is the Apache Combined Log format. Other formats can be supported by editing the Source profile in Activescan. See the ‘docs’ folder in the Loom distribution for more details on how to configure Source profiles for Activescan. Advanced Use Cases The use cases above describe the most straightforward use of Loom to provide a Workbench for analysts working in Hadoop to explore and transform data, augmented by Activescan’s ability to automate tasks. Loom can provide broader capabilities Data Governance There are many aspects to data governance, but metadata underpins all of them. Loom provides a catalog of data in a Hadoop cluster, with rich metadata. Loom does not allow users to define and enforce specific policies about data. Loom does provide the ability to define custom properties that can be used to track specific information that is important to data governance. Users can also create business glossaries, which are important to aligning semantics and organizing data in a consistent way. Data Lineage Loom provides the core capability to track metadata about data lineage. Loom automatically tracks lineage for any transformations executed through Loom and can also automatically track operations outside of Loom (Sqoop, TDCH, Hive). Other types of operations (e.g. MapReduce, Pig, etc.) may be tracked in the future. However, Loom currently provides an API which can be used to track lineage. If data lineage is fundamentally important, then some choices will need to be made. Users will need to be limited to certain types of operations through certain tools and these will need to be instrumented to record lineage with Loom. It is possible to ensure that everything is tracked automatically in Loom, but it does not come for free, unless operations are limited entirely to Loom, Sqoop, TDCH, and Hive. ETL Management Loom has a basic capability to execute “data pipelines” when new files are discovered by Activescan. This is very limited when compared to full-blow ETL tools , but does provide a basic level of automated data processing. Loom also has te ability to “call out” to external tools, which can provide more powerful transformation and workflow capabilities. The Loom API allows users to track metadata about anything going on in Hadoop. For users that are managing ETL in Hadoop with a workflow engine that does not include its own metadata repository (e.g. Oozie), Loom can fill this gap. A user needs to proactively integrate API calls to Loom into their data processing workflows. Appendix A: Loom Concepts and Model There are four core entities in Loom: Sources, Datasets, Transforms (Processes) and Jobs. Sources represent sets of data that are outside Loom's control. That is, their lifecycles are managed external to Loom -- creation, modification, and deletion. Although Loom does not control or manage these external sources, it allows you to register them, and so manage their metadata, and for some sources, access their data. Loom also discovers potential sources by scanning your Hadoop file system; potential sources can be reviewed and activated as sources. Datasets represent sets of data that are controlled by Loom. They are used for data pre-processing and analysis. Datasets can be created from sources initially; thereafter datasets are output from transforms. Sources and Datasets are both 'containers' of sets of related data (e.g., tables). In that way, they are both analogous to databases, and similar to each other. They are distinguished in Loom due to the different semantics and activities associated with each, based on their ownership and how much Loom knows about (and controls) each. Processes define data processing to be performed on datasets. Loom supports the execution of HiveQL and Weaver transforms. Processes can be executed by providing the input and output data contexts. A data context is a (container ID, table name) pair, where a container can be either a source or a dataset. Hive queries can have multiple inputs, or no inputs; there is always one output context. Weaver transforms always have one input and one output context. Lineage is computed from executed (or 'used') processes, called 'process uses' because not all processes have to be formally executed in order to relate an input to an output. (The API can be used to define 'generic' processes and process uses relating sources and datasets to each other.) When a transform is executed, a job entity is created in Loom to track and record the execution progress and statistics. Sources Sources are either registered through the Workbench, the API, or by Activescan Activescan continuously scans HDFS and registers any new files/directories Sources can be o Created through the Workbench: HDFS files or directories, Hive databases, or relational databases o Created by Activescan: HDFS files or directories, Hive databases, RDBMs (when referenced in Sqoop or TDCH jobs) o Created through the API: anything you want HDFS Sources A Source is mapped to a Directory Files in the Directory become Tables in the Source Supported formats: Delimited text (e.g. CSV), log file (regex), Avro Pluggable through a Java interface Hive Sources A Source is mapped to a Hive Database Tables in the Database becomes Tables in the Source RDBMS Descriptive only (i.e. Loom cannot connect to a RDMBS to read metadata or data), mainly for lineage tracking through Activescan and API Datasets Datasets are generated from Sources Datasets are always stored in Hive o Sometimes the Source data is copied into Hive resulting in a duplicate of the original data stored in Hive’s default format (a specific delimited text format) o Other times the Source data is registered as an external table in Hive, which references the original data Datasets can be transformed through the Loom Workbench or API using Weaver or HiveQL. Dataset metadata includes: Everything covered by Sources Statistics - generated by Activescan upon creation of a table or triggered manually through the Workbench Transforms and Jobs Datasets can be transformed through Loom using Weaver or Hive Data import/export processes (Sqoop/TDCH) will be registered as Transforms when discovered by Activescan HiveQL queries detected by Activescan will be registered as Transforms Other types of transforms can be registered through the API Weaver For working with individual tables Interactive editing based on samples - choose an operation, apply the operation, see the results Batch processing over full datasets HiveQL Any single-statement HiveQL query - primarily geared towards using SELECT (converted to a CREATE TABLE AS under the covers) Lineage is tracked automatically for Loom-initiated transforms Jobs Jobs are created to track the execution of a Transforms Transforms are saved and can be reused or edited later Activescan Activescan supports the following types of “scans”: HDFS/Hive scan – for cataloging HDFS and Hive Sources Job scan – for detecting Sqoop and TDCH jobs, cataloging associated Sources, with lineage Statistics scan – for calculating statistics for individual tables HDFS scan Scanning HDFS at scheduled intervals looking for new files/directories o Identifies type of file using configured “scanners” based on extensions, names, structure of data, etc. Scanning Hive at scheduled intervals looking for new databases/tables Registers discovered data as Sources (or new Tables within existing Sources) Execute processes based on “directives files” Directives A .activescan file, which references some executable file (e.g. Oozie script) or Weaver transform, can be configured for a Source (Directory) or Dataset. Whenever Activescan finds a new table in the Source/Dataset, it will trigger the executable referenced in the .activescan file Job scan Scanning MR history server at scheduled intervals for new jobs Identifies any jobs executed by Sqoop or TDCH Registers the input and output as Sources and the import/export as a process and process-use (creates lineage) Statistics scan Triggered whenever a new Table is added to a Dataset Can also be manually triggered through Workbench