Big Data Ingestion Architecture Design TAMU Big Data Ingestion Architecture Considerations 1 IBM Big Data Ingestion Architecture Design Document History Author: Martin Donohue Revision History Date of this revision: 21 November 2014 Revision Revision Number Date Date of next revision Summary of Changes Changes marked ----- Approvals This document requires following approvals. Name Organization Distribution This document has been distributed to Name 2 Organization IBM Big Data Ingestion Architecture Design Table of Contents Document History 2 1 Intended Audience 5 2 Overview 5 3 Scope of this document 5 4 Organization of this document 5 5 Data Ingestion Architecture Considerations 5 6 Load Scenarios 8 Data at Rest ................................................................................................................. 9 Data in Motion .......................................................................................................... 11 Data from a datawarehouse ....................................................................................... 11 7 Data Types 16 Structured .................................................................................................................. 16 UnStructured ............................................................................................................. 17 8 Conclusion 19 3 IBM Big Data Ingestion Architecture Design Overall Architecture Diagram Figure 1 - Application Architecture See Appendix A5 for a sample deployment pattern of services across nodes. 4 IBM Big Data Ingestion Architecture Design 1 Intended Audience This document is meant for Client personnel who are involved in the BigInsights project Individual IBM solution teams Individual non IBM solution teams The audience should have a high level understanding of the Apache Hadoop (Hadoop) architecture as well as the IBM Infosphere BigInsights 3.0 (BigInsights) platform before reading this document as it refers to concepts such as management and data nodes which are specific to a Hadoop and BigInsights implementation. 2 Overview Data ingestion is one of the critical Hadoop workflows. Massive amounts of data must be moved from various sources into Hadoop for analysis. Apache Flume™ and Apache Sqoop™ are two of the data ingestion tools that are commonly used for Hadoop. Flume is a distributed, reliable and available system for efficiently collecting, aggregating and moving data from many different sources to a centralized datastore such as Hadoop Distributed File System (HDFS). Sqoop is a tool designed to transfer data between Hadoop and relational databases. It can import data from a relational database management system (RDBMS) into HDFS, HBase and Hive and then export the data back after transforming it using Hadoop MapReduce. Traditional ETL tools(Datastage, Informatica...) can be used to move data into Hadoop(HDFS/GPFS, Hbase...). 3 Scope of this document The scope of this document is limited to the core software technology components for BigInsights 3.0 core solution infrastructure as well as supporting components such as ETL tools. This document addresses the design considerations for the solution design. 4 Organization of this document The document starts with a high level background of the data ingestion strategy. Various factors that influence the infrastructure design such as streaming data, batch file feed and data types are described in individual sections. This leads to the recommendation describing requirements for the ingestion flows that the client plans to stand up. Efforts have been made to provide relevant technology details while at the same time masking unwanted technical details. 5 Data Ingestion Architecture Considerations The diagram below encapsulates the potential data flows for a Big Data architecture. The entire flows/components represented will likely not be needed for most BigData implementations as this 5 IBM Big Data Ingestion Architecture Design would represent a complete migration to the BigData platform from traditional client server environments. PureData for Analytics PureData for Operational Analytics DB2 BLU, PureData for Analytics InfoSphere Data Click, Information Server, MDM, G2 Information Movement, Matching & Transformation Enterprise Warehouse Landing, Exploration & Archive Security, Governance and Business Continuity Guardium, Optim, Symphony InfoSphere BigInsights 6 Analytic Appliances IBM Big Data Ingestion Architecture Design Several methods exist for ingesting data into the cluster. Some of these methods require the network design to be adjusted. Data movement options 1) BigInsights console can be used to route small amounts of data from data sources to data nodes without requiring the data network to be routed. This allows the isolated data network design but restricts tooling options, source options and will impact speeds and feeds. 2) DataClick: An ETL option that can be placed on Edge nodes available in BigInsights. DataClick is DataStage with an improved business user interface, modified feature set and reduced connector support. It is an ETL router that's ability to move data is based on available CPU and network resources. It takes servers to move large amounts of data using traditional ETL. 3) DataStage: IBM's ETL product (that can be placed on one or more edge nodes) is Information Server (DataStage). It connects to many sources of data and multiple targets such as data nodes in a HDFS Hadoop cluster. It is an ETL router that's ability to move data is based on available CPU and network resources. It takes servers to move large amounts of data using traditional ETL. 4) JAQL generates Map Reduce code that executes on data nodes. It uses JDBC to access databases requiring the data network to be routed. 5) PDO Push-Down Optimization generates Map Reduce code that runs on the data nodes. It uses JDBC to access databases requiring the data network to be routed. 6) Sqoop is basically a code generator of custom Map Reduce code for importing / exporting data between RDBMS and HDFS. Sqoop runs in the Hadoop cluster on data nodes and must be able to connect to the RDBMS in question over the network via a JDBC interface. In the case of Sqoop, the data network would need to be routed. 7) GPFS is a HDFS replacement that provides a posix cluster file system that can be mounted remotely. Edge nodes or routing the data network can be used to copy data to the BI file system. 8) Flume 9) Custom Map Reduce code 10) Streams Streams is a software platform that enables the development and execution of applications that process information in data streams. InfoSphere Streams enables continuous and fast analysis of massive volumes of moving data to help improve the speed of business insight and decision making. 11) Informatica Serving as the foundation for all data integration projects, the Informatica Platform lets IT organizations initiate the ETL process from virtually any business system, in any format. As part of the Informatica Platform, Informatica PowerCenter delivers robust yet easy-to-use Enterprise ETL capabilities that simplify development and deployment. PowerCenter Express, a free version of PowerCenter that can be downloaded in 10 minutes, is available to support the ETL needs of smaller departmental data marts and data warehouses. Because both these products, and all the 7 IBM Big Data Ingestion Architecture Design products on the Informatica Platform are powered by Vibe™, the ETL work done on one project can be reused on another. 12) SAS DI Data integration is the process of consolidating data from a variety of sources in order to produce a unified view of the data. SAS supports data integration in the following ways: Connectivity and metadata. A shared metadata environment provides consistent data definition across all data sources. SAS software enables you to connect to, acquire, store, and write data back to a variety of data stores, streams, applications, and systems on a variety of platforms and in many different environments. For example, you can manage information in Enterprise Resource Planning (ERP) system, relational database management systems (RDBMS), flat files, legacy systems, message queues, and XML. Data cleansing and enrichment. Integrated SAS Data Quality software enables you to profile, cleanse, augment, and monitor data to create consistent, reliable information. SAS Data Integration Studio provides a number of transformations and functions that can improve the quality of your data. Extraction, transformation, and loading. SAS Data Integration Studio enables you to extract, transform, and load data from across the enterprise to create consistent, accurate information. It provides a point-and-click interface that enables designers to build process flows, quickly identify inputs and outputs, and create business rules in metadata, all of which enable the rapid generation of data warehouses, data marts, and data streams. Migration and synchronization. SAS Data Integration Studio enables you to migrate, synchronize, and replicate data among different operational systems and data sources. Data transformations are available for altering, reformatting, and consolidating information. Real-time data quality integration allows data to be cleansed as it is being moved, replicated, or synchronized, and you can easily build a library of reusable business rules. Data federation. SAS Data Integration Studio enables you to query and use data across multiple systems without the physical movement of source data. It provides virtual access to database structures, ERP applications, legacy files, text, XML, message queues, and a host of other sources. It enables you to join data across these virtual data sources for real-time access and analysis. The semantic business metadata layer shields business staff from underlying data complexity. Master data management. SAS Data Integration Studio enables you to create a unified view of enterprise data from multiple sources. Semantic data descriptions of input and output data sources uniquely identify each instance of a business element (such as customer, product, and account) and standardize the master data model to provide a single source of truth. Transformations and embedded data quality processes ensure that master data is correct. 6 Load Scenarios This document will cover 4 different load scenarios(Data at Rest, Data in Motion and Data from a Datawarehouse) 8 IBM Big Data Ingestion Architecture Design Data at Rest Data at rest is data that is already in a file in some directory. It is at rest, meaning that no additional updates are planned on this data and it can be transferred as is. Flume Overview A Flume agent is a Java virtual machine (JVM) process that hosts the components through which events flow. Each agent contains at the minimum a source, a channel and a sink. An agent can also run multiple sets of channels and sinks through a flow multiplexer that either replicates or selectively routes an event. Agents can cascade to form a multihop tiered collection topology until the final datastore is reached. A Flume event is a unit of data flow that contains a payload and an optional set of string attributes. An event is transmitted from its point of origination, normally called a client, to the source of an agent. When the source receives the event, it sends it to a channel that is a transient store for events within the agent. The associated sink can then remove the event from the channel and deliver it to the next agent or the event’s final destination. Figure 2 depicts a representative Flume topology. Actionable Insights Predictive Analytics & Modeling 6 External Social Data Sources Structured Operational BI & Performance 10.1.2.x/24 Management Exploration & Discovery Unstructured Hadoop Sensor Geospatial Time Series Namewith DR – Dual Detailed Network Design & load Data Nodes Streaming Flume is built on the concept of flows. The various sources of the data sent through Flume may have different batching or reliability 9 IBM Big Data Ingestion Architecture Design setup. Often logs are continually being added to and what we want are the new records as they are added. Any number of Flume agents can be chained together. You begin by starting a Flume agent where the data originates. This data then flows through a series of nodes that are chained together. Every Flume agent works with both a source and a sink. (Actually a single agent can work with multiple sources and multiple sinks.) Sources and sinks are wired together via a channel. Each node receives data as “source,” stores it in a channel, and sends it via a “sink.” An agent running on one node can pass the data to an agent on a different node. The agent on the second node also works with a source and a sink. There are a number of different types of sources and sinks supplied with the product. For data to be passed from one agent to another, we can have an Avro sink on the first communicating with an Avro source on the second. Avro is a remote procedure call and serialization framework that is a separate Apache project. It uses JSON for defining data types and protocols and it serializes data in a compact binary format. Flume supports more that just a multi-tiered topology. This is an example of a consolidation topology. Here a single Avro source receives data from multiple Avro sinks. Both a replication and a multiplexing topology are also supported. In a replicating topology, log events from the source are passed to all channels connected to that source. In a multiplexing topology, data in the header area of a log event can be queried and used to distribute the event to one or more channels. It is possible to land files directly into the HDFS file system. For some data sets such as log files, there is the option to move files into a central file store or staging area (NFS data store can be used for this). From there the files can be moved into HDFS using Flume, Distributed Copy utility or manually using the Web Console copy application. Additionally, HttpFS can be leveraged to move files into the cluster from any client workstation, as long as the user can be authenticated into the BigInsights cluster. SQOOP can also be used to pull data from other systems that support a JDBC/ODBC protocol or a special SQOOP connector, into BigInsights. GPFS is POSIX compliant. For environments leveraging GPFS file system, Informatica can be leveraged for data ingestion. Informatica is in the process of certifying against GPFS. Additionally, GPFS clients can be used to easily FTP files into the cluster. 10 IBM Big Data Ingestion Architecture Design Data in Motion Data in Motion is data that is continuously being updated: New data might be added regularly to these data sources, Data might be appended to a file, or Discrete or different logs might be getting merged into one log. Streams InfoSphere offers a complete stream computing solution to enable real-time analytic processing of data in motion. Specifically, InfoSphere Streams provides an advanced computing platform to allow user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources. The solution can handle very high data throughput rates, up to millions of events or messages per second. Corporat e Network (Intranet) Manage ment Network (Private) Cluster Network (Private) Figure 5 - Streams Processings Language(SPL) SPL (Streams Processing Language) allows users to define the input(s), specify analytic operators, the intermediate streams, and then the output. Programs can be visually represented in the Flow Graph, something that is automatically created in Streams Studio as developers write the program. The high level language is compiled into C for deployment to the runtime. The flowgraph shows many capabilities, such as analying multiple input streams, New York Stock Exchange data shown along with Securities and Exchange Commission (SEC) Edgar data model which are things like quarterly and annual reports from businesses in the US. The intermediate data can be fused, or merged before sending data out of the Streams runtime. Data from a datawarehouse When moving data from a data warehouse, or any RDBMS for that matter, 11 IBM Big Data Ingestion Architecture Design we could export the data and then use Hadoop commands to import the data. If you are working with a Netezza system, then you can use the Jaql Netezza module to both read from and write to Netezza tables. Data can also be moved using BigSQL Load. Sqoop Overview Apache Sqoop is a CLI tool designed to transfer data between Hadoop and relational databases. Sqoop can import data from an RDBMS such as MySQL or Oracle Database into HDFS and then export the data back after data has been transformed using MapReduce. Sqoop can also import data into HBase and Hive. Sqoop connects to an RDBMS through its JDBC connector and relies on the RDBMS to describe the database schema for data to be imported. Both import and export utilize MapReduce, which provides parallel operation as well as fault tolerance. During import, Sqoop reads the table, row by row, into HDFS. Because import is performed in parallel, the output in HDFS is multiple files— delimited text files, binary Avro or SequenceFiles— containing serialized record data. Sqoop has a command-line interface for transferring data. It supports incremental loads to/from a single datanase table, or a free-form SQL query, as well as scripts that can be run whenever needed to import updates made to a database since the last import. Sqoop can be used also to populate tables in Hive or HBase. Sqoop provides a set of high-performance open-source connectors that can be customized for your specific external connections. Sqoop offers specific connector modules that are designed for different product types. Sqoop successfully graduated from incubator status in March of 2012 and is now a top-level Apache project. Sqoop is designed to transfer data between relational database systems and Hadoop. It uses JDBC to access the relational systems. To use it with BigInsights, you must copy the JDBC driver JAR for the relational database to be accessed into the $SQOOP_HOME/lib directory so that the driver can be used by the Sqoop software. Sqoop accesses the database so that it can understand the schema of the data involved in a transfer. It then generates a MapReduce application to import or export the data from/to the database. When you use Sqoop to import data into Hadoop, Sqoop generates a Java class that encapsulates one row of the imported table. You have access to the actul source code for the generated Java class. This can allow you to quickly develop other MapReduce applications that use the records that Sqoop stored into HDFS. The connection information is the same whether you are doing an import or an export. You specify a JDBC connection string, the username, and the password. In the example on the slide, you use the keyword import or the keyword export — just one, not both at the same time — depending on the action you want to perform. The Sqoop import command is used to extract data from a relational table and load it into Hadoop. Each row in HDFS comes from a row in the corresponding table. The resulting data in HDFS can be stored as text files or binary files, as well as imported directly into HBase or Hive. By default, all columns of all rows are imported, however, there are arguments that allow you to specify particular columns or specify a WHERE clause to limit the rows. You can even specify your own query to access the relational data. If you want to specify the location of the imported data, use the --target-dir argument. Otherwise the target directory name will be the same as the table name. 12 IBM Big Data Ingestion Architecture Design To split the data across multiple mappers, by default, Sqoop uses the primary key of the table. It determines the minimum and maximum values for the key and then assumes an even distribution of values. You can use the --split-by argument to have the distribution work with a different column. If the table does not have an index column, or has a multi-column key, then you must specify the split-by column parameter. To only import data from a subset of columns, use the --columns argument where the column names are comma separated. To limit the rows use the --where argument, and supply your own query that returns the rows to be imported. This allows for greater flexibility, for example allowing you to get data by joining tables. By default the imported data is in delimited text format (--as-textfile). Optional parameters allow you to import in binary format (--as-sequencefile) or as an Avro data file (--asavrodatafile). Also, you can override the default in order to have the data compressed. 13 IBM Big Data Ingestion Architecture Design The Sqoop export command reads data in Hadoop and places it into relational tables (you export from HDFS into a database). The target table must already exist and you can specify your own parsing specifications. By default, Sqoop inserts rows into a relational table. This is primarily intended for loading data into a new table. If there are any errors when doing the insert, the export process will fail. However, there are other export modes. The update mode — the second mode — causes Sqoop to generate update statements. To do updates, you must specify the --update-key argument. Here you tell Sqoop which table column (or comma-separated columns) to use in the WHERE clause of the update statement. If the update does not modify a row, it is not considered to be an error. The condition just goes undetected. Some database systems allow for --update-mode allowinsert to be specified — these are databases that have an UPSERT command (UPSERT does an update if the row exists, but otherwise inserts a new row). The third mode is call mode. With this mode, Sqoop calls and passes the record to a stored procedure. The --export-dir parameter defines the location of the files in HDFS that are to be exported from HDFS to put records into the database. Now for a couple of additional pieces of information. By default, Sqoop assumes that it is working with comma-separated fields and that each record is terminated by a newline. Both the import and export commands have the facility to allow you to override this behavior. Remember, that when data is imported into Hadoop, you are given access to the Java source for the Java class what was generated. If your data was not in the default format and, if the data that you are exporting is in that same format, then you can use parts of that same code to read the data. When Sqoop is inserting rows into a table, it generates a multi-row insert. Each multi-row insert handles up to 100 rows. The mapper then does a commit after 100 statements are executed. This means that 10,000 rows are inserted before being committed. Also, each export mapper in the generated MapReduce program commits with separate transactions. Data Click Data Click provides self-service data integration so that any business or technical user can integrate data from one table, multiple tables, or even whole schemas into a Hadoop Distributed File System (HDFS) in IBM InfoSphere BigInsight. You can create activities to integrate data by using the InfoSphere Data Click browser-based interface. You can create multiple activities that each specify different source-to-target configurations. 14 IBM Big Data Ingestion Architecture Design When you create an activity, you define the source for the data. You can choose data that you require from a wide variety of data sources, including IBM PureData™ (DB2® and Netezza®), Oracle, SQL Server, and Teradata. You can limit the source to the data that you require, whether it is a single table or multiple databases. When you run an activity, InfoSphere Data Click automatically creates Hive tables in the target directory. A Hive table is created for each table that you select, and you can specify the location of the target directory where the data will be stored. The data types of the columns in the Hive table are assigned based on metadata information about the data types of the columns in the source. You can then use IBM Big SQL in InfoSphere BigInsights to read and analyze the data in the tables. You also set the policies for the activity, including the amount of data that you can integrate when the activity runs. The policy choices that you make are automated without any coding. 15 IBM Big Data Ingestion Architecture Design 7 Data Types There are 2 types of data, structure and unstructured. Traditional RDBMS environments support structured data. The new sources(Facebook, Twitter......) of data have an unstructured format which require extensive programming to support. Structured Structured data can be move into the Pure Data for Analytics server(Netezza) for advanced analytics. It simplifies and optimizes performance of data services for analytic applications, enabling very complex algorithms to run faster than would be currently economically possible in the BigInsights environment. Structured data can also be loaded into Hbase or Hive allowing consumers the ability to access the data through the BigSQL layer of BigInsights. i2 Read/Write SSH,Servers IMM VIP VIP Sys Admin Console BigInsights Console Collectors Q1 QRadar DataClick An ETL option that can be placed on Edge nodes available in BigInsights 2.1. DataClick is DataStage with an improved business user interface, modified feature set and reduced connector support. It is an ETL router that's ability to move data is based on available CPU and network resources. 16 IBM Big Data Ingestion Architecture Design UnStructured Unstructured data can be defined in many forms. Some example include email, texts, social media postings.... The data can be initially landed in the BigInsights(HDFS or GPFS) discovery zone. The data can then be analyzed and categorized using text analytics(System T). Once the necessary fields have been extracted, the data can be moved into Hive/Hbase or left in HDFS. 1 GbE, no HA 10 GbE, HA – 2 switches & LACP Primary Site Secondary Site 10.1.1.x/24 BigInsights Text Analytics Components(System T) BigInsights comes with templates and add-ons for the open source Eclipse IDE. The Eclipse tools include wizards and built-in syntax checking as well as code generators and quick testing abilities. The SystemT Optimizer compares multiple execution plans to derive the most efficient path. The SystemT runtime then sequentially processes and outputs a stream of annotated documents(See Figure below). Text analytics modules can also be deployed as functions in BigSheets where business analyst can leverage sophisticated text analytics capabilities without having to learn the technical details. 17 IBM Big Data Ingestion Architecture Design SystemT operates all in memory. This allows for quick execution but has file size and quantities limitation. AQL operates over a simple relational data model with three data types: Span, Tuple, and View. The Span is identified by its “begin” and “end” positions. A Tuple is a list of fixed size spans. A view is a set of Tuples which could be considered as rows in a table. Regular expressions are used to extract characters, and character patterns. When it comes to extracting words you should use dictionaries as much as possible. Dictionaries are simply a text file with a list of words or a list of words in an array included in the AQL code. Dictionaries allow the SystemT engine to operate most effectively. The simpler your Regex expressions are and the more you use dictionaries the faster SystemT will be able to operate. This also helps minimize the amount of memory consumed which sometimes becomes critical when testing a complex extractor on a local VM installed on your PC. 18 IBM Big Data Ingestion Architecture Design 8 Conclusion The recommended tool for BigData ingestion depends greatly on the data type(structured or unstructured) being ingested as well as the ingestion scenario(data at rest, data in motion .....). The decision concerning data at rest involves existing skillset. The data in motion selection should begin with Streams which will provide the most flexible solution when the supporting products(BPM, ILOG ....) are considered. One dimension of the selection process is price and the standard hadoop tools(Sqoop, Flume, Pig, Jaql...) can be use in most circumstances. The complexity of the data ingestion including security requirements should be considered in that circumstance as the custom coding necessary may exceed the commercial product solution. 19 IBM Big Data Ingestion Architecture Design Appendix A1 Reference Material 1. IBM Information Center (http://pic.dhe.ibm.com/infocenter/bigins/v2r1/index.jsp?topic=%2Fcom.ibm.swg.im.inf osphere.biginsights.welcome.doc%2Fdoc%2Fwelcome.html) 2. Apache Flume ( http://flume.apache.org/ ) 3. IBM System x Reference Architecture (http://www.redbooks.ibm.com/abstracts/redp5009.html ) 4. Apache Sqoop (http://sqoop.apache.org/ ) 5. IBM Infosphere Streams (http://pic.dhe.ibm.com/infocenter/streams/v2r0/index.jsp?topic=%2Fcom.ibm.swg.im.i nfosphere.streams.whats-new.doc%2Fdoc%2Fibminfospherestreams-whats-newspl.html ) 6. IBM Infosphere DataClick (http://www01.ibm.com/support/knowledgecenter/SSPT3X_2.1.2/com.ibm.swg.im.iis.dataclick.do c/topics/dclkintrocont.html ) 7. IBM AQL (http://pic.dhe.ibm.com/infocenter/bigins/v1r2/index.jsp?topic=%2Fcom.ibm.s wg.im.infosphere.biginsights.doc%2Fdoc%2Fbiginsights_aqlref_con_aqloverview.html ) 8. 20 IBM Stre ams Proc essi ng Cop y Lang the sou rce uage dat : a(ra w files Exec ) to utivthe sec e ond ary site. Vie • Gen w erat e Sys BigInsights Data Admin 1 GbE, no/ Collectors SSH, VIP Flume IMM JDBC •SQL Perform Cluster Managem To Replicate Hadoop reduce Data HBase Hive CSV Tables Files tables Applicat Compil Platfor Strea Input Outpu Proce HA Console Ingest Big SQL Mini ODBC Network ent disk the Name raw same I/O BigInsi Sources ion Optimiz AQL miz Server Driver ed Network when files operations & to the ms m (Private) En tas ssing e concatena secondary Data those (Private) Langua er dat Plan ghts optimi Proce gi ting Hadoop on the files, Nodes Opera a ge read cluster. primary many ssing zed bein ne tors small site. files g Langu compi and write repli one big lation agecate file intod. HDFS. • Big Data Ingestion Architecture Design deri ved dat a by run ning the sam e jobs on bot h site s. 21 IBM