Big_Data_Ingestion_Architecture_1.2_2003.doc

advertisement
Big Data Ingestion Architecture Design
TAMU Big Data Ingestion Architecture
Considerations
1
IBM
Big Data Ingestion Architecture Design
Document History
Author: Martin Donohue
Revision History
Date of this revision: 21 November 2014
Revision Revision
Number Date
Date of next revision
Summary of Changes
Changes
marked
-----
Approvals
This document requires following approvals.
Name
Organization
Distribution
This document has been distributed to
Name
2
Organization
IBM
Big Data Ingestion Architecture Design
Table of Contents
Document History
2
1 Intended Audience
5
2 Overview
5
3 Scope of this document
5
4 Organization of this document
5
5 Data Ingestion Architecture Considerations
5
6 Load Scenarios
8
Data at Rest ................................................................................................................. 9
Data in Motion .......................................................................................................... 11
Data from a datawarehouse ....................................................................................... 11
7 Data Types
16
Structured .................................................................................................................. 16
UnStructured ............................................................................................................. 17
8 Conclusion
19
3
IBM
Big Data Ingestion Architecture Design
Overall Architecture Diagram
Figure 1 - Application Architecture
See Appendix A5 for a sample deployment pattern of services across nodes.
4
IBM
Big Data Ingestion Architecture Design
1 Intended Audience
This document is meant for



Client personnel who are involved in the BigInsights project
Individual IBM solution teams
Individual non IBM solution teams
The audience should have a high level understanding of the Apache Hadoop (Hadoop)
architecture as well as the IBM Infosphere BigInsights 3.0 (BigInsights) platform before reading
this document as it refers to concepts such as management and data nodes which are specific to
a Hadoop and BigInsights implementation.
2 Overview
Data ingestion is one of the critical Hadoop workflows. Massive amounts of data must be
moved from various sources into Hadoop for analysis. Apache Flume™ and Apache Sqoop™
are two of the data ingestion tools that are commonly used for Hadoop. Flume is a distributed,
reliable and available system for efficiently collecting, aggregating and moving data from many
different sources to a centralized datastore such as Hadoop Distributed File System (HDFS).
Sqoop is a tool designed to transfer data between Hadoop and relational databases. It can
import data from a relational database management system (RDBMS) into HDFS, HBase and
Hive and then export the data back after transforming it using Hadoop MapReduce. Traditional
ETL tools(Datastage, Informatica...) can be used to move data into Hadoop(HDFS/GPFS,
Hbase...).
3 Scope of this document
The scope of this document is limited to the core software technology components for BigInsights
3.0 core solution infrastructure as well as supporting components such as ETL tools. This
document addresses the design considerations for the solution design.
4 Organization of this document
The document starts with a high level background of the data ingestion strategy. Various factors
that influence the infrastructure design such as streaming data, batch file feed and data types are
described in individual sections. This leads to the recommendation describing requirements for
the ingestion flows that the client plans to stand up. Efforts have been made to provide relevant
technology details while at the same time masking unwanted technical details.
5 Data Ingestion Architecture Considerations
The diagram below encapsulates the potential data flows for a Big Data architecture. The entire
flows/components represented will likely not be needed for most BigData implementations as this
5
IBM
Big Data Ingestion Architecture Design
would represent a complete migration to the BigData platform from traditional client server
environments.
PureData for
Analytics
PureData for
Operational
Analytics
DB2 BLU,
PureData for
Analytics
InfoSphere
Data Click,
Information
Server,
MDM, G2
Information Movement, Matching & Transformation
Enterprise
Warehouse
Landing,
Exploration
& Archive
Security,
Governance
and
Business
Continuity
Guardium,
Optim,
Symphony
InfoSphere
BigInsights
6
Analytic
Appliances
IBM
Big Data Ingestion Architecture Design
Several methods exist for ingesting data into the cluster. Some of these methods require the
network design to be adjusted.
Data movement options
1) BigInsights console can be used to route small amounts of data from data sources to data
nodes without requiring the data network to be routed. This allows the isolated data network
design but restricts tooling options, source options and will impact speeds and feeds.
2) DataClick: An ETL option that can be placed on Edge nodes available in BigInsights.
DataClick is DataStage with an improved business user interface, modified feature set and
reduced connector support. It is an ETL router that's ability to move data is based on available
CPU and network resources. It takes servers to move large amounts of data using traditional
ETL.
3) DataStage: IBM's ETL product (that can be placed on one or more edge nodes) is Information
Server (DataStage). It connects to many sources of data and multiple targets such as data nodes
in a HDFS Hadoop cluster. It is an ETL router that's ability to move data is based on available
CPU and network resources. It takes servers to move large amounts of data using traditional
ETL.
4) JAQL generates Map Reduce code that executes on data nodes. It uses JDBC to access
databases requiring the data network to be routed.
5) PDO Push-Down Optimization generates Map Reduce code that runs on the data nodes. It
uses JDBC to access databases requiring the data network to be routed.
6) Sqoop is basically a code generator of custom Map Reduce code for importing / exporting
data between RDBMS and HDFS. Sqoop runs in the Hadoop cluster on data nodes and must be
able to connect to the RDBMS in question over the network via a JDBC interface. In the case of
Sqoop, the data network would need to be routed.
7) GPFS is a HDFS replacement that provides a posix cluster file system that can be mounted
remotely. Edge nodes or routing the data network can be used to copy data to the BI file system.
8) Flume
9) Custom Map Reduce code
10) Streams
Streams is a software platform that enables the development and execution of applications that
process information in data streams. InfoSphere Streams enables continuous and fast analysis of
massive volumes of moving data to help improve the speed of business insight and decision
making.
11) Informatica
Serving as the foundation for all data integration projects, the Informatica Platform lets IT
organizations initiate the ETL process from virtually any business system, in any format. As part
of the Informatica Platform, Informatica PowerCenter delivers robust yet easy-to-use Enterprise
ETL capabilities that simplify development and deployment. PowerCenter Express, a free version
of PowerCenter that can be downloaded in 10 minutes, is available to support the ETL needs of
smaller departmental data marts and data warehouses. Because both these products, and all the
7
IBM
Big Data Ingestion Architecture Design
products on the Informatica Platform are powered by Vibe™, the ETL work done on one project
can be reused on another.
12) SAS DI
Data integration is the process of consolidating data from a variety of sources in order to produce
a unified view of the data. SAS supports data integration in the following ways:

Connectivity and metadata. A shared metadata environment provides consistent data
definition across all data sources. SAS software enables you to connect to, acquire,
store, and write data back to a variety of data stores, streams, applications, and systems
on a variety of platforms and in many different environments. For example, you can
manage information in Enterprise Resource Planning (ERP) system, relational database
management systems (RDBMS), flat files, legacy systems, message queues, and XML.

Data cleansing and enrichment. Integrated SAS Data Quality software enables you to
profile, cleanse, augment, and monitor data to create consistent, reliable information.
SAS Data Integration Studio provides a number of transformations and functions that can
improve the quality of your data.

Extraction, transformation, and loading. SAS Data Integration Studio enables you to
extract, transform, and load data from across the enterprise to create consistent,
accurate information. It provides a point-and-click interface that enables designers to
build process flows, quickly identify inputs and outputs, and create business rules in
metadata, all of which enable the rapid generation of data warehouses, data marts, and
data streams.

Migration and synchronization. SAS Data Integration Studio enables you to migrate,
synchronize, and replicate data among different operational systems and data sources.
Data transformations are available for altering, reformatting, and consolidating
information. Real-time data quality integration allows data to be cleansed as it is being
moved, replicated, or synchronized, and you can easily build a library of reusable
business rules.

Data federation. SAS Data Integration Studio enables you to query and use data across
multiple systems without the physical movement of source data. It provides virtual access
to database structures, ERP applications, legacy files, text, XML, message queues, and a
host of other sources. It enables you to join data across these virtual data sources for
real-time access and analysis. The semantic business metadata layer shields business
staff from underlying data complexity.

Master data management. SAS Data Integration Studio enables you to create a unified
view of enterprise data from multiple sources. Semantic data descriptions of input and
output data sources uniquely identify each instance of a business element (such as
customer, product, and account) and standardize the master data model to provide a
single source of truth. Transformations and embedded data quality processes ensure that
master data is correct.
6
Load Scenarios
This document will cover 4 different load scenarios(Data at Rest, Data in Motion and Data from a
Datawarehouse)
8
IBM
Big Data Ingestion Architecture Design
Data at Rest
Data at rest is data that is already in a file in some directory.
It is at rest, meaning that no additional updates are planned on this data
and it can be transferred as is.
Flume Overview
A Flume agent is a Java virtual machine (JVM) process that hosts the components through
which events flow. Each agent contains at the minimum a source, a channel and a sink. An
agent can also run multiple sets of channels and sinks through a flow multiplexer that either
replicates or selectively routes an event. Agents can cascade to form a multihop tiered
collection topology until the final datastore is reached.
A Flume event is a unit of data flow that contains a payload and an optional set of string
attributes. An event is transmitted from its point of origination, normally called a client, to the
source of an agent. When the source receives the event, it sends it to a channel that is a
transient store for events within the agent. The associated sink can then remove the event from
the channel and deliver it to the next agent or the event’s final destination. Figure 2 depicts a
representative Flume topology.
Actionable Insights
Predictive Analytics & Modeling
6
External
Social
Data Sources
Structured Operational
BI & Performance
10.1.2.x/24
Management
Exploration & Discovery
Unstructured
Hadoop
Sensor
Geospatial
Time Series
Namewith DR – Dual
Detailed Network Design
&
load
Data
Nodes
Streaming
Flume is built on the concept of flows.
The various sources of the data sent through Flume may have different batching or reliability
9
IBM
Big Data Ingestion Architecture Design
setup. Often logs are continually being added to and what we want are the new records as they
are added.
Any number of Flume agents can be chained together. You begin by starting a Flume agent
where the data originates. This data then flows through a series of nodes that are chained
together.
Every Flume agent works with both a source and a sink. (Actually a single agent can work with
multiple sources and multiple sinks.) Sources and sinks are wired together via a channel.
Each node receives data as “source,” stores it in a channel, and sends it via a “sink.”
An agent running on one node can pass the data to an agent on a different node. The agent on
the second node also works with a source and a sink. There are a number of different types of
sources and sinks supplied with the product. For data to be passed from one agent to
another, we can have an Avro sink on the first communicating with an Avro source on the
second.
Avro is a remote procedure call and serialization framework that is a separate Apache project. It
uses JSON for defining data types and protocols and it serializes data in a compact binary
format.
Flume supports more that just a multi-tiered topology. This is an example of a consolidation
topology. Here a single Avro source receives data from multiple Avro sinks.
Both a replication and a multiplexing topology are also supported. In a replicating topology, log
events from the source are passed to all channels connected to that source. In a multiplexing
topology, data in the header area of a log event can be queried and used to distribute the event to
one or more channels.
It is possible to land files directly into the HDFS file system. For some data sets such as log files,
there is the option to move files into a central file store or staging area (NFS data store can be
used for this). From there the files can be moved into HDFS using Flume, Distributed Copy utility
or manually using the Web Console copy application. Additionally, HttpFS can be leveraged to
move files into the cluster from any client workstation, as long as the user can be authenticated
into the BigInsights cluster. SQOOP can also be used to pull data from other systems that
support a JDBC/ODBC protocol or a special SQOOP connector, into BigInsights.
GPFS is POSIX compliant. For environments leveraging GPFS file system, Informatica can be
leveraged for data ingestion. Informatica is in the process of certifying against GPFS. Additionally,
GPFS clients can be used to easily FTP files into the cluster.
10
IBM
Big Data Ingestion Architecture Design
Data in Motion
Data in Motion is data that is continuously being updated:
 New data might be added regularly to these data sources,
 Data might be appended to a file, or
 Discrete or different logs might be getting merged into one log.
Streams
InfoSphere offers a complete stream computing solution to enable real-time analytic processing
of data in motion. Specifically, InfoSphere Streams provides an advanced computing platform to
allow user-developed applications to quickly ingest, analyze and correlate information as it arrives
from thousands of real-time sources. The solution can handle very high data throughput rates, up
to millions of events or messages per second.
Corporat
e
Network
(Intranet)
Manage
ment
Network
(Private)
Cluster Network (Private)
Figure 5 - Streams Processings Language(SPL)
SPL (Streams Processing Language) allows users to define the input(s), specify analytic
operators, the intermediate streams, and then the output. Programs can be visually represented
in the Flow Graph, something that is automatically created in Streams Studio as developers write
the program. The high level language is compiled into C for deployment to the runtime. The
flowgraph shows many capabilities, such as analying multiple input streams, New York Stock
Exchange data shown along with Securities and Exchange Commission (SEC) Edgar data model
which are things like quarterly and annual reports from businesses in the US. The intermediate
data can be fused, or merged before sending data out of the Streams runtime.
Data from a datawarehouse
When moving data from a data warehouse, or any RDBMS for that matter,
11
IBM
Big Data Ingestion Architecture Design
we could export the data and then use Hadoop commands to import the data.
If you are working with a Netezza system, then you can use the Jaql Netezza module to both
read from and write to Netezza tables. Data can also be moved using BigSQL Load.
Sqoop Overview
Apache Sqoop is a CLI tool designed to transfer data between Hadoop and relational
databases. Sqoop can import data from an RDBMS such as MySQL or Oracle Database into
HDFS and then export the data back after data has been transformed using MapReduce. Sqoop
can also import data into HBase and Hive.
Sqoop connects to an RDBMS through its JDBC connector and relies on the RDBMS to describe
the database schema for data to be imported. Both import and export utilize MapReduce, which
provides parallel operation as well as fault tolerance. During import, Sqoop reads the table, row
by row, into HDFS. Because import is performed in parallel, the output in HDFS is multiple files—
delimited text files, binary Avro or SequenceFiles— containing serialized record data.
Sqoop has a command-line interface for transferring data. It supports incremental loads to/from a
single datanase table, or a free-form SQL query, as well as scripts that can be run whenever
needed to import updates made to a database since the last import. Sqoop can be used also to
populate tables in Hive or HBase.
Sqoop provides a set of high-performance open-source connectors that can be customized for
your specific external connections. Sqoop offers specific connector modules that are designed for
different product types. Sqoop successfully graduated from incubator status in March of 2012 and
is now a top-level Apache project.
Sqoop is designed to transfer data between relational database systems and Hadoop. It uses
JDBC to access the relational systems. To use it with BigInsights, you must copy the JDBC driver
JAR for the relational database to be accessed into the $SQOOP_HOME/lib directory so that the
driver can be used by the Sqoop software.
Sqoop accesses the database so that it can understand the schema of the data involved in a
transfer. It then generates a MapReduce application to import or export the data from/to the
database. When you use Sqoop to import data into Hadoop, Sqoop generates a Java class that
encapsulates one row of the imported table. You have access to the actul source code for the
generated Java class. This can allow you to quickly develop other MapReduce applications that
use the records that Sqoop stored into HDFS.
The connection information is the same whether you are doing an import or an export. You
specify a JDBC connection string, the username, and the password.
In the example on the slide, you use the keyword import or the keyword export — just one, not
both at the same time — depending on the action you want to perform.
The Sqoop import command is used to extract data from a relational table and load it into
Hadoop. Each row in HDFS comes from a row in the corresponding table. The resulting data in
HDFS can be stored as text files or binary files, as well as imported directly into HBase or Hive.
By default, all columns of all rows are imported, however, there are arguments that allow you to
specify particular columns or specify a WHERE clause to limit the rows. You can even specify
your own query to access the relational data.
If you want to specify the location of the imported data, use the --target-dir argument. Otherwise
the target directory name will be the same as the table name.
12
IBM
Big Data Ingestion Architecture Design
To split the data across multiple mappers, by default, Sqoop uses the primary key of the table. It
determines the minimum and maximum values for the key and then assumes an even distribution
of values. You can use the --split-by argument to have the distribution work with a different
column. If the table does not have an index column, or has a multi-column key, then you must
specify the split-by column parameter.
To only import data from a subset of columns, use the --columns argument where the column
names are comma separated. To limit the rows use the --where argument, and supply your own
query that returns the rows to be imported. This allows for greater flexibility, for example allowing
you to get data by joining tables.
By default the imported data is in delimited text format (--as-textfile). Optional
parameters allow you to import in binary format (--as-sequencefile) or as an Avro data file (--asavrodatafile). Also, you can override the default in order to have the data compressed.
13
IBM
Big Data Ingestion Architecture Design
The Sqoop export command reads data in Hadoop and places it into relational tables (you
export from HDFS into a database). The target table must already exist and you can specify your
own parsing specifications.
By default, Sqoop inserts rows into a relational table. This is primarily intended for loading data
into a new table. If there are any errors when doing the insert, the export process will fail.
However, there are other export modes.
The update mode — the second mode — causes Sqoop to generate update statements. To do
updates, you must specify the --update-key argument. Here you tell Sqoop which table
column (or comma-separated columns) to use in the WHERE clause of the update statement.
If the update does not modify a row, it is not considered to be an error. The condition just goes
undetected.
Some database systems allow for --update-mode allowinsert to be specified — these are
databases that have an UPSERT command (UPSERT does an update if the row exists, but
otherwise inserts a new row).
The third mode is call mode. With this mode, Sqoop calls and passes the record to a stored
procedure.
The --export-dir parameter defines the location of the files in HDFS that are to be exported from
HDFS to put records into the database.
Now for a couple of additional pieces of information.
By default, Sqoop assumes that it is working with comma-separated fields and that each record is
terminated by a newline. Both the import and export commands have the facility to allow you to
override this behavior.
Remember, that when data is imported into Hadoop, you are given access to the Java source for
the Java class what was generated. If your data was not in the default format and, if the data that
you are exporting is in that same format, then you can use parts of that same code to read the
data.
When Sqoop is inserting rows into a table, it generates a multi-row insert. Each multi-row insert
handles up to 100 rows. The mapper then does a commit after 100 statements are executed. This
means that 10,000 rows are inserted before being committed.
Also, each export mapper in the generated MapReduce program commits with separate
transactions.
Data Click
Data Click provides self-service data integration so that any business or technical user can
integrate data from one table, multiple tables, or even whole schemas into a Hadoop Distributed
File System (HDFS) in IBM InfoSphere BigInsight.
You can create activities to integrate data by using the InfoSphere Data Click browser-based
interface. You can create multiple activities that each specify different source-to-target
configurations.
14
IBM
Big Data Ingestion Architecture Design
When you create an activity, you define the source for the data. You can choose data that you
require from a wide variety of data sources, including IBM PureData™ (DB2® and Netezza®),
Oracle, SQL Server, and Teradata. You can limit the source to the data that you require, whether
it is a single table or multiple databases.
When you run an activity, InfoSphere Data Click automatically creates Hive tables in the target
directory. A Hive table is created for each table that you select, and you can specify the location
of the target directory where the data will be stored. The data types of the columns in the Hive
table are assigned based on metadata information about the data types of the columns in the
source. You can then use IBM Big SQL in InfoSphere BigInsights to read and analyze the data in
the tables.
You also set the policies for the activity, including the amount of data that you can integrate when
the activity runs. The policy choices that you make are automated without any coding.
15
IBM
Big Data Ingestion Architecture Design
7 Data Types
There are 2 types of data, structure and unstructured. Traditional RDBMS environments support
structured data. The new sources(Facebook, Twitter......) of data have an unstructured format
which require extensive programming to support.
Structured
Structured data can be move into the Pure Data for Analytics server(Netezza) for advanced
analytics. It simplifies and optimizes performance of data services for analytic applications,
enabling very complex algorithms to run faster than would be currently economically possible in
the BigInsights environment.
Structured data can also be loaded into Hbase or Hive allowing consumers the ability to access
the data through the BigSQL layer of BigInsights.
i2 Read/Write
SSH,Servers
IMM
VIP
VIP
Sys Admin Console
BigInsights Console
Collectors
Q1
QRadar
DataClick An ETL option that can be placed on Edge nodes available in BigInsights 2.1.
DataClick is DataStage with an improved business user interface, modified feature set and
reduced connector support. It is an ETL router that's ability to move data is based on available
CPU and network resources.
16
IBM
Big Data Ingestion Architecture Design
UnStructured
Unstructured data can be defined in many forms. Some example include email, texts, social
media postings.... The data can be initially landed in the BigInsights(HDFS or GPFS) discovery
zone. The data can then be analyzed and categorized using text analytics(System T). Once the
necessary fields have been extracted, the data can be moved into Hive/Hbase or left in HDFS.
1 GbE, no HA
10 GbE, HA – 2 switches & LACP
Primary Site
Secondary Site
10.1.1.x/24
BigInsights Text Analytics Components(System T)
BigInsights comes with templates and add-ons for the open source Eclipse IDE. The Eclipse
tools include wizards and built-in syntax checking as well as code generators and quick testing
abilities. The SystemT Optimizer compares multiple execution plans to derive the most efficient
path. The SystemT runtime then sequentially processes and outputs a stream of annotated
documents(See Figure below). Text analytics modules can also be deployed as functions in
BigSheets where business analyst can leverage sophisticated text analytics capabilities without
having to learn the technical details.
17
IBM
Big Data Ingestion Architecture Design
SystemT operates all in memory. This allows for quick execution but has file size and quantities
limitation. AQL operates over a simple relational data model with three data types: Span, Tuple,
and View. The Span is identified by its “begin” and “end” positions. A Tuple is a list of fixed size
spans. A view is a set of Tuples which could be considered as rows in a table.
Regular expressions are used to extract characters, and character patterns. When it comes to
extracting words you should use dictionaries as much as possible. Dictionaries are simply a text
file with a list of words or a list of words in an array included in the AQL code. Dictionaries allow
the SystemT engine to operate most effectively. The simpler your Regex expressions are and
the more you use dictionaries the faster SystemT will be able to operate. This also helps
minimize the amount of memory consumed which sometimes becomes critical when testing a
complex extractor on a local VM installed on your PC.
18
IBM
Big Data Ingestion Architecture Design
8
Conclusion
The recommended tool for BigData ingestion depends greatly on the data type(structured or
unstructured) being ingested as well as the ingestion scenario(data at rest, data in motion .....).
The decision concerning data at rest involves existing skillset. The data in motion selection
should begin with Streams which will provide the most flexible solution when the supporting
products(BPM, ILOG ....) are considered. One dimension of the selection process is price and the
standard hadoop tools(Sqoop, Flume, Pig, Jaql...) can be use in most circumstances. The
complexity of the data ingestion including security requirements should be considered in that
circumstance as the custom coding necessary may exceed the commercial product solution.
19
IBM
Big Data Ingestion Architecture Design
Appendix A1
Reference Material
1. IBM Information Center
(http://pic.dhe.ibm.com/infocenter/bigins/v2r1/index.jsp?topic=%2Fcom.ibm.swg.im.inf
osphere.biginsights.welcome.doc%2Fdoc%2Fwelcome.html)
2. Apache Flume ( http://flume.apache.org/ )
3. IBM System x Reference Architecture
(http://www.redbooks.ibm.com/abstracts/redp5009.html )
4. Apache Sqoop (http://sqoop.apache.org/ )
5. IBM Infosphere Streams
(http://pic.dhe.ibm.com/infocenter/streams/v2r0/index.jsp?topic=%2Fcom.ibm.swg.im.i
nfosphere.streams.whats-new.doc%2Fdoc%2Fibminfospherestreams-whats-newspl.html )
6. IBM Infosphere DataClick (http://www01.ibm.com/support/knowledgecenter/SSPT3X_2.1.2/com.ibm.swg.im.iis.dataclick.do
c/topics/dclkintrocont.html )
7. IBM AQL
(http://pic.dhe.ibm.com/infocenter/bigins/v1r2/index.jsp?topic=%2Fcom.ibm.s
wg.im.infosphere.biginsights.doc%2Fdoc%2Fbiginsights_aqlref_con_aqloverview.html )
8.
20
IBM
Stre
ams
Proc
essi
ng Cop
y
Lang
the
sou
rce
uage
dat
: a(ra
w
files
Exec
) to
utivthe
sec
e ond
ary
site.
Vie
•
Gen
w erat
e
Sys
BigInsights
Data
Admin
1
GbE,
no/
Collectors
SSH,
VIP
Flume
IMM
JDBC
•SQL
Perform
Cluster
Managem
To
Replicate
Hadoop
reduce
Data
HBase
Hive
CSV
Tables
Files
tables
Applicat
Compil
Platfor
Strea
Input
Outpu
Proce
HA
Console
Ingest
Big
SQL
Mini
ODBC
Network
ent
disk
the
Name
raw
same
I/O
BigInsi
Sources
ion
Optimiz
AQL
miz
Server
Driver
ed
Network
when
files
operations
&
to
the
ms
m
(Private)
En
tas
ssing
e
concatena
secondary
Data
those
(Private)
Langua
er dat
Plan
ghts
optimi
Proce
gi
ting
Hadoop
on
the
files,
Nodes
Opera
a
ge
read
cluster.
primary
many
ssing
zed
bein
ne
tors
small
site.
files
g
Langu
compi
and
write
repli
one
big
lation
agecate
file intod.
HDFS.
•
Big Data Ingestion Architecture Design
deri
ved
dat
a by
run
ning
the
sam
e
jobs
on
bot
h
site
s.
21
IBM
Download