Big Data and EDW - The Brite Group

advertisement
The Brite Group Inc.
How can Big Data/Hadoop co-exist with your Enterprise Data
Warehouse
White paper on Big Data and EDW architecture and integration strategy
The Brite Group Inc.
10/18/2013
Abstract
Organizations increasingly continue to use data warehouses to manage structured and operational
data. These data warehouses provide business analysts with the ability to analyze key data and
trends, and they also enable senior leadership to make intelligent decisions about their business.
However, the advent of social media and internet-based data sources, commonly referred to big
data, not only challenge the role of the traditional data warehouse in analyzing data from these
diverse sources, but also expose limitations posed by software and hardware platforms that are
used for building traditional data warehouse solutions.
Getting full business intelligence (BI) value out of big data requires some adjustments to best
practices and tools for enterprise data warehouses (EDWs). For example, deciding which of the
big data sources (such as web sites, social media, robotics, mobile devices, and sensors) are
useful for BI/EDW purposes pushes BI professionals to think in new terms. Big data from these
sources ranges from structured to semi-structured to unstructured, and most EDWs are not
designed to store and manage this full range of data. Likewise, some of the new big data sources
feed data relentlessly in real time, whereas the average EDW is not designed for such real time
feeds. Furthermore, most of the business value coming from big data is derived from advanced
analytics based on the combination of both traditional enterprise data and new data sources.
Accordingly, companies will need to modify their best practices for EDW data integration, data
quality, and data modeling in order to take advantage of big data, specifically by making some
changes to their existing infrastructure, tools and processes to integrate big data into their current
environment.
Copyright 2013©
Page 1
The relationship between the data warehouse and big data is merging to become a hybrid
structure. In this hybrid model, the highly structured optimized operational data remains in the
tightly controlled data warehouse, while the data that is highly distributed and subject to change
in real time is controlled by a Hadoop-based infrastructure.
Increasingly, organizations understand that they have a business requirement to be able to
combine traditional data warehouses, their historical business data sources and less structured
and big data sources. A hybrid model, supporting traditional and big data sources, can thus help
accomplish these business goals.
1. Role of an Enterprise Data Warehouse in an organization
Traditionally, data processing for analytic purposes follows a fairly static workflow (shown in
Figure 1). Through the regular course of business, enterprises first create modest amounts of
structured data with stable data models via enterprise applications like Storefronts and Order
Processing systems, CRM, ERP and Billing systems. Data integration tools are then used to
extract, transform and load the data from enterprise applications and transactional databases to a
staging area where data standardization normalization occurs, so that the data is modeled into
neat rows and tables. The modeled, cleansed data is then loaded into an enterprise data
warehouse.This routine usually occurs on a scheduled basis – usually daily or weekly, sometimes
more frequently.
Figure1. Role of a traditional EDW in an organisation
Copyright 2013©
Page 2
From there, data warehouse and business intelligent engineers create and schedule regular
reports to run against data stored in the warehouse, which are distributed to the business
management. They also create dashboards and other limited visualization tools for executives
and senior management.
Business analysts, meanwhile, use data analytics tools to run advanced analytics against the
warehouse, or more often against sample data migrated to a local data mart due to size
limitations. Non-expert business users perform basic data visualization and limited analytics
against the data warehouse via front-end business intelligence products. Data volumes in
traditional data warehouses rarely exceeded multiple terabytes as large volumes of data strain
warehouse resources and degrade performance.
The job of cleaning and integrating source data, often referred to as Extract, Transform and Load
(ETL) is often done using data integration tools. Using these tools, developers drag and drop predefined transforms onto a workspace to construct graphical workflow ETL jobs that clean and
integrate structured transaction and master data (customers, products, assets, suppliers etc.)
before loading this data into data warehouses. Over the years these tools have improved their
support for more data sources, change data capture, data cleansing and performance. However,
historically, these tools have only supported structured data primarily associated with data from
internal systems.
2. Challenges in using traditional EDW data analysis
In the last few years, new internal and external data sources have emerged that businesses now
want to analyse to produce more insights over what they already know. In particular, social
network interaction data, profile data and social relationships in a social graph are getting
attention from marketers interested in everything from sentiment to likes, dislikes, followers and
influencers. Indeed, each Facebook update, tweet, blog post and comment creates multiple new
data points, which could be structured, semi-structured and unstructured. Billions of online
purchases, stock trades and other transactions happen every day, including countless automated
transactions. Each creates a number of data points collected by retailers, banks, credit cards,
credit agencies and others. Electronic devices of all sorts – including servers and other IT
hardware, smart energy meters and temperature sensors, all create semi-structured log data that
record every action.
Within the enterprise, web logs are growing rapidly as customers switch to on-line channels as
their preferred way of interacting and transacting business. This in turn is also generating much
larger volumes of both structured, semistructured and unstructured data. All of this new activity
is giving rise to a new emerging area of technology associated with analytics. That area is big
data. In the big data environment, extreme transaction processing, larger data volumes, more
variety in data types being analysed, and data generation occurring at increasingly rapid rates are
all issues that need to be supported if companies are to derive value from these new data sources.
Copyright 2013©
Page 3
The cost of processing potentially terabytes of data also needs to be kept to a minimum to stop
total cost of ownership of ETL processing from spiralling out of control.
BI and analytics platforms have been tremendously effective at using structured data, such as
that found in CRM applications. But in the big data era, the BI challenge has changed
dramatically, in terms of both goals and execution, described as follows.




Volume: The volume of data has grown by orders of magnitude. A recent article from The
International Data Corporation (IDC) stated that the size of the entire digital universe in
2005 was 130 billion gigabytes; in 2013, it was up to 40 trillion gigabytes. 1 Today’s
enterprise environments routinely contain petabytes of data.
Variety: The variety of data has evolved from the traditional structured datasets of ERP and
CRM systems, to data gathered from user interactions on the Web (clickstreams, search
queries, and social media) and mobile user activity, including location-based information.
Velocity: The velocity of data accumulation and change is accelerating, driven by an
expanding universe of wired and wireless devices, including Web-enabled sensors and Webbased applications.
Veracity: Veracity refers to the fact that data must be able to be verified based on both
accuracy and context before being used in predicting business value. For example, an
innovative business may want to be able to analyze massive amounts of data in real time to
quickly assess the value of a customer and the potential to provide additional offers to that
customer. It is therefore necessary to identify the right amount and types of data that can be
analyzed in real time to impact business outcomes.
This revolutionary shift places significant new demands on data storage and analytical software,
and poses new challenges for BI and database professionals. It also creates powerful
opportunities for discovering and implementing new strategies that generate competitive
advantage. Realizing these opportunities requires two things: the technological capacity to gather
and store big data, as well as new tools for turning data into insights and, ultimately, value.
3. Advent of Real Time EDW solutions
Unlike traditional data warehouses, massively parallel analytic databases are capable of quickly
ingesting large amounts of mainly structured data with minimal data modeling required and can
scale-out to accommodate multiple terabytes and sometimes petabytes of data. Most importantly
for end-users, massively parallel analytic databases support near real-time results to complex
SQL queries. Fundamental characteristics of a massively parallel analytic database include:

1
Massively parallel processing (MPP) capabilities. Massively parallel processing, or MPP,
allows for ingesting, processing and querying of data on multiple machines simultaneously.
The result is significantly faster performance than traditional data warehouses that run on a
single, large box and are constrained by a single choke point for data ingest.
http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf
Copyright 2013©
Page 4

Shared-nothing architectures: A shared-nothing architecture ensures there is no single
point of failure in some analytic database environments. In these cases, each node operates
independently of the others so if one machine fails, the others keep running. This is
particularly important in MPP environments where, with hundreds of machines processing
data in parallel, the occasional failure of one or more machines is inevitable.
Prominent vendors that offer MPP and shared-nothing architectures are Teradata, Netezza
and EMC GreenPlum. Teradata is well established and arguably best in the data warehouse
appliances; here, performance degradation is minimal even with exponential raise of number
of users and work load. However, it is costly for small organizations, and some developers
find it difficult to understand its system level functionality. Netezza seems to be good fit for
mid-ranged organizations or departments in terms of cost, it has reasonably good
performance, and it is easy to use and load data. However, there seems to be slight
performance degradation when number of users and concurrent queries increase, and there
also seems to be a lack of a matured toolset for operating on the database. EMC Greenplan,
on the other hand, is one of the new generation databases that introduce integration with
some of the revolutionary big data approaches. It promises many things and it seems that it
tries very hard to deliver all of the new features.

Columnar architectures: Rather than storing and processing data in rows, as is typical with
most relational databases, most massively parallel analytic databases employ columnar
architectures. In columnar environments, only columns that contain the necessary data to
determine the answer to a given query are processed (rather than entire rows of data),
resulting in split-second query results. This also means data does not need to be structured
into neat tables as with traditional relational databases.

Advanced data compression capabilities: Advanced data compression capabilities allow
analytic databases to ingest and store larger volumes of data than otherwise possible and to
do so with significantly fewer hardware resources than traditional databases. A database with
10-to-1 compression capabilities, for example, can compress 10 terabytes of data down to 1
terabyte. Data compression, and a related technique called data encoding, are critical to
scaling to massive volumes of data efficiently.
A prominent vendor that offers columnar and advanced data compression capabilities is
Vertica. Vertica seems to be the cheapest among the other MPP appliances described above.
It is a highly compressed database by automatically choosing the data compression
algorithm, and it automates some of the system maintenance activities such as purging
logically deleted data. However, there seems to be a lack of GUI tools in Vertica nad it has
some very limited workload management options. Moreover, Vertica heavily depends on
projections to deliver performance for different scenarios, and it seems not very suitable for
general purpose users with a large number of concurrent sessions querying the database.
Copyright 2013©
Page 5
Other established systems offering columnar MPP capabilities are InfiniDB from Calpont
(seems to be purpose-built for analytical workloads, transactional support and bulk load
operations), ParAccel from Actian (seems to offer fast, in-memory performance and a
sophisticated big data analytics platform) and Aster from Terradata (seems to be the first one
to embed SQL and big data analytic processing to allow deeper insights on multi-structured
data sources with high performance and scalability).


Commodity hardware: Most massively parallel analytic databases run on off-the-shelf
commodity hardware from Dell, IBM and others, so they can scale-out in a cost effective
manner.
In-memory data processing: Some massively parallel analytic databases use dynamic RAM
and/or flash for some real-time data processing. Some are fully in-memory, while others use
a hybrid approach that blends less expensive but lower performing disk-based storage for
stale data with DRAM or flash for frequently accessed data.
Massively parallel analytic databases still face some challenges, however. Most notably, they are
not designed to ingest, process and analyze semi-structured and unstructured data that are largely
responsible for the explosion of data volumes in the big data era.
4. Advent of Big Data/NoSQL/Hadoop and Advanced Analytics
The advent of the Web, mobile devices and social networking technologies has caused a
fundamental change to the nature of data. Unlike the traditional corporate data that is centralized,
highly structured and easily manageable, this so-called Big Data is highly distributed, loosely
structured, and increasingly large in volume.
There are number of approaches to processing and analyzing big data, although most have some
common characteristics, such as: taking advantage of commodity hardware to enable scale-out,
parallel processing techniques; employing non-relational data storage capabilities to process
unstructured and semi-structured data; and applying advanced analytics and data visualization
technology to big data to convey insights to end-users. In addition to the MPP real-time EDW
solutions, two big data approaches that may transform the areas of business analytics and data
management are NoSQL and Apache Hadoop. They are described in some detail below.
4.1. NoSQL
A new style of database called NoSQL (Not Only SQL) has recently emerged to process large
volumes of multi-structured data. NoSQL databases primarily aim at serving up discrete data
stored among large volumes of multi-structured data to end-user and automated big data
applications. This capability is mostly lacking from relational database technology, which simply
cannot maintain needed application performance levels at big data scale.
Some examples of NoSQL databases include:
Copyright 2013©
Page 6







HBase
Cassandra
Redis
Voldemort
MongoDB
Neo4j
CouchDB
The downside of most NoSQL databases is that they traded the compliance of ACID (Atomicity,
Consistency, Isolation, Durability) properties, typically used by relational databases, for
performance and scalability. Many also lack mature data management and monitoring tools.
Both these shortcomings are in the process of being overcome by both the open source NoSQL
communities and a handful of vendors (such as DataStax, Sqrrl, 10gen, Aerospike and
Couchbase) that are attempting to commercialize the various NoSQL databases.
Table 2 shows a vendor specific comparison of the NoSQL products, currently offered on the
market. The table lists four different types of NoSQL databases. For each database type,
examples of database vendors are provided, as well as information on the best fit for a type, its
data storage, data access and integration methods.
Database
Type
Database
Vendor
Best Fit For
Data Storage
Methods
Data Access
Methods
Integration
Methods
Key Value
Store
Redis,
Voldemort,
Oracle BDB,
Amazon
SimpleDB, Riak
Cassandra,
HBase
Content caching,
logging
A hash table with
unique keys using
pointers to items of
data
Get, put, and
delete operations
based on a
primary key
REST (XML
and JSON)
and HTTP API
interfaces
Distributed file
systems
Keys that point to
multiple columns,
arranged by column
family
Column
Family
Store
Document
Store
CouchDB,
MongoDb
Graph
Database
Neo4J, InfoGrid,
Infinite Graph
Table operations
without joins
(joins must be
handled by the
application)
Similar to key-value
Inverted index on
Web applications
stores, but using
any attribute,
semi-structured JSON aloowing for fast
documents
full-text search
containing nested
using rich,
values associated
document-based
with each key
queries
Social networking, Flexible graph model Property graph,
Recommendations that can scale across using node,
multiple machines
relationship and
property
methods
Java, REST,
Thrift API
interfaces, full
Hadoop
integration
REST (XML
and JSON)
and HTTP API
interfaces
REST (XML
and JSON)
and HTTP API
interfaces
Table2. A vendor specific comparison of NoSQL products
Copyright 2013©
Page 7
4.2. Apache Hadoop
Apache Hadoop2 has been the driving force behind the growth of the big data industry. Hadoop
brings the ability to cheaply process very large amounts of data, regardless of its structure.
Existing EDWs and relational databases excel at processing structured data and can store
massive amounts of data. However, this requirement for structure restricts the kinds of data that
can be processed, and it also imposes an inertia that makes data warehouses unsuited for agile
exploration of massive heterogenous data. The amount of effort required to warehouse data often
means that valuable data sources in organizations are never mined. This is where Hadoop can be
used to make a big difference. Also, as the Hadoop project matures, it continually acquires
various components to enhance its usability and functionality. The name Hadoop has therefore
come to represent an entire ecosystem, as described below.
4.2.1.
Lower Hadoop levels: HDFS and MapReduce
At the core of Hadoop is a framework called MapReduce, which is what drives most of today’s
big data processing. Created by Google, the important innovation of MapReduce is the ability to
take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing the
computation solves the issue of data too large to fit onto a single machine. When this technique
is combined with commodity Linux servers, the result is a cost-effective alternative to massive
computing arrays.
For the MapReduce computation to take place, each server must have access to the data. This is
the role of HDFS, the Hadoop Distributed File System. Servers in a Hadoop cluster can fail and
not abort the computation process. HDFS ensures data is replicated with redundancy across the
cluster. On completion of a calculation, a node will write its results back into HDFS.
There are no restrictions on the data that HDFS stores. Data may be unstructured and
schemaless. By contrast, relational databases require that data be structured and schemas be
defined before storing the data. With HDFS, making sense of the data is the responsibility of the
developer’s code.
4.2.2.
Programming and DW levels: Pig and Hive
Programming Hadoop at the MapReduce level requires working with the Java APIs, and
manually loading data files into HDFS. However, working directly with Java APIs can be
tedious and error prone, and it also restricts usage of Hadoop to Java programmers. Hadoop thus
offers two solutions for making Hadoop programming easier.

2
Pig is a programming language that simplifies the common tasks of working with Hadoop:
loading data, expressing transformations on the data, and storing the final results. Pig’s builtin operations can make sense of semi-structured data, such as log files, and the language is
extensible using Java to add support for custom data types and transformations.
http://hadoop.apache.org/
Copyright 2013©
Page 8

Hive enables Hadoop to operate as a data warehouse. It superimposes structure on data in
HDFS and then permits queries over the data using a familiar SQL-like syntax. As with Pig,
Hive’s core capabilities are extensible.
Choosing between Hive and Pig can be confusing. Hive is more suitable for data warehousing
tasks, with predominantly static structure and the need for frequent analysis. Hive’s closeness to
SQL makes it an ideal point of integration between Hadoop and other business intelligence tools.
Pig, on the other hand, gives the developer more agility for the exploration of large datasets,
allowing the development of succinct scripts for transforming data flows for incorporation into
larger applications. Pig is a thinner layer over Hadoop than Hive, and its main advantage is to
drastically cut the amount of code needed compared to direct use of Hadoop’s Java APIs. As
such, Pig’s intended audience remains primarily the software developer.
4.2.3.
Data access levels: HBase, Sqoop and Flume
At its heart, Hadoop is a batch-oriented system. Data are loaded into HDFS, processed, and then
retrieved. This is somewhat of a computing throwback, as often interactive and random access to
data is required. HBase is used to address this issue, which is a column-oriented NoSQL
database that runs on top of HDFS. Modeled after Google’s BigTable, the goal of HBase is to
host billions of rows of data for rapid access. MapReduce can then use HBase as both a source
and a destination for its computations, while Hive and Pig can be used in combination with
HBase. Some HBase use cases include logging, counting and storing time-series data.
Improved interoperability between Hadoop and the rest of the data world is provided by Sqoop
and Flume. Sqoop is a tool designed to import data from relational databases into Hadoop, either
directly into HDFS or into Hive. Flume is designed to import streaming flows of log data directly
into HDFS. Hive’s SQL friendliness means that it can be used as a point of integration with the
variety of database tools capable of making connections via JBDC or ODBC database drivers.
4.3. Advanced analytics with Hadoop
The technology around Hadoop is advancing rapidly, becoming both more powerful and easier to
implement and manage. An ecosystems of vendors, such as start-ups like Cloudera and
Hortonworks and well-established companies like IBM and Microsoft, are working to offer
commercial, enterprise-ready Hadoop distributions, tools and services to make deploying and
managing the technology a practical reality for the traditional enterprise. However, this does not
imply that big data technologies will replace traditional enterprise data warehouses. Rather, an
increasing trend for them is to exist in parallel.
Indeed, the traditional data warehouse still plays a vital role in the business. Financial analysis
and other applications associated with the EDW are as important as ever, and the EDW itself will
be a source of some of the data used in big data projects and will likely receive data from the
results of advanced analysis projects. Accordingly, Hadoop and big data can be seen as an
expansion rather than a replacement for EDW and RDBMS data engines.
Copyright 2013©
Page 9
5. How can Big Data/NoSQL/ Hadoop co-exist with EDW?
The first step toward successful Hadoop/EDW integration is to determine where Hadoop fits in
the data warehouse architecture. As Hadoop is a family of products, each with multiple
capabilities, there are multiple areas in data warehouse architectures where Hadoop products can
contribute. Hadoop seems most compelling as a data platform for capturing and storing big data
within an extended DW environment, in addition to processing that data for analytic purposes on
other platforms.This approach allows firms to protect their investment in their respective EDW
infrastructure and also extend it to accommodate the Big data environment.
The most prominent roles for Hadoop in EDW architectures are as follows.

Data staging. A considerable amount of data is processed in an EDW’s staging area to
prepare source data for specific uses (reporting, analytics) and for loading into specific
databases (EDWs, data marts). Much of this processing is done by homegrown ETL tools.
Hadoop thus allows organizations to deploy an extremely scalable and economical ETL
environment. For example, one of the most popular ETL use cases is offloading heavy
transformations, the "T" in ETL, from the data warehouse and into Hadoop. The rationale
behind this is because, for years, organizations have struggled to scale traditional ETL
architectures. Specifically, many data integration platforms pushed the transformations down
to the data warehouse, which is why today data integration in EDW architectures drives up to
80% of database capacity and resources, resulting in unsustainable spending, ongoing
maintenance efforts, as well as poor user query performance. By shifting the "T" to Hadoop,
organizations can dramatically reduce costs and free up database capacity and resources for
faster user query performance.

Data archiving. Traditionally, enterprises had three options when it came to archiving data:
leave it within a relational database, move it to an offline storage library, or purge it.
Hadoop’s scalability and low cost enable organizations to keep all data forever in a readily
accessible online environment.

Schema flexibility. Relational DBMSs (used in Data Warehouse implementations) are well
equipped in storing highly structured data, from ERP, CRM and other opertational databases,
to stable semi-structured data (XML, JSON). As a complement, Hadoop can quickly and
easily ingest any data format, including evolving schema (as in A/B and multivariate tests on
a website) and no schema (audio, video, images).

Processing flexibility. Hadoop’s NoSQL approach is a more natural framework for
manipulating non-traditional data types and enabling procedural processing, valuable in use
cases such as time-series analysis and gap recognition. Hadoop also supports a variety of
programming languages, thus providing more capabilities than SQL alone.
One way to augment and enhance the EDW in an organisation with a Hadoop/big data cluster is
as follows:

Continue to store summary structured data from OLTP and back office systems into the
EDW.
Copyright 2013©
Page 10


Store unstructured data into Hadoop/NoSQL that does not fit nicely into tables. This means
all the communication with customers from phone logs, customer feedbacks, GPS locations,
photos, tweets, emails, text messages, etc. can be stored in Hadoop.
Co-relate data in EDW with the data in the Hadoop cluster to get better insight about
customers, products, equipment, etc. Organizations can now run ad-hoc analytics and
clustering and targeting models against this co-related data in Hadoop, which is otherwise
computationally very intensive.
Copyright 2013©
Page 11
Figure 2 shows an Enterpise data architecture, reflecting one perspective on data sources, data
sharing, and the diversity of data discovery platforms using an EDW/Hadoop environment.
Figure 2. Enterprise data architecture using combined EDW/Big data environment
Copyright 2013©
Page 12
Two example use cases showcasing such integrated EDW/Hadoop environment are as follows.
1. A major brokerage firm uses Hadoop to preprocess raw click streams generated by customers
using their website. Processing these click streams provides valuable insight into customer
preferences which are passed to a data warehouse. The data warehouse then couples these
customer preferences with marketing campaigns and recommendation engines to offer
investment suggestions and analysis to consumers.
2. An eCommerce service uses Hadoop for machine learning to detect fraudulent supplier
websites. The fraudulent sites exhibit patterns that Hadoop uses to produce a predictive
model. The model is copied into the data warehouse where it is used to find sales activity that
matches the pattern. Once found, that supplier is investigated and potentially discontinued.
To summarize, complex Hadoop jobs can use the data warehouse as a data source,
simultaneously leveraging the massively parallel capabilities of two systems. Inevitably,
visionary companies will take this step to achieve competitive advantages.
6. Use cases and best practices
The following use cases show some of the best practices on how Hadoop can coexist with EDW.
1. Use case from Dell big-data solutions - Find out how to use Hadoop technology and an
enterprise data warehouse in performing Web page analytics:
http://www.dell.com/learn/ba/en/bapad1/videos~en/documents~hadoop-edw-web-pageanalytics.aspx?c=ba&l=en&s=biz&cs=bapad1&delphi:gr=true
2. Use cases from Cloudera - Ten Common Hadoopable Problems (Real-World Hadoop Use
Cases):
http://blog.cloudera.com/wpcontent/uploads/2011/03/ten_common_hadoopable_problems_final.pdf
3. The Evolution of the Enterprise Data Warehouse, starring Hadoop:
http://data-informed.com/the-evolution-of-the-enterprise-data-warehouse-starring-hadoop/
7. Final recommendations on Hadoop/EDW Integration
Hadoop provides a new, complementary approach to traditional data warehousing that helps
deliver on some of the most difficult challenges of enterprise data warehouses. By making it
easier to gather and analyze data, it may help move the spotlight away from the technology
towards some important limitations on today’s business intelligence efforts, such as information
culture and the limited ability of many people to actually use information to make the right
decisions.
Copyright 2013©
Page 13
Enterprises across all industries should evaluate current and potential big data use cases and
engage the big data community to understand the latest technological developments. Some
recommendations on how to integrate Hadoop/EDW include:





Work with the community, like-minded organizations and vendors to identify areas big data
can provide business value;
Consider the level of big data skills within your organisation to determine if you are in a
position to begin experimenting with big data approaches, such as Hadoop;
Engage both IT and the business to develop a plan to integrate big data tools, technology and
approaches into your existing EDW infrastructure - specifically, begin to cultivate a datadriven culture among employees at all levels and encourage data experimentation;
Embrace an open, rather than proprietary, approach, to give customers the flexibility needed
to experiment with new big data technologies and tools; and most importantly,
Listen and respond to customer feedback as big data deployments mature and grow.
Web Resources used for this white paper
1. Phillip Russom. “Where Hadoop fits in your Data Warehouse architecture”, TDWI, 2013.
http://tdwi.org/research/2013/07/tdwi-checklist-report-where-hadoop-fits-in-your-datawarehouse-architecture.aspx
2. Jeff Kelly. “Big Data: Hadoop, Business Analytics and Beyond”, Wikibon, 2013.
http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond
3. Edd Dumbill. “What is Apache Hadoop?”, Strata, 2012.
http://strata.oreilly.com/2012/02/what-is-apache-hadoop.html
4. Amr Awadalllah and Dan Graham. “Hadoop and the Data Warehouse: When to use which”,
Cloudera and Teradata, 2012.
http://www.cloudera.com/content/dam/cloudera/Resources/PDF/Hadoop_and_the_Data_War
ehouse_Whitepaper.pdf
Copyright 2013©
Page 14
Download