The Brite Group Inc. How can Big Data/Hadoop co-exist with your Enterprise Data Warehouse White paper on Big Data and EDW architecture and integration strategy The Brite Group Inc. 10/18/2013 Abstract Organizations increasingly continue to use data warehouses to manage structured and operational data. These data warehouses provide business analysts with the ability to analyze key data and trends, and they also enable senior leadership to make intelligent decisions about their business. However, the advent of social media and internet-based data sources, commonly referred to big data, not only challenge the role of the traditional data warehouse in analyzing data from these diverse sources, but also expose limitations posed by software and hardware platforms that are used for building traditional data warehouse solutions. Getting full business intelligence (BI) value out of big data requires some adjustments to best practices and tools for enterprise data warehouses (EDWs). For example, deciding which of the big data sources (such as web sites, social media, robotics, mobile devices, and sensors) are useful for BI/EDW purposes pushes BI professionals to think in new terms. Big data from these sources ranges from structured to semi-structured to unstructured, and most EDWs are not designed to store and manage this full range of data. Likewise, some of the new big data sources feed data relentlessly in real time, whereas the average EDW is not designed for such real time feeds. Furthermore, most of the business value coming from big data is derived from advanced analytics based on the combination of both traditional enterprise data and new data sources. Accordingly, companies will need to modify their best practices for EDW data integration, data quality, and data modeling in order to take advantage of big data, specifically by making some changes to their existing infrastructure, tools and processes to integrate big data into their current environment. Copyright 2013© Page 1 The relationship between the data warehouse and big data is merging to become a hybrid structure. In this hybrid model, the highly structured optimized operational data remains in the tightly controlled data warehouse, while the data that is highly distributed and subject to change in real time is controlled by a Hadoop-based infrastructure. Increasingly, organizations understand that they have a business requirement to be able to combine traditional data warehouses, their historical business data sources and less structured and big data sources. A hybrid model, supporting traditional and big data sources, can thus help accomplish these business goals. 1. Role of an Enterprise Data Warehouse in an organization Traditionally, data processing for analytic purposes follows a fairly static workflow (shown in Figure 1). Through the regular course of business, enterprises first create modest amounts of structured data with stable data models via enterprise applications like Storefronts and Order Processing systems, CRM, ERP and Billing systems. Data integration tools are then used to extract, transform and load the data from enterprise applications and transactional databases to a staging area where data standardization normalization occurs, so that the data is modeled into neat rows and tables. The modeled, cleansed data is then loaded into an enterprise data warehouse.This routine usually occurs on a scheduled basis – usually daily or weekly, sometimes more frequently. Figure1. Role of a traditional EDW in an organisation Copyright 2013© Page 2 From there, data warehouse and business intelligent engineers create and schedule regular reports to run against data stored in the warehouse, which are distributed to the business management. They also create dashboards and other limited visualization tools for executives and senior management. Business analysts, meanwhile, use data analytics tools to run advanced analytics against the warehouse, or more often against sample data migrated to a local data mart due to size limitations. Non-expert business users perform basic data visualization and limited analytics against the data warehouse via front-end business intelligence products. Data volumes in traditional data warehouses rarely exceeded multiple terabytes as large volumes of data strain warehouse resources and degrade performance. The job of cleaning and integrating source data, often referred to as Extract, Transform and Load (ETL) is often done using data integration tools. Using these tools, developers drag and drop predefined transforms onto a workspace to construct graphical workflow ETL jobs that clean and integrate structured transaction and master data (customers, products, assets, suppliers etc.) before loading this data into data warehouses. Over the years these tools have improved their support for more data sources, change data capture, data cleansing and performance. However, historically, these tools have only supported structured data primarily associated with data from internal systems. 2. Challenges in using traditional EDW data analysis In the last few years, new internal and external data sources have emerged that businesses now want to analyse to produce more insights over what they already know. In particular, social network interaction data, profile data and social relationships in a social graph are getting attention from marketers interested in everything from sentiment to likes, dislikes, followers and influencers. Indeed, each Facebook update, tweet, blog post and comment creates multiple new data points, which could be structured, semi-structured and unstructured. Billions of online purchases, stock trades and other transactions happen every day, including countless automated transactions. Each creates a number of data points collected by retailers, banks, credit cards, credit agencies and others. Electronic devices of all sorts – including servers and other IT hardware, smart energy meters and temperature sensors, all create semi-structured log data that record every action. Within the enterprise, web logs are growing rapidly as customers switch to on-line channels as their preferred way of interacting and transacting business. This in turn is also generating much larger volumes of both structured, semistructured and unstructured data. All of this new activity is giving rise to a new emerging area of technology associated with analytics. That area is big data. In the big data environment, extreme transaction processing, larger data volumes, more variety in data types being analysed, and data generation occurring at increasingly rapid rates are all issues that need to be supported if companies are to derive value from these new data sources. Copyright 2013© Page 3 The cost of processing potentially terabytes of data also needs to be kept to a minimum to stop total cost of ownership of ETL processing from spiralling out of control. BI and analytics platforms have been tremendously effective at using structured data, such as that found in CRM applications. But in the big data era, the BI challenge has changed dramatically, in terms of both goals and execution, described as follows. Volume: The volume of data has grown by orders of magnitude. A recent article from The International Data Corporation (IDC) stated that the size of the entire digital universe in 2005 was 130 billion gigabytes; in 2013, it was up to 40 trillion gigabytes. 1 Today’s enterprise environments routinely contain petabytes of data. Variety: The variety of data has evolved from the traditional structured datasets of ERP and CRM systems, to data gathered from user interactions on the Web (clickstreams, search queries, and social media) and mobile user activity, including location-based information. Velocity: The velocity of data accumulation and change is accelerating, driven by an expanding universe of wired and wireless devices, including Web-enabled sensors and Webbased applications. Veracity: Veracity refers to the fact that data must be able to be verified based on both accuracy and context before being used in predicting business value. For example, an innovative business may want to be able to analyze massive amounts of data in real time to quickly assess the value of a customer and the potential to provide additional offers to that customer. It is therefore necessary to identify the right amount and types of data that can be analyzed in real time to impact business outcomes. This revolutionary shift places significant new demands on data storage and analytical software, and poses new challenges for BI and database professionals. It also creates powerful opportunities for discovering and implementing new strategies that generate competitive advantage. Realizing these opportunities requires two things: the technological capacity to gather and store big data, as well as new tools for turning data into insights and, ultimately, value. 3. Advent of Real Time EDW solutions Unlike traditional data warehouses, massively parallel analytic databases are capable of quickly ingesting large amounts of mainly structured data with minimal data modeling required and can scale-out to accommodate multiple terabytes and sometimes petabytes of data. Most importantly for end-users, massively parallel analytic databases support near real-time results to complex SQL queries. Fundamental characteristics of a massively parallel analytic database include: 1 Massively parallel processing (MPP) capabilities. Massively parallel processing, or MPP, allows for ingesting, processing and querying of data on multiple machines simultaneously. The result is significantly faster performance than traditional data warehouses that run on a single, large box and are constrained by a single choke point for data ingest. http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf Copyright 2013© Page 4 Shared-nothing architectures: A shared-nothing architecture ensures there is no single point of failure in some analytic database environments. In these cases, each node operates independently of the others so if one machine fails, the others keep running. This is particularly important in MPP environments where, with hundreds of machines processing data in parallel, the occasional failure of one or more machines is inevitable. Prominent vendors that offer MPP and shared-nothing architectures are Teradata, Netezza and EMC GreenPlum. Teradata is well established and arguably best in the data warehouse appliances; here, performance degradation is minimal even with exponential raise of number of users and work load. However, it is costly for small organizations, and some developers find it difficult to understand its system level functionality. Netezza seems to be good fit for mid-ranged organizations or departments in terms of cost, it has reasonably good performance, and it is easy to use and load data. However, there seems to be slight performance degradation when number of users and concurrent queries increase, and there also seems to be a lack of a matured toolset for operating on the database. EMC Greenplan, on the other hand, is one of the new generation databases that introduce integration with some of the revolutionary big data approaches. It promises many things and it seems that it tries very hard to deliver all of the new features. Columnar architectures: Rather than storing and processing data in rows, as is typical with most relational databases, most massively parallel analytic databases employ columnar architectures. In columnar environments, only columns that contain the necessary data to determine the answer to a given query are processed (rather than entire rows of data), resulting in split-second query results. This also means data does not need to be structured into neat tables as with traditional relational databases. Advanced data compression capabilities: Advanced data compression capabilities allow analytic databases to ingest and store larger volumes of data than otherwise possible and to do so with significantly fewer hardware resources than traditional databases. A database with 10-to-1 compression capabilities, for example, can compress 10 terabytes of data down to 1 terabyte. Data compression, and a related technique called data encoding, are critical to scaling to massive volumes of data efficiently. A prominent vendor that offers columnar and advanced data compression capabilities is Vertica. Vertica seems to be the cheapest among the other MPP appliances described above. It is a highly compressed database by automatically choosing the data compression algorithm, and it automates some of the system maintenance activities such as purging logically deleted data. However, there seems to be a lack of GUI tools in Vertica nad it has some very limited workload management options. Moreover, Vertica heavily depends on projections to deliver performance for different scenarios, and it seems not very suitable for general purpose users with a large number of concurrent sessions querying the database. Copyright 2013© Page 5 Other established systems offering columnar MPP capabilities are InfiniDB from Calpont (seems to be purpose-built for analytical workloads, transactional support and bulk load operations), ParAccel from Actian (seems to offer fast, in-memory performance and a sophisticated big data analytics platform) and Aster from Terradata (seems to be the first one to embed SQL and big data analytic processing to allow deeper insights on multi-structured data sources with high performance and scalability). Commodity hardware: Most massively parallel analytic databases run on off-the-shelf commodity hardware from Dell, IBM and others, so they can scale-out in a cost effective manner. In-memory data processing: Some massively parallel analytic databases use dynamic RAM and/or flash for some real-time data processing. Some are fully in-memory, while others use a hybrid approach that blends less expensive but lower performing disk-based storage for stale data with DRAM or flash for frequently accessed data. Massively parallel analytic databases still face some challenges, however. Most notably, they are not designed to ingest, process and analyze semi-structured and unstructured data that are largely responsible for the explosion of data volumes in the big data era. 4. Advent of Big Data/NoSQL/Hadoop and Advanced Analytics The advent of the Web, mobile devices and social networking technologies has caused a fundamental change to the nature of data. Unlike the traditional corporate data that is centralized, highly structured and easily manageable, this so-called Big Data is highly distributed, loosely structured, and increasingly large in volume. There are number of approaches to processing and analyzing big data, although most have some common characteristics, such as: taking advantage of commodity hardware to enable scale-out, parallel processing techniques; employing non-relational data storage capabilities to process unstructured and semi-structured data; and applying advanced analytics and data visualization technology to big data to convey insights to end-users. In addition to the MPP real-time EDW solutions, two big data approaches that may transform the areas of business analytics and data management are NoSQL and Apache Hadoop. They are described in some detail below. 4.1. NoSQL A new style of database called NoSQL (Not Only SQL) has recently emerged to process large volumes of multi-structured data. NoSQL databases primarily aim at serving up discrete data stored among large volumes of multi-structured data to end-user and automated big data applications. This capability is mostly lacking from relational database technology, which simply cannot maintain needed application performance levels at big data scale. Some examples of NoSQL databases include: Copyright 2013© Page 6 HBase Cassandra Redis Voldemort MongoDB Neo4j CouchDB The downside of most NoSQL databases is that they traded the compliance of ACID (Atomicity, Consistency, Isolation, Durability) properties, typically used by relational databases, for performance and scalability. Many also lack mature data management and monitoring tools. Both these shortcomings are in the process of being overcome by both the open source NoSQL communities and a handful of vendors (such as DataStax, Sqrrl, 10gen, Aerospike and Couchbase) that are attempting to commercialize the various NoSQL databases. Table 2 shows a vendor specific comparison of the NoSQL products, currently offered on the market. The table lists four different types of NoSQL databases. For each database type, examples of database vendors are provided, as well as information on the best fit for a type, its data storage, data access and integration methods. Database Type Database Vendor Best Fit For Data Storage Methods Data Access Methods Integration Methods Key Value Store Redis, Voldemort, Oracle BDB, Amazon SimpleDB, Riak Cassandra, HBase Content caching, logging A hash table with unique keys using pointers to items of data Get, put, and delete operations based on a primary key REST (XML and JSON) and HTTP API interfaces Distributed file systems Keys that point to multiple columns, arranged by column family Column Family Store Document Store CouchDB, MongoDb Graph Database Neo4J, InfoGrid, Infinite Graph Table operations without joins (joins must be handled by the application) Similar to key-value Inverted index on Web applications stores, but using any attribute, semi-structured JSON aloowing for fast documents full-text search containing nested using rich, values associated document-based with each key queries Social networking, Flexible graph model Property graph, Recommendations that can scale across using node, multiple machines relationship and property methods Java, REST, Thrift API interfaces, full Hadoop integration REST (XML and JSON) and HTTP API interfaces REST (XML and JSON) and HTTP API interfaces Table2. A vendor specific comparison of NoSQL products Copyright 2013© Page 7 4.2. Apache Hadoop Apache Hadoop2 has been the driving force behind the growth of the big data industry. Hadoop brings the ability to cheaply process very large amounts of data, regardless of its structure. Existing EDWs and relational databases excel at processing structured data and can store massive amounts of data. However, this requirement for structure restricts the kinds of data that can be processed, and it also imposes an inertia that makes data warehouses unsuited for agile exploration of massive heterogenous data. The amount of effort required to warehouse data often means that valuable data sources in organizations are never mined. This is where Hadoop can be used to make a big difference. Also, as the Hadoop project matures, it continually acquires various components to enhance its usability and functionality. The name Hadoop has therefore come to represent an entire ecosystem, as described below. 4.2.1. Lower Hadoop levels: HDFS and MapReduce At the core of Hadoop is a framework called MapReduce, which is what drives most of today’s big data processing. Created by Google, the important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing the computation solves the issue of data too large to fit onto a single machine. When this technique is combined with commodity Linux servers, the result is a cost-effective alternative to massive computing arrays. For the MapReduce computation to take place, each server must have access to the data. This is the role of HDFS, the Hadoop Distributed File System. Servers in a Hadoop cluster can fail and not abort the computation process. HDFS ensures data is replicated with redundancy across the cluster. On completion of a calculation, a node will write its results back into HDFS. There are no restrictions on the data that HDFS stores. Data may be unstructured and schemaless. By contrast, relational databases require that data be structured and schemas be defined before storing the data. With HDFS, making sense of the data is the responsibility of the developer’s code. 4.2.2. Programming and DW levels: Pig and Hive Programming Hadoop at the MapReduce level requires working with the Java APIs, and manually loading data files into HDFS. However, working directly with Java APIs can be tedious and error prone, and it also restricts usage of Hadoop to Java programmers. Hadoop thus offers two solutions for making Hadoop programming easier. 2 Pig is a programming language that simplifies the common tasks of working with Hadoop: loading data, expressing transformations on the data, and storing the final results. Pig’s builtin operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations. http://hadoop.apache.org/ Copyright 2013© Page 8 Hive enables Hadoop to operate as a data warehouse. It superimposes structure on data in HDFS and then permits queries over the data using a familiar SQL-like syntax. As with Pig, Hive’s core capabilities are extensible. Choosing between Hive and Pig can be confusing. Hive is more suitable for data warehousing tasks, with predominantly static structure and the need for frequent analysis. Hive’s closeness to SQL makes it an ideal point of integration between Hadoop and other business intelligence tools. Pig, on the other hand, gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming data flows for incorporation into larger applications. Pig is a thinner layer over Hadoop than Hive, and its main advantage is to drastically cut the amount of code needed compared to direct use of Hadoop’s Java APIs. As such, Pig’s intended audience remains primarily the software developer. 4.2.3. Data access levels: HBase, Sqoop and Flume At its heart, Hadoop is a batch-oriented system. Data are loaded into HDFS, processed, and then retrieved. This is somewhat of a computing throwback, as often interactive and random access to data is required. HBase is used to address this issue, which is a column-oriented NoSQL database that runs on top of HDFS. Modeled after Google’s BigTable, the goal of HBase is to host billions of rows of data for rapid access. MapReduce can then use HBase as both a source and a destination for its computations, while Hive and Pig can be used in combination with HBase. Some HBase use cases include logging, counting and storing time-series data. Improved interoperability between Hadoop and the rest of the data world is provided by Sqoop and Flume. Sqoop is a tool designed to import data from relational databases into Hadoop, either directly into HDFS or into Hive. Flume is designed to import streaming flows of log data directly into HDFS. Hive’s SQL friendliness means that it can be used as a point of integration with the variety of database tools capable of making connections via JBDC or ODBC database drivers. 4.3. Advanced analytics with Hadoop The technology around Hadoop is advancing rapidly, becoming both more powerful and easier to implement and manage. An ecosystems of vendors, such as start-ups like Cloudera and Hortonworks and well-established companies like IBM and Microsoft, are working to offer commercial, enterprise-ready Hadoop distributions, tools and services to make deploying and managing the technology a practical reality for the traditional enterprise. However, this does not imply that big data technologies will replace traditional enterprise data warehouses. Rather, an increasing trend for them is to exist in parallel. Indeed, the traditional data warehouse still plays a vital role in the business. Financial analysis and other applications associated with the EDW are as important as ever, and the EDW itself will be a source of some of the data used in big data projects and will likely receive data from the results of advanced analysis projects. Accordingly, Hadoop and big data can be seen as an expansion rather than a replacement for EDW and RDBMS data engines. Copyright 2013© Page 9 5. How can Big Data/NoSQL/ Hadoop co-exist with EDW? The first step toward successful Hadoop/EDW integration is to determine where Hadoop fits in the data warehouse architecture. As Hadoop is a family of products, each with multiple capabilities, there are multiple areas in data warehouse architectures where Hadoop products can contribute. Hadoop seems most compelling as a data platform for capturing and storing big data within an extended DW environment, in addition to processing that data for analytic purposes on other platforms.This approach allows firms to protect their investment in their respective EDW infrastructure and also extend it to accommodate the Big data environment. The most prominent roles for Hadoop in EDW architectures are as follows. Data staging. A considerable amount of data is processed in an EDW’s staging area to prepare source data for specific uses (reporting, analytics) and for loading into specific databases (EDWs, data marts). Much of this processing is done by homegrown ETL tools. Hadoop thus allows organizations to deploy an extremely scalable and economical ETL environment. For example, one of the most popular ETL use cases is offloading heavy transformations, the "T" in ETL, from the data warehouse and into Hadoop. The rationale behind this is because, for years, organizations have struggled to scale traditional ETL architectures. Specifically, many data integration platforms pushed the transformations down to the data warehouse, which is why today data integration in EDW architectures drives up to 80% of database capacity and resources, resulting in unsustainable spending, ongoing maintenance efforts, as well as poor user query performance. By shifting the "T" to Hadoop, organizations can dramatically reduce costs and free up database capacity and resources for faster user query performance. Data archiving. Traditionally, enterprises had three options when it came to archiving data: leave it within a relational database, move it to an offline storage library, or purge it. Hadoop’s scalability and low cost enable organizations to keep all data forever in a readily accessible online environment. Schema flexibility. Relational DBMSs (used in Data Warehouse implementations) are well equipped in storing highly structured data, from ERP, CRM and other opertational databases, to stable semi-structured data (XML, JSON). As a complement, Hadoop can quickly and easily ingest any data format, including evolving schema (as in A/B and multivariate tests on a website) and no schema (audio, video, images). Processing flexibility. Hadoop’s NoSQL approach is a more natural framework for manipulating non-traditional data types and enabling procedural processing, valuable in use cases such as time-series analysis and gap recognition. Hadoop also supports a variety of programming languages, thus providing more capabilities than SQL alone. One way to augment and enhance the EDW in an organisation with a Hadoop/big data cluster is as follows: Continue to store summary structured data from OLTP and back office systems into the EDW. Copyright 2013© Page 10 Store unstructured data into Hadoop/NoSQL that does not fit nicely into tables. This means all the communication with customers from phone logs, customer feedbacks, GPS locations, photos, tweets, emails, text messages, etc. can be stored in Hadoop. Co-relate data in EDW with the data in the Hadoop cluster to get better insight about customers, products, equipment, etc. Organizations can now run ad-hoc analytics and clustering and targeting models against this co-related data in Hadoop, which is otherwise computationally very intensive. Copyright 2013© Page 11 Figure 2 shows an Enterpise data architecture, reflecting one perspective on data sources, data sharing, and the diversity of data discovery platforms using an EDW/Hadoop environment. Figure 2. Enterprise data architecture using combined EDW/Big data environment Copyright 2013© Page 12 Two example use cases showcasing such integrated EDW/Hadoop environment are as follows. 1. A major brokerage firm uses Hadoop to preprocess raw click streams generated by customers using their website. Processing these click streams provides valuable insight into customer preferences which are passed to a data warehouse. The data warehouse then couples these customer preferences with marketing campaigns and recommendation engines to offer investment suggestions and analysis to consumers. 2. An eCommerce service uses Hadoop for machine learning to detect fraudulent supplier websites. The fraudulent sites exhibit patterns that Hadoop uses to produce a predictive model. The model is copied into the data warehouse where it is used to find sales activity that matches the pattern. Once found, that supplier is investigated and potentially discontinued. To summarize, complex Hadoop jobs can use the data warehouse as a data source, simultaneously leveraging the massively parallel capabilities of two systems. Inevitably, visionary companies will take this step to achieve competitive advantages. 6. Use cases and best practices The following use cases show some of the best practices on how Hadoop can coexist with EDW. 1. Use case from Dell big-data solutions - Find out how to use Hadoop technology and an enterprise data warehouse in performing Web page analytics: http://www.dell.com/learn/ba/en/bapad1/videos~en/documents~hadoop-edw-web-pageanalytics.aspx?c=ba&l=en&s=biz&cs=bapad1&delphi:gr=true 2. Use cases from Cloudera - Ten Common Hadoopable Problems (Real-World Hadoop Use Cases): http://blog.cloudera.com/wpcontent/uploads/2011/03/ten_common_hadoopable_problems_final.pdf 3. The Evolution of the Enterprise Data Warehouse, starring Hadoop: http://data-informed.com/the-evolution-of-the-enterprise-data-warehouse-starring-hadoop/ 7. Final recommendations on Hadoop/EDW Integration Hadoop provides a new, complementary approach to traditional data warehousing that helps deliver on some of the most difficult challenges of enterprise data warehouses. By making it easier to gather and analyze data, it may help move the spotlight away from the technology towards some important limitations on today’s business intelligence efforts, such as information culture and the limited ability of many people to actually use information to make the right decisions. Copyright 2013© Page 13 Enterprises across all industries should evaluate current and potential big data use cases and engage the big data community to understand the latest technological developments. Some recommendations on how to integrate Hadoop/EDW include: Work with the community, like-minded organizations and vendors to identify areas big data can provide business value; Consider the level of big data skills within your organisation to determine if you are in a position to begin experimenting with big data approaches, such as Hadoop; Engage both IT and the business to develop a plan to integrate big data tools, technology and approaches into your existing EDW infrastructure - specifically, begin to cultivate a datadriven culture among employees at all levels and encourage data experimentation; Embrace an open, rather than proprietary, approach, to give customers the flexibility needed to experiment with new big data technologies and tools; and most importantly, Listen and respond to customer feedback as big data deployments mature and grow. Web Resources used for this white paper 1. Phillip Russom. “Where Hadoop fits in your Data Warehouse architecture”, TDWI, 2013. http://tdwi.org/research/2013/07/tdwi-checklist-report-where-hadoop-fits-in-your-datawarehouse-architecture.aspx 2. Jeff Kelly. “Big Data: Hadoop, Business Analytics and Beyond”, Wikibon, 2013. http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond 3. Edd Dumbill. “What is Apache Hadoop?”, Strata, 2012. http://strata.oreilly.com/2012/02/what-is-apache-hadoop.html 4. Amr Awadalllah and Dan Graham. “Hadoop and the Data Warehouse: When to use which”, Cloudera and Teradata, 2012. http://www.cloudera.com/content/dam/cloudera/Resources/PDF/Hadoop_and_the_Data_War ehouse_Whitepaper.pdf Copyright 2013© Page 14