Talend Big Data Partners Playbook Value Proposition, Qualification, Objection Handling and more… 2013-04 | v1.0 Table Of Contents 1. Big Data Overview ................................................................................................................................................. 3 2. Talend Big Data Value Proposition ........................................................................................................................ 4 3. Talend Big Data Products....................................................................................................................................... 7 4. How to Detect/ Create/ Qualify Opportunities ..................................................................................................... 8 5. Pricing .................................................................................................................................................................. 10 6. Market Overview ................................................................................................................................................. 10 7. Competitive Intelligence ...................................................................................................................................... 11 8. Customer Case Studies ........................................................................................................................................ 13 9. Partners ............................................................................................................................................................... 14 10. Glossary/Background ......................................................................................................................................... 15 Introduction Welcome to the Talend Big Data Partner Playbook! How to use This document is meant as a reference guide for Talend partners and is confidential. It falls under the Talend nondisclosure agreement signed as part of the standard partner agreement and must not be distributed. It is meant for Talend Partners only. Important Note: You will notice in some sections of the documents several icons like this one By clicking on them, you will access more detailed information on the corresponding section. The goal is to keep the main document concise while offering additional information if needed. If when clicking on the icon you get an error message about a “Word converter” please use this link to correct the issue. Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 2 1. Big Data Overview What is big data? Big data represents a significant paradigm shift in enterprise technology. This advance allows organizations to differentiate themselves by processing data they never thought possible, increasing the speed in which they analyze and improve immense amounts of data. Big data encompasses a complex and large set of diverse structured and unstructured datasets that are difficult to process using traditional data management practices and tools. For example, there is an increasing desire to collect call detail records, web logs, data from sensor networks, financial transactions, social media and Internet text, and then analyze with existing data sources. Over 80 percent of the world’s data is unstructured and it is growing at 15 times the rate of structured data. Big data technology is rather new and based in open source communities. Hadoop, the most widely used big data technology is merely a version 1.0 (published January 2012). What are the challenges implementing big data projects? How can Talend help? While Hadoop and other big data technologies are standards compliant, they still require a very explicit skillset to master and software tools to manage and deploy. Companies need to integrate big data and non-relational (NoSQL) data sources requiring big data integration tools like Talend. The primary implementation challenges include: 1. Lack of development knowledge and skills The big data technologies are new and the underlying concepts, such as Map Reduce, are complex. In this nascent market, there are limited tools available to aid development and implementation of these projects. You are required to find resources that understand these complexities in order to be successful but there is only a handful available. Compounding this challenge, the technology is not easy to learn. 2. Lack of big data project management Big data projects at this point are just that, projects. It is early in the adoption process and most organizations are trying to sort out potential value and create an exploratory project or special project team. Typically these projects go unmanaged. As with any corporate data management discipline however, they will eventually need to comply with established corporate standards and accepted project management norms for organization, deployment and sharing of project artifacts. 3. Poor big data quality can lead to big problems Depending on the goal of a big data project, poor data quality can have a big impact on effectiveness. It can be argued that inconsistent or invalid data could have exponential impact on analysis in the big data world. As analysis on big data grows, so too will the need for validation, standardization, enrichment and resolution of data. Even identification of linkages can be considered a data quality issue that needs to be resolved for big data. Talend Big Data is a powerful and versatile solution that simplifies integrating big data technologies and data sources without writing and maintaining complex code. Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 3 2. Talend Big Data Value Proposition Talend presents an intuitive development and project management environment to aid in the deployment of a big data program. It extends typical data management functions into big data with a wide range of functions across integration and data quality. It not only simplifies development but also increases effectiveness of a big data program. 2.1 Talend’s Big Data Strategy There are four key strategic pillars to the Talend big data strategy. 1. Big Data Integration Landing big data (large volumes of log files, data from operational systems, social media, sensors, or other sources) in Hadoop, via HDFS, HBase, Sqoop or Hive loading is considered an operational data integration problem. Talend is the link between traditional resources, such as databases, applications and file servers and these big data technologies. It provides an intuitive set of graphical components and workspace that allows for interaction with a big data source or target without need to learn and write complicated code. A configuration of a big data connection is represented graphically and the underlying code is automatically generated and then can be deployed as a service, executable or stand-alone job. The full set of Talend data integration components (application, database, service and even a master data hub) are used so that data movement can be orchestrated from any source or into almost any target. Finally, Talend provides graphical components that enable easy configuration for NoSQL technologies such as MongoDB, Cassandra, Neo4J, Hive and HBase to provide random, real-time read/write, columnar-oriented access to Big Data. 2. Big Data Manipulation There is a range of tools that enable a developer to take advantage of big data parallelization to perform transformations on massive amounts of data. These languages such as Apache Pig and HBase provide a scripting language to compare, filter, evaluate and group data within an HDFS cluster. Talend abstracts these functions into a set of components that allow these scripts to be defined in a graphical environment and as part of a data flow so they can be developed quickly and without any knowledge of the underlying language. 3. Big Data Quality Talend presents data quality functions that take advantage of the massively parallel processing (MPP) environment of Hadoop, and because we rely only on generating native Hadoop code, users immediately can apply data quality across their cluster. It provides explicit function to take advantage of the massively parallel environment to identify duplicate records across these huge data stores in moments not days. It also extends into profiling big data and other important quality issues as the Talend data quality functions can be employed for big data tasks. This is a natural extension of a proven integrated data quality and data integration solutions. 4. Big Data Project Management and Governance While most of the early big data projects are free of explicit project management structure, this will surely change as they become part of the bigger system. With that change, companies will need to wrap Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 4 standards and procedures around these projects just as they have with data management in the past. Talend provides a complete set of functions for project management. With Talend, the ability to schedule, monitor and deploy any big data job is included as well as a common repository so that developers can collaborate on and share project metadata and artifacts. 2.2 Elevator Pitch Talend Big Data is a powerful and versatile open source solution for big data integration that delivers integration at any scale, both technically and economically, enabling profound change throughout businesses by allowing them to unlock the power of their data. Talend provides an easy-to-use graphical development environment that allows for interaction with big data sources and targets without the need to learn and write complicated code or learn complex MapReduce techniques. Talend's big data components have been tested and certified to work with leading big data Hadoop distributions, including Amazon EMR (Elastic MapReduce), Cloudera, Google, Greenplum/Pivotal, Hortonworks and MapR. Talend provides out-of-the-box support for a range of big data platforms from the leading appliance vendors including Greenplum, Netezza, Teradata, and Vertica. Talend simplifies the development of big data and facilitates the organization and orchestration required by these projects so that you can focus on the key question… “What use should we make of data, big and small, and how am I going to be the leader in using data to help my business?” Talend provides discreet value to your technical teams that are tasked with big data implementation. 2.3 Key Benefits Talend Big Data benefits and capabilities include: Flexible. Deliver solutions for all your needs. Provides comprehensive big data support. Native support for Hadoop HDFS, Hbase, Hive, Pig, Sqoop, BigQuery. Certified for all major Hadoop based distributions – Amazon EMR, Cloudera, Hortonworks, Mapr, Greenplum. Comprehensive support for NoSQL. Talend provides the necessary big data functions and extends this with over 450 components that allow for integration with nearly any application, warehouse or database. Additionally, you can deploy big data jobs as service, a self-contained executable or as a scheduled task. Provides easy-to-use, graphical code generating tools that simplify big data integration without writing or maintaining complex big data code. Reduces time-to-market with drag-and-drop creation and configuration, prebuilt packages and documented examples based on real-world experiences. Talend Big Data is the only solution to natively generate MapReduce, Pig, and HiveQL code. Deploy into production with confidence – rely on enterprise support and services from Talend The worlds first E-L-T mapping tool for Hive – move data from Hive to Hive (this is visionary) Scalable. Minimize disruption as your business volume increases. Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 5 Rapidly deploy big data jobs on Hadoop. Faster performance through Talend big data code generation (i.e. MapReduce, Pig, Hive) that is optimized for these MPP (massively parallel processing) environments because the data job is 100% Hadoop native. Big data quality jobs can be run in parallel in Hadoop. Requires no installation (zero install on Hadoop cluster), no performance overhead, and is easy to manage Open. Improve your productivity as Talend Big Data provides: A large collaborative community. As of January 2013 there have been over 45,000 downloads of Talend Big Data (June 2012 to Jan 2013) Software created through open standards and development processes eliminates vendor lock-in. Talend Big Data is powered by the most widely used open source projects in the Apache community. We certify with Hortonworks, Cloudera, Mapr, but we can work with any Hadoop platform and with the application on these platforms, because we don’t have any special or proprietary code Ready to start today. Talend Open Studio for Big Data is free to download and use for as long as you want. No budget battles or endless delays - just accessible, reliable open source integration, starting today. Administer and manage even the most complex teams and projects whose members have different roles and responsibilities Has a proven lower TCO based on Talend customer successes The Talend development studio increases developer productivity with a graphical environment that allows them to drag, drop and configure components to implement big data projects in minutes not weeks and days. Talend also provides a shared repository so developers can check in/check out metadata and big data artifacts and reuse expertise across the project. Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 6 3. Talend Big Data Products Talend provides three big data products: 1. Talend Open Studio for Big Data combines big data technologies into a unified open source environment simplifying the loading, extraction, transformation and processing of large and diverse data sets. o 2. Talend Enterprise Big Data is the subscription license version of Open Studio with Gold support and many advanced features including versioning and a shared repository. o 3. There are a few differences between this and Talend Open Studio for Data Integration. First, TOS4BD includes big data components for Hadoop (HDFS), HBase, HCatalog, Hive, Sqoop and Pig. Second, it does not include functions for context, metadata manager, business modeler and documentation. A customer will upgrade to this product for all the same reasons one would upgrade from any open source product to Talend commercial products. Currently, most big data projects do not employ much project management. Talend provides these functions in our commercial offering. Talend Platform for Big Data extends the Talend Enterprise Big Data product with all of our data quality features, for example big data profiling and big data matching, and support for non-big data features of data integration and data quality. It also provides advanced scalability options and platinum support. A complete feature/benefit comparison matrix between Talend’s Big Data products is at: http://www.talend.com/products/big-data/matrix Many of your opportunities will be upselling TOS users to Talend Enterprise Big Data or Talend Platform for Big Data. In addition to sharing the detailed matrix to highlight differences, this upselling guide shows the salient points. Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 7 4. How to Detect/ Create/ Qualify Opportunities There are three discreet audiences in the Talend Big Data sales opportunity: the CTO, the big data developer and the data scientist. The majority of people exploring big data are exactly that, explorers. They can see the value but may not be able to see how to get to where they need to be. They are all looking for the “right” use case for their organization. While they all share these common concerns each has some individual characteristics worth noting. CXO, Line of Business CIO, CTO Director/VP of IT Director/VP of Application Architecture Director of Systems Architecture The CTO is looking for resources and a use case for big data. They can see the value and are looking to implement quickly to set differentiation and establish competitive advantage in their marketplace. With Talend the learning curve for big data is shortened so they can use existing resources to get started on big data projects today. The use cases are still building but the early winners are analytics and storage. For the CTO we need to ask them “why big data?” Development Data Stewards: Working on quality assurance, security and metadata management projects Developers using Java, Hive, SQL, Pig, MapReduce, Python or other Apache Big Data tools Big data technologies are new and fairly complex. There are not many resources familiar with them and many are trying to learn them quickly. With Talend, big data technologies are simplified or abstracted into intuitive, graphical components that generate the complex code. This eliminates the need to learn the complexities and shorten the learning curve. Data Scientist Data Scientist / Architect: Statistics, SQL and Hive programming, scripting language skills, data mining and analysis, business analysis and industry expertise. The Data Scientist has become the critical link between big data and business value as they are tasked with analyzing a business problem and deriving a solution that leverages the available data. They are nowhere without the data. Talend simplifies big data technologies so that they can focus on the task at hand, the analysis. It is the critical link to supply data to a BI tool without the complexities of coding the interface. Big data is an emerging space. The following use-cases are taken from industry sources, to explain the types of situations that big data is being used today. 4.1 Qualification Questions To qualify an opportunity, you will need to collect information at all stages of the sales cycle. The list below could be applied to any opportunity, but is especially valid for big data opportunities that are usually bigger and more strategic: What is the business issue your customer is trying to solve? Do they have a compelling event (i.e. a reason for choosing a date for the application to be in production) Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 8 What is your customer's budget? Who is the key decision maker? Who are the influencers? What is the customer's buying process? What are the customer's decision-making criteria? What data integration vendor solution do they use today for ETL? In addition to those generic questions, you could use the following matrix to identify information more specific to big data opportunities: BIG DATA NEED BIG DATA MATURITY Does the prospect have a big data business case identified? Do they have a problem that can be resolved by big data but are not aware of a solution? Does the prospect understand the issues of big data – volume, variety, velocity, complexity? Do you or your customer have a strong data architect or data scientist that is leading the project? Are they planning to use a Hadoop, NoSQL, or big data appliance? Is there a partner with significant big data expertise involved? Perhaps Cloudera or Greenplum? What type and how much data do they have or will have in the future? (e.g terabytes of unstructured data) Is the data corporate, social media, html, text heavily? Do they have a data quality practice today? Have they considered the impact of their big data project on data quality? How do you apply data quality to masses of data? Is trending more important that absolute perfect underlying data? Are they looking to standardize their data? Do they augment data with third party sources? Do they have manual process that could benefit from automation? Do they have a data governance team? Big data is many things, but providing real-time access to data can be tricky1. Has the customer considered if they require real-time data access and how they will provide realtime access to the Hadoop? VOLUME QUALITY GOVERNANCE REAL-TIME 1 http://www.odbms.org/blog/2012/09/hadoop-and-nosql-interview-with-j-chris-anderson/ Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 9 5. Pricing Please contact your Talend Partner manager for more details. 6. Market Overview 6.1 What is the market – definition, size and segmentation? Big data allows organizations to process data they never thought possible and increase the speed in which they analyze and improve immense amounts of data in order to establish differentiation. With big data, processes are improved and critical decisions can be improved with more information. It is changing entire markets and enabling solution to challenges that had not even been thought of in the past. According to Gartner, in 2013, big data is forecast to drive $34 billion of IT spending and will total $232 billion of IT spending by 2016. Big data currently has the most significant impact in social network analysis and content analytics with 45 percent of new spending each year. In traditional IT supplier markets, application infrastructure and middleware is most affected (10 percent of new spending each year is influenced by big data in some way) when compared with storage software, database management system, data integration/quality, business intelligence or supply chain management (SCM). 2 It is important to note that this current research estimates revenues across some existing companies. This is the nature of a nascent space; the estimates are a bit unreliable, however they are still substantial. The next generation data warehouse companies are listed here and it can be argued that they are not true “big data” companies… yet. Further insight into the current numbers can be seen in the following table. As you will see, it is a mix of a lot of multiple segments. As defined by Wikibon, The big data market includes those technologies, tools, and services as follows: 2 Hadoop distributions, software, subprojects and related hardware; Next-generation data warehouses and related hardware; Big data analytic platforms and applications; Business intelligence, data mining and data visualization platforms and applications as applied to Big Data; Data integration platforms and tools as applied to Big Data; Big Data support, training, and professional services. http://www.gartner.com/newsroom/id/2200815 Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 10 7. Competitive Intelligence 7.1 Key Differentiators And Important Questions to Ask Your Customer Talend Big Data has several key differentiators as outlined in this playbook, but here are those that are most significant: 1. Cluster Scale vs. Non-native Engines Talend holds a unique and technically superior position, in that our software generates 100% native Hadoop data jobs. This means that our data jobs can infinitely scale. Processing massive data sets is the corner stone of big data, having a technology that meets this need comes down to either: Hand-coding, with mean higher developer fees and maintenance costs Using Talend – re-use existing skills and reduce maintainability issues as Talend tools abstract the under complexity Our competitors require that their engine, which was written for a different environment, be installed, configured and managed on an on-going based, to simply run a data job on Hadoop. This is not MapReduce and it doesn’t scale neither technically nor economically. How many engines do you need to install? What pricing model is applied? What if I need to scale out the cluster if my data grows in the next 5 years? With Talend we do not require the customer to install any special Talend software. A major proof point is the fact that Hortonworks Data Platform embeds and promotes Talend Open Studio for Big Data as their preferred solution for big data integration. Our code generation also delivers the following immediate benefits our customers: Setup fees are zero. Talend does not require any pre-installed Talend software on either Hadoop or NoSQL Upgrade and maintenance costs are largely non-existent Our big data jobs are 100% compatible and ready to run, scale, process just about size of data, all within the cluster 2. Predictable Costs vs. Runtime Based Pricing Unlike competitors Talend does not charge for runtime or nodes for our big data solutions. We also include all connectors to both enterprise data sources and of course the big data platforms within the same fee. We do not apply additional charge based on where the software is hosted – you can run on or off the big data platform for the same fee. How many nodes do you expect in your big data cluster? How much will it grow every year? Does your budget include a perpetual increase of the license cost (as data usually increases every year)? How you considered the costs for each connector for your products? Talend’s fee is all inclusive of connectors? Do you consider that you will need to run some jobs inside the cluster and some outside? How you consider the additional server charges for this setup? Do you intend to use the data quality features and how other vendors are charging for this additional functionality? 3. Unified Platform vs. Patchwork of Incomplete Solutions Talend provides an enterprise integration solution with a complete set of the required functions for successful implementation. There is no extra charge for various connectors or integration and cleansing functions. Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 11 Have you considered all the technologies you will need for this project? Do you have resources with expertise in all the required functions? How many will you need to learn? Can you share these resources across the products? How well do they fit together? Are they all upgraded at the same time? What is the Total Cost of Ownership (TCO) for your integration project? Have you considered all of the licenses you will need for Big Data, DI, DQ, MDM, ESB and BPM? Are they provided as one license, by the same vendor? On the same platform? Do the other vendors provide a common repository to share project metadata and artifacts? 4. Open Source and Extensible vs. Black Box Proprietary Solution Talend Big Data is extensible and open source. This is not black-box proprietary software. You can open it up, investigate and extend as necessary. NO other vendor provides this. Can you customize the other big data solutions you are considering? If there is a function that you do not have and need can you extend the Studio to include this without waiting for the vendor to make an update? Does the other solution have a community of developers who create extensions to the solution and share them with each other? Our partners are also providing connections for their NoSQL solutions. How can you determine if an issue you have is a bug or an implementation problem? If it is a bug do you have access to track the fix and install a patch as soon as it is available? Can you investigate the software and possibly fix it yourself? Do the other vendors have a vital and active community who provides another level of free support to each other? 5. Big Data Management vs. Simple Hadoop Connectivity Talend Big Data goes beyond data integration, by delivery a real killer-app for big data –big data profiling. As identified by our customers, big data presents a great opportunity but also a major challenge. All financial companies are required to meet data compliance and governance obligations. For example, in the US, banks are required to assert that all customers have well-formatted social security numbers and no duplicates entries exist. With big data we can upload as much data as we have into Hadoop, but how do we audit the quality of the data when regulators come knocking? That’s where Talend big data quality (as part of the Talend Platform for Big Data) starts to come in. Today we offer Profiling of HIVE and the ability to run analysis remotely on HIVE databases, leveraging the processing power of the cluster Matching in Hadoop – matching is one of the most computationally intensive functions of DQ, therefore running in on Hadoop is not only desirable, it’s simply a mandatory requirement, because 1. The data is already loaded, with no option to download and process it elsewhere 2. The cluster processing is required to tackle such a large dataset, therefore you must run on the cluster Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 12 8. Customer Case Studies Customer case studies are made available through Talend.com. Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 13 9. Partners 9.1 Summary Talend has a broad ecosystem of big data partners which significantly benefits our customers as our products are built to run in big data environments. We also have many established partners using our technology who can assist with implementation and services if needed. The current (April 2013) list of partners includes the following: Hadoop Distribution Partners Big Data Appliance/Cloud Partners NoSQL Partners Talend supports all the common Hadoop distributions across: Greenplum/Pivotal DataStax/ Cassandra* Netezza 10Gen/ MongoDB* Amazon EMR Vertica Couchbase* Apache Teradata Redis Cloudera Google Platform Riak HBase* EMC/ Greenplum Hortonworks Membase MapR Neo4J* * indicates that Talend supported components are available Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 14 10.Glossary/Background Big data is defined by countless new terms and technologies. Below is a small set of terms that are used within this document. Cassandra – Apache Cassandra is an OSS database that runs on top of the Hadoop filesystem (HDFS). It is key/value store database that it provides its own query language and can be tuned to optimize huge commodity server farms. It was originally developed by Facebook to search the inbox system. Hadoop - Hadoop was born because existing approaches were inadequate to process huge amounts of data. Hadoop was built to address the challenge of indexing the entire World Wide Web every day. Google developed a paradigm called MapReduce in 2004, and Yahoo! eventually started Hadoop as an implementation of MapReduce in 2005 and released it as an open source project in 2007. While it may have started as a MapReduce implementation it has extended well beyond this and has transformed into a massive operating system for distributed parallel processing of huge amounts of data. MapReduce was the first way to use this operating system, but it has been joined by many other techniques, such as Apache Hive and Pig open source projects that make Hadoop easier to use for particular purposes. Much like any other operating system, Hadoop has the basic constructs needed to perform computing: It has a file system, a language to write programs, a way of managing the distribution of those programs over a distributed cluster and a way of accepting the results of those programs, ultimately combining them back into one result set. HBase – HBase is an OSS database that runs on top of the Hadoop filesystem (HDFS). It is columnar database that it provides fault-tolerant storage and quick access to large quantities of sparse data. It was originally developed by Facebook to serve their messaging systems HCatalog – Largely developed by Hortonworks and now part of Apache Hadoop, HCatalog addresses the need to have meta-data to describe the structure of the underlying data stored in Hadoop. This makes the development and maintenances of big data application more efficient. Hive – Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. It was originally developed by Facebook but is used in production by many companies. MapReduce – Map Reduce is a software framework introduced by Google in 2004. It allows a programmer to express a transformation of data that can be executed on a cluster that may include thousands of computers operating in parallel. At its core, it uses a series of “maps” to divide a problem across multiple parallel servers and then uses a “reduce” to consolidate responses from each map and identify an answer to the original problem. NoSQL – (Not only SQL) this refers to a large class of data storage mechanisms that differ significantly from the well-known, traditional relational data stores (RDBMS). These technologies implement their own query language and are typically built on advanced programming structures for key/value relationships, defined objects, tabular methods or tuples. The term is often used to describe the wide range of data stores classified as big data. Pig – the Apache Pig project is a high-level data-flow programming language and execution framework for creating MapReduce programs used with Hadoop. The abstract language for this platform is called Pig Latin and it abstracts the programming into a notation which makes MapReduce programming similar to that of SQL for RDBMS systems. Pig Latin is extended using UDF (User Defined Functions) which the user can write in Java and then call directly from the language. Talend takes advantage of UDF extensively. Talend ©2013 | Talend Big Data Partners Playbook | 2013-04 15