Data Integration for Big Data Pierre Skowronski Prague le 23.04.2013 1 IT is struggling with the cost of Big Data • Growing data volume is quickly consuming capacity • Need to onboard, store, & process new types of data • High expense and lack of big data skills Informatica Corporation Confidential – Do Not Distribute 2 Prove the Value with Big Data Deliver Value Along the Way Cost: Lower Big Data Project Costs (helps self-fund big data projects) Risk: Minimize Risk of New Technologies (design once, deploy anywhere) Delivery: Innovate Faster With Big Data (onboard, discover, operationalize) Informatica Corporation Confidential – Do Not Distribute 3 Informatica Corporation Confidential – Do Not Distribute 4 INTRODUCING THE INFORMATICA POWERCENTER BIG DATA EDITION Informatica Corporation Confidential – Do Not Distribute 5 PowerCenter Big Data Edition Lower Costs Optimize processing with low cost commodity hardware Transactions, OLTP, OLAP Traditional Grid EDW Documents and Emails ODS Social Media, Web Logs Machine Device, Scientific MDM Increase productivity up to 5X Informatica Corporation Confidential – Do Not Distribute 6 Hadoop complements Existing Infrastructure on low cost commodity hardware 7 Informatica Corporation Confidential – Do Not Distribute 77 5 x better productivity for similar performance Project domain Finance Finance Cluster size 3 10 Processing Compare to Expert Handcoding Cleanse, Transform, sort, group 40% faster than PIG Extract, process, load 50% faster than PIG Extract, process, load 20% slower than PIG In the worst, only 20% slower the hand-coding Mostly, equal or faster Inormatica 1 week vs hand-coding 5-6 weeks 8 Informatica Corporation Confidential – Do Not Distribute 88 PowerCenter Big Data Edition Minimize Risk Quickly staff projects with trained data integration experts Design once and deploy anywhere Deploy On-Premise or in the Cloud Traditional Grid Pushdown to RDBMS or DW Appliance Informatica Corporation Confidential – Do Not Distribute 9 Graphical Processing Logic Test on Native, Deploy on Hadoop Partial records only Separate partial records from completed records 10 Completed records only Separate incomplete and complete partial records Select incomplete partial records Aggregate all Sort records by completed and Calling number partial-completed records Informatica Corporation Confidential – Do Not Distribute 1010 Run it simple on Hadoop Choose execution environment Press Run View hive query 11 Informatica Corporation Confidential – Do Not Distribute 1111 Technology Minimaize Risk with Informatica Partners and Certified Developer Community Expertise & best practices Global Systems Integrators 9,000+ trained developers Achieving Operational Efficiency With Informatica Best practices & reusability Informatica Developers • 45,000+ developers in Informatica TechNet • 3x more developers than any other vendor* 1200 1000 800 600 400 200 0 People Ab Initio Business Objects IBM Informatica * Source: U.S. resume search on dice.com, December 2008 Informatica Corporation Confidential – Do Not Distribute 12 WHAT ARE CUSTOMERS DOING WITH INFORMATICA AND BIG DATA? Informatica Corporation Confidential – Do Not Distribute 13 Lower Costs of Big Data Projects Saved $20M + $2-3M On-going by Archiving & Optimization The Challenge Data warehouse exploding with over 200TB of data. generating up to 5 million queries a day impacting query performance The Solution The Result Business Reports ERP CRM EDW Custom Interaction Data User activity • Saved 100TBs of space over past 2 ½ years • Reduced rearchitecture project from 6 months to 2 weeks • Improved performance by 25% Archived Data • Return on investment in less than 6 months Large Global Financial Institution Informatica Corporation Confidential – Do Not Distribute 14 Large Global Financial Institution Lower Costs of Big Data Projects The Challenge. Increasing demand for faster data driven decision making and analytics as data volumes and processing loads rapidly increase The Solution RDBMS The Result • Near Real-Time Datamarts RDBMS Datamarts RDBMS Traditional Grid • • Cost-effectively scale performance Lower hardware costs Increased agility by standardizing on one data integration platform Phase 2 Data Warehouse Web Logs Informatica Corporation Confidential – Do Not Distribute 15 Large Government Agency Flexible Architecture to Support Rapidly Changing Business Needs The Challenge Data volumes growing at 3-5 times over the next 2-3 years The Solution The Result Traditional Grid Mainframe RDBMS EDW Phase 2 DW Unstructured Data Data Virtualization DW Business Reports • Manage data integration and load of 10+ billion records from multiple disparate data sources • Flexible data integration architecture to support changing business requirements in a heterogeneous data management environment Phase 2 Informatica Corporation Confidential – Do Not Distribute 16 Why PowerCenter Big Data Edition • Repeatability • Predictable, repeatable deployments and methodology • Reuse of existing assets • • Apply existing integration logic to load data to/from Hadoop Reuse existing data quality rules to validate Hadoop data • Reuse of existing skills • Enable ETL developers to leverage the power of Hadoop • Governance • • 17 Enforce and validate data security, data quality and regulatory policies Manageability Informatica Corporation Confidential – Do Not Distribute 1717 Informatica Corporation Confidential – Do Not Distribute 18