Informatica Corporation Confidential – Do Not

advertisement
Data Integration for Big Data
Pierre Skowronski
Prague le 23.04.2013
1
IT is struggling with the cost of Big Data
• Growing data volume is
quickly consuming capacity
• Need to onboard, store, &
process new types of data
• High expense and lack of
big data skills
Informatica Corporation Confidential – Do Not Distribute
2
Prove the Value with Big Data
Deliver Value Along the Way
Cost: Lower Big Data Project
Costs
(helps self-fund big data projects)
Risk: Minimize Risk of New
Technologies
(design once, deploy anywhere)
Delivery: Innovate Faster With Big
Data
(onboard, discover, operationalize)
Informatica Corporation Confidential – Do Not Distribute
3
Informatica Corporation Confidential – Do Not Distribute
4
INTRODUCING THE
INFORMATICA
POWERCENTER BIG DATA
EDITION
Informatica Corporation Confidential – Do Not Distribute
5
PowerCenter Big Data Edition
Lower Costs
Optimize processing with low
cost commodity hardware
Transactions,
OLTP, OLAP
Traditional Grid
EDW
Documents and Emails
ODS
Social Media, Web Logs
Machine Device,
Scientific
MDM
Increase productivity up to 5X
Informatica Corporation Confidential – Do Not Distribute
6
Hadoop complements Existing Infrastructure
on low cost commodity hardware
7
Informatica Corporation Confidential – Do Not Distribute
77
5 x better productivity for
similar performance
Project domain
Finance
Finance
Cluster
size
3
10
Processing
Compare to
Expert Handcoding
Cleanse, Transform, sort, group
40% faster than PIG
Extract, process, load
50% faster than PIG
Extract, process, load
20% slower than PIG
In the worst, only 20% slower the hand-coding
Mostly, equal or faster
Inormatica 1 week vs hand-coding 5-6 weeks
8
Informatica Corporation Confidential – Do Not Distribute
88
PowerCenter Big Data Edition
Minimize Risk
Quickly staff projects with
trained data integration experts
Design once and deploy anywhere
Deploy On-Premise or
in the Cloud
Traditional Grid
Pushdown to RDBMS or
DW Appliance
Informatica Corporation Confidential – Do Not Distribute
9
Graphical Processing Logic
Test on Native, Deploy on Hadoop
Partial records
only
Separate partial records
from completed records
10
Completed
records only
Separate
incomplete and
complete partial
records
Select incomplete
partial records
Aggregate all
Sort records by
completed and
Calling number
partial-completed
records
Informatica Corporation Confidential – Do Not Distribute
1010
Run it simple on Hadoop
Choose execution
environment
Press Run
View hive query
11
Informatica Corporation Confidential – Do Not Distribute
1111
Technology
Minimaize Risk with Informatica Partners
and Certified Developer Community
Expertise &
best
practices
Global Systems Integrators
9,000+ trained developers
Achieving
Operational
Efficiency
With Informatica
Best
practices &
reusability
Informatica Developers
• 45,000+ developers in
Informatica TechNet
• 3x more developers than any
other vendor*
1200
1000
800
600
400
200
0
People
Ab Initio
Business Objects
IBM
Informatica
* Source: U.S. resume search on dice.com, December 2008
Informatica Corporation Confidential – Do Not Distribute
12
WHAT ARE CUSTOMERS
DOING WITH INFORMATICA
AND BIG DATA?
Informatica Corporation Confidential – Do Not Distribute
13
Lower Costs of Big Data Projects
Saved $20M + $2-3M On-going by Archiving & Optimization
The Challenge Data warehouse exploding with over 200TB of data.
generating up to 5 million queries a day impacting query performance
The Solution
The Result
Business
Reports
ERP
CRM
EDW
Custom
Interaction Data
User activity
• Saved 100TBs of
space over past 2 ½
years
• Reduced
rearchitecture project
from 6 months to 2
weeks
• Improved
performance by 25%
Archived
Data
• Return on investment
in less than 6 months
Large Global Financial Institution
Informatica Corporation Confidential – Do Not Distribute
14
Large Global Financial Institution
Lower Costs of Big Data Projects
The Challenge. Increasing demand for faster data driven decision making and analytics
as data volumes and processing loads rapidly increase
The Solution
RDBMS
The Result
•
Near Real-Time
Datamarts
RDBMS
Datamarts
RDBMS
Traditional Grid
•
•
Cost-effectively scale
performance
Lower hardware costs
Increased agility by
standardizing on one
data integration
platform
Phase 2
Data
Warehouse
Web Logs
Informatica Corporation Confidential – Do Not Distribute
15
Large Government Agency
Flexible Architecture to Support Rapidly Changing Business Needs
The Challenge Data volumes growing at 3-5 times over the next 2-3 years
The Solution
The Result
Traditional Grid
Mainframe
RDBMS
EDW
Phase 2
DW
Unstructured
Data
Data Virtualization
DW
Business
Reports
• Manage data
integration and load of
10+ billion records from
multiple disparate data
sources
• Flexible data integration
architecture to support
changing business
requirements in a
heterogeneous data
management
environment
Phase 2
Informatica Corporation Confidential – Do Not Distribute
16
Why PowerCenter Big Data Edition
• Repeatability
•
Predictable, repeatable deployments and methodology
• Reuse of existing assets
•
•
Apply existing integration logic to load data to/from Hadoop
Reuse existing data quality rules to validate Hadoop data
• Reuse of existing skills
•
Enable ETL developers to leverage the power of Hadoop
• Governance
•
•
17
Enforce and validate data security, data quality and regulatory policies
Manageability
Informatica Corporation Confidential – Do Not Distribute
1717
Informatica Corporation Confidential – Do Not Distribute
18
Download