Uploaded by Leela Naga Durga Sri Ram Gidda

dataworlddistilled-bigdata-dwh-bi-v07-10-2019-190715153521

advertisement
The Data World Distilled:
Understanding how the data world works in the Big Data era
Bill Hayduk
Founder, CEO
QuerySurge ™
a software division of
About
FACTS
RTTS Founded:
RTTS is the parent company of QuerySurge and began as a
consulting firm centered on QA & testing
1996
Location:
Technology Partners
New York, NY
(Headquarters)
Customer profile:
Fortune 1000
Software Offering
QuerySurge (2012)
System Integrators
QuerySurge Partners:
• 11 industry-leading
Technology Partners
• 14 global System
Integrators
• 22 regional consulting
firms
QuerySurge ™
Sales & Consulting
Partners
a software division of
DWH, BI, Big Data Marketplaces
Data Warehouse Marketplace
“the worldwide data warehouse management software market is forecast
to generate nearly $17 billion in revenue by 2019” - Forrester
Top vendors: Oracle, Teradata, IBM, Microsoft, SAP, Micro Focus and Amazon
Business Intelligence Marketplace
“The business intelligence (BI) and analytics software market is forecast to grow to
$22.8 billion by the end of 2020” - Gartner
SAP, IBM, SAS, Microsoft, Oracle, Tableau, Qlik, MicroStrategy , Information Builders
Big Data Marketplace
“By the end of 2020, companies will spend > USD $72 billion on
hardware, software, & professional services” - IDC
on Big Data
Oracle, IBM, Microsoft, Amazon, Micro Focus, HortonWorks, Cloudera, Teradata,
SAP, MongoDB, MapR, DataStax, Snowflake.
QuerySurge ™
a software division of
Fast Facts about Data
•
By the end of 2020, companies will spend > USD $72 billion
on Big Data hardware, software, & professional services
(the current market size is USD $46 billion)
•
> 75% of companies are investing or planning to invest in
Big Data in the next 2 years
•
Professional services represents 43% of the Big Data market
(services=USD $31 Billion of $72 Billion)
QuerySurge ™
a software division of
The Data World Distilled
Big Data
Data
Governance
Data
Warehouse
Data Testing
ETL/ Data
Integration
Data Quality
QuerySurge ™
BI & Analytics
a software division of
The Data World Distilled
Big Data
Data
Governance
Data
Warehouse
Data Testing
ETL/ Data
Integration
Data Quality
QuerySurge ™
BI & Analytics
a software division of
What is Big Data?
QuerySurge ™
a software division of
What is Big Data?
Big Data: defined as too much volume, velocity and variability to work on normal
database architectures.
Size
Defined as 5 petabytes or more
1 petabyte = 1,000 terabytes
1,000 terabytes = 1,000,000 gigabytes
1,000,000 gigabytes = 1,000,000,000 megabytes
“The market for big data is $70 billion and growing by
15% a year.”
- EMC COO Pat Gelsinger
QuerySurge ™
a software division of
the Big Data Impact
Handles more than 1 million customer transactions every hour.
•
•
data imported into databases that contain > 2.5 petabytes of data
the equivalent of 167 times the information contained in all the books in the US Library of
Congress.
others
Facebook handles 40 billion photos from its user base.
Twitter processes 85 million tweets per day
Google processes 1 Terabyte per hour
eBay processes 80 Terabytes per day
QuerySurge ™
a software division of
What is
?
Hadoop is an
open source project that develops software for scalable,
distributed computing.
•
•
is a
of large data sets across
clusters of computers using simple programming models.
easily deals with complexities of high
of data
from single servers to 1,000’s of machines, each offering local
computation and storage.
•
detects and
QuerySurge ™
at the application layer
a software division of
Key Attributes of Hadoop
• Redundant and reliable
• Extremely powerful
• Easy to program distributed apps
• Runs on commodity hardware
QuerySurge ™
a software division of
Top
Vendors
““By the end of 2020, companies will spend more than USD $72 billion on
on Big Data hardware, software, & professional services” - IDC
built by
QuerySurge ™
Basic Hadoop Architecture
MapReduce – processing part that manages
the programming jobs. (a.k.a. Task Tracker)
HDFS (Hadoop Distributed File System) –
stores data on the machines. (a.k.a. Data
Node)
MapReduce
(Task Tracker)
HDFS
(Data
Node)
machine
QuerySurge ™
a software division of
Basic Hadoop Architecture
(continued)
Cluster
Add more machines for scaling – from 1 to 100 to 1,000
Job Tracker accepts jobs, assigns tasks, identifies failed machines
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Name Node
Name Node
Coordination for HDFS. Inserts and extraction are communicated through the Name Node.
QuerySurge ™
a software division of
Apache Hive
Apache Hive - a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis.
Hive provides a mechanism to query the data using a SQL-like language
called HiveQL that interacts with the HDFS files
MapReduce
•
•
•
•
•
(Task Tracker)
create
insert
update
delete
select
QuerySurge ™
HiveQL
HiveQL HiveQL
HiveQL
HiveQL
HDFS
(Data
Node)
a software division of
About
What is NoSQL?
A term used to describe high-performance, non-relational databases that provide a mechanism for storage and
retrieval of data that is modeled in means other than the tabular relations used in relational databases
NoSQL Database Types
Document databases pair each key with a complex data structure known as a document. Documents can contain
many different key-value pairs, or key-array pairs, or even nested documents.
Graph stores are used to store information about networks of data, such as social connections. Graph stores
include Neo4J and Giraph.
Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute
name (or 'key'), together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value
stores, such as Redis, allow each value to have a type, such as 'integer', which adds functionality.
Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store
columns of data together, instead of rows.
QuerySurge ™
a software division of
Top
Vendors
built by
QuerySurge ™
NoSQL versus Hadoop
When to use NoSQL? / When to use Hadoop?
• Online real-time processing
• Data set is smaller
• Measured in milliseconds
• Offline big data processing
• Offline analytics
• Measured in minutes & hours
Source: classpattern.com
QuerySurge ™
built by
NoSQL Example:
Use Cases
ETL from MongoDB
Data Warehouse
Batch Aggregation
ETL to MongoDB
Source: MongoDB, Inc.
built by
QuerySurge ™
The Data World Distilled
Big Data
Data
Governance
Data
Warehouse
Data Testing
ETL/ Data
Integration
Data Quality
QuerySurge ™
BI & Analytics
a software division of
What is a Data Warehouse?
QuerySurge ™
a software division of
What is a Data Warehouse?
Data Warehouse
•
typically a relational database that is designed for query and analysis
rather than for transaction processing
•
a place where historical data is stored for archival, analysis and
security purposes.
•
contains either raw data or formatted data
•
combines data from multiple sources
•
•
•
•
•
•
•
•
•
Sales
salaries
operational data
human resource data
inventory data
web logs
Social networks
Internet text and docs
other
QuerySurge ™
Legacy DB
CRM/ERP
DB
Finance DB
a software division of
Data Warehouse - the marketplace
“The worldwide data warehouse management software market is
forecast to generate nearly $17 billion in revenue by 2019”
- Forrester
Data Warehouse size
Small data warehouses: < 5 TB
Midsize data warehouses: 5 TB - 20 TB
Large data warehouses: >20 TB
- Analyst firm Gartner
Leaders in on-premises Data Warehouse Data Management Systems
- Analyst firm™ Gartner’s ‘Magic Quadrant for Data Warehouse Database Management Systems’
QuerySurge
a software division of
Data Warehouse - the marketplace
Alternate Delivery Models
Leading Cloud DWHs
Leading Appliance DWHs
An appliance is
software and
servers
optimized
together.
Oracle founder Larry Ellison with an
Exadata appliance
QuerySurge ™
a software division of
Data Warehouse - Business Case
Why build a Data Warehouse?
•
Data stored in operational systems (OLTP) not
easily accessible
•
OLTP systems are not designed for end-user
analysis
•
The data in OLTP is constantly changing
•
May be deficient in historical data
•
Diverse forms of data stored in different platforms
and/or dissimilar formats
QuerySurge ™
a software division of
Data Warehouse - Business Case
The Data Warehouse Business Solution
•
Collects data from different sources (other
databases, files, web services, etc)
•
Integrates data into logical business areas
•
Provides direct access to data with powerful
reporting tools (BI)
QuerySurge ™
a software division of
Data Warehouse - about the data
The Data Warehouse data
• Subject-oriented
• Integrated
• Non-volatile
• Time-variant
QuerySurge ™
a software division of
The Data World Distilled
Big Data
Data
Governance
Data
Warehouse
Data Testing
ETL / Data
Integration
Data Quality
QuerySurge ™
BI & Analytics
a software division of
Data Integration & the ETL process
ETL = Extract, Transform, Load
Why ETL?
Need to load the data warehouse regularly (daily/weekly) so that it can serve its
purpose of facilitating business analysis.
Extract - data from one or more OLTP systems and copy
into the warehouse
Transform – removing inconsistencies, assemble to a common
format, adding missing fields, summarizing detailed data and
deriving new fields to store calculated data.
Load – map the data, transform and/or load it into the DWH.
The ETL function is either performed by home-grown software that someone wrote or
through commercial software
QuerySurge ™
a software division of
the ETL process
Source Data
ETL Process
Target DWH
Extract
Legacy DB
Transform
CRM/ERP
DB
Finance DB
Load
QuerySurge ™
a software division of
Continuous Integration/ETL solutions - the Marketplace
Leaders in ETL Solutions
(ab initio)
QuerySurge ™
a software division of
The Data World Distilled
Big Data
Data
Governance
Data
Warehouse
Data Testing
ETL/ Data
Integration
Data Quality
QuerySurge ™
BI & Analytics
a software division of
Business Intelligence (BI)
QuerySurge ™
a software division of
Business Intelligence (BI)
Business Intelligence – What is it?
• Software applications used in spotting, digging-out, and
analyzing business data
• BI provides simple access to data which can be used in day
to day operations, integrates data into logical business areas
• BI provides historical, current and predictive views of
business operations
• BI is made up of several related activities, including data
mining, online analytical processing, querying and reporting.
Business Intelligence software is like reporting engines on steroids
QuerySurge ™
a software division of
BI & Analytics - the marketplace
“The business intelligence (BI) and analytics software market is forecast to
grow to $22.8 billion by the end of 2020”
“The four large "stack" vendors (SAP, Oracle, IBM and Microsoft) continue to
consolidate the market, owning 59 percent of the market share. ”
- Analyst firm Gartner
Leaders in BI
- Analyst firm Forrester Research’s ‘Forrester Wave’
QuerySurge ™
a software division of
Business Intelligence (BI) - Who uses it?
Wal-Mart uses vast amounts of data and category analysis to
dominate the industry.
Amazon and Yahoo follow a "test and learn" approach to
business changes.
Hardee’s, Wendy’s, and T.G.I. Friday’s use BI
to make strategic decisions.
QuerySurge ™
a software division of
Business Intelligence (BI) & Data Marts
Data Mart
A database that has the same characteristics as a data warehouse, but is
usually smaller and is focused on the data for one division or one
workgroup within an enterprise.
Typically hold aggregated data and some granular data. It is a subset of the
DWH and makes it more efficient for Business Intelligence reporting. BI tools
sit on top of the data marts.
Source Data
ETL Process
Target DW
ETL Process
Data Mart
Legacy DB
CRM/ERP DB
Finance DB
QuerySurge ™
a software division of
Business Intelligence (BI) & Analytics
Source Data
Target DWH
Legacy DB
ETL Process
ETL Process
CRM/ERP
DB
Finance DB
Data Mart
QuerySurge ™
a software division of
The Data World Distilled
Big Data
Data
Governance
Data
Warehouse
Data Testing
ETL/ Data
Integration
Data Quality
QuerySurge ™
BI & Analytics
a software division of
Data Quality Issues
80% of organizations… will underestimate the costs related
to the data acquisition tasks by an average of 50 percent.
46% of companies cite Data Quality as a barrier for
adopting Business Intelligence products.
Data Quality Best Practices boost revenue by 66%.
The average organization loses $14.2 million annually
through poor Data Quality.
QuerySurge ™
built by
Data Quality
Primary Characteristics of Data Quality tools
courtesy of Gartner’s “Magic Quadrant for Data Quality Tools”
o Profiling
o Parsing and standardization
o Generalized Cleansing
o Matching
o Monitoring
o Enrichment
o Subject-area-specific support
o Metadata management
o Configuration environment
QuerySurge
a software division of
™
Data Quality - the marketplace
“The market for data quality software tools reached $1.61 billion in 2017 (the
most recent year for which Gartner has data), an increase of 11.6% over 2016.
Gartner’s interactions with clients also indicate that demand remains high.”
- Analyst firm Gartner
Leaders in Data Quality
- Analyst firm Gartner’s Magic Quadrant
QuerySurge ™
a software division of
The Data World Distilled
Big Data
Data
Governance
Data
Warehouse
Data Testing
ETL/ Data
Integration
Data Quality
QuerySurge ™
BI & Analytics
a software division of
Data Quality vs. Data Testing
Primary Characteristics of Data Quality tools
courtesy of Gartner’s “Magic Quadrant for Data Quality Tools”
o
o
o
o
o
o
o
o
o
Profiling
Parsing and standardization
Generalized Cleansing
Matching
Monitoring
Enrichment
Subject-area-specific support
Metadata management
Configuration environment
Data
Verification &
Validation?
Primary Characteristics of Data Testing tools
Courtesy of the book "Testing the Data Warehouse Practicum"
▪
Data Completeness
▪
Data Transformation
▪
Regression Testing
▪
Reporting
QuerySurge
Data
Verification &
Validation?
a software division of
™
Where Data Testing fits in your data strategy
QuerySurge ™
a software division of
The Executive Office and Critical Data
CxOs are using Business Intelligence & Analytics to make critical business decisions
– with the assumption that the underlying data is fine.
Business Intelligence & Analytics
“The average organization loses
$14.2 million annually through
poor Data Quality.”
- Gartner
Data Architecture
Typical data issue
areas
ETL
Mainframe
Key Roles in Building & Testing a Data Store
Data Analyst: Creates data requirements (source-totarget map or mapping doc)
Data Architect: Models and builds data store (Big Data
lake, Data Warehouse, etc.)
ETL Developer: Transforms and loads data from
sources to target data stores
Data Tester: Validates the data, based on mappings,
as it moves and transforms from sources to targets
QuerySurge ™
a software division of
Data Requirements = Mapping Document
a.k.a. Source-to-Target Map
It’s the critical element required
to efficiently plan the target Data
Stores. It also defines the Extract,
Transform, Load (ETL) process.
Intention:
✓ capture business rules
✓ data flow mapping and
✓ data movement requirements.
Mapping Doc specifies:
▪ Source input definition
▪ Target/output details
▪ Business & data transformation rules
▪ Absolute data quality requirements
▪ Optional data quality requirements.
QuerySurge ™
a software division of
Most Common Data Validation Method
Sampling
•
•
Review Business Rules (i.e. mapping document, data flow mappings)
Write Tests in SQL editor
•
Execute 2 Tests: 1 at Source & 1 at Target
•
Export results to 2 Excel files
•
Compare a Sampling of results by eye (‘Stare & Compare’)
Issue with Stare & Compare:
Impossible to visually compare billions of data sets
Result: usually less than 1% of data is compared
Example - Current QuerySurge customer
• one test = 100 million rows X 200 columns = 20 billion data sets
• there is no practical way to manually verify (eyeball) this data set
• the client has more than 15,000 total tests
QuerySurge ™
a software division of
Data Store Roles, Tasks, & Timelines
Tasks
Roles
Data Analyst
Determine
Requirements
Create & maintain
Mapping
Document
Data Architect
Review
Mapping
Document
Model and
build target
Data Stores
Maintain
Target Data
Stores
ETL Developer
Review
Mapping
Document
Build data
movement
logic
Extract & load data or
extract, transform, &
load data
Huge Risk
iterate
Review
Mapping
Document
Data Tester
Timeline
Create 2 SQL
tests for each
mapping with
SQL editor
Execute
tests
iterate
Dump
results of
tests to 2
Excel files
Compare
Excel files
by eye
About QuerySurge
QuerySurge ™
a software division of
What is QuerySurge ?
QuerySurge
is the leading testing solution for
automated validation & testing of Big Data
Use Cases
QuerySurge ™
a software division of
How QuerySurge Works
Source Data
Data Stores
• Databases
• Data Warehouses
• Data Marts
Fixed Width
Delimited
Excel
JSON
Big Data
stores
• Hadoop
• NoSQL
Flat Files
•
•
•
•
Target Data
QuerySurge connects
to any 2 points
at one time
HQL
SQL
SQL
Data
Warehouses
Business Intelligence
Reports
XML
Web Services
Comparison of every data set
Source
Data
Results – pass/fail
Target
Data
Data Intelligence Reports,
Data Health Dashboard,
automated email reports
Big Data Process - Developer & Tester
ETL Developer:
Codes data movement based on Mapping Requirements
Source Data
Data Warehouse Data Mart
Big Data lake
ETL
Testing Point #1
Data Tester:
BI Analyst extracts
data for reports
BI & Analytics
ETL
Testing Point #2
Testing Points #3
Tests data movement based on Mapping Requirements
Testing Point #4
Tester tests BI
Reports
QuerySurge Supports 50+ Data Stores
QuerySurge supports the following data stores…
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Amazon Redshift, Elastic Map Reduce, DynamoDB
Apache Hadoop/Hive, Spark
Cassandra
Cloudera
Couchbase
Exasol
Flat Files (delimited, fixed-width)
Hortonworks
IBM (Db2, Netezza, Informix, Big Insights, Cloudant, MDM, Cognos)
JSON files
Mainframe
MAPR
Micro Focus Vertica
Microsoft (SQL Server DWH, HDInsight, PDW, SSAS, Excel, Access,
SharePoint)
•
•
•
•
•
•
•
•
•
•
•
MongoDB
Oracle (Oracle DB, MySQL, Exadata, NoSQL, Hadoop)
Pivotal GreenPlum
PostgreSQL
Salesforce
SAP (HANA, IQ, ASE, SQL Anywhere, Altiscale Data Cloud)
Snowflake
Tableau
Teradata, Aster
Workday
XML
…and any other data store
Excel
Flat Files
The Data World Distilled
Big Data
Data
Governance
Data
Warehouse
Data Testing
ETL/ Data
Integration
Data Quality
QuerySurge ™
BI & Analytics
a software division of
4 main components of successful data governance
1) Data stewardship
Identifying and assigning roles and responsibilities.
- who is creating its data,
- who has overall responsibility for the data,
- who uses the data, who routes it,
- who oversees its use.
2) Data classification
Identify and categorize data types into groups.
3) Data quality
Data quality - the process of measuring the reliability of current data sets to provide
information that can be used to make organizational decisions.
4) Data management
Process where all the organization's data governance efforts come together. The
company actively manages its data governance efforts and involves the creation of the
architectures and business processes required to properly maintain the organization’s
data through its full lifecycle.
Data Maturity Model - Process
•
Patterned after the Capability
Maturity Model
Integration(CMMI) from the
Software Engineering Institute
(SEI) at Carnegie Mellon
University
•
Devised by IBM, along with 55
other companies
•
•
•
•
•
Data Governance is second nature
ROI for data-related projects is tracked
Business value of data mgmt is recognized
Cost of data mgmt is easier to manage
Costs are reduced as processes become
automated
• Further defined value of data for more data elements
• Data Governance methodology is introduced during the
planning stages of new projects
• Enterprise data models are documented & published
• Data-related policies become more clear & reflect the
organization’s data principles.
• Data integration opportunities are better leveraged.
• Risk assessment for data integrity & quality becomes part of the
organization’s project methodology.
• More data-related controls are documented
• Metadata becomes an important part of documenting critical
data elements.
• Few stable processes exist
• “Just do it” mentality
source: IBM Data Governance Council Maturity Model
built by
QuerySurge ™
Data Governance - the marketplace
“Rapidly increasing growth in data volumes, rising regulatory & compliance
mandates, and enhancing strategic risk management & decision-making are
expected to drive the growth of the data governance market.”
The data governance market size is expected to grow from $1.31 Billion in 2018
to $3.53 Billion by 2023, at a CAGR of 22.0%.”
- MarketsAndMarkets.com
Leaders in Data Governance
- The Forrester Wave
QuerySurge ™
a software division of
The Data World Distilled
the Data World by Top Vendors
•
•
•
•
•
•
Source types
Flat files
Excel
json
Xml
Web services
databases
•
•
•
•
•
•
•
•
ETL Vendors
Ab Initio
IBM
Informatica
Microsoft
Oracle
SAP
SAS
Talend
Source Data
•
•
•
•
•
•
Data Warehouse Vendors
• Amazon
• IBM
• Microsoft
• Micro Focus
• Oracle
• SAP
• Snowflake
• Teradata
Hadoop Vendors
Amazon
Cloudera
Hortonworks
IBM
MAPR
Microsoft
•
•
•
•
•
•
•
Data Warehouse Data Mart
Big Data lake
ETL
BI Vendors
IBM
Microsoft
Microstrategy
Qlik
Tableau
SAP
Oracle
BI & Analytics
ETL
NoSQL Vendors
• Amazon
• Apache
• Cassandra
• Couchbase
• MongoDB
• Oracle
•
•
•
•
•
•
Data Quality
Informatica
IBM
Oracle
SAP
SAS
Talend
•
•
•
•
•
•
Data Testing
Data Governance
QuerySurge • Collibra
Informatica
• DATUM
Tricentis
• GDE
Data Gaps
• IBM
IceDQ
• Informatica
built
Bitwise
•by SAP
The Data World Distilled:
Understanding how the data world works in the Big Data era
Any questions?
Bill Hayduk
Founder, CEO
QuerySurge ™
Download