The Data World Distilled: Understanding how the data world works in the Big Data era Bill Hayduk Founder, CEO QuerySurge ™ a software division of About FACTS RTTS Founded: RTTS is the parent company of QuerySurge and began as a consulting firm centered on QA & testing 1996 Location: Technology Partners New York, NY (Headquarters) Customer profile: Fortune 1000 Software Offering QuerySurge (2012) System Integrators QuerySurge Partners: • 11 industry-leading Technology Partners • 14 global System Integrators • 22 regional consulting firms QuerySurge ™ Sales & Consulting Partners a software division of DWH, BI, Big Data Marketplaces Data Warehouse Marketplace “the worldwide data warehouse management software market is forecast to generate nearly $17 billion in revenue by 2019” - Forrester Top vendors: Oracle, Teradata, IBM, Microsoft, SAP, Micro Focus and Amazon Business Intelligence Marketplace “The business intelligence (BI) and analytics software market is forecast to grow to $22.8 billion by the end of 2020” - Gartner SAP, IBM, SAS, Microsoft, Oracle, Tableau, Qlik, MicroStrategy , Information Builders Big Data Marketplace “By the end of 2020, companies will spend > USD $72 billion on hardware, software, & professional services” - IDC on Big Data Oracle, IBM, Microsoft, Amazon, Micro Focus, HortonWorks, Cloudera, Teradata, SAP, MongoDB, MapR, DataStax, Snowflake. QuerySurge ™ a software division of Fast Facts about Data • By the end of 2020, companies will spend > USD $72 billion on Big Data hardware, software, & professional services (the current market size is USD $46 billion) • > 75% of companies are investing or planning to invest in Big Data in the next 2 years • Professional services represents 43% of the Big Data market (services=USD $31 Billion of $72 Billion) QuerySurge ™ a software division of The Data World Distilled Big Data Data Governance Data Warehouse Data Testing ETL/ Data Integration Data Quality QuerySurge ™ BI & Analytics a software division of The Data World Distilled Big Data Data Governance Data Warehouse Data Testing ETL/ Data Integration Data Quality QuerySurge ™ BI & Analytics a software division of What is Big Data? QuerySurge ™ a software division of What is Big Data? Big Data: defined as too much volume, velocity and variability to work on normal database architectures. Size Defined as 5 petabytes or more 1 petabyte = 1,000 terabytes 1,000 terabytes = 1,000,000 gigabytes 1,000,000 gigabytes = 1,000,000,000 megabytes “The market for big data is $70 billion and growing by 15% a year.” - EMC COO Pat Gelsinger QuerySurge ™ a software division of the Big Data Impact Handles more than 1 million customer transactions every hour. • • data imported into databases that contain > 2.5 petabytes of data the equivalent of 167 times the information contained in all the books in the US Library of Congress. others Facebook handles 40 billion photos from its user base. Twitter processes 85 million tweets per day Google processes 1 Terabyte per hour eBay processes 80 Terabytes per day QuerySurge ™ a software division of What is ? Hadoop is an open source project that develops software for scalable, distributed computing. • • is a of large data sets across clusters of computers using simple programming models. easily deals with complexities of high of data from single servers to 1,000’s of machines, each offering local computation and storage. • detects and QuerySurge ™ at the application layer a software division of Key Attributes of Hadoop • Redundant and reliable • Extremely powerful • Easy to program distributed apps • Runs on commodity hardware QuerySurge ™ a software division of Top Vendors ““By the end of 2020, companies will spend more than USD $72 billion on on Big Data hardware, software, & professional services” - IDC built by QuerySurge ™ Basic Hadoop Architecture MapReduce – processing part that manages the programming jobs. (a.k.a. Task Tracker) HDFS (Hadoop Distributed File System) – stores data on the machines. (a.k.a. Data Node) MapReduce (Task Tracker) HDFS (Data Node) machine QuerySurge ™ a software division of Basic Hadoop Architecture (continued) Cluster Add more machines for scaling – from 1 to 100 to 1,000 Job Tracker accepts jobs, assigns tasks, identifies failed machines Task Task Task Task Task Task Task Task Task Task Task Task Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Data Data Data Data Data Data Data Data Data Data Data Data Node Node Node Node Node Node Node Node Node Node Node Node Name Node Name Node Coordination for HDFS. Inserts and extraction are communicated through the Name Node. QuerySurge ™ a software division of Apache Hive Apache Hive - a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive provides a mechanism to query the data using a SQL-like language called HiveQL that interacts with the HDFS files MapReduce • • • • • (Task Tracker) create insert update delete select QuerySurge ™ HiveQL HiveQL HiveQL HiveQL HiveQL HDFS (Data Node) a software division of About What is NoSQL? A term used to describe high-performance, non-relational databases that provide a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases NoSQL Database Types Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. Graph stores are used to store information about networks of data, such as social connections. Graph stores include Neo4J and Giraph. Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or 'key'), together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value stores, such as Redis, allow each value to have a type, such as 'integer', which adds functionality. Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows. QuerySurge ™ a software division of Top Vendors built by QuerySurge ™ NoSQL versus Hadoop When to use NoSQL? / When to use Hadoop? • Online real-time processing • Data set is smaller • Measured in milliseconds • Offline big data processing • Offline analytics • Measured in minutes & hours Source: classpattern.com QuerySurge ™ built by NoSQL Example: Use Cases ETL from MongoDB Data Warehouse Batch Aggregation ETL to MongoDB Source: MongoDB, Inc. built by QuerySurge ™ The Data World Distilled Big Data Data Governance Data Warehouse Data Testing ETL/ Data Integration Data Quality QuerySurge ™ BI & Analytics a software division of What is a Data Warehouse? QuerySurge ™ a software division of What is a Data Warehouse? Data Warehouse • typically a relational database that is designed for query and analysis rather than for transaction processing • a place where historical data is stored for archival, analysis and security purposes. • contains either raw data or formatted data • combines data from multiple sources • • • • • • • • • Sales salaries operational data human resource data inventory data web logs Social networks Internet text and docs other QuerySurge ™ Legacy DB CRM/ERP DB Finance DB a software division of Data Warehouse - the marketplace “The worldwide data warehouse management software market is forecast to generate nearly $17 billion in revenue by 2019” - Forrester Data Warehouse size Small data warehouses: < 5 TB Midsize data warehouses: 5 TB - 20 TB Large data warehouses: >20 TB - Analyst firm Gartner Leaders in on-premises Data Warehouse Data Management Systems - Analyst firm™ Gartner’s ‘Magic Quadrant for Data Warehouse Database Management Systems’ QuerySurge a software division of Data Warehouse - the marketplace Alternate Delivery Models Leading Cloud DWHs Leading Appliance DWHs An appliance is software and servers optimized together. Oracle founder Larry Ellison with an Exadata appliance QuerySurge ™ a software division of Data Warehouse - Business Case Why build a Data Warehouse? • Data stored in operational systems (OLTP) not easily accessible • OLTP systems are not designed for end-user analysis • The data in OLTP is constantly changing • May be deficient in historical data • Diverse forms of data stored in different platforms and/or dissimilar formats QuerySurge ™ a software division of Data Warehouse - Business Case The Data Warehouse Business Solution • Collects data from different sources (other databases, files, web services, etc) • Integrates data into logical business areas • Provides direct access to data with powerful reporting tools (BI) QuerySurge ™ a software division of Data Warehouse - about the data The Data Warehouse data • Subject-oriented • Integrated • Non-volatile • Time-variant QuerySurge ™ a software division of The Data World Distilled Big Data Data Governance Data Warehouse Data Testing ETL / Data Integration Data Quality QuerySurge ™ BI & Analytics a software division of Data Integration & the ETL process ETL = Extract, Transform, Load Why ETL? Need to load the data warehouse regularly (daily/weekly) so that it can serve its purpose of facilitating business analysis. Extract - data from one or more OLTP systems and copy into the warehouse Transform – removing inconsistencies, assemble to a common format, adding missing fields, summarizing detailed data and deriving new fields to store calculated data. Load – map the data, transform and/or load it into the DWH. The ETL function is either performed by home-grown software that someone wrote or through commercial software QuerySurge ™ a software division of the ETL process Source Data ETL Process Target DWH Extract Legacy DB Transform CRM/ERP DB Finance DB Load QuerySurge ™ a software division of Continuous Integration/ETL solutions - the Marketplace Leaders in ETL Solutions (ab initio) QuerySurge ™ a software division of The Data World Distilled Big Data Data Governance Data Warehouse Data Testing ETL/ Data Integration Data Quality QuerySurge ™ BI & Analytics a software division of Business Intelligence (BI) QuerySurge ™ a software division of Business Intelligence (BI) Business Intelligence – What is it? • Software applications used in spotting, digging-out, and analyzing business data • BI provides simple access to data which can be used in day to day operations, integrates data into logical business areas • BI provides historical, current and predictive views of business operations • BI is made up of several related activities, including data mining, online analytical processing, querying and reporting. Business Intelligence software is like reporting engines on steroids QuerySurge ™ a software division of BI & Analytics - the marketplace “The business intelligence (BI) and analytics software market is forecast to grow to $22.8 billion by the end of 2020” “The four large "stack" vendors (SAP, Oracle, IBM and Microsoft) continue to consolidate the market, owning 59 percent of the market share. ” - Analyst firm Gartner Leaders in BI - Analyst firm Forrester Research’s ‘Forrester Wave’ QuerySurge ™ a software division of Business Intelligence (BI) - Who uses it? Wal-Mart uses vast amounts of data and category analysis to dominate the industry. Amazon and Yahoo follow a "test and learn" approach to business changes. Hardee’s, Wendy’s, and T.G.I. Friday’s use BI to make strategic decisions. QuerySurge ™ a software division of Business Intelligence (BI) & Data Marts Data Mart A database that has the same characteristics as a data warehouse, but is usually smaller and is focused on the data for one division or one workgroup within an enterprise. Typically hold aggregated data and some granular data. It is a subset of the DWH and makes it more efficient for Business Intelligence reporting. BI tools sit on top of the data marts. Source Data ETL Process Target DW ETL Process Data Mart Legacy DB CRM/ERP DB Finance DB QuerySurge ™ a software division of Business Intelligence (BI) & Analytics Source Data Target DWH Legacy DB ETL Process ETL Process CRM/ERP DB Finance DB Data Mart QuerySurge ™ a software division of The Data World Distilled Big Data Data Governance Data Warehouse Data Testing ETL/ Data Integration Data Quality QuerySurge ™ BI & Analytics a software division of Data Quality Issues 80% of organizations… will underestimate the costs related to the data acquisition tasks by an average of 50 percent. 46% of companies cite Data Quality as a barrier for adopting Business Intelligence products. Data Quality Best Practices boost revenue by 66%. The average organization loses $14.2 million annually through poor Data Quality. QuerySurge ™ built by Data Quality Primary Characteristics of Data Quality tools courtesy of Gartner’s “Magic Quadrant for Data Quality Tools” o Profiling o Parsing and standardization o Generalized Cleansing o Matching o Monitoring o Enrichment o Subject-area-specific support o Metadata management o Configuration environment QuerySurge a software division of ™ Data Quality - the marketplace “The market for data quality software tools reached $1.61 billion in 2017 (the most recent year for which Gartner has data), an increase of 11.6% over 2016. Gartner’s interactions with clients also indicate that demand remains high.” - Analyst firm Gartner Leaders in Data Quality - Analyst firm Gartner’s Magic Quadrant QuerySurge ™ a software division of The Data World Distilled Big Data Data Governance Data Warehouse Data Testing ETL/ Data Integration Data Quality QuerySurge ™ BI & Analytics a software division of Data Quality vs. Data Testing Primary Characteristics of Data Quality tools courtesy of Gartner’s “Magic Quadrant for Data Quality Tools” o o o o o o o o o Profiling Parsing and standardization Generalized Cleansing Matching Monitoring Enrichment Subject-area-specific support Metadata management Configuration environment Data Verification & Validation? Primary Characteristics of Data Testing tools Courtesy of the book "Testing the Data Warehouse Practicum" ▪ Data Completeness ▪ Data Transformation ▪ Regression Testing ▪ Reporting QuerySurge Data Verification & Validation? a software division of ™ Where Data Testing fits in your data strategy QuerySurge ™ a software division of The Executive Office and Critical Data CxOs are using Business Intelligence & Analytics to make critical business decisions – with the assumption that the underlying data is fine. Business Intelligence & Analytics “The average organization loses $14.2 million annually through poor Data Quality.” - Gartner Data Architecture Typical data issue areas ETL Mainframe Key Roles in Building & Testing a Data Store Data Analyst: Creates data requirements (source-totarget map or mapping doc) Data Architect: Models and builds data store (Big Data lake, Data Warehouse, etc.) ETL Developer: Transforms and loads data from sources to target data stores Data Tester: Validates the data, based on mappings, as it moves and transforms from sources to targets QuerySurge ™ a software division of Data Requirements = Mapping Document a.k.a. Source-to-Target Map It’s the critical element required to efficiently plan the target Data Stores. It also defines the Extract, Transform, Load (ETL) process. Intention: ✓ capture business rules ✓ data flow mapping and ✓ data movement requirements. Mapping Doc specifies: ▪ Source input definition ▪ Target/output details ▪ Business & data transformation rules ▪ Absolute data quality requirements ▪ Optional data quality requirements. QuerySurge ™ a software division of Most Common Data Validation Method Sampling • • Review Business Rules (i.e. mapping document, data flow mappings) Write Tests in SQL editor • Execute 2 Tests: 1 at Source & 1 at Target • Export results to 2 Excel files • Compare a Sampling of results by eye (‘Stare & Compare’) Issue with Stare & Compare: Impossible to visually compare billions of data sets Result: usually less than 1% of data is compared Example - Current QuerySurge customer • one test = 100 million rows X 200 columns = 20 billion data sets • there is no practical way to manually verify (eyeball) this data set • the client has more than 15,000 total tests QuerySurge ™ a software division of Data Store Roles, Tasks, & Timelines Tasks Roles Data Analyst Determine Requirements Create & maintain Mapping Document Data Architect Review Mapping Document Model and build target Data Stores Maintain Target Data Stores ETL Developer Review Mapping Document Build data movement logic Extract & load data or extract, transform, & load data Huge Risk iterate Review Mapping Document Data Tester Timeline Create 2 SQL tests for each mapping with SQL editor Execute tests iterate Dump results of tests to 2 Excel files Compare Excel files by eye About QuerySurge QuerySurge ™ a software division of What is QuerySurge ? QuerySurge is the leading testing solution for automated validation & testing of Big Data Use Cases QuerySurge ™ a software division of How QuerySurge Works Source Data Data Stores • Databases • Data Warehouses • Data Marts Fixed Width Delimited Excel JSON Big Data stores • Hadoop • NoSQL Flat Files • • • • Target Data QuerySurge connects to any 2 points at one time HQL SQL SQL Data Warehouses Business Intelligence Reports XML Web Services Comparison of every data set Source Data Results – pass/fail Target Data Data Intelligence Reports, Data Health Dashboard, automated email reports Big Data Process - Developer & Tester ETL Developer: Codes data movement based on Mapping Requirements Source Data Data Warehouse Data Mart Big Data lake ETL Testing Point #1 Data Tester: BI Analyst extracts data for reports BI & Analytics ETL Testing Point #2 Testing Points #3 Tests data movement based on Mapping Requirements Testing Point #4 Tester tests BI Reports QuerySurge Supports 50+ Data Stores QuerySurge supports the following data stores… • • • • • • • • • • • • • • Amazon Redshift, Elastic Map Reduce, DynamoDB Apache Hadoop/Hive, Spark Cassandra Cloudera Couchbase Exasol Flat Files (delimited, fixed-width) Hortonworks IBM (Db2, Netezza, Informix, Big Insights, Cloudant, MDM, Cognos) JSON files Mainframe MAPR Micro Focus Vertica Microsoft (SQL Server DWH, HDInsight, PDW, SSAS, Excel, Access, SharePoint) • • • • • • • • • • • MongoDB Oracle (Oracle DB, MySQL, Exadata, NoSQL, Hadoop) Pivotal GreenPlum PostgreSQL Salesforce SAP (HANA, IQ, ASE, SQL Anywhere, Altiscale Data Cloud) Snowflake Tableau Teradata, Aster Workday XML …and any other data store Excel Flat Files The Data World Distilled Big Data Data Governance Data Warehouse Data Testing ETL/ Data Integration Data Quality QuerySurge ™ BI & Analytics a software division of 4 main components of successful data governance 1) Data stewardship Identifying and assigning roles and responsibilities. - who is creating its data, - who has overall responsibility for the data, - who uses the data, who routes it, - who oversees its use. 2) Data classification Identify and categorize data types into groups. 3) Data quality Data quality - the process of measuring the reliability of current data sets to provide information that can be used to make organizational decisions. 4) Data management Process where all the organization's data governance efforts come together. The company actively manages its data governance efforts and involves the creation of the architectures and business processes required to properly maintain the organization’s data through its full lifecycle. Data Maturity Model - Process • Patterned after the Capability Maturity Model Integration(CMMI) from the Software Engineering Institute (SEI) at Carnegie Mellon University • Devised by IBM, along with 55 other companies • • • • • Data Governance is second nature ROI for data-related projects is tracked Business value of data mgmt is recognized Cost of data mgmt is easier to manage Costs are reduced as processes become automated • Further defined value of data for more data elements • Data Governance methodology is introduced during the planning stages of new projects • Enterprise data models are documented & published • Data-related policies become more clear & reflect the organization’s data principles. • Data integration opportunities are better leveraged. • Risk assessment for data integrity & quality becomes part of the organization’s project methodology. • More data-related controls are documented • Metadata becomes an important part of documenting critical data elements. • Few stable processes exist • “Just do it” mentality source: IBM Data Governance Council Maturity Model built by QuerySurge ™ Data Governance - the marketplace “Rapidly increasing growth in data volumes, rising regulatory & compliance mandates, and enhancing strategic risk management & decision-making are expected to drive the growth of the data governance market.” The data governance market size is expected to grow from $1.31 Billion in 2018 to $3.53 Billion by 2023, at a CAGR of 22.0%.” - MarketsAndMarkets.com Leaders in Data Governance - The Forrester Wave QuerySurge ™ a software division of The Data World Distilled the Data World by Top Vendors • • • • • • Source types Flat files Excel json Xml Web services databases • • • • • • • • ETL Vendors Ab Initio IBM Informatica Microsoft Oracle SAP SAS Talend Source Data • • • • • • Data Warehouse Vendors • Amazon • IBM • Microsoft • Micro Focus • Oracle • SAP • Snowflake • Teradata Hadoop Vendors Amazon Cloudera Hortonworks IBM MAPR Microsoft • • • • • • • Data Warehouse Data Mart Big Data lake ETL BI Vendors IBM Microsoft Microstrategy Qlik Tableau SAP Oracle BI & Analytics ETL NoSQL Vendors • Amazon • Apache • Cassandra • Couchbase • MongoDB • Oracle • • • • • • Data Quality Informatica IBM Oracle SAP SAS Talend • • • • • • Data Testing Data Governance QuerySurge • Collibra Informatica • DATUM Tricentis • GDE Data Gaps • IBM IceDQ • Informatica built Bitwise •by SAP The Data World Distilled: Understanding how the data world works in the Big Data era Any questions? Bill Hayduk Founder, CEO QuerySurge ™