What is Parallel Data Warehouse (PDW) and where does it fit? Mike Lampa Director – Business Analytic Solutions Agenda • What is Big Data and Where does PDW Fit? • PDW Architecture on Dell hardware • MPP and Shared Nothing concepts • Data distribution and re-distribution • EDW Reference Architecture • Data Modeling Guidelines • ETL Guidelines • Resource Skilling Considerations Feel free to ask questions as we move along – goal is to make this presentation as interactive as possible! 2 Confidential Global Marketing What is Big Data Global Marketing The Data Explosion • 988 Exabytes of information in 2010. • Everyday 2.5 quintillions of data is created. • Volume of data across enterprise doubling every 3 years • 80% of enterprise data is unstructured 4 DELL CONFIDENTIAL Storm Rising in Data Analytics Major technology developments are driving three key mega-trends driving new opportunities for the industry and our customers Scalable Database Architectures Real-time BI and Analytics Self-Service NoSQL/NewSQL “Big Data” Unstructured Structured In-Memory Hadoop Visualization Socianalytics 5 Confidential Columnar/MPP Pig/Hive Cloud BI Financial Customer Advisory Council February 1-2 | New York City Big Data complements Analytics (DW & BI) Data Sources Sensors Devices Crawlers Bots Processing Infrastructure Shared Infrastructure Knowledge Capture Business Value Models and Production Analytic Applications Apps Un-Structured Exploratory Analytics New IP Creation BI Tools ERP DW CRM Structured Well-defined processing Data-enriched tools PDW Data Mgmt Svcs 6 DELL CONFIDENTIAL IT Professional App Developer Domain Specialist BI User PDW Appliance: Architecture & MPP Concepts Global Marketing 8 Confidential Global Marketing What is the PDW Appliance? Microsoft Parallel Data Warehouse • Microsoft software running on Dell hardware • High-end data warehouse, scales to 100’s of terabytes • Massively Parallel Processing (MPP) for high performance • Architected with redundancy throughout the system • Based on proven MS SQL Server 2008 R2 platform • Low cost of ownership with industry standard Dell hardware • Sold as an appliance with software preloaded • Extensive consulting and application services available • Microsoft and Dell representatives work together to serve the customer 9 Confidential Global Marketing Microsoft PDW Architecture Scales for Resilience and High Performance, with a Low Cost of Entry Control Rack Data Racks (up to 4) PowerEdge R610 Database Servers Scale by data rack(s) MD3620f adding Storage Nodes Control Nodes (R710) Active / Passive Dual Fiber Channel Client Drivers Dual Infiniband Management Servers (R610) Data Center Monitoring Landing Zone (R510) ETL Load Interface Backup Node (R710 and MD3600f w/MD1200’s) Corporate Backup Solution Spare Database Server Confidential 10 Corporate Network Private Network Global Marketing Rack Configuration for Dell MD Appliance Control Rack 11 Confidential Data Rack 1 Data Rack 2 Data Rack 3 Data Rack 4 Global Marketing PDW Core Concepts • Distributed Relational Database – 10 DBMS servers per Data Rack – Data distributed across multiple DBMS instances • Massively parallel processing – Multiple concurrent resources resolve SQL set operations against Distributed data › Compute Node architecture supports 10 parallel instances of DBMS per Data Rack. › Each DBMS instance works in parallel on its own “distribution” of a single user query. • Shared nothing computing – Resource and data independence are maintained within each DBMS instance › Each Compute Node reserves its shared resources (CPU, Memory, Disk) for only its distribution of system data – Managed by MPP server (Control node) – Converting schema & metadata from shared nothing to a common logical view • Configured for high redundancy 12 Confidential Global Marketing Data Distribution • Distributed: A table structure with evenly distributed records across multiple shared nothing databases. –Distribution Key: A single column in a Distributed table that is used for hash distribution of records across multiple shared nothing databases. • Replicated: A table structure that exists as a full copy on each shared nothing database. • Ultra Shared Nothing: Design database schema with a mix of replicated and distributed tables to minimize data movement between nodes. – Dimensions are replicated – Facts are distributed – Redistribute rows at run time when distribution incompatibility is encountered in SQL set operation. 13 Confidential Global Marketing Ultra Shared Nothing Example 14 Confidential TD PD C D M D TD PD C D M D TD PD C D M D TD PD C D M D Global Marketing SF SF SF SF Redistribution • Redistribution: The movement of data between shared nothing database instances to answer distribution incompatible SQL queries within a PDW Appliance. – Shuffle: A redistribution technique that leverages Inifiniband™ network to create temporary distribution compatible data sets. – At least one table in the query plan uses a Distribution Key in its join criteria. – Any table that is not joined on it’s Distribution Key is targeted for Shuffle first. Leftmost table is chosen if multiple tables meet this criteria. – Replication: A redistribution technique that is used to create a temporary full copy of a data set. 15 Confidential Global Marketing SMP vs MPP SMP MPP • Shared Resources • Dedicated resources • Limited scaling • Scales to PB • Applicable < 20TB • Applicable > 20 TB • HA must be architected in • Built in HA and redundancy • High Concurrency for complex workloads 16 Confidential Global Marketing EDW Architecture Global Marketing EDW Logical Architecture 18 Confidential Global Marketing EDW Information Layers Data Rack Landing Zone 19 Confidential Global Marketing Data Flow Integration Layer Replication Source Data In ETL Stage LRF Base & Package Layer Data Store • • PDW Load Scripts Load Scheduling PDWPRD01 Base & Package 100 TB Package Copy for Presentation Data Presentation Layer Data Store PDWPRD03 Presentation 60 TB Dell network Infiniband Consumption Layer Data out 20 Confidential SAS, BO, MSAS, BI Tools Global Marketing Enabling Packages - Hub and Spoke • Physical Data Marts (Packages) May make sense from consumption perspective. • Primary Considerations: – Business Function – Type of BI Workload • Secondary Considerations: – User Size, Data Volumes & Performance – Security & Sensitivity 21 Confidential Global Marketing Data Modeling Guidelines Global Marketing PDW Table “Geometries” • Replicated: A table structure that exists as a full copy within each PDW Data Node • Distributed: A table structure that is hashed and distributed as evenly as possible across all PDW Data Nodes on the appliance Global Marketing PDW Table Geometry Example Compute Nodes PDW Storage Nodes DD Source System SD Date Dim Item Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Prod Prod Prod Prod Dim ID Category Sub Cat Desc DD SD SF 1 SF 2 ID PD ID PD Sales Fact DD Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold SD DD Store Dim Store Store Store Store Dim ID Name Mgr Size Promo Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End SD DD SD 24 Confidential Global Marketing SF 3 SF 4 SF 5 ID PD ID PD ID PD Distribute or Replicate? • Use distributed tables when: –The table is large – generally > 5GB –For fact/detail tables –Full table scans do not provide acceptable performance • Use replicated tables when: –The table is small – generally < 5GB –For dimension/lookup tables –Multiple foreign keys exist and foreign key joins are common 25 Confidential Global Marketing Partitioning Distributed Tables • Distributed tables are already segmented by hashed distributions • Will further partition rows within a distribution, based on a partition function (eg: Time_Dim Quarter or Year) • Allows for operations efficiency when adding, loading, dropping, and switching partitions • Good for fast loading of an unused partition and then switching it in after loading • Partition for manageability – Typically on a date key (or integer surrogate) – Typically same as clustered index key – SWITCH partitions OUT for fast delete of history or IN to modify or add a specific historical slice 26 Confidential Global Marketing Colocation Tables must be designed for performance from the beginning. Performance optimization is not just a DBA thing after development is complete! • Colocation: Within a PDW appliance, two individual records with identical keys will always belong to the same Distribution. – Single GREATEST performance consideration – Beneficial for Join and Aggregation performance (eg Parent & Child join) – Distribution Compatibility • Choosing the right Distribution Key – Identify Commonly used join keys and/or aggregations – Choose a single column that limits skew to < 40% (High Domain Cardinality & low Distribution SKEW) – Distribution Key should be the first column declared for Distributed Table DDL – Consider Surrogate Keys when “business key” is compound 27 Confidential Global Marketing Colocation Example • Ensure that all compatible Distribution Keys are identical data types. – [customer.customer_id integer] = [customer_hist.customer_id integer] – [customer.customer_id integer] <> [customer_hist.customer_id char(10)] Distribution Key • Colocation and Distribution Key Example PK COLUMNS TABLE: 28 Confidential MFG_SO DELL_MFG_ORD_CNFG DELL_MFG_ORD_STAT SO_ID SO_ID SO_ID BU_ID BU_ID BU_ID ORD_NUM ORD_NUM ORD_NUM ORD_TIE_NUM MFG_STAT_CD MFG_WIP_STAT_DTM Global Marketing Handling Large Dimensions Large Dimensions (>5GB uncompressed) • Distribute/Normalize • Distribute: – If possible, distribute dimension on same key as fact surrogate key. – If distribution compatibility not possible, “shuffle” dimension data on the fly (at Query time) • Normalize: – Normalize large dimension into smaller tables and replicate the core dimension (more manageable replication size) – Look at usage pattern, if only a few columns from dimension are used in most of the major queries, separate high use columns from low use columns into separate Dim_tables. Both, have the same surrogate key. › Core dimension is replicated which is joined locally to fact table › Create a view to combine data from core and outrigger to insulate complexity from users (CAUTION: check the performance) 29 Confidential Global Marketing Multi Level Partitioning • Partitioning – Partitioning is a method of distributing a table’s rows among a number of subtables (partitions). – Partitioning is applied within each Distribution. • Multi-Level Partition Support – Any combination of up to four total Range, Hash, Or List partition schemes. – Each new partition level generates new partitions at a multiple of the previous level. – Partition Values: [Range 4 x List 6] will generate 24 partition files per Distribution. 30 Confidential Global Marketing Benefits of Partitioning • Reduce Table Scans – This is the most common use case for partitioning. – Relies on query restrictions aligned with: Range, Hash, or List qualifiers. – Practice: Partition on commonly restricted fields (query based). • Minimize Memory Utilization – Join Operations › Reduces memory requirements per join. › Reduces disk spill if session or operation limits are reached. – Aggregation › Reduces memory requirements to build result set. › Reduces disk spill if session or operation limits are reached. – Practice: Hash Partition on join key. 31 Confidential Global Marketing Multi Level Partitioning Example DDL of Multi-Level Partitioned Table CREATE TABLE member ( memberID BIGINT NOT NULL, memberType SMALLINT NOT NULL, lastName VARCHAR(50) NOT NULL, activeStatus CHAR(1) NOT NULL, salesTotal FLOAT, lastLogin DATE NOT NULL) WITH distribute_on (memberID), text compressed, IIpartition=((range on lastLogin partition p01 values < '2007_01_01', partition p02 values < '2007_04_01', partition p03 values < '2007_07_01', partition p04 values < '2007_10_01', partition p05 values < '2008_01_01', partition p06 values >= '2008_01_01') SUBpartition (hash on memberID 5 partitions) SUBpartition (list on activeStatus PARTITION p101 VALUES ('n'), PARTITION p102 VALUES ('a'), PARTITION p103 VALUES (default))); 32 Confidential Global Marketing ETL Guidelines Global Marketing PDW Data Loading Design Goals • Load data efficiently and non-obtrusively, respecting concurrent queries and loads • Reduce table fragmentation as much as possible • Provide system recovery capabilities in the event of data load failure that have minimal impact on concurrent queries • Provide multiple load/ETL options for PDW customers Global Marketing Data Movement Service (DMS) with PDW • Runs on the following nodes as a Windows service: › Control › Compute › Landing Zone • Used to quickly move data in parallel between nodes by using Infiniband network in PDW • Uses ADO.NET › Uses SqlClient namespace to select data from SQL Server › Uses SqlBulkCopy to insert data into Compute nodes • Two protocols/networks used by DMS: – Data transfer network to move data between nodes – Message network to send command and status messages to nodes from Manager • DMS closely interacts with the primary PDW Engine Service • DMS is used for both loading and querying data 35 Confidential Global Marketing ETL Loading Options in PDW • • • • 36 DWLoader Utility SQL Server Integration Services (SSIS) CREATE TABLE AS SELECT (CTAS) Standard SQL DML statements: INSERT/SELECT Confidential Global Marketing PDW Distributed Table Load – Step 1 Control Rack Data Rack Control Node Compute Nodes (2) Load Manager creates staging tables DMSEngine Ser er PDW SQL Server DMS Manager Load Manager DMS (3) DMS reads load data and buffers records to send to Compute Nodes round-robin Infiniband (1) DWLoader invoked/ SSIS Storage Nodes (4) Each row is converted for bulk insert and hashed based on the distribution column DMS Converter Sender Receiver Writer (5) Hashed row is sent to appropriate node receiver for loading Landing Zone Load Client Load File/SSIS 37 Confidential SSIS API DMS DMS Distributor Converter Sender Receiver Writer (6) Row is bulk inserted into staging table Global Marketing PDW Distributed Table Load – Step 2 STEP 1: DWloader creates topology equivalent staging table and moves data from LZ file into staging tables using DMS Staging DB Destination DB 2nd step process DWloader uses SQL commands to move from staging to destination tables NOTE: distributions of a table are written in parallel when the multi-transactions option is set to true. 38 Confidential Global Marketing Data Loading – DWloader • Command-line utility invoked on the Landing Zone • Integrated with DMS – Streamlines I/O and minimizes data-loading times through powerful parallel loading functionality against a single text file – Optimize data load speeds while maintaining a performance balance so as not to seriously degrade concurrently running queries • Characteristics of Dwloader – Accommodate initial data loads of large files over 300 GB – Achieve data load speeds of up to 2 TB per hour – Accommodate multiple and concurrent incremental loads – Has settings for canceling and showing status of loads – Input file must reside on the Landing Zone – Max. concurrency 10, queues up subsequent load 39 Confidential Global Marketing Data Loading – SSIS • The SQL Server PDW Destination is an SSIS component that lets you load data into SQL Server PDW by using an SSIS .dtsx package. • In the package workflow for SQL Server PDW, you can load and merge data from multiple sources and load data to multiple destinations. • The loads occur in parallel, both within a package and among multiple packages running concurrently • SQL Server 2008 R2 SSIS includes: – SQL Server Parallel Data Warehouse Connection Manager – SQL Server Parallel Data Warehouse Destination • Similar to dwloader, SSIS leverages DMS for parallel load operations. • SSIS can run either on the Landing Zone or on a server outside the PDW appliance. 40 Confidential Global Marketing SSIS and PDW Data Types When using SSIS to load data from a data source to a SQL Server PDW database: –Data is first mapped from the source data to SSIS data types. –This allows data from multiple data sources to map to a common set of data types. –Then the data is mapped from SSIS to SQL Server PDW data types. 41 Confidential Global Marketing Leading Practices – Data Loading • Minimize page breaks (fragmentation) by designing “partitionfriendly” loads. • If necessary, drop non-clustered indexes before loading and re-index after all loads are complete. • There is no benefit to sorting data before hitting the Landing Zone. 42 Confidential Global Marketing Leading Practices – Staging Databases • Historic PDW load jobs tend to be the largest. The staging database may be reduced in size for subsequent incremental loads. • When creating the staging database, use the following guidelines: – Replicated table size should be the estimated size per Compute Node of all the replicated tables that will load concurrently. – Distributed table size should be the estimated size per appliance of all the distributed tables that will load concurrently. – Log size is typically similar to the replicated table size. 43 Confidential Global Marketing Leading Practices - SSIS • For good PDW loading throughput, it is important to keep a steady stream with minimal starts and stops. • PDW connections and queries are very costly to initiate. Use fewer to do more. • Data type conversion in the PDW destination adapter is very expensive. Be sure the input types match the destination types, especially for strings and decimals. • Consider performing data transformations after loading into the staging database (ELT instead of ETL). 44 Confidential Global Marketing ETL Guidelines • Grouping: Determine the largest set of data that is distribution compatible within the query. This will break queries in multiple compatible steps. 45 Confidential Global Marketing ETL Guidelines • Joining two distribution incompatible tables – Scenario where changing the structure of either table is not possible – Usually encountered while populating fact tables from underlying BASE tables – Create a temporary table with required columns from driver table distributed on a key which makes distribution compatibility possible – More controlled “Shuffle” – Temp table or Join output can be reused by multiple queries – ETL BEST PRACTICE: BREAK YOUR WORKLOAD IN MULTIPLE, MANAGEABLE DISTRIBUTION COMPATIBLE SET OF QUERIES 46 Confidential Global Marketing Resource Skilling Considerations Global Marketing Skills Consideration • Platform Skills - Moving from SQL to PDW – Retain much of your SQL skills (SQL Server Data Architecture & DBA, SSIS, etc) – Heterogeneous platform as you scale from GB to PB! • Design Skills: – MPP Data Architecture differs from SMP › Think in Terms of Distribution Keys vs Primary_Key and Foreign_Key › Think in Terms of Distribution Compatibility vs Indexes for Performance › Surrogate Keys lend themselves to Distribution Keys – MPP ETL Architecture differs from SMP › More use of Load Ready Files with Upsert Logic vs Dynamic Lookup › Staging environment strategies simulate CDC Key Lookups › Surrogate Keys add complexity to CDC Lookups and Key Generation 48 Confidential Global Marketing Summary • PDW is an MPP appliance from Microsoft on Dell Hardware • Keep in mind MPP and Shared Nothing concepts while designing your EDW on PDW. • Traditional SMP concepts are neither sufficient nor applicable. • Break your workload in manageable distribution compatible chunks. • PDW supports both Normalized and Star schemas. • Consider grouping data in logical information layers. • Use combination of Dwloader & SSIS depending on unit-ofwork • Retain your core technology platform skills, augment your DW design skills 49 Confidential Global Marketing Q&A Global Marketing