TechTalk Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) Meinrad Weiss 09.05.2012 BASEL 1 BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 MÜNCHEN STUTTGART WIEN AGENDA 1. Overview Microsoft Data Warehousing Solutions 2. Parallel Data Warehouse (PDW) – What’s that? Or MPP vs. SMP 3. Hardware Architecture – Control Rack and Data Rack 4. Tools (Management Dashboard, Nexus Query Tool, DWSQL) 5. Distribution and Replication of Data 6. Table Constraints and Data Type Limitations 7. Comparison: Load Speed with SMP versus PDW 8. Basic Shared Nothing / Shuffle Moves 9. Concrete Offerings 2 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Microsoft Data Warehousing Solutions Reference Architectures offering best price performance for data warehousing Scalable and reliable platform for data warehousing on any hardware Appliance for high-end data warehousing requiring highest scalability, performance, or complexity Ideal for data marts or small to mid-sized DWs with scancentric workloads Ideal for large data marts or mid-sized EDWs Offers flexibility in hardware and architecture Software only Reference Architectures (software and hardware) Software only DW appliance (fully integrated software and hardware) Scale-up DW Scale-up DW Scale-up DW Scale-out DW with MPP 10s of TB 2 – 80 TB 10s of TB Scalable and reliable platform for data warehousing on any hardware Ideal for data marts or small to mid-sized EDWs 3 2012 © Trivadis Hochperformante und Kostengünstige Data Warehouse Systeme 09.05.2012 10s - 100s of TB Data Warehouse – Products Positioning Scale Complexity HA by default SW-HW integration PDW with Distributed Data Architecture SQL Server 2008 R2 Data Center Appliance Simplicity PDW SQL Server 2008 R2 Fast Track SQL Server 2008 R2 Enterprise 4 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Data Warehouse – Products Positioning PDW with Distributed Data Architecture Scale Complexity HA by default SW-HW integration SQL Server 2008 R2 Data Center 100% SQL Server 2008 R2 Compatibility PDW SQL Server 2008 R2 with Fast Track Reference Architecture SQL Server 2008 R2 Enterprise 5 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 MPP vs. SMP SMP - Symmetric Multiprocessing Multiple CPUs used to complete individual processes SMP simultaneously All CPUs share the same memory, disks, and network controllers All SQL Server implementations up until now have been SMP MPP - Massively Parallel Processing 6 Uses many separate CPUs running in parallel to execute a single program Each CPU has its own memory and disks High-speed communications between nodes Applications must be segmented 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 MPP Two hardware vendors: HP and Dell Microsoft+Dell Parallel Data Warehouse Appliance 7 Microsoft+HP Enterprise Data Warehouse Appliance 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Control Rack Data Rack(s) SQL Control Node SQL SQL SQL SQL Management Node SQL SQL Landing Zone SQL SQL SQL Backup Node SQL 2012 © Trivadis Control Rack Data Rack(s) SQL Control Node SQL SQL SQL SQL Management Node SQL SQL Landing Zone SQL SQL SQL Backup Node SQL 2012 © Trivadis Control Rack Data Rack(s) SQL Client connections always go through the Control Node control node SQL Management Node SQL Windows Failover Cluster for Availability SQL Contains no persistent user data SQL Processes SQL requests SQL Prepares execution plan Orchestrates distributed execution SQL Local SQL Server processes final query SQL Landing Zone plan and aggregates results SQL SQL Backup Node 2012 © Trivadis SQL Control Rack Data Rack(s) SQL Control Node SQL SQL SQL Management Node SQL Provides Support and Patching for the Appliance SQL Holds image for re-deployment of compute node SQL SQL Holds Active Directory Landing Zone SQL SQL Backup Node 2012 © Trivadis SQL Control Rack Data Rack(s) SQL Control Node SQL SQL Source Management Node Landing Zone Files SQL Data Loader SQL Compute Nodes SQL DWLoader or SQL Server Integration Services SQL Provides high-capacity storage for data SQL Landing Zone files from ETL processes SQL Is available as a sandbox for other applications and scripts that run on the SQL internal network Backup Node 2012 © Trivadis Provides SQL ServerSQLIntegration Services Control Rack Data Rack(s) SQL Control Node SQL SQL SQL SQL Management Node SQL SQL SQL Landing Zone SQL SQL Backup Node 2012 © Trivadis Provides IntegratedSQLBackup Solution Integrates with 3rd party backup products Orderable in different sizes Control Data RackRack Servers 5/10 active + 1 passive per Rack Control Node InfiniBand, FC and Ethernet switching SQL Expansion Grow from 1/2–4 data racks, storage Management Node options, test/dev system Consists of COMPUTE NODES Landing Zone and STORAGE NODES Shared Nothing Backup Node Spare Node provides failover in case of node failure 2012 © Trivadis Data Rack(s) SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL Compute Node Storage Node SQL Each MPP node is a highly tuned symmetric multi-processing (SMP) node with standard interfaces More or less multiple FastTrack Servers Provides dedicated hardware, database, and storage Runs SQL Server 2008 Local Drives are configured as RAID 1 15 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Connectivity and Tools Nexus Query Chameleon DWSQL 16 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Web-BasedManagement Dashboard 17 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 System Center (SCOM) 18 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Distribution and Replication of Data Product Dim Time Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Prod Dim ID Prod Category Prod Sub Cat Prod Desc Larger (> 10 B) Fact Table is Hash Distributed Across All Compute Nodes SQL SF SF SF SF -1-1 -1 -1 SF SF SF SF -1 -1 -1 -2 SQL Sales Facts Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold Store Dim Store Dim ID Store Name Store Mgr Store Size 19 SQL Mktg Campaign Dim SQL Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 SF SF SF SF -1 -1 -1 -3 SF SF SF SF -1 -1 -1 -4 Distribution and Replication of Data Time Dim Product Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Prod Dim ID Prod Category Prod Sub Cat Prod Desc Sales Facts Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold Store Dim Store Dim ID Store Name Store Mgr Store Size 20 Mktg Campaign Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End Smaller (<5GB ) Dimension Tables are Replicated on Every Compute Node SQL T D S D P D M D P D M D SF SF SF SF -1 -1 -1 -2 SQL T D S D SQL T D S D P D M D SF SF SF SF -1 -1 -1 -3 SQL T D S D P D M D SF SF SF SF -1 -1 -1 -4 Result: Fact -Dimension Joins can be performed locally 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 SF SF SF SF -1-1 -1 -1 Creating a Database CREATE DATABASE PDW WITH (AUTOGROW = ON, REPLICATED_SIZE = 1024 GB, -- (per Node) DISTRIBUTED_SIZE = 16384 GB, -- (whole System) LOG_SIZE = 1024 GB); 21 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Distribution on a PDW CREATE TABLE myTable (column Defs) WITH (DISTRIBUTION = HASH (id)); PDW Node 1 Create Table <myTable GUID>_a Create Table <myTable GUID>_b … Create Table <myTable GUID>_h 8 Tables per Node PDW Node 2 Create Table <myTable GUID>_a Create Table <myTable GUID>_b … Create Table <myTable GUID>_h PDW Node … 22 PDW Node 10 Create Table <myTable GUID>_a Final Result: Create Table <myTable GUID>_b 80 individual tables across a … 2012 © Trivadis 10 node (1 data rack) appliance Create Table <myTable GUID>_h Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Create Replicated Table CREATE TABLE DimProduct( ProductId BIGINT NOT NULL, Description VARCHAR(50), CategoryId INT NOT NULL, ListPrice DECIMAL(12,2)) WITH (DISTRIBUTION = REPLICATE); Creates tables on each of the individual compute nodes and assigns them to the REPLICATED file group. Data Compression is automatically turned on CREATE TABLE statement syntax varies slightly from its syntax in standard Transact-SQL 23 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Data Type Limitations Most Scalar data types supported by SQL Server 2008 are supported by PDW Main exceptions Text (and related BLOB data types) XML SQL Variant Timestamp System and CLR UDTs IDENTITY/DEFAULT constraints not supported Character data types are case sensitive PDW uses collation: Latin1_General_BIN2 24 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 PDW-/ SQL Server-Data Types bigint binary bit char/nchar date, time datetime datetime2 datetimeoffset decimal float geography/geometry hierarchyid image int money numeric real smalldatetime smallint smallmoney sql_variant sysname text/ntext timestamp tinyint uniqueidentifier varbinary varchar/nvarchar xml Performance Tests: Data Load on SMP System Load a single 75 GB flatfile with 600 million rows on SQL SMP 90.000 Bulk Copy Rows/sec 1hour 48min. 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 25 Data Load on PDW Loading the same single flat file into PDW (75GB / 600 Mill rows) dwloader.exe -i D:\TPCH\lineItem.tbl -M Fastappend -E -m -d tpch_100gb -E -c -b 10000 -rt value -rv 100 -R LineItem.tbl.rejects -e ascii -t "|" -r \r\n -U sa -P {password} -T tpch_100gb.dbo.lineitem_Load Option Loadtime Reload 09 min 35 sec 133 Append 09 min 42 sec 131 FastAppend 02 min 23sec 534 26 MB/sec 45 times faster 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 dwloader.exe -i D:\TPCH\lineItem.tbl -M Fastappend -E -m -d tpch_100gb -E -c -b 10000 -rt value -rv 100 -R LineItem.tbl.rejects -e ascii -t "|" -r \r\n -T tpch_100gb.dbo.lineitem_Load Single file 600 MB/sec READ 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 27 Copy table within PDW Table with 600 million rows (LineItem) + SELECT * INTO lineitem_copy FROM tpch_100gb.dbo.lineitem 14 times faster... 36 min 07 sec (SMP) versus 2 min 12 sec ... on PDW 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 28 Hub and Spoke Departmental Reporting Regional Reporting Central EDW Hub High-Performance Reporting Mobile Applications Landing Zone Regional Reporting with Business Decision Appliance Third-Party RDBMS Third-Party Data Integration 29 ETL Tools 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Remote table copy Create a Heap table on SMP destination server: CREATE REMOTE TABLE tpch_Henk.dbo.LineItem_test AT ('Data Source = NYCPDW-LZ01,1433; User ID = sa; Password = x;') AS SELECT * FROM tpch_100gb.dbo.lineitem_load *) Check Status of copy operation SELECT * FROM sys.dm_pdw_dms_workers WHERE type = 'PARALLEL_COPY_READER' AND destination_info =[skypdw_Henk].[dbo].[LineItem_test]' *) Requires Infiniband HCA card in remote SQL Server SMP 30 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Result: 600 mill rows - Remote table copy 21:25 Minutes 31 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Basic Shared Nothing Join (Replicated/Distributed) SELECT FROM JOIN WHERE ss_key, Cost item_dim a store_sales b ON a.color = b.color a.color = 'Yellow' Replicated Table Distributed Table Item Dim Store Sales Cost ss_key Color Qty Red 10 1 Red 5 Green 15 3 Blue 10 Blue 25 5 Yellow 12 5 7 Green 7 Yellow Node 1 Color Join Type: Shared Nothing Result Set: 5,5 Distribution: Compatible Replication satisfies compatibility for inner joins Store Sales distribution key not used Item Dim Cost ss_key Color Qty Red 10 2 Red 3 Green 15 4 Blue 11 Blue 25 6 Yellow 17 5 8 Green 1 Yellow Node 2 Color Streaming Results Store Sales Result Set: Results streamed to client 6,5 Control Node required No aggregation (processing) on Final Result Set 32 5,5 : 6,5 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Basic Shared Nothing Join (Distributed/Distributed) SELECT FROM JOIN WHERE a.color, b.Qty web_sales a store_sales b ON ws_key = ss_key a.color = 'Red' Distributed Table Distributed Table Web Sales Store Sales Color Qty ss_key Color Qty 1 Red 15 1 Red 5 3 Blue 20 3 Blue 10 5 Yellow 22 5 Yellow 12 7 Green 17 7 Green 7 Node 1 ws_key Result Set Red,5 Join Type: Shared Nothing Distribution: Compatible Join includes compatible distribution keys with compatible data types Streaming Results Web Sales Color Results streamed to client Store Sales Qty ss_key Color Qty 2 Red 13 2 Red 3 4 Blue 21 4 Blue 11 6 Yellow 27 6 Yellow 17 8 Green 11 8 Green 1 Node 2 ws_key Result Set Red,3 No aggregation (processing) on Control Node required Final Result Set 33 2012 © Trivadis Red,5 : Red,3 Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Redistribution Join: Shuffle SELECT FROM JOIN ON WHERE vs_key, a.ord, b.qty vendor_sales a store_sales b a.vs_key = b.VID a.color = 'Red' Distributed Table Distributed Table Vendor Sales Store Sales Color Ord ss_key ss_key Qty Qty 11 Red 15 1 2 2 11 5 5 32 Blue 20 3 3 32 32 10 10 54 Yellow 22 5 6 6 54 12 12 78 Green 17 7 7 78 78 7 7 Vendor Sales Color respective distribution keys Distribution: Incompatible (vendor_sales) only Result Set 11,15, 5 ss_key ss_key VID VID Qty Qty 2 Red 13 12 211 33 4 Blue 21 44 44 11 11 6 Yellow 27 56 654 17 17 8 Green 11 88 88 11 Data from right table (Store_Sales) is rebuilt: DK = VID Streaming Results Store Sales Ord Shuffle-Move Operation Query is now distribution compatible Node 2 vs_key 35 VID VID Tables are not co-located on their Distribution used from left table Node 1 vs_key Join Type: Redistribution Result Set 2,13, 3 Results streamed to client No aggregation (processing) on Control Node required Final Result Set 11,15,5 : 2,13,3 Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 2012 © Trivadis 27.04.2012 Appliance Update AU3 Performance – up to 10x improvement Data Movement Services New cost based Query Optimizer New Data Movement Service 1/2 rack appliances from HP and Dell System Center 2012 Integration (SCOM pack) And YES … Support for Stored Procedures (subset) Collations: Full support for international data Native SQL Server drivers 36 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 Enterprise Data Warehouse Mark Wunderli Infra2Apps Technology Consultant Hewlett-Packard (Switzerland) GmbH mark.wunderli@hp.com HP & Microsoft Data Warehousing Continuum: Reference Architectures and Appliances Scales from 1TB to Hundreds of TBs Balanced solutions ideal for data marts - EDW with scan-centric workloads Packaged and custom support From SMB to Enterprise Built on HP ProLiant G7 Reference Architectures Business Data Warehouse ProLiant DL370 (2P) Up to 12 Cores Internal HDD (6TB) Basic RA ProLiant DL38x G7 (2P). Up to 24 Cores. P2000 G3 (up to 20TB) 38 | Techtalk Trivadis | 9. Mai 2012 Mainstream RAs ProLiant DL58x G7 (4P) Up to 48 Cores P2000 G3 (Up to 60TB) Appliances Premium RA ProLiant DL980 G7 (8P) Up to 80 Cores P2000 G3 (Up to 95+ TB) HP Enterprise Data Warehouse Per /rack:11xProLiant DL360 (2P) 10 x P2000G3 (56 – 150TB) Up to 4 data racks (600TB) Enterprise Data Warehouse Components Data Racks 10 + 1 compute nodes per rack Infiniband Backup: Fibre Channel Control Nodes Ethernet Management Nodes Compute Nodes Landing Zone 39 | Techtalk Trivadis | 9. Mai 2012 High-Level PDW Architecture Control Rack Data Rack Control Node Database Server Nodes Active/Passive Storage Nodes (MSA) Compute Node Client Drivers ETL Load Interface Landing Zone(s) Corporate Backup Solution Fibre Channel Management Node Infiniband Backup Node Active/Passive Spare Node 40 | Techtalk Trivadis | 9. Mai 2012 Enterprise Data Warehouse Appliance ½ Rack Configurations Half-rack EDW configuration Entry Level option for MPP technology Data Rack Configuration – Lower capacity requirements, same components (4+1 Compute/ 4 Storage Nodes) – HDD Capacities: 300 GB (LFF/SFF), 600 GB SFF, and 1 TB LFF) – Control Rack unchanged Upgrade to Full Data Rack: – At intro max 1 Half-Data Rack EDW orderable – Upgrade to full Data Rack possible 41 | Techtalk Trivadis | 9. Mai 2012 For Backup Total Solution Support HP and Microsoft Converged Systems Microsoft and HP work together to provide a seamless support experience. Customers choose the service level from Microsoft and from HP to meet their business needs. Microsoft Premier Support * − 24x7 Reactive Support with on-site response − Proactive Services − Technical Account Management or upgrade through add-on Premier Mission Critical All features of underlying Premier Plan above plus − Faster reactive support response time with on-site solution engineering support − Prioritized access to Microsoft product groups − Solution supportability review and architectural guidance for maximum performance HP Support Plus 24 − Reactive 24x7 hardware and software support for HP appliance components with a 4 hr onsite hardware response or Proactive 24 Service − Integrated hardware and software support including proactive and reactive services to improve stability and availability across your IT environment or Critical Service − Comprehensive support solution designed to help minimize the business impact of downtime for mission critical applications * Premier support plan (Standard level or above) is a prerequisite for PDW customers 42 | Techtalk Trivadis | 9. Mai 2012 Merci! Hewlett-Packard (Switzerland) GmbH mark.wunderli@hp.com Thank you! Grazie mille! Trivadis AG meinrad.weiss@trivadis.com VIELEN DANK! BASEL 43 BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG 2012 © Trivadis Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW) 27.04.2012 MÜNCHEN STUTTGART WIEN