Introduction to the Parallel Data Warehouse (PDW)

advertisement
TechTalk
Beste Skalierbarkeit dank
massiv paralleler
Verarbeitung mit "Parallel
Data Warehouse" (PDW)
Meinrad Weiss
09.05.2012
BASEL
1
BERN
LAUSANNE
ZÜRICH
DÜSSELDORF
FRANKFURT A.M.
FREIBURG I.BR.
HAMBURG
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
MÜNCHEN
STUTTGART
WIEN
AGENDA
1. Overview Microsoft Data Warehousing Solutions
2. Parallel Data Warehouse (PDW) – What’s that? Or MPP vs. SMP
3. Hardware Architecture – Control Rack and Data Rack
4. Tools (Management Dashboard, Nexus Query Tool, DWSQL)
5. Distribution and Replication of Data
6. Table Constraints and Data Type Limitations
7. Comparison: Load Speed with SMP versus PDW
8. Basic Shared Nothing / Shuffle Moves
9. Concrete Offerings
2
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Microsoft Data Warehousing Solutions
Reference Architectures
offering best price
performance for data
warehousing
Scalable and reliable platform
for data warehousing on any
hardware
Appliance for high-end data
warehousing requiring
highest scalability,
performance, or complexity
Ideal for data marts or small
to mid-sized DWs with scancentric workloads
Ideal for large data marts or
mid-sized EDWs
Offers flexibility in hardware
and architecture
Software only
Reference Architectures
(software and hardware)
Software only
DW appliance
(fully integrated software and
hardware)
Scale-up DW
Scale-up DW
Scale-up DW
Scale-out DW with MPP
10s of TB
2 – 80 TB
10s of TB
Scalable and reliable platform
for data warehousing on any
hardware
Ideal for data marts or small
to mid-sized EDWs
3
2012 © Trivadis
Hochperformante und Kostengünstige Data Warehouse Systeme
09.05.2012
10s - 100s of TB
Data Warehouse – Products Positioning
Scale
Complexity
HA by default
SW-HW integration
PDW with
Distributed
Data
Architecture
SQL Server 2008 R2
Data Center
Appliance
Simplicity
PDW
SQL Server 2008 R2
Fast Track
SQL Server 2008 R2
Enterprise
4
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Data Warehouse – Products Positioning
PDW with
Distributed Data
Architecture
Scale
Complexity
HA by default
SW-HW integration
SQL Server 2008 R2
Data Center
100% SQL Server 2008
R2 Compatibility
PDW
SQL Server 2008 R2
with Fast Track
Reference Architecture
SQL Server 2008 R2
Enterprise
5
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
MPP vs. SMP
 SMP - Symmetric Multiprocessing
 Multiple CPUs used to complete individual processes
SMP
simultaneously
 All CPUs share the same memory, disks, and network controllers
 All SQL Server implementations up until now have been SMP
 MPP - Massively Parallel Processing




6
Uses many separate CPUs running in parallel to execute a single program
Each CPU has its own memory and disks
High-speed communications between nodes
Applications must be segmented
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
MPP
Two hardware vendors: HP and Dell
Microsoft+Dell
Parallel Data
Warehouse Appliance
7
Microsoft+HP
Enterprise Data
Warehouse Appliance
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Control Rack
Data Rack(s)
SQL
Control Node
SQL
SQL
SQL
SQL
Management Node
SQL
SQL
Landing Zone
SQL
SQL
SQL
Backup Node
SQL
2012 © Trivadis
Control Rack
Data Rack(s)
SQL
Control Node
SQL
SQL
SQL
SQL
Management Node
SQL
SQL
Landing Zone
SQL
SQL
SQL
Backup Node
SQL
2012 © Trivadis
Control Rack
Data Rack(s)
SQL
 Client connections always go through the
Control Node
control node
SQL
Management Node
SQL
 Windows Failover Cluster
for Availability
SQL
 Contains no persistent
user data
SQL
 Processes SQL requests
SQL
 Prepares execution plan
 Orchestrates distributed
execution
SQL
 Local SQL Server processes
final query
SQL
Landing Zone
plan and aggregates results
SQL
SQL
Backup Node
2012 © Trivadis
SQL
Control Rack
Data Rack(s)
SQL
Control Node
SQL
SQL
SQL
Management
Node
SQL
 Provides Support and
Patching for the
Appliance
SQL
 Holds image for re-deployment of
compute node
SQL
SQL
 Holds Active Directory
Landing Zone
SQL
SQL
Backup Node
2012 © Trivadis
SQL
Control Rack
Data Rack(s)
SQL
Control Node
SQL
SQL
Source
Management Node
Landing
Zone
Files
SQL
Data
Loader
SQL
Compute
Nodes
SQL
DWLoader or
SQL Server Integration Services
SQL
 Provides high-capacity
storage for data
SQL
Landing Zone
files from ETL processes
SQL
 Is available as a sandbox
for other
applications and scripts
that run on the
SQL
internal network
Backup Node
2012 © Trivadis
 Provides SQL ServerSQLIntegration Services
Control Rack
Data Rack(s)
SQL
Control Node
SQL
SQL
SQL
SQL
Management Node
SQL
SQL
SQL
Landing Zone
SQL
SQL
Backup Node
2012 © Trivadis
 Provides IntegratedSQLBackup Solution
 Integrates with 3rd party backup products
 Orderable in different sizes
Control
 Data
RackRack
Servers
5/10 active + 1
passive per Rack
Control Node
 InfiniBand, FC and
Ethernet switching
SQL
 Expansion Grow
from 1/2–4 data
racks, storage
Management Node
options, test/dev
system
 Consists of
COMPUTE NODES
Landing Zone
and STORAGE
NODES
 Shared Nothing
Backup Node
 Spare Node
provides failover in
case of node
failure
2012 © Trivadis
Data Rack(s)
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
Compute Node
Storage Node
SQL
 Each MPP node is a highly tuned symmetric multi-processing
(SMP) node with standard interfaces
 More or less multiple FastTrack Servers
 Provides dedicated hardware, database, and storage
 Runs SQL Server 2008
 Local Drives are configured as RAID 1
15
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Connectivity and Tools
Nexus Query Chameleon
DWSQL
16
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Web-BasedManagement Dashboard
17
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
System Center (SCOM)
18
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Distribution and Replication of Data
Product Dim
Time Dim
Date Dim ID
Calendar Year
Calendar Qtr
Calendar Mo
Calendar Day
Prod Dim ID
Prod Category
Prod Sub Cat
Prod Desc
Larger (> 10 B) Fact
Table is Hash
Distributed Across All
Compute Nodes
SQL
SF
SF
SF
SF
-1-1
-1
-1
SF
SF
SF
SF
-1
-1
-1
-2
SQL
Sales Facts
Date Dim ID
Store Dim ID
Prod Dim ID
Mktg Camp Id
Qty Sold
Dollars Sold
Store Dim
Store Dim ID
Store Name
Store Mgr
Store Size
19
SQL
Mktg
Campaign
Dim
SQL
Mktg Camp ID
Camp Name
Camp Mgr
Camp Start
Camp End
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
SF
SF
SF
SF
-1
-1
-1
-3
SF
SF
SF
SF
-1
-1
-1
-4
Distribution and Replication of Data
Time Dim
Product Dim
Date Dim ID
Calendar Year
Calendar Qtr
Calendar Mo
Calendar Day
Prod Dim ID
Prod Category
Prod Sub Cat
Prod Desc
Sales Facts
Date Dim ID
Store Dim ID
Prod Dim ID
Mktg Camp Id
Qty Sold
Dollars Sold
Store Dim
Store Dim ID
Store Name
Store Mgr
Store Size
20
Mktg
Campaign
Dim
Mktg Camp ID
Camp Name
Camp Mgr
Camp Start
Camp End
Smaller (<5GB )
Dimension Tables are
Replicated on Every
Compute Node
SQL
T
D
S
D
P
D
M
D
P
D
M
D
SF
SF
SF
SF
-1
-1
-1
-2
SQL
T
D
S
D
SQL
T
D
S
D
P
D
M
D
SF
SF
SF
SF
-1
-1
-1
-3
SQL
T
D
S
D
P
D
M
D
SF
SF
SF
SF
-1
-1
-1
-4
Result: Fact -Dimension
Joins can be performed
locally
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
SF
SF
SF
SF
-1-1
-1
-1
Creating a Database
CREATE DATABASE PDW
WITH
(AUTOGROW = ON,
REPLICATED_SIZE = 1024 GB, -- (per Node)
DISTRIBUTED_SIZE = 16384 GB, -- (whole System)
LOG_SIZE
= 1024 GB);
21
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Distribution on a PDW
CREATE TABLE myTable
(column Defs)
WITH (DISTRIBUTION = HASH (id));
PDW Node 1
Create Table <myTable GUID>_a
Create Table <myTable GUID>_b
…
Create Table <myTable GUID>_h
8 Tables per Node
PDW Node 2
Create Table <myTable GUID>_a
Create Table <myTable GUID>_b
…
Create Table <myTable GUID>_h
PDW Node …
22
PDW Node 10
Create Table <myTable GUID>_a Final Result:
Create Table <myTable GUID>_b 80 individual tables across a
…
2012 © Trivadis
10 node (1 data rack) appliance
Create
Table
<myTable
GUID>_h
Beste
Skalierbarkeit
dank massiv
paralleler Verarbeitung
mit "Parallel Data Warehouse" (PDW)
27.04.2012
Create Replicated Table
CREATE TABLE DimProduct(
ProductId
BIGINT
NOT NULL,
Description VARCHAR(50),
CategoryId INT
NOT NULL,
ListPrice
DECIMAL(12,2))
WITH (DISTRIBUTION = REPLICATE);
 Creates tables on each of the individual compute nodes and
assigns them to the REPLICATED file group.
 Data Compression is automatically turned on
 CREATE TABLE statement syntax varies slightly from its syntax in
standard Transact-SQL
23
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Data Type Limitations
 Most Scalar data types supported by SQL Server 2008
are supported by PDW
 Main exceptions
 Text (and related BLOB data types)
 XML
 SQL Variant
 Timestamp
 System and CLR UDTs
 IDENTITY/DEFAULT constraints not supported
 Character data types are case sensitive
 PDW uses collation: Latin1_General_BIN2
24
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
PDW-/ SQL Server-Data Types
bigint
binary
bit
char/nchar
date, time
datetime
datetime2
datetimeoffset
decimal
float
geography/geometry
hierarchyid
image
int
money
numeric
real
smalldatetime
smallint
smallmoney
sql_variant
sysname
text/ntext
timestamp
tinyint
uniqueidentifier
varbinary
varchar/nvarchar
xml
Performance Tests: Data Load on SMP System
Load a single 75 GB flatfile with 600 million rows on SQL SMP
90.000 Bulk
Copy Rows/sec
1hour
48min.
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
25
Data Load on PDW
Loading the same single flat file into PDW (75GB / 600 Mill rows)
dwloader.exe
-i D:\TPCH\lineItem.tbl
-M Fastappend -E -m
-d tpch_100gb
-E -c -b 10000 -rt value -rv 100
-R LineItem.tbl.rejects
-e ascii -t "|" -r \r\n
-U sa -P {password}
-T tpch_100gb.dbo.lineitem_Load
Option
Loadtime
Reload
09 min 35 sec
133
Append
09 min 42 sec
131
FastAppend
02 min 23sec
534
26
MB/sec
45 times faster
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
dwloader.exe
-i D:\TPCH\lineItem.tbl
-M Fastappend -E -m -d tpch_100gb
-E -c -b 10000 -rt value -rv 100
-R LineItem.tbl.rejects
-e ascii -t "|" -r \r\n
-T tpch_100gb.dbo.lineitem_Load
Single file
600 MB/sec
READ
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
27
Copy table within PDW
Table with 600 million rows (LineItem)
+
SELECT *
INTO
lineitem_copy
FROM
tpch_100gb.dbo.lineitem
14 times
faster...
36 min 07 sec (SMP) versus 2 min 12 sec ... on PDW
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
28
Hub and Spoke
Departmental
Reporting
Regional
Reporting
Central EDW Hub
High-Performance
Reporting
Mobile
Applications
Landing Zone
Regional
Reporting with
Business Decision
Appliance
Third-Party
RDBMS
Third-Party
Data
Integration
29
ETL Tools
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Remote table copy
Create a Heap table on SMP destination server:
CREATE REMOTE TABLE tpch_Henk.dbo.LineItem_test
AT ('Data Source = NYCPDW-LZ01,1433;
User ID = sa; Password = x;')
AS
SELECT *
FROM tpch_100gb.dbo.lineitem_load
*)
Check Status of copy operation
SELECT *
FROM sys.dm_pdw_dms_workers
WHERE type = 'PARALLEL_COPY_READER'
AND destination_info =[skypdw_Henk].[dbo].[LineItem_test]'
*) Requires Infiniband HCA card in remote SQL Server SMP
30
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Result: 600 mill rows - Remote table copy
21:25
Minutes
31
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Basic Shared Nothing Join (Replicated/Distributed)
SELECT
FROM
JOIN
WHERE
ss_key, Cost
item_dim a
store_sales b ON a.color = b.color
a.color = 'Yellow'
Replicated
Table
Distributed Table
Item Dim
Store Sales
Cost
ss_key
Color
Qty
Red
10
1
Red
5
Green
15
3
Blue
10
Blue
25
5
Yellow
12
5
7
Green
7
Yellow
Node 1
Color
Join Type: Shared Nothing
Result
Set:
5,5
Distribution: Compatible
 Replication satisfies compatibility for
inner joins
 Store Sales distribution key not used
Item Dim
Cost
ss_key
Color
Qty
Red
10
2
Red
3
Green
15
4
Blue
11
Blue
25
6
Yellow
17
5
8
Green
1
Yellow
Node 2
Color
Streaming Results
Store Sales
Result
Set:
 Results streamed to client
6,5
Control Node required
 No aggregation (processing) on
Final Result Set
32
5,5 : 6,5
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Basic Shared Nothing Join (Distributed/Distributed)
SELECT
FROM
JOIN
WHERE
a.color, b.Qty
web_sales a
store_sales b ON ws_key = ss_key
a.color = 'Red'
Distributed Table
Distributed Table
Web Sales
Store Sales
Color
Qty
ss_key
Color
Qty
1
Red
15
1
Red
5
3
Blue
20
3
Blue
10
5
Yellow
22
5
Yellow
12
7
Green
17
7
Green
7
Node 1
ws_key
Result
Set
Red,5
Join Type: Shared Nothing
Distribution: Compatible
 Join includes compatible distribution
keys with compatible data types
Streaming Results
Web Sales
Color
 Results streamed to client
Store Sales
Qty
ss_key
Color
Qty
2
Red
13
2
Red
3
4
Blue
21
4
Blue
11
6
Yellow
27
6
Yellow
17
8
Green
11
8
Green
1
Node 2
ws_key
Result
Set
Red,3
 No aggregation (processing) on
Control Node required
Final Result Set
33
2012 © Trivadis
Red,5 : Red,3
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Redistribution Join: Shuffle
SELECT
FROM
JOIN
ON
WHERE
vs_key, a.ord, b.qty
vendor_sales a
store_sales b
a.vs_key = b.VID
a.color = 'Red'
Distributed Table
Distributed Table
Vendor Sales
Store Sales
Color
Ord
ss_key
ss_key
Qty
Qty
11
Red
15
1
2
2
11
5
5
32
Blue
20
3
3
32
32
10
10
54
Yellow
22
5
6
6
54
12
12
78
Green
17
7
7
78
78
7
7
Vendor Sales
Color
respective distribution keys
Distribution: Incompatible
(vendor_sales) only
Result
Set
11,15,
5
ss_key
ss_key
VID
VID
Qty
Qty
2
Red
13
12
211
33
4
Blue
21
44
44
11
11
6
Yellow
27
56
654
17
17
8
Green
11
88
88
11
 Data from right table (Store_Sales) is
rebuilt: DK = VID
Streaming Results
Store Sales
Ord
Shuffle-Move Operation
 Query is now distribution compatible
Node 2
vs_key
35
VID
VID
 Tables are not co-located on their
 Distribution used from left table
Node 1
vs_key
Join Type: Redistribution
Result
Set
2,13, 3
 Results streamed to client
 No aggregation (processing) on
Control Node required
Final Result Set
11,15,5
: 2,13,3
Beste
Skalierbarkeit
dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
2012 © Trivadis
27.04.2012
Appliance Update AU3
 Performance – up to 10x improvement
 Data Movement Services
 New cost based Query Optimizer
 New Data Movement Service
 1/2 rack appliances from HP and Dell
 System Center 2012 Integration (SCOM pack)
 And YES … Support for Stored Procedures  (subset)
 Collations: Full support for international data
 Native SQL Server drivers
36
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
Enterprise Data Warehouse
Mark Wunderli
Infra2Apps
Technology Consultant
Hewlett-Packard (Switzerland) GmbH
mark.wunderli@hp.com
HP & Microsoft Data Warehousing Continuum:
Reference Architectures and Appliances
Scales from 1TB to Hundreds of TBs




Balanced solutions ideal for data marts - EDW with scan-centric
workloads
Packaged and custom support
From SMB to Enterprise
Built on HP ProLiant G7
Reference Architectures
Business Data
Warehouse
ProLiant DL370 (2P)
Up to 12 Cores
Internal HDD (6TB)
Basic RA
ProLiant DL38x G7
(2P). Up to 24
Cores. P2000 G3
(up to 20TB)
38 | Techtalk Trivadis | 9. Mai 2012
Mainstream RAs
ProLiant DL58x G7 (4P)
Up to 48 Cores
P2000 G3
(Up to 60TB)
Appliances
Premium RA
ProLiant DL980 G7
(8P) Up to 80 Cores
P2000 G3
(Up to 95+ TB)
HP Enterprise Data Warehouse
Per /rack:11xProLiant DL360 (2P)
10 x P2000G3 (56 – 150TB)
Up to 4 data racks (600TB)
Enterprise Data Warehouse Components
Data Racks
10 + 1 compute
nodes per rack
Infiniband
Backup:
Fibre Channel
Control
Nodes
Ethernet
Management
Nodes
Compute
Nodes
Landing
Zone
39 | Techtalk Trivadis | 9. Mai 2012
High-Level PDW Architecture
Control Rack
Data Rack
Control Node
Database Server Nodes
Active/Passive
Storage Nodes (MSA)
Compute Node
Client Drivers
ETL Load
Interface
Landing Zone(s)
Corporate Backup
Solution
Fibre
Channel
Management Node
Infiniband
Backup Node
Active/Passive
Spare Node
40 | Techtalk Trivadis | 9. Mai 2012
Enterprise Data Warehouse Appliance
½ Rack Configurations
Half-rack EDW configuration
Entry Level option for MPP technology
Data Rack Configuration
– Lower capacity requirements, same components
(4+1 Compute/ 4 Storage Nodes)
– HDD Capacities:
300 GB (LFF/SFF), 600 GB SFF, and 1 TB LFF)
– Control Rack unchanged
Upgrade to Full Data Rack:
– At intro max 1 Half-Data Rack EDW orderable
– Upgrade to full Data Rack possible
41 | Techtalk Trivadis | 9. Mai 2012
For
Backup
Total Solution Support
HP and Microsoft Converged Systems
Microsoft and HP work together to provide a seamless support experience.
Customers choose the service level from Microsoft and from HP to meet their business needs.
Microsoft
Premier Support *
− 24x7 Reactive Support with on-site response
− Proactive Services
− Technical Account Management
or upgrade through add-on
Premier Mission Critical
All features of underlying Premier Plan above
plus
− Faster reactive support response time with
on-site solution engineering support
− Prioritized access to Microsoft product groups
− Solution supportability review and
architectural guidance for maximum
performance
HP
Support Plus 24
− Reactive 24x7 hardware and software
support for HP appliance components with a
4 hr onsite hardware response
or
Proactive 24 Service
− Integrated hardware and software support including
proactive and reactive services to improve stability
and availability across your IT environment
or
Critical Service
− Comprehensive support solution designed to
help minimize the business impact of downtime
for mission critical applications
* Premier support plan (Standard level or above) is a prerequisite for PDW customers
42 | Techtalk Trivadis | 9. Mai 2012
Merci!
Hewlett-Packard (Switzerland) GmbH
mark.wunderli@hp.com
Thank you!
Grazie mille!
Trivadis AG
meinrad.weiss@trivadis.com
VIELEN DANK!
BASEL
43
BERN
LAUSANNE
ZÜRICH
DÜSSELDORF
FRANKFURT A.M.
FREIBURG I.BR.
HAMBURG
2012 © Trivadis
Beste Skalierbarkeit dank massiv paralleler Verarbeitung mit "Parallel Data Warehouse" (PDW)
27.04.2012
MÜNCHEN
STUTTGART
WIEN
Download