Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft

Parallel Database

Systems 101

Jim Gray & Gordon Bell

Microsoft Corporation presented at VLDB 95, Zurich Switzerland, Sept 1995

• Detailed notes available from Gray@Microsoft.com

– this presentation is 120 of the 174 slides (time limit)

– Notes in PowerPoint7 and Word7

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

1

Outline

•

Why Parallelism:

– technology push

– application pull

• Benchmark Buyer’s Guide

– metrics

– simple tests

• Parallel Database Techniques

– partitioned data

– partitioned and pipelined execution

– parallel relational operators

• Parallel Database Systems

– Teradata. Tandem, Oracle, Informix, Sybase, DB2, ‘RedBrick


2

Kinds Of Information Processing

Immediate

Point-to-Point Broadcast

conversation money lecture concert Net work

Time

Shifted

mail book newspaper

Data

Base

Its ALL going electronic

Immediate is being stored for analysis (so ALL database)

Analysis & Automatic Processing are being added


3

Why Put Everything in Cyberspace ?

Point-to-Point

OR

Broadcast

Low rent min $/byte

Shrinks time now or later

Shrinks space here or there

Automate processing knowbots


Locate

Process

Analyze

Summarize

Network

Data

Base

4

Databases:

Information At Your Fingertips™

Information Network™

Knowledge Navigator™

• All information will be in an online database (somewhere)

• You might record everything you

• read: 10MB/day, 400 GB/lifetime (two tapes)

• hear: 400MB/day, 16 TB/lifetime (a tape per decade)

• see: 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe someday)

• Data storage, organization, and analysis is a challenge.

• That is what databases are about

• DBs do a good job on “records”

• Now working on text, spatial, image, and sound.


5

Database Store ALL Data Types

• The Old World:

– Millions of objects

– 100-byte objects

People

Name Address

David NY

Mike Berk

Won Austin


• The New World:

•

People

Name Address Papers Picture Voice

David NY

Mike Berk

Won Austin

Billions of objects

• Big objects (1MB)

• Objects have behavior

(methods)

Paperless office

Library of congress online

All information online entertainment publishing business

Information Network,

Knowledge Navigator,

Information at your fingertips

6

Magnetic Storage Cheaper than Paper

• File Cabinet

: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space (2x3 @ 10$/ft 2 ) 180$ total

3 ¢/sheet

700$

• Disk

: disk (8 GB =)

ASCII: 4 m pages

0 .

05 ¢/sheet

2,000$

(60x cheaper)

• Image : 200 k pages

1 ¢/sheet

(3x cheaper than paper)

• Store everything on disk


7

Moore’s Law

• XXX doubles every 18 months 60% increase per year

– Micro Processor speeds

– chip density

– Magnetic disk density

1GB

128MB

8MB

1 chip memory size

( 2 MB to 32 MB)

– Communications bandwidth

WAN bandwidth approaching LANs

1MB

128KB

• Exponential Growth:

– The past does not matter

8KB

1970 1980 bits: 1K 4K 16K 64K 256K

1990 2000

1M 4M 16M 64M 256M

– 10x here, 10x there, soon you're talking REAL change .

• PC costs decline faster than any other platform

– Volume & learning curves

– PCs will be the building bricks of all future systems


10

In The Limit: The Pico Processor

3

1 MM

Pico Processor

10 pico-second ram 1 MB

100 MB

10 GB

1 TB

100 TB

10 nano-second ram

10 microsecond ram

10 millisecond disc

10 second tape archive

1 M SPECmarks,

1TFLOP

10 6 clocks to bulk ram

Event-horizon on chip.

VM reincarnated

Multi-program cache

On-Chip SMP

Terror Bytes!


14

What's a Terabyte?

(250 K$ of Disk

@ .25$/MB

)

1 Terabyte

1,000,000,000 business letters

100,000,000 book pages

50,000,000 FAX images

10,000,000 TV pictures (mpeg)

4,000 LandSat images

150 miles of bookshelf



10 days of video

Library of Congress (in ASCII) is 25 TB

1980 : 200 M$ of disc 10,000 discs

5 M$ of tape silo 10,000 tapes

1995 : 250 K$ of magnetic disc 70 discs

500 K$ of optical disc robot 250 platters

50 K$ of tape silo 50 tapes

Terror Byte !!


23

Summary (of storage)

• Capacity and cost are improving fast (100x per decade)

• Accesses are getting larger (MOX, GOX, SCANS)

• BUT Latencies and bandwidth are not improving much

• (3x per decade)

• How to deal with this???

• Bandwidth:

– Use partitioned parallel access (disk & tape farms)

• Latency

– Pipeline data up storage hierarchy (next section)


30

Interesting Storage Ratios

• Disk is back to 100x cheaper than RAM

• Nearline tape is only 10x cheaper than disk

– and the gap is closing!

RAM $/MB

Disk $/MB

100:1

Disk & DRAM look good

10:1

30:1

Disk $/MB

Nearline Tape

?

??? Why bother with Tape

1:1

1960 1970 1980 1990 2000


31

Performance =Storage Accesses not Instructions Executed

• In the “old days” we counted instructions and IO’s

• Now we count memory references

• Processors wait most of the time

Where the time goes: clock ticks used by AlphaSort Components

70 MIPS

“real” apps have worse Icache

OS misses so run at 60 MIPS if well tuned, 20 MIPS if not

Memory Wait

B-Cache

Data Miss


I-Cache

Miss

D-Cache

Miss

32

10

9

Storage Latency:

How Far Away is the Data?

Andromdeda

Tape /Optical

Robot

2,000 Years

106 Disk

Pluto

2 Years

100 Memory

10

2

On Board Cache

On Chip Cache

1 Registers


Sacramento

This Campus

This Room

My Head

1.5 hr

10 min

1 min

33

Network Speeds

• Network speeds grow 60% / year

• WAN speeds limited by politics

• if voice is X$/minute, how much is video?

• Switched 100Mb Ethernet

• 1,000x more bandwidth

• ATM is a scaleable net:

• 1 Gb/s to desktop & wall plug

• commodity: same for LAN, WAN

• 1Tb/s fibers in laboratory


1e 9

Comm Speedups

1e 8

1e 7

1e 6

1e 5

1e 4

Processors

LANs &

WANs

(i/s)

(b/s)

1e 3

1960 1970 1980 1990 2000

Year

34

Network Trends & Challenge

10 10

• Bandwidth UP 10 4 Price DOWN

10 9 1 Gb/s

• Speed-of-light unchanged

10 8 CAN

• Software got worse

• Standard Fast Nets

» ATM

» PCI

» Myrinet

» Tnet

10 7

10

6

10 5

10 4

10 3

1 Mb/s

1 Kb/s

• HOPE:

– Commodity Net

– Good software

• Then clusters become a SNAP!

• commodity: 10k$/slice

10 2

1965

LAN

WAN

POTS

1975 1985

PC Bus

1995 2000

35


The Seven Price Tiers

•

•

•

•

10$:

100$:

1,000$:

10,000$:

•

•

100,000$:

1,000,000$:

• 10,000,000$: wrist watch computers pocket/ palm computers portable computers personal computers (desktop) departmental computers (closet) site computers (glass house) regional computers (glass castle)

SuperServer: Costs more than 100,000 $

“Mainframe” Costs more than 1M$

Must be an array of processors, disks, tapes comm ports


36

The New Computer Industry

• Horizontal integration is new structure

• Each layer picks best from lower layer.

• Desktop (C/S) market

• 1991: 50%

• 1995: 75%

Function

Operation

Integration

Applications

Middleware

Baseware

Systems

Silicon & Oxide

Example

AT&T

EDS

SAP

Oracle

Microsoft

Compaq

Intel & Seagate

38


Software Economics: Bill’s Law

• Bill Joy’s law (Sun

):

Don’t write software for less than 100,000 platforms.

@10M$ engineering expense, 1,000$ price

• Bill Gate’s law

:

Don’t write software for less than 1,000,000 platforms.

@10M$ engineering expense, 100$ price

• Examples:

• UNIX vs NT: 3,500$ vs 500$

• UNIX-Oracle vs SQL-Server: 100,000$ vs 1,000$

• No Spreadsheet or Presentation pack on UNIX/VMS/...

• Commoditization of base Software & Hardware


40

Thesis

Many Little will Win over Few Big

1 M$

Mainframe

100 K$

Mini

10 K$

Micro Nano

5.25"

3.5"

2.5"

1.8"

14"


9"

44

Year 2000 4B Machine

• The Year 2000 commodity PC (3K$)

• B illion Instructions/Sec

• B illion Bytes RAM

• B illion Bits/s Net

• 10

B

Bytes Disk

• B illion Pixel display

• 3000 x 3000 x 24 pixel


1 Bips Processor

.1 B byte RAM

10 B byte Disk

45

4 B PC’s: The Bricks of Cyberspace

• Cost 3,000 $

• Come with

• OS (NT, POSIX,..)

• DBMS

• High speed Net

• System management

• GUI / OOUI

• Tools

• Compatible with everyone else

• CyberBricks


46

Implications of Hardware Trends

Large Disc Farms will be inexpensive ( 100$/GB)

Large RAM databases will be inexpensive (1,000$/GB)

Processors will be inexpensive

1k SPECint

CPU

So

The building block will be a processor with large RAM lots of Disc

5 GB RAM

50 GB Disc

47


Implication of Hardware Trends: Clusters


CPU

5 GB RAM

50 GB Disc

Future Servers are CLUSTERS of processors, discs

Distributed Database techniques make clusters work

48

Future SuperServer

4T Machine

Array of 1,000 4B machines processors, disks, tapes comm lines

A few MegaBucks

Challenge:

Manageability

Programmability

Security

Availability

Scaleability

Affordability

As easy as a single system


100 Tape Transports

= 1,000 tapes

= 1 PetaByte

1,000 discs =

10 Terrorbytes

100 Nodes

1 Tips

49

Great Debate

:

Shared What?

Shared Memory

(SMP)

CLIENTS

Processors

M emory

Shared Disk

CLIENTS

Shared Nothing

(network)

CLIENTS

Easy to program

Difficult to build

Difficult to scaleup

Sequent, SGI, Sun VMScluster, Sysplex

Hard to program

Easy to build

Easy to scaleup

Tandem, Teradata, SP2

Winner will be a synthesis of these ideas

Distributed shared memory (DASH, Encore) blurs distinction between Network and Bus (locality still important)

But gives Shared memory message cost.


50

Scaleables: Uneconomic So Far

• A Slice is a processor, memory, and a few disks.

• Slice Price of Scaleables so far is 5x to 10x markup

– Teradata: 70K$ for a Intel 486 + 32MB + 4 disk.

– Tandem: 100k$ for a MipsCo R4000 + 64MB + 4 disk

– Intel: 75k$ for an I860 +32MB + 2 disk

– TMC: 75k$ for a SPARC 3 + 32MB + 2 disk.

– IBM/SP2: 100k$ for a R6000 + 64MB + 8 disk

• Compaq Slice Price is less than 10k$

• What is the problem?

– Proprietary interconnect

– Proprietary packaging

– Proprietary software (vendorIX)


51

Summary

• Storage trends force pipeline & partition parallelism

– Lots of bytes & bandwidth per dollar

– Lots of latency

• Processor trends force pipeline & partition

– Lots of MIPS per dollar

– Lots of processors

• Putting it together Scaleable Networks and Platforms)

– Build clusters of commodity processors & storage

– Commodity interconnect is key (S of PMS)

» Traditional interconnects give 100k$/slice.

– Commodity Cluster Operating System is key

– Fault isolation and tolerance is key

– Automatic Parallel Programming is key


52

The Hardware is in Place and

Then A Miracle Occurs

?

SNAP

Scaleable Network And Platforms

Commodity Distributed OS

built on

Commodity Platforms

Commodity Network Interconnect

Enables

Parallel Applications


53

Why Parallel Access To Data?

At 10 MB/s

1.2 days to scan

1,000 x parallel

1.5 minute SCAN.

1 Terabyte

1 Terabyte

10 MB/s

Parallelism: divide a big problem into many smaller ones to be solved in parallel.


56

DataFlow Programming

Prefetch & Postwrite Hide Latency

• Can't wait for the data to arrive (2,000 years!)

• Need a memory that gets the data in advance ( 100MB/S)

• Solution :

• Pipeline from source (tape, disc, ram...) to cpu cache

• Pipeline results to destination


57

Why are Relational Operators

So Successful for Parallelism?

Relational data model uniform operators on uniform data stream

Closed under composition

Each operator consumes 1 or 2 input streams

Each stream is a uniform collection of data

Sequential data in and out: Pure dataflow partitioning some operators

(e.g. aggregates, non-equi-join, sort,..) requires innovation

AUTOMATIC PARALLELISM


58

Database Systems “Hide” Parallelism

• Automate system management via tools

• data placement

• data organization (indexing)

• periodic tasks (dump / recover / reorganize)

• Automatic fault tolerance

• duplex & failover

• transactions

• Automatic parallelism

• among transactions (locking)

• within a transaction (parallel execution)


59

Automatic Parallel OR DB

.

.

..

.

.

.

1/2/72

.

.

4/8/95

Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7;

Landsat date loc image

Temporal

Spatial

Image

Assign one process per processor/disk: find images with right data & location analyze image, if 70% snow, return it

.

.

.

.

33N

120W

.

.

.

34N

120W date, location,

& image tests

Answer image


60

Outline

• Why Parallelism :

– technology push

– application pull


– metrics







– Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrick


61

Parallelism: Speedup & Scaleup

Speedup:

Same Job,

More Hardware

Less time

100GB 100GB

Scaleup:

Bigger Job,

More Hardware

Same time

Transaction

Scaleup:

more clients/servers

Same response time

100GB

1 k clients

100GB

Server

1 TB

10 k clients

1 TB

Server


62

The New Law of Computing

Grosch's Law:

2x $ is 4x performance

1 MIPS

1 $

1,000 MIPS

32 $

.03$/MIPS

2x $ is

2x performance

Parallel Law:

Needs

Linear Speedup and Linear Scaleup

Not always possible


1,000 MIPS

1,000 $

1 MIPS

1 $

63

Parallelism: Performance is the Goal

Goal is to get 'good' performance.

Law 1: parallel system should be faster than serial system

Law 2: parallel system should give near-linear scaleup or near-linear speedup or both.


64

The New Performance Metrics

• Transaction Processing Performance Council:

– TPC-A: simple transaction

– TPC-B: server only, about 3x lighter than TPC-A

– Both obsoleted by TPC-C

(no new results after 6/7/95)

• TPC-C (revision 3) Transactions Per Minute tpm-C

– Mix of 5 transactions: query, update, minibatch

– Terminal price eliminated

– about 5x heavier than tpcA (so 3.5 ktpcA 20 ktpmC)

• TPC-D approved in March 1995 - Transactions Per Hour

– Scaleable database (30 GB, 100GB, 300GB,... )

– 17 complex SQL queries (no rewrites, no hints without permission)

– 2 load/purge queries

– No official results yet, many “customer” results.


65

TPC-C Results 12/94

Courtesy of Charles Levine of Tandem (of course)

3000

HP9000

2000

1000

AS400

16 cpus

RS6000

HP T500

32 cpus

SUN

HP 9000 E55, H70

Tandem Himalaya Server

64 cpus

Ta ndem

HP-H70

AS4 00

HP900 0

RS600 0

Sun

DG

HP-H70

HP T5 00-8

11 2 cpus

0

0 2000 4000 6000 8000 10000 12000 14000

PERFORMANCE (TPMC)

16000 18000 20000 22000


66

Success Stories

• Online Transaction Processing

• many little jobs

• SQL systems support 3700 tps-A

(24 cpu, 240 disk)

• SQL systems support 21,000 tpm-C

(112 cpu,670 disks)

• Batch (decision support and Utility)

• few big jobs, parallelism inside

• Scan data at 100 MB/s

• Linear Scaleup to 500 processors

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey hardware hardware

67

The Perils of Parallelism

The Good

Speedup Curve

A Bad Speedup Curve

No Parallelism

Benefit

A Bad Speedup Curve

3-Factors

Linearity Linearity

Startup:

Interference

Skew

Processors & Discs

:

:

Processors & Discs

Processors & Discs

Creating processes

Opening files

Optimization

Device (cpu, disc, bus) logical (lock, hotspot, server, log,...)

If tasks get very small, variance > service time


68

Benchmark Buyer's Guide

Things to ask

The Whole Story

(for any system)

When does it stop scaling?

The Benchmark

Report

Throughput numbers,

Not ratios.

Standard benchmarks allow

Comparison to others

Comparison to sequential

Processors & Discs

Ratios and non-standard benchmarks are red flags.


69

Performance 101: Scan Rate

Disk is 3MB/s to 10MB/s

Record is 100B to 200B (

TPC-D 110...160

,

Wisconsin 204

)

So should be able to read 10kr/s to 100kr/s

Simple test: Time this on a 1M record table

SELECT count(*) FROM T WHERE x < :infinity;

(table on one disk, turn off parallelism)

Typical problems: disk or controller is an antique no read-ahead in operating system or DB small page reads (2kb) data not clustered on disk big cpu overhead in record movement

Scan

Parallelism is not the cure for these problems


70

Parallel Scan Rate

Simplest parallel test:

Scaleup previous test:

4 disks,

4 controllers,

4 processors

4 times as many records partitioned 4 ways.

Same query

Should have same elapsed time.

Some systems do.


Scan

Agg

Count

Count

71

Parallel Update Rate

Test: UPDATE T

SET x = x + :one;

Test for million row T on 1 disk

Test for four million row T on 4 disks

Look for bottlenecks.

Log

After each call, execute ROLLBACK WORK

See if UNDO runs at the DO speed

See if UNDO is parallel (scales up)


UPDATE

72

The

records/$/second

Metric

• parallel database systems scan data

• An interesting metric (100 byte record):

– Record Scan Rate / System Cost

• Typical scan rates: 1k records/s to 30k records/s

• Each Scaleable system has a “slice price” guess:

– Gateway: 15k$ (P5 + ATM + 2 disks +NT + SQLserver or Informix or Oracle)

–

Teradata: 75k$

– Sequent: 75k$ (P5+2 disks+Dynix+Informix)

– Tandem: 100k$

– IBM SP2: 130k$ (RS6000+2 disks, AIX, DB2)

• You can compute slice price for systems later in presentation

• BAD: 0.1 records/s/$ (there is one of these)

• GOOD: 0.33 records/s/$ (there is one of these)

• Super! 1.00 records/s/$ (there is one of these)

• We should aim at 10 records/s/$ with P6.


74

Embarrassing Questions to

Ask Your PDB Vendor

How are constraints checked?

ask about unique secondary indices ask about deferred constraints ask about referential integrity

How does parallelism interact with triggers

Stored procedures

OO extensions

How can I change my 10 TB database design in an hour?

add index add constraint reorganize / repartition

These are hard problems.


75

Outline

• Why Parallelism:




– metrics







–

Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrick


76

Automatic Data Partitioning

Split a SQL table to subset of nodes & disks

Partition within set:

Range Hash Round Robin

A...E F...J

K...N

O...S T...Z

Good for equijoins, range queries group-by

A...E F...J

K...N

O...S T...Z

A...E F...J

K...N

O...S T...Z

Good for equijoins Good to spread load

Shared disk and memory less sensitive to partitioning,

Shared nothing benefits from "good" partitioning


77

Index Partitioning

Hash indices partition by hash

0...9

10..19

20..29 30..39 40..

B-tree indices partition as a forest of trees.

One tree per range

A..C

D..F

G...M

N...R

S..Z

Primary index clusters data


78

Secondary Index Partitioning

In shared nothing, secondary indices are Problematic

Partition by base table key ranges

Insert: completely local (but what about unique?)

Lookup: examines ALL trees (see figure)

Unique index involves lookup on insert

.

A..Z

A..Z

A..Z

A..Z

A..Z

Base Table

Partition by secondary key ranges

Insert: two nodes (base and index)

Lookup: two nodes (index -> base)

Uniqueness is easy

Teradata solution

A..C

D..F

G...M

N...R

S..

Base Table


79

Pipeline

Kinds of Parallel Execution

Any

Sequential

Program

Any

Sequential

Program

Partition

outputs split N ways inputs merge M ways

Any

Sequential

Program


Any

Sequential

Program

80

Data Rivers

Split + Merge Streams

N X M Data Streams

M Consumers

N producers

River

Producers add records to the river,

Consumers consume records from the river

Purely sequential programming.

River does flow control and buffering does partition and merge of data records

River = Split/Merge in Gamma = Exchange operator in Volcano.


81

Partitioned Execution

Spreads computation and IO among processors

Count

Count Count Count Count Count

A...E

F...J

A Table

K...N

O...S

T...Z

Partitioned data gives

NATURAL parallelism


82

N x M way Parallelism

Sort

Join

Merge

Sort

Join

Merge

Sort

Join

Merge

Sort

Join

Sort

Join

A...E

F...J

K...N

O...S

T...Z

N inputs, M outputs, no bottlenecks.

Partitioned Data

Partitioned and Pipelined Data Flows


83

Picking Data Ranges

Disk Partitioning

For range partitioning, sample load on disks.

Cool hot disks by making range smaller

For hash partitioning,

Cool hot disks by mapping some buckets to others

River Partitioning

Use hashing and assume uniform

If range partitioning, sample data and use histogram to level the bulk

Teradata, Tandem, Oracle use these tricks


84

Blocking Operators = Short Pipelines

An operator is blocking, if it does not produce any output, until it has consumed all its input

Database Load

Template has three blocked phases

Tape

File

SQL Table

Process

Scan Sort Runs Merge Runs Table Insert

SQL Table

Examples:

Sort,

Aggregates,

Hash-Join

(reads all of one operand)

Sort Runs Merge Runs Index Insert

Index 1

Sort Runs

Sort Runs

Merge Runs Index Insert

Index 2

Merge Runs Index Insert

Index 3

Blocking operators kill pipeline parallelism

Make partition parallelism all the more important.


85

Simple Aggregates (sort or hash?)

Simple aggregates (count, min, max, ...) can use indices

More compact

Sometimes have aggregate info.

GROUP BY aggregates scan in category order if possible (use indices)

Else

If categories fit in RAM use RAM category hash table

Else make temp of <category, item> sort by category, do math in merge step.


86

Sort

Used for loading and reorganization

(sort makes them sequential) build B-trees reports non-equijoins

Rarely used for aggregates or equi-joins (if hash available

Sort

Input

Data


Runs

Merge

Sorted

Data

88

Parallel Sort

M input N output

Sort design

River is range or hash partitioned

Disk and merge not needed if sort fits in memory

Merge runs

Sub-sorts generate runs

Scales linearly because

Range or Hash Partition River

= =>

12 log(10 ) 12

2x slower

Scan or other source

Sort is benchmark from hell for shared nothing machines net traffic = disk bandwidth, no data filtering at the source

89


SIGMOD Sort Award

Datamation Sort

:

1M records (100 B recs)

1000 seconds 1986

60 seconds 1990

7 seconds 1994

3.5 seconds 1995 (SGI challenge) micros finally beat the mainframe! finally! a UNIX system that does IO

SIGMOD MinuteSort

1.1GB, Nyberg,

Alpha 3cpu

1994

1.6GB, Nyberg, 1995

SGI Challenge (12 cpu) no SIGMOD PennySort record


Sort Records/second vs T ime

1.0E+06

IBM 3090

SGI

Alpha

1.0E+05

Cray YMP

1.0E+04

Sequent

Intel

HyperCube

1.0E+03

Hardware Sorter

1.0E+02

1985

Tandem

M68000

1990 1995

250

200

150

100

Sort T ime on an SGI Challenge

1.6 GB (16 M 100-byte records)

12 cpu, write done

2.2 GB,

96 disk lists merged lists-sorted read-done pin

50

0

1 2 4 6

Threads (Sprocs) devoted to sorting

10

90

Nested Loops Join

If inner table indexed on join cols (b-tree or hash) then sequential scan outer (from start key)

For each outer record probe inner table for matching recs

Works best if inner is in RAM (=> small inner

Outer

Table

Inner

Table

91


Merge Join (and sort-merge join)

If tables sorted on join cols (b-tree or hash) then sequential scan each (from start key) left < right left=right left > right advance left match advance right

Nice sequential scan of data (disk speed)

(MxN case may cause backwards rescan)

NxM case

Cartesian product

Sort-merge join sorts before doing the merge

Partitions well: partition smaller to larger partition.

Left

Table

Right

Table

Works for all joins ( outer, non-equijoins, Cartesian, exclusion,...

)


92

Hash Join

Hash smaller table into N buckets

(hope N=1)

If N=1 read larger table, hash to smaller

Else, hash outer to disk then bucket-by-bucket hash join.

Purely sequential data behavior Left

Table

Always beats sort-merge and nested unless data is clustered.

Good for equi, outer, exclusion join

Lots of papers, products just appearing

(what went wrong?)

Hash reduces skew


Right Table

Hash

Buckets

93

Parallel Hash Join

ICL implemented hash join with bitmaps in CAFS machine

(1976)!

Kitsuregawa pointed out the parallelism benefits of hash join in early 1980’s (it partitions beautifully)

We ignored them! (why?) But now, Everybody's doing it.

(or promises to do it).

Hashing minimizes skew, requires little thinking for redistribution

Hashing uses massive main memory


95

Observations

It is easy to build a fast parallel execution environment

(no one has done it, but it is just programming)

It is hard to write a robust and world-class query optimizer.

There are many tricks

One quickly hits the complexity barrier

Common approach:

Pick best sequential plan

Pick degree of parallelism based on bottleneck analysis

Bind operators to process


98

What’s Wrong With That?

Why isn’t the best serial plan, the best parallel plan?

Counter example:

Table partitioned with local secondary index at two nodes

Range query selects all of node 1 and 1% of node 2.

Node 1 should do a scan of its partition.

Node 2 should use secondary index.

SELECT * FROM telephone_book WHERE name < “NoGood”;

Sybase Navigator & DB2 PE should get this right.

Table

Scan

Index

Scan

We need theorems here

(practitioners do not have them)


A..M

N..Z

99

What Systems Work This Way

Shared Nothing

Teradata:

Tandem:

400 nodes

110 nodes

IBM / SP2 / DB2: 128 nodes

Informix/SP2 48 nodes

ATT & Sybase 8x14 nodes

Shared Disk

Oracle

Rdb

170 nodes

24 nodes

Shared Memory

Informix

RedBrick

9 nodes

? nodes


CLIENTS

CLIENTS

CLIENTS

Process ors

M emory

101

Outline

• Why Parallelism:




– metrics







– Teradata

– Tandem

- Oracle

- Informix

- Sybase


-DB2

-RedBrick

102

System Survey Ground Rules

Premise: The world does not need yet another PDB survey

It would be nice to have a survey of “real” systems

Visited each parallel DB vendor I could (time limited)

Asked not to be given confidential info.

Asked for public manuals and benchmarks

Asked that my notes be reviewed

I say only nice things (I am a PDB booster)


103

Acknowledgments

Teradata

Todd Walter and Carrie Ballinger

Tandem

Susanne Englert, Don Slutz, HansJorge Zeller, Mike Pong

Oracle

Gary Hallmark, Bill Widdington

Informix

Gary Kelley, Hannes Spintzik, Frank Symonds, Dave Clay

Navigator

Rick Stellwagen, Brian Hart, Ilya Listvinsky, Bill Huffman ,

Bob McDonald, Jan Graveson

Ron Chung Hu, Stuart Thompto

DB2

Chaitan Baru, Gilles Fecteau, James Hamilton,

Hamid Pirahesh

Redbrick

Phil Fernandez, Donovan Schneider


104

Teradata

• Ship 1984, now an ATT GIS brand name

• Parallel DB server for decision support

SQL in, tables out

• Support Heterogeneous data (convert to client format)

Data hash partitioned among AMPs with fallback (mirror) hash.

UNIX

VMS

PC Mac

Applications run on clients

Application

Processor

PEP

Biggest installation: 476 nodes, 2.4 TB

Ported to UNIX base


AMP

IBM

105

UNIX

VMS

PC Mac

Parsing Engines

Interface to IBM or Ethernet or...

Accept SQL, return records and status.

Support SQL 89, moving to SQL92

Parse, Plan & authorize SQL

Application

Processor

PEP cost based optimizer

Issue requests to AMPs

Merge AMP results to requester.

AMP

Some global load control based on client priority

(adaptive and GREAT!)

Access Modules

Almost all work done in AMPs

A shared nothing SQL engine scans, inserts, joins, log, lock,....

Manages up to 4 disks (as one logical volume )

Easy design, manage, grow (just add disk)


IBM

106

Data Layout: Hash Partitioning

All data declustered to all nodes

Each table has a hash key (may be compound)

Key maps to one of 4,000 buckets

Buckets map to one of the AMPs

Non-Unique secondary index partitioned by table criterion

Fallback bucket maps to second AMP in cluster.

Typical cluster is 6 nodes (2 is mirroring).

Cluster limits failure scope:

2 failures only cause data outage if both in same cluster.

Within a node, each hash to cylinder then hash to “page”

Page is a heap with a sorted directory


107

Teradata Optimization & Execution

Sophisticated query optimizer

(many tricks) Great emphasis on Joins & Aggregates.

Nested, merge, product, bitmap join (no hash join)

Automatic load balancing from hashing & load control

Excellent utilities for data loading, reorganize

Move > 1TB database from old to new in 6 days, in background while old system running

Old hardware, 3.8B row table (1TB), >300 AMPs typical scan, sort, join averages 30 minutes

108


Query Execution

Protocol

PE requests work

AMP responds OK (or pushback)

AMP works (if all OK)

AMP declares finished

When all finished, PE does 2PC and starts pull

Simple scan :

PE broadcasts scan to each AMP

Each AMP scans produces answer spool file

PE pulls spool file from AMPs via Ynet

If scan were ordered, sort “catcher” would be forked at each AMP pipelined to scans

Ynet and PE would do merge of merges from AMPs


109

Aggregates, Updates

Aggregate of Scan :

Scan’s produce local sub-aggregates

Hash sub-aggregates to Ynet

Each AMP “catches” its sub-aggregate hash buckets

Consolidate sub-aggregates.

PE pulls aggregates from AMPs via Ynet.

Note: fully scaleable design

Insert / Update / Delete at a AMP node generates insert / update /delete messages to unique-secondary indices fallback bucket of base table.

messages saved in spool if node is down


110

Query Execution: Joins

Great emphasis on Joins.

Includes small-table large-table optimization cheapest triple, then cheapest in triple.

If equi-partitioned, do locally

If not equi-partitioned,

May replicate small table to large partition (Ynet shines)

May repartition one if other is already partitioned on join

May repartition both (in parallel)

Join algorithm within node is

Product

Nested

Sort-merge

Hash bit map of secondary indices, intersected.


111

Utilities

Bulk Data Load, Fast Data Load, Multi-load,

Blast 32KB of data to an AMP

Multiple sessions by multiple clients can drive 200x parallel

Double buffer

AMP unpacks, and puts “upsert”onto Ynet

One record can generate multiple upserts

(transaction-> inventory, store-sales, ...)

Catcher on Ynet, grabs relevant “upserts” to temp file.

Sorts and then batches inserts (survives restarts).

Online and restartable.

Customers cite this as Teradata strength.

Fast Export (similar to bulk data load)

112


Utilities II

Backup / Restore: Rarely needed because of fallback.

Cluster is unit of recovery

Backup is online, Restore is offline

Reorganize:

Rarely needed, add disk is just restart

Add node: rehash all buckets that go to that node:

(Ynet has old and new bucket map)

Fully parallel and fault tolerant, takes minutes


113

Port To UNIX

New design (3700 series) described in VLDB 93

Ported to UNIX platforms (3600 AP, PE, AMP)

Moved Teradata to Software Ynet on SMPs

Based on Bullet-Proof UNIX with TOS layer atop.

message system communications stacks raw disk & virtual processors virtual partitions (buckets go to virtual partitions) removes many TOS limits

Result is 10x to 60x faster than an AMP

Compiled expression evaluation

(gives 50x speedup on scans)

Large main memory helps

SQL Applications

Parsing engine (parallelism)

Teradata SQL (AMP logic)

UNIX PDE: TOS adapter

UNIX 5.4 (SMP, RAS, virtual Ynet)

HARDWARE


114

Customer Benchmarks

Standard Benchmarks

Only old Boral/DeWitt Wisconsin numbers.

Nothing public.

Moving > 1TB database from one old to new in 6 days, in background while old system running

So: unload-load rate > 2MB/s sustained

Background task (speed limited by host speed/space)

Old hardware, 3.8B row table, >300 AMPs typical scan, sort, join averages 30 minutes rates (rec size not cited): scan: clustered join: insert-select:

Hash index build:

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey krec/s/AMP k rec/s

9

2.7 mr/s !!!!!!

2 600 kr/s

.39

120 kr/s

3.3

100 kr/s

115

UNIX/SMP Port of Teradata

Times to process a Teradata Test DB on a 8 Pentium, 3650.

These numbers are 10 to 150x better than a single AMP

Compiled expression handling more memory op rows seconds k r/s MB/s scan copy

50000000 737 67.8 11.0

5000000 1136 4.4 0.7

aggregate 50000000 788 63.5 10.3

Join 50x2M (clustered) 52000000 768 67.7 11.0

Join 5x5 (unclustered) 10000000 237 42.2 6.8

Join 50Mx.1K 50000100 1916 26.1 4.2


116

Teradata Good Things

Scaleable to large (multi-terabyte) databases

Available TODAY!

It is VERY real: in production in many large sites

Robust and complete set of utilities

Automatic management.

Integrates with the IBM mainframe OLTP world

Heterogeneous data support is good data warehouse


117

Tandem

Message-based OS ( Guardian ):

(1) location transparency

(2) fault isolation (failover to other nodes).

Expand software 255 Systems WAN

4 node System

Classic shared-nothing system (like Teradata except applications run inside DB machine.

224

PROCESSORS

30MB/S

8 x1M B/S

1-16 MIPS R4400 cpus dual port controllers, dual 30MB/s LAN

1974-1985: Encompass : Fault-tolerant Distributed OLTP

1986: NonStopSQL :

First distributed and high-performance SQL (200 tps)

1989: Parallel NonStopSQL : Parallel query optimizer/executor

1994: Parallel and Online SQL (utilities, DDL, recovery, ....)

1995: Moving to ServerNet: shared disk model

118


Tandem Data Layout

Each table or index range partitioned to a set of disks

(anywhere in network)

Partition

Block

File= {parts}

Index is B-tree per partition clustering index is B+ tree

Table fragments are files (extent based).

Descriptors for all local files live in local catalog

(node autonomy)

Extents may be added

Tables can be distributed in network (lan or wan)

Duplexed disks and disk processes for failover


119

Tandem Software (Process Structure)

SQL

C/COBOL/..

Application

SQL engine

Joins, Sorts global aggs triggers index maintenance views security

Disk

Server Pair buffer pool

Selects

Update, Delete

Record/Set insert

Aggregates

Assertions

Locking

Logging

Disk Pair or Array

Data partition

GUI

Utilities

Helper

Processes

Hardware & OS move data at 4MB/s with >1 ins/byte


120

OLTP Features

Insert / Update / Delete index in parallel with base table

If 5 indices, 5x faster response time.

Record and key-value range locking, SQL92 isolation levels

Undo scanner per log: double-buffers undo to each server

21 k tpc-C ( WOW!!

) with 110 node server (800GB db)

Can mix OLTP and batch.

Priority serving to avoid priority inversion problem

Buffer management prevents sequential buffer pollution

121


Tandem Query Plan & Execution

Simple selects & aggregates done in disk servers

Parallelism chosen: scan: table fragmentation hash: # processors or Outer table fragments

Sorts: redistribution, sort in executors (N-M)

Joins done in executors (nest, sort-merge, hash).

Redistribution is always a hash (minimize skew)

Pipeline as deep as possible (use lots of processes)

Application

SQL subsystem

Executors

Disk

Servers

Multiple logs & parallel UNDO avoid bottlenecks

Can mix OLTP and batch.

Priority serving to avoid priority inversion problem

Buffer management prevents sequential buffer pollution


122

Parallel Operators

Initially just inserted rivers between sequential operators

Parallel query optimizer

Created executors at all clustering nodes or at all nodes, repartitioned via hash to them

Gave parallel select, insert, update, delete join, sort, aggregates,...

correlated subqueries are blocking

Got linear speedup/scaleup on Wisconsin.

Marketing never noticed, product slept from 1989-1993

Developers added: Hash Join aggregates in disk process

SQL92 features parallel utilities online everything converted to MIPSco fixed bugs


123

Join Strategies

Nested loop

Sort merge

Both can work off index-only access

Replicate small to all partitions (when one small)

Small-table Cartesian product large-table optimization

Now hybrid-hash join uses many small buckets tuned to memory demand tuned to sequential disk performance no bitmaps because (1) parallel hash

(2) equijoins usually do not benefit

When both large, and unclustered (rare case)

N+M scanners, 16 catchers: sortmerge or hybrid hash


124

Administration (Parallel & Online everything)

All utilities are

online

(claim to reduce outages by 40%):

Add table, column,...

Add index: builds index from stale copy uses log for catchup in final minute, gets lock, completes index.

Reorg B-tree while it is accessed

Add / split/ merge/ reorg partition

Backup

Recover page, partition, file.

Add, alter logs, disks, processors, ...

You need this: Terabyte operations take a long time!

Parallel Utilities: load (M to N) index build (M scanners, N inserters, in background) recovery:


125

Benchmarks

No official DSS benchmark reports

Unofficial results

1 to 16 R4400 class processors, 64MB each (Himalayas)

3 disks, 3 ctlrs each

Load Wisc rec/s

1.6 kr/s

Parallel Index build 1.5 kr/s

Sequential

MB/s

321 Kb/s

15Kb/s

SCAN rec/s

16x Parallel

MB/s speedup

28 kr/s 5.4 MB/s

24 kr/s 240 KB/s

16

16

28 kr/s5.8 MB/s 470 kr/s 94 MB/s 16 !!!!!!!

Aggregate (1 col)

Aggregate (6 col)

2-Way hash Join

3-Way hash Join

25 kr/s 4.9 MB/s 400 kr/s 58 MB/s

18 kr/s 3.6 MB/s 300 kr/s 60 MB/s

13 kr/s 2.6 MB/s 214 kr/s 42 MB/s

? kr/s ? Mb/s ? kr/s ? MB/s

16

16

16

?

1x and 16x rates are best I’ve seen anywhere.

126


Tandem Good Things

21 K TPM-C ( WOW !)

It is available TODAY!

Online everything

Fault tolerant, distributed, high availability

Mix OLTP and batch

Great Hash Join Algorithm

Probably the best peak performance available


127

Oracle

Parallel Server (V7): Multiple threads in a server

Multiple servers in a cluster

Client/server, OLTP & clusters (TP lite)

Parallel Query (V7.1) Parallel SELECT (and sub-selects)

Parallel Recovery : (V7.1) @ restart, one log scanner, multiple redoers

Beta in 1993, Ship 6/94.

More Parallel (create table) : V7.2, 6/95

Shared disk implementation ported to most platforms

Parallel SELECT

(no parallel INSERT, UPDATE, DELETE, DDL) except for sub-selects inside these verbs.


128

Oracle Data Layout

Homogenous: one table (index) per segment extents picked from a TableSpace

Segment

Block

Table Space

= File Set

Files may be raw disk

Segments are B-trees or heaps.

Extents may be added data -> disk map is automatic

No range / hash / round-robin partitioning

ROWID can be used as scan partitioning on base tables.

Guiding principal:

If its not organized, it can’t get disorganized, and doesn’t need to be reorganized.


129

Oracle Parallel Query Product Concept

Convert serial SELECT plan to parallel plan

If Table scan or HINT then consider parallel plan

Table has default degree of parallelism (explicitly set)

Overridden by system limits and hints.

Use max degree of all participating tables.

Intermediate results are hash partitioned

Nested Loop Join and Merge Join

User hints can

(must?) specify join order, join strategy, index, degree of parallelism,...

st an ce s

In

D eg

Multiprocess & thread

DB re e


Client

Query

Coordinator

130

Query Planning

Query Coordinator starts with Oracle Cost-Based plan

If plan requests Table scan or HINT then consider parallel plan

Table has default degree of parallelism (explicitly set)

Overridden by system limits and hints.

Use max degree of all participating tables.

Shared disk makes temp space allocation easy

Planner picks degree of parallelism and river partitioning.

Proud of their OR optimization.

131


Query Execution

Coordinator does extra work to merge the outputs of several sorts subsorts pushed to servers aggregate the outputs of several aggregates aggregates pushed to servers

Parallel function invocation is potentially a big win.

SELECT COUNT ( f(a,b,c,...)) FROM T;

Invokes function f on each element of T, 100x parallel.


132

Join Strategies

Oracle has (1) Nested Loop Join

(2) Merge Join

Replicate inner to outer partition automatic in shared disk

(looks like partition outer).

Has small-table large-table optimization

(Cartesian product join)

User hints can specify join order, join strategy, index degree of parallelism,...


133

Transactions & Recovery

Transactions and transaction save points (linear nest).

ReadOnly snapshots for decision support.

SQL92 isolation levels (ACID = Snapshot isolation)

Database has multiple rollback segments UNDO log,

Transaction has one commit/REDO log so may be a bottleneck

Parallel recovery at restart:

One log scanner,

DEGREE REDO streams, typically one per disk

INSTANCE REDO streams, typically two-deep per disk

134


Administration

Not much special:

Limit degree of parallelism at a server

Set default parallelism of a table

Query can only lower these limits

No special tools, meters, monitors,...

Just ordinary Parallel Server

Oracle Utilities

User can write parallel load / unload utility

Index build, Constraints, are separate steps

Not incremental or online or restartable.

Update Statistics (Analyze) is not parallel

Index build is a N-1 parallel: N scanner/sorter, 1 inserter.

Parallel recovery at restart:

One log scanner,

DEGREE REDO streams, typically one per disk

INSTANCE REDO streams, typically two-deep per disk


135

Benchmarks

Sequent 20x 50MHz 486, .5GB RAM, 20 disk rec/s

Load 5M Wisc .5 kr/s 113 KB/s 8.8 kr/s 1.8 MB/s

Parallel Index load 2.2 kr/s 18 Kb/s 29 kr/s 235 KB/s

SCAN 1.7 kr/s 364 KB/s 26 kr/s 5.3 MB/s

Agg MJ

Agg NJ krecr/s

Sequential

KB/s

3.3 kr/s 660 KB/s

1.4 kr/s 290 KB/s rec/s

20x Parallel

MB/s speedup

45 kr/s 9.3 MB/s

26 kr/s 5.4 MB/s

16

13

15

14

19

Same benchmark on

16x SP1 (a shared nothing machine), got similar results.

168x N-cube ( 16MB/node), 4 lock nodes, 64 disk nodes got good scaleup

Oracle has published details on all these benchmarks.

Sept 1994 news:

20 Pentium, 40 disk system, SCAN at 44 MB/s 55% cpu


136

Available now!

Oracle Good Things

Parallel Everywhere (on everybody’s box)

A HIGH FUNCTION SQL

No restrictions (triggers, indices,...)

Very easy to use (almost no knobs or options)

Parallel invocation of stored procedures

Near-linear scaleup and speedup of SELECTs.

137

Informix

DSA (Dynamic Scaleable Architecture) describes redesign to thread-based, server-based system.

V6 - 1993 - : DSA -- rearchitecture (threads, OLTP focus)

V7 - 1994 - : PDQ -- Parallel Data Query (SMP)

V8 - 1995 - : XMP -- Cluster parallelism (shared disk/nothing).

Parallelism is a MAJOR focus now that SQL92 under control

Other major focus is TOOLS (ODBC, DRDA, NewEra 4GL).

Informix is a UNIX SQL system:

AIX (IBM), HP/UX (HP), OSF/1 (DEC, HP), SCO/UNIX, Sequent/DYNIX, SUN (SunOS, Solaris)

Today shared nothing parallelism on IBM SP2, ATT3650, ICL, (beta)

138


Informix Data Layout

Table or index maps to homogeneous set of DB spaces contains “chunks” (extents)

DBspace

File

Block

Partition by: range, round robin expression hash (V8)

Chunks may be added

Access via B+Tree, B* tree, and hash (V8)

Built an extent-based file system on raw disks or files

High speed sequential, clustering, async IO,...


139

Informix Execution

Completely parallel DML, some parallel DDL

Parallel SELECT, UPDATE, DELETE

Executor per partition in all cases.

Parallel sort, joins (nest, merge, hash)

Client

Virtual

Processes helpers

Buffer Pool aggregates, union

M join

Whenever an operator has input and a free output buffer, it can work scan

M join to fill the output buffer.

Natural flow control scan scan

Blocking operators (sort, hash join, aggregates, correlated subqueries)

Spool to a buffer (if small), else spool to disk.

Shared buffer pool minimizes data copies.


140

Parallel Plans

Query plan is parallelized by scanner per table partition (does select, project) sub-aggregates per partition (hash or sort)

If clustered join (nested loop or merge) then operator per outer or per partition

If hash-join, parallel scan smaller first, build bitmap and hash buckets then scan larger and: join to smaller if it fits in memory else filter via bitmap and build larger buckets then join bucket by bucket

Hybrid hash join with bitmaps and bucket tuning.


141

Parallel Operators

Parallel SELECT, UPDATE, DELETE

Executor per partition in all cases.

Parallel sort, joins, aggregates, union

Only correlated subqueries are blocking

Completely parallel DML, some parallel DDL


142

Transactions & Recovery

SQL 2 isolation levels allow DSS to run in background

Transaction save points

Separate logical and physical logs.

Bulk updates could bottleneck on single log.

Recovery unit is data partition (DBspace)

Parallel recovery: thread per DBspace

If DB fragment unavailable, DSS readers can skip it


143

Informix Administration

Can assign % of processors, memory, IO to DSS

(parallel query)

Sum of all parallel queries live within this quota

Each query can specify the % of the total that it wishes.

(0 means sequential execution)

Utilities

Parallel Data load (SMP only)

Parallel Index Build (N - M)

Parallel recovery

Online backup / restore

144


Sequent system:

Load 300M Wisc

Parallel Index load

SCAN

Aggregate

2-Way hash Join

3-Way hash Join

Benchmarks

9 Pentium processors

1 GB main memory

Base tables on 16 disk (FWD SCSI)

Indices on 10 discs

Temp space on 10 disks rec/s

Sequential

MB/s rec/s

Parallel

MB/s speedup

3kr/s 600Kb/s

48kr/s 1MB/s

17kr/s 3.5MB/s 147kr/s 30MB/s

11kr/s

18kr/s

25kr/s

2.3MB/s

3.2MB/s

3.5Mb/s

113kr/s

242kr/s

239kr/s

23MB/s

31MB/s

33MB/s

8.3

10.1

9.7

9.5


145

Informix Shared Nothing Benchmark

IBM SP2 - :

TPC-D-like database 48 SP2 Processors

Customer Benchmark ,

Not audited benchmark.

Load

60 GB in 40 minutes,

250 GB in 140 min

about 100 GB/hr

! 2GB/node/hr

Scan & Aggregate (#6)

60 GB in 7 min = 140 MB/s = 3 MB/s/node = 30 kr/s

260 GB in 24 min = 180 MB/s = 4 MB/s/node = 40 kr/s

Power Test (17 complex queries and 2 load/purge ops)

60 GB in 5 hrs

260 GB in 18 hrs

Multiuser Test:

1 user, 12 queries: 10 hrs, 4 users, 3 queries: 10 hrs


146

Informix Good Things

A full function SQL

Available today on Sequent

Beautiful manuals

Linear speedup and scaleup

Best published performance on UNIX systems

Probably best price performance.

(but things are changing fast!)

Some mechanisms to mix OLTP and batch.


147

Sybase Navigator Product concept

Two layer software architecture:

(1) Navigator drives array of shared-nothing SQL engines.

(2) Array of SQL engines , each unaware of others.

similar to Tandem disk processes

SQL engine is COTS.

ADMIN

ISTRA

TOR

Goal: linear scaleup and speedup, plus good OLTP support

Emphasize WHOLE LIFECYCLE

SQL

SQL

SQL

SQL

SQL

CON

FIG

SQL

SQL

URA

TOR

Configurator: tools to design a parallel system

Administrator: tools to manage a parallel system

(install/upgrade, start/stop, backup/restore, monitor/tune)

Optimizer: execute requests in parallel.


148

Configurator

Fully graphical design tool

Given ER model and dataflow model of the application workload characteristics response time requirements, hardware components

(heavy into circles and arrows)

Recommends hardware configuration/

Table definitions (SQL) table partitioning

149


Administrator

Made HUGE investments in this area.

Truly industry leading graphical tools make MPP configuration “doable”.

GUI interface to manage: startup / shutdown of cluster backup / restore / manage logs configure (install, add nodes, configure and tune servers)

Manage / consolidate system event logs

System stored procedures (global operations)

(e.g. aggregate statistics from local to global cat)

Monitor SQL Server events


150

Data Layout

Pure shared nothing

Navigator partitions data among SQL servers

• map to a subset of the servers

• range partition or hash partition.

Secondary indices are partitioned with base table

No Unique secondary indices

Only shorthand views, no protection views

Schema server stores global data definition for all nodes.

Each partition server has schema for its partition data for its partition.

log for its partition


151

Sybase SQL Server Backgrounder

Recently became SQL89 compliant (cursors, nulls, etc)

Stored procedures, multi-threaded, internationalized,

B*-tree centric (clustering index is B+tree)

Use nested loops, sort-merge join (sort is index build).

Page locking, 2K disk IO, ... other little-endian design decisions.

Respectable TPC-C results (AIX RS/6000).

UNIX raw disks or files are base (also on OS/2, NetWare,...).

table->disk mapping

CREATE DATABASE name ON {device...} LOG ON {device...}

SP_ADDSEGMENT segment, device

CREATE TABLE name(cols) [ ON segment]

Microsoft has a copy of the code, deep ported to NT

152


Navigator Extension Mechanisms

Navigator extended Sybase TDS by

Adding stored procedures to do things

Extending the syntax (e.g. see data placement syntax below)

Sybase TDS and OpenServer design are great for this

All “front ends based on OpenServer and threads”


153

Process Structure - Pure Shared Nothing

DBA Server does everything: SQL compilation

System management

Catalog management

SQL server restart (in 2nd node)

DBA fallback detects deadlock does DBA takeover on fail

Control server at each node manages SQL servers there

(security, request caching, 2PC, final merge /aggregate,...

parallel stored procedures (SMID) )

Split server manages re-partitioning of data

SQL Server is unit of query parallelism, (one per cpu per node)

Clients

Control

(1/node)

DBA server

Split

SQL schema server


= system manager monitor

& SQL optimizer

GUI

Navigator

Manager

= catalogs database in a SQL server

154

Simple Request Processing

Client connects to Navigator (a Control Server) using standard Sybase TDS protocol.

SQL request flows to DBA server that compiles it sends stored procedures (plans) to all control servers plans to all relevant SQL servers

Control server executes plan.

Pass to SQL server, returns results.

Client

Plan cached on second call,

DBA server not invoked.

Good for OLTP

Control

(1/node)

DBA server

Split

SQL schema server


155

Parallel Request Processing

If query involves multiple nodes, then command sent to each one

(diagram shows secondary index lookup)

Query sent to SQL servers that may have relevant data.

If data needs to be redistributed or aggregated, split servers issue queries and inserts

(that is their only role) split servers have no persistent storage.

Client

Control Control

Control

DBA server

Split

Split

SQL

Split schema server


156

Data Manipulation

SQL server is unit of parallelism

"Parallelized EVERYTHING in the T-SQL language"

Includes SIMD execution of T-SQL procedures, plus N-M data move operations.

Two-level optimization:

DBA Server has optimizer

(BIG investment, all new code,

NOT the infamous Sybase optimizer)

Each SQL server has Sybase optimizer

If extreme skew, different servers have different plans

DBA optimizer shares code with SQL server

(so they do not play chess with one another).

Very proud of their optimizer.


157

Query Execution

Classic Sellinger cost-based optimizer.

SELECT, UPDATE, DELETE N-to-M parallel

Bulk and async INSERT interface.

N-M Parallel sort

Aggregate (hash/sort) select and join can do index-only access if data is there.

eliminate correlated subqueries (convert to join).

(Gansky&Wong. SIGMOD87 extended)

Join: nested-loop, sort-merge, index only

Sybase often dynamically builds index to support nested loop (fake sort-merge)

Typically left-deep sequence of binary joins.


158

Join and Partition Strategy

Partition strategies

If already partitioned on join, then no splitting

Else Move subset of T1 to T2 partitions.

or Replicate T1 to all T2 partitions or repartition both T1 and T2 to width of home nodes or target.

No hash join, but all (re) partitioning is range or hash based.

Not aggressive parallelism/pipelining: 2 op at a time.

Pipeline to disk via split server (not local to disk and then split).

Split servers fake subtables for SQL engines.

Top level aggregates merged by control, others done by split.

159


Utilities

Bulk data load (N-M) async calls

GUI manages

Backup all SQL serves in parallel

Reorg via CREATE TABLE <new> ,

INSERT INTO <new> SELECT * FROM <old>

Utilities are mostly offline (as per Sybase)

Nice EXPLAIN utility


160

Futures

Hash join within split servers

Shared memory optimizations

Full support for unique secondary indices

Full trigger support (cross-server triggers)

Full security and view support.


161

Benchmarks

Preliminary: 8x8 3600 - Ynet.

node: 8 x (50MHz 486 256k local cache) 512MB main memory,

2 x 10 disk arrays, @ 2GB 4 MB/s per disk.

6 x Sybase servers

Scaleup & speedup tests of 1, 4, and 8 nodes.

Numbers (except loading) reported as ratios of elapsed times

S&S tests show a >7x speedup of 8-way over 1-way

Tests cover insert, select, update, delete, join, aggregate, load

Reference Account: Chase Manahattan Bank

14x8 P5 ATT 3600 cluster: (112 processors)

56 SQL servers, 10GB each = 560 GB

100x faster than DB2/MVS (minutes vs days)

162

Navigator Good Things

Concern for lifecycle design, install, manage, operate, use

Good optimization techniques

Fully parallel, including stored procedures!

Scaleup and Speedup are near linear.


163

Sybase IQ

Sybase bought Expressway

Expressway evolved from Model 204 bitmap technology: index duplicates with bitmap compress bitmap.

Can give 10x or 100x speedup.

Can save space and IO bandwidth

Currently, two products (Sybase and IQ) not integrated

164


DB2

DB2/VM: = SQL/DS: System R gone public

DB2/MVS (classic Parallel Sysplex, Parallel Query Server, ...)

Parallel and async IO into one process (on mainframe)

Parallel execution in next release (late next year?)

MVS PQS now withdrawn?

DB2/AS400: Home grown

DB2-2-PE: OS2/DM grown large. First moved to AIX

Being extended parallelism

Parallelism based on SP/2 -- shared nothing done right.

Benchmarks today - Beta everywhere

DB2++: separate code path has OO extensions, good TPC-C

Ported to HP/UX, Solaris, NT in beta

165


DB2/2 Data Layout

• DATABASE: a collection of nodes (up to 128 SP2s so far)

• NODEGROUP: a collection of logical nodes (a 4k hash map

•

• LOGICAL NODE: A DB2 instance (segments, log, locks...)

• PHYSICAL NODE: A box.

Segments

Logical Node: Segments of 4 k pages

– Segments allocated in units (64K default)

– Tables stripe across all segments

• Table created in NodeGroup:

– Hash (partition key) across all members of group

• Cluster has single system Image

Group 2

Nodes:


Group 1

166

DB2/2 Query Execution

• Each node maintains pool of AIX server processes

• Query optimizer does query decomposition to node plans (like R* distributed query decomposition)

• Parallel Optimization is 1Ø (not like Wai Hong’s work)

• Sends sub-plans to nodes to be executed by servers

• Node binds plan to server process

• Intermediate results hashed

• Proud that Optimizer does not need hints.

• “Standard” join strategies (except no hash join).


167

DB2/2 Utilities

• 4 loaders:

– import

– raw-insert (fabricates raw blocks, no checks)

– insert

– bulk insert

• Reorganize hash map, add / drop nodes, add devices

– Table unavailable during these operations

• Online & Incremental backup

• Fault tolerance via HACMP


168

DB2/2 Performance:

Good performance Great Scaling

Wisconsin scaleups big = 4.8 M rec = 1 GB small = 1.2 M rec = 256MB scan rate ~12 kr/s/node raw load: 2.5 kr/s/node see notes for more data

25.0

20.0

15.0

10.0

Speedup vs Nodes

DB2/2 PE on SP2

5.0

0.0

0 2 4 6 8 10 12 14 16


Load

Scan

Agg

SMJ

NLJ

SMJ2

Index1

Index2

MJ

169

DB2/2 Good Things

•

Scaleable to 128 nodes (or more)

•

From IBM

•

Good performance

•

Complete SQL (update, insert,...)

•

Will converge with DB2/3 (OO and TPC-C stuff)

•

Will be available off AIX someday

– (aix is slow and SP2 is very expensive)

170


RedBrick

• Read-only (LOAD then SELECT only) Database system

– Load is incremental and sophisticated

• Precompute indices to make small-large joins run fast

– Indices use compression techniques.

– Only join via indices

• Many aggregate functions to make DSS reports easy

• Parallelism:

– Pipeline IO

– Typically a thread per processor (works on index partition)

– Piggyback many queries on one scan

– Parallel utilities (index in parallel, etc)

– SP2 implementation uses shared disk model.


171

Summary

There is a LOT of activity

(many products coming to market)

Query optimization is near the complexity barrier

Needs a new approach?

All have good speedup & scaleup if they can find a plan

Managing huge processor / disk / tape arrays is hard.

I am working on commoditizing these ideas: low $/record/sec (scaleup PC technology) low Admin $/node (automate, automate, automate,...)

Continuous availability (online & fault tolerant)


172