Jim Gray & Gordon Bell
Microsoft Corporation presented at VLDB 95, Zurich Switzerland, Sept 1995
• Detailed notes available from Gray@Microsoft.com
– this presentation is 120 of the 174 slides (time limit)
– Notes in PowerPoint7 and Word7
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
1
•
– technology push
– application pull
• Benchmark Buyer’s Guide
– metrics
– simple tests
• Parallel Database Techniques
– partitioned data
– partitioned and pipelined execution
– parallel relational operators
• Parallel Database Systems
– Teradata. Tandem, Oracle, Informix, Sybase, DB2, ‘RedBrick
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
2
conversation money lecture concert Net work
mail book newspaper
Data
Base
Its ALL going electronic
Immediate is being stored for analysis (so ALL database)
Analysis & Automatic Processing are being added
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
3
Point-to-Point
OR
Broadcast
Low rent min $/byte
Shrinks time now or later
Shrinks space here or there
Automate processing knowbots
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Locate
Process
Analyze
Summarize
Network
Data
Base
4
Information At Your Fingertips™
Information Network™
Knowledge Navigator™
• All information will be in an online database (somewhere)
• You might record everything you
• read: 10MB/day, 400 GB/lifetime (two tapes)
• hear: 400MB/day, 16 TB/lifetime (a tape per decade)
• see: 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe someday)
• Data storage, organization, and analysis is a challenge.
• That is what databases are about
• DBs do a good job on “records”
• Now working on text, spatial, image, and sound.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
5
• The Old World:
– Millions of objects
– 100-byte objects
People
Name Address
David NY
Mike Berk
Won Austin
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
• The New World:
•
People
Name Address Papers Picture Voice
David NY
Mike Berk
Won Austin
Billions of objects
• Big objects (1MB)
• Objects have behavior
(methods)
Paperless office
Library of congress online
All information online entertainment publishing business
Information Network,
Knowledge Navigator,
Information at your fingertips
6
• File Cabinet
: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space (2x3 @ 10$/ft 2 ) 180$ total
3 ¢/sheet
700$
• Disk
: disk (8 GB =)
ASCII: 4 m pages
0 .
05 ¢/sheet
2,000$
• Image : 200 k pages
1 ¢/sheet
• Store everything on disk
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
7
• XXX doubles every 18 months 60% increase per year
– Micro Processor speeds
– chip density
– Magnetic disk density
1GB
128MB
8MB
1 chip memory size
( 2 MB to 32 MB)
– Communications bandwidth
WAN bandwidth approaching LANs
1MB
128KB
• Exponential Growth:
– The past does not matter
8KB
1970 1980 bits: 1K 4K 16K 64K 256K
1990 2000
1M 4M 16M 64M 256M
– 10x here, 10x there, soon you're talking REAL change .
• PC costs decline faster than any other platform
– Volume & learning curves
– PCs will be the building bricks of all future systems
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
10
3
1 MM
Pico Processor
10 pico-second ram 1 MB
100 MB
10 GB
1 TB
100 TB
10 nano-second ram
10 microsecond ram
10 millisecond disc
10 second tape archive
1 M SPECmarks,
1TFLOP
10 6 clocks to bulk ram
Event-horizon on chip.
VM reincarnated
Multi-program cache
On-Chip SMP
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
14
@ .25$/MB
1 Terabyte
1,000,000,000 business letters
100,000,000 book pages
50,000,000 FAX images
10,000,000 TV pictures (mpeg)
4,000 LandSat images
150 miles of bookshelf
15 miles of bookshelf
7 miles of bookshelf
10 days of video
Library of Congress (in ASCII) is 25 TB
1980 : 200 M$ of disc 10,000 discs
5 M$ of tape silo 10,000 tapes
1995 : 250 K$ of magnetic disc 70 discs
500 K$ of optical disc robot 250 platters
50 K$ of tape silo 50 tapes
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
23
• Capacity and cost are improving fast (100x per decade)
• Accesses are getting larger (MOX, GOX, SCANS)
• BUT Latencies and bandwidth are not improving much
• (3x per decade)
• How to deal with this???
• Bandwidth:
– Use partitioned parallel access (disk & tape farms)
• Latency
– Pipeline data up storage hierarchy (next section)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
30
• Disk is back to 100x cheaper than RAM
• Nearline tape is only 10x cheaper than disk
– and the gap is closing!
RAM $/MB
Disk $/MB
100:1
Disk & DRAM look good
10:1
30:1
Disk $/MB
Nearline Tape
?
??? Why bother with Tape
1:1
1960 1970 1980 1990 2000
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
31
• In the “old days” we counted instructions and IO’s
• Now we count memory references
• Processors wait most of the time
70 MIPS
“real” apps have worse Icache
OS misses so run at 60 MIPS if well tuned, 20 MIPS if not
B-Cache
Data Miss
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
I-Cache
Miss
D-Cache
Miss
32
10
9
Andromdeda
Tape /Optical
Robot
2,000 Years
106 Disk
Pluto
2 Years
100 Memory
10
2
On Board Cache
On Chip Cache
1 Registers
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Sacramento
This Campus
This Room
My Head
1.5 hr
10 min
1 min
33
• Network speeds grow 60% / year
• WAN speeds limited by politics
• if voice is X$/minute, how much is video?
• Switched 100Mb Ethernet
• 1,000x more bandwidth
• ATM is a scaleable net:
• 1 Gb/s to desktop & wall plug
• commodity: same for LAN, WAN
• 1Tb/s fibers in laboratory
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
1e 9
Comm Speedups
1e 8
1e 7
1e 6
1e 5
1e 4
Processors
LANs &
WANs
(i/s)
(b/s)
1e 3
1960 1970 1980 1990 2000
Year
34
10 10
• Bandwidth UP 10 4 Price DOWN
10 9 1 Gb/s
• Speed-of-light unchanged
10 8 CAN
• Software got worse
• Standard Fast Nets
» ATM
» PCI
» Myrinet
» Tnet
10 7
10
6
10 5
10 4
10 3
1 Mb/s
1 Kb/s
• HOPE:
– Commodity Net
– Good software
• Then clusters become a SNAP!
• commodity: 10k$/slice
10 2
1965
LAN
WAN
POTS
1975 1985
PC Bus
1995 2000
35
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
•
•
•
•
10$:
100$:
1,000$:
10,000$:
•
•
100,000$:
1,000,000$:
• 10,000,000$: wrist watch computers pocket/ palm computers portable computers personal computers (desktop) departmental computers (closet) site computers (glass house) regional computers (glass castle)
SuperServer: Costs more than 100,000 $
“Mainframe” Costs more than 1M$
Must be an array of processors, disks, tapes comm ports
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
36
• Horizontal integration is new structure
• Each layer picks best from lower layer.
• Desktop (C/S) market
• 1991: 50%
• 1995: 75%
Function
Operation
Integration
Applications
Middleware
Baseware
Systems
Silicon & Oxide
Example
AT&T
EDS
SAP
Oracle
Microsoft
Compaq
Intel & Seagate
38
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
• Bill Joy’s law (Sun
):
Don’t write software for less than 100,000 platforms.
@10M$ engineering expense, 1,000$ price
• Bill Gate’s law
:
Don’t write software for less than 1,000,000 platforms.
@10M$ engineering expense, 100$ price
• Examples:
• UNIX vs NT: 3,500$ vs 500$
• UNIX-Oracle vs SQL-Server: 100,000$ vs 1,000$
• No Spreadsheet or Presentation pack on UNIX/VMS/...
• Commoditization of base Software & Hardware
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
40
1 M$
Mainframe
100 K$
Mini
10 K$
Micro Nano
5.25"
3.5"
2.5"
1.8"
14"
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
9"
44
• The Year 2000 commodity PC (3K$)
• B illion Instructions/Sec
• B illion Bytes RAM
• B illion Bits/s Net
• 10
B
Bytes Disk
• B illion Pixel display
• 3000 x 3000 x 24 pixel
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
1 Bips Processor
.1 B byte RAM
10 B byte Disk
45
• Cost 3,000 $
• Come with
• OS (NT, POSIX,..)
• DBMS
• High speed Net
• System management
• GUI / OOUI
• Tools
• Compatible with everyone else
• CyberBricks
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
46
Large Disc Farms will be inexpensive ( 100$/GB)
Large RAM databases will be inexpensive (1,000$/GB)
Processors will be inexpensive
1k SPECint
CPU
So
The building block will be a processor with large RAM lots of Disc
5 GB RAM
50 GB Disc
47
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
CPU
5 GB RAM
50 GB Disc
Future Servers are CLUSTERS of processors, discs
Distributed Database techniques make clusters work
48
Array of 1,000 4B machines processors, disks, tapes comm lines
A few MegaBucks
Challenge:
Manageability
Programmability
Security
Availability
Scaleability
Affordability
As easy as a single system
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
100 Tape Transports
= 1,000 tapes
= 1 PetaByte
1,000 discs =
10 Terrorbytes
100 Nodes
1 Tips
49
:
Shared Memory
(SMP)
CLIENTS
Processors
M emory
Shared Disk
CLIENTS
Shared Nothing
(network)
CLIENTS
Easy to program
Difficult to build
Difficult to scaleup
Sequent, SGI, Sun VMScluster, Sysplex
Hard to program
Easy to build
Easy to scaleup
Tandem, Teradata, SP2
Winner will be a synthesis of these ideas
Distributed shared memory (DASH, Encore) blurs distinction between Network and Bus (locality still important)
But gives Shared memory message cost.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
50
• A Slice is a processor, memory, and a few disks.
• Slice Price of Scaleables so far is 5x to 10x markup
– Teradata: 70K$ for a Intel 486 + 32MB + 4 disk.
– Tandem: 100k$ for a MipsCo R4000 + 64MB + 4 disk
– Intel: 75k$ for an I860 +32MB + 2 disk
– TMC: 75k$ for a SPARC 3 + 32MB + 2 disk.
– IBM/SP2: 100k$ for a R6000 + 64MB + 8 disk
• Compaq Slice Price is less than 10k$
• What is the problem?
– Proprietary interconnect
– Proprietary packaging
– Proprietary software (vendorIX)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
51
• Storage trends force pipeline & partition parallelism
– Lots of bytes & bandwidth per dollar
– Lots of latency
• Processor trends force pipeline & partition
– Lots of MIPS per dollar
– Lots of processors
• Putting it together Scaleable Networks and Platforms)
– Build clusters of commodity processors & storage
– Commodity interconnect is key (S of PMS)
» Traditional interconnects give 100k$/slice.
– Commodity Cluster Operating System is key
– Fault isolation and tolerance is key
– Automatic Parallel Programming is key
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
52
?
SNAP
Scaleable Network And Platforms
built on
Commodity Platforms
Commodity Network Interconnect
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
53
1 Terabyte
1 Terabyte
10 MB/s
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
56
• Can't wait for the data to arrive (2,000 years!)
• Need a memory that gets the data in advance ( 100MB/S)
• Solution :
• Pipeline from source (tape, disc, ram...) to cpu cache
• Pipeline results to destination
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
57
Relational data model uniform operators on uniform data stream
Closed under composition
Each operator consumes 1 or 2 input streams
Each stream is a uniform collection of data
Sequential data in and out: Pure dataflow partitioning some operators
(e.g. aggregates, non-equi-join, sort,..) requires innovation
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
58
• Automate system management via tools
• data placement
• data organization (indexing)
• periodic tasks (dump / recover / reorganize)
• Automatic fault tolerance
• duplex & failover
• transactions
• Automatic parallelism
• among transactions (locking)
• within a transaction (parallel execution)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
59
.
.
..
.
.
.
1/2/72
.
.
4/8/95
Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7;
Landsat date loc image
Temporal
Spatial
Image
Assign one process per processor/disk: find images with right data & location analyze image, if 70% snow, return it
.
.
.
.
33N
120W
.
.
.
34N
120W date, location,
& image tests
Answer image
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
60
• Why Parallelism :
– technology push
– application pull
• Benchmark Buyer’s Guide
– metrics
– simple tests
• Parallel Database Techniques
– partitioned data
– partitioned and pipelined execution
– parallel relational operators
• Parallel Database Systems
– Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrick
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
61
Same Job,
More Hardware
Less time
100GB 100GB
Bigger Job,
More Hardware
Same time
more clients/servers
Same response time
100GB
1 k clients
100GB
Server
1 TB
10 k clients
1 TB
Server
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
62
1 MIPS
1 $
1,000 MIPS
32 $
.03$/MIPS
Needs
Linear Speedup and Linear Scaleup
Not always possible
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
1,000 MIPS
1,000 $
1 MIPS
1 $
63
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
64
• Transaction Processing Performance Council:
– TPC-A: simple transaction
– TPC-B: server only, about 3x lighter than TPC-A
– Both obsoleted by TPC-C
(no new results after 6/7/95)
• TPC-C (revision 3) Transactions Per Minute tpm-C
– Mix of 5 transactions: query, update, minibatch
– Terminal price eliminated
– about 5x heavier than tpcA (so 3.5 ktpcA 20 ktpmC)
• TPC-D approved in March 1995 - Transactions Per Hour
– Scaleable database (30 GB, 100GB, 300GB,... )
– 17 complex SQL queries (no rewrites, no hints without permission)
– 2 load/purge queries
– No official results yet, many “customer” results.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
65
Courtesy of Charles Levine of Tandem (of course)
3000
HP9000
2000
1000
AS400
16 cpus
RS6000
HP T500
32 cpus
SUN
HP 9000 E55, H70
Tandem Himalaya Server
64 cpus
Ta ndem
HP-H70
AS4 00
HP900 0
RS600 0
Sun
DG
HP-H70
HP T5 00-8
11 2 cpus
0
0 2000 4000 6000 8000 10000 12000 14000
PERFORMANCE (TPMC)
16000 18000 20000 22000
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
66
• Online Transaction Processing
• many little jobs
• SQL systems support 3700 tps-A
(24 cpu, 240 disk)
• SQL systems support 21,000 tpm-C
(112 cpu,670 disks)
• Batch (decision support and Utility)
• few big jobs, parallelism inside
• Scan data at 100 MB/s
• Linear Scaleup to 500 processors
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey hardware hardware
67
The Good
Speedup Curve
A Bad Speedup Curve
No Parallelism
Benefit
A Bad Speedup Curve
3-Factors
Linearity Linearity
Startup:
Interference
Skew
Processors & Discs
:
:
Processors & Discs
Processors & Discs
Creating processes
Opening files
Optimization
Device (cpu, disc, bus) logical (lock, hotspot, server, log,...)
If tasks get very small, variance > service time
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
68
Things to ask
The Whole Story
(for any system)
When does it stop scaling?
The Benchmark
Report
Throughput numbers,
Not ratios.
Standard benchmarks allow
Comparison to others
Comparison to sequential
Processors & Discs
Ratios and non-standard benchmarks are red flags.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
69
Disk is 3MB/s to 10MB/s
Record is 100B to 200B (
TPC-D 110...160
,
Wisconsin 204
)
So should be able to read 10kr/s to 100kr/s
Simple test: Time this on a 1M record table
SELECT count(*) FROM T WHERE x < :infinity;
(table on one disk, turn off parallelism)
Typical problems: disk or controller is an antique no read-ahead in operating system or DB small page reads (2kb) data not clustered on disk big cpu overhead in record movement
Scan
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
70
Simplest parallel test:
Scaleup previous test:
4 disks,
4 controllers,
4 processors
4 times as many records partitioned 4 ways.
Same query
Should have same elapsed time.
Some systems do.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Scan
Agg
Count
Count
71
Test: UPDATE T
SET x = x + :one;
Test for million row T on 1 disk
Test for four million row T on 4 disks
Look for bottlenecks.
Log
After each call, execute ROLLBACK WORK
See if UNDO runs at the DO speed
See if UNDO is parallel (scales up)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
UPDATE
72
• parallel database systems scan data
• An interesting metric (100 byte record):
– Record Scan Rate / System Cost
• Typical scan rates: 1k records/s to 30k records/s
• Each Scaleable system has a “slice price” guess:
– Gateway: 15k$ (P5 + ATM + 2 disks +NT + SQLserver or Informix or Oracle)
–
Teradata: 75k$
– Sequent: 75k$ (P5+2 disks+Dynix+Informix)
– Tandem: 100k$
– IBM SP2: 130k$ (RS6000+2 disks, AIX, DB2)
• You can compute slice price for systems later in presentation
• BAD: 0.1 records/s/$ (there is one of these)
• GOOD: 0.33 records/s/$ (there is one of these)
• Super! 1.00 records/s/$ (there is one of these)
• We should aim at 10 records/s/$ with P6.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
74
How are constraints checked?
ask about unique secondary indices ask about deferred constraints ask about referential integrity
How does parallelism interact with triggers
Stored procedures
OO extensions
How can I change my 10 TB database design in an hour?
add index add constraint reorganize / repartition
These are hard problems.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
75
• Why Parallelism:
– technology push
– application pull
• Benchmark Buyer’s Guide
– metrics
– simple tests
– partitioned data
– partitioned and pipelined execution
– parallel relational operators
• Parallel Database Systems
–
Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrick
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
76
Split a SQL table to subset of nodes & disks
Partition within set:
Range Hash Round Robin
A...E F...J
K...N
O...S T...Z
Good for equijoins, range queries group-by
A...E F...J
K...N
O...S T...Z
A...E F...J
K...N
O...S T...Z
Good for equijoins Good to spread load
Shared disk and memory less sensitive to partitioning,
Shared nothing benefits from "good" partitioning
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
77
Hash indices partition by hash
0...9
10..19
20..29 30..39 40..
B-tree indices partition as a forest of trees.
One tree per range
A..C
D..F
G...M
N...R
S..Z
Primary index clusters data
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
78
In shared nothing, secondary indices are Problematic
Partition by base table key ranges
Insert: completely local (but what about unique?)
Lookup: examines ALL trees (see figure)
Unique index involves lookup on insert
.
A..Z
A..Z
A..Z
A..Z
A..Z
Base Table
Partition by secondary key ranges
Insert: two nodes (base and index)
Lookup: two nodes (index -> base)
Uniqueness is easy
Teradata solution
A..C
D..F
G...M
N...R
S..
Base Table
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
79
Any
Sequential
Program
Any
Sequential
Program
outputs split N ways inputs merge M ways
Any
Sequential
Program
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Any
Sequential
Program
80
N X M Data Streams
M Consumers
N producers
Producers add records to the river,
Consumers consume records from the river
Purely sequential programming.
River does flow control and buffering does partition and merge of data records
River = Split/Merge in Gamma = Exchange operator in Volcano.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
81
Spreads computation and IO among processors
Count
Count Count Count Count Count
A...E
F...J
A Table
K...N
O...S
T...Z
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
82
Sort
Join
Merge
Sort
Join
Merge
Sort
Join
Merge
Sort
Join
Sort
Join
A...E
F...J
K...N
O...S
T...Z
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
83
Disk Partitioning
For range partitioning, sample load on disks.
Cool hot disks by making range smaller
For hash partitioning,
Cool hot disks by mapping some buckets to others
River Partitioning
Use hashing and assume uniform
If range partitioning, sample data and use histogram to level the bulk
Teradata, Tandem, Oracle use these tricks
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
84
An operator is blocking, if it does not produce any output, until it has consumed all its input
Database Load
Template has three blocked phases
Tape
File
SQL Table
Process
Scan Sort Runs Merge Runs Table Insert
SQL Table
Examples:
Sort,
Aggregates,
Hash-Join
(reads all of one operand)
Sort Runs Merge Runs Index Insert
Index 1
Sort Runs
Sort Runs
Merge Runs Index Insert
Index 2
Merge Runs Index Insert
Index 3
Blocking operators kill pipeline parallelism
Make partition parallelism all the more important.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
85
Simple aggregates (count, min, max, ...) can use indices
More compact
Sometimes have aggregate info.
GROUP BY aggregates scan in category order if possible (use indices)
Else
If categories fit in RAM use RAM category hash table
Else make temp of <category, item> sort by category, do math in merge step.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
86
Used for loading and reorganization
(sort makes them sequential) build B-trees reports non-equijoins
Rarely used for aggregates or equi-joins (if hash available
Sort
Input
Data
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Runs
Merge
Sorted
Data
88
M input N output
Sort design
River is range or hash partitioned
Disk and merge not needed if sort fits in memory
Merge runs
Sub-sorts generate runs
Scales linearly because
Range or Hash Partition River
= =>
12 log(10 ) 12
2x slower
Scan or other source
Sort is benchmark from hell for shared nothing machines net traffic = disk bandwidth, no data filtering at the source
89
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
:
1M records (100 B recs)
1000 seconds 1986
60 seconds 1990
7 seconds 1994
3.5 seconds 1995 (SGI challenge) micros finally beat the mainframe! finally! a UNIX system that does IO
1.1GB, Nyberg,
Alpha 3cpu
1994
SGI Challenge (12 cpu) no SIGMOD PennySort record
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Sort Records/second vs T ime
1.0E+06
IBM 3090
SGI
Alpha
1.0E+05
Cray YMP
1.0E+04
Sequent
Intel
HyperCube
1.0E+03
Hardware Sorter
1.0E+02
1985
Tandem
M68000
1990 1995
250
200
150
100
Sort T ime on an SGI Challenge
1.6 GB (16 M 100-byte records)
12 cpu, write done
2.2 GB,
96 disk lists merged lists-sorted read-done pin
50
0
1 2 4 6
Threads (Sprocs) devoted to sorting
10
90
If inner table indexed on join cols (b-tree or hash) then sequential scan outer (from start key)
For each outer record probe inner table for matching recs
Works best if inner is in RAM (=> small inner
Outer
Table
Inner
Table
91
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
If tables sorted on join cols (b-tree or hash) then sequential scan each (from start key) left < right left=right left > right advance left match advance right
Nice sequential scan of data (disk speed)
(MxN case may cause backwards rescan)
NxM case
Cartesian product
Sort-merge join sorts before doing the merge
Partitions well: partition smaller to larger partition.
Left
Table
Right
Table
Works for all joins ( outer, non-equijoins, Cartesian, exclusion,...
)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
92
Hash smaller table into N buckets
(hope N=1)
If N=1 read larger table, hash to smaller
Else, hash outer to disk then bucket-by-bucket hash join.
Purely sequential data behavior Left
Table
Always beats sort-merge and nested unless data is clustered.
Good for equi, outer, exclusion join
Lots of papers, products just appearing
(what went wrong?)
Hash reduces skew
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Right Table
Hash
Buckets
93
ICL implemented hash join with bitmaps in CAFS machine
(1976)!
Kitsuregawa pointed out the parallelism benefits of hash join in early 1980’s (it partitions beautifully)
We ignored them! (why?) But now, Everybody's doing it.
(or promises to do it).
Hashing minimizes skew, requires little thinking for redistribution
Hashing uses massive main memory
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
95
It is easy to build a fast parallel execution environment
(no one has done it, but it is just programming)
It is hard to write a robust and world-class query optimizer.
There are many tricks
One quickly hits the complexity barrier
Common approach:
Pick best sequential plan
Pick degree of parallelism based on bottleneck analysis
Bind operators to process
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
98
Why isn’t the best serial plan, the best parallel plan?
Counter example:
Table partitioned with local secondary index at two nodes
Range query selects all of node 1 and 1% of node 2.
Node 1 should do a scan of its partition.
Node 2 should use secondary index.
SELECT * FROM telephone_book WHERE name < “NoGood”;
Sybase Navigator & DB2 PE should get this right.
Table
Scan
Index
Scan
We need theorems here
(practitioners do not have them)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
A..M
N..Z
99
Shared Nothing
Teradata:
Tandem:
400 nodes
110 nodes
IBM / SP2 / DB2: 128 nodes
Informix/SP2 48 nodes
ATT & Sybase 8x14 nodes
Shared Disk
Oracle
Rdb
170 nodes
24 nodes
Shared Memory
Informix
RedBrick
9 nodes
? nodes
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
CLIENTS
CLIENTS
CLIENTS
Process ors
M emory
101
• Why Parallelism:
– technology push
– application pull
• Benchmark Buyer’s Guide
– metrics
– simple tests
• Parallel Database Techniques
– partitioned data
– partitioned and pipelined execution
– parallel relational operators
– Teradata
– Tandem
- Oracle
- Informix
- Sybase
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
-DB2
-RedBrick
102
Premise: The world does not need yet another PDB survey
It would be nice to have a survey of “real” systems
Visited each parallel DB vendor I could (time limited)
Asked not to be given confidential info.
Asked for public manuals and benchmarks
Asked that my notes be reviewed
I say only nice things (I am a PDB booster)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
103
Teradata
Todd Walter and Carrie Ballinger
Tandem
Susanne Englert, Don Slutz, HansJorge Zeller, Mike Pong
Oracle
Gary Hallmark, Bill Widdington
Informix
Gary Kelley, Hannes Spintzik, Frank Symonds, Dave Clay
Navigator
Rick Stellwagen, Brian Hart, Ilya Listvinsky, Bill Huffman ,
Bob McDonald, Jan Graveson
Ron Chung Hu, Stuart Thompto
DB2
Chaitan Baru, Gilles Fecteau, James Hamilton,
Hamid Pirahesh
Redbrick
Phil Fernandez, Donovan Schneider
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
104
• Ship 1984, now an ATT GIS brand name
• Parallel DB server for decision support
SQL in, tables out
• Support Heterogeneous data (convert to client format)
Data hash partitioned among AMPs with fallback (mirror) hash.
UNIX
VMS
PC Mac
Applications run on clients
Application
Processor
PEP
Biggest installation: 476 nodes, 2.4 TB
Ported to UNIX base
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
AMP
IBM
105
UNIX
VMS
PC Mac
Interface to IBM or Ethernet or...
Accept SQL, return records and status.
Support SQL 89, moving to SQL92
Parse, Plan & authorize SQL
Application
Processor
PEP cost based optimizer
Issue requests to AMPs
Merge AMP results to requester.
AMP
Some global load control based on client priority
(adaptive and GREAT!)
Almost all work done in AMPs
A shared nothing SQL engine scans, inserts, joins, log, lock,....
Manages up to 4 disks (as one logical volume )
Easy design, manage, grow (just add disk)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
IBM
106
All data declustered to all nodes
Each table has a hash key (may be compound)
Key maps to one of 4,000 buckets
Buckets map to one of the AMPs
Non-Unique secondary index partitioned by table criterion
Fallback bucket maps to second AMP in cluster.
Typical cluster is 6 nodes (2 is mirroring).
Cluster limits failure scope:
2 failures only cause data outage if both in same cluster.
Within a node, each hash to cylinder then hash to “page”
Page is a heap with a sorted directory
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
107
Sophisticated query optimizer
(many tricks) Great emphasis on Joins & Aggregates.
Nested, merge, product, bitmap join (no hash join)
Automatic load balancing from hashing & load control
Excellent utilities for data loading, reorganize
Move > 1TB database from old to new in 6 days, in background while old system running
Old hardware, 3.8B row table (1TB), >300 AMPs typical scan, sort, join averages 30 minutes
108
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Protocol
PE requests work
AMP responds OK (or pushback)
AMP works (if all OK)
AMP declares finished
When all finished, PE does 2PC and starts pull
Simple scan :
PE broadcasts scan to each AMP
Each AMP scans produces answer spool file
PE pulls spool file from AMPs via Ynet
If scan were ordered, sort “catcher” would be forked at each AMP pipelined to scans
Ynet and PE would do merge of merges from AMPs
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
109
Aggregate of Scan :
Scan’s produce local sub-aggregates
Hash sub-aggregates to Ynet
Each AMP “catches” its sub-aggregate hash buckets
Consolidate sub-aggregates.
PE pulls aggregates from AMPs via Ynet.
Note: fully scaleable design
Insert / Update / Delete at a AMP node generates insert / update /delete messages to unique-secondary indices fallback bucket of base table.
messages saved in spool if node is down
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
110
Great emphasis on Joins.
Includes small-table large-table optimization cheapest triple, then cheapest in triple.
If equi-partitioned, do locally
If not equi-partitioned,
May replicate small table to large partition (Ynet shines)
May repartition one if other is already partitioned on join
May repartition both (in parallel)
Join algorithm within node is
Product
Nested
Sort-merge
Hash bit map of secondary indices, intersected.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
111
Bulk Data Load, Fast Data Load, Multi-load,
Blast 32KB of data to an AMP
Multiple sessions by multiple clients can drive 200x parallel
Double buffer
AMP unpacks, and puts “upsert”onto Ynet
One record can generate multiple upserts
(transaction-> inventory, store-sales, ...)
Catcher on Ynet, grabs relevant “upserts” to temp file.
Sorts and then batches inserts (survives restarts).
Online and restartable.
Customers cite this as Teradata strength.
Fast Export (similar to bulk data load)
112
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Backup / Restore: Rarely needed because of fallback.
Cluster is unit of recovery
Backup is online, Restore is offline
Reorganize:
Rarely needed, add disk is just restart
Add node: rehash all buckets that go to that node:
(Ynet has old and new bucket map)
Fully parallel and fault tolerant, takes minutes
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
113
New design (3700 series) described in VLDB 93
Ported to UNIX platforms (3600 AP, PE, AMP)
Moved Teradata to Software Ynet on SMPs
Based on Bullet-Proof UNIX with TOS layer atop.
message system communications stacks raw disk & virtual processors virtual partitions (buckets go to virtual partitions) removes many TOS limits
Result is 10x to 60x faster than an AMP
Compiled expression evaluation
(gives 50x speedup on scans)
Large main memory helps
SQL Applications
Parsing engine (parallelism)
Teradata SQL (AMP logic)
UNIX PDE: TOS adapter
UNIX 5.4 (SMP, RAS, virtual Ynet)
HARDWARE
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
114
Standard Benchmarks
Only old Boral/DeWitt Wisconsin numbers.
Nothing public.
Moving > 1TB database from one old to new in 6 days, in background while old system running
So: unload-load rate > 2MB/s sustained
Background task (speed limited by host speed/space)
Old hardware, 3.8B row table, >300 AMPs typical scan, sort, join averages 30 minutes rates (rec size not cited): scan: clustered join: insert-select:
Hash index build:
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey krec/s/AMP k rec/s
9
2 600 kr/s
.39
120 kr/s
3.3
100 kr/s
115
Times to process a Teradata Test DB on a 8 Pentium, 3650.
These numbers are 10 to 150x better than a single AMP
Compiled expression handling more memory op rows seconds k r/s MB/s scan copy
50000000 737 67.8 11.0
5000000 1136 4.4 0.7
aggregate 50000000 788 63.5 10.3
Join 50x2M (clustered) 52000000 768 67.7 11.0
Join 5x5 (unclustered) 10000000 237 42.2 6.8
Join 50Mx.1K 50000100 1916 26.1 4.2
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
116
Scaleable to large (multi-terabyte) databases
Available TODAY!
It is VERY real: in production in many large sites
Robust and complete set of utilities
Automatic management.
Integrates with the IBM mainframe OLTP world
Heterogeneous data support is good data warehouse
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
117
Message-based OS ( Guardian ):
(1) location transparency
(2) fault isolation (failover to other nodes).
Expand software 255 Systems WAN
4 node System
Classic shared-nothing system (like Teradata except applications run inside DB machine.
224
PROCESSORS
30MB/S
8 x1M B/S
1-16 MIPS R4400 cpus dual port controllers, dual 30MB/s LAN
1974-1985: Encompass : Fault-tolerant Distributed OLTP
1986: NonStopSQL :
First distributed and high-performance SQL (200 tps)
1989: Parallel NonStopSQL : Parallel query optimizer/executor
1994: Parallel and Online SQL (utilities, DDL, recovery, ....)
1995: Moving to ServerNet: shared disk model
118
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Each table or index range partitioned to a set of disks
(anywhere in network)
Partition
Block
File= {parts}
Index is B-tree per partition clustering index is B+ tree
Table fragments are files (extent based).
Descriptors for all local files live in local catalog
(node autonomy)
Extents may be added
Tables can be distributed in network (lan or wan)
Duplexed disks and disk processes for failover
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
119
SQL
C/COBOL/..
Application
SQL engine
Joins, Sorts global aggs triggers index maintenance views security
Disk
Server Pair buffer pool
Selects
Update, Delete
Record/Set insert
Aggregates
Assertions
Locking
Logging
Disk Pair or Array
Data partition
GUI
Utilities
Helper
Processes
Hardware & OS move data at 4MB/s with >1 ins/byte
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
120
Insert / Update / Delete index in parallel with base table
If 5 indices, 5x faster response time.
Record and key-value range locking, SQL92 isolation levels
Undo scanner per log: double-buffers undo to each server
21 k tpc-C ( WOW!!
) with 110 node server (800GB db)
Can mix OLTP and batch.
Priority serving to avoid priority inversion problem
Buffer management prevents sequential buffer pollution
121
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Simple selects & aggregates done in disk servers
Parallelism chosen: scan: table fragmentation hash: # processors or Outer table fragments
Sorts: redistribution, sort in executors (N-M)
Joins done in executors (nest, sort-merge, hash).
Redistribution is always a hash (minimize skew)
Pipeline as deep as possible (use lots of processes)
Application
SQL subsystem
Executors
Disk
Servers
Multiple logs & parallel UNDO avoid bottlenecks
Can mix OLTP and batch.
Priority serving to avoid priority inversion problem
Buffer management prevents sequential buffer pollution
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
122
Initially just inserted rivers between sequential operators
Parallel query optimizer
Created executors at all clustering nodes or at all nodes, repartitioned via hash to them
Gave parallel select, insert, update, delete join, sort, aggregates,...
correlated subqueries are blocking
Got linear speedup/scaleup on Wisconsin.
Marketing never noticed, product slept from 1989-1993
Developers added: Hash Join aggregates in disk process
SQL92 features parallel utilities online everything converted to MIPSco fixed bugs
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
123
Nested loop
Sort merge
Both can work off index-only access
Replicate small to all partitions (when one small)
Small-table Cartesian product large-table optimization
Now hybrid-hash join uses many small buckets tuned to memory demand tuned to sequential disk performance no bitmaps because (1) parallel hash
(2) equijoins usually do not benefit
When both large, and unclustered (rare case)
N+M scanners, 16 catchers: sortmerge or hybrid hash
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
124
All utilities are
(claim to reduce outages by 40%):
Add table, column,...
Add index: builds index from stale copy uses log for catchup in final minute, gets lock, completes index.
Reorg B-tree while it is accessed
Add / split/ merge/ reorg partition
Backup
Recover page, partition, file.
Add, alter logs, disks, processors, ...
You need this: Terabyte operations take a long time!
Parallel Utilities: load (M to N) index build (M scanners, N inserters, in background) recovery:
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
125
No official DSS benchmark reports
Unofficial results
1 to 16 R4400 class processors, 64MB each (Himalayas)
3 disks, 3 ctlrs each
Load Wisc rec/s
1.6 kr/s
Parallel Index build 1.5 kr/s
Sequential
MB/s
321 Kb/s
15Kb/s
SCAN rec/s
16x Parallel
MB/s speedup
28 kr/s 5.4 MB/s
24 kr/s 240 KB/s
16
16
28 kr/s5.8 MB/s 470 kr/s 94 MB/s 16 !!!!!!!
Aggregate (1 col)
Aggregate (6 col)
2-Way hash Join
3-Way hash Join
25 kr/s 4.9 MB/s 400 kr/s 58 MB/s
18 kr/s 3.6 MB/s 300 kr/s 60 MB/s
13 kr/s 2.6 MB/s 214 kr/s 42 MB/s
? kr/s ? Mb/s ? kr/s ? MB/s
16
16
16
?
1x and 16x rates are best I’ve seen anywhere.
126
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
21 K TPM-C ( WOW !)
It is available TODAY!
Online everything
Fault tolerant, distributed, high availability
Mix OLTP and batch
Great Hash Join Algorithm
Probably the best peak performance available
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
127
Parallel Server (V7): Multiple threads in a server
Multiple servers in a cluster
Client/server, OLTP & clusters (TP lite)
Parallel Query (V7.1) Parallel SELECT (and sub-selects)
Parallel Recovery : (V7.1) @ restart, one log scanner, multiple redoers
Beta in 1993, Ship 6/94.
More Parallel (create table) : V7.2, 6/95
Shared disk implementation ported to most platforms
Parallel SELECT
(no parallel INSERT, UPDATE, DELETE, DDL) except for sub-selects inside these verbs.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
128
Homogenous: one table (index) per segment extents picked from a TableSpace
Segment
Block
Table Space
= File Set
Files may be raw disk
Segments are B-trees or heaps.
Extents may be added data -> disk map is automatic
No range / hash / round-robin partitioning
ROWID can be used as scan partitioning on base tables.
Guiding principal:
If its not organized, it can’t get disorganized, and doesn’t need to be reorganized.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
129
Convert serial SELECT plan to parallel plan
If Table scan or HINT then consider parallel plan
Table has default degree of parallelism (explicitly set)
Overridden by system limits and hints.
Use max degree of all participating tables.
Intermediate results are hash partitioned
Nested Loop Join and Merge Join
User hints can
(must?) specify join order, join strategy, index, degree of parallelism,...
st an ce s
In
D eg
Multiprocess & thread
DB re e
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Client
Query
Coordinator
130
Query Coordinator starts with Oracle Cost-Based plan
If plan requests Table scan or HINT then consider parallel plan
Table has default degree of parallelism (explicitly set)
Overridden by system limits and hints.
Use max degree of all participating tables.
Shared disk makes temp space allocation easy
Planner picks degree of parallelism and river partitioning.
Proud of their OR optimization.
131
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Coordinator does extra work to merge the outputs of several sorts subsorts pushed to servers aggregate the outputs of several aggregates aggregates pushed to servers
Parallel function invocation is potentially a big win.
SELECT COUNT ( f(a,b,c,...)) FROM T;
Invokes function f on each element of T, 100x parallel.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
132
Oracle has (1) Nested Loop Join
(2) Merge Join
Replicate inner to outer partition automatic in shared disk
(looks like partition outer).
Has small-table large-table optimization
(Cartesian product join)
User hints can specify join order, join strategy, index degree of parallelism,...
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
133
Transactions and transaction save points (linear nest).
ReadOnly snapshots for decision support.
SQL92 isolation levels (ACID = Snapshot isolation)
Database has multiple rollback segments UNDO log,
Transaction has one commit/REDO log so may be a bottleneck
Parallel recovery at restart:
One log scanner,
DEGREE REDO streams, typically one per disk
INSTANCE REDO streams, typically two-deep per disk
134
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Not much special:
Limit degree of parallelism at a server
Set default parallelism of a table
Query can only lower these limits
No special tools, meters, monitors,...
Just ordinary Parallel Server
User can write parallel load / unload utility
Index build, Constraints, are separate steps
Not incremental or online or restartable.
Update Statistics (Analyze) is not parallel
Index build is a N-1 parallel: N scanner/sorter, 1 inserter.
Parallel recovery at restart:
One log scanner,
DEGREE REDO streams, typically one per disk
INSTANCE REDO streams, typically two-deep per disk
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
135
Sequent 20x 50MHz 486, .5GB RAM, 20 disk rec/s
Load 5M Wisc .5 kr/s 113 KB/s 8.8 kr/s 1.8 MB/s
Parallel Index load 2.2 kr/s 18 Kb/s 29 kr/s 235 KB/s
SCAN 1.7 kr/s 364 KB/s 26 kr/s 5.3 MB/s
Agg MJ
Agg NJ krecr/s
Sequential
KB/s
3.3 kr/s 660 KB/s
1.4 kr/s 290 KB/s rec/s
20x Parallel
MB/s speedup
45 kr/s 9.3 MB/s
26 kr/s 5.4 MB/s
16
13
15
14
19
Same benchmark on
16x SP1 (a shared nothing machine), got similar results.
168x N-cube ( 16MB/node), 4 lock nodes, 64 disk nodes got good scaleup
Oracle has published details on all these benchmarks.
Sept 1994 news:
20 Pentium, 40 disk system, SCAN at 44 MB/s 55% cpu
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
136
Available now!
Parallel Everywhere (on everybody’s box)
A HIGH FUNCTION SQL
No restrictions (triggers, indices,...)
Very easy to use (almost no knobs or options)
Parallel invocation of stored procedures
Near-linear scaleup and speedup of SELECTs.
137
DSA (Dynamic Scaleable Architecture) describes redesign to thread-based, server-based system.
V6 - 1993 - : DSA -- rearchitecture (threads, OLTP focus)
V7 - 1994 - : PDQ -- Parallel Data Query (SMP)
V8 - 1995 - : XMP -- Cluster parallelism (shared disk/nothing).
Parallelism is a MAJOR focus now that SQL92 under control
Other major focus is TOOLS (ODBC, DRDA, NewEra 4GL).
Informix is a UNIX SQL system:
AIX (IBM), HP/UX (HP), OSF/1 (DEC, HP), SCO/UNIX, Sequent/DYNIX, SUN (SunOS, Solaris)
Today shared nothing parallelism on IBM SP2, ATT3650, ICL, (beta)
138
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Table or index maps to homogeneous set of DB spaces contains “chunks” (extents)
DBspace
File
Block
Partition by: range, round robin expression hash (V8)
Chunks may be added
Access via B+Tree, B* tree, and hash (V8)
Built an extent-based file system on raw disks or files
High speed sequential, clustering, async IO,...
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
139
Completely parallel DML, some parallel DDL
Parallel SELECT, UPDATE, DELETE
Executor per partition in all cases.
Parallel sort, joins (nest, merge, hash)
Client
Virtual
Processes helpers
Buffer Pool aggregates, union
M join
Whenever an operator has input and a free output buffer, it can work scan
M join to fill the output buffer.
Natural flow control scan scan
Blocking operators (sort, hash join, aggregates, correlated subqueries)
Spool to a buffer (if small), else spool to disk.
Shared buffer pool minimizes data copies.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
140
Query plan is parallelized by scanner per table partition (does select, project) sub-aggregates per partition (hash or sort)
If clustered join (nested loop or merge) then operator per outer or per partition
If hash-join, parallel scan smaller first, build bitmap and hash buckets then scan larger and: join to smaller if it fits in memory else filter via bitmap and build larger buckets then join bucket by bucket
Hybrid hash join with bitmaps and bucket tuning.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
141
Parallel SELECT, UPDATE, DELETE
Executor per partition in all cases.
Parallel sort, joins, aggregates, union
Only correlated subqueries are blocking
Completely parallel DML, some parallel DDL
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
142
SQL 2 isolation levels allow DSS to run in background
Transaction save points
Separate logical and physical logs.
Bulk updates could bottleneck on single log.
Recovery unit is data partition (DBspace)
Parallel recovery: thread per DBspace
If DB fragment unavailable, DSS readers can skip it
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
143
Can assign % of processors, memory, IO to DSS
(parallel query)
Sum of all parallel queries live within this quota
Each query can specify the % of the total that it wishes.
(0 means sequential execution)
Parallel Data load (SMP only)
Parallel Index Build (N - M)
Parallel recovery
Online backup / restore
144
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Sequent system:
Load 300M Wisc
Parallel Index load
SCAN
Aggregate
2-Way hash Join
3-Way hash Join
9 Pentium processors
1 GB main memory
Base tables on 16 disk (FWD SCSI)
Indices on 10 discs
Temp space on 10 disks rec/s
Sequential
MB/s rec/s
Parallel
MB/s speedup
3kr/s 600Kb/s
48kr/s 1MB/s
17kr/s 3.5MB/s 147kr/s 30MB/s
11kr/s
18kr/s
25kr/s
2.3MB/s
3.2MB/s
3.5Mb/s
113kr/s
242kr/s
239kr/s
23MB/s
31MB/s
33MB/s
8.3
10.1
9.7
9.5
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
145
IBM SP2 - :
TPC-D-like database 48 SP2 Processors
Customer Benchmark ,
Not audited benchmark.
Load
60 GB in 40 minutes,
250 GB in 140 min
! 2GB/node/hr
Scan & Aggregate (#6)
60 GB in 7 min = 140 MB/s = 3 MB/s/node = 30 kr/s
260 GB in 24 min = 180 MB/s = 4 MB/s/node = 40 kr/s
Power Test (17 complex queries and 2 load/purge ops)
60 GB in 5 hrs
260 GB in 18 hrs
Multiuser Test:
1 user, 12 queries: 10 hrs, 4 users, 3 queries: 10 hrs
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
146
A full function SQL
Available today on Sequent
Beautiful manuals
Linear speedup and scaleup
Best published performance on UNIX systems
Probably best price performance.
(but things are changing fast!)
Some mechanisms to mix OLTP and batch.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
147
Two layer software architecture:
(1) Navigator drives array of shared-nothing SQL engines.
(2) Array of SQL engines , each unaware of others.
similar to Tandem disk processes
SQL engine is COTS.
ADMIN
ISTRA
TOR
Goal: linear scaleup and speedup, plus good OLTP support
Emphasize WHOLE LIFECYCLE
SQL
SQL
SQL
SQL
SQL
CON
FIG
SQL
SQL
URA
TOR
Configurator: tools to design a parallel system
Administrator: tools to manage a parallel system
(install/upgrade, start/stop, backup/restore, monitor/tune)
Optimizer: execute requests in parallel.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
148
Fully graphical design tool
Given ER model and dataflow model of the application workload characteristics response time requirements, hardware components
(heavy into circles and arrows)
Recommends hardware configuration/
Table definitions (SQL) table partitioning
149
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Made HUGE investments in this area.
Truly industry leading graphical tools make MPP configuration “doable”.
GUI interface to manage: startup / shutdown of cluster backup / restore / manage logs configure (install, add nodes, configure and tune servers)
Manage / consolidate system event logs
System stored procedures (global operations)
(e.g. aggregate statistics from local to global cat)
Monitor SQL Server events
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
150
Pure shared nothing
Navigator partitions data among SQL servers
• map to a subset of the servers
• range partition or hash partition.
Secondary indices are partitioned with base table
No Unique secondary indices
Only shorthand views, no protection views
Schema server stores global data definition for all nodes.
Each partition server has schema for its partition data for its partition.
log for its partition
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
151
Recently became SQL89 compliant (cursors, nulls, etc)
Stored procedures, multi-threaded, internationalized,
B*-tree centric (clustering index is B+tree)
Use nested loops, sort-merge join (sort is index build).
Page locking, 2K disk IO, ... other little-endian design decisions.
Respectable TPC-C results (AIX RS/6000).
UNIX raw disks or files are base (also on OS/2, NetWare,...).
table->disk mapping
CREATE DATABASE name ON {device...} LOG ON {device...}
SP_ADDSEGMENT segment, device
CREATE TABLE name(cols) [ ON segment]
Microsoft has a copy of the code, deep ported to NT
152
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Navigator extended Sybase TDS by
Adding stored procedures to do things
Extending the syntax (e.g. see data placement syntax below)
Sybase TDS and OpenServer design are great for this
All “front ends based on OpenServer and threads”
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
153
DBA Server does everything: SQL compilation
System management
Catalog management
SQL server restart (in 2nd node)
DBA fallback detects deadlock does DBA takeover on fail
Control server at each node manages SQL servers there
(security, request caching, 2PC, final merge /aggregate,...
parallel stored procedures (SMID) )
Split server manages re-partitioning of data
SQL Server is unit of query parallelism, (one per cpu per node)
Clients
Control
(1/node)
DBA server
Split
SQL schema server
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
= system manager monitor
& SQL optimizer
GUI
Navigator
Manager
= catalogs database in a SQL server
154
Client connects to Navigator (a Control Server) using standard Sybase TDS protocol.
SQL request flows to DBA server that compiles it sends stored procedures (plans) to all control servers plans to all relevant SQL servers
Control server executes plan.
Pass to SQL server, returns results.
Client
Plan cached on second call,
DBA server not invoked.
Good for OLTP
Control
(1/node)
DBA server
Split
SQL schema server
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
155
If query involves multiple nodes, then command sent to each one
(diagram shows secondary index lookup)
Query sent to SQL servers that may have relevant data.
If data needs to be redistributed or aggregated, split servers issue queries and inserts
(that is their only role) split servers have no persistent storage.
Client
Control Control
Control
DBA server
Split
Split
SQL
Split schema server
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
156
SQL server is unit of parallelism
"Parallelized EVERYTHING in the T-SQL language"
Includes SIMD execution of T-SQL procedures, plus N-M data move operations.
Two-level optimization:
DBA Server has optimizer
(BIG investment, all new code,
NOT the infamous Sybase optimizer)
Each SQL server has Sybase optimizer
If extreme skew, different servers have different plans
DBA optimizer shares code with SQL server
(so they do not play chess with one another).
Very proud of their optimizer.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
157
Classic Sellinger cost-based optimizer.
SELECT, UPDATE, DELETE N-to-M parallel
Bulk and async INSERT interface.
N-M Parallel sort
Aggregate (hash/sort) select and join can do index-only access if data is there.
eliminate correlated subqueries (convert to join).
(Gansky&Wong. SIGMOD87 extended)
Join: nested-loop, sort-merge, index only
Sybase often dynamically builds index to support nested loop (fake sort-merge)
Typically left-deep sequence of binary joins.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
158
Partition strategies
If already partitioned on join, then no splitting
Else Move subset of T1 to T2 partitions.
or Replicate T1 to all T2 partitions or repartition both T1 and T2 to width of home nodes or target.
No hash join, but all (re) partitioning is range or hash based.
Not aggressive parallelism/pipelining: 2 op at a time.
Pipeline to disk via split server (not local to disk and then split).
Split servers fake subtables for SQL engines.
Top level aggregates merged by control, others done by split.
159
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Bulk data load (N-M) async calls
GUI manages
Backup all SQL serves in parallel
Reorg via CREATE TABLE <new> ,
INSERT INTO <new> SELECT * FROM <old>
Utilities are mostly offline (as per Sybase)
Nice EXPLAIN utility
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
160
Hash join within split servers
Shared memory optimizations
Full support for unique secondary indices
Full trigger support (cross-server triggers)
Full security and view support.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
161
Preliminary: 8x8 3600 - Ynet.
node: 8 x (50MHz 486 256k local cache) 512MB main memory,
2 x 10 disk arrays, @ 2GB 4 MB/s per disk.
6 x Sybase servers
Scaleup & speedup tests of 1, 4, and 8 nodes.
Numbers (except loading) reported as ratios of elapsed times
S&S tests show a >7x speedup of 8-way over 1-way
Tests cover insert, select, update, delete, join, aggregate, load
Reference Account: Chase Manahattan Bank
14x8 P5 ATT 3600 cluster: (112 processors)
56 SQL servers, 10GB each = 560 GB
100x faster than DB2/MVS (minutes vs days)
162
Concern for lifecycle design, install, manage, operate, use
Good optimization techniques
Fully parallel, including stored procedures!
Scaleup and Speedup are near linear.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
163
Sybase bought Expressway
Expressway evolved from Model 204 bitmap technology: index duplicates with bitmap compress bitmap.
Can give 10x or 100x speedup.
Can save space and IO bandwidth
Currently, two products (Sybase and IQ) not integrated
164
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
DB2/VM: = SQL/DS: System R gone public
DB2/MVS (classic Parallel Sysplex, Parallel Query Server, ...)
Parallel and async IO into one process (on mainframe)
Parallel execution in next release (late next year?)
MVS PQS now withdrawn?
DB2/AS400: Home grown
DB2-2-PE: OS2/DM grown large. First moved to AIX
Being extended parallelism
Parallelism based on SP/2 -- shared nothing done right.
Benchmarks today - Beta everywhere
DB2++: separate code path has OO extensions, good TPC-C
Ported to HP/UX, Solaris, NT in beta
165
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
• DATABASE: a collection of nodes (up to 128 SP2s so far)
• NODEGROUP: a collection of logical nodes (a 4k hash map
•
• LOGICAL NODE: A DB2 instance (segments, log, locks...)
• PHYSICAL NODE: A box.
Segments
Logical Node: Segments of 4 k pages
– Segments allocated in units (64K default)
– Tables stripe across all segments
• Table created in NodeGroup:
– Hash (partition key) across all members of group
• Cluster has single system Image
Group 2
Nodes:
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Group 1
166
• Each node maintains pool of AIX server processes
• Query optimizer does query decomposition to node plans (like R* distributed query decomposition)
• Parallel Optimization is 1Ø (not like Wai Hong’s work)
• Sends sub-plans to nodes to be executed by servers
• Node binds plan to server process
• Intermediate results hashed
• Proud that Optimizer does not need hints.
• “Standard” join strategies (except no hash join).
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
167
• 4 loaders:
– import
– raw-insert (fabricates raw blocks, no checks)
– insert
– bulk insert
• Reorganize hash map, add / drop nodes, add devices
– Table unavailable during these operations
• Online & Incremental backup
• Fault tolerance via HACMP
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
168
Wisconsin scaleups big = 4.8 M rec = 1 GB small = 1.2 M rec = 256MB scan rate ~12 kr/s/node raw load: 2.5 kr/s/node see notes for more data
25.0
20.0
15.0
10.0
Speedup vs Nodes
DB2/2 PE on SP2
5.0
0.0
0 2 4 6 8 10 12 14 16
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Load
Scan
Agg
SMJ
NLJ
SMJ2
Index1
Index2
MJ
169
•
•
•
•
•
•
– (aix is slow and SP2 is very expensive)
170
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
• Read-only (LOAD then SELECT only) Database system
– Load is incremental and sophisticated
• Precompute indices to make small-large joins run fast
– Indices use compression techniques.
– Only join via indices
• Many aggregate functions to make DSS reports easy
• Parallelism:
– Pipeline IO
– Typically a thread per processor (works on index partition)
– Piggyback many queries on one scan
– Parallel utilities (index in parallel, etc)
– SP2 implementation uses shared disk model.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
171
There is a LOT of activity
(many products coming to market)
Query optimization is near the complexity barrier
Needs a new approach?
All have good speedup & scaleup if they can find a plan
Managing huge processor / disk / tape arrays is hard.
I am working on commoditizing these ideas: low $/record/sec (scaleup PC technology) low Admin $/node (automate, automate, automate,...)
Continuous availability (online & fault tolerant)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
172