Parallel Database Systems A SNAP Application

advertisement
Parallel Database Systems
A SNAP Application
Gordon Bell
Jim Gray
450 Old Oak Court
Los Altos, CA 94022
GBell@Microsoft.com
310 Filbert, SF CA 94133
Gray@Microsoft.com
Platform
Network
1
Bell & Gray 4/15 / 95
Outline
• Cyberspace Pep Talk:
• Databases are the dirt of Cyberspace
• Billions of clients mean millions of servers
• Parallel Imperative:
•
•
•
•
•
Hardware trend: Many little devices
Consequence: Servers are arrays of commodity components
PC’s are the bricks of Cyberspace
Must automate parallel {design / operation / use}
Software parallelism via dataflow & Data Partitioning
• Parallel database techniques
•
•
•
•
Parallel execution of many little jobs (OLTP)
Data Partitioning
Pipeline Execution
Automation techniques)
• Summary
Bell & Gray 4/15 / 95
2
Kinds Of Information Processing
Point-to-Point
Immediate
Time
Shifted
Broadcast
conversation
money
lecture
concert
mail
book
newspaper
Net
work
Data
Base
Its ALL going electronic
Immediate is being stored for analysis (so ALL database)
Analysis & Automatic Processing are being added
3
Bell & Gray 4/15 / 95
Low rent
min $/byte
Shrinks time
now or later
Shrinks space
here or there
Automate processing
knowbots
Bell & Gray 4/15 / 95
Immediate OR Time Delayed
Why Put Everything in Cyberspace?
Point-to-Point
OR
Broadcast
Network
Locate
Process
Analyze
Summarize
Data
Base
4
Database Store ALL Data Types
• The Old World:
– Millions of objects
– 100-byte objects
• The New World:
• Billions of objects
• Big objects (1MB)
• Objects have behavior
(methods)
People
Name Address
David
NY
Mike
Berk
Won
Austin
People
Name Address Papers Picture Voice
David NY
Mike
Berk
Won Austin
Paperless office
Library of congress online
All information online
entertainment
publishing
business
Information Network,
Knowledge Navigator,
Information at your fingertips
6
Bell & Gray 4/15 / 95
Magnetic Storage Cheaper than Paper
• File Cabinet:
cabinet (4 drawer)
paper (24,000 sheets)
space (2x3 @ 10$/ft2)
total
250$
250$
180$
700$
3 ¢/sheet
• Disk:
disk (8 GB =)
4,000$
ASCII: 4 m pages
0.1 ¢/sheet (30x cheaper)
• Image:
200 k pages
2 ¢/sheet
• Store everything on disk
(similar to paper)
7
Bell & Gray 4/15 / 95
Cyberspace Demographics
• Computer History:
1950 National Computer
1960 Corporate Computer
1970 Site Computer
1980 Departmental Computer
1990 Personal Computer
2000 ?
• most computers are small
100M
NEXT: 1 Billion X for some X (phone?)
1M
1K
SUPER MAIN MINI WS PC
FRAME
• most of the money is in
clients and wiring
1990: 50% desktop
1995: 75% desktop
Bell & Gray 4/15 / 95
100B$
10B$
1B$
8
Billions of Clients
• Every device will be “intelligent”
• Doors, rooms, cars, ...
• Computing will be ubiquitous
9
Bell & Gray 4/15 / 95
Billions of Clients Need
Millions of Servers
All clients are networked to servers
Clients
may be nomadic or on-demand
mobile
clients
Fast clients want faster servers Servers
fixed
clients
server
Servers provide
data,
control,
coordination
communication
super
server
Super Servers
Large Databases
High Traffic shared data
10
Bell & Gray 4/15 / 95
Outline
• Cyberspace Pep Talk:
• Databases are the dirt of Cyberspace
• Billions of clients mean millions of servers
• Parallel Imperative:
• Hardware trend: Many little devices
• Consequence: Server arrays of commodity parts
• PC’s are the bricks of Cyberspace
• Must automate parallel {design / operation / use}
• Software parallelism via dataflow & Data Partitioning
• Parallel database techniques
•
•
•
•
Parallel execution of many little jobs (OLTP)
Data Partitioning
Pipeline Execution
Automation techniques)
• Summary
Bell & Gray 4/15 / 95
17
Moore’s Law Restated
Many Little Won over Few Big
Hardware trends: Few generic parts:
1 M$
100 K$
10 K$
Micro
Mainframe
Mini
9"
5.25"
3.5"
CPU
RAM
Disk & Tape arrays
ATM for LAN/WAN
?? for CAN
?? for OS
Nano
2.5" 1.8"
These parts will be inexpensive (commodity components)
Systems will be arrays of these parts
Software challenge: how to program arrays
18
Bell & Gray 4/15 / 95
Future SuperServer
1,000 discs =
10 Terrorbytes
High Speed Network ( 10 Gb/s)
Array of
processors,
disks,
tapes
comm lines
100 Tape Transports
= 1,000 tapes
= 1 PetaByte
100 Nodes
1 Tips
Challenge:
How to program it
Must use parallelism
Pipeline
hide latency
Partition
bandwidth
scaleup
19
Bell & Gray 4/15 / 95
The Hardware is in Place and
Then A Miracle Occurs
?
SNAP
Scaleable Network And Platforms
Commodity Distributed OS
built on
Commodity Platforms
Commodity Network Interconnect
21
Bell & Gray 4/15 / 95
Why Parallel Access To Data?
At 10 MB/s
1.2 days to scan
1,000 x parallel
1.3 minute SCAN.
1 Terabyte
1 Terabyte
10 MB/s
Bell & Gray 4/15 / 95
Parallelism:
divide a big problem
into many smaller ones
to be solved in parallel.
22
DataFlow Programming
Prefetch & Postwrite Hide Latency
• Can't wait for the data to arrive
• Need a memory that gets the data in advance ( 100MB/S)
• Solution:
• Pipeline from source (tape, disc, ram...) to cpu cache
• Pipeline results to destination
23
Bell & Gray 4/15 / 95
The New Law of Computing
Grosch's Law:
1 MIPS
1$
2x $ is 4x performance
1,000 MIPS
32 $
.03$/MIPS
2x $ is
2x performance
Parallel Law:
Needs
Linear Speedup and Linear Scaleup
Not always possible
1,000 MIPS
1,000 $ 1 MIPS
1$
24
Bell & Gray 4/15 / 95
Parallelism: Performance is the Goal
Goal is to get 'good' performance.
Law 1: parallel system should be
faster than serial system
Law 2: parallel system should give
near-linear scaleup or
near-linear speedup or
both.
Parallelism is faster, not cheaper:
trades money for time.
Bell & Gray 4/15 / 95
25
The Perils of Parallelism
Processors & Discs
Skew
Interference
Linearity
Three Perils
Startup
A Bad Speedup Curve
No Parallelism Benefit
Processors & Discs
Startup:
Creating processes
Opening files
Optimization
Interference: Device (cpu, disc, bus)
logical (lock, hotspot, server, log,...)
Skew:
If tasks get very small, variance > service time
27
Bell & Gray 4/15 / 95
Kinds of Parallel Execution
Pipeline
Partition
outputs split N ways
inputs merge M ways
Any
Sequential
Program
Sequential
Sequential
Any
Sequential
Sequential
Program
Any
Sequential
Program
Any
Sequential
Sequential
Program
29
Bell & Gray 4/15 / 95
Data Rivers
Split + Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river,
Consumers consume records from the river
Purely sequential programming.
River does flow control and buffering
does partition and merge of data records
River = Exchange operator in Volcano.
Bell & Gray 4/15 / 95
30
Partitioned Data and Execution
Spreads computation and IO among processors
Count
Count
Count
Count
Count
Count
A Table
A...E
F...J
K...N
O...S
T...Z
Partitioned data gives
NATURAL execution parallelism
31
Bell & Gray 4/15 / 95
Partitioned + Merge + Pipeline
Execution
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
Pure dataflow programming
Gives linear speedup & scaleup
But, top node may be bottleneck
So....
32
Bell & Gray 4/15 / 95
N xM way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
Bell & Gray 4/15 / 95
33
Why are Relational Operators
Successful for Parallelism?
Relational data model
uniform operators
on uniform data stream
Closed under composition
Each operator consumes 1 or 2 input streams
Each stream is a uniform collection of data
Sequential data in and out: Pure dataflow
partitioning some operators (e.g. aggregates, non-equi-join, sort,..)
requires innovation
AUTOMATIC PARALLELISM
Bell & Gray 4/15 / 95
34
SQL
a NonProcedural Programming Language
• SQL: functional programming language
describes answer set.
• Optimizer picks best execution plan
• Picks data flow web (pipeline),
• degree of parallelism (partitioning)
• other execution parameters (process placement, memory,...)
Execution
Planning
Monitor
Schema
GUI
Bell & Gray 4/15 / 95
Optimizer
Plan
Executors
Rivers
35
Database Systems “Hide” Parallelism
• Automate system management via tools
• data placement
• data organization (indexing)
• periodic tasks (dump / recover / reorganize)
• Automatic fault tolerance
• duplex & failover
• transactions
• Automatic parallelism
• among transactions (locking)
• within a transaction (parallel execution)
36
Bell & Gray 4/15 / 95
Success Stories
• Online Transaction Processing
• many little jobs
• SQL systems support 3700 tps-A
(24 cpu, 240 disk)
• SQL systems support 21,000 tpm-C
hardware
(110 cpu, 800 disk)
• Batch (decision support and Utility)
• few big jobs, parallelism inside
• Scan data at 100 MB/s
• Linear Scaleup to 50 processors
hardware
37
Bell & Gray 4/15 / 95
Kinds of Partitioned Data
Split a SQL table to subset of nodes & disks
Partition within set:
Range
A...E F...J
K...N O...S T...Z
Hash
A...E F...J
K...N O...S T...Z
Round Robin
A...E F...J
K...N O...S T...Z
Good for equijoins, Good for equijoins Good to spread load
range queries
group-by
Shared disk and memory less sensitive to partitioning,
Shared nothing benefits from "good" partitioning
38
Bell & Gray 4/15 / 95
Picking Data Ranges
Disk Partitioning
For range partitioning, sample load on disks.
Cool hot disks by making range smaller
For hash partitioning,
Cool hot disks by mapping some buckets to others
River Partitioning
Use hashing and assume uniform
If range partitioning, sample data and use
histogram to level the bulk
Teradata, Tandem, Oracle use these tricks
41
Bell & Gray 4/15 / 95
Parallel Data Scan
Select image
from landsat
where date between 1970 and 1990
and overlaps(location, :Rockies)
and snow_cover(image) >.7;
Landsat
date loc image
1/2/72
.
.
.
.
.
..
.
.
4/8/95
33N
120W
.
.
.
.
.
.
.
34N
120W
Temporal
Spatial
Image
Assign one process per processor/disk:
find images with right data & location
analyze image, if 70% snow, return it
Answer
image
date, location,
& image tests
42
Bell & Gray 4/15 / 95
Parallel Aggregates
For aggregate function, need a decomposition strategy:
count(S) = count(s(i)), ditto for sum()
avg(S) = ( sum(s(i))) /  count(s(i))
and so on...
For groups,
sub-aggregate groups close to the source
drop sub-aggregates into a hash river.
Count
Count
Count
Count
Count
Count
A Table
Bell & Gray 4/15 / 95
A...E
F...J
K...N
O...S
T...Z
44
Parallel Sort
M input N output
Sort design
River is range or hash partitioned
Merge
runs
Disk and merge
not needed if
sort fits in
memory
Sub-sorts
generate
runs
Scales linearly because
6
log(10 )
12
=
log(10 )
6
12
Range or Hash Partition River
=>
2x slower
Scan
or
other source
Sort is benchmark from hell for shared nothing machines
net traffic = disk bandwidth, no data filtering at the source
46
Bell & Gray 4/15 / 95
Blocking Operators =Short Piplelines
An operator is blocking,
if it does not produce any output,
until it has consumed all its input
T ape
File
SQL Table
Process
Scan
Examples:
Sort,
Aggregates,
Hash-Join (reads all of one operand)
Sort Runs
Database Load
Template has
three blocked
phases
M erge Runs T able Insert
SQL Table
Sort Runs
M erge Runs Index Insert
Sort Runs
M erge Runs Index Insert
Sort Runs
M erge Runs Index Insert
Index 1
Index 2
Index 3
Blocking operators kill pipeline parallelism
Make partition parallelism all the more important.
47
Bell & Gray 4/15 / 95
Hash Join
Hash smaller table into N buckets (hope N=1)
If N=1 read larger table, hash to smaller
Else, hash outer to disk then
bucket-by-bucket hash join.
Purely sequential data behavior
Right Table
Hash
Buckets
Left
Table
Always beats sort-merge and nested
unless data is clustered.
Good for equi, outer, exclusion join
Lots of papers,
products just appearing (what went wrong?)
Hash reduces skew
50
Bell & Gray 4/15 / 95
Observation: Execution “easy”
Automation “hard”
It is “easy” to build a fast parallel execution environment
(no one has done it, but it is just programming)
It is hard to write a robust and world-class query optimizer.
There are many tricks
One quickly hits the complexity barrier
Common approach:
Pick best sequential plan
Pick degree of parallelism based on bottleneck analysis
Bind operators to process
Place processes at nodes
Place scratch files near processes
Use memory as a constraint
51
Bell & Gray 4/15 / 95
Systems That Work This Way
Shared Nothing
CLIENTS
Teradata:
400 nodes
Tandem:
110 nodes
IBM / SP2 / DB2: 48 nodes
ATT & Sybase 112 nodes
Informix/SP2
48 nodes
CLIENTS
Shared Disk
Oracle
Rdb
170 nodes
24 nodes
Shared Memory
Informix
RedBrick
CLIENTS
9 nodes
? nodes
Processors
M emory
52
Bell & Gray 4/15 / 95
Research Problems
(partition: random or organized)
• Automatic parallel programming
(process placement)
1,000 discs =
10 Terrorbytes
High Speed Network ( 10 Gb/s)
• Automatic data placement
100 Tape Transports
= 1,000 tapes
= 1 PetaByte
100 Nodes
1 Tips
• Parallel concepts, algorithms & tools
• Parallel Query Optimization
• Execution Techniques
load balance,
checkpoint/restart,
pacing,
53
Bell & Gray 4/15 / 95
Summary
• Cyberspace is Growing
• Databases are the dirt of cybersspace
PCs are the bricks, Networks are the morter.
Many little devices: Performance via Arrays of {cpu, disk ,tape}
• Then a miracle occurs: a scaleable distributed OS and net
• SNAP: Scaleable Networks and Platforms
• Then parallel database systems give software parallelism
• OLTP: lots of little jobs run in parallel
• Batch TP: data flow & data partitioning
• Automate processor & storage array administration
• Automate processor & storage array programming
• 2000 platforms as easy as 1 platform.
54
Bell & Gray 4/15 / 95
Download