Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research

advertisement
Parallel Database Systems
Analyzing LOTS of Data
Jim Gray
Microsoft Research
1
Main Message

Technology trends give
– many processors and storage units
– inexpensively

To analyze large quantities of data
– sequential (regular) access patterns are 100x faster
– parallelism is 1000x faster (trades time for money)
– Relational systems show many parallel algorithms.
3
Moore’s Law
•XXX doubles every 18 months
60% increase per year
•Micro Processor speeds
•chip density
•Magnetic disk density
•Communications bandwidth
WAN bandwidth
approaching LANs.
1GB
128MB
8MB
1 chip memory size
( 2 MB to 32 MB)
1MB
128KB
8KB
1980
1990
1970
5 2000
bits: 1K 4K 16K 64K 256K 1M 4M 16M64M 256M
Magnetic Storage Cheaper than Paper
• File Cabinet:
• Disk:
cabinet (4 drawer)
paper (24,000 sheets)
space (2x3 @ 10$/ft2)
total
250$
250$
180$
700$
3 ¢/sheet
disk (4 GB =)
500$
ASCII: 2 m pages
(100x cheaper)
• Image:
0.025 ¢/sheet
200 k pages
(10x cheaper) .25 ¢/sheet
• Store everything on disk
6
Today’s Storage Hierarchy :
Speed & Capacity vs Cost Tradeoffs
Price vs Speed
Size vs Speed
10 15
1
Disc
Secondary
Main
10 6
10 3
Cache
Offline
Tape Main
10 2
Secondary
Online $/MB
Online
Tape
Disc Tape 10 0
Offline
Tape
Nearline
10 -2
Tape
Nearline
Tape
10 12
Size(B)
10 9
10 4
Cache
10 -9
10 -6
10 -3
10 0
Access Time (seconds)
10 -4
10 3
10 -9
10 -6
10 -3
10 0
Access Time (seconds)
10 3
9
Storage Latency:
How Far Away is the Data?
Clock Ticks
10
9
Andromdeda
Tape /Optical
Robot
10 6 Disk
100
10
2
1
Memory
On Board Cache
On Chip Cache
Registers
2,000 Years
Pluto
Sacramento
2 Years
1.5 hr
This Campus
10 min
This Room
My Head
1 min
10
Thesis
Many Little will Win over Few Big
1 M$
100 K$
10 K$
Micro
Mini
Mainframe
14"
Nano
9"
5.25"
3.5"
2.5"
1.8"
11
Implications of Hardware Trends
Large Disc Farms will be inexpensive
(10k$/TB)
Large RAM databases will be inexpensive
(1K$/GB)
Processors will be inexpensive
1k SPECint
CPU
So building block will be
a processor
with large RAM
2 GB RAM
lots of Disc
lots of network bandwidth
CyberBrick™
500 GB Disc
12
Implication of Hardware Trends:
Clusters
CPU
50 GB Disc
5 GB RAM
Future Servers are CLUSTERS
of processors, discs
Distributed Database techniques
make clusters work
13
Summary
•
Tech trends => pipeline & partition parallelism
–
–
–
–
•
Lots of bytes & bandwidth per dollar
Lots of latency
Lots of MIPS per dollar
Lots of processors
Putting it together
Scaleable Networks and Platforms
– Build clusters of commodity processors & storage
– Commodity interconnect is key (S of PMS)
• Traditional interconnects give 100k$/slice.
– Commodity Cluster Operating System is key
– Fault isolation and tolerance is key
15
– Automatic Parallel Programming is key
Parallelism: Performance is the Goal
Goal is to get 'good' performance.
Law 1: parallel system should be
faster than serial system
Law 2: parallel system should give
near-linear scaleup or
near-linear speedup or
both.
18
Kinds of Parallel Execution
Pipeline
Partition
Any
Sequential
Program
Sequential
Sequential
Any
Sequential
Sequential
Program
Any
Sequential
Program
Any
Sequential
Sequential
Program
outputs split N ways
inputs merge M ways
19
Processors & Discs
Linearity
Processors & Discs
Skew
ity
r
ea
n
i
L
A Bad Speedup Curve A Bad Speedup Curve
No Parallelism
3-Factors
Benefit
Interference
The Good
Speedup Curve
Startup
Speedup = OldTime
NewTime
The Perils of Parallelism
Processors & Discs
Startup:
Creating processes
Opening files
Optimization
Interference: Device (cpu, disc, bus)
logical (lock, hotspot, server, log,...)
Skew:
If tasks get very small, variance > service time
20
Why Parallel Access To Data?
At 10 MB/s
1.2 days to scan
1 Terabyte
10 MB/s
1,000 x parallel
1.5 minute SCAN.
1 Terabyte
Parallelism:
divide a big problem
into many smaller ones
to be solved in parallel.
21
Data Flow Programming
Prefetch & Postwrite Hide Latency
Can't wait for the data to arrive (2,000 years!)
Memory that gets the data in advance ( 100MB/S)
Solution:
Pipeline from storage (tape, disc...) to cpu cache
Pipeline results to destination
22
Parallelism: Speedup & Scaleup
Speedup:
100GB
Same Job,
More Hardware
Less time
Scaleup:
100GB
Bigger Job,
More Hardware
Same time
Transaction
Scaleup:
more clients/servers
Same response time
1 k clients
100GB
1 TB
10 k clients
100GB
1 TB
Server
Server
23
Benchmark Buyer's Guide
The Whole Story
(for any system)
When does it
stop scaling?
Throughput numbers,
Not ratios.
Standard benchmarks allow
Comparison to others
Comparison to sequential
The Benchmark
Report
Throughput
Things to ask
Processors & Discs
Ratios & non-standard benchmarks are red flags.
24
Why are Relational Operators
So Successful for Parallelism?
Relational data model uniform operators
on uniform data stream
closed under composition
Each operator consumes 1 or 2 input streams
Each stream is a uniform collection of data
Sequential data in and out: Pure dataflow
partitioning some operators (e.g. aggregates, non-equi-join, sort,..)
requires innovation
AUTOMATIC PARALLELISM
25
Database Systems “Hide” Parallelism
•
Automate system management via tools
• data placement
• data organization (indexing)
• periodic tasks (dump / recover / reorganize)
•
Automatic fault tolerance
• duplex & failover
• transactions
•
Automatic parallelism
• among transactions (locking)
• within a transaction (parallel execution)26
Automatic Parallel OR DB
Select image
from landsat
where date between 1970 and 1990
and overlaps(location, :Rockies)
and snow_cover(image) >.7;
Landsat
date loc image
1/2/72
.
.
.
.
.
..
.
.
4/8/95
33N
120W
.
.
.
.
.
.
.
34N
120W
Assign one process per processor/disk:
find images with right data & location
analyze image, if 70% snow, return it
Temporal
Spatia
Image
Answer
image
date, location,
& image tests
27
Automatic Data Partitioning
Split a SQL table to subset of nodes & disks
Partition within set:
Range
Hash
A...E F...J K...N O...S T...Z
Good for equi-joins,
range queries
group-by
Round Robin
A...E F...J K...N O...S T...Z
A...E F...J K...N O...S T...Z
Good for equi-joins
Good to spread load
Shared disk and memory less sensitive to partitioning,
Shared nothing benefits from "good" partitioning
28
Index Partitioning
Hash indices partition by hash
0...9
10..19
20..29 30..39 40..•
B-tree indices partition as a forest of trees.
One tree per range
A..C
D..F
G...M
N...R
Primary index clusters data
29
S..Z
Secondary Index Partitioning
In shared nothing, secondary indices are Problematic
Partition by base table key ranges
Insert: completely local (but what about unique?)
Lookup: examines ALL trees (see figure)
Unique index involves lookup on insert.
Partition by secondary key ranges
A..Z
Insert: two nodes (base and index)
Lookup: two nodes (index -> base)
Uniqueness is easy
A..Z
A..Z
A..Z
A..Z
Base Table
Teradata solution
A..C
D..F
G...M
Base Table
N...R
30
S..•
Data Rivers: Split + Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river,
Consumers consume records from the river
Purely sequential programming.
River does flow control and buffering
does partition and merge of data records
River = Split/Merge in Gamma =
Exchange operator in Volcano.
31
Partitioned Execution
Spreads computation and IO among processors
Count
Count
Count
Count
Count
Count
A Table
A...E
F...J
K...N
O...S
T...Z
Partitioned data gives
NATURAL parallelism
32
N x M way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
Partitioned Data
Partitioned and Pipelined Data Flows
33
Picking Data Ranges
Disk Partitioning
For range partitioning, sample load on disks.
Cool hot disks by making range smaller
For hash partitioning,
Cool hot disks
by mapping some buckets to others
River Partitioning
Use hashing and assume uniform
If range partitioning, sample data and use
histogram to level the bulk
Teradata, Tandem, Oracle use these tricks
34
Blocking Operators = Short Pipelines
An operator is blocking,
if it does not produce any output,
until it has consumed all its input
Tape
File
SQL Table
Process
Scan
Sort Runs
Sort Runs
Sort Runs
Examples:
Sort,
Sort Runs
Aggregates,
Hash-Join (reads all of one operand)
Database Load
Template has
three blocked
phases
Merge Runs
Table Insert
Merge Runs
Index Insert
Merge Runs
Index Insert
Merge Runs
Index Insert
SQL Table
Index 1
Index 2
Index 3
Blocking operators kill pipeline parallelism
Make partition parallelism all the more important.35
Simple Aggregates (sort or hash?)
Simple aggregates (count, min, max, ...) can use indices
More compact
Sometimes have aggregate info.
GROUP BY aggregates
scan in category order if possible (use indices)
Else
If categories fit in RAM
use RAM category hash table
Else
make temp of <category, item>
sort by category,
do math in merge step.
36
Parallel Aggregates
For aggregate function, need a decomposition strategy:
count(S) =  count(s(i)), ditto for sum()
avg(S) = ( sum(s(i))) /  count(s(i))
and so on...
For groups,
sub-aggregate groups close to the source
drop sub-aggregates into a hash river.
Count
Count
Count
Count
Count
Count
A Table
37
A...E
F...J
K...N
O...S
T...Z
Parallel Sort
M inputs N outputs
River is range or hash partitioned
Merge
runs
Sub-sorts
generate
runs
Range or Hash Partition River
Disk and merge
not needed if
sort fits in
memory
Scales “linearly” because
log(106 )
12
Scan
or
other source
log(10 )
=
6
=>
2x slower
12
Sort is benchmark from hell for shared nothing machines
38
net traffic = disk bandwidth, no data filtering at the source
Hash Join: Combining Two Tables
Right Table
Left
Table
Hash smaller table into N buckets (hope N=1)
If N=1 read larger table, hash to smaller
Else, hash outer to disk then
Hash
bucket-by-bucket hash join.
Buckets
Purely sequential data behavior
Always beats sort-merge and nested
unless data is clustered.
Good for equi, outer, exclusion join
Lots of papers,
products just appearing (what went wrong?)
Hash reduces skew
39
Parallel Hash Join
ICL implemented hash join with bitmaps in CAFS
machine (1976)!
Kitsuregawa pointed out the parallelism benefits of
hash join in early 1980’s (it partitions beautifully)
We ignored them! (why?)
But now, Everybody's doing it.
(or promises to do it).
Hashing minimizes skew, requires little thinking for
redistribution
Hashing uses massive main memory
40
What Systems Work This Way
CLIENTS
Shared Nothing
Teradata:
CLIENTS
400 nodes
80x12 nodes
Tandem:
110 nodes
IBM / SP2 / DB2: 128 nodes
Informix/SP2
100 nodes
ATT & Sybase 8x14 nodes
Shared Disk
Oracle
Rdb
170 nodes
24 nodes
CLIENTS
Processors
M emory
Shared Memory
Informix
RedBrick
9 nodes
? nodes
42
Outline

Why Parallelism:
– technology push
– application pull

Parallel Database Techniques
– partitioned data
– partitioned and pipelined execution
– parallel relational operators
43
Main Message

Technology trends give
– many processors and storage units
– inexpensively

To analyze large quantities of data
– sequential (regular) access patterns are 100x faster
– parallelism is 1000x faster (trades time for money)
– Relational systems show many parallel algorithms.
44
Download