Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research 1 Main Message Technology trends give – many processors and storage units – inexpensively To analyze large quantities of data – sequential (regular) access patterns are 100x faster – parallelism is 1000x faster (trades time for money) – Relational systems show many parallel algorithms. 3 Moore’s Law •XXX doubles every 18 months 60% increase per year •Micro Processor speeds •chip density •Magnetic disk density •Communications bandwidth WAN bandwidth approaching LANs. 1GB 128MB 8MB 1 chip memory size ( 2 MB to 32 MB) 1MB 128KB 8KB 1980 1990 1970 5 2000 bits: 1K 4K 16K 64K 256K 1M 4M 16M64M 256M Magnetic Storage Cheaper than Paper • File Cabinet: • Disk: cabinet (4 drawer) paper (24,000 sheets) space (2x3 @ 10$/ft2) total 250$ 250$ 180$ 700$ 3 ¢/sheet disk (4 GB =) 500$ ASCII: 2 m pages (100x cheaper) • Image: 0.025 ¢/sheet 200 k pages (10x cheaper) .25 ¢/sheet • Store everything on disk 6 Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs Price vs Speed Size vs Speed 10 15 1 Disc Secondary Main 10 6 10 3 Cache Offline Tape Main 10 2 Secondary Online $/MB Online Tape Disc Tape 10 0 Offline Tape Nearline 10 -2 Tape Nearline Tape 10 12 Size(B) 10 9 10 4 Cache 10 -9 10 -6 10 -3 10 0 Access Time (seconds) 10 -4 10 3 10 -9 10 -6 10 -3 10 0 Access Time (seconds) 10 3 9 Storage Latency: How Far Away is the Data? Clock Ticks 10 9 Andromdeda Tape /Optical Robot 10 6 Disk 100 10 2 1 Memory On Board Cache On Chip Cache Registers 2,000 Years Pluto Sacramento 2 Years 1.5 hr This Campus 10 min This Room My Head 1 min 10 Thesis Many Little will Win over Few Big 1 M$ 100 K$ 10 K$ Micro Mini Mainframe 14" Nano 9" 5.25" 3.5" 2.5" 1.8" 11 Implications of Hardware Trends Large Disc Farms will be inexpensive (10k$/TB) Large RAM databases will be inexpensive (1K$/GB) Processors will be inexpensive 1k SPECint CPU So building block will be a processor with large RAM 2 GB RAM lots of Disc lots of network bandwidth CyberBrick™ 500 GB Disc 12 Implication of Hardware Trends: Clusters CPU 50 GB Disc 5 GB RAM Future Servers are CLUSTERS of processors, discs Distributed Database techniques make clusters work 13 Summary • Tech trends => pipeline & partition parallelism – – – – • Lots of bytes & bandwidth per dollar Lots of latency Lots of MIPS per dollar Lots of processors Putting it together Scaleable Networks and Platforms – Build clusters of commodity processors & storage – Commodity interconnect is key (S of PMS) • Traditional interconnects give 100k$/slice. – Commodity Cluster Operating System is key – Fault isolation and tolerance is key 15 – Automatic Parallel Programming is key Parallelism: Performance is the Goal Goal is to get 'good' performance. Law 1: parallel system should be faster than serial system Law 2: parallel system should give near-linear scaleup or near-linear speedup or both. 18 Kinds of Parallel Execution Pipeline Partition Any Sequential Program Sequential Sequential Any Sequential Sequential Program Any Sequential Program Any Sequential Sequential Program outputs split N ways inputs merge M ways 19 Processors & Discs Linearity Processors & Discs Skew ity r ea n i L A Bad Speedup Curve A Bad Speedup Curve No Parallelism 3-Factors Benefit Interference The Good Speedup Curve Startup Speedup = OldTime NewTime The Perils of Parallelism Processors & Discs Startup: Creating processes Opening files Optimization Interference: Device (cpu, disc, bus) logical (lock, hotspot, server, log,...) Skew: If tasks get very small, variance > service time 20 Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1 Terabyte 10 MB/s 1,000 x parallel 1.5 minute SCAN. 1 Terabyte Parallelism: divide a big problem into many smaller ones to be solved in parallel. 21 Data Flow Programming Prefetch & Postwrite Hide Latency Can't wait for the data to arrive (2,000 years!) Memory that gets the data in advance ( 100MB/S) Solution: Pipeline from storage (tape, disc...) to cpu cache Pipeline results to destination 22 Parallelism: Speedup & Scaleup Speedup: 100GB Same Job, More Hardware Less time Scaleup: 100GB Bigger Job, More Hardware Same time Transaction Scaleup: more clients/servers Same response time 1 k clients 100GB 1 TB 10 k clients 100GB 1 TB Server Server 23 Benchmark Buyer's Guide The Whole Story (for any system) When does it stop scaling? Throughput numbers, Not ratios. Standard benchmarks allow Comparison to others Comparison to sequential The Benchmark Report Throughput Things to ask Processors & Discs Ratios & non-standard benchmarks are red flags. 24 Why are Relational Operators So Successful for Parallelism? Relational data model uniform operators on uniform data stream closed under composition Each operator consumes 1 or 2 input streams Each stream is a uniform collection of data Sequential data in and out: Pure dataflow partitioning some operators (e.g. aggregates, non-equi-join, sort,..) requires innovation AUTOMATIC PARALLELISM 25 Database Systems “Hide” Parallelism • Automate system management via tools • data placement • data organization (indexing) • periodic tasks (dump / recover / reorganize) • Automatic fault tolerance • duplex & failover • transactions • Automatic parallelism • among transactions (locking) • within a transaction (parallel execution)26 Automatic Parallel OR DB Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Landsat date loc image 1/2/72 . . . . . .. . . 4/8/95 33N 120W . . . . . . . 34N 120W Assign one process per processor/disk: find images with right data & location analyze image, if 70% snow, return it Temporal Spatia Image Answer image date, location, & image tests 27 Automatic Data Partitioning Split a SQL table to subset of nodes & disks Partition within set: Range Hash A...E F...J K...N O...S T...Z Good for equi-joins, range queries group-by Round Robin A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z Good for equi-joins Good to spread load Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning 28 Index Partitioning Hash indices partition by hash 0...9 10..19 20..29 30..39 40..• B-tree indices partition as a forest of trees. One tree per range A..C D..F G...M N...R Primary index clusters data 29 S..Z Secondary Index Partitioning In shared nothing, secondary indices are Problematic Partition by base table key ranges Insert: completely local (but what about unique?) Lookup: examines ALL trees (see figure) Unique index involves lookup on insert. Partition by secondary key ranges A..Z Insert: two nodes (base and index) Lookup: two nodes (index -> base) Uniqueness is easy A..Z A..Z A..Z A..Z Base Table Teradata solution A..C D..F G...M Base Table N...R 30 S..• Data Rivers: Split + Merge Streams N X M Data Streams M Consumers N producers River Producers add records to the river, Consumers consume records from the river Purely sequential programming. River does flow control and buffering does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano. 31 Partitioned Execution Spreads computation and IO among processors Count Count Count Count Count Count A Table A...E F...J K...N O...S T...Z Partitioned data gives NATURAL parallelism 32 N x M way Parallelism Merge Merge Merge Sort Sort Sort Sort Sort Join Join Join Join Join A...E F...J K...N O...S T...Z N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows 33 Picking Data Ranges Disk Partitioning For range partitioning, sample load on disks. Cool hot disks by making range smaller For hash partitioning, Cool hot disks by mapping some buckets to others River Partitioning Use hashing and assume uniform If range partitioning, sample data and use histogram to level the bulk Teradata, Tandem, Oracle use these tricks 34 Blocking Operators = Short Pipelines An operator is blocking, if it does not produce any output, until it has consumed all its input Tape File SQL Table Process Scan Sort Runs Sort Runs Sort Runs Examples: Sort, Sort Runs Aggregates, Hash-Join (reads all of one operand) Database Load Template has three blocked phases Merge Runs Table Insert Merge Runs Index Insert Merge Runs Index Insert Merge Runs Index Insert SQL Table Index 1 Index 2 Index 3 Blocking operators kill pipeline parallelism Make partition parallelism all the more important.35 Simple Aggregates (sort or hash?) Simple aggregates (count, min, max, ...) can use indices More compact Sometimes have aggregate info. GROUP BY aggregates scan in category order if possible (use indices) Else If categories fit in RAM use RAM category hash table Else make temp of <category, item> sort by category, do math in merge step. 36 Parallel Aggregates For aggregate function, need a decomposition strategy: count(S) = count(s(i)), ditto for sum() avg(S) = ( sum(s(i))) / count(s(i)) and so on... For groups, sub-aggregate groups close to the source drop sub-aggregates into a hash river. Count Count Count Count Count Count A Table 37 A...E F...J K...N O...S T...Z Parallel Sort M inputs N outputs River is range or hash partitioned Merge runs Sub-sorts generate runs Range or Hash Partition River Disk and merge not needed if sort fits in memory Scales “linearly” because log(106 ) 12 Scan or other source log(10 ) = 6 => 2x slower 12 Sort is benchmark from hell for shared nothing machines 38 net traffic = disk bandwidth, no data filtering at the source Hash Join: Combining Two Tables Right Table Left Table Hash smaller table into N buckets (hope N=1) If N=1 read larger table, hash to smaller Else, hash outer to disk then Hash bucket-by-bucket hash join. Buckets Purely sequential data behavior Always beats sort-merge and nested unless data is clustered. Good for equi, outer, exclusion join Lots of papers, products just appearing (what went wrong?) Hash reduces skew 39 Parallel Hash Join ICL implemented hash join with bitmaps in CAFS machine (1976)! Kitsuregawa pointed out the parallelism benefits of hash join in early 1980’s (it partitions beautifully) We ignored them! (why?) But now, Everybody's doing it. (or promises to do it). Hashing minimizes skew, requires little thinking for redistribution Hashing uses massive main memory 40 What Systems Work This Way CLIENTS Shared Nothing Teradata: CLIENTS 400 nodes 80x12 nodes Tandem: 110 nodes IBM / SP2 / DB2: 128 nodes Informix/SP2 100 nodes ATT & Sybase 8x14 nodes Shared Disk Oracle Rdb 170 nodes 24 nodes CLIENTS Processors M emory Shared Memory Informix RedBrick 9 nodes ? nodes 42 Outline Why Parallelism: – technology push – application pull Parallel Database Techniques – partitioned data – partitioned and pipelined execution – parallel relational operators 43 Main Message Technology trends give – many processors and storage units – inexpensively To analyze large quantities of data – sequential (regular) access patterns are 100x faster – parallelism is 1000x faster (trades time for money) – Relational systems show many parallel algorithms. 44