Parallel Database Systems A SNAP Application Gordon Bell Jim Gray 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com 310 Filbert, SF CA 94133 Gray@Microsoft.com Platform Network 1 Bell & Gray 4/15 / 95 Outline • Cyberspace Pep Talk: • Databases are the dirt of Cyberspace • Billions of clients mean millions of servers • Parallel Imperative: • • • • • Hardware trend: Many little devices Consequence: Servers are arrays of commodity components PC’s are the bricks of Cyberspace Must automate parallel {design / operation / use} Software parallelism via dataflow & Data Partitioning • Parallel database techniques • • • • Parallel execution of many little jobs (OLTP) Data Partitioning Pipeline Execution Automation techniques) • Summary Bell & Gray 4/15 / 95 2 Kinds Of Information Processing Point-to-Point Immediate Time Shifted Broadcast conversation money lecture concert mail book newspaper Net work Data Base Its ALL going electronic Immediate is being stored for analysis (so ALL database) Analysis & Automatic Processing are being added 3 Bell & Gray 4/15 / 95 Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Bell & Gray 4/15 / 95 Immediate OR Time Delayed Why Put Everything in Cyberspace? Point-to-Point OR Broadcast Network Locate Process Analyze Summarize Data Base 4 Database Store ALL Data Types • The Old World: – Millions of objects – 100-byte objects • The New World: • Billions of objects • Big objects (1MB) • Objects have behavior (methods) People Name Address David NY Mike Berk Won Austin People Name Address Papers Picture Voice David NY Mike Berk Won Austin Paperless office Library of congress online All information online entertainment publishing business Information Network, Knowledge Navigator, Information at your fingertips 6 Bell & Gray 4/15 / 95 Magnetic Storage Cheaper than Paper • File Cabinet: cabinet (4 drawer) paper (24,000 sheets) space (2x3 @ 10$/ft2) total 250$ 250$ 180$ 700$ 3 ¢/sheet • Disk: disk (8 GB =) 4,000$ ASCII: 4 m pages 0.1 ¢/sheet (30x cheaper) • Image: 200 k pages 2 ¢/sheet • Store everything on disk (similar to paper) 7 Bell & Gray 4/15 / 95 Cyberspace Demographics • Computer History: 1950 National Computer 1960 Corporate Computer 1970 Site Computer 1980 Departmental Computer 1990 Personal Computer 2000 ? • most computers are small 100M NEXT: 1 Billion X for some X (phone?) 1M 1K SUPER MAIN MINI WS PC FRAME • most of the money is in clients and wiring 1990: 50% desktop 1995: 75% desktop Bell & Gray 4/15 / 95 100B$ 10B$ 1B$ 8 Billions of Clients • Every device will be “intelligent” • Doors, rooms, cars, ... • Computing will be ubiquitous 9 Bell & Gray 4/15 / 95 Billions of Clients Need Millions of Servers All clients are networked to servers Clients may be nomadic or on-demand mobile clients Fast clients want faster servers Servers fixed clients server Servers provide data, control, coordination communication super server Super Servers Large Databases High Traffic shared data 10 Bell & Gray 4/15 / 95 Outline • Cyberspace Pep Talk: • Databases are the dirt of Cyberspace • Billions of clients mean millions of servers • Parallel Imperative: • Hardware trend: Many little devices • Consequence: Server arrays of commodity parts • PC’s are the bricks of Cyberspace • Must automate parallel {design / operation / use} • Software parallelism via dataflow & Data Partitioning • Parallel database techniques • • • • Parallel execution of many little jobs (OLTP) Data Partitioning Pipeline Execution Automation techniques) • Summary Bell & Gray 4/15 / 95 17 Moore’s Law Restated Many Little Won over Few Big Hardware trends: Few generic parts: 1 M$ 100 K$ 10 K$ Micro Mainframe Mini 9" 5.25" 3.5" CPU RAM Disk & Tape arrays ATM for LAN/WAN ?? for CAN ?? for OS Nano 2.5" 1.8" These parts will be inexpensive (commodity components) Systems will be arrays of these parts Software challenge: how to program arrays 18 Bell & Gray 4/15 / 95 Future SuperServer 1,000 discs = 10 Terrorbytes High Speed Network ( 10 Gb/s) Array of processors, disks, tapes comm lines 100 Tape Transports = 1,000 tapes = 1 PetaByte 100 Nodes 1 Tips Challenge: How to program it Must use parallelism Pipeline hide latency Partition bandwidth scaleup 19 Bell & Gray 4/15 / 95 The Hardware is in Place and Then A Miracle Occurs ? SNAP Scaleable Network And Platforms Commodity Distributed OS built on Commodity Platforms Commodity Network Interconnect 21 Bell & Gray 4/15 / 95 Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 1.3 minute SCAN. 1 Terabyte 1 Terabyte 10 MB/s Bell & Gray 4/15 / 95 Parallelism: divide a big problem into many smaller ones to be solved in parallel. 22 DataFlow Programming Prefetch & Postwrite Hide Latency • Can't wait for the data to arrive • Need a memory that gets the data in advance ( 100MB/S) • Solution: • Pipeline from source (tape, disc, ram...) to cpu cache • Pipeline results to destination 23 Bell & Gray 4/15 / 95 The New Law of Computing Grosch's Law: 1 MIPS 1$ 2x $ is 4x performance 1,000 MIPS 32 $ .03$/MIPS 2x $ is 2x performance Parallel Law: Needs Linear Speedup and Linear Scaleup Not always possible 1,000 MIPS 1,000 $ 1 MIPS 1$ 24 Bell & Gray 4/15 / 95 Parallelism: Performance is the Goal Goal is to get 'good' performance. Law 1: parallel system should be faster than serial system Law 2: parallel system should give near-linear scaleup or near-linear speedup or both. Parallelism is faster, not cheaper: trades money for time. Bell & Gray 4/15 / 95 25 The Perils of Parallelism Processors & Discs Skew Interference Linearity Three Perils Startup A Bad Speedup Curve No Parallelism Benefit Processors & Discs Startup: Creating processes Opening files Optimization Interference: Device (cpu, disc, bus) logical (lock, hotspot, server, log,...) Skew: If tasks get very small, variance > service time 27 Bell & Gray 4/15 / 95 Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Sequential Sequential Any Sequential Sequential Program Any Sequential Program Any Sequential Sequential Program 29 Bell & Gray 4/15 / 95 Data Rivers Split + Merge Streams N X M Data Streams M Consumers N producers River Producers add records to the river, Consumers consume records from the river Purely sequential programming. River does flow control and buffering does partition and merge of data records River = Exchange operator in Volcano. Bell & Gray 4/15 / 95 30 Partitioned Data and Execution Spreads computation and IO among processors Count Count Count Count Count Count A Table A...E F...J K...N O...S T...Z Partitioned data gives NATURAL execution parallelism 31 Bell & Gray 4/15 / 95 Partitioned + Merge + Pipeline Execution Merge Sort Sort Sort Sort Sort Join Join Join Join Join A...E F...J K...N O...S T...Z Pure dataflow programming Gives linear speedup & scaleup But, top node may be bottleneck So.... 32 Bell & Gray 4/15 / 95 N xM way Parallelism Merge Merge Merge Sort Sort Sort Sort Sort Join Join Join Join Join A...E F...J K...N O...S T...Z N inputs, M outputs, no bottlenecks. Bell & Gray 4/15 / 95 33 Why are Relational Operators Successful for Parallelism? Relational data model uniform operators on uniform data stream Closed under composition Each operator consumes 1 or 2 input streams Each stream is a uniform collection of data Sequential data in and out: Pure dataflow partitioning some operators (e.g. aggregates, non-equi-join, sort,..) requires innovation AUTOMATIC PARALLELISM Bell & Gray 4/15 / 95 34 SQL a NonProcedural Programming Language • SQL: functional programming language describes answer set. • Optimizer picks best execution plan • Picks data flow web (pipeline), • degree of parallelism (partitioning) • other execution parameters (process placement, memory,...) Execution Planning Monitor Schema GUI Bell & Gray 4/15 / 95 Optimizer Plan Executors Rivers 35 Database Systems “Hide” Parallelism • Automate system management via tools • data placement • data organization (indexing) • periodic tasks (dump / recover / reorganize) • Automatic fault tolerance • duplex & failover • transactions • Automatic parallelism • among transactions (locking) • within a transaction (parallel execution) 36 Bell & Gray 4/15 / 95 Success Stories • Online Transaction Processing • many little jobs • SQL systems support 3700 tps-A (24 cpu, 240 disk) • SQL systems support 21,000 tpm-C hardware (110 cpu, 800 disk) • Batch (decision support and Utility) • few big jobs, parallelism inside • Scan data at 100 MB/s • Linear Scaleup to 50 processors hardware 37 Bell & Gray 4/15 / 95 Kinds of Partitioned Data Split a SQL table to subset of nodes & disks Partition within set: Range A...E F...J K...N O...S T...Z Hash A...E F...J K...N O...S T...Z Round Robin A...E F...J K...N O...S T...Z Good for equijoins, Good for equijoins Good to spread load range queries group-by Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning 38 Bell & Gray 4/15 / 95 Picking Data Ranges Disk Partitioning For range partitioning, sample load on disks. Cool hot disks by making range smaller For hash partitioning, Cool hot disks by mapping some buckets to others River Partitioning Use hashing and assume uniform If range partitioning, sample data and use histogram to level the bulk Teradata, Tandem, Oracle use these tricks 41 Bell & Gray 4/15 / 95 Parallel Data Scan Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Landsat date loc image 1/2/72 . . . . . .. . . 4/8/95 33N 120W . . . . . . . 34N 120W Temporal Spatial Image Assign one process per processor/disk: find images with right data & location analyze image, if 70% snow, return it Answer image date, location, & image tests 42 Bell & Gray 4/15 / 95 Parallel Aggregates For aggregate function, need a decomposition strategy: count(S) = count(s(i)), ditto for sum() avg(S) = ( sum(s(i))) / count(s(i)) and so on... For groups, sub-aggregate groups close to the source drop sub-aggregates into a hash river. Count Count Count Count Count Count A Table Bell & Gray 4/15 / 95 A...E F...J K...N O...S T...Z 44 Parallel Sort M input N output Sort design River is range or hash partitioned Merge runs Disk and merge not needed if sort fits in memory Sub-sorts generate runs Scales linearly because 6 log(10 ) 12 = log(10 ) 6 12 Range or Hash Partition River => 2x slower Scan or other source Sort is benchmark from hell for shared nothing machines net traffic = disk bandwidth, no data filtering at the source 46 Bell & Gray 4/15 / 95 Blocking Operators =Short Piplelines An operator is blocking, if it does not produce any output, until it has consumed all its input T ape File SQL Table Process Scan Examples: Sort, Aggregates, Hash-Join (reads all of one operand) Sort Runs Database Load Template has three blocked phases M erge Runs T able Insert SQL Table Sort Runs M erge Runs Index Insert Sort Runs M erge Runs Index Insert Sort Runs M erge Runs Index Insert Index 1 Index 2 Index 3 Blocking operators kill pipeline parallelism Make partition parallelism all the more important. 47 Bell & Gray 4/15 / 95 Hash Join Hash smaller table into N buckets (hope N=1) If N=1 read larger table, hash to smaller Else, hash outer to disk then bucket-by-bucket hash join. Purely sequential data behavior Right Table Hash Buckets Left Table Always beats sort-merge and nested unless data is clustered. Good for equi, outer, exclusion join Lots of papers, products just appearing (what went wrong?) Hash reduces skew 50 Bell & Gray 4/15 / 95 Observation: Execution “easy” Automation “hard” It is “easy” to build a fast parallel execution environment (no one has done it, but it is just programming) It is hard to write a robust and world-class query optimizer. There are many tricks One quickly hits the complexity barrier Common approach: Pick best sequential plan Pick degree of parallelism based on bottleneck analysis Bind operators to process Place processes at nodes Place scratch files near processes Use memory as a constraint 51 Bell & Gray 4/15 / 95 Systems That Work This Way Shared Nothing CLIENTS Teradata: 400 nodes Tandem: 110 nodes IBM / SP2 / DB2: 48 nodes ATT & Sybase 112 nodes Informix/SP2 48 nodes CLIENTS Shared Disk Oracle Rdb 170 nodes 24 nodes Shared Memory Informix RedBrick CLIENTS 9 nodes ? nodes Processors M emory 52 Bell & Gray 4/15 / 95 Research Problems (partition: random or organized) • Automatic parallel programming (process placement) 1,000 discs = 10 Terrorbytes High Speed Network ( 10 Gb/s) • Automatic data placement 100 Tape Transports = 1,000 tapes = 1 PetaByte 100 Nodes 1 Tips • Parallel concepts, algorithms & tools • Parallel Query Optimization • Execution Techniques load balance, checkpoint/restart, pacing, 53 Bell & Gray 4/15 / 95 Summary • Cyberspace is Growing • Databases are the dirt of cybersspace PCs are the bricks, Networks are the morter. Many little devices: Performance via Arrays of {cpu, disk ,tape} • Then a miracle occurs: a scaleable distributed OS and net • SNAP: Scaleable Networks and Platforms • Then parallel database systems give software parallelism • OLTP: lots of little jobs run in parallel • Batch TP: data flow & data partitioning • Automate processor & storage array administration • Automate processor & storage array programming • 2000 platforms as easy as 1 platform. 54 Bell & Gray 4/15 / 95