Parallel and Distributed Data Processing COMPSCI 445 1 When one system isn’t enough Too much data to query queries take too long with one processor more queries are required than can be handled with one processor Too much data to ingest Need to handle lots of inserts/updates Latency is too high round trip to other side of the earth is min 130 ms Big geographical distance => high latency 2 When one system isn’t enough Scale up vertical put the DB on a bigger and more powerful system hardware limits, and expensive Scale out horizontal add more, smaller systems and connect them together make sure these small systems can communicate to each other 3 Parallel vs. Distributed For the purpose of this lecture we’ll make the following distinction: Parallel DBs tightly clustered processors connected with high-speed links still centralized, just a centralized cluster complicated to add/remove ‘nodes’ Distributed DBs processors spread out over a larger area over slower links easier to add/remove machines 4 Parallel Architectures Shared Disk processors are independent but share storage storage bandwidth is a potential bottleneck not too common anymore Shared Memory bunch of independet processors but share memory + disk storage processors are independent but share memory modern multi-core systems are basically this Shared Nothing most common Completely independent systems connected only by network Most common parallel DB architecture 5 Shared-Memory Architecture 6 Server Architecture practical shared memory architecture Multiple CPUs Each CPU has multiple cores Can run multiple processes and threads in parallel They share the last-level cache Non-Uniform Memory Access Each CPU has a local memory -> faster A CPU can also access the local memory of other CPUs -> slower Shared I/O controllers 7 Pros and Cons fast, avoid bottleneck for CPU Pros Sharing data in memory is very fast Favor access to local memory whenever possible so that the inter-CPU bus is not a bottleneck Cons not easy to scalable Limited scalability: maximum number of CPUs per servers and cores per CPU is bounded Memory coordination between CPUs gets more expensive with more CPUs 8 Shared-Nothing Architecture can scale way easier Independent servers connected by network No shared memory across servers Sharing data requires communication over the network Can scale to thousands of processors Remote Direct Memory Access (RDMA) Provides very low-latency shared memory abstraction on shared-nothing systems 9 Partitioning because we have shared nothing, each DB will have a slice of the data of the whole data For shared-nothing inter-node communication is costly and needs to be minimized Spread data across nodes by partitioning tables horizontal partitioning - split tables by rows vertical partitioning - split tables by columns Partitioning methods Round-robin equally distributed, assigned in cyclic order Range Partitioning distribute based on range of a particular values Hash Partitioning same as above but not a specific value instead of a range goal: want to split such that most of the work can be done locally 10 Partitioning Round-robin partitioning spread rows from table by iterating through nodes round-robin can’t change the number of nodes without repartitioning everything Range partitioning divide tables by value (e.g. 0 < col1 < 10 is partition0, …) can suffer from skew Hash partitioning hash values into buckets Both range and hash partitioning let you specify partitions indep. of node count 11 Partitioning Horizontal Partitioning SQL DBs are usually row oriented, so this is natural inefficient for queries that project away many columns takes up more storage Vertical Partitioning compact (compression works better on homogenous columns) efficient for queries that project away other columns bad if rows have to be reconstructed across multiple nodes 12 How to Parallelize Operators Consider shared-nothing cluster of servers Suppose a table is horizontally partitioned across multiple servers Each server keeps a partition of rows How to run operators in parallel on these tables? Start with a simple query 13 Parallelize simple SELECT SELECT S.sid FROM Sailors S WHERE S.age > 40 node 1 node 2 node 3 node 4 p0, p4, p8, p12 p1, p5, p9, p13 p2, p6, p10, p14 p3, p7, p11, p15 Assume table is partitioned 16 ways each nodes have 4 records Each node runs select locally on local partitions and ships the results to be concatenated no ordering or joining => results can streamed Who is responsible for concatenating results? Heavyweight/smart client Cooperative Server(s) Middleware 14 Heavyweight/Smart Client Client does all the jobs, but will need to access to information of nodes, ... from the systems Client issues query to each node Client combines + processes results Client has to know how many nodes and where they are also needs to know which partitions they host Clients are complicated HBase works this way client library manages query distribution client SELECT S.sid FROM Sailors S WHERE S.age > 40 node 1 node 2 node 3 node 4 p0, p4, p8, p12 p1, p5, p9, p13 p2, p6, p10, p14 p3, p7, p11, p15 15 Cooperative Server Exists a master node to help a client do the job Client connects to whichever node Connected node is responsible for managing distribution of query and combining results Client can be the same as for normal DB higher load on connected node without load balancing hot spots are likely ElasticSearch works like this client SELECT S.sid FROM Sailors S WHERE S.age > 40 node 1 node 2 node 3 node 4 p0, p4, p8, p12 p1, p5, p9, p13 p2, p6, p10, p14 p3, p7, p11, p15 possible hotspot 16 Middleware Exists a load balancer to handle things for client, sicne this is not insidee database, this would avoid hotspot, overload for databases (as seen in cooperative server approach) Special node(s) are SELECT S.sid client FROM Sailors S responsible for coordinating queries Client connects to the special node handles load balancing • can avoid hotspotting handles node management • failure handling/ failover can happen here Vitess works this way WHERE S.age > 40 load balancer/node manager node 1 node 2 node 3 node 4 p0, p4, p8, p12 p1, p5, p9, p13 p2, p6, p10, p14 p3, p7, p11, p15 17 Parallel Complex Operators What if a query has JOINs? Or sorts (e.g. ORDER BY or GROUP BY) The nodes will have to share data over the network efficiency is now about minimizing the sum of disk I/O and network I/O Let’s start with sorting We can do merge sort in parallel • each node sorts locally and then we merge the sorted partitions We can exploit range partitioning to avoid merges 18 Range-Partitioning Sort Initially distribute data with range into nodes first Redistribute the relation using range partitioning Each node Ni sorts its partition of the relation locally. Final merge operation is trivial: range-partitioning ensures that, if i < j, all key values in node Ni are all less than all key values in Nj because we already partitioned values we don’t need any merging we can actually just stream the results in partition order 19 Parallel External Sort-Merge Each node Ni locally sorts the data The sorted runs on each node are then merged in parallel: The sorted partitions at each node Ni are rangepartitioned across the processors N1, ..., Nm. Each node Ni merges the streams as received The sorted runs on nodes N1,..., Nm are concatenated to get the final result. • avoids having a ‘tree’ of merges Can suffer from execution skew All nodes send to node 1, then all nodes send data to node 2, ... Only one receiver gets data and computes Can be modified so each node sends data to all other nodes in parallel (block at a time) 20 Parallel Sort 21 Parallel JOINs Both tables are partitioned across all nodes Let’s assume that the partitioning is unrelated to the JOIN criteria For equi-joins and natural joins (equality-based) partition the two relations each node • each node ‘hosts’ a partition of each relation contains records of both relation • can be hash or range-based partitioning • both relations must be partitioned using the same hash function, range vector compute join locally at each node • using any join algorithm JOIN happens locally at each node 22 Partitioned Parallel Join 23 Exchange Operator a special operator for merge data, routing to nodes,... in parallel/ distributed exchange operator expresses the partition + merge data movement All other operators can be sequential 24 Exchange Operator Partitioning of data can be done by Hash partitioning Range partitioning Replicating data to all servers (broadcasting) Sending all data to a single server Destination nodes can receive data from multiple source nodes. Incoming data can be merged by: Random merge Ordered merge 25 Example Query Tables: r(A,C) s(B,D) Query: SELECT r.C, s.D, count(*) FROM r, s WHERE r.A = s.B GROUP BY r.C, s.D 26 SELECT r.C, s.D, count(*) Parallel Plans Query: FROM r, s WHERE r.A = s.B GROUP BY r.C, s.D 27 Parallel Plan explain the above diagram r and s are partitioned on r.C and s.D, respectively E1 and E2: exchange operators Repartition r (resp. s) using join attribute r.A (resp. s.B) Loc. Part.: partitioning step of hash join • Each server locally partitions the tuples it receives so that the partitions of the outer table fit in memory. HJ: hash join, local at each server E3: exchange operator Repartition by aggregation attributes (r.C, s.D) HA: local hash-based aggregation E4: collect all results to one server 28 Alternative Parallel Plan Original With partial aggregation HA1: Each server computes partial aggregates for all values of (r.C,s.D) on all local tuples compute aggregate locally for all values of (r.C, s.D) HA2: Aggregate all partial aggregates for specific value of (r.C,s.D) Sends less data, but breaks pipeline then this exchange operator will aggregate once more but based on specific values Each server now must materialize all its partial aggregates 29 Pipelining Over Network Avoids writing intermediate values to disk Buffer between consumer and producer Producer pushes tuples onto buffer if not full Can batch tuples to reduce number of messages 30 Distributed DBs Parallel DB stuff has all been about performance processors closely connected w/ fast network No discussion of failure No discussion of dynamic scaling Distributed DBs extends the discussion to include failure databases separated by slow links (e.g. one node in tokyo, one in new york) scale changes 31 Why distribute your data? Availability if your data is on more than one node, then it’s still available even if a node has failed Latency If the New York office had to send all queries to the Tokyo datacenter, then every query would take 130ms minimum Capacity Individual DBs are too small to contain it Cost Efficiency Scale up when load is up, scale down when load drops 32 Distributed considerations The network is slower than in the parallel DB cases communication is expensive The network is unreliable node independence is important Nodes may fail but data must not be lost! 33 Distributed problems Naming things how do we guarantee unique names? Replicating data back up how do we make sure that data is safe, even if nodes fail? Distributed queries and transactions how do you do concurrency control between multiple machines? Adding/Removing nodes how do we change the configuration of the system over time? 34 Naming things Lots of things require unique names DB objects (tables, indices, etc.) rows (rids) For high-frequency changes (e.g. rids) we can’t require full-system consensus Some solutions: pre-assigned globally unique prefix • each name is a tuple of (prefix, local name) randomly generated names (i.e. UUIDs) • not completely guaranteed to be unique, but hightly unlikely to collide 35 Replicating Data Data can’t just be on a single node that node could fail, and the data would be lost Replication pushes data out to other nodes Synchronous vs Asynchronous synchronous - updates are not considered durable done when sent enough until *enough* copies are durable only replica asynchronous - updates are sent more efficiently, but replicas may see inconsistent values asynchronously so inconsistent values Primary/Secondary vs. Peer-to-peer (or multi-master) primary/secondary - the primary is the only node primary is where updates may occur the only one who can Peer-to-peer - multiple nodes may permit updates write 36 sacrifice speed for data consistency Synchronous Replication Synchronous replication ensures that all txns see the same value regardless of which replica they read. voting - txns must write to a majority of replicas before being durable, and read from a majority before ‘knowing’ which version is the latest read-any/write-all - txns can read any copy, but on update must write to all copies before considering the update durable Pros: txns have same semantics as on single node Cons: writes are much slower writes aren’t fault tolerant (node failures halt txns) locks and commits require messages and waiting 37 will have a buffer to write changes in there and then after a while it will push out => if multiple changes are made to the same records, we may only need to push the last change Asynchronous Replication Asynchronous replication is much faster txns write and commit locally out here meant changes are pushed out when efficient push send to other nodes Capture + Apply Capture gathers changes at the primary and sends them to the secondary sites • Log-shipping: the tail of the WAL is sent send most recent entrues snapshot -> periodic • Snapshotting: copies of the page(s) are sent periodically update, not necessarily last entries Apply runs on the secondary sites and applies the changes to their local table copies • REDO log entries for log-shipping • Overwrite pages for snapshotting 38 Primary/Secondary vs. Peer-to-Peer Primary/Secondary Only one node may perform updates (the primary) All concurrency control can stay on the primary Primary becomes bottle neck Peer-to-peer any node may update update conflicts may occur • e.g. one node runs UPDATE SET salary = 40000 and another runs UPDATE SET salary = 35000 much rarer in practice 39 Distributed Transactions A transaction may read and modify tuples stored at many nodes Locks have to be granted at many nodes Deadlock detection has to work across nodes Recovery has to work across multiple nodes Distributed locking Centralized - one node handles all locking (bottleneck, central point of failure) Primary Copy - each object has a ‘primary’ copy node that handles locking for that object (latency) write locks must Distributed - locks can be granted at any node go to all replica (read locks are local, writes lock all copies) *of course, you could use a non-locking concurrency control mechanism 40 Distributed Recovery Transaction atomicity and durability must be guaranteed multiple nodes must coordinate need a commit protocol Two Phase Commit (2PC) make sure atomicity + recoverable across multiple nodes a coordinator node starts w/ a prepare message each node decides abort/commit • writes abort/prepare to log, sends no/yes back to coord if all responses are yes, coordinator sends force commit to all nodes • any no causes an abort node writes commit to log sends ack to coord when all acks received, coord finishes commit 41 Adding and Removing Nodes Removing nodes deal with failures (clients shouldn’t connect to failed nodes) replace hardware/upgrade systems scale down to save money Adding nodes deal with failures - replace hard-failed nodes with new ones re-introduce upgraded nodes to system scale up to handle load rebalancing partitions (re-sizing partitions) 42 Adding a Fresh node Replacing a failed node or scaling up the system Need to populate it with partitions copying can take time (esp. if it’s on the other side of the world) need to also capture ongoing changes Remember that all this I/O can itself cause failures Manager starts partition copying AND also sets node up as a follower of the partition gets updates but isn’t part of commit protocol When copy is done, node applies the saved log have 3 phrases of copying to ensure catch up 43 Adding a Fresh node All this I/O can take a while puts load on the sender and receiver the longer it takes the more likely it is to be interrupted In practice, rather than a few large partitions systems have many more smaller partitions Take less time per partition • less opportunity for failure • less of a log to replay Can be done in parallel • multiple nodes can send their partitions to the new node For scalable distributed systems time to recovery is an important metric 44 Vitess - Shared-nothing Mysql Youtube ran with a single mysql instance until traffic was so high it kept crashing So, a read replica was created to take off some of the read load until read traffic was crashing the replica So, multiple read replicas were created but then write traffic started crashing the primary So, the tables were manually partitioned (vitess calls them ‘shards’) application logic needed to understand the sharding scheme and handle aggregation (they were ‘heavyweight clients’) 45 Vitess - Shared-nothing Mysql Complexity was pushed back into the data layer Clients would make normal mysql queries to a middleware proxy, the ‘VTGate’ Implemented in golang load balancer have access of infor of shards, control shards activities 46 Vitess - Shared-nothing Mysql Beneath VTGates are a collection of mysql instances each one is vanilla, stock mysql Each DB is hosted by multiple mysql instances (replication at statement level by VTGate) Tables may be sharded Shard instances are VTTablet Hosted on multiple mysql instances Partition/Merge handled by VTGate Sharding specified manually (w/ DDL) 47 Vitess - Shared-nothing Mysql Multiple nodes host any VTTablet load-balancing failure-handling A failed mysql node can be compensated for by the VTGate routing requests to the other replicas The control plane handles starting new mysql instances new instance can have VTTablets provisioned to it incrementally 48 Vitess - Shared-nothing Mysql Started in 2010 Open Source implementation (vitess.io , github.com/ vitessio/vitess) Gradually increasing subset of mysql syntax that VTGate supports limited JOIN support (but getting better all the time) 49 DBs in the Cloud The cloud is where many new applications are hosted It’s differently structured from how things were when DBMS’ were first created. 50 A typical DB setup ca. 1980 The Server term inal A P P term inal term inal term inal DB term inal term inal term inal App and DB run on centralized hardware (big iron) Clients are dumb terminals connected to the server (usually via phone lines) Static, unchanging. Adding a new terminal takes time 51 A typical DB setup ca. 1995 client client App Server A P P DB Server client DB client client client client App and DB still run on centralized hardware Clients are PCs connected via ethernet clients run app-specific software Static, adding clients is infrequent and scaling can be managed over months/years 52 A typical cloud DB setup ca. 2015 client client client client … cloud load balancer web server Service X web server … DB internet web server application service Dist. Storage App is served by a collection of web servers scale up/down rapidly (minutes, seconds) Clients are web browsers (PCs, phones, etc.) load can vary widely DB is not on curated, expert managed hardware 53 DBs in the Cloud The ‘cloud’ is what you get with things like AWS Lots of hardware resources scattered around the globe clustered into ‘zones’ inside individual data centers Everything is done with VMs/containers storage may be local or remote lots of automatic management of stuff via things such as Kubernetes An interconnected collection of services not a monolithic, hand-made application stack running on every node 54 Cloud Architectures Computing Explicitly managed instances • eg. AWS EC2 • VM images curated by customer • VM space ‘rented’ to run image Orchestrated environments to run containers • e.g. Kubernetes • smaller single-purpose containers • described in configurations run as needed Serverless functions (FaaS, functions as a service) • e.g. AWS lambda • just upload code and configure ‘triggers’ • system allocates compute on the fly to run your function 55 Cloud Architectures Storage Services Object Stores • e.g. AWS S3 • key/value for large objects (e.g. files) • internally replicated and managed, pay by the byte Distributed Disk storage seemed like a disk but not actually a disk • e.g. EBS • Replicated disk images you can attach to VMs/ containers Key/Value stores • e.g. Dynamo, BigTable • Distributed table storage • Can only query by key or key range • Key structure is main focus of design 56 Cloud Architectures Everything else Load balancing Log management Messaging Monitoring … X-as-a-Service 57 Storage Disaggregation Physically separate compute and storage servers Storage services offer limited read/write interface They run on resource-limited storage servers Advantages for cloud provider Provision storage and computation independently Advantages for users Storage services are cheap Ideally, no performance overhead since network bandwidth can be (but not always is) similar to local I/O bandwidth This storage ‘works different’ from local disks and things pretending to be local disks DB designs may be sub-optimal 58 Example: AWS Storage Services 59 AWS aurora - cloud-friendly DBMS DB design often assumes ‘local’ disks even disk cabinets were emulating local disks DB’s often assume that the disks are ‘dumb’ intricately managed replication sync policies may not work with cloud storage Aurora: primary/secondary style replication but secondaries can be scaled up/down quickly storage is all cloud aware and internally replicated replication from primary is sent direct to storage 60 AWS aurora - cloud-friendly DBMS Logs are shipped to volumes directly and applied Storage engine of DBMS is replaced with cloud storage-aware system 61
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )