唐宇 2013-11-28 目录 Perspective on the CAP theorem Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS Summary CAP Theorem Exist a total order of all Operations such that each operation looks as if it were completed at a single instant Each request eventually receives a response The system continues to operate despite arbitrary message loss or failure of part of the system CAP证明 Theoretical Context C与A的权衡问题是不可靠系统中安全性(safety)与活 性(liveness)之间权衡问题的一个范例 安全性(safety):Something bad will not happen Consistency requirements are typically safety properties 活性(liveness): Something good eventually happens 可用性是活性的一个范例 The relationship between safety and liveness properties has been a long-standing challenge in distributed computing. The problem of consensus Consensus问题 问题场景 多个进程各自拥有一个初始值 所有进程对这些初始值中的某一个达成一致 三个要求 Agreement: every process must output the same value; Validity: every value output must have been provided as the input for some process; Termination: every process must eventually output a value. Agreement and validity are safety properties, while termination is a liveness property. 已知的结论: In the case of consensus, safety and liveness are impossible if the system is even potentially slightly faulty Practical Implications C与A的权衡不可避免,如何设计分布式系统? Best-effort availability Chubby Best-effort consistency 网络缓存 Balancing consistency and availability TACT (Tunable Availability and Consistency Trade-offs)工具 Segmenting consistency and availability 支付 vs 购物车 目录 Perspective on the CAP theorem Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS Summary Motivation Geo-replication Internet users are globally distributed Applications replicate data across datacenters Reduce network latencies to users 矛盾: Cross-site consistency latency The problems are magnified with WAN latency Observation: Strong consistency is not always required: Depend on the applications Goal: RedBlue Consistency: Mixing strong consistency (for application semantics) & eventual consistency (for fast responses) in a same system Red and Blue Blue operations: order of execution can vary from site to site Red operations: must be executed in the same order at all sites. RedBlue Consistency RedBlue Order Red operations must be totally ordered The order of Blue operations can vary from site to site Causal serialization A site has a causal serialization of the RedBlue order if a) b) the ordering is a linear extension of the RedBlue order for any two operations u, v, if site(v) = i and u < v in Oi, then u < v in O RedBlue Consistency Each site applies operations according to the causal serialization of the RedBlue order State convergence 保证可用性/性能 Deposit Accrueinterest 执行顺序影响结果 State convergence 原因 两操作不能交换 解决方法 传递操作的执行 结果 Generator & Shadow operations 很多操作不能直接交换,如Deposit与Accrueinterest 将每一个操作分解为generator operations和 shadow operations Generator operations Only executed at the primary site against a system state Produces no side effects Determines state transitions that would occur Produces shadow operations Shadow operations Applies the state transitions to all the sites including the primary site Must produce the same effects as the original operation given the original state for the Generator operation An example Converged but Invalid 通信的实时性 需要Red Red or Blue? Correct Result Design & Implementation Local site Client Remote site Request Proxy Produce shadow operation(s) Admit or reject this operation according to RedBlue consistency Single MySQL node Shadow operation(s) Coordinator Admit Data writer Write Storage engine Coordinator Propagate the admit operation(s) Note: When a shadow operation is rejected, the proxy server re-executes the generator operation And restarts the Process. Optimistic concurrency control Timestamp - Logical clock form <<b0; b1; … ; bk-1>, r> bi is the local count of shadow operations initially executed by site i; r is the global count of red shadow operations. 保证全局Red操作全序 - token passing scheme 每个site轮流占有全局唯一token,拥有token的site才能 增加全局计数r,每个site占有token的时间为1s 当Blue操作执行完成,增加bi的值 当Red操作执行完成,增加bi和r的值 Experimental Setup Amazon EC2 using extra large virtual machine instances located in five sites: US east (UE) US west (UW) Ireland (IE) Brazil (BR) Singapore (SG) User observed latency Each user issues a single outstanding request at a time Case Studies TPC-W modles an online bookstore RUBiS emulates an online auction website modeled after eBay Quoddy is an open source Facebook-like social networking site Single workload Proportion of Blue and read Apps workload Originally With shadow ops Read-only(%) Update(%) Blue(%) Red(%) Shopping mix 85 15 99.2 0.8 Browsing mix 96 4 99.5 0.5 Ordering mix 63 37 93.6 6.4 RUBiS Bidding mix 85 15 97.4 2.6 Quoddy Mix 85 15 100 0 TPC-W Mixed workload Summary RedBlue consistency combines strong and eventual consistency into a single system The decompositon of generator/shadow operations expands the space of possible Blue operations A simple rule for labeling is provably state convergent and invariant preserving Discussion 人工干预较多 将操作分解为generator和shadow 将shadow操作标记为Blue或Red 系统设计局限性 无容错机制 Token轮询机制 疑惑 不同site间的causal如何实时获得? 目录 Perspective on the CAP theorem Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS Summary Wide-Area Storage Desired Properties: ALPS Availability All operations issued to the data store complete successfully Low Latency Client operations complete “quickly.” Partition Tolerance The data store continues to operate under network partitions Scalability The data store scales out linearly Consistency with ALPS 由于选择了A+P,就不能得到C( linearizability )了 Sequential consistency与AP冲突 Causal+可与AP共同获得 Causal + Convergent Conflict Handling Causal Conflicts in Causal V=3 V=4 Causal + Conflict Handling V=4 V=4 V=4 以前的Causal+系统 Bayou 1994, TACT 2000, PRACTI 2006 Limited Scalability All data should fit on same machine (Bayou) The set of keys over which causal+ consistency is provided are still limited to what a single machine can handle (PRACTI) Causal in COPS 确定因果关系的依据 版本(versions) 依赖关系(dependencies) Versions – 同一个Key不同的值 To reason about different values of a key Each replica returns non decreasing versions of a key Dependencies – 多个操作之间 yj depends on xi if and only if put(xi) → put (yj) Writing a version only after writing all of its dependencies COPS系谱 标准COPS COPS-GT(get transactions) Provides a superset of COPS’ functionalities by also introducing support for get transactions COPS-CD(conflict detection) COPS with conflict detection Get操作 Put操作 COPS architecture Returns a consistent view of multiple key-value pairs in a single call Gets and puts are linearizable across the nodes in the cluster Operations between clusters occur asynchronously Causal dependency 依赖检查时由Nearest Dependencies决定 Get_trans操作才会使用All Dependencies Get_trans in COPS-GT 保证一致性视角 尽量保证最新版本 获取一遍所有数据, 更新客户端版本 检查版本是否一致 若不一致需重读 更新客户端信息 返回结果 Garbage Collection Subsystem Version Garbage Collection COPS-GT only 在一段时间阈值内,旧版本数据不被读取时 Dependency Garbage Collection COPS-GT only 当数据被同步到所有数据中心一段时间后 Client Metadata Garbage Collection COPS + COPS-GT 当新的数据被同步到所有数据中心后(get操作返回标记位) Conflic Detection The default COPS system avoids conflict detection using a last-writer wins strategy “Last-writer” is determined by comparing version numbers COPS with conflict detection (COPS-CD) 需要在put操作中增加prev参数,表示当前可见的版本号 Detect a conflict prev ≠ curr if and only if new and curr conflict COPS-CD has an application specified convergent conflict handler that is invoked when a conflict is detected. Microbenchmarks Put_after(x) 表示有x个dependencies Experiment config Dynamic Workloads(1) variances Dynamic Workloads(2) Scalability LOG mimics systems based on log serialization and exchange, which can only provide causal+ consistency with single node replicas. 目录 Perspective on the CAP theorem Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS Summary 两个系统 Gemini RedBlue consistency: Linearizability + Eventual COPS Causal+ 特点 操作执行的正确性 高性能(相对的) 良好的可扩展性