Sequence CRDT: A Scalable Sequence Encoding for Massive Collaborative Editing Brice Nédelec, Pascal Molli & Achour Mostefaoui GDD – LINA – University of Nantes Workshop on Highly-Scalable Distributed Systems Wednesday 14 January 2015, Paris France. Distributed Collaborative Editors • Distributed Collaborative Editors allow people to work distributed in space, time and organizations. • Google Doc, Etherpad, Google Wave… • 190M users on GDrive. (include Gdoc) Google Doc is great, but... • Single point of failure: • If provider is down -> no collaboration • Privacy, economic intelligence: • What if google search for ANR on 15 October ;) ? • Mass editing: • Google has limitations on simultaneous users… (50), up to 50 -> just readers Is it possible to build a fully decentralized editor that support 1M of simultaneous users? • Why? Because it is hard ;) – “We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard.” Kennedy 1962 • Because it can also be useful, mass collaboration -> Mooc, Webinars, events, Google Wave has already been used like that… Distributed Collaborative Editors Principles (OT or CRDT) • Based on optimistic replication algorithms – Operations are generated locally • No lock, no communication with others sites – Broadcasted to others sites • Every operation eventually derlivered – Re-executed when received • System is correct if it ensures causality, convergence and “intention preservation” (OT definition) i.e. preserve partial orders in the sequence Principles of Sequence CRDT • Encode the order of the sequence in the Id of elements (remember ;) 10 LET B=A 15 For I=1 to 27 20 LET A=A*A 21 NEXT I • Arghh, I forgot LET B=B^2 before NEXT I, no way to use 20,5 ?? Insert alpha between p and q Create an id for alpha Create a disambiguator for alpha so path+dis unique) Space and time complexity of Sequence CRDT mainly decided here !! Scientific problem • Write an allocation strategy ID for sequence element that is independent of insertion order • Many ways to type “QWERTY”, how to compute the smallest IDs for each character whatever insertion order ? PB: Order of Insertions Typed: Q;W;E;R;T;Y Typed: Y;T;R;E;W;Q Combine Exponential tree & random allocation LSEQ Complexities O((log n)2) -> avoid to rebalance IDs… Experiments • We built the CRATE Editor1 – LSEQ for ID allocation – Gossip for broadcast – Anti-entropy for missed delivery – interval version vectors for causal reception2 1https://github.com/Chat-Wane/CRATE.git 2M. Mukund, G. Shenoy R., S. Suresh, Optimized or-sets without ordering constraints, in: M. Chatterjee, J.-n. Cao, K. Kothapalli, S. Rajsbaum (Eds.), Distributed Computing and Networking, Vol. 8314 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2014, pp. 227{241. doi:10.1007/978-3-642-45249-9_15. 1st Setup • Objective: Validate the space complexity analysis of LSEQ. – when the editing behaviour is monotonic, LSEQ has a polylogarithmic upper-bound on space complexity with respect to the number of insert operations. – When the editing behaviour is random, LSEQ has a logarithmic space complexity. • Setup: – A single machine with 2 peers – Peers globally produce 166 char/s to create a doc of 500000 chars – Monotonic behavior Evaluation 2nd Setup • Objective: Show that CRATE scales in terms of the number of peers. – In other words, the size of the network does not impact the space complexity upper bound of messages . • Setup: – On GRID500, number of peers grows from 2 to 450, – 166 C/s uniformely distributed among peers 3rd Setup • Objective: Show that concurrency does not negatively impact the size of identifiers. Hence, scenarios without concurrency show the upperbound on the size of identifiers. • Setup: – A single machine emulates 10 peers using the application CRATE. – 10000 char at 3 ins/s uniformly distributed among the peers – 5 runs with the approximate following latencies: 0: 02ms , 100ms , 500ms , 1s , and 10s . Conclusions • LSEQ allows to compute IDs for sequence CRDT with an upper bound to log(n)2 • The number of peers and concurrency do not impact negatively the performances of CRATE • One million users is reachable… Nédelec, B., Molli, P., Mostefaoui, A., & Desmontils, E. (2013, September). LSEQ: an adaptive structure for sequences in distributed collaborative editing. In Proceedings of the 2013 ACM symposium on Document engineering (pp. 37-46). ACM. Nédelec, B., Molli, P., Mostefaoui, A., & Desmontils, E. (2013). Concurrency Effects Over Variable-size Identifiers in Distributed Collaborative Editing. In Proceedings of the International workshop on Document Changes: Modeling, Detection, Storage and Visualization, Florence, Italy, September 10, 2013 (Vol. 1008, pp. 0-7). Perspectives • Deploy a 1M editor on a network of browsers – 1M users – Editing 1M characters… • And measures performances • Under progress, nearly ready…