UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau Introduction to CS739: Distribution Systems What are distributed systems? What are the benefits and challenges? How will CS739 be structured? Readings, Writeups, Presentations Projects Goals of Course Learn about challenges and existing techniques for building distributed systems and services • Read and discuss influential papers from SOSP, OSDI, NSDI Gain some experience programming in distributed environment • Warm-up project • Final project What is a Distributed System? Leslie Lamport says: “You know you have one when the crash of a computer you never heard of stops you from doing any work” More technical definition: “Collection of independent computers that appears to its users as a single coherent system” How are parallel, distributed, networked systems different? • All contain nodes (processing, memory, disk) connected with network More unified parallel distributed Consider distributed services as well… networked Less unified Benefits of Distributed Systems Great price/performance • Leverage commodity components (nodes and networks) • Use many, many of them Incremental scalability • Can add x% new nodes (or disks or memory) to improve performance x% Improved availability • Continue operating when some nodes stop working Improved reliability • Deliver correct results when some nodes misbehave, corrupt data Allow geographically-distributed individuals to share data or cooperate Distributed System Challenges Lack of global state information • Different nodes have different view of system – What are the contents of file A? – How many jobs are running on node X? – Which nodes are currently part of the system? • See delays, different ordering of messages, lost messages, network partitions • Tension with goal of “single coherent system” Handling slow, failed and misbehaving nodes • • • • How do you avoid slow nodes? How do you get back data or work from failed node? When nodes disagree, how do you know who is wrong? Tension with goal of “available and reliable” When is it okay to have some centralized components? • Simplifies state management, but single point-of-failure and performance bottleneck Content of 739 Distributed system courses can be very different… Theoretical: distributed algorithms (e.g., to allow nodes to come to consensus or agreement) • 4 lectures Practical: distributed programming (e.g., using RPC, JAVA RMI, CORBA, DCOM, MPI, PVM) • Warm-up project Research systems: new ideas for making distributed systems better • Focus of course • Implemented systems with new conceptual ideas • Recent papers in top systems conferences (SOSP, OSDI, NSDI) Learning by Reading Intense reading list; assume sophisticated reader (736) • • Usually cover 1 fascinating paper per class No exams Three types of classes 1) Formal lecture: Only for 4 theory topics 2) Discussions: Most papers – I ask questions, expect everyone to enthusiastically participate; fairly casual – Task 1: Read paper 2-3 times before class – Task 2: Email write-up to me BEFORE class – Task 3: Take turns being scribe (about 2 times in semester) • • Write-up notes from discussion in latex Post to web page within 72 hours Learning by Reading (cont) Types of classes (cont) 3) Group-led lectures: 4 topics – – Small group gives overview of about 3-4 related papers Topics: • • • • – Advantages • • – Good practice for giving presentations Learn about topic in slightly more depth Tasks • • – Distributed system analysis Process migration Programming environments Specialized distributed services Group: » Finalize related papers (1 week before) » Present to me (2 days before) » Use slides Everyone else: Skim papers Handout: State preferences by next week Course Topics: Reading List Distributed Operating Systems (Survey, Amoeba vs Sprite) Network File Systems (NFS, Coda, LBFS) Theory: Time, Ordering, and Distributed Snapshots (2 Lamport papers) Analysis of Distributed Systems (1 + Group Presentation) Programming Environments (DSM, MapReduce, Group) Process Migration (1 + Group) Specialized Distributed Services (Porcupine + Group) SPRING BREAK Theory: Consensus (Byzantine failures and fail-stop processors) Cluster-based File Systems (Petal+Frangipani and GoogleFS) Communication Primitives (RPC vs U-Net) P2P Systems (Measurement, CFS, Amazon, Pangaea, LOCKSS) Miscellaneous: Trust, Recovery, Mistakes, Speculation, Sensor Networks Learning by Doing Warm-up Project • Goal: Become familiar with existing distributed programming environments • Examples: Hadoop (open-source MapReduce), MPI, PVM • Task 0: Get environment running • Task 1: Implement simple application (e.g., sorting) • Task 2: Report sufficient numbers to indicate did something Final Project • Goal 1: Experience with “research process” in general – Work on open-ended project, unknown result – New idea where don’t know if it will work • Goal 2: Learn about specific topic in depth • Topic from my list or your own choice; work with project partner • Deliverables: 20 minute talk, short research paper Agenda for Next Class See website: www.cs.wisc.edu/~cs739-1 Read: • Survey : Distributed Operating Systems Andrew S. Tanenbaum and Robbert Van Renesse ACM Computing Surveys, Volume 17, Issue 4 (December 1985), pp 419-470 • Long paper: Focus on Sections 1 and 2 Answer question: • What were the goals of distributed systems at this time? Which design issue (I.e., communication primitives, naming and protection, resource management, fault tolerance, services) seems most challenging (or interesting)? Why? • Email answer to me with Subject cs739: Survey Think about group presentation papers