Introduction to CS739: Distribution Systems

advertisement
UNIVERSITY of WISCONSIN-MADISON
Computer Sciences Department
CS 739
Distributed Systems
Andrea C. Arpaci-Dusseau
Introduction to CS739:
Distribution Systems
What are distributed systems?
What are the benefits and challenges?
How will CS739 be structured?
Readings, Writeups, Presentations
Projects
Goals of Course
Learn about challenges and existing techniques for building
distributed systems and services
• Read and discuss influential papers from SOSP, OSDI, NSDI
Gain some experience programming in distributed
environment
• Warm-up project
• Final project
What is a Distributed System?
Leslie Lamport says:
“You know you have one when the crash of a computer you never heard of
stops you from doing any work”
More technical definition:
“Collection of independent computers that appears to its users as a single
coherent system”
How are parallel, distributed, networked systems different?
• All contain nodes (processing, memory, disk) connected with network
More
unified
parallel
distributed
Consider distributed services as well…
networked
Less
unified
Benefits of Distributed Systems
Great price/performance
• Leverage commodity components (nodes and networks)
• Use many, many of them
Incremental scalability
• Can add x% new nodes (or disks or memory) to improve
performance x%
Improved availability
• Continue operating when some nodes stop working
Improved reliability
• Deliver correct results when some nodes misbehave, corrupt data
Allow geographically-distributed individuals to share data or
cooperate
Distributed System Challenges
Lack of global state information
• Different nodes have different view of system
– What are the contents of file A?
– How many jobs are running on node X?
– Which nodes are currently part of the system?
• See delays, different ordering of messages, lost messages,
network partitions
• Tension with goal of “single coherent system”
Handling slow, failed and misbehaving nodes
•
•
•
•
How do you avoid slow nodes?
How do you get back data or work from failed node?
When nodes disagree, how do you know who is wrong?
Tension with goal of “available and reliable”
When is it okay to have some centralized components?
• Simplifies state management, but single point-of-failure and
performance bottleneck
Content of 739
Distributed system courses can be very different…
Theoretical: distributed algorithms (e.g., to allow nodes to
come to consensus or agreement)
• 4 lectures
Practical: distributed programming (e.g., using RPC, JAVA
RMI, CORBA, DCOM, MPI, PVM)
• Warm-up project
Research systems: new ideas for making distributed systems
better
• Focus of course
• Implemented systems with new conceptual ideas
• Recent papers in top systems conferences (SOSP, OSDI, NSDI)
Learning by Reading
Intense reading list; assume sophisticated reader (736)
•
•
Usually cover 1 fascinating paper per class
No exams
Three types of classes
1) Formal lecture: Only for 4 theory topics
2) Discussions: Most papers
– I ask questions, expect everyone to enthusiastically participate; fairly
casual
– Task 1: Read paper 2-3 times before class
– Task 2: Email write-up to me BEFORE class
– Task 3: Take turns being scribe (about 2 times in semester)
•
•
Write-up notes from discussion in latex
Post to web page within 72 hours
Learning by Reading (cont)
Types of classes (cont)
3) Group-led lectures: 4 topics
–
–
Small group gives overview of about 3-4 related papers
Topics:
•
•
•
•
–
Advantages
•
•
–
Good practice for giving presentations
Learn about topic in slightly more depth
Tasks
•
•
–
Distributed system analysis
Process migration
Programming environments
Specialized distributed services
Group:
» Finalize related papers (1 week before)
» Present to me (2 days before)
» Use slides
Everyone else: Skim papers
Handout: State preferences by next week
Course Topics: Reading List
Distributed Operating Systems (Survey, Amoeba vs Sprite)
Network File Systems (NFS, Coda, LBFS)
Theory: Time, Ordering, and Distributed Snapshots (2 Lamport papers)
Analysis of Distributed Systems (1 + Group Presentation)
Programming Environments (DSM, MapReduce, Group)
Process Migration (1 + Group)
Specialized Distributed Services (Porcupine + Group)
SPRING BREAK
Theory: Consensus (Byzantine failures and fail-stop processors)
Cluster-based File Systems (Petal+Frangipani and GoogleFS)
Communication Primitives (RPC vs U-Net)
P2P Systems (Measurement, CFS, Amazon, Pangaea, LOCKSS)
Miscellaneous: Trust, Recovery, Mistakes, Speculation, Sensor Networks
Learning by Doing
Warm-up Project
• Goal: Become familiar with existing distributed programming
environments
• Examples: Hadoop (open-source MapReduce), MPI, PVM
• Task 0: Get environment running
• Task 1: Implement simple application (e.g., sorting)
• Task 2: Report sufficient numbers to indicate did something
Final Project
• Goal 1: Experience with “research process” in general
– Work on open-ended project, unknown result
– New idea where don’t know if it will work
• Goal 2: Learn about specific topic in depth
• Topic from my list or your own choice; work with project partner
• Deliverables: 20 minute talk, short research paper
Agenda for Next Class
See website:
www.cs.wisc.edu/~cs739-1
Read:
• Survey : Distributed Operating Systems
Andrew S. Tanenbaum and Robbert Van Renesse
ACM Computing Surveys, Volume 17, Issue 4 (December 1985),
pp 419-470
• Long paper: Focus on Sections 1 and 2
Answer question:
• What were the goals of distributed systems at this time? Which
design issue (I.e., communication primitives, naming and
protection, resource management, fault tolerance, services) seems
most challenging (or interesting)? Why?
• Email answer to me with Subject cs739: Survey
Think about group presentation papers
Download