The Datacenter Needs an Operating System

advertisement

The Datacenter Needs an

Operating System

Matei Zaharia, Benjamin Hindman, Andy

Konwinski, Ali Ghodsi, Anthony Joseph,

Randy Katz, Scott Shenker, Ion Stoica

Background

• Clusters of commodity servers have become a major computing platform in industry and academia

• Driven by data volumes outpacing the processing capabilities of single machines

• Democratized by cloud computing

Background

• Some have declared that “the datacenter is the new computer”

• Claim: this new computer increasingly needs an operating system

• Not necessarily a new host OS, but a common software layer that manages resources and provides shared services for the whole datacenter, like an OS does for one host

Why Datacenters Need an OS

• Growing number of applications

– Parallel processing systems: MapReduce, Dryad,

Pregel, Percolator, Dremel, MR Online

– Storage systems: GFS, BigTable, Dynamo, SCADS

– Web apps and supporting services

• Growing number of users

– 200+ for Facebook’s Hadoop data warehouse, running near-interactive ad hoc queries

What Operating Systems Provide

Resource sharing across applications & users

Data sharing between programs

Programming abstractions (e.g. threads, IPC)

Debugging facilities (e.g. ptrace, gdb)

Result: OSes enable a highly interoperable software ecosystem that we now take for granted

An Analogy

• Today, a scientist analyzing data on a single machine can pipe it through a variety of tools, write new tools that interface with these through standard APIs, and trace across the stack

• In the future, the scientist should be able to fire up a cloud on EC2 and do the same thing:

– Intermix a variety of apps & programming models

– Write new parallel programs that talk to these

– Get a unified interface for managing the cluster

– Debug and trace across all these components

Today’s Datacenter OS

• Hadoop MapReduce as common execution and resource sharing platform

• Hadoop InputFormat API for data sharing

• Abstractions for productivity programmers, but not for system builders

• Very challenging to debug across all the layers

Tomorrow’s Datacenter OS

• Resource sharing:

– Lower-level interfaces for fine-grained sharing

(Mesos is a first step in this direction)

– Optimization for a variety of metrics (e.g. energy)

– Integration with network scheduling mechanisms

(e.g. Seawall [NSDI ‘11], NOX, Orchestra)

Tomorrow’s Datacenter OS

• Data sharing:

– Standard interfaces for cluster file systems, keyvalue stores, etc

– In-memory data sharing (e.g. Spark, DFS cache), and a unified system to manage this memory

– Streaming data abstractions (analogous to pipes)

– Lineage instead of replication for reliability (RDDs)

Tomorrow’s Datacenter OS

• Programming abstractions:

– Tools that can be used to build the next

MapReduce / BigTable in a week (e.g. BOOM)

– Efficient implementations of communication primitives (e.g. shuffle, broadcast)

– New distributed programming models

Tomorrow’s Datacenter OS

• Debugging facilities:

– Tracing and debugging tools that work across the cluster software stack (e.g. X-Trace, Dapper)

– Replay debugging that takes advantage of limited languages / computational models

– Unified monitoring infrastructure and APIs

Putting it All Together

• A successful datacenter OS might let users:

– Build a Hadoop-like software stack in a week using the OS’s abstractions, while gaining other benefits (e.g. cross-stack replay debugging)

– Share data efficiently between independently developed programming models and applications

– Understand cluster behavior without having to log into individual nodes

– Dynamically share the cluster with other users

Conclusion

• Datacenters need an OS-like software stack for the same reasons single computers did: manageability, efficiency & programmability

• An OS is already emerging in an ad-hoc way

• Researchers can help by taking a long-term approach towards these problems

How Researchers can Help

• Focus on paradigms, not performance

– Industry is tackling performance but lacks luxury to take long-term view towards abstractions

• Explore clean-slate approaches

– Likelier to have impact here than in a “real” OS because datacenter software changes quickly!

• Bring cluster computing to non-experts

– Much harder and more rewarding than big users

Download