The Datacenter Needs an Operating System

The Datacenter Needs an

Operating System

Matei Zaharia, Benjamin Hindman, Andy

Konwinski, Ali Ghodsi, Anthony Joseph,

Randy Katz, Scott Shenker, Ion Stoica

Background

• Clusters of commodity servers have become a major computing platform in industry and academia

• Driven by data volumes outpacing the processing capabilities of single machines

• Democratized by cloud computing

Background

• Some have declared that “the datacenter is the new computer”

• Claim: this new computer increasingly needs an operating system

• Not necessarily a new host OS, but a common software layer that manages resources and provides shared services for the whole datacenter, like an OS does for one host

Why Datacenters Need an OS

• Growing number of applications

– Parallel processing systems: MapReduce, Dryad,

Pregel, Percolator, Dremel, MR Online

– Storage systems: GFS, BigTable, Dynamo, SCADS

– Web apps and supporting services

• Growing number of users

– 200+ for Facebook’s Hadoop data warehouse, running near-interactive ad hoc queries

What Operating Systems Provide

• Resource sharing across applications & users

• Data sharing between programs

• Programming abstractions (e.g. threads, IPC)

• Debugging facilities (e.g. ptrace, gdb)

Result: OSes enable a highly interoperable software ecosystem that we now take for granted

An Analogy

• Today, a scientist analyzing data on a single machine can pipe it through a variety of tools, write new tools that interface with these through standard APIs, and trace across the stack

• In the future, the scientist should be able to fire up a cloud on EC2 and do the same thing:

– Intermix a variety of apps & programming models

– Write new parallel programs that talk to these

– Get a unified interface for managing the cluster

– Debug and trace across all these components

Today’s Datacenter OS

• Hadoop MapReduce as common execution and resource sharing platform

• Hadoop InputFormat API for data sharing

• Abstractions for productivity programmers, but not for system builders

• Very challenging to debug across all the layers

Tomorrow’s Datacenter OS

• Resource sharing:

– Lower-level interfaces for fine-grained sharing

(Mesos is a first step in this direction)

– Optimization for a variety of metrics (e.g. energy)

– Integration with network scheduling mechanisms

(e.g. Seawall [NSDI ‘11], NOX, Orchestra)


• Data sharing:

– Standard interfaces for cluster file systems, keyvalue stores, etc

– In-memory data sharing (e.g. Spark, DFS cache), and a unified system to manage this memory

– Streaming data abstractions (analogous to pipes)

– Lineage instead of replication for reliability (RDDs)


• Programming abstractions:

– Tools that can be used to build the next

MapReduce / BigTable in a week (e.g. BOOM)

– Efficient implementations of communication primitives (e.g. shuffle, broadcast)

– New distributed programming models


• Debugging facilities:

– Tracing and debugging tools that work across the cluster software stack (e.g. X-Trace, Dapper)

– Replay debugging that takes advantage of limited languages / computational models

– Unified monitoring infrastructure and APIs

Putting it All Together

• A successful datacenter OS might let users:

– Build a Hadoop-like software stack in a week using the OS’s abstractions, while gaining other benefits (e.g. cross-stack replay debugging)

– Share data efficiently between independently developed programming models and applications

– Understand cluster behavior without having to log into individual nodes

– Dynamically share the cluster with other users

Conclusion

• Datacenters need an OS-like software stack for the same reasons single computers did: manageability, efficiency & programmability

• An OS is already emerging in an ad-hoc way

• Researchers can help by taking a long-term approach towards these problems

How Researchers can Help

• Focus on paradigms, not performance

– Industry is tackling performance but lacks luxury to take long-term view towards abstractions

• Explore clean-slate approaches

– Likelier to have impact here than in a “real” OS because datacenter software changes quickly!

• Bring cluster computing to non-experts

– Much harder and more rewarding than big users

The Datacenter Needs an Operating System

The Datacenter Needs an

Operating System

Background

Background

Why Datacenters Need an OS

What Operating Systems Provide

An Analogy

Today’s Datacenter OS

Tomorrow’s Datacenter OS

Tomorrow’s Datacenter OS

Tomorrow’s Datacenter OS

Tomorrow’s Datacenter OS

Putting it All Together

Conclusion

How Researchers can Help

Related documents

Products

Support

The Datacenter Needs an Operating System