YARN

advertisement
Hadoop YARN in the Cloud
Junping Du
Staff Engineer, VMware
China Hadoop Summit, 2013
Agenda
• Hadoop YARN – Hub for Big Data Applications
• YARN and Cloud Computing
• HVE (Hadoop Virtualization Extension) work
on YARN
Hadoop MapReduce v1 (Classic)
• JobTracker
– Manage cluster
resources and job
scheduling
• TaskTracker
– Per node agent
– Manage tasks
MapReduce v1 Limitations
• Scalability
– Manage cluster resources and job scheduling
• SPOF (Single Point Of Failure)
• JobTracker failure cause all queued and running job
failure
– Restart is very tricky due to complex state
• Hard partition of resources into map and reduce
slots
– Low resource utilization
• Lacks support for alternate paradigms
• Lack of wire-compatible protocols
YARN Architecture
• Splits up the two major functions of
JobTracker
– Resource Manager (RM) - Cluster resource
management
– Application Master (AM) - Task scheduling and
monitoring
• NodeManager (NM) - A new per-node
slave
– launching the applications’ containers
– monitoring their resource usage (cpu,
memory) and reporting to the Resource
Manager.
• YARN maintains compatibility with existing
MapReduce application and support other
applications
YARN – Hub for Big Data Applications
Impala
HBase
OpenMPI
Distributed Shell
Spark
MapReduce
Tez
Storm
YARN
HDFS
• App-specific AM
• HOYA (Hbase On YArn)
– Long running services (YARN-896)
• LLAMA (Low Latency Application MAster)
– Gang Scheduler (YARN-624)
YARN and Cloud
• Two different prospective:
– YARN-centric prospective
• YARN is the key platform to apps
• YARN is independent of infrastructure, running on top of
Cloud shows YARN’s generality
– Cloud-centric prospective
• YARN is an umbrella kind of applications
• Supporting YARN shows Cloud’s generality
YARN and Cloud: YARN-centric Prospective
Big Data Apps
HBase
Open MPI
Distributed Shell
Spark
…
Impala
MapReduce
Tez
Storm
YARN
Infrastructure
Cloud Infrastructure
Bare-metal machines
…
VMware
Open Stack
…
YARN and Cloud: Cloud-centric Prospective
Legacy Apps
Non-YARN
Big Data Apps
…
YARN Apps
HBase
Open MPI
D.S
Spark
Impala
MapReduce
Tez
Storm
…
YARN
Cloud Infrastructure (VMware, Open Stack, etc.)
YARN vs. Cloud
• Similarity
– Target to share resources across applications
– Provide Global Resource Management
• YARN vs. Cloud
– YARN managing resource in OS layer vs. Cloud
managing resources in Hypervisor (Not comparable,
but Hypervisor is more powerful than OS )
– Apps managed by YARN need specific AppMaster,
Apps managed by Cloud is exactly the same as
running on physical machines (Cloud
)
– YARN tracking application-specific metrics/progress,
Cloud only track underlayer resources (YARN
)
YARN + Cloud
• Why YARN + Cloud?
– Leverage virtualization in strong isolation, fine-grained
resource sharing and other benefits
– Uniform infrastructure to simplify IT in enterprise
• What it looks like?
– Running YARN NM inside of VMs managed by Cloud
Infrastructure
– Build communication channel between YARN RM and
Cloud Resource Manager for coordination
• How we do?
– First thing above is very easy and smoothly
– Second things to achieve in two ways
• YARN can aware/manipulate Cloud resource change
• YARN provide a generic resource notification mechanism so
Cloud Manager can use when resource changing
Elastic YARN Node in the Cloud
• VM’s resource boundary can be elastic
–
–
–
–
CPU is easy – time slicing (with constraints)
Memory is harder – page sharing and memory ballooning
In case of contention, enforce limits and proportional sharing
“Stealing” resources behind apps could cause bad
performance (paging)
– App aware resource management could address these issues
• Hadoop YARN Resource Model
– Dynamic with adding/removing nodes
– But static for per node
• In this case, shall we enable resource elasticity on VM?
– If yes, low performance when resource contention happens.
– If no, low utilization as physical boxes because free resources
cannot be leveraged by other busy VMs
• We need better answer .
HVE provide the answer!
• Hadoop Virtualization Extensions
– A project to enhance Hadoop running on
virtualization
• Goal: Make Hadoop Cloud-Ready
– Provide Virtualization-awareness to Hadoop, i.e.
virtual topology, virtual resources, etc.
– Deliver generic utility that can be leveraged by
virtualized platform
• Independent of virtualization platform and
cloud infrastructure
• 100% contribution to Apache Hadoop
Community
HVE
• Philosophy
– make infrastructure related components abstract
– deliver different implementations that can be
configured properly
• E.g.
BlockPlacementPolicy
(Abstract)
BlockPlacementPolicy
BlockPlacementPolicy
Default
BlockPlacementPolicy
For Virtualization
Elastic YARN Node in the Cloud
Container
Add/Remove
Resources?
Container
Other
Workload
Virtual
YARN
Node
NodeManager
Datanode
Virtualization Host
Grow/Shrink resource of a VM
VMDK
Grow/Shrink
by tens of GB in
memory?
Implementation – YARN-291 (umbrella)
• YARN-312
• YARN-311
– Core scheduler changes
– AdminProtocol changes
• REST API, JMX, etc.
• YARN-313
• CLI
Resource Manager
Scheduler
UpdateNodeResource()
AdminService
Admin CLI
Cluster Resource
yarn rmadmin -updateNodeResource
<NodeId> <Resource>
SchedulerNode
RMContext
RMNode
Resource Tracker Service
Heartbeat
Node Manager
Cloud Resource
Manager
Reference
• YARN MapReduce 2.0
– https://issues.apache.org/jira/browse/MAPREDUCE279
• HVE topology extension
– https://issues.apache.org/jira/browse/HADOOP-8468
• HVE topology extension for YARN
– https://issues.apache.org/jira/browse/YARN-18
• HVE elastic resource configuration
– https://issues.apache.org/jira/browse/YARN-291
• Gang Scheduling
– https://issues.apache.org/jira/browse/YARN-624
• Long-lived services in YARN
– https://issues.apache.org/jira/browse/YARN-896
Thanks!
Junping Du
jdu@vmware.com
Download