Hadoop YARN in the Cloud Junping Du Staff Engineer, VMware China Hadoop Summit, 2013 Agenda • Hadoop YARN – Hub for Big Data Applications • YARN and Cloud Computing • HVE (Hadoop Virtualization Extension) work on YARN Hadoop MapReduce v1 (Classic) • JobTracker – Manage cluster resources and job scheduling • TaskTracker – Per node agent – Manage tasks MapReduce v1 Limitations • Scalability – Manage cluster resources and job scheduling • SPOF (Single Point Of Failure) • JobTracker failure cause all queued and running job failure – Restart is very tricky due to complex state • Hard partition of resources into map and reduce slots – Low resource utilization • Lacks support for alternate paradigms • Lack of wire-compatible protocols YARN Architecture • Splits up the two major functions of JobTracker – Resource Manager (RM) - Cluster resource management – Application Master (AM) - Task scheduling and monitoring • NodeManager (NM) - A new per-node slave – launching the applications’ containers – monitoring their resource usage (cpu, memory) and reporting to the Resource Manager. • YARN maintains compatibility with existing MapReduce application and support other applications YARN – Hub for Big Data Applications Impala HBase OpenMPI Distributed Shell Spark MapReduce Tez Storm YARN HDFS • App-specific AM • HOYA (Hbase On YArn) – Long running services (YARN-896) • LLAMA (Low Latency Application MAster) – Gang Scheduler (YARN-624) YARN and Cloud • Two different prospective: – YARN-centric prospective • YARN is the key platform to apps • YARN is independent of infrastructure, running on top of Cloud shows YARN’s generality – Cloud-centric prospective • YARN is an umbrella kind of applications • Supporting YARN shows Cloud’s generality YARN and Cloud: YARN-centric Prospective Big Data Apps HBase Open MPI Distributed Shell Spark … Impala MapReduce Tez Storm YARN Infrastructure Cloud Infrastructure Bare-metal machines … VMware Open Stack … YARN and Cloud: Cloud-centric Prospective Legacy Apps Non-YARN Big Data Apps … YARN Apps HBase Open MPI D.S Spark Impala MapReduce Tez Storm … YARN Cloud Infrastructure (VMware, Open Stack, etc.) YARN vs. Cloud • Similarity – Target to share resources across applications – Provide Global Resource Management • YARN vs. Cloud – YARN managing resource in OS layer vs. Cloud managing resources in Hypervisor (Not comparable, but Hypervisor is more powerful than OS ) – Apps managed by YARN need specific AppMaster, Apps managed by Cloud is exactly the same as running on physical machines (Cloud ) – YARN tracking application-specific metrics/progress, Cloud only track underlayer resources (YARN ) YARN + Cloud • Why YARN + Cloud? – Leverage virtualization in strong isolation, fine-grained resource sharing and other benefits – Uniform infrastructure to simplify IT in enterprise • What it looks like? – Running YARN NM inside of VMs managed by Cloud Infrastructure – Build communication channel between YARN RM and Cloud Resource Manager for coordination • How we do? – First thing above is very easy and smoothly – Second things to achieve in two ways • YARN can aware/manipulate Cloud resource change • YARN provide a generic resource notification mechanism so Cloud Manager can use when resource changing Elastic YARN Node in the Cloud • VM’s resource boundary can be elastic – – – – CPU is easy – time slicing (with constraints) Memory is harder – page sharing and memory ballooning In case of contention, enforce limits and proportional sharing “Stealing” resources behind apps could cause bad performance (paging) – App aware resource management could address these issues • Hadoop YARN Resource Model – Dynamic with adding/removing nodes – But static for per node • In this case, shall we enable resource elasticity on VM? – If yes, low performance when resource contention happens. – If no, low utilization as physical boxes because free resources cannot be leveraged by other busy VMs • We need better answer . HVE provide the answer! • Hadoop Virtualization Extensions – A project to enhance Hadoop running on virtualization • Goal: Make Hadoop Cloud-Ready – Provide Virtualization-awareness to Hadoop, i.e. virtual topology, virtual resources, etc. – Deliver generic utility that can be leveraged by virtualized platform • Independent of virtualization platform and cloud infrastructure • 100% contribution to Apache Hadoop Community HVE • Philosophy – make infrastructure related components abstract – deliver different implementations that can be configured properly • E.g. BlockPlacementPolicy (Abstract) BlockPlacementPolicy BlockPlacementPolicy Default BlockPlacementPolicy For Virtualization Elastic YARN Node in the Cloud Container Add/Remove Resources? Container Other Workload Virtual YARN Node NodeManager Datanode Virtualization Host Grow/Shrink resource of a VM VMDK Grow/Shrink by tens of GB in memory? Implementation – YARN-291 (umbrella) • YARN-312 • YARN-311 – Core scheduler changes – AdminProtocol changes • REST API, JMX, etc. • YARN-313 • CLI Resource Manager Scheduler UpdateNodeResource() AdminService Admin CLI Cluster Resource yarn rmadmin -updateNodeResource <NodeId> <Resource> SchedulerNode RMContext RMNode Resource Tracker Service Heartbeat Node Manager Cloud Resource Manager Reference • YARN MapReduce 2.0 – https://issues.apache.org/jira/browse/MAPREDUCE279 • HVE topology extension – https://issues.apache.org/jira/browse/HADOOP-8468 • HVE topology extension for YARN – https://issues.apache.org/jira/browse/YARN-18 • HVE elastic resource configuration – https://issues.apache.org/jira/browse/YARN-291 • Gang Scheduling – https://issues.apache.org/jira/browse/YARN-624 • Long-lived services in YARN – https://issues.apache.org/jira/browse/YARN-896 Thanks! Junping Du jdu@vmware.com