System Software Considerations for Cloud Computing on Big Data March 17, 2011 Michael Kozuch Intel Labs Pittsburgh Outline 1. Background: Open Cirrus 2. Cluster software stack 3. Big Data 4. Power 5. Recent news 2 Open Cirrus Open Cirrus* Cloud Computing Testbed Collaboration between industry and academia, sharing •hardware infrastructure •software infrastructure UIUC* •research •applications and data sets KIT* ISPRAS* ETRI* CMU* GaTech* CESGA* MIMOS* China Telecom* China Mobile* IDA* Sponsored by HP, Intel, and Yahoo! (with additional support from NSF) 14 sites currently, target of around 20 in the next two years Open Cirrus* http://opencirrus.org Objectives – Foster systems research around cloud computing – Enable federation of heterogeneous datacenters – Vendor-neutral open-source stacks and APIs for the cloud – Expose research community to enterprise level requirements – Capture realistic traces of cloud workloads Each Site – Runs its own research and technical teams, – Contributes individual technologies – Operates some of the global services Independently-managed sites… providing a cooperative research testbed Intel BigData Cluster 1 Gb/s (x4) 45 Mb/s T3 to Internet 20 nodes: 1 Xeon (single-core) [Irwindale] 6GB DRAM 366GB disk 10 nodes: 2 Xeon 5160 (dual-core) [Woodcrest] 4GB RAM 2 75GB Disk 10 nodes: 2 Xeon E5345 (quad-core) [Clovertown] 8GB DRAM 2 150GB Disk 1 Gb/s (x4) 1 Gb/s (x4) Switch 48 Gb/s Blade Rack 40 nodes ------------- rXrY=row X rack Y rXrYcZ=row X rack Y chassis Z (r2r1c1-4) Nodes Cores DRAM (GB) Spindles Storage (TB) (r2r2c1-4) r2r1c1-4 r2r2c1-4 40 40 140 320 240 320 80 80 12 12 1 Gb/s (x2x5 p2p) Switch 48 Gb/s Switch 48 Gb/s 1U Rack 15 nodes 2 Xeon E5420 (quad-core) [Harpertown] 8GB DRAM 2 1TB Disk 2 Xeon E5420 (quad-core) [Harpertown] 8GB DRAM 2 1TB Disk ------------- r1r1 15 120 120 30 30 x1 (r1r2) r1r2 27 264 696 102 66 r1r3 r1r4 r2r3 r3r2 r3r3 45 30 360 240 360 480 270 180 270 180 ------------- ------------- 2 Xeon X5650 (six-core) [WestmereEP] 48GB DRAM 6 0.5TB Disk 2 Xeon E5520 (quad-core) [Nehalem-EP] 16GB DRAM 6 1TB Disk 2 Xeon E5440 (quad-core) [Harpertown] 8GB DRAM 6 1TB Disk x3 (r1r3,r1r4,r2r3) mobile storage 8 5 64 128 16 60 16 60 1 Gb/s (x15 p2p) 2U Rack 15 nodes 2U Rack 15 nodes 12 nodes ------------- Switch 48 Gb/s 1 Gb/s (x15 p2p) 1 Gb/s (x27 p2p) x1 (r1r1) ------------12 1TB Disk Switch 48 Gb/s 1U Rack 15 nodes ------------- 3U Rack 5 storage nodes (r1r5) Switch 48 Gb/s 1 Gb/s (x15 p2p) ------------2 Xeon E5440 (quad-core) [Harpertown] 16GB DRAM 2 1TB Disk 1 Gb/s (x4) 1 Gb/s (x4) 1 Gb/s (x4) 1 Gb/s (x4x4 p2p) 2 Xeon E5345 (quad-core) [Clovertown] 8GB DRAM 2 150GB Disk Key: 1 Gb/s (x4) 1 Gb/s (x4) Switch 48 Gb/s 1 Gb/s (x4x4 p2p) Blade Rack 40 nodes Switch 48 Gb/s 1 Gb/s (x8) Switch 24 Gb/s 1 Gb/s (x8 p2p) Mobile Rack 8 (1u) nodes x2 (r3r2,r3r3) TOTAL 210 1508 2344 818 646 Cloud Software Stack Cloud Software Stack – Key Learnings • Enable use of application frameworks (Hadoop, Maui-Torque) • Enable general IaaS use • Provide Big Data storage service Application Frameworks • Enable physical resources allocation IaaS Storage Service Why Physical? 1. Virtualization overhead 2. Access to phys resource 3. Security issues Resource Allocator Node Node Node Node Node Node Zoni Functionality Provides each project with a mini-datacenter • Allocation Isolation of experiments • Assignment of physical resources to users • Isolation • Allow multiple mini-clusters to co-exist without interference • Provisioning • Booting of specified OS • Management • OOB power management • Debugging • Domain 0 Domain 1 PXE/DNS/DHCP DNS/PXE/DHCP Server Pool 0 Server Server Pool 1 Gateway Pool 0 OOB console access Intel BigData Cluster Dashboard Big Data Example Applications Application Big Data Algorithms Compute Style Scientific study (e.g. earthquake study) Ground model Earthquake simulation, thermal conduction, … HPC Internet library search Historic web snapshots Data mining MapReduce Virtual world analysis Virtual world database Data mining TBD Language translation Text corpuses, audio archives,… Speech recognition, MapReduce & machine translation, HPC moretext-to-speech, video uploaded to … There has been YouTube than if ABC, NBC, Video search in the Video last data 2 months Object/gesture MapReduce identification, face24/7/365 and CBS had been airing content continuously sincerecognition, 1948. … - Gartner 12 Big Data Interesting applications are data hungry The data grows over time The data is immobile – 100 TB @ 1Gbps ~= 10 days Compute comes to the data The value of a cluster is its data Big Data clusters are the new libraries 13 Example Motivating Application: Online Processing of Archival Video • Research project: Develop a context recognition system that is 90% accurate over 90% of your day • • • • Example query 1: “Where did I leave my briefcase?” • • Leverage a combination of low- and high-rate sensing for perception Federate many sensors for improved perception Big Data: Terabytes of archived video from many egocentric cameras Sequential search through all video streams [Parallel Camera] Example query 2: “Now that I’ve found my briefcase, track it” • Cross-cutting search among related video streams [Parallel Time] Big Data Cluster 14 14 Big Data System Requirements Provide high-performance execution over Big Data repositories Many spindles, many CPUs Parallel processing Enable multiple services to access a repository concurrently Enable low-latency scaling of services Enable each service to leverage its own software stack IaaS, file-system protections where needed Enable slow resource scaling for growth Enable rapid resource scaling for power/demand Scaling-aware storage 15 Storing the Data – Choices Model 1: Separate Compute/Storage Compute and storage can scale independently Many opportunities for reliability compute servers storage servers Model 2: Co-located Compute/Storage No compute resources are under-utilized Potential for higher throughput compute/storage servers 16 Cluster Model external network BWswitch TOR Switch Cluster Switch Connections to R Racks BWnode BWdisk p cores Rack of N server nodes d disks The cluster switch quickly becomes the bottleneck. Local computation is crucial. 17 I/O Throughput Analysis Random Placement Location-Aware Placement 11X 9.2X 5000 4000 3000 1000 3.5X 2000 3.6X Data Throughput (Gb/s) 6000 0 Disk-1G SSD-1G Disk-10G SSD-10G 20 racks of 20 2-disk servers; BWswitch = 10 Gbps 18 Data Location Information Issues: • Many different file system possibilities (HDFS, PVFS, Lustre, etc) • Many different application framework possibilities • Consumers could be virtualized Solution: • Standard cluster-wide Data Location Service • Resource Telemetry Service to evaluate scheduling choices • Enables virtualized location info and file system agnosticism 19 Exposing Location Information LA runtime DFS Resource Telemetry Service LA application LA runtime DFS Guest OS OS VM Runtime VMM Virtual Machines LA application Data Location Service DFS OS (a) non-virtualized (b) virtualized 20 Power (System) Efficiency Demand Scaling/ Power Proportionality “A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing Systems,” Anton Beloglazov, Rajkumar Buyya, Young Choon Lee, and Albert Zomaya 22 Power Proportionality and Big Data The Hadoop Filesystem (10K blocks) Number of blocks stored on node i 2000 Possible power savings: ~66% ~0% Optimal: ~95% Node number i i=100 Rabbit Filesystem Simple Strategy: Maintain a “primary replica” 24 A reliable, power-proportional filesystem for Big Data workloads Recent News Recent News • “Intel Labs to Invest $100 Million in U.S. University Research” • • • • Over five years Intel Science and Technology Centers– 3+2 year sponsored research Half-dozen or more by 2012 Each can have small number of Intel research staff on site • New ISTC focusing on cloud computing possible 26 Tentative Research Agenda Framing Potential Questions Potential Research Questions Software stack • Is physical allocation an interesting paradigm for the public cloud? • What are the right interfaces between the layers? • Can multi-variable optimization work across layers? Big Data • Can a hybrid cloud-HPC file system provide best-of-both-worlds? • How should the file system deal with heterogeneity? • What are the right file system sharing models for the cloud? • Can physical resources be taken from the FS and given back? 29 Potential Research Questions Power • Can storage service power be reduced without reducing availability? • How should a power-proportional FS maintain a good data layout? Federation • Which applications can cope with limited bandwidth between sites? • What are the optimal ways to join data across clusters? • How necessary is federation? How should compute, storage, and power be managed to optimize for performance, energy, and fault-tolerance? 30 Backup Scaling– Power Proportionality Demand scaling presents perf./power trade-off • Our servers: 250W loaded, 150W idle, 10W off, 200s setup Research underway for scaling cloud applications • • • Control theory Load prediction Autoscaling Cloud-based App Request rate: λ Scaling beyond single tier less well-understood Note: proportionality issue is orthogonal to FAWN design Scaling– Power Proportionality Project 1: Multi-tier power management • E.g. Facebook λ Project 2: Multi-variable optimization IaaS Distributed file system Resource allocator Physical resources Project 3: Collective optimization • Open Cirrus may have key role λ e.g. Tashi e.g. Rabbit e.g. Zoni