System Software for Big Data Computing Cho-Li Wang The University of Hong Kong HKU High-Performance Computing Lab. Total # of cores: 3004 CPU + 5376 GPU cores RAM Size: 8.34 TB Disk storage: 130 TB Peak computing power: 27.05 TFlops GPU-Cluster (Nvidia M2050, “Tianhe-1a”): 7.62 Tflops 31.45TFlops (X12 in 3.5 years) 35 30 20T 25 2007.7 2009 2010 2011.1 20 15 10 2.6T 3.1T 5 CS Gideon-II & CC MDRP Clusters 0 2007.7 2009 2010 2011.1 2 Big Data: The "3Vs" Model • • • High Volume (amount of data) High Velocity (speed of data in and out) High Variety (range of data types and sources) 2.5 x 1018 2010: 800,000 petabytes (would fill a stack of DVDs reaching from the earth to the moon and back) By 2020, that pile of DVDs would stretch half way to Mars. Our Research • • • Heterogeneous Manycore Computing (CPUs+ GUPs) Big Data Computing on Future Manycore Chips Multi-granularity Computation Migration (1) Heterogeneous Manycore Computing (CPUs+ GUPs) JAPONICA : Java with AutoParallelization ON GraphIcs Coprocessing Architecture CPUs GPU Heterogeneous Manycore Architecture 6 New GPU & Coprocessors Vendor Model Launch Date Sandy 2011Q1 Bridge Intel Ivy 2012Q2 Bridge Xeon Phi 2012H2 Brazos 2012Q2 2.0 Fab. (nm) #Accelerator Cores (Max.) GPU Clock (MHz) 32 12 HD graphics 3000 EUs (8 threads/EU) 850 – 1350 22 16 HD graphics 4000 EUs (8 threads/EU) 650 – 1150 22 60 x86 cores (with a 512-bit vector unit) 6001100 40 80 Evergreen shader cores AMD 488-680 TDP (watts) Memory Bandwidth (GB/s) 95 L3: 8MB Sys mem (DDR3) 21 77 L3: 8MB Sys mem (DDR3) 25.6 300 8GB GDDR5 320 18 L2: 1MB Sys mem (DDR3) 21 17-100 L2: 4MB Sys mem (DDR3) 25 148 288.5 32 128-384 Northern Islands cores 2010Q1 40 512 Cuda cores (16 SMs) 1300 238 L1: 48KB L2: 768KB 6GB Kepler 2012Q4 28 2880 Cuda cores 836/876 300 6GB GDDR5 Trinity 2012Q2 Fermi Nvidia (GK110) 723-800 Programming Model Remarks OpenCL Bandwidth is system DDR3 memory bandwidth OpenMP#, OpenCL*, OpenACC% Less sensitive to branch divergent workloads OpenCL, C++AMP APU CUDA, OpenCL, OpenACC 3X Perf/Watt, Dynamic Parallelism, HyperQ 7 #1 in Top500 (11/2012): Titan @ Oak Ridge National Lab. • • • • • • • 18,688 AMD Opteron 6274 16-core CPUs (32GB DDR3) . 18,688 Nvidia Tesla K20X GPUs Total RAM size: over 710 TB Total Storage: 10 PB. Peak Performance: 27 Petaflop/s o GPU: CPU = 1.311 TF/s: 0.141 TF/s = 9.3 : 1 Linpack: 17.59 Petaflop/s Power Consumption: 8.2 MW Titan compute board: 4 AMD Opteron + 4 NVIDIA Tesla K20X GPUs NVIDIA Tesla K20X (Kepler GK110) GPU: 2688 CUDA cores 8 Design Challenge: GPU Can’t Handle Dynamic Loops GPU = SIMD/Vector Data Dependency Issues (RAW, WAW) 9 Solutions? Static loops Dynamic loops for(i=0; i<N; i++) { C[i] = A[i] + B[i]; } for(i=0; i<N; i++) { A[ w[i] ] = 3 * A[ r[i] ]; } Non-deterministic data dependencies inhibit exploitation of inherent parallelism; only DO-ALL loops or embarrassingly parallel workload gets admitted to GPUs. 9 Dynamic loops are common in scientific and engineering applications 10 Source: Z. Shen, Z. Li, and P. Yew, "An Empirical Study on Array Subscripts and Data Dependencies" GPU-TLS : Thread-level Speculation on GPU • Incremental parallelization o • • Efficient dependency checking schemes Deferred update o • sliding window style execution. Speculative updates are stored in the write buffer of each thread until the commit time. 3 phases of execution Phase I Phase II Phase III • Speculative execution • Dependency checking intra-thread RAW valid inter-thread RAW in GPU • Commit true inter-thread RAW GPU: lock-step execution in the same warp (32 threads per warp). 11 JAPONICA : Profile-Guided Work Dispatching Dynamic Profiling High Inter-iteration dependence: -- Read-After-Write (RAW) -- Write-After-Read (WAR) -- Write-After-Write (WAW) Scheduler Dependency density Low/None Medium Parallel Highly parallel Massively parallel … 8 high-speed x86 cores Multi-core CPU 64 x86 cores 2880 cores Many-core coprocessors 12 JAPONICA : System Architecture Sequential Java Code with user annotation Profiler (on GPU) JavaR Code Translation Dependency Density Analysis Uncertain Static Dep. Analysis Intra-warp Dep. Check No dependence RAW DO-ALL Parallelizer CPU-Multi threads GPU-Many threads Inter-warp Dep. Check one loop Profiling Results WAW/WAR Speculator GPU-TLS CUDA kernels & CPU Multi-threads Program Dependence Graph (PDG) Privatization CUDA kernels with GPU-TLS| Privatization & CPU Single-thread Task Scheduler : CPU-GPU Co-Scheduling Task Sharing Task Stealing High DD : CPU single core Low DD : CPU+GPU-TLS 0 : CPU multithreads + GPU CPU CPU queue: low, high, 0 GPU queue: low, 0 Communication GPU Assign the tasks among CPU & GPU according to their dependency density (DD) 13 (2) Crocodiles: Cloud Runtime with Object Coherence On Dynamic tILES” “General Purpose” Manycore Tile-based architecture: Cores are connected through a 2D networkon-a-chip 鳄鱼 @ HKU (01/2013-12/2015) • Crocodiles: Cloud Runtime with Object Coherence On Dynamic tILES for future 1000-core tiled processors” GbE ZONE 4 ZONE 2 ZONE 3 DRAM Controller PCI-E GbE Memory Controller RAM Memory Controller RAM PCI-E RAM ZONE 1 GbE RAM Memory Controller GbE PCI-E PCI-E 16 • Dynamic Zoning o o o o Multi-tenant Cloud Architecture Partition varies over time, mimic “Data center on a Chip”. Performance isolation On-demand scaling. Power efficiency (high flops/watt). Design Challenge: “Off-chip Memory Wall” Problem – DRAM performance (latency) improved slowly over the past 40 years. (a) Gap of DRAM Density & Speed (b) DRAM Latency Not Improved Memory density has doubled nearly every two years, while performance has improved slowly (e.g. still 100+ of core clock cycles per memory access) Lock Contention in Multicore System • Physical memory allocation performance sorted by function. As more cores are added more processing time is spent contending for locks. Lock Contention Exim on Linux collapse Kernel CPU time (milliseconds/message) Challenges and Potential Solutions • Cache-aware design o o • Data Locality/Working Set getting critical! Compiler or runtime techniques to improve data reuse Stop multitasking o o Context switching breaks data locality Time Sharing Space Sharing 马其顿方阵众核操作系统 : Next-generation Operating System for 1000-core processor 21 Thanks! For more information: C.L. Wang’s webpage: http://www.cs.hku.hk/~clwang/ http://i.cs.hku.hk/~clwang/recruit2012.htm Multi-granularity Computation Migration Granularity Coarse WAVNet Desktop Cloud G-JavaMPI JESSICA2 SOD Fine System scale Small Large (Size of state) 24 WAVNet: Live VM Migration over WAN A P2P Cloud with Live VM Migration over WAN “Virtualized LAN” over the Internet” High penetration via NAT hole punching Establish direct host-to-host connection Free from proxies, able to traverse most NATs Key Members VM VM Zheming Xu, Sheng Di, Weida Zhang, Luwei Cheng, and Cho-Li Wang, WAVNet: Wide-Area Network Virtualization Technique for Virtual Private Cloud, 2011 International Conference on Parallel Processing (ICPP2011) 25 WAVNet: Experiments at Pacific Rim Areas 北京高能物理所 IHEP, Beijing 日本产业技术综合研究所 (AIST, Japan) SDSC, San Diego 深圳先进院 (SIAT) 中央研究院 (Sinica, Taiwan) 静宜大学 (Providence University) 26 香港大学 (HKU) 26 JESSICA2: Distributed Java Virtual Machine A Multithreaded Java Program Java Enabled Single System Image Computing Architecture Thread Migration JIT Compiler Mode Portable Java Frame JESSICA2 JVM Master JESSICA2 JVM Worker JESSICA2 JVM Worker JESSICA2 JVM JESSICA2 JVM JESSICA2 JVM Worker 27 27 History and Roadmap of JESSICA Project • • • • JESSICA V1.0 (1996-1999) – Execution mode: Interpreter Mode – JVM kernel modification (Kaffe JVM) – Global heap: built on top of TreadMarks (Lazy Release Consistency + homeless) JESSICA V2.0 (2000-2006) – Execution mode: JIT-Compiler Mode – JVM kernel modification – Lazy release consistency + migrating-home protocol JESSICA V3.0 (2008~2010) – Built above JVM (via JVMTI) – Support Large Object Space JESSICA v.4 (2010~) – Japonica : Automatic loop parallization and speculative execution on GPU and multicore CPU – TrC-DC : a software transactional memory system on cluster with distributed clocks (not discussed) J1 and J2 received a total of 1107 source code downloads Past Members King Tin LAM, Kinson Chan Chenggang Zhang Ricky Ma 28 Stack-on-Demand (SOD) Stack frame A Method Stack frame A Stack frame A Method Area Program Counter Local variables Program Counter Local variables Stack frame B Program Counter Heap Area Rebuilt Method Area Local variables Heap Area Cloud node Object (Pre-)fetching objects Mobile node Elastic Execution Model via SOD (a) “Remote Method Call” (b) Mimic thread migration (c) “Task Roaming”: like a mobile agent roaming over the network or workflow With such flexible or composable execution paths, SOD enables agile and elastic exploitation of distributed resources (storage), a Big Data Solution ! Lightweight, Portable, Adaptable 30 Live migration Stack-on-demand (SOD) Method Area … … Method Area Code Stack segments Partial Heap Stacks Thread migration (JESSICA2) iOS Heap JVM process Cloud service provider duplicate VM instances for scaling Small footprint JVM Code Load balancer Mobile client Internet Multi-thread Java process JVM comm. JVM guest OS guest OS Xen VM Xen VM trigger live migration Load balancer Xen-aware host OS Overloaded eXCloud : Integrated Solution for Multigranularity Migration Desktop PC Ricky K. K. Ma, King Tin Lam, Cho-Li Wang, "eXCloud: Transparent Runtime Support for Scaling Mobile Applications," 2011 IEEE International Conference on Cloud and Service Computing (CSC2011),. (Best Paper Award)