HANDBOOK ON GREEN INFORMATION AND COMMUNICATION SYSTEMS Chapter 9: Green Computing Platforms for Biomedical Systems Vinay Vijendra Kumar Lakshmi, Ashish Panday, Arindam Mukherjee, and Bharat S Joshi University of North Carolina at Charlotte © University of North Carolina at Charlotte 1 Overview Green Computing in Biomedical Field Survey of Green Computing Platform Analysis of popular Biomedical Applications Design Framework for Biomedical Embedded Processors Survey of Simulation tools for Design Space Exploration Development and Characterization of Benchmark Suite Design Space Exploration and Optimization Techniques of Embedded Micro architectures Conclusion Future Research Areas © University of North Carolina at Charlotte 2 Green Computing in Biomedical Field Computing in Biomedical systems can be classified into 3 categories. Implantable device level Portable/Embedded platform level Server level © University of North Carolina at Charlotte 3 Characteristics of Biomedical Systems Power consumption Renewable energy resource – energy harvesting Heat dissipation Minimizing area Cost Performance © University of North Carolina at Charlotte 4 Survey of Green Computing Platforms Implantable Devices monitor the physiological parameters of the human body. Pacemakers, cardioverter-defibrillators, cochlear Most of the implantable devices are inactive most of the times and activate based on a stimulus from the body Configuration of a brain implant or brain-machine interface (BMI) © University of North Carolina at Charlotte 5 Embedded Platforms physiological monitoring systems recognition systems Wearable ultra-low power biomedical signal processor, CoolBio™. © University of North Carolina at Charlotte 6 Power Management in Intel ATOM ATOM includes power management control block, a power management block, a clock synthesizer and a few programmable registers which work on reducing the noise, achieving low quiescent current, real-time dynamic switching of voltage and frequency between multiple performance modes, varying core operation voltage and processor speeds to save on ATOM’s power and improve its performance. Figure : Power management in Intel ATOM © University of North Carolina at Charlotte 7 Servers The Oracle WebLogic Server 11g software was used to demonstrate the performance of the Avitek Medical Records sample application. A configuration using SPARC T3-1B and SPARC Enterprise M5000 servers from Oracle was used and showed excellent scaling of different configurations as well as doubling previous generation SPARC blade performance. Server Processor Memory Maximum TPS SPARC T3-1B 1 x SPARC T3, 1.65 GHz, 16 cores 128 GB 28,156 SPARC T3-1B 1 x SPARC T3, 1.65 GHz, 8 cores 128 GB 14,030 Sun Blade T6320 1 x UltraSPARC T2, 1.4 GHz, 8 cores 64 GB 13,386 © University of North Carolina at Charlotte 8 Cell Processor © University of North Carolina at Charlotte 9 Analysis of Biomedical Applications Flowchart for choosing algorithm-architecture combination best suited for an application Start 1 Research standard definitions and processes involved in solving the desired kernel 2 Optimal time/ space complexities? 7 Is algorithm acceptable? No 5 Choose architecture Yes 6 Is architecturealgorithm pair optimal? stop 10 3 Identify different solving techniques to optimize kernel Yes 4 Choose parallel algorithm © University of North Carolina at Charlotte No 9 Change algorithm No Yes 8 Is architecture acceptable? No 10 Change architecture Yes Pairwise Correlation E X X Y Y cov( X , Y ) r Sx S y SX Sy X: {x1, x2, x3, ….. xn} Y : {y1, y2, y3, ….. yn} r : coefficient of correlation Cov(X,Y) : covariance of X and Y SX : standard deviations of X SY : standard deviations of Y µX: Expectation of X µY: Expectation of Y Another way to interpret PPMCC n n n i 1 i 1 i 1 n xi yi xi yi r 2 n xi2 xi n yi2 yi i 1 i 1 i 1 i 1 n © University of North Carolina at Charlotte n 11 n n 2 r(i , j ) n n n k 1 k 1 k 1 n x(i , k ) x( j , k ) x(i , k ) x( j , k ) 2 n n n 2 2 n x( i , k ) x( i , k ) n x( j , k ) x( j , k ) k 1 k 1 k 1 k 1 n 2 i,j : ith, jth channel where 1≤i,j≤m x(i,k), x(j,k) : kth sample from ith, jth channel where 1≤i,j≤m, i≠j and 1≤k≤n r(i,j) : Correlation coefficient between ith, jth channel where 1≤i,j≤m © University of North Carolina at Charlotte 12 Choosing initial algorithm and architecture Initially the PWC is written In serial fashion for Xeon Dual Core processor . After running Vtune we arrive at the following statistics Table 1: Performance of Serial code on Intel Xeon Dual Core processor Serial Code CPI 0.84 L1I_MISS% 22.98 L1D_MISS % 91.54 L2_MISS % 60.77 The code is them parallelised in OpenMP and analysed once again to arrive at better performance values as shown below Table 4.3: Performance of OpenMP code on Intel Xeon Dual Core processor Parallel Code(OMP) CPI L1I_MISS % L1D_MISS % L2_MISS % 0.67 27.84 89.23 25.67 Implementation on Cell using the Ring Algorithm gives a speed-up of approx. 56 when compared with serial version on Intel Xeon. © University of North Carolina at Charlotte 13 Design Framework for Biomedical Embedded Processors Start 1 Devlopment of Application Specific Benchmark Suite 2 Performance Analysis of benchmarks on Simulator tools No 3 Explore the design space. Arrive at suitable embedded architecture 4 Select Simulator Tool 5 Run the optimizer to arrive at better performance and power values 6 Are latency and throughput requirements met? Yes Yes stop 7 Is the architecture optimum? No Design flow for Bio-medical Embedded Processors © University of North Carolina at Charlotte 14 Survey of Simulation tools for Design Space Exploration Features MV5 M5 CASPER Sesc Full-System Simulation ✘ ✔ ✘ ✘ System-call Emulation ✔ ✔ ✘ ✔ I/O Disk ✔ ✔ ✘ ✔ ISA Alpha Various Sparc Mips Emulated thread API ✔ ✔ ✔ ✔ Category Event Driven Cycle Driven Trace driven Event Driven IO Core ✔ ✔ ✔ ✔ Multithreaded core ✔ ✔ ✘ ✘ OOO Core ✔ ✔ ✔ ✔ SIMD Core ✔ ✔ ✘ ✔ © University of North Carolina at Charlotte 15 Development and Characterization of Benchmark Suite A good multicore benchmark will identify bottlenecks in the multicore system design including memory and I/O bottlenecks, computational bottlenecks, and real-time bottlenecks*. In addition, a good multicore benchmark will identify synchronization problems where code and data blocks are split, distributed to various compute engines for processing, and then the results are reassembled. *S Gal-On, M Levy, S Leibson, “How to Survice the Quest for a useful Multicore Benchmark", ECN Magazine, Dec 2009 © University of North Carolina at Charlotte 16 Performance analysis of the benchmark Analysis of PWC on various Simulator tools CASPER CPI 6.5 6.4 6.3 6.2 6.1 6 5.9 5.8 5.7 170 165 160 Avg Power 155 (uW) 150 0 5000 10000 15000 20000 0 5000 10000 15000 D$ size (in bytes) D$ size (in bytes) Average Power per core on CASPER CPI per core on CASPER © University of North Carolina at Charlotte 17 20000 MV5 Simulation Analysis of Parallel version of the code (per CPU results) on MV5 with various configurations Frequency fractal_smp Fractal_smp Config_hetero Config_hetero 1 GHz 1 GHz 1 GHz 1 Ghz Total Energy of cpu (mJ) Number of SIMD CPUs 4 4 2 4 Total Leakage Energy cpu (mJ) Number of OOO CPUs 0 0 2 4 No. of HW+SW threads 64+2 64+2 32+2 32+2 Benchmark Used FILTER PPPC FILTER PPPC FILTER PPPC Host memory usage 1.217 MB 1.207 MB 2.234 MB 2.255 MB Simulation time (seconds) 0.019065 0.001364 0.070888 1050.42 Clock active energy (uJ) Total Cache Energy (mJ) D$ Miss rate I$ Miss rate Floating ALU Active Energy (mJ) Integer ALU Active Energy (mJ) of Fractal smp on FILTER 26.3581 00 1.713209 0.000956 2.186035 0.257 0.162 1.0877987 1.785665 Fractal smp on PPPC 0.01064 4 0.002118 0.000188 0.003291 2.195 0.078 0.0004001 0.000844 Config_hetero FILTER + PPPC 29.9186 95 1.543702 0.000292 12.097728 2.876 0.018 2.241064 4.005570 Config_hetero FILTER + PPPC 32.2689 733 4.1982526 0.000182 0.0081944 1.747 0.000001 1.0877992 1.7854455 © University of North Carolina at Charlotte 18 Design Space Exploration and Optimization Techniques of Embedded Micro architectures Different approaches used for design space exploration for multicore processor architecture and optimization algorithms Artificial Neural Networks (ANN) Fast Genetic Algorithms(Used in CASPER) Genetically programmed Response surfaces(GPRS used on MV5) © University of North Carolina at Charlotte 19 Conclusion Methodologies for the characterization of bio-medical applications for ultra-low-energy and low heat producing embedded implantable devices, as well as for low power dissipation but high performance embedded computing platforms. PWC benchmark the computation complexity is O(mn2), which has given a CPI of 0.67 and L2 Cache miss percentage of 25.67 on Intel Xeon Dual Core processor Outlines of the procedure to be followed for the design space exploration of processor micro-architectures using existing simulation tools and optimizers. heterogeneous configuration with two IO and two OOO consumes less energy per CPU (29.918 mJ) compared to a homogenous configuration on MV5's alpha architecture simulation © University of North Carolina at Charlotte 20 Future Research Areas Development of better different instruction set architectures (ISAs) Corresponding cross-compilers to generate optimized executables for the simulators Upgrading existing simulation platforms to support full system mode with real time kernel libraries to account for the latency and throughput of the real-life applications Development of advanced real time operating systems and scheduling algorithms to schedule the various applications on different heterogeneous cores to meet the hard real time constraints. © University of North Carolina at Charlotte 21 Thanks for your attention! © University of North Carolina at Charlotte 22