Chapter 9: Green Computing Platforms for Biomedical

advertisement
HANDBOOK ON GREEN INFORMATION AND
COMMUNICATION SYSTEMS
Chapter 9:
Green Computing Platforms for
Biomedical Systems
Vinay Vijendra Kumar Lakshmi, Ashish Panday, Arindam
Mukherjee, and Bharat S Joshi
University of North Carolina at Charlotte
© University of North Carolina at Charlotte
1
Overview
 Green Computing in Biomedical Field
 Survey of Green Computing Platform
 Analysis of popular Biomedical Applications
 Design Framework for Biomedical Embedded
Processors
 Survey of Simulation tools for Design Space
Exploration
 Development and Characterization of Benchmark Suite
 Design Space Exploration and Optimization Techniques
of Embedded Micro architectures
 Conclusion
 Future Research Areas
© University of North Carolina at Charlotte
2
Green Computing in Biomedical Field
Computing in Biomedical systems can be classified
into 3 categories.
Implantable device level
Portable/Embedded platform level
Server level
© University of North Carolina at Charlotte
3
Characteristics of Biomedical Systems
Power consumption
Renewable energy resource – energy
harvesting
Heat dissipation
Minimizing area
Cost
Performance
© University of North Carolina at Charlotte
4
Survey of Green Computing Platforms
Implantable Devices
 monitor the physiological parameters of the human body.
 Pacemakers, cardioverter-defibrillators, cochlear
 Most of the implantable devices are inactive most of the times
and activate based on a stimulus from the body
Configuration of a brain implant or brain-machine interface (BMI)
© University of North Carolina at Charlotte
5
Embedded Platforms
 physiological monitoring systems
 recognition systems
Wearable ultra-low power biomedical signal
processor, CoolBio™.
© University of North Carolina at Charlotte
6
Power Management in Intel ATOM
ATOM includes power management control block, a power
management block, a clock synthesizer and a few programmable
registers which work on reducing the noise, achieving low
quiescent current, real-time dynamic switching of voltage and
frequency between multiple performance modes, varying core
operation voltage and processor speeds to save on ATOM’s
power and improve its performance.
Figure : Power management in Intel ATOM
© University of North Carolina at Charlotte
7
Servers
The Oracle WebLogic Server 11g software was used to
demonstrate the performance of the Avitek Medical
Records sample application. A configuration using SPARC
T3-1B and SPARC Enterprise M5000 servers from Oracle
was used and showed excellent scaling of different
configurations as well as doubling previous generation
SPARC blade performance.
Server
Processor
Memory
Maximum TPS
SPARC T3-1B
1 x SPARC T3, 1.65 GHz, 16 cores
128 GB
28,156
SPARC T3-1B
1 x SPARC T3, 1.65 GHz, 8 cores
128 GB
14,030
Sun Blade T6320
1 x UltraSPARC T2, 1.4 GHz, 8 cores
64 GB
13,386
© University of North Carolina at Charlotte
8
Cell Processor
© University of North Carolina at Charlotte
9
Analysis of Biomedical Applications
Flowchart for choosing
algorithm-architecture
combination best suited
for an application
Start
1
Research standard
definitions and
processes
involved in
solving the desired
kernel
2
Optimal time/
space
complexities?
7
Is algorithm
acceptable?
No
5
Choose
architecture
Yes
6
Is architecturealgorithm pair
optimal?
stop
10
3
Identify different
solving techniques
to optimize kernel
Yes
4
Choose parallel
algorithm
© University of North Carolina at Charlotte
No
9
Change algorithm
No
Yes
8
Is architecture
acceptable?
No
10
Change
architecture
Yes
Pairwise Correlation
E  X   X Y  Y  
cov( X , Y )
r

Sx S y
SX Sy
X: {x1, x2, x3, ….. xn}
Y : {y1, y2, y3, ….. yn}
r : coefficient of correlation
Cov(X,Y) : covariance of X and Y
SX : standard deviations of X
SY : standard deviations of Y
µX: Expectation of X
µY: Expectation of Y
Another way to interpret PPMCC
n
n
n
i 1
i 1
i 1
n xi yi   xi  yi
r
2




n xi2    xi   n yi2    yi 
i 1
i 1
 i 1 
 i 1 
n
© University of North Carolina at Charlotte
n
11
n
n
2
r(i , j ) 
n
n
n
k 1
k 1
k 1
n x(i , k ) x( j , k )   x(i , k )  x( j , k )
2
n
 n

 n

2
2
n x( i , k )    x( i , k )   n x( j , k )    x( j , k ) 
k 1
k 1
 k 1

 k 1

n
2
i,j
: ith, jth channel where 1≤i,j≤m
x(i,k), x(j,k) : kth sample from ith, jth channel where 1≤i,j≤m, i≠j and
1≤k≤n
r(i,j)
: Correlation coefficient between ith, jth channel
where 1≤i,j≤m
© University of North Carolina at Charlotte
12
Choosing initial algorithm and architecture
Initially the PWC is written In serial fashion for Xeon Dual Core processor . After
running Vtune we arrive at the following statistics
Table 1: Performance of Serial code on Intel Xeon Dual Core processor
Serial Code
CPI
0.84
L1I_MISS%
22.98
L1D_MISS %
91.54
L2_MISS %
60.77
The code is them parallelised in OpenMP and analysed once again to arrive at better
performance values as shown below
Table 4.3: Performance of OpenMP code on Intel Xeon Dual Core processor
Parallel
Code(OMP)
CPI
L1I_MISS %
L1D_MISS %
L2_MISS %
0.67
27.84
89.23
25.67
Implementation on Cell using the Ring Algorithm gives a speed-up of
approx. 56 when compared with serial version on Intel Xeon.
© University of North Carolina at Charlotte
13
Design Framework for Biomedical Embedded Processors
Start
1
Devlopment of
Application
Specific
Benchmark Suite
2
Performance
Analysis of
benchmarks on
Simulator tools
No
3
Explore the design
space. Arrive at
suitable embedded
architecture
4
Select Simulator
Tool
5
Run the optimizer
to arrive at better
performance and
power values
6
Are latency and
throughput
requirements met?
Yes
Yes
stop
7
Is the architecture
optimum?
No
Design flow for Bio-medical Embedded Processors
© University of North Carolina at Charlotte
14
Survey of Simulation tools for Design Space Exploration
Features
MV5
M5
CASPER
Sesc
Full-System Simulation
✘
✔
✘
✘
System-call Emulation
✔
✔
✘
✔
I/O Disk
✔
✔
✘
✔
ISA
Alpha
Various
Sparc
Mips
Emulated thread API
✔
✔
✔
✔
Category
Event Driven
Cycle Driven
Trace driven
Event Driven
IO Core
✔
✔
✔
✔
Multithreaded core
✔
✔
✘
✘
OOO Core
✔
✔
✔
✔
SIMD Core
✔
✔
✘
✔
© University of North Carolina at Charlotte
15
Development and Characterization of Benchmark Suite
A good multicore benchmark will identify bottlenecks in
the multicore system design including memory and
I/O bottlenecks, computational bottlenecks, and
real-time bottlenecks*. In addition, a good multicore
benchmark will identify synchronization problems
where code and data blocks are split, distributed to
various compute engines for processing, and then the
results are reassembled.
*S Gal-On, M Levy, S Leibson, “How to Survice the Quest for a useful Multicore Benchmark", ECN
Magazine, Dec 2009
© University of North Carolina at Charlotte
16
Performance analysis of the benchmark
Analysis of PWC on various Simulator tools
CASPER
CPI
6.5
6.4
6.3
6.2
6.1
6
5.9
5.8
5.7
170
165
160
Avg
Power 155
(uW) 150
0
5000
10000
15000
20000
0
5000
10000
15000
D$ size (in bytes)
D$ size (in bytes)
Average Power per core on CASPER
CPI per core on CASPER
© University of North Carolina at Charlotte
17
20000
MV5 Simulation
Analysis of Parallel version of the code (per CPU results) on MV5
with various configurations
Frequency
fractal_smp
Fractal_smp
Config_hetero
Config_hetero
1 GHz
1 GHz
1 GHz
1 Ghz
Total
Energy
of cpu
(mJ)
Number of
SIMD
CPUs
4
4
2
4
Total
Leakage
Energy
cpu (mJ)
Number of
OOO CPUs
0
0
2
4
No.
of
HW+SW
threads
64+2
64+2
32+2
32+2
Benchmark
Used
FILTER
PPPC
FILTER PPPC
FILTER PPPC
Host
memory
usage
1.217 MB
1.207 MB
2.234 MB
2.255 MB
Simulation time
(seconds)
0.019065
0.001364
0.070888
1050.42
Clock active
energy (uJ)
Total Cache
Energy (mJ)
D$ Miss
rate
I$
Miss
rate
Floating
ALU Active
Energy (mJ)
Integer ALU
Active
Energy (mJ)
of
Fractal smp on
FILTER
26.3581
00
1.713209
0.000956
2.186035
0.257
0.162
1.0877987
1.785665
Fractal smp on
PPPC
0.01064
4
0.002118
0.000188
0.003291
2.195
0.078
0.0004001
0.000844
Config_hetero
FILTER + PPPC
29.9186
95
1.543702
0.000292
12.097728
2.876
0.018
2.241064
4.005570
Config_hetero
FILTER + PPPC
32.2689
733
4.1982526
0.000182
0.0081944
1.747
0.000001
1.0877992
1.7854455
© University of North Carolina at Charlotte
18
Design Space Exploration and
Optimization Techniques of Embedded
Micro architectures
Different approaches used for design space exploration
for multicore processor architecture and optimization
algorithms
Artificial Neural Networks (ANN)
Fast Genetic Algorithms(Used in CASPER)
Genetically programmed Response
surfaces(GPRS used on MV5)
© University of North Carolina at Charlotte
19
Conclusion
 Methodologies for the characterization of bio-medical
applications for ultra-low-energy and low heat producing
embedded implantable devices, as well as for low power
dissipation but high performance embedded computing
platforms. PWC benchmark the computation complexity is
O(mn2), which has given a CPI of 0.67 and L2 Cache miss
percentage of 25.67 on Intel Xeon Dual Core processor
 Outlines of the procedure to be followed for the design
space exploration of processor micro-architectures using
existing simulation tools and optimizers. heterogeneous
configuration with two IO and two OOO consumes less
energy per CPU (29.918 mJ) compared to a homogenous
configuration on MV5's alpha architecture simulation
© University of North Carolina at Charlotte
20
Future Research Areas




Development of better different instruction set
architectures (ISAs)
Corresponding cross-compilers to generate optimized
executables for the simulators
Upgrading existing simulation platforms to support full
system mode with real time kernel libraries to account for
the latency and throughput of the real-life applications
Development of advanced real time operating systems and
scheduling algorithms to schedule the various applications
on different heterogeneous cores to meet the hard real
time constraints.
© University of North Carolina at Charlotte
21
Thanks for your
attention!
© University of North Carolina at Charlotte
22
Download