DAC2012-Tutorial - Computer Science and Engineering

advertisement
System-Level Exploration of Power,
Temperature, Performance, and Area
for Multicore Architectures
Houman Homayoun1, Manish Arora1, Luis Angel D. Bathen2
1Department
of Computer Science and Engineering
University of California San Diego
2Department
of Computer Science
University of California, Irvine
1
Outline

Why power, temperature and reliability? (H. Homayoun)

Tools for architects? (H. Homayoun)

Thermal simulation using hotspot (H. Homayoun)

Power/performance modeling for NVMs using nvsim (H. Homayoun)

Cycle-level NOC simulation using DARSIM (H. Homayoun)

Using bottlenecks analysis and McPAT for efficient CPU design space exploration
(M. Arora)

Simplescalar: a computer system design and analysis infrastructure (L. Bathen)

PHiLOSoftware: Software-Controlled Memory Virtualization (L. Bathen)

SimpleScalar + CACTI + NVSIM + SystemC
2
Power

Power consumption: first-order design constraint







Embedded System: Battery life time
High Performance System: Heat dissipation
Large Scale Systems: Billing cost
Power
Thermal
Reliability
Short-lived batteries
Huge heatsinks
High electric bills
Unreliable microprocessors
3
Power Importance – Data Center Example



2006: Data Centers in US consumed 61 billion KWh,
a electricity cost of about $9 billion1
2012: Doubles! $18 billion1
Impact of 10% reduction of processor and RAM
power consumption

6.5% reduction of total power consumption
Power
Thermal
Reliability
Others (5%)
Network (9%)
Storage (21%)
$18B x 6.5% = $1.1
billion in savings
1Environmental
Processor + RAM
(65%)
Protection Agency (EPA), “Report to Congress on Server and Data Center Energy Efficiency”, August 2, 2007
4
Temperature Trend – Temperature Crisis



Energy = Heat
Heat dissipation is costly
Increasing power dissipation in computer systems
 Ever increasing cooling costs
Power
Thermal
Reliability
Power Density (W/cm2)
10000
Rocket Nozzle
1000
Nuclear Reactor
Hot Plate
100
Nehalem
Core 2 Duo
Itanium 2
POWER4
POWER2
8086
10
8008 8085
Max Power Consumption
1
1975
286
386
Athlon II
AMD K8
POWER3 POWER6
P6
Pentium®
486
8080
1985
1995
Year
2005
2015
5
Reliability Trend – Reliability Crisis
Decrease lifetime reliability



Power
High power densities: high temperatures
Every 10o temperature increase doubles the failure rate
Thermal
Reliability
Technology Scaling

Manufacturing defects, process variation
1.00E+00
Probability of Failure

1.00E-02
1.00E-04
1.00E-06
1.00E-08
1.8
1.6
1.4
1.2
1
0.8
Relative Cell Size
Source: N. Kim, et al., Analyzing the Impact of Joint Optimization of Cell Size, TVLSI 2011
6
Efficiency Crisis



Moore’s law allow to double transistor budget every 18 months
Power
The power and thermal budget have not changed significantly Thermal Reliability
Efficiency problem in new generations of microprocessor
Intel Teraflops
(Many-Core)
Power Down
8x10 Cores
7
What Can We Do About it?

In order to achieve “sustainable computing”, and fight back the “Power
Problem”, we need to rethink from a “Green Computing” perspective.

Understand levels of design abstraction; technology level, circuit level, and
architecture level.

Understand where power/temperature is dissipated, where reliability issue
exposed and where performance bottleneck exist.

Think about ways to tackle these issue at all levels.
8
Tools for Architects





Performance
 CPU: SimpleScalar, GEMS5, SMTSIM
 GPU: GPGPU-SIM
 Network: DARSIM, NoC-SIM
Power
 CPU: Wattch
 SRAM/CAM, eDRAM Cache: Cacti
 Non-Volatile Memory: NVSIM
Reliability: VARIUS
Temperature: HotSpot
Power, Timing, Area for CPU + Cache + Main Memory: McPAT
9
Thermal Simulation Using HOTSPOT
Slides Courtesy of
A Quick Thermal Tutorial by Kevin Skadron and Mircea Stan
10
Thermal modeling1

A fine-grained, dynamic model for temperature






Architects can use
Accounts for adjacency and package
No detailed designs
Provide detailed temperature distribution
Fast
HotSpot - a compact model based on thermal R, C
 Parameterized for various
 Architectures
 Power models
 Floorplans
 Thermal Packages
[1] A Quick Thermal Tutorial, Kevin Skadron and Mircea Stan
11
HotSpot

Time evolution of temperature is driven by unit
activities and power dissipations averaged over N
cycles (10K shown to provide enough accuracy)



Power dissipations can come from any power simulator, act as
“current sources” in RC circuit ('P' vector in the equations)
Simulation overhead in Wattch/SimpleScalar: <1%
Requires models of



Floorplan: important for adjacency
Package: important for spreading and time constants
R and C matrices are derived from the above
12
Example System1
Heat sink
IC Package
Heat spreader
PCB
Pin
Die
Interface
material
[1] A Quick Thermal Tutorial, Kevin Skadron and Mircea Stan
13
HotSpot Implementation


Primarily a circuit solver
Steady state solution



Mainly matrix inversion – done in two steps
 Decomposition of the matrix into lower and upper triangular
matrices
 Successive backward substitution of solved variables
Implements the pseudocode from CLR
Transient solution



Inputs – current temperature and power
Output – temperature for the next interval
Computed using a fourth order Runge-Kutta
(RK4) method
14
Validation

Validated and calibrated using MICRED test chips



Within 7% for both steady-state and transient step-response


Interface material (chip/spreader) matters
Also validated against an FPGA


9x9 array of power dissipators and sensors
Compared to HotSpot configured with same grid, package
Can instantiate a temp. sensor based on a ring oscillator and
counter
Also validated against IBM AnSYS FEM simulations
15
HotSpot Interface

Inputs




a power trace file
a floorplan file
A config file (package information)
Outputs



the corresponding transient temperatures
steady state temperatures
Thermal map (perl script)
16
Config File
# thermal model parameters
# chip specs
# chip thickness in meters
-t_chip
0.00015
# silicon thermal conductivity in W/(m-K)
-k_chip
100.0
# silicon specific heat in J/(m^3-K)
-p_chip
1.75e6
# temperature threshold for DTM (kelvin)
-thermal_threshold 354.95
# heat sink specs
# convection capacitance in J/K
-c_convec
140.4
# convection resistance in K/W
-r_convec
0.1
# heatsink side in meters
-s_sink
0.06
# heatsink thickness in meters
-t_sink
0.0069
# heatsink thermal conductivity in W/(m-K)
-k_sink
400.0
# heatsink specific heat in J/(m^3-K)
-p_sink
3.55e6
# heat spreader specs
# spreader side in meters
-s_spreader
0.03
# spreader thickness in meters
-t_spreader
0.001
# heat spreader thermal conductivity in W/(m-K)
-k_spreader
400.0
# heat spreader specific heat in J/(m^3-K)
-p_spreader
3.55e6
# interface material specs
# interface material thickness in meters
-t_interface
2.0e-05
# interface material thermal conductivity in W/(m-K)
-k_interface
4.0
# interface material specific heat in J/(m^3-K)
-p_interface
4.0e6
17
FLPfile
# Floorplan close to the Alpha EV6 processor
# Line Format: <unit-name>\t<width>\t<height>\t<left-x>\t<bottom-y>
# all dimensions are in meters
0.0159
# comment lines begin with a '#'
# comments and empty lines are ignored
FPMap_0
FPMul_0
FPReg_0
FPAdd_0
0.0139
L2_left 0.004900
0.006200
0.000000
0.009800
L2 0.016000
0.009800
0.000000
0.000000
L2_right
0.004900
0.006200
0.011100
0.009800
Icache 0.003100
0.002600
0.004900
0.009800
Dcache 0.003100
0.002600
0.008000
0.009800
Bpred_0 0.001033
0.000700
0.004900
0.012400
Bpred_1 0.001033
0.000700
0.005933
0.012400
Bpred_2 0.001033
0.000700
0.006967
0.012400
DTB_0 0.001033
0.000700
0.008000
0.012400
DTB_1 0.001033
0.000700
0.009033
0.012400
DTB_2 0.001033
0.000700
0.010067
0.012400
FPAdd_0 0.001100
0.000900
0.004900
0.013100
FPAdd_1 0.001100
0.000900
0.006000
0.013100
FPReg_0 0.000550
0.000380
0.004900
0.014000
FPReg_1 0.000550
0.000380
0.005450
0.014000
FPMul_0 0.001100
0.000950
0.004900
0.014380
FPMul_1 0.001100
0.000950
0.006000
0.014380
FPMap_0 0.001100
0.000670
0.004900
0.015330
FPMap_1 0.001100
0.000670
0.006000
0.015330
IntMap 0.000900
0.001350
0.007100
0.014650
IntQ 0.001300
0.001350
0.008000
0.014650
IntReg_0
0.000900
0.000670
0.009300
0.015330
IntReg_1
0.000900
0.000670
0.010200
0.015330
IntExec 0.001800
0.002230
0.009300
0.013100
FPQ 0.000900
0.001550
0.007100
0.013100
LdStQ 0.001300
0.000950
0.008000
0.013700
ITB_0 0.000650
0.000600
0.008000
0.013100
ITB_1 0.000650
0.000600
0.008650
0.013100
L2_left
Bpred_0
IntReg_0
IntQ
FPQ
LdStQ IntExec
ITB_1
DTB_0 DTB_2
L2_right
0.0119
Icache
Dcache
0.0099
0.0079
0.0059
L2
0.0039
0.0019
-0.0001
-0.0001
0.0049
0.0099
0.0149
18
HotSpot Modes of running

Block Level




fast
less accurate
Example: hotspot -c hotspot.config -f ev6.flp -p gcc.ptrace -o
gcc.ttrace -steady_file gcc.steady
Grid-Level




slow
more accurate
Example: hotspot -c hotspot.config -f ev6.flp -p gcc.ptrace steady_file gcc.steady -model_type grid
Change grid size for trade-off between speed up and accuracy

default grid size is 64x64
19
3D Modeling with HotSpot


HotSpot's grid model is capable of modeling stacked 3D
chips
HotSpot need Layer Configuration File (LCF) for 3D
simulation.

LCF specifies the set of vertical layers to be modeled including its
physical properties (thickness, conductivity etc.)
20
Example: LCF file for two layers
#<Layer Number>
#<Lateral heat flow Y/N?>
#<Power Dissipation Y/N?>
#<Specific heat capacity in J/(m^3K)>
#<Resistivity in (m-K)/W>
#<Thickness in m>
#<floorplan file>
# Layer 0: Silicon
0
Y
Y
1.75e6
0.01
0.00015
ev6.flp
# Layer 1: thermal interface material
(TIM)
1
Y
N
4e6
0.25
2.0e-05
ev6.flp
For instance, the above sample file shows an LCF corresponding to
the default HotSpot configuration with two layers: one layer of silicon
and one layer of Thermal Interface Material (TIM)
Command line example:
hotspot -c hotspot.config -f <some_random_file> -p example.ptrace
-o example.ttrace -model_type grid -grid_layer_file example.lcf
21
Example: Modeling memory
peripheral temperature
22
Power/Performance Modeling for NonVolatile Memories Using NVSIM
23
Why yet-another Circuit-Level
Estimation Tool for Cache Memories



Emerging non-volatile memory devices show a large
variation on performance, energy, and density.
Some of them are performance-optimized; some of
them are area-optimized…
For system-level research, it is NOT correct to pick
random device parameters from multiple sources.
24
NVSIM1


`nvsim' is designed to be a general circuit-level
performance, power, and area model
Emerging memory technologies supported:







NAND
PCM
MRAM(STT-RAM)
Memristor
SRAM
DRAM
eDRAM
[1] "Design Implications of Memristor-Based RRAM Cross-Point Structures." In DATE
2011, C. Xu, X. Dong, N. P. Jouppi, and Y. Xie
25
NVSim Model

Developed on the basis of CACTI


CACTI models SRAM and DRAM caches
CACTI does NOT support eNVM.
Row Decoders
Wordline Drivers
Precharge & Equalization
2D array of memory
cells
Bitline Mux
Sense Amplifies
Sense Amplifier Mux
Output/Write Drivers
Memory cells
NVSim made
modifications
on the subarray-level
And the bank-level
Peripheral circuitry
CACTI-modeled memory
subarray
26
Tricks (Subarray-Level)

Why the circuit design space is large?

Many design tricks
•Array
•Transistor type
•Interconnect type
oWire pitch
oRepeater design
•Sense amp
oCurrent-sensing
oVoltage-sensing
•Driver
oArea-opt
oLatency-opt
Precharge & Equalization
Row Decoders
Wordline Drivers
oHigh-performance
oLow-power
oLow-standby
oMOSaccessed
oCrosspoint
2D array of memory
cells
Bitline Mux
Sense Amplifies
Sense Amplifier Mux
Output/Write Drivers
27
Configuring NVSim

NVSim provides a variety of functionalities by supporting two
categories of the configuration input files: <.cfg> files and <.cell>
files.


<.cfg> Configuration: <.cfg> files are used to specify the non-volatile
memory module parameters and tune the design exploration knobs. The
details of how to configure <.cfg> files are under the cfg files page.
<.cell> Configuration: <.cell> files are used to specify the non-volatile
memory cell properties. The information held in these files are usually
from the device level. NVSim provides default <.cell> files for PC-RAM,
STT-RAM, and R-RAM as well as allows advanced users to tailor their own
cell properties by add new <.cell> files. The details of how to configure
<.cell> files are under the cell files page.
28
NVSIM Interface
-DesignTarget: cache
//-DesignTarget: RAM
//-DesignTarget: CAM
-CacheAccessMode: Normal
//-CacheAccessMode: Fast
//-CacheAccessMode: Sequential
//-OptimizationTarget: ReadLatency
//-OptimizationTarget: WriteLatency
//-OptimizationTarget: ReadDynamicEnergy
//-OptimizationTarget: WriteDynamicEnergy
//-OptimizationTarget: ReadEDP
//-OptimizationTarget: WriteEDP
//-OptimizationTarget: LeakagePower
-OptimizationTarget: Area
//-OptimizationTarget: Exploration
//-ProcessNode: 200
//-ProcessNode: 120
//-ProcessNode: 90
//-ProcessNode: 65
-ProcessNode: 45
//-ProcessNode: 32
-Capacity (KB): 128
//-Capacity (MB): 1
-WordWidth (bit): 512
-Associativity (for cache only): 8
-DeviceRoadmap: HP
//-DeviceRoadmap: LSTP
//-DeviceRoadmap: LOP
-Routing: H-tree
//-Routing: non-H-tree
-MemoryCellInputFile: SRAM.cell
//-MemoryCellInputFile: Memristor_3.cell
//-MemoryCellInputFile:
PCRAM_JSSC_2007.cell
//-MemoryCellInputFile:
PCRAM_JSSC_2008.cell
//-MemoryCellInputFile:
PCRAM_IEDM_2004.cell
//-MemoryCellInputFile:
MRAM_ISSCC_2007.cell
//-MemoryCellInputFile:
MRAM_ISSCC_2010_14_2.cell
//-MemoryCellInputFile:
MRAM_Qualcomm_IEDM.cell
//-MemoryCellInputFile: SLCNAND.cell
-Temperature (K): 380
29
Example 1: PCRAM1

32-nm 16MB 8-way L3 caches (with different PCRAM
design optimizations)
[1] PCRAMsim: System-Level Performance, Energy, and Area Modeling for PhaseChange RAM, Xiangyu Dong, et al. ICCAD 2009
30
Cycle-Level NoC Simulation Using DARSIM
31
DARSIM1



A parallel, highly configurable, cycle-level network-on-chip simulator based
on an ingress-queued wormhole router architecture
Most hardware parameters are configurable, including geometry, bandwidth,
crossbar dimensions
Packets arrive flit-by-flit on ingress ports are buffered in ingress virtual
channel (VC) buffers until they have been assigned a next-hop node and VC;
they then compete for the crossbar and, after crossing, depart from the
egress ports.
Basic datapath of an NoC
router modeled by DARSIM.
[1] Darsim: A Parallel Cycle-Level NoC Simulator, Lis, Mieszko; Shim, Keun Sup; Cho, Myong
Hyon; Ren, Pengju; Khan, Omer; Devadas, Srinivas, ISPASS 2011.
32
DARSIm Simulation Parameters
--cycles arg
simulate for arg cycles (0 = until drained)
--packets arg
simulate until arg packets arrive (0 = until drained)
--stats-start arg start statistics after cycle arg (default: 0)
--no-stats
do not report statistics
--no-fast-forward do not fast-forward when system is drained
--memory-traces arg read memory traces from file arg
--log-file arg
write a log to file arg
--random-seed arg set random seed (default: use system entropy)
--version
show program version and exit
-h [ --help ]
show this help message and exit
33
Sample Config File
[geometry]
height = 8
width = 8
[routing]
node = weighted
queue = set
one queue per flow = false
one flow per queue = false
[node]
queue size = 8
[bandwidth]
cpu = 16/1
net = 16
north = 1/1
east = 1/1
south = 1/1
west = 1/1
[queues]
cpu = 0 1
net = 8 9
north = 16 18
east = 28 30
south = 20 22
west = 24 26
[core]
default = injector
34
Network Configuration
-x (arg) : network width (8)
-y (arg) : network height (8)
-v (arg) : number of virtual channels per set (1)
-q (arg) : capacity of each virtual channel in flits(4)
-c (arg) : core type (memtraceCore)
-m (arg) : memory type (privateSharedMSI)
-n (arg) : number of VC sets
-o (arg) : output filename (output.cfg)
35
Sample Flow Event
tick 12094
flow 0x001b0000 size 13
tick 12140
flow 0x00001f00 size 5
tick 12141
flow 0x001f0000 size 5
tick 12212
flow 0x00002100 size 5
tick 12212
flow 0x00210000 size 13
The first two lines indicate that at the cycle of 12094, a packet which
consists of 13 flits is injected to Node 27 (0x1b), and its destination is
Node 0 (0x00).
36
Statistics
flit counts:
flow 00010000: offered 58, sent 58, received 58 (0 in flight)
flow 00020000: offered 44, sent 44, received 44 (0 in flight)
flow 00030000: offered 34, sent 34, received 34 (0 in flight)
flow 00040700: offered 32, sent 32, received 32 (0 in flight)
flow 00050700: offered 34, sent 34, received 34 (0 in flight)
.....
all flows counts: offered 109724, sent 109724, received 109724 (0 in flight)
in-network sent flit latencies (mean +/- s.d., [min..max] in # cycles):
flow 00010000: 4.06897 +/- 0.364931, range [4..6]
flow 00020000: 5.20455 +/- 0.756173, range [5..9]
flow 00030000: 6.38235 +/- 0.874969, range [6..10]
flow 00040700: 6.1875 +/- 0.526634, range [6..8]
flow 00050700: 5.11765 +/- 0.32219, range [5..6]
.....
all flows in-network flit latency: 9.95079 +/- 20.5398
37
Example: Effect of Routing and VC1
The effect of routing and VC configuration on network transit latency
in a relatively congested network on the WATER benchmark: while O1TURN
and ROMM clearly outperform XY, the margin is not particularly impressive.
[1] Scalable Accurate Multicore Simulation in the 1000 core era", M. Lis et al. ISPASS 2011
38
Design Flow
39
Trace Driven NUCA Non-Volatile
Cache Simulation in 3D
40
Trace Driven NUCA Non-Volatile
Cache Simulation in 3D
Feed new cache latency for performance impact
cycle accurate
simulator like
SimpleScalar or
SMTSIM
NVSIM-based
simulator
Feed cache trace
DARSIM Network
Simulation
HotSpot 3D (Virginia)
3D thermal
simulation
Feed temperature for accurate leakage power modeling
41
Questions?
3/21/2016
42
Using Bottlenecks Analysis and
McPAT for Efficient CPU Design
Space Exploration
Manish Arora1
Computer Science and Engineering
University of California, San Diego
1Credit
also goes to my co-authors: Feng Wang (Qualcomm),
Bob Rychlik (Qualcomm) and Dean Tullsen (UC San Diego)
Tackling Design Complexity

Increasingly complex design decisions





Accurate simulation is slow
Simulation of all design points not feasible
Commonly followed techniques inadequate
Sensitivity analysis



Multicore exacerbates the problem
Vary a single parameter while keeping other parameters fixed
E.g. L2 performance by varying size and keeping else constant
Dependent on the choice of fixed point of reference

June 4 2012
L2 performance correlated with L1 size
44
Accelerating Design Space Exploration

Speeding up individual simulations




Benchmark subsetting (SMARTS[1], SimPoint[2] and MinneSpec[3])
Analytical models instead of cycle accurate simulation (Karkhanis
et al. [4])
Regression models to derive performance models (Lee et al. [5])
Design space pruning




June 4 2012
Hill-climbing (Systems et al. [6])
Tabu search (Axelsson et al. [7])
Genetic search (Palesi et al. [8])
Plackett and Burman (P&B) based design (Yi et al. [9] and Arora et
al. [10])
45
Plackett and Burman (P&B) Based Design

Advantages




Exploration over ranges of parameter values
Linear or near linear number of experiments
Non-iterative technique (exploit cluster parallelism)
Workings

Provide a high value (+1) and low value (-1) for each component




Run a P&B specified set of experiments
Evaluate “Impact” of each component

June 4 2012
CPU Freq 2GHz (+1) / 1GHz (-1)
L2 Cache 1MB (+1) / 256KB (-1) and so on…
E.g. CPU Frequency has a 30% influence when CPU Freq changed from
1GHz to 2GHz AND changing L2 from 256KB to 1MB AND …
46
System Under Design - 1


Sub-system consisting of 11 components
Up to 10 choices per component
June 4 2012
47
System Under Design - 2

12 Mobile CPU centric benchmarks
June 4 2012
48
June 4 2012
49
June 4 2012
50
June 4 2012
51
Using P&B for Cost-Optimized Designs

Recapping P&B



P&B yields unit-less “Impact” (influence of changing a component)
Provides “Impact” trends by changing upper bounds
Constrained systems


Most systems are cost constrained (area, power or energy)
Need to look at cost together with performance



Use Cost Normalized Marginal Impact


L2 cache size has higher impact than L2 Associativity
L2 Associativity might still provide the best “Bang for the buck”
Impact gained / Cost incurred
Use McPAT [11, 12] to evaluate baseline and marginal
costs
June 4 2012
52
McPAT: High Level Features

Integrated modeling framework





Hardware validated (~20% error)
Configurable system components




Power (Peak, Dynamic, Short-circuit and Leakage)
Area
Critical path timing
Cores, NOC, clock tree, PLL, caches and memory controllers etc.
Technology nodes 90nm to 22nm
Device types Bulk CMOS, SOI and Double Gate
Flexible XML interface

June 4 2012
Standalone and performance simulator integration
53
McPAT 1.0 Overview (from [11])
Unspecified parameters
filled, Structures
optimized to satisfy
timing
June 4 2012
54
Use models and
configurations to
evaluate numbers
Framework Components

Hierarchical power, area and timing model

Model structures at a low level but allow high-level configuration


Optimizer for circuit level implementations

Determines unspecified parameters in the internal chip
representation



Model core details rigorously and allow to connect multiple cores
User specifies cache size and number of banks but optimizer specifies
cachebank wordline and bitline lengths
User can choose to specify everything themselves
Internal chip representation

June 4 2012
Driven by user inputs and those generated by the optimizer
55
Hierarchical Modeling (from [11])
June 4 2012
56
Power, Area and Timing Modeling

Power Modeling




Timing Modeling



Dynamic power using load capacitive modeling, supply voltage,
clock frequency and activity factors
Short-circuit power using published models
Leakage using published data and existing models such as MASTER
Estimate critical path
Use RC delays to estimate time similar to CACTI
Area Modeling


June 4 2012
Similar to CACTI to model gates and regular structures
Empirical modeling techniques for non-regular structures
57
Multicore Architectural Modeling

Core




NOC



Signal link and router models
Shared and private cache hierarchies
Memory Controller


Configurable models of fetch, execute, load-store and OOO etc.
Reservation station style and physical-register-file architectures
In-order, OOO and Multithreaded architectures
Front end, transaction processing and PHY models
Clocking

June 4 2012
PLL and clock tree models
58
Circuit and Technology Level Modeling

Wires




Devices




Hierarchical repeated wires for local and global wires
Short wires using pi-RC models
Latches automatically inserted and modeled to satisfy clock rates
Use ITRS 2007 roadmap data
90nm, 65nm, 45nm, 32nm and 22nm nodes supported
Planar bulk (up to 36nm), SOI (up to 25nm) and double-gate
(22nm) modeled
Support for Power-saving modes

McPAT 1.0 supports multiple sleep states

June 4 2012
Coming this summer (v0.8 current)
59
McPAT Operation

Requires input from user and simulator







Target clock rate, architectural and technology parameters
Optimization function (timing or ED^2)
Unit activity factors
McPAT optimizes structures to satisfy timing
Configurations not satisfying timing are discarded
Optimization functions are applied to all timing
satisfying configurations
Numbers calculated using remaining configurations +
activity factors
June 4 2012
60
Downloading and Installing

Current version 0.8 available from HP Labs website


Download and build the tool (“make” works)


Works on unix compatible systems
Command line operation (standalone XML input)


http://www.hpl.hp.com/research/mcpat/
Print levels provide verbose results
Alternatively can build together with simulator
June 4 2012
61
Running McPAT

Standalone mode (with XML input file)



Architectural and technology details specified within XML
Find correspondence between McPAT stats and simulator stats
Run performance simulation and pass counters to XML
<component id="system.core0.icache" name="icache">
<param name="icache_config" value="131072,32,8,1,8,3,32,0"/>
<stat name="read_accesses" value=“2000"/>
<stat name="read_misses" value=“116"/>
<stat name="conflicts" value=“9"/>
</component>


Integrated with multiple simulators (M5, SMTSIM, Multi2SIM etc.)
Documentation gives tips on building together with
simulator
June 4 2012
62
XML Specification (Top Level)
June 4 2012
63
XML Specification (Core)
June 4 2012
64
XML Specification (Memory Controller)
June 4 2012
65
Results (Top Level)
June 4 2012
66
Results (Core)
June 4 2012
67
Results (Memory Controller)
June 4 2012
68
Cost Normalized Marginal Impact

Created XML specification for our processor system



Modeled a mobile processor and obtained activity factors from
a custom cycle-accurate simulator
Obtained baseline power and area
Obtained Marginal costs

June 4 2012
Obtained cost normalized marginal impact
69
Results: Cost Normalized Impact
June 4 2012
70
Obtaining Cost Optimized Designs

Make design decisions utilizing the marginal impact
and marginal cost information
June 4 2012
71
Results: Cost Optimized Designs


Set budgets to 70% - 40% of highest end system
Selection algorithm minimizes impact loss while
reducing cost as much as possible


June 4 2012
Performance within 16% of peak @ 40% area
Performance within 19% of peak @ nearly half the power
72
To Summarize






Looked at the problem of efficient design space
exploration
Used the Plackett and Burman Method to yield
“Impact” or a measure of bottleneck for components
Understood the basic workings of McPAT
Understood the use of McPAT to obtain area and
power costs for various system configurations
Used cost numbers to obtain cost normalized impact
Used cost normalized impact values to obtain efficient
design choices
June 4 2012
73
References
[1] Wunderlich et al. “SMARTS: Accelerating microarchitecural simulation via rigorous statistical
sampling”, ISCA 2003.
[2] Sherwood et al. “Automaticaly characterizing large scale program behavior”, ASPLOS 2002.
[3] KleinOsowski et al. “MinneSPEC: A new SPEC benchmark workload for simulation based
computer architecture research”, CAL 2002.
[4] Karkhanis et al. “A first-order superscalar processor model”, ISCA 2004.
[5] Lee at al. “Accurate and efficient regression modeling for microarchitectural performance
and power prediction”, ASPLOS 2006.
[6] Systems et al. “Spacewalker: Automated design space exploration”, HP Labs 2001.
[7] Axelsson et al. “Architecture synthesis and partitioning of realtime systems”, CODES 1997.
[8] Palesi et al. “Multi-objective design space exploration using genetic algorithms”, CODES
2002.
[9] Yi et al. “A statistically rigorous approach for improving simulation methodology”, HPCA
2003.
[10] Arora et al. “Efficient system design using the SAAB methodology”, SAMOS 2012.
[11] S. Li et al. “McPAT: An integrated Power, Area and Timing framework”, MICRO 2009.
[12] S. Li et al. McPAT 1.0 technical report, HP Labs 2009.
June 4 2012
74
Questions?
SimpleScalar Simulator (ISS) and
PHiLOSoftware Framework (SystemC)
Luis Angel D. Bathen (Danny)3
Slides Courtesy of: Kyoungwoo Lee1, Aviral Shrivastava2, and Nikil
Dutt3
1Dept.
of Computer Science
Yonsei University
2Dept.
of Computer Science & Engineering
Arizona State University
3Dept.
of Computer Science
University of California
at Irvine
Contents


SimpleScalar Overview
Demo 1: a simple simulation


PHiLOSoftware Simulator



(w/ 3.1 version of SimpleScalar)
SimpleScalar + CACTI + SystemC TLM
Demo 2: Bus Protocol Selection
Demo 3: Software-Controlled Memory Virtualization
77
Contents


SimpleScalar Overview
Demo 1: a simple simulation


PHiLOSoftware Simulator



(w/ 3.1 version of SimpleScalar)
SimpleScalar + CACTI + SystemC TLM
Demo 2: Bus Protocol Selection
Demo 3: Software-Controlled Memory Virtualization
78
Overview

What is an architectural simulator?


a tool that reproduces the behavior of a computing
device
Why we use a simulator?






Leverage a faster, more flexible software
development cycle
Permit more design space exploration
Facilitates validation before H/W becomes available
Level of abstraction is tailored by design task
Possible to increase/improve system instrumentation
Usually less expensive than building a real system
79
A Taxonomy of Simulation Tools
Shaded tools are included in SimpleScalar Tool Set
80
Functional vs.Performance

Functional simulators implement the architecture.



Perform real execution
Implement what programmers see
Performance simulators implement the
microarchitecture.



Model system resources/internals
Concern about time
Do not implement what programmers see
81
Trace- vs. Execution-Driven

Trace-Driven



Simulator reads a ‘trace’ of the instructions captured
during a previous execution
Easy to implement, no functional components necessary
Execution-Driven



Simulator runs the program (trace-on-the-fly)
Difficult to implement
Advantages




Faster than tracing
No need to store traces
Register and memory values usually are not in trace
Support mis-speculation cost modeling
82
SimpleScalar Tool Set Overview

Computer architecture research test bed



Compilers, assembler, linker, libraries, and simulators
Targeted to the virtual SimpleScalar architecture
Hosted on most any Unix-like machine
83
SimpleScalar Suite
84
Strength of SimpleScalar

Highly flexible


Portable



Host: virtual target runs on most Unix-like systems
Target: simulators can support multiple ISAs
Extensible



functional simulator + performance simulator
Source is included for compiler, libraries, simulators
Easy to write simulators
Performance

Runs codes approaching ‘real’ sizes
85
Contents


SimpleScalar Overview
Demo 1: a simple simulation


PHiLOSoftware Simulator



(w/ 3.1 version of SimpleScalar)
SimpleScalar + CACTI + SystemC TLM
Demo 2: Bus Protocol Selection
Demo 3: Software-Controlled Memory Virtualization
86
./sim-safe helloworld




Helloworld!
Create a new file, hello.c, that has the following code:
#include<stdio.h>
main()
{
printf("Hello World!\n");
}
then compile it using the following command:
$ $IDIR/bin/sslittle-na-sstrix-gcc –o hello hello.c
That should generate a file hello, which we will run over the
simulator:
$ $IDIR/simplesim-3.0/sim-safe hello
In the output, you should be able to find the following:
sim: ** starting functional simulation **
Hello World!
87
./sim-safe test-math
TESTS-PISA – test-math

Simple set of test executables

./tests-pisa/bin.little/


anagram, test-fmath, test-lswlr, test-printf, test-llong,
test-math
Run test-math
$ $IDIR/simplesim-3.0/sim-safe testspisa/bin.little/test-math
88
./sim-safe test-math
In the output (1)
sim: ** starting functional simulation **
pow(12.0, 2.0) == 144.000000
pow(10.0, 3.0) == 1000.000000
pow(10.0, -3.0) == 0.001000
str: 123.456
x: 123.000000
str: 123.456
x: 123.456000
str: 123.456
x: 123.456000
123.456 123.456000 123 1000
sinh(2.0) = 3.62686
sinh(3.0) = 10.01787
h=3.60555
atan2(3,2) = 0.98279
pow(3.60555,4.0) = 169
169 / exp(0.98279 * 5) = 1.24102
3.93117 + 5*log(3.60555) = 10.34355
cos(10.34355) = -0.6068, sin(10.34355) = -0.79486
x 0.5x
x0.5 x
x 0.5x
-1e-17 != -1e-17 Worked!
89
./sim-safe test-math
In the output (2)
sim: ** simulation statistics **
sim_num_insn
213703 # total number of instructions executed
sim_num_refs
56899 # total number of loads and stores executed
sim_elapsed_time
1 # total simulation time in seconds
sim_inst_rate
213703.0000 # simulation speed (in insts/sec)
ld_text_base
0x00400000 # program text (code) segment base
ld_text_size
91744 # program text (code) size in bytes
ld_data_base
0x10000000 # program initialized data segment base
ld_data_size
13028 # program init'ed `.data' and uninit'ed `.bss' size in bytes
ld_stack_base
0x7fffc000 # program stack segment base (highest address in
stack)
ld_stack_size
16384 # program initial stack size
ld_prog_entry
0x00400140 # program entry point (initial PC)
ld_environ_base
0x7fff8000 # program environment base address address
ld_target_big_endian
0 # target executable endian-ness, non-zero if big endian
mem.page_count
33 # total number of pages allocated
mem.page_mem
132k # total size of memory pages allocated
mem.ptab_misses
34 # total first level page table misses
mem.ptab_accesses
1546771 # total page table accesses
mem.ptab_miss_rate
0.0000 # first level page table miss rate
90
./sim-cache test-math
Cache Simulator

Run test-math with sim-cache
$ $IDIR/simplesim-3.0/sim-cache testspisa/bin.little/test-math
91
./sim-cache test-math
In the output
sim: ** simulation statistics **
sim_num_insn
213703 # total number of instructions executed
sim_num_refs
56899 # total number of loads and stores executed
sim_elapsed_time
1 # total simulation time in seconds
sim_inst_rate
213703.0000 # simulation speed (in insts/sec)
il1.accesses
213703 # total number of accesses
il1.hits
189940 # total number of hits
il1.misses
23763 # total number of misses
il1.replacements
23507 # total number of replacements
il1.writebacks
0 # total number of writebacks
il1.invalidations
0 # total number of invalidations
il1.miss_rate
0.1112 # miss rate (i.e., misses/ref)
il1.repl_rate
0.1100 # replacement rate (i.e., repls/ref)
il1.wb_rate
0.0000 # writeback rate (i.e., wrbks/ref)
il1.inv_rate
0.0000 # invalidation rate (i.e., invs/ref)
dl1.accesses
57480 # total number of accesses
dl1.hits
56675 # total number of hits
dl1.misses
805 # total number of misses
dl1.replacements
549 # total number of replacements
dl1.writebacks
416 # total number of writebacks
dl1.invalidations
0 # total number of invalidations
dl1.miss_rate
0.0140 # miss rate (i.e., misses/ref)
dl1.repl_rate
0.0096 # replacement rate (i.e., repls/ref)
dl1.wb_rate
0.0072 # writeback rate (i.e., wrbks/ref)
92
./sim-cache –cache:dl1 dl1:32:32:32:f test-math
Cache Configuration

Cache configuration
<name>:<nsets>:<bsize>:<assoc>:<repl>
<name> - name of the cache being defined
<nsets> - number of sets in the cache
<bsize> - block size of the cache
<assoc> - associativity of the cache
<repl> - block replacement strategy, 'l'-LRU, 'f'-FIFO, 'r'random
Examples: -cache:dl1 dl1:4096:32:1:l
 Run test-math with sim-cache
$ $IDIR/simplesim-3.0/sim-cache –cache:dl1 dl1:32:32:32:f
tests-pisa/bin.little/test-math
93
./sim-cache –cache:dl1 dl1:32:32:32:f test-math
In the output
sim: ** simulation statistics **
sim_num_insn
213703 # total number of instructions executed
sim_num_refs
56899 # total number of loads and stores executed
sim_elapsed_time
1 # total simulation time in seconds
sim_inst_rate
213703.0000 # simulation speed (in insts/sec)
il1.accesses
213703 # total number of accesses
il1.hits
189940 # total number of hits
il1.misses
23763 # total number of misses
il1.replacements
23507 # total number of replacements
il1.writebacks
0 # total number of writebacks
il1.invalidations
0 # total number of invalidations
il1.miss_rate
0.1112 # miss rate (i.e., misses/ref)
il1.repl_rate
0.1100 # replacement rate (i.e., repls/ref)
il1.wb_rate
0.0000 # writeback rate (i.e., wrbks/ref)
il1.inv_rate
0.0000 # invalidation rate (i.e., invs/ref)
dl1.accesses
57480 # total number of accesses
dl1.hits
56938 # total number of hits
dl1.misses
542 # total number of misses
dl1.replacements
0 # total number of replacements
dl1.writebacks
0 # total number of writebacks
dl1.invalidations
0 # total number of invalidations
dl1.miss_rate
0.0094 # miss rate (i.e., misses/ref)
dl1.repl_rate
0.0000 # replacement rate (i.e., repls/ref)
dl1.wb_rate
0.0000 # writeback rate (i.e., wrbks/ref)
dl1.inv_rate
0.0000 # invalidation rate (i.e., invs/ref)
…
94
Different Cache Configurations

Different Configurations



Difference
< -cache:dl1
dl1:32:32:32:f # l1 data cache config, i.e.,
{<config>|none}
> -cache:dl1
dl1:256:32:1:l # l1 data cache config, i.e.,
{<config>|none}
dl1 output differences

< dl1.hits
56938 # total number of hits
< dl1.misses
542 # total number of misses
< dl1.replacements
0 # total number of replacements
< dl1.writebacks
0 # total number of writebacks

> dl1.hits
56675 # total number of hits
> dl1.misses
805 # total number of misses
> dl1.replacements
549 # total number of replacements
> dl1.writebacks
416 # total number of writebacks
95
./sim-outorder test-math
Performance Simulation

sim-outorder



Performance simulation
Out-of-Order Issue
Run test-math with sim-outorder
$ $IDIR/simplesim-3.0/sim-order testspisa/bin.little/test-math
96
./sim-outorder test-math
In the output
sim: ** simulation statistics **
sim_num_insn
213703 # total number of instructions committed
sim_num_refs
56899 # total number of loads and stores committed
sim_num_loads
34105 # total number of loads committed
sim_num_stores
22794.0000 # total number of stores committed
sim_num_branches
38594 # total number of branches committed
sim_elapsed_time
1 # total simulation time in seconds
sim_inst_rate
213703.0000 # simulation speed (in insts/sec)
sim_total_insn
233029 # total number of instructions executed
sim_total_refs
61927 # total number of loads and stores executed
sim_total_loads
37545 # total number of loads executed
sim_total_stores
24382.0000 # total number of stores executed
sim_total_branches
42770 # total number of branches executed
sim_cycle
224302 # total simulation time in cycles
sim_IPC
0.9527 # instructions per cycle
sim_CPI
1.0496 # cycles per instruction
sim_exec_BW
1.0389 # total instructions (mis-spec + committed) per cycle
sim_IPB
5.5372 # instruction per branch
IFQ_count
352201 # cumulative IFQ occupancy
IFQ_fcount
74028 # cumulative IFQ full count
ifq_occupancy
1.5702 # avg IFQ occupancy (insn's)
ifq_rate
1.0389 # avg IFQ dispatch rate (insn/cycle)
ifq_latency
1.5114 # avg IFQ occupant latency (cycle's)
ifq_full
0.3300 # fraction of time (cycle's) IFQ was full
RUU_count
1440457 # cumulative RUU occupancy
RUU_fcount
45203 # cumulative RUU full count
ruu_occupancy
6.4220 # avg RUU occupancy (insn's)
97
Contents


SimpleScalar Overview
Demo 1: a simple simulation


PHiLOSoftware Simulator



(w/ 3.1 version of SimpleScalar)
SimpleScalar + CACTI + SystemC TLM
Demo 2: Bus Protocol Selection
Demo 3: Software-Controlled Memory Virtualization
98
A Taxonomy of Simulation Tools
99
ARM Cortex-M3
Address Map
Software-Controlled Memories

Two types of memory subsystems


Hardware-controlled Caches
Software-controlled SRAMs


SPMs are preferred over caches due to:





ScratchPad Memories (SPMs), Tightly Coupled Memories (ARM), Local Stores
(Cell), Streaming Memories, Software Caches
Smaller area footprint
Lower power consumption
More predictable
Allow explicit control of their data
[Leverich et al., ISCA ‘07]
SPMs lack hardware support
Compiler
and programmer
need to explicitly manage
themdeploying
New
memory
subsystems
are
Accessed through physical addresses
Difficultdistributed
to capture irregularity aton-chip
compile time
with
softwareWhat about sharing their address space?
controlled memories




3/21/2016
100
Goal: Abstracting Physical Characteristics
from the Device

Challenges:

Software-controlled memories are explicitly accessed

Programmers assume full access to their physical space


Sharing issues and security risks (open environments)
Different physical characteristics

Due to voltage scaling



Increased latencies
Higher error rates
Due to interconnection network
Different access latencies
Mutually Dependent
Can we exploit
Local vs. Remote virtualization to minimize
Due to technology
programmer burden while
Process variations
SRAM vs. NVM characteristics
opportunistically
 Virtualization as a viable solution
exploiting
the
variation
in physical
Traditional
virtualization
makes
no difference between
address spaces






3/21/2016
101
• Multi-processing and
multi-tasking
• Shared Memory
• Distributed Memory
• Leakage Power
• Voltage Scaling
• Hybrid Memories
• Process Variations
Low
Power/Energ
y
Locally Allocated
On-Chip Space
(1) Voltage Scaled Performance
Low-Power / Mid Latency
(2) Voltage Scaled
Fault-tolerant
(3) Nominal-Voltage
High Power / Low Latency
(4) Nominal-Voltage
CPU
Low-Priority
Security
Reliability
(5) Nominal-Voltage
NV
SPM
SPM
Higher Power / Latency
M
(6) NVM High Write Power /
Voltage Nominal
• Open Environments
Scaled Voltage
• Shared Memory Latency
• Encryption (7) NVM Higher Write
Power / Latency
Remotely Allocated
On-Chip Space
CPU
SPM
SPM
NV
M
• Memory
Most Nominal
Vulnerable
Voltage
Voltage
(LargeScaled
On-Chip
Area)
• Process Variations
• Soft-errors
(8) Low-Power DRAM
DRAM
NVM
SRAM Preferred
PHiLOSoftware’s Heterogeneous
Virtual Address Space
(9) High-Power DRAM
Memory
Controller
Interconnection
Network
Propose the idea of virtual address
spaces with different characteristics
Low Power DIMM
3/21/2016
High Power DIMM
102
PHiLOSoftware Framework
PHiLOSoftware
Software Annotations
@LowPower(arr,a,64)
@Reliable(arr,b,64)
@Secure(arr,c,128)
Application Market
Application
Over the Air Update
Application
Application
1
Application
1
1
App. OS
Service1
Service1
Service1
User Profile
User
Profile
Manager
User
Profile
Manager
Manager
User
User
Profile
Profile
Third
Party
Manager
Manager
App1
Services. OS
Private. OS
Third Party
OS
RTS
Allocation Policies
Compiler
(Static Analysis)
Application Layer (Static)
Run-Time System (OS/Hypervisor)
Run-Time Layer
(Dynamic)
3/21/2016
CPU
SPM
CPU
SPM
CPU
CPU
Manager
High Power
DRAM
HDD
NVM
NVM
DMA
DRAM
DRAM
DRAM
Voltage Scaled Emerging Memories
CMP Low Power Medium Power
Platform Type
(CMP,NoC,
MPSoC)
Memory Tech.
(SRAM, NVMs,
DRAM)
103
Platform
Platform Configuration
(Variable)
PHiLOSoftware Simulator
// connect modules to the bus
Priority
HW ID
while(...) {
...// bus_0->arbiter_port(*bus_arbiter_0);
create cpus && p.has_more())
if(!p.is_finished()
Annotations
SHA MIN_PRIO+1,
MOTION
NORMAL, 2, 0x00);
{ gen_0 = new generator("gen_0",
AES MIN_PRIO+2,
JPEG
@LowPower(arr,a,64)
gen_0->off_bus_port(*bus_0);
inst
= p.get_next_inst();
gen_1
= new generator("gen_1",
NORMAL, 2, 0x01);
BLOWFISH
GSM
@Reliable(arr,b,64)
gen_0->on_bus_port(*bus_1);
H263
ADPCM
#ifdef
VERBOSE
@Secure(arr,c,128)
spmvisor_0
= new spmvisor("spmvisor_0", SPMVISORBASE, SPMVISORBASE + SPMVISORSPACESIZE
- 1,
Application Policies
printf("%g %s : EXECUTE_INST <%s, 0x%lx, 0x%lx>\n",
gen_1->off_bus_port(*bus_0);
1, 1, MIN_PRIO+0, NORMAL, 2, 0x04);
sc_simulation_time(), name(), inst.op.c_str(), inst.pc,
inst.address);
Physical
Start Address
SHA
MOTION
gen_1->on_bus_port(*bus_1);
SimpleScalar
Traces
#endif
AES
JPEG
// mem
SimpleScalar
BLOWFISH
GSM
spmvisor_0->bus_port_req(*bus_0);
#ifdef
LOAD_INSTRUCTIONS
spm_0
= new spm("spm_0", SPMBASE + 0*SPM_SIZE, SPMBASE + 1*SPM_SIZE-1,1, 1);
H263
ADPCM
entry_p
= p.v_lut.lookup(inst.pc);
spmvisor_0->bus_port_spm(*bus_1);
spm_1
=
new spm("spm_1", SPMBASE + 1*SPM_SIZE, SPMBASE + 2*SPM_SIZE-1,1, 1);
if(entry_p!=NULL)
Physical End Address
{
Simulated RTOS Environment
Simulated Virtualized Environment
// bus
if(entry_p->in_spm)
bus_0->slave_port(*spmvisor_0);
bus_0
= new channel_amba2("bus_amba2_0",false);
A1
A2
A3
A4
A5
A6
A7
A8
{
GuestOS1
GuestOS2
GuestOS3
GuestOS4
A1
A2
A1
bus_0->slave_port(*ram);
bus_arbiter_0
= new arbiter("arbiter_amba2_0", STATIC, false);
spm_ax_inc(p.id);
RTOS
Arbitration Policy
Hypervisor
#ifdef
VSPM_ENABLED
bus_1->slave_port(*spm_0);
bus_1
= new channel_amba2("bus_amba2_1",false);
read(&off_bus_port,entry_p->pa,packet);
SystemC TLM
bus_1->slave_port(*spm_1);
bus_arbiter_1
= new arbiter("arbiter_amba2_1", STATIC, false);
#else
Power/Perf.
Manager
CPU
CPU
CPU
CPU
High Power
read(&on_bus_port,entry_p->pa,packet);
Models
HDD
#endif
OM
} Module Connectivity
#ifdef RISC_CYCLES_EN
wait(RISC_CYCLES(S_ISA_TO_INT_ISA(inst.op))+inst.cycles,
SC_NS);S-DMA
SPM
SPM
SPM
SPM
OM
OM
OM
#else
Nominal
Voltage
Voltage Scaled
CMP Low Power Medium Power
wait(1, SC_NS);
#endif
Memory Technology
...
Platform DB
Fault & Variability Models
Performanc
e
Models architecture configuration file for 2 core CMP with SPMs
Sample
(CACTI/NVSim)
Contents


SimpleScalar Overview
Demo 1: a simple simulation


PHiLOSoftware Simulator



(w/ 3.1 version of SimpleScalar)
SimpleScalar + CACTI + SystemC TLM
Demo 2: Bus Protocol Selection
Demo 3: Software-Controlled Memory Virtualization
105
PHiLOSoftware DEMO1: Bus Protocol
Selection
Access Type / Address / Number of bytes
lw
r16,0(r29)
0x400140
0x400140
_YY_R_
lui
r28,0x1001
0x400148
addiu
r28,r28,-26160
0x400150
addiu
r17,r29,4
0x400158
addiu
r3,r17,4
0x400160
sll
r2,r16,2
0x400168
addu
r3,r3,r2
0x400170
addu
r18,r0,r3
0x400178
sw
r18,-32412(r28)
0x400180
0x400180
addiu
r29,r29,-24
0x400188
addu
r4,r0,r16
0x400190
addu
r5,r0,r17
0x400198
addu
r6,r0,r18
0x4001a0
jal
0x403800
0x4001a8
addiu
r29,r29,-24
0x403800
0x7fff8000
_YY_W_
4
0x10001b34
0
4
0
SimpleScalar Trace
SimpleScalar
SimpleScalar
SimpleScalar
SimpleScalar
CPU
CPU
CPU
CPU
8KB I$
8KB D$
8KB I$
8KB D$
8KB I$
AMBA AHB TLM
8KB D$
CACTI
8KB I$
8KB D$
SystemC – TLM Model
DEMO1: Running

Usage: testbench.x <arb protocol>
<cycles to simulate>




./cmp_2cpu_cache.x STATIC 1000000
Can vary from STATIC, RANDOM,
ROUNDROBIN, TDMA, TDMA_RR
Cycles from 10,000, -1 for full
application run
Sample run output:
501858
501940
501946
501951
501951
502022
502114
ram : read 0x1046370347 at address 0xc14
ram : write 0x364210008 at address 0xffdfe20
simple_cpu_0 : EXECUTE_INST <lw, 0x4005c8, 0x2147450764>
simple_cpu_0 : EXECUTE_INST <addiu, 0x0, 0x11032>
simple_cpu_0 : EXECUTE_INST <sw, 0x4005d8, 0x2147450764>
ram : read 0x628966950 at address 0xc18
simple_cpu_0 : EXECUTE_INST <sw, 0x4005e8, 0x268445784>
DEMO1: Output Analysis

Effect of arbitration protocol:

STATIC is blocking (e.g., CPU 0 has highest priority)



Each of the different arbitration protocols has different behavior


Can be observed by number of arbitration cycles from each cache interface
./cmp_2cpu_cache.x STATIC 1000000





May lead to starvation of other CPUs
Run 10000 cycles with STATIC arb protocol to see example
cache_0: Arbitration wait cycles (Reads) = 0
cache_0: Arbitration wait cycles (Writes) = 0
cache_1: Arbitration wait cycles (Reads) = 4581
cache_1: Arbitration wait cycles (Writes) = 391
./cmp_2cpu_cache.x TDMA 1000000




cache_0: Arbitration wait cycles (Reads) = 103313
cache_0: Arbitration wait cycles (Writes) = 128494
cache_1: Arbitration wait cycles (Reads) = 148688
cache_1: Arbitration wait cycles (Writes) = 28329
DEMO1: Output Analysis

TDMA


Fair
Increases total execution time because of time-slots


FULL SIM: 8.44379e+06
STATIC


Might cause starvation
Processes with high priority finish faster



E.g. CPU0 has highest priority
FULL SIM: 5.93757e+06
Reverse priorities (CPU1>CPU0): 6.0784e+06

CPU0 has longest task to execute

Suffers of performance degradation because CPU1 has higher priority
DEMO1: 8 Core CMP

./cmp_8cpu_cache.x STATIC 1000000

1 Million cycles and STATIC arbitration protocol
ram: |cache_0|
ram: |cache_1|
ram: |cache_2|

Reads:
Reads:
Reads:
2240
3072
475
Writes:
Writes:
Writes:
4936
1518
19
./cmp_8cpu_cache.x RANDOM 1000000

ram:
ram:
ram:
ram:
ram:
ram:
ram:
ram:
Transactio
n
Starvation
Same as above, RANDOM arbitration
|cache_0|
|cache_1|
|cache_2|
|cache_3|
|cache_4|
|cache_5|
|cache_6|
|cache_7|
Reads:
Reads:
Reads:
Reads:
Reads:
Reads:
Reads:
Reads:
1150
1445
1088
768
576
576
768
1088
Writes:
Writes:
Writes:
Writes:
Writes:
Writes:
Writes:
Writes:
416
198
492
679
880
901
784
470
Greater
fairness
DEMO1: 8 Core CMP – Protocol
Comparison
Protocol
Read
s
Write
s
Tota
l
Average
Average
Transaction Arb.
s
Wait Cycles
$ Read
STDev
$ Write
STDev
STATIC
5787
6473
12260
766.25
9802.5 1145.329531 1636.738941
RANDOM
7459
4820
12279
767.4375
10051.125 288.6667358 232.6671442
ROUNDRO
BIN
7076
4460
11536
721
68994.75 238.1438641 238.1438641
TDMA
7076
4460
11536
721
68994.75 238.1438641 238.1438641
- Almost same number of accesses for all protocols
- Higher STDev means higher variation in number of
- E.g., STATIC vs. RANDOM, RR, TDMA
Contents


SimpleScalar Overview
Demo 1: a simple simulation


PHiLOSoftware Simulator



(w/ 3.1 version of SimpleScalar)
SimpleScalar + CACTI + SystemC TLM
Demo 2: Bus Protocol Selection
Demo 3: Software-Controlled Memory Virtualization
112
Shared SPMs in Heterogeneous
Multi-tasking Environments (Open Environment)

Problem: Fixed allocation policies

App4
RTOS
known single/multiple apps
CPU1
CPU2
App3
Crypto
No SPM space available, map all data off-chip


App3
Enforce pre-defined policies


App2 App5 App1
RTOS
App1
AMBA AHB
What about data and task criticality/priority?
Need dynamic and selective enforcement of allocation
policies

SPM1
SPM2
MM
Reduced power consumption/better performance…
Want to launch newly downloaded App!
App1
DMA
App5
Task Priority:
High
CPU0
App2
App3
CPU1
Medium
Low
App4
MM-SPM
Transfers:
FixedSPM0policies do not work well in open environments where
SPM1 number of applications running concurrently is
the
MM
unknown!
Virtual ScratchPad Memories
(vSPMs)
A1
A2
SPM
Compete for same SPM
A1
vSPM1
A2
vSPM2
1K
1K
vSPM1
1K
1K
1K
1K
vSPM2
1K
1K
Block-based
priority-driven
allocation policies
Applications see their own dedicated SPM(s)
1K
1K
1K
1K
1K
1K
PEM
1K
1K
0
SPM Space
4K-1
4K4K Protected
Evict
Memory
(PEM) Space
8K-1
MM Space
MM Space
nGB-1
nGB-1
Priority-based Dynamic Memory
Allocation

May have data/application based priorities
Create vSPM(s)Selective
prior to running
applications
eviction (data-priority)
Selective eviction



No need to evict data from SPMs on every context switch
Task Priority:
High
Need to launch new (trusted) application
App1
CPU0
App2
Low priority data is
mapped to PEM
space!
App3
Medium
Low
ALL data is protected through vSPMs
CPU1
App4
SPM0
SPM1
PEM
MM-SPM
Transfers:
Support priority-based
selective allocation
Lowpriority
(data and application
driven)
Data Priority:
Selectiv
vSPM Normal
blocks
e
mapped
Allocation
to PEM Eviction
Minimize overheads of generating
trusted environments!
space
(data protected through SPMVisor)
S1
S2
S3
S4
S5
S6
P1
P2
P3
DEMO3: Setup
Simulated Virtualized Environment
A1

Total of 4 active cores

A4
GuestOS2
A5
A6
GuestOS3
A7
A8
GuestOS4
Hypervisor
Total of 32KB on-chip space
CPU
CPU
CPU
CPU
SPMVisor
Crypto
SPMVisor



A3
8KB SPM space (per core)


A2
GuestOS1
SPM
SPM
SPM
Simulated 1 vSPM per app
Total of 8 * 8KB = 64KB on-chip virtualized space
SPM
S-DMA
Simulated Virtualized Environment:

Generates input for SystemC model

annotated traces with context switch information per application
name
os_a
os_b
os_c
os_d
hyp
time_slot
10000
10000
10000
10000
20000
cx_cost mem_sz(MB)
4000
128
4000
128
4000
128
4000
128
6000
128
n_app
2
2
2
2
{program name, ...}
adpcm/mem.trace aes/mem.trace
blowfish/mem.trace gsm/mem.trace
h263/mem.trace jpeg/mem.trace
motion/mem.trace sha/mem.trace
name
dtlb
entries policy (supported: full assoc(fifo) = 1, not supported yet: 2-set asso = 2, 4-set asso = 4)
12
1
MM
DEMO3: Hypervisor - CX vs. vSPMs

Hypervisor CX: on a context-switch, evict data from
SPMs



Hypervisor w/vSPMs: no need to evict data from SPMs


Load data for new tasks onto SPMs
Protects integrity of SPM data
Each Application has a dedicated virtual space
At run-time load SPM allocation tables
Object Name
Buffer_0x7f<7>:
Buffer_0x7f<8>:
Buffer_0x10<0>:
Buffer_0x10<1>:
Buffer_0x10<2>:
Buffer_0x10<3>:
Buffer_0x10<4>:
Buffer_0x10<5>:
Buffer_0x10<6>:
T_Start
1
16
1295
89
9
226971
228507
230043
231816
T_End
87
233608
232119
233598
233169
233166
230038
231574
233043
Lifetime
86
233592
230824
233509
233160
6195
1531
1531
1227
Addr_Start
2147450879
2147446783
268435456
268439552
268443648
268447744
268451840
268455936
268460032
Addr_End
2147454974
2147450878
268439551
268443647
268447743
268451839
268455935
268460031
268464127
# of Accesses
7
84825
3822
370
25088
1058
1024
1024
17
DEMO3: Hypervisor - CX vs. vSPMs (Cont.)

Comparison of cx and vSPMs

E1:


spmvisor_e1.x – 2 Applications, 1 OS, Hypervisor, 1 Core, 1 SPM, 2 x vSPM
rtos_e1.x – 2 applications, 1 OS, Hypervisor, 1 Core, 1 SPM
Traditional
(CX)
SPMVisor
2.89E+07
1.46E+07
49 % Lower
Execution Time
677486.6228
52877.2700
6
92 % Energy
Savings
Execution Time
Total Energy
(nJ)
Improvements
DEMO3: Hypervisor - CX vs. vSPMs (Cont.)

Comparison of cx and vSPMs

E4:


spmvisor_e4.x – 8 Applications, 4 Oses, Hypervisor, 4 Cores, 4 SPMs, 8
vSPMs
rtos_e4.x – 8 Applications, 4 Oses, Hypervisor, 4 Cores, 4 SPMs
Traditional
(CX)
SPMVisor
6.25E+07
2.18E+07
65 % Lower
Execution Time
2475745.172
465330.598
4
81 % Energy
Savings
Execution Time
Total Energy
(nJ)
Improvements
- The number of data evictions/loads due to context switching hurts bo
performance and energy
Questions?
3/21/2016
120
Contact Information
Houman Homayoun
Email: hhomayou@eng.ucsd.edu
http://cseweb.ucsd.edu/~hhomayoun/
Manish Arora
Email: marora@eng.ucsd.edu
http://cseweb.ucsd.edu/~marora/
Luis Angel Bathen
Email:lbathen@uci.edu
www.ics.uci.edu/~lbathen/
121
Architecture Tool Publicly Available for Download

HotSpot


DARSIM (Hornet)


http://www.simplescalar.com/
GPGPU-SIM


http://www.m5sim.org/Main_Page
SimpleScalar


http://www.hpl.hp.com/research/cacti/
GEMS5


http://www.hpl.hp.com/research/mcpat/
CACTI


http://www.rioshering.com/nvsimwiki/index.php?title=Main_Page
McPAT


http://csg.csail.mit.edu/hornet/
NVSIM


http://lava.cs.virginia.edu/HotSpot/
http://www.gpgpu-sim.org/
VARIUS

http://iacoma.cs.uiuc.edu/varius/
122
Download