PPT slides - Electrical and Computer Engineering

advertisement
Based on the paper by :Jie Tang, Shaoshan Liu,Zhimin Gu,Chen Liu and
Jean-Luc Gaoudiot,Fellow, IEEE
Computer Architecture Letters
Volume 10 Issue 1









Introduction
Motivation and Background
Previous Work
Methodology
Prefetcher Performance
Energy Efficiency
Energy Consumption Analysis
Energy Efficiency Model
Conclusion

Data prefetching, is the process of fetching data
that is needed in the program in advance, before
the instruction that requires it is executed.

It removes apparent memory latency.

Data prefetching has been a successful
technique in modern high-performance
computing platforms.

It was found however, that prefetching
significantly increases power consumption.

Embedded Mobile Systems typically have
constraints for space , cost and power.

This means that they can’t afford power
consuming processes.

Hence, prefetching was considered
unsuitable for Embedded Systems.

Embedded mobile systems
have now come to be
powered by powerful
processors such as dual
core processors like the
Tegra2 by Nvidia

Smart phone applications
include web browsing,
multimedia, gaming,
Webtop control all of
which require a very high
performance from the
computing system.

To meet the this requirement, methods such as
prefetching which were earlier shunned, can
now be used.

With better, more power efficient technology,
the energy consumption behavior may have also
changed.

Due to this reason, we have decided to study
and model the energy efficiency of different
types of prefetchers.

Over the years, the main bottleneck that was
preventing the speeding up of systems, has been the
slowness of memory and not the processor speed.

Prefetching date can be implemented in hardware by
observing fetching patterns, such as prefetching the
most recently used data first.

Sequential prefetching takes advantage of spatial
locality in the memory.

Tagged prefetching associates a tag bit with every
memory block and prefetches based on that value

Stride-based prefetching detects the stride
pattern in the address stream like fetches from
different iterations from the same loop.

Stream prefetchers try to capture sequential
nearby misses and prefetch an entire block at a
time.

Correlated prefetchers issue prefetches based
on the previously recorded correlations between
addresses of cache misses.

There had been some studies focusing on improving
energy efficiency in hardware prefetching: PARE is
one of these techniques, which constructs a poweraware hardware prefetching engine.

By categorizing memory accesses into different
groups

It uses a table with indexed hardware history which is
continuously updated and different memory fetches
are categorized, and the prefetching decisions are
based on the information in this table.



Modern embedded mobile Table 1 Benchmark Set
systems execute a wide
Xerces-C++
SAX
variety of workloads
The first set includes two
XML data processing
benchmarks taken from
Xerces-C++
They are implementing
event-based parsing which
is data centric (SAX) and
tree-based parsing model
which is document centric
(DOM).
DOM
Media Bench
II
JPEG2000 Encode
JPEG2000 Decode
H.264 Encode
H.264 Decode
PARSEC
Fluidanimate
Freqmine

The second set is taken from MediaBench II which
provides application level benchmarks, representing
multimedia and entertainment workloads , based on
the ISO JPEG-2000 and ISO JPEG-2000 standard.

It also has the H.264 Video Compression standards.

The third set is taken from the PARSEC(Princeton
Application Repository for Shared-Memory
Computers) benchmark for multithreaded processors
which is used in many gaming applications.



Cache hierarchy indicates
the level of cache that the
prefetcher covers.
Prefetching degree
shows whether the
prefetching degree
of the prefetcher is static
or dynamically adjusted.
Trigger L1 and Trigger L2
respectively show what
triggers the prefetch.

P
1
P
2
P
3
P
4
P
5
P
6
Table 2 Summary of Prefetchers
cache
prefetching
hierarchy
degree
trigger L1
trigger L2
L1 & L2
Dynamic
miss
access
L1
Static
miss
N/A
L1 & L2
Dynamic
miss
miss
L2
Static
N/A
miss
L1 & L2
Dynamic
miss
miss
L2
Static
N/A
access



To study the performance
of the selected prefetchers,
we use CMP$IM,a cache
simulator, to model highperformance embedded
systems.
It is a Pin based multi-core
cache simulator
Simulation parameters are
shown in Table 3, which
resembles modern
Smartphone and e-book
systems

Table 3 Simulation
Parameters
Frequency
1 GHz
Issue Width
4
Instruction Window
128 entries
L1 Data Cache
32KB, 8-way, 1cycle
L1 Inst. Cache
32 KB, 8-way, 1cycle
L2 Uniform Cache
512 KB, 16-way, 20 cycles
Memory Latency
256 MB, 200 cycles

To study the impact of prefetching on energy
consumption of memory subsystem, we use
CACTI to model energy parameter of
different technology implementations.

In a simulator, a hardware prefetcher can be
defined by a set of hardware tables, its
output is in the form of tables of data, hence
it’s energy consumption can be modeled.

Prefetching techniques are effective on
improving performance by more than 5% on
average. In detail, the effectiveness of
prefetchers depends on both prefetching
technique itself and natures of applications.

P3 results in the best average performance
because it’s the most aggressive prefetcher.

JPEG2000 decoding and encoding programs can
receive up to 22% of performance improvement
due to its streaming feature.

Fig1 Performance Improvement

We study the energy efficiency of both 90 nm
and 32 nm technologies. The results are
summarized in Figures 2 and 3 respectively.

The baseline for comparison is energy
consumption without any prefetcher, thus a
positive number shows that with the prefetcher
the system dissipates more energy.

For instance, 0.1 means that with the prefetcher,
the system dissipates 10% more energy
compared to baseline.

In 90nm technology, most prefetchers
significantly increase overall energy
consumption, which confirms the findings of
previous studies.

Thus, in 90 nm technology, only very
conservative prefetchers can be energy
efficient.

Fig 2
90nm

Fig 3 32nm

In 32 nm technology, P4 is still the most energy efficient
prefetcher, reducing overall energy by almost 4% on
average; when running JPEG 2000 Decode, it achieves
close to 10% energy saving.

P2 and P3 are still the most energy-inefficient prefetchers
due to their aggressiveness. However, in the worst case
they only consume 25% extra energy, a four-fold
reduction compared to the 90 nm implementations.

Thus most prefetchers are able to provide performance
gain with less than 5% energy overheads; and P1 and P4
even result in 2% to 5% energy reductions.

In equation 1, the total energy consumption
consists of two contributors: static energy
(Estatic) and dynamic energy (Edynamic)

Nm is the number of read/write memory
accesses

Edynamic = number of read/write accesses with
the energy dissipated on the bus & memory
subsystem of each access (E’m ).

(Estatic) is production of overall execution time (t) and
the system static power consumption (Pstatic).

When prefetchers accelerate the process, the reduced
execution time reduces the static energy
consumption.

However, prefetchers generate significant amount of
extra memory subsystem accesses leading to pure
dynamic overheads.

Equation 1:
E = Estatic+Edynamic= (Pstatic x t)+(Nm x E’m)
Table 4 Energy Category
Dynamic memory dynamic activities of the memory subsystem
Static memory
memory subsystem static power consumption
Dynamic prefetch dynamic activities of the prefetcher
Static prefetch
prefetcher hardware static power consumption
Fig 4

In 90 nm technology, dynamic energy
contributes to up to 66% of the total energy
consumption: 14% from the pre-fetcher and 52%
from the memory subsystem. Static energy
only accounts for 34% of the total energy
consumption.

Hence, although the prefetchers are able to
reduce execution time, there leaves little room
for total energy saving, leading to energy
inefficiency for most pre-fetchers in 90 nm
implementations.

In 32 nm technology, static energy contributes
over 66% of the total energy consumption: 65%
from the memory subsystem, and 1% from the
prefetcher hardware.

Dynamic energy is far less compared to static.

32 nm technology, prefetchers become energyefficient in many different cases.

We propose an analytical model to evaluate efficiency.
Equation 2: Eno-pref > Epref (?)

To simplify the model, we assume there is only one level in the
memory subsystem. Compared to Eno-pref, Epref has two more
contributors: static energy and dynamic energy consumption
coming from prefetcher hardware.

Equation 3:
Pm-static*t1+Nm1xE’m>Pm-staticxt2+Nm2xE’m+Pp-staticxt2+NpxE’p

Equation 4:
(t1-t2)/t1 > [(Nm2-Nm1)*E’m+Np*E’p+Pp-static*t2]/Pm-static*t1

The left-hand side shows the performance gain
as a result of prefetching.

The dividend of right-hand side contains three
terms: energy overhead incurred by the extra
memory accesses ;dynamic energy; and static
energy consumption.

The divisor of the right-hand side represents the
static energy of the original design without
prefetching

As summarized in Equation 5, if a prefetcher
needs to be energy efficient, the performance
gain (G) it brings must be greater than the ratio
of the energy overhead (Eoverhead) it incurs over
the original static energy (Eno-pref-static).
Equation 5:
G> Eoverhead/Eno-perf-static
Equation 6 :
EEI=G - Eoverhead/Eno-perf-static


We define a metric Energy
Efficiency Indicator (EEI) in
Equation 6. A positive EEI
indicates the prefetcher is
energy-efficient and vice
versa.
We have validated the
analytical results with the
empirical results shown in
table, thus indicating the
simplicity and
effectiveness of our
analytical models.
P1
P2
90 nm -0.1
-0.5
P3
P4
P5
P6
-0.69 0.03 -0.27 -0.31
32 nm 0.03 -0.05 -0.07 0.05
0.00 -0.14

With a new trend in highly capable
embedded mobile applications, it seems
conducive to implement high-performance
techniques-> PREFETCHING

They do not seem to put a burden on energy
consumption and should thus be implement

A simple analytical model has been
demonstrated to estimate the effects of
prefetching and to effectively calculate it.

System designers can estimate the energy
efficiency of their hardware prefetcher
designs and make changes accordingly.
Download