Slides

advertisement
Outline
•Introduction
•Different Scratch Pad Memories
•Cache and Scratch Pad for embedded
applications
Memories in Embedded Systems
Each memory has its own advantages
CPU
Internal ROM
Internal
SRAM
External DRAM
For better performance memory accesses have to be fast
Efficient Utilization of Scratch-Pad
Memory in Embedded Processor
Applications
What is Scratchpad memory ?
• Fast on-chip SRAM
• Abbreviated as SPM
• 2 types of SPM : Static SPM locations don’t change at runtime
 Dynamic  SPM locations change at runtime
Objective
• Find a technique for efficiently exploiting onchip SPM by partitioning the application’s
scalar and array variables into off-chip DRAM
and on-chip SPM.
• Minimize the total execution time of the
application.
SPM and Cache
• Similarities
Connected to the same address and data buses.
 Access latency of 1 processor cycle.
• Difference
 SPM guarantees single cycle access time while an
access to cache is subject to a miss.
Block Diagram of Embedded Processor Application
Division of Data Address Space between SRAM
and DRAM
Example: Histogram Evaluation Code
• Builds a histogram of 256 brightness levels for the pixels of
an N* N image –
char Brightnesslevel [512] [512];
int Hist [256]; /* Elements initialized to 0 */
…
for(i = 0;i < N;i+ +)
for (j = 0;j < N;j + +)
/* For each pixel (i, j) in image */
level = BrightnessLevel [i] [j];
Hist [level] = Hist [level] + 1;
Problem Description
• If the code is executed on a processor
configured with a data cache of size 1Kb –
performance will be degraded by conflict
misses in the cache between elements of the
2 arrays Hist and BrightnessLevel.
• Solution:- Selectively map to SPM those
variables that cause maximum number of
conflicts in the data cache.
Partitioning Strategy
• Features affecting partitioning
Scalar variables and constants
Size of arrays
Life-times of array variables
Access frequency of array variables
Conflicts in loops
• Partitioning Algorithm
Features affecting partitioning
• Scalar variables and constants
 All scalar variables and scalar constants are
mapped onto SPM.
• Size of Arrays
 Arrays that are larger than SRAM are mapped
onto off-chip memory.
Features affecting partitioning
• Lifetime of an Array Variable
Definition :- period between its definition and its
last use.
 Variables with disjoint lifetimes can be stored in
the same processor register.
 Arrays with different lifetimes can share the same
memory space.
Features affecting partitioning
• Intersecting Life Times ILT(u)
 Definition :- Number of array variables having
a non-null intersection of lifetimes with u.
 Indicates the number of other arrays it could
possibly interact with, in cache.
 So map arrays with highest ILT values into
SPM, thereby eliminating a large number of
potential conflicts.
Features affecting partitioning
• Access frequency of Array Variables
Variable Access Count  VAC(u)
 Definition :- Number of accesses to elements
of u during its lifetime.
 Interference Access Count IAC(u)
 Definition :- Number of accesses to other
arrays during the lifetime of u.
 Interference Factor  IF(u) = VAC(u)*IAC(u)
Features affecting partitioning
Conflicts in Loops
for i = 0 to N-1
access a [i]
access b [i]
access c [2 i]
access c [2 i + 1]
end for
Loop Conflict GraphLCG
edge weight e(u, v) = ∑pi=1 ki
ki ->total no. of accesses to u and v in loop i
Total no. of accesses to a and c combined : (1+2)*N = 3N
=>e(a,c) = 3N ; e(b,c) = 3N ; e(a,b) = 0
a
b
3N
3N
c
Features affecting partitioning
• Loop Conflict Factor
 Definition :- sum of incident edge weights to node
u.
 LCF(u) = ∑v є LCG - {u} e(u,v)
 Higher the LCF, more conflicts are likely for an
array, more desirable to map the array to the SPM.
Partitioning Strategy
• Features affecting partitioning
Scalar variables and constants
Size of arrays
Life-times of array variables
Access frequency of array variables
Conflicts in loops
• Partitioning Algorithm
Partitioning Algorithm
• Algorithm for determining the mapping
decision of each(scalar and array) program
variable to SPM or DRAM/cache.
• First assigns scalar constants and variables to
SPM.
• Arrays that are larger than SPM are mapped
onto DRAM.
Partitioning Algorithm
• For remaining (n) arrays, generates lifetime
intervals and computes LCF and IF values.
• Sorts the 2n interval points thus generated and
traverses them in increasing order.
• For each array u encountered, if there is sufficient
SRAM space for u and all arrays with lifetimes
intersecting the lifetime interval of u, with more
critical LCF and IF nos., then maps u to SPM else to
DRAM/cache.
Performance Details for Beamformer Example
Typical Applications
• Dequantde-quantization routine in MPEG decoder
application
• IDCTInverse Discrete Cosine Transform
• SORSuccessive Over Relaxation Algorithm
• MatrixMultMatrix multiplication
• FFTFast Fourier Transform
• DHRCDifferential Heat Release Computation
Algorithm
Performance Comparison of Configurations A, B, C and
D
Conclusion
• Average improvement of 31.4% over A (only
SRAM)
• Average improvement of 30.0% over B (only
cache)
• Average improvement of 33.1% over C
(random partitioning)
Compiler Decided Dynamic
Memory allocation for Scratch Pad
Based Embedded Systems.
Cache is one of the option for Onchip
Memory
CPU
Internal ROM
Cache
External DRAM
Why All Embedded Systems Don't Have
Cache Memory
The reasons could be
• Increased On Chip Area
• Increased Energy
• Increased Cost
• Hit Latency and Undeterministic Cache Access
A method for allocating program data to
non-cached SRAM
• Dynamic i.e. allocation changes at runtime
• Compiler-decided transfers
• Zero overhead per-memory-instruction unlike
software or hardware caching
• Has no software Caching tags
• Requires no run time checks
• High Predictable memory access times
Static Approach
Internal SRAM
int a[100];
int b[100];
…
while(i<100)
…..a……
while(i<100)
……b…...
Allocator
External DRAM
Int b[100]
Static Approach
Internal SRAM
Int a[100]
int a[100];
int b[100];
…
while(i<100)
…..a……
while(i<100)
……b…...
Allocator
External DRAM
Int b[100]
Dynamic Approach
Internal SRAM
Int a[100]
int a[100];
int b[100];
…
while(i<100)
…..a……
while(i<100)
……b…...
Allocator
External DRAM
Int b[100]
Dynamic Approach
Internal SRAM
int b[100]
int a[100];
int b[100];
while(i<100)
……a…...
while(i<100)
……b……
Allocator
External DRAM
int a[100]
It is similar to caching, but under compiler control
Compiler-Decided Dynamic Approach
int a[100];
int b[100];
…
// a is in SRAM
while(i<100)
……a…….
// Copy a out to DRAM
// Copy b in to SRAM
while(i<100)
……..b…..…
•Need to minimize costs
for greater benefit
Decide
on dynamic
•Accounts
for changing
program
behavioratstatically
Requirements
run time
•Compiler manages and decides the
transfers between sram and dram
Transfer cost
Approach
The method is to
• Use profiling to estimate reuse
• Copy variables in to SRAM when reused
• Cost model ensures that benefit exceeds cost
• Transfers data between the On chip and Off chip
memory under compiler supervision
• Compiler-known data allocation at each point in
the code
Advantages
• Benefits with no software translation overhead
• Predictable SRAM accesses ensuring better realtime guarantees than Hardware or Software
caching
• No more data transfers than caching
Overview of Strategy
Divide the complete program into different
regions
For (Starting Point of each Region)
<
Remove Some Variables from Sram
Copy Some Variables into Sram from
Dram
>
Some Imp Questions
What are regions ?
What to bring in to SRAM ?
What to evict from SRAM ?
The Problem has an exponential number of
Solutions (NP Complete)
Regions
• It is the code between successive program points
• Coincide with changes in program behavior
• New regions start at:
• Start of each procedure
• Before start of each loop
• Before conditional statements containing loops,
procedures
What to Bring in to SRAM ?
• Bring in variables that are re-used in region,
provided cost of transfer is recovered.
• These transfers will reduce the memory access
time
• Cost model accounts for:
• Profile estimated re-use
• Benefit from reuse
• Detailed Cost of transfer
• Bring in cost
• Eviction cost
What to Remove from SRAM?
in the future.
The data variables that are furthest in the
future
This time can be obtained by assigning
timestamps for each of the nodes
Need concept of time order of different
code regions
•
The Data-Program Relationship
Graph
The DPGR is a new data structure that
helps in identification of regions and
marking of time stamps
• It is essentially a program’s call graph
appended with additional nodes for
• Loop nodes
• Variable nodes
Data-Program
Relationship
Graph
1
• Defines regions
main
2
5
Proc_A
3
Depth first search order
reveals execution time.
4
lo
op
a
6
Defines Regions
7
Proc_B
Proc_C
b
lo
op
•order
“Allocation-change points” at
region changes
Time Stamps
• A method associates a time stamp with every
program point
• The time stamp forms a total order among
themselves
• The program points are reached during the
runtime in time stamp order.
Optimizations
• The is no need to write back unmodified or
dead SRAM variables into DRAM
• Optimize data transfer code using DMA when
it is available
• Data transfer code can be placed in special
memory block copy procedures
Multiple Allocations due to Multiple
•
Paths
•Contents of SRAM could be different on
different incoming paths to a node in DPRG
• Problem can happen in
• Loops
• Conditional execution
• Multiple calls to same procedure
Conditional join nodes
Join Node
• Favor the most frequent path
• Consensus allocation is chosen assuming the
incoming allocation from the most probable
predecessor
Procedure join nodes
• Few program points have multiple timestamps
• The nodes with multiple timestamps are called join
nodes as they join multiple paths from main()
• A strategy is used that adopts different allocation
strategies for different paths but with same code
Offsets in SRAM
•
SRAM can get fragmented when variables are
swapped out
•
•
In this method
•
•
Intelligent offset mechanism required
Place memory variables with similar lifetimes
together  larger fragments when evicted
together
Experimental Setup
• Architecture: Motorola MCORE
• Memory architecture : 2 levels of memory
• SRAM size: Estimated as 25% of the total data
requirement
• DRAM latency 10 cycles
• Compiler : Gcc
Results
Conclusion
The designer has to choose the right
mix of Scratch pad and Cache for
performance advantages.
References
• Sumesh U ,Rajeev B.
Compiler Decided Dynamic Memory Allocation for Scratch Pad Based
Embedded Systems .
• Alexandru N ,Preeti P, N Dutt .
Efficient Use of Scratch Pads in Embedded Applications
• Josh Pfrimmer, Kin F. Li, and Daler Rakhmatov
Balancing Scratch Pad and Cache in Embedded Systems for Power and
Speed Performance
Questions
Thank you
Download