PPT - ECE 751 Embedded Computing Systems

advertisement
Lecture 17: Embedded
Multiprocessor Memory
Embedded Computing Systems
Mikko Lipasti, adapted from M. Schulte
Based on slides and textbook from Wayne Wolf
High Performance Embedded Computing
© 2007 Elsevier
Topics





Parallel memory systems.
Models for memory
Heterogeneous memory systems
Consistent parallel memory systems
ARM MPCore multiprocessor
© 2006 Elsevier
Parallel memory systems



n memory banks can be
accessed independently.
Peak access rate given by
n parallel accesses.
If  is the probability of a
non-sequential access

Bank 0
address
data
Probability of k sequential
accesses is
p(k )   (1   ) k 1

Mean length of sequential

accesses is
Lb   k  p(k ) 
k 1
Bank 1
1

© 2006 Elsevier
Bank 2
Bank 3
Memory system design


Design parallel memory systems using previous
memory component models (CH 2)
Parameters:




Delay is a nonlinear function of memory size.


Area – Size of the component
Performance - Access time of the component – may
differ for reads vs. writes,, page mode, etc.
Energy per access – may also differ
Bit line delays can dominate access time
Delay is a nonlinear function of the number of
ports.
© 2006 Elsevier
Heterogeneous memory systems

Heterogeneous memory improves real-time
performance:



Heterogeneous memory improves power:


Accesses to the same bank interfere, even if not
to the same location.
Segregating real-time locations improves
predictability, reduces access time variance.
Smaller blocks with fewer ports consume less
energy.
What are disadvantages of heterogeneous
memory systems?
© 2006 Elsevier
Memory system design methodology [Dut98]
© 2006 Elsevier
[Dut98] © 1998 IEEE
Motion Estimation Architecture
[Dut98] © 1998 IEEE
© 2006 Elsevier
Memory Partitioning and Delay
© 2006 Elsevier
[Dut98] © 1998 IEEE
Critical Sections and Locks [Akg02]

Critical Section



Lock Delay


Time between release and acquisition of a lock
Lock Latency


Code section where shared data is accessed
Lock helps guarantee the consistency of shared data
(e.g., global variables)
Time to acquire a lock when no contention
Approach

Provide SoC lock cache
© 2006 Elsevier
SoC Lock Cache Mechanism


Locks to shared code sections are stored in a
dedicated lock cache
Locks appear in processors address space

Access using load/store instructions
© 2006 Elsevier
SoC Lock Cache Features





Simple hardware mechanism: SoCLC
No modifications/extensions to processor core
or to caches
No special instructions or atomic primitives
Can be integrated as an intellectual property
(IP) block into the SoC
Hardware interrupt triggered notification
© 2006 Elsevier
SW Only vs. HW/SW Locks
© 2006 Elsevier
SW Only vs. HW/SW Locks
© 2006 Elsevier
SoC Lock Cache Hardware
© 2006 Elsevier
Short vs. Long Critical Sections

A short critical section has a relatively short
time between lock acquisition and release




For example, less than 1,000 cycles
Don’t switch to another task while waiting for lock
Locks are associated with PEs
A long critical section has a relatively long
time between lock acquisition and release



For example, less than 1,000 cycles
Locks are associated with tasks on PEs
More hardware is required to track tasks
© 2006 Elsevier
Soc Lock Cache Interrupts
© 2006 Elsevier
SoC Lock Cache Results

Area is less than 0.1% of full SoC design
© 2006 Elsevier
Coherent parallel memory systems

Caches need to be coherent.


Cache snooping is a common approach
When data is accessed from memory, look for it in
other caches
.
© 2006 Elsevier
Application-Aware Snoop Filtering [Zho08]


In embedded systems, designer may know
which memory is shared between tasks
Snooping is enabled only for the accesses
referring to known shared regions.


Identify shared memory regions for each task


Reduces power consumption due to snooping
Provide this info to the operating system and
cache snoop controller for runtime utilization
Focus on write-back caches with writeinvalidate protocol
© 2006 Elsevier
Snoop Filtering Architecture

Snoop filter determines if the D-cache should
actually be snooped
.
© 2006 Elsevier
Shared Memory Identification

With no virtual memory:



Utilize Shared Address Segments (SAS) mechanism
Programmer identifies shared structures
Compiler controls placement of data



Aligns data on 2m address boundary
Identify segment using a SegID (MSBs of address)
What if arrays are not of size 2m?
© 2006 Elsevier
SAS Snoop Filtering Hardware

For each shared segment to be supported




SegDim indicates size of the segment (bit mask)
SegId indicates start of the segment
Compare SegID with address MSBs
What if you have more shared segments than
hardware for identifying them?
© 2006 Elsevier
Snoop Filtering Results


Snoop activities are report for direct mapped
and 4-way caches for write-invalidate and
write-update mechanisms
Snoop activity is reduced by 51% to 98%
© 2006 Elsevier
Virtual memory and snoop filtering

Recent embedded processors provide virtual
memory support through MMUs




Translate virtual address (VPN + offset) to physical
address (PPN + offset)
Provides transparent memory allocation, isolation,
and protection for tasks
Requires page table (PT) and translation lookaside
buffer (TLB) to translate the VPN to PPN
Programmer and compiler no longer know physical
address

Different technique is needed for snoop filtering
© 2006 Elsevier
Shared Memory Identification

With virtual memory:


Utilize Shared Page Set (SPS) mechanism
Programmer identifies shared structures



Provides array starting address and size
Identifies which threads use which structures
Operating systems assigns RegID

Stores this information in the page table (PT) and
translation lookaside buffer (TLB)
© 2006 Elsevier
SPS Snoop Filtering Hardware


The PT and TLB are augmented with the RegID for
each page
The information for shared regions for each task is
loaded by the operating system




Implemented using a bit mask register with one bit for
each shared region.
For example: 01010100 indicates a task uses shared
regions 2, 4, and 6.
On a cache miss, the RegID is transmitted along the
databus.
Filtering hardware at each node checks if the current
task has shared data in RegID region.
© 2006 Elsevier
SPS Snoop Filtering Hardware
© 2006 Elsevier
Snoop Filtering Energy Results




Snoop energies are reported for direct mapped
and 4-way caches for write-invalidate (WI) and
write-update (WU) mechanisms
WI requires much less energy than WU
Snoop energy is reduced by 47% to 93%
SPS only used with WI
© 2006 Elsevier
ARM11 MPCore
© 2006 Elsevier
ARM11 MPCore Features






Up to 4 CPUs implementing ARM v6
Snoop Control Unit for Cache Coherency
Distributed Interrupt Controller
Private Timer and Private Watchdog for each
CPU
AXI high speed Advanced Microprocessor
Bus Architecture (AMBA) L2 memory
interfaces
Flexibility configuration during synthesis.
© 2006 Elsevier
ARM11 MPCore Pipeline Stages
Stage 1
1st Fetch
Stage (Fe1)
Stage 2
1st Fetch
Stage (Fe2)
Stage 3
Instruction
Decode (De)
Stage 4
Reg. read
and issue
(Iss)
Stage 5
Stage 6
Shifter
Stage (Sh)
ALU
Operation
(ALU)
Saturation
Stage (Sat)
1st multiply
acc. Stage
(MAC1)
2nd multiply
acc. Stage
(MAC2)
3rd multiply
acc. Stage
(MAC3)
Address
Generation
(ADD)
Data
cache 1
(DC1)
Data
cache 2
(DC2)
© 2006 Elsevier
Stage 7
Stage 8
Write back
Mul/ALU
(WBex)
Write back
from LSU
(WBIs)
ARM11 MPCore Caches






Instruction and data caches, including a nonblocking data cache with Hit-Under-Miss (HUM)
Data cache is physically indexed, physically tagged,
write back, write allocate only
Instruction cache is virtually indexed, physically
tagged
32-bit interface to the instruction cache and 64-bit
interface to the data cache
Hardware support for data cache coherency
The instruction and data cache can be
independently configured during synthesis to sizes
between 16KB and 64KB.
ARM11 MPCore Caches




Both caches are 4-way set-associative.
Cache line replacement policy is round-robin.
The cache line length is eight 32-bit words.
Both data cache read misses and write misses
are non-blocking.



Up to three outstanding data cache read misses and
up to four outstanding data cache write misses are
supported.
Support is provided for streaming of sequential
data with LDM operations, and for sequential
instruction fetches.
On a cache-miss, critical word first filling of the
cache is performed.
Coherency protocol – MESI

MESI is a write-invalidate protocol






Writing to a shared location invalidates corresponding
lines in other L1 caches
Cache lines can be in one of four states
Modified: The cache line is present only in the current
cache, and it is dirty. It has been modified from the
value in main memory.
Exclusive: The cache line is present only in the
current cache, and is clean. It matches the main
memory value.
Shared: The cache line is present in more than one
CPU cache and is clean. It matches main memory
value.
Invalid: This coherent cache line is not present in the
cache.
L1 Data Memory
L1 Instruction Memory
Level 2 Memory - AXI

MPCore Level 2


Supported AXI transfers



The ARM11 MPCore processor Level 2 interface
consists, by default, of two 64-bit wide AXI bus masters
coherent and non-coherent write-back write-allocate
coherent non-cachable
AXI transaction IDs

The arbitration for transaction ordering on Athe XI
masters is round robin among the requesting MP11
CPUs
Download