Multicore Architecture Basics

advertisement
0907532 Special Topics in Computer Engineering
Multicore Architecture Basics
1
Basic concept of parallelism
• The idea is simple: improve performance by
performing two or more operations at the same
time.
• Has been an important computer design strategy
since the beginning.
2
Parallelism in This Course (muticore machines)
• Attain parallelism by using several processing
elements (cores) on the same chip or on different
chips sharing main memory.
• Parallel computing is necessary for continuing
performance gains given that “clock speeds are not
going to increase dramatically
3
Clock Rate (GHz)
2005 IT Roadmap Semiconductors
25
2005 Roadmap
20
15
10
Intel single core
5
0
2001 2003 2005 2007 2009 2011 2013
4
Clock Rate (GHz)
Change in ITS Roadmap in 2 yrs
25
2005 Roadmap
20
15
10
2007 Roadmap
Intel single core
5
Intel multicore
0
2001 2003 2005 2007 2009 2011 2013
5
Shared Address Space Architectures
• Any core can directly reference any memory
location
• Communication between cores occurs implicitly as
result of loads and stores
6
Memory hierarchy and cache memories:
1. Review concepts assuming “Single Core”
2. Introduce problems and solution when used in
“Multicore Machines”
7
Single core memory hierarchy and cache
memories
• Programs tend to exhibit temporal and spatial
locality:
• Temporal locality: Once programs access data
items or instructions they tend to access them again
in the near future.
• Spatial locality: Once programs access data items
or instruction, they tend to access nearby data
items or instruction in the near future.
• Because of the locality property of programs,
memory is organized in a hierarchy.
8
Memory hierarchy
~ 1’s Cycle
Key Observations


Core
~ 1’s – 10
Cycles
L1 Cache
L2
Cache
Access to L1
cache is on order
of 1 cycle
Access to L2 on
order of 1 to 10
cycles

Access to Main
memory ~ 100’s
cycles

Access to Disk ~
1000’s cycles
~ 100’s Cycles
Main
Memory
Magnetic
Disk
~ 1000’s Cycles
Connecting lines thickness depict bandwidth: Bytes/Second
Processor and Memory are Far
Apart
memory
interconnect
processor
Art of Multiprocessor
Programming
10
Reading from Memory
address
Art of Multiprocessor
Programming
11
Reading from Memory
zzz…
Art of Multiprocessor
Programming
12
Reading from Memory
value
Art of Multiprocessor
Programming
13
Writing to Memory
address, value
Art of Multiprocessor
Programming
14
Writing to Memory
zzz…
Art of Multiprocessor
Programming
15
Writing to Memory
ack
Art of Multiprocessor
Programming
16
Cache: Reading from Memory
address
cache
Art of Multiprocessor
Programming
17
Cache: Reading from Memory
cache
Art of Multiprocessor
Programming
18
Cache: Reading from Memory
cache
Art of Multiprocessor
Programming
19
Cache Hit
?
cache
Art of Multiprocessor
Programming
20
Cache Hit
Yes!
cache
Art of Multiprocessor
Programming
21
Cache Miss
address
No…
?
cache
Art of Multiprocessor
Programming
22
Cache Miss
cache
Art of Multiprocessor
Programming
23
Cache Miss
cache
Art of Multiprocessor
Programming
24
Memory and cache performance metrics
• Cache Hit and Miss : When the data is found in the
cache, we have a cache hit, otherwise it is a miss.
• Hit Ratio ,HR = fraction of memory references that
hit
– Depends on locality of application
– Measure of effectiveness of caching mechanism
• Miss Ratio , MR= fraction of memory references
that miss
• HR = 1- MR
25
Average memory system access time
If all the data fits in main memory (i.e. ignore desk
access)
HR * cache access time + MR * main memory access time
26
Cache line
• When there is a cache miss, a fixed size block of
consecutive data elements, or line, is copied from
main memory to the cache.
• Typical cache line size is 4-128 bytes.
• Main memory can be seen as a sequence of lines,
some of which can have a copy in the cache.
27
MEMORY HIERARCHY AND BANDWIDTH ON
MULTICORE
• Each core has its own private cache, L1 cache to
provide fast access, e.g. 1-2 cycles.
• L2 caches may be shared across multiple cores.
• In the event of cache miss at both L1 and L2, the
memory controller must forward a load/store
request to the off-chip main memory.
28
Intel® Core™ Microarchitecture – Memory Sub-system
High Level Multicore Architectural view
Intel Core 2
Duo Processor
A
A
E
E
C
B
Intel Core 2
Quad Processor
A
A
A
A
E
E
E
E
C1
C2
B
B
Dual Core has shared cache
64B Cache
Line has both shared
64B Cache Line
Quad core
Memory And separated Memory
cache
A = Architectural State
C = 2nd Level Cache
E = Execution Engine & Interrupt
B = Bus Interface connects to main memory & I/O
Cache line ping-ponging or tennis effect
• One processor writes to a cache line and then
another processor writes to the same cache line but
different data element
• Cash line is in a separate socket/separate L2 cache
environment
• Each core would take a HITM (HIT Modified) on
the cache line causing it to ship across the FSB
(Front Side Bus to memory)
• This increases the FSB traffic and even in good
conditions costs about ½ the cost of a memory
access
30
Intel® Core™ Microarchitecture – Memory Sub-system
With a separated cache
Memory
Front Side Bus (FSB)
Shipping L2 Cache Line
~Half access to memory
Cache Line
CPU1
CPU2
Intel® Core™ Microarchitecture – Memory Sub-system
Advantages of Shared Cache – using Advanced
Smart Cache® Technology
Memory
Front Side Bus (FSB)
L2 is shared:
No need to ship cache
line
Cache Line
CPU1
CPU2
False Sharing

Performance issue in programs where cores may write to different memory
addresses BUT in the same cache lines
Known as Ping-Ponging – Cache line is shipped between cores
Core 0
X[0] = 0
Core 1
X[1] = 0
X[0] = 1
Time

X[1] = 1
False
X[0] =
2 Sharing not an
issue in shared cache
1 0
0
2
It is an issue 1
in 1
separated cache
Avoiding False Sharing
Change either
• Algorithm
– adjust the implementation of the algorithm (the loop
stride) to access data in different cache line for each
thread
Or
• Data Structure:
– add some “padding” to a data structure or arrays ( just
enough padding generally less than cache line size) so
that threads access data from different cache lines.
34
Download