Henry Neeman, Director
OU Supercomputing Center for Education & Research
University of Oklahoma
Wednesday October 17 2007
The March of Progress
Multicore/Many-core Basics
Software Strategies for Multicore/Many-core
A Concrete Example: Weather Forecasting
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 2
10 racks @ 1000 lbs per rack
270 Pentium4 Xeon CPUs,
2.0 GHz, 512 KB L2 cache
270 GB RAM, 400 MHz FSB
8 TB disk
Myrinet2000 Interconnect
100 Mbps Ethernet Interconnect
OS: Red Hat Linux
Peak speed: 1.08 TFLOP/s
(1.08 trillion calculations per second)
One of the first Pentium4 clusters!
boomer.oscer.ou.edu
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 4
9 years from room to chip!
http://news.com.com/2300-1006_3-6119652.html
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 5
In 1965, Gordon Moore was an engineer at Fairchild
Semiconductor.
He noticed that the number of transistors that could be squeezed onto a chip was doubling about every 18 months.
It turns out that computer speed is roughly proportional to the number of transistors per unit area.
Moore wrote a paper about this concept, which became known as
“Moore’s Law.”
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 6
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 7
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 8
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 9
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 10
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 11
Fastest Supercomputer in the World
1000000 www.top500.org
100000
10000
1000
Fastest
Moore
100
10
1
1992 1994 1996 1998 2000 2002 2004 2006 2008
Year
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 12
[5]
Fast, expensive, few
Slow, cheap, a lot
Registers
Cache memory
Main memory (RAM)
Hard disk
Removable media (e.g., DVD)
Internet
[6]
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 14
The speed of data transfer between Main Memory and the
CPU is much slower than the speed of calculating, so the CPU spends most of its time waiting for data to come in or go out.
CPU
351 GB/sec [7]
Bottleneck
10.66 GB/sec [9] (3%)
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 15
CPU
351 GB/sec [7]
Cache is nearly the same speed as the CPU, so the CPU doesn’t have to wait nearly as long for stuff that’s already in cache: it can do more operations per second!
253 GB/sec [8] (72%)
10.66 GB/sec [9] (3%)
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 16
Register reuse : Do a lot of work on the same data before working on new data.
Cache reuse : The program is much more efficient if all of the data and instructions fit in cache; if not, try to use what’s in cache a lot before using anything that isn’t in cache.
Data locality : Try to access data that are near each other in memory before data that are far.
I/O efficiency : Do a bunch of I/O all at once rather than a little bit at a time; don’t mix calculations and I/O.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 17
OSCER’s big cluster, topdawg, has Irwindale CPUs: single core, 3.2 GHz, 800 MHz Front Side Bus.
The theoretical peak CPU speed is 6.4 GFLOPs (double precision) per CPU, and in practice we’ve gotten as high as
94% of that.
So, in theory each CPU could consume 143 GB/sec.
The theoretical peak RAM bandwidth is 6.4 GB/sec, but in practice we get about half that.
So, any code that does less than 45 calculations per byte transferred between RAM and cache has speed limited by
RAM bandwidth.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 18
Matrix-Matrix Multiply
Let A, B and C be matrices of sizes nr
nc , nr
nk and nk
nc , respectively:
A
a a
1 , 1
2 , 1
a a
3 , 1 nr , 1 a
1 , 2 a
2 , 2 a
3 , 2
a nr , 2 a
1 , 3 a
2 , 3 a
3 , 3
a nr , 3
a a a
1 , nc
2
3 ,
, nc nc a nr , nc
B
b b
1 , 1
2 , 1
b b
3 , 1 nr , 1 b
1 , 2 b
2 , 2 b
3 , 2
b nr , 2 b
1 , 3 b
2 , 3 b
3 , 3
b nr , 3
b b b
1 , nk
2
3
, nk
, nk b nr , nk
C
c
c c c
3
1 , 1
2 nk
, 1
, 1
, 1 c
1 , 2 c
2 , 2 c
3 , 2
c nk , 2 c
1 , 3 c
2 , 3 c
3 , 3
c nk , 3
c c c
1 , nc
2 , nc
3 , nc
c nk , nc
The definition of A = B • C is a r , c
k nk
1 b r , k
c k , c
b r , 1
c
1 , c
b r , 2
c
2 , c for r
{1, nr } , c
{1, nc } .
b r , 3
c
3 , c
b r , nk
c nk , c
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 20
SUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, &
& nr, nc, nq)
IMPLICIT NONE
INTEGER,INTENT(IN) :: nr, nc, nq
REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst
REAL,DIMENSION(nr,nq),INTENT(IN) :: src1
REAL,DIMENSION(nq,nc),INTENT(IN) :: src2
INTEGER :: r, c, q
DO c = 1, nc
DO r = 1, nr dst(r,c) = 0.0
DO q = 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c)
END DO
END DO
END DO
END SUBROUTINE matrix_matrix_mult_naive
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 21
Better
800
Matrix-Matrix Multiply
700
600
500
400
300
200
100
0
0 10000000 20000000 30000000 40000000 50000000 60000000
Total Problem Size in bytes (nr*nc+nr*nq+nq*nc)
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
Init
22
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 23
Tile : A small rectangular subdomain of a problem domain.
Sometimes called a block or a chunk .
Tiling : Breaking the domain into tiles.
Tiling strategy : Operate on each tile to completion, then move to the next tile.
Tile size can be set at runtime, according to what’s best for the machine that you’re running on.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 24
SUBROUTINE matrix_matrix_mult_by_tiling (dst, src1, src2, nr, nc, nq, &
& rtilesize, ctilesize, qtilesize)
IMPLICIT NONE
INTEGER,INTENT(IN) :: nr, nc, nq
REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst
REAL,DIMENSION(nr,nq),INTENT(IN) :: src1
REAL,DIMENSION(nq,nc),INTENT(IN) :: src2
INTEGER,INTENT(IN) :: rtilesize, ctilesize, qtilesize
INTEGER :: rstart, rend, cstart, cend, qstart, qend
DO cstart = 1, nc, ctilesize cend = cstart + ctilesize - 1
IF (cend > nc) cend = nc
DO rstart = 1, nr, rtilesize rend = rstart + rtilesize - 1
IF (rend > nr) rend = nr
DO qstart = 1, nq, qtilesize qend = qstart + qtilesize - 1
IF (qend > nq) qend = nq
CALL matrix_matrix_mult_tile(dst, src1, src2, nr, nc, nq, &
& rstart, rend, cstart, cend, qstart, qend)
END DO
END DO
END DO
END SUBROUTINE matrix_matrix_mult_by_tiling
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 25
SUBROUTINE matrix_matrix_mult_tile (dst, src1, src2, nr, nc, nq, &
& rstart, rend, cstart, cend, qstart, qend )
IMPLICIT NONE
INTEGER,INTENT(IN) :: nr, nc, nq
REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst
REAL,DIMENSION(nr,nq),INTENT(IN) :: src1
REAL,DIMENSION(nq,nc),INTENT(IN) :: src2
INTEGER,INTENT(IN) :: rstart, rend, cstart, cend, qstart, qend
INTEGER :: r, c, q
DO c = cstart, cend
DO r = rstart, rend
IF (qstart == 1) dst(r,c) = 0.0
DO q = qstart, qend dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c)
END DO
END DO
END DO
END SUBROUTINE matrix_matrix_mult_tile
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 26
SUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, &
& nr, nc, nq)
IMPLICIT NONE
INTEGER,INTENT(IN) :: nr, nc, nq
REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst
REAL,DIMENSION(nr,nq),INTENT(IN) :: src1
REAL,DIMENSION(nq,nc),INTENT(IN) :: src2
INTEGER :: r, c, q
DO c = 1, nc
DO r = 1, nr dst(r,c) = 0.0
DO q = 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c)
END DO
END DO
END DO
END SUBROUTINE matrix_matrix_mult_naive
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 27
Matrix-Matrix Mutiply Via Tiling
Matrix-Matrix Mutiply Via Tiling (log-log)
1000
250
200
100
150
10
100
50
1E+08 10000000 1000000 100000 10000 1000 100 10
1
512x256
512x512
1024x512
1024x1024
2048x1024
Better
100000000 10000000 1000000 100000 10000
Tile Size (bytes)
1000 100 10
0
Tile Size (bytes)
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007
0.1
28
It allows your code to exploit data locality better, to get much more cache reuse: your code runs faster!
It’s a relatively modest amount of extra coding (typically a few wrapper functions and some changes to loop bounds).
If you don’t need tiling – because of the hardware, the compiler or the problem size – then you can turn it off by simply setting the tile size equal to the problem size.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 29
Cache optimization works best when the number of calculations per byte is large.
For example, with matrix-matrix multiply on an n × n matrix, there are O ( n 3 ) calculations (on the order of n 3 ), but only
O ( n 2 ) bytes of data.
So, for large n , there are a huge number of calculations per byte transferred between RAM and cache.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 30
In the olden days (i.e., the first half of 2005), each CPU chip had one “brain” in it.
More recently, each CPU chip has 2 cores (brains), and, starting in late 2006, 4 cores.
Jargon : Each CPU chip plugs into a socket , so these days, to avoid confusion, people refer to sockets and cores , rather than CPUs or processors.
Each core is just like a full blown CPU, except that it shares its socket with one or more other cores – and therefore shares its bandwidth to RAM.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 32
Core Core
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 33
Core Core
Core Core
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 34
Core Core Core Core
Core Core Core Core
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 35
Each socket has access to a certain amount of RAM, at a fixed RAM bandwidth per SOCKET .
As the number of cores per socket increases, the contention for RAM bandwidth increases too.
At 2 cores in a socket, this problem isn’t too bad. But at 16 or 32 or 80 cores, it’s a huge problem .
So, applications that are cache optimized will get big speedups .
But, applications whose performance is limited by RAM bandwidth are going to speed up only as fast as RAM bandwidth speeds up.
RAM bandwidth speeds up much slower than CPU speeds up.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 36
Each node has access to a certain number of network ports, at a fixed number of network ports per NODE .
As the number of cores per node increases, the contention for network ports increases too.
At 2 cores in a socket, this problem isn’t too bad. But at 16 or 32 or 80 cores, it’s a huge problem .
So, applications that do minimal communication will get big speedups .
But, applications whose performance is limited by the number of MPI messages are going to speed up very very little – and may even crash the node.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 37
Most multicore chip families have relatively small cache per core (e.g., 2 MB) – and this problem seems likely to remain.
Small TLBs make the problem worse: 512 KB per core rather than 2 MB.
So, to get good cache reuse, you need to partition algorithm so subproblem needs no more than 512 KB.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 38
On Intel Core Duo (“Yonah”):
Cache size is 2 MB per core.
Page size is 4 KB.
A core’s data TLB size is 128 page table entries.
Therefore, D-TLB only covers 512 KB of cache.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 39
On Intel Core Duo (“Yonah”):
Cache size is 2 MB per core.
Page size is 4 KB.
A core’s data TLB size is 128 page table entries.
Therefore, D-TLB only covers 512 KB of cache.
The cost of a TLB miss is 49 cycles, equivalent to as many as 196 calculations ! (4 FLOPs per cycle) http://www.digit-life.com/articles2/cpu/rmma-via-c7.html
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 40
We need much bigger caches!
TLB must be big enough to cover the entire cache.
It’d be nice to have RAM speed increase as fast as core counts increase, but let’s not kid ourselves.
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 41
http://www.oscer.ou.edu/education.php
Supercomputing in Plain English: Multicore Madness
Wednesday October 17 2007 42
Adapted from Mary Jane Irwin
( www.cse.psu.edu/~mji )
[Adapted from Computer Organization and Design , Patterson & Hennessy, © 2005]
Multithreading on A Chip
Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions
Multithreading – increase the utilization of resources on a chip by allowing multiple processes ( threads ) to share the functional units of a single processor
Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread
The caches, TLBs, BHT, BTB can be shared (although the miss rates may increase if they are not sized accordingly)
The memory can be shared through virtual memory mechanisms
Hardware must support efficient thread context switching
Types of Multithreading on a Chip
Fine-grain – switch threads on every instruction issue
Round-robin thread interleaving (skipping stalled threads)
Processor must be able to switch threads on every clock cycle
Advantage – can hide throughput losses that come from both short and long stalls
Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads
Coarse-grain – switches threads only on costly stalls
(e.g., L2 cache misses)
Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread
Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss
Pipeline must be flushed and refilled on thread switches
Simultaneous Multithreading (SMT)
A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor
(superscalar) to exploit both program ILP and threadlevel parallelism (TLP)
Most SS processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP)
With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them
Need separate rename tables (ROBs) for each thread
Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle
Intel’s Pentium 4 SMT called hyperthreading
Supports just two threads (doubles the architecture state)
Typically, each core of newer multi-core process is hyperthreaded
Threading on a 4-way SS Processor Example
Coarse MT Fine MT SMT
Issue slots →
Thread A Thread B
Thread C Thread D
William Stallings
Computer Organization and Architecture
8 th Edition
Chapter 18
Multicore Computers
Hardware Performance Issues
Microprocessors have seen an exponential increase in performance
Improved organization
Increased clock frequency
Increase in Parallelism
Pipelining
Superscalar
Simultaneous multithreading (SMT)
Diminishing returns
More complexity requires more logic
Increasing chip area for coordinating and signal transfer logic
Harder to design, make and debug
Alternative Chip
Organizations
Intel Hardware
Trends
Increased Complexity
Power requirements grow exponentially with chip density and clock frequency
Can use more chip area for cache
Smaller
Order of magnitude lower power requirements
By 2015
100 billion transistors on 300mm 2 die
Cache of 100MB
1 billion transistors for logic
Pollack’s rule:
Performance is roughly proportional to square root of increase in complexity
Double complexity gives 40% more performance
Multicore has potential for near-linear improvement
Unlikely that one core can use all cache effectively
Power and Memory Considerations
Chip Utilization of Transistors
Software Performance Issues
Performance benefits dependent on effective exploitation of parallel resources
Even small amounts of serial code impact performance
10% inherently serial on 8 processor system gives only 4.7 times performance
Communication, distribution of work and cache coherence overheads
Some applications effectively exploit multicore processors
Effective Applications for Multicore Processors
Database
Servers handling independent transactions
Multi-threaded native applications
Lotus Domino, Siebel CRM
Multi-process applications
Oracle, SAP, PeopleSoft
Java applications
Java VM is multi-thread with scheduling and memory management
Sun’s Java Application Server, BEA’s Weblogic, IBM
Websphere, Tomcat
Multi-instance applications
One application running multiple times
E.g. Value Game Software
Multicore Organization
Number of core processors on chip
Number of levels of cache on chip
Amount of shared cache
Next slide examples of each organization:
(a) ARM11 MPCore (deleted)
(b) AMD Opteron (deleted)
(c) Intel Core Duo
(d) Intel Core i7
Multicore Organization Alternatives
Advantages of shared L2 Cache
Constructive interference reduces overall miss rate
Data shared by multiple cores not replicated at cache level
With proper frame replacement algorithms mean amount of shared cache dedicated to each core is dynamic
Threads with less locality can have more cache
Easy inter-process communication through shared memory
Cache coherency confined to L1
Dedicated L2 cache gives each core more rapid access
Good for threads with strong locality
Shared L3 cache may also improve performance
Individual Core Architecture
Intel Core Duo uses superscalar cores
Intel Core i7 uses simultaneous multi-threading (SMT)
Scales up number of threads supported
4 SMT cores, each supporting 4 threads appears as 16 core
Intel x86 Multicore Organization - Core Duo (1)
2006
Two x86 superscalar, shared L2 cache
Dedicated L1 cache per core
32KB instruction and 32KB data
Thermal control unit per core
Manages chip heat dissipation
Maximize performance within constraints
Improved ergonomics
Advanced Programmable Interrupt Controlled (APIC)
Inter-process interrupts between cores
Routes interrupts to appropriate core
Includes timer so OS can interrupt core
Intel x86 Multicore Organization - Core Duo (2)
Power Management Logic
Monitors thermal conditions and CPU activity
Adjusts voltage and power consumption
Can switch individual logic subsystems
2MB shared L2 cache
Dynamic allocation
MESI support for L1 caches
Extended to support multiple Core Duo in SMP
L2 data shared between local cores or external
Bus interface
Intel Core Duo Block Diagram
Intel x86 Multicore Organization - Core i7
November 2008
Four x86 SMT processors
Dedicated L2, shared L3 cache
Speculative pre-fetch for caches
On chip DDR3 memory controller
Three 8 byte channels (192 bits) giving 32GB/s
No front side bus
QuickPath Interconnection
Cache coherent point-to-point link
High speed communications between processor chips
6.4G transfers per second, 16 bits per transfer
Dedicated bi-directional pairs
Total bandwidth 25.6GB/s
Intel Core i7 Block Diagram
Performance Effect of Multiple Cores
Adapted from Mary Jane Irwin
( www.cse.psu.edu/~mji )
[Adapted from Computer Organization and Design , Patterson & Hennessy, © 2005]
Multicore Xbox360 – “Xenon” processor
To provide game developers with a balanced and powerful platform
Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache
165M transistors total
3.2 Ghz Near-POWER ISA
2-issue, 21 stage pipeline, with 128 128-bit registers
Weak branch prediction – supported by software hinting
In order instructions
Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything else
An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM
337M transistors, 10MB framebuffer
48 pixel shader cores, each with 4 ALUs
Xenon Diagram
Core 0
L1D L1I
Core 1
L1D L1I
Core 2
L1D L1I
1MB UL2
512MB
DRAM
GPU
BIU/IO Intf
10MB
EDRAM
3D Core
Video
Out
DVD
HDD Port
Front USBs (2)
Wireless
MU ports (2 USBs)
Rear USB (1)
Ethernet
IR
Audio Out
Flash
Systems Control
Analog
Chip
Video Out
The PS3 “Cell” Processor Architecture
Composed of a Non-SMP Architecture
234M transistors @ 4Ghz
1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s
512KB L2 $ - Massively high bandwidth (200GB/s) bus connects it to everything else
The PPE is strangely similar to one of the Xenon cores
Almost identical, really. Slight ISA differences, and fine-grained MT instead of real SMT
The real differences lie in the SPEs (21M transistors each)
An attempt to ‘fix’ the memory latency problem by giving each processor complete control over it’s own 256KB “scratchpad” – 14M transistors
– Direct mapped for low latency
4 vector units per SPE, 1 of everything else – 7M trans.
The PS3 “Cell” Processor Architecture
How to make use of the SPEs
What about the Software?
Makes use of special IBM “Hypervisor”
Like an OS for OS’s
Runs both a real time OS (for sound) and non-real time (for things like AI)
Software must be specially coded to run well
The single PPE will be quickly bogged down
Must make use of SPEs wherever possible
This isn’t easy, by any standard
What about Microsoft?
Development suite identifies which 6 threads you’re expected to run
Four of them are DirectX based, and handled by the OS
Only need to write two threads, functionally
http://ps3forums.com/showthread.php?t=22858
Next Lecture and Reminders
Reminders
Final is Wednesday, May 2 from 1-2:50 PM in ITT 328