CSCE 513
Computer Architecture
Lecture 16
Instruction Level Parallelism:
Hyper-threading and limits
Topics


Hardware threading
Limits on ILP
Readings
November 30, 2015
Overview
Last Time

pthreads
Readings for GPU programming


Stanford – (Itunes)http://code.google.com/p/stanford-cs193g-sp2010/
UIUC ECE 498 AL : Applied Parallel Programming
 http://courses.engr.illinois.edu/ece498/al/
 Book (online) David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
http://courses.engr.illinois.edu/ece498/al/Syllabus.html
New




Back to chapter 3
Topics revisited: multiple issue; tomasulo’s data hazards for
address field
Hyperthreading
Limits on ILP
–2–
CSCE 513 Fall 2015
CSAPP – Bryant O’Hallaron
.
–3–
CSCE 513 Fall 2015
T1 (“Niagara”)
Target: Commercial server applications

High thread level parallelism (TLP)
 Large numbers of parallel client requests

Low instruction level parallelism (ILP)
 High cache miss rates
 Many unpredictable branches
 Frequent load-load dependencies
Power, cooling, and space are major
concerns for data centers
Metric: Performance/Watt/Sq. Ft.
Approach: Multicore, Fine-grain
multithreading, Simple pipeline, Small
L1
–4
– caches, Shared L2
CSCE 513 Fall 2015
T1 Architecture
Also ships with 6 or 4 processors
–5–
3/15/2016
CS252 s06 T1
5
CSCE 513 Fall 2015
T1 pipeline
Single issue, in-order, 6-deep pipeline: F, S, D, E, M, W
3 clock delays for loads & branches.
Shared units:
 L1, L2
 TLB
 X units
 pipe registers
• Hazards:
– Data
– Structural
–6–
CSCE 513 Fall 2015
T1 Fine-Grained Multithreading
Each core supports four threads and has its own level
one caches (16KB for instructions and 8 KB for data)
Switching to a new thread on each clock cycle
Idle threads are bypassed in the scheduling


Waiting due to a pipeline delay or cache miss
Processor is idle only when all 4 threads are idle or stalled
Both loads and branches incur a 3 cycle delay that can
only be hidden by other threads
A single set of floating point functional units is shared
by all 8 cores

floating point performance was not a focus for T1
–7–
CSCE 513 Fall 2015
Memory, Clock, Power
16 KB 4 way set assoc. I$/ core
8 KB 4 way set assoc. D$/ core
3MB 12 way set assoc. L2 $ shared






4 x 750KB independent banks
crossbar switch to connect
2 cycle throughput, 8 cycle latency
Direct link to DRAM & Jbus
Manages cache coherence for the 8 cores
CAM based directory
Write through
• allocate LD
• no-allocate ST
Coherency is enforced among the L1 caches by a directory
associated with each L2 cache block
Used to track which L1 caches have copies of an L2 block
By associating each L2 with a particular memory bank and
enforcing the subset property, T1 can place the directory at L2
rather than at the memory, which reduces the directory
overhead
L1 data cache is write-through, only invalidation messages are
required; the data can always be retrieved from the L2 cache
– 8 – 1.2 GHz at 72W typical, 79W peak power consumption
CSCE 513 Fall 2015
Miss Rates: L2 Cache Size, Block
Size
2.5%
L2 Miss rate
2.0%
TPC-C
SPECJBB
1.5%
T1
1.0%
0.5%
0.0%
1.5 MB;
32B
1.5 MB;
64B
3 MB;
32B
3 MB;
64B
6 MB;
32B
6 MB;
64B
–9–
CSCE 513 Fall 2015
Miss Latency: L2 Cache Size, Block Size
200
180
T1
TPC-C
SPECJBB
160
L2 Miss latency
140
120
100
80
60
40
20
0
1.5 MB; 32B
1.5 MB; 64B
3 MB; 32B
3 MB; 64B
6 MB; 32B
6 MB; 64B
– 10 –
CSCE 513 Fall 2015
CPI Breakdown of Performance
Benchmark
Per
Thread
CPI
Per
Effective Effective
core
CPI for
IPC for
CPI
8 cores
8 cores
TPC-C
7.20
1.80
0.23
4.4
SPECJBB
5.60
1.40
0.18
5.7
SPECWeb99
6.60
1.65
0.21
4.8
– 11 –
CSCE 513 Fall 2015
Fraction of cycles not ready
Not Ready Breakdown
100%
Other
80%
Pipeline delay
60%
L2 miss
40%
L1 D miss
20%
L1 I miss
0%
TPC-C
SPECJBB
SPECWeb99
TPC-C - store buffer full is largest contributor
SPEC-JBB - atomic instructions are largest contributor
SPECWeb99 - both factors contribute
– 12 –
CSCE 513 Fall 2015
Performance: Benchmarks + Sun Marketing
Sun Fire
T2000
IBM p5-550 with 2
dual-core Power5 chips
Dell PowerEdge
SPECjbb2005 (Java server software)
business operations/ sec
63,378
61,789
24,208 (SC1425 with dual singlecore Xeon)
SPECweb2005 (Web server
performance)
14,001
7,881
4,850 (2850 with two dual-core
Xeon processors)
NotesBench (Lotus Notes
performance)
16,061
14,740
Benchmark\Architecture
– 13 –
CSCE 513 Fall 2015
HP marketing view of T1 Niagara
1. Sun’s radical UltraSPARC T1 chip is made up of individual
cores that have much slower single thread performance when
compared to the higher performing cores of the Intel Xeon,
Itanium, AMD Opteron or even classic UltraSPARC
processors.
2. The Sun Fire T2000 has poor floating-point performance, by
Sun’s own admission.
3. The Sun Fire T2000 does not support commerical Linux or
Windows® and requires a lock-in to Sun and Solaris.
4. The UltraSPARC T1, aka CoolThreads, is new and unproven,
having just been introduced in December 2005.
5. In January 2006, a well-known financial analyst downgraded
Sun on concerns over the UltraSPARC T1’s limitation to only
the Solaris operating system, unique requirements, and
longer adoption cycle, among other things. [10]
Where is the compelling value to warrant taking such a risk?
– 14 –
CSCE 513 Fall 2015
Microprocessor Comparison
Processor
Cores
Instruction issues
/ clock / core
Peak instr. issues
/ chip
Multithreading
L1 I/D in KB per core
L2 per core/shared
Clock rate (GHz)
Transistor count (M)
Die size (mm2)
– 15 –
Power
(W)
3/15/2016
SUN T1
Opteron
Pentium D
IBM Power 5
8
2
2
2
1
3
3
4
8
6
6
8
No
SMT
SMT
Finegrained
12K
uops/16
16/8
64/64
3 MB
1MB /
1MB/
1.9 MB
core
core
shared
shared
1.2
2.4
300
233
379
199
79
110
CS252 s06 T1
3.2
230
206
130
64/32
1.9
276
389
125
15
CSCE 513 Fall 2015
Performance Relative to Pentium D
6.5
6
Performance relative to Pentium D
5.5
5
+Power5
Opteron
Sun T1
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
– 16 –
SPECIntRate SPECFPRate
3/15/2016
SPECJBB05 SPECWeb05
CS252 s06 T1
TPC-like
16
CSCE 513 Fall 2015
In
tR
– 17 –
SP
EC
SP
EC
CS252 s06 T1
at
t
^2
at
t
m
^2
/W
/m
-C
-C
TP
C
TP
C
JB
B0
5/
W
JB
B0
5/
m
m
at
e/
W
at
t
Opteron
SP
EC
at
t
at
e/
m
m
^2
at
e/
W
^2
+Power5
FP
R
FP
R
SP
EC
SP
EC
In
tR
3/15/2016
SP
EC
at
e/
m
m
Efficiency normalized to Pentium D
Performance/mm2, Performance/Watt
5.5
5
4.5
4
Sun T1
3.5
3
2.5
2
1.5
1
0.5
0
17
CSCE 513 Fall 2015
Niagara 2
Improve performance by increasing threads
supported per chip from 32 to 64

8 cores * 8 threads per core
Floating-point unit for each core, not for each chip
Hardware support for encryption standards EAS,
3DES, and elliptical-curve cryptography
Niagara 2 will add a number of 8x PCI Express
interfaces directly into the chip in addition to
integrated 10Gigabit Ethernet XAU interfaces
and Gigabit Ethernet ports.
Integrated memory controllers will shift support
from DDR2 to FB-DIMMs and double the
Kevin Krewell
maximum amount of system memory.
“Sun's Niagara Begins CMT Flood The Sun UltraSPARC T1 Processor Released”
Microprocessor Report, January 3, 2006
– 18 –
3/15/2016
CS252 s06 T1
18
CSCE 513 Fall 2015
Amdahl’s Law Paper
Gene Amdahl, "Validity of the Single Processor Approach to
Achieving Large-Scale Computing Capabilities", AFIPS
Conference Proceedings, (30), pp. 483-485, 1967.
How long is paper?
How much of it is Amdahl’s Law?
What other comments about parallelism besides
Amdahl’s Law?
– 19 –
CSCE 513 Fall 2015
Parallel Programmer
Productivity
Lorin Hochstein et al "Parallel Programmer Productivity: A Case Study of
Novice Parallel Programmers." International Conference for High
Performance Computing, Networking and Storage (SC'05). Nov. 2005
What did they study?
What is argument that novice parallel programmers are a good
target for High Performance Computing?
How can account for variability in talent between programmers?
What programmers studied?
What programming styles investigated?
How big multiprocessor?
How measure quality?
How measure cost?
– 20 –
CSCE 513 Fall 2015
GUSTAFSON’s Law
Amdahl’s Law Speedup = (s + p ) ⁄ (s + p ⁄ N )
= 1 ⁄ (s + p ⁄ N ),
N = number of processors
s = sequential(serial) time
p = parallel time
For N=1024 processors
– 21 –
http://mprc.pku.edu.cn/courses/architecture/autumn2005/reevaluating-Amdahls-law.pdf CSCE 513 Fall 2015
Scale the problem
– 22 –
CSCE 513 Fall 2015
Matrix Multiplication revisited
for i=0; i < n; ++i
for (j=0; j<n; ++j){
for(k=0; k<n; ++k){
C[i][j] = C[i][j] + A[i][k]*B[k][j];
}
}
}
Note n3 multiplications, n3 additions, 4n3 memory references?
How can we improve code?
Stride through A?
B?
C?
Do reference to A[i][k] and C[i][j] work together or against each other
on the miss-rate
– Blocking
23 –
CSCE 513 Fall 2015
GUSTAFSON’s Law model scaling
Example
Suppose a model is dominated by a matrix multiply,
that for a given n x n matrix is multiplied a large
constant (k) number of times.
kn3 multiplies, adds (ignore memory references)
If a model of size n=1024 can be executed in 10 minutes
on one processor, using Gustaphson’s how big can
the model be and execute in the 10 minutes
assuming 1024 processors with 0% serial code?
with 1% serial code?
with 10% serial code?
– 24 –
CSCE 513 Fall 2015
– 25 –
CSCE 513 Fall 2015
Figure 3.26 ILP available in a perfect processor for six of the SPEC92 benchmarks. The first three
programs are integer programs, and the last three are floating-point programs. The floating-point programs
are loop intensive and have large amounts of loop-level parallelism.
– 26 –
Copyright © 2011, Elsevier Inc. All rights Reserved.
CSCE 513 Fall 2015
Figure 3.30 The relative change in the miss rates and miss latencies when executing with one thread per
core versus four threads per core on the TPC-C benchmark. The latencies are the actual time to return the
requested data after a miss. In the four-thread case, the execution of other threads could potentially hide much of
this latency.
– 27 –
Copyright © 2011, Elsevier Inc. All rights Reserved.
CSCE 513 Fall 2015
Figure 3.30 The relative change in the miss rates and miss latencies when executing with one thread per
core versus four threads per core on the TPC-C benchmark. The latencies are the actual time to return the
requested data after a miss. In the four-thread case, the execution of other threads could potentially hide much of
this latency.
– 28 –
Copyright © 2011, Elsevier Inc. All rights Reserved.
CSCE 513 Fall 2015
Figure 3.31 Breakdown of the status on an average thread. “Executing” indicates the thread issues an
instruction in that cycle. “Ready but not chosen” means it could issue but another thread has been chosen, and
“not ready” indicates that the thread is awaiting the completion of an event (a pipeline delay or cache miss, for
example).
– 29 –
Copyright © 2011, Elsevier Inc. All rights Reserved.
CSCE 513 Fall 2015
Figure 3.32 The breakdown of causes for a thread being not ready. The contribution to the “other” category varies. In
PC-C, store buffer full is the largest contributor; in SPEC-JBB, atomic instructions are the largest contributor; and in SPECWeb99,
both factors contribute.
– 30 –
Copyright © 2011, Elsevier Inc. All rights Reserved.
CSCE 513 Fall 2015
Figure 3.35 The speedup from using multithreading on one core on an i7 processor averages 1.28 for the Java
benchmarks and 1.31 for the PARSEC benchmarks (using an unweighted harmonic mean, which implies a
workload where the total time spent executing each benchmark in the single-threaded base set was the same).
The energy efficiency averages 0.99 and 1.07, respectively (using the harmonic mean). Recall that anything above 1.0 for
energy efficiency indicates that the feature reduces execution time by more than it increases average power. Two of the
Java benchmarks experience little speedup and have significant negative energy efficiency because of this. Turbo Boost is
off in all cases. These data were collected and analyzed by Esmaeilzadeh et al. [2011] using the Oracle (Sun) HotSpot
build 16.3-b01 Java 1.6.0 Virtual Machine and the gcc v4.4.1 native compiler.
– 31 –
Copyright © 2011, Elsevier Inc. All rights Reserved.
CSCE 513 Fall 2015
Figure 3.41 The Intel Core i7 pipeline structure shown with the memory system components. The total pipeline
depth is 14 stages, with branch mispredictions costing 17 cycles. There are 48 load and 32 store buffers. The six
independent functional units can each begin execution of a ready micro-op in the same cycle.
– 32 –
Copyright © 2011, Elsevier Inc. All rights Reserved.
CSCE 513 Fall 2015
Fermi (2010)
~1.5TFLOPS (SP)/~800GFLOPS (DP)
230 GB/s DRAM Bandwidth
– 33 –
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, Spring 2010 University of Illinois, Urbana-Champaign
33
CSCE 513 Fall 2015