Többmagos processzorok

advertisement
5. Elágazás kezelés
3.1. Bevezetés:
Figure 1.: Basic scheme of instruction fetching in sequential processors
Figure 2.: Basic scheme of instruction fetching using branch prediction
Figure 3.: Straightforward processing of an unconditional branch on a short pipeline
Figure 4.: Straightforward processing of a conditional branch on a short pipeline
Figure 5.: Straightforward processing of a conditional branch on a long pipeline
Figure 6.: Pipeline stages in Intel’s and AMD’s processors
Figure 7.: Performance (P) and dissipation vs. number of pipeline stages
3.2. Áttekintés és Lokális:
Figure 8.: Overview of basic prediction mechanisms for predicting the branch direction
Figure 9.: Local prediction
Figure 10.: Global prediction (Prediction depends on the actual path)
Figure 11.: Hardware prediction
Figure 12.: Principle of the local, dynamic prediction with 1-bit branch history
Figure 13.: Alternatives to implement 2-level local branch prediction
Figure 14.: The principle of Pentium Pro’s 128x4 way set associative BHT
Figure 15.: The actual layout of Pentium Pro’s 128x4 way set associative BHT
3.3. Globális:
Figure 16.: Straightforward global prediction
Figure 17.: Principle of the Gshare prediction
Figure 18.: Principle of the Gselect prediction
Figure 19.: Dynamic branch prediction of Athlon (K7) using the Gselect (Gshare)
mechanism
3.4. Global choice:
Figure 20.: Dynamic branch prediction in the Athlon-64 (K8) using the Gselect (Gshare)
mechanism
Figure 21.: Principle of the combined prediction using local and global prediction at the
same time (used in the Alpha 21264, or the POWER 4)
Figure 21.: Principle of the combined prediction using local and global prediction at the
same time
(used in the Alpha 21264, or the POWER 4)
Figure 22.: Implementation alternatives of choice prediction
Figure 23.: The principle of the combined branch prediction of the POWER 4
3.5. Advanced features:
Figure 24.: Advanced branch prediction features of the Pentium family
Figure 25.: Advanced branch prediction features of AMD’s K6/K7/K8 family
Figure 26.: Advanced branch prediction features of IBM’s POWER family
3.6. Áttekintő összefoglalás:
Figure 27.: Overview of the implemented schemes for predicting the branch direction
3.7. Többszörös eljárások:
Figure 28.: Assumed Simplified branch prediction scheme of the K7, without showing the
global prediction (A: address bit, C: Conditional branch, W: Way)
Figure 29.: Assumed simplified branch prediction scheme of the K8, without showing the
global prediction (C: Conditional branch, R: Return, W: Way 0/1, SA: Start
address)
Figure 30.: The physical implementation of branch prediction in Intel’s Northwood and
Prescott processors
Figure 31.: Instruction Cache: More then instructions alone
Figure 32.: Chapter 4, Opteron’s Instruction Cache and Decoding
3.8. BTA:
Figure 33.: Alternatives to generate the BTA
Figure 34.: The principle of branch prediction using a BHT and a BTAC (C: counter)
Figure 35.: Alternative ways to access BHTs/BTACs
+
Sequential
I F A R
IIFA
Branch
I$
BTA
IB
Further processing
Figure 1.: Basic scheme of instruction fetching in sequential processors
+
I F A R
IIFA
N
Y
I$
Branch
fetch
BTA
Branch
prediction
IB
Further processing
Figure 2.: Basic scheme of instruction fetching using branch prediction
Unconditional
branh instruction
F
D
W
E
Branch
BT A
2 bubbles
Fetch BT I
Figure 3.: Straightforward processing of an unconditional branch on a short pipeline
Conditional
branh instruct ion
F
D
E
W
Branch
Condition ?
BT A
Fetch BT I
3 bubbles
Figure 4.: Straightforward processing of a conditional branch on a short pipeline
F1
F2
F3
D1
E1
E2
Condition ?
BT A
Fetch BT I
Large number of bubbles
Figure 5.: Straightforward processing of a conditional branch on a long pipeline
No of pipeline stages
40
30
20
Pentium Pro
(~12)
Pentium
K6
*
(6)
(5)
*
*
10
1990
1995
Pentium 4
(~20)
*
Athlon
(6)
P4 Prescott
(~30)
*
Conroe
(14)
*
Athlon-64
(12)
*
*
2000
2005
Figure 6.: Pipeline stages in Intel’s and AMD’s processors
Year
P/D
P
D
10
20
30
Nr. of pipeline stages
40
Figure 7.: Performance (P) and dissipation vs. number of pipeline stages
Prediction of the branch direction
Local
Global
(2-level)
1-level
2-level
Static
Compiler hints
Straightforward global
Gshare
Dynamic
HW-prediction
1-bit
2-bit
Figure 8.: Overview of basic prediction mechanims for predicting the branch direction
Gselect
?
Figure 9.: Local prediction
1
0
0
Path 2:
.
.
0
0
Path 1:
0
0
.
0
0
?
Figure 10.: Global prediction
(Prediction depends on the actual path)
.
1
0
0
D < 0 Taken
?
D>0
Not taken
Figure 11.: Hardware prediction
BHT
IFA:
x
}
x: 0: sequential cont.
1: branch
Figure 12.: Principle of the local, dynamic prediction with 1-bit branch history
2-level local branch prediction
With a shared global history
table for all patterns
With individual history
tables for different patterns
(Alpha 21264)
(Pentium Pro)
IFA:
IFA:
Local BHT
(e.g. 1K*10bit)
Local BHT
(e.g. 1K*3bit)
Local BHT
(e.g. 128*4bit)
Local BHT
e.g. 4-ways each
Figure 13.: Alternatives to implement 2-level local branch prediction
76
0
BTA
(linear)
Tag
BHT
Index
127
Way 2
Way 3
Way 0
Way 1
0 1 0 0
15
0
xx
xx:
00/01 not taken
10/11 taken
Tags
History
4-bit
Tags
History
4-bit
Tags
History
4-bit
0
Counters
Figure 14.: The principle of Pentium Pro’s 128x4 way set associative BHT
Tags
History
4-bit
127
0
Tag
Tag
Tag
Tag
H
C
H
C
H
C
H
Figure 15.: The actual layout of Pentium Pro’s 128x4 way set associative BHT
C
Global history
(shift register)
0
1
1
0
0
1
1
BHT
x
Branch history
Figure 16.: Straightforward global prediction
Global history
IFA
...
0
1
1
0
0
1
1
1
0
0
1
1
0
0
}
XOR
BHT
x
Figure 17.: Principle of the Gshare prediction
Branch history
Global history
0
1
1
0
0
1
1
BHT
Branch history
x
... 1
IFA:
0 1 1 0
Figure 18.: Principle of the Gselect prediction
0
BHT
(4K * 2 bit)
8
Global history
31
IFA:
7 4 3
0 11 0
0
Subsequent
fetch blocks
16
xx
} xx:
00/01: Not taken
10/11: Taken
256
Figure 19.: Dynamic branch prediction of Athlon (K7) using the Gselect (Gshare) mechanism
8
BTH
Global history
47
IFA:
8 7
4 3
(4*4K*2bit)
0
16
xx
256
?
256
Branch 2
256
Branch 1
256
Branch 0
Figure 20.: Dynamic branch prediction in the Athlon-64 (K8) using the Gselect (Gshare) mechanism
Global history
IFA:
Local
BHT
Global
BHT
IFA:
Best choice
BHT
x
Local prediction
Global prediction
Resulting prediction
Figure 21.: Principle of the combined prediction using local and global prediction at the same time
(used in the Alpha 21264, or the POWER 4)
Global history
IFA:
Local
BHT
IFA:
Global
BHT
Best choice
BHT
x
Local prediction
Resulting prediction
Global prediction
Global
prediction
Local
prediction
Actual
prediction
Figure 21.: Principle of the combined prediction using local and global prediction at the same time
(used in the Alpha 21264, or the POWER 4)
Choice prediction
1. prediction
Alpha 21264
2. level local history prediction with a
shared counter table for all patterns
(1K * 10 bits/1K * 3 bits)
1-level local history prediction
POWER 4
(16K * 1-bit)
PA-8500/8700
2. prediction
Choice
Simple 2-level global prediction
Global history referenced choice table
(12-bit global history/4K * 2 bits)
(12-bit global history/4K * 2-bits)
2-level Gshare global prediction
(11-bit global history is hashed with
the IFA, 16K * 1-bit counter table)
Compiler hints
Figure 22.: Implementation alternatives of choice prediction
Accessed in the same way as the
global counter table
(16K * 1-bit)
2-bits
11-bit global history
...
18
5
0
1
1
0
0
1
1
1
0
XOR
0 1
1
0
0
}
IFA
IFA:
BHT
14
14
14
16K*1bit
16K*1bit
16K*1bit
Local History
Selector Table
Global History
Update
Local
prediction
Select the better
Global
prediction
Figure 23.: The principle of the combined branch prediction of the POWER 4
1-bit per group
Alpha 21264 Branch Hybrid
Prediction
gshare
P1
P2
EECC551 - Shaaban
#41 lec # 5 Winter 2005 1-9-2006
Pentium Pro
PII/PIII
P4 W/N
P4 P
RAS
Y
Y
Y
Y
Branch hints
-
-
Y
Y
Loop detector
-
-
-
-
Indir. branch pred.
-
-
-
Y
Figure 24.: Advanced branch prediction features of the Pentium family
K6
K7
K8
RAS
Y
Y
Y
Branch hints
-
-
-
Loop detector
-
-
-
Indir. branch pred.
-
-
-
Figure 25.: Advanced branch prediction features of AMD’s K6/K7/K8 family
POWER 3
POWER 4
Return Stack
Link Stack*
2x Return Stack*
Branch hints
-
Y
Y
Loop detector
-
Count Stack*
Indir. branch pred.
-
RAS
-
POWER 5
Count Stack*
-
* Supported by compiler hints
Figure 26.: Advanced branch prediction features of IBM’s POWER family
Prediction of the branch direction
Basic prediction mechanism
Local
2-level
1-level
S tatic
Dynamic
Compiler
hints
1-bit
S hared
counters
2-bit
Individual
counters
S imple
global
Auxiliary combined mechanism
(Advanced branch prediction features)
Global
Combined
2-level
Choice
prediction
Gshare
Preemptive
use of
compiler
hints
Dedicated prediction
RAS
Pentium
Pentium Pro
(512*2)
Pentium Pro
Pentium Pro
P4 Will/Northw.
(4K*2)
P4 Will/Northw.
P4 Will/Northw. P4 Will/Northw. P4 Will/Northw.
P4 Prescott
(4K*2)
P4 Prescott
P4 Prescott
P4 Prescott
P4 Prescott
K6
K6
(8K*2)
K6
(16-entries)
K7
K7
K7
(12-entries)
K8
K8
(16K*2)
K8
(12-entries)
PPC 604
PPC 604
(512*2)
PPC 620
PPC 620
(2K*2)
POWER 3
POWER 3
(2K*2)
PPC 620
(POWER 4)
(11-bit/16K*1)
POWER 4
PPC 620
POWER 4
POWER 4
Alpha 21164
(12-entries)
(Alpha 21264)
(1K*10/1K*3)
Alpha 21264
(Alpha 21264)
(12its/4K*2)
Alpha 21264
(32-entries)
Alpha 21264
PA-8000
(256*3)
PA-8000
PA-8500/8700
UltraS PARC-III
POWER 4
Alpha 21164
(2K*2)
PA-8000
P4 Prescott
PPC 604
(POWER 4)
(16K*1)
Alpha 21164
PA-8500/8700
(UltraS PARC-III)
Indirect
branch
3-bit
Pentium Pro
POWER 4
For cycles
Gselect
Pentium
(256*2)
Pentium
Backup
use of
static
prediction
UltraS PARC-III
(12-bits/16K*2)
UltraS PARC-III UltraS PARC-III
(8-entries)
Figure 27.: Overview of the implemented schemes for predicting the branch direction
POWER 4
31
15 14
IFA:
Tag
31
4 30
2-way set associative I$
14 13
43 0
BTAC
IFA:
Tag
Index
BTA 0
BTA 1
Index
1K x
2 addr.
I
F
A
R
IFA [13:4]
1K*16B
fetch blocks
Way 0
Way 1
IFA [14:4]
IFA [14:4]
[31:15]
[31:15]
16 b
16 b
S elector Table
BTA
(Exec.)
(shared for the
upper and lower
parts of the I$)
1K*16B
fetch blocks
BTA1
BTA0
Fetch unit
(during predecoding)
Tags
15
16B+P
IFA [3:0]
16 B Fetch
block
16B+P
0
16 B Selector
block
Tags
IFA [3:1]
A 15
0
IFA14 W: 31
BTA
0 C:
BTA
x x
32-bit
Decode and issue instructions
beginning with the given address
Sequential
(no branch)
12
entries
RET
BTA 1
BTA 0
Take or not according
to the global prediction
(cond. branch)
Take the branch
(uncond. branch)
RAT
RET address
Figure 28.: Assumed Simplified branch prediction scheme of the K7, without showing the global prediction
(A: address bit, C: Conditional branch, W: Way)
+16
31
15 14
IFA:
Tag
31
4 30
2-way set associative I$
14 13
43 0
IFA:
Tag
Index SA
Index
BTAC
?
BTA 2
BTA 1
BTA 0
I
F
A
R
512 x
4 addr.
1K*16B
fetch blocks
IFA [12:4]
Way 1
Way 0
S elector Table
IFA [14:4]
IFA [14:4]
[31:15]
[31:15]
BTA
calculator
?
BTA2
BTA1
BTA0
1K*16B
fetch blocks
Tags
15
16B+P
16 b
SA
Predecoding
SA [3:0]
16 B Fetch
block
16 b
0
16 B Selector
block
16B+P
Tags
IFA [3:1]
15
0
x x
Old IFA 13 14 W 14
New IFA
0 RC
BTA
11-bit
Decode and issue instructions
beginning with the given address
Sequential
(no branch)
BTA2/RET
BTA1/RET
BTA0/RET
12
entries
RAT
Take or not according
to the global prediction
(cond. branch)
Take the branch
(uncond. branch)
RET address
Figure 29.: Assumed simplified branch prediction scheme of the K8, without showing the global prediction
(C: Conditional branch, R: Return, W: Way 0/1, SA: Start address)
+ 16
Figure 30.: The physical implementation of branch prediction in Intel’s Northwood and Prescott
processors
Figure 31.: Instruction Cache: More then instructions alone
Figure 32.: Chapter 4, Opteron’s Instruction Cache and Decoding
BTA
Calculated on the fly
Examples
Ultra SPARC III
K6
Power 4, 5
Accessed from BTAC
From the I$
PPro/PII/PIII/P4
21264
K7/K8
Power 3
Figure 33.: Alternatives to generate the BTA
+
IFA:
I$
IFA:
BHT
BTAC
IFA:
Tag
I
F
A
R
Update BTAC
(create/delete
BTAC entry)
IB
C
Further processing
Update BHT with branch result
Tags
BTA
Update BTAC with BTA if BHT initiates it.
Figure 34.: The principle of branch prediction using a BHT and a BTAC
(C: counter)
if BTAC misses
IIFA
BTA if mispred.
if BTAC hits
Accessing BHTs/BTACs
Cache-like access
(direct / set associative)
Indexed access
IFA:
Associative access
IFA:
Index
BHT
C
(Counters)
For large tables most branches will
map to a unique entry.
For smaller tables multiple branches
may map to the same entry, resulting
in interferences and thus in degrated
prediction accuracy.
IFA:
Tags
Index
IFA
Tags
C
Tags
C
IFA
Reduces interferences but increases cost.
Avoids interference but stronly increases cost.
Examples:
16K entry local BHT (Power4)
16K entry global BHT (Power4)
16K entry selector table (Power4)
C
(E.g. two-way set associative)
128*4 way BHT/BTAC (Pentium Pro)
1K*4 way BHT/BTAC (Pentium II, III, 4)
128*2 way BTAC (Power3)
Figure 35.: Alternative ways to access BHTs/BTACs
64K entry BTAC (PPC 604)
Irodalomjegyzék
Többszálas processzorok
1. J.M. Borkenhagen at all: “A multithreaded PowerPC processor for commercial servers”, IBM. J. Res.
Develop. Vol.44, No. 6., November 2005, pp. 885-898.
2. MAJC
3. C. McNary, R. Bhatia: Montecito: “A dual-core, dual-thread Itanium processor”, IEEE Micro, MarchApril 2005,pp. 10-20.
4. P. Kongetira at all: “Niagara: a 32-way multithreaded sparc processor”, IEEE Micro, March-April 2005,
pp. 21-29.
5. Sun’s UltraSparc T1 – the Next Generation Server CPUs, December 29, 2005, 8 pages.
6. R. Hetherington: “The UltraSPARC T1 Processor – Power Efficient Throughput Computing”, Sun
Microsystems, December 2005, pp. 3-11.
7. R. Cross: “Hyper-Threading Technology”, Intel Technology Journal, Volume 06, Issue 01, 2002, pp. 4-16.
8. 21464
9. B. Sinharcy at all: “Power5 system microarchitecture”, IBM J. Res. & Dev. Vol. 49. No. 4/5,
July/September 2005, pp. 505-521.
10. “Fundamentals of Multithreading”, SystemLogic.net, June 2001, 19 pages.
Többmagos processzorok
1. “Intel Core Microarchitecture Design Goals”, Inside Intel Core Microarchitecture White Paper, pp. 3-12.
2. “The UltraSPARC IV Processor Architecture”, pp. 1-8.
3. J.M. Tendler at all: “Power4 system microarchitecture”, IBM J. Res. & Dev. Vol. 46, No. 1, January 2002,
pp. 5-25.
4. “Lostcircuits HP PA-8800 RISC Processor”, Review by MS, October 19, 2001. 12 pages.
5. “UltraSPARC IV Processor Architecture Overview”, Technical Whitepaper, Sun Microsystems, February
2004, pp. 1-8.
Download