add r4, r6, r8

advertisement
C
S
D
A
2/e
Chapter 5 Overview




The principles of pipelining
A pipelined design of SRC
Pipeline hazards
Instruction-level parallelism (ILP)



Superscalar processors
Very Long Instruction Word (VLIW) machines
Microprogramming


Control store and micro-branching
Horizontal and vertical microprogramming
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Bölüm 5 Genel Bakış




Pipeline mimarisinin esasları
SRC nin pipeline tasarımı
Pipeline riskleri
Instruction-level parallelism (ILP)



Superscalar işlemciler
Very Long Instruction Word (VLIW) makineleri
Microprogramming


Control store ve micro-branching
Horizontal(Yatay) ve vertical(Dikey) microprogramming
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Fig 5.1 Executing Machine Instructions vs. Manufacturing
Small Parts
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
The Pipeline Stages

5 pipeline stages are shown






1. Fetch instruction
2. Fetch operands
3. ALU operation
4. Memory access
5. Register write
5 instructions are executing





shr
sub
add
st
ld
r3,
r2,
r4,
r4,
r2,
r3, 2
r5, r1
r3, r2
addr1
addr2
;storing result in r3
;idle, no mem. access needed
;adding in ALU
;accessing r4 and addr1
;instruction being fetched
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Pipeline Aşamaları

5 pipeline aşaması






1. Fetch instruction
2. Fetch operands
3. ALU işlemleri
4. Bellek erişimi
5. Register yazma
5 komut işleniyor





shr
sub
add
st
ld
r3,
r2,
r4,
r4,
r2,
r3, 2
r5, r1
r3, r2
addr1
addr2
; sonuç r3 e depolanır
; idle, bellek ulaşımına gerek yok
; ALU da toplama
; r4 ve addr1 e ulaşılması
; komutun getirilmesi
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Notes on Pipelining Instruction Processing





Pipeline stages are shown top to bottom in order traversed by
one instruction
Instructions listed in order they are fetched
Order of insts. in pipeline is reverse of listed
If each stage takes one clock:
- every instruction takes 5 clocks to complete
- some instruction completes every clock tick
Two performance issues: instruction latency, and instruction
bandwidth
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Pipeline Komut İşleme





Pipeline stages are shown top to bottom in order traversed by
one instruction
Komutlar fetch edildiği sırada listelenir.
Pipeline da komutlarun sırası listenin tersinedir.
Eğer her aşama bir clock tutarsa:
- her komut 5 clock da tamamlanır
- her clock da komut bitimi
İki performans konusu: komut gecikme süresi, ve komut bant
genşiliği
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Dependence Among Instructions


Execution of some instructions can depend on the completion of
others in the pipeline
One solution is to “stall” the pipeline



early stages stop while later ones complete processing
Dependences involving registers can be detected and data
“forwarded” to instruction needing it, without waiting for register
write
Dependence involving memory is harder and is sometimes
addressed by restricting the way the instruction set is used


“Branch delay slot” is example of such a restriction
“Load delay” is another example
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Komutlar Arasındaki Bağımlılık


Pipeline da bazı komutların işlenmesi, diğerlerinin bitimine
bağlıdır.
Bir çözüm “stall” (bekletme) dir



İlk aşamalar, sonrakiler işlemlerini bitirirken, beklerler.
Register ları içeren bağlılıklar, register yazması beklenmeden,
tespit edilebiilir ve veri kendine ihitiyaç olunan komuta forward
edilir.
Bellek içeren bağlılıklar daha zordur ve kullanılacak komut
kümesi yolunda kıstlamalar oluşturabilir.


“Branch delay slot” bu tip bir kıstlamaya örnek olabilie.
“Load delay” bir diğer örnektir
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Branch and Load Delay Examples
Branch Delay
brz r2, r3
add r6, r7, r8
st r6, addr1
This inst. always executed
Only done if r3  0
Load Delay
ld
add
shr
sub

r2, addr
r5, r1, r2
r1,r1,4
r6, r8, r2
This inst. gets “old”
value of r2
This inst. gets r2 value
loaded from addr
Working of instructions not changed, but way they work together is
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Branch ve Load Gecikme Örnekleri
Branch Gecikmesi
brz r2, r3
add r6, r7, r8
st r6, addr1
Bu komut herzaman işlenir
Sadece r3, 0 olmazsa (r3  0)
Load Gecikmesi
ld
add
shr
sub

r2, addr
r5, r1, r2
r1,r1,4
r6, r8, r2
Bu komut r2 nin eski değerini alır
Bu komut addr den r2 ye yüklenen
değeri alır
Komutların çalışması değişmez, fakat birlikte çalışma yolu
değişebilir.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Characteristics of Pipelined Processor Design

Main memory must operate in one cycle



Instruction and data memory must appear separate





Harvard architecture has separate instruction & data memories
Again, this is usually done with separate caches
Few buses are used


This can be accomplished by expensive memory, but
It is usually done with cache, to be discussed in Chap. 7
Most connections are point to point
Some few-way multiplexers are used
Data is latched (stored in temporary registers) at each pipeline
stage—called “pipeline registers.”
ALU operations take only 1 clock (esp. shift)
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Pipeline İşlemci Tasarımının Özellikleri

Ana bellek tek cycle da işlenmeli



Komut ve veri belleği ayrı görülmelidir





Harvard mimarisi ayrı komut ve veri belleklerine sahiptir.
Bu genelde ayrı cache lerde yapılır.
Az miktar da veri yolu kullanılır


Pahalı bellek kullanılması gerekir, fakat
Bu işlem cache ile genelde yapılır , to be discussed in Chap. 7
Pek çok bağlantı point to point dir.
Bazı few-way multiplexers(çoklayıcı) kullanılır.
Veri, her pipeline aşamasında tutulur (geçici registera depolanır)
—called “pipeline registers.”
ALU işlemleri sadece 1 clock alır. (esp. shift)
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Adapting Instructions to Pipelined Execution



All instructions must fit into a common pipeline stage structure
We use a 5 stage pipeline for the SRC
1) Instruction fetch
2) Decode and operand access
3) ALU operations
4) Data memory access
5) Register write
We must fit load/store, ALU, and branch instructions into this
pattern
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Komutların Pipeline Olarak İşlenmeye Adapte Edilmesi



Bütün komutlar genel bir pipeline aşama yapısına uymak
zorundadır.
Biz SRC için 5 aşamalı pipeline mimarisi kullancağız
1) Instruction fetch
2) Decode ve operand ulaşımı
3) ALU işlemleri
4) Veri Belleğine Ulaşım
5) Register yazma
Biz load/store, ALU ve branch komutlarını bu yapıya uyğun hale
getireceğiz.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Fig 5.2 ALU Instructions fit into 5
D
Stages
A
2/e
• Second ALU operand comes
either from a register or
instruction register c2 field
• Op code must be available
in stage 3 to tell ALU what to
do
• Result register, ra, is written
in stage 5
• No memory operation
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
Fig 5.2 ALU Komutu 5 aşamada
D
A
2/e
• İkinci ALU işlemi register dan
veya c2 den gelebilir.
• Op code 3. aşamada ALU ya
ne yapılacağını söylemesi
için hazır olmalıdır.
• Somuç registeri ra ya 5.
aşamada yazılır.
• Bellek işlemi yoktur.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Fig 5.4 Load and
D Store Instructions
A
2/e



ALU computes
effective addresses
Stage 4 does read
or write
Result reg. written
only on load
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
SFig 5.4 Load ve Store
D
Komutları
A
2/e



ALU computes
effective addresses
Aşama 4 read veya
write yapar
Sonuç reg. Sadece
load da yazılır.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e



Fig 5.6 SRC Pipeline
Registers and RTN
Specification
The pipeline
registers pass info.
from stage to
stage
RTN specifies
output reg. values
in terms of input
reg. values for
stage
Discuss RTN at
each stage on
blackboard
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Fig 5.6 SRC Pipeline
Registers and RTN
Specification



pipeline registerlar
aşamadan aşamaya
bilgi geçirirler.
RTN aşamlar için
output reg.
değerlerini, input
reg değerleri
açısından belirtir.
Discuss RTN at
each stage on
blackboard
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Global State of the Pipelined SRC





PC, the general registers, instruction memory, and data memory
is the global machine state
PC is accessed in stage 1 (& stage 2 on branch)
Instruction memory is accessed in stage 1
General registers are read in stage 2 and written in stage 5
Data memory is only accessed in stage 4
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Pipeline SRC de Global State





PC, the general registers, komut belleği, and veri belleği global
makine durumudur.
PC ye aşama 1 de ulaşılır. (& aşama 2 on branch)
Komut belleğine ye aşama 1 de ulaşılır.
Genel registers aşama 2 de okunur ve aşama 5 de yazılır.
Veri belleğine sadece aşama 4 de ulaşılır.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Restrictions on Access to Global State by
Pipeline


We see why separate instruction and data memories (or
caches) are needed
When a load or store accesses data memory in stage 4,
stage 1 is accessing an instruction


Two operands may be needed from registers in stage 2
while another instruction is writing a result register in stage
5


Thus two memory accesses occur simultaneously
Thus as far as the registers are concerned, 2 reads and a
write happen simultaneously
Increment of PC in stage 1 must be overridden by a
successful branch in stage 2
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Pipeline: Global State e ulaşımda ki Kısıtlamalar
D
A
2/e


Neden ayrı komut ve veri bellekleri kullandığımızı gördük
Bir load veya store komutu stage 4 de veri belleğine
ulaşırken, aşama 1 de bir komut a ulaşılır.


Aşama 2 de register ların 2 tane operand ihtiyacı varken,
bir diğer komut aşama 5 de sonuç register ına veri yazar.


Böylece iki bellek ulaşımı eş zamanlı meydana gelir.
Böylece 2 read ve 1 write işlemi eş zaamnlı olarak gerçekleşir.
Aşama 2 deki başarılı bir branch işlemi, aşama 1 de PC nin
artırılmasını zorunlu kılar.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e


Fig 5.7 Pipeline
Data Path &
Control Signals
Most
control
signals
shown and
given
values
Multiplexer
control is
stressed in
this figure
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e


Fig 5.7 Pipeline
Data Path &
Control Signals
Pek çok kontrol
sinyali ve
değerleri
Çoklayıcı
kontrolü bu
figürde ön plana
çıkartılmıştır.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
Example of Propagation of Instructions Through
S
Pipe
D
A
2/e
100:
104:
108:
112:
512:



add r4, r6, r8;
ld
r7, 128(r5);
brl
r9, r11, 001;
str
r12, 32;
......
sub ...
R[4]  R[6] + R[8];
R[7]  M[R[5]+128];
PC  R[11]: R[9]  PC;
M[PC+32]  R[12];
next instruction
It is assumed that R[11] contains 512 when the brl instruction is
executed
R[6] = 4 and R[8] = 5 are the add operands
R[5] =16 for the ld and R[12] = 23 for the str
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Pipe İşleminde komutların yayılmasına örnekler
D
A
2/e
100:
104:
108:
112:
512:



add r4, r6, r8;
ld
r7, 128(r5);
brl
r9, r11, 001;
str
r12, 32;
......
sub ...
R[4]  R[6] + R[8];
R[7]  M[R[5]+128];
PC  R[11]: R[9]  PC;
M[PC+32]  R[12];
Sonraki komut
brl komutu işlendiği zaman, R[11] in 512 içermesi beklenir
R[6] = 4 ve R[8] = 5 add operandlarıdır.
R[5] =16 ld için ve R[12] = 23 str için
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Fig 5.8 Cycle 1
D add Enters Pipe
A
2/e

Program
counter is
incremented to
104
512: sub ...
......
112: str r12, #32
108: brl r9, r11, 001
104: ld
r7, r5, #128
100: add r4, r6, r8
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Fig 5.8 Cycle 1
D add Enters Pipe
A
2/e

PC 104 e
artıtılır.
512: sub ...
......
112: str r12, #32
108: brl r9, r11, 001
104: ld
r7, r5, #128
100: add r4, r6, r8
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
SFig 5.9 Cycle 2
D ld Enters Pipe
A
2/e

add
operands
are fetched
in stage 2
512: sub ...
......
112: str r12, #32
108: brl r9, r11, 001
104: ld
r7, r5, #128
100: add r4, r6, r8
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
SFig 5.9 Cycle 2
D ld Enters Pipe
A
2/e

Aşama 2 de
add operandları
getirildi.
512: sub ...
......
112: str r12, #32
108: brl r9, r11, 001
104: ld
r7, r5, #128
100: add r4, r6, r8
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Fig 5.10 Cycle 3
brl Enters Pipe
D
A
2/e

add
performs its
arithmetic in
stage 3
512: sub ...
......
112: str r12, #32
108: brl r9, r11, 001
104: ld
r7, r5, #128
100: add r4, r6, r8
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Fig 5.10 Cycle 3
brl Enters Pipe
D
A
2/e

Aşama 3 de
add aritmetik
işlemini
yerine getirir
512: sub ...
......
112: str r12, #32
108: brl r9, r11, 001
104: ld
r7, r5, #128
100: add r4, r6, r8
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Fig 5.11 Cycle 4
str enters pipe
D
A
2/e


add is idle in stage 4
Success of brl changes
program counter to 512
512: sub ...
......
112: str r12, #32
108: brl r9, r11, 001
104: ld
r7, r5, #128
100: add r4, r6, r8
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Fig 5.11 Cycle 4
str enters pipe
D
A
2/e


add aşama 4 deki gibi
aynıdır
Brl PC yi 512 ye
değiştirir.
512: sub ...
......
112: str r12, #32
108: brl r9, r11, 001
104: ld
r7, r5, #128
100: add r4, r6, r8
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Fig 5.12 Cycle 5
sub Enters Pipe
D
A
2/e


add completes in
stage 5
sub is fetched from
loc. 512 after
successful brl
512: sub ...
......
112: str r12, #32
108: brl r9, r11, 001
104: ld
r7, r5, #128
100: add r4, r6, r8
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Fig 5.12 Cycle 5
sub Enters Pipe
D
A
2/e


add aşama 5 de
tamamlanır
Brl den sonra, sub
512 location ından
alıp getirilir.
512: sub ...
......
112: str r12, #32
108: brl r9, r11, 001
104: ld
r7, r5, #128
100: add r4, r6, r8
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Functions of the SRC Pipeline Stages

Stage 1: fetches instruction


Stage 2: decodes inst. and gets operands




PC incremented or replaced by successful branch in stage 2
Load or store gets operands for address computation
Store gets register value to be stored as 3rd operand
ALU operation gets 2 registers or register and constant
Stage 3: performs ALU operation


Calculates effective address or does arithmetic/logic
May pass through link PC or value to be stored in mem.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
SRC Pipeline Aşamalrının Fonksiyonları

Aşama 1: komutun alıp getirilmesi (fetch)


Aşama 2: komutun decode edilmesi ve operandların alınması




PC arttırılır veya aşama 2 de başarılı bir branch (dallanma) ile
yenilenir.
Load veya store, adres hesaplaması için operandalrı alır
Store 3. operand olarak depolancak register değerini alır
ALU işlemi 2 register veya 1 register ve 1 sabit alır.
Aşama 3: ALU işleminin gerçekleştirilmesi


Effective adres hesaplanır veya arithmetic/logic işlemler yapılır
PC veya bellekde depolanmış degere geçiş olabilir.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Functions of the SRC Pipeline Stages
(continued)

Stage 4: accesses data memory




Passes Z4 to Z5 unchanged for non-memory instructions
Load fills Z5 from memory
Store uses address from Z4 and data from MD4(no longer needed)
Stage 5: writes result register


Z5 contains value to be written, which can be ALU result, effective
address, PC link value, or fetched data
ra field always specifies result register in SRC
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
SRC Pipeline Aşamalrının Fonksiyonları

Aşama 4: veri belleğine ulaşılması



Bellek kullanılmayan komutlarda Z4 ve Z5 değişmeden geçilir.
Store, Z4 den adres ve MD4 den veriyi kullanır
Aşama 5: sonuç registerin yazılması


Z5 yazılacak değeri tutar, bu değer ALU result, effective address,
PC link value, veya fetched data olabilir.
SRC de ra alanı genelde sonuç register olarak belirtilir.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Dependence Between Instructions in Pipe:
Hazards
Instructions that occupy the pipeline together are being
executed in parallel

This leads to the problem of instruction dependence, well
known in parallel processing

The basic problem is that an instruction depends on the
result of a previously issued instruction that is not yet
complete

Two categories of hazards
Data hazards: incorrect use of old and new data
Branch hazards: fetch of wrong instruction on a change in PC

Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Komutlar Arasındaki Bağlılıklar in Pipe: Hazards
D
A
2/e
Pipeline da komutlar paralel olarak işletilir.

Paralel işlemede, bu durum komutların birbirine bağlılık
problemine sebep olur.

Temel problem, bir komutun çalışmasının, çalışmasını
bitirmemiş başka bir komutun işlenmesi sonucu ortaya
çıkacak olan sonuca bağlı olmasından ileri gelir.

Hataları iki kategoride inceleriz
Veri hazards: eski ve yeni verinin yanlış kullanımı
Dallanma (Branch) hazards: PC de ki değişim sonucunda
yanlış komutun fetch edilmesi

Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
General Classification of Data Hazards
(Not Specific to SRC)



A read after write hazard (RAW) arises from a flow dependence,
where an instruction uses data produced by a previous one
A write after read hazard (WAR) comes from an antidependence, where an instruction writes a new value over one
that is still needed by a previous instruction
A write after write hazard (WAW) comes from an output
dependence, where two parallel instructions write the same
register and must do it in the order in which they were issued
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Veri Hazard larının Genel Sınıflandırılması



A read after write hazard (RAW) arises from a flow dependence,
bir komutun bir önceki komut tarafından oluşturulan veriyi kullanması
gereken durumda oluşur.
A write after read hazard (WAR) comes from an anti-dependence,
bir komutun yeni bir değeri bir yere yazarken, oradan hala bir önceki
komutun değer alması gerekiyorsa oluşur.
A write after write hazard (WAW) comes from an output dependence,
iki paralel komutun aynı registera yazma durumları varsa, bu
işlemleri işleme sırasına göre yapmalrı gerekir.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Detecting Hazards and Dependence Distance



To detect hazards, pairs of instructions must be considered
Data is normally available after being written to reg.
Can be made available for forwarding as early as the stage
where it is produced



Stage 3 output for ALU results, stage 4 for mem. fetch
Operands normally needed in stage 2
Can be received from forwarding as late as the stage in which
they are used

Stage 3 for ALU operands and address modifiers, stage 4 for
stored register, stage 2 for branch target
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Hataların Tespit Edilmesi ve Bağımlılık Mesafesi



Hataların ayıklanması için komut çifti bilinmelidir
Veri normalde reg. e yazıldıktan sonra uygun olur.
“Forwarding” işlemi aşamalarda gerektiği anda en erken
biçimde yapılmalıdır



aşama 3 ALU sonucu için output, aşama 4 bellek fetch için
Operandlar normalde aşama 2 için ihtiyaç olurlar
Can be received from forwarding as late as the stage in which
they are used

Aşama 3 ALU operandları ve adres modifierları için, aşama 4
depolama register için, aşama 2 branch target için
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Data Hazards in SRC


Since all data memory access occurs in stage 4, memory writes
and reads are sequential and give rise to no hazards
Since all registers are written in the last stage, WAW and WAR
hazards do not occur



Two writes always occur in the order issued, and a write always
follows a previously issued read
SRC hazards on register data are limited to RAW hazards
coming from flow dependence
Values are written into registers at the end of stage 5 but may
be needed by a following instruction at the beginning of stage 2
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
SRC de Veri Hataları


Bütün veri bellek ulaşımı aşama 4 de olduğu için, bellek yazma
ve okuma ardışık olur ve hata oluşma riski azalır.
Bütün register lara son aşamada yazma işlemi olduğundan;
WAW ve WAR hataları oluşmaz



İki yazma genelde işleme sırasında göre olur, ve bir write genelde
bir önceki işlenen read işlemini takip eder.
Register verisi üzerindeki SRC hataları RAW hatası ile sınırlıdır.
Aşama 5 in sonunda register lara yazılan değerler bir sonraki
komutda aşama 2 nin başında ihtiyaç duyulabilirler.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
Possible Solutions to the Register Data Hazard
S
Problem
D
A
2/e

Detection:




Correction:



The machine manual could list rules specifying that a dependent
instruction cannot be issued less than a given number of steps
after the one on which it depends
This is usually too restrictive
Since the operation and operands are known at each stage,
dependence on a following stage can be detected
The dependent instruction can be “stalled” and those ahead of it
in the pipeline allowed to complete
Result can be “forwarded” to a following inst. in a previous stage
without waiting to be written into its register
Preferred SRC design will use detection, forwarding and
stalling only when unavoidable
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Register Data Hazard Problemleri için
Muhtemel Çözümler

Tespit Edilmesi:




Doğrulanması:



The machine manual could list rules specifying that a dependent
instruction cannot be issued less than a given number of steps
after the one on which it depends
This is usually too restrictive
İşlemler ve operandlar her aşamada bilindiği için, bir sonraki
aşamada ki bağımlıklık tespit edilebilir.
Bağımlı komut bekletilmelidir (stall) ve those ahead of it in the
pipeline allowed to complete
Sonuç, bir önceki aşamda registerlara yazma işlemi
beklenmeden bir sonraki komuta iletilmelidir.(forwarding)
Tercih edilen SRC tasarımı detection, forwarding ve stalling
kullacaktır.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
RAW, WAW, and WAR Hazards


RAW hazards are due to causality: one cannot use a value before it
has been produced.
WAW and WAR hazards can only occur when instructions are
executed in parallel or out of order.



Not possible in SRC.
Are only due to the fact that registers have the same name.
Can be fixed by renaming one of the registers or by delaying the updating of
a register until the appropriate value has been produced.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
RAW, WAW, ve WAR Hazards


RAW hazards nedenseldir: biri bir değeri kullanmamalıdır, o değer
oluşturulmadan önce
WAW ve WAR hazards sadece komutlar paralel ve ya sıra dışında
çalıştırıldıklarında oluşur.



SRC de mümkün değildir.
Sadece register ların aynı isimlerde olması durumuda
Bir register ın yeniden adlandırılmasıyla veya register ın güncellenmesini
uyğun değerin üretilmesine kadar erteleterek düzeltilir.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Tbl 5.1 Instruction Pair Hazard Interaction
Write to Reg. File
Result Normally/Earliest available
Read from
Reg. File
Value
Normally/
Latest
needed
Class
Class N/L N/E
alu
2/3
load
2/3
ladr
2/3
store 2/3
branch 2/2
alu
6/4
4/1
4/1
4/1
4/1
4/2
load
6/5
4/2
4/2
4/2
4/2
4/3
ladr
6/4
4/1
4/1
4/1
4/1
4/2
brl
6/2
4/1
4/1
4/1
4/1
4/1
Instruction separation to eliminate
hazard, Normal/Forwarded
Latest needed stage 3 for store is based on address modifier register. The
stored value is not needed until stage 4

Store also needs an operand from ra. See Text Tbl 5.

Instruction separation is used rather than bubbles because of the applicability
to multi-issue, multi-pipelined machines.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall

C
S
D
A
2/e
Tbl 5.1 Instruction Pair Hazard Interaction
Write to Reg. File
Result Normally/Earliest available
Read from
Reg. File
Value
Normally/
Latest
needed
Class
Class N/L N/E
alu
2/3
load
2/3
ladr
2/3
store 2/3
branch 2/2
alu
6/4
4/1
4/1
4/1
4/1
4/2
load
6/5
4/2
4/2
4/2
4/2
4/3
ladr
6/4
4/1
4/1
4/1
4/1
4/2
brl
6/2
4/1
4/1
4/1
4/1
4/1
Instruction separation to eliminate
hazard, Normal/Forwarded
Store için en son ihtiyaç duyulan aşama 3 adres modifier register a bağlıdır.
Depolanan değer aşama 4 e kadar ihtiyaç olunmaz.

Store ra dan bir operand a ihtiyaç duyar. See Text Tbl 5.

Komut ayrıştırma baloncukların yerine kullanılır, multi-issue, multi-pipeline
makinelerinin uygulanabilirlikleri sebebiyle.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall

C
S
D
A
2/e
Delays Unavoidable by Forwarding

In the column headed by load, we see the value loaded cannot
be available to the next instruction, even with forwarding


Can restrict compiler not to put a dependent instruction in the next
position after a load (next 2 positions if the dependent instruction is
a branch)
Target register cannot be forwarded to branch from the
immediately preceding instruction


Code is restricted so that branch target must not be changed by
instruction preceding branch (previous 2 instructions if loaded from
mem.)
Do not confuse this with the branch delay slot, which is a
dependence of instruction fetch on branch, not a dependence of
branch on something else
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Forwarding ile önlenemeyen Gecikmeler

Load sütununda, forwarding olmasına ragmen, yüklenen
değerin bir sonraki komut için hazır olmayacağını görüyoruz.


restrict compiler load dan sonra bağımlı bir komut koymazlar (eğer
bağımlı komut branch ise sonraki 2 pozisyon için)
Hedef reg. , önceki komutdan branch e forward edilmeyebilir.


Kode kısıtlanmıştır, böylece dallanma hedefi, önceki branch e göre
değişmemelidir. (eğer bellekden yüklendiyse önceki 2 komut)
Bunu, dallanma gecikmesiyle karıştımaynız. which is a dependence
of instruction fetch on branch, not a dependence of branch on
something else
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Stalling the Pipeline on Hazard Detection




Assuming hazard detection, the pipeline can be stalled by
inhibiting earlier stage operation and allowing later stages to
proceed
A simple way to inhibit a stage is a pause signal that turns off
the clock to that stage so none of its output registers are
changed
If stages 1 & 2, say, are paused, then something must be
delivered to stage 3 so the rest of the pipeline can be cleared
Insertion of nop into the pipeline is an obvious choice
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Hata Tespitinde Pipeline ı Bekletme




Hata tespitini düşünün, pipeline ilk aşamaların işlemini azlatmak
ve sonraki aşamları ilerletmek için bekletilsin.
Bir aşamayı engellemenin basit yolu durma sinyali dir. Bu sinyal
a aşama için clock u durdurur ve o aşamanın output reg. lerinin
değişmesini önler.
Eğer aşama 1 ve 2 durdurulursa, pipeline ın geri kalan kısmı
temizlenir.
Pipeline a nop göndermek belirli bir tercih olabilir.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Fig 5.14 Stall Due to a Dependence Between
Two alu Instructions
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Restrictions Left If Forwarding Done Wherever
Possible
1) Branch delay slot
• The instruction after a branch is always executed, whether the
branch succeeds or not.
2) Load delay slot
• A register loaded from memory cannot be used as an operand
in the next instruction.
• A register loaded from memory cannot be used as a branch
target for the next two instructions.
3) Branch target
• Result register of alu or ladr instruction cannot be used as
branch target by the next instruction.
Computer Systems Design and Architecture Second Edition
br r4
add . . .
•••
ld r4, 4(r5)
nop
neg r6, r4
ld r0, 1000
nop
nop
br r0
not r0, r1
nop
br r0
© 2004 Prentice Hall
C
S
D
A
2/e
Restrictions Left If Forwarding Done Wherever
Possible
1) Branch delay slot
• Branch den sonraki komut herzaman işlenir, branch başarılı
olsun ya da olmasın.
2) Load delay slot
• Bellekden yüklenen register bir sonraki register in operand ı
olarak kullanılmayabilir.
• Bellekden yüklenen bir reg. bir sonraki komut için dallanma
hedefi olmaz.
3) Branch target
• Alu ve ladr komutları sonuç reg. bir sonraki komut için dallanma
hedefi olmaz.
Computer Systems Design and Architecture Second Edition
br r4
add . . .
•••
ld r4, 4(r5)
nop
neg r6, r4
ld r0, 1000
nop
nop
br r0
not r0, r1
nop
br r0
© 2004 Prentice Hall
C
S
D
A
2/e
Instruction Level Parallelism



A pipeline that is full of useful instructions completes at most
one every clock cycle
Sometimes called the Flynn limit
If there are multiple function units and multiple instructions
have been fetched, then it is possible to start several at
once
Two approaches are: superscalar


Dynamically issue as many prefetched instructions to idle
function units as possible
and Very Long Instruction Word (VLIW)


Statically compile long instruction words with many operations
in a word, each for a different function unit
Word size may be 128 or 256 or more bits.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Komut Düzeyi Paralelliği



Komutlaral dolu bir pipeline her clok cycle da en fazla bir
komut bitirir.
Sometimes called the Flynn limit
Eğer multiple fonksiyon birimleri ve komutları fetch edildiyse,
bir kerede birden fazlasına başlanması mümkün olur.
İki yaklaşım vardır: superscalar


Dynamically issue as many prefetched instructions to idle
function units as possible
ve Very Long Instruction Word (VLIW)


Statically compile long instruction words with many operations
in a word, each for a different function unit
Word size may be 128 or 256 or more bits.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
Character of the Function Units in Multiple Issue
S
Machines
D
A
2/e

There may be different types of function units






Floating point
Integer
Branch
There can be more than one of the same type
Each function unit is itself pipelined
Branches become more of a problem



There are fewer clock cycles between branches
Branch units try to predict branch direction
Instructions at branch target may be prefetched, and even
executed speculatively, in hopes the branch goes that way
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
Character of the Function Units in Multiple Issue
S
Machines
D
A
2/e

Farklı fonksiyon brimleri vardır






Floating point
Integer
Branch
Birden fazla aynı tip olabilir
Her fonksiyon birimi kendinden pipeline edilmiştir
Branch ler daha çok problem olurlar



Branchler arasında daha az clock cycle ları vardır
Branch birimleri dallanma yönünü tahmin etmeye çalışırlar
Dallanma hedefindeki komutlar prefetch edilebilr, ve kurgusal
olarak işlenebilirler
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Figure 5.16: Structure of the Dual-Pipeline SRC
D
A
2/e
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Figure 5.19: Dual-Issue SRC Pipelines and
Forwarding Paths
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Microprogramming: Basic Idea
• Recall control sequence for 1-bus SRC
Step
T0.
T1.
T2.
T3.
T4.
T5.
Concrete RTN
MA  PC: C  PC+4;
MD  M[MA]: PC  C;
IR  MD;
A  R[rb];
C  A + R[rc];
R[ra]  C;


Control Sequence
PCout, MAin, Inc4, Cin, Read
Cout, PCin, Wait
MDout, IRin
Grb, Rout, Ain
Grc, Rout, ADD, Cin
Cout, Gra, Rin, End
Control unit job is to generate the sequence of control signals
How about building a computer to do this?
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Microprogramming: Basic Idea
• Kontrol dizisini 1-bus SRC için yeniden çağıralım
Step
T0.
T1.
T2.
T3.
T4.
T5.
Concrete RTN
MA  PC: C  PC+4;
MD  M[MA]: PC  C;
IR  MD;
A  R[rb];
C  A + R[rc];
R[ra]  C;


Control Sequence
PCout, MAin, Inc4, Cin, Read
Cout, PCin, Wait
MDout, IRin
Grb, Rout, Ain
Grc, Rout, ADD, Cin
Cout, Gra, Rin, End
Kontrol biriminin görevi kontrol sinyalleri dizisinin oluşturulmasıdır
Bunu yapacak bir bilğisayar nasıl yapılır?
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
The Microcode Engine





A computer to generate control signals is much simpler than an
ordinary computer
At the simplest, it just reads the control signals in order from a
read only memory
The memory is called the control store
A control store word, or microinstruction, contains a bit pattern
telling which control signals are true in a specific step
The major issue is determining the order in which
microinstructions are read
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
The Microcode Engine





Kontrol sinyalleri olşturan bir bilgisayar, normal bir bilgisayara
göre daha basittir.
Basit olarak, sadece kontrol sinyallerini bellekten bir read ile
okur.
Bellek, control store (kontrol deposu) olarak adlandırılır.
control store word, veya microinstruction, belirli bir basamak için
kontrol sinyallerinin dogruluğunu söyleyen bit pattern leri içeriler
Ana işlem microinstruction ların okunma sırasına kara
verilmesidir.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Fig 5.22 Block Diagram of a Microcoded
Control Unit


Microinstruction has
branch control, branch
address, and control
signal fields
Micro-program counter
can be set from several
sources to do the
required sequencing
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Fig 5.22 Block Diagram of a Microcoded
Control Unit


Microinstruction; branch
control, branch address,
ve control signal
alanlarına sahiptir.
Micro-program counter
beklenen dizilemeyi
yapamak için pekçok
kaynaktan set edilebilir.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Parts of the Microprogrammed Control Unit


Since the control signals are just read from memory, the main
function is sequencing
This is reflected in the several ways the PC can be loaded





Output of incrementer—PC+1
PLA output—start address for a macroinstruction
Branch address from instruction
External source—say for exception or reset
Micro conditional branches can depend on condition codes, data
path state, external signals, etc.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Microprogrammed Control Biriminin Parçaları


Kontrol sinyalleri sadece bellekten okunduğu için, ana fonksiyon
bunların sıralanmasıdır
This is reflected in the several ways the PC can be loaded





Output of incrementer—PC+1
PLA output—macroinstruction için başlangıç adresi
instruction için branch adresi
External source— exception ve reset için
Micro durumlu branch ler durum kodlarına, veri yolu durumuna ,
harici sinyalere...v.b. şeylere bağlıdır.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Contents of a Microinstruction



Main component is list of 1/0 control signal values
There is a branch address in the control store
There are branch control bits to determine when to use the
branch address and when to use PC+1
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Microinstruction İçeriği



Ana component 1/0 kontrol sinyal değerleridir.
control store da bir tane branch adresi vardır.
PC+1 ve branch adreslerinin ne zaman kullanılacağına karar
vermeye yarayan branch kontrol bit leri vardır.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Figure 5.23: Layout of the Control Store
0
Code for instruction fetch

a1
Code for add
Microaddress

a2
Code for br
a3
Code for shr

Common inst.
fetch
sequence
Separate
sequences for
each (macro)
instruction
Wide words
2n-1
m bits wide
k branch
control bits
Computer Systems Design and Architecture Second Edition
c control
signals
n branch
addr. bits
© 2004 Prentice Hall
C
S
D
A
2/e
Figure 5.23: Control Store un Tasarımı
0
Code for instruction fetch

a1
Code for add
Microaddress
a2
Code for br
a3
Code for shr


Genel komut
fetch dizisi
Her komut için
ayrık diziler
Wide words
2n-1
m bits wide
k branch
control bits
Computer Systems Design and Architecture Second Edition
c control
signals
n branch
addr. bits
© 2004 Prentice Hall
C
S Horizontal Versus Vertical Microcode Schemes
D
A
2/e





In horizontal microcode, each control signal is represented by a
bit in the instruction
In vertical microcode, a set of true control signals is represented
by a shorter code
The name horizontal implies fewer control store words of more
bits per word
Vertical code only allows RTs in a step for which there is a
vertical instruction code
Thus vertical code may take more control store words of fewer
bits
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S Horizontal Versus Vertical Microcode Schemes
D
A
2/e





horizontal microcode da, instruction daki her kontrol sinyali bir
bit ile ifade edilir
vertical microcode da, doğru kontrol sinyali kümesi daha kısa bir
kod ile ifade edilir
horizontal ismi daha az kontrol deposu word lerinin word başına
daha fazla bit ile ifade edilmesi anlamını taşır.
Vertical code sadece vertical instruction kodunda bir
basamaktaki RT lere izin verir.
Böylece, vertical code daha az bit ile daha fazla kontrol deposu
word ü alabilir.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Fig 5.25 A Somewhat Vertical Encoding

Scheme would save (16+7) - (4+3) = 16 bits/word in the case
illustrated
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Saving Control Store Bits With Horizontal
Microcode

Some control signals cannot possibly be true at the same time






One and only one ALU function can be selected
Only one register out gate can be true with a single bus
Memory read and write cannot be true at the same step
A set of m such signals can be encoded using log2m bits
(log2(m+1) to allow for no signal true)
The raw control signals can then be generated by a k to 2k
decoder, where 2k ≥ m (or 2k ≥ m+1)
This is a compromise between horizontal and vertical
encoding
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
C
S
D
A
2/e
Horizontal Microcode ile Kontrol Deposu Bit
lerinin Korunması

Bazı kontrol sinyalleri muhtemel olarak aynı zamanda doğru
olmayabilir.






One and only one ALU function can be selected
Sadece bir out gate register doğru olabilir single bus ile.
Bellek read ve write aynı basamakta doğru olmayabilirler.
m sinyal kümesi log2m kullanılarak encode yapılabilir.
(log2(m+1) to allow for no signal true)
raw control sinyalleri k to 2k decoder ile oluşturulabilir,
2k ≥ m olmak şartıyla (veya 2k ≥ m+1)
Bu vertical ve horizontal encode lama arasındaki uyumdur.
Computer Systems Design and Architecture Second Edition
© 2004 Prentice Hall
Download