Chapter 1 Summary

advertisement
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
Chapter One
Introduction to Pipelined Processors
Pipelining:
•
It is a technique of decomposing a sequential process into sub-operations which can be
executed in a special dedicated segment that operates concurrently with all other
segments
•
It improves processor performance by overlapping the execution of multiple
instructions
Example for pipeline in computer
•
Consider that the process of execution of an instruction involves four major steps:
1. Instruction Fetch (IF): from main memory
2. Instruction Decoding (ID): which identifies the operation to be performed
3. Operand Fetch(OF): if needed in execution
4. Execution(EX): of the decoded arithmetic/logic operation
• In a non-pipelined computer, these four steps must be completed before the
next instruction can be issued
•
In a pipelined computer, successive stages are executed in an overlapped fashion
2013-14
Advanced Microprocessor by Vincy Joseph
•
Theoretically a k-stage linear pipeline could be k-times faster.
•
But this ideal speedup cannot be achieved due to factors like :
TE CMPNA/B
– Data dependency
– Branch and Interrupts
Principles of Linear Pipelining
•
In pipelining, we divide a task into set of subtasks.
•
The precedence relation of a set of subtasks {T1, T2,…, Tk} for a given task T implies
that the task Tj cannot start until some earlier task Ti finishes.
•
The interdependencies of all subtasks form the precedence graph.
•
With a linear precedence relation, task Tj cannot start until earlier subtasks { Ti} for all
(i < j) finish.
•
A linear pipeline can process subtasks with a linear precedence graph.
Basic Linear Pipeline
•
L: latches, interface between different stages of pipeline
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
•
S1, S2, etc. : pipeline stages
•
It consists of cascade of processing stages.
•
Stages : Pure combinational circuits performing arithmetic or logic operations over
the data flowing through the pipe.
•
Stages are separated by high speed interface latches.
•
Latches : Fast Registers holding intermediate results between stages
•
Information Flow are under the control of common clock applied to all latches
•
The flow of data in a linear pipeline having four stages for the evaluation of a function on
five inputs is as shown below:
•
The vertical axis represents four stages
•
The horizontal axis represents time in units of clock period of the pipeline.
Clock Period (τ) for the pipeline
•
Let τi be the time delay of the circuitry Si and t1 be time delay of latch.
•
Then the clock period of a linear pipeline is defined by
  max  i  t1  tm  t1
k
i 1
The reciprocal of clock period is called clock frequency (f = 1/τ) of a pipeline processor
Performance of a linear pipeline
•
Consider a linear pipeline with k stages.
•
Let T be the clock period and the pipeline is initially empty.
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
•
Starting at any time, let us feed n inputs and wait till the results come out of the pipeline.
•
First input takes k periods and the remaining (n-1) inputs come one after the another
in successive clock periods.
•
Thus the computation time for the pipeline Tp is
Tp = kT+(n-1)T = [k+(n-1)]T
•
For example if the linear pipeline have four stages with five inputs.
•
Tp = [k+(n-1)]T = [4+4]T = 8T
Performance Parameters
•
The various performance parameters of pipeline are :
1. Speed-up
2. Throughput
3. Efficiency
Speedup
Speedup is defined as
Speedup = Time taken for a given computation by a non-pipelined functional unit
Time taken for the same computation by a pipelined version
Assume a function of k stages of equal complexity which takes the same amount of time T.
Non-pipelined function will take kT time for one input.
Then Speedup = nkT/(k+n-1)T = nk/(k+n-1)
The maximum value of speedup is
Lt [Speedup] = k
n∞
Efficiency
It is an indicator of how efficiently the resources of the pipeline are used.
If a stage is available during a clock period, then its availability becomes the unit of resource.
Efficiency can be defined as
Efficiency =
Number of stage time units actually used during computatio n
Total number of stage time units available during that computatio n
No. of stage time units = nk (there are n inputs and each input uses k stages. )
Total no. of stage-time units available = k[ k + (n-1)]
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
It is the product of no. of stages in the pipeline (k) and no. of clock periods taken for
computation(k+(n-1)).
Thus efficiency is expressed as follows:
Efficiency 
nk
n

k k  n - 1 k  n  1
The maximum value of efficiency is 1.
Throughput
It is the average number of results computed per unit time.
For n inputs, a k-staged pipeline takes
[k+(n-1)]T time units
Then,
Throughput = n / [k+n-1] T = nf / [k+n-1]
where f is the clock frequency
The maximum value of throughput is
Lt [Throughput] = f
n∞
Throughput = Efficiency x Frequency
Example : Floating Point Adder Unit
•
This pipeline is linearly constructed with 4 functional stages.
•
The inputs to this pipeline are two normalized floating point numbers of the form
A = a x 10p
B = b x 10q
where a and b are two fractions and p and q are their exponents.
•
Our purpose is to compute the sum
C = A + B = c x 10r = d x 10s
where r = max(p,q) and 0.1 ≤ d < 1
•
For example:
2013-14
A=0.9504 x 10
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
3
B=0.8200 x 102
a = 0.9504 b= 0.8200
p=3 & q =2
•
Operations performed in the four pipeline stages are :
1. Compare p and q and choose the largest exponent, r = max(p,q)and compute t = |p – q|
Example:
r = max(p , q) = 3
t = |p-q| = |3-2|= 1
2. Shift right the fraction associated with the smaller exponent by t units to equalize the two
exponents before fraction addition.
•
Example:
Smaller exponent, b= 0.8200
Shift right b by 1 unit is 0.082
3. Perform fixed-point addition of two fractions to produce the intermediate sum fraction c
•
Example :
a = 0.9504 b= 0.082
c = a + b = 0.9504 + 0.082 = 1.0324
4. Count the number of leading zeros (u) in fraction c and shift left c by u units to produce
the normalized fraction sum
d = c x 10u, with a leading bit 1. Update the large
exponent s by subtracting s = r – u to produce the output exponent.
•
Example:
c = 1.0324 , u = -1  right shift
d = 0.10324 , s= r – u = 3-(-1) = 4
C = 0.10324 x 104
•
The above 4 steps can all be implemented with combinational logic circuits and the 4
stages are:
2013-14
Advanced Microprocessor by Vincy Joseph
1. Comparator / Subtractor
2. Shifter
3. Fixed Point Adder
4. Normalizer (leading zero counter and shifter)
Classification of Pipeline Processors
•
There are various classification schemes for classifying pipeline processors.
•
Two important schemes are
1. Handler’s Classification
2. Li and Ramamurthy's Classification
TE CMPNA/B
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
Handler’s Classification
•
Based on the level of processing, the pipelined processors can be classified as:
1. Arithmetic Pipelining
2. Instruction Pipelining
3. Processor Pipelining
Arithmetic Pipelining
•
The arithmetic logic units of a computer can be segmented for pipelined operations in
various data formats.
•
Example : Star 100
•
Example : Star 100
– It has two pipelines where arithmetic operations are performed
– First: Floating Point Adder and Multiplier
– Second : Multifunctional : For all scalar instructions with floating point adder,
multiplier and divider.
– Both pipelines are 64-bit and can be split into four 32-bit at the cost of precision
Instruction Pipelining
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
•
The execution of a stream of instructions can be pipelined by overlapping the execution
of current instruction with the fetch, decode and operand fetch of the subsequent
instructions
•
It is also called instruction look-ahead
The organization of 8086 into a separate BIU and EU allows the fetch and execute cycle
to overlap.
Processor Pipelining
•
This refers to the processing of same data stream by a cascade of processors each of
which processes a specific task
•
The data stream passes the first processor with results stored in a memory block which
is also accessible by the second processor
•
The second processor then passes the refined results to the third and so on.
Li and Ramamurthy's Classification
2013-14
•
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
According to pipeline configurations and control strategies, Li and Ramamurthy classify
pipelines under three schemes
– Unifunction v/s Multi-function Pipelines
– Static v/s Dynamic Pipelines
– Scalar v/s Vector Pipelines
Unifunction v/s Multi-function Pipeline
Unifunctional Pipelines
• A pipeline unit with fixed and dedicated function is called unifunctional.
• Example: CRAY1 (Supercomputer - 1976)
• It has 12 unifunctional pipelines described in four groups:
– Address Functional Units:
• Address Add Unit
• Address Multiply Unit
– Scalar Functional Units
• Scalar Add Unit
• Scalar Shift Unit
• Scalar Logical Unit
• Population/Leading Zero Count Unit
– Vector Functional Units
• Vector Add Unit
• Vector Shift Unit
• Vector Logical Unit
– Floating Point Functional Units
• Floating Point Add Unit
• Floating Point Multiply Unit
• Reciprocal Approximation Unit
Multifunctional Pipelines
• A multifunction pipe may perform different functions either at different times or same
time, by interconnecting different subset of stages in pipeline.
• Example 4X-TI-ASC (Supercomputer - 1973)
•
•
•
It has four multifunction pipeline processors, each of which is reconfigurable for a variety
of arithmetic or logic operations at different times.
It is a four central processor comprised of nine units.
2013-14
•
•
•
•
•
•
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
It has
one instruction processing unit
four memory buffer units and
four arithmetic units.
Thus it provides four parallel execution pipelines below the IPU.
Any mixture of scalar and vector instructions can be executed simultaneously in four
pipes.
Static Vs Dynamic Pipeline
Static Pipeline
•
•
•
It may assume only one functional configuration at a time
It can be either unifunctional or multifunctional
Static pipelines are preferred when instructions of same type are to be executed
continuously
• A unifunction pipe must be static.
Dynamic pipeline
• It permits several functional configurations to exist simultaneously
• A dynamic pipeline must be multi-functional
• The dynamic configuration requires more elaborate control and sequencing
mechanisms than static pipelining
Scalar Vs Vector Pipeline
Scalar Pipeline
• It processes a sequence of scalar operands under the control of a DO loop
• Instructions in a small DO loop are often prefetched into the instruction buffer.
• The required scalar operands are moved into a data cache to continuously supply the
pipeline with operands
• Example: IBM System/360 Model 91
• In this computer, buffering plays a major role.
• Instruction fetch buffering:
• provide the capacity to hold program loops of meaningful size.
• Upon encountering a loop which fits, the buffer locks onto the loop and
subsequent branching requires less time.
• Operand fetch buffering:
• provide a queue into which storage can dump operands and execution units can
fetch operands.
• This improves operand fetching for storage-to-register and storage-to-storage
instruction types.
Vector Pipelines
• They are specially designed to handle vector instructions over vector operands.
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
•
•
•
Computers having vector instructions are called vector processors.
The design of a vector pipeline is expanded from that of a scalar pipeline.
The handling of vector operands in vector pipelines is under firmware and hardware
control.
• Example : Cray 1
Linear pipeline (Static & Unifunctional)
• In a linear pipeline data flows from one stage to another and all stages are used once in
a computation and it is for one functional evaluation.
Nonlinear Pipeline
•
•
•
•
•
In floating point adder, stage (2) and (4) needs a shift register.
We can use the same shift register and then there will be only 3 stages.
Then we should have a feedback from third stage to second stage.
Further the same pipeline can be used to perform fixed point addition.
A pipeline with feed-forward and/or feedback connections is called non-linear
Example: 3-stage nonlinear pipeline
•
•
•
•
•
•
•
•
•
•
•
It has 3 stages Sa, Sb and Sc and latches.
Multiplexers(cross circles) can take more than one input and pass one of the inputs
to output
Output of stages has been tapped and used for feedback and feed-forward.
The above pipeline can perform a variety of functions.
Each functional evaluation can be represented by a particular sequence of usage of
stages.
Some examples are:
• Sa, Sb, Sc
• Sa, Sb, Sc, Sb, Sc, Sa
• Sa, Sc, Sb, Sa, Sb, Sc
Each functional evaluation can be represented using a diagram called Reservation
Table(RT).
It is the space-time diagram of a pipeline corresponding to one functional evaluation.
X axis – time units
Y axis – stages
For first sequence Sa, Sb, Sc, Sb, Sc, Sa called function A , we have
2013-14
Advanced Microprocessor by Vincy Joseph
0
Sa
1
3
4
A
Sb
5
A
A
Sc
•
2
TE CMPNA/B
A
A
A
For second sequence Sa, Sc, Sb, Sa, Sb, Sc called function B, we have
0
Sa
1
2
A
4
5
A
Sb
A
Sc
3
A
A
A
•
After starting a function, the stages need to be reserved in corresponding time units.
•
Each function supported by multifunction pipeline is represented by different RTs
•
Time taken for function evaluation in units of clock period is compute time.(For A & B,
it is 6)
•
Marking in same row => usage of stage more than once
•
Marking in same column => more than one stage at a time
•
Hardware of multifunction pipeline should be reconfigurable.
•
Multifunction pipeline can be static or dynamic
•
Static:
•
•
Initially configured for one functional evaluation.
•
For another function, pipeline need to be drained and reconfigured.
•
You cannot have two inputs of different function at the same time
Dynamic:
•
Can do different functional evaluation at a time.
2013-14
Advanced Microprocessor by Vincy Joseph
•
TE CMPNA/B
It is difficult to control as we need to be sure that there is no conflict in usage of
stages.
Principle of Designing Pipeline Processors
Instruction Prefetch and Branch Handling
•
The instructions in computer programs can be classified into 4 types:
– Arithmetic/Load Operations (60%)
– Store Type Instructions (15%)
– Branch Type Instructions (5%)
– Conditional Branch Type (Yes – 12% and No – 8%)
• Arithmetic/Load Operations (60%) :
– These operations require one or two operand fetches.
– The execution of different operations requires a different number of pipeline
cycles
• Store Type Instructions (15%) :
– It requires a memory access to store the data.
• Branch Type Instructions (5%) :
– It corresponds to an unconditional jump.
• Conditional Branch Type (Yes – 12% and No – 8%) :
– Yes path requires the calculation of the new address
– No path proceeds to next sequential instruction.
• Arithmetic-load and store instructions do not alter the execution order of the program.
• Branch instructions and Interrupts cause some damaging effects on the performance
of pipeline computers.
Interrupts
• When instruction I is being executed,the occurrence of an interrupt postpones instruction
I+1 until ISR is serviced.
• There are two types of interrupt:
– Precise : caused by illegal operation codes and can be detected at decoding stage
– Imprecise: caused by defaults from storage, address and execution functions
• Precise: Since decoding is the first stage, instruction I prohibits I+1 from entering the
pipeline and all preceding instructions are executed before ISR
• Imprecise : No new instructions are allowed and all incomplete instructions whether
they precede or follow are executed before ISR.
Interrupt Handling Example of Cray1
• The interrupt system is built around an exchange package.
• When an interrupt occurs, the Cray-1 saves 8 scalar registers, 8 address registers,
program counter and monitor flags.
• These are packed into 16 words and swapped with a block whose address is specified
by a hardware exchange address register
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
•
Since exchange package does not have all state information, software interrupt handler
have to store remaining states
In general, the higher the percentage of branch type instructions in a program, the slower a
program will run on a pipeline processor.
Estimation of the effect of branching on an n-segment instruction pipeline
•
•
•
•
•
•
Consider an instruction cycle with n pipeline clock periods.
Let
– p – probability of conditional branch (20%)
– q – probability that a branch is successful (60% of 20%) (12/20=0.6)
Suppose there are m instructions
Then no. of instructions of successful branches = mxpxq (mx0.2x0.6)
Delay of (n-1)/n is required for each successful branch to flush pipeline.
Thus, the total instruction cycle required for m instructions =
1
n  m  1  mpq(n  1)
n
n
•
As m becomes large , the average no. of instructions per instruction cycle is given as
m
n
Lt

m   n  m  1
mpq(n  1)
1  pq(n  1)

n
n
• When p =0, the above measure reduces to n, which is ideal.
• In reality, it is always less than n.
Solution: Multiple Prefetch Buffers
• Buffers can be used to match the instruction fetch rate to pipeline consumption rate
1. Sequential Buffers: for in-sequence pipelining
2. Target Buffers: instructions from a branch target (for out-of-sequence pipelining)
A conditional branch cause both sequential and target to fill and based on condition one is
selected and other is discarded
Data Buffering and Busing Structures
Speeding up of pipeline segments
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
•
•
The processing speeds of pipeline segments are usually unequal.
Consider the example given below:
•
•
•
•
If T1 = T3 = T and T2 = 3T, S2 becomes the bottleneck and we need to remove it
How?
One method is to subdivide the bottleneck
– Two divisions possible are:
First Method:
•
Second Method:
•
If the bottleneck is not sub-divisible, we can duplicate S2 in parallel
• Control and Synchronization is more complex in parallel segments
Data Buffering
• Instruction and data buffering provides a continuous flow to pipeline units
• Example: 4X TI ASC
• In this system it uses a memory buffer unit (MBU) which
• Supply arithmetic unit with a continuous stream of operands
• Store results in memory
• The MBU has three double buffers X, Y and Z (one octet per buffer)
• X,Y for input and Z for output
• This provides pipeline processing at high rate and alleviate bandwidth mismatch
problem between memory and arithmetic pipeline
• In TI ASC, once instruction dependency is recognized, update capability is
incorporated by transferring contents of Z buffer to X or Y buffer.
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
Internal Forwarding and Register Tagging
•
•
•
•
•
Internal Forwarding: It is replacing unnecessary memory accesses by register-toregister transfers.
Register Tagging: It is the use of tagged registers for exploiting concurrent activities
among multiple ALUs.
Memory access is slower than register-to-register operations.
Performance can be enhanced by eliminating unnecessary memory accesses
This concept can be explored in 3 directions:
1. Store – Load Forwarding
2. Load – Load Forwarding
3. Store – Store Forwarding
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
Register Tagging
Example : IBM Model 91 : Floating Point Execution Unit
• The floating point execution unit consists of :
– Data registers
– Transfer paths
– Floating Point Adder Unit
– Multiply-Divide Unit
– Reservation stations
– Common Data Bus
•
•
There are 3 reservation stations for adder named A1, A2 and A3 and 2 for multipliers
named M1 and M2.
Each station has the source & sink registers and their tag & control fields
2013-14
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
The stations hold operands for next execution.
3 store data buffers(SDBs) and 4 floating point registers (FLRs) are tagged
Busy bits in FLR indicates the dependence of instructions in subsequent execution
Common Data Bus(CDB) is to transfer operands
There are 11 units to supply information to CDB: 6 FLBs, 3 adders & 2 multiply/divide
unit
Tags for these stations are :
Unit
Tag
Unit
Tag
FLB1
0001
ADD1
1010
FLB2
0010
ADD2
1011
FLB3
0011
ADD3
1100
FLB4
0100
M1
1000
FLB5
0101
M2
1001
FLB6
0110
Internal forwarding can be achieved with tagging scheme on CDB.
Example:
Let F refers to FLR and FLBi stands for ith FLB and their contents be (F) and (FLBi)
Consider instruction sequence
ADD F,FLB1
F  (F) + (FLB1)
MPY F,FLB2
F  (F) x (FLB2)
During addition :
– Busy bit of F is set to 1
– Contents of F and FLB1 is sent to adder A1
– Tag of F is set to 1010 (tag of adder)
Meantime, the decode of MPY reveals F is busy, then
– F should set tag of M1 as 1010 (Tag of adder)
– F should change its tag to 1000 (Tag of Multiplier)
– Send content of FLB2 to M1
When addition is done, CDB finds that the result should be sent to M1
Multiplication is done when both operands are available
Hazard Detection and Resolution
• Hazards are caused by resource usage conflicts among various instructions
• They are triggered by inter-instruction dependencies
Terminologies:
• Resource Objects: set of working registers, memory locations and special flags
• Data Objects: Content of resource objects
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
•
Each Instruction can be considered as a mapping from a set of data objects to a set of
data objects.
• Domain D(I) : set of resource of objects whose data objects may affect the execution of
instruction I.
• Range R(I): set of resource objects whose data objects may be modified by the
execution of instruction I
• Instruction reads from its domain and writes in its range
• Consider execution of instructions I and J, and J appears immediately after I.
• There are 3 types of data dependent hazards:
1. RAW (Read After Write)
The necessary condition for this hazard is
R( I )  D( J )  
Example:
I1 : LOAD r1,a
I2 : ADD r2,r1
I2 cannot be correctly executed until r1 is loaded
Thus I2 is RAW dependent on I1
2. WAW(Write After Write)
The necessary condition is
R( I )  R( J )  
•
•
Example
I1 : MUL r1, r2
I2 : ADD r1,r4
Here I1 and I2 writes to same destination and hence they are said to be WAW dependent.
3. WAR (Write After Write)
The necessary condition is
D( I )  R( J )  
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
•
•
•
•
Example:
I1 : MUL r1,r2
I2 : ADD r2,r3
Here I2 has r2 as destination while I1 uses it as source and hence they are WAR
dependent
• Hazards can be detected in fetch stage by comparing domain and range.
• Once detected, there are two methods:
1. Generate a warning signal to prevent hazard
2. Allow incoming instruction through pipe and distribute detection to all pipeline stages.
Job Sequencing and Collision Prevention
• Consider reservation table given below at t=1
1
Sa
2
3
6
A
A
A
Sc
A
A
Consider next initiation made at t=2
Sa
1
2
A1
A2
Sb
A1
Sc
•
5
A
Sb
•
4
3
4
5
A2
A1
A2
A1
A2
A1
6
7
A1
A2
8
A2
The second initiation easily fits in the reservation table
Now consider the case when first initiation is made at t = 1 and second at t = 3.
1
Sa
A1
2
3
A2
4
5
6
A1
7
8
A2
2013-14
Advanced Microprocessor by Vincy Joseph
Sb
A1
Sc
•
•
•
•
•
A2
A1A2
A1
A2
TE CMPNA/B
A2
A1A2
A2
Here both markings A1 and A2 falls in the same stage time units and is called collision
and it must be avoided
Terminologies
Latency: Time difference between two initiations in units of clock period
Forbidden Latency: Latencies resulting in collision
Forbidden Latency Set: Set of all forbidden latencies
Considering all initiations:
Sa
Sb
Sc
1
2
3
4
5
6
7
8
9
10
11
A1
A2
A3
A4
A5
A6A1
A2
A3
A4
A5
A6
A1
A2
A1A3
A2A4
A3A5
A4A6
A5
A6
A1
A2
A1A3
A2A4
A3A5
A4A6
A5
• Forbidden Latencies are 2 and 5
Shortcut Method of finding Latency
•
•
•
•
•
•
•
•
Forbidden Latency Set = {5} U {2} U {2} = { 2,5}
Latency Sequence : Sequence of latencies between successive initiations
For a RT, number of valid initiations and latencies are infinite
Latency Cycle:
Among the infinite possible latency sequence, the periodic ones are significant.
E.g. { 1, 3, 3, 1, 3, 3,… }
The subsequence that repeats itself is called latency cycle.
E.g. {1, 3, 3}
Period of cycle: The sum of latencies in a latency cycle (1+3+3=7)
Average Latency: The average taken over its latency cycle (AL=7/3=2.33)
A6
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
•
To design a pipeline, we need a control strategy that maximize the throughput (no. of
results per unit time)
• Maximizing throughput is minimizing AL
• Latency sequence which is aperiodic in nature is impossible to design
• Thus design problem is arriving at a latency cycle having minimal average latency.
State Diagram
• The initial collision vector (ICV) is a binary vector formed from F such that
C = (Cn…. C2 C1)
where Ci = 1 if i  F and Ci = 0 if otherwise
• Thus in our example
F = { 2,5 }
C = (1 0 0 1 0)
• The procedure is as follows:
1. Start with the ICV
2. For each unprocessed state,
For each bit i in the CVi which is 0, do the following:
a. Shift CVi right by i bits
b. Drop i rightmost bits
c. Append zeros to left
d. Logically OR with ICV
e. If step(d) results in a new state then form a new node for this state and join it with
node of CVi by an arc with a marking i.
This shifting process needs to continue until no more new states can be generated.
•
•
•
•
•
•
•
The state with all zeros has a self-loop which corresponds to empty pipeline and it is
possible to wait for indefinite number of latency cycles of the form (7),(8), (9),(10) etc.
Simple Cycle: latency cycle in which each state is encountered only once.
Complex Cycle: consists of more than one simple cycle in it.
It is enough to look for simple cycles
Greedy Cycle: A simple cycle is a greedy cycle if each latency contained in a cycle is
the minimal latency(outgoing arc) from a state in the cycle.
A good task initiation sequence should include the greedy cycle.
The simple cycles are (3),(5) ,(1,3,3),(4,3) and (4)
The Greedy cycle is (1,3,3)
2013-14
•
•
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
In the above example, the cycle that offers MAL is (1, 3, 3) (MAL = (1+3+3)/3 =2.333)
The task initiation sequence with MAL is given as below:
Sa
Sb
Sc
1
2
A1
A2
A1
3
4
5
6
7
8
A5
A1
A2
A8
A2
A1
A2
A5
A1
A2
A1
A2
A5
A5
9
10
11
12
A5
A8
A5
13
A8
A8
A8
A8
Superscalar Processors
• Scalar processors: one instruction per cycle
• Superscalar : multiple instruction pipelines are used.
• Purpose: To exploit more instruction level parallelism in user programs.
• Only independent instructions can be executed in parallel.
• The fundamental structure (m=3) is as follows:
•
•
Here, the instruction decoding and execution resources are increased
Example: A dual pipeline superscalar processor
•
•
•
•
•
It can issue two instructions per cycle
There are two pipelines with four processing stages : fetch, decode, execute and store
Two instruction streams are from a single I-cache.
Assume each stage requires one cycle except execution stage.
The four functional units of execution stage are:
2013-14
Advanced Microprocessor by Vincy Joseph
Functional Unit
TE CMPNA/B
Number of stages
Adder
02
Multiplier
03
Logic
01
Load
01
• Functional units are shared on dynamic basis
• Look-ahead Window: for out-of-order instruction issue
Superscalar Performance
• Time required by scalar base machine is
T(1,1) = k+N-1
• The ideal execution time required by an m-issue superscalar machine is
T(m,1)  k  (N - m)
•
m
k is the time required to execute first m instructions
(N-m)/m is the time required to execute remaining (N-m) instructions.
The ideal speedup of the superscalar machine is
S(m,1) 
T(1,1)
T(m,1)
S(m,1) 
m(N  k - 1)
N  m(k - 1)
 As N  ∞, the speedup S(m,1)  m.
Superpipeline Processors
•
•
•
In a superpipelined processor of degree n, the pipeline cycle time is 1/n of base cycle.
•
Time to execute N instructions for a superpipelined machine of degree n with k stages is
T(1,n) = k + (N-1)/n
Speedup is given as
T(1,1) n(k  N - 1)
S(1, n) 

T(1, n) nk  (N - 1)
As N ∞ , S(1,n) n
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
Superpipelined Superscalar Processors
•
This machine executes m instructions every cycle with a pipeline cycle 1/n of base cycle.
•
Time taken to execute N independent instructions on a superpipelined superscalar
machine of degree (m,n) is
N-m
T (m, n)  k 
mn
The speedup over base machine is
•
S(m, n) 
•
As N  ∞, S(m,n)mn
T(1,1) mn(k  N  1)

T(m, n) mnk  N  m
Systolic Architecture
• Conventional architecture operate on load and store operations from memory.
• This requires more memory references which slows down the system as shown below:
•
In systolic processing, data to be processed flows through various operation stages and
finally put in memory as shown below:
2013-14
•
•
•
•
•
•
•
•
•
•
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
The basic architecture constitutes processing elements (PEs) that are simple and identical
in behavior at all instants.
Each PE may have some registers and an ALU.
PEs are interlinked in a manner dictated by the requirements of the specific algorithm.
E.g. 2D mesh, hexagonal arrays etc.
PEs at the boundary of structure are connected to memory
Data picked up from memory is circulated among PEs which require it in a rhythmic
manner and the result is fed back to memory and hence the name systolic
Example : Multiplication of two n x n matrices
Every element in input is picked up n times from memory as it contributes to n elements
in the output.
To reduce this memory access, systolic architecture ensures that each element is pulled
only once
Consider an example where n = 3


Conventional Method: O(n3)
For I = 1 to N
For J = 1 to N
For K = 1 to N
C[I,J] = C[I,J] + A[J,K] * B[K,J];
This will run in O(n) time!
To run in n time we need n x n processing units, in our example n = 9.
For systolic processing, the input data need to be modified as:
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
and finally stagger the data sets for input.
At every tick of the global system clock, data is passed to each processor from two different
directions, then it is multiplied and the result is saved in a register.
Example: Samba: Systolic Accelerator for Molecular Biological Applications
This systolic array contains 128 processors shared into 32 full custom VLSI chips. One chip
houses 4 processors, and one processor performs 10 millions matrix cells per second.
Very Long Instruction Word (VLIW) Architecture
VLIW Machine
• It consists of many functional units connected to a large central register file
• Each functional unit have two read ports and one write port
• Register file would have enough memory bandwidth to balance the operand usage rate of
functional units
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
VLIW characteristics
• VLIW contains multiple primitive instructions that can be executed in parallel
• The compiler packs a number of primitive, independent instructions into a very long
instruction word
• The compiler must guarantee that multiple primitive instructions which group
together are independent so they can be executed in parallel.
• Example of a single VLIW instruction: F=a+b; c=e/g; d=x&y; w=z*h;
VLIW instruction
c=e/g
F=a+b
a
d=x&y
w=z*h
F
PU
b
e
c
PU
g
x
d
PU
y
z
w
PU
h
VLIW Principles
1. The compiler analyzes dependence of all instructions among sequential code and
extracts as much parallelism as possible.
2. Based on analysis, the compiler re-codes the sequential code in VLIW instruction
words.(One VLIW instruction word contains maximum 8 primitive instructions)
3. Finally VLIW hardware
– Fetch the VLIWs from cache,
– Decode them,
– Dispatch the independent primitive instructions to corresponding functional units
and
– Execute
Advantages of VLIW architecture
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
• Reduces Complexity: Due to parallelism among their primitive instructions
• Higher possible clock rate because of reduced complexity
Drawbacks of VLIW Architecture
• Compiler has to be aware of technology dependent parameters-like latencies and
repetition rate. This restricts the use of same compiler for a family of VLIW processors.
• Wasted memory space and bandwidth when some EUs are not used
• Performance is dependent on how the compiler produces VLIW words.
•
•
•
Data Flow Computers
They are based on the concept of data driven computation
Conventional computers are under program flow control.
Features of Control Flow Model
•
•
•
•
•
•
•
•
Data is passed between instructions via shared memory
Flow of control is implicitly sequential
Program counters are used to sequence the execution of instruction
Features of Data Flow Model
Intermediate or final results are passed directly as data token between instructions
There is no concept of shared data storage
Program sequencing is constrained only by data dependency among instructions
Data Flow Graph
A data flow graph is a directed graph whose nodes correspond to operators and arcs are
pointers forwarding data tokens
The graph demonstrates sequencing constraints among instructions
2013-14
•
•
•
•
•
•
•
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
In DFC, the machine level program is represented by DFGs.
The firing rules of instructions is based on the data availability
An operator of the schema is enabled when tokens are present on all input arcs
The enabled operator may fire at any time by
• removing the tokens on its input arc
• computing a value from operands associated with input tokens
• associating that value with the result token placed on its output arc
The result may be sent to more than one destinations by means of a link
Consider the following simple computation
input : a,b
y = (a+b)/x
x = (a*(a+b))+b
output: y,x
and
•
The representation of conditionals and iterations requires additional types of links and
actors
They are as follows:
•
Data link : Data values pass through data links
2013-14


Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
Control link : Control tokens are transmitted which conveys a value of either true or
false
Actors
Operator: It removes tokens on input arc, compute a value based on operands in input
arc and result is placed in output arc.

Decider : Decider receives values from its input arc applies its associated predicate and
produces either a TRUE or FALSE control token at its output arc

Boolean Operator: It can be used to combine control tokens produced at a decider
allowing a decision to be built up from simpler decision.


Control token direct the flow of data tokens by means of T-gates, F-gates or merge actors
T-gate : A T-gate passes the data token on its input arc to its output arc only when it
receives a control token with TRUE value at its control input
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B

F-gate : A F-gate passes the data token on its input arc to its output arc only when it
receives a control token with FALSE value at its control input

Merge Actor : It has a true input, a false input and a control input. It passes to its output
arc a data token from the input arc corresponding to the value of the control token
received

Example: Draw the data flow diagram corresponding to the following computation(xn)
input x,n
y=1, i=n
while i > 0 do
begin y=y*x; i = i-1 end
z=y
output z
•
The tokens carrying the false values on the input arcs of the merge operator allows it to
initiate.
The decider emits a token carrying the TRUE value each time the execution of the loop
body is required.
•
2013-14
•
•
•
•
•
•
•
•
•
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
When the firing of the decider yields a FALSE, the value of y is routed to the output link
z.
Data Flow Machine Architectures
Depending on the way of handling data tokens DFCs can be divided into
– Static Model
– Dynamic Model
Static Data Flow Computer
In this machine, data tokens are assumed to move along the arcs of the dataflow program
graph to the operator nodes
Nodal operation gets executed when all its operands are present at input arcs
Only one token is allowed to exist on any arc at any given time.
This architecture is static because
– Tokens are not labeled
Control Token must be used for timing
SDFC Example
A data flow schema to be executed is stored in the memory of the processor
Memory is organized into instruction cells, each cell corresponds to an operator of
dataflow program.
Each instruction cell is composed of 3 registers.
– First – holds instruction
– Second and Third – holds the operands
2013-14
•
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
Instruction specifies the operation to be performed and the address(es) of the register(s)
to which the result of the operation is to be directed.
2013-14







•
•
•
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
When a cell contains an instruction and the necessary operands, it is enabled and
transmits its content as operation packets
Arbitration network directs operation packet to appropriate processing unit by decoding
instruction portion of packet
Processing unit performs the desired function
The result of an operation leaves processing unit as one or more data packets consisting
of computed value and address of the register in memory to which the value is to be
delivered
Distribution network accepts data packets and utilizes address to store the value of
operation.
Many instruction cells may be enabled simultaneously and it is the task of arbitration
network to efficiently deliver operation packets to processing units and to queue
operation packets.
Each arbitration unit passes one packet at a time to resolve ambiguity.
Dynamic Dataflow Architecture
It uses tagged tokens so that more than one token can exist in an arc.
Tagging is achieved by attaching a label with each token which identifies the context of
that token.
Maximum parallelism can be exploited in this model
2013-14
Advanced Microprocessor by Vincy Joseph
TE CMPNA/B
Download