Uploaded by purva.sharma

Unit 5 COA multiprocessor pdf

advertisement
Unit 5
Multiprocessor
Characteristics of Multiprocessor
• Multiprocessor is an interconnection of two or more CPUs with
memory and input-output equipment.
• The term processor in multiprocessor can mean either a CPU or
an input-output processor.
• However, a system with single CPU and one or more IOP is usually
not included in the definition of multiprocessor system unless the
IOP has computational facilities comparable to a CPU.
• A multiprocessor system is controlled by one operating system
that provides interaction between processors and all the
components of the system cooperate in the solution of a problem.
Multiprocessor:
Characteristics of Multiprocessor
• Multiprocessing improves the reliability of the system so that a
failure or error in one part has limited effect on the rest of the
system. If a fault causes one processor to fail, a second processor
can be assigned to perform the functions of the disabled
processor.
• Multiprocessor system provides improved system performance by
performing the computation in parallel in one of two ways.
• Multiple independent jobs can be made to operate in parallel.
• A single job can be partitioned into multiple parallel tasks.
Multiple independent jobs can be made to operate
in parallel.
• An overall function can be partitioned into several number of tasks that
each processor can handle individually.
• System tasks may be allocated to special-purpose processors whose
design is optimized to perform certain types of processing efficiently.
• For example :A computer system where one processor performs the
computation for an industrial process control while other monitor and
control the various parameters such as temperature and flow rate.
• Another example is a computer where one processor performs highspeed floating point mathematical computation and another takes
care of routine data-processing tasks.
A single job can be partitioned into multiple
parallel tasks.
• Multiprocessing can improve performance by decomposing a program
into parallel executable tasks.
• This can be achieved in one of the two ways.
• The user can explicitly declare that certain tasks of program be
executed in parallel. This must be done prior to loading the program by
specifying the parallel executable segments.
• Other , more efficient way is to provide a compiler with multiprocessor
software that can automatically detect parallelism in a user’s program.
The compiler checks for data dependency in the program. If a program
depends on data generated in another part, the part yielding the
needed data must be executed first. However, two parts of a program
that do not use data generated by each can run concurrently.
Classification of multiprocessor based on the
way memory is organised
• Tightly coupled (or shared memory) multiprocessor
• Loosely coupled (or distributed memory) multiprocessor
Tightly coupled multiprocessor: It is a multiprocessor with common shared memory. Each
processor can have its own local memory. Most commercial tightly coupled multiprocessor
provide a cache memory with each CPU. In addition, there is a global common memory that all
CPUs can access.
Loosely coupled multiprocessor: Each processor in a loosely coupled system has its own
private local memory. The processors are tied together by switching scheme designed to route
information from one processor to another through a message passing scheme. The
processors relay program and data to other processors in packets. A packet consists of an
address, the data content and some error detection code. The packets are addressed to a
specific processor or taken by the first available processor, depending on the communication
system used.
Loosely coupled systems are most efficient when the interaction between task is
minimal, whereas tightly coupled systems can tolerate a higher degree of interaction
between tasks.
Shared memory and distributed memory
multiprocessor
Interconnection structures
The interconnection between the components of a multiprocessor System can
have different physical configurations depending on the number of transfer
paths that are available between the processors and memory in a shared
memory system or among the processing elements in a loosely coupled
system. These physical configuration are called interconnection structure.
There are some of the structure available for establishing an interconnection
network:
• Time-shared common bus
• Multiport memory
• Crossbar switch
• Multistage switching network
• Hypercube system
Time-shared common bus
Time-shared common bus
• A common bus multiprocessor system consists of a number of processor connected
through a common path to a memory unit.
• As we can see from the figure that only one processor can communicate with memory or
another processor at any given time.
• Transfer operations are conducted by the processor that is in control of the bus at the time.
• Any other processor wishing to initiate a transfer must first determine the availability status
of the bus, and only after the bus become available the processor address the destination
unit to initiate the transfer.
• A command is issued to inform the destination unit what operation is to be performed.
• The receiving unit recognizes its address in the bus and respond to the control signal from
the sender, after which the transfer is initiated.
• The system may exhibit transfer conflicts since one common bus is shared by all processor.
• These conflicts must be resolved by incorporating a bus controller that establishes priorities
among the requesting units.
Time-shared common bus
• A single common-bus system is restricted to one transfer at a
time.
• This means that when one processor is communicating with
memory, all other processors are either busy with internal
operations or must be idle waiting for the bus.
• As a consequence, the total overall transfer rate within the system
is limited by the speed of the single path.
Time-shared bus: dual bus structure
Multiport memory
Multiport memory
• It employs separate buses between each memory module and each CPU.
• The structure with 4 CPUs and 4 memory modules (MM) is shown in figure.
• Each processor bus is connected to each memory module.
• A processor bus consists of the address, data, and control line required to communicate
with memory.
• The memory module is said to have four ports and each port accommodates one of the
buses.
• The module must have internal control logic to determine which port will have access to
memory at any given time.
• Memory access conflicts are resolved by assigning fixed priorities to each memory port.
• The priority for memory access associated with each processor may be established by the
physical port position that its bus occupies in each module.
• Thus CPU 1 will have priority over CPU 2, CPU 2 will have priority over CPU 3 and CPU 4 will
have lowest priority.
Multiport memory
• The advantage of the multiport memory organization is the high
transfer rate that can be achieved because of the multiple paths
between processor and memory.
• The disadvantage is that it requires expensive memory control
logic and large number of cables and connectors. As
consequence, this interconnection structure is usually
appropriate for systems with small number of processors.
Crossbar switch
Crossbar switch
• It consist of number of crosspoints that are places at intersections
between processor buses and memory module paths.
• Figure shows a crossbar switch interconnection between four CPUs
and four memory modules.
• The small square in each crosspoint is a switch that determines the
path from a processor to a memory module.
• Each switch has point has control logic to set up the transfer path
between a processor and memory.
• Switch examines the address that is placed in the bus to determine
whether its particular module is being addressed. It also resolve
multiple requests for access to the same memory module on a
predetermined priority basis.
Crossbar switch: block diagram of crossbar
switch
Block diagram of crossbar switch
Cross-bar switch
• Figure shows the functional design of a crossbar switch connected to
one memory module.
• The circuit consists of multiplexers that select the data, address, and
control from one CPU for communication with memory module.
• Priority levels are established by the arbitration logic to select one CPU
when two or more CPUs attempt to access the same memory.
• The multiplexers are controlled with binary code that is generated by a
priority encoder with the arbitration logic.
• Crossbar switch supports simultaneous transfers from all memory
modules because there is separate path associated with each module.
• However, the hardware required to implement the switch can become
quite large and complex.
Multistage switching network
Operation of a 2 x 2 interchange switch
Interchange switch
• The basic component of a multistage network is a two-input, twooutput interchange switch.
• As shown in Figure, the 2x2 switch has two inputs, labelled A and
B, two outputs, labelled 0 and 1.
• There are control signals associated with the switch that establish
the interconnection between the input and output terminals.
• The switch has the capability to arbitrate between conflicting
requests.
• If inputs A and B both request the same output terminal, only one
of them will be connected; other will be blocked
Multistage switching network
Binary tree with 2x2 switch
Binary tree with 2x2 switches
• Figure shows binary tree multistage switching network using 2x2 switch as a building block.
• Two processors P1 and P2 are connected through switches to 8 memory module marked in
binary from 000 through 111.
• The path from source to destination is determined from binary bits of the destination
number.
• The first bit of the destination number determines the switch output in the first level.
• The second bit specifies the output of the switch in the second level, and third bit specifies
the output of the switch in the third level.
• For example, to connect P1 to memory module 101, it is necessary to form a path from P1 to
output 1 in the first-level switch, output 0 in the second-level switch, and output 1 in the
third-level switch.
• Either P1 or P2 can be connected to any one of the 8 memories.
• Certain request pattern can not be satisfied simultaneously.
• For example, if P1 is connected to only one of the destinations 000 through 011, P2 can be
connected to only one of the destination 100 through 111.
Hypercube Interconnection
• Hypercube or binary n-cube multiprocessor structure is a loosely coupled
system composed of N=2n processors interconnected in an n-dimensional
binary cube.
• Each processor forms a node of the cube.
• Although it is customary to refer to each node as having a processor, in effect
it contains not only a CPU but also local memory and I/O interface.
• Each processor has direct communication paths to n other neighbour
processors.
• These paths correspond to the edges of the cube.
• There are 2n distinct n-bit binary address that can assigned to the processors.
• Each processors address differ from that of each of its n neighbours by
exactly one-bit position.
Hypercube Interconnection
Hypercube Interconnection
• Figure shows the hypercube structure for n=1,2, and 3.
• A one cube has n=1, and 2n=2. It contains two processors interconnected by a single
path.
• Two-cube structure has n=2 and four nodes interconnected as square.
• Three-cube structure 8 nodes interconnected as cube.
• An n-cube structure has 2n nodes with processor residing in each node.
• Each node is assigned a binary address in such a way that the addresses of two
neighbours differ exactly in one bit position.
• Routing messages through an n-cube structure may take from one to n links from a
source node to a destination node.
• For example, in three-cube structure 000 can communicate directly with node 001.
• It is necessary to go through at least two links to provide communication from 000 to
011.
Inter-processor Arbitration
• Computer system contain number of buses at various levels to facilitate the transfer
of information between components.
• The CPU contains a number of internal buses for transferring information between
processor register and ALU.
• A memory bus consists of lines for transferring data, address, and read/write
information.
• An I/O bus is used to transfer information to and from input and output devices.
• A bus that connects major components such as CPUs, IOPs, and memory is called
system bus.
• The processor in a shared memory multiprocessor system request access to
common memory or other common resources through the system bus. If no other
processor is currently utilizing the bus, the requesting processor may be granted
access immediately. However, the requesting processor must wait if another
processor is currently utilizing the system bus.
• Furthermore, other processor may request the system bus at the same time .
• Arbitration must then be performed to resolve this multiple contention for the
shared resources.
Inter-processor Arbitration
• Priority resolving technique
• Hardware : Serial and parallel
• Software: Dynamic algorithm
Serial arbitration procedure
• The serial arbitration resolving technique is obtained from a daisychain connection.
• The processor connected to the system bus are assigned priority
according to their position along the priority control line.
• The device closest to the priority line is assigned the highest
priority.
• When multiple devices concurrently request the use of the bus,
the device with the highest priority is granted access to it.
Inter-processor Arbitration: Serial arbitration
(daisy-chain) procedure
Serial arbitration procedure
• Figure shows the daisy-chain connections of four arbiters. It is assumed that each
processor has its own arbiter logic with priority-in (PI) and priority-out (PO) lines.
• The PO of each arbiter is connected to PI of the next-lower-priority arbiter.
• PI of the highest-priority unit is maintained at a logic 1 value.
• Highest priority unit in the system will always receive access to the system bus
when it requests it.
• The scheme got the term from the structure of the grant line, which chains through
each device from the higher to lowest priority. The highest priority device will pass
the grant line to the lower priority device only if it does not want it. Then the priority is
forwarded to the next in the sequence. All devices use the same line for bus
requests.
• If a busy bus line returns to its idle state, the most high-priority arbiter enables the
busy line, and its corresponding processor can then run the required bus transfer.
• Thus, the processor whose arbiter has PI=1 and PO=0 is the one that is given control
of the system bus.
Parallel arbitration logic
• The parallel arbitration technique uses an external priority
encoder and decoder.
• Each bus arbiter in the parallel scheme has bus request output
line and bus acknowledge input line.
• Each arbiter enables the request line when its processor is
requesting access to the system bus.
• The processor takes control of the bus if its acknowledge input
line is enabled.
• The bus busy line provides an orderly transfer of control, as in the
daisy-chaining case.
Parallel arbitration logic
Parallel arbitration
• Figure shows the parallel arbitration for multiprocessor system
consist of four processor.
• Each processor has its own bus arbiter. Request lines from four
arbiter are given as input to the 4x2 priority encoder. The output of
the priority encoder generates a 2-bit code which represents the
highest-priority unit among those requesting the bus.
• The 2-bit code from the encoder output drives a 2x4 decoder
which enables the proper acknowledge line to grant bus access to
the highest priority unit.
Priority encoder
Priority encoder is a circuit that implements the priority function.
The logic of the priority encoder is such that if two or more inputs
arrive at the same time, the input having the highest priority will
take precedence. The truth table of a four-input priority encoder is
shown in the Figure. The “x’s” in the table designate don’t –care
conditions. Input Io has the highest priority: so regardless of the
values of other inputs, when this input is 1, the output generates
an output xy=00. I1 has the next priority level. The output is 01 if I1
=1 provided that Io =0 regardless of the value of the other two
lower-priority inputs.
The output for I2 is generate only if higher-priority inputs are 0, and
so on down the priority level.
The interrupt status IST is set only when one or more inputs are
equal to 1.
If all the inputs are 0, IST is cleared to 0 and other outputs of the
encoder are not used, so they are marked with don’t-care
conditions. This is because the vector address is not transferred to
the CPU when IST=0.
Dynamic arbitration algorithms
• Priorities of the units can be dynamically changeable while the system is in operation
Time Slice
Fixed length time slice is given sequentially to each processor, round robin fashion
Polling
Unit address polling -Bus controller advances the address to identify the requesting
unit. When processor that requires the access recognizes its address, it activates the
bus busy line and then accesses the bus. After number of bus cycles, the polling
continues by choosing a different processor.
LRU
The least recently used algorithm gives the highest priority to the requesting device
that has not used bus for the longest interval.
Dynamic arbitration algorithms
FIFO
The first come first serve scheme requests are served in the
order received. The bus controller here maintains a queue data
structure.
Rotating Daisy Chain
• The PO output of the last device is connected to the PI of the
first one. Highest priority to the unit that is nearest to the unit
that has most recently accessed the bus(it becomes the bus
controller)
Inter-processor communication
Shared memory multiprocessor:
• In a master-slave mode, one processor, master, always
executes the operating system functions.
• In the separate operating system organization, each processor
can execute the operating system routines it needs. This
organization is more suitable for loosely coupled systems.
• In the distributed operating system organization, the operating
system routines are distributed among the available processors.
However, each particular operating system function is assigned
to only one processor at a time. It is also referred to as a
floating operating system.
Inter-processor communication:
Loosely coupled multiprocessor:
• There is no shared memory for passing information.
• The communication between processors is by means of message passing
through I/O channels.
• The communication is initiated by one processor calling a procedure that
resides in the memory of the processor with which it wishes to
communicate.
• The communication efficiency of the inter-processor network depends on
the communication routing protocol, processor speed, data link speed, and
the topology of the network.
Inter-processor Synchronization
• Synchronization is an essential part of inter-process
communication. It refers to a case where the data used to
communicate between processors is control information.
• It is needed to enforce the correct sequence of processes and to
ensure mutually exclusive access to shared writable data.
Mutual exclusion with Semaphore: (Method to
provide inter-process synchronization)
• A properly functioning multiprocessor system must provide a mechanism
that will guarantee orderly access to shared memory and other shared
resources.
• This is necessary to protect data from being changed simultaneously by two
or more processors.
• This mechanism has been termed mutual exclusion.
• Mutual exclusion must be provided in a multiprocessor system to enable one
processor to exclude or lock out access to a shared resource by other
processors when it is in a critical section.
• A critical section is a program sequence that, once begun, must complete
execution before another processor accesses the same shared resource.
• A binary variable called semaphore is often used to indicate whether a
processor is being executing a critical section.
Mutual exclusion with Semaphore: (Method to
provide inter-process synchronization)
• A semaphore is a software-controlled flag that is stored in a memory
location that all processors can access.
• When the semaphore is equal to 1, it means that a processor is
executing a critical program, so that shared memory is not available to
other processors.
• When semaphore is 0, shared memory is available to any requesting
processor.
• Processor that share the same memory segment agree by convention
not to use the memory segment unless the semaphore is equal to 0,
indicating that memory is available.
• They also agree to set the semaphore to 1 when they are executing a
critical section and to clear it to 0 when they are finished.
Parallel processing
• Parallel processing is term used to denote a large class of techniques that are used to provide
simultaneous data processing tasks for the purpose of increasing the computational speed of a
computer system.
• Instead of processing each instruction sequentially as in a conventional computer, a parallel
processing system is able to perform concurrent data processing to achieve faster execution time.
• The system may have two or more ALUs and be able to execute two or more instructions at the
same time. The system may have two or more processors operating concurrently.
• The purpose of parallel processing is to speed up the computer processing capability and
increase its throughput, that is, the amount of processing that can be accomplished during a
given interval of time.
• Parallel processing at a higher level of complexity can be achieved by having a multiplicity of
functional units that perform identical or different operations simultaneously.
• Parallel processing is established by distributing the data among the multiple functional units.
• For example: the arithmetic, logic, and shift operations can be separated into three units and the
operands diverted to each unit under the supervision of a control unit.
Processor with multiple function units
Application and types of parallel processing
Application
• Numeric weather forecasting
• Computational aerodynamics
• Finite-element analysis
• Remote-sensing application
• Genetic engineering
• Computer-asseted
tomography
• Weapon research and defence
Types
• Pipeline processing
• Vector processing
• Array processors
Classification of parallel processing by M.J.
Flynn
• M.J. Flynn classification of parallel processing considered the
organisation of computer system by number of instructions and data
items that are manipulated simultaneously.
• Instruction stream: The sequence of instruction read from memory
• Data stream: The operations performed on the data in the processor
• Flyn’s classification divides computers into four major group
1. Single instruction stream, single data stream (SISD)
2. Single instruction stream, multiple data stream (SIMD)
3. Multiple instruction stream, single data stream (MISD)
4. Multiple instruction stream, multiple data stream (MIMD)
SISD
• It represents the organization of a single computer containing a
control unit, a processor unit, and a memory unit.
• Instructions are executed sequentially, and the system may or
may not have internal parallel processing capabilities.
• Parallel processing in this case is achieved by means of multiple
functional units or by pipeline processing.
SIMD
• It represents an organization that includes many processing units
under the supervision of a common control unit.
• All processor receive same instruction from the control unit but
operate on different items of data.
• The shared memory unit must contain multiple modules so that it
can communicate with all the processor simultaneously.
* MISD structure is only of theoretical interest since no practical
system has been constructed using this.
MIMD
• It refers to a computer system capable of processing several
programs at the same time. Most multiprocessor and
multicomputer systems can be classified in this category.
Pipelining
• Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process
being executed in a special dedicated segment that operates concurrently with all other segments.
• A pipeline can be visualized as a collection of processing segments through which binary information flows.
• Each segment performs partial processing dictated by the way task is portioned.
• The result obtained from the computation in each segment is transferred to the next segment in the
pipeline.
• The final result is obtained after the data have passed through all segments.
• The overlapping of computation is made possible by associating a register with each segment so that each
can operate on distinct data simultaneously.
• Simplest way of viewing the pipeline structure is to imagine that each segment consists of an input register
followed by a combinational circuit.
• The register hold the data and combinational circuit performs the suboperation in the particular segment
• The output of the combinational circuit in a given segment is applied to the input register of the next
segment.
• A clock is applied to all registers after enough time has elapsed to perform all segment activity.
• In this way the information flows through the pipeline one step at a time.
Example of pipelining
• Suppose that we want to perform the combined multiply and add operations
with a stream of numbers.
•
Ai*Bi + Ci for i=1,2,3,--------7
• Each operation can be implemented in a segment with pipeline.
• Each segment has one or two register and combinational circuit.
• R1 through R5 are registers that receive new data with every clock pulse.
• Multiplier and adder are combinational circuits
• The suboperations performed in each segment of the pipeline are as follows:
• R1
Ai R2
Bi
Input Ai and Bi
• R3
R1*R2, R4
Ci Multiply and input Ci
• R5
R3+R4
Add ci to product
Pipeline processing
Content of Registers in pipeline example
Pipelining
• It takes three clock pulse to fill up the pipe and retrieve the first
output from R5.
• From there on, each clock produces a new output and moves the
data one step down the pipeline
• This happens as long as new input data flow into the system
• When no more input data are available, the clock must continue
until the last output emerges out of the pipeline.
Four-segment pipeline
Space-time diagram for pipeline
Space-time diagram: The diagram which shows the segment utilization as function of time. The horizontal axis
displays the time in clock cycle and vertical axis gives the segment number.
The diagram shows the 6 tasks T1 through T6 executed in four segments. Initially, task T1 is handled by segment 1.
After the first clock, segment 2 is busy with t1 while segment 1 is busy with task T2. Continuing in this manner, the
first task T1 is completed after the fourth clock cycle. From them on, the pipe completes a task every clock cycle.
No matter how many segments are there in the system, once the pipeline is full, it takes only one clock period to
obtain an output.
Vector processing
• There is class of computational problems that are beyond the capabilities of
a conventional computer.
• These problems are characterised by the fact that they require a vast number
of computations that will take a conventional computer days or even weeks
to complete.
• In many engineering and applications, the problems can be formulated in
terms of vectors and matrices that lend themselves to vector processing.
• Application of vector processing:
• Long-range weather forecasting
• Petroleum explorations
• Seismic data analysis
• Medical diagnosis
• Aerodynamics and space flight simulations
• Artificial intelligence and expert systems
• Mapping the human genome
• Image processing
Vector operations
• A vector is an ordered set of a one-dimensional array of floating-point
numbers.
• A vector V of length n is represented as row vector by V=[V1 V2 V3
……..Vn]
• It may be represented as column vector if the data items are listed in
column.
• A conventional sequential computer is capable of processing operands
one at a time.
• Consequently, operations on vectors must be broken down into single
computation with subscripted variables.
• The element Vi of vector V is written as V(I) and index I refers to a
memory address or register where the number is stored.
Vector operations: Example with conventional
scalar processor
• To examine the difference between a conventional scalar
processor and a vector processor, consider the following Fortran
DO loop
• This program is implemented in machine language by following
sequence of operations.
Vector operations and instruction format
• A computer capable of vector processing eliminates the overhead
associated with the time it takes to fetch and execute the
instructions in the program loop. It allows the operation to be
specified with single vector instruction of the form
•
C(1:100)=A(1:100) +B (1:100)
Figure: Instruction format for vector processor
Matrix multiplication
• Matrix multiplication is one of the most computationally intensive
operation performed in computers with vector processors.
• The multiplication of two nxn matrices consist of n2 inner product
or n3 multiply-add operations.
• Consider multiplication of two 3x3 matrices A and B
Pipeline for calculating an inner product
• In general, inner product consist of the sum of k product term of the form
• C=A1B1+A2B2+A3B3+.............. +AkBk
• The value of A and B are either in memory or in processor registers.
• The floating-point multiplier pipeline and the floating-point adder pipeline are assumed to
have 4 segment each.
• All segment registers in the multiplier and adder are initialized to 0.
• Therefore, the output of the adder is 0 for first 8 cycles until both pipes are full.
• Ai and Bi pairs are brought in and multiplied at a rate of one pair per cycle.
• After the first four cycles, the product begin to be added to the output of the adder.
• During the next four cycles 0 is added to the products entering the adder pipeline.
• At the end of the eighth cycle, the first four products A1B1 through A4B4 are in four adder
segments, and the next four products A5B5 through A8B8 are in the multiplier segment.
• At the beginning of the 9th cycle starts the addition A1B1+A5B5 in the adder pipeline.
• The 10th cycle starts the addition A2B2+A6B6 and so on.
• This pattern breaks down summation into four sections as follows
Pipeline: Vector processing
Memory interleaving
Multiple module memory organisation
Memory module
• The memory can be partitioned into number of modules connected to
common memory address and data buses.
• Memory module is memory array together with its own address and data
registers.
• Each memory array has its own address register AR and data register DR.
• The address register receive information from common address bus and data
registers communicate with a bidirectional data bus.
• Two least significant bits of the address can be used to distinguish between
four modules.
• The modular system permits one module to initiate a memory access while
other modules are in the process of reading or writing a word and each
module can honour a memory request independent of the state of the other
modules.
Memory module
• The advantage of modular memory is that it allows the use of
technique called interleaving.
• In an interleaved memory, different sets of address are assigned
to different memory modules.
• For example, in a two-module memory system, the even address
may be in one module and the odd address in the other.
• When the number of modules is a power of 2, the least significant
bits of the address select a memory module and remaining bits
designate the specific location to be accessed within the selected
module.
Array processor
• An array processor is a processor that performs computation on
large arrays of data.
Array processor is of two type:
Attached array processor
SIMD array processor
Attached array processor
• An attached array processor is designed as a peripheral for a
conventional host computer, and its purpose is to enhance the
performance of the computer by providing vector processing for
complex scientific applications.
• It achieves high performance by means of parallel processing with
multiple functional units.
• It includes an arithmetic unit containing one or more pipelined
floating-point adders and multipliers.
• The array processor can be programmed by the user to
accommodate a variety of complex arithmetic problems.
Attached array processor
Attached array processor
• Figure shows the interconnection of an attached array processor to a host
computer.
• The host computer is a general-purpose commercial computer, and the attached
processor is a back-end machine driven by the host computer.
• The array processor is connected through an input-output controller to the
computer and computer treats it like an external interface.
• The data for the attached processor are from main memory to a local memory
through a high-speed bus.
• The general-purpose computer without the attached processor serves the user that
need conventional data processing.
• The system with attached processor satisfies the need for complex arithmetic
application.
• The objective of the attached array processor is to provide vector manipulation
capabilities to a conventional computer at a fraction of the cost of supercomputers.
SIMD array processor
• A SIMD array processor is a computer with multiple processing
units operating in parallel.
• The processing units are synchronized to perform the same
operation under the control of a common control unit, thus
providing a single instruction stream, multiple data stream (SIMD)
organization.
SIMD array processor
SIMD array processor
• It contains a set of identical processing elements (PEs), each having local memory
(M).
• Each PE includes an ALU, a floating-point arithmetic unit and working registers.
• The master control unit controls the operation in the processor elements.
• The main memory is used for storage of the program.
• The function of the master control unit is to decode the instructions and determine
how the instruction is to be executed .
• Scalar and program control instructions are directly executed within the master
control unit.
• Vector instructions are broadcast to all PEs simultaneously.
• Each PE uses operands stored in its local memory.
• Vector operands are distributed to the local memories prior to the parallel execution
of the instruction.
Example of vector addition using SIMD
processor
• C=A+B
• The master control unit first stores the ith components ai and bi of
A and B in local memory Mi for i=1,2,3,……..n.
• It then broadcasts the floating-point add instruction ci=ai+bi to all
PEs, causing the addition to take place simultaneously.
• The components of ci are stored in fixed locations in each memory
location.
• This produces the desired vector sum in one add Cycle.
SIMD array processor
• Masking schemes are used to control the status of each PE during the execution of
vector instructions.
• Each PE has a flag that is set when the PE is active and reset when PE is inactive.
• This ensure that only those Pes that need to participate are active during the
execution of the instruction.
• For example , suppose that the array processor contains a set of 64 PEs.
• If a vector length of less than 64 data items is to be processed, the control unit
selects the proper number of Pes to be active.
• Vectors of greater length than 64 must be divided into 64-word portions by the
control unit.
• SIMD processors are highly specialised computers. They are suited primarily for
numerical problems that can be expressed in vector or matrix form.
• However, they are not very efficient in other types of computation or in dealing with
conventional data processing algorithm.
Download