Unit 5 Multiprocessor Characteristics of Multiprocessor • Multiprocessor is an interconnection of two or more CPUs with memory and input-output equipment. • The term processor in multiprocessor can mean either a CPU or an input-output processor. • However, a system with single CPU and one or more IOP is usually not included in the definition of multiprocessor system unless the IOP has computational facilities comparable to a CPU. • A multiprocessor system is controlled by one operating system that provides interaction between processors and all the components of the system cooperate in the solution of a problem. Multiprocessor: Characteristics of Multiprocessor • Multiprocessing improves the reliability of the system so that a failure or error in one part has limited effect on the rest of the system. If a fault causes one processor to fail, a second processor can be assigned to perform the functions of the disabled processor. • Multiprocessor system provides improved system performance by performing the computation in parallel in one of two ways. • Multiple independent jobs can be made to operate in parallel. • A single job can be partitioned into multiple parallel tasks. Multiple independent jobs can be made to operate in parallel. • An overall function can be partitioned into several number of tasks that each processor can handle individually. • System tasks may be allocated to special-purpose processors whose design is optimized to perform certain types of processing efficiently. • For example :A computer system where one processor performs the computation for an industrial process control while other monitor and control the various parameters such as temperature and flow rate. • Another example is a computer where one processor performs highspeed floating point mathematical computation and another takes care of routine data-processing tasks. A single job can be partitioned into multiple parallel tasks. • Multiprocessing can improve performance by decomposing a program into parallel executable tasks. • This can be achieved in one of the two ways. • The user can explicitly declare that certain tasks of program be executed in parallel. This must be done prior to loading the program by specifying the parallel executable segments. • Other , more efficient way is to provide a compiler with multiprocessor software that can automatically detect parallelism in a user’s program. The compiler checks for data dependency in the program. If a program depends on data generated in another part, the part yielding the needed data must be executed first. However, two parts of a program that do not use data generated by each can run concurrently. Classification of multiprocessor based on the way memory is organised • Tightly coupled (or shared memory) multiprocessor • Loosely coupled (or distributed memory) multiprocessor Tightly coupled multiprocessor: It is a multiprocessor with common shared memory. Each processor can have its own local memory. Most commercial tightly coupled multiprocessor provide a cache memory with each CPU. In addition, there is a global common memory that all CPUs can access. Loosely coupled multiprocessor: Each processor in a loosely coupled system has its own private local memory. The processors are tied together by switching scheme designed to route information from one processor to another through a message passing scheme. The processors relay program and data to other processors in packets. A packet consists of an address, the data content and some error detection code. The packets are addressed to a specific processor or taken by the first available processor, depending on the communication system used. Loosely coupled systems are most efficient when the interaction between task is minimal, whereas tightly coupled systems can tolerate a higher degree of interaction between tasks. Shared memory and distributed memory multiprocessor Interconnection structures The interconnection between the components of a multiprocessor System can have different physical configurations depending on the number of transfer paths that are available between the processors and memory in a shared memory system or among the processing elements in a loosely coupled system. These physical configuration are called interconnection structure. There are some of the structure available for establishing an interconnection network: • Time-shared common bus • Multiport memory • Crossbar switch • Multistage switching network • Hypercube system Time-shared common bus Time-shared common bus • A common bus multiprocessor system consists of a number of processor connected through a common path to a memory unit. • As we can see from the figure that only one processor can communicate with memory or another processor at any given time. • Transfer operations are conducted by the processor that is in control of the bus at the time. • Any other processor wishing to initiate a transfer must first determine the availability status of the bus, and only after the bus become available the processor address the destination unit to initiate the transfer. • A command is issued to inform the destination unit what operation is to be performed. • The receiving unit recognizes its address in the bus and respond to the control signal from the sender, after which the transfer is initiated. • The system may exhibit transfer conflicts since one common bus is shared by all processor. • These conflicts must be resolved by incorporating a bus controller that establishes priorities among the requesting units. Time-shared common bus • A single common-bus system is restricted to one transfer at a time. • This means that when one processor is communicating with memory, all other processors are either busy with internal operations or must be idle waiting for the bus. • As a consequence, the total overall transfer rate within the system is limited by the speed of the single path. Time-shared bus: dual bus structure Multiport memory Multiport memory • It employs separate buses between each memory module and each CPU. • The structure with 4 CPUs and 4 memory modules (MM) is shown in figure. • Each processor bus is connected to each memory module. • A processor bus consists of the address, data, and control line required to communicate with memory. • The memory module is said to have four ports and each port accommodates one of the buses. • The module must have internal control logic to determine which port will have access to memory at any given time. • Memory access conflicts are resolved by assigning fixed priorities to each memory port. • The priority for memory access associated with each processor may be established by the physical port position that its bus occupies in each module. • Thus CPU 1 will have priority over CPU 2, CPU 2 will have priority over CPU 3 and CPU 4 will have lowest priority. Multiport memory • The advantage of the multiport memory organization is the high transfer rate that can be achieved because of the multiple paths between processor and memory. • The disadvantage is that it requires expensive memory control logic and large number of cables and connectors. As consequence, this interconnection structure is usually appropriate for systems with small number of processors. Crossbar switch Crossbar switch • It consist of number of crosspoints that are places at intersections between processor buses and memory module paths. • Figure shows a crossbar switch interconnection between four CPUs and four memory modules. • The small square in each crosspoint is a switch that determines the path from a processor to a memory module. • Each switch has point has control logic to set up the transfer path between a processor and memory. • Switch examines the address that is placed in the bus to determine whether its particular module is being addressed. It also resolve multiple requests for access to the same memory module on a predetermined priority basis. Crossbar switch: block diagram of crossbar switch Block diagram of crossbar switch Cross-bar switch • Figure shows the functional design of a crossbar switch connected to one memory module. • The circuit consists of multiplexers that select the data, address, and control from one CPU for communication with memory module. • Priority levels are established by the arbitration logic to select one CPU when two or more CPUs attempt to access the same memory. • The multiplexers are controlled with binary code that is generated by a priority encoder with the arbitration logic. • Crossbar switch supports simultaneous transfers from all memory modules because there is separate path associated with each module. • However, the hardware required to implement the switch can become quite large and complex. Multistage switching network Operation of a 2 x 2 interchange switch Interchange switch • The basic component of a multistage network is a two-input, twooutput interchange switch. • As shown in Figure, the 2x2 switch has two inputs, labelled A and B, two outputs, labelled 0 and 1. • There are control signals associated with the switch that establish the interconnection between the input and output terminals. • The switch has the capability to arbitrate between conflicting requests. • If inputs A and B both request the same output terminal, only one of them will be connected; other will be blocked Multistage switching network Binary tree with 2x2 switch Binary tree with 2x2 switches • Figure shows binary tree multistage switching network using 2x2 switch as a building block. • Two processors P1 and P2 are connected through switches to 8 memory module marked in binary from 000 through 111. • The path from source to destination is determined from binary bits of the destination number. • The first bit of the destination number determines the switch output in the first level. • The second bit specifies the output of the switch in the second level, and third bit specifies the output of the switch in the third level. • For example, to connect P1 to memory module 101, it is necessary to form a path from P1 to output 1 in the first-level switch, output 0 in the second-level switch, and output 1 in the third-level switch. • Either P1 or P2 can be connected to any one of the 8 memories. • Certain request pattern can not be satisfied simultaneously. • For example, if P1 is connected to only one of the destinations 000 through 011, P2 can be connected to only one of the destination 100 through 111. Hypercube Interconnection • Hypercube or binary n-cube multiprocessor structure is a loosely coupled system composed of N=2n processors interconnected in an n-dimensional binary cube. • Each processor forms a node of the cube. • Although it is customary to refer to each node as having a processor, in effect it contains not only a CPU but also local memory and I/O interface. • Each processor has direct communication paths to n other neighbour processors. • These paths correspond to the edges of the cube. • There are 2n distinct n-bit binary address that can assigned to the processors. • Each processors address differ from that of each of its n neighbours by exactly one-bit position. Hypercube Interconnection Hypercube Interconnection • Figure shows the hypercube structure for n=1,2, and 3. • A one cube has n=1, and 2n=2. It contains two processors interconnected by a single path. • Two-cube structure has n=2 and four nodes interconnected as square. • Three-cube structure 8 nodes interconnected as cube. • An n-cube structure has 2n nodes with processor residing in each node. • Each node is assigned a binary address in such a way that the addresses of two neighbours differ exactly in one bit position. • Routing messages through an n-cube structure may take from one to n links from a source node to a destination node. • For example, in three-cube structure 000 can communicate directly with node 001. • It is necessary to go through at least two links to provide communication from 000 to 011. Inter-processor Arbitration • Computer system contain number of buses at various levels to facilitate the transfer of information between components. • The CPU contains a number of internal buses for transferring information between processor register and ALU. • A memory bus consists of lines for transferring data, address, and read/write information. • An I/O bus is used to transfer information to and from input and output devices. • A bus that connects major components such as CPUs, IOPs, and memory is called system bus. • The processor in a shared memory multiprocessor system request access to common memory or other common resources through the system bus. If no other processor is currently utilizing the bus, the requesting processor may be granted access immediately. However, the requesting processor must wait if another processor is currently utilizing the system bus. • Furthermore, other processor may request the system bus at the same time . • Arbitration must then be performed to resolve this multiple contention for the shared resources. Inter-processor Arbitration • Priority resolving technique • Hardware : Serial and parallel • Software: Dynamic algorithm Serial arbitration procedure • The serial arbitration resolving technique is obtained from a daisychain connection. • The processor connected to the system bus are assigned priority according to their position along the priority control line. • The device closest to the priority line is assigned the highest priority. • When multiple devices concurrently request the use of the bus, the device with the highest priority is granted access to it. Inter-processor Arbitration: Serial arbitration (daisy-chain) procedure Serial arbitration procedure • Figure shows the daisy-chain connections of four arbiters. It is assumed that each processor has its own arbiter logic with priority-in (PI) and priority-out (PO) lines. • The PO of each arbiter is connected to PI of the next-lower-priority arbiter. • PI of the highest-priority unit is maintained at a logic 1 value. • Highest priority unit in the system will always receive access to the system bus when it requests it. • The scheme got the term from the structure of the grant line, which chains through each device from the higher to lowest priority. The highest priority device will pass the grant line to the lower priority device only if it does not want it. Then the priority is forwarded to the next in the sequence. All devices use the same line for bus requests. • If a busy bus line returns to its idle state, the most high-priority arbiter enables the busy line, and its corresponding processor can then run the required bus transfer. • Thus, the processor whose arbiter has PI=1 and PO=0 is the one that is given control of the system bus. Parallel arbitration logic • The parallel arbitration technique uses an external priority encoder and decoder. • Each bus arbiter in the parallel scheme has bus request output line and bus acknowledge input line. • Each arbiter enables the request line when its processor is requesting access to the system bus. • The processor takes control of the bus if its acknowledge input line is enabled. • The bus busy line provides an orderly transfer of control, as in the daisy-chaining case. Parallel arbitration logic Parallel arbitration • Figure shows the parallel arbitration for multiprocessor system consist of four processor. • Each processor has its own bus arbiter. Request lines from four arbiter are given as input to the 4x2 priority encoder. The output of the priority encoder generates a 2-bit code which represents the highest-priority unit among those requesting the bus. • The 2-bit code from the encoder output drives a 2x4 decoder which enables the proper acknowledge line to grant bus access to the highest priority unit. Priority encoder Priority encoder is a circuit that implements the priority function. The logic of the priority encoder is such that if two or more inputs arrive at the same time, the input having the highest priority will take precedence. The truth table of a four-input priority encoder is shown in the Figure. The “x’s” in the table designate don’t –care conditions. Input Io has the highest priority: so regardless of the values of other inputs, when this input is 1, the output generates an output xy=00. I1 has the next priority level. The output is 01 if I1 =1 provided that Io =0 regardless of the value of the other two lower-priority inputs. The output for I2 is generate only if higher-priority inputs are 0, and so on down the priority level. The interrupt status IST is set only when one or more inputs are equal to 1. If all the inputs are 0, IST is cleared to 0 and other outputs of the encoder are not used, so they are marked with don’t-care conditions. This is because the vector address is not transferred to the CPU when IST=0. Dynamic arbitration algorithms • Priorities of the units can be dynamically changeable while the system is in operation Time Slice Fixed length time slice is given sequentially to each processor, round robin fashion Polling Unit address polling -Bus controller advances the address to identify the requesting unit. When processor that requires the access recognizes its address, it activates the bus busy line and then accesses the bus. After number of bus cycles, the polling continues by choosing a different processor. LRU The least recently used algorithm gives the highest priority to the requesting device that has not used bus for the longest interval. Dynamic arbitration algorithms FIFO The first come first serve scheme requests are served in the order received. The bus controller here maintains a queue data structure. Rotating Daisy Chain • The PO output of the last device is connected to the PI of the first one. Highest priority to the unit that is nearest to the unit that has most recently accessed the bus(it becomes the bus controller) Inter-processor communication Shared memory multiprocessor: • In a master-slave mode, one processor, master, always executes the operating system functions. • In the separate operating system organization, each processor can execute the operating system routines it needs. This organization is more suitable for loosely coupled systems. • In the distributed operating system organization, the operating system routines are distributed among the available processors. However, each particular operating system function is assigned to only one processor at a time. It is also referred to as a floating operating system. Inter-processor communication: Loosely coupled multiprocessor: • There is no shared memory for passing information. • The communication between processors is by means of message passing through I/O channels. • The communication is initiated by one processor calling a procedure that resides in the memory of the processor with which it wishes to communicate. • The communication efficiency of the inter-processor network depends on the communication routing protocol, processor speed, data link speed, and the topology of the network. Inter-processor Synchronization • Synchronization is an essential part of inter-process communication. It refers to a case where the data used to communicate between processors is control information. • It is needed to enforce the correct sequence of processes and to ensure mutually exclusive access to shared writable data. Mutual exclusion with Semaphore: (Method to provide inter-process synchronization) • A properly functioning multiprocessor system must provide a mechanism that will guarantee orderly access to shared memory and other shared resources. • This is necessary to protect data from being changed simultaneously by two or more processors. • This mechanism has been termed mutual exclusion. • Mutual exclusion must be provided in a multiprocessor system to enable one processor to exclude or lock out access to a shared resource by other processors when it is in a critical section. • A critical section is a program sequence that, once begun, must complete execution before another processor accesses the same shared resource. • A binary variable called semaphore is often used to indicate whether a processor is being executing a critical section. Mutual exclusion with Semaphore: (Method to provide inter-process synchronization) • A semaphore is a software-controlled flag that is stored in a memory location that all processors can access. • When the semaphore is equal to 1, it means that a processor is executing a critical program, so that shared memory is not available to other processors. • When semaphore is 0, shared memory is available to any requesting processor. • Processor that share the same memory segment agree by convention not to use the memory segment unless the semaphore is equal to 0, indicating that memory is available. • They also agree to set the semaphore to 1 when they are executing a critical section and to clear it to 0 when they are finished. Parallel processing • Parallel processing is term used to denote a large class of techniques that are used to provide simultaneous data processing tasks for the purpose of increasing the computational speed of a computer system. • Instead of processing each instruction sequentially as in a conventional computer, a parallel processing system is able to perform concurrent data processing to achieve faster execution time. • The system may have two or more ALUs and be able to execute two or more instructions at the same time. The system may have two or more processors operating concurrently. • The purpose of parallel processing is to speed up the computer processing capability and increase its throughput, that is, the amount of processing that can be accomplished during a given interval of time. • Parallel processing at a higher level of complexity can be achieved by having a multiplicity of functional units that perform identical or different operations simultaneously. • Parallel processing is established by distributing the data among the multiple functional units. • For example: the arithmetic, logic, and shift operations can be separated into three units and the operands diverted to each unit under the supervision of a control unit. Processor with multiple function units Application and types of parallel processing Application • Numeric weather forecasting • Computational aerodynamics • Finite-element analysis • Remote-sensing application • Genetic engineering • Computer-asseted tomography • Weapon research and defence Types • Pipeline processing • Vector processing • Array processors Classification of parallel processing by M.J. Flynn • M.J. Flynn classification of parallel processing considered the organisation of computer system by number of instructions and data items that are manipulated simultaneously. • Instruction stream: The sequence of instruction read from memory • Data stream: The operations performed on the data in the processor • Flyn’s classification divides computers into four major group 1. Single instruction stream, single data stream (SISD) 2. Single instruction stream, multiple data stream (SIMD) 3. Multiple instruction stream, single data stream (MISD) 4. Multiple instruction stream, multiple data stream (MIMD) SISD • It represents the organization of a single computer containing a control unit, a processor unit, and a memory unit. • Instructions are executed sequentially, and the system may or may not have internal parallel processing capabilities. • Parallel processing in this case is achieved by means of multiple functional units or by pipeline processing. SIMD • It represents an organization that includes many processing units under the supervision of a common control unit. • All processor receive same instruction from the control unit but operate on different items of data. • The shared memory unit must contain multiple modules so that it can communicate with all the processor simultaneously. * MISD structure is only of theoretical interest since no practical system has been constructed using this. MIMD • It refers to a computer system capable of processing several programs at the same time. Most multiprocessor and multicomputer systems can be classified in this category. Pipelining • Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates concurrently with all other segments. • A pipeline can be visualized as a collection of processing segments through which binary information flows. • Each segment performs partial processing dictated by the way task is portioned. • The result obtained from the computation in each segment is transferred to the next segment in the pipeline. • The final result is obtained after the data have passed through all segments. • The overlapping of computation is made possible by associating a register with each segment so that each can operate on distinct data simultaneously. • Simplest way of viewing the pipeline structure is to imagine that each segment consists of an input register followed by a combinational circuit. • The register hold the data and combinational circuit performs the suboperation in the particular segment • The output of the combinational circuit in a given segment is applied to the input register of the next segment. • A clock is applied to all registers after enough time has elapsed to perform all segment activity. • In this way the information flows through the pipeline one step at a time. Example of pipelining • Suppose that we want to perform the combined multiply and add operations with a stream of numbers. • Ai*Bi + Ci for i=1,2,3,--------7 • Each operation can be implemented in a segment with pipeline. • Each segment has one or two register and combinational circuit. • R1 through R5 are registers that receive new data with every clock pulse. • Multiplier and adder are combinational circuits • The suboperations performed in each segment of the pipeline are as follows: • R1 Ai R2 Bi Input Ai and Bi • R3 R1*R2, R4 Ci Multiply and input Ci • R5 R3+R4 Add ci to product Pipeline processing Content of Registers in pipeline example Pipelining • It takes three clock pulse to fill up the pipe and retrieve the first output from R5. • From there on, each clock produces a new output and moves the data one step down the pipeline • This happens as long as new input data flow into the system • When no more input data are available, the clock must continue until the last output emerges out of the pipeline. Four-segment pipeline Space-time diagram for pipeline Space-time diagram: The diagram which shows the segment utilization as function of time. The horizontal axis displays the time in clock cycle and vertical axis gives the segment number. The diagram shows the 6 tasks T1 through T6 executed in four segments. Initially, task T1 is handled by segment 1. After the first clock, segment 2 is busy with t1 while segment 1 is busy with task T2. Continuing in this manner, the first task T1 is completed after the fourth clock cycle. From them on, the pipe completes a task every clock cycle. No matter how many segments are there in the system, once the pipeline is full, it takes only one clock period to obtain an output. Vector processing • There is class of computational problems that are beyond the capabilities of a conventional computer. • These problems are characterised by the fact that they require a vast number of computations that will take a conventional computer days or even weeks to complete. • In many engineering and applications, the problems can be formulated in terms of vectors and matrices that lend themselves to vector processing. • Application of vector processing: • Long-range weather forecasting • Petroleum explorations • Seismic data analysis • Medical diagnosis • Aerodynamics and space flight simulations • Artificial intelligence and expert systems • Mapping the human genome • Image processing Vector operations • A vector is an ordered set of a one-dimensional array of floating-point numbers. • A vector V of length n is represented as row vector by V=[V1 V2 V3 ……..Vn] • It may be represented as column vector if the data items are listed in column. • A conventional sequential computer is capable of processing operands one at a time. • Consequently, operations on vectors must be broken down into single computation with subscripted variables. • The element Vi of vector V is written as V(I) and index I refers to a memory address or register where the number is stored. Vector operations: Example with conventional scalar processor • To examine the difference between a conventional scalar processor and a vector processor, consider the following Fortran DO loop • This program is implemented in machine language by following sequence of operations. Vector operations and instruction format • A computer capable of vector processing eliminates the overhead associated with the time it takes to fetch and execute the instructions in the program loop. It allows the operation to be specified with single vector instruction of the form • C(1:100)=A(1:100) +B (1:100) Figure: Instruction format for vector processor Matrix multiplication • Matrix multiplication is one of the most computationally intensive operation performed in computers with vector processors. • The multiplication of two nxn matrices consist of n2 inner product or n3 multiply-add operations. • Consider multiplication of two 3x3 matrices A and B Pipeline for calculating an inner product • In general, inner product consist of the sum of k product term of the form • C=A1B1+A2B2+A3B3+.............. +AkBk • The value of A and B are either in memory or in processor registers. • The floating-point multiplier pipeline and the floating-point adder pipeline are assumed to have 4 segment each. • All segment registers in the multiplier and adder are initialized to 0. • Therefore, the output of the adder is 0 for first 8 cycles until both pipes are full. • Ai and Bi pairs are brought in and multiplied at a rate of one pair per cycle. • After the first four cycles, the product begin to be added to the output of the adder. • During the next four cycles 0 is added to the products entering the adder pipeline. • At the end of the eighth cycle, the first four products A1B1 through A4B4 are in four adder segments, and the next four products A5B5 through A8B8 are in the multiplier segment. • At the beginning of the 9th cycle starts the addition A1B1+A5B5 in the adder pipeline. • The 10th cycle starts the addition A2B2+A6B6 and so on. • This pattern breaks down summation into four sections as follows Pipeline: Vector processing Memory interleaving Multiple module memory organisation Memory module • The memory can be partitioned into number of modules connected to common memory address and data buses. • Memory module is memory array together with its own address and data registers. • Each memory array has its own address register AR and data register DR. • The address register receive information from common address bus and data registers communicate with a bidirectional data bus. • Two least significant bits of the address can be used to distinguish between four modules. • The modular system permits one module to initiate a memory access while other modules are in the process of reading or writing a word and each module can honour a memory request independent of the state of the other modules. Memory module • The advantage of modular memory is that it allows the use of technique called interleaving. • In an interleaved memory, different sets of address are assigned to different memory modules. • For example, in a two-module memory system, the even address may be in one module and the odd address in the other. • When the number of modules is a power of 2, the least significant bits of the address select a memory module and remaining bits designate the specific location to be accessed within the selected module. Array processor • An array processor is a processor that performs computation on large arrays of data. Array processor is of two type: Attached array processor SIMD array processor Attached array processor • An attached array processor is designed as a peripheral for a conventional host computer, and its purpose is to enhance the performance of the computer by providing vector processing for complex scientific applications. • It achieves high performance by means of parallel processing with multiple functional units. • It includes an arithmetic unit containing one or more pipelined floating-point adders and multipliers. • The array processor can be programmed by the user to accommodate a variety of complex arithmetic problems. Attached array processor Attached array processor • Figure shows the interconnection of an attached array processor to a host computer. • The host computer is a general-purpose commercial computer, and the attached processor is a back-end machine driven by the host computer. • The array processor is connected through an input-output controller to the computer and computer treats it like an external interface. • The data for the attached processor are from main memory to a local memory through a high-speed bus. • The general-purpose computer without the attached processor serves the user that need conventional data processing. • The system with attached processor satisfies the need for complex arithmetic application. • The objective of the attached array processor is to provide vector manipulation capabilities to a conventional computer at a fraction of the cost of supercomputers. SIMD array processor • A SIMD array processor is a computer with multiple processing units operating in parallel. • The processing units are synchronized to perform the same operation under the control of a common control unit, thus providing a single instruction stream, multiple data stream (SIMD) organization. SIMD array processor SIMD array processor • It contains a set of identical processing elements (PEs), each having local memory (M). • Each PE includes an ALU, a floating-point arithmetic unit and working registers. • The master control unit controls the operation in the processor elements. • The main memory is used for storage of the program. • The function of the master control unit is to decode the instructions and determine how the instruction is to be executed . • Scalar and program control instructions are directly executed within the master control unit. • Vector instructions are broadcast to all PEs simultaneously. • Each PE uses operands stored in its local memory. • Vector operands are distributed to the local memories prior to the parallel execution of the instruction. Example of vector addition using SIMD processor • C=A+B • The master control unit first stores the ith components ai and bi of A and B in local memory Mi for i=1,2,3,……..n. • It then broadcasts the floating-point add instruction ci=ai+bi to all PEs, causing the addition to take place simultaneously. • The components of ci are stored in fixed locations in each memory location. • This produces the desired vector sum in one add Cycle. SIMD array processor • Masking schemes are used to control the status of each PE during the execution of vector instructions. • Each PE has a flag that is set when the PE is active and reset when PE is inactive. • This ensure that only those Pes that need to participate are active during the execution of the instruction. • For example , suppose that the array processor contains a set of 64 PEs. • If a vector length of less than 64 data items is to be processed, the control unit selects the proper number of Pes to be active. • Vectors of greater length than 64 must be divided into 64-word portions by the control unit. • SIMD processors are highly specialised computers. They are suited primarily for numerical problems that can be expressed in vector or matrix form. • However, they are not very efficient in other types of computation or in dealing with conventional data processing algorithm.