me comp pc 1 latest

advertisement
1. Introduction to Parallel Computing
• Parallel computing is a subject of interest in the
computing community.
• Ever-growing size of databases and increasing
complexity are putting great stress on the single
processor computers.
• Now the entire computer science community is
looking for some computing environment where
current computational capacity can be enhanced.
– By improving the performance of a single computer
( Uniprocessor system)
– By parallel processing
– The most obvious solution is the introduction of multiple
processors working in tandem to solve a given problem.
– The Architecture used to solve this problem is Advanced/
Parallel Computer Architecture and the algorithms are
known as Parallel Algorithms and programming of these
computers is known as Parallel Programming.
• Parallel computing is the simultaneous execution of
the same task, split into subtasks, on multiple
processors in order to obtain results faster.
• why we require parallel computing?
• what are the levels of parallel processing ?
• how flow of data occurs in parallel processing?
• What is the Role the of parallel processing in
some fields like science and engineering,
database queries and artificial intelligence?
Objectives
• Historical facts of parallel computing.
• Explain the basic concepts of program, process, thread,
concurrent execution, parallel execution and granularity
• Explain the need of parallel computing.
• Describe the levels of parallel processing .
• Describe Parallel computer classification Schemes.
• Describe various applications of parallel computing.
Why Parallel Processing?
• Computation requirements are ever increasing:
– simulations, scientific prediction (earthquake),
distributed databases, weather forecasting (will it
rain tomorrow?), search engines, e-commerce,
Internet service applications, Data Center
applications, Finance (investment risk analysis), Oil
Exploration, Mining, etc.
Why Parallel Processing?
• Hardware improvements like pipelining,
superscalar are not scaling well and
require sophisticated compiler technology
to exploit performance out of them.
• Techniques such as vector processing
works well for certain kind of problems.
Why Parallel Processing?
• Significant development in networking
technology is paving a way for networkbased cost-effective parallel computing.
• The parallel processing technology is
mature and is being exploited
commercially.
Constraints of conventional architecture : von
Neumann machine( sequential computers)
Parallelism in uniprocesor System
• Parallel processing mechanisms to achieve
parallelism in uniprocessor system are :
– Multiple function units
– Parallelism and pipelining within CPU
– Overlapped CPU and i/o operations
– Use of hierarchical memory system
– Multiprogramming and time sharing
Parallelism in uniprocesor System
Parallelism in uniprocesor System
Parallelism in uniprocesor System
Comparison between Sequential
and Parallel Computer
• Sequential Computers
– Are uniprocessor systems
( 1 CPU)
– Can Execute 1 Instruction
at a time
– Speed is limited
– It is quite expensive to
make single cpu faster
– Area where it can be used
: colleges, labs,
– Ex: Pentium PC
• Parallel Computers
– Are Multiprocessor
Systems ( many CPU’s )
– Can Execute several
Instructions at a time.
– No limitation on speed
– Less expensive if we use
larger number of fast
processors to achieve
better performance.
– Ex : CRAY 1, CRAYXMP(USA) and PARAM (
India )
History of Parallel Computing
• The experiments and implementations of
the use of parallelism started in the 1950s
by the IBM.
• A serious approach towards designing
parallel computers was started with the
development of ILLIAC IV in 1964 .
• The concept of pipelining was introduced
in computer CDC 7600 in 1969.
History of Parallel Computing
• In 1976, the CRAY1 was developed by Seymour Cray.
Cray1 was a pioneering effort in the development of vector
registers.
•
• The next generation of Cray called Cray XMP was
developed in the years 1982-84.
• It was coupled with supercomputers and used a shared
memory.
• In the 1980s Japan also started manufacturing high
performance supercomputers. Companies like NEC, Fujitsu
and Hitachi were the main manufacturers.
Parallel computers
* Parallel computers are those systems
which emphasize on parallel processing.
* Parallel processing is an efficient form of
information processing which emphasis the
exploitation of concurrent events in
computing process
parallel computers
Pipeline
Computers
Array
Processors
Multiprocessor
Systems
Fig : Division of parallel computers
 pipelined computers performs overlapped
computations to exploit temporal parallelism .
here successive instructions are executed in
overlapped fashion as shown in
figure(next…).
 In Nonpipelined computers the execution of
first instruction must be completed before the
next instruction can be issued
Pipeline Computers
•
•
1.
2.
3.
4.
These computers performs overlapped
computations.
Instruction cycle of digital computer
involves 4 major steps :
IF (Instruction Fetch)
ID (Instruction Decode)
OF (Operand Fetch)
EX (Execute)
Pipelined processor
Functional structure of pipeline computer
Array Processor
•
Array processor is synchronous parallel
computer with multiple ALUs ,called as
processing elements (PE), these PE’s can
operate in parallel mode.
•
An appropriate data routing algorithm
must be established among PE’s.
Multiprocessor system
This system achieves asynchronous
parallelism through a set of interactive
processors with shared resources
(memories ,databases etc.) .
PROBLEM SOLVING IN PARALLEL:
Temporal Parallelism :
• Ex: submission of Electricity Bills :
– Suppose there are 10000 residents in a
locality and they are supposed to submit their
electricity bills in one office.
steps to submit the bill are as follows:
– Go to the appropriate counter to take the
form to submit the bill.
–
Submit the filled form along with cash.
–
Get the receipt of submitted bill.
Serial Vs. Parallel
COUNTER 2
COUNTER
COUNTER 1
Q
Please
sequential execution
• Giving application form = 5 seconds
• Accepting filled application form and
counting the cash and returning, if
required = 5mnts, i.e., 5 ×60= 300 sec.
• Giving receipts = 5 seconds.
• Total time taken in processing one bill =
5+300+5 = 310 seconds
if we have 3 persons sitting at three different
counters with :
i) One person giving the bill submission form
ii) One person accepting the cash and
Returning ,if necessary and
iii) One person giving the receipt.
As three persons work in the same time, it is
called temporal parallelism.
• Here, a task is broken into many subtasks,
and those subtasks are executed
simultaneously in the time domain .
Data Parallelism
• In data parallelism, the complete set of data is divided into
multiple blocks and operations on the blocks are applied
parallel.
• data parallelism is faster as compared to Temporal
parallelism.
• Here, no synchronization is required between counters (or
processors ).
• It is more tolerant of faults.
• The working of one person does not effect the other.
• Inter-processor communication is less.
Disadvantages : Data parallelism
• The task to be performed by each
processor is pre-decided i.e., assignment
of load is static.
• It should be possible to break the input
task into mutually exclusive tasks.
– space would be required for counters. This
requires multiple hardware which may be
costly.
PERFORMANCE EVALUATION
• The performance attributes are:
• Cycle time (T): It is the unit of time for all the operations
of a computer system. It is the inverse of clock rate (l/f).
The cycle time is represented in n sec.
• Cycles Per Instruction (CPI): Different instructions
takes different number of cycles for exection. CPI is
measurement of number of cycles per instruction
• Instruction count (LC): Number of instruction in a
program is called instruction count. If we assume that all
instructions have same number of cycles, then the total
execution time of a program
• the total execution time of a program=
number of instruction in the program *
number of cycle required by one
instruction * time of one cycle.
• execution time T=Ic*CPI*Tsec.
• Practically the clock frequency of the
system is specified in MHz.
• the processor speed is measured in terms
of million instructions per sec (MIPS).
SOME ELEMENTARY
CONCEPTS
•
•
•
•
•
•
Program
Process
Thread
Concurrent and Parallel Execution
Granularity
Potential of Parallelism
process
• Each process has a life cycle, which
consists of creation, execution and
termination phases. A process may create
several new processes, which in turn may
also create a new processes.
Process creation requires four
actions
• Setting up the process description
• Allocating an address space
• Loading the program into the allocated
address space
• Passing the process description to the
process scheduler
– The process scheduling involves three
concepts: process state, state transition and
scheduling policy.
Thread
• Thread is a sequential flow of control
within a process.
• A process can contain one or more
threads.
• Threads have their own program counter
and register values, but they share the
memory space and other resources of the
process.
• Thread is basically a lightweight process .
• Advantages:
– It takes less time to create and terminate a
new thread than to create, and terminate a
process.
– It takes less time to switch between two
threads within the same process .
– Less communication overheads.
• Study of concurrent and parallel executions is
important due to following reasons:
– i) Some problems are most naturally solved by using a
set of co-operating processes.
– ii) To reduce the execution time.
• Concurrent execution is the temporal behavior of
the N-client 1-server model .
• Parallel execution is associated with the N-client
N-server model. It allows the servicing of more
than one client at the same time as the number of
servers is more than one.
• Granularity refers to the amount of
computation done in parallel relative to the
size of the whole program.
• In parallel computing, granularity is a
qualitative measure of the ratio of
computation to communication.
Potential of Parallelism
• Some problems may be easily parallelized.
• On the other hand, there are some inherent
sequential problems (computation of Fibonacci
sequence) whose parallelization is nearly
impossible .
• If processes don’t share address space and we
could eliminate data dependency among
instructions, we can achieve higher level of
parallelism.
Speed-up
• The concept of speed up is used as a
measure of the speed up that indicates up
to what extent to which a sequential
program can be parallelized.
Processing Elements
Architecture
Two Eras of Computing
Architectures
System Software/Compiler
Applications
P.S.Es
Architectures
System Software
Applications
P.S.Es
Sequential
Era
Parallel
Era
1940
50
60
70
80
90
2000
Commercialization
R&D
Commodity
2030
Human Architecture! Growth Performance
Vertical
Growth
Horizontal
5
10
15 20 25
30
Age
35
40
45 . . . .
Computational Power Improvement
C.P.I
Multiprocessor
Uniprocessor
1
2. . . .
No. of Processors
Characteristics of Parallel computer
• Parallel computers can be characterized
based on
– the data and instruction streams forming
various types of computer organizations.
– the computer structure, e.g. multiple processors
having separate memory or one shared global
memory.
– size of instructions in a program called grain
size.
TYPES OF CLASSIFICATION
1) Classification based on the instruction and
data streams
2) Classification based on the structure of
computers
3) Classification based on how the memory
is accessed
4) Classification based on grain size
classification of parallel computers
• Flynn’s classification based on instruction and data
streams
• The Structural classification based on different
computer organizations;
• The Handler's classification based on three distinct
levels of computer:
– Processor control unit (PCU), Arithmetic logic unit
(ALU), Bit-level circuit (BLC)
• describe the sub-tasks or instructions of a program
that can be executed in parallel based on the grain
size.
FLYNN’S CLASSIFICATION
• Proposed by Michael Flynn in 1972.
• Introduced the concept of instruction and data
streams for categorizing of computers.
• This classification is based on instruction and
data streams
• Working of the instruction cycle.
Instruction Cycle
• The instruction cycle consists of a
sequence of steps needed for the execution
of an instruction in a program
The control unit fetches
instructions one at a time.
The fetched Instruction is
then decoded by the
decoder
the processor executes the
decoded instructions.
The result of execution is
temporarily stored in
Memory Buffer Register
(MBR).
Instruction Stream and Data Stream
• flow of instructions is called instruction
stream.
• flow of operands between processor and
memory is bi-directional. This flow of
operands is called data stream.
Flynn’s Classification
•
Based on multiplicity of instruction
streams and data streams observed by
the CPU during program execution.
1)
2)
3)
4)
Single Instruction and Single Data stream (SISD)
Single Instruction and multiple Data stream (SIMD)
Multiple Instruction and Single Data stream (MISD)
Multiple Instruction and Multiple Data stream (MIMD)
SISD : A Conventional Computer
Instructions
Data Input
Processor
Data Output
Speed is limited by the rate at which
computer can transfer information internally.
Ex: PCs, Workstations
Single Instruction and Single Data stream
(SISD)
– sequential execution of instructions is performed by one
CPU containing a single processing element (PE)
– Therefore, SISD machines are conventional serial
computers that process only one stream of instructions
and one stream of data. Ex: Cray-1, CDC 6600, CDC 7600
–
The MISD Architecture
Instruction
Stream A
Instruction
Stream B
Instruction Stream C
Processor
Data
Output
Stream
A
Data
Input
Stream
Processor
B
Processor
C
More of an intellectual exercise than a practical configuration.
Few built, but commercially not available
Multiple Instruction and Single Data
stream (MISD)
• multiple processing elements are organized under
the control of multiple control units.
• Each control unit is handling one instruction stream
and processed through its corresponding
processing element.
• each processing element is processing only a single
data stream at a time.
• Ex:C.mmp built by Carnegie-Mellon University.
All processing elements are interacting with the common
shared memory for the organization of single data stream
Advantages of MISD
– for the specialized applications like
• Real time computers need to be fault tolerant where
several processors execute the same data for
producing the redundant data.
• All these redundant data are compared as results
which should be same otherwise faulty unit is
replaced.
• Thus MISD machines can be applied to fault tolerant
real time computers.
SIMD Architecture
Instruction
Stream
Data Input
stream A
Data Input
stream B
Data Input
stream C
Data Output
stream A
Processor
A
Data Output
stream B
Processor
B
Processor
C
Data Output
stream C
Ci<= Ai * Bi
Ex: CRAY machine vector processing, Intel MMX
(multimedia support)
Single Instruction and multiple Data
stream (SIMD)
• multiple processing elements work under the control
of a single control unit.
• one instruction and multiple data stream.
• All the processing elements of this organization
receive the same instruction broadcast from the CU.
• Main memory can also be divided into modules for
generating multiple data streams.
• Every processor must be allowed to complete its
instruction before the next instruction is taken for
execution.
• The execution of instructions is synchronous
SIMD Processors
• Some of the earliest parallel computers such as the
Illiac IV, MPP, DAP, CM-2 are belonged to this class of
machines.
• Variants of this concept have found use in co-processing
units such as the MMX units in Intel processors and IBM
Cell processor.
• SIMD relies on the regular structure of computations
(such as those in image processing).
• It is often necessary to selectively turn off operations on
certain data items. For this reason, most SIMD
programming architectures allow for an ``activity mask'',
which determines if a processor should participate in a
computation or not.
MIMD Architecture
Instruction Instruction Instruction
Stream A Stream B Stream C
Data Input
stream A
Data Input
stream B
Data Input
stream C
Data Output
stream A
Processor
A
Data Output
stream B
Processor
B
Processor
C
Data Output
stream C
Unlike SISD, MISD, MIMD computer works asynchronously.
Shared memory (tightly coupled) MIMD
Distributed memory (loosely coupled) MIMD
Shared Memory MIMD machine
Processor
A
M
E
M B
O U
R S
Y
Processor
B
M
E
M B
O U
R S
Y
Processor
C
M
E
M B
O U
R S
Y
Global Memory System
Communication : Source PE writes data to GM & destination retrieves it
 Easy to build, conventional OSes of SISD can be easily ported
 Limitation : reliability & expandability. A memory component or any processor
failure affects the whole system.
 Increase of processors leads to memory contention.
Ex. : Silicon graphics supercomputers....
Distributed Memory MIMD
IPC
IPC
channel
channel
Processor
A
Processor
B
Processor
C
M
E
M B
O U
R S
Y
M
E
M B
O U
R S
Y
M
E
M B
O U
R S
Y
Memory
System A
Memory
System B
Memory
System C
 Communication : IPC (Inter-Process Communication) via
Network.
 Network can be configured to ... Tree, Mesh, Cube, etc.
 Unlike Shared MIMD


High Speed
easily/ readily expandable
Highly reliable (any CPU failure does not affect the whole system)
MIMD Processors
• In contrast to SIMD processors, MIMD processors
can execute different programs on different
processors.
• A variant of this, called single program multiple data
streams (SPMD) executes the same program on
different processors.
• It is easy to see that SPMD and MIMD are closely
related in terms of programming flexibility and
underlying architectural support.
• Examples of such platforms include current
generation Sun Ultra Servers, SGI Origin Servers,
multiprocessor PCs, workstation clusters.
•
•
•
•
Multiple Instruction and Multiple Data
stream (MIMD)
multiple processing elements and multiple
control units are organized as in MISD.
for handling multiple instruction streams,
multiple control units are there and For handling
multiple data streams, multiple processing
elements are organized.
The processors work on their own data with
their own instructions.
Tasks executed by different processors can
start or finish at different times.
• in the real sense MIMD organization is said to be a Parallel
computer.
• All multiprocessor systems fall under this
classification.
• Examples :C.mmp, Cray-2, Cray X-MP, IBM
370/168 MP, Univac 1100/80, IBM
3081/3084.
• MIMD organization is the most popular for a
parallel computer.
• In the real sense, parallel computers execute
the instructions in MIMD mode
SIMD-MIMD Comparison
• SIMD computers require less hardware than MIMD
computers (single control unit).
• However, since SIMD processors are specially
designed, they tend to be expensive and have long
design cycles.
• Not all applications are naturally suited to SIMD
processors.
• In contrast, platforms supporting the SPMD
paradigm can be built from inexpensive off-the-shelf
components with relatively little effort in a short
amount of time.
HANDLER’S CLASSIFICATION
• In 1977, Handler proposed an elaborate notation
for expressing the pipelining and parallelism of
computers.
• Handler's classification addresses the computer at
three distinct levels:
– Processor control unit (PCU)---- CPU
– Arithmetic logic unit (ALU)--- processing element
– Bit-level circuit (BLC)--- logic circuit .
Way to describe a computer
• Computer = (p * p', a * a', b * b')
• Where p = number of PCUs
p'= number of PCUs that can be pipelined
a = number of ALUs controlled by each PCU
a'= number of ALUs that can be pipelined
b = number of bits in ALU or processing element
(PE) word
b'= number of pipeline segments on all ALUs or in
a single PE
Relationship between various elements of
the computer
• The '*' operator is used to indicate that the units are
pipelined or macro-pipelined with a stream of data running
through all the units.
• The '+' operator is used to indicate that the units are not
pipelined but work on independent streams of data.
• The 'v' operator is used to indicate that the computer
hardware can work in one of several modes.
• The '~' symbol is used to indicate a range of values for any
one of the parameters.
Ex:
• The CDC 6600 has a single main processor supported
by 10 I/O processors. One control unit coordinates one
ALU with a 60-bit word length. The ALU has 10
functional units which can be formed into a pipeline. The
10 peripheral I/O processors may work in parallel with
each other and with the CPU. Each I/O processor
contains one 12-bit ALU.
CDC 6600I/O = (10, 1, 12)
• The description for the main processor is:
CDC 6600main = (1, 1 * 10, 60)
• The main processor and the I/O processors can be
regarded as forming a macro-pipeline so the '*'
operator is used to combine the two structures:
CDC 6600 = (I/O processors) * (central processor) =
(10, 1, 12) * (1, 1 * 10, 60)
STRUCTURAL CLASSIFICATION
STRUCTURAL CLASSIFICATION
• a parallel computer (MIMD) can be characterized
as a set of multiple processors and shared memory
or memory modules communicating via an
interconnection network.
• When multiprocessors communicate through the
global shared memory modules then this
organization is called Shared memory computer
or Tightly coupled systems
• Shared memory multiprocessors have the
following characteristics:
– Every processor communicates through a
shared global memory
– For high speed real time processing, these
systems are preferable as their throughput is
high as compared to loosely coupled systems.
• In tightly coupled system organization, multiple
processors share a global main memory, which
may have many modules.
• The processors have also access to I/O devices.
The inter- communication between processors,
memory, and other devices are implemented
through various interconnection networks,
Types of Interconnection n/w
• Processor-Memory Interconnection Network (PMIN)
– This is a switch that connects various processors to different memory
modules.
• Input-Output-Processor Interconnection Network
(IOPIN)
– This interconnection network is used for communication between
processors and I/O channels
• Interrupt Signal Interconnection Network (ISIN)
– When a processor wants to send an interruption to
another processor, then this interrupt first goes to ISIN,
through which it is passed to the destination processor.
In this way, synchronization between processor is
implemented by ISIN.
ISIN
PMIN
IOPIN
– To reduce this delay, every processor may
use cache memory for the frequent
references made by the processor as
Uniform Memory Access Model
(UMA)
• In this model, main memory is uniformly
shared by all processors in multiprocessor
systems and each processor has equal
access time to shared memory.
• This model is used for time-sharing
applications in a multi user environment
Uniform Memory Access (UMA):
• Most commonly represented today by
Symmetric Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache
Coherent UMA. Cache coherent means if one
processor updates a location in shared
memory, all the other processors know about
the update. Cache coherency is
accomplished at the hardware level.
Non-Uniform Memory Access
Model (NUMA)
• In shared memory multiprocessor systems, local
memories can be connected with every processor.
The collection of all local memories form the global
memory being shared.
• global memory is distributed to all the processors .
• In this case, the access to a local memory is
uniform for its corresponding processor ,but if one
reference is to the local memory of some other
remote processor, then the access is not uniform.
• It depends on the location of the memory. Thus, all
memory words are not accessed uniformly.
•
•
•
•
•
Non-Uniform Memory Access
(NUMA):
Often made by physically linking two or more SMPs
One SMP can directly access memory of another SMP
Not all processors have equal access time to all memories
Memory access across link is slower
If cache coherency is maintained, then may also be called CC-NUMA - Cache
Coherent NUMA
Advantages: Global address space provides a user-friendly programming
perspective to memory
• Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
Disadvantages:
• Primary disadvantage is the lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increases traffic on the shared memoryCPU path, and for cache coherent systems, geometrically increase traffic
associated with cache/memory management.
• Programmer responsibility for synchronization constructs that ensure "correct"
access of global memory.
• Expense: it becomes increasingly difficult and expensive to design and produce
shared memory machines with ever increasing numbers of processors
Cache-Only Memory Access
Model (COMA)
• shared memory multiprocessor systems
may use cache memories with every
processor for reducing the execution time
of an instruction .
Loosely coupled system
• when every processor in a multiprocessor
system, has its own local memory and the
processors communicate via messages
transmitted between their local memories,
then this organization is called Distributed
memory computer or Loosely coupled
system
• each processor in loosely coupled systems is
having a large local memory (LM), which is not
shared by any other processor.
• such systems have multiple processors with
their own local memory and a set of I/O devices.
• This set of processor, memory and I/O devices
makes a computer system.
• these systems are also called multi-computer
systems.
• These computer systems are connected
together via message passing
interconnection network through which
processes communicate by passing
messages to one another.
• Also called as distributed multi computer
system .
CLASSIFICATION BASED ON
GRAIN SIZE
• This classification is based on recognizing
the parallelism in a program to be executed
on a multiprocessor system.
• The idea is to identify the sub-tasks or
instructions in a program that can be
executed in parallel .
Factors affecting decision of
parallelism
• Number and types of processors available,
i.e. architectural features of host computer
• Memory organization
• Dependency of data, control and resources
Parallelism conditions
Data Dependency
• It refers to the situation in which two or
more instructions share same data.
• The instructions in a program can be
arranged based on the relationship of data
dependency
• how two instructions or segments are data
dependent on each other
Types of data dependencies
i) Flow Dependence : If instruction I2 follows I1 and output of
I1 becomes input of I2, then I2 is said to be flow dependent
on I1.
ii) Antidependence : When instruction I2 follows I1 such that
output of I2 overlaps with the input of I1 on the same data.
iii) Output dependence : When output of the two instructions
I1 and I2 overlap on the same data, the instructions are said
to be output dependent.
iv) I/O dependence : When read and write operations by two
instructions are invoked on the same file, it is a situation of
I/O dependence.
Control Dependence
• Instructions or segments in a program may
contain control structures.
• dependency among the statements can be in
control structures also. But the order of
execution in control structures is not known
before the run time.
• control structures dependency among the
instructions must be analyzed carefully
Resource Dependence
• The parallelism between the instructions
may also be affected due to the shared
resources.
• If two instructions are using the same
shared resource then it is a resource
dependency condition
Bernstein Conditions for
Detection of Parallelism
• For execution of instructions or block of
instructions in parallel, The instructions
should be independent of each other.
• These instructions can be data dependent
/ control dependent / resource dependent
on each other .
Bernstein conditions are based on
the following two sets of variables
i) The Read set or input set RI that consists of
memory locations read by the statement of
instruction I1.
ii) The Write set or output set WI that consists
of memory locations written into by
instruction I1.
Parallelism based on Grain size
• Grain size: Grain size or Granularity is a
measure which determines how much
computation is involved in a process.
• Grain size is determined by counting the
number of instructions in a program
segment.
1) Fine Grain: This type contains
approximately less than 20 instructions.
2) Medium Grain: This type contains
approximately less than 500 instructions.
3) Coarse Grain: This type contains
approximately greater than or equal to one
thousand instructions.
LEVELS OF PARALLEL
PROCESSING
•
•
•
•
Instruction Level
Loop level
Procedure level
Program level
Instruction level
• This is the lowest level and the degree of
parallelism is highest at this level.
• The fine grain size is used at instruction level
• The fine grain size may vary according to the type
of the program. For example, for scientific
applications, the instruction level grain size may be
higher.
• As the higher degree of parallelism can be achieved
at this level, the overhead for a programmer will be
more.
Loop Level
• This is another level of parallelism where iterative
loop instructions can be parallelized.
• Fine grain size is used at this level also.
• Simple loops in a program are easy to parallelize
whereas the recursive loops are difficult.
• This type of parallelism can be achieved through
the compilers .
Procedure or Sub Program Level
• This level consists of procedures, subroutines
or subprograms.
• Medium grain size is used at this level
containing some thousands of instructions in
a procedure.
• Multiprogramming is implemented at this
level.
Program Level
• It is the last level consisting of
independent programs for parallelism.
• Coarse grain size is used at this level
containing tens of thousands of
instructions.
• Time sharing is achieved at this level of
parallelism
Operating Systems for
High Performance
Computing
Operating Systems for PP
• MPP systems having thousands of
processors requires OS radically different
from current ones.
• Every CPU needs OS :
– to manage its resources
– to hide its details
• Traditional systems are heavy, complex and
not suitable for MPP
Operating System Models
• Frame work that unifies features, services
and tasks performed.
• Three approaches to building OS....
– Monolithic OS
– Layered OS
– Microkernel based OS
•
Client server OS
•
Suitable for MPP systems
• Simplicity, flexibility and high performance are
crucial for OS.
Monolithic Operating System
Application
Programs
Application
Programs
System Services
User Mode
Kernel Mode
Hardware


Better application Performance
Difficult to extend
Ex: MS-DOS
Layered OS
Application
Programs
Application
Programs
System Services
User Mode
Kernel Mode
Memory & I/O Device Mgmt
Process Schedule
Hardware
 Easier to enhance
 Each layer of code access lower level interface
 Low-application performance
Ex : UNIX
Traditional OS
Application
Programs
Application
Programs
User Mode
Kernel Mode
OS
Hardware
OS Designer
New trend in OS design
Application
Programs
Application
Programs
Servers
User Mode
Kernel Mode
Microkernel
Hardware
Microkernel/Client Server OS
(for MPP Systems)
Client
Application
Thread
lib.
File
Server
Network
Server
Display
Server
User
Kernel
Microkernel
Send
Reply
Hardware
Tiny OS kernel providing basic primitive
(process, memory, IPC)
Traditional services becomes subsystems
OS = Microkernel + User Subsystems
Few Popular Microkernel Systems
MACH, CMU
PARAS, C-DAC
Chorus
QNX
(Windows)
ADVANTAGES OF PARALLEL
COMPUTATION
• Reasons for using parallel computing:
– save time and solve larger problems
– with the increase in number of processors
working in parallel, computation time is bound
to reduce .
– Cost savings
– Overcoming memory constraints
– Limits to serial computing
APPLICATIONS OF PARALLEL
PROCESSING
• Weather forecasting
• Predicting results of chemical and nuclear reactions
• DNA structures of various species
• Design of mechanical devices
• Design of electronic circuits
• Design of complex manufacturing processes
• Accessing of large databases
• Design of oil exploration systems
• Design of web search engines, web based business
services
• Design of computer-aided diagnosis in medicine
• Development of MIS for national and multi-national
corporations
• Development of advanced graphics and virtual reality
software, particularly for the entertainment industry,
including networked video and multi-media
technologies
• Collaborative work (virtual) environments .
Scientific Applications/Image
processing
• Global atmospheric circulation,
• Blood flow circulation in the heart,
• The evolution of galaxies,
• Atomic particle movement,
• Optimization of mechanical components
Engineering Applications
• Simulations of artificial ecosystems,
• Airflow circulation over aircraft components
Database Query/Answering
Systems
• To speed up database queries we can use
Teradata computer, which employs
parallelism in processing complex queries.
AI Applications
• Search through the rules of a production system,
• Using fine-grain parallelism to search the
semantic networks
• Implementation of Genetic Algorithms,
• Neural Network processors,
• Preprocessing inputs from complex environments,
such as visual stimuli.
Mathematical Simulation and
Modeling Applications
• Parsec, a C-based simulation language for
sequential and parallel execution of discreteevent simulation models.
• Omnet++ a discrete-event simulation software
development environment written in C++.
• Desmo-J a Discrete event simulation framework
in Java.
INDIA’S PARALLEL
COMPUTERS
• In India, the development and design of
parallel computers started in the early
80’s.
• (CDAC) in 1988 was designed the highspeed parallel machines
India’s Parallel Computer
• Sailent Features of PARAM series:
• PARAM 8000 CDAC 1991: 256 Processor
parallel computer,
• PARAM 8600 CDAC 1994: PARAM 8000
enhanced with Intel i860 vector microprocessor..
Improved software for numerical applications.
•
• PARAM 9000/SS CDAC 1996 : Used Sunsparc
II processors.
• MARK Series:
– Flosolver Mark I NAL 1986
– Flosolver Mark II NAL 1988
– Flosolver Mark III NAL 1991
Summary/Conclusions
– In the simplest sense, parallel computing is the
simultaneous use of multiple compute resources
to solve a computational problem:
• To be run using multiple CPUs
• A problem is broken into discrete parts that can be
solved concurrently
• Each part is further broken down to a series of
instructions
• Instructions from each part execute simultaneously
•on
different
CPUs
For example:
Uses for Parallel Computing:
• Science and Engineering: Historically, parallel
computing has been considered to be "the high end of
computing", and has been used to model difficult
problems in many areas of science and engineering:
Atmosphere, Earth, Environment
• Physics - applied, nuclear, particle, condensed matter,
high pressure, fusion, photonics
• Bioscience, Biotechnology, Genetics
• Chemistry, Molecular Sciences
• Geology, Seismology
• Mechanical Engineering - from prosthetics to spacecraft
• Electrical Engineering, Circuit Design, Microelectronics
• Computer Science, Mathematics
• Industrial and Commercial: Today, commercial
applications provide an equal or greater driving force in
the development of faster computers. These applications
require the processing of large amounts of data in
sophisticated ways. For example: Databases, data mining
• Oil exploration
• Web search engines, web based business services
• Medical imaging and diagnosis
• Pharmaceutical design
• Financial and economic modelling
• Management of national and multi-national corporations
• Advanced graphics and virtual reality, particularly in the
entertainment industry
• Networked video and multi-media technologies
• Collaborative work environments
• Why Use Parallel Computing?
– Save time and/or money: In theory, throwing more
resources at a task will shorten its time to completion,
with potential cost savings. Parallel computers can be
built from cheap, commodity components.
– Solve larger problems: Many problems are so large
and/or complex that it is impractical or impossible to
solve them on a single computer, especially given
limited computer memory. For example:
– Web search engines/databases processing millions of
transactions per second
– Provide concurrency: A single compute resource
can only do one thing at a time. Multiple computing
resources can be doing many things simultaneously.
For example, the Access Grid provides a global
collaboration network where people from around the
world can meet and conduct work "virtually".
• Use of non-local resources: Using compute resources on
a wide area network, or even the Internet when local
compute resources are scarce. For example:
– SETI@home over 1.3 million users, 3.2 million computers in
nearly every country in the world.
• Limits to serial computing: Both physical and practical
reasons pose significant constraints to simply building ever
faster serial computers:
– Transmission speeds - the speed of a serial computer is directly
dependent upon how fast data can move through hardware.
– Economic limitations - it is increasingly expensive to make a
single processor faster. Using a larger number of moderately
fast commodity processors to achieve the same (or better)
performance is less expensive.
– Current computer architectures are increasingly relying upon
hardware level parallelism to improve performance:
• Multiple execution units
• Pipelined instructions
• Multi-core
Terminologies related to PC
• Supercomputing / High Performance Computing (HPC) Using
the world's fastest and largest computers to solve large problems.
• Node A standalone "computer in a box". Usually comprised of
multiple CPUs/processors/cores. Nodes are networked together
to comprise a supercomputer.
• CPU / Socket / Processor / Core a CPU (Central Processing
Unit) was a singular execution component for a computer. Then,
multiple CPUs were incorporated into a node. Then, individual
CPUs were subdivided into multiple "cores", each being a unique
execution unit. CPUs with multiple cores are sometimes called
"sockets" - vendor dependent. The result is a node with multiple
CPUs, each containing multiple cores.
• Task A logically discrete section of computational work. A task
is typically a program or program-like set of instructions that is
executed by a processor. A parallel program consists of multiple
tasks running on multiple processors.
• Pipelining Breaking a task into steps performed by different
processor units, with inputs streaming through, much like an
assembly line; a type of parallel computing.
• Shared Memory From a strictly hardware point of view,
describes a computer architecture where all processors have
direct (usually bus based) access to common physical memory.
In a programming sense, it describes a model where parallel
tasks all have the same "picture" of memory and can directly
address and access the same logical memory locations
regardless of where the physical memory actually exists.
• Symmetric Multi-Processor (SMP) Hardware architecture
where multiple processors share a single address space
and access to all resources;
• Distributed Memory In hardware, refers to network based
memory access for physical memory that is not common. As
a programming model, tasks can only logically "see" local
machine memory and must use communications to access
memory on other machines where other tasks are executing.
• Communications Parallel tasks typically need to exchange
data. several ways of communication: through a shared
memory bus or over a network, however the actual event of
data exchange is commonly referred to as communications
regardless of the method employed.
• Synchronization The coordination of parallel tasks in real
time, very often associated with communications.
• Granularity In parallel computing, granularity is a qualitative
measure of the ratio of computation to communication.
• Coarse: relatively large amounts of computational work are
done between communication events
• Fine: relatively small amounts of computational work are
done between communication events
•
Parallel Overhead The amount of time required to coordinate parallel tasks, as
opposed to doing useful work. Parallel overhead can include factors such as:
– Task start-up time
-Synchronizations
– Data communications
– Software overhead imposed by parallel compilers, libraries, tools, operating
system, etc.
– Task termination time
•
Massively Parallel Refers to the hardware that comprises a given parallel system having many processors. The meaning of "many" keeps increasing, but currently,
the largest parallel computers can be comprised of processors numbering in the
hundreds of thousands.
•
Embarrassingly Parallel Solving many similar, but independent tasks
simultaneously; little to no need for coordination between the tasks.
•
Scalability Refers to a parallel system's (hardware and/or software) ability to
demonstrate a proportionate increase in parallel speedup with the addition of more
processors. Factors that contribute to scalability include: Hardware - particularly
memory-cpu bandwidths and network communications
Application algorithm
•
The Future:
• During the past 20+ years, the trends
indicated by ever faster networks,
distributed systems, and multi-processor
computer architectures (even at the desktop
level) clearly show that parallelism is the
future of computing.
Download