CS 213: Parallel Processing Architectures

advertisement
CS 213: Parallel Processing
Architectures
Laxmi Narayan Bhuyan
http://www.cs.ucr.edu/~bhuyan
PARALLEL PROCESSING ARCHITECTURES
CS213 SYLLABUS
Winter 2008
INSTRUCTOR: L.N. Bhuyan
(http://www.engr.ucr.edu/~bhuyan/)
PHONE: (951) 827-2347 E-mail:
bhuyan@cs.ucr.edu
LECTURE TIME: TR 12:40pm-2pm
PLACE: HMNSS 1502
OFFICE HOURS: W 2.00-4.00 or By Appointment
References:
•
•
John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach,
Morgan Kauffman Publisher.
Research Papers to be available in the class
COURSE OUTLINE:
•
•
•
•
•
Introduction to Parallel Processing: Flynn’s classification, SIMD and MIMD
operations, Shared Memory vs. message passing multiprocessors, Distributed shared
memory
Shared Memory Multiprocessors: SMP and CC-NUMA architectures, Cache
coherence protocols, Consistency protocols, Data pre-fetching, CC-NUMA memory
management, SGI 4700 multiprocessor, Chip Multiprocessors, Network Processors
(IXP and Cavium)
Interconnection Networks: Static and Dynamic networks, switching techniques,
Internet techniques
Message Passing Architectures: Message passing paradigms, Grid architecture,
Workstation clusters, User level software
Multiprocessor Scheduling: Scheduling and mapping, Internet web servers, P2P,
Content aware load balancing
PREREQUISITE: CS 203A
GRADING:
Project I – 20 points
Project II – 30 points
Test 1 – 20 points
Test 2 - 30 points
Possible Projects
• Experiments with SGI Altix 4700
Supercomputer – Algorithm design and
FPGA offloading
• I/O Scheduling on SGI
• Chip Multiprocessor (CMP) – Design,
analysis and simulation
• P2P – Using Planet Lab
Note: 2 students/group – Expect submission
of a paper to a conference
Useful Web Addresses
• http://www.sgi.com/products/servers/altix/4000/
and http://www.sgi.com/products/rasc/
• Wisconsin Computer Architecture Page – Simulators
http://www.cs.wisc.edu/~arch/www/tools.html
• SimpleScalar – www.simplescalar.com – Look for
multiprocessor extensions
• NepSim: http: www.cs.ucr.edu/~yluo/nepsim/
Working in a cluster environment
• Beowulf Cluster – www.beowulf.org
• MPI – www-unix.mcs.anl.gov/mpi
Application Benchmarks
• http://www-flash.stanford.edu/apps/SPLASH/
Parallel Computers
• Definition: “A parallel computer is a collection of
processing elements that cooperate and
communicate to solve large problems fast.”
Almasi and Gottlieb, Highly Parallel Computing ,1989
• Questions about parallel computers:
– How large a collection?
– How powerful are processing elements?
– How do they cooperate and communicate?
– How are data transmitted?
– What type of interconnection?
– What are HW and SW primitives for programmer?
– Does it translate into performance?
Parallel Processors “Myth”
• The dream of computer architects since 1950s:
replicate processors to add performance vs. design
a faster processor
• Led to innovative organization tied to particular
programming models since
“uniprocessors can’t keep going”
– e.g., uniprocessors must stop getting faster due to limit
of speed of light – Has it happened?
– Killer Micros! Parallelism moved to instruction level.
Microprocessor performance doubles every 1.5 years!
– In 1990s companies went out of business: Thinking
Machines, Kendall Square, ...
What level Parallelism?
• Bit level parallelism: 1970 to ~1985
– 4 bits, 8 bit, 16 bit, 32 bit microprocessors
• Instruction level parallelism (ILP):
~1985 through today
– Pipelining
– Superscalar
– VLIW
– Out-of-Order execution
– Limits to benefits of ILP?
• Process Level or Thread level parallelism;
mainstream for general purpose computing?
– Servers are parallel
– High-end Desktop dual processor PC soon??
(or just the sell the socket?)
Why Multiprocessors?
1. Microprocessors as the fastest CPUs
• Collecting several much easier than redesigning 1
2. Complexity of current microprocessors
• Do we have enough ideas to sustain 2X/1.5yr?
• Can we deliver such complexity on schedule?
3. Slow (but steady) improvement in parallel
software (scientific apps, databases, OS)
4. Emergence of embedded and server markets
driving microprocessors in addition to desktops
• Embedded functional parallelism
• Network processors exploiting packet-level parallelism
• SMP Servers and cluster of workstations for multiple
users – Less demand for parallel computing
Amdahl’s Law and Parallel
Computers
• Amdahl’s Law (f: original fraction sequential)
Speedup = 1 / [(f + (1-f)/n] = n/[1+(n-1)/f],
where n = No. of processors
• A portion f is sequential => limits parallel speedup
– Speedup <= 1/ f
• Ex. What fraction sequential to get 80X speedup from 100
processors? Assume either 1 processor or 100 fully used
80 = 1 / [(f + (1-f)/100] => f = 0.0025
Only 0.25% sequential! => Must be a highly
parallel program
Popular Flynn Categories
• SISD (Single Instruction Single Data)
– Uniprocessors
• MISD (Multiple Instruction Single Data)
– ???; multiple processors on a single data stream
• SIMD (Single Instruction Multiple Data)
– Examples: Illiac-IV, CM-2
•
•
•
•
Simple programming model
Low overhead
Flexibility
All custom integrated circuits
– (Phrase reused by Intel marketing for media instructions ~
vector)
• MIMD (Multiple Instruction Multiple Data)
– Examples: Sun Enterprise 5000, Cray T3D, SGI Origin
• Flexible
• Use off-the-shelf micros
• MIMD current winner: Concentrate on major design
emphasis <= 128 processor MIMD machines
Classification of Parallel Processors
• SIMD – EX: Illiac IV and Maspar
• MIMD - True Multiprocessors
1. Message Passing Multiprocessor - Interprocessor
communication through explicit message passing through “send” and
“receive operations.
EX: IBM SP2, Cray XD1, and Clusters
2. Shared Memory Multiprocessor – All processors share the
same address space. Interprocessor communication through load/store
operations to a shared memory.
EX: SMP Servers, SGI Origin, HP
V-Class, Cray T3E
Their advantages and disadvantages?
More Message passing Computers
• Cluster: Computers connected over highbandwidth local area network (Ethernet or
Myrinet) used as a parallel computer
• Network of Workstations (NOW):
Homogeneous cluster – same type
computers
• Grid: Computers connected over wide area
network
Another Classification for MIMD
Computers
• Centralized Memory: Shared memory located at centralized
location – may consist of several interleaved modules – same distance
from any processor – Symmetric Multiprocessor (SMP) – Uniform
Memory Access (UMA)
• Distributed Memory: Memory is distributed to each
processor – improves scalability
(a) Message passing architectures – No processor can directly
access another processor’s memory
(b) Hardware Distributed Shared Memory (DSM)
Multiprocessor – Memory is distributed, but the address space is
shared – Non-Uniform Memory Access (NUMA)
(c) Software DSM – A level of o/s built on top of message passing
multiprocessor to give a shared memory view to the programmer.
Data Parallel Model
• Operations can be performed in parallel on each
element of a large regular data structure, such as
an array
• 1 Control Processor (CP) broadcasts to many
PEs. The CP reads an instruction from the
control memory, decodes the instruction, and
broadcasts control signals to all PEs.
• Condition flag per PE so that can skip
• Data distributed in each memory
• Early 1980s VLSI => SIMD rebirth:
32 1-bit PEs + memory on a chip was the PE
• Data parallel programming languages lay out
data to processor
Data Parallel Model
• Vector processors have similar ISAs,
but no data placement restriction
• SIMD led to Data Parallel Programming
languages
• Advancing VLSI led to single chip FPUs and
whole fast µProcs (SIMD less attractive)
• SIMD programming model led to
Single Program Multiple Data (SPMD) model
– All processors execute identical program
• Data parallel programming languages still useful,
do communication all at once:
“Bulk Synchronous” phases in which all
communicate after a global barrier
SIMD Programming – HighPerformance Fortran (HPF)
• Single Program Multiple Data (SPMD)
• FORALL Construct similar to Fork:
FORALL (I=1:N), A(I) = B(I) + C(I), END FORALL
• Data Mapping in HPF
1. To reduce interprocessor communication
2. Load balancing among processors
http://www.npac.syr.edu/hpfa/
http://www.crpc.rice.edu/HPFF/
Major MIMD Styles
1. Centralized shared memory ("Uniform
Memory Access" time or "Shared Memory
Processor")
2. Decentralized memory (memory module
with CPU)
• Advantages: Scalability, get more memory
bandwidth, lower local memory latency
• Drawback: Longer remote communication
latency, Software model more complex
• Two types: Shared Memory and Message
passing
Symmetric Multiprocessor (SMP)
• Memory: centralized with uniform access
time (“uma”) and bus interconnect
• Examples: Sun Enterprise 5000 , SGI
Challenge, Intel SystemPro
Decentralized Memory versions
1. Shared Memory with "Non Uniform
Memory Access" time (NUMA)
2. Message passing "multicomputer" with
separate address space per processor
– Can invoke software with Remote Procedue
Call (RPC)
– Often via library, such as MPI: Message
Passing Interface
– Also called "Syncrohnous communication"
since communication causes synchronization
between 2 processes
Distributed Directory MPs
Communication Models
• Shared Memory
– Processors communicate with shared address space
– Easy on small-scale machines
– Advantages:
•
•
•
•
Model of choice for uniprocessors, small-scale MPs
Ease of programming
Lower latency
Easier to use hardware controlled caching
• Message passing
– Processors have private memories,
communicate via messages
– Advantages:
• Less hardware, easier to design
• Good scalability
• Focuses attention on costly non-local operations
• Virtual Shared Memory (VSM)
Shared Address/Memory
Multiprocessor Model
• Communicate via Load and Store
– Oldest and most popular model
• Based on timesharing: processes on
multiple processors vs. sharing single
processor
• process: a virtual address space
and ~ 1 thread of control
– Multiple processes can overlap (share), but
ALL threads share a process address space
• Writes to shared address space by one
thread are visible to reads of other threads
– Usual model: share code, private stack, some
shared heap, some private heap
Shared Memory Multiprocessor Model
• Communicate via Load and Store
– Oldest and most popular model
• Based on timesharing: processes on multiple
processors vs. sharing single processor
• process: a virtual address space
and ~ 1 thread of control
– Multiple processes can overlap (share), but ALL threads
share a process address space
• Writes to shared address space by one thread are
visible to reads of other threads
– Usual model: share code, private stack, some shared
heap, some private heap
Download