Week 8 Power Point Slides

advertisement
Multiprocessors and Multi-computers
• Multi-computers
– Distributed address space accessible by local processors
– Requires message passing
– Programming tends to be more difficult
• Multiprocessors
– Single address space accessible by all processors
– Simultaneous access to shared variables can produce
inconsistent results
– Generally programming is more convenient
– Doesn’t scale to more than about sixteen processors
Shared Memory Hardware
Memory Modules
Bus
Memory modules
Bus configuration
Processes
Processes
Crossbar switch configuration
Cache Coherence
Significantly impacts performance
• Cache Coherence Protocol
– Write-Update: All caches
immediately updated with
altered data
– Write-Invalidate: Altered data
is invalidated in all caches.
Updates take place only if
subsequently referenced
• False Sharing: Cache
updates take place because
multiple processes access the
same cache block but not the
same locations
y
Memory
x
x
Processor 1
y
Processor 2
Cache Blocks
Note: Significant because each processor has a local cache
Shared Memory Access
• Critical Section
Shared Variable
x
– A section of code that needs
to be protected from
simultaneous access
• Mutual Exclusion
– The mechanism used to
enforce a critical section
– Locks
– Semaphores
– Monitors
– Condition Variables
=1
=2
Process 1
Process 2
Sequential Consistency
Formally defined by Lamport (1979):
• A multiprocessor result is sequentially consistent if:
– The operations of each individual processors occur in
proper sequence specified by its program.
– The overall output matches some sequential order of
operations by all the processors
• Summary: Arbitrary interleaving of instructions does not
affect the output generated.
Deadlock
Resources permanently blocked waiting for needed resources
R1
R2
Rn-1
Rn
P1
P2
Pn-1
Pn
• Necessary Conditions
–
–
–
–
Circular Wait
Limited Resource
Non-preemptive
Hold and Wait
Deadly Embrace
R1
R2
P1
P2
Two Process Deadlock
Locks
Locks are the simplest mutual exclusion mechanism
Normally, these are provided by operating system calls
• Single bit variable: 1=locked, 0=unlocked
“Enter door and lock the door at entry”
• Spin locks (busy wait locks)
– while (lock==1) spin(); // Normally involves hardware support
lock = 1;
// Critical section
lock = 0;
• Advantages: Simple and easy to understand
• Disadvantages
– Poor use of the CPU if process does not block while waiting
– It’s easy to skip the lock=0 statement
• Examples: Pthreads and openMP provide OS abstractions
Note: The while and lock setting must be atomic
Semaphores
• Limits concurrent access
• An integer variable, s, controls
the mechanism
• Operations
– P operation: passeren in
Dutch for: to pass
s--;
while (s<0) wait();
// Critical section code
– V operation: vrigeven in
Dutch for: to release
s++;
if (s<=0)
unblock a waiting process;
p(s); /* Critical section */ v(s);
• Notes
– Set s=1 initially for s to
be a binary semaphore
which acts like a lock.
– Set s=k>1 initially if k
simultaneous entries are
possible
– Set s=k<=0 for consumer
processes waiting to
consume data produced
• Disadvantage: Its easy to
skip the v operation
• Example: UNIX OS
Monitors
• A Class mechanism that limits access to a shared resource
public class doIt
{ public doIt() {//Constructor logic}
public synchronized void critMethod()
{ wait();
// Wait till another thread signals
notify();
}
}
• Advantage: Most natural mutual exclusive mechanism
• Disadvantage: Requires a language that supports the construct
• Examples: Java, ADA, Modula II
Condition Variables
Mechanism to guarantee a global condition before critical section entry
• Advantages:
– Reduce overhead with
checking if a global variable
reaches some value
– Avoids having to frequently
“poll” the global variable
• Disadvantage: Its easy to
skip the unlock operations
• Example: Pthreads
• Notes:
– wait() unlocks and locks
mutex automatically
– Threads must already be
waiting for a signal when it is
thrown
Example
• Thread 1
lock(mutex)
while (c<>VALUE)
wait(cVar,mutex)
// Critical section
unlock(mutex);
• Thread 2
if (c==VALUE)
signal(condVar)
Shared Memory Programming
Alternatives
• Heavyweight processes
• Modified syntax of an existing language (HP Fortran)
• Programming language designed for parallel processing (ADA)
• Compiler extensions to specify parallel execution (OpenMP)
• Thread programming standard: Java Threads and pthreads
Threads
Definition: Path of execution through a process
• Heavyweight processes (UNIX fork, wait, waitpid, shmat, shmdt)
– Disadvantage: time and memory expensive
– Advantage: A blocked process doesn’t block the other processes
• Lightweight threads (pthreads library)
– Only needs to share stack space and instruction counter
– "Thread Safe" programming required to guarantee consistent results
• Pthreads
– Threads can be spawned and started by other threads
– They can run independently (detached from their parent thread) or require
joins for termination
– Formation of thread pools are possible
– Threads communicate through signals
– Processing order is indeterminate
Forks and Joins
General thread flow of control
pid = fork();
if (pid == 0)
{ /* Do spawned thread code */ }
else
{ /* Do spawning thread code */ }
if (pid == 0) exit(0); else wait(0);
Note: Detached processes run independently from its parent without joins
Processes and Threads
IP
IP
Code
Heap
Code
Heap
Stack
Listeners
Stack
Listeners
Resources
IP
Resources
Stack
Single Thread Process
Dual Thread Process
Notes:
• Threads can be three orders of magnitude faster than processes
• Thread safe library routines can be used by multiple concurrent threads
• Synchronization uses shared variables
Example Program (summing numbers)
Heavyweight UNIX processes (Section 8.7.1)
Pseudo code
Create semaphores
Allocate shared memory and attach shared memory
Load array with numbers
Fork child processes
IF Parent THEN sum parent section
ELSE sum child section
P(semaphore) Add to global sum V(semaphore)
IF (child) terminate ELSE join
Print results
Release semaphores, detatch and release shared memory
Note: The Java and pthread version require about half the code
Modify Existing Language Syntax
Example Constructs
• Declaration of a shared memory variable
shared int x;
• Specify statements to execute concurrently
par { s1(); s2(); s3(); … sn(); }
• Iterations assigned to different processors
forall (i=0; i<n; i++) { //code }
• Examples: High Performance Fortran and C
Compiler Optimizations
• The following works because the statements are
independent
forall (i = 0; i < P; i++) a[i] = 0;
• Bernsteins conditions
– Outputs from one processor cannot be inputs to another
– Outputs from the processors cannot overlap
• Example: a = x + y; b = x + z; are okay to execute
simultaneously
Java Threads
• Instantiate and run a thread
ThreadClass t = new ThreadClass().start();
• Thread class
Class ThreadClass extends Thread
{ public ThreadClass {//Constructor}
public void run()
{ while (true)
{ //yield or sleep periodically.
//thread code executed here.
} } }
Pthreads
IEEE POSIX 1003.1c 1995: UNIX-based C standardized API
Advantages
• Industry standardized interface which replaces vendor proprietary APIs
• Thread creation, synchronization, and context switching are implemented in
user space without kernel intervention, which is inherently more efficient than
kernel-based thread operations
• User-level implementation provides the flexibility to choose a scheduler that
best suits the application, independent of the kernel scheduler.
Drawbacks
•
•
•
•
Poor locality limits performance when accessing shared data across processors
The Pthreads scheduler hasn't proven suited to manage large numbers of threads
Shared memory multithreaded programs typically follow the SPMD model
Most parallel programs still are course-grain in design
Performance Comparisons
Pthreads versus Kernel Threads
Real: wall clock time (actual elapsed time)
User: time spent in user mode
Sys: time spent in the kernel within the process
Compiler Extensions (openMP)
• Extensions for C/C++, Fortran, and Java (JOMP)
• Consists of: Compiler directives, library routines and
environment variables
• Recognized industry standard developed in the late 1990s
• Designed for shared memory programming
• Uses fork-join model, but uses threads
• Parallel sections of code execute “teams of threads”
• General Syntax
– C: #pragma omp <directive>
– JOMP: //omp <directive>
Download