2006Fa61C-L39-ddg-so..

advertisement
inst.eecs.berkeley.edu/~cs61c
UC Berkeley CS61C : Machine Structures
Lecture 39 – Software Parallel Computing
2006-12-04
Thanks to Prof. Demmel
for his CS267 slides & Andy Carle for 1st CS61C draft
www.cs.berkeley.edu/~demmel/cs267_Spr05/
Lecturer SOE Dan Garcia
www.cs.berkeley.edu/~ddgarcia
Five-straight big games! 
This one was supposed to be easy!
Stanford showed up, and so did the winds as Cal
struggled to beat SU, a 1-10 team and the weakest
opponent all year. Next game is 2006-12-28 in the
Holiday Bowl in San Diego vs 9-3 Texas A&M.
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see this pic ture.
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see t his picture.
calbears.cstv.com/sports/m-footbl/recaps/120206aaa.html
CS61C L39 Software Parallel Computing (1)
Garcia, Fall 2006 © UCB
Outline
• Large, important problems require
powerful computers
• Why powerful computers must be
distributed or parallel computers
• What are these?
• Principles of parallel computing
performance
CS61C L39 Software Parallel Computing (2)
Garcia, Fall 2006 © UCB
Simulation: The Third Pillar of Science
• Traditional scientific and engineering paradigm:
• Do theory or paper design
• Perform experiments or build system
• Limitations:
•
•
•
•
Too difficult -- build large wind tunnels
Too expensive -- build a throw-away passenger jet
Too slow -- wait for climate or galactic evolution
Too dangerous -- weapons, drug design, climate models
• Computational science paradigm:
• Use high performance computer (HPC) systems to
simulate the phenomenon
• Based on physical laws & efficient numerical methods
CS61C L39 Software Parallel Computing (3)
Garcia, Fall 2006 © UCB
Example Applications
• Science
•
•
•
•
•
Global climate modeling
Biology: genomics; protein folding; drug design; malaria simulations
Astrophysical modeling
Computational Chemistry, Material Sciences and Nanosciences
SETI@Home : Search for Extra-Terrestrial Intelligence
• Engineering
•
•
•
•
•
•
Semiconductor design
Earthquake and structural modeling
Fluid dynamics (airplane design)
Combustion (engine design)
Crash simulation
Computational Game Theory (e.g., Chess Databases)
• Business
• Rendering computer graphic imagery (CGI), ala Pixar and ILM
• Financial and economic modeling
• Transaction processing, web services and search engines
• Defense
• Nuclear weapons -- test by simulations
• Cryptography
CS61C L39 Software Parallel Computing (4)
Garcia, Fall 2006 © UCB
Performance Requirements
• Terminology
• FLOP – FLoating point OPeration www.epm.ornl.gov/chammp/chammp.html
• Flops/second – standard metric for expressing
the computing power of a system
• Example : Global Climate Modeling
• Divide the world into a grid (e.g. 10 km spacing)
• Solve fluid dynamics equations to determine
what the air has done at that point every minute
 Requires about 100 Flops per grid point per minute
• This is an extremely simplified view of how the
atmosphere works, to be maximally effective
you need to simulate many additional systems
on a much finer grid
CS61C L39 Software Parallel Computing (5)
Garcia, Fall 2006 © UCB
High Resolution
Climate Modeling on
NERSC-3 – P. Duffy,
et al., LLNL
Performance Requirements (2)
• Computational Requirements
• To keep up with real time
(i.e. simulate one minute per wall clock minute):
 8 Gflops/sec
• Weather Prediction (7 days in 24 hours):
 56 Gflops/sec
• Climate Prediction (50 years in 30 days):
 4.8 Tflops/sec
• Climate Prediction Experimentation (50 yrs in 12 hrs):
 288 Tflops/sec
• Perspective
• Pentium 4 1.4GHz, 1GB RAM, 4x100MHz FSB
 ~0.3 Gflops/sec, effective
 Climate Prediction would take ~1233 years
Reference:http://www.tc.cornell.edu/~lifka/Papers/SC2001.pdf
CS61C L39 Software Parallel Computing (7)
Garcia, Fall 2006 © UCB
What Can We Do?
• Wait for our machines to get faster?
• Moore’s law tells us things are getting better;
|why not stall for the moment?
• Heat issues  Moore on last legs!
• Many believe so … thus push for multi-core (fri)!
• Prohibitive costs :
Rock’s law
• Cost of building a
semiconductor chip
fabrication plant capable
of producing chips in line
w/Moore’s law
doubles every four years
• In 2003, it cost $3 BILLION
CS61C L39 Software Parallel Computing (8)
Garcia, Fall 2006 © UCB
Upcoming Calendar
Week #
Mon
Wed
Parallel
Parallel
Computing
Computing in
Hardware
This Week in Software
(Scott)
#15
#16
Sun 2pm
Review
10 Evans
Thu Lab
I/O
Networking &
61C Feedback
Survey
Fri
LAST
CLASS
Summary,
Review, &
HKN Evals
FINAL EXAM
THU 12-14 @
12:30pm-3:30pm
234 Hearst Gym
Final exam
Same rules as Midterm, except you get 2
double-sided handwritten review sheets
(1 from your midterm, 1 new one)
+ green sheet [Don’t bring backpacks]
CS61C L39 Software Parallel Computing (9)
Garcia, Fall 2006 © UCB
Alternatively, many CPUs at once!
• Distributed computing
• Many computers (either homogeneous or
heterogeneous) who get “work units”
from a “central dispatcher” and return
when they’re done to get more
• “{ Grid, Cluster } Computing”
• Parallel Computing
• Multiple processors “all in one box”
executing identical code on different
parts of problem to arrive at a unified
(meaningful) solution
• “Supercomputing”
CS61C L39 Software Parallel Computing (10)
Garcia, Fall 2006 © UCB
Distributed Computing
• Let’s “tie together” many disparate
machines into one compute cluster
• These could all be the same (easier) or
very different machines (harder)
• Common themes
• “Dispatcher” hands jobs & collects results
• “Workers” (get, process, return) until done
• Examples
• SETI@Home, BOINC, Render farms
• Google clusters running MapReduce
CS61C L39 Software Parallel Computing (11)
Garcia, Fall 2006 © UCB
Google’s MapReduce
• Remember CS61A?
(reduce + (map square '(1 2 3)) 
(reduce + '(1 4 9)) 
14
• We told you “the beauty of pure functional
programming is that it’s easily parallelizable”
• Do you see how you could parallelize this?
• What if the reduce function argument were associative,
would that help?
• Imagine 10,000 machines ready to help you compute
anything you could cast as a MapReduce problem!
• This is the abstraction Google is famous for authoring
(but their reduce not exactly the same as the CS61A’s reduce)
• It hides LOTS of difficulty of writing parallel code!
• The system takes care of load balancing, dead machines, etc
CS61C L39 Software Parallel Computing (12)
Garcia, Fall 2006 © UCB
MapReduce Programming Model
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
map (in_key, in_value) 
list(out_key, intermediate_value)
• Processes input key/value pair
• Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) 
list(out_value)
• Combines all intermediate values for a particular key
• Produces a set of merged output values (usu just one)
code.google.com/edu/parallel/mapreduce-tutorial.html
CS61C L39 Software Parallel Computing (13)
Garcia, Fall 2006 © UCB
Example : Word Occurrances in Files
map(String input_key,
String input_value):
// input_key : document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key,
Iterator intermediate_values):
// output_key
: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
• This is just an example
• Real code at Google is more complicated and in C++
• Real code at Hadoop (open source MapReduce) in Java
CS61C L39 Software Parallel Computing (14)
Garcia, Fall 2006 © UCB
Example : MapReduce Execution
file1
file2 file3
ah ah er
ah
file4
file5
file6
file7
if or
or uh
or
ah if
map(String input_key,
String input_value):
// input_key : doc name
// input_value: doc contents
for each word w in input_value:
EmitIntermediate(w, "1");
ah:1 ah:1 er:1
ah:1
if:1 or:1 or:1 uh:1 or:1 ah:1 if:1
ah:1,1,1,1 er:1 if:1,1 or:1,1,1 uh:1
reduce(String output_key,
Iterator intermediate_values):
// output_key
: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
4
1
2
(ah) (er) (if)
CS61C L39 Software Parallel Computing (15)
3
1
(or) (uh)
Garcia, Fall 2006 © UCB
How Do I Experiment w/MapReduce?
1. Work for Google. ;-)
2. Use the Open Source versions
•
•
•
Hadoop: map/reduce & distributed file system:
lucene.apache.org/hadoop/
Nutch: crawler, parsers, index :
lucene.apache.org/nutch/
Lucene Java: text search engine library:
lucene.apache.org/java/docs/
3. Wait until 2007Fa
•
(You heard it here 1st!) Google will be giving us:


•
•
Single Rack, 40 dual proc xeon, 4GB Ram, 600GB Disk
Also available as vmware image
We’re developing bindings for Scheme for 61A
It will be available to students & researchers
CS61C L39 Software Parallel Computing (16)
Garcia, Fall 2006 © UCB
Recent History of Parallel Computing
• Parallel Computing as a field exploded in popularity in the mid-1990s
• This resulted in an “arms race” between universities, research labs,
and governments to have the fastest supercomputer in the world
• LINPACK (solving dense system of linear equations) is benchmark
Source:
top500.org
CS61C L39 Software Parallel Computing (17)
Garcia, Fall 2006 © UCB
Current Champions (Nov 2006)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
unclassified, classified
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
BlueGene/L – eServer Blue Gene Solution
IBM DOE / NNSA / LLNL
Livermore, CA, United States
65,536 dual-processors, 280.6 Tflops/s
0.7 GHz PowerPC 440, IBM
Red Storm – Sandia / Cray Red Storm
NNSA / Sandia National Laboratories
Albuquerque, NM, United States
26,544 Processors, 101.4 Tflops/s
2.4 GHz dual-core Opteron, Cray Inc.
BGW - eServer Blue Gene Solution
IBM TJ Watson Research Center
Yorktown Heights, NY, United States
40,960 Processors, 91.3 Tflops/s
0.7 GHz PowerPC 440, IBM
CS61C L39 Software Parallel Computing (18)
top500.org
Garcia, Fall 2006 © UCB
Synchronization (1)
• How do processors communicate with
each other?
• How do processors know when to
communicate with each other?
• How do processors know which other
processor has the information they
need?
• When you are done computing, which
processor, or processors, have the
answer?
CS61C L39 Software Parallel Computing (19)
Garcia, Fall 2006 © UCB
Synchronization (2)
• Some of the logistical complexity of
these operations is reduced by
standard communication frameworks
• Message Passing Interface (MPI)
• Sorting out the issue of who holds
what data can be made easier with the
use of explicitly parallel languages
• Unified Parallel C (UPC)
• Titanium (Parallel Java Variant)
• Even with these tools, much of the
skill and challenge of parallel
programming is in resolving these
problems
CS61C L39 Software Parallel Computing (20)
Garcia, Fall 2006 © UCB
Parallel Processor Layout Options
P2
P1
Pn
$
$
$
bus
Shared Memory
memory
P0
memory
NI
P1
memory
NI
Pn
...
memory
NI
Distributed Memory
interconnect
CS61C L39 Software Parallel Computing (21)
Garcia, Fall 2006 © UCB
Parallel Locality
• We now have to expand our view of the memory hierarchy to include
remote machines
• Remote memory behaves like a very fast network
• Bandwidth vs. Latency becomes important
Regs
Instr. Operands
Cache
Blocks
Memory
Blocks
Remote Memory
Large Data Blocks
Local and Remote Disk
CS61C L39 Software Parallel Computing (22)
Garcia, Fall 2006 © UCB
Amdahl’s Law
• Applications can almost never be completely
parallelized
• Let s be the fraction of work done sequentially,
so (1-s) is fraction parallelizable,
and P = number of processors
Speedup(P) = Time(1) / Time(P)
≤ 1 / ( s + ((1-s) / P) ), and as P  ∞
≤ 1/s
• Even if the parallel portion of your application
speeds up perfectly, your performance may be
limited by the sequential portion
CS61C L39 Software Parallel Computing (23)
Garcia, Fall 2006 © UCB
Parallel Overhead
• Given enough parallel work, these are the
biggest barriers to getting desired speedup
• Parallelism overheads include:
• cost of starting a thread or process
• cost of communicating shared data
• cost of synchronizing
• extra (redundant) computation
• Each of these can be in the range of ms
(many Mflops) on some systems
• Tradeoff: Algorithm needs sufficiently large
units of work to run fast in parallel (I.e. large
granularity), but not so large that there is
not enough parallel work
CS61C L39 Software Parallel Computing (24)
Garcia, Fall 2006 © UCB
Load Imbalance
• Load imbalance is the time that some
processors in the system are idle due to
• insufficient parallelism (during that phase)
• unequal size tasks
• Examples of the latter
• adapting to “interesting parts of a domain”
• tree-structured computations
• fundamentally unstructured problems
• Algorithms need to carefully balance load
CS61C L39 Software Parallel Computing (25)
Garcia, Fall 2006 © UCB
Peer Instruction of Assumptions
1.
2.
3.
Writing & managing SETI@Home is relatively
straightforward; just hand out & gather data 1:
2:
3:
Most parallel programs that, when run on N
(N big) identical supercomputer processors 4:
will yield close to N x performance increase 5:
6:
The majority of the world’s computing power 7:
lives in supercomputer centers
8:
CS61C L39 Software Parallel Computing (26)
ABC
FFF
FFT
FTF
FTT
TFF
TFT
TTF
TTT
Garcia, Fall 2006 © UCB
Peer Instruction Answer
1. The heterogeneity of the machines, handling
machines that fail, falsify data. FALSE
2. The combination of Amdahl’s law, overhead,
and load balancing take its toll. FALSE
3. Have you considered how many PCs + game
devices exist? Not even close. FALSE
1.
2.
3.
Writing & managing SETI@Home is relatively
straightforward; just hand out & gather data 1:
2:
3:
Most parallel programs that, when run on N
(N big) identical supercomputer processors 4:
will yield close to N x performance increase 5:
6:
The majority of the world’s computing power 7:
lives in supercomputer centers
8:
CS61C L39 Software Parallel Computing (27)
ABC
FFF
FFT
FTF
FTT
TFF
TFT
TTF
TTT
Garcia, Fall 2006 © UCB
Summary
• Parallel & Distributed Computing is a
multi-billion dollar industry driven by
interesting and useful scientific
computing & business applications
• It is extremely unlikely that sequential
computing will ever again catch up
with the processing power of parallel
or distributed systems
• Programming parallel systems can be
extremely challenging (distributed
computing less so, thanks to APIs),
but is built upon many of the concepts
you’ve learned this semester in 61c
CS61C L39 Software Parallel Computing (28)
Garcia, Fall 2006 © UCB
Download