Introduction to Parallel Computing

advertisement
CPE 779 Parallel Computing
http://www1.ju.edu.jo/ecourse/abusufah/cpe779_Spr10/index.html
Lecture 1: Introduction
Walid Abu-Sufah
University of Jordan
CPE 779 Parallel Computing - Spring 2010
1
Acknowledgment: Collaboration
This course is being offered in collaboration with
• The IMPACT research group at the University of Illinois
http://impact.crhc.illinois.edu/
• The Universal Parallel Computing Research Center
(UPCRC) at the University of Illinois
http://www.upcrc.illinois.edu/
• The Computation based Science and Technology
Research Center (CSTRC) of the Cyprus Institute
http://cstrc.cyi.ac.cy/
01/19/2010
CS267 - Lecture 1
2
Acknowledgment: Slides
Some of the slides used in this course are based on
slides by
• Kathy Yelick, University of California at Berkeley
http://www.cs.berkeley.edu/~yelick/cs194f07
• Jim Demmel, University of California at Berkeley & Horst
Simon, Lawrence Berkeley National Lab (LBNL)
http://www.cs.berkeley.edu/~demmel/cs267_Spr10/
• Wen-mei Hwu and Sanjay Patel of the University of
Illinois and David Kurk, Nvidia Corporation
http://courses.ece.illinois.edu/ece498/al/
01/19/2010
CS267 - Lecture 1
3
Course Motivation
In the last few years:
• Conventional sequential processors can not
get faster
- Previously clock speed doubled every 18 months
• All computers will be parallel
• >>> All programs will have to become parallel
programs
- Especially programs that need to run faster.
01/19/2010
CS267 - Lecture 1
4
Course Motivation (continued)
There will be a huge change in the entire
computing industry
• Previously the industry depended on selling
new computers by running their users'
programs faster without the users having to
reprogram them.
• Multi/ many core chips have started a
revolution in the software industry
01/19/2010
CS267 - Lecture 1
5
Course Motivation (continued)
Large research activities to address this
issue are underway
• Computer companies: Intel, Microsoft,
Nvidia, IBM, ..etc
- Parallel programming is a concern for the
entire computing industry.
• Universities
- Berkeley's ParLab (2008: $20 million grant)
- The Universal Parallel Computing Research
Center of the University of Illinois (2008: $20
million grant)
01/19/2010
CS267 - Lecture 1
6
Course Goals
• The purpose of this course is to teach students
the necessary skills for developing applications
that can take advantage of on-chip parallelism.
• Part 1 (~4 weeks): focus on the techniques that
are most appropriate for multicore architectures
and the use of parallelism to improve program
performance. Topics include
- performance analysis and tuning
- data techniques
- shared data structures
- load balancing. and task parallelism
- synchronization
01/19/2010
CS267 - Lecture 1
7
Course Goals (continued - I)
• Part 2 (~ 12 weeks): Provide students with
knowledge and hands-on experience in
developing applications software for massively
parallel processors (100s or 1000s of cores)
- Use NVIDIA GPUs and the Cuda programming
language.
- To Effectively program these processors students
will acquire in-depth knowledge about
- Data parallel programming principles
- Parallelism models
- Communication models
- Resource limitations of these processors.
01/19/2010
CS267 - Lecture 1
8
Outline of rest of lecture
all
• Why powerful computers must use parallel processors
Including your laptops and handhelds
• Examples of Computational Science and Engineering
(CSE) problems which require powerful computers
Commercial problems too
• Why writing (fast) parallel programs is hard
But things are improving
• Principles of parallel computing performance
• Structure of the course
01/19/2010
CS267 - Lecture 1
9
What is Parallel Computing?
• Parallel computing: using multiple processors in
parallel to solve problems (execute applications) more
quickly than with a single processor
• Examples of parallel machines:
• A cluster computer that contains multiple PCs combined
together with a high speed network
• A shared memory multiprocessor (SMP*) by connecting
multiple processors to a single memory system
• A Chip Multi-Processor (CMP) contains multiple processors
(called cores) on a single chip
• Concurrent execution comes from the desire for
performance
• * Technically, SMP stands for “Symmetric Multi-Processor”
CPE 779 Parallel Computing - Spring 2010
10
Units of Measure
• High Performance Computing (HPC) units are:
• Flop: floating point operation
• Flops/s: floating point operations per second
• Bytes: size of data (a double precision floating point number is 8)
• Typical sizes are millions, billions, trillions…
Mega:
Giga:
Tera:
Peta:
Exa:
Zetta:
Yotta:
Mflop/s = 1006 flop/sec;
Gflop/s = 1009 flop/sec;
Tflop/s = 1012 flop/sec;
Pflop/s = 1015 flop/sec;
Eflop/s = 1018 flop/sec;
Zflop/s = 1021 flop/sec;
Yflop/s = 1024 flop/sec;
Mbyte = 220 = 1048576 ~ 106 bytes
Gbyte = 230 ~ 109 bytes
Tbyte = 240 ~ 1012 bytes
Pbyte = 250 ~ 1015 bytes
Ebyte = 260 ~ 1018 bytes
Zbyte = 270 ~ 1021 bytes
Ybyte = 280 ~ 1024 bytes
• Current fastest (public) machine ~ 2.3 Pflop/s
• Up-to-date list at www.top500.org
11
High Performance Computing, HPC
• Parallel computers have been used for decades
• Mostly used in computational science, engineering,
business, and defense
• Problems too large to solve on one processor; use 100s or 1000s
• Examples of challenging computations in science
•
•
•
•
•
Global climate modeling
Biology: genomics; protein folding; drug design
Astrophysical modeling
Computational Chemistry
Computational Material Sciences and Nanosciences
CPE 779 Parallel Computing - Spring 2010
12
High Performance Computing, HPC(continued)
• Examples of challenging computations in engineering
•
•
•
•
•
Semiconductor design
Earthquake and structural modeling
Computation fluid dynamics (airplane design)
Combustion (engine design)
Crash simulation
• Examples of challenging computations in business
• Financial and economic modeling
• Transaction processing, web services and search engines
• Examples of challenging computations in defence
• Nuclear weapons -- test by simulations
• Cryptography
CPE 779 Parallel Computing - Spring 2010
13
Economic Impact of HPC
• Airlines:
• Logistics optimization systems on parallel computers.
• Savings: approx. $100 million per airline per year.
• Automotive design:
• Major automotive companies use large systems (500+ CPUs)
for:
• CAD-CAM, crash testing, structural integrity and aerodynamics.
• One company has 500+ CPU parallel system.
• Savings: approx. $1 billion per company per year.
• Semiconductor industry:
• Semiconductor firms use large systems (500+ CPUs) for
• device electronics simulation and logic validation
• Savings: approx. $1 billion per company per year.
• Securities industry (note: old data …)
14
• Savings: approx. $15 billion per year for U.S. home
mortgages.
CPE 779 Parallel Computing - Spring 2010
all
(2007)
Why powerful
computers are
parallel
15
What is New in Parallel Computing Now?
• In the 80s/90s many companies “bet” on parallel computing and
failed
• Computers got faster too quickly for there to be a large market
• What is new now?
• The entire computing industry has bet on parallelism
There is a desperate need for parallel
programmers
• Let’s see why…
CPE 779 Parallel Computing - Spring 2010
16
Technology Trends: Microprocessor Capacity
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Microprocessors have
become smaller, denser,
and more powerful.
Gordon Moore (co-founder of
Intel) predicted in 1965 that the
transistor density of
semiconductor chips would
double roughly every 18
months.
Slide source: Jack Dongarra
CPE 779 Parallel Computing - Spring 2010
17
Microprocessor Transistors and Clock Rate
Growth in transistors per chip
Increase in clock rate
100,000,000
1000
10,000,000
1,000,000
i80386
i80286
100,000
R3000
R2000
100
Clock Rate (MHz)
Transistors
R10000
Pentium
10
1
i8086
10,000
i8080
i4004
1,000
1970 1975 1980 1985 1990 1995 2000 2005
0.1
1970
1980
1990
2000
Year
Year
In 2002: Why bother with parallel programming? Just wait a year or two…
CPE 779 Parallel Computing - Spring 2010
18
Limit #1: Power density
Can soon put more transistors on a chip than can afford to turn on.
-- Patterson ‘07
Scaling clock speed (business as usual) will not work
Sun’s
Surface
Power Density (W/cm2)
10000
Rocket
Nozzle
1000
Nuclear
Reactor
100
8086
Hot Plate
10 4004
8008 8085
386
286
8080
1
1970
1980
P6
Pentium®
486
1990
Year
Source: Patrick
Gelsinger, Intel
2000
CPE 779 Parallel Computing - Spring 2010
2010
19
Limit #2: Hidden Parallelism Tapped Out
Application performance was increasing by 52% per year as measured
by the SpecInt benchmarks here
From Hennessy and Patterson,
Computer Architecture: A Quantitative
Approach, 4th edition, 2006
• ½ due to transistor density
• ½ due to architecture
changes, e.g., Instruction
Level Parallelism (ILP)
• VAX
: 25%/year 1978 to 1986
• RISC + x86: 52%/yearCPE
1986
to 2002
779 Parallel Computing - Spring 2010
20
Limit #2: Hidden Parallelism Tapped Out
• Superscalar (SS) designs were the state of the art;
many forms of parallelism not visible to programmer
• multiple instruction issue
• dynamic scheduling: hardware discovers parallelism
between instructions
• speculative execution: look past predicted branches
• non-blocking caches: multiple outstanding memory ops
• Unfortunately, these sources have been used up
CPE 779 Parallel Computing - Spring 2010
21
Revolution is Happening Now
• Chip density is
continuing increase
~2x every 2 years
• Clock speed is not
• Number of processor
cores may double
instead
• There is little or no
hidden parallelism
(ILP) to be found
• Parallelism must be
exposed to and
managed by software
Source: Intel, Microsoft (Sutter) and
Stanford (Olukotun, Hammond)
CPE 779 Parallel Computing - Spring 2010
23
Parallelism in 2010?
• All major processor vendors are producing multicore chips
• Every machine will soon be a parallel machine
• To keep doubling performance, parallelism must double
• Which commercial applications can use this parallelism?
• Do they have to be rewritten from scratch?
• Will all programmers have to be parallel programmers?
• New software model needed
• Try to hide complexity from most programmers – eventually
• In the meantime, need to understand it
• Computer industry betting on this big change, but does not
have all the answers
• Berkeley ParLab and University of Illinois UPCRC established to work
on this
24
Moore’s Law reinterpreted
• Number of cores per chip will double every
two years
• Clock speed will not increase (possibly
decrease)
• Need to deal with systems with millions of
concurrent threads
• Need to deal with inter-chip parallelism as
well as intra-chip parallelism
Outline
all
• Why powerful computers must be parallel processors
Including your laptop
• Why writing (fast) parallel programs is hard
• Principles of parallel computing performance
CPE 779 Parallel Computing - Spring 2010
26
Why writing (fast) parallel
programs is hard
CPE 779 Parallel Computing - Spring 2010
27
Principles of Parallel Computing
•
•
•
•
•
•
Finding enough parallelism (Amdahl’s Law)
Granularity
Locality
Load balance
Coordination and synchronization
Performance modeling
All of these things make parallel programming
harder than sequential programming.
CPE 779 Parallel Computing - Spring 2010
28
Finding Enough Parallelism: Amdahl’s Law
T1 = execution time using 1 processor (serial execution time)
Tp = execution time using P processors
S = serial fraction of computation (i.e. fraction of computation
which can only be executed using 1 processor)
C = fraction of computation which could be executed by p
processors
Then S + C = 1 and
Tp = S * T1+ (T1 * C)/P = (S + C/P)T1
Speedup = Ψ(p) = T1/Tp = 1/(S+C/P) <= 1/S
• Maximum speedup (i.e. when P=∞) Ψmax = 1/S; example
S=.05 , Ψmax = 20
• Currently the fastest machine has ~224,000 processors
• Even if the parallel part speeds up perfectly performance is
limited by the sequential part
CPE 779 Parallel Computing
29
- Spring 2010
Speedup Barriers: (a) Overhead of Parallelism
• Given enough parallel work, overhead is a big barrier to
getting desired speedup
• Parallelism overheads include:
•
•
•
•
cost of starting a thread or process
cost of communicating shared data
cost of synchronizing
extra (redundant) computation
• Each of these can be in the range of milliseconds on
some systems (=millions of flops)
• Tradeoff: Algorithm needs sufficiently large units of work
to run fast in parallel (I.e. large granularity), but not so
large that there is not enough parallel work
CPE 779 Parallel Computing - Spring 2010
30
Speedup Barriers: (b) Working on Non Local Data
Conventional
Storage
Proc
Hierarchy
Cache
L2 Cache
Proc
Cache
L2 Cache
Proc
Cache
L2 Cache
L3 Cache
L3 Cache
Memory
Memory
Memory
potential
interconnects
L3 Cache
• Large memories are slow, fast memories are small
• Parallel processors, collectively, have large, fast cache
• the slow accesses to “remote” data we call “communication”
• Algorithm should do most work on local data
CPE 779 Parallel Computing - Spring 2010
31
Speedup Barriers: (c) Load Imbalance
• Load imbalance occurs when some processors in the
system are idle due to
• insufficient parallelism (during that phase)
• unequal size tasks
• Algorithm needs to balance load
CPE 779 Parallel Computing - Spring 2010
32
Course Mechanics
• Web page:
http://www1.ju.edu.jo/ecourse/abusufah/cpe779_spr10/index.h
tml
• Grading:
- Five programming assignments
- Final projects (proposals due Wednesday April 21)
- Could be parallelizing an application
- Developing an application using Cuda/OpenCL
- Performance models driven tuning of a parallel application on multicore
and or GPU
- Midterm, Wednesday April 7 ; 5:30-6:45
- Final, Thursday, May 27, 5:30-7:30
01/19/2010
CS267 - Lecture 1
33
Rough List of Topics (For Details see Syllabus)
• Basics of computer architecture, memory hierarchies, performance
• Parallel Programming Models and Machines
- Shared Memory and Multithreading
- Data parallelism, GPUs
• Parallel languages and libraries
- OpenMP
- CUDA
• General techniques
- Load balancing, performance modeling and tools
• Applications
01/19/2010
CS267 - Lecture 1
34
Reading Materials: Textbooks
Required
• David B. Kirk, Wen-mei W. Hwu, Programming Massively Parallel
Processors: A Hands-on Approach, Morgan Kaufmann (February 5,
2010).
(Most chapters are available on line in draft status; visit
http://courses.ece.illinois.edu/ece498/al/Syllabus.html )
• Calvin Lin and Larry Snyder, "Principles of Parallel Programming",
Addison-Wesley, 2009
01/19/2010
CS267 - Lecture 1
35
Reading Materials: References
• Ian Foster, Designing and Building Parallel Programs , AddisonWesley ( available online @ http://www.mcs.anl.gov/~itf/dbpp/ )
• Randima Fernando, GPU Gems: Programming Techniques, Tips
and Tricks for Real-Time Graphics, Publisher: Addison-Wesley
Professional (2004), (available online on NVIDIA Developer Site;
see:
http://http.developer.nvidia.com/GPUGems/gpugems_part01.html )
• Grama, A., Gupta, A., Karypis, G., and Kumar, V. "Introduction to
Parallel Computing", Second Edition, Addison Wesley, 2003.
01/19/2010
CS267 - Lecture 1
36
Reading Materials: Tutorials
• See course website for tutorials and other online resources
01/19/2010
CS267 - Lecture 1
37
Download