Keynote Powerpoint

advertisement
If Parallelism Is The New Normal,
How Do We Prepare Our Students
(And Ourselves)?
Joel Adams
Department of Computer Science
Calvin College
An Anecdote about CCSC:MW
This story has nothing to do with parallel
computing, but it may be of interest…
Did you know that if it were not for CCSC:MW,
CS Education Week would likely not exist?
CCSC:MW 2014 - 2
How CCSC:MW  CS Ed Week
At CCSC:MW in 2008:
•The ACM-CSTA’s Chris
Stevenson gave the
keynote, describing the
decline of CS in high
schools
– No Child Left Behind was killing HS CS!
– I’m pretty apolitical, but ...
CCSC:MW 2014 - 3
How CCSC:MW  CS Ed Week
• I decided to visit my
Congressman,
Rep. Vernon Ehlers,
ranking member of the
House Committee on
Science & Technology
(a Physics PhD and former Calvin prof).
• He was surprised to hear of the problems
(esp. enrollment declines) CS was facing.
CCSC:MW 2014 - 4
How CCSC:MW  CS Ed Week
• Rep. Ehlers contacted
the ACM, specifically
Cameron Wilson.
• They worked together
on CS Education Week,
which the House
passed 405-0 in 2009.
• CCSC:MW catalyzed CS Education Week!
CCSC:MW 2014 - 5
What’s Happening Now?
There is a bill currently in Congress:
– H.R. 2536: The CS Education Act of 2013
– It seeks to strengthen K-12 CS education, and
make CS a core subject.
– It currently has 116 co-sponsors (62R, 54D);
is supported by ACM, NCWIT, Google, MS, ...
– It has been referred to the Committee on Early
Childhood, Elementary, and Secondary Ed.,
chaired by Rep. Todd Rokita (R, IN).
CCSC:MW 2014 - 6
Most Representatives Are Unaware
CCSC:MW 2014 - 7
What Can You Do?
There is strength in numbers:
•Contact your Congressional reprentative
and ask them to co-sponsor HR 2536.
– If you are in Rep. Rokita’s district… (!)
– More co-sponsors improve its chances.
•Tweet to Rep. Rokita (@ToddRokita)
– Tell him you support HR 2536 – the CS
Education Act of 2013 – and want it to pass.
CCSC:MW 2014 - 8
And Now, Back To Today’s Topic
Overview
•The past
– How our computing foundation has shifted
•The present
– Today’s hardware & software landscapes
•The future?
– Preparing ourselves & our students
CCSC:MW 2014 - 9
the sun
hot plate
Temperature
actual
projected
2020
CCSC:MW 2014 - 10
The Heat Problem…
• … was not caused by Moore’s Law
• It was caused by manufacturers doubling
the clock speeds every 18-24 months
• This was the “era of the free lunch” for
software developers:
– If your software was sluggish, faster hardware
would fix your problem within two years!
CCSC:MW 2014 - 11
Solving the Heat Problem…
• In 2005, manufacturers stopped doubling
the clock speeds because of the heat,
power consumption, electron bleeding, …
• This ended the “era of the free lunch”
– Software will no longer speed up on its own.
CCSC:MW 2014 - 12
Clock Speed
(frequency)
trend
CCSC:MW 2014 - 13
But Moore’s Law Continued
• Every 2 years, manufacturers could still
double the transistors in a given area:
– 2006: Dual-core CPUs
– 2008: Quad-core CPUs
– 2010: 8-core CPUs
– 2012: 16-core CPUs
–…
• Each of these cores has the full
functionality of a traditional CPU.
CCSC:MW 2014 - 14
12 Years of Moore’s Law
2001: ohm.calvin.edu: 18 nodes,
each with:
-One 1-GHz Athlon CPU
-1 GB RAM / node
-Gigabit Ethernet, USB, HDMI, …
-Ubuntu Linux
-~$60,000 (funded by NSF).
2013: Adapteva Parallella
-A Dual-core 1-GHz ARM A7
-16 core Epiphany Coprocessor
-1 GB RAM
-Gigabit Ethernet, USB, HDMI, …
-Ubuntu Linux
-~$99 (but free via university program!)
CCSC:MW 2014 - 15
Multiprocessors are Inexpensive
2014: Nvidia Jetson TK1
-Quad-core ARM A15
-Kepler GPU w/ 192 CUDA cores
-2 GB RAM
-Gigabit Ethernet, HDMI, USB, …
-Ubuntu Linux
-~$200
CCSC:MW 2014 - 16
Multiprocessors are Everywhere
CCSC:MW 2014 - 17
Some Implications
• Traditional sequential programs will not
run faster on today’s hardware.
– They may well run slower because the
manufacturers are decreasing clock speeds.
• The only software that will run faster is
parallel software designed to scale with
the number of cores.
CCSC:MW 2014 - 18
Categorizing Parallel Hardware
Parallel Systems
Heterogeneous
Systems
Shared
Memory
Multicore
Accelerators
Newer
Clusters
GPUs Coprocessors
CCSC:MW 2014 - 19
Modern
Super
Computers
Distributed
Memory
Older
Clusters
Hardware: A Diverse Landscape
• Shared-memory systems
Core1 Core2
Core3 Core4
Memory
• Distributed-memory systems
CPU2
Mem1
CPU1
Network
Mem2
• Heterogeneous systems
CCSC:MW 2014 - 20
CPU3
Mem3
CPUN
MemN
CS Curriculum 2013
Because of this hardware revolution, the
advent of cloud computing, and so on,
CS2013 has added a new knowledge area:
Parallel and Distributed Computing
(PDC)
CCSC:MW 2014 - 21
What is PDC?
It goes beyond traditional concurrency:
– Parallel emphasizes:
o Throughput / performance (and timing)
o Scalability (performance improves with # of cores)
o New topics like speedup, Amdahl’s Law, …
– Distributed emphasizes:
o Multiprocessing (no shared memory)
– MPI, MapReduce/Hadoop, BOINC, …
o Cloud computing
o Mobile apps accessing scalable web services
CCSC:MW 2014 - 22
Software: Communication Options
In shared-memory systems, programs may:
•Communicate via the shared-memory
– Languages: Java, C++11, …
– Libraries: POSIX threads, OpenMP
•Communicate via message passing
– Message-passing languages: Erlang, Scala, …
– Libraries: the Message Passing Interface (MPI)
CCSC:MW 2014 - 23
CS Curriculum 2013 (CS2013)
• The CS2013 core includes 15 hours of
parallel & distr. computing (PDC) topics
+ 5 hours in core Tier 1
+ 10 hours in core Tier 2
+ related topics in System Fundamentals (SF)
• How/where do we cover these topics in
the CS curriculum?
CCSC:MW 2014 - 24
Model 1: Create a New Course
Add a new course to the CS curriculum that
covers the core PDC topics:
+ If someone else has to teach this new course,
dealing with PDC is their problem, not mine!
– The CS curriculum is already full!
– What do we drop to make room?
CCSC:MW 2014 - 25
Model 2: Across the Curriculum
Sprinkle 15+ hours (3 weeks) of PDC across
our core CS courses, not counting SF:
+ Students see relationship of PDC to data
structures, algorithms, prog. lang., …
+ Easier to make room for 1 week in 1 course
than jettison an entire course.
+ Spreads the effort across multiple faculty
– All those faculty have to be “on board”
CCSC:MW 2014 - 26
Calvin CS Curriculum
Year Fall Semester
Spring Semester
1
Intro to Computing
Calculus I
Data Structures
Calculus II
2
Algorithms & DS
Intro. Comp. Arch.
Discrete Math I
Programming Lang.
Discrete Math II
3
Software Engr.
Engr
Adv. Elective
OS & Networking
Adv. Elective
Statistics
4
Adv. Elective:
Elective HPC
Sr. Practicum I
Adv. Elective
Sr. Practicum II
Perspectives on Comp.
CCSC:MW 2014 - 27
Why Introduce Parallelism in CS2?
• For students to be facile with parallelism,
they need to see it early and often.
• Performance (Big-Oh) is a topic that’s first
addressed in CS2.
• Data structures let us store large data sets
– Slow sequential processing of these sets
provides a natural motivation for parallelism.
CCSC:MW 2014 - 28
Parallel Topics in CS2
• Lecture topics:
– Single threading vs. multithreading
– The single-program-multiple-data (SPMD),
fork-join, parallel loop, and reduction patterns
– Speedup, asymptotic performance analysis
– Parallel algorithms: searching, sorting
– Race conditions: non-thread-safe structures
• Lab exercise: Compare sequential vs.
parallel matrix operations using OpenMP
CCSC:MW 2014 - 29
Lab Exercise: Matrix Operations
Given a Matrix class, the students:
•Measure the time to perform sequential
addition and transpose methods
•For each of three different approaches:
– Use the approach to parallelize those methods
– Record execution times in a spreadsheet
– Create a chart showing time vs # of threads
Students directly experience the speedup…
CCSC:MW 2014 - 30
Addition: m3 = m1 + m2
Single-threaded: ~36 steps
=
+
Multi-threaded (4 threads): ~9 steps
=
CCSC:MW 2014 - 31
+
Tranpose: m2 = m1.transpose()
Single-threaded: ~24 steps
=
.tranpose()
Multi-threaded (4 threads): ~6 steps
=
CCSC:MW 2014 - 32
.tranpose()
Matrix Addition vs. Transpose, 4 (8 HT) Cores
Addition
Transpose
0.35
0.3
0.25
Time
0.2
0.15
0.1
0.05
0
1
2
4
6
Number
SIGCSE 2014
- 33 of Threads
8
10
Programming Project
• Parallelize other Matrix operations
– Multiplication
– Assignment
– Constructors
– Equality
• Some operations (file I/O) are inherently
sequential, providing a useful lesson…
CCSC:MW 2014 - 34
Alternative Exercise/Project
• Parallelize image-processing operations:
– Color-to-grayscale
– Invert (negative)
– Blur, Sharpen
– Sepia-tinting
• Many students will find photo-processing
to be more engaging than matrix ops.
CCSC:MW 2014 - 35
Assessment
All students complete end-of-course
evaluations with open-ended feedback:
• They really like the week on parallelism
– Covering material that is not in the textbook
makes CS2 seem fresh and cutting edge
– Students really like learning how they can use
all their cores instead of just one
– Having students experience speedup is key
(and even better if they can see it)
CCSC:MW 2014 - 36
More Implications
• Software developers who cannot build
parallel apps will be unable to leverage the
full power of today’s hardware.
– At a competitive disadvantage?
• Designing / writing parallel apps is very
different from designing / writing
sequential apps.
– Pros think in terms of parallel design patterns
CCSC:MW 2014 - 37
Parallel Design Patterns
• … are industry-standard strategies that
parallel professionals have found useful
over 30+ years of practice.
• … often have direct support built into
popular platforms like MPI and OpenMP.
• … are likely to remain useful, regardless
of future PDC developments.
• … provide a framework for PDC concepts.
CCSC:MW 2014 - 38
Algorithm Strategy Patterns
Example 1: Most parallel programs use one of
just three parallel algorithm strategy patterns:
•Data decomposition: divide up the data and
process it in parallel.
•Task decomposition: divide the algorithm into
functional tasks that we perform in parallel (to
the extent possible).
•Pipeline: divide the algorithm into linear stages,
through which we “pump” the data.
Of these, only data decomposition scales well…
CCSC:MW 2014 - 39
Data Decomposition (1 thread)
Thread
0
CCSC:MW 2014 - 40
Data Decomposition (2 threads)
Thread
0
Thread
1
CCSC:MW 2014 - 41
Data Decomposition (4 threads)
Thread
0
Thread
1
Thread
2
Thread
3
CCSC:MW 2014 - 42
Task Decomposition
Independent functions in a sequential
computation can be “parallelized”:
int main() {
x = f();
y = g();
z = h();
w = x + y + z;
}
CCSC:MW 2014 - 43
Thread 0
main()
f()
Thread 1
g()
Thread 2
h()
Thread 3
Pipeline
Programs with non-independent functions…
}
int main() {
...
while (fin) {
fin >> a;
b = f(a);
c = g(b);
d = h(c);
fout << d;
}
...
TimeStep: 0
Thread 0
1
2
3
4
5
6
main()
a0 a1 a2 a3 a4 a5 a6
Thread 1
f(a)
b0 b1 b2 b3 b4 b5
Thread 2
g(b)
c0 c1 c2 c3 c4
Thread 3
h(c)
d0 d1 d2 d3
… can still be pipelined:
CCSC:MW 2014 - 44
Scalability
• If a program gets faster as more threads
/cores are used, its performance scales.
• For the three algorithm strategy patterns:
Algorithm Strategy Pattern
Scalability Limited By
Task Decomposition
Number of functions/tasks
Pipeline
Number of pipeline stages
Data Decomposition
Amount of data to be processed
– Only data decomposition scales well.
CCSC:MW 2014 - 45
The Reduction Pattern
Programs often need to combine the local results
of N parallel tasks:
• When N is large, O(N) time is too slow
• The reduction pattern does it in O(lg(N)) time:
To sum
these 8
numbers:
6
8
9
1
5
7
2
4
Step 1
14
10
12
6
Step 2
24
18
Step 3
42
CCSC:MW 2014 - 46
A Parallel Pattern Taxonomy
Faculty Development Resources
• National Computational Science Institute (NCSI)
offers workshops each summer:
– www.computationalscience.org/workshops/
• The XSEDE Education Program offers
workshops, bootcamps, and facilities:
– www.xsede.org/curriculum-and-educator-programs
• The LittleFe Project offers “buildouts” at which
participants can build (and take home) a free
portable Beowulf cluster:
– littlefe.net
CCSC:MW 2014 - 48
LittleFe
Little Fe (v4): 6 nodes
-Dual-core Atom CPU
-Nvidia ION2 w/ 16 CUDA cores
-2 GB RAM
-GigabitEthernet, USB, …
-Custom Linux distro (BCCD)
-Pelican case
-~$2500 (but free at “buildouts”!)
SIGCSE 2014 - 49
Faculty Development Resources
• CSinParallel is an NSF-funded project to help
CS educators integrate PDC topics.
– 1-3 hour hands-on PDC “modules” in:
o Different level courses
o Different languages
o Different parallel design patterns (patternlets)
– Workshops (today, here; summer 2015 in Chicago)
– Community of supportive people to help work through
problems and issues.
– csinparallel.org
CCSC:MW 2014 - 50
Patternlets Demo
CCSC:MW 2014 - 51
Summary
• Every CS major should learn about PDC
– CS2013 adds PDC to the CS core curriculum
– CS2 is a natural place to introduce parallelism,
using ‘embarrassingly parallel’ problems
– Address synchronization in later courses
• Parallel design patterns provide a stable
intellectual framework for PDC.
• There are a variety of resources available
to help us all make the transition.
CCSC:MW 2014 - 52
“The pessimist complains about the wind;
the optimist expects it to change;
the realist adjusts the sails.”
- William Arthur Ward
•Thank you!
•Time for questions…
CCSC:MW 2014 - 53
Links to Resources
• CSinParallel: csinparallel.org
• LittleFe: littlefe.net
• XSEDE: www.xsede.org
• NCSI: www.computationalscience.org
• CS Education Act of 2013:
– www.computinginthecore.org/csea
– Rep. Todd Rokita (@ToddRokita)
CCSC:MW 2014 - 54
SIGCSE 2014 - 55
Download