Introduction, Syllabus and Prelims


CSCI-6964: High Performance

Parallel & Distributed Computing


AE 216, Mon/Thurs 2-3:20 p.m.

Introduction, Syllabus & Prelims

Prof. Chris Carothers

Computer Science Department

Lally 306 [Office Hrs: Wed, 11a.m – 1p.m]

Course Prereqs…

• Some programming experience in Fortran, C, C++…

– Java is great but not for HPC…

– You’ll have a choice to do your assignment in C, C++ or

Fortran…subject to the language support of the programming paradigm..

• Assume you’ve never touched a parallel or distributed computer..

– If you have MPI experience will help you, but it is not necessary…

• If you love to write software…

– Both practice and theory are presented but there is a strong focus on getting your programs to work…

Course Textbook

Introduction to

Parallel Computing, by Grama, Gupta,

Karypis and Kumar

• Make sure you have the 2 nd edition!

• Available either online thru the


Wesley publisher or

RPI Campus bookstore.

Course Topics

• Prelims & Motivation

– Memory Hierarchy

– CPU Organization

• Parallel Architectures (Ch. 2, papers)

– Message Passing/SMP

– Communications Networks

• Basic Communications Operations (Ch 3)

• MPI Programming (Ch 6)

• Principles of Parallel Algorithm Design (Ch 4)

• Thread Programming (Ch 7)

– Ptreads

– OpenMP

Course Topics (cont.)

• Analytical Modeling of Parallel Programs (Ch 5)

– LogP Model (paper)

• Parallel Algorithms (Mix of Ch 8 – 11)

– Matrix Algorithms

– Sorting “

– Graph “

– Search “

• MapReduce Programming Paradigm (papers)

• Applications (Guest Lectures)

– Computational Fluid Dynamics

– Mesh Adaptivity

– Parallel Discrete-Event Simulation

Course Grading Criteria

• You must read ahead (lecture, textbook and papers)

– FOR EACH CLASS…You will write a 1 page paper that sumarizes what you read and states any questions you might have for my benefit…..

– In the case of guest lectures, report is due next class.

– What’s it worth…

• 1 grade point per class up to 25 points total

• There are 27 lectures, so you can pick 3 to miss…

• 4 programming assignments worth 10 pts each

– MPI, Pthreads, OpenMP, MapReduce

• Parallel Computing Research Project worth 35 pts

• Yes, that’s right no mid-term or final exam…

– May sound good, but when it’s 4 a.m. and your parallel program doesn’t work and you’ve spent the past 30 hours debugging it, an exam doesn’t sound so bad …

– For a course like this, you’ll need to manage you time a little each day and don’t get behind on the assignments or projects!

To Make A Fast Parallel Computer

You Need a Faster Serial

Computer…well sorta…

• Review of…

– Instructions…

– Instruction processing..

• Put it together…why the heck do we care about or need a parallel computer?

– i.e., they are really cool pieces of technology, but can they really do anything useful beside compute Pi to a few billion more digits…

Processor Instruction Sets

• In general, a computer needs a few different kinds of instructions:

– mathematical and logical operations

– data movement (access memory)

– jumping to new places in memory

• if the right conditions hold.

– I/O (sometimes treated as data movement)

• All these instructions involve using registers to store data as close as possible to the CPU

– E.g. $t0, $s0 in MIPs on %eax, %ebx in x86

$s0 a=(b+c)-(d+e);

$s1 $s2 $s3 $s4 add $t0, $s1, $s2 # t0 = b+c add $t1, $s3, $s4 # t1 = d+e sub $s0, $t0, $t1 # a = $t0–$t1

lw destreg, const(addrreg)

“Load Word”

Name of register to put value in

A number

Name of register to get base address from address = (contents of addrreg

) + const

Array Example:



$s1 $s2 lw $t0,8($s2) # $t0 = c[8] add $s0, $s1, $t0 # $s0=$s1+$t0

(yeah, this is not quite right …  )

sw srcreg, const(addrreg)

“Store Word”

Name of register to get value from

A number

Name of register to get base address from address = (contents of addrreg

) + const

sw $s0, 4($s3)

If $s3 has the value 100 , this will copy the word in register $s0 to memory location 104.

Memory[104] <- $s0

Instruction formats

6 bits op

5 bits rs

5 bits 5 bits 5 bits 6 bits rt rd shamt funct

32 bits

This format is used for many MIPS instructions that involve calculations on values already in registers.

E.g. add $t0, $s0, $s1

How are instructions processed?

• In the simple case…

– Fetch instruction from memory

– Decode it (read op code, and use registers based on what instruction the op code says

– Execute the instruction

– Write back any results to register or memory

• Complex case…

– Pipeline – overlap instruction processing…

– Superscalar – multi-instruction issue per clock cycle..

Simple (relative term) CPU

Multicyle Datapath & Control

Simple (yeah right!) Instruction

Processing FSM!

Pipeline Processing w/ Laundry

• While the first load is drying, put the second load in the washing machine.

• When the first load is being folded and the second load is in the dryer, put the third load in the washing machine.

• Admittedly unrealistic scenario for CS students, as most only own 1 load of clothes…

6 PM

Task order





7 8 9 10 11 12 1 2 AM


6 PM

Task order





7 8 9 10 11 12 1 2 AM

Pipelined DP w/ signals

Pipelined Instruction.. But wait, we’ve got dependencies!

HPDC Spring 2008 - Intro, Syllabus & Prelims 25

Pipeline w/ Forwarding Values

Where Forwarding Fails…must stall

How Stalls Are Inserted

What about those crazy branches?

Problem: if the branch is taken, PC goes to addr

72, but don’t know until after 3 other instructions are processed

Dynamic Branch Prediction

• From the phase “There is no such thing as a typical program”, this implies that programs will branch is different ways and so there is no “one size fits all” branch algorithm.

Alt approach: keep a history (1 bit) on each branch instruction and see if it was last taken or not.

Implementation: branch prediction buffer or branch history table.

– Index based on lower part of branch address

– Single bit indicates if branch at address was last taken or not. (1 or 0)

– But single bit predictors tends to lack sufficient history…

Solution: 2-bit Branch Predictor

Must be wrong twice before changing prediction

Learns if the branch is more biased towards “taken” or “not taken”

Even more performance…

• Ultimately we want greater and greater

Instruction Level Parallelism (ILP)

• How?

• Multiple instruction issue.

– Results in CPI’s less than one.

– Here, instructions are grouped into “issue slots”.

– So, we usually talk about IPC (instructions per cycle)

– Static: uses the compiler to assist with grouping instructions and hazard resolution.

Compiler MUST remove ALL hazards.

– Dynamic: (i.e., superscalar) hardware creates the instruction schedule based on dynamically detected hazards

Example Static 2-issue



•32 bits from intr.


•Two read, 1 write ports on reg file

Ex. 2-Issue Code Schedule

Loop: lw $t0, 0($s1) addiu $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1, -4 bne $s1, $zero, Loop


Loop: addi $s1, $s1, -4 addu $t0, $t0, $s2 bne $s1, $zero, Loop

#t0=array element

#add scalar in $s2

#store result

# dec pointer

# branch $s1!=0

Data Xfer Inst.

Cycles lw $t0, 0($s1) 1 sw $t0, 4($s1)




It take 4 clock cycles for 5 instructions or IPC of 1.25

More Performance: Loop Unrolling

• Technique where multiple copies of the loop body are made.

• Make more ILP available by removing dependencies.

• How? Complier introduces additional registers via

“register renaming”.

• This removes “name” or “anti” dependence

– where an instruction order is purely a consequence of the reuse of a register and not a real data dependence.

– No data values flow between one pair and the next pair

– Let’s assume we unroll a block of 4 interations of the loop..

Loop Unrolling Schedule



Instructions addi $s1, $s1, -16

Data Xfer lw $t0, 0($s1) lw $t1, 12($s1) addu $t0, $t0, $s2 lw $t2, 8($s1) addu $t1, $t1, $s2 lw $t3, 4($s1) addu $t2, $t2, $s2 sw $t0, 16($s1) addu $t3, $t3, $s2 sw $t1, 12($s1) sw $t2, 8($s1) bne $s1, $zero, loop sw $t3, 4($s1)


Now, it takes 8 clock cycles for 14 instructions or IPC










Dynamic Scheduled Pipeline

Intel P4 Dynamic Pipeline – Looks like a cluster .. Just much much smaller…

Summary of Pipeline Technology

We’ve exhausted this!!

IPC just won’t go much higher…


More Speed til it Hertz!

• So, if not ILP is available, why not increase the clock frequency

– E.g. why don’t we have 100 GHz processors today?


– With current CMOS technology power needs polynominal++ increase with a linear increase in clock speed.

– Power leads to heat which will ultimately turn your CPU to heap of melted silicon!

HPDC Spring 2008 - Intro, Syllabus & Prelims 41

CPU Power Consumption…

Typically, 100 watts is magic limit..

Where do we go from here?

(actually, we’ve arrived @ “here”!)

• Current Industry Trend: Multi-core CPUs

– Typically lower clock rate (i.e., < 3 Ghz)

– 2, 4 and now 8 cores in single “socket” package

– Because of smaller VLSI design processes (e.g. < 45 nm) can reduce power & heat..

• Potential for large, lucrative contracts in turning old dusty sequential codes to multi-core capable

– Salesman: here’s your new $200 CPU, & oh, BTW, you’ll need this million $ consulting contract to port your code to take advantage of those extra cores!

• Best business model since the mainframe!

– More cores require greater and greater exploitation of available parallelism in an application which gets harder and harder as you scale to more processors..

• Due to cost, we’ll force in-house development of talent pool..

– You could be that talent pool…

Examples: Multicore CPUs

• Brief listing of the recently released new 45 nm processors: Based on Intel site

(Processor Model - Cache - Clock Speed - Front Side Bus)

• Desktop Dual Core:

– E8500 - 6 MB L2 - 3.16 GHz - 1333 MHz

– E8400 - 6 MB L2 - 3.00 GHz - 1333 MHz

– E8300 - 6 MB L2 - 2.66 GHz - 1333 MHz

• Laptop Dual Core:

– T9500 - 6 MB L2 - 2.60 GHz - 800 MHz

– T9300 - 6 MB L2 - 2.50 GHz - 800 MHz

– T8300 - 3 MB L2 - 2.40 GHz - 800 MHz

– T8100 - 3 MB L2 - 2.10 GHz - 800 MHz

• Desktop Quad Core:

– Q9550 - 12MB L2 - 2.83 GHz - 1333 MHz

– Q9450 - 12MB L2 - 2.66 GHz - 1333 MHz

– Q9300 - 6MB L2 - 2.50 GHz - 1333 MHz

• Desktop Extreme Series:

– QX9650 - 12 MB L2 - 3 GHz - 1333 MHz

• Note: Intel's new 45nm Penryn-based Core 2 Duo and Core 2 Extreme processors were released on January 6, 2008. The new processors launch within a 35W thermal envelope .

These are becoming the building block of today’s SCs

Getting large amounts of speed requires lots of processors…

Nov. 2007 TOP 12 Supercomputers








DOE/LLNL, US: Blue Gene/L: 212,992 processors : 478 Tflops!

FZJ, Germany: Blue Gene/L: 64K processors

NMCAC, US: SGI Altix: 14336 processors

CRL, India: HP Xeon Cluster: 14240 processors

Sweden Gov.: HP Xeon Cluster: 13768 processors

RedStorm Sandia, US: Cray/Opteron: 26569 processors



Oak Ridge Nat. Lab., US: Cray XT4: 23016 processors

IBM TJ Watson, NY: Blue Gene/L: 40960 processors (20 racks)


NERSC/LBNL, US: Cray XT4: 19320 processors


Stony Brook/BNL, NY: Blue Gene/L 36864 processors (18 racks)


DOE/LLNL, US: IBM pSeries cluster: 12208 processors


RPI, NY: Blue Gene/L: 32768 processors (16 racks)

If all NY State TOP 500 Blue Gene’s where interconnected, we’d have an SC resource of well above #2 in the world!

HPDC Spring 2008 - Intro, Syllabus & Prelims 46

Soon-To-Be Fastest


• Ranger @ Texas Adv. Computation

Center (TACC)

– Sun is the lead designer/integrater

– Peak Performance: 504 TFlops

• This is the Linpack performance

– 62,976 processor cores

• 3936 nodes with 4, quad-core

AMD Phenom processors

• 8 Gflops per core is peak performance..

– 123 TBytes of RAM

– 1.73 Pbytes of disk

– 7 stage infiniBand interconnect

• 2.1 usec latency

What are SC’s used for??

• Can you say “fever for the flavor”..

• Yes, Pringles used an

SC to model airflow of chips as the entered

“The Can”..

• Improved overall yield of “good” chips in “The

Can” and less chips on the floor…

HPDC Spring 2008 - Intro, Syllabus & Prelims 48

Patient Specific Vascular Surgical Planning

– Virtual flow facility for patient specific surgical planning

– High quality patient specific flow simulations needed quickly

– Image patent, create model, adaptive flow simulation

– Simulation on massively parallel computers

– Cost only $600 on 32K Blue Gene/L vs. $50K for a repeat open heart surgery…

• Current uni-core speed has peaked

– No more ILP to exploit

– Can’t make CPU cores any faster w/ current

CMOS technology

– Must go massively parallel in order to increase

IPC (#instructions per clock cycle).

• Only way for large application to go really fast is to use lots and lots of processors..

– Today’s systems have 10’s of thousands of processors

– By 2010 systems will emerge w/ > 1 million processors! (e.g. Blue Waters @ UIUC)

Reading/Paper Summary Assignments!

• Next lecture 2 summary assignments are due..

– 1 for this lecture

• Covers notes plus Chapter 1 thru 2.1

– 1 for next lecture

• Covers slides plus Chapter 6.1 thru 6.5

• Note, next lecture not until Thursday, Jan.

24 th

– Jan 17 th lecture cancelled due to CS External

Review event..

– Jan 21 st lecture cancelled due to MLK day.

