The Role of Performance

CpE 442

Introduction to Computer Architecture

The Role of Performance

Instructor: H. H. Ammar

Introduction to Computer Architectures

CpE442 Lec2.1

Overview of Today’s Lecture:


° Review from Last Lecture

° Definition and Measures of Performance

° Benchmarks

° Summarizing Performance and Performance

Pitfalls


CpE442 Lec2.2

Review: What is "Computer Architecture"

° Co-ordination of levels of abstraction

Application

Compiler

Operating

System

Instr. Set Proc.

I/O system

Digital Design

Circuit Design

° Under a set of rapidly changing Forces

Instruction Set

Architecture


CpE442 Lec2.3

Review: Levels of Representation

High Level Language

Program temp = v[k]; v[k] = v[k+1]; v[k+1] = temp;

Compiler

Assembly Language

Program lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2)

Assembler

Machine Language

Program

0000 1001 1100 0110 1010 1111 0101 1000

1010 1111 0101 1000 0000 1001 1100 0110

1100 0110 1010 1111 0101 1000 0000 1001

0101 1000 0000 1001 1100 0110 1010 1111

Machine Interpretation

Control Signal

Specification


CpE442 Lec2.4

Review: Levels of Organization

SPARCstation 20

Computer

SPARC

Processor

Control

Memory

Datapath

Devices

Input

Output


CpE442 Lec2.5

Computer Architecture Simulation Tools

1. The HASE Architecture Simulation

Environment

2. The New Compiler Technology simulation (shown in class)

3. MIPS Assembly Language Simulators a. SPIM A MIPS32 Simulator http://pages.cs.wisc.edu/~larus/spim.html

CpE442 Lec2.6

b. MARS (MIPS Assembler and Runtime Simulator) http://courses.missouristate.edu/kenvollmar/mars/


Review: Summary from Last Lecture

° All computers consist of five components

• Processor: (1) datapath and (2) control

• (3) Memory

• (4) Input devices and (5) Output devices

° Not all “memory” are created equally

• Cache: fast (expensive) memory are placed closer to the processor

• Main memory: less expensive memory--we can have more

° Input and output (I/O) devices has the messiest organization

• Wide range of speed: graphics vs. keyboard

• Wide range of requirements: speed, standard, cost ... etc.

• Least amount of research (so far)


CpE442 Lec2.7





° Benchmarks


Pitfalls


CpE442 Lec2.8

Metrics of performance

CpE442 Lec2.9

Application

Response time, Answers per month

Operations per second

Programming

Language

Compiler

ISA

Datapath

Control

Function Units

Transistors Wires Pins

(millions) of Instructions per second – MIPS

(millions) of (F.P.) operations per second – MFLOP/s

Megabytes per second

Cycles per second (clock rate)


Relating Processor Metrics the execution time of a given program on a given CPU architecture

° CPU execution time = CPU clock cycles/pgm X clock cycle time

° or CPU execution time = CPU clock cycles/pgm ÷ clock rate

° Define CPI = the avg. clock cycles per instruction, CPI tells us something about the Instruction Set Architecture, the

Implementation of that architecture, and the program being measured

° CPU clock cycles/pgm = Instructions/pgm X CPI

° or CPI = CPU clock cycles/pgm ÷ Instructions/pgm


CpE442 Lec2.10

Aspects of CPU Performance,

CPU time = Seconds

Program

= Instructions x Cycles x Seconds

Program Instruction Cycle clock rate

Program

Compiler

Instr. Set Arch.

Organization

Technology

CpE442 Lec2.11

instr. count CPI


Aspects of CPU Performance


Program


Program Instruction Cycle clock rate

Program

Compiler

Instr. Set.

Organization

Technology

CpE442 Lec2.12

instr count

X

X

X

CPI

(x)

(x)

X

X X

X


Figures from a Simulator for the following code segment comparing two compilers for (i=0;i<3;i++) { in_a(i)++; int_b(i)++; flt_d(i) = flt_d(i) + flt_c(i); }

CpE442 Lec2.13


CpE442 Lec2.14


CpE442 Lec2.15


CpE442 Lec2.16


CpE442 Lec2.17


Organizational Trade-offs

CpE442 Lec2.18

Application

Programming

Language

Compiler

ISA

Datapath

Control

Function Units

Transistors Wires Pins

Instruction Mix

CPI

Single-Cycle Processor Design

CPI=1, large cycle time-Slow clock

Multi-cycle Processor Design

CPI > 1, smaller cycle time- Faster

Cycle Time clock


CPI “Average cycles per instruction”

CPI = (CPU Time * Clock Rate) / Instruction Count

= Clock Cycles / Instruction Count

The performance equation can be written as follows using instruction classes

and the instruction count I and CPI for each class i

CPU time = ClockCycleTime * i n

S

CPI * I

= 1 i i

CPI = n

S i = 1

"instruction frequency"

Instruction Count

See example next slide

Invest Resources where time is Spent!


CpE442 Lec2.19

Example

Base Machine (Reg / Reg)

Op Freq(Fi) CPI(i)

ALU

Load

50%

20%

Store 10%

Branch 20%

Typical Mix

2

2

1

2

% Time

.5

33%

.4

27%

.2

13%

.4

27%

1.5

The CPI = 1.5 cycles per instruction


CpE442 Lec2.20

Assume a program of 1 million instructions, Compare the performance of

Base Machine (B) with the above CPI, 1 GHZ clock, and

Enhanced Machine (E) with 1.333 GHZ and a one cycle increase for L/S and branch instructions

Enhanced Machine (Reg / Reg)

Op Freq CPI(i)

ALU 50% 1 .5

% Time

25%

Load 20% 3

Store 10% 3

Branch20% 3

.6

.3

.6

2.0

30%

15%

30%


CpE442 Lec2.21

Comparing the performance of two machines

Speedup due to enhancement E:

ExTime w/o E Performance w/ E

Speedup(E) = -------------------= ---------------------

ExTime w/ E Performance w/o E

= Perf. of E / Perf. of B = exec. Time of B / exec. Time of E

= 1.5 * 1 / 2 * 0.75 = 1

Performance of B is similar to that of E,

No gain in performance


CpE442 Lec2.22

Rate Metrics –

MIPS (Million Instructions Per Second), and

MFLPOS (Miilions Floating Point Operations Per Second)

MIPS = Instruction Count / (CPU Time * 10^6)

= Clock Rate / (CPI * 10^6)

• machines with different instruction sets ?

• programs with different instruction mixes ?

dynamic frequency of instructions

• uncorrelated with performance

CpE442 Lec2.23

MFLOP/S= FP Operations / (Time * 10^6)

• machine dependent

• often not where time is spent


Example showing why MIPS can fail

Compare performance with Compilers 1 and 2 for a given program on a given machine

Instruction Count in Billions for instruction classes A B C

Compiler 1 Instruction Count 5 1 1

Compiler 2 Instruction Count 10 1 1

CPI for each class 1 2 3

Clock cycles using compiler1 = 10 Billion

Clock cycles using compiler2 = 15 Billion assuming 1GHZ clock

CPU Time 1 = 5x1+1x2 +1x3 = 10 secs

CPU Time 2 = 10x1 + 1x2 + 1x3 = 15 secs yet the MIPS rating is

MIPS 1 = (instr. Count/cpu time in sec x 10^6)

= (5+1+1)/10 * 1000 = 700

MIPS 2 = 12/15 * 1000 = 800 giving the impression that 2 have a higher rate of executing instructions than 1


CpE442 Lec2.24





° Benchmarks


Pitfalls


CpE442 Lec2.25

Why Do Benchmarks?

° How we evaluate differences

• Different systems

• Changes to a single system

° Provide a target

• Benchmarks should represent large class of important programs

• Improving benchmark performance should help many programs

° For better or worse, benchmarks shape a field

° Good ones accelerate progress

• good target for development

° Bad benchmarks hurt progress

• help real programs v. sell machines/papers?

• Inventions that help real programs don’t help benchmark


CpE442 Lec2.26

Programs to Evaluate Processor Performance

° (Toy) Benchmarks

• 10-100 line

• e.g.,: sieve, puzzle, quicksort

° Synthetic Benchmarks

• attempt to match average frequencies of real workloads

• e.g., Whetstone, dhrystone

° Kernels

• Time critical excerpts Real programs

• e.g., gcc, spice


CpE442 Lec2.27

Successful Benchmark: SPEC http://www.spec.org/benchmarks.html

http://mrob.com/pub/comp/benchmarks/spec.html#CPU_06

° EE Times + 5 companies band together to form the Systems Performance Evaluation

Committee (SPEC):

Sun, MIPS, HP, Apollo, DEC

° Create standard list of programs, inputs, reporting: some real programs, includes OS calls, some I/O


CpE442 Lec2.28

SPEC second round, SPEC95

•

8 integer benchmarks in C and 10 floating pt benchmarks in Fortran

CpE442 Lec2.29


CpE442 Lec2.30


CpE442 Lec2.31






° Benchmarks


Pitfalls


CpE442 Lec2.32

Amdahl's Law

Speedup due to enhancement E:

ExTime w/o E Performance w/ E

Speedup(E) = -------------------= ---------------------

ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then,

ExTime(with E) = ((1-F) + F/S) X ExTime(without E)

Speedup(with E) = ExTime(without E) ÷

((1-F) + F/S) X ExTime(without E)

<= 1/(1-F) speed up is bounded by this factor


CpE442 Lec2.33

Performance Evaluation Summary


Program


Program Instruction Cycle

° Time is the measure of computer performance!

° Good products created when have:

• Good benchmarks

• Good ways to summarize performance

° If not good benchmarks and summary, then choice between improving product for real programs vs. improving product to get more sales=> sales almost always wins

° Remember Amdahl’s Law: Speedup is limited by unimproved part of programs

° HW 1, Submit via ecampus


CpE442 Lec2.34

The Role of Performance

CpE 442

Introduction to Computer Architecture