What is Computer Architecture? - Computer Science Department

advertisement
CES 524 Computer Architecture
Fall 2006
W 6 to 9 PM
Salazar Hall 2008
B. Ravikumar (Ravi)
Office: Darwin Hall, 116 I
Office hours: TBA
Text-book
Computer Architecture: A
quantitative approach
Hennessy and Patterson
3rd edition
Other references:
• Hennessy and Patterson, Computer Organization and Design,
hardware-software interface. (undergraduate text by the same
authors)
• Jordan and Alaghband, Fundamentals of Parallel Processing.
• Hwang, Advanced Computer Architecture. (excellent source on
parallel and pipeline computing)
• I. Koren, Computer Arithmetic Algorithms.
• Survey articles, on-line lecture notes from several sources.
• http://www.cs.utexas.edu/users/dburger/teaching/cs382mf03/homework/papers.html
• http://www.cs.wisc.edu/arch/www/
Class notes and lecture format
• power-point slides, some times pdf, postscript etc.
• Tablet PC will be used to
• add comments
• draw sketches
• write code
• make corrections etc.
• lecture notes prepared with generous help provided
through the web sites of many instructors:
• Patterson, Berkeley
• Susan Eggers, University of Washington
• Mark Hill, Wisconsin
• Breecher, Clark U
• DeHon, Caltech, etc.
Background expected
• Logic Design
• Combinational logic, sequential logic
• logic design and minimization of circuits
• finite state machines
• Discrete Components-multiplexers, memory units, ALU
• Basic Machine structure
• processor (Data path, control), memory, I/O
• Number representations, computer arithmetic
• Assembly language programming
But if you have not studied some of these topics or don’t
remember, don’t worry! These topics will be reviewed.
Coursework and grading
• Home Work Sets (about 5 sets) 25 points
• Most problems will be assigned from the text. Text has many problems
marked * with solutions at the back. These will be helpful in solving the HW
problems.
• Additional problems at similar level
• Some implementation exercises (mostly C programming)
• Policy on collaboration
• Mid-semester Tests
25 points
• One will be in-class, open book/notes, 75 minutes long
• The other can be take-home or in-class. (discussed later)
• dates will be announced (at least) one week in advance
• all topics discussed until the previous lecture
• Project
15 points
• Semester-long work. Each one chooses one problem to work on.
• Design, implementation, testing etc.
• Report and a presentation
• Final Exam
35 points
(take-home?)
Project Examples
• hardware design
e.g. circuit design for dedicated application such as image
processing
• hardware testing
• cache-efficient algorithm design and implementation
• wireless sensor networks
• embedded system design (for those who are doing CES 520)
• Tablet PC hardware study
Overview of the course
• Review of Computer Organization
• Digital logic design
• Computer Arithmetic
• machine/assembly language
• Data and control path design
• Cost performance trade-offs
• Instruction Set Design Principles
• classification of instruction set architectures
• addressing modes
• encoding instructions
• trade-offs/examples/historical overview
• Instruction level parallelism
• overview of pipelining
• superscalar execution
• branch prediction
• dynamic scheduling
• Cache Memory
• cache performance modeling
• main memory technology
• virtual memory
• cache-efficient algorithm design
• external memory data structures, algorithms, applications
• Shared-memory multiprocessor system
• symmetric multiprocessors
• distributed shared-memory machines
• Storage systems
• RAID architecture
• I/O performance measures
• Advanced Topics
• Computer Arithmetic
• alternatives to floating-point
•Design to minimize size, delay etc.
• circuit complexity trade-offs (area, time, energy etc.)
• hardware testing
• Fault-models, fault testing
• model checking
• external memory computation problems
• external memory algorithms
• data structures, applications
• Cache-efficient algorithms
• Nonstandard models
• quantum computing
• bio-computing and neural computing
• Sensor networks
Overview of the course
• Review of Computer Organization
• Digital logic design
w
x
x
y
y
z
z
Inputs
Decoder
w
.
.
.
...
Outputs
(a) Programmable
OR gates
(b) Logic equivalent
of part a
(c) Programmable read-only
memory (PROM)
Overview of the course
• Review of Computer Organization
• Digital logic design
Inputs
8-input
ANDs
...
AND
array
(AND
plane)
.
.
.
6-input
ANDs
OR
array
(OR
plane)
...
4-input
ORs
Outputs
(a) General programmable
combinational logic
(b) PAL: programmable
AND array, fixed OR array
(c) PLA: programmable
AND and OR arrays
Overview of the course
• Review of Computer Organization
• Digital logic design – sequential circuits
Dime
Quarter
Reset
------- Input ------Dime
Current
state
S 00
S 10
S 25
S 00
S 10
S 20
S 35
S 00
S 20
S 30
S 35
S 00
S 25
S 35
S 35
S 00
S 30
S 35
S 35
S 00
S 35
S 35
S 35
S 00
Next state
S 00 is the initial state
S 35 is the final state
S 10
S 20
Reset
Reset
Dime
Start
Quarter
Dime
Quarter
Quarter
S 00
S 25
Reset
Reset
Dime
Quarter
Reset
Dime
Quarter
S 35
Dime
Quarter
S 30
Example of sequential circuit design
Overview of the course
• Review of Computer Organization
• Computer Arithmetic
Overview of the course
• Review of Computer Organization
• machine/assembly language
MIPS instruction set
Overview of the course
• Review of Computer Organization
• Data and control path design
Overview of the course
•
Review of Computer Organization
•
Cost performance trade-offs
A processor spends 30% of its time on flp addition, 25% on flp mult,
and 10% on flp division. Evaluate the following enhancements, each
costing the same to implement:
a.
b.
c.
Redesign of the flp adder to make it twice as fast.
Redesign of the flp multiplier to make it three times as fast.
Redesign the flp divider to make it 10 times as fast.
Overview of the course
• Instruction Set Design Principles
• classification of instruction set architectures
Overview of the course
• addressing modes
Overview of the course
• Instruction level parallelism
• overview of pipelining
Overview of the course
• overview of pipelining
Hazards
When the same resource is required by two successive
instructions executing different cycles, a hazard occurs.
• read after write
x=y+z
w=x+t
The second instruction’s “read x” part of the cycle has to
wait for the first instruction’s “write x” part of the cycle.
Spring 2003
CSE P548
23
• branch prediction
x = y + z;
if (x > 0)
y = y + 1;
else z = z +1;
The instruction after x = y + z can only be determined after x has
been computed.
x = 1;
while (x < 100) do
{
sum = sum + x;
x = x + 1;
}
The instruction after x = x + 1 will be almost always sum = sum + x
(99 out of 100 times), so if we know that the instruction is in a loop,
we predict that the loop is executed and we will be correct most of the
time.
• Cache Memory
• cache performance modeling
• main memory technology
• virtual memory
• cache-efficient algorithm design
• external memory data structures, algorithms, applications
• branch prediction
• dynamic scheduling
• Cache Memory
• cache performance modeling
• main memory technology
• virtual memory
• cache-efficient algorithm design
• external memory data structures, algorithms, applications
• Shared-memory multiprocessor system
• symmetric multiprocessors
• distributed shared-memory machines
• Storage systems
• RAID architecture
• I/O performance measures
Hitting the Memory Wall
Relative performance
10 6
Processor
10 3
Memory
1
1980
1990
2000
2010
Calendar year
Fig. 17.8 Memory density and capacity have grown along with the
CPU power and complexity, but memory speed has not kept pace.
Spring 2003
CSE P548
27
• branch prediction
• dynamic scheduling
• Cache Memory
• cache performance modeling
• main memory technology
• virtual memory
• cache-efficient algorithm design
• external memory data structures, algorithms, applications
• Shared-memory multiprocessor system
• symmetric multiprocessors
• distributed shared-memory machines
• Storage systems
• RAID architecture
• I/O performance measures
• branch prediction
• dynamic scheduling
• Cache Memory
• cache performance modeling
• main memory technology
• virtual memory
• cache-efficient algorithm design
• external memory data structures, algorithms, applications
• Shared-memory multiprocessor system
• symmetric multiprocessors
• distributed shared-memory machines
• Storage systems
• RAID architecture
• I/O performance measures
• Advanced Topics
• Computer Arithmetic
• alternatives to floating-point
•Design to minimize size, delay etc.
• circuit complexity trade-offs (area, time, energy etc.)
• hardware testing
• Fault-models, fault testing
• model checking
• external memory computation problems
• external memory algorithms
• data structures, applications
• Cache-efficient algorithms
• Nonstandard models
• quantum computing
• bio-computing and neural computing
• Sensor networks
Focus areas
• instruction set architecture, review of computer organization
• instruction-level parallelism (chapters 3 and 4)
• computer arithmetic
(Appendix)
• cache-performance analysis and modeling (chapter 5)
• multiprocessor architecture (chapter 7)
• interconnection networks (chapter 8)
• cache-efficient algorithm design
• code optimization and advanced compilation techniques
• FPGA and reconfigurable computing
History of Computer Architecture
Abacus is probably more than 3000 years old.
The abacus is prepared for use by placing it flat on a table or one's lap and pushing all the
beads on both the upper and lower decks away from the beam. The beads are manipulated
with either the index finger or the thumb of one hand.
The abacus is still in use today by shopkeepers in Asia. The use of the abacus is still taught
in Asian schools, and some schools in the West. Visually impaired children are taught to
use the abacus where their sighted counterparts would be taught to use paper and pencil to
perform calculations. One particular use for the abacus is teaching children simple
mathematics and especially multiplication; the abacus is an excellent substitute for rote
memorization of multiplication tables, a particularly detestable task for young children.
History of Computer Architecture
• Mechanical devices for controlling complex operations have been in
existence since at least 1500's. (First ones were rotating pegged cylinders
in musical boxes.)
The medieval development of camshafts has proven to be of immense technological significance. It
allowed the budding medieval industry to transform the rotating movement of waterwheels and
windmills into the movements that could be used for the hammering of ore, the sawing of wood and
the manufacturing of paper.
•Pascal developed a mechanical calculator to help in tax work.
• Pascal's calculator contains eight dials that connect to a drum, with a
linkage that causes a dial to rotate one notch when a carry is produced
from a dial in a lower position.
• Some of Pascal's adding machines which he started to build in 1642,
still exist today.
Pascal began work on his calculator in 1642, when he was only 19 years old.
He had been assisting his father, who worked as a tax commissioner, and
sought to produce a device which could reduce some of his workload. By 1652
Pascal had produced fifty prototypes and sold just over a dozen machines, but
the cost and complexity of the Pascaline – combined with the fact that it
could only add and subtract, and the latter with difficulty – was a barrier to
further sales, and production ceased in that year. By that time Pascal had
moved on to other pursuits, initially the study of atmospheric pressure, and
later philosophy.
The Pascaline was a decimal machine. This proved to be a liability, however, as the contemporary French
currency system was not decimal. It was instead similar to the Imperial pounds ("livres"), shillings ("sols")
and pence ("deniers") in use in Britain until the 1970s, and necessitated that the user perform further
calculations if the Pascaline was to be used for its intended purposes, as a currency calculator.
In 1799 France changed to a metric system, by which time Pascal's basic design had inspired other
craftsmen, although with a similar lack of commercial success. Child prodigy Gottfried Wilhelm von Leibniz
produced a competing design, the Stepped Reckoner, in 1672 which could perform addition, subtraction,
multiplication and division, but calculating machines did not become commercially viable until the early
19th century, when Charles Xavier Thomas de Colmar's Arithmometer, itself based on Von Leibniz's design,
was commercially successful.
• Babbage built in 1800's a computational device called difference
engine.
• This machine had features seen in modern computers- means for reading input
data, storing data, performing calculations, producing output and automatically
controlling the operations of the machine.
Difference engines were forgotten and then rediscovered in 1822 by Charles
Babbage, who proposed it in a paper to the Royal Astronomical Society
entitled "Note on the application of machinery to the computation of very
big mathematical tables." This machine used the decimal number system
and was powered by cranking a handle. The British government initially
financed the project, but withdrew funding when Babbage repeatedly asked
for more money whilst making no apparent progress on building the
machine. Babbage went on to design his much more general analytical
engine but later produced an improved difference engine design (his
"Difference Engine No. 2") between 1847 and 1849. Inspired by Babbage's
difference engine plans, Per Georg Scheutz built several difference engines
from 1855 onwards; one was sold to the British government in 1859. Martin
Wiberg improved Scheutz's construction but used his device only for
producing and publishing printed logarithmic tables.
The principle of a difference engine is Newton's method of differences, It
may be illustrated with a small example. Consider the quadratic
polynomial p(x) = 2x2 − 3x + 2
To calculate p(0.5) we use the values from the
lowest diagonal. We start with the rightmost
column value of 0.04. Then we continue the
second column by subtracting 0.04 from 0.16 to
get 0.12. Next we continue the first column by
taking its previous value, 1.12 and subtracting the
0.12 from the second column. Thus p(0.5) is 1.120.12 = 1.0. In order to compute p(0.6), we iterate
the same algorithm on the p(0.5) values: take
0.04 from the third column, subtract that from
the second column's value 0.12 to get 0.08, then
subtract that from the first column's value 1.0 to
get 0.92, which is p(0.6).
Babbage’s analytical engine
• Babbage also designed a more sophisticated
machine known as analytical engine which had a
mechanism for branching and a means for
programming using punched cards.
Portion of the mill of the Analytical
Engine with printing mechanism, under
construction at the time of Babbage’s
death.
The designs for the Analytical Engine include almost all the
essential logical features of a modern electronic digital computer.
The engine was programmable using punched cards. It had a ‘store’
where numbers and intermediate results could be held and a
separate ‘mill’ where the arithmetic processing was performed. The
separation of the ‘store’ (memory) and ‘mill’ (central processor) is a
fundamental feature of the internal organisation of modern
computers.
The Analytical Engine could have `looped’ (repeat the same
sequence of operations a predetermined number of times) and was
capable of conditional branching (IF… THEN… statements) i.e.
automatically take alternative courses of action depending on the
result of a cacluation.
Had it been built it would have needed to be operated by a steam
engine of some kind. Babbage made little attempt to raise funds to
build the Analytical Engine. Instead he continued to work on
simpler and cheaper methods of manufacturing parts and built a
small trial model which was under construction at the time of his
death.
• Analytical engine was never built because technology was not
developed to meet the required standards.
• Ada Lovelace (daughter of the poet Byron) worked with Babbage
to write the earliest computer programs to solve problems on the
analytical engine.
• A version of Babbage's difference engine was actually built by the
Science Museum in London in 1991 and can still be viewed today.
• It took over a century until the start of World War II, before the
next major development in computing machinery took place.
• In England, German U-boat submarines were causing heavy
damage on allied shipping.
• The U-boats received communications using a secret code
which was implemented by a machine made by Siemens
known as ENIGMA.
• The process of encryption used by Enigma was known for a
longtime. But decoding was a much harder task.
• Alan Turing, a British Mathematician, and others in England, built
an electromechanical machine to decode the message sent by
ENIGMA.
• The Colossus was a successful code breaking machine that came
out of Turing's research.
• Colossus had all the features of an electronic computers.
• Vacuum tubes to store the contents of a paper tape that is fed into the
machine.
• Computations took place among the vacuum tubes and programming
was performed with plug boards.
• Around the same time as Turing’s efforts, Eckert and Mauchly set out
to create a machine to compute tables of ballistic trajectories for the U.
S. Army.
• The result of their effort was the Electronic Numerical Integrator and
Computer (ENIAC) at Moore School of Engineering at the University of
Pennsylvania.
• ENIAC consisted of 18000 vacuum tubes, which made up the
computing section of the machines programming and data entry
were performed by setting switches.
• There was no concept of stored program, and there was no central
memory unit.
• But these were not serious limitations since ENIAC was intended
to do special-purpose calculations.
• ENIAC was not ready until the war was over, but it wa
successfully use for nine years after the war (1946-1955).
• After ENIAC was completed, von Neumann joined Ekert and
Mauchly. Together, they worked on a model for stored program
computer called EDVAC.
• Eckert and Mauchley and von Neumann and Goldstein split up after
disputes over credit and differences of opinion.
• But the concept of stored program computer came out of this
collaboration. The term von Neumann architecture is used to denote a
stored program computer.
• Wilkes in Cambridge University built a stored program computer
(EDSAC) that was completed in 1947.
• Atanasoff at Iowa State built a small-scale electronic computer.
His work came to light as part of a lawsuit.
• Another early machine that deserves credit is Konrad Zuse’s machine
in Germany.
• Historical papers (e.g. by Knuth) have been written on the earliest
computer programs ever written on these machines:
• sorting
• generate perfect squares, primes
Comparison of early computers
ENIAC - details
• Decimal (not binary)
• 20 accumulators of 10 digits
• ENIAC used ten-position ring counters to store
digits; each digit used 36 tubes, 10 of which were the
dual triodes making up the flip-flops of the ring counter.
Arithmetic was performed by "counting" pulses with the
ring counters and generating carry pulses if the counter
"wrapped around", the idea being to emulate in
electronics the operation of the digit wheels of a mechanical adding machine.
•
•
•
•
•
•
Programmed manually by switches
18,000 vacuum tubes
30 tons
15,000 square feet
140 kW power consumption
5,000 additions per second
ENIAC – ALU, Reliability
• The basic clock cycle was 200 microseconds, or 5,000 cycles per second for operations on
the 10-digit numbers. In one of these cycles, ENIAC could write a number to a register, read
a number from a register, or add/subtract two numbers. A multiplication of a 10-digit
number by a d-digit number (for d up to 10) took d+4 cycles, so a 10- by 10-digit
multiplication took 14 cycles, or 2800 microseconds—a rate of 357 per second. If one of the
numbers had fewer than 10 digits, the operation was faster. Division and square roots took
13(d+1) cycles, where d is the number of digits in the result (quotient or square root). So a
division or square root took up to 143 cycles, or 28,600 microseconds—a rate of 35 per
second. If the result had fewer than ten digits, it was obtained faster.
• By the simple (if expensive) expedient of never turning the machine off, the engineers
reduced ENIAC's tube failures to the acceptable rate of one tube every two days. According to
a 1989 interview with Eckert the continuously failing tubes story was therefore mostly a
myth: "We had a tube fail about every two days and we could locate the problem within 15
minutes."
• In 1954, the longest continuous period of operation without a failure was 116 hours (close to
five days). This failure rate was remarkably low, and stands as a tribute to the precise
engineering of ENIAC.
von Neumann machine
•
•
•
•
Stored Program concept
Main memory storing programs and data
ALU operating on binary data
Control unit interpreting instructions from memory and
executing
• Input and output equipment operated by control unit
• Princeton Institute for Advanced Studies
• IAS
• Completed 1952
Commercial Computers
• 1947 - Eckert-Mauchly Computer Corporation
• BINAC
• UNIVAC I (Universal Automatic Computer)
• Sold for $1 Million.
• Totally 48 machines were built
• US Bureau of Census 1950 calculations
• Became part of Sperry-Rand Corporation
• Late 1950s - UNIVAC II
• Faster
• More memory
IBM and DEC
• Punched-card processing equipment
• Office automation, electric type-writers etc.
• 1953 - the 701
• IBM’s first stored program computer
• Scientific calculations
• 1955 - the 702
• Business applications
• Lead to 700/7000 series
• In 1964, IBM invested $5 B to build IBM 360 series
of machines. (main-frame)
• DEC developed PDP series (minicomputers)
Transistors
•
•
•
•
•
•
•
•
Replaced vacuum tubes
Smaller
Cheaper
Less heat dissipation
Solid State device
Made from Silicon (Sand)
Invented 1947 at Bell Labs
William Shockley, Bardeen and Brittain
Transistor Based Computers
•
•
•
•
Second generation machines
NCR & RCA produced small transistor machines
IBM 7000
DEC - 1957
• Produced PDP-1
Microelectronics
• Literally - “small electronics”
• A computer is made up of gates, memory cells and
interconnections
• These can be manufactured on a semiconductor
• e.g. silicon wafer
Generations of Computers
•
•
•
•
•
•
•
Vacuum tube - 1946-1957
Transistor - 1958-1964
Small scale integration - 1965 on
• Up to 100 devices on a chip
Medium scale integration - to 1971
• 100-3,000 devices on a chip
Large scale integration - 1971-1977
• 3,000 - 100,000 devices on a chip
Very large scale integration - 1978 to date
• 100,000 - 100,000,000 devices on a chip
Ultra large scale integration
• Over 100,000,000 devices on a chip
Introduction
1.1 Introduction
1.2 The Task of a Computer Designer
1.3 Technology and Computer Usage Trends
1.4 Cost and Trends in Cost
1.5 Measuring and Reporting Performance
1.6 Quantitative Principles of Computer Design
1.7 Putting It All Together: The Concept of Memory Hierarchy
What’s Computer Architecture?
The attributes of a [computing] system as seen by the programmer, i.e.,
the conceptual structure and functional behavior, as distinct from the
organization of the data flows and controls the logic design, and the
physical implementation.
Amdahl, Blaaw, and Brooks, 1964
SOFTWARE
What’s Computer Architecture?
• 1950s to 1960s: Computer Architecture Course Computer
Arithmetic was the main focus.
• 1970s to mid 1980s: Computer Architecture Course
Instruction Set Design, especially ISA appropriate for
compilers. (covered in Chapter 2)
• 1990s and later: Computer Architecture Course
Design of CPU, memory system, I/O system,
Multiprocessors, instruction-level parallelism etc.
The Task of a Computer Designer
1.1 Introduction
1.2 The Task of a
Computer Designer
Evaluate Existing
1.3 Technology and
Implementation
Systems for
Complexity
Computer Usage
Bottlenecks
Trends
1.4 Cost and Trends in
Benchmarks
Cost
Technology
1.5 Measuring and
Trends
Reporting
Implement Next
Performance
Simulate New
Generation System
Designs and
1.6 Quantitative Principles
Organizations
of Computer Design
1.7 Putting It All Together:
Workloads
The Concept of
Memory Hierarchy
Technology and Computer Usage Trends
1.1 Introduction
1.2 The Task of a Computer
Designer
1.3 Technology and Computer
Usage Trends
1.4 Cost and Trends in Cost
1.5 Measuring and Reporting
Performance
1.6 Quantitative Principles of
Computer Design
1.7 Putting It All Together: The
Concept of Memory
Hierarchy
When building a Cathedral numerous
very practical considerations need
to be taken into account:
• available materials
• worker skills
• willingness of the client to pay the
price.
Similarly, Computer Architecture is
about working within constraints:
• What will the market buy?
• Cost/Performance
• Tradeoffs in materials and
processes
Trends
Gordon Moore (Founder of Intel) observed in 1965 that the number of transistors
that could be crammed on a chip doubles every year.
This has CONTINUED to be true since then. (Moore’s
law)
Transistors Per Chip
1.E+08
Pentium 3
Pentium Pro
1.E+07
Pentium II
Power PC G3
Pentium
486
1.E+06
Power PC 601
386
80286
1.E+05
8086
1.E+04
4004
1.E+03
1970
1975
1980
1985
1990
1995
2000
2005
Trends
Processor performance, as measured by the SPEC benchmark has also risen
dramatically.
Alpha 6/833
Sun MIPS
-4/ M
260 2000
IBM
RS/
6000
DEC Alpha 5/500
DEC
AXP/ DEC Alpha 4/266
DEC Alpha 21264/600
500
87
88
89
90
91
92
93
94
95
96
97
98
99
20
00
5000
4000
3000
2000
1000
0
Trends
Memory Capacity (and Cost) have changed dramatically in the last 20 years.
size
1000000000
year
1980
1983
1986
1989
1992
1996
2000
100000000
Bits
10000000
1000000
100000
10000
1000
1970
1975
1980
1985
Year
1990
1995
2000
size(Mb) cyc time
0.0625 250 ns
0.25
220 ns
1
190 ns
4
165 ns
16
145 ns
64
120 ns
256
100 ns
Trends
Based on SPEED, the CPU has increased dramatically, but memory and
disk have increased only a little. This has led to dramatic changes in
architecture, Operating Systems, and Programming practices.
Capacity
Speed (latency)
Logic
2x in 3 years
2x in 3 years
DRAM
4x in 3 years
2x in 10 years
Disk
4x in 3 years
2x in 10 years
Growth in Microprocessor performance
• Growth in microprocessor performance since the mid-1980’s
has been substantially higher than before 1980’s.
• Figure 1.1 shows this trend. Processor performance has
increased by a factor of about 1600 in 16 years.
• There are two graphs in this figure. The graph showing growth
rate of about 1.58 per year in the real one; the other one showing
a rate of 1.35 per year is an extrapolation from the time prior to
1980.
• Prior to 1980’s performance growth was largely technology
driven.
• Growth increases since 1980 is attributable to architectural and
organizational ideas.
Performance Trends
Current taxonomy of computer systems
The early computers were just main-frame systems.
Current classification of computer systems is broadly:
• Desktop computing
• Low-end PC’s
• High-end work-stations
• Servers
• High throughput
• Greater reliability (with back-up)
• Embedded computers
• Hand-held devices (video-games, PDA, cell-phones)
• Sensor networks
• appliances
Summary of the three computing classes
Cost of IC
• wafer is manufactured (typically circular) and the dies are cut.
• tested, packaged and shipped. (all dies are identical.)
Cost of IC
• wafer is manufactured (typically circular). (See Figure
1.8) and the dies are cut.
• tested, packaged and shipped. (all dies are identical.)
Cost of die + testing + packaging and final test
Cost of IC =
Final test yield
Cost of wafer
Cost of Die =
Dies per wafer * Die yield
Dies per wafer = p (wafer
radius)2
-------------------
-
Die area
Can you explain the correction factor?
p * Wafer diameter
--------------------(2 * Die area)0.5
Distribution of cost in a system
Measuring And Reporting Performance
1.1 Introduction
1.2 The Task of a Computer Designer
1.3 Technology and Computer Usage
Trends
1.4 Cost and Trends in Cost
1.5 Measuring and Reporting
Performance
1.6 Quantitative Principles of Computer
Design
1.7 Putting It All Together: The Concept
of Memory Hierarchy
This section talks about:
1.
Metrics – how do we describe in
a numerical way the
performance of a computer?
2. What tools do we use to find
those metrics?
Metrics
Plane
DC to
Paris
Speed
Passengers
Throughput
(pmph)
Boeing 747
6.5 hours
610 mph
470
286,700
Concorde
3 hours
1350 mph
132
178,200
• Time to run the task (Exec Time)
• Execution time, response time, latency
• Tasks per day, hour, week, sec, ns … (Performance)
• Throughput, bandwidth
Metrics - Comparisons
"X is n times faster than Y" means
ExTime(Y)
--------ExTime(X)
=
Performance(X)
--------------Performance(Y)
= n
Speed of Concorde vs. Boeing 747
Throughput of Boeing 747 vs. Concorde
Metrics - Comparisons
Pat has developed a new product, "rabbit" about which she wishes to determine
performance. There is special interest in comparing the new product, rabbit to
the old product, turtle, since the product was rewritten for performance reasons.
(Pat had used Performance Engineering techniques and thus knew that rabbit
was "about twice as fast" as turtle.) The measurements showed:
Performance Comparisons
Product
Turtle
Rabbit
Transactions / second
30
60
Seconds/ transaction
0.0333
0.0166
Seconds to process transaction
3
1
Which of the following statements reflect the performance comparison of rabbit and turtle?
o Rabbit is 100% faster than turtle.
o Rabbit is twice as fast as turtle.
o Rabbit takes 1/2 as long as turtle.
o Rabbit takes 1/3 as long as turtle.
o Rabbit takes 100% less time than
turtle.
o Rabbit takes 200% less time than
turtle.
o Turtle is 50% as fast as rabbit.
o Turtle is 50% slower than rabbit.
o Turtle takes 200% longer than rabbit.
o Turtle takes 300% longer than rabbit.
Metrics - Throughput
Application
Answers per month
Operations per second
Programming
Language
Compiler
(millions) of Instructions per second: MIPS
ISA
(millions) of (FP) operations per second:
MFLOP/s
Datapath
Megabytes per second
Control
Function Units
TransistorsWiresPins
Cycles per second (clock rate)
Methods For Predicting Performance
• Benchmarks
• Hardware: Cost, delay, area, power estimation
• Simulation (many levels)
• ISA, RT, Gate, Circuit
• Queuing Theory
• Rules of Thumb
• Fundamental “Laws”/Principles
• Trade-offs
Execution time
Execution time can be defined in different ways:
• wall-clock time, response time or elapsed time: latency to
complete the task. (includes disk accesses, memory accesses,
input/output activities etc.), including all possible overheads.
• CPU time: does not include the I/O, and other overheads.
• user CPU time: execution of application tasks
• system CPU time: execution of system tasks
Throughput vs. efficiency
Benchmarks
SPEC: System Performance Evaluation Cooperative
•
•
•
First Round 1989
• 10 programs yielding a single number (“SPECmarks”)
Second Round 1992
• SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs)
• Compiler Flags unlimited. March 93 of DEC 4000 Model 610:
spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=
memcpy(b,a,c)”
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
Third Round 1995
• new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10
floating point)
• “benchmarks useful for 3 years”
• Single flag setting for all programs: SPECint_base95, SPECfp_base95
Benchmarks
CINT2000 (Integer Component of SPEC CPU2000):
Program
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
Language
C
C
C
C
C
C
C++
C
C
C
C
C
What Is It
Compression
FPGA Circuit Placement and Routing
C Programming Language Compiler
Combinatorial Optimization
Game Playing: Chess
Word Processing
Computer Visualization
PERL Programming Language
Group Theory, Interpreter
Object-oriented Database
Compression
Place and Route Simulator
http://www.spec.org/osg/cpu2000/CINT2000/
Benchmarks
CFP2000 (Floating Point Component of SPEC CPU2000):
Program
Language
What Is It
168.wupwise
Fortran 77
Physics / Quantum Chromodynamics
171.swimFortran 77
Shallow Water Modeling
172.mgrid
Fortran 77
Multi-grid Solver: 3D Potential Field
173.applu
Fortran 77
Parabolic / Elliptic Differential Equations
177.mesa
C
3-D Graphics Library
178.galgel
Fortran 90
Computational Fluid Dynamics
179.art
C
Image Recognition / Neural Networks
183.equake
C
Seismic Wave Propagation Simulation
187.facerec
Fortran 90
Image Processing: Face Recognition
188.ammp
C
Computational Chemistry
189.lucas
Fortran 90
Number Theory / Primality Testing
191.fma3d
Fortran 90
Finite-element Crash Simulation
200.sixtrack
Fortran 77
High Energy Physics Accelerator Design
301.apsi
Fortran 77
Meteorology: Pollutant Distribution
Benchmarks
Sample Results For SpecINT2000
http://www.spec.org/osg/cpu2000/results/res2000q3/cpu2000-20000718-00168.asc
Base
Benchmarks
Ref Time
164.gzip
1400
175.vpr
1400
176.gcc
1100
181.mcf
1800
186.crafty
1000
197.parser
1800
252.eon
1300
253.perlbmk
1800
254.gap
1100
255.vortex
1900
256.bzip2
1500
300.twolf
3000
SPECint_base2000
SPECint2000
Base
Base
Run Time Ratio
277
419
275
621
191
500
267
302
249
268
389
784
505*
334*
399*
290*
522*
360*
486*
596*
442*
710*
386*
382*
438
Peak
Peak
Peak
Ref Time Run Time Ratio
1400
270
1400
417
1100
272
1800
619
1000
191
1800
499
1300
267
1800
302
1100
248
1900
264
1500
375
3000
776
518*
336*
405*
291*
523*
361*
486*
596*
443*
719*
400*
387*
442
Intel OR840(1 GHz
Pentium III
processor)
Benchmarks
Performance Evaluation
• “For better or worse, benchmarks shape a field”
• Good products created when have:
• Good benchmarks
• Good ways to summarize performance
• Given sales is a function in part of performance relative to
competition, investment in improving product as reported
by performance summary
• If benchmarks/summary inadequate, then choose between
improving product for real programs vs. improving product
to get more sales;
Sales almost always wins!
• Execution time is the measure of computer performance!
Benchmarks
How to Summarize Performance
Management would like to have one number.
Technical people want more:
1.
They want to have evidence of reproducibility – there should be
enough information so that you or someone else can repeat the
experiment.
2.
There should be consistency when doing the measurements
multiple times.
How would you report these results?
Computer A
Computer B
Computer C
Program P1 (secs)
1
10
20
Program P2 (secs)
1000
100
20
Total Time (secs)
1001
110
40
Quantitative Principles of Computer Design
1.1 Introduction
1.2 The Task of a Computer
Designer
1.3 Technology and Computer Usage
Trends
1.4 Cost and Trends in Cost
1.5 Measuring and Reporting
Performance
1.6 Quantitative Principles of
Computer Design
1.7 Putting It All Together: The
Concept of Memory Hierarchy
Make the common case fast.
Amdahl’s Law:
Relates total speedup of a system
to the speedup of some portion of
that system.
Quantitative Design
Amdahl's Law
Speedup due to enhancement E:
Speedup( E ) 
Execution _ Time _ Without _ Enhancement
Performance _ With _ Enhancement

Execution _ Time _ With _ Enhancement
Performance _ Without _ Enhancement
This fraction enhanced
Suppose that enhancement E accelerates a fraction F of the
task by a factor S, and the remainder of the task is
unaffected
Quantitative Design - Amdahl’s law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
Speedupoverall =
ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
This fraction enhanced
ExTimeold
ExTimenew
Quantitative Design -Amdahl's Law
• Floating point instructions improved to run 2X; but only 10%
of actual instructions are FP
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Speedupoverall
=
1
0.95
=
1.053
Quantitative Design
Cycles Per Instruction
CPI = (CPU Time * Clock Rate) / Instruction Count
= Cycles / Instruction Count
n
CPU _ Time  Cycle _ Time *  CPIi * I i
“Instruction Frequency”
n
CPI   CPI i * Fi
i 1
where
i 1
Invest Resources where time is Spent!
Number of
instructions
of type I.
Fi 
Ii
Instruction _ Count
Quantitative Design
Suppose we have a machine where we can count the frequency with
which instructions are executed. We also know how many cycles it takes
for each instruction type.
Base Machine (Reg / Reg)
Op
Freq
Cycles
CPI(i)
(% Time)
ALU
50%
1
.5
(33%)
Load
20%
2
.4
(27%)
Store
10%
2
.2
(13%)
Branch
20%
2
.4
(27%)
Total CPI
How do we get CPI(I)?
How do we get %time?
1.5
Quantitative Design
Locality of Reference
Programs access a relatively small portion of the address
space at any instant of time.
There are two different types of locality:
Temporal Locality (locality in time): If an item is referenced,
it will tend to be referenced again soon (loops, reuse, etc.)
Spatial Locality (locality in space/location): If an item is
referenced, items whose addresses are close by tend to be
referenced soon (straight line code, array access, etc.)
The Concept of Memory Hierarchy
1.1 Introduction
1.2 The Task of a Computer Designer
1.3 Technology and Computer Usage
Trends
1.4 Cost and Trends in Cost
1.5 Measuring and Reporting
Performance
1.6 Quantitative Principles of Computer
Design
1.7 Putting It All Together: The Concept
of Memory Hierarchy
Fast memory is expensive.
Slow memory is cheap.
The goal is to minimize the
price/performance for a
particular price point.
Memory Hierarchy
Registers
Level 1
cache
Level 2
Cache
Memory
Disk
Typical
Size
4 - 64
<16K
bytes
<2 Mbytes
<16
Gigabytes
>5
Gigabytes
Access
Time
1 nsec
3 nsec
15 nsec
150 nsec
5,000,00
0 nsec
Bandwidt
h (in
MB/sec)
10,000 –
50,000
2000 5000
500 1000
500 1000
100
Managed
By
Compiler
OS
OS/User
Hardware Hardware
Memory Hierarchy
• Hit: data appears in some block in the upper level (example: Block X)
• Hit Rate: the fraction of memory access found in the upper level
• Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieved from a block in the lower level (Block
Y)
• Miss Rate = 1 - (Hit Rate)
• Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block to the processor
• Hit Time << Miss Penalty (500 instructions on 21264!)
Memory Hierarchy
Registers
Level 1
cache
Level 2
Cache
Memory
What is the cost of executing a program if:
• Stores are free (there’s a write pipe)
• Loads are 20% of all instructions
• 80% of loads hit (are found) in the Level 1 cache
• 97 of loads hit in the Level 2 cache.
Disk
Wrap Up
1.1 Introduction
1.2 The Task of a Computer Designer
1.3 Technology and Computer Usage Trends
1.4 Cost and Trends in Cost
1.5 Measuring and Reporting Performance
1.6 Quantitative Principles of Computer Design
1.7 Putting It All Together: The Concept of Memory Hierarchy
Download