ECE369: Fundamentals of Computer Architecture

advertisement
ECE369A: Fundamentals of Computer Architecture
ECE 369A
MWF 10:00 AM - 10:50 AM in CHVZ 405
Instructor
Teaching Assistant
Name: Ali Akoglu
Nirmal Kumbhare
Office: ECE 356-B
ECE232
Phone: (520) 626-5149
Email: akoglu@ece.arizona.edu
nirmalk@email.arizona.edu
Office Mondays 1:00-2:00PM,
Hours: Fridays 2:00-4:00pm
or
by appointment
TBD
ECE369
1
General Policies, Tips
•
•
•
•
•
•
•
D. A. Patterson and J.L. Hennessy, Computer Organization and
Design: The Hardware/Software Interface, Fifth Edition, Morgan
Kaufmann Publishers.
Prerequisite: ECE274 & Programming in C & Programming in
Veriolg/VHDL
Course involves class (3 credits) and laboratory (1 credit) activities.
Class activities will involve 3 to 4 assignments, quizzes, reading
assignments, 5 examinations, and a semester long project.
Laboratory activities will accompany the topics covered in the lectures
with hands on programming based assignments. A coordinated
sequence of lab assignments will help students design, program,
debug, and evaluate specific components of a datapath using the
Xilinx ISE and a FPGA board.
Laboratory activities (programming and debugging) and assignments
(15) will help students build the foundation for the semester long
datapath-design based project.
Final project will require students implement a synthesizable datapath
based on a given instruction set architecture and evaluate its
performance on a specific FPGA platform.
ECE369
2
General Policies, Tips
• NO LATE ASSIGNMENTS
• Make-ups may be arranged prior to the scheduled activity.
• Inquiries about graded material => within 3 days of receiving a
grade.
• You are encouraged to discuss the assignment specifications
with your instructor, your teaching assistant, and your fellow
students. However, anything you submit for grading must be
unique and should NOT be a duplicate of another source.
• Read before the class
• Participate and ask questions
• Manage your time
• Start working on assignments early
ECE369
3
Grading
Distribution of Components
Component
Percentage
Class Participation
10
Laboratory
10
Exam-I
8
Exam-II
10
Exam-III
10
Exam-IV
12
Exam-V
15
Project
25
Grades Scale
Percentage
Grade
90-100%
A
80-89%
B
70-79%
C
60-69%
D
Below 60%
E
Class Participation: Piazza participation (3%), Quizzes (7%) (in-class)
ECE369
4
Labs and Project
• Laboratory:
– Labs1-17: 2000pts
• Project
– Labs18-28: 2500pts
– Competition: !!!!
ECE369
5
Announcements
•
ECE232 LAB (Starts Aug 31)
– TA
•
Assignments & Project
– Pairs only !!!
– Who is my partner? (Aug 31)
•
Lecture notes on the web
– http://ece.arizona.edu/~ece369/
ECE369
6
Objective
• Understand “how computer works”
• Appreciate the design tradeoffs
ECE369
7
Chapter 1
ECE369
8
Introduction
•
This course is all about how computers work
•
But what do we mean by a computer?
– Different types: desktop, servers, embedded devices
– Different uses: automobiles, graphics, finance, genomics…
– Different manufacturers: Intel, Apple, IBM, Microsoft, Sun…
– Different underlying technologies and different costs!
ECE369
9
First Microprocessor Intel 4004, 1971
• 4-bit accumulator
architecture
• 8mm pMOS
• 2,300 transistors
• 3 x 4 mm2
• 750kHz clock
• 8-16 cycles/inst.
ECE369
10
Hardware
• Team from IBM building PC
prototypes in 1979
• Motorola 68000 chosen
initially, but 68000 was late
• 8088 is 8-bit bus version of
8086 => allows cheaper
system
• Estimated sales of 250,000
• 100,000,000s sold
[ Personal Computing Ad, 11/81]
ECE369
11
11
DYSEAC, first mobile computer!
• Carried in two tractor trailers, 12 tons + 8 tons
• Built for US Army Signal Corps
ECE369
12
End of Uniprocessors
Intel cancelled high
performance uniprocessor,
joined IBM and Sun for
multiple processors
ECE369
13
Processor Technology Trends
• Shrinking of transistor sizes: 250nm (1997) 
130nm (2002)  65nm (2007)  32nm (2010)
28nm(2011, AMD GPU, Xilinx FPGA)
22nm(2011, Intel Ivy Bridge, die shrink of the
Sandy Bridge architecture)  14nm (2014, Intel Core M,
Core I7)
• Transistor density increases by 35% per year and die size
increases by 10-20% per year… more cores!
ECE369
14
Power Consumption Trends
• Dyn power a activity x capacitance x voltage2 x frequency
• Capacitance per transistor and voltage are decreasing,
but number of transistors is increasing at a faster rate;
hence clock frequency must be kept steady
• Leakage power is also rising
• Power consumption is already between 100-150W in
high-performance processors today
 3.3GHz Intel core i7: 130 watts
ECE369
15
Recent Microprocessor Trends
Transistors: 1.43x / year
Cores: 1.2 - 1.4x
Performance: 1.15x
Frequency: 1.05x
Power: 1.04x
2004
2010
Source: Micron University Symp.
ECE369
16
What you will learn
•
How programs are translated into the machine language
– And how the hardware executes them
•
Hardware and Software Interface
•
Both Hardware and Software affect performance:
– Algorithm determines number of source-level statements
– Language/Compiler/Architecture determine machine instructions
– Processor/Memory determine how fast instructions are executed
•
What is parallel processing
ECE369
17
Below Your Program
•
•
Application software
– Written in high-level language
System software
– Compiler: translates HLL code to machine code
– Operating System: service code
• Handling input/output
• Managing memory and storage
• Scheduling tasks & sharing resources
•
Hardware
– Processor, memory, I/O controllers
ECE369
18
Inside the Processor (CPU)
•
•
•
Datapath: performs operations on data
Control: sequences datapath, memory, ...
Cache memory
– Small fast SRAM memory for immediate access to data
ECE369
19
Abstraction
•
Delving into the depths
reveals more information
•
An abstraction omits unneeded
detail,
helps us cope with complexity
ECE369
20
Instruction Set Architecture
•
A very important abstraction
– interface between hardware and low-level software
– standardizes instructions, machine language bit patterns, etc.
– advantage: different implementations of the same architecture
– disadvantage: sometimes prevents using new innovations
•
Modern instruction set architectures:
– IA-32, PowerPC, MIPS, SPARC, ARM, and others
ECE369
21
Sneak Peak (Lecture 18)
ECE369
22
Performance
•
•
•
•
Measure, Report, and Summarize
Make intelligent choices
See through the marketing hype
Key to understanding underlying organizational motivation
Why is some hardware better than others for different programs?
What factors of system performance are hardware related?
(e.g., Do we need a new machine, or a new operating system?)
How does the machine's instruction set affect performance?
ECE369
23
Defining Performance
•
Which airplane has the best performance?
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud
Concorde
BAC/Sud
Concorde
Douglas
DC-8-50
Douglas DC8-50
0
100
200
300
400
0
500
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud
Concorde
BAC/Sud
Concorde
Douglas
DC-8-50
Douglas DC8-50
500
1000
4000
6000
8000 10000
Cruising Range (miles)
Passenger Capacity
0
2000
0
1500
Cruising Speed (mph)
100000 200000 300000 400000
Passengers x mph
ECE369
24
Computer Performance: TIME, TIME, TIME
•
Response Time (latency)
— How long does it take for my job to run?
— How long does it take to execute a job?
— How long must I wait for the database query?
•
Throughput
— How many jobs can the machine run at once?
— What is the average execution rate?
— How much work is getting done?
•
If we upgrade a machine with a new processor what do we increase?
•
If we add a new machine to the lab what do we increase?
ECE369
25
Execution Time
•
•
•
Elapsed Time
– counts everything (disk and memory accesses, I/O , etc.)
– a useful number, but often not good for comparison purposes
CPU time
– doesn't count I/O or time spent running other programs
– can be broken up into system time, and user time
Our focus: user CPU time
– time spent executing the lines of code that are "in" our program
ECE369
26
Book's Definition of Performance
•
For some program running on machine X,
PerformanceX = 1 / Execution timeX
•
"X is n times faster than Y"
PerformanceX / PerformanceY = n
•
Problem:
– machine A runs a program in 20 seconds
– machine B runs the same program in 25 seconds
ECE369
27
Clock Cycles
•
Instead of reporting execution time in seconds, we often use cycles
seconds
cycles
seconds


program program
cycle
Clock period
Clock (cycles)
Data transfer
and computation
Update state
•
clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)
ECE369
28
How to Improve Performance
seconds
cycles
seconds


program program
cycle
So, to improve performance (everything else being equal) you can
either (increase or decrease?)
________ the # of required cycles for a program, or
________ the clock cycle time or, said another way,
________ the clock rate.
ECE369
29
How many cycles are required for a program?
...
6th
5th
4th
3rd instruction
2nd instruction
Could assume that number of cycles equals number of instructions
1st instruction
•
time
This assumption is incorrect,
different instructions take different amounts of time on different machines.
Why? hint: remember that these are machine instructions, not lines of C code
ECE369
30
Different numbers of cycles for different instructions
time
•
Multiplication takes more time than addition
•
Floating point operations take longer than integer ones
•
Accessing memory takes more time than accessing registers
•
Important point: changing the cycle time often changes the number of
cycles required for various instructions (more later)
ECE369
31
Example
•
Our favorite program runs in 10 seconds on computer A, which has a
4 GHz. clock. We are trying to help a computer designer build a new
machine B, that will run this program in 6 seconds. The designer can use
new (or perhaps more expensive) technology to substantially increase the
clock rate, but has informed us that this increase will affect the rest of the
CPU design, causing machine B to require 1.2 times as many clock cycles as
machine A for the same program. What clock rate should we tell the
designer to target?"
•
Don't Panic, can easily work this out from basic principles
ECE369
32
Example
•
Our favorite program runs in 10
seconds on computer A, which has
a 4 GHz. clock. We are trying to
help a computer designer build a
new machine B, that will run this
program in 6 seconds. The
designer can use new (or perhaps
more expensive) technology to
substantially increase the clock
rate, but has informed us that this
increase will affect the rest of the
CPU design, causing machine B to
require 1.2 times as many clock
cycles as machine A for the same
program. What clock rate should
we tell the designer to target?"
seconds
cycles
seconds


program program
cycle
ECE369
33
Now we understand cycles
•
A given program will require
– some number of instructions (machine instructions)
– some number of cycles
– some number of seconds
•
We have a vocabulary that relates these quantities:
– cycle time (seconds per cycle)
– clock rate (cycles per second)
– CPI (cycles per instruction)
a floating point intensive application might have a higher CPI
– MIPS (millions of instructions per second)
this would be higher for a program using simple instructions
ECE369
34
Back to the Same Formula, CPI (Cycles/Instruction)
seconds
cycles
seconds


program program
cycle
ECE369
35
CPI Example
•
Suppose we have two
implementations of the same
instruction set
architecture (ISA).
seconds
cycles
seconds


program program
cycle
For some program,
Machine A has a clock cycle time
of 250 ps and a CPI of 2.0
Machine B has a clock cycle time
of 500 ps and a CPI of 1.2
What machine is faster for this
program, and by how much?
ECE369
36
Let’s Complicate Things A Little bit…
A compiler designer is trying to
decide between two code
sequences for a particular
machine. Based on the hardware
implementation, there are three
different classes of instructions:
Class A, Class B, and Class C,
and they require one, two, and
three cycles (respectively).
Which sequence will be faster? How much?
seconds
cycles
seconds


program program
cycle
What is the CPI for each sequence?
The first code sequence has 5
instructions: 2 of A, 1 of B, and 2 of C
The second sequence has 6 instructions:
4 of A, 1 of B, and 1 of C.
ECE369
37
Component Analysis
ECE369
38
What is MIPS?
•
•
Instruction execution rate => higher is better
Issues:
– Can not compare processors with different instruction sets
– Varies between programs on the same processor
– Can vary inversely with the performance… ?
ECE369
39
MIPS Example
ECE369
40
Amdahl's Law
•
The performance enhancement of an improvement is limited by how
much the improved feature is used. In other words: Don’t expect an
enhancement proportional to how much you enhanced something.
•
Example:
"Suppose a program runs in 100 seconds on a machine, with
multiply operations responsible for 80 seconds of this time. How
much do we have to improve the speed of multiplication if we want
the program to run 4 times faster?"
How about making it 5 times faster?
ECE369
41
Amdahl’s Law
1. Speed up = 4
2. Old execution time = 100
3. New execution time = 100/4 = 25
4. If 80 seconds is used by the affected part =>
5. Unaffected part = 100-80 = 20 sec
6. Execution time new = Execution time unaffected +
Execution time affected / Improvement
7. 25= 20 + 80/Improvement
8. Improvement = 16
ECE369
42
Example: Speed up using parallel processors
Suppose an application is “almost all” parallel: 90%.
What is the speedup using 10, 100, and 1000 processors?
new time = old time * 10% + ( old time * 90% ) / 10
Speed up (P=10) = old time / new time
Speedup (P=10) = 5.3
Speedup (P = 100 ) = 9.1
Speedup ( P = 1000 ) = 9.9
ECE369
43
Amdahl’s Law Overview
ECE369
44
Example
• Suppose we are considering an enhancement that
runs 10times faster than the original machine but is
only usable 40% of the time. What is the overall
speedup gained by incorporating the enhancement?
Speedup 
1
0.4
 0.6
10
ECE369
 1.56
45
Example
•
Implementations of floating point square root vary significantly in
performance. Suppose FP square root (FPSQR) is responsible for
20% of the execution time of a critical benchmark on am machine,
One proposal is to add FPSQR hardware that will speed up this
operation by a factor of 10. The other alternative is just to try to make
all FP instructions run faster; FP instructions are responsible for a
total of 50% of the execution time. The design team believes that they
can make all FP instructions run two times faster with the same effort
as required for the fast square root. Compare those two design
alternatives.
Speedup FPSQR 
SpeedupFP 
1
0.2
 (1  0.2)
10
1
0.5
 (1  0.5)
2
ECE369
 1.22
 1.33
46
Benchmarks
•
•
•
Performance best determined by running a real application
– Use programs typical of expected workload
– Or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.
Small benchmarks
– nice for architects and designers
– easy to standardize
– can be abused
SPEC (System Performance Evaluation Cooperative)
– companies have agreed on a set of real program and inputs
– valuable indicator of performance (and compiler technology)
– can still be abused
ECE369
47
Benchmark Games
•
An embarrassed Intel Corp. acknowledged Friday that a bug in a
software program known as a compiler had led the company to
overstate the speed of its microprocessor chips on an industry
benchmark by 10 percent. However, industry analysts said the
coding error…was a sad commentary on a common industry
practice of “cheating” on standardized performance tests…The error
was pointed out to Intel two days ago by a competitor, Motorola
…came in a test known as SPECint92…Intel acknowledged that it
had “optimized” its compiler to improve its test scores. The
company had also said that it did not like the practice but felt to
compelled to make the optimizations because its competitors were
doing the same thing…At the heart of Intel’s problem is the practice
of “tuning” compiler programs to recognize certain computing
problems in the test and then substituting special handwritten
pieces of code…
Saturday, January 6, 1996 New York Times
ECE369
48
SPEC CPU2000
ECE369
49
Problems with Benchmarking
•
Hard to evaluate real benchmarks:
– Machine not built yet, simulators too slow
– Benchmarks not ported
– Compilers not ready
Benchmark performance is composition of hardware and software
(program, input, compiler, OS) performance, which must all be
specified
ECE369
50
IR optimization example
C source code
compile
unoptimized IR
ECE369
51
Constant folding
cfold
ECE369
52
Constant propagation
constprop
ECE369
53
Dead code elimination
dce
ECE369
54
ECE369
Xe
on
3G
Hz
5
12
K
B(
o
)
d)
pt
im
iz e
B(
-x
W
-O
3xW
xB
(-O
0)
-O
3-
B
12
K
3G
Hz
5
MB
,
.6G
Hz
/1
MB
,
12
K
3G
Hz
5
.6G
Hz
/1
Xe
on
B1
B1
Xe
on
Compiler and Performance
Application Set
500
450
400
350
300
250
200
150
100
50
0
55
Scary Stuff
Op
ALU
Load
Store
Branch
Frequency
43%
21%
12%
24%
Cycle Count
1
1
2
2
Let’s say we were able to reduce the cycle count for
“Store” operations to 1 with a cost of slowing our
clock by15%. Is this new design feasible?
 n

  CPI i  IC i 
n
IC i


i 1



CPI original 
  CPI i  
Instruction _ Count i 1
 Instruction _ Count 
ECE369
56
Example(Contd.)
Old CPI = 0.43 + 0.21 + 0.12x2 + 0.24x2 = 1.38
New CPI = 0.43 + 0.21 + 0.12 + 0.24x2 = 1.24
Speed up = old time/new time
= (ICx oldCPI x T)/(IC x newCPI x 1.15T)
=0.97
so, don't make this change.
ECE369
57
Performance Measurement Overview
CPUtime  CPUclock _ cycles _ for _ the _ pogram  Clock _ Cycle _ Time
CPUtime 
CPI 
CPUclock _ cycles _ for _ the _ pogram
Clock _ Rate
CPUclock _ cycles _ for _ the _ pogram
IC
CPUtime  IC  CPI  Clock _ Cycle _ Time
CPUtime 
CPUtime 
IC  CPI
Clock _ Rate
Seconds Instructions ClcokCycles
Seconds



Pr ogram
Pr ogram
Instruction ClockCycle
ECE369
58
Performance Measurement Overview
n
CPU clock _ cycles _ for _ the _ program   CPIi  ICi
i 1
 n

CPUtime    CPI i  IC i   Clock _ Cycle _ Time
 i 1

 n

  CPI i  IC i 
n
IC i


i 1



overall _ CPI 
  CPI i  
Instruction _ Count i 1
 Instruction _ Count 
ECE369
59
Exercise
Suppose we have made the following measurements:
• Frequency of FP operations = 25%
• Average CPI of FP operations = 4.0
• Average CPI of other instructions = 1.33
• Frequency of FPSQR = 2%
• CPI of FPSQR = 20
Assume that the two design alternatives are:
a) to reduce the CPI of FPSQR to 2 or
b) to reduce the average CPI of all FP operations to 2.
Compare these alternatives.
ECE369
60
Solution
 n

  CPI i  IC i 
n
IC i


i 1



CPI original 
  CPI i  
Instruction _ Count i 1
 Instruction _ Count 
 4  25%  1.33  75%  2.0
CPI Saved _ on _ FPSQR  2%  (CPI oldFPSQR  CPI newFPSQR )  2%  (20  2)  0.36
CPI overall _ for _ new _ FPSQR  CPI original  CPI Saved _ on _ FPSQR  2  0.36  1.64
CPI overall _ for _ new _ FP  75%  1.33  25%  2.0  1.5
SpeedupFP 
CPUTimeoriginal
CPUTimenew

IC  ClockCycle  CPI original
IC  ClockCycle  CPI new
ECE369

CPI original
CPI new

2.00
 1.33
1.5
61
Multiprocessors
•
•
Multicore microprocessors
– More than one processor per chip
Requires explicitly parallel programming
– Compare with instruction level parallelism
• Hardware executes multiple instructions at once
• Hidden from the programmer
– Hard to do
• Programming for performance
• Load balancing
• Optimizing communication and synchronization
ECE369
62
Hardware/Software Tradeoffs
•
Which operations are directly supported in “hardware” and which are
synthesized in software?
•
How much hardware should you employ to speed up certain functions?
Advantages
Disadvantages
Hardware
Speed,
Consistency
Cost
No flexibility
Software
Flexibility,
easier/faster design
lower cost of errors
Slowness
ECE369
63
Remember
•
Performance is specific to a particular program
– Total execution time is a consistent summary of performance
•
For a given architecture performance increases come from:
–
–
–
–
•
increases in clock rate (without adverse CPI affects)
improvements in processor organization that lower CPI
compiler enhancements that lower CPI and/or instruction count
Algorithm/Language choices that affect instruction count
Pitfall: expecting improvement in one aspect of a machine’s
performance to affect the total performance
ECE369
64
Download