Introduction to Architecture

advertisement
Introduction to Architecture & Code Sequences
HM
Introduction to Computer Architecture
(10/7/2010)
Synopsis
 Historical Perspective
 Evolution of µP Performance
 Processor Performance Growth
 Key Messages about Computer Architecture
 Code Sequences for Different Architectures
 Dependences
 Score Board
 Bibliography
Historical Perspective
 Before 1940
- 1643 Pascal’s Arithmetic Machine
- About 1660 Leibnitz Four Function Calculator
- 1710 -1750 Punched Cards by Bouchon, Falcon, Jacquard
- 1810 Babbage Difference Engine, unfinished; 1st programmer Lady Ada
- 1835 Babbage Analytical Engine, also unfinished
- 1920 Hollerith Tabulating Machine to help with census in the USA
 Decade of 1940s:
- 1939 – 1942 John Atanasoff built programmable, electronic computer
at Iowa State University
- 1936-1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical
computers based on relays; colleague advised “tubes”
- 1946 Mauchly and Eckert built ENIAC, modeled after Atanasoff’s ideas,
built at University of Pennsylvania: Electronic Numeric Integrator and
Computer, 30 ton monster
 Decade of the 1950s:
- Univac Uniprocessor based on ENIAC, commercially viable
- Commercial systems sold by Remington Rand
- Mark III computer
 Decade of the 1960s:
- IBM’s 360 family co-developed with GE, Siemens, et al.
- Transistor replaces vacuum tube
- Burroughs stack machines, compete with GPR architectures
- All still von Neuman architectures, even stack architectures
- 1969 ARPANET
- Cache and VMM developed
 Decade of the 1970s:
- High-point of main-frames, birth of microprocessor
- High-end mainframes, e.g. CDC 6000s, IBM 360/67, and 370 series
1
Introduction to Architecture & Code Sequences



HM
- Caches, VMM common on mainframes
- Intel 4004, Intel 8080, single-chip microprocessors
- Programmable controllers
- Mini-computers, PDP 11, HP 3000
- Expensive memories still magnetic-core based
- Height of Digital Equipment Corp. (DEC)
- Birth of personal computers, which DEC misses
Decade of the 1980s:
- Decrease of mini-computer use
- 32-bit computing even on minis
- Multitude of Supercomputer manufacturers
- Architecture advances: fast caches, larger caches
- Compiler complexity: trace-scheduling, VLIW
- Workstations common: Sun, Apollo, HP, and DEC trying to catch up
Decade of the 1990s:
- Architecture advances: superscalar-pipelined, speculative execution,
out-of-order execution
- Powerful desktops
- End of mini-computer and of many super-computer manufacturers
- Microprocessor as powerful as early supercomputers
- Cheaper memory technology
- Consolidation of computer companies into a small # of large ones
- Numerous supercomputer corporations close
Decade of the 2000s:
- Architecture advances: Multi-core CPUs
- Multi-threaded cores
- 64-bit computing and addressing
- Heterogeneous computer grids
Evolution of µP Performance
Transistor Count
Clock Frequency
Instructions /
cycle: ipc
MFLOPs
1970s
1980s
1990s
2000+
10k-100k
100k-1M
1M-100M
1B
0.2-2 MHz
2-20 MHz
0.02 – 1 GHz
10 GHz
<= 0.1
0.1 – 0.9
0.9 – 2.0
>= 10
< 0.2
0.2 - 20
20 – 2,000
100,000
2
Introduction to Architecture & Code Sequences
HM
Processor Performance Growth






Moore’s Law see [1]
Observation made in 1965 by Gordon Moore --co-founder of Intel-- that
the number of transistors per square inch on integrated circuits had
doubled every year since the integrated circuit. Moore predicted that this
trend would continue for the foreseeable future.
In subsequent years, the pace slowed down a bit, but data density
doubled approximately every 18 months, and this is the current
definition of Moore's Law, which Moore himself has blessed. Most experts,
including Moore himself, expect Moore's Law to hold for another two
decades.
Others coin a more general law, stating that “the circuit density
increases predictably over time.”
So far, 2010, Moore’s Law is holding true since ~1968.
Some Intel fellows believe an end to Moore’s Law will be reached around
2018 due to
1. physical limitations in the process of manufacturing transistors from
semi-conductor material
2. Limitations of adequately cooling small masses, confined areas
3. Accessing the number of pins on such small surfaces
This phenomenal growth is unknown in any other industry. For example,
if doubling of performance could be achieved every 18 months, then by
2001 other industries would have achieved the following:
- cars would travel at 2,400,000 Mph, and get 600,000 MpG
- Air travel from LA to NYC would be at 36,000 Mach, or take 0.5
seconds
3
Introduction to Architecture & Code Sequences
HM
Key Messages about Computer Architecture
1: Memory is Always Slow, Way Too Slow!
The inner core of the processor, the CPU or the µP, is getting faster at a
steady rate. Access to memory is also getting faster over time, but at a
slower rate. This rate differential has existed for quite some time, with the
strange effect that fast processors have to rely on slow memories. It is not
uncommon that on an MP server the processor has to wait > 100 cycles
before a memory access completes. On a Multi-Processor the bus protocol is
more complex due to snooping, backing-off, arbitration, etc., thus the
number of cycles to complete an access can grow so high.
Discarding conventional memory altogether, relying only on cache-like
memories, is NOT yet an option, due to the price differential between cache
and regular DRAM. Another way of seeing this: Using solely reasonablypriced cache memories (say at 10 times the cost of regular memory) is not
possible, since the resulting physical address space would be too small.
Corollary 1: Almost all intellectual efforts in high-performance computer
architecture focus on reducing the performance disparity of fast processors
vs. slow memories. All else seems easy compared to this fundamental
problem!
µProc
~60%/yr
.
“Moore’s Law”
CPU
100
10
DRAM
1
DRAM
~7%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
Performance
1000
Time
Source: David Patterson, UC Berkeley
4
Introduction to Architecture & Code Sequences
HM
2: Clustering of Events - What Happened Just Now, Will Soon Happen
Again!
A strange thing happens during program execution: Seemingly unrelated
events tend to cluster. For example, memory accesses tend to concentrate a
majority of their referenced addresses onto a small domain of the total
address space. Even if all of memory is accessed, during some periods of
time this phenomenon of clustering seems immutable. While one memory
access seems independent of another, they both happen to fall onto the
same page (or working set of pages, or cache line). We call this
phenomenon Data Locality! We see later, architects exploit locality to speed
up Virtual Memory Management.
Similarly, hash functions tend to concentrate an unproportionally large
number of keys onto a small number of hash values, i.e. table entries. Here
the incoming search key (say, source program identifier “i”) is mapped into
an index, but the next, completely unrelated key, happens to map onto the
same index. In an extreme case, this may render a hash lookup slower than
a sequential search.
This clustering happens in all disparate modules of the processor
architecture. For example, when a data cache is used to speed-up memory
accesses by having a copy of frequently used data in a faster memory unit, it
happens that a small cache suffices. This is again due to Data Locality
(spatial and temporal). Data that have been accessed recently will again be
accessed in the near future, or at least data that live close by will be
accessed in the near future. Thus they happen to reside in the same cache
line. Architects do exploit this to speed up execution, while keeping the
incremental cost for HW contained. Here clustering is exploited as a valuable
opportunity.
Corollary 2: If this clustering of events (AKA locality) would not happen, the
whole architectural ideas of caches (rendering slow memories fast), of
branch predictors, and VMM (making small memories appear large) would
not work at all. It is due to great locality that major performance bottlenecks
and resource limitations can be overcome.
5
Introduction to Architecture & Code Sequences
HM
3: Heat is Bad – Design CPU for Low-Voltage, Low Current, Low Heat
Clocking a processor fast (.e.g. > 3-5 GHz) increases performance and thus
generally “is good”. Other performance parameters, such as memory access
speed, peripheral access, etc. do not scale with the clock speed. Still,
increasing the clock to a higher rate is desirable. But this comes at the cost
of higher current and thus more heat generated in the identical physical
space, the geometry of the silicon processor or chipset. However, a Silicon
part acts like a resistor, conducting better, as it gets warmer (negative
temperature coëfficient resistor, or NTC). Since the power-supply is a
constant-current source, a lower resistance causes lower voltage, shown as
VDroop in the figure below (© Anandtech, see [2]).
This in turn means the voltage has to be increased artificially, to sustain the
clock rate, creating more heat, ultimately leading to self-destruction of the
part. Great efforts are being made to increase the clock speed, requiring
more voltage, while at the same time reducing heat generation. Current
technologies include sleep-states of the Silicon part (processor as well as
6
Introduction to Architecture & Code Sequences
HM
chip-set), and Turbo boost mode, to contain heat generation while boosting
clock speed just at the right time.
Corollary 3: Good that to date (2010) Silicon manufacturing technologies
allow the shrinking of transistors and thus of whole dies. Else CPUs would
become larger, more expensive, and above all: hotter.
7
Introduction to Architecture & Code Sequences
HM
Code Sequences for Different Architectures
Goals, Core Ideas
 Analyze various levels of complexity; same source, various target
systems
 Interaction between high-level language, compiler, target architecture
 Sample language rules: operator precedence, associativity, commutativity
 Derive measures of architectural quality: some architecture (ISA) better
than other?
Example 1: Object Code Sequence without Optimization
 Strict left-to-right translation, no optimization at all!
 Consider non-commutative subtraction and division operators
 No common subexpression elimination (CSE), No register reuse
 Conventional operator precedence
 For Single Accumulator SAA, Three-Address GPR, Stack Architectures
 Sample source snippet: d  ( a + 3 ) * b - ( a + 3 ) / c
No
1
2
3
4
5
6
7
8
9
10
11
12
Single-Accumulator
ld a
add #3
mult b
st temp1
ld a
add #3
div c
st temp2
ld temp1
sub temp2
st d
Three-Address GPR
dest  op1 op op2
add r1, a, #3
mult r2, r1, b
add r3, a, #3
div r4, r3, c
sub d, r2, r4
Stack Machine
push a
pushlit #3
add
push b
mult
push a
pushlit #3
add
push c
div
sub
pop d
Observations, Example 1
 Three-address code looks shortest, w.r.t. number of instructions
 Maybe optical illusion, must also consider number of bits for instructions,
and consider: How many registers are available?
 Must consider number of I-fetches, operand fetches
 Must consider total number of stores
 Numerous memory accesses on SAA due to temporary values held in
memory
8
Introduction to Architecture & Code Sequences






HM
Most memory accesses on SA, since everything requires a memory
access, even multiple memory accesses for a single arithmetic
computation!!
Architect considers designing “reverse subtract” operation for SA, to save
some stores and loads
Three-Address architecture immune to commutativity constraint, since
operands may be placed in registers in either order
No need for reverse-operation opcodes for Three-Address architecture
Decide in Three-Address architecture how to encode operand types
Numerous stack instructions, since each operand fetch is separate
instruction
9
Introduction to Architecture & Code Sequences
HM
Example 2: Using CSE + Register Re-Use Optimization
 Eliminate common subexpression
 Compiler handles left-to-right order for non-commutative operators on
SAA
 Best possible code for: d  ( a + 3 ) * b - ( a + 3 ) / c
No
1
2
3
4
5
6
7
8
9
10
11
Single-Accumulator
ld a
add #3
st temp1
div c
st temp2
ld temp1
mult b
sub temp2
st d
Three-Address GPR
dest  op1 op op2
add r1, a, #3
mult r2, r1, b
div r1, r1, c
sub d, r2, r1
Stack Machine
push a
pushlit #3
add
dup
push b
mult
xch
push c
div
sub
pop d
Observations, Example 2
 Single Accumulator Architecture (SAA) optimized still needs temporary
storage, uses temp1 for common subexpression; has no other register!!
 SAA could use negate instruction or reverse subtract
 Register-use optimized for Three-Address architecture
 Common sub-expression optimized on Stack Machine by duplicating,
exchanging, etc.
10
Introduction to Architecture & Code Sequences
HM
Example 3: Interaction of Operator Precedence & Architecture







Analyze similar source expressions but with reversed operator precedence
One operator sequence associates right-to-left, due to precedence:
Expression 1
Compiler uses commutativity
The other, Expression 2, uses left-to-right, due to explicit parentheses
Use simple-minded code model: no cache, no optimization
Will there be advantages/disadvantages inherent in any architecture?
Expression 1 is: e  a + b * c ^ d
No
1
2
3
4
5
6
7
8


Single-Accumulator
ld c
expo d
mult b
add a
st e
Three-Address GPR
dest  op1 op op2
expo r1, c, d
mult r1, b, r1
add e, a, r1
Stack Machine
Implied Operands
push a
push b
push c
push d
expo
mult
add
pop e
Expression 2 is: f  ( ( g + h ) * i ) ^ j
Here the operators associate left-to-right
No
1
2
3
4
5
6
7
8
Single-Accumulator
ld g
add h
mult i
expo j
st f
Three-Address GPR
dest  op1 op op2
add r1, g, h
mult r1, i, r1
expo f, j, r1
Stack Machine
Implied operands
push g
push h
add
push i
mult
push j
expo
pop f
Observations, Interaction of Precedence and Architecture
 Software eliminates constraints imposed by precedence: looking ahead
 Execution times identical, unless blurred by secondary effect, see cache
example below
 Conclusion: all architectures handle arithmetic and logic operations well,
except when register starvation causes superfluous spill code
11
Introduction to Architecture & Code Sequences
HM
Timing Analysis: For Stack Machine with 2-Word Cache
 Stack Machine with no register inherently slow: Memory Accesses!!!
 Implement few top of stack elements via HW shadow registers  Cache
 Measure equivalent code sequences with/without consideration for cache
 Top-of-stack register tos points to last valid word of physical stack
 Two shadow registers may hold 0, 1, or 2 true top of stack words
 Top of stack cache counter, tcc, specifies the number of shadow registers
actually used to know “real” top of stack
 Thus tos plus tcc jointly specify what and where is the true top of stack
stack
stack
tos
tos
0,1,2
0,1,2
tcc
tcc
22tos
tosregisters
registers





free
free
Timings for push, pushlit, add-etc., pop operations depend on tcc
Operations in shadow registers fastest, arbitrarily here 1 cycle, includes
the register access and the operation itself
Each memory access adds 2 cycles, or costs 2 added cycles; in reality
>> 10 cycles
For stack changes use some defined policy, e.g. keep shadow
registers 50% full, or keep always full, or keep always empty
Table below refines timings for stack with shadow registers
Operation
Cycles
tcc before
tcc after
add
add
add
push x
push x
pushlit #3
pushlit #3
pop y
pop y
1
1+2
1+2+2
2
2+2
1
1+2
2
2+2
tcc
tcc
tcc
tcc
tcc
tcc
tcc
tcc
tcc
tcc = 1
tcc = 1
tcc = 1
tcc++
tcc = 2
tcc++
tcc = 2
tcc-tcc = 0
=
=
=
=
=
=
=
=
=
2
1
0
0, 1
2
0, 1
2
1, 2
0
12
tos
change
nochange
tos-tos -=2
nochange
tos++
nochange
tos++
nochange
tos--
comment
underflow?
underflow?
overflow?
overflow?
underflow?
Introduction to Architecture & Code Sequences





HM
“Add” representative for any arithmetic operation
Code emission for: a + b * c ^ ( d + e * f ^ g )
Let + and * be commutative, due to programming language rule
Architecture here has 2 shadow registers, and compiler knows this
Note: no sub and no div operation to avoid operand-order question 
#
1
2
3
4
5
6
7
8
9
10
11
12
13
Blind left to right
push
push
push
push
push
push
push
expo
mult
add
expo
mult
add
a
b
c
d
e
f
g
cycles 1
2
2
4
4
4
4
4
1
3
3
3
3
3
Smart cache use
push
push
expo
push
mult
push
add
push
expo
push
mult
push
add
f
g
e
d
c
b
a
cycles 2
2
2
1
2
1
2
1
2
1
2
1
2
1
Observations, Stack Machine with 2-Word Cache
 Blind code emission costs 40 cycles; i.e. not taking advantage of tcc
knowledge: costs performance
 Smart Code emission with shadow register consideration costs 20 cycles
 Execution of code without cache would be
7 pushes = 7*4 cycles
6 memory operations = 6*6 cycles
Total = 64 cycles
 True penalty for memory access is much worse in practice, order of tens
of cycles; will get worse over time
 Caveat: Tremendous speed-up is generally an indicator that you are
dealing with a system exhibiting severe flaws
 Such is the case here: Return of investment for 2 register hardware
investment is enormous
 Stack Machine can be fast, if purity of top-of-stack access is sacrificed for
performance
 Note that indexing, looping, indirection, call/return are not addressed
here
13
Introduction to Architecture & Code Sequences
HM
Dependences
Inter-instruction dependencies arise. They are known in the technical jargon
as dependences, data dependence, anti-dependence, etc. One instruction
computes a result, the other needs that result. Or, one instruction uses data
which after the use may be recomputed.

True Dependence, AKA Data Dependence
r3 ← r1 op r2
r5 ← r3 op r4
(Read after Write, RAW)

Anti-Dependence
r3 ← r1 op r2
r1 ← r5 op r4
(Write after read, WAR)
Output Dependence
r3 ← r1 op r2
r5 ← r3 op r4
r3 ← r6 op r7
(Write after Write, WAW, Read between)


Control Dependence
if ( condition1 ) {
r3 = r1 op r2;
}else{
r5 = r3 op r4;
} // end if
write( r3 );
14
Introduction to Architecture & Code Sequences
HM
Register Renaming
Only some dependences constitutes real dependence, AKA DataDependence. The others are artifacts of insufficient resources, generally
register resources. If more registers were available, then replacing the
conflicting names with new (renamed) registers could make the conflict
disappear. Anti- and Output-Dependences are false dependencies. Assume
all registers are live afterwards.
Compilers need to be aware of such dependences during code-emission, to
ensure data correctness.
Similarly, HW score board needs to be aware of such dependences (register
to register), to ensure accurate (timing for) register usage; i.e. at which
moment is a register actually usable
Original Uses + Dependences:
L1: r1 ← r2 op r3
L2: r4 ← r1 op r5
L3: r1 ← r3 op r6
L4: r3 ← r1 op r7
Registers renamed, new dependences:
r10 ← r2 op r30
r4 ← r10 op r5
r1 ← r30 op r6
r3 ← r1 op r7
L1,
L1,
L1,
L3,
L2,
L3,
L1, L2 true-Dep with r10
L3, L4 true-Dep with r1
L2
L3
L4
L4
L3
L4
true-Dep with r1
output-Dep with r1
anti-Dep with r3
true-Dep with r1
anti-Dep with r1
anti-Dep with r3
runs in half the time with renamed regs!
First r1 -> r10
First r3 -> r30
Then regs are live afterwards, r1, r3, ...
15
Introduction to Architecture & Code Sequences
HM
Score Board




Purpose of score-board --an array of programmable bits sb[]-- is to
manage HW resources, specifically registers
Single-bit array, one bit associated to some specific register, associated
by index=name: sb[i] belongs to reg ri.
 Only if sb[i] = 0, register i has valid data, i.e. the register is NOT in
progress of being written with new data. Instead the current
data are valid!
 If bit i is set, i.e. if sb[i] = 1, then that register ri has stale data
In-order execution:
 rd ← rs op rt
 if sb[rs] or if sb[ rt] is set → RAW dependence, hence stall
 if sb[rd] is set→ WAW dependence, hence stall
 else dispatch instruction
To allow out of order (ooo) execution, upon computing the value of rd:
 Update rd, and clear sb[rd]
Bibliography
1. http://en.wikipedia.org/wiki/Moore's_law
2. http://www.tomshardware.com/de/foren/240300-6-intel-cpus-mythosstunde-wahrheit
3. Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction
Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding
of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July
1986, pp 11-16.
16
Download