Trends in High-Performance Microprocessor Design

advertisement
Thema
Trends in High-Performance Microprocessor
Design
von Artur Klauser
Artur Klauser
TU Graz, Telematik, Dipl.-Ing. 1994
Univ. of Colorado at Boulder, Computer
Science, Ph.D. 1999
Intel Microprocessor Research Lab (MRL),
1997
DEC Western Research Lab (WRL), 1998
Compaq Alpha Advanced Development
(VSSAD), since 1999
Email: Artur.Klauser@computer.org
Abstract
General-purpose microprocessors are the devices that have fueled
the personal computer and internet revolution we have experienced
over the past couple of decades. A processor is at the heart of every
computer system in use today, from tiny autonomous embedded
control systems to large scale, powerful, networked supercomputers.
The Compaq Alpha microprocessor line fits into the highperformance end of this spectrum, powering high-end workstations,
servers, and supercomputers.
Over the years, processor architecture and design always had to
respond to and move in lock-step with technological advances in
processor implementation and integrated circuit technology, as well
as programming paradigms and instructions sets. In this paper I
review some of the trends that have driven the Compaq Alpha
processor architecture in the past decade and give an outlook at
current and future trends that will have an impact on the architecture
of future Alpha processors.
Introduction
A processor is at the heart of every computer system that we build
today. Around this processor, you find several other components
that make up a computer. Memory for instruction and data storage
and input-output devices to communicate with the rest of the world,
like disk controllers, graphics cards, keyboard interfaces, network
adapters, etc. The purpose of the processor is to execute machine
instructions. Thus, the logical operation of a processor is defined
by the instruction set architecture (ISA) that it executes. Multiple
different processors can implement the same ISA. What
differentiates such processors is their processor architecture, which
is the way that each processor is organized internally in order to
achieve the goal of implementing its ISA. By changing the processor
architecture, a processor designer can influence the performance
characteristics and efficiency with which instructions are executed.
Processor architecture also has to respond to implementation
constraints imposed on it by the target circuit technology of the
chip, in order to achieve a set performance goal.
In the rest of this section I will give a short crash course in advanced
computer architecture - an overview of the state of the art in
12
processor architecture for general-purpose high-performance
microprocessors.
Early Architectures
In early computer architectures, processor operation was very simple and strictly sequential. In the first step, for each instruction the
program counter (PC) would be used to send the next instruction
address to memory. Potentially several clock cycles later, the
instruction is returned from memory. Then the instruction would be
decoded. Decoding produces a list of source and destination
operands that the instruction operates on, and a specific operation
that is to be performed. In the next step, source operands would be
accessed and delivered to the arithmetic-logic unit (ALU). The ALU
eventually performs the operation that was specified in the
instruction and delivers a result. The result is then written back to
the destination that was decoded. Finally, the PC would be updated
to advance to the next instruction that is to be executed, after which
the whole process starts from the beginning for the next instruction.
It is easy to see that in this type of design, many operations of the
processor are unnecessarily serialized and large portions of the
processor sit idle for a majority of time. For example, the ALU is only
busy during the period where the operation is performed on the
source operands, but sits idle during the rest of the time it takes to
execute an instruction. It is not uncommon for an instruction to
consume on the order of 10 clock cycles to execute. Processor
architects often quote processor performance in instructions
executed per clock cycle (IPC). Therefore, this simple processor
architecture would achieve a performance of 0.1 IPC [4].
A driving force for this design were sparse resources. The number
of transistors that was available on a processor chip was low (tens
of thousands). Much emphasis of the design had to be placed on
limiting the number of transistors to implement each function of the
processor. Efficiency could only be addressed when functionality
was satisfied, which left little freedom.
Pipelining
To achieve higher performance, the various operations involved in
executing a single instruction can be separated into different stages
of a pipeline and performed in parallel for multiple instructions.
Since the pipeline can only advance at the rate of its slowest stage,
it is advantageous for all instructions to have approximately the
same amount of work to do in each pipeline stage. The architectural
shift to a pipelined design goes hand in hand with a shift in
predominant ISAs of the time from complex instruction sets (CISC)
to simpler, reduced instruction sets (RISC). In RISC instruction sets,
each instruction only performs a simple operation that can be
executed in a short pipeline stage. All instructions have a similar
amount of work to perform, supporting a shift to pipelined
architectures.
Telematik 1/2001
Thema
A typical pipeline of early pipelined designs would have the
following stages. A fetch stage to get the instruction to be executed.
A decode stage to determine operands and operation. A register
fetch stage to access the source operands. An execute stage to
perform the operation. And finally a writeback stage where the result
is stored into its destination. Several other restrictions are
supporting short, well balanced pipelines, such as restricting memory
access to a few specific memory instructions and allowing sources
and destination of other instructions only to come from the register
file [12].
In the best case, a pipelined architecture can start to process a new
instruction every cycle, for a throughput of 1.0 IPC. With the advent
of pipelining comes a problem for control-flow, however. Controlflow instructions are those instructions that can affect the PC of the
next instruction; they change the path the processor takes in the
program. It is easy to see that this is fairly trivial in non-pipelined
architectures, since the update of the PC is the last operation that is
performed when an instruction is executed. It can take into
consideration the outcome of the computation of the instruction. If
the instruction branches to a different location depending on e.g.
whether the result of the ALU was zero or non-zero, the PC update
logic has this information readily available in a non-pipelined
processor. Now consider a pipelined processor. The instruction
following the branch needs to be fetched, i.e. its address must be
known, when the branch instruction is in the second pipeline stage
(decode stage). However, at this point the outcome of the branch is
not yet determined. Not until two stages later, in the execute stage,
will the processor know whether the branch should be taken or not.
This leaves the processor architect with the problem that it is not
clear which instruction to fetch for the duration of three cycles after
a branch instruction was fetched. Not fetching anything, however,
largely decreases the performance of the processor. In integer
dominated code, approximately every 5th instruction is a branch. If
the processor would hold off on instruction fetch for 3 cycles on
every 5th instruction, however, its throughput would be reduced to
5/8 = 0.625 IPC, a reduction of 37% from its ideal.
Branch Prediction
Different architectures have solved the control-flow problem in different ways. One way is to introduce branch delay slots into the
ISA. These are instructions after the branch that are always executed,
independent of the branch outcome. With respect to their controlflow, these instructions look like they were before the branch. Since
the processor now has some time between fetching the branch and
changing the PC, the branch can advance to the execution stage
and no or fewer bubble cycles have to be inserted into the pipeline.
Another approach is to use branch prediction or guess which way
the branch will go and speculatively proceed down this predicted
path.
Since many branches turn out to be highly predictable, branch
prediction is very successful in preventing pipeline bubbles after
branch instructions. Dynamic branch predictors are generally based
on the observation that the history of previous branch behavior is
a good indicator for future branch behavior. If a branch was
frequently taken in the past, it has a high likelihood of being taken
in the future as well. However, some branches will be mispredicted.
In that case, the misprediction will be detected when the branch
instruction is in the execute stage. The correct (computed) outcome
Telematik 1/2001
of the branch is compared against the predicted outcome and
misprediction recovery is initiated if the two mismatch. On
misprediction recovery, all instructions in the pipeline after the
branch are killed and the fetch stage is redirected to fetch instructions
from the correct target address. In the case of a misprediction, the
pipeline does incur a 3 cycle misfetch penalty during which
instructions from the wrong path after the mispredicted branch were
fetched.
With a 90% accurate branch predictor, only one in 10 predicted
branches is mispredicted and the misfetch bubble will be incurred.
To continue our example from above, with 1 in 5 instructions being
a branch and a 3 cycle bubble after mispredicted branches, we have
an average performance of 50/53 = 0.943 IPC. Note that the
performance is computed in terms of a commit IPC, i.e. instructions
committed per clock cycle, since only those instructions contribute
to the computation, whereas killed instructions are overhead.
With pipelined architectures and branch prediction, the best possible
performance is 1.0 IPC or one useful instruction completed every
clock cycle. To break this barrier, a new idea is needed.
Superscalar Execution
Pipelined architectures as described in the previous section are
also called scalar architectures - they operate on 1 instruction per
cycle in every pipeline stage. To increase performance further, super-scalar architectures are used. In a n-way superscalar architecture,
each pipeline stage operates on n instructions at the same time.
Since such a processor fetches n instructions each cycle and can
commit (up to) n instructions each cycle, the peak performance of
the processor is n IPC. However, as we saw with the pipelined
scalar design, mispredicted branches will decrease the average
performance of the processor somewhat, so more accurate branch
predictors are necessary to achieve good average performance.
Besides managing control-flow obstacles with branch prediction,
structural and data-flow hazards get to be a problem with wide
superscalar processor designs. As described so far, instructions in
a pipeline stay in order with respect to each other, i.e. an instruction
can not overtake its predecessors in the pipeline. Thus, this design
style is called an in-order architecture. In-order architectures can
experience pipeline stalls due to long execution latencies of
instructions. For example, if a load instruction is held up in the
pipeline because it has missed in the cache and has to get its data
from memory, none of the following instructions can proceed either,
even though many of those instructions might not be dependent
on the load. Accessing data from memory can take from many tens
to a couple hundred of cycles, during which time all instructions in
the pipeline are stalled. This is clearly an area where we could do
better for the sake of performance.
Out-of-Order Execution and Register Renaming
To avoid stalling independent instructions unnecessarily after other
long latency instructions, out-of-order execution comes to rescue.
In out-of-order execution architectures, sometimes also referred to
as dynamically scheduled, instructions can overtake each other in
the pipeline, i.e. they do not necessarily execute in program order.
Instructions are still fetched in program order but are eventually
delivered into a pool of instructions in the processor execution
core. From this pool, instructions are taken in data-flow order, only
respecting the availability of their source operands to determine if
13
Thema
they are ready for execution. After execution, instructions are companies at the time, it was apparent that just continuing with the
reordered back into program order and retire in this order.
implementation of the complex VAX instruction set would not
With out-of-order execution comes the concept of register renaming. provide the long term performance improvements that were sought.
Register renaming is used to give each result that is produced by an The complex VAX instruction set had acquired a lot of baggage
instruction a unique new location to avoid confusion between mul- over the years that made implementations complicated and took a
tiple results for the same register that might be live at the same time. heavy toll on the clock cycle time that was achievable in its
Since out-of-order execution processors can change the execution implementations. At this point the decision was made to go with a
order of instructions, two instructions I1 and I2 which produce new, reduced, streamlined instruction set which would allow faster
results that have to be written into the same register R1 can be implementation and was to fuel a new processor line for the next
executed in the wrong order, i.e. they incur a write-after-write conflict. couple of decades.
To maintain the illusion of sequential execution order to all The new Alpha architecture and instruction set was taking into
instructions in the program, however, each instruction I1 and I2 consideration studies on RISC instruction sets of the time,
writes its result into a new location P1 and P2 respectively, such implementation complexities, and the experience from the VAX
that the results of both instructions are available at the same time. processor [4] line and earlier in-house RISC designs of SAFE and
All source operands R1 of instructions that depend on either the PRISM as well as the designs of the Titan and MutiTitan [15]
result of I1 or I2 are also renamed to use the correct copy of the processors undertaken at DEC's Western Research Lab (WRL).
result P1 or P2 respectively. Out-of-order execution processors have
a logical register set (Rx) that is the one referred to in the ISA. These The Alpha ISA was formulated as the industries first 64-bit
logical registers are renamed to a much larger set of physical registers instruction set, with 64-bit addressing, 32 64-bit integer registers
(Px) which are actually present in the implementation of the register and 32 64-bit floating point registers. It is a pure RISC instruction
file to support multiple versions of each logical register. Renaming set with fixed length 32-bit instructions. It has dedicated memory
is performed in a pipeline stage after instruction decoding and before load/store instructions with simple addressing modes. All
instructions enter the out-of-order instruction pool. Renaming takes computational instructions operate on sources and destination in
care of all write-after-write (WAW) and write-after-read (WAR) the register file. Integer instructions support data types up to 64 bit,
conflicts introduced by out-of-order execution [14].
floating point instructions support 32-bit and 64-bit IEEE as well as
Instructions write their results to the architectural machine state, VAX floating point formats. A couple of instructions were added to
corresponding to logical registers, only when they commit. This support easy recompilation of VAX applications onto the new Alprevents the architectural state from becoming corrupted by pha ISA.
instructions that have executed but will be
killed later due to branch mispredictions or
other reasons.
With out-of-order execution, architectures
can better leverage the instruction level
parallelism (ILP) that is inherent in the program that is executed. Long latency
instructions, like floating point divides, do
no longer block progress for following
instructions and even very long latency
operations like loading data from the second
level cache or even memory can be
sufficiently overlapped with the execution
of following operations. However, since the
number of instructions in flight, i.e. the
Tab.1: Alpha Implementation Overview.
number of instructions fetched but not yet
committed can be around hundred in such
an architecture, very accurate branch prediction is crucially important The Alpha ISA avoids a few pitfalls that had been made by
to achieve good performance.
competing RISC instruction sets, which allowed to let first
Pipelined superscalar out-of-order architectures are the current state implementation artifacts sneak into the ISA definition. In that, the
of the art in microprocessor design. Examples of such processors Alpha ISA does not provide for branch delay slots - an
include the Alpha 21264 (EV6) [16], the Pentium 4 (Willamette) [9], implementation can use branch prediction to speed up control-flow
the MIPS R10000 [20], and the HP PA-8000 [13]
processing. The ISA does not contain any special complex
operations - implementations are counting on multiple issue
Previous Trends - A History of Alpha Processors
superscalar operation to process many instructions at a time. The
ISA only defines a weak memory consistency model with explicit
Birth of the Alpha ISA
barrier instructions to be used if strict ordering is necessary, allowing
In the late 1980s, design teams that had been working on the DEC more freedom in the memory system implementation. The ISA only
VAX line of processors had started to investigate possible successor guarantees imprecise arithmetic exceptions with explicit trap barrier
architectures for their product line. As in many other computer instructions to be used when ordering has to be guaranteed, which
14
Telematik 1/2001
Thema
also allows more implementation freedom. Additionally, the Alpha
ISA adheres to typical RISC traits such as the lack of global state
and mode bits. The new Alpha ISA had been formulated to keep its
implementation simple yet give it a large degree of flexibility in
implementation choices. Table 1 gives an overview of the variety of
architecture and technology implementation parameters that the
Alpha ISA has been and is being implemented on.
Early Design Prototype - EV3
To prove the viability of the new architecture and to allow early
system debugging and software development, the first in-house
design was targeted to the current circuit technology of the time,
CMOS-3. Since the Alpha architecture was developed as an eventual
replacement for the VAX architecture, the first design was termed
EV3, EV for Extended VAX and 3 coming from its implementation
technology CMOS-3. This design was a simple 2-way superscalar
in-order processor, however, due to time and area constraints it had
only a simple memory system and did not implement a floating
point unit yet. Floating point instructions were trapped as
unimplemented opcodes and emulated in software. This initial design
allowed architects and implementers to gain some experience with
the new ISA and provided a convenient vehicle for efficient software
development.
implementation technology that would allow around 9 million
transistors to be integrated onto the die. With that many transistors
available there was plenty of space for large on-chip caches as well
as a more complicated processor core. The EV5 processor was
designed as a 4-way superscalar in-order core with separate 8 kB
first level instruction and data caches and a unified on-chip 96 kB
second level cache. EV5 supports two concurrent read accesses to
its first level data cache via a duplicated cache, providing a
bandwidth of 4.8 GB/s to the first level cache. The majority of the
increased transistor budget went into the caches, but the core also
was widened from 2-way to 4-way superscalar execution. In order
to keep the target clock frequency of 300 MHz at its introduction in
1995, the decision was made to still stay with in-order execution for
this generation. The processor includes control for a third level offchip cache up to 64 MB and can support a cache bandwidth of 1.2
GB/s and a memory bandwidth of 400 MB/s in order to keep its
execution engine feed with data.
Again, like in the last generation, a lot of implementation detail was
paid to the processor speed and the EV5 was again the fastest
processor on the market when it was introduced [7].
Internal architectural studies with wider superscalar in-order
processors have shown that in-order execution would find
insufficient ILP in most programs to be pushed much beyond a 4way super-scalar design. Additional studies have revealed that a
VLIW organization, where more scheduling complexity is pushed
off from the architecture into the compiler, did not result in significant
simplifications of the processor architecture to warrant its radical
break in instruction set architecture. Furthermore, many of the VLIW
compiler and instruction scheduling enhancements could be used
even for a normal RISC architecture. At this point, the decision was
made to take the next generation processor out-of-order to leverage
the additional ILP found in many programs.
1st Generation - EV4 (Alpha 21064)
After the successful initial design study of EV3, proving the
feasibility of the Alpha ISA and getting some experience in its
implementation, the first for-sale architectural generation, the EV4
processor, was designed and built and got introduced to the public
in 1992. This first generation was a 2-way superscalar in-order
processor. With its 200 MHz operation in a 0.75 µm CMOS process
technology, it was by far the fastest processor on the market. A lot
of effort in the architecture design and implementation had gone
into achieving the goal of very high operating
frequency for this process technology. This can
FETCH
MAP
QUEUE
REG
EXEC
DCACHE
mainly be attributed to the fact that both the
Stage: 0
1
2
3
4
5
6
instruction set and architecture were kept simple,
Int
Int
Branch
Reg
Exec
and the chip was implemented in a full custom
Issue
Predictors
Reg
File
Addr
Sys Bus
Queue
design with great detail to speed paths. The 1.68
Map
(80)
Exec
(20)
64-bit
L1
million transistors of this first generation
Bus
Data
implementation include an on-chip 8 kB first level
Reg
Inter- Cache Bus
Exec
80 in-flight instructions
Cache
File
face
instruction cache and 8 kB first level data cache
Addr 64 kB
plus 32 loads and 32 stores
128-bit
)
(80
Unit
Exec
Next-Line
for an on-chip cache bandwidth of 1.6 GB/s. The
2-Set
Address
Phys Addr
processor achieves an off-chip cache bandwidth
4 Instructions / cycle
L1 Ins.
44-bit
of 600 MB/s and a memory bandwidth of
Cache
FP ADD
Reg
FP
150 MB/s.
64 kB
Div/Sqrt
FP
File
Victim
Issue
2-Set
With so much detail paid to the clock speed of
Reg
Buffer
(72)
Queue
FP MUL
Map
the processor, its architecture had to be kept simMiss
(15)
ple so it would not get in the way of speed. The 2Address
way in-order design was the compromise that
resulted.
Fig.1: Alpha 21264 (EV6) processor architecture.
Nevertheless, it was clear that architectural
advances could be leveraged for future generations together with
technological advances to push ahead in the high-performance 3rd Generation - EV6 (Alpha 21264)
processor arena.
The EV6 microprocessor [16, 17] was finally introduced in 1998 on a
0.35 µm CMOS process technology and with an operating frequency
2nd Generation - EV5 (Alpha 21164)
of 600 MHz. It is designed as a 4-way superscalar out-of-order
The next Alpha architecture [6] was targeted at a 0.5 µm CMOS architecture (see figure 1). Now, although it can fetch only 4
Telematik 1/2001
15
Thema
instructions per cycle, its execution core is
capable of executing 6 instructions per cycle
16 L1
(4 INT and 2 FP) and it can commit 8
Miss
Buffers
instructions per cycle. This widening of the
pipeline towards the end helps to catch up
64 kB Icache
with instructions that delay the out-of-order
core of the processor due to incurring long
21264
operation latencies, like memory accesses. Its
Core
approximately 15 million transistors
encompass a separate 64 kB first level
instruction cache and a 64 kB first level data
64 kB Dcache
cache for an on-chip cache bandwidth of
16 L1
9.6 GB/s, an off-chip cache bandwidth of
Victim
Buffers
6.4 GB/s and a memory bandwidth of 3.2 GB/s.
Like the previous generation, EV6 supports
two concurrent accesses to the first level data
cache, this time by a double-pumped cache
running at twice the processor frequency. EV6 Fig.2: Alpha 21364
also supports up to 16 outstanding off-chip
memory references in its out-of-order design. The large bandwidth
improvement in the memory system proved to be pivotal to
successfully feed the execution core, in particular for demanding
server-type applications like large database workloads, but also in
the high-performance technical computing field.
Besides the advances in the memory system and taking the
processor core architecture to an out-of-order design, some initial
attempts were made at integrating more functionality onto the chip.
The 21264 integrates the controller for an off-chip second level
cache using separate data busses for accessing the off-chip cache
and memory systems in order to maximize memory throughput.
Higher integration reduces the component count on the processor
board and speeds up the cache access, since fewer slow chip
crossings are necessary to access the data. We will see that this is
a trend to continue in coming generations.
Current Trends - System on a Chip
Previous Alpha processor implementations have focused mainly
on improving processor core features for improved performance,
like wider issue widths and out-of-order execution. With the third
generation processor EV6, an increasing amount of attention has
been given to the memory system, greatly improving its performance
and thus lifting processor performance for memory intensive
application areas like commercial and high-performance technical
workloads. However, circuit technology has also improved in the
meantime. Besides increased device speeds that have been a result
of smaller feature sizes, current generation process technologies
also support a much larger number of transistors on the same size
die. This leads to the potential for integrating more and more
computer components on the processor die that used to be located
in separate chips on the motherboard, or in other words we are at
the point where it becomes possible to integrate whole systems on
a chip (SoC).
4th Generation - EV7 (Alpha 21364)
The fourth generation Alpha processor [5], being implemented in a
0.18 µm CMOS process technology, allows for 152 million transistors
on the die. With that many transistors available it starts to become
feasible to use a SoC design for a high-performance microprocessor.
16
Address In
Address Out
1.75 MB
7-Set
L2 Cache
Memory
Controller
Network
Interface
&
Router
R
A
M
B
U
S
N
S
E
W
I/O
16 L2
Victim Buffers
(EV7) processor architecture.
Previously, SoC designs were mostly a domain of the embedded
processor market, which could be realized with a much smaller
transistor count.
The EV7 processor architecture leverages the EV6 processor core
architecture and integrates a number of additional features on the
periphery of the processor core. Following are the main features of
this generation (see figure 2):
EV6 Processor Core: In order to leverage the work done on the
EV6 out-of-order execution processor design, EV7 reuses the
EV6 processor core with some minor enhancements, like
supporting up to 48 outstanding memory requests, at an initial
frequency of 1.2 GHz.
Integrated Second Level Cache: 138 of the 152 million transistors
will be spent on RAM, integrating a 1.75 MB, 7-way set
associative L2 cache onto the processor die. The cache is ECC
protected and provides a bandwidth of 19.2 GB/s to the processor
core and a 10 ns load-to-use latency.
Integrated Memory Controller: While previous processor
generations were relying on an external support logic chip set
for controlling the computer's memory, EV7 integrates two 800
MHz Direct RAMbus controllers onto the die. Each controller
connects to 4 RDRAM channels. Like the L2 cache, memory is
ECC protected. Integration of the memory controllers onto the
processor die will boost the memory bandwidth to 12.8 GB/s
and a 30 ns access latency for open memory pages. These
enhancements in the memory system are major contributors to
performance gains for today's memory intensive commercial
workloads.
Integrated Network Interface: In order to make it easy to build
large multiprocessor systems,EV7 contains an on-chip interface
and router for a direct processor-to-processor interconnect.The
interface consists of 4 32-bit ports with 6.4 GB/s bandwidth
each and a 15 ns processor-to-processor latency. The
interconnect forms a 2-D torus network that supports deadlockfree adaptive routing in the integrated interface. The network
Telematik 1/2001
Thema
80 %
60 %
Higher Integration
100 %
Higher MHz
New core
interface autonomously handles all protocol transactions
necessary to build a cache-coherent nun-uniform memory access
(ccNUMA) shared memory multi-processor (SMP) system. An
additional 5th port on the processor provides a similar interface
to support an I/O channel with 6.4 GB/s bandwidth. Each network
router provides buffer space for over 300 packets.
Issue
Mispred
Trap
Cache
Memory
40 %
20 %
0%
EV5 /
EV6 /
600 MHz 575 MHZ
EV6 /
1 GHz
EV7 /
1 GHz
Fig.3: Fraction of execution time spent in various system
components for TPC-C workload on EV5, EV6, and EV7.
To better understand what has driven the architectural decisions
for EV6 and EV7, it helps to look at the time that various system
components contribute to the execution of a commercial benchmark
like TPC-C (see figure 3). On an EV5 architecture, approximately 1/3
of the time is spent in the memory system, 1/3 of the time is spent in
the cache, and 1/3 of the time is spent for instruction execution in
the core. With the introduction of the EV6 architecture and its outof-order execution engine, times spent in those three components
have scaled roughly equally. However, scaling the same processor
core to a higher clock frequency does not provide much additional
benefit since it mainly attacks the core execution time, but does
little to help reducing the time spent in the memory system and
caches. In order to attack this problem, it was necessary to allow for
tighter integration of the caches and memory system with the
processor core. With this higher integration, memory, cache, and
core execution times are more evenly distributed again.
Larger server systems are generally built as multiprocessor systems.
For example, the EV6 powered Wildfire [8] server line supports up
to 32 CPUs in a ccNUMA CMP architecture connected by a 2-level
hierarchy of switches. To build such a system, however, a number
or support chips are necessary for the network interfaces, routers,
and coherency protocol engines. With more chip crossings on its
way, latency for data packets becomes a problem [2]. On Wildfire,
non-pipelined remote memory access latency is on the order of
thousand nanoseconds, much slower than local memory access,
which only takes a few hundred nanoseconds. To make large SMP
systems more efficient, tighter integration of the memory system
and networking interface to other processors (i.e. remote memory)
is highly desirable. EV7 has the networking interface, router, and
protocol engines pulled onto the processor chip in order to allow
lower latency remote accesses and a larger number of CPUs,
reaching into the hundreds of processors.
Telematik 1/2001
Future Trends - Latency, Complexity, Clustering, and
Parallelism
Perceived and expected trends over the next few years are starting
to influence architectural decisions that are being made for future
processor generations. In this section, I outline some of these trends
and their influence on processor architectures.
5th Generation - EV8 (Alpha 21464)
The fifth generation Alpha microprocessor implementation is well
underway in its architecture design and implementation. Analysis
of typical Alpha workloads has shown that many programs have
sufficient ILP to support scaling an out-of-order execution core to
an issue width greater than 4. However, performance improvements
from wider issue execution cores start to level off. In order to find
sufficient parallelism to warrant a wider issue core, parallelism has
to be sought beyond just ILP.
Looking at server workloads it quickly becomes apparent that there
are several higher levels of parallelism above the instruction level
parallelism of a single program. Typically, a server runs many different programs or processes at the same time. In many machines,
this multi-process parallelism is harnessed with multiprocessor
systems, each processor running one process at a time. However,
many programs are also multi-threaded or can be converted to multithreaded programs relatively easily. Multiple threads in a program
all run in the same address space and cooperate more closely with
each other than multiple programs typically do. Thus, there exists
potential to utilize this thread level parallelism (TLP) together with
the ILP of each thread within a single processor, and utilize process
level parallelism across processors in a multiprocessor system. Note,
however, that the distinction between thread and process-level
parallelism is a subtle one, since processes can also share address
space in most operating systems and thus almost behave like threads
in many respects.
Simultaneous multithreading (SMT) [18] is an architecture design
that can utilize thread level parallelism (TLP). An SMT architecture
is an extension of the superscalar out-of-order execution architecture.
In a normal superscalar out-of-order execution architecture, all
instructions that are fetched and executed at a time come from the
same thread. Since wide issue processors do not tend to find
sufficient ILP at all times, a large fraction of available issue slots can
not be occupied by instructions from one thread and this potential
execution bandwidth stays unused. In an SMT architecture, the
processor fetches instructions from multiple threads at the same
time. Instructions from different threads go through decoding and
register renaming separately. They are then dumped into the common
pool of instructions that need to be executed. Like in a conventional
out-of-order architecture, instructions from the shared pool are
picked depending on data-flow order and are executed irrespective
of their correlation to any specific thread. Since instructions from
different threads are independent of each other, more instructions
in the common issue pool are data-ready at any given time, thus
allowing for better use of the existing functional units and issue
bandwidth of the superscalar execution core. Finally, after execution,
instructions retire again in program order within their respective
thread. It is easy to see that this architectural organization for SMT
largely leverages the design that is already existent in a conventional
superscalar out-of-order execution processor.
EV8, the fifth generation Alpha architecture, is being designed as a
17
Thema
4-way SMT, 8-way super-scalar out-of-order execution processor.
It supports 4 concurrent contexts in its 4 thread processing units
(TPUs). Note that in this context the TPUs are not required to run in
the same address space. TPUs can be running multiple processes
or threads or any mixture thereof. EV8 can sustain an aggregate
execution bandwidth of 8 instructions per cycle. With this design, a
single EV8 processor can harness ILP and TLP, whereas multiple
EV8s in a multiprocessor configuration can be used to leverage
multi-program parallelism. Like the previous generation, EV8 with
its total of 250 million transistors is designed as a system on a chip
with a large second level cache, memory controller, and direct
processor-to-processor network interface and router on the die for
glue-less ccNUMA SMP systems with up to 512 processors. It is
designed to support higher cache and memory bandwidth than the
previous generation to push performance ahead for memory intensive workloads. The initial implementation targets a 0.125 µm CMOS
technology for a clock frequency of around 2.0 GHz.
Influence of Multi-threading and Design Complexity
Wide issue superscalar out-of-order execution processors like the
EV8 are very complex architectures that require a large design team
and a long design time. They reward the architect with superior
performance in a wide area of single-threaded as well as multithreaded applications. However, alternatives arise in some
application areas that do not need this level of generality. For
example, database applications are highly parallel at a thread or
process level, but provide little ILP within a thread. In addition, this
workload is very memory intensive and spends a very large fraction
of its time in memory stalls. For these reasons, wide issue superscalar
architectures have little advantage over simpler architectures.
Important performance contributors are a good memory system and
the ability to support a large number of concurrent memory accesses,
which can be achieved for example with multiprocessing.
Design complexity can be largely reduced if alternative architectures
are used for such special application areas. An alternative that fits
well with the requirements of database workloads are chip
multiprocessors (CMP), like implemented by IBM in their Power4
architecture [3], and in the Piranha research prototype [1] of
Compaq's Western Research Lab. Piranha integrates 8 processors
on a single die. Each processor is a simple single-issue in-order
implementation of the Alpha ISA with blocking private first level
data and instruction caches. All processors on the die share a large
on-chip second level cache, the memory system controller, and the
integrated network interface that connects the chip to other chips,
similar to the networks of the EV7 and EV8 architectures. Since each
processor is relatively simple, design time and complexity for one
such processor are largely reduced. The architecture relies on the
fact that each processor is reused many times on the die to achieve
large aggregate execution bandwidth, without much additional
design complexity. The complexity of designing the second level
cache, memory system controller, and network interface, however,
still has to be considered as well. This CMP achieves its performance
by having 8 simple processor cores together with a tightly integrated
memory system. It does not support high ILP, but does support
thread and process-level parallelism very well, matching its targeted
application domain.
Here we have seen how application domain restrictions that leverage
certain types of parallelism better than others can drive the
18
architecture design into a specific direction. Besides influences of
the application areas that high-performance microprocessors are
used for, there is also a strong influence of the process technology
that is used for their implementation. Processor architects have to
take into consideration the effects of process technology on the
implementation of a processor architecture, in order to achieve the
desired result in terms of performance, cost, power dissipation, etc.
The following section describes how process technology changes
the way processor architectures are designed.
Technological Influences on Architecture
Over the last decade, process technology has made steady, fast
paced advances which have largely helped to attain the performance
improvements we have grown used to. In the Alpha processor line,
process technology was at 0.75 µm for EV4 in 1992 and will be
at 0.125 µm for EV8. Integration has shot up from 1.68 million
transistors in EV4 to 250 million in EV8. Clock frequency has grown
from 200 MHz in EV4 to about 2 GHz for EV8. However, all the
performance enhancing effects of process technology also bring
problems along.
Current Sourcing and Power Dissipation
EV4 had a power dissipation of 30 W at a supply voltage of 3.3 V.
EV7 has a power dissipation of 125 W. Assuming EV8 will have the
same power dissipation at a supply voltage of only 1.2 V, its current
sourcing demands will increase significantly. If you do the math,
EV4 draws an average current of 9 A, whereas EV8 is going to draw
a current of > 100 A. This current has to get onto the chip with
incurring minimal IR drop on the supply voltage, especially since
VDD has also gotten smaller.
Dynamic current requirements are getting worse as well. EV4 has a
total effective switching capacitance of 12.5 nF at 200 MHz, EV6
has an effective switching capacitance of 34 nF at 600 MHz. EV8
will be running at close to 2 GHZ on a yet larger die with a more
complicated architecture, prone to increase its switching capacitance
even further. To help with the di/dt problem, EV6 has a total on-chip
decoupling capacitance of 320 nF, taking up approximately 15-20%
of the die area, plus an additional 1 µF wire-bond attached chip
capacitor with 160 VDD/VSS bond wires to reduce inductive
coupling.
Thermal issues also have to be considered. Sufficient cooling for
an EV4 die could be sustained in a package with 1.1 K/W thermal
resistance. For an EV6 and EV7 die, a package with a thermal
resistance of 0.3 K/W has to be used to guarantee sufficient heat
exchange from the die to the package surface and prevent damaging
the die. However, packages with lower thermal resistance are a lot
more expensive. Currently, 125-150 W dies are the absolute maximum
power dissipation that can be sustained with passively cooled
packaging technology. EV7 with its 125 W at 1.2 GHz of operation is
currently at the limit of passive air-cooled packaging technology.
Besides making sure the heat can be brought off the die reliably,
high performance processors also face the problem of thermal
variation across the die, i.e. avoiding hot spots. A chip can tolerate
approximately 20 K temperature differential across its surface. In
order to reduce mechanical stress on the die due to thermal
expansion, the power grid on EV6 is actually laid out as two meshes,
one for VDD and one for VSS, taking up two complete metal layers.
Covering the whole metal layer with a solid surface was considered
Telematik 1/2001
Thema
to produce too much mechanical stress on the chip, for the thermal
expansion of the metal, silicon, and insulation layers are not exactly
the same [11].
So what does this have to do with processor architecture? Well, in
order to keep power dissipation in bounds, architectural and
implementation tricks can be used. Predominantly what is employed
here is conditional clocking and shutting down non-used logic
blocks on the chip for short periods, i.e. a few clock cycles, before
they start to get used again. This reduces both the power dissipated
by the clock distribution network in the conditionally clocked area
as well as the power of the temporarily halted circuit, since it does
not experience any transitions. On EV4 through EV7, clock
distribution is responsible for approximately 30-40% of the total
chip power dissipation. On EV6, the clock grid is partitioned into
global and local clock distribution with a total load of 3 nF and 6 nF
respectively. Global clock distribution is unconditional, whereas
local clocks can be either unconditional or conditional [10]. EV7,
since it is reusing the EV6 core design, has kept the EV6 clocking
scheme for the core and has added another level of clocking for the
other components, split into three additional clock domains. These
clock domains are synchronized to the core clock domain with separate delay-locked loops (DLLs) for global clock skew to remain
below 60 ps across its almost 4 cm2 die [19].
In past and current designs, dynamic power, i.e. the power
associated with switching activity, is the predominant source of
power dissipation. However, both VDD and Vt (threshold voltage)
have been constantly reduced over the past decade. This results in
a reduction of VDD-Vt or the voltage that is responsible for turning
off a device. Since devices are not turned off „as hard“ anymore,
leakage or off-current is increasing, giving rise to higher static power
dissipation in newer process technologies.
Wire Scaling
Process geometries have steadily been shrinking over the past
decade, from 0.75µm in the implementation of EV4, to now 0.18 µm
for EV7. This allows the integration of more transistors onto a die.
However, processes do not shrink homogeneously. Whereas
properties of transistors behave favorably when shrunk, wires start
to become a problem.
The important property for a wire in this respect is its RC delay. For
sub-micron shrinking rules, the RC delay of a wire usually increases
slightly between process generations. For constant drive strength,
this means that wires become slower than they used to be in the
previous generation. Since transistors become faster at the same
time, wires become much slower relative to transistor speeds and
their impact gets more and more noticeable with every process
generation. If wires are „local“ to logic blocks, the effect of bad wire
scaling is typically mitigated by the fact that the wire also reduces
in absolute length. However, going to newer designs also generally
means more complexity. Although architects try to keep complexity
localized, some communication will still be required across long
distances on the chip. These distances might have the same absolute length as in previous process generations, but those „global“
wires are now slower than in previous process generations. This
results in more and more paths in the processor becoming wire
dominated, i.e. the majority of the time of a signal can be attributed
to its transmission delay incurred on a wire, rather than propagation
Telematik 1/2001
delay through logic gates. However, the more wire dominated a
path is, the worse it scales also for future process shrinks, since the
wire gets relatively slower with respect to transistor speeds.
Switching from aluminum to copper (Cu) interconnects helps wires
by reducing the wire's R. Using low-k dielectrics helps a wire's C.
However, these process changes are one-time solutions, i.e. the
one process shrink where the switch to Cu occurs is easier, but
following shrinks exhibit the same problem again. Alpha processor
implementations have switched to Cu interconnect with the
introduction of EV7. This has eased the reuse of the EV6 core in
EV7's 0.18 µm process, a core that was originally designed two
process generations earlier in 0.35 µm design rules.
Clustering
An architectural tool to cope with the challenges that smaller process
geometries impose on a design is the increased use of clustering.
Clustering serves to reduce local design complexity. In a clustered
design, a tradeoff is made between local (intra-cluster) complexity
and latency vs. global (inter-cluster) latency. Building clustered
designs means to break up logical components of a design into
smaller pieces - clusters. Each cluster is sufficiently small that wire
delay problems within the cluster are not yet dominating the design.
This allows for small intra-cluster latencies and supports a high
clock frequency in each cluster. On the other hand, small clusters
mean that logic complexity within a cluster is limited and decisions
can only be made based on local availability of corresponding input
data, i.e. each cluster is not as „smart“ as a monolithic design could
be. Whenever information has to cross cluster boundaries additional transmission latency will be incurred and needs to be accounted
for.
Stage
1
Stage
1
Stage
2
Stage
2
Stage
3
Stage
3
Stage
4
Stage
5
Pipelining
Super-Pipelining
Stage 2
Cluster
1
Stage
1
Cluster
2
Stage
3
Clustering
Cluster
3
Fig.4: Pipeline organizations
Clustered design can be understood as an extension of the idea of
pipelining as depicted in Fig. 4. With the introduction of pipelining,
the operation of a processor was broken up into separate logical
functions, each of which could operate independently from the
others. Each pipeline stage consumes some inputs from earlier
stages and produces some outputs for later stages. The pipeline
stages are ordered in a temporal sequence. Local complexity in each
pipeline stage is limited in order to reach a high overall clock speed.
Some designs like the Pentium 4 have broken each pipeline state
into even smaller sequential pieces, sometimes referred to as super19
Thema
pipelining. However, the whole pipeline still consists of a temporally
sequential succession of (now smaller) stages, on the order of 20
stages for a Pentium 4.
Complementary to pipelining, clustering breaks up the operation
within a pipeline stage into multiple concurrent operations. Information of one pipeline stage fans out into multiple clusters operating
in parallel. All clusters then feed their results back to a combined
successor pipeline stage, as shown in Fig. 4.
An example where clustering was used in a previous design are the
EV6 execution units and associated schedulers. Aggregate, the EV6
processor can issue 6 instructions each cycle out of a pool of 35
instruction queue entries. The picker loop has to complete in a
single cycle. It has to determine which of the 35 instructions are
data ready, pick 6 of them in some priority scheme, remove the
picked instructions from the instruction queue, and inform the
remaining instructions when the results of the picked instructions
will become ready. This proved to be an unmanageable task at the
target clock speed. The solution employed in EV6 is clustering. The
scheduler and execution units are broken up into 3 clusters. One
cluster operates on all floating point instructions and each of the
other two clusters operate on half of the integer instructions, as can
be seen in the block diagram of EV6 in Fig. 1. Each integer picker is
now responsible for picking 2 data ready instructions out of a shared
pool of 20 instruction queue entries. The floating point picker has
to pick 2 data ready instructions out of its pool of 15 instruction
queue entries. This allows the picker loop to run within a short
cycle time. The tradeoff that was bought into, however, is that results
that have to cross cluster boundaries incur an additional cycle of
delay. For example, assume instruction I1 issues in the left cluster in
cycle t and it has a nominal latency of 1 cycle. An instruction I2 in
the left cluster that is data dependent on I1 can issue at cycle t + 1.
However, if instruction I2 is on the right cluster, it can not issue
until cycle t + 2, incurring an additional 1 cycle cross-cluster
communication penalty. Careful slotting of instructions into clusters
is necessary in this case to avoid paying the cross-cluster penalty
too frequently.
With wire delays becoming more dominant in future process
technologies, clustering will need to be used to an increasing degree
in order to keep local complexity down and clock speed up.
Speculation
Current microprocessor designs use speculation to varying degrees
to cope with the fact that in some situations information is not
available at the time it is needed. One example is branch prediction
as used in all Alpha implementations. Since the outcome of a branch
is not yet known when the next instruction has to be fetched, a
branch predictor is used to guess this outcome and the processor
speculatively proceeds down the predicted direction.
Another example of speculation is the memory system of EV6. The
L1 data cache of EV6 has a 2 cycle access latency, resulting in a 3
cycle load-to-use latency of load instructions. If I1 is an integer
load instruction issuing at cycle t, a dependent instruction I2 can
issue as soon as cycle t + 3 if I1 hits in the data cache. However, EV6
does not know whether I1 is a hit or miss in the data cache until
cycle t + 5. Therefore, we are in the same dilemma again. We need
information in cycle t + 3 that will not become available until cycle
20
t + 5. Speculation is the solution used in EV6 [17]. The processor
assumes that I1 will hit in the cache and issues I2 speculatively. If I1
hits, speculation was correct and maximum performance was
achieved by the schedule. However, if I1 misses, I2 and all
instructions that have been issued since have to be squashed and
reissued later. This guarantees architectural correctness, but costs
performance because issue slots were wasted. The goal is to keep
this waste to a minimum. In order to do this, EV6 uses a load hit/
miss predictor. The hit/miss predictor is trained with the outcome of
the load hit/miss speculation of previous encounters of a load
instruction. If the hit/miss predictor has encountered many
misspeculations in the past, it does not allow instruction I2 to issue
until the outcome of the load instruction I1 is determined for sure in
cycle t + 5. However, if the speculation was correct frequently in the
past, I2 can issue as soon as possible, i.e. in cycle t + 3.
The two cases above are simple examples where speculation has to
be used because information is produced later than needed. With
wire delays becoming more dominant and the increasing use of
clustering, another form of information deficiency is on the rise.
When multiple clusters operate in parallel with one or several cycles
of communication latency between them, it occurs more frequently
that one cluster has information that another cluster needs. Even
though this information is available in a timely fashion, its producer
and consumer are not in the same "time domain". The information is
on time but in the wrong place. This effect will be seen increasingly
with future technologies. On an EV6 die (0.35 µm) a signal could still
cross the entire chip in one clock cycle. In a 0.1 µm technology, a
signal would require 5-6 clock cycles to travel from one functional
block to another one across the die, assuming the historical trend
of constant die sizes across process generations continues.
Increased use of speculation is a solution to this delayed information
problem. The information receiver has to make a guess as to what
information it is going to receive in order to make rapid progress
and meet performance goals. In case of misspeculation, appropriate
misspeculation recovery actions have to be performed on all affected
components.
Process technology improvements provide the fuel to run the highperformance processor engine in the future. Besides the ever
increasing transistor count and device speed, future implementation
technologies also pose significant challenges to the processor
architect to get the highest benefit and best possible performance
from process technology advances.
Conclusions
In this article, I have presented an overview of the past
developments, current state of the art, and near future trends and
problems in high-performance general-purpose microprocessor
architecture and design. The paper has followed the Alpha RISC
processor designs from the birth of the Alpha ISA and architecture
in the late 1980s, through the past decade, to the current generation
and has given an outlook into the design plans for the Alpha
architectures of the near future. Coarse trends and problem areas
for current and future processor implementations have been
reviewed in the areas of program characteristics and behavior as
well as process technology, and some implications on processor
architecture and design were pointed out.
Telematik 1/2001
Thema
Acknowledgements
I would like to extend my thanks to my colleges at the Compaq
Alpha Advanced Development Group (VSSAD) and EV8 design
team to provide answers to my pestering questions about
ancient history (late 1980s, early 1990s), memories of which
seem to be fading away rapidly.
References
[1] Luiz Barroso and Kourosh Gharachorloo et al. Piranha: A
Scalable Architecture Based on Single-Chip Multiprocessing.
In 27th Intl. Symp. on Computer Architecture, pages 282293, Vancouver, Canada, June 2000.
[2] Luiz Barroso, Kourosh Gharachorloo, Andreas Nowatzyk, and
Ben Verghese. Impact of Chip-Level Integration on Performance of OLTP Workloads. In 6th Intl. Symp. on High Performance Computer Architecture, Toulouse, France, January
2000. IEEE.
[3] Keith Diefendorff. Power4 Focuses on Memory Bandwidth.
Microprocessor Report, 13(10):11-17, October 1999.
[4] Joel Emer and Douglas Clark. Retrospective: Characterization
of Processor Performance in the VAX-11/780. In 25 years of
the Intl. Symp. on Computer Architecture (selected papers),
pages 37-38, Barcelona, Spain, June 1998.
[5] Anil Jain et al. A 1.2 GHz Alpha Microprocessor with 44.8 GB/
s of Chip Pin Bandwidth. In Intl. Solid State Circuits Conf.,
San Francisco, CA, February 2001. IEEE.
[6] John H. Edmondson et al. Internal Organization of the Alpha
21164. A 300 MHz 64-bit Quad-issue CMOS RISC
Microprocessor. Digital Technical Journal, 7(1):119-135,
1995.
[7] William Bowhill et al. Circuit Implementation of a 300 MHz 64bit Second-generation CMOS Alpha CPU. Digital Technical
Journal, 7(1):100-118, 1995.
[8] Kourosh Gharachorloo, Mandhu Sharma, Simon Steely, and
Stephen Van Doren. Architecture and Design of the
AlphaServer GS320. In 9th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems,
pages 13-24, Boston, MA, November 2000. ACM.
[9] Peter Glaskowsky. Pentium 4 (Partially) Reviewed.
Microprocessor Report, 14(8):10-13, August 2000.
[10] Michael Gowan, Larry Biro, and Daniel Jackson. Power
Considerations in the Design of the Alpha 21264
Microprocessor. In Design Automation Conference, pages
726-731, San Francisco, CA, 1998. ACM.
[11] Paul Gronowski, William Bowhill, Ronald Preston, Michael
Gowan, and Randy Allman. High-Performance
Microprocessor Design. IEEE Journal of Solid-State Circuits,
33(5):676-686, May 1998.
[12] John Hennessy and David Patterson. Computer Architecture:
A Quantitative Approach. Morgan Kaufmann, 1990. ISBN
1-55860-069-8.
[13] Doug Hunt. Advanced Performance Features of the 64-bit PA8000. In COMPCON, pages 123-128, San Francisco, CA,
March 1995. IEEE.
[14] William Johnson. Superscalar Microprocessor Design.
Prentice-Hall, 1991. ISBN 0-13-875634-1.
[15] Norman P. Jouppi. Architectural and Organizational Tradeoffs
in the Design of the MultiTitan CPU. Research Report 89/9,
Digital WRL, July 1989.
Telematik 1/2001
[16] R. E. Kessler, E. J. McLellan, and D. A. Webb. The Alpha 21264
Microprocessor Architecture. In ICCD'98, Intl. Conf. on
Computer Design: VLSI in Computers and Processors,
Austin, TX, October 1998. IEEE.
[17] Richard Kessler. The Alpha 21264 Microprocessor. IEEE
Micro, 19(2):24-36, March 1999.
[18] Jack L. Lo, Susan J. Eggers, Joel S. Emer, Henry M. Levy,
Rebecca L. Stamm, and Dean M. Tullsen. Converting ThreadLevel Parallelism to Instruction-Level Parallelism via
Simultaneous Multithreading. ACM Transactions on Comp.
Sys., 15(3):322-354, August 1997.
[19] T. Xanthopoulos, D. Bailey, A. Gangwar, M. Gowan, A. Jain,
and B. Prewitt. The Design and Analysis of the Clock Distribution Network of a 1.2 GHz Alpha Microprocessor. In Intl.
Solid State Circuits Conf., San Francisco, CA, February 2001.
IEEE.
[20] Keneth C. Yeager. The MIPS R10000 Superscalar
Microprocessor. IEEE Micro, 16(2):28-41, February 1996.
23rd International Conference on
Software Engineering
Toronto, Ontario, Canada
May 12-19, 2001
Today, the engineering of software profoundly impacts
world economics. For example, the desperate demands
by all information technology sectors to adapt their
information systems to the web has generated a
tremendous need for methods, tools, processes, and
infrastructure to develop new and evolve existing
applications efficiently and cost-effectively.
ICSE 2001, the premier conference for software engineering
will present the latest inventions, achievements, and
experiences in software engineering research and practice.
We invite you to participate in ICSE 2001 to help us build
an exciting forum for exchanging ideas and experiences
in this ever expanding and critical field of software
engineering.
Keynote Speakers
Frederick P. Brooks, Jr. (Univ. of North Carolina, USA)
Linda Northrop (Software Engineering Institute, USA)
Mary Shaw (Carnegie Mellon University, USA)
Daniel Sabbah (IBM Corporation, USA)
Bernd Voigt (Lufthansa, Germany)
21
Related documents
Download