A Parallel Environment for Simulating Quantum Computation Geva Patz

A Parallel Environment for Simulating Quantum Computation
by
Geva Patz
B.S. Computer Science
University of South Africa (1998)
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning
in partial fulfillment of the requirements for the degree of
Master of Science in Media Arts and Sciences
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2003
c Massachusetts Institute of Technology 2003. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Program in Media Arts and Sciences,
School of Architecture and Planning
May 21, 2003
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Stephen A. Benton
Professor of Media Arts and Sciences
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Andrew B. Lippman
Chairman
Department Committee on Graduate Students
2
3
A Parallel Environment for Simulating Quantum Computation
by
Geva Patz
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning
on May 21, 2003, in partial fulfillment of the
requirements for the degree of
Master of Science in Media Arts and Sciences
Abstract
This thesis describes the design and implementation of an environment to allow quantum
computation to be simulated on classical computers. Although it is believed that quantum
computers cannot in general be efficiently simulated classically, it is nevertheless possible
to simulate small but interesting systems, on the order of a few tens of quantum bits. Since
the state of the art of physical implementations is less than 10 bits, simulation remains a
useful tool for understanding the behavior of quantum algorithms.
To create a suitable envrionment for simulation, we constructed a 32-node cluster of
workstation class computers linked with a high speed (gigabit Ethernet) network. We
then wrote an initial simulation environment based on parallel linear algebra libraries with
a Matlab front end. These libraries operated on large matrices representing the problem
being simulated.
The parallel Matlab environment demonstrated a degree of parallel speedup as we
added processors, but overall execution times were high, since the amount of data scaled
exponentially with the size of the problem. This increased both the number of operations
that had to be performed to compute the simulation, and the volume of data that had to
be communicated between the nodes as they were computing. The scaling also affected
memory utilization, limiting us to a maximum problem size of 14 qubits.
In an attempt to increase simulation efficiency, we revisited the design of the simulation
environment. Many quantum algorithms have a structure that can be described using the
tensor product operator from linear algebra. We believed that a new simulation environment based on this tensor product structure would be substantially more efficient than one
based on large matrices. We designed a new simulation envrionment that exploited this
tensor product structure. Benchmarks that we performed on the new simulation environment confirmed that it was substantially more efficient, allowing us to perform simulations
of the quantum Fourier transform and the discrete approximation to the solution of 3-SAT
by adiabatic evolution up to 25 qubits in a reasonable time.
Thesis Supervisor: Stephen A. Benton
Title: Professor of Media Arts and Sciences
4
A Parallel Environment for Simulating Quantum Computation
by
Geva Patz
The following people served as readers for this thesis:
Thesis Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Isaac L. Chuang
Associate Professor
MIT Media Laboratory
Thesis Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Edward Farhi
Professor of Physics
MIT Center for Theoretical Physics
6
7
Acknowledgments
Many thanks to the members of my thesis committee:
To Steve Benton, my advisor, who stepped in at a moment of need and guided me
through the completion of this thesis. His wise guidance and kind support were invaluable.
Thanks to my reader, Isaac Chuang, without whom this thesis would not have happened. He introduced me to the world of quantum computing, pointed me at the problem
that this thesis addressed, and suggested ways to approach the solution. He also enabled
me to have access to the computing resources required to make this work possible.
Thanks also to my other reader, Eddie Farhi, who introduced me to the idea of adiabatic
quantum computing, and whose office I always looked forward to visiting.
I’d have to switch to a much bigger font to thank Linda Peterson adequately. Her office
is a haven for desperate, panic-stricken, confused or otherwise needy students, and she is
a wellspring of helpful advice (um, I mean options).
To my wife, Alex: thank you so much for the support and encouragement you’ve given
me throughout my time at MIT, and for putting up with me in my sleep-deprived, notaltogether-cheerful thesis writing mode.
8
Contents
1
Introduction
19
2
Background
23
2.1
Why is quantum computing interesting? . . . . . . . . . . . . . . . . . . . . .
23
2.2
Basic concepts of quantum computation . . . . . . . . . . . . . . . . . . . . .
26
2.2.1
Quantum bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.2.2
Quantum gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Quantum algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.3.1
The quantum Fourier transform . . . . . . . . . . . . . . . . . . . . .
32
2.3.2
Quantum computation by adiabatic evolution . . . . . . . . . . . . .
33
The tensor product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
2.3
2.4
3
Parallel simulation of quantum computation
39
3.1
Simulating quantum computing . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.1.1
Previous work in simulating quantum computation . . . . . . . . . .
41
Parallel processing and cluster computing . . . . . . . . . . . . . . . . . . . .
43
3.2.1
Cluster computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
The tensor product as a means to optimize computation . . . . . . . . . . . .
46
3.3.1
Parallelizing by the tensor product . . . . . . . . . . . . . . . . . . . .
46
3.3.2
Efficient computation of tensor product structured multiplications .
50
3.2
3.3
9
10
4
CONTENTS
The simulation environment
53
4.1
Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.2
Initial software implementation . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.2.1
Overview of libraries used . . . . . . . . . . . . . . . . . . . . . . . .
57
4.2.2
Prior work in parallelizing Matlab . . . . . . . . . . . . . . . . . . . .
58
4.2.3
Design of the parallel Matlab environment . . . . . . . . . . . . . . .
59
The tensor product based simulation environment . . . . . . . . . . . . . . .
63
4.3.1
Algorithm (circuit) specification language . . . . . . . . . . . . . . . .
64
4.3.2
Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.3.3
Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
4.3.4
Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
4.3
5
Evaluation
79
5.1
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.2
The fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
5.2.1
Single node execution times . . . . . . . . . . . . . . . . . . . . . . . .
81
5.2.2
Data transfer timing . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.2.3
Startup overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.3
Gates and gate combinations . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5.4
The quantum Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . . .
88
5.4.1
The quantum Fourier transform on the initial environment . . . . . .
89
5.4.2
Replicating the discrete Fourier transform . . . . . . . . . . . . . . . .
90
5.4.3
Circuit-based implementation . . . . . . . . . . . . . . . . . . . . . .
92
5.4.4
Comparing efficient and inefficient circuit specifications . . . . . . .
95
3-SAT by adiabatic evolution . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.5
6
Conclusions
A Code listings
101
105
A.1 Circuit specification language parser . . . . . . . . . . . . . . . . . . . . . . . 105
CONTENTS
11
A.2 Quantum Fourier transform circuit generator . . . . . . . . . . . . . . . . . . 120
A.3 3-SAT problem generator for adiabatic evolution . . . . . . . . . . . . . . . . 125
12
CONTENTS
List of Figures
2-1 Quantum NOT gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2-2 The CNOT gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2-3 Three CNOTs form a swap gate . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2-4 Using a Hadamard gate to generate entangled Bell states . . . . . . . . . . .
30
2-5 General circuit representation of the quantum Fourier transform . . . . . . .
33
3-1 Basic schematic form of a quantum circuit . . . . . . . . . . . . . . . . . . . .
40
3-2 A representative quantum circuit . . . . . . . . . . . . . . . . . . . . . . . . .
41
4-1 The cluster nodes, seen from below . . . . . . . . . . . . . . . . . . . . . . . .
55
4-2 The two dimensional block cyclic distribution . . . . . . . . . . . . . . . . .
57
4-3 Layering of libraries in the first generation simulation environment . . . . .
59
4-4 State diagram for the parallel Matlab server master node . . . . . . . . . . .
60
4-5 State diagram for the parallel Matlab server slave nodes . . . . . . . . . . . .
61
4-6 Circuit to demonstrate different specification orderings . . . . . . . . . . . .
67
4-7 Circuit for the compilation example . . . . . . . . . . . . . . . . . . . . . . .
69
4-8 State of the internal data structure . . . . . . . . . . . . . . . . . . . . . . . .
71
4-9 System-level overview of the parallel execution of a problem . . . . . . . . .
73
4-10 Algorithm specification to illustrate computation sequence . . . . . . . . . .
74
4-11 An example computation sequence, illustrating communication patterns . .
75
4-12 State diagram for the new simulator master node . . . . . . . . . . . . . . . .
76
4-13 State diagram for the new simulator slave nodes . . . . . . . . . . . . . . . .
77
13
14
LIST OF FIGURES
5-1 Single-node execution times . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5-2 Single-node execution times with identity matrix . . . . . . . . . . . . . . . .
83
5-3 Vector transfer times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5-4 Startup overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5-5 Block of CNOTs circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5-6 Block of CNOTs, no permutation-related communication (cf Figure 5-8) . . .
87
5-7 Alternating CNOTs circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
5-8 Alternating CNOTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
5-9 Parallel Matlab based simulator performance . . . . . . . . . . . . . . . . . .
90
5-10 Traditional Fourier transform execution times . . . . . . . . . . . . . . . . . .
92
5-11 Quantum Fourier transform circuit execution times . . . . . . . . . . . . . .
94
5-12 Quantum Fourier transform inefficient circuit execution times . . . . . . . .
96
5-13 Execution times for 3-SAT by adiabatic evolution, N steps . . . . . . . . . .
99
List of Tables
4.1
Abbreviated grammar for the tensor product specification language . . . . .
65
4.2
Record types for the internal compiler data structure . . . . . . . . . . . . .
69
5.1
Number of runs for simulation data . . . . . . . . . . . . . . . . . . . . . . .
80
5.2
Fourier transform execution times for larger problem size . . . . . . . . . . .
93
15
16
LIST OF TABLES
Listings
4.1
A sample parallel Matlab script . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.2
A sample algorithm specification . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.3
Inefficient specification of the circuit in Figure 4-6 . . . . . . . . . . . . . . .
67
4.4
Efficient specification of the circuit in Figure 4-6 . . . . . . . . . . . . . . . .
67
4.5
Input file for the compilation example . . . . . . . . . . . . . . . . . . . . . .
68
A.1 Header file definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.2 Lexical analyzer definition for flex . . . . . . . . . . . . . . . . . . . . . . . 107
A.3 Parser definition for bison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.4 Efficient quantum Fourier transform circuit generation script . . . . . . . . . 120
A.5 Example efficient circuit output for 4 qubits . . . . . . . . . . . . . . . . . . . 121
A.6 Unoptimized quantum Fourier transform circuit generation script . . . . . . 122
A.7 Example unoptimized circuit output for 4 qubits . . . . . . . . . . . . . . . . 124
A.8 Simulation code generator for 3-SAT by adiabatic evolution . . . . . . . . . 125
A.9 Example adiabatic evolution code for 3 qubits . . . . . . . . . . . . . . . . . 126
A.10 Random instance generator for 3-SAT by adaibatic evolution . . . . . . . . . 127
17
18
LISTINGS
Chapter 1
Introduction
The idea of simulating quantum computation on classical computers seems at first not to
make logical sense. Quantum computing is interesting primarily because it appears to be
able to solve problems that are intractable for classical computers. If this is the case, then
quantum computers cannot be efficiently simulated on classical ones.
Our goal, however, is much more modest. We do not seek to efficiently simulate quantum computation in general, for arbitrary problem size. Rather, we want to simulate the
largest problems that we can, until physical implementations of quantum computers have
overtaken our abilities to simulate them.
The largest physically realized quantum computation to date operated on seven quantum bits (qubits) [VSB+ 01]. Given that problem size doubles with every additional qubit,
simulating even low tens of qubits would allow us to investigate problems many orders of
magnitude larger than the size of quantum computers that we are currently able to build.
The exponential scaling of demands on memory and processor resources with increasing
problem size will always overwhelm us at some point, but with some thought we may be
able to postpone that point far enough to allow us to simulate some interesting problems.
Simulation also has a much more rapid configuration turn-around time than physical
experiments. We all hope that in the future, quantum computers will be as trivially reconfigurable as the desktop classical computers of today, but at the moment every successful
19
20
CHAPTER 1. INTRODUCTION
physical quantum computation has been a complex, carefully planned experiment with an
elaborate experimental setup tailored to solving one specific problem (often one specific
instance of a problem). For now, simulation offers much greater flexibility for reconfiguration, and is an essential tool for planning any experimental realization.
More generally, the lessons we learn in simulating quantum computation on classical
computers may yield insights that will be useful in other fields that deal with similarly
large problems. An obvious application would be simulating other quantum systems, but
similar techniques are also useful in such fields as image and signal processing.
To achieve even the relatively modest goal of simulating problems on the order of 20
qubits, we require substantial computing resources, and an intelligent approach to using
those resources. One path of attack is to combine the resources (memory and CPU) of
multiple processing units. High end parallel processing computers are, however, expensive, rare and often difficult to program. Ideally, we would like to harness the power of
readily available, inexpensive, easily configurable workstation class computers to perform
our computations. This suggests exploiting the technique of ’cluster computing’, in which
multiple off-the shelf workstations are combined into a parallel computing resource.
Regardless of the amount of simulation hardware we have available, it will be useful
to find efficient ways of representing the problem we are simulating, in order to reduce the
resource consumption of our simulations. The resources we are typically most interested in
are memory and CPU, but in a clustered computing environment there is another resource
that becomes significant, too: communication.
In this thesis, I will describe a simulation environment that we have built to explore the
simulation of quantum computation. We began by building and configuring a clustered
computing environment. We then implemented a simulator on it based on a library of
parallel linear algebra routines. This approach was chosen because linear algebra is at the
core of the mathematical representation of quantum computing.
Although our initial simulation environment validated the feasibility of simulating
quantum computation on a cluster of classical workstations, it also uncovered a number of
21
limitations, both of cluster computing in general, and of the specific simulation approach
we had chosen. We therefore developed a new simulation environment, designed to more
efficiently represent and simulate problems, and to reduce the reliance on inter-node communications which had proved to be a substantial bottleneck in cluster computing. Specifically, we based our new simulation around the tensor product, a mathematical structure
that neatly parallels the structure of many quantum algorithms, and that provides a basis
for a more compact and efficient representation of these problems.
Chapter 2 starts by outlining some of the basic concepts of quantum computing and
the associated mathematics that will be necessary to understand the rest of this thesis.
Chapter 3 continues by reviewing concepts relevant to the parallel simulation of quantum computation. It discusses how quantum computation may be simulated, and introduces cluster computing, which forms the basis of the architecture of our simulation environment. It also describes how the tensor product, introduced in the previous chapter, has
been used as a structure for parallelization and efficient execution of simulations.
With the background out of the way, Chapter 4 describes our simulation environment
in detail, beginning with the cluster hardware, then moving on to a description of the initial
(parallel matrix based) simulation environment. It describes the limitations of the initial
simulation environment that motivated the design of the new simulation environment,
and then describes the new design.
Chapter 5 discusses our evaluation of the new simulation environment, describing the
benchmarks we developed and the results of running these benchmarks on the simulator.
Finally, Chapter 6 summarizes our conclusions, and suggests some directions in which
the simulation environment could be taken in the future.
22
CHAPTER 1. INTRODUCTION
Chapter 2
Background
This chapter introduces some key background concepts that will be relevant to the rest of
the thesis. This is by no means intended to be an exhaustive or rigorous survey of the subject of quantum computation, but instead is meant to give the reader enough background
to follow the concepts and notation used elsewhere. For a more comprehensive review of
quantum computation and the underlying principles of quantum mechanics, the reader is
referred to [NC00] or [Pre98].
After a brief motivation in Section 2.1 of why the subject of quantum computing is
interesting to study, Section 2.2 introduces some of the elementary concepts and principles
of quantum computation, along with the mathematical structures used to represent them.
In section 2.3 we review a few representative quantum algorithms. Section 2.4 introduces
the tensor product, which will be the mathematical key to the design of our simulation
environment.
2.1 Why is quantum computing interesting?
The theory of quantum computation is rich and interesting in its own right, but it is of
particular interest because it is believed that quantum computers may be able to perform
certain types of computation that are fundamentally too hard for classical computers to
23
24
CHAPTER 2. BACKGROUND
perform in a reasonable time.
A classical computer is simply a computer in which the physical representation and
manipulation of information follows the laws of classical physics. This definition encompasses every practical ’computer’ in use today, from the microcontroller in a washing machine to the fastest supercomputers. Strictly, since quantum mechanics underpins all of
physics, classical computers are simply a special case of quantum computers, but since
their design does not directly exploit the principles of quantum mechanics, it is helpful to
distinguish them from quantum computers that do so.
More formally, classical computers are types of Turing machines, named for Alan Turing,
who in a seminal paper in 1936 [Tur36] developed the first formal, abstract model that
defined what it means for a machine, abstract or physical, to compute an algorithm. The
abstract mathematical computing ’machine’ that Turing introduced was called a ’logical
computing machine’ in his paper, but we now refer to it as a Turing machine.
Although in principle any problem expressible as an algorithm can be solved on a
Turing machine, in practice, certain kinds of problem may not be solvable on classical
computers with reasonable computational resources (with ’resources’ usually defined as
storage space and computing time). The study of the resource requirements of algorithms
is known as complexity theory. Complexity theory divides problems into a number of complexity classes based on the resources required to solve them.
One of the most important of these classes is P (for Polynomial time), which is defined as the set of problems1 that can be solved on a deterministic machine (loosely, a
conventional Turing machine) in polynomial time, in other words where the amount of
time (equivalently, the number of steps) taken to solve the problem can be related to the
size of the problem by a polynomial in the problem size. Less formally, P is essentially the
class of problems that can be efficiently computed on classical computers.
Another class, NP (for Nondeterministic Polynomial time), is defined as those prob1
Strictly, the complexity classes are defined in terms of decision problems, i.e. problems that require a Y ES
or N O answer. Since algorithms can be restated as equivalent decision problems, we ignore this formal nicety
here.
2.1. WHY IS QUANTUM COMPUTING INTERESTING?
25
lems where the solution can be verified in polynomial time. Clearly P ⊆ NP, since the
solution to any problem in P can be verified by executing the problem in polynomial time.
It is believed that P 6= NP, and many problems in NP have been posed for which no known
solution algorithm exists in P, but this has not yet been proved, and whether or not P = NP
remains one of the great unanswered questions in computer science.
A further complexity class is PSPACE, being those problems solvable with unlimited
time, but a polynomial amount of storage space (memory). Again, it is clear that NP ⊆
PSPACE, and it is suspected, but unproven, that NP 6= PSPACE. Thus it remains unknown
even whether or not P = PSPACE.
It is known that there are classes of problems that are outside PSPACE, hence outside P. For instance, we know that PSPACE ⊂ EXPSPACE, where EXPSPACE is the set
of problems solvable with unlimited time and with an amount of memory that increases
exponentially with problem size.
How does all this discussion of complexity classes relate to quantum computers? A
new complexity class, BQP has been defined to encompass all problems that can be efficiently solved on a quantum computer. BQP is defined as those problems that can be
solved in polynomial time by a quantum computer with a bounded probability that the
solution is incorrect (most definitions give this probability bound as p ≤ 0.25, but the
choice of bound is arbitrary).
It has been shown that BQP ⊂ PSPACE, but the relation to P and NP is unproven.
Tantalizingly, however, there are problems that have been shown to be in BQP, but that
are strongly believed to be outside P. Proving that this is so, i.e. proving that there are
problems that can be solved on quantum computers that cannot be solved on classical
computers, is equivalent to proving that P 6= PSPACE, but even in the absence of a proof,
there are strong hints that suggest this is so. Herein lies the promise of quantum computing.
In particular, the current interest in quantum computing was largely stimulated by a
paper by Peter Shor [Sho97] that gave algorithms for calculating discrete logarithms and
26
CHAPTER 2. BACKGROUND
for finding the prime factors of an integer in polynomial time on a quantum computer. The
integer factoring algorithm generated particular interest because of its potential application to cryptanalysis (the well-known RSA public key cryptosystem, for instance, depends
on the difficulty of integer factoring for its security).
Shor presented an algorithm that finds the prime factors of an N -bit integer in O((log N )3 )
time. There is no equivalent classical algorithm known that can perform factoring in time
O((log N )k ) for any k. The most efficient classical algorithms currently known, the Number Field Sieve and Multiple Polynomial Quadratic Sieve, have exponential run times
(O(e(ln N )
1/3 (ln ln N )2/3
√
) and O(e
ln N ln ln N )
respectively) [Bre00].
Algorithms such as Shor’s strongly suggest that BQP 6= P, and it is this that drives
much of the interest in quantum computing.
2.2
Basic concepts of quantum computation
2.2.1 Quantum bits
The elementary unit of quantum data is the qubit (for “quantum bit”), by analogy with
the bit in classical computing. Although we will deal with qubits almost exclusively as
mathematical abstractions, it is important to bear in mind that, just as classical bits have
a physical representation, so qubits correspond to physical states within a quantum computer, subject to the laws of physics and in particular those of quantum mechanics.
Qubits, like bits, have states, such as |0i and |1i. Unlike classical bits, qubits are not
restricted to these states, but can take on any state of the form
|ψi = a|0i + b|1i ,
(2.1)
where a and b are complex numbers. Mathematically, the states |0i and |1i are orthonormal
basis vectors for a two-dimensional complex vector space. The state of a qubit is a unit
vector in this space. The general qubit in 2.1 is said to be in a superposition of states |0i and
|1i.
2.2. BASIC CONCEPTS OF QUANTUM COMPUTATION
27
The restriction to unit vectors arises from the interpretation of a and b: A crucial principle of quantum mechanics is that we cannot precisely determine (“measure”) the state
of a quantum system (in this case, a qubit). Measuring a qubit in a state |ψi as above
will yield the measurement 0 with probability |a|2 and 1 with probability |b|2 . Since these
probabilities must sum to 1, |a|2 + |b|2 = 1.
This vector representation of qubits is a very useful mathematical abstraction, and we
will make extensive use of it in our simulation environment. We will often use vector
notation to represent the state of a qubit, as in


 a 
|ψi = 
b
.
(2.2)
The extension of these concepts to multiple qubits is straightforward. Two qubits have
four computational basis states, |00i, |01i, |10i and |11i, corresponding to the four possible
states of a pair of classical bits. These states are sometimes written as integers in the form
|0i, |1i, |2i and |3i respectively.
The state vector describing the state of a pair of qubits is simply
|ψi = a|00i + b|01i + c|10i + d|11i .
(2.3)
More generally, a system of n qubits has computational basis states which take the
form |x0 x1 x2 . . . xn i, where each of the xi ∈ {0, 1}. There are therefore 2n such basis states,
and the state vector for such a system has 2n entries (or probability amplitudes). This exponential increase in information as the number of qubits increases hints at the potential
computational power of quantum computing.
2.2.2
Quantum gates
Computation with qubits requires manipulating their states. These manipulations are
again physical, and their exact nature depends on the particular physical implementation of a given quantum computer. Here too, though, it is helpful to use a mathematical
28
CHAPTER 2. BACKGROUND
abstraction to describe these manipulations independent of any specific physical implementation.
An abstraction which is helpful in describing a wide range of quantum algorithms is
the circuit model, in which algorithms are described as a collection of quantum gates, which
operate on qubits by analogy with the logic gates of classical computing (AND,
OR , NOT ,
etc.)
To illustrate, consider the quantum NOT gate. Just as the classical NOT swaps bit values,
taking 0 → 1 and 1 → 0, the quantum
NOT
gate takes |0i → |1i and |1i → |0i. More
generally, however, it takes any state |ψi = a|0i + b|1i to the state |ψ 0 i = b|0i + a|1i.
Graphically, this is usually represented as in Figure 2-1 (X is a standard shorthand for the
NOT
gate, and ⊕ represents the classical
XOR
operation, or equivalently binary addition
modulo 2).
|ψi
|ψ 0 i
X
Figure 2-1: Quantum
NOT
gate
Mathematically, the NOT gate can be represented as a 2 × 2 matrix:


0 1

X=
(2.4)
1 0
All quantum gates on n qubits can be represented similarly as 2n × 2n unitary matrices.
A matrix U is unitary when U U † = I (U † is the adjoint of U , defined as (U ∗ )T , where U ∗ is
the complex conjugate matrix of U ). The unitarity property is necessary to ensure that the
’output’ of a quantum gate remains a unit vector.
An example of a two qubit gate is the controlled-NOT, or CNOT gate. This has the form
29
2.2. BASIC CONCEPTS OF QUANTUM COMPUTATION

UCN OT
1 0 0 0


 0 1

=

 0 0




0 0 

.
0 1 


(2.5)
0 0 1 0
The CNOT gate is graphically represented as in Figure 2-2.
|Ci
•
|Ci
|Ci
•
|Ci
|T i
X
|C ⊕ T i
|T i
⊕
|C ⊕ T i
Figure 2-2: The
CNOT
gate: the representation on the right is a common shorthand.
Three alternating CNOTs in succession have the effect of exchanging the values of two
qubits, as in Figure 2-3. This combination can itself be represented as a two-qubit operator,
the swap gate.
|ai
⊕
•
⊕
|bi
|ai
×
|bi
|bi
•
⊕
•
|ai
|bi
×
|ai
Figure 2-3: Three CNOTs form a swap gate: the common representation of the swap gate is on the
right
Another useful gate is the Hadamard gate, represented

H
. It has the form

1 1 1 
H=√ 
.
2 1 −1
(2.6)
The Hadamard gate takes |0i to a superposition of states |0i and |1i:
|0i →
|0i + |1i
√
2
(2.7)
It is this ability to create and manipulate superpositions of states that give quantum
computers their inherent parallelism. To illustrate, suppose we have a function f that
30
CHAPTER 2. BACKGROUND
can be implemented with a unitary function Uf that transforms two input qubits |xi|yi as
follows (where ⊕ signifies single bit binary addition):
Uf : |xi|yi → |xi|y ⊕ f (x)i
(2.8)
Now suppose that we apply Uf to the input state with |xi in the superposition shown
in (2.7) and |yi = |0i. Then
Uf :
|0i|f (0)i + |1i|f (1)i
|0i + |1i
√
√
|0i →
.
2
2
(2.9)
This output state contains information about f (0) and f (1), so in a loose sense we have
performed an evaluation of f on both 0 and 1. This notion of quantum parallelism is one
of the keys to the potential power of quantum computation. However, note that we cannot
extract both the values of f (0) and f (1) directly from this output state. If we attempt to
measure it, we will destroy the state and we will get one of the two measurement outcomes
(0, f (0)) or (1, f (1)) with equal probability p = 0.5.
To unlock the information potential of quantum systems, we need another concept, that
of entanglement. A full discussion of entanglement is beyond the scope of this overview.
However, let us consider the circuit in Figure 2-4, which demonstrates another important
use for the Hadamard gate, in preparing a class of entangled states known as Bell states or
EPR pairs.
|xi
H
•
|ψi
|yi
⊕
Figure 2-4: Using a Hadamard gate to generate entangled Bell states
If the input to this circuit is |00i, then the output is
|ψi =
|00i + |11i
√
≡ |β00 i .
2
(2.10)
31
2.3. QUANTUM ALGORITHMS
At a first glance, this might look like just another superposition, but this is not the case.
If we were to apply a Hadamard transform to two qubits to create a superposition, the
output would be
|φi =
2.3
|00i + |01i + |10i + |11i
.
2
(2.11)
Quantum algorithms
As in classical computing, our interest in quantum computing is to be able to execute useful algorithms. The quantum bits and quantum gates introduced above provide a helpful
abstraction for specifying these algorithms. Just as classical logic gates can be combined
into circuits, so quantum gates can be combined into quantum circuits. Many useful algorithms can be conveniently expressed in this form, and indeed we often use the terms
’algorithm simulation’ and ’circuit simulation’ interchangeably.
We have already seen simple circuits that perform useful functions such as swapping
qubit values (Figure 2-3) and generating Bell states (Figure 2-4). For a more substantial
example, we will discuss an algorithm to calculate an important transform known as the
quantum Fourier transform in section 2.3.1.
The circuit model is not the only way of thinking of quantum algorithms. In section
2.3.2, we consider an alternative technique, that of quantum adiabatic evolution, and apply it
to 3-SAT, a classic hard problem from traditional computer science.
The algorithms presented in this section were chosen to give an illustrative flavor of
some applications of quantum computing. They will also form the basis of some of the
performance benchmarks for the tensor product based simulation environment discussed
in Chapter 5.
32
CHAPTER 2. BACKGROUND
2.3.1
The quantum Fourier transform
The quantum Fourier transform is analogous to the classical discrete Fourier transform,
familiar from signal processing applications, which takes an input vector (x0 , x1 , . . . , xN −1 )
of complex numbers, and maps it to an output vector (y0 , y1 , . . . , yN −1 ) as follows:
−1
1 NX
xj e2πijk/N
yk = √
N k=0
(2.12)
The quantum Fourier transform has an analogous definition. Given an orthonormal
basis |0i, |1i, . . . , |N − 1i, the quantum Fourier transform is a linear operator acting as follows on the basis states:
−1
1 NX
√
|ji →
e2πijk/N |ki
N k=0
(2.13)
Although the above representation makes the relation between the discrete and the quantum Fourier transforms clear, an alternative equivalent representation, known as the product representation, provides a more useful structure for generating circuits to compute the
quantum Fourier transform (with N = 2n ):
|ji →
=
n 1 O
2n/2
1
2n/2
l=1
n
Y
k=1
−l
|0i + 22πij2 |kl i
|0i + exp 2πi
k
X
l=1
(2.14)
j n−k+l
2l
!
!
|1i
,
(2.15)
where ji is the ith bit in the binary representation of j, and |kl i is the lth qubit in |ki. This
representation corresponds to the quantum circuit in figure 2-5. Absent from the circuit is
the final bit reversal operation, which reverses the order of the output qubits analogously
to the bit reversal of the discrete Fourier transform. Each of the Ri in the diagram is a
rotation, defined by
33
2.3. QUANTUM ALGORITHMS

1
Ri = 
|j1 i
H R2
...
2πi/2k
0 e
(2.16)

.
Rn−1 Rn
•
|j2 i
0

H R2
. . . Rn−2 Rn−1
..
.
..
.
•
|jn−1 i
•
•
|jn i
H R2
•
•
H
Figure 2-5: General circuit representation of the quantum Fourier transform
The quantum Fourier transform, in turn, is an important component of many significant larger quantum algorithms, such as Shor’s integer factoring algorithm.
2.3.2
Quantum computation by adiabatic evolution
Although the circuit model is a convenient abstraction for representing quantum algorithms, it is not the only way of mapping problems onto quantum systems. One alternative
framework [FGG+ 01] is based on exploiting the adiabatic theorem.
To understand the adiabatic theorem, we must introduce another fundamental concept
of quantum mechanics, the Hamiltonian. The time evolution of a quantum system can be
described by the Schrödinger equation:
ih̄
d|ψ(t)i
= H(t)|ψ(t)i
dt
(2.17)
Here, |ψ(t)i is the state vector of the system at time t, h̄ is Planck’s constant (generally,
units are chosen such that h̄ = 1), and H(t) is a Hermitian operator called the Hamiltonian
of the system.
Briefly stated, the adiabatic theorem states the following: consider a quantum system
34
CHAPTER 2. BACKGROUND
whose evolution is governed by the Hamiltonian H(t). Take H(t) = H̃(t/T ) where H̃(s) is
a one-parameter family of Hamiltonians with 0 ≤ s ≤ 1. Let the instantaneous eigenstates
and eigenvalues of H̃(s) be of the form El (s)|l; si with Ei (s) ≤ Ej (s) for i < j. The
adiabatic theorem states that if the gap between the lowest two eigenvalues E1 (s) − E0 (s)
is strictly greater than zero for all 0 ≤ s ≤ 1, then
lim |hl = 0; s = 1|ψ(T )i| = 1 .
T →∞
(2.18)
In other words, if the gap is positive as above, then if T is big enough (i.e. if t/T is small
enough), |ψ(t)i remains close to the ground state |ψg (t)i = |l = 0; s = t/T i of the system.
This gives a hint as to how adiabatic evolution might be used for quantum computation, if we can specify our algorithm in the form of a series of Hamiltonians H(t), chosen
in such a way that the initial ground state of H(0) is both known and easy to construct.
For each instance of the problem, we can then construct a problem Hamiltonian HP .
Although HP is not difficult to construct, its ground state, which encodes the solution to
the corresponding instance of the problem, is difficult to compute directly. This is where
we use adiabatic evolution. We set H(T ) = HP , so the ground state |ψg (T )i of H(T )
encodes the solution. T is the running time of our algorithm.
For 0 ≤ t ≤ T , H(t) smoothly interpolates between the initial Hamiltonian H(0) and
the final Hamiltonian, H(T ), which is equivalent to HP . If T is large enough, then H(t)
will vary slowly. By the adiabatic theorem, then, final state of this evolution |ψ(T )i will be
close to the solution state |ψg (T )i.
To illustrate, consider the example of the 3-SAT problem, which has been shown to be
NP-complete [Coo71]. A problem is NP-complete if it is in the complexity class NP, and
if it has the property that any other problem in NP is reducible to it by a polynomial time
algorithm (by ’problem Φ is reducible to problem Φ0 ’ we mean that any instance of Φ can
be converted in polynomial time into an instance of Φ0 with the same truth value).
A n-bit instance of 3-SAT is a Boolean formula consisting of a conjunction of clauses
C1 ∧ C2 ∧ ... ∧ CM , where each clause C involves at most three of the n bits. The problem
35
2.3. QUANTUM ALGORITHMS
requires finding a satisfying assignment, that is a set of values for each of the n bits that
makes all of the clauses simultaneously true.
An instance of 3-SAT can be expressed in a manner suitable for the application of adiabatic evolution by constructing a Hamiltonian to represent it (a ’problem Hamiltonian’) as
follows: for each clause C with associated bits ziC , zjC and zkC , define an energy function
hC (ziC , zjC , zkC ) such that hC = 0 if (ziC , zjC , zkC ) satisfies clause C, and hC = 1 otherwise.
Each bit zi is associated with a corresponding qubit |zi i. Each clause C is associated with
an operator
HP,C (|z1 i|z2 i . . . |zn i) = hC (ziC , zjC , zkC )|z1 i|z2 i . . . |zn i .
(2.19)
The problem Hamiltonian HP is then the sum over the clauses of the HP,C :
HP =
X
(2.20)
HP,C
C
Given a problem Hamiltonian as above, one can solve the instance of 3-SAT by finding its ground state. To do this, it is necessary to start with a Hamiltonian HB with a
known ground state (the ’initial Hamiltonian’) and to use adiabatic evolution to go from
this known ground state to the ground state of HP . HB is constructed as follows: define
(i)
the one-bit Hamiltonian HB acting on bit i thus:
1
(i)
HB = (1 − σx(i) ) ,
2
(2.21)
where


0 1
.
σx(i) = 
(2.22)
1 0
For each clause C, define
(i )
(j )
(k )
HB,C = HB C + HB C + HB C ,
(2.23)
36
CHAPTER 2. BACKGROUND
then
HB =
X
(2.24)
HB,C .
C
Adiabatic evolution proceeds by taking
H(t) = 1 −
t
T
HB +
t
T
(2.25)
HP ,
so
H̃(s) = (1 − s)HB + sHP .
(2.26)
Start the system at t = 0 in the known ground state of H(0) (i.e. in the ground state of
HB ). By the adiabatic theorem, if T is big enough and the minimum gap, gmin , between
the two lowest energy eigenstates is not zero, |ψ(T )i will be close to the ground state of
HP , which represents the solution to the instance of 3-SAT.
2.4
The tensor product
The tensor product is also known as the Kronecker product or the direct product of matrices.
It is an operation on two matrices, denoted A ⊗ B. If A and B are m × n and p × q matrices
respectively, then A ⊗ B is a mp × nq matrix defined as follows:

a1,1 B


 a2,1 B

A⊗B =
..

.


a1,2 B
···
a1,n B
a2,2 B
..
.
···
..
.
a2,n B
..
.
am,1 B am,2 B · · · am,n B





.



(2.27)
The tensor product has a number of useful properties, which will be helpful later when
we attempt to compute tensor products. It is associative:
37
2.4. THE TENSOR PRODUCT
A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C
(2.28)
It is distributive over normal matrix multiplication:
(A ⊗ B)(C ⊗ D) = (AC ⊗ BD) ,
(2.29)
(provided that the dimensions of A, B, C and D are such that AC and BD are defined).
Inverses and transposes of tensor products have the following useful properties:
(A ⊗ B)−1 = A−1 ⊗ B −1
(A ⊗ B)T
= AT ⊗ B T
(2.30)
(2.31)
The above already suggests that we may be able to reduce the amount of computation
performed on a large matrix if it can be expressed as the tensor product of smaller matrices.
To see this, take for example the matrix A = M1 ⊗ M2 ⊗ . . . ⊗ Mn , and consider the relative
amount of computational effort in computing A−1 versus computing Mi−1 for the n smaller
matrices.
Finally, there are two more properties of the tensor product that will be useful to us in
calculating the trace and eigenvalues of matrices represented in tensor product form. In
the case of the trace, we have that
tr(A ⊗ B) = tr(A)tr(B) .
(2.32)
In the case of eigenvalues, if A and B have eigenvalues λi and µj respectively, with
corresponding eigenvectors xi and yj , then
(A ⊗ B)(xi ⊗ yj ) = λi µj (xi ⊗ yj ) .
(2.33)
In other words, every eigenvalue of A ⊗ B is a product of the eigenvalues of A and B.
38
CHAPTER 2. BACKGROUND
How is this useful to us in simulating quantum computing? It turns out that quantum
circuits often have natural decompositions in terms of the tensor product. Consider for
example the simple circuit for the Hadamard transform on four qubits:
H
H
H
H
This has the tensor product representation H ⊗ H ⊗ H ⊗ H, where H is the one-qubit
(2×2) Hadamard gate matrix. In general, any parallel sequence of gates can be represented
as the tensor product of the operator matrices corresponding to them.
Chapter 3
Parallel simulation of quantum
computation
This chapter will review the concepts that motivate the design of our simulation environment. Section 3.1 gives an overview of the type of simulations that we wish to perform,
and gives a sense of the complexity of implementing these simulations on classical computers.
One way of tackling this complexity is by using the combined power of multiple processors in parallel to perform the simulation. There are many approaches to parallel computing, and we have chosen to use an architecture known as cluster computing. Section
3.2 defines cluster computing, motivates our choice of this architecture, and describes the
challenges and limitations particular to it.
In order to exploit the potential advantages of parallel hardware, we require a means of
parallelizing the computations we will perform. In the previous chapter, we introduced the
tensor product and saw how we could use it as a structure for many quantum algorithms.
Now, in section 3.3, we will explain how the tensor product structure has been used to
guide parallelization. We will also consider how tensor product based transforms can be
efficiently applied at each of the parallel steps.
39
40
CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION
3.1 Simulating quantum computing
The phrase “simulating quantum computing” can have a number of meanings1 . For the
purposes of this thesis, when we say that we intend to simulate quantum computation, we
mean that we will use classical computers to simulate the operation of certain quantum
algorithms, typically expressed as quantum circuits. We do not simulate the behavior of
any particular physical implementation of quantum computation, concerning ourselves
rather with the algorithmic/circuit abstraction.
In order to simulate quantum circuits, we must be able to represent each circuit, its
input and its output classically. At their most basic, quantum circuits can be thought of as
transforms that operate on an n-qubit input state |ψi to produce an output state |ψ 0 i, as in
Figure 3-1.
|ψi
n
/
Quantum
circuit
n
/
|ψ 0 i
Figure 3-1: Basic schematic form of a quantum circuit
As we have seen in Chapter 2, the input |ψi and the output |ψ 0 i can be represented as
state vectors of dimension 2n , say x and y respectively. The entries in each of these vectors
are complex numbers. The circuit itself can be represented by a 2n × 2n transform matrix
U , also of complex numbers. Simulating the operation of the circuit is then simply a matter
of performing the computation
y = Ux .
(3.1)
The simplicity of equation 3.1 is deceptive, however. For a start, the sizes of x, y and U
grow exponentially with problem size. If we store U and x and perform a full matrix-vector
multiplication, we will require at least 2n+3 + 22n+3 bytes of storage for single precision
1
Our use of the term ’simulation’ to refer to the simulation of quantum computations on classical computers
should not be confused with quantum simulation, which typically refers to the simulation of a quantum system
on a quantum computer
41
3.1. SIMULATING QUANTUM COMPUTING
values. This translates to over 32 Gb for a 16-qubit problem. We will also require O(22 n)
complex multiplication operations.
Furthermore, the entries of U are usually not directly specified, but must be computed
from the constituent gates of the circuit. As an example, consider the circuit in Figure 3-2.
This corresponds to the multiplication
y = (I2 ⊗ Uf )(Ud ⊗ Ue )(Ua ⊗ Ub ⊗ Uc )x ,
(3.2)
where I2 is the 2 × 2 identity matrix.
|ψ0 i
|ψ00 i
Ua
Ud
|ψ1 i
|ψ10 i
Ub
Uf
|ψ2 i
Uc
Ue
|ψ20 i
Figure 3-2: A representative quantum circuit
The amount of computational work required to perform the above calculation naı̈vely
turns out to be greater still than would have appeared at first when we treated circuits as a
single transform. Here, we require at least three matrix-vector multiplications, in addition
to the work required to compute the three sets of tensor products.
3.1.1
Previous work in simulating quantum computation
There are relatively few simulation environments for quantum computation in the literature. Most consist of languages or environments for describing small quantum circuits
or algorithms and for simulating them on single-processor workstations. A good representative of this class is the QCL language [Öme00a], which provides basic constructs for
specifying quantum algorithms, but which has no native support for parallelism.
QCL was designed primarily as a programming language, not as a simulation environment. The ultimate intent of QCL is to provide a means to specify algorithms that
would be executed using quantum computing resources (or a mix of quantum and classi-
42
CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION
cal computing resources). The author, does however, provide a simulation environment,
the QC library [Öme00b] to allow QCL programs to be executed in the absence of quantum
resources.
The QC library stores quantum states as state vectors, using a compressed representation in which only non-zero amplitudes are stored. This trades memory efficiency in the
case in which many amplitudes are zero for a computational performance penalty when
operating on this more complex representation in memory. In the general case, where
many or all of the probability amplitudes are nonzero, essentially the full state vector must
be stored.
The lack of support for parallelism in the QC library places significant limits on the
size of the problem that can be simulated, because of both memory and CPU cycle limitations. There is, of course, nothing in principle preventing a parallel back end from being
developed to execute QCL code.
Perhaps the most complete published parallel simulation environment is the one developed at ISI/USC by Obenland and Despain [OD98]. Their interest was particularly in
simulating a physical implementation of quantum computation using laser pulses directed
at ions in an ion trap.
The ISI team had access to high end supercomputers, specifically a Cray T3E and an
IBM SP2 multiprocessor, and they took advantage of this to execute their simulations in
parallel. They noted a significant speedup on larger problems, close to the theoretically
predicted parallel speedup.
Obenland and Despain’s work was a clear indication that parallelism could be fruitfully exploited to achieve larger, faster simulations of quantum computing. However, they
had access to high-end, purpose-built supercomputing environments. We wanted to know
to what extent results like this could be achievable on more widely available parallel environments, in particular on a cluster of off-the-shelf workstations.
The ISI simulation timings revealed that communications overhead ultimately became
the dominant time factor. Because of the highly efficient internal interconnect in the high
3.2. PARALLEL PROCESSING AND CLUSTER COMPUTING
43
end supercomputers that were used, communications overhead was not a significant factor for small numbers of processors. However, when 25% of the available processors were
used, communications overhead increased to 40-60% of total execution time for many
problems. With half the processors in use, it increased to take up 60-90% of the execution time.
These findings, on a tightly-coupled multiprocessor architecture with a high speed internal interconnect, suggest that message passing based parallelism would be even harder
on clusters, where the interprocessor interconnect is substantially slower, and this was
indeed our experience.
3.2 Parallel processing and cluster computing
The regular structure of, and large number of operations involved in, many linear algebra
problems make them attractive as candidates for parallel processing. The term ’parallel
processing’ refers to any architecture in which one or more processors operate on multiple
data elements simultaneously.
Early computers designed to perform fast linear algebra operations typically made use
of vector processors. As the name suggests, vector processors perform operations (e.g.
addition or multiplication) on vectors of multiple data items rather than on single memory
addresses or registers. Vector processors execute instructions sequentially, but achieve
parallelism at the data level.
Early supercomputers typically contained one large vector processor operating at high
speed. For example, the earliest and most well known true vector processor supercomputer, the Cray 1, operated on eight ’vector registers’, each of which could hold 64 eightbyte words. Vector processing has evolved into the modern concept of single instruction,
multiple data (SIMD) parallelism. This is almost ubiquitous in modern processor designs,
such as the Motorola PowerPC VelocityEngine, or the Intel Pentium MMX extensions.
Another, often complementary, approach to parallel computing is to increase the num-
44
CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION
ber of processors in the system and to parallelize the execution of algorithms across multiple processors. A number of models of parallel processing have been attempted. One popular early approach was massively parallel processing (MPP). MPP systems are so named
because they contain a large number of processors — hundreds or sometimes thousands
of them. Each processor has its own local memory, and the processors are linked using a
high-speed internal interconnect mechanism.
3.2.1
Cluster computing
In recent years, with the advent of higher bandwidth network interconnects, it has become
feasible to build parallel computing systems out of multiple independent workstations,
rather than a single machine with multiple processors. This technique is known as cluster computing. The term ’cluster’ is somewhat loosely applied to groups of networked
conventional workstations that cooperate computationally. The workstations may be heterogeneous in terms of such factors as their processing capacity (number, type and speed
of CPUs), their memory size and configuration and even the operating systems that run
on them.
We make a distinction here between clusters and so-called networks of workstations
(NOWs). Although the terms are sometimes used interchangeably in the literature, the
term ’cluster’ is generally applied to a network that, while it may consist of workstationclass computer hardware, is essentially dedicated to the task of parallel computation. A
NOW, by contrast, may comprise machines that are also used for other purposes, often
desktop machines that perform networked computation only when otherwise idle.
Clustering makes parallel processing more accessible than traditional ’single box’ parallel processing. Cluster components are cheaper, and are easily replaced or upgraded.
Clusters can be expanded, partitioned or otherwise reconfigured with relative ease. A
wider range of development tools and operating system support is available for commodity workstation hardware, making development on a cluster environment more accessible
to a general user base than development on traditional multiprocessor machines.
3.2. PARALLEL PROCESSING AND CLUSTER COMPUTING
45
Clusters are becoming increasingly accepted in the high performance computing community, with 93 clusters appearing in the most recent (at the time of writing) TOP500 list
[MSDS02] of the highest performance computers in the world (ranked according to their
performance on the LINPACK benchmark). It is worth noting, however, that many of
the clusters on the list use unconventional high-speed interconnects or other customized
hardware enhancements that differentiate them from off-the-shelf computing hardware.
Indeed, there are only 14 ’self made’ clusters on the TOP500 list at this time.
There are several tradeoffs in using a cluster instead of a traditional multiprocessor
machine. Individual cluster nodes are often significantly less reliable than individual processing elements in a multiprocessor machine, an issue that becomes increasingly significant as cluster sizes increase. Furthermore, although many message passing libraries, such
as MPI [For93] and PVM[GBD+ 94], have been ported to cluster environments, clustering
support is often less mature than support for traditional multiprocessor environments.
The most significant disadvantage of cluster computing, though, is the decreased interprocessor communications performance relative to other multiprocessor environments.
Local memory access on typical Intel processor based machines are on the order of 1 gigabyte/s, with latencies around 50-90 clock cycles. High speed memory crossbars used in
some modern supercomputer designs offer even higher bandwidths.
By comparison, even fast interconnects such as Gigabit Ethernet and Myrinet offer raw
bandwidths on the order of low hundreds of megabytes per second. These raw bandwidths are further degraded by protocol overhead and inherent transport inefficiencies.
Even with low protocol overhead, network-induced latencies are at least on the order of
thousands of clock cycles [CWM01].
Probably the most widespread clustering architecture, not least of all because its definition is broad enough to encompass a wide range of different clustered environments, is the
Beowulf clustering environment [BSS+ 95], named for the initial implementation of such an
environment at NASA’s Goddard Space Flight Center. Originally applied to clusters based
on the free Linux operating system, the term ’Beowulf’ has now come to be applied to any
46
CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION
cluster that approximately meets the following criteria:
• The cluster hardware is essentially standard workstation hardware, possibly augmented by a more exotic fast network interconnect.
• The operating systems in use on the cluster are free Unix variants (originally Linux
was the operating system of choice for Beowulf clusters, now alternatives such as
FreeBSD are increasingly common).
• The cluster is dedicated to parallel computational purposes, and is typically accessed
through a single point of entry (the head node).
• Some combination of software to facilitate parallel computing is installed on the cluster. Such software typically includes one or more message passing libraries, and
cluster-wide administrative tools. It may also include operating system enhancements such as kernel extensions that implement a shared process numbering space.
Our parallel simulation environment is implemented on a 32-node Beowulf cluster.
More details of the cluster configuration can be found in section 4.1.
3.3 The tensor product as a means to optimize computation
There are two ways in which we use the tensor product to optimize the simulation of
quantum circuits. First, we use the tensor product structure to determine an intelligent
parallelization of the circuit. Then, we use the tensor product decomposition to minimize
the amount of matrix computation we perform.
3.3.1
Parallelizing by the tensor product
The digital signal processing community has used the tensor product as a means to structure parallel implementations of signal processing transforms for some years [GCT92].
Much work has been done in expressing common transforms such as the Fast Fourier
3.3. THE TENSOR PRODUCT AS A MEANS TO OPTIMIZE COMPUTATION
47
Transform in tensor product form, and using this representation to parallelize the application of the transform to an input vector [Pit97].
In our simulations of quantum computing, we use very similar techniques to those applied to large signal processing transforms. We apply operators (transform matrices) with
a tensor product structure to state vectors (input vectors). It seems reasonable, therefore,
that the same techniques that have been useful in signal processing would be useful to us.
To understand how the tensor product structure can be used to determine a corresponding parallelization, let us consider an idealized m × m transform A of the form
A = In ⊗ B ,
(3.3)
where In is the n × n identity matrix, and B is thus of size
m
n
×
m
n.
Now suppose we wish to calculate the matrix-vector product y = Ax, where x is a
vector of size m. This corresponds to the application of a set of operators to a state vector.
To illustrate, take m = 8 and n = 4. A then has the following form:

A
1 0 0 0


 0 1

= 

 0 0


 

0 0 
  B1,1
⊗
1 0 
B2,1



B1,2 
B2,2
(3.4)

0 0 0 1

 B1,1

 B
 2,1


 0


 0

= 

 0


 0


 0


0
B1,2
B2,2
0
0
0
0
0
0
0
0
0
0
0
0
0
B1,1 B1,2
0
0
0
0
0
B2,1 B2,2
0
0
0
0
0
0
0
B1,1 B1,2
0
0
0
0
0
B2,1 B2,2
0
0
0
0
0
0
0
B1,1 B1,2
0
0
0
0
0
B2,1 B2,2























(3.5)
48
CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION
It is clear from the above that we can calculate the product y = Ax by partitioning x
into four equal partitions of two elements, and then performing four smaller calculations
of the following form (1 ≤ i ≤ 4):


 y2i−1 

y2i


 x2i−1 
=B
x2i

(3.6)
Notice that the result of each of the calculations is independent of the other three results. This implies that the calculation of y = Ax can effectively be parallelized across four
processors.
More generally, any calculation of the form y = Ax where A has the form A = In ⊗
B can be parallelized over n processors, each performing a multiplication by B of some
partition of x.
What if we have fewer than n processors? Suppose there are p processors available,
where p < n. We note that
In ⊗ B = (Ip ⊗ In/p ) ⊗ B
= Ip ⊗ (In/p ⊗ B) ,
(3.7)
(3.8)
and partition the computation into p parallel subcomputations, each involving a multiplication of a partition of x by (In/p ⊗ B).
What about the case when the tensor product representation does not take the convenient form A = In ⊗ B? Here we must make use of a permutation matrix to rearrange the
tensor product into this form. A permutation matrix P is a square matrix whose elements
are each either 0 or 1, and where
PTP = I = PPT
(3.9)
A stride permutation matrix Pn,s (where s is a factor of n, i.e. n = ms) is a permutation
49
3.3. THE TENSOR PRODUCT AS A MEANS TO OPTIMIZE COMPUTATION
matrix which when applied to a vector x of length n rearranges it as follows:
h
Pn,s x = x1 , x1+s , x1+2s , . . . , x1+(m−1)s , x2 , x2+s , x2+2s , . . . , xs , x2s , x3s , . . . , xms
i
(3.10)
Now suppose that A and B are matrices of sizes m × n and p × q respectively. We can
relate A ⊗ B to B ⊗ A with stride permutations as follows:
A ⊗ B = Pmp,m (B ⊗ A)Pnq,q
(3.11)
This is known as the commutative property of the tensor product, and it allows us to
rewrite tensor products of the form A = B ⊗ In in the form
A = Pm,(m−n) (In ⊗ B)Pm,n ,
(3.12)
where m is the dimension of the square matrix A.
To illustrate, consider the following circuit fragment, where U and V are 3-qubit and
2-qubit operators respectively:
|j1,2,3 i
3
/
|j4,5 i
2
/
|j6,7 i
2
/
U
V
The tensor product representation of this circuit applied to the state vector x corresponding to the input state |xi is
[U ⊗ I22 ⊗ V ]x = [(U ⊗ I22 ) ⊗ V ]x
(3.13)
= [(P25 ,22 (U ⊗ I22 )P25 ,23 ) ⊗ V ]x
(3.14)
= [(P25 ,22 ⊗ I22 )(I22 ⊗ U ⊗ V )(P25 ,23 ⊗ I22 )]x
(3.15)
50
CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION
= [p(I4 ⊗ (U ⊗ V ))p0 ]x ,
(3.16)
where p = P25 ,22 ⊗ I22 and p0 = P25 ,23 ⊗ I22 are permutation matrices.
Thus, the circuit above can be parallelized by the technique described above using up
to four processors.
The permutations p and p0 can be implemented by rearranging the elements of x. In
a multiprocessor architecture, such rearrangements can often be achieved by alternate addressing of the underlying data. In a clustered environment, however, the rearrangements
require communication of the elements to be rearranged between the nodes with processors holding the relevant data. Since we are communicating only state vectors, and not full
matrices, this communication is comparatively low relative to problem size. Nonetheless,
avoiding unnecessary communications will be important in minimizing execution times.
3.3.2
Efficient computation of tensor product structured multiplications
We have discussed above how a tensor product representation of a quantum circuit can be
parallelized by reducing each parallel component to the form Im ⊗ B, perhaps with some
permutation of the data before and after. We have not yet considered how best to perform
each of the multiplications Bxi (where xi is the ith partition of the vector x).
Even from the very simple example above, where B = U ⊗ V , it is clear that B may well
have its own tensor product structure. The multiplication of such tensor products can be
performed efficiently using the following method, described in [BD96b]:
Suppose we wish to compute the product
Bx = (M1 ⊗ M2 )x ,
(3.17)
where M1 and M2 have dimensions m × n and p × q respectively. We first reshape x into a
q × n matrix thus:
Xi,j ≡ x(j−1)q+i
(3.18)
3.3. THE TENSOR PRODUCT AS A MEANS TO OPTIMIZE COMPUTATION
51
It has been shown ([Dyk87]) that the product (M1 ⊗ M2 )x can then be calculated by
finding
Y = M1 (M2 X)T
T
(3.19)
and then converting Y back into a vector using the reverse of the process by which X was
constructed in (3.18).
We have reduced the multiplication of x by a mp × nq matrix to a series of two multiplications by smaller matrices.
In general, we have
Bx = (M1 ⊗ M2 ⊗ . . . ⊗ MK )x
=
(3.20)
!
T T T
M1 M2 . . . (MK X)T . . .
,
(3.21)
where X is derived from x as above.
To see how this can significantly reduce the amount of computation required, consider
the case where each of the Mi (1 ≤ i ≤ K) is a square matrix of dimension n × n. Then
the number of multiplications required to compute the conventional matrix-vector multiplication Bx, with B as in equation (3.20) above, is of order O(n2K ), since B is a nK × nK
matrix.
The reformulation in equation (3.21) allows us to perform K matrix-matrix multiplications of the form Mi Xi , where Xi is the intermediate result formed by the sequence of
multiplications by Mj , j = i + 1 . . . K (XK ≡ X). In each case Mi is an n × n matrix, and
Xi is an n × nK−1 matrix.
Each matrix multiplication thus requires O(nk+1 ) multiplications, and the total computation requires O(KnK+1 ) multiplications. The computation additionally requires K
matrix transpositions of n × nK−1 matrices. The amount of work required to perform these
transpositions is, however, not great relative to the work for the matrix-matrix multiplica-
52
CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION
tions.
In principle, the matrix-matrix multiplications and the matrix transpositions could be
implemented as parallel routines to further take advantage of multiple processors available in a parallel environment. However, in practice, in a clustered environment, the communications overhead of the requisite parallel linear algebra routines (discussed in more
detail in Chapter 5) limits the usefulness of this additional parallelization.
Chapter 4
The simulation environment
This chapter describes our parallel environment for simulating quantum computation on
a cluster of classical workstations, beginning with a description of the cluster hardware in
section 4.1.
Our first attempt at a simulation environment, discussed in section 4.2, was based on
parallel matrix operations. These parallel operations were primarily drawn from existing
optimized libraries, described in section 4.2.1. We used Matlab as the front end for this
simulation environment, drawing on existing work in interfacing Matlab to parallel linear
algebra libraries described in section 4.2.2. Our implementation, described in section 4.2.3,
followed the architecture of these prior implementations, but was tailored to perform the
functions we required for our simulation, and to operate on complex matrices.
It became apparent to us that the matrix-based implementation of our initial simulation
environment was suboptimal with respect to resource requirement scaling, both in terms
of memory usage and computation time. Seeking to improve simulation efficiency, we
developed a new simulation environment based on the tensor product structure of quantum circuits. This allowed us to apply prior work in the parallelization of tensor product
computation, and in efficient implementation of tensor product multiplications (described
above in sections 3.3.1 and 3.3.2 respectively) to our simulations.
Section 4.3 describes the new, tensor product based, simulation environment. We de53
54
CHAPTER 4. THE SIMULATION ENVIRONMENT
veloped a simple circuit specification language (section 4.3.1) as in input mechanism, along
with a compiler (section 4.3.2) that translates this input into a sequence of steps to be executed by the new parallel simulation code. Section 4.3.3 describes how this compiled
representation is distributed to the nodes, and section 4.3.4 describes the actual execution.
4.1 Hardware
The hardware on which the simulation environment runs consists of a networked cluster
of off-the-shelf computers, pictured in Figure 4-1. It consists of 33 machines with a total of
68 processors, as follows:
• One head node, with 4 Gb RAM and four Intel Pentium III Xeon processors with a
900 MHz system clock speed
• Eight older cluster nodes, each with 768 Mb RAM and two Pentium III processors
with a 1 GHz system clock speed
• 24 newer cluster nodes, each with 1 Gb RAM and two Pentium III processors with a
1.2 GHz system clock speed
The nodes are interconnected using 1000BaseT gigabit Ethernet, through a switch with
a claimed backplane switching throughput of 38 Gbit/sec, enough in theory to allow for
simultaneous communication between all the nodes. Initially the nodes ran on switched
100BaseT fast Ethernet, but it soon became clear that communications overhead was a significant performance bottleneck, so the nodes were upgraded to a faster network transport.
Gigabit Ethernet was chosen because of its low cost and ease of configuration relative to
many other high speed interconnect mechanisms such as Myrinet or Fiber Channel.
Although the gigabit Ethernet network is a substantial performance improvement on
the old 100 Mbit/sec network, several factors conspire to reduce the effective maximum
throughput of the network. The PCI network adapters used in all the nodes are low-end,
32-bit wide cards. Since the nodes all have a standard ’workstation’ architecture, the PCI
4.1. HARDWARE
55
Figure 4-1: The cluster nodes, seen from below. The eight older, lower-capacity machines are
visible at the top right.
bus is a single bus shared by all I/O devices. On the software side, we are using a stock
Linux kernel, with the inefficiencies inherent in a TCP stack that must necessarily be all
things to all users.
We tested our networking setup with the well-known netperf network benchmarking tool. On an otherwise unloaded network, peak node-to-node TCP data throughput
was 482 Mbit/sec.
Reliability is a significant concern in any large network of off-the-shelf workstations.
99.5% uptime, for instance, may be entirely tolerable on a single workstation, but a cluster
with 32 nodes independently experiencing 99.5% uptime will have an unacceptably low
overall uptime of 85%.
Two areas in which particular work was required to achieve acceptable reliability were
cooling and hard drive reliability. In order to prevent frequent overheating, we found it
necessary to physically enclose the ceiling mounted rack in which the cluster nodes were
mounted, so that forced air could be channeled downwards from vents in the ceiling to the
56
CHAPTER 4. THE SIMULATION ENVIRONMENT
front of the machines.
We also found that the hard drives on some of the nodes would frequently perform erratically. Whether this was due to exposure to electromagnetic radiation in the laboratory,
mechanical unreliability, or other causes was not conclusively established, but we did find
that running the machines disklessly improved uptime significantly.
Diskless operation also simplified configuration: all machines mount their root file
systems from a shared NFS server connected directly to the gigabit switch. There are no
swap volumes; for our application, swapping would in any case impose an unacceptable
performance penalty.
The nodes are used to implement a Beowulf cluster: each runs an essentially standard
version of the Linux operating system (2.4 kernel version) and has available both the MPI
and PVM message passing libraries.
We initially ran the MOSIX [BGW93] kernel extensions, which provide automatic load
balancing for Beowulf clusters. We found, however, that for our particular application it
was both possible and desirable to determine the distribution of processes to nodes ourselves. The overhead involved in MOSIX automatic process switching was too great, and
the message passing libraries did not support seamless process migration.
4.2 Initial software implementation
Our first generation simulation environment used an elementary linear algebraic representation of quantum states and operations. States were stored as complex valued state
vectors, and operators were implemented as full matrices. This representation allowed us
to use existing libraries of fast parallel linear algebra routines to perform the computations
at the core of the simulation.
We used Matlab as the front end for the simulation environment. We drew on prior
work on interfacing Matlab with parallel linear algebra libraries to develop a client-server
interface, in which a Matlab client exchanged data with, and sent requests to, a parallel
57
4.2. INITIAL SOFTWARE IMPLEMENTATION
linear algebra server.
4.2.1
Overview of libraries used
The linear algebra routines we used were drawn from the PBLAS [CDO+ 95] and ScaLAPACK [BCC+ 97] libraries. These are highly optimized general purpose FORTRAN based
parallel linear algebra libraries, first designed to be used on multiprocessor computers
such as the Intel Paragon or various Cray supercomputers. They have, however, more
recently been ported to run in a clustered environment.
To achieve parallelism, the routines in the linear algebra libraries distribute the data
being operated across all the processors involved in the computation. To do this, they
make use of the two dimensional block cyclic distribution, illustrated in Figure 4-2.
M1,1
M1,2
M1,3
0
M2,1
M3,1
M2,2
M2,3
M3,2
M3,3
M5,1
M4,2
M4,3
M5,2
M5,3
0
M6,1
M7,1
M6,2
M6,3
M7,2
M7,3
M2,4
M2,5
M3,4
M3,5
M8,3
M1,7
M4,4
M4,5
M5,4
M5,5
M2,6
M2,7
M3,6
M3,7
M6,5
M7,4
M7,5
M4,6
M4,7
M5,6
M5,7
M8,5
M3,8
M4,8
M5,8
1
M6,6
M6,7
M7,6
M7,7
M6,8
M7,8
3
2
M8,4
M2,8
3
0
M6,4
M1,8
1
2
3
M8,2
M1,6
0
1
2
M8,1
M1,5
3
2
M4,1
M1,4
1
M8,6
M8,7
M8,8
Figure 4-2: The two dimensional block cyclic distribution: Illustrated here is a representative 8 × 8
matrix M distributed among four processors with a block size of 2 × 2. Numbers in
boxes indicate the processor to which each block is assigned.
The developers of parallel linear algebra libraries chose the two dimensional block
cyclic distribution to provide a good compromise between efficient allocation of resources
58
CHAPTER 4. THE SIMULATION ENVIRONMENT
across all the CPUs, and efficient local matrix computation on each CPU [DvdGW94]. Each
local CPU uses the single-processor BLAS (Basic Linear Algebra Subroutines) library to
perform matrix operations on its local portions of the distributed matrices.
To communicate amongst the processors, the libraries use a standardized communication mechanism, BLACS (the Basic Linear Algebra Communications System). This is a
message passing interface for linear algebra that in turn relies on an underlying low-level
message passing library such as MPI or PVM to perform the actual inter-process communications. We used both message passing libraries interchangeably to see whether there
were any significant performance differences between them, and found none.
We chose to use Matlab as a front end for the first generation simulation environment,
because it provides a fairly user-friendly environment for the manipulation of results, and
because we had a number of legacy single-processor simulation routines written in Matlab
that we wished to port to a parallel environment.
4.2.2
Prior work in parallelizing Matlab
Matlab itself does not support parallel operations [Mol95]. There have been a number of
third party attempts to provide parallel functionality for Matlab. These can be roughly
divided into three classes: those that add a message passing layer to Matlab; those that
facilitate the parallelization of ’embarrassingly parallel’ Matlab routines (e.g. certain easily
parallelizable loop constructs); and those that provide an interface between Matlab and an
external set of parallel numerical libraries.
The message passing extensions were not ideally suited to our application, because
they required the Matlab user to devote a great deal of time and attention to the details
of parallel implementation. Nor did the nature of our quantum code lend itself to an
embarrassingly parallel approach.
The most potentially useful Matlab extensions for our application, therefore, were
those that interfaced to external parallel linear algebra libraries. This idea was first developed in the MATPAR [Spr98], which introduced the concept of a client-server extension
59
4.2. INITIAL SOFTWARE IMPLEMENTATION
to Matlab, with Matlab scripts making requests to an independent parallel linear algebra
server. This architecture was extended by MATLAB*P [CE02], which provided a wider
range of library functions.
At the time that we were developing our parallel Matlab based simulation environment, Matpar was not available for Intel x86 based clusters, and MATLAB*P did not support complex valued matrices, an essential requirement for our application1 . We therefore
had to develop our own parallel server to use with Matlab. The approach that we followed
was based directly on the key architecture of Matpar and MATLAB*P, but focused on support for the specific functions that we required for our simulation, operating on complex
valued matrices.
4.2.3
Design of the parallel Matlab environment
Our first simulation library consisted of multiple layers on both the client and the server
side, illustrated in Figure 4-3. On the client side, user code calls a set of library routines,
implemented as Matlab .m scripts. These, in turn, call a single Matlab extension (MEX),
written in C, that forms the ’bridge’ between the client and the cluster. The MEX software sends a series of commands to the cluster to instruct it to perform the appropriate
operations. It also transfers data between Matlab and the cluster.
User Matlab code
Library functions
(Matlab .m scripts)
Matlab extension
(MEX – C code)
Network transport
(TCP/IP)
Simulation library
PBLAS
ScaLAPACK
BLACS
Message passing library
(MPI or PVM)
Network transport
(TCP/IP)
Client (workstation)
Server (cluster)
Figure 4-3: Layering of libraries in the first generation simulation environment
1
A more recent (August 2002) version of MATLAB*P introduced support for complex valued matrices, but
this postdates our development work.
60
CHAPTER 4. THE SIMULATION ENVIRONMENT
The Matlab extension uses a lightweight TCP/IP protocol of our own devising to com-
municate with the cluster. On the cluster, a single node is arbitrarily chosen as the ’master’
node. The master accepts TCP/IP connections from Matlab. It receives commands and
data from the Matlab extension and relays these to the other nodes on the cluster. State
diagrams for the master node and the slave nodes are shown in figures 4-4 and 4-5 respectively (incidental error processing is omitted).
Transfer data o
to Matlab More data
$
Await data
GF
6 from nodes lll
l
l
More nodes
lll
lll
l
remain
l
l
Notification
Data transfer
Request data
Process completion o
complete
notification
from nodes
received
O
m
m
m
Data xfer from
All nodes
mm
mmm
mcomplete
m
cluster requested
m
vmm
@A
Command / Relay command
All nodes
Await
request
/
notified
received
from
Matlab
to
nodes
O
to transfer
Node data
received
Data xfer to
cluster requested
@A
y
, Await data
transfer Data transfer complete
More data
to transfer
Matlab data
received
Await completion
notifications V
&
Operation
complete
/ Perform operation
/ Transfer data to
appropriate node
BC
Figure 4-4: State diagram for the parallel Matlab server master node
To homogenize the communications mechanism across the cluster, the inter-node command mechanism is not implemented as a direct TCP/IP protocol, but rather through the
BLACS. Commands are represented as short one-dimensional arrays. The first entry in the
array is an opcode for the command, and the remaining entries are parameters (typically
either references to existing matrices to be operated on, or dimensions of new matrices to
be created).
The typical processing sequence is as follows: initial data is either loaded into the clus-
61
4.2. INITIAL SOFTWARE IMPLEMENTATION
Perform
Store data
operation
fromJ master SSSS Data transfer
O
SSSS complete
More data
Data received
Command
SSSS
SSSS received
expected
from master
S)
~
Await data o Data transfer from Await command
from master m
master requested from master Data transfer to
master requested
Data transfer
complete
/ Send completion
confirmation , Send data
to master
U
More data
to transfer
Figure 4-5: State diagram for the parallel Matlab server slave nodes
ter from Matlab or, where possible, generated directly on the cluster. A series of operations
is performed on the cluster, and the final result is returned to Matlab. Until the final transfer at the end, the matrices exist only on the cluster, and Matlab operates on them through
a reference that incorporates a serial number generated by the master node when each
matrix is created.
Listing 4.1 shows an illustrative example of a Matlab script that calls the parallel simulation library (parallel library calls are highlighted in bold). This example applies an
n-qubit quantum Fourier transform to an input vector v. To assist in understanding the
listing, it may be helpful to consult the description of the quantum Fourier transform in
section 2.3.1, and in particular the circuit diagram in Figure 2-5.
Notice how the computation starts begins with by setting up the initial data, either by
transferring it from Matlab onto the cluster (as in line 6, which uses the pput function to
transfer the input vector to the cluster) or by generating it directly on the cluster (as in line
7, which uses the peye function to generate an identity matrix).
Recall that operators are stored as full matrices. To save time and memory space on
the Matlab workstation, operators are transferred in their compact matrix representation
using the pputop function, as in line 5. They are then converted, or promoted, to full n
qubit (2n × 2n ) matrices by the ppron function.
Since operator promotion will be an important element in our consideration of the
memory demands of the parallel Matlab environment, it is worth considering the opera-
62
1
2
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
CHAPTER 4. THE SIMULATION ENVIRONMENT
% pqft.m Calculate Quantum Fourier Transform
% v = input vector, n = number of qubits
function r = pqft(v, n)
hadamard = pputop([1 1;1 -1] / sqrt(2));
vec = pput(v);
u = peye(2ˆn);
for k = (n-1:-1:0)
v = ppron(hadamard,2ˆn,k,k);
u = v * u;
for j = (k-1:-1:0)
% crk = controlled rotation by k
t = pputop(crk(exp(2*i*pi/2ˆ(k-j+1)),1));
op = ppron(t, 2ˆn, j, k);
u = op * u;
p k i l l (op);
end
end
f = u * vec;
p k i l l (u); p k i l l (vec);
r = pget(f);
p k i l l (f);
Listing 4.1: A sample parallel Matlab script: This example calculates the quantum Fourier transform. Functions in bold are parallel functions, described in the text.
tion of ppron in a little more detail. The ppron function operates on either a one-qubit
(2 × 2) operator matrix, as in line 9, or a two-qubit (4 × 4) operator matrix, as in line 14.
It takes four parameters, a reference to the operator matrix O, the dimension N of the full
matrix to create, and the (zero-indexed) qubit numbers j and k on which the operator is to
operate (for a single qubit operator j = k). The function returns a reference M to the full
matrix that is created.
For a single-qubit operator, the effect of ppron is to create a matrix M of the the following form (where n = log2 N , and I is the 2 × 2 identity matrix):
k terms
z
}|
n−k−1 terms
{
z
}|
{
M =I ⊗ ... ⊗ I ⊗O⊗ I ⊗ ... ⊗ I
(4.1)
The effect for two-qubit operators is similar.
Notice that the matrix-matrix multiplications at lines 10 and 15 and the vector-matrix
4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT
63
multiplication at line 19 are parallel operations whose operands are references to matrices
(or vectors) stored on the cluster. The conventional ’*’ operator has been overloaded to
call the appropriate parallel function when its operands are parallel matrix references.
Once the computation is complete, the pget function at line 21 transfers the result
vector from the cluster back into Matlab. Only this vector, of 2n complex numbers, needs
to be transferred; all of the intermediate calculations are performed entirely on the cluster.
Matlab does not provide a reliable means of determining when a variable goes out of
scope. It is therefore necessary to free matrices that are no longer in use manually, in order
to optimize memory utilization on the cluster. This is the purpose of the pkill function
at lines 16, 20 and 22.
4.3 The tensor product based simulation environment
The initial simulation environment demonstrated that it was possible to simulate quantum algorithms on the cluster. However, it also highlighted several significant limitations,
leading us to develop the new (tensor product based) simulation environment to address
these limitations.
One of the obvious drawbacks of the initial simulation environment was the approach
of storing everything as a full (problem-sized) matrix. This led to severe memory scaling
limitations. For an N -qubit problem, each matrix requires 2N × 2N complex values, or
22N +3 bytes (since each complex value requires eight bytes at single precision). Almost
all computations involve matrix multiplication, which requires three matrices: two source
matrices to be multiplied, and a third matrix to store the result.
It follows that the cluster is practically limited to problems of a maximum of 14 qubits
in size. Three 15-qubit matrices would require 24 Gb of RAM (assuming single precision
complex values, each of which requires eight bytes). Although this is less than the total
amount of RAM available on the cluster nodes (30 Gb), the nature of the block cyclic distribution requires that we use equal amounts of memory on all nodes. If we use all 32
64
CHAPTER 4. THE SIMULATION ENVIRONMENT
nodes, we are therefore constrained by the memory capacity of the eight 768 Mb nodes. If
we exclude these nodes, we are reduced to 24 1 Gb nodes. Since we require some RAM
above the matrix storage RAM for other application data and for the operating system
on each machine, it follows that, for most practical applications (i.e. those involving any
matrix-matrix multiplication), 15 qubit problems are too large to fit in memory.
The large, exponentially scaling matrices in the simulation environment not only tax
the memory capacity of the cluster, they also require a significant amount of computation and, even more deadly to performance, communication. Since communication is the
Achilles heel of any Beowulf cluster, the vast volumes of data that need to be communicated can cause performance to suffer dramatically.
So, having answered the question “can we simulate?” in the affirmative, the question thus became “can we do better?”. We hypothesized that a storage and computation
mechanism that took advantage of the tensor product structure inherent to many quantum algorithms would be more efficient, and so we set out to develop a second generation
simulation environment based on the tensor product structure.
Our design for a tensor product based simulation environment required several components: a means of specifying algorithms (circuits); a mechanism for translating these
specifications into a form suitable for parallel execution; a distribution system to parallelize the computation across multiple nodes; and code to perform the actual computations
on each node. Each of these components is discussed in order below.
4.3.1
Algorithm (circuit) specification language
We wanted an algorithm specification mechanism that would be both human- and machinefriendly. Algorithms should be straightforward for humans to enter and to understand in
their source form. They should also be easy to generate by automated means (e.g. by a
script that automatically generates problems of a given size from a template), and easy to
parse.
To this end we developed a simple algorithm specification language. An abbreviated
4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT
65
grammar for the language appears in Table 4.1, and an illustrative sample algorithm encoded in this language is in listing 4.2, which is a literal encoding of the circuit to perform
a three-qubit quantum Fourier transform.
Tokens
identifier
real
integer
operator
filename
(Literals)
≡
≡
≡
≡
≡
≡
[A-Za-z][A-Za-z -]*
[-]?[0-9]+[.][0-9]+
[0-9]+
Any previously defined identifier or ’I’
Valid Unix filename, with optional path
Identified in quotes below
→
→
→
→
→
→
→
→
→
→
→
→
→
line | input line
statement ’;’
definition | operation | permutation | size | inc
’define’ identifier ’(’ integer ’)’ matrix
’[’ numlist ’]’
complex | numlist complex
realnum | ’(’ ’,’ realnum ’)’ | ’(’ realnum ’,’ realnum ’)’
real | ’%(’ real ’)’ | ’-’ ’%(’ real ’)’
operator ’(’ intlist
’perm’ ’(’ intlist ’)’
integer | intlist ’,’ integer
’size’ integer
’include’ filename
Rules
input
line
statement
definition
matrix
numlist
complex
realnum
operation
permutation
intlist
size
inc
Table 4.1: Abbreviated grammar for the tensor product specification language
The overall syntax is conventional: comments start with a ’#’ and statements are terminated with a semicolon. Algorithms typically start with a declaration of the size of the
circuit in qubits (line 3), and declarations of the operators (lines 5–16). Only one size declaration is allowed, and it must appear before the first line in which an operator is used.
Operators may be defined anywhere prior to their use, and may not be redefined.
An operator definition specifies the name of the operator, the number of qubits on
which it operates, and the values of the operator matrix. The values are complex numbers,
which may be specified with their real part only (e.g. 1.0); with their imaginary part only
66
CHAPTER 4. THE SIMULATION ENVIRONMENT
1
# 3 bit QFT - hand coded based on circuit on p220 of Nielsen & Chuang
3
s i z e 3;
5
define c_s(2) [ 1.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
# controlled phase
0.0
0.0
(,1.0) ];
13
define c_t(2) [ 1.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
# controlled pi/8
0.0
0.0
(%(0.5),%(0.5)) ];
15
define h(1)
6
7
8
10
11
12
16
18
19
20
21
22
23
24
[ %(0.5) %(0.5)
%(0.5) -%(0.5) ] ;
# hadamard
h(0);
c_s(1,0);
c_s(2,1);
h(1);
c_t(2,0);
h(2);
perm(2,1,0);
Listing 4.2: A sample algorithm specification: Calculating a three-bit quantum Fourier transform.
(e.g. (,1.0) for i); or with both real and imaginary parts (e.g. (%(0.5),%(0.5)) for
√
√1 + √1 i). The notation %(x) represents x.
2
2
The circuit itself is described as a sequence of operators, as in lines 18–23. Each operator
is followed by the qubits on which it operates, specified in the order in which they are input
to the operator. Bit numbers are zero indexed. Notice that there is no explicit specification
of the tensor product structure, as this structure is inferred by the compiler.
It is, however, useful when specifying an algorithm to consider the ordering of the
operators to ensure that an optimal structure is inferred. Consider the circuit in Figure 4-6.
This can be represented as in Listing 4.3, but from this the compiler will infer the sequence
Ua ⊗ I ⊗ I; Ub ⊗ Uc ⊗ I; Ud ⊗ Ue ⊗ I; Uf ⊗ I ⊗ I .
(4.2)
4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT
Ua
Ub
Uc
Ud
Ue
Uf
67
Figure 4-6: Circuit to demonstrate different specification orderings: Listings 4.3 and 4.4 are both
implementations of this circuit, but listing 4.4 is implemented more efficiently by the
compiler.
1
2
3
4
5
6
U_a(0);
U_b(0);
U_c(1);
U_d(1);
U_e(2);
U_f(2);
1
2
3
4
5
6
Listing 4.3: Inefficient specification of the circuit in Figure 4-6
U_a(0);
U_c(1);
U_e(2);
U_b(0);
U_d(1);
U_f(2);
Listing 4.4: Efficient specification of the circuit in Figure 4-6
If we interleave the operators as in listing 4.4, the compiler is able to infer the more
efficient sequence
Ua ⊗ Uc ⊗ Ue ; Ub ⊗ Ud ⊗ Uf .
(4.3)
Finally, the permutation statement perm reorders the qubits in the order specified. The
statement is generally rarely used, as most permutations are generated internally when the
algorithm is compiled. Returning to our original example, we have used a permutation at
line 24 to simulate the final bit reversal operation.
Note that permutations are not operators in the strict sense. Suppose we have an nqubit problem, and we define the operator swap(2) as the two-qubit SWAP gate. Then the
statements swap(0,1) and perm(1,0,2,3,. . .,n) will have the same effect of exchanging the values of bits 0 and 1, but they will be implemented differently by the compiler.
The permutation will be implemented as a data exchange, and may possibly be optimized
away entirely by the compiler, while the swap operation will be implemented as a gate
68
CHAPTER 4. THE SIMULATION ENVIRONMENT
like any other. Since there is no difference in practice between the effect of these two statements, the permutation might generally be preferred as it is slightly more efficient.
There is one other statement not illustrated in this example, the include statement.
This is analogous to the #include preprocessor directive in C, inserting the contents of
another source file as if contents of the included file appeared directly at the location of the
include statement. This allows libraries of standard gates to be defined in one place, and
included in multiple algorithms without tedious redefinition.
4.3.2
Compilation
We have alluded to the operation of the compiler above. Now let us consider its operation
in more detail.
The compiler accepts as its input circuit specifications of the form described in section 4.3.1. It generates an internal tensor product based representation of the circuit which
directly forms the basis of the distribution of the problem to multiple processors for execution.
The compiler goes through several steps, which will be illustrated with reference to
the code fragment in Listing 4.5, implementing the circuit in Figure 4-7. As the compiler
proceeds, it builds and then defines an internal structure that represents the circuit. This
structure will ultimately be used to guide the distribution and execution of the simulation.
1
2
3
4
5
6
7
8
s i z e (5);
A(2);
D(3,4);
B(2,1);
E(3,0);
perm(0,1,3,2,4);
F(3,4);
C(2);
Listing 4.5: Input file for the compilation example: Definitions of the operators are omitted.
The compiler starts by parsing the input file and building an initial version of the algorithm structure. The parser is written using a combination of the bison and flex tools (the
4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT
|x0 i
|x2 i
|y0 i
E
|x1 i
|y1 i
B
A
•
•
|x3 i
69
×
C
|y2 i
×
•
|y3 i
F
|y4 i
D
|x4 i
Figure 4-7: Circuit for the compilation example
enhanced GNU versions of lex and yacc, respectively). The key parser code is presented
in Appendix A, in listings A.1, A.2, and A.3.
The internal structure that the compiler builds up consists of a linked list of records.
The possible record types are described in Table 4.2. To further clarify the difference between the B and P records described in the table, suppose we have three qubits, and they
undergo two permutations: an initial permutation that exchanges the values in bit positions 1 and 2, followed by a permutation that exchanges the values in bit positions 0 and 2.
The two permutations would be represented with B records as B(0,2,1) and B(1,2,0)
respectively, and with P records as P(0,2,1) and P(2,1,0).
Type
Operator
Written
O(op, bits)
Bit order
B(bits)
Permutation
P(bits)
Terminator
T
Description
Specifies an operator (op), and indicates the qubits
(bits), in order, on which it operates.
Specifies a permutation of the qubits by defining the
new order (bits) of the bits relative to the initial bit
ordering.
Specifies a permutation of the qubits by defining the
new order (bits) of the bits relative to the previous
permutation (or the initial ordering, if no previous
permutation)
Delimits the end of a set of operator terms forming
a single tensor product
Table 4.2: Record types for the internal compiler data structure
The compiler scans the list of operations in top-to-bottom order, inferring the tensor
70
CHAPTER 4. THE SIMULATION ENVIRONMENT
product structure of the circuit according to the following rules:
1. Operations on independent bits are combined into a single tensor product. An example is lines 2 and 3, which are combined to form the product A2 ⊗ D3,4 .
2. Re-use of a bit starts a new tensor product, as at line 4, where bit 2, previously operated on at line 2, is operated on again.
3. Each operator must operate on consecutive, increasingly numbered bits. Thus, for
example, B(2,1) at line 4 must be renumbered to B(1,2), and E(0,3) at line 5
must be renumbered to E(0,1) (or some equivalent consecutive bit numbering).
Permutations are inserted to effect these renumberings.
4. Tensor products are reordered to the form I ⊗ . . . ⊗ I ⊗ op ⊗ . . . ⊗ op. Permutations
are inserted to effect this re-ordering. The re-ordering is necessary so that we can
later factor out the identity matrix to determine the parallelization of the code, as
described in section 3.3.1.
The compiler starts by applying rules 2 and 3. It renumbers bits within each operator
as necessary, generating B record style permutations, as in Figure 4-8(a). It then reorders
operations within each tensor product according to rule 4 (Figure 4-8(b)).
Having inserted permutations in the previous steps, the compiler then optimizes the
permutations in the record list to minimize the total number of permutations (Figure 48(c)). Although not illustrated in this particular example, the compiler optimizes across
tensor product terms where possible.
The B records are a convenient form for the compiler to work with, but ultimately the
permutations must be implemented by data transfers. P records specify the nature of these
transfers in a more direct manner. In the final step, therefore, the compiler rewrites all the
B records as P records, and inserts a final P record if necessary to restore the ordering of the
output bits to the order that the user expects (typically 0, 1, 2, . . . , n, except where a final
permutation has been specified in the input code).
4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT
←− I ⊗ I ⊗ A2 ⊗ D3,4
O(A,2)
O(A,2)
O(D,3,4)
T
B(2,1,0,3,4) ←− reorder bits of B
O(B,0,1)
B(2,1,3,0,4) ←− reorder bits of E
O(E,2,3)
T
B(2,1,0,3,4) ←− specified by user
O(F,3,4)
O(C,0)
T
O(D,3,4)
remains unchanged
T
B(1,2,3,4,0) ←− B0,1 ⊗ I ⊗ E3,4 →
B(3,2,1,4,0)
I ⊗ B1,2 ⊗ E3,4
O(B,1,2)
B(3,2,4,1,0)
O(E,3,4)
T
B(2,0,1,3,4) ←− C0 ⊗ I ⊗ I ⊗ F3,4 →
B(1,0,2,3,4)
I ⊗ I ⊗ C2 ⊗ F3,4
O(F,2,3)
O(C,4)
T
(a) After rearranging bits within
operators
O(A,2)
O(D,3,4)
T
B(3,2,4,1,0)
O(B,1,2)
O(E,3,4)
T
B(1,0,2,3,4)
O(F,2,3)
O(C,4)
T
(c) After optimizing
permutations
(b) After shifting operators to high bits
O(A,2)
O(D,3,4)
T
P(3,2,4,1,0) ←− B → P record
O(B,1,2)
O(E,3,4)
T
P(2,3,4,1,0) ←− B → P record
O(F,2,3)
O(C,4)
T
P(4,3,0,1,2) ←− Final P record
(d) Final form
Figure 4-8: State of the internal data structure
71
72
CHAPTER 4. THE SIMULATION ENVIRONMENT
4.3.3
Distribution
With compilation complete, we now in effect have a complete ’map’ of the computation,
and are ready to distribute it across multiple processors for parallel execution. To do this
we use the method outlined in section 3.3.1.
After compilation, the computation has the form
P0 (In1 ⊗ U1,1 ⊗ . . . ⊗ U1,k1 )P1 (In2 ⊗ U2,1 ⊗ . . . ⊗ U2,k2 ) . . . PN −1 (InN ⊗ UN,1 ⊗ . . . ⊗ UN,kN )PN ,
(4.4)
where each of the Pi (0 ≤ i ≤ N ) is a permutation (possibly the identity permutation), the
Ini are ni × ni identity matrices, and the Ui,j are operators. Note that the Pi are permutations of values in the state vector, derived from the qubit permutations generated during
compilation.
The system level steps involved in executing the simulation are outlined in Figure 49. This outline emphasizes the communication steps required to distribute the problem
and to rearrange the intermediate result vectors during the computation. The execution
process itself is considered in more detail in section 4.3.4.
To better understand the communication patterns involved note that the state vector
x is multiplied by each of the tensor products sequentially, and recall that for each of the
tensor products Ti , it is the identity matrix Ini of expression 4.4 that determines the parallelism of the corresponding computation. Let the total number of processors available to
perform the simulation be p. If ni ≤ p, then each of ni processors computes
(U1,1 ⊗ U1,2 . . . ⊗ U1,k1 )xj ,
(4.5)
where 1 ≤ j ≤ ni , and xj is a partition of x. If ni ≥ p, then each of p processors computes
(Ini /p ⊗ U1,1 ⊗ U1,2 . . . ⊗ U1,k1 )xj ,
(4.6)
4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
73
Launch parallel processes
Send algorithm to all parallel processes
Distribute initial input vector x among the processes
for i = 1 to N do
Perform computation x = Ti x, where Ti is the ith tensor product.
if The number of processors required for Ti+1 6= the number of processors required
for Ti then
Redistribute x to the processors involved in Ti + 1, taking Pi into account as necessary
else if i < N and Pi is not the identity permutation then
Redistribute the portions of x necessary to implement Pi
end if
end for
Retrieve the result vector, taking Pn into account as necessary
Figure 4-9: System-level overview of the parallel execution of a problem, emphasizing the communication (distribution and redistribution of data) involved
where 1 ≤ j ≤ p.
Thus, if ni+1 6= ni and either ni < p or ni+1 < p, a different number of processors will
be used to compute Ti x, and x will need to be repartitioned. In our clustered environment,
this translates into communication of the contents of x between the processors involved in
computing Ti x and those involved in computing Ti+1 x.
More communication must occur if permutation Pi is not the identity permutation. In
this case, communication need only occur between those processors holding values that
must be permuted.
The simulation begins by launching the simulation code on all the processors involved
in the simulation. Although the number of active processors may vary during execution,
the simulation processes are present on all processors throughout the course of the simulation, even if idle. This seemed the most reasonable approach given the added complexity
and startup overhead of dynamically launching and killing processes, and given that there
is control of idle processes at the operating system level in any case.
The process running on the machine on which the simulation was launched is designated the master process. The role of the master is substantially diminished compared to
74
CHAPTER 4. THE SIMULATION ENVIRONMENT
the architecture of the parallel Matlab simulation environment. Its primary function is to
collect the final result state vector (step 12 in Figure 4-9).
The algorithm specification (step 2) and the initial input vector (step 3) are passed to the
running processes as files on a shared NFS file system. The specification file is generated
by the compiler, and includes the compiled algorithm specification and operator matrices
for all the operations used. The input vector file, giving a list of values to populate the
initial state vector, is generated by the user. A utility program allows the user to generate
a state vector by providing a set of qubit values.
Since each processor ’knows’ its position in the processor group, and since all processors have the identical algorithm specification, it can perform its sequence of i computations independently until it needs to communicate with one or more ’partner’ processors.
If it reaches a communication point, it blocks until the partner is ready to communicate.
Consider an illustrative example. Suppose we have four processors (P0 . . . P4 ), and
an algorithm specification as in Figure 4-10. The sequence of computations performed
by each processor is illustrated in Figure 4-11. Each box represents a computation, and
communications are indicated with lines of the form •
•, labeled with the values being
communicated. The notation xjk represents the partition (xj , xx+1 , . . . , xk ) of x.
)
O(U1,1)
T1 = I2 ⊗ U1 ⊗ U2 ⊗ U3
O(U2,2)
O(U3,3)
T
o
O(U4,2)
T2 = I4 ⊗ U4 ⊗ U5
O(U5,3)
T
P(0,1,3,2)
o
O(U6,2)
T3 = I4 ⊗ U6 ⊗ U7
O(U7,3)
P(0,1,3,2)
Figure 4-10: Algorithm specification to illustrate computation sequence
The diagram illustrates all the cases in which communication is necessary. The first
is at point 2, since T1 is parallelized over two processors and T2 is parallelized over four
processors. The next case is at point 4, where a permutation is called for. Processors P0
75
4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT
()*+
/.-,
1
P0 :
T1 x18
()*+
/.-,
2
(x58 )
•
•
P1 :
P2 :
()*+
/.-,
3
T1 x916
(x13
16 )
•
•
P3 :
()*+
/.-,
4
T2 x14
T2 x58
T2 x912
T2 x13
16
T3 x14
()*+
/.-,
5
(x58 )
•
•
T3 x912
(x912 )
•
•
T3 x58
T3 x13
16
()*+
/.-,
6
()*+
/.-,
7
•
•
(x912 )
•
(x58 )
•
(x13
16 )
•
Figure 4-11: An example computation sequence, illustrating communication patterns
and P3 are not involved in the data exchange for this permutation, so they continue to
execute T3 at point 4. Finally at point 6, all the data is transferred to the master node (P0 )
for reassembly and output. Note that the final permutation in Figure 4-10 is not explicitly
implemented as a separate step. Instead, P0 uses this final permutation to guide the order
of reassembly. This saves an additional communication between P1 and P2 .
Notice that although P0 and P3 are able to continue computation at point 4 without
blocking, the computation ultimately has to wait for P1 and P2 to complete T3 . We do,
however, gain some efficiency by requiring synchronization points only on those processors that have work to be synchronized. Thus, in this example, only two nodes need to execute a synchronization handshake, which is simpler and more efficient than a four-node
synchronization.
Since the data being transferred consists entirely of partitions of vectors, it made sense
to use the BLACS to implement inter-node communications. BLACS provides routines for
exchanging vector data, along with the necessary synchronization primitives.
76
CHAPTER 4. THE SIMULATION ENVIRONMENT
4.3.4
Execution
Finally, let us consider execution process at the individual node level. State diagrams
for the master node and the slave nodes are shown in figures 4-12 and 4-13 respectively.
Most of the entries in the diagrams relate to the inter-node communications discussed in
section 4.3.3. In this section we will concentrate on the processing of the tensor products
themselves (i.e. the central entry in the state diagrams).
Load algorithm
specification
Node not involved in
initial tensor product
Node is involved in
initial tensor product
/ Await initial
data receipt
Data
received
Output result
and terminate
O
All data
transferred
Store data (with
No data
Process
next
/
transfer required
tensor product j
final permutation)
n 7
SSS
J
S
n
S
SSSS
nn
n
n
S
Execution
Post-execution data
More data
Data
Data n
S
complete
transfer
required
expected
S
received
received
n
SSS
n
SSS
nnn
n
S
n
SS)
n
Node not involved in
Node is involved in
Await data o
Send data
/ Await data
later tensor products
receipt
(nonblocking)
from nodes later tensor products Load initial
state data Figure 4-12: State diagram for the new simulator master node
To calculate the products Ti Xkj efficiently, we use the procedure discussed in section 3.3.2. The code is based on the TENPACK [BD96a] Fortran library for tensor product
manipulation, which implements the algorithms in [BD96b] for a wide range of matrix
types (full, banded, positive definite, symmetric and triangular). Our code implements
the matrix-vector multiplication algorithm from [BD96b] in C, optimized for the general
matrices and with provision for the efficient computation of initial identity matrix terms.
The key function is tp mult(), which has the following declaration:
void tp_mult(FLOAT *tp_array, i n t *sizes, i n t n, i n t id_size,
FLOAT *x, FLOAT *work);
tp array is a one dimensional array of complex numbers (each complex number is
77
4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT
Load algorithm
specification
Data
/ Await initial
data receipt
Execution
complete
Data
received
/ Process next
tensor product k
6
n
n
nn
n
n
n
Post-execution data
nnreceived
nnn
n
n
nn
Await data o
receipt
Node not involved in
initial tensor product
Node is involved in
initial tensor product
Load initial
state data Node is involved in
later tensor products
Send data
(nonblocking)
No data
transfer required
transfer required
Node not involved in
later tensor products
Send results
2
to master Master
acknowledges
/ Terminate
Figure 4-13: State diagram for the new simulator slave nodes
implemented as a consecutive pair of floats), consisting of the values of the operator
matrices, ordered by rows then by columns, in the order in which the operators appear in
the tensor product. The dimension of each of the operators in order is specified in array
sizes, and n gives the total number of operators, excluding any leading identity matrix.
Clearly, it would be both space-inefficient to store a leading identity matrix, and timeinefficient to multiply by it unnecessarily. To address this, the parameter id size specifies
the size of the leading identity matrix if there is one (or zero if there is none), so that it can
be dealt with more efficiently.
The tp mult routine in turn calls the level 3 BLAS routine CGEMM to perform the required multiplications. We use the ATLAS implementation of the BLAS library [WPD00],
which automatically tunes itself upon compilation to the exact architectural parameters
(CPU, cache size, etc.) of the machine on which it is compiled. Our earlier experience with
using various BLAS libraries as the basis for the parallel linear algebra routines in the Matlab implementation showed that ATLAS yielded the highest performance of the readily
available BLAS implementations.
The tp mult routine does not declare any significantly sized local variables: the main
data structures are the ones passed in the arguments above. It is fairly clear, therefore, that
78
CHAPTER 4. THE SIMULATION ENVIRONMENT
the main factor governing memory usage on each node is the size of x. The operators are
typically small: it is unusual for an operator to operate on more than a few qubits, so the
storage required for each operator is on the order of a few tens or hundreds of bytes.
The size of x (and hence work) depends on the number of qubits in each state vector
partition. If there are N qubits divided over p processors, then the total storage requirement per processor for x and work combined is:
2 arrays × 8 bytes/complex value × 2N/p values/array = 2(N/p)+4 bytes
(4.7)
Thus, on the current cluster nodes, which have between 768 Mb and 1 Gb of RAM
each, there is sufficient memory (allowing for operating system headroom) to support 25
qubits per node, or 24 qubits per processor if each of the two processors in a node is used
independently. Compared to the parallel Matlab implementation, which could support a
theoretical maximum of 14 qubits across combined memory space of the entire cluster, this
represents a substantial improvement.
The performance of the new system also compares very favorably with that of the old.
In Chapter 5, we will examine these performance issues in depth.
Chapter 5
Evaluation
We performed a number of benchmarks to characterize the performance of the new simulation environment. Section 5.1 outlines our methodology, describing the source of our
execution time figures. We start our analysis in section 5.2 by looking at the basic elements
contributing to simulation performance: computation time, data transfer time and startup
overhead. In section 5.3 we look at simulations of structured sets of gates designed to evaluate execution times with both minimal and extensive communication. Finally, we look at
simulations of two disparate types of algorithm: the quantum Fourier transform in section
5.4, and adiabatic quantum computation of 3-SAT in section 5.5.
5.1 Methodology
Except where otherwise noted, the data below was collected by measuring the execution
time of the simulation environment. The compilation time is not included, because our
tests revealed it to be insignificant (on the order of at most a few hundred milliseconds).
Timing was performed with the GNU time command (version 1.7), which returns the
number of CPU seconds used by the process being timed in user mode, the number of
CPU seconds spent in kernel-mode system calls, and the total elapsed time. The values
reported below are elapsed times, as this is the most relevant time to answer the question
79
80
CHAPTER 5. EVALUATION
’how long does simulation X take?’
The time command returns two significant figures for the number of elapsed seconds.
In practice, limitations of the system clock and of the accuracy of the timing mechanism
used lead to an effective resolution of a few hundred milliseconds. This is more than
acceptable for our purposes, since most of the simulations we are timing have execution
times from tens to thousands of seconds.
Where the simulations ran on multiple processors, timing data was always taken for
the master node, i.e. the node on which the simulation was started. This node by design
started first and finished last, and thus its execution was exactly the time for the entire
simulation.
The execution time graphs below omit data for small problem sizes (generally where
the number of qubits N < 12). The computation time for these smaller sizes is negligible
relative to the process startup time, so the graph is essentially flat for these smaller sizes.
Because of the typically exponential scaling of most problems with the number of qubits,
the time (y) axis is usually shown with a logarithmic scale. The (x) axis is generally linear
in the number of qubits (thus effectively logarithmic in the number of values in the state
vector).
For the cluster-based simulations (sections 5.3, 5.4 and 5.5), we ran each simulation for
a given size and number of data points repeatedly. The data point plotted is the median of
n execution times, where n was varied according to execution time as shown in Table 5.1
in order to put a reasonable ceiling on total execution time for each set of simulations.
Execution time
range (t sec)
t < 100
100 ≤ t < 1000
1000 ≤ t < 2500
t ≥ 2500
Number of
runs (n)
100
25
10
5
Table 5.1: Number of runs for simulation data
81
5.2. THE FUNDAMENTALS
5.2 The fundamentals
Before we examine the performance of the simulation environment as a whole, it is instructive to gather statistics on some basic performance characteristics. The first of these is
execution time for a single node. In contrast to the other benchmarks in this section, these
statistics were gathered by inserting code into the simulation software itself to record the
system time before and after execution. This was done to eliminate the effects of any
startup overhead. Startup overhead was then measured separately (see Figure 5-4).
5.2.1
Single node execution times
First, we measured the effect of gate size on execution time. For each problem of size N ,
we ran a sequence of simulations, first with N single qubit (Hadamard) gates, then with
N/2 two qubit (CNOT) gates, and finally with N/3 three qubit (Toffoli) gates. The results
are summarized in Figure 5-1.
An interesting phenomenon apparent in this graph is that single qubit operations take
less time than two- or three-qubit operations, even though the same number of multiplications is being performed. This is presumably due to the smaller (32 byte) matrix size
of the one qubit operator matrix enabling it to be accessed more efficiently (possibly as
register-resident data).
Recall from section 3.3 that each tensor product term requires a multiplication step and
a transposition step. We recorded the execution time of each step separately and found
that, for problem sizes greater than 15 qubits, transposition took up approximately 30% of
the execution time, and multiplication took up approximately 70%.
To develop a further feel for how the number of gates in each tensor product term
affects the execution time, we ran a series of tests using tensor products of the form
⊗(N −k)
I2
⊗ H ⊗k with k ∈ 1, 2, 4, 8, 16 ,
(5.1)
where N as usual is the number of qubits, H ⊗k is the k-qubit Hadamard transform, and
82
CHAPTER 5. EVALUATION
1000
1-qubit operators
2-qubit operators
3-qubit operators
Execution time [sec]
100
10
1
0.1
0.01
12
14
16
18
20
Problem size [qubits]
22
24
Figure 5-1: Single-node execution times: combinations of gates were executed on an isolated node
with no communications
⊗(N −k)
I2
5.2.2
is a 2N −k × 2N −k identity matrix. The results are summarized in Figure 5-2.
Data transfer timing
Computation time is not the only factor affecting total execution time. Another significant contribution to overall execution time is communication amongst the nodes. Unlike
the parallel Matlab implementation, no communication occurs during individual computational steps; the only communication is vector transfer for permutations and bit size
realignment in between tensor products.
To evaluate the efficiency of the vector transfer mechanism, we instrumented the code
to measure the amount of time taken to exchange vectors of various sizes between two
nodes. An ’exchange’ in this context is a complete transmission of a vector followed by
receipt of a vector of the same size. The results are summarized in figure 5-3.
83
5.2. THE FUNDAMENTALS
1000
1 Hadamard
2 Hadamards
4 Hadamards
8 Hadamards
16 Hadamards
Execution time [sec]
100
10
1
0.1
0.01
12
14
16
18
20
Problem size [qubits]
22
24
Figure 5-2: Single-node execution times with identity matrix: total number of (Hadamard) gates is
indicated, remaining bits were left empty
As is apparent from the curve superimposed on the data, transfer throughput is slightly
under 10% of the theoretical maximum bidirectional throughput of the Gigabit Ethernet
transport medium. However, as we saw in section 4.1, the effective maximum throughput
of the operating system/bus/NIC combination is less than half the theoretical maximum
of the transport mechanism, so the vector transfer times are in fact closer to 20% of the
raw maximum throughput attainable. It is likely that this is due primarily to protocol and
processing overhead.
While the transfer mechanism could be more efficient, a comparison of Figures 5-3 and
5-1 reveals that communication times are encouragingly low relative to computation time.
Looking at figure 5-2, we see that this depends somewhat on the number of operations. In
a computation with a relatively low number of operations per step, and a relatively high
proportion of steps followed by communications, communication time can approach, and
84
CHAPTER 5. EVALUATION
100
10.7 * bits / 109 + 0.35
Time to exchange [sec]
10
1
0.1
0.01
12
14
16
18
20
Data exchange size [qubits]
22
24
Figure 5-3: Vector transfer times: vectors of the indicated sizes were exchanged between a pair of
nodes
in extreme cases even exceed computation time.
5.2.3
Startup overhead
Another factor contributing to overall execution time, more noticeable at lower bit number,
is the startup time for the system. It takes a certain amount of time to launch the underlying
message passing system (MPI or PVM), register all the nodes, and bring them all to the
point where they are ready to start computation. This overhead is illustrated in Figure 5-4.
Although startup overhead is not particularly significant at higher bit number, it is
the predominant time cost at lower bit number, as will be apparent from a number of the
graphs later on.
85
5.3. GATES AND GATE COMBINATIONS
18
16
Startup overhead [sec]
14
12
10
8
6
4
2
0
5
10
15
20
Number of processors
25
30
Figure 5-4: Startup overhead: Time to launch the simulation environment on the number of processors indicated
5.3 Gates and gate combinations
Having examined the key determinants of performance, we are now ready to look at actual
computations. First, we run a simulation intended to be computationally intensive but not
communications intensive. To achieve this, we use a block of CNOT gates, as illustrated in
the circuit in Figure 5-5.
Since each column of CNOTs operates on the same set of qubits, there are no permutation or bit distribution changing communications between each computational step. The
only communication is the final data collection at the end. A third of the qubits are left
free, partly to provide parallelism, and partly to make this set of circuits more directly
comparable to the ones in the next test.
The execution times for this ’block of gates’ problem are summarized in Figure 5-6.
The basic structure of this graph will recur in most of the other graphs in this section.
86
CHAPTER 5. EVALUATION
|x1 i
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
|x2 i
⊕
⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕
..
.
..
.
|xb(N +1)/3c∗2−1 i
•
•
•
•
|xb(N +1)/3c∗2 i
⊕
⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕
•
•
•
•
•
•
•
•
•
•
•
•
|xb(N +1)/3c∗2+1 i
..
.
..
.
|xN i
Figure 5-5: Block of
CNOT s
circuit
At lower bit number, startup overhead is the most significant time cost, dominating processing and communication times. This initially favors fewer processors. However, as bit
number increases, startup time remains constant while other factors increase. In particular,
computation times increase more rapidly with fewer processors.
This pattern is particularly clear in the above example, because there is no communications overhead. The lower set of points represents the ’ideal’ speedup for 32 processors,
being the single processor time divided by 32. Notice that the speedup converges to being
quite close to the theoretical ideal.
Now let us compare this to a similar example where communication overhead plays
a much bigger role. In this next case, we alternate the
CNOT s
as shown in the circuit in
Figure 5-7. Here, we have the same number of CNOTs as in the previous test, but this time
they operate on alternate qubits at each step, forcing a permutation communication after
each step.
The effects of the added communication overhead are shown in Figure 5-8. First, overall execution times increase in all cases. The difference is insignificant at lower bit number,
but increases to between a factor of 2 (for one processor) to 8 (for 32 processors). Second,
the rates of increase in execution times differ less than in the previous case. The graphs for
87
5.3. GATES AND GATE COMBINATIONS
1000
1 processor
2 processors
4 processors
8 processors
16 processors
32 processors
Ideal speedup (32x)
Execution time [sec]
100
10
1
12
14
16
18
20
Problem size [qubits]
22
24
Figure 5-6: Block of gates, no permutation-related communication: 16 columns of CNOT gates operating on the same bits, exercising execution time without communication (cf Figure
5-8)
different processor numbers still cross over, but they do not pull away from each other as
quickly. The single processor curve in particular tracks much closer to the two-processor
case, because its permutations involve memory transfers, not network communications.
Although the single processor has an inherent execution time disadvantage, the reduction
in communication costs compensates somewhat for this.
Overall, communication costs play a much more significant role, as we can see by comparing the convergence of the 32 processor case to the ideal speedup with the previous example. At 25 bits, for example, using 32 processors give a speedup factor of approximately
5 over the single processor case, against a comparable speedup factor of approximately 20
for the previous, low communication, example.
Notice that although the effects of communication overhead are significant, it is much
less of a factor than it was for the parallel Matlab implementation. This is for three main
88
CHAPTER 5. EVALUATION
|x1 i
•
|x2 i
⊕
|x3 i
•
•
⊕
⊕
•
•
⊕
⊕
•
•
⊕
⊕
•
•
⊕
⊕
•
•
⊕
⊕
•
•
⊕
⊕
•
•
⊕
⊕
⊕
..
.
|xN −2 i
•
|xN −1 i
⊕
|xN i
Figure 5-7: Alternating
priately)
•
..
.
•
•
⊕
⊕
CNOT s
•
•
⊕
⊕
•
•
⊕
⊕
•
•
⊕
⊕
•
•
⊕
⊕
•
•
⊕
⊕
•
•
⊕
⊕
•
⊕
circuit (if N is not a multiple of 3, last one or two qubits differ appro-
reasons: first, much less data is being transferred: vectors of size O(2N ) instead of matrices
of size O(22N ). Second, the communication comes after each computational step, instead
of being interleaved throughout the computation. This allows the computation itself to
proceed much faster, as it is not constantly blocking while waiting for I/O to complete.
Finally, the effect of network and protocol induced latency is much smaller when data is
transferred in bulk instead of being split into many temporally separate small requests.
The slightly greater unevenness to each curve is not a result of the greater communication overhead. Rather, it is due to the structure of the problem: the number of gates
relative to problem size differs slightly depending on whether N is a multiple of three, one
less than a multiple of three, or two less than a multiple of three.
5.4 The quantum Fourier transform
Having characterized the basic performance factors of the tensor product based simulation
environment, we were ready to put it through its paces on some real quantum algorithms.
To test the simulation environment with realistic quantum circuits, we implemented the
quantum Fourier transform described in section 2.3.1.
89
5.4. THE QUANTUM FOURIER TRANSFORM
10000
1 processor
2 processors
4 processors
8 processors
16 processors
32 processors
Ideal speedup (32x)
Execution time [sec]
1000
100
10
1
12
14
16
18
20
Problem size [qubits]
22
24
Figure 5-8: Alternating CNOTs: 16 columns of alternating CNOTs exercising execution and communication times
5.4.1
The quantum Fourier transform on the initial environment
Before we look at the quantum Fourier transform on the new simulation environment,
it is worthwhile to examine its performance on the old parallel Matlab based simulation
environment as a baseline.
We ran the quantum Fourier transform code from listing 4.1 on grids of 1, 2, 4, 8, 16 and
32 processors, and measured the execution times using the Matlab internal timer functions
(tic and toc). These times are shown in Figure 5-9.
There is no data point for 13 bits on a single processor, because the problem is too large
to fit into the memory of a single node. In the 32-processor case, two processors on each
of 16 nodes were used. The total available RAM in this case was thus the same as for the
16-processor case (16 Gb).
Although there is clearly a speedup for higher bit numbers as the number of processors
90
CHAPTER 5. EVALUATION
10000
1 processor
2 processors
4 processors
8 processors
16 processors
32 processors
Execution time [sec]
1000
100
10
1
7
8
9
10
Problem size [qubits]
11
12
13
Figure 5-9: Parallel Matlab based simulator performance: Shown are execution times for the quantum Fourier transform of varying size on the indicated number of processors.
is increased, overall run times are high. This is due in part to the rapid scaling of the
number of operations that need to be performed, as discussed in section 3.3.2. This is also
due to the high communications overhead of transferring large amounts of data during
matrix multiplication operations.
5.4.2
Replicating the discrete Fourier transform
Let us now turn back to the new simulation environment, and examine the performance
of various implementations of the quantum Fourier transform.
Before implementing the circuit-based quantum Fourier transform, we implemented
a well-known alternative tensor product based representation of the equivalent discrete
Fourier transform. We did this in order to verify that our simulator performed as expected
on a known problem with a known formula describing its run time as a function of prob-
91
5.4. THE QUANTUM FOURIER TRANSFORM
lem size.
Specifically, we used the Cooley-Tukey [CT65] factorization of the discrete Fourier
transform, better known as the Fast Fourier transform, which implements the 2N -value
transform F2N (equivalent to the N qubit quantum Fourier transform) as

F2 N = 
1
Y
j=N

(I2N −j ⊗ B2j−1 ) R2N ,
(5.2)
where R2N is the bit reversal permutation, and B2j−1 is the butterfly matrix defined by
Bm = (F2 ⊗ Im )Dm .
(5.3)
F2 is the 2 × 2 discrete Fourier transform matrix, and Dm is a diagonal matrix of weights
of the form
−2πi(j mod m)bj/mc
Dm (j, j) = exp
2m
for j = 0 . . . (2m − 1) .
(5.4)
To implement this on our simulator, we added one small extension to allow us to calculate Dm through in-place vector multiplication. This consisted of a language extension
Diag[m], to specify the application of Dm at a given point in the calculation, and a small
code segment in the execution code to perform the necessary vector multiplications.
The execution times for problem size n = 2N , 12 ≤ N ≤ 25 are shown in Figure 5-10.
The primary aim of this test was to validate the operation of the simulation library by
comparing its execution time to the theoretically predicted execution time. The FFT implementation described above has an execution time of O(n log n), where n is the problem
size.
For the single processor case, the data in Figure 5-10 clearly shows this behavior (as
highlighted by the superimposed graph of the function y = 5.6×10−7 n log n+1.9. Multipleprocessor times are higher because of the added influence of communication time.
As usual, for small problem size, the overall time is dominated by startup overhead.
As problem size increases, the effect of startup overhead becomes smaller, so the execution
92
CHAPTER 5. EVALUATION
1 processor
2 processors
4 processors
8 processors
16 processors
32 processors
5.55*10-7(n log n) + 1.9
Execution time [sec]
1000
100
10
1
12
14
16
18
20
Problem size [log(n)]
22
24
Figure 5-10: Traditional Fourier transform execution times
times for 2-32 processors appear to converge. This is because the overall amount of time
required for communication is relatively higher than the time required for computation.
As problem size becomes larger still, the relative effect of computation time increases,
and the effects of parallelism start to give an advantage to higher numbers of processors.
This is somewhat difficult to see from the graph, but becomes more apparent by looking at
the values in Table 5.2, which show execution times for larger problem sizes. Note that in
all cases, however, the single-processor times are better than any of the multiple processor
times.
5.4.3
Circuit-based implementation
Although the FFT implementation provides a useful confirmation that our simulation environment performs as expected, we are in a sense ’cheating’, in that we are exploiting
specific properties of the problem to achieve a particularly efficient simulation. A more
93
5.4. THE QUANTUM FOURIER TRANSFORM
Size
21
22
23
24
25
2 procs
65
136
275
576
1171
4 procs
62
128
262
548
1109
8 procs
63
125
259
533
1086
16 procs
67
129
257
531
1101
32 procs
75
137
264
541
1081
Table 5.2: Fourier transform execution times for larger problem size: times are in seconds. Values
that are in the ’wrong order’ (i.e. where more processors result in higher execution times)
are emphasized in italics.
realistic way of evaluating the simulator’s performance for quantum circuits in general is
to implement a full circuit representation of the quantum Fourier transform.
To do this, we wrote a script to generate quantum Fourier transform circuits of the form
in Figure 2-5 for a given bit number. A full listing of the script, along with sample output
for a four-qubit quantum Fourier transform, can be found in Appendix A, in listings A.4
and A.5 respectively.
We generated circuits for a range of bit numbers up to 25 bits, and ran these circuits
through the simulation environment, measuring the execution times as usual. The results
are summarized in Figure 5-11.
Here, computation time is higher relative to communication time than in the previous
case, so the multiple processor configurations show a clear advantage sooner (about as
soon as the higher startup overhead becomes insignificant). The behavior is similar to the
mixed execution/communication test in Figure 5-8.
In section 3.3.2, we saw that the number of multiplication operations required to compute the product of a vector and a matrix in tensor product form is O(KnK+1 ), where K is
the number of tensor product terms and n is the dimension of the tensor product terms.
We can approximate the execution time for the tensor product based quantum Fourier
transform by setting n = 4, since the bulk of the operations are two-qubit controlled rotations. To approximate K, we note that for a fully populated column of gates, there will be
N/2 gates for N qubits, so we set K = N/2. This gives an approximate execution time for
each step Tstep of
94
CHAPTER 5. EVALUATION
10000
1 processor
2 processors
4 processors
8 processors
16 processors
32 processors
2.35*10-7
N2 2N
5.87*10-8 N2 2N
Execution time [sec]
1000
100
10
1
12
14
16
18
20
Problem size [qubits]
22
24
Figure 5-11: Quantum Fourier transform circuit execution times
N N +1
Tstep (N ) = O
42
2
.
(5.5)
The total number of steps Nsteps is given by
Nsteps (N ) = 2N − 1 ,
(5.6)
so the overall execution time Texec can be approximated by
Texec (N ) = Tstep (N )Nsteps (N )
' O N 2 2N
.
(5.7)
(5.8)
The higher dotted line in figure 5-11 shows that actual single processor execution times
5.4. THE QUANTUM FOURIER TRANSFORM
95
are indeed close to the prediction (the predicted curve grows slightly faster, which is to be
expected since we have slightly overestimated n and K). The values on the lower dotted
line are exactly one quarter of the values on the higher dotted line. These are a good fit to
the eight-processor data, illustrating that for eight processors we are achieving about half
the theoretical maximum speedup.
The execution times for the circuit based quantum Fourier transform are, as expected,
higher than the optimized FFT case above. However, it is instructive to compare the lower
end of the execution time graph for the tensor product based quantum Fourier transform
to the higher end of the execution time graph in Figure 5-9 for the old parallel Matlab
implementation to get a sense of the overall improvement.
5.4.4
Comparing efficient and inefficient circuit specifications
We performed a further experiment to see how significantly execution time was affected
by the efficiency of the circuit representation. Recall from section 4.3.1 that the ordering of
gates in a circuit can affect the efficiency with which the compiler is able to implement that
circuit (see Figure 4-6 and listings 4.3 and 4.4). We were curious to quantify the variation
in execution time that would result.
Our original quantum Fourier transform circuit generation script generated more efficient circuit specifications, with the operations interleaved as much as possible to minimize the number of consecutive tensor product multiplication steps (as in listing 4.4). We
rewrote the script to deliberately generate less efficient specifications, with each qubit’s
gates specified sequentially (as in listing 4.3). Bear in mind that this did not change the
circuit in any way, merely the compiler’s optimized interpretation of it. The script can be
found in listing A.6 in Appendix A, with sample output in listing A.7.
The execution time results for the inefficient script are summarized in Figure 5-12.
To understand these results, it is useful to consider how many tensor product steps are
involved in each circuit representation, and the effect that this has on the two important
parameters of execution time and communication time.
96
CHAPTER 5. EVALUATION
10000
1 processor
2 processors
4 processors
8 processors
16 processors
32 processors
2.33*10-7 N2 2N
Execution time [sec]
1000
100
10
1
12
14
16
18
20
Problem size [qubits]
22
24
Figure 5-12: Quantum Fourier transform inefficient circuit execution times
The number of consecutive tensor product steps Nsteps (N ) in an N qubit quantum
Fourier transform implemented efficiently is given by equation 5.6, while the number of
steps Ninef f (N ) in an inefficiently implemented N qubit quantum Fourier transform is
Ninef f (N ) =
(N − 1)(N − 2)
+ 2.
2
(5.9)
Since the number of operations overall remains constant for each N , the total amount
of computation is similar in both cases. In fact, the average number of operations per step
for the inefficient implementation is lower, which results in an improvement (albeit very
small) in single-processor execution times for larger problem sizes. This can be seen from
the dotted line in Figure 5-12, where the values are about 1% lower than those in figre 5-11.
Since more steps require more permutations, and hence greater overall communication,
the inefficient implementation fared worse in multiple processor computations than the
comparable efficient implementation.
97
5.5. 3-SAT BY ADIABATIC EVOLUTION
5.5 3-SAT by adiabatic evolution
To test the simulator with an somewhat different type of algorithm, we developed a simulation of the application of adiabatic evolution to the 3-SAT problem described in section
2.3.2.
To recast the problem into a form suitable for testing our simulation environment, we
use the simulation mechanism described by Hogg in [Hog03]. This executes a discrete
approximation of the adiabatic evolution algorithm for a deterministic number of steps
and then evaluates the probability of solution. We chose this method of simulation because it particularly lends itself to implementation in our tensor product based simulation
environment, making it a good test algorithm.
Hogg’s simulation technique relies on the discrete approximation described in [vDMV01].
The evolution from the beginning Hamiltonian H(0) to the final Hamiltonian H(T ) is approximated by a sequence of r Hamiltonians each applied for time
T
r.
Each of the r Hamil-
tonians is implemented by a unitary transform
Uj = H ⊗n F0,j H ⊗n Ff,j for 1 ≤ j ≤ r ,
(5.10)
where
T
j
F0,j |xi = e−i r (1− r )h(x) |xi
T j
f (x)
r
Ff,j |xi = e−i r
|xi ,
(5.11)
(5.12)
and the functions h(x) and f (x) are based on the definitions of the initial and final Hamiltonian respectively (h(x) is the xth diagonal term in H(0) expressed in its computational
basis, and similarly for f (x) and H(T )).
The implementation of the two H ⊗N steps is a standard ’column of gates’. However,
notice that all the qubits are involved in the Hadamard transforms, so we cannot make use
98
CHAPTER 5. EVALUATION
of the usual identity matrix factoring parallelization. F0,j and Ff,j are best implemented as
in-place multiplications so, as in Section 5.4.2, we added a small section of code to the simulator, and corresponding statements (F0[j] and Ff[j]) to the specification language.
Our script to generate simulation code to simulate 3-SAT by adiabatic evolution by
to this method is given in listing A.8 in Appendix A. Example output for a three qubit
simulation is given in listing A.9.
For each instance of 3-SAT, in addition to the simulation code, we need to provide the
simulator with a description of the specific instance so that it has values for the Hamiltonians F0,j and Ff,j . Code to generate random instances of 3-SAT for input into the simulator
is shown in listing A.10.
In the interest of keeping execution times manageable for large bit numbers, we ran
the simulation with r = N , where N is the size of the problem in qubits. We used random
instances of 3-SAT, since in this formulation the execution time of the simulation is constant
regardless of the values of the initial and problem Hamiltonians. The number of random
instances for each problem size was as in table 5.1. Execution times are summarized in
Figure 5-13.
To develop an estimate for expected execution time, we can use a similar approach to
the one that we used for the quantum Fourier transform. Here, in an N bit simulation,
for each step there are two series of N Hadamard transforms and two multiplications by
diagonal Hamiltonians. Each series of Hamiltonians should take O(KnK+1 ) multiplications, with K = N and n = 2 (since all the operations are 2 × 2 Hadamards). Each of the
applications of a Hamiltonian involves a simple in-place multiplication of a vector by a diagonal matrix, so there are O(N ) multiplications for each of these steps. The total number
of multiplications for each step, Tstep (N ) is thus given by:
Tstep (N ) = O 2 N 2N +1 + N
.
(5.13)
Since we run each simulation for N steps, the the total number of multiplications Texec (N )
is simply:
99
5.5. 3-SAT BY ADIABATIC EVOLUTION
10000
1 processor
2 processors
4 processors
8 processors
16 processors
32 processors
2.3*10-7 N2(2N+1+1)
Execution time [sec]
1000
100
10
1
12
14
16
18
20
Problem size [qubits]
22
24
Figure 5-13: Execution times for 3-SAT by adiabatic evolution, N steps
Texec (N ) = O 2N N 2N +1 + N
.
= O 2N 2 2N +1 + 1
(5.14)
(5.15)
The dotted line in Figure 5-13 illustrates that actual single processor execution times follow
the prediction closely.
Notice that adding more processors does not significantly improve execution times.
This is because the Hadamard transforms operate on all qubits and thus cannot be parallelized in the usual way. So, for p processors, we have a theoretical best case speedup
of
100
CHAPTER 5. EVALUATION
Texec (N, p) = O 2N
2
N +1
2
1
+
p
.
(5.16)
Although there is a 1/p speedup for the multiplication routine, there is no speedup for the
more costly 2N +1 Hadamard step. Formula (5.16) does not take communications overhead
into account.
For the multiple processor cases, we implemented a parallel version of the F0 and Ff
multiplication routine, which distributed the vector amongst the multiple processors for
the multiplication steps. At lower bit number, startup and communication overhead dominated as usual. For higher bit numbers, there was an improvement in execution time
with an increased number of processors, but this was very slight (at 25 bits, execution time
dropped from 9736 seconds on a single processor to 9518 seconds on 32 processors — an
improvement of about 2%).
In real simulations of adiabatic quantum computing, it is common to study multiple
random instances of the problem (in this case 3-SAT) to gather meaningful statistics of the
performance of the algorithm (in this case, the probability of solution for a given number
of steps). It would therefore be a much more efficient use of the parallelism of the cluster
to check multiple instances simultaneously on single nodes, rather than single instances
on multiple nodes.
Chapter 6
Conclusions
We set out to develop an environment for simulating quantum computation on a cluster
of multiple classical computers. Our implementation demonstrates that this is feasible for
moderately sized problems if care is taken to manage the effects of exponential scaling of
resource requirements with problem size.
Our initial simulation approach used a matrix representation for the quantum circuits
being simulated, and applied parallel linear algebra routines to perform the computation.
With the initial simulation environment, we were able to simulate slightly larger systems
than the state of the art for physical quantum computer (13 qubits against 7), and we did
see a limited speedup as additional processors were added, albeit with diminishing returns
as we increased to 16 or 32 processors.
We learned a number of lessons from our initial implementation. One key observation was that the full matrix representation is an inefficient mechanism for representing
quantum circuits, as CPU, memory and communication resource requirements increase as
O(22N ) with problem size. This significantly limited the size of the problems that we could
simulate.
Another important observation was that communications overhead is a significant limiting factor in cluster computing. The parallel linear algebra libraries that we used were optimized for single-machine multiprocessor architectures, and over the much slower inter101
102
CHAPTER 6. CONCLUSIONS
connect of our cluster network, they performed very poorly when performing operations
such as matrix multiplication that require extensive communication between processors.
It was clear to us that a re-think was in order. We hypothesized that the tensor product
structure inherent to many quantum computing problems would lead to a more efficient
implementation, much as it had been helpful to the digital signal processing community
in implementing large matrix transforms. Our final simulation environment was therefore
based around a tensor product model.
Our evaluation of the tensor product based simulation environment has demonstrated
that it is indeed a substantially more efficient means for performing a number of simulations. Resource requirements now increased as O(2N ) — still exponential, but with a
slower growth rate than before. We were able to extend simulation sizes to 25 qubits without difficulty, and overall execution times decreased dramatically.
Communication overhead was still a factor, but much less so than for the initial simulation environment, and for most problems we were able to utilize the parallelism of the
cluster effectively to reduce overall processing time. For more communication-intensive
problems, this reduction in time was somewhat less than the maximum possible theoretical speedup, but was still significant enough to be useful.
Not all problems lent themselves to tensor product based parallelization. The discrete
approximation to adiabatic quantum computation, for instance, benefited from the efficient tensor product based computational core of our simulator, but saw little benefit from
adding additional processors.
In conclusion, we have demonstrated that it is feasible to simulate useful problems
in quantum computation on classical computers. Our experience suggests that for many
problems, using the tensor product structure provides an efficient means of representing
and executing the problem. In many cases, the tensor product also provides a natural
parallelization mechanism.
Another conclusion of our study is that clustering is not a panacea for high performance computing. Clusters are not supercomputers: even fast network transports make
103
slow inter-processor interconnects, and more generally workstation hardware is not optimized for large-scale multiprocessing. However, if care is taken to find efficient means
of representing problems that minimize communications, cluster parallelism can provide
meaningful performance gains.
104
CHAPTER 6. CONCLUSIONS
Appendix A
Code listings
This appendix contains longer listings for several important pieces of code referenced in
the text. Each listing has a cross-reference to the section of the text in which it is discussed.
A.1
Circuit specification language parser
The following three listings contain the main body of the parser for the circuit/algorithm
specification language discussed in section 4.3.1. Listing A.1 contains basic definitions of
data structures referenced in the other two listings. Listing A.2 contains the flex code for
the parser, and listing A.3 contains the bison code (the bulk of the processing occurs in this
latter listing).
The simplified grammar presented in Table 4.1 on page 65 may be helpful in understanding the structure of the language defined in these listings.
/* qparse.h: Header file for quantum circuit language parser
*
Geva Patz, MIT Media Lab, 2003
*/
# i f n d e f _QPARSE_H
# define _QPARSE_H
i n t yyline;
105
106
APPENDIX A. CODE LISTINGS
# define FLOAT f l o a t
#define MAXOP 5
/* maximum operator size in bits */
# define MAX_BITS 32 /* maximum number of problem bits */
/* Complex number type */
typedef s t r u c t {
FLOAT re;
FLOAT im;
} complex;
/* Item in linked list of values for matrix definitions */
typedef s t r u c t _ml {
complex val;
s t r u c t _ml * next;
} matlist;
/* Item in linked list of bit numbers for operators */
typedef s t r u c t _bl {
i n t val;
s t r u c t _bl * next;
} bitlist;
/* Symbol table entry record */
typedef s t r u c t _sl {
char *name;
int
bits;
complex *matrix;
s t r u c t _sl *next;
} symlist;
/* Operator stack entry record */
typedef s t r u c t _op {
enum {tOP,
/* Operator (O record)
*/
tPERM,
/* Permutation (P or B record) */
tDELIM,
/* Product delimiter (D record) */
tSPECIAL
/* Special function
*/
} type;
symlist *op;
/* Operator information for type = tOP */
i n t prodsize; /* Product size for type = tDELIM */
i n t *bits;
/* Bits operated on for type = tOP or tPERM */
i n t function; /* Function index for tSPECIAL */
i n t parm;
/* Parameter for tSPECIAL */
s t r u c t _op *next;
A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER
} oplist;
void yyerror(char *);
# endif
Listing A.1: Header file definitions
%x include
%{
/* qparse.l: flex file for quantum circuit specification language
*
Geva Patz, MIT Media Lab, 2003
*/
# include
# include
# include
# include
# include
<ctype.h>
<string.h>
<math.h>
"qparse.h"
"qparse.tab.h"
e x t e r n YYSTYPE yylval;
e x t e r n symlist * symbol_table;
# define DEBUG_L 0
# define MAX_INCLUDES 10
# define tprintf i f (DEBUG_L) printf
YY_BUFFER_STATE inc_stack[MAX_INCLUDES];
i n t inc_stack_ptr = 0;
/*
* Keyword table
*/
s t r u c t kt {
char *kt_name;
i n t kt_val;
} key_words[] = {
{ "define", DEFINE },
{ "perm", PERM },
{ "size", SIZE },
{ "F0",
FZERO},
{ "Ff",
FF},
107
108
APPENDIX A. CODE LISTINGS
{ "Diag", DIAG},
{ 0, 0 }
};
/* Function prototypes */
i n t kw_lookup(char *);
symlist *st_lookup(char *);
f l o a t makefloat(char *);
%}
WORD
DQSTR
DIGIT
SIGN
FLOAT
[A-Za-z_][-A-Za-z0-9_]*
\"[ˆ\"]*\"
[0-9]
[+-]
({DIGIT}+"."{DIGIT}+ | "%("{DIGIT}+"."{DIGIT}+")")
%%
/** Handle the "include" statement **/
"include"
<include>[ \t\n]
BEGIN(include);
/* Swallow whitespace in includes */
<include>[ˆ \t\n]+ {
char estr[1000];
if( inc_stack_ptr > MAX_INCLUDES) {
yyerror("Too many nested includes");
exit(1);
}
inc_stack[inc_stack_ptr++] = YY_CURRENT_BUFFER;
yyin = fopen(yytext, "r");
if(!yyin) {
sprintf(estr, "Unable to open include file ’%s’"
,
yytext);
yyerror(estr);
exit(1);
}
yy_switch_to_buffer(yy_create_buffer(yyin,YY_BUF_SIZE
));
BEGIN(INITIAL);
A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER
109
}
/** Integers **/
{DIGIT}+
{
yylval.val = atol(yytext);
return INT;
}
/** Various forms of complex numbers **/
{SIGN}?({DIGIT}+"."{DIGIT}+|"%("{DIGIT}+"."{DIGIT}+")") {
yylval.cplx.re = makefloat(yytext);
yylval.cplx.im = 0.0;
return NUM;
}
/* Real part only */
{SIGN}?({DIGIT}+"."{DIGIT}+|"%("{DIGIT}+"."{DIGIT}+")")":"{SIGN}?({DIGIT
}+"."{DIGIT}+|"%("{DIGIT}+"."{DIGIT}+")") {
int i;
for(i=0;yytext[i]!=’:’;i++)
;
yylval.cplx.re = makefloat(yytext);
yylval.cplx.im = makefloat(yytext+i+1);
return NUM;
}
/* Real and imaginary parts */
":"{SIGN}?({DIGIT}+"."{DIGIT}+|"%("{DIGIT}+"."{DIGIT}+")") {
yylval.cplx.re = 0.0;
yylval.cplx.im = makefloat(yytext+1);
return NUM;
}
/* Imaginary part only */
/** Handle keywords, identifiers and punctuation **/
{WORD}
{
int i;
symlist *ptr;
/* Check for keywords */
if((i = kw_lookup(yytext)) != -1) {
yylval.string = strdup(yytext);
110
APPENDIX A. CODE LISTINGS
return i;
}
/* Check for identifiers */
if((ptr = st_lookup(yytext)) != NULL) {
yylval.st_rec = ptr;
return OP;
}
yylval.string = strdup(yytext);
tprintf("id(%s) ", yytext);
return ID;
}
\n
{
yyline++;
tprintf("\n... ");
#.*
[ \t\f]*
","
"\."
";"
"\{"
"\}"
"("
")"
"\["
"\]"
.
}
{
{
{
{
{
{
{
{
{
{
{
{
<<EOF>>
{
/* Ignored (comment) */;
/* Ignored (white space) */;
return COMMA;
return DOT;
return SEMICOLON;
return OBRACE;
return CBRACE;
return OBKT;
return CBKT;
return OSBKT;
return CSBKT;
return yytext[0];
}
}
}
}
}
}
}
}
}
}
}
}
if(--inc_stack_ptr <0)
yyterminate();
else {
yy_delete_buffer(YY_CURRENT_BUFFER);
yy_switch_to_buffer(inc_stack[inc_stack_ptr]);
}
}
%%
/*
* st_lookup
111
A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER
*
*
*/
Look up a string in the symbol table.
string is found, NULL otherwise.
Returns a pointer if the
symlist * st_lookup(char *target) {
symlist *ptr;
for(ptr = symbol_table; ptr!=NULL && strcmp(ptr->name,target);
ptr = ptr->next)
/* empty loop */;
return ptr;
}
/*
* kw_lookup
*
Look up a string in the keyword table. Returns a -1 if the
*
string is not a keyword otherwise it returns the keyword number
*/
int kw_lookup(char *word) {
struct kt *kp;
for (kp = key_words; kp->kt_name != 0; kp++)
if (!strcmp(word, kp->kt_name))
return kp->kt_val;
return -1;
}
/* makefloat
*
Convert a string to a float, handling the %(N) square root form
*
as necessary
*/
float makefloat(char *str) {
float sign,f;
int pos;
if(str[0]==’-’) {
sign = -1.0;
pos = 1;
} else {
sign = 1.0;
pos = 0;
112
APPENDIX A. CODE LISTINGS
}
if(str[pos]==’%’ && str[pos+1]==’(’)
f=sqrt(atof(str+pos+2));
else
f=atof(str+pos);
return(sign*f);
}
Listing A.2: Lexical analyzer definition for flex
%{
/* qparse.y: bison file for quantum circuit specification language
*
Geva Patz, MIT Media Lab, 2003
*/
# include
# include
# include
# include
# include
<stdio.h>
<string.h>
<stdlib.h>
"qparse.h"
"transform.h"
symlist * symbol_table;
oplist * op_stack;
i n t bitlimit, maxbits, errflag;
void addsym(char *, i n t , complex *);
void pushop(symlist *, i n t *);
void pushspecial( i n t , i n t );
%}
%union {
long val;
complex cplx;
matlist *matrix;
bitlist *bits;
symlist *st_rec;
char *string;
}
%token CBKT ")"
%token CBRACE "}"
A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER
%token COMMA ","
%token CSBKT "]"
%token DEFINE "define"
%token DOT "."
%token <string> ID
%token <val> INT
%token <cplx> NUM
%token OBKT "("
%token OBRACE "{"
%token <st_rec> OP
%token OSBKT "["
%token PERM "perm"
%token SEMICOLON ";"
%token SIZE "size"
%token DIAG "Diag"
%token FZERO "F0"
%token FF "Ff"
%type <matrix> numlist
%type <matrix> matrix
%type <bits> intlist
%%
input: line {} | input line {} ;
line: stmt ";" {};
stmt: fn_defn {} | operator {} | permute {} | size {} | special {};
fn_defn:
"define" OP "(" INT ")" matrix
{
/* If OP is matched instead of ID, this is a
redefinition */
char estr[1024];
snprintf(estr, 1023, "Redefinition of operator ’%s’",
$2->name);
yyerror(estr);
}
| "define" ID "(" INT ")" matrix
{
complex *vallist;
i n t i, matsize;
113
114
APPENDIX A. CODE LISTINGS
matlist *lptr;
i f ($4 > MAXOP) {
yyerror("Operator size too large");
}
matsize = (1<<$4)*(1<<$4);
vallist = (complex *)malloc( s i z e o f (complex)*matsize);
f o r (i=matsize-1, lptr=$6;
i>=0 && lptr!=NULL;
i--, lptr=lptr->next) {
vallist[i] = lptr->val;
}
i f (i>=0)
yyerror("Matrix too small for number of bits specified
");
i f (lptr!=NULL)
yyerror("Matrix too large for number of bits specified
");
addsym($2, $4, vallist);
};
matrix: "[" numlist "]" {$$ = $2};
numlist: NUM
{
matlist *m;
m = (matlist *)malloc( s i z e o f (matlist));
m->val = $1;
m->next = NULL;
$$ = m;
}
| numlist NUM
{
matlist *m;
m = (matlist *)malloc( s i z e o f (matlist));
m->val = $2;
m->next = $1;
$$ = m;
}
;
operator: OP "(" intlist ")"
A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER
115
{
i n t *destbits, *temp;
i n t i, j;
i n t used[MAX_BITS];
bitlist *ptr;
destbits = NULL;
f o r (i=0; i<=maxbits; i++)
used[i] = 0;
f o r (i=0,ptr=$3;ptr!=NULL;ptr=ptr->next,i++) {
temp = ( i n t *)malloc((i+1)* s i z e o f ( i n t ));
f o r (j=0;j<i;j++)
temp[j+1] = destbits[j];
temp[0] = ptr->val;
i f (destbits!=NULL)
free(destbits);
destbits=temp;
i f (used[ptr->val])
yyerror("Duplicate bit in operator");
used[ptr->val] = -1;
}
i f (i<$1->bits)
yyerror("Too few bits specified for operator");
e l s e i f (i>$1->bits)
yyerror("Too many bits specified for operator");
else
pushop($1, destbits);
};
permute: "perm" "(" intlist ")"
{
i n t *destbits;
i n t used[MAX_BITS];
i n t i;
bitlist *ptr;
/*yyerror("patz hasn’t implemented this properly");*/
destbits = ( i n t *)malloc((maxbits+1)* s i z e o f ( i n t ));
f o r (i=0; i<=maxbits; i++)
used[i] = 0;
f o r (i=0,ptr=$3;ptr!=NULL;ptr=ptr->next,i++) {
destbits[maxbits-i] = ptr->val;
i f (used[ptr->val])
116
APPENDIX A. CODE LISTINGS
yyerror("Duplicate bit in permutation");
used[ptr->val] = -1;
}
--i;
i f (i<maxbits)
yyerror("Too few bits specified for permutation");
e l s e i f (i>maxbits)
yyerror("Too many bits specified for permutation");
else
pushop(NULL, destbits);
};
intlist: INT
{
bitlist *m;
m = (bitlist *)malloc( s i z e o f (bitlist));
i f (maxbits==-1) {
yyerror("Problem size not declared");
/* Fudge to prevent lots of duplicate messages */
maxbits = bitlimit;
}
e l s e i f ((m->val = $1) > maxbits)
yyerror("Bit value too large");
m->next = NULL;
$$ = m;
}
| intlist "," INT
{
bitlist *m;
m = (bitlist *)malloc( s i z e o f (bitlist));
i f (maxbits==-1) {
yyerror("Problem size not declared");
/* Fudge to prevent lots of duplicate messages */
maxbits = bitlimit;
}
e l s e i f ((m->val = $3) > maxbits)
yyerror("Bit value too large");
m->next = $1;
$$ = m;
}
;
size: "size" INT
A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER
117
{
i f (maxbits!=-1)
yyerror("Duplicate size declaration");
e l s e i f ($2>bitlimit)
yyerror("Problem size exceeds maximum allowable");
e l s e i f ($2==0)
yyerror("Problem size must be non-zero");
else
maxbits = $2 - 1;
}
;
special: "Diag" "[" INT "]" {pushspecial(0,$3);} |
"F0"
{pushspecial(1,0);} |
"Ff"
{pushspecial(2,0);} |
%%
/* addsym: add a symbol (operator name) to the symbol table */
void addsym(char *name, i n t bits, complex *values) {
symlist *ptr, *lptr;
ptr = (symlist *)malloc( s i z e o f (symlist));
ptr->name = strdup(name);
ptr->bits = bits;
ptr->matrix = values;
i f (symbol_table == NULL)
symbol_table = ptr;
else {
f o r (lptr = symbol_table; lptr->next!=NULL; lptr=lptr->next)
/* empty loop */ ;
lptr->next = ptr;
}
}
/* pushop: add an operator (or a permutation, if op==NULL) to the stack
*/
void pushop(symlist *op, i n t *bitlist) {
oplist *ptr, *lptr;
ptr = (oplist *)malloc( s i z e o f (oplist));
ptr->op = op;
118
APPENDIX A. CODE LISTINGS
ptr->bits = bitlist;
ptr->type = (op == NULL)?tPERM:tOP;
ptr->next = NULL;
i f (op_stack == NULL)
op_stack = ptr;
else {
f o r (lptr = op_stack; lptr->next!=NULL; lptr=lptr->next)
/*empty loop */ ;
lptr->next=ptr;
}
}
/* pushspecial: add a special function record to the stack */
void pushspecial( i n t func, i n t param) {
oplist *ptr, *lptr;
ptr = (oplist *)malloc( s i z e o f (oplist));
ptr->type = tSPECIAL;
ptr->function = func;
ptr->parm = param;
ptr->next = NULL;
i f (op_stack == NULL)
op_stack = ptr;
else {
f o r (lptr = op_stack; lptr->next!=NULL; lptr=lptr->next)
/*empty loop */ ;
lptr->next=ptr;
}
}
/* yyerror: Bison error handling routine */
void yyerror(char *s) {
fprintf(stderr, "%s at line %d\n", s, yyline+1);
errflag = -1;
}
/*** main entry point ***/
i n t main( i n t argc, char **argv) {
FILE *out;
A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER
/* initialization */
symbol_table = NULL;
op_stack = NULL;
bitlimit = MAX_BITS - 1;
maxbits = -1;
errflag = 0;
i f (argc!=2) {
fprintf(stderr, "Usage: %s <output_file>\n", argv[0]);
exit(-1);
}
out = fopen(argv[1],"w");
i f (out == NULL) {
perror("File open error");
exit(-1);
}
yyparse();
i f (errflag) {
fprintf(stderr, "*** Errors occurred: aborting\n");
exit(-1);
}
i f (op_stack == NULL) {
fprintf(stderr, "*** Empty problem: aborting\n");
exit(-1);
}
# i f d e f DEBUG
dumpstack("Pre-transformation", op_stack, NULL, NULL, maxbits+1);
# endif
transformstack(op_stack, maxbits+1);
# i f d e f DEBUG
dumpstack("Post-transformation", op_stack, NULL, NULL, maxbits+1);
# endif
r e t u r n writestack(out, op_stack, maxbits+1);
}
Listing A.3: Parser definition for bison
119
120
APPENDIX A. CODE LISTINGS
A.2
Quantum Fourier transform circuit generator
The listings in this section present scripts for generating quantum Fourier transform circuits for simulation, as discussed in section 5.4, along with representative output of the
scripts. It may also be helpful to refer to the general quantum Fourier transform circuit
schema in Figure 2-5 on page 33.
Listing A.4 is the main quantum Fourier transform generation script, discussed in section 5.4.3. This produces circuits that are structured to assist the compiler in inferring an
efficient tensor product representation. An example of such a circuit (for four qubits) is
given in listing A.5. Notice that the operations on the various qubits are interleaved as
much as possible.
#!/usr/bin/perl
# genqft.pl Geva Patz, MIT Media Lab, 2003
# Generate circuit description for N-bit quantum Fourier transform
# (optimized for compilation)
use POSIX;
($bits) = @ARGV;
die "Usage: $0 <bits> (2 <= bits <= 25)\n"
unless $bits>1 && $bits<26;
p r i n t "# autogenerated by $0 $bits\n\n";
# Define operations
p r i n t "define h(1) [%(0.5) %(0.5) %(0.5) -%(0.5)];\n";
f o r $j (2..$bits) {
# Controlled rotation by $j
p r i n t "define r_$j(2) [1.0 0.0 0.0 0.0\n";
print "
0.0 1.0 0.0 0.0\n";
print "
0.0 0.0 1.0 0.0\n";
print "
0.0 0.0 0.0 ";
# exp(2*pi*i / 2ˆj)
p r i n t f "%f:%f];\n", cos($pi4/(1<<$j)), s i n ($pi4/(1<<$j));
}
A.2. QUANTUM FOURIER TRANSFORM CIRCUIT GENERATOR
p r i n t "size $bits;\n\n";
$step[0]=1;
f o r $j (1..$bits-1) {
$step[$j] = 0;
}
# We reverse the bits at the beginning, and adjust indices below
# as needed. This puts the computation in the ’leading identity
# matrix’ form (could let the compiler do this)
p r i n t "perm(";
f o r ($i=$bits-1;$i>=0;$i--) {
p r i n t "$i", $i?’,’:’)’;
}
p r i n t "\n";
f o r ($col=1;$step[$bits-1]<2;$col++) {
p r i n t "\n# column $col\n";
f o r $j (0..$bits-1) {
i f ($step[$j]==1) {
p r i n t "h(", $bits-$j-1, ");\n";
} e l s i f ($step[$j] && $step[$j]<=($bits-$j)) {
p r i n t "r_", $step[$j], "(", $bits-($j+$step[$j]), ", ";
p r i n t $bits-$j-1, ");\n";
}
}
$step[0]++;
f o r $j (1..$bits-1) {
$step[$j]++ i f $step[$j-1]>2;
}
}
Listing A.4: Efficient quantum Fourier transform circuit generation script
# autogenerated by ./genqft-prt.pl 4
define h(1) [%(0.5) %(0.5) %(0.5) -%(0.5)];
define r_2(2) [1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 1.0 0.0
0.0 0.0 0.0 1.000000:0.000000];
define r_3(2) [1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 1.0 0.0
121
122
APPENDIX A. CODE LISTINGS
0.0 0.0 0.0 1.000000:0.000000];
define r_4(2) [1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 1.0 0.0
0.0 0.0 0.0 1.000000:0.000000];
s i z e 4;
perm(3,2,1,0)
# column 1
h(3);
# column 2
r_2(2, 3);
# column 3
r_3(1, 3);
h(2);
# column 4
r_4(0, 3);
r_2(1, 2);
# column 5
r_3(0, 2);
h(1);
# column 6
r_2(0, 1);
# column 7
h(0);
Listing A.5: Example efficient circuit output for 4 qubits
For comparison, listing A.6 generates equivalent quantum Fourier transform circuits,
but with a simple structure that makes no attempt at optimization for the compiler. Each
qubit’s operations are generated sequentially, as illustrated in the four-qubit example output in listing A.7. The effects of this less efficient representation on simulation time are
discussed in section 5.4.4.
#!/usr/bin/perl
A.2. QUANTUM FOURIER TRANSFORM CIRCUIT GENERATOR
# gebadqft.pl Geva Patz, MIT Media Lab, 2003
# Generate circuit description for N-bit quantum Fourier transform
# (without regard for compiler-friendly optimization)
use POSIX;
($bits) = @ARGV;
die "Usage: $0 <bits> (2 <= bits <= 25)\n"
unless $bits>1 && $bits<26;
p r i n t "# autogenerated by $0 $bits\n\n";
# Define operations
$pi4 = POSIX::acos(0)*4; # 2*pi
p r i n t "define h(1) [%(0.5) %(0.5) %(0.5) -%(0.5)];\n";
f o r $j (2..$bits) {
# Controlled rotation by $j
p r i n t "define r_$j(2) [1.0 0.0 0.0 0.0\n";
print "
0.0 1.0 0.0 0.0\n";
print "
0.0 0.0 1.0 0.0\n";
print "
0.0 0.0 0.0 ";
# exp(2*pi*i / 2ˆj)
p r i n t f "%f:%f];\n", cos($pi4/(1<<$j)), s i n ($pi4/(1<<$j));
}
p r i n t "size $bits;\n";
f o r $i (0..$bits-1) {
p r i n t "\n# Bit $i\n";
p r i n t "h($i);\n";
f o r $j (2..$bits-$i) {
p r i n t "r_$j(", $i+$j-1,",", $i, ");\n";
}
}
# Final bit reversal
p r i n t "\nperm(";
f o r ($i=$bits-1;$i>=0;$i--) {
p r i n t "$i", $i?’,’:’)’;
123
124
APPENDIX A. CODE LISTINGS
}
p r i n t "\n";
Listing A.6: Unoptimized quantum Fourier transform circuit generation script
Compare the following output to the efficient output in listing A.5.
# autogenerated by ./genbadqft-prt.pl 4
define h(1) [%(0.5) %(0.5) %(0.5) -%(0.5)];
define r_2(2) [1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 1.0 0.0
0.0 0.0 0.0 0.000000:1.000000];
define r_3(2) [1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 1.0 0.0
0.0 0.0 0.0 0.707107:0.707107];
define r_4(2) [1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 1.0 0.0
0.0 0.0 0.0 0.923880:0.382683];
s i z e 4;
# Bit 0
h(0);
r_2(1,0);
r_3(2,0);
r_4(3,0);
# Bit 1
h(1);
r_2(2,1);
r_3(3,1);
# Bit 2
h(2);
r_2(3,2);
# Bit 3
h(3);
perm(3,2,1,0)
A.3. 3-SAT PROBLEM GENERATOR FOR ADIABATIC EVOLUTION
125
Listing A.7: Example unoptimized circuit output for 4 qubits
A.3
3-SAT problem generator for adiabatic evolution
The code in this section is used in simulating the solution of 3-SAT by adiabatic evolution,
as described in section 5.5. Listing A.8 generates the basic specification for an N -qubit
simulation of this kind. An example of the output code, for three qubits, is given in listing
A.9.
In order to execute a simulation of an instance of 3-SAT, we need an additional set of
input data specifying the Hamiltonian corresponding to the particular instance we wish to
simulate. Listing A.10 is a C program to generate random instances of 3-SAT of a given bit
size.
#!/usr/bin/perl
#
#
#
#
genadi.pl Geva Patz, MIT Media Lab, 2003
Generate circuit description for N-qubit adiabatic
evolution simulation (use companion ’genham’ Hamiltonian
generator to generate specific problem Hamiltonian)
($bits) = @ARGV;
die "Usage: $0 <bits> (2 <= bits <= 25)\n"
unless $bits>1 && $bits<26;
p r i n t "# autogenerated by $0 $bits\n\n";
p r i n t "define h(1) [%(0.5) %(0.5) %(0.5) -%(0.5)];\n";
p r i n t "\nsize $bits;\n";
# Run for N steps; change second index of i to run for alternate
# number of steps
f o r $i (1..$bits) {
p r i n t "\n# Step $i\n";
126
APPENDIX A. CODE LISTINGS
f o r $j(1..$bits) {
p r i n t "h(", $j-1, ");\n";
}
p r i n t "F0[$i];\n";
f o r $j(1..$bits) {
p r i n t "h(", $j-1, ");\n";
}
p r i n t "Ff[$i];\n";
}
Listing A.8: Simulation code generator for 3-SAT by adiabatic evolution
# autogenerated by ./genadi-prt.pl 3
define h(1) [%(0.5) %(0.5) %(0.5) -%(0.5)];
s i z e 3;
# Step 1
h(0);
h(1);
h(2);
F0[1];
h(0);
h(1);
h(2);
Ff[1];
# Step 2
h(0);
h(1);
h(2);
F0[2];
h(0);
h(1);
h(2);
Ff[2];
# Step 3
h(0);
h(1);
h(2);
F0[3];
h(0);
A.3. 3-SAT PROBLEM GENERATOR FOR ADIABATIC EVOLUTION
127
h(1);
h(2);
Ff[3];
Listing A.9: Example adiabatic evolution code for 3 qubits
/* genham.c: Generate (diagonal) Hamiltonian representing a random
*
N-bit instance of 3-SAT
*
Geva Patz, MIT Media Lab, 2002 (updated 2003)
*/
# include
# include
# include
# include
<stdio.h>
<stdlib.h>
<sys/time.h>
<limits.h>
# define MAXCLAUSE (1<<22)
# define bailifnull(x,y) i f ((y)==NULL) \
{fprintf(stderr,"Can’t allocate %s",(x));exit(-1)
;}
i n t binval(unsigned i n t * a) {
i n t b;
b = (((a[3]&0x04)>>2) * (1<<(a[2]-1)) +
((a[3]&0x02)>>1) * (1<<(a[1]-1)) +
(a[3]&0x01)
* (1<<(a[0]-1))
);
r e t u r n b;
}
i n t main( i n t argc, char *argv[]) {
unsigned char * valid;
unsigned i n t * cnt;
unsigned i n t * clauses;
char bits;
unsigned i n t i, j, k, flag, numvalid;
unsigned i n t alltrue, xor[3], or[3];
unsigned i n t clause;
s t r u c t timeval tv;
s t r u c t timezone tz;
128
APPENDIX A. CODE LISTINGS
i f (argc!=2) {
fprintf(stderr, "Usage: %s <bits> \n", argv[0]);
exit(-1);
}
bits = atoi(argv[1]);
i f (bits<3 || bits>99) {
fprintf(stderr, "Number of bits must be >= 3 and <=25\n");
exit(-1);
}
i f (argc>2)
printf("%d\n",bits);
gettimeofday(&tv, &tz);
srand(( i n t )(tv.tv_usec&INT_MAX));
valid = (unsigned char *)malloc((1<<bits)* s i z e o f (unsigned char));
bailifnull("valid", valid);
cnt = (unsigned i n t *)malloc((1<<bits)* s i z e o f (unsigned i n t ));
bailifnull("cnt", cnt);
clauses = (unsigned i n t *)malloc(4*MAXCLAUSE* s i z e o f (unsigned i n t ));
bailifnull("clauses", clauses);
do {
alltrue = (1<<bits)-1;
clause = 0;
numvalid = 1<<bits;
f o r (i=0; i < 1<<bits; i++) {
valid[i]=1;
cnt[i]=0;
}
while(numvalid>1 && clause<MAXCLAUSE*4) {
clauses[clause] = 1+( i n t )((bits*1.0)*rand()/(RAND_MAX+1.0));
do {
clauses[clause+1] = 1+( i n t )((bits*1.0)*rand()/(RAND_MAX+1.0));
} while(clauses[clause+1] == clauses[clause]);
do {
clauses[clause+2] = 1+( i n t )((bits*1.0)*rand()/(RAND_MAX+1.0));
} while(clauses[clause+2] == clauses[clause] ||
clauses[clause+2] == clauses[clause+1]);
clauses[clause+3] = ( i n t )(8.0*rand()/(RAND_MAX+1.0));
A.3. 3-SAT PROBLEM GENERATOR FOR ADIABATIC EVOLUTION
f o r (j=0;j<3;j++) {
xor[j]=(clauses[clause+3] & 1<<j)?1<<(clauses[clause+j]-1):0;
or[j]=alltrueˆ(1<<(clauses[clause+j]-1));
}
numvalid=0;
f o r (j=0;j<=alltrue;j++) {
flag=0;
f o r (k=0;k<3;k++) {
i f ( ((jˆxor[k])|or[k]) == alltrue)
flag = -1;
}
i f (!flag) {
valid[j]=0;
cnt[j]++;
}
i f (valid[j])
numvalid++;
}
clause+=4;
}
} while(!numvalid);
/* Output instance data */
f o r (i=0;i<clause;i+=4)
printf("%d %d %d %d\n",clauses[i],clauses[i+1],clauses[i+2],
clauses[i+3]);
/* End of instance delimiter */
printf("0 0 0 -1\n");
/* Output problem Hamiltonian */
f o r (i=0;i<1<<bits;i++)
printf("%d\n",cnt[i]);
exit(0);
}
Listing A.10: Random instance generator for 3-SAT by adaibatic evolution
129
130
APPENDIX A. CODE LISTINGS
Bibliography
[BCC+ 97]
L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C.
Whaley. ScaLAPACK Users’ Guide. SIAM, 1997.
[BD96a]
P. E. Buis and W. R. Dyksen. Algorithm 753; TENPACK: a LAPACK-based
library for the computer manipulation of tensor products. ACM Transactions
on Mathematical Software, 22(1):24–29, March 1996.
[BD96b]
P. E. Buis and W. R. Dyksen. Efficient vector and parallel manipulation of tensor products. ACM Transactions on Mathematical Software, 22(1):18–23, March
1996.
[BGW93]
A. Barak, S. Guday, and R.G. Wheeler. The MOSIX Distributed Operating System: Load Balancing for Unix. Number 672 in Lecture Notes in Computer
Science. Springer-Verlag, 1993.
[Bre00]
R. P. Brent. Recent progress and prospects for integer factorisation algorithms. In Proceedings of COCOON 2000, volume 1858 of Lecture notes in Computer Science, pages 3–22. Springer-Verlag, 2000.
[BSS+ 95]
D. J. Becker, T. Sterling, D. Savarese, J. E. Dorband, U. A. Ranawak, and C. V.
Packer. Beowulf: A parallel workstation for scientific computation. In Proceedings, International Conference on Parallel Processing, 1995.
131
132
[CDO+ 95]
BIBLIOGRAPHY
J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and R. C. Whaley.
A proposal for a set of parallel basic linear algebra subprograms. Technical
Report CS-95-292, University of Tennessee, July 1995.
[CE02]
R. Choy and A. Edelman. MATLAB*P 2.0: Interactive supercomputing made
practical. Master’s thesis, Massachusetts Institute of Technology, 2002.
[Coo71]
S. A. Cook. The complexity of theorem-proving procedures. In Proceedings of
the Third ACM Symposium on the Theory of Computing, pages 151–158, 1971.
[CT65]
J. W. Cooley and J. W. Tukey. An algorithm for the machine computation
of the complex Fourier series. Mathematics of Computation, 19:297–301, April
1965.
[CWM01]
H. Chen, P. Wyckoff, and K. Moor. Cost/performance evaluation of gigabit
ethernet and myrinet as cluster interconnect. In Proceedings OPNETWORK
2000, Sep 2001.
[DvdGW94] J. Dongarra, R. van de Geijn, and D. Walker. Scalability issues in the design of
a library for dense linear algebra. Journal of Parallel and Distributed Computing,
22:523–537, 1994.
[Dyk87]
W. R. Dyksen. Tensor product generalized ADI methods for elliptic problems.
SIAM Journal on Numerical Analysis, 24(1):59–76, Feb 1987.
[FGG+ 01]
E. Farhi, J. Goldstone, S. Gutmann, J. Lapan, A. Lundgren, and D. Preda. A
quantum adiabatic evolution algorithm applied to random instances of an
np-complete problem. Science, 292:472–475, April 2001.
[For93]
MPI Forum. MPI: A message passing interface. In Supercomputing ’93 Proceedings, pages 878–883, November 1993.
133
BIBLIOGRAPHY
[GBD+ 94]
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam.
PVM: Parallel Virtual Machine, a users’ guide and tutorial for networked parallel
computing. MIT Press, 1994.
[GCT92]
J. Granata, M. Conner, and R. Tolimieri. The tensor product: A mathematical
programming language for FFTs and other fast DSP operations. IEEE SP
Magazine, pages 40–48, January 1992.
[Hog03]
T. Hogg. Adiabatic quantum computing for random satisfiability problems.
Physical Review A, 67:022314, 2003.
[Mol95]
C. Moler. Why there isn’t a parallel Matlab. Matlab News and Notes, Spring
1995.
[MSDS02]
H. Meuer, E. Strohmaier, J. Dongarra, and H. D. Simon. Top 500 supercomputing sites. http://www.top500.org/, November 2002.
[NC00]
M. A. Nielsen and I. L. Chuang. Quantum computation and quantum information. Cambridge University Press, 2000.
[OD98]
K. Obenland and A. M. Despain. A parallel quantum computer simulator. In
High Performance Computing ’98, 1998.
[Öme00a]
B. Ömer. Quantum programming in QCL. Master’s thesis, Technical University of Vienna, January 2000.
[Öme00b]
B.
Ömer.
lished).
Simulation
Technical report,
of
quantum
computers
(unpub-
Technical University of Vienna,
2000.
http://tph.tuwien.ac.at/õemer/doc/qcsim.pdf.
[Pit97]
N. P. Pitsianis. The Kronecker Product in Approximation and Fast Transform Generation. PhD thesis, Cornell University, 1997.
134
[Pre98]
BIBLIOGRAPHY
J. Preskill. Physics 229: Advanced mathematical methods of physics - quantum computation and information. Technical report, California Insitute of
Technology, http://www.theory.caltech.edu/people/preskill/ph229/, 1998.
[Sho97]
Peter W. Shor.
Polynomial-time algorithms for prime factorization and
discrete logarithms on a quantum computer. SIAM Journal on Computing,
26(5):1484–1509, 1997.
[Spr98]
P. L. Springer. Matpar: Parallel extensions for Matlab. In Proceedings of the
International Conference on Parallel and Distributed Techniques and Applications,
volume 3, pages 1191–1195, July 1998.
[Tur36]
A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 2, 42:230–265,
1936.
[vDMV01]
W. van Dam, M. Mosca, and U. Vazirani. How powerful is adiabatic quantum
computation? In Proceedings of the 42nd annual symposium on foundations of
computer science (FOCS2001), pages 279–287, 2001.
[VSB+ 01]
L. M. K. Vandersypen, M. Steffen, G. Breyta, C. S. Yannoni, M. H. Sherwood,
and I. L. Chuang. Experimental realization of Shor’s quantum factoring algorithm using nuclear magnetic resonance. Nature, 414:883–887, December
2001.
[WPD00]
R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Technical report, Department of
Computer Science, University of Tennessee Konxville, March 2000.