A Parallel Environment for Simulating Quantum Computation by Geva Patz B.S. Computer Science University of South Africa (1998) Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2003 c Massachusetts Institute of Technology 2003. All rights reserved. Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Program in Media Arts and Sciences, School of Architecture and Planning May 21, 2003 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephen A. Benton Professor of Media Arts and Sciences Thesis Supervisor Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrew B. Lippman Chairman Department Committee on Graduate Students 2 3 A Parallel Environment for Simulating Quantum Computation by Geva Patz Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning on May 21, 2003, in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences Abstract This thesis describes the design and implementation of an environment to allow quantum computation to be simulated on classical computers. Although it is believed that quantum computers cannot in general be efficiently simulated classically, it is nevertheless possible to simulate small but interesting systems, on the order of a few tens of quantum bits. Since the state of the art of physical implementations is less than 10 bits, simulation remains a useful tool for understanding the behavior of quantum algorithms. To create a suitable envrionment for simulation, we constructed a 32-node cluster of workstation class computers linked with a high speed (gigabit Ethernet) network. We then wrote an initial simulation environment based on parallel linear algebra libraries with a Matlab front end. These libraries operated on large matrices representing the problem being simulated. The parallel Matlab environment demonstrated a degree of parallel speedup as we added processors, but overall execution times were high, since the amount of data scaled exponentially with the size of the problem. This increased both the number of operations that had to be performed to compute the simulation, and the volume of data that had to be communicated between the nodes as they were computing. The scaling also affected memory utilization, limiting us to a maximum problem size of 14 qubits. In an attempt to increase simulation efficiency, we revisited the design of the simulation environment. Many quantum algorithms have a structure that can be described using the tensor product operator from linear algebra. We believed that a new simulation environment based on this tensor product structure would be substantially more efficient than one based on large matrices. We designed a new simulation envrionment that exploited this tensor product structure. Benchmarks that we performed on the new simulation environment confirmed that it was substantially more efficient, allowing us to perform simulations of the quantum Fourier transform and the discrete approximation to the solution of 3-SAT by adiabatic evolution up to 25 qubits in a reasonable time. Thesis Supervisor: Stephen A. Benton Title: Professor of Media Arts and Sciences 4 A Parallel Environment for Simulating Quantum Computation by Geva Patz The following people served as readers for this thesis: Thesis Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Isaac L. Chuang Associate Professor MIT Media Laboratory Thesis Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edward Farhi Professor of Physics MIT Center for Theoretical Physics 6 7 Acknowledgments Many thanks to the members of my thesis committee: To Steve Benton, my advisor, who stepped in at a moment of need and guided me through the completion of this thesis. His wise guidance and kind support were invaluable. Thanks to my reader, Isaac Chuang, without whom this thesis would not have happened. He introduced me to the world of quantum computing, pointed me at the problem that this thesis addressed, and suggested ways to approach the solution. He also enabled me to have access to the computing resources required to make this work possible. Thanks also to my other reader, Eddie Farhi, who introduced me to the idea of adiabatic quantum computing, and whose office I always looked forward to visiting. I’d have to switch to a much bigger font to thank Linda Peterson adequately. Her office is a haven for desperate, panic-stricken, confused or otherwise needy students, and she is a wellspring of helpful advice (um, I mean options). To my wife, Alex: thank you so much for the support and encouragement you’ve given me throughout my time at MIT, and for putting up with me in my sleep-deprived, notaltogether-cheerful thesis writing mode. 8 Contents 1 Introduction 19 2 Background 23 2.1 Why is quantum computing interesting? . . . . . . . . . . . . . . . . . . . . . 23 2.2 Basic concepts of quantum computation . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Quantum bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Quantum gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Quantum algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.1 The quantum Fourier transform . . . . . . . . . . . . . . . . . . . . . 32 2.3.2 Quantum computation by adiabatic evolution . . . . . . . . . . . . . 33 The tensor product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3 2.4 3 Parallel simulation of quantum computation 39 3.1 Simulating quantum computing . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.1.1 Previous work in simulating quantum computation . . . . . . . . . . 41 Parallel processing and cluster computing . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Cluster computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 The tensor product as a means to optimize computation . . . . . . . . . . . . 46 3.3.1 Parallelizing by the tensor product . . . . . . . . . . . . . . . . . . . . 46 3.3.2 Efficient computation of tensor product structured multiplications . 50 3.2 3.3 9 10 4 CONTENTS The simulation environment 53 4.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Initial software implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2.1 Overview of libraries used . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.2 Prior work in parallelizing Matlab . . . . . . . . . . . . . . . . . . . . 58 4.2.3 Design of the parallel Matlab environment . . . . . . . . . . . . . . . 59 The tensor product based simulation environment . . . . . . . . . . . . . . . 63 4.3.1 Algorithm (circuit) specification language . . . . . . . . . . . . . . . . 64 4.3.2 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3.3 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.4 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3 5 Evaluation 79 5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 The fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2.1 Single node execution times . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2.2 Data transfer timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2.3 Startup overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 Gates and gate combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 The quantum Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.1 The quantum Fourier transform on the initial environment . . . . . . 89 5.4.2 Replicating the discrete Fourier transform . . . . . . . . . . . . . . . . 90 5.4.3 Circuit-based implementation . . . . . . . . . . . . . . . . . . . . . . 92 5.4.4 Comparing efficient and inefficient circuit specifications . . . . . . . 95 3-SAT by adiabatic evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5 6 Conclusions A Code listings 101 105 A.1 Circuit specification language parser . . . . . . . . . . . . . . . . . . . . . . . 105 CONTENTS 11 A.2 Quantum Fourier transform circuit generator . . . . . . . . . . . . . . . . . . 120 A.3 3-SAT problem generator for adiabatic evolution . . . . . . . . . . . . . . . . 125 12 CONTENTS List of Figures 2-1 Quantum NOT gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2-2 The CNOT gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2-3 Three CNOTs form a swap gate . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2-4 Using a Hadamard gate to generate entangled Bell states . . . . . . . . . . . 30 2-5 General circuit representation of the quantum Fourier transform . . . . . . . 33 3-1 Basic schematic form of a quantum circuit . . . . . . . . . . . . . . . . . . . . 40 3-2 A representative quantum circuit . . . . . . . . . . . . . . . . . . . . . . . . . 41 4-1 The cluster nodes, seen from below . . . . . . . . . . . . . . . . . . . . . . . . 55 4-2 The two dimensional block cyclic distribution . . . . . . . . . . . . . . . . . 57 4-3 Layering of libraries in the first generation simulation environment . . . . . 59 4-4 State diagram for the parallel Matlab server master node . . . . . . . . . . . 60 4-5 State diagram for the parallel Matlab server slave nodes . . . . . . . . . . . . 61 4-6 Circuit to demonstrate different specification orderings . . . . . . . . . . . . 67 4-7 Circuit for the compilation example . . . . . . . . . . . . . . . . . . . . . . . 69 4-8 State of the internal data structure . . . . . . . . . . . . . . . . . . . . . . . . 71 4-9 System-level overview of the parallel execution of a problem . . . . . . . . . 73 4-10 Algorithm specification to illustrate computation sequence . . . . . . . . . . 74 4-11 An example computation sequence, illustrating communication patterns . . 75 4-12 State diagram for the new simulator master node . . . . . . . . . . . . . . . . 76 4-13 State diagram for the new simulator slave nodes . . . . . . . . . . . . . . . . 77 13 14 LIST OF FIGURES 5-1 Single-node execution times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5-2 Single-node execution times with identity matrix . . . . . . . . . . . . . . . . 83 5-3 Vector transfer times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5-4 Startup overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5-5 Block of CNOTs circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5-6 Block of CNOTs, no permutation-related communication (cf Figure 5-8) . . . 87 5-7 Alternating CNOTs circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5-8 Alternating CNOTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5-9 Parallel Matlab based simulator performance . . . . . . . . . . . . . . . . . . 90 5-10 Traditional Fourier transform execution times . . . . . . . . . . . . . . . . . . 92 5-11 Quantum Fourier transform circuit execution times . . . . . . . . . . . . . . 94 5-12 Quantum Fourier transform inefficient circuit execution times . . . . . . . . 96 5-13 Execution times for 3-SAT by adiabatic evolution, N steps . . . . . . . . . . 99 List of Tables 4.1 Abbreviated grammar for the tensor product specification language . . . . . 65 4.2 Record types for the internal compiler data structure . . . . . . . . . . . . . 69 5.1 Number of runs for simulation data . . . . . . . . . . . . . . . . . . . . . . . 80 5.2 Fourier transform execution times for larger problem size . . . . . . . . . . . 93 15 16 LIST OF TABLES Listings 4.1 A sample parallel Matlab script . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2 A sample algorithm specification . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3 Inefficient specification of the circuit in Figure 4-6 . . . . . . . . . . . . . . . 67 4.4 Efficient specification of the circuit in Figure 4-6 . . . . . . . . . . . . . . . . 67 4.5 Input file for the compilation example . . . . . . . . . . . . . . . . . . . . . . 68 A.1 Header file definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 A.2 Lexical analyzer definition for flex . . . . . . . . . . . . . . . . . . . . . . . 107 A.3 Parser definition for bison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.4 Efficient quantum Fourier transform circuit generation script . . . . . . . . . 120 A.5 Example efficient circuit output for 4 qubits . . . . . . . . . . . . . . . . . . . 121 A.6 Unoptimized quantum Fourier transform circuit generation script . . . . . . 122 A.7 Example unoptimized circuit output for 4 qubits . . . . . . . . . . . . . . . . 124 A.8 Simulation code generator for 3-SAT by adiabatic evolution . . . . . . . . . 125 A.9 Example adiabatic evolution code for 3 qubits . . . . . . . . . . . . . . . . . 126 A.10 Random instance generator for 3-SAT by adaibatic evolution . . . . . . . . . 127 17 18 LISTINGS Chapter 1 Introduction The idea of simulating quantum computation on classical computers seems at first not to make logical sense. Quantum computing is interesting primarily because it appears to be able to solve problems that are intractable for classical computers. If this is the case, then quantum computers cannot be efficiently simulated on classical ones. Our goal, however, is much more modest. We do not seek to efficiently simulate quantum computation in general, for arbitrary problem size. Rather, we want to simulate the largest problems that we can, until physical implementations of quantum computers have overtaken our abilities to simulate them. The largest physically realized quantum computation to date operated on seven quantum bits (qubits) [VSB+ 01]. Given that problem size doubles with every additional qubit, simulating even low tens of qubits would allow us to investigate problems many orders of magnitude larger than the size of quantum computers that we are currently able to build. The exponential scaling of demands on memory and processor resources with increasing problem size will always overwhelm us at some point, but with some thought we may be able to postpone that point far enough to allow us to simulate some interesting problems. Simulation also has a much more rapid configuration turn-around time than physical experiments. We all hope that in the future, quantum computers will be as trivially reconfigurable as the desktop classical computers of today, but at the moment every successful 19 20 CHAPTER 1. INTRODUCTION physical quantum computation has been a complex, carefully planned experiment with an elaborate experimental setup tailored to solving one specific problem (often one specific instance of a problem). For now, simulation offers much greater flexibility for reconfiguration, and is an essential tool for planning any experimental realization. More generally, the lessons we learn in simulating quantum computation on classical computers may yield insights that will be useful in other fields that deal with similarly large problems. An obvious application would be simulating other quantum systems, but similar techniques are also useful in such fields as image and signal processing. To achieve even the relatively modest goal of simulating problems on the order of 20 qubits, we require substantial computing resources, and an intelligent approach to using those resources. One path of attack is to combine the resources (memory and CPU) of multiple processing units. High end parallel processing computers are, however, expensive, rare and often difficult to program. Ideally, we would like to harness the power of readily available, inexpensive, easily configurable workstation class computers to perform our computations. This suggests exploiting the technique of ’cluster computing’, in which multiple off-the shelf workstations are combined into a parallel computing resource. Regardless of the amount of simulation hardware we have available, it will be useful to find efficient ways of representing the problem we are simulating, in order to reduce the resource consumption of our simulations. The resources we are typically most interested in are memory and CPU, but in a clustered computing environment there is another resource that becomes significant, too: communication. In this thesis, I will describe a simulation environment that we have built to explore the simulation of quantum computation. We began by building and configuring a clustered computing environment. We then implemented a simulator on it based on a library of parallel linear algebra routines. This approach was chosen because linear algebra is at the core of the mathematical representation of quantum computing. Although our initial simulation environment validated the feasibility of simulating quantum computation on a cluster of classical workstations, it also uncovered a number of 21 limitations, both of cluster computing in general, and of the specific simulation approach we had chosen. We therefore developed a new simulation environment, designed to more efficiently represent and simulate problems, and to reduce the reliance on inter-node communications which had proved to be a substantial bottleneck in cluster computing. Specifically, we based our new simulation around the tensor product, a mathematical structure that neatly parallels the structure of many quantum algorithms, and that provides a basis for a more compact and efficient representation of these problems. Chapter 2 starts by outlining some of the basic concepts of quantum computing and the associated mathematics that will be necessary to understand the rest of this thesis. Chapter 3 continues by reviewing concepts relevant to the parallel simulation of quantum computation. It discusses how quantum computation may be simulated, and introduces cluster computing, which forms the basis of the architecture of our simulation environment. It also describes how the tensor product, introduced in the previous chapter, has been used as a structure for parallelization and efficient execution of simulations. With the background out of the way, Chapter 4 describes our simulation environment in detail, beginning with the cluster hardware, then moving on to a description of the initial (parallel matrix based) simulation environment. It describes the limitations of the initial simulation environment that motivated the design of the new simulation environment, and then describes the new design. Chapter 5 discusses our evaluation of the new simulation environment, describing the benchmarks we developed and the results of running these benchmarks on the simulator. Finally, Chapter 6 summarizes our conclusions, and suggests some directions in which the simulation environment could be taken in the future. 22 CHAPTER 1. INTRODUCTION Chapter 2 Background This chapter introduces some key background concepts that will be relevant to the rest of the thesis. This is by no means intended to be an exhaustive or rigorous survey of the subject of quantum computation, but instead is meant to give the reader enough background to follow the concepts and notation used elsewhere. For a more comprehensive review of quantum computation and the underlying principles of quantum mechanics, the reader is referred to [NC00] or [Pre98]. After a brief motivation in Section 2.1 of why the subject of quantum computing is interesting to study, Section 2.2 introduces some of the elementary concepts and principles of quantum computation, along with the mathematical structures used to represent them. In section 2.3 we review a few representative quantum algorithms. Section 2.4 introduces the tensor product, which will be the mathematical key to the design of our simulation environment. 2.1 Why is quantum computing interesting? The theory of quantum computation is rich and interesting in its own right, but it is of particular interest because it is believed that quantum computers may be able to perform certain types of computation that are fundamentally too hard for classical computers to 23 24 CHAPTER 2. BACKGROUND perform in a reasonable time. A classical computer is simply a computer in which the physical representation and manipulation of information follows the laws of classical physics. This definition encompasses every practical ’computer’ in use today, from the microcontroller in a washing machine to the fastest supercomputers. Strictly, since quantum mechanics underpins all of physics, classical computers are simply a special case of quantum computers, but since their design does not directly exploit the principles of quantum mechanics, it is helpful to distinguish them from quantum computers that do so. More formally, classical computers are types of Turing machines, named for Alan Turing, who in a seminal paper in 1936 [Tur36] developed the first formal, abstract model that defined what it means for a machine, abstract or physical, to compute an algorithm. The abstract mathematical computing ’machine’ that Turing introduced was called a ’logical computing machine’ in his paper, but we now refer to it as a Turing machine. Although in principle any problem expressible as an algorithm can be solved on a Turing machine, in practice, certain kinds of problem may not be solvable on classical computers with reasonable computational resources (with ’resources’ usually defined as storage space and computing time). The study of the resource requirements of algorithms is known as complexity theory. Complexity theory divides problems into a number of complexity classes based on the resources required to solve them. One of the most important of these classes is P (for Polynomial time), which is defined as the set of problems1 that can be solved on a deterministic machine (loosely, a conventional Turing machine) in polynomial time, in other words where the amount of time (equivalently, the number of steps) taken to solve the problem can be related to the size of the problem by a polynomial in the problem size. Less formally, P is essentially the class of problems that can be efficiently computed on classical computers. Another class, NP (for Nondeterministic Polynomial time), is defined as those prob1 Strictly, the complexity classes are defined in terms of decision problems, i.e. problems that require a Y ES or N O answer. Since algorithms can be restated as equivalent decision problems, we ignore this formal nicety here. 2.1. WHY IS QUANTUM COMPUTING INTERESTING? 25 lems where the solution can be verified in polynomial time. Clearly P ⊆ NP, since the solution to any problem in P can be verified by executing the problem in polynomial time. It is believed that P 6= NP, and many problems in NP have been posed for which no known solution algorithm exists in P, but this has not yet been proved, and whether or not P = NP remains one of the great unanswered questions in computer science. A further complexity class is PSPACE, being those problems solvable with unlimited time, but a polynomial amount of storage space (memory). Again, it is clear that NP ⊆ PSPACE, and it is suspected, but unproven, that NP 6= PSPACE. Thus it remains unknown even whether or not P = PSPACE. It is known that there are classes of problems that are outside PSPACE, hence outside P. For instance, we know that PSPACE ⊂ EXPSPACE, where EXPSPACE is the set of problems solvable with unlimited time and with an amount of memory that increases exponentially with problem size. How does all this discussion of complexity classes relate to quantum computers? A new complexity class, BQP has been defined to encompass all problems that can be efficiently solved on a quantum computer. BQP is defined as those problems that can be solved in polynomial time by a quantum computer with a bounded probability that the solution is incorrect (most definitions give this probability bound as p ≤ 0.25, but the choice of bound is arbitrary). It has been shown that BQP ⊂ PSPACE, but the relation to P and NP is unproven. Tantalizingly, however, there are problems that have been shown to be in BQP, but that are strongly believed to be outside P. Proving that this is so, i.e. proving that there are problems that can be solved on quantum computers that cannot be solved on classical computers, is equivalent to proving that P 6= PSPACE, but even in the absence of a proof, there are strong hints that suggest this is so. Herein lies the promise of quantum computing. In particular, the current interest in quantum computing was largely stimulated by a paper by Peter Shor [Sho97] that gave algorithms for calculating discrete logarithms and 26 CHAPTER 2. BACKGROUND for finding the prime factors of an integer in polynomial time on a quantum computer. The integer factoring algorithm generated particular interest because of its potential application to cryptanalysis (the well-known RSA public key cryptosystem, for instance, depends on the difficulty of integer factoring for its security). Shor presented an algorithm that finds the prime factors of an N -bit integer in O((log N )3 ) time. There is no equivalent classical algorithm known that can perform factoring in time O((log N )k ) for any k. The most efficient classical algorithms currently known, the Number Field Sieve and Multiple Polynomial Quadratic Sieve, have exponential run times (O(e(ln N ) 1/3 (ln ln N )2/3 √ ) and O(e ln N ln ln N ) respectively) [Bre00]. Algorithms such as Shor’s strongly suggest that BQP 6= P, and it is this that drives much of the interest in quantum computing. 2.2 Basic concepts of quantum computation 2.2.1 Quantum bits The elementary unit of quantum data is the qubit (for “quantum bit”), by analogy with the bit in classical computing. Although we will deal with qubits almost exclusively as mathematical abstractions, it is important to bear in mind that, just as classical bits have a physical representation, so qubits correspond to physical states within a quantum computer, subject to the laws of physics and in particular those of quantum mechanics. Qubits, like bits, have states, such as |0i and |1i. Unlike classical bits, qubits are not restricted to these states, but can take on any state of the form |ψi = a|0i + b|1i , (2.1) where a and b are complex numbers. Mathematically, the states |0i and |1i are orthonormal basis vectors for a two-dimensional complex vector space. The state of a qubit is a unit vector in this space. The general qubit in 2.1 is said to be in a superposition of states |0i and |1i. 2.2. BASIC CONCEPTS OF QUANTUM COMPUTATION 27 The restriction to unit vectors arises from the interpretation of a and b: A crucial principle of quantum mechanics is that we cannot precisely determine (“measure”) the state of a quantum system (in this case, a qubit). Measuring a qubit in a state |ψi as above will yield the measurement 0 with probability |a|2 and 1 with probability |b|2 . Since these probabilities must sum to 1, |a|2 + |b|2 = 1. This vector representation of qubits is a very useful mathematical abstraction, and we will make extensive use of it in our simulation environment. We will often use vector notation to represent the state of a qubit, as in a |ψi = b . (2.2) The extension of these concepts to multiple qubits is straightforward. Two qubits have four computational basis states, |00i, |01i, |10i and |11i, corresponding to the four possible states of a pair of classical bits. These states are sometimes written as integers in the form |0i, |1i, |2i and |3i respectively. The state vector describing the state of a pair of qubits is simply |ψi = a|00i + b|01i + c|10i + d|11i . (2.3) More generally, a system of n qubits has computational basis states which take the form |x0 x1 x2 . . . xn i, where each of the xi ∈ {0, 1}. There are therefore 2n such basis states, and the state vector for such a system has 2n entries (or probability amplitudes). This exponential increase in information as the number of qubits increases hints at the potential computational power of quantum computing. 2.2.2 Quantum gates Computation with qubits requires manipulating their states. These manipulations are again physical, and their exact nature depends on the particular physical implementation of a given quantum computer. Here too, though, it is helpful to use a mathematical 28 CHAPTER 2. BACKGROUND abstraction to describe these manipulations independent of any specific physical implementation. An abstraction which is helpful in describing a wide range of quantum algorithms is the circuit model, in which algorithms are described as a collection of quantum gates, which operate on qubits by analogy with the logic gates of classical computing (AND, OR , NOT , etc.) To illustrate, consider the quantum NOT gate. Just as the classical NOT swaps bit values, taking 0 → 1 and 1 → 0, the quantum NOT gate takes |0i → |1i and |1i → |0i. More generally, however, it takes any state |ψi = a|0i + b|1i to the state |ψ 0 i = b|0i + a|1i. Graphically, this is usually represented as in Figure 2-1 (X is a standard shorthand for the NOT gate, and ⊕ represents the classical XOR operation, or equivalently binary addition modulo 2). |ψi |ψ 0 i X Figure 2-1: Quantum NOT gate Mathematically, the NOT gate can be represented as a 2 × 2 matrix: 0 1 X= (2.4) 1 0 All quantum gates on n qubits can be represented similarly as 2n × 2n unitary matrices. A matrix U is unitary when U U † = I (U † is the adjoint of U , defined as (U ∗ )T , where U ∗ is the complex conjugate matrix of U ). The unitarity property is necessary to ensure that the ’output’ of a quantum gate remains a unit vector. An example of a two qubit gate is the controlled-NOT, or CNOT gate. This has the form 29 2.2. BASIC CONCEPTS OF QUANTUM COMPUTATION UCN OT 1 0 0 0 0 1 = 0 0 0 0 . 0 1 (2.5) 0 0 1 0 The CNOT gate is graphically represented as in Figure 2-2. |Ci • |Ci |Ci • |Ci |T i X |C ⊕ T i |T i ⊕ |C ⊕ T i Figure 2-2: The CNOT gate: the representation on the right is a common shorthand. Three alternating CNOTs in succession have the effect of exchanging the values of two qubits, as in Figure 2-3. This combination can itself be represented as a two-qubit operator, the swap gate. |ai ⊕ • ⊕ |bi |ai × |bi |bi • ⊕ • |ai |bi × |ai Figure 2-3: Three CNOTs form a swap gate: the common representation of the swap gate is on the right Another useful gate is the Hadamard gate, represented H . It has the form 1 1 1 H=√ . 2 1 −1 (2.6) The Hadamard gate takes |0i to a superposition of states |0i and |1i: |0i → |0i + |1i √ 2 (2.7) It is this ability to create and manipulate superpositions of states that give quantum computers their inherent parallelism. To illustrate, suppose we have a function f that 30 CHAPTER 2. BACKGROUND can be implemented with a unitary function Uf that transforms two input qubits |xi|yi as follows (where ⊕ signifies single bit binary addition): Uf : |xi|yi → |xi|y ⊕ f (x)i (2.8) Now suppose that we apply Uf to the input state with |xi in the superposition shown in (2.7) and |yi = |0i. Then Uf : |0i|f (0)i + |1i|f (1)i |0i + |1i √ √ |0i → . 2 2 (2.9) This output state contains information about f (0) and f (1), so in a loose sense we have performed an evaluation of f on both 0 and 1. This notion of quantum parallelism is one of the keys to the potential power of quantum computation. However, note that we cannot extract both the values of f (0) and f (1) directly from this output state. If we attempt to measure it, we will destroy the state and we will get one of the two measurement outcomes (0, f (0)) or (1, f (1)) with equal probability p = 0.5. To unlock the information potential of quantum systems, we need another concept, that of entanglement. A full discussion of entanglement is beyond the scope of this overview. However, let us consider the circuit in Figure 2-4, which demonstrates another important use for the Hadamard gate, in preparing a class of entangled states known as Bell states or EPR pairs. |xi H • |ψi |yi ⊕ Figure 2-4: Using a Hadamard gate to generate entangled Bell states If the input to this circuit is |00i, then the output is |ψi = |00i + |11i √ ≡ |β00 i . 2 (2.10) 31 2.3. QUANTUM ALGORITHMS At a first glance, this might look like just another superposition, but this is not the case. If we were to apply a Hadamard transform to two qubits to create a superposition, the output would be |φi = 2.3 |00i + |01i + |10i + |11i . 2 (2.11) Quantum algorithms As in classical computing, our interest in quantum computing is to be able to execute useful algorithms. The quantum bits and quantum gates introduced above provide a helpful abstraction for specifying these algorithms. Just as classical logic gates can be combined into circuits, so quantum gates can be combined into quantum circuits. Many useful algorithms can be conveniently expressed in this form, and indeed we often use the terms ’algorithm simulation’ and ’circuit simulation’ interchangeably. We have already seen simple circuits that perform useful functions such as swapping qubit values (Figure 2-3) and generating Bell states (Figure 2-4). For a more substantial example, we will discuss an algorithm to calculate an important transform known as the quantum Fourier transform in section 2.3.1. The circuit model is not the only way of thinking of quantum algorithms. In section 2.3.2, we consider an alternative technique, that of quantum adiabatic evolution, and apply it to 3-SAT, a classic hard problem from traditional computer science. The algorithms presented in this section were chosen to give an illustrative flavor of some applications of quantum computing. They will also form the basis of some of the performance benchmarks for the tensor product based simulation environment discussed in Chapter 5. 32 CHAPTER 2. BACKGROUND 2.3.1 The quantum Fourier transform The quantum Fourier transform is analogous to the classical discrete Fourier transform, familiar from signal processing applications, which takes an input vector (x0 , x1 , . . . , xN −1 ) of complex numbers, and maps it to an output vector (y0 , y1 , . . . , yN −1 ) as follows: −1 1 NX xj e2πijk/N yk = √ N k=0 (2.12) The quantum Fourier transform has an analogous definition. Given an orthonormal basis |0i, |1i, . . . , |N − 1i, the quantum Fourier transform is a linear operator acting as follows on the basis states: −1 1 NX √ |ji → e2πijk/N |ki N k=0 (2.13) Although the above representation makes the relation between the discrete and the quantum Fourier transforms clear, an alternative equivalent representation, known as the product representation, provides a more useful structure for generating circuits to compute the quantum Fourier transform (with N = 2n ): |ji → = n 1 O 2n/2 1 2n/2 l=1 n Y k=1 −l |0i + 22πij2 |kl i |0i + exp 2πi k X l=1 (2.14) j n−k+l 2l ! ! |1i , (2.15) where ji is the ith bit in the binary representation of j, and |kl i is the lth qubit in |ki. This representation corresponds to the quantum circuit in figure 2-5. Absent from the circuit is the final bit reversal operation, which reverses the order of the output qubits analogously to the bit reversal of the discrete Fourier transform. Each of the Ri in the diagram is a rotation, defined by 33 2.3. QUANTUM ALGORITHMS 1 Ri = |j1 i H R2 ... 2πi/2k 0 e (2.16) . Rn−1 Rn • |j2 i 0 H R2 . . . Rn−2 Rn−1 .. . .. . • |jn−1 i • • |jn i H R2 • • H Figure 2-5: General circuit representation of the quantum Fourier transform The quantum Fourier transform, in turn, is an important component of many significant larger quantum algorithms, such as Shor’s integer factoring algorithm. 2.3.2 Quantum computation by adiabatic evolution Although the circuit model is a convenient abstraction for representing quantum algorithms, it is not the only way of mapping problems onto quantum systems. One alternative framework [FGG+ 01] is based on exploiting the adiabatic theorem. To understand the adiabatic theorem, we must introduce another fundamental concept of quantum mechanics, the Hamiltonian. The time evolution of a quantum system can be described by the Schrödinger equation: ih̄ d|ψ(t)i = H(t)|ψ(t)i dt (2.17) Here, |ψ(t)i is the state vector of the system at time t, h̄ is Planck’s constant (generally, units are chosen such that h̄ = 1), and H(t) is a Hermitian operator called the Hamiltonian of the system. Briefly stated, the adiabatic theorem states the following: consider a quantum system 34 CHAPTER 2. BACKGROUND whose evolution is governed by the Hamiltonian H(t). Take H(t) = H̃(t/T ) where H̃(s) is a one-parameter family of Hamiltonians with 0 ≤ s ≤ 1. Let the instantaneous eigenstates and eigenvalues of H̃(s) be of the form El (s)|l; si with Ei (s) ≤ Ej (s) for i < j. The adiabatic theorem states that if the gap between the lowest two eigenvalues E1 (s) − E0 (s) is strictly greater than zero for all 0 ≤ s ≤ 1, then lim |hl = 0; s = 1|ψ(T )i| = 1 . T →∞ (2.18) In other words, if the gap is positive as above, then if T is big enough (i.e. if t/T is small enough), |ψ(t)i remains close to the ground state |ψg (t)i = |l = 0; s = t/T i of the system. This gives a hint as to how adiabatic evolution might be used for quantum computation, if we can specify our algorithm in the form of a series of Hamiltonians H(t), chosen in such a way that the initial ground state of H(0) is both known and easy to construct. For each instance of the problem, we can then construct a problem Hamiltonian HP . Although HP is not difficult to construct, its ground state, which encodes the solution to the corresponding instance of the problem, is difficult to compute directly. This is where we use adiabatic evolution. We set H(T ) = HP , so the ground state |ψg (T )i of H(T ) encodes the solution. T is the running time of our algorithm. For 0 ≤ t ≤ T , H(t) smoothly interpolates between the initial Hamiltonian H(0) and the final Hamiltonian, H(T ), which is equivalent to HP . If T is large enough, then H(t) will vary slowly. By the adiabatic theorem, then, final state of this evolution |ψ(T )i will be close to the solution state |ψg (T )i. To illustrate, consider the example of the 3-SAT problem, which has been shown to be NP-complete [Coo71]. A problem is NP-complete if it is in the complexity class NP, and if it has the property that any other problem in NP is reducible to it by a polynomial time algorithm (by ’problem Φ is reducible to problem Φ0 ’ we mean that any instance of Φ can be converted in polynomial time into an instance of Φ0 with the same truth value). A n-bit instance of 3-SAT is a Boolean formula consisting of a conjunction of clauses C1 ∧ C2 ∧ ... ∧ CM , where each clause C involves at most three of the n bits. The problem 35 2.3. QUANTUM ALGORITHMS requires finding a satisfying assignment, that is a set of values for each of the n bits that makes all of the clauses simultaneously true. An instance of 3-SAT can be expressed in a manner suitable for the application of adiabatic evolution by constructing a Hamiltonian to represent it (a ’problem Hamiltonian’) as follows: for each clause C with associated bits ziC , zjC and zkC , define an energy function hC (ziC , zjC , zkC ) such that hC = 0 if (ziC , zjC , zkC ) satisfies clause C, and hC = 1 otherwise. Each bit zi is associated with a corresponding qubit |zi i. Each clause C is associated with an operator HP,C (|z1 i|z2 i . . . |zn i) = hC (ziC , zjC , zkC )|z1 i|z2 i . . . |zn i . (2.19) The problem Hamiltonian HP is then the sum over the clauses of the HP,C : HP = X (2.20) HP,C C Given a problem Hamiltonian as above, one can solve the instance of 3-SAT by finding its ground state. To do this, it is necessary to start with a Hamiltonian HB with a known ground state (the ’initial Hamiltonian’) and to use adiabatic evolution to go from this known ground state to the ground state of HP . HB is constructed as follows: define (i) the one-bit Hamiltonian HB acting on bit i thus: 1 (i) HB = (1 − σx(i) ) , 2 (2.21) where 0 1 . σx(i) = (2.22) 1 0 For each clause C, define (i ) (j ) (k ) HB,C = HB C + HB C + HB C , (2.23) 36 CHAPTER 2. BACKGROUND then HB = X (2.24) HB,C . C Adiabatic evolution proceeds by taking H(t) = 1 − t T HB + t T (2.25) HP , so H̃(s) = (1 − s)HB + sHP . (2.26) Start the system at t = 0 in the known ground state of H(0) (i.e. in the ground state of HB ). By the adiabatic theorem, if T is big enough and the minimum gap, gmin , between the two lowest energy eigenstates is not zero, |ψ(T )i will be close to the ground state of HP , which represents the solution to the instance of 3-SAT. 2.4 The tensor product The tensor product is also known as the Kronecker product or the direct product of matrices. It is an operation on two matrices, denoted A ⊗ B. If A and B are m × n and p × q matrices respectively, then A ⊗ B is a mp × nq matrix defined as follows: a1,1 B a2,1 B A⊗B = .. . a1,2 B ··· a1,n B a2,2 B .. . ··· .. . a2,n B .. . am,1 B am,2 B · · · am,n B . (2.27) The tensor product has a number of useful properties, which will be helpful later when we attempt to compute tensor products. It is associative: 37 2.4. THE TENSOR PRODUCT A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C (2.28) It is distributive over normal matrix multiplication: (A ⊗ B)(C ⊗ D) = (AC ⊗ BD) , (2.29) (provided that the dimensions of A, B, C and D are such that AC and BD are defined). Inverses and transposes of tensor products have the following useful properties: (A ⊗ B)−1 = A−1 ⊗ B −1 (A ⊗ B)T = AT ⊗ B T (2.30) (2.31) The above already suggests that we may be able to reduce the amount of computation performed on a large matrix if it can be expressed as the tensor product of smaller matrices. To see this, take for example the matrix A = M1 ⊗ M2 ⊗ . . . ⊗ Mn , and consider the relative amount of computational effort in computing A−1 versus computing Mi−1 for the n smaller matrices. Finally, there are two more properties of the tensor product that will be useful to us in calculating the trace and eigenvalues of matrices represented in tensor product form. In the case of the trace, we have that tr(A ⊗ B) = tr(A)tr(B) . (2.32) In the case of eigenvalues, if A and B have eigenvalues λi and µj respectively, with corresponding eigenvectors xi and yj , then (A ⊗ B)(xi ⊗ yj ) = λi µj (xi ⊗ yj ) . (2.33) In other words, every eigenvalue of A ⊗ B is a product of the eigenvalues of A and B. 38 CHAPTER 2. BACKGROUND How is this useful to us in simulating quantum computing? It turns out that quantum circuits often have natural decompositions in terms of the tensor product. Consider for example the simple circuit for the Hadamard transform on four qubits: H H H H This has the tensor product representation H ⊗ H ⊗ H ⊗ H, where H is the one-qubit (2×2) Hadamard gate matrix. In general, any parallel sequence of gates can be represented as the tensor product of the operator matrices corresponding to them. Chapter 3 Parallel simulation of quantum computation This chapter will review the concepts that motivate the design of our simulation environment. Section 3.1 gives an overview of the type of simulations that we wish to perform, and gives a sense of the complexity of implementing these simulations on classical computers. One way of tackling this complexity is by using the combined power of multiple processors in parallel to perform the simulation. There are many approaches to parallel computing, and we have chosen to use an architecture known as cluster computing. Section 3.2 defines cluster computing, motivates our choice of this architecture, and describes the challenges and limitations particular to it. In order to exploit the potential advantages of parallel hardware, we require a means of parallelizing the computations we will perform. In the previous chapter, we introduced the tensor product and saw how we could use it as a structure for many quantum algorithms. Now, in section 3.3, we will explain how the tensor product structure has been used to guide parallelization. We will also consider how tensor product based transforms can be efficiently applied at each of the parallel steps. 39 40 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION 3.1 Simulating quantum computing The phrase “simulating quantum computing” can have a number of meanings1 . For the purposes of this thesis, when we say that we intend to simulate quantum computation, we mean that we will use classical computers to simulate the operation of certain quantum algorithms, typically expressed as quantum circuits. We do not simulate the behavior of any particular physical implementation of quantum computation, concerning ourselves rather with the algorithmic/circuit abstraction. In order to simulate quantum circuits, we must be able to represent each circuit, its input and its output classically. At their most basic, quantum circuits can be thought of as transforms that operate on an n-qubit input state |ψi to produce an output state |ψ 0 i, as in Figure 3-1. |ψi n / Quantum circuit n / |ψ 0 i Figure 3-1: Basic schematic form of a quantum circuit As we have seen in Chapter 2, the input |ψi and the output |ψ 0 i can be represented as state vectors of dimension 2n , say x and y respectively. The entries in each of these vectors are complex numbers. The circuit itself can be represented by a 2n × 2n transform matrix U , also of complex numbers. Simulating the operation of the circuit is then simply a matter of performing the computation y = Ux . (3.1) The simplicity of equation 3.1 is deceptive, however. For a start, the sizes of x, y and U grow exponentially with problem size. If we store U and x and perform a full matrix-vector multiplication, we will require at least 2n+3 + 22n+3 bytes of storage for single precision 1 Our use of the term ’simulation’ to refer to the simulation of quantum computations on classical computers should not be confused with quantum simulation, which typically refers to the simulation of a quantum system on a quantum computer 41 3.1. SIMULATING QUANTUM COMPUTING values. This translates to over 32 Gb for a 16-qubit problem. We will also require O(22 n) complex multiplication operations. Furthermore, the entries of U are usually not directly specified, but must be computed from the constituent gates of the circuit. As an example, consider the circuit in Figure 3-2. This corresponds to the multiplication y = (I2 ⊗ Uf )(Ud ⊗ Ue )(Ua ⊗ Ub ⊗ Uc )x , (3.2) where I2 is the 2 × 2 identity matrix. |ψ0 i |ψ00 i Ua Ud |ψ1 i |ψ10 i Ub Uf |ψ2 i Uc Ue |ψ20 i Figure 3-2: A representative quantum circuit The amount of computational work required to perform the above calculation naı̈vely turns out to be greater still than would have appeared at first when we treated circuits as a single transform. Here, we require at least three matrix-vector multiplications, in addition to the work required to compute the three sets of tensor products. 3.1.1 Previous work in simulating quantum computation There are relatively few simulation environments for quantum computation in the literature. Most consist of languages or environments for describing small quantum circuits or algorithms and for simulating them on single-processor workstations. A good representative of this class is the QCL language [Öme00a], which provides basic constructs for specifying quantum algorithms, but which has no native support for parallelism. QCL was designed primarily as a programming language, not as a simulation environment. The ultimate intent of QCL is to provide a means to specify algorithms that would be executed using quantum computing resources (or a mix of quantum and classi- 42 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION cal computing resources). The author, does however, provide a simulation environment, the QC library [Öme00b] to allow QCL programs to be executed in the absence of quantum resources. The QC library stores quantum states as state vectors, using a compressed representation in which only non-zero amplitudes are stored. This trades memory efficiency in the case in which many amplitudes are zero for a computational performance penalty when operating on this more complex representation in memory. In the general case, where many or all of the probability amplitudes are nonzero, essentially the full state vector must be stored. The lack of support for parallelism in the QC library places significant limits on the size of the problem that can be simulated, because of both memory and CPU cycle limitations. There is, of course, nothing in principle preventing a parallel back end from being developed to execute QCL code. Perhaps the most complete published parallel simulation environment is the one developed at ISI/USC by Obenland and Despain [OD98]. Their interest was particularly in simulating a physical implementation of quantum computation using laser pulses directed at ions in an ion trap. The ISI team had access to high end supercomputers, specifically a Cray T3E and an IBM SP2 multiprocessor, and they took advantage of this to execute their simulations in parallel. They noted a significant speedup on larger problems, close to the theoretically predicted parallel speedup. Obenland and Despain’s work was a clear indication that parallelism could be fruitfully exploited to achieve larger, faster simulations of quantum computing. However, they had access to high-end, purpose-built supercomputing environments. We wanted to know to what extent results like this could be achievable on more widely available parallel environments, in particular on a cluster of off-the-shelf workstations. The ISI simulation timings revealed that communications overhead ultimately became the dominant time factor. Because of the highly efficient internal interconnect in the high 3.2. PARALLEL PROCESSING AND CLUSTER COMPUTING 43 end supercomputers that were used, communications overhead was not a significant factor for small numbers of processors. However, when 25% of the available processors were used, communications overhead increased to 40-60% of total execution time for many problems. With half the processors in use, it increased to take up 60-90% of the execution time. These findings, on a tightly-coupled multiprocessor architecture with a high speed internal interconnect, suggest that message passing based parallelism would be even harder on clusters, where the interprocessor interconnect is substantially slower, and this was indeed our experience. 3.2 Parallel processing and cluster computing The regular structure of, and large number of operations involved in, many linear algebra problems make them attractive as candidates for parallel processing. The term ’parallel processing’ refers to any architecture in which one or more processors operate on multiple data elements simultaneously. Early computers designed to perform fast linear algebra operations typically made use of vector processors. As the name suggests, vector processors perform operations (e.g. addition or multiplication) on vectors of multiple data items rather than on single memory addresses or registers. Vector processors execute instructions sequentially, but achieve parallelism at the data level. Early supercomputers typically contained one large vector processor operating at high speed. For example, the earliest and most well known true vector processor supercomputer, the Cray 1, operated on eight ’vector registers’, each of which could hold 64 eightbyte words. Vector processing has evolved into the modern concept of single instruction, multiple data (SIMD) parallelism. This is almost ubiquitous in modern processor designs, such as the Motorola PowerPC VelocityEngine, or the Intel Pentium MMX extensions. Another, often complementary, approach to parallel computing is to increase the num- 44 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION ber of processors in the system and to parallelize the execution of algorithms across multiple processors. A number of models of parallel processing have been attempted. One popular early approach was massively parallel processing (MPP). MPP systems are so named because they contain a large number of processors — hundreds or sometimes thousands of them. Each processor has its own local memory, and the processors are linked using a high-speed internal interconnect mechanism. 3.2.1 Cluster computing In recent years, with the advent of higher bandwidth network interconnects, it has become feasible to build parallel computing systems out of multiple independent workstations, rather than a single machine with multiple processors. This technique is known as cluster computing. The term ’cluster’ is somewhat loosely applied to groups of networked conventional workstations that cooperate computationally. The workstations may be heterogeneous in terms of such factors as their processing capacity (number, type and speed of CPUs), their memory size and configuration and even the operating systems that run on them. We make a distinction here between clusters and so-called networks of workstations (NOWs). Although the terms are sometimes used interchangeably in the literature, the term ’cluster’ is generally applied to a network that, while it may consist of workstationclass computer hardware, is essentially dedicated to the task of parallel computation. A NOW, by contrast, may comprise machines that are also used for other purposes, often desktop machines that perform networked computation only when otherwise idle. Clustering makes parallel processing more accessible than traditional ’single box’ parallel processing. Cluster components are cheaper, and are easily replaced or upgraded. Clusters can be expanded, partitioned or otherwise reconfigured with relative ease. A wider range of development tools and operating system support is available for commodity workstation hardware, making development on a cluster environment more accessible to a general user base than development on traditional multiprocessor machines. 3.2. PARALLEL PROCESSING AND CLUSTER COMPUTING 45 Clusters are becoming increasingly accepted in the high performance computing community, with 93 clusters appearing in the most recent (at the time of writing) TOP500 list [MSDS02] of the highest performance computers in the world (ranked according to their performance on the LINPACK benchmark). It is worth noting, however, that many of the clusters on the list use unconventional high-speed interconnects or other customized hardware enhancements that differentiate them from off-the-shelf computing hardware. Indeed, there are only 14 ’self made’ clusters on the TOP500 list at this time. There are several tradeoffs in using a cluster instead of a traditional multiprocessor machine. Individual cluster nodes are often significantly less reliable than individual processing elements in a multiprocessor machine, an issue that becomes increasingly significant as cluster sizes increase. Furthermore, although many message passing libraries, such as MPI [For93] and PVM[GBD+ 94], have been ported to cluster environments, clustering support is often less mature than support for traditional multiprocessor environments. The most significant disadvantage of cluster computing, though, is the decreased interprocessor communications performance relative to other multiprocessor environments. Local memory access on typical Intel processor based machines are on the order of 1 gigabyte/s, with latencies around 50-90 clock cycles. High speed memory crossbars used in some modern supercomputer designs offer even higher bandwidths. By comparison, even fast interconnects such as Gigabit Ethernet and Myrinet offer raw bandwidths on the order of low hundreds of megabytes per second. These raw bandwidths are further degraded by protocol overhead and inherent transport inefficiencies. Even with low protocol overhead, network-induced latencies are at least on the order of thousands of clock cycles [CWM01]. Probably the most widespread clustering architecture, not least of all because its definition is broad enough to encompass a wide range of different clustered environments, is the Beowulf clustering environment [BSS+ 95], named for the initial implementation of such an environment at NASA’s Goddard Space Flight Center. Originally applied to clusters based on the free Linux operating system, the term ’Beowulf’ has now come to be applied to any 46 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION cluster that approximately meets the following criteria: • The cluster hardware is essentially standard workstation hardware, possibly augmented by a more exotic fast network interconnect. • The operating systems in use on the cluster are free Unix variants (originally Linux was the operating system of choice for Beowulf clusters, now alternatives such as FreeBSD are increasingly common). • The cluster is dedicated to parallel computational purposes, and is typically accessed through a single point of entry (the head node). • Some combination of software to facilitate parallel computing is installed on the cluster. Such software typically includes one or more message passing libraries, and cluster-wide administrative tools. It may also include operating system enhancements such as kernel extensions that implement a shared process numbering space. Our parallel simulation environment is implemented on a 32-node Beowulf cluster. More details of the cluster configuration can be found in section 4.1. 3.3 The tensor product as a means to optimize computation There are two ways in which we use the tensor product to optimize the simulation of quantum circuits. First, we use the tensor product structure to determine an intelligent parallelization of the circuit. Then, we use the tensor product decomposition to minimize the amount of matrix computation we perform. 3.3.1 Parallelizing by the tensor product The digital signal processing community has used the tensor product as a means to structure parallel implementations of signal processing transforms for some years [GCT92]. Much work has been done in expressing common transforms such as the Fast Fourier 3.3. THE TENSOR PRODUCT AS A MEANS TO OPTIMIZE COMPUTATION 47 Transform in tensor product form, and using this representation to parallelize the application of the transform to an input vector [Pit97]. In our simulations of quantum computing, we use very similar techniques to those applied to large signal processing transforms. We apply operators (transform matrices) with a tensor product structure to state vectors (input vectors). It seems reasonable, therefore, that the same techniques that have been useful in signal processing would be useful to us. To understand how the tensor product structure can be used to determine a corresponding parallelization, let us consider an idealized m × m transform A of the form A = In ⊗ B , (3.3) where In is the n × n identity matrix, and B is thus of size m n × m n. Now suppose we wish to calculate the matrix-vector product y = Ax, where x is a vector of size m. This corresponds to the application of a set of operators to a state vector. To illustrate, take m = 8 and n = 4. A then has the following form: A 1 0 0 0 0 1 = 0 0 0 0 B1,1 ⊗ 1 0 B2,1 B1,2 B2,2 (3.4) 0 0 0 1 B1,1 B 2,1 0 0 = 0 0 0 0 B1,2 B2,2 0 0 0 0 0 0 0 0 0 0 0 0 0 B1,1 B1,2 0 0 0 0 0 B2,1 B2,2 0 0 0 0 0 0 0 B1,1 B1,2 0 0 0 0 0 B2,1 B2,2 0 0 0 0 0 0 0 B1,1 B1,2 0 0 0 0 0 B2,1 B2,2 (3.5) 48 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION It is clear from the above that we can calculate the product y = Ax by partitioning x into four equal partitions of two elements, and then performing four smaller calculations of the following form (1 ≤ i ≤ 4): y2i−1 y2i x2i−1 =B x2i (3.6) Notice that the result of each of the calculations is independent of the other three results. This implies that the calculation of y = Ax can effectively be parallelized across four processors. More generally, any calculation of the form y = Ax where A has the form A = In ⊗ B can be parallelized over n processors, each performing a multiplication by B of some partition of x. What if we have fewer than n processors? Suppose there are p processors available, where p < n. We note that In ⊗ B = (Ip ⊗ In/p ) ⊗ B = Ip ⊗ (In/p ⊗ B) , (3.7) (3.8) and partition the computation into p parallel subcomputations, each involving a multiplication of a partition of x by (In/p ⊗ B). What about the case when the tensor product representation does not take the convenient form A = In ⊗ B? Here we must make use of a permutation matrix to rearrange the tensor product into this form. A permutation matrix P is a square matrix whose elements are each either 0 or 1, and where PTP = I = PPT (3.9) A stride permutation matrix Pn,s (where s is a factor of n, i.e. n = ms) is a permutation 49 3.3. THE TENSOR PRODUCT AS A MEANS TO OPTIMIZE COMPUTATION matrix which when applied to a vector x of length n rearranges it as follows: h Pn,s x = x1 , x1+s , x1+2s , . . . , x1+(m−1)s , x2 , x2+s , x2+2s , . . . , xs , x2s , x3s , . . . , xms i (3.10) Now suppose that A and B are matrices of sizes m × n and p × q respectively. We can relate A ⊗ B to B ⊗ A with stride permutations as follows: A ⊗ B = Pmp,m (B ⊗ A)Pnq,q (3.11) This is known as the commutative property of the tensor product, and it allows us to rewrite tensor products of the form A = B ⊗ In in the form A = Pm,(m−n) (In ⊗ B)Pm,n , (3.12) where m is the dimension of the square matrix A. To illustrate, consider the following circuit fragment, where U and V are 3-qubit and 2-qubit operators respectively: |j1,2,3 i 3 / |j4,5 i 2 / |j6,7 i 2 / U V The tensor product representation of this circuit applied to the state vector x corresponding to the input state |xi is [U ⊗ I22 ⊗ V ]x = [(U ⊗ I22 ) ⊗ V ]x (3.13) = [(P25 ,22 (U ⊗ I22 )P25 ,23 ) ⊗ V ]x (3.14) = [(P25 ,22 ⊗ I22 )(I22 ⊗ U ⊗ V )(P25 ,23 ⊗ I22 )]x (3.15) 50 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION = [p(I4 ⊗ (U ⊗ V ))p0 ]x , (3.16) where p = P25 ,22 ⊗ I22 and p0 = P25 ,23 ⊗ I22 are permutation matrices. Thus, the circuit above can be parallelized by the technique described above using up to four processors. The permutations p and p0 can be implemented by rearranging the elements of x. In a multiprocessor architecture, such rearrangements can often be achieved by alternate addressing of the underlying data. In a clustered environment, however, the rearrangements require communication of the elements to be rearranged between the nodes with processors holding the relevant data. Since we are communicating only state vectors, and not full matrices, this communication is comparatively low relative to problem size. Nonetheless, avoiding unnecessary communications will be important in minimizing execution times. 3.3.2 Efficient computation of tensor product structured multiplications We have discussed above how a tensor product representation of a quantum circuit can be parallelized by reducing each parallel component to the form Im ⊗ B, perhaps with some permutation of the data before and after. We have not yet considered how best to perform each of the multiplications Bxi (where xi is the ith partition of the vector x). Even from the very simple example above, where B = U ⊗ V , it is clear that B may well have its own tensor product structure. The multiplication of such tensor products can be performed efficiently using the following method, described in [BD96b]: Suppose we wish to compute the product Bx = (M1 ⊗ M2 )x , (3.17) where M1 and M2 have dimensions m × n and p × q respectively. We first reshape x into a q × n matrix thus: Xi,j ≡ x(j−1)q+i (3.18) 3.3. THE TENSOR PRODUCT AS A MEANS TO OPTIMIZE COMPUTATION 51 It has been shown ([Dyk87]) that the product (M1 ⊗ M2 )x can then be calculated by finding Y = M1 (M2 X)T T (3.19) and then converting Y back into a vector using the reverse of the process by which X was constructed in (3.18). We have reduced the multiplication of x by a mp × nq matrix to a series of two multiplications by smaller matrices. In general, we have Bx = (M1 ⊗ M2 ⊗ . . . ⊗ MK )x = (3.20) ! T T T M1 M2 . . . (MK X)T . . . , (3.21) where X is derived from x as above. To see how this can significantly reduce the amount of computation required, consider the case where each of the Mi (1 ≤ i ≤ K) is a square matrix of dimension n × n. Then the number of multiplications required to compute the conventional matrix-vector multiplication Bx, with B as in equation (3.20) above, is of order O(n2K ), since B is a nK × nK matrix. The reformulation in equation (3.21) allows us to perform K matrix-matrix multiplications of the form Mi Xi , where Xi is the intermediate result formed by the sequence of multiplications by Mj , j = i + 1 . . . K (XK ≡ X). In each case Mi is an n × n matrix, and Xi is an n × nK−1 matrix. Each matrix multiplication thus requires O(nk+1 ) multiplications, and the total computation requires O(KnK+1 ) multiplications. The computation additionally requires K matrix transpositions of n × nK−1 matrices. The amount of work required to perform these transpositions is, however, not great relative to the work for the matrix-matrix multiplica- 52 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION tions. In principle, the matrix-matrix multiplications and the matrix transpositions could be implemented as parallel routines to further take advantage of multiple processors available in a parallel environment. However, in practice, in a clustered environment, the communications overhead of the requisite parallel linear algebra routines (discussed in more detail in Chapter 5) limits the usefulness of this additional parallelization. Chapter 4 The simulation environment This chapter describes our parallel environment for simulating quantum computation on a cluster of classical workstations, beginning with a description of the cluster hardware in section 4.1. Our first attempt at a simulation environment, discussed in section 4.2, was based on parallel matrix operations. These parallel operations were primarily drawn from existing optimized libraries, described in section 4.2.1. We used Matlab as the front end for this simulation environment, drawing on existing work in interfacing Matlab to parallel linear algebra libraries described in section 4.2.2. Our implementation, described in section 4.2.3, followed the architecture of these prior implementations, but was tailored to perform the functions we required for our simulation, and to operate on complex matrices. It became apparent to us that the matrix-based implementation of our initial simulation environment was suboptimal with respect to resource requirement scaling, both in terms of memory usage and computation time. Seeking to improve simulation efficiency, we developed a new simulation environment based on the tensor product structure of quantum circuits. This allowed us to apply prior work in the parallelization of tensor product computation, and in efficient implementation of tensor product multiplications (described above in sections 3.3.1 and 3.3.2 respectively) to our simulations. Section 4.3 describes the new, tensor product based, simulation environment. We de53 54 CHAPTER 4. THE SIMULATION ENVIRONMENT veloped a simple circuit specification language (section 4.3.1) as in input mechanism, along with a compiler (section 4.3.2) that translates this input into a sequence of steps to be executed by the new parallel simulation code. Section 4.3.3 describes how this compiled representation is distributed to the nodes, and section 4.3.4 describes the actual execution. 4.1 Hardware The hardware on which the simulation environment runs consists of a networked cluster of off-the-shelf computers, pictured in Figure 4-1. It consists of 33 machines with a total of 68 processors, as follows: • One head node, with 4 Gb RAM and four Intel Pentium III Xeon processors with a 900 MHz system clock speed • Eight older cluster nodes, each with 768 Mb RAM and two Pentium III processors with a 1 GHz system clock speed • 24 newer cluster nodes, each with 1 Gb RAM and two Pentium III processors with a 1.2 GHz system clock speed The nodes are interconnected using 1000BaseT gigabit Ethernet, through a switch with a claimed backplane switching throughput of 38 Gbit/sec, enough in theory to allow for simultaneous communication between all the nodes. Initially the nodes ran on switched 100BaseT fast Ethernet, but it soon became clear that communications overhead was a significant performance bottleneck, so the nodes were upgraded to a faster network transport. Gigabit Ethernet was chosen because of its low cost and ease of configuration relative to many other high speed interconnect mechanisms such as Myrinet or Fiber Channel. Although the gigabit Ethernet network is a substantial performance improvement on the old 100 Mbit/sec network, several factors conspire to reduce the effective maximum throughput of the network. The PCI network adapters used in all the nodes are low-end, 32-bit wide cards. Since the nodes all have a standard ’workstation’ architecture, the PCI 4.1. HARDWARE 55 Figure 4-1: The cluster nodes, seen from below. The eight older, lower-capacity machines are visible at the top right. bus is a single bus shared by all I/O devices. On the software side, we are using a stock Linux kernel, with the inefficiencies inherent in a TCP stack that must necessarily be all things to all users. We tested our networking setup with the well-known netperf network benchmarking tool. On an otherwise unloaded network, peak node-to-node TCP data throughput was 482 Mbit/sec. Reliability is a significant concern in any large network of off-the-shelf workstations. 99.5% uptime, for instance, may be entirely tolerable on a single workstation, but a cluster with 32 nodes independently experiencing 99.5% uptime will have an unacceptably low overall uptime of 85%. Two areas in which particular work was required to achieve acceptable reliability were cooling and hard drive reliability. In order to prevent frequent overheating, we found it necessary to physically enclose the ceiling mounted rack in which the cluster nodes were mounted, so that forced air could be channeled downwards from vents in the ceiling to the 56 CHAPTER 4. THE SIMULATION ENVIRONMENT front of the machines. We also found that the hard drives on some of the nodes would frequently perform erratically. Whether this was due to exposure to electromagnetic radiation in the laboratory, mechanical unreliability, or other causes was not conclusively established, but we did find that running the machines disklessly improved uptime significantly. Diskless operation also simplified configuration: all machines mount their root file systems from a shared NFS server connected directly to the gigabit switch. There are no swap volumes; for our application, swapping would in any case impose an unacceptable performance penalty. The nodes are used to implement a Beowulf cluster: each runs an essentially standard version of the Linux operating system (2.4 kernel version) and has available both the MPI and PVM message passing libraries. We initially ran the MOSIX [BGW93] kernel extensions, which provide automatic load balancing for Beowulf clusters. We found, however, that for our particular application it was both possible and desirable to determine the distribution of processes to nodes ourselves. The overhead involved in MOSIX automatic process switching was too great, and the message passing libraries did not support seamless process migration. 4.2 Initial software implementation Our first generation simulation environment used an elementary linear algebraic representation of quantum states and operations. States were stored as complex valued state vectors, and operators were implemented as full matrices. This representation allowed us to use existing libraries of fast parallel linear algebra routines to perform the computations at the core of the simulation. We used Matlab as the front end for the simulation environment. We drew on prior work on interfacing Matlab with parallel linear algebra libraries to develop a client-server interface, in which a Matlab client exchanged data with, and sent requests to, a parallel 57 4.2. INITIAL SOFTWARE IMPLEMENTATION linear algebra server. 4.2.1 Overview of libraries used The linear algebra routines we used were drawn from the PBLAS [CDO+ 95] and ScaLAPACK [BCC+ 97] libraries. These are highly optimized general purpose FORTRAN based parallel linear algebra libraries, first designed to be used on multiprocessor computers such as the Intel Paragon or various Cray supercomputers. They have, however, more recently been ported to run in a clustered environment. To achieve parallelism, the routines in the linear algebra libraries distribute the data being operated across all the processors involved in the computation. To do this, they make use of the two dimensional block cyclic distribution, illustrated in Figure 4-2. M1,1 M1,2 M1,3 0 M2,1 M3,1 M2,2 M2,3 M3,2 M3,3 M5,1 M4,2 M4,3 M5,2 M5,3 0 M6,1 M7,1 M6,2 M6,3 M7,2 M7,3 M2,4 M2,5 M3,4 M3,5 M8,3 M1,7 M4,4 M4,5 M5,4 M5,5 M2,6 M2,7 M3,6 M3,7 M6,5 M7,4 M7,5 M4,6 M4,7 M5,6 M5,7 M8,5 M3,8 M4,8 M5,8 1 M6,6 M6,7 M7,6 M7,7 M6,8 M7,8 3 2 M8,4 M2,8 3 0 M6,4 M1,8 1 2 3 M8,2 M1,6 0 1 2 M8,1 M1,5 3 2 M4,1 M1,4 1 M8,6 M8,7 M8,8 Figure 4-2: The two dimensional block cyclic distribution: Illustrated here is a representative 8 × 8 matrix M distributed among four processors with a block size of 2 × 2. Numbers in boxes indicate the processor to which each block is assigned. The developers of parallel linear algebra libraries chose the two dimensional block cyclic distribution to provide a good compromise between efficient allocation of resources 58 CHAPTER 4. THE SIMULATION ENVIRONMENT across all the CPUs, and efficient local matrix computation on each CPU [DvdGW94]. Each local CPU uses the single-processor BLAS (Basic Linear Algebra Subroutines) library to perform matrix operations on its local portions of the distributed matrices. To communicate amongst the processors, the libraries use a standardized communication mechanism, BLACS (the Basic Linear Algebra Communications System). This is a message passing interface for linear algebra that in turn relies on an underlying low-level message passing library such as MPI or PVM to perform the actual inter-process communications. We used both message passing libraries interchangeably to see whether there were any significant performance differences between them, and found none. We chose to use Matlab as a front end for the first generation simulation environment, because it provides a fairly user-friendly environment for the manipulation of results, and because we had a number of legacy single-processor simulation routines written in Matlab that we wished to port to a parallel environment. 4.2.2 Prior work in parallelizing Matlab Matlab itself does not support parallel operations [Mol95]. There have been a number of third party attempts to provide parallel functionality for Matlab. These can be roughly divided into three classes: those that add a message passing layer to Matlab; those that facilitate the parallelization of ’embarrassingly parallel’ Matlab routines (e.g. certain easily parallelizable loop constructs); and those that provide an interface between Matlab and an external set of parallel numerical libraries. The message passing extensions were not ideally suited to our application, because they required the Matlab user to devote a great deal of time and attention to the details of parallel implementation. Nor did the nature of our quantum code lend itself to an embarrassingly parallel approach. The most potentially useful Matlab extensions for our application, therefore, were those that interfaced to external parallel linear algebra libraries. This idea was first developed in the MATPAR [Spr98], which introduced the concept of a client-server extension 59 4.2. INITIAL SOFTWARE IMPLEMENTATION to Matlab, with Matlab scripts making requests to an independent parallel linear algebra server. This architecture was extended by MATLAB*P [CE02], which provided a wider range of library functions. At the time that we were developing our parallel Matlab based simulation environment, Matpar was not available for Intel x86 based clusters, and MATLAB*P did not support complex valued matrices, an essential requirement for our application1 . We therefore had to develop our own parallel server to use with Matlab. The approach that we followed was based directly on the key architecture of Matpar and MATLAB*P, but focused on support for the specific functions that we required for our simulation, operating on complex valued matrices. 4.2.3 Design of the parallel Matlab environment Our first simulation library consisted of multiple layers on both the client and the server side, illustrated in Figure 4-3. On the client side, user code calls a set of library routines, implemented as Matlab .m scripts. These, in turn, call a single Matlab extension (MEX), written in C, that forms the ’bridge’ between the client and the cluster. The MEX software sends a series of commands to the cluster to instruct it to perform the appropriate operations. It also transfers data between Matlab and the cluster. User Matlab code Library functions (Matlab .m scripts) Matlab extension (MEX – C code) Network transport (TCP/IP) Simulation library PBLAS ScaLAPACK BLACS Message passing library (MPI or PVM) Network transport (TCP/IP) Client (workstation) Server (cluster) Figure 4-3: Layering of libraries in the first generation simulation environment 1 A more recent (August 2002) version of MATLAB*P introduced support for complex valued matrices, but this postdates our development work. 60 CHAPTER 4. THE SIMULATION ENVIRONMENT The Matlab extension uses a lightweight TCP/IP protocol of our own devising to com- municate with the cluster. On the cluster, a single node is arbitrarily chosen as the ’master’ node. The master accepts TCP/IP connections from Matlab. It receives commands and data from the Matlab extension and relays these to the other nodes on the cluster. State diagrams for the master node and the slave nodes are shown in figures 4-4 and 4-5 respectively (incidental error processing is omitted). Transfer data o to Matlab More data $ Await data GF 6 from nodes lll l l More nodes lll lll l remain l l Notification Data transfer Request data Process completion o complete notification from nodes received O m m m Data xfer from All nodes mm mmm mcomplete m cluster requested m vmm @A Command / Relay command All nodes Await request / notified received from Matlab to nodes O to transfer Node data received Data xfer to cluster requested @A y , Await data transfer Data transfer complete More data to transfer Matlab data received Await completion notifications V & Operation complete / Perform operation / Transfer data to appropriate node BC Figure 4-4: State diagram for the parallel Matlab server master node To homogenize the communications mechanism across the cluster, the inter-node command mechanism is not implemented as a direct TCP/IP protocol, but rather through the BLACS. Commands are represented as short one-dimensional arrays. The first entry in the array is an opcode for the command, and the remaining entries are parameters (typically either references to existing matrices to be operated on, or dimensions of new matrices to be created). The typical processing sequence is as follows: initial data is either loaded into the clus- 61 4.2. INITIAL SOFTWARE IMPLEMENTATION Perform Store data operation fromJ master SSSS Data transfer O SSSS complete More data Data received Command SSSS SSSS received expected from master S) ~ Await data o Data transfer from Await command from master m master requested from master Data transfer to master requested Data transfer complete / Send completion confirmation , Send data to master U More data to transfer Figure 4-5: State diagram for the parallel Matlab server slave nodes ter from Matlab or, where possible, generated directly on the cluster. A series of operations is performed on the cluster, and the final result is returned to Matlab. Until the final transfer at the end, the matrices exist only on the cluster, and Matlab operates on them through a reference that incorporates a serial number generated by the master node when each matrix is created. Listing 4.1 shows an illustrative example of a Matlab script that calls the parallel simulation library (parallel library calls are highlighted in bold). This example applies an n-qubit quantum Fourier transform to an input vector v. To assist in understanding the listing, it may be helpful to consult the description of the quantum Fourier transform in section 2.3.1, and in particular the circuit diagram in Figure 2-5. Notice how the computation starts begins with by setting up the initial data, either by transferring it from Matlab onto the cluster (as in line 6, which uses the pput function to transfer the input vector to the cluster) or by generating it directly on the cluster (as in line 7, which uses the peye function to generate an identity matrix). Recall that operators are stored as full matrices. To save time and memory space on the Matlab workstation, operators are transferred in their compact matrix representation using the pputop function, as in line 5. They are then converted, or promoted, to full n qubit (2n × 2n ) matrices by the ppron function. Since operator promotion will be an important element in our consideration of the memory demands of the parallel Matlab environment, it is worth considering the opera- 62 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 CHAPTER 4. THE SIMULATION ENVIRONMENT % pqft.m Calculate Quantum Fourier Transform % v = input vector, n = number of qubits function r = pqft(v, n) hadamard = pputop([1 1;1 -1] / sqrt(2)); vec = pput(v); u = peye(2ˆn); for k = (n-1:-1:0) v = ppron(hadamard,2ˆn,k,k); u = v * u; for j = (k-1:-1:0) % crk = controlled rotation by k t = pputop(crk(exp(2*i*pi/2ˆ(k-j+1)),1)); op = ppron(t, 2ˆn, j, k); u = op * u; p k i l l (op); end end f = u * vec; p k i l l (u); p k i l l (vec); r = pget(f); p k i l l (f); Listing 4.1: A sample parallel Matlab script: This example calculates the quantum Fourier transform. Functions in bold are parallel functions, described in the text. tion of ppron in a little more detail. The ppron function operates on either a one-qubit (2 × 2) operator matrix, as in line 9, or a two-qubit (4 × 4) operator matrix, as in line 14. It takes four parameters, a reference to the operator matrix O, the dimension N of the full matrix to create, and the (zero-indexed) qubit numbers j and k on which the operator is to operate (for a single qubit operator j = k). The function returns a reference M to the full matrix that is created. For a single-qubit operator, the effect of ppron is to create a matrix M of the the following form (where n = log2 N , and I is the 2 × 2 identity matrix): k terms z }| n−k−1 terms { z }| { M =I ⊗ ... ⊗ I ⊗O⊗ I ⊗ ... ⊗ I (4.1) The effect for two-qubit operators is similar. Notice that the matrix-matrix multiplications at lines 10 and 15 and the vector-matrix 4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT 63 multiplication at line 19 are parallel operations whose operands are references to matrices (or vectors) stored on the cluster. The conventional ’*’ operator has been overloaded to call the appropriate parallel function when its operands are parallel matrix references. Once the computation is complete, the pget function at line 21 transfers the result vector from the cluster back into Matlab. Only this vector, of 2n complex numbers, needs to be transferred; all of the intermediate calculations are performed entirely on the cluster. Matlab does not provide a reliable means of determining when a variable goes out of scope. It is therefore necessary to free matrices that are no longer in use manually, in order to optimize memory utilization on the cluster. This is the purpose of the pkill function at lines 16, 20 and 22. 4.3 The tensor product based simulation environment The initial simulation environment demonstrated that it was possible to simulate quantum algorithms on the cluster. However, it also highlighted several significant limitations, leading us to develop the new (tensor product based) simulation environment to address these limitations. One of the obvious drawbacks of the initial simulation environment was the approach of storing everything as a full (problem-sized) matrix. This led to severe memory scaling limitations. For an N -qubit problem, each matrix requires 2N × 2N complex values, or 22N +3 bytes (since each complex value requires eight bytes at single precision). Almost all computations involve matrix multiplication, which requires three matrices: two source matrices to be multiplied, and a third matrix to store the result. It follows that the cluster is practically limited to problems of a maximum of 14 qubits in size. Three 15-qubit matrices would require 24 Gb of RAM (assuming single precision complex values, each of which requires eight bytes). Although this is less than the total amount of RAM available on the cluster nodes (30 Gb), the nature of the block cyclic distribution requires that we use equal amounts of memory on all nodes. If we use all 32 64 CHAPTER 4. THE SIMULATION ENVIRONMENT nodes, we are therefore constrained by the memory capacity of the eight 768 Mb nodes. If we exclude these nodes, we are reduced to 24 1 Gb nodes. Since we require some RAM above the matrix storage RAM for other application data and for the operating system on each machine, it follows that, for most practical applications (i.e. those involving any matrix-matrix multiplication), 15 qubit problems are too large to fit in memory. The large, exponentially scaling matrices in the simulation environment not only tax the memory capacity of the cluster, they also require a significant amount of computation and, even more deadly to performance, communication. Since communication is the Achilles heel of any Beowulf cluster, the vast volumes of data that need to be communicated can cause performance to suffer dramatically. So, having answered the question “can we simulate?” in the affirmative, the question thus became “can we do better?”. We hypothesized that a storage and computation mechanism that took advantage of the tensor product structure inherent to many quantum algorithms would be more efficient, and so we set out to develop a second generation simulation environment based on the tensor product structure. Our design for a tensor product based simulation environment required several components: a means of specifying algorithms (circuits); a mechanism for translating these specifications into a form suitable for parallel execution; a distribution system to parallelize the computation across multiple nodes; and code to perform the actual computations on each node. Each of these components is discussed in order below. 4.3.1 Algorithm (circuit) specification language We wanted an algorithm specification mechanism that would be both human- and machinefriendly. Algorithms should be straightforward for humans to enter and to understand in their source form. They should also be easy to generate by automated means (e.g. by a script that automatically generates problems of a given size from a template), and easy to parse. To this end we developed a simple algorithm specification language. An abbreviated 4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT 65 grammar for the language appears in Table 4.1, and an illustrative sample algorithm encoded in this language is in listing 4.2, which is a literal encoding of the circuit to perform a three-qubit quantum Fourier transform. Tokens identifier real integer operator filename (Literals) ≡ ≡ ≡ ≡ ≡ ≡ [A-Za-z][A-Za-z -]* [-]?[0-9]+[.][0-9]+ [0-9]+ Any previously defined identifier or ’I’ Valid Unix filename, with optional path Identified in quotes below → → → → → → → → → → → → → line | input line statement ’;’ definition | operation | permutation | size | inc ’define’ identifier ’(’ integer ’)’ matrix ’[’ numlist ’]’ complex | numlist complex realnum | ’(’ ’,’ realnum ’)’ | ’(’ realnum ’,’ realnum ’)’ real | ’%(’ real ’)’ | ’-’ ’%(’ real ’)’ operator ’(’ intlist ’perm’ ’(’ intlist ’)’ integer | intlist ’,’ integer ’size’ integer ’include’ filename Rules input line statement definition matrix numlist complex realnum operation permutation intlist size inc Table 4.1: Abbreviated grammar for the tensor product specification language The overall syntax is conventional: comments start with a ’#’ and statements are terminated with a semicolon. Algorithms typically start with a declaration of the size of the circuit in qubits (line 3), and declarations of the operators (lines 5–16). Only one size declaration is allowed, and it must appear before the first line in which an operator is used. Operators may be defined anywhere prior to their use, and may not be redefined. An operator definition specifies the name of the operator, the number of qubits on which it operates, and the values of the operator matrix. The values are complex numbers, which may be specified with their real part only (e.g. 1.0); with their imaginary part only 66 CHAPTER 4. THE SIMULATION ENVIRONMENT 1 # 3 bit QFT - hand coded based on circuit on p220 of Nielsen & Chuang 3 s i z e 3; 5 define c_s(2) [ 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 # controlled phase 0.0 0.0 (,1.0) ]; 13 define c_t(2) [ 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 # controlled pi/8 0.0 0.0 (%(0.5),%(0.5)) ]; 15 define h(1) 6 7 8 10 11 12 16 18 19 20 21 22 23 24 [ %(0.5) %(0.5) %(0.5) -%(0.5) ] ; # hadamard h(0); c_s(1,0); c_s(2,1); h(1); c_t(2,0); h(2); perm(2,1,0); Listing 4.2: A sample algorithm specification: Calculating a three-bit quantum Fourier transform. (e.g. (,1.0) for i); or with both real and imaginary parts (e.g. (%(0.5),%(0.5)) for √ √1 + √1 i). The notation %(x) represents x. 2 2 The circuit itself is described as a sequence of operators, as in lines 18–23. Each operator is followed by the qubits on which it operates, specified in the order in which they are input to the operator. Bit numbers are zero indexed. Notice that there is no explicit specification of the tensor product structure, as this structure is inferred by the compiler. It is, however, useful when specifying an algorithm to consider the ordering of the operators to ensure that an optimal structure is inferred. Consider the circuit in Figure 4-6. This can be represented as in Listing 4.3, but from this the compiler will infer the sequence Ua ⊗ I ⊗ I; Ub ⊗ Uc ⊗ I; Ud ⊗ Ue ⊗ I; Uf ⊗ I ⊗ I . (4.2) 4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT Ua Ub Uc Ud Ue Uf 67 Figure 4-6: Circuit to demonstrate different specification orderings: Listings 4.3 and 4.4 are both implementations of this circuit, but listing 4.4 is implemented more efficiently by the compiler. 1 2 3 4 5 6 U_a(0); U_b(0); U_c(1); U_d(1); U_e(2); U_f(2); 1 2 3 4 5 6 Listing 4.3: Inefficient specification of the circuit in Figure 4-6 U_a(0); U_c(1); U_e(2); U_b(0); U_d(1); U_f(2); Listing 4.4: Efficient specification of the circuit in Figure 4-6 If we interleave the operators as in listing 4.4, the compiler is able to infer the more efficient sequence Ua ⊗ Uc ⊗ Ue ; Ub ⊗ Ud ⊗ Uf . (4.3) Finally, the permutation statement perm reorders the qubits in the order specified. The statement is generally rarely used, as most permutations are generated internally when the algorithm is compiled. Returning to our original example, we have used a permutation at line 24 to simulate the final bit reversal operation. Note that permutations are not operators in the strict sense. Suppose we have an nqubit problem, and we define the operator swap(2) as the two-qubit SWAP gate. Then the statements swap(0,1) and perm(1,0,2,3,. . .,n) will have the same effect of exchanging the values of bits 0 and 1, but they will be implemented differently by the compiler. The permutation will be implemented as a data exchange, and may possibly be optimized away entirely by the compiler, while the swap operation will be implemented as a gate 68 CHAPTER 4. THE SIMULATION ENVIRONMENT like any other. Since there is no difference in practice between the effect of these two statements, the permutation might generally be preferred as it is slightly more efficient. There is one other statement not illustrated in this example, the include statement. This is analogous to the #include preprocessor directive in C, inserting the contents of another source file as if contents of the included file appeared directly at the location of the include statement. This allows libraries of standard gates to be defined in one place, and included in multiple algorithms without tedious redefinition. 4.3.2 Compilation We have alluded to the operation of the compiler above. Now let us consider its operation in more detail. The compiler accepts as its input circuit specifications of the form described in section 4.3.1. It generates an internal tensor product based representation of the circuit which directly forms the basis of the distribution of the problem to multiple processors for execution. The compiler goes through several steps, which will be illustrated with reference to the code fragment in Listing 4.5, implementing the circuit in Figure 4-7. As the compiler proceeds, it builds and then defines an internal structure that represents the circuit. This structure will ultimately be used to guide the distribution and execution of the simulation. 1 2 3 4 5 6 7 8 s i z e (5); A(2); D(3,4); B(2,1); E(3,0); perm(0,1,3,2,4); F(3,4); C(2); Listing 4.5: Input file for the compilation example: Definitions of the operators are omitted. The compiler starts by parsing the input file and building an initial version of the algorithm structure. The parser is written using a combination of the bison and flex tools (the 4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT |x0 i |x2 i |y0 i E |x1 i |y1 i B A • • |x3 i 69 × C |y2 i × • |y3 i F |y4 i D |x4 i Figure 4-7: Circuit for the compilation example enhanced GNU versions of lex and yacc, respectively). The key parser code is presented in Appendix A, in listings A.1, A.2, and A.3. The internal structure that the compiler builds up consists of a linked list of records. The possible record types are described in Table 4.2. To further clarify the difference between the B and P records described in the table, suppose we have three qubits, and they undergo two permutations: an initial permutation that exchanges the values in bit positions 1 and 2, followed by a permutation that exchanges the values in bit positions 0 and 2. The two permutations would be represented with B records as B(0,2,1) and B(1,2,0) respectively, and with P records as P(0,2,1) and P(2,1,0). Type Operator Written O(op, bits) Bit order B(bits) Permutation P(bits) Terminator T Description Specifies an operator (op), and indicates the qubits (bits), in order, on which it operates. Specifies a permutation of the qubits by defining the new order (bits) of the bits relative to the initial bit ordering. Specifies a permutation of the qubits by defining the new order (bits) of the bits relative to the previous permutation (or the initial ordering, if no previous permutation) Delimits the end of a set of operator terms forming a single tensor product Table 4.2: Record types for the internal compiler data structure The compiler scans the list of operations in top-to-bottom order, inferring the tensor 70 CHAPTER 4. THE SIMULATION ENVIRONMENT product structure of the circuit according to the following rules: 1. Operations on independent bits are combined into a single tensor product. An example is lines 2 and 3, which are combined to form the product A2 ⊗ D3,4 . 2. Re-use of a bit starts a new tensor product, as at line 4, where bit 2, previously operated on at line 2, is operated on again. 3. Each operator must operate on consecutive, increasingly numbered bits. Thus, for example, B(2,1) at line 4 must be renumbered to B(1,2), and E(0,3) at line 5 must be renumbered to E(0,1) (or some equivalent consecutive bit numbering). Permutations are inserted to effect these renumberings. 4. Tensor products are reordered to the form I ⊗ . . . ⊗ I ⊗ op ⊗ . . . ⊗ op. Permutations are inserted to effect this re-ordering. The re-ordering is necessary so that we can later factor out the identity matrix to determine the parallelization of the code, as described in section 3.3.1. The compiler starts by applying rules 2 and 3. It renumbers bits within each operator as necessary, generating B record style permutations, as in Figure 4-8(a). It then reorders operations within each tensor product according to rule 4 (Figure 4-8(b)). Having inserted permutations in the previous steps, the compiler then optimizes the permutations in the record list to minimize the total number of permutations (Figure 48(c)). Although not illustrated in this particular example, the compiler optimizes across tensor product terms where possible. The B records are a convenient form for the compiler to work with, but ultimately the permutations must be implemented by data transfers. P records specify the nature of these transfers in a more direct manner. In the final step, therefore, the compiler rewrites all the B records as P records, and inserts a final P record if necessary to restore the ordering of the output bits to the order that the user expects (typically 0, 1, 2, . . . , n, except where a final permutation has been specified in the input code). 4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT ←− I ⊗ I ⊗ A2 ⊗ D3,4 O(A,2) O(A,2) O(D,3,4) T B(2,1,0,3,4) ←− reorder bits of B O(B,0,1) B(2,1,3,0,4) ←− reorder bits of E O(E,2,3) T B(2,1,0,3,4) ←− specified by user O(F,3,4) O(C,0) T O(D,3,4) remains unchanged T B(1,2,3,4,0) ←− B0,1 ⊗ I ⊗ E3,4 → B(3,2,1,4,0) I ⊗ B1,2 ⊗ E3,4 O(B,1,2) B(3,2,4,1,0) O(E,3,4) T B(2,0,1,3,4) ←− C0 ⊗ I ⊗ I ⊗ F3,4 → B(1,0,2,3,4) I ⊗ I ⊗ C2 ⊗ F3,4 O(F,2,3) O(C,4) T (a) After rearranging bits within operators O(A,2) O(D,3,4) T B(3,2,4,1,0) O(B,1,2) O(E,3,4) T B(1,0,2,3,4) O(F,2,3) O(C,4) T (c) After optimizing permutations (b) After shifting operators to high bits O(A,2) O(D,3,4) T P(3,2,4,1,0) ←− B → P record O(B,1,2) O(E,3,4) T P(2,3,4,1,0) ←− B → P record O(F,2,3) O(C,4) T P(4,3,0,1,2) ←− Final P record (d) Final form Figure 4-8: State of the internal data structure 71 72 CHAPTER 4. THE SIMULATION ENVIRONMENT 4.3.3 Distribution With compilation complete, we now in effect have a complete ’map’ of the computation, and are ready to distribute it across multiple processors for parallel execution. To do this we use the method outlined in section 3.3.1. After compilation, the computation has the form P0 (In1 ⊗ U1,1 ⊗ . . . ⊗ U1,k1 )P1 (In2 ⊗ U2,1 ⊗ . . . ⊗ U2,k2 ) . . . PN −1 (InN ⊗ UN,1 ⊗ . . . ⊗ UN,kN )PN , (4.4) where each of the Pi (0 ≤ i ≤ N ) is a permutation (possibly the identity permutation), the Ini are ni × ni identity matrices, and the Ui,j are operators. Note that the Pi are permutations of values in the state vector, derived from the qubit permutations generated during compilation. The system level steps involved in executing the simulation are outlined in Figure 49. This outline emphasizes the communication steps required to distribute the problem and to rearrange the intermediate result vectors during the computation. The execution process itself is considered in more detail in section 4.3.4. To better understand the communication patterns involved note that the state vector x is multiplied by each of the tensor products sequentially, and recall that for each of the tensor products Ti , it is the identity matrix Ini of expression 4.4 that determines the parallelism of the corresponding computation. Let the total number of processors available to perform the simulation be p. If ni ≤ p, then each of ni processors computes (U1,1 ⊗ U1,2 . . . ⊗ U1,k1 )xj , (4.5) where 1 ≤ j ≤ ni , and xj is a partition of x. If ni ≥ p, then each of p processors computes (Ini /p ⊗ U1,1 ⊗ U1,2 . . . ⊗ U1,k1 )xj , (4.6) 4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 73 Launch parallel processes Send algorithm to all parallel processes Distribute initial input vector x among the processes for i = 1 to N do Perform computation x = Ti x, where Ti is the ith tensor product. if The number of processors required for Ti+1 6= the number of processors required for Ti then Redistribute x to the processors involved in Ti + 1, taking Pi into account as necessary else if i < N and Pi is not the identity permutation then Redistribute the portions of x necessary to implement Pi end if end for Retrieve the result vector, taking Pn into account as necessary Figure 4-9: System-level overview of the parallel execution of a problem, emphasizing the communication (distribution and redistribution of data) involved where 1 ≤ j ≤ p. Thus, if ni+1 6= ni and either ni < p or ni+1 < p, a different number of processors will be used to compute Ti x, and x will need to be repartitioned. In our clustered environment, this translates into communication of the contents of x between the processors involved in computing Ti x and those involved in computing Ti+1 x. More communication must occur if permutation Pi is not the identity permutation. In this case, communication need only occur between those processors holding values that must be permuted. The simulation begins by launching the simulation code on all the processors involved in the simulation. Although the number of active processors may vary during execution, the simulation processes are present on all processors throughout the course of the simulation, even if idle. This seemed the most reasonable approach given the added complexity and startup overhead of dynamically launching and killing processes, and given that there is control of idle processes at the operating system level in any case. The process running on the machine on which the simulation was launched is designated the master process. The role of the master is substantially diminished compared to 74 CHAPTER 4. THE SIMULATION ENVIRONMENT the architecture of the parallel Matlab simulation environment. Its primary function is to collect the final result state vector (step 12 in Figure 4-9). The algorithm specification (step 2) and the initial input vector (step 3) are passed to the running processes as files on a shared NFS file system. The specification file is generated by the compiler, and includes the compiled algorithm specification and operator matrices for all the operations used. The input vector file, giving a list of values to populate the initial state vector, is generated by the user. A utility program allows the user to generate a state vector by providing a set of qubit values. Since each processor ’knows’ its position in the processor group, and since all processors have the identical algorithm specification, it can perform its sequence of i computations independently until it needs to communicate with one or more ’partner’ processors. If it reaches a communication point, it blocks until the partner is ready to communicate. Consider an illustrative example. Suppose we have four processors (P0 . . . P4 ), and an algorithm specification as in Figure 4-10. The sequence of computations performed by each processor is illustrated in Figure 4-11. Each box represents a computation, and communications are indicated with lines of the form • •, labeled with the values being communicated. The notation xjk represents the partition (xj , xx+1 , . . . , xk ) of x. ) O(U1,1) T1 = I2 ⊗ U1 ⊗ U2 ⊗ U3 O(U2,2) O(U3,3) T o O(U4,2) T2 = I4 ⊗ U4 ⊗ U5 O(U5,3) T P(0,1,3,2) o O(U6,2) T3 = I4 ⊗ U6 ⊗ U7 O(U7,3) P(0,1,3,2) Figure 4-10: Algorithm specification to illustrate computation sequence The diagram illustrates all the cases in which communication is necessary. The first is at point 2, since T1 is parallelized over two processors and T2 is parallelized over four processors. The next case is at point 4, where a permutation is called for. Processors P0 75 4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT ()*+ /.-, 1 P0 : T1 x18 ()*+ /.-, 2 (x58 ) • • P1 : P2 : ()*+ /.-, 3 T1 x916 (x13 16 ) • • P3 : ()*+ /.-, 4 T2 x14 T2 x58 T2 x912 T2 x13 16 T3 x14 ()*+ /.-, 5 (x58 ) • • T3 x912 (x912 ) • • T3 x58 T3 x13 16 ()*+ /.-, 6 ()*+ /.-, 7 • • (x912 ) • (x58 ) • (x13 16 ) • Figure 4-11: An example computation sequence, illustrating communication patterns and P3 are not involved in the data exchange for this permutation, so they continue to execute T3 at point 4. Finally at point 6, all the data is transferred to the master node (P0 ) for reassembly and output. Note that the final permutation in Figure 4-10 is not explicitly implemented as a separate step. Instead, P0 uses this final permutation to guide the order of reassembly. This saves an additional communication between P1 and P2 . Notice that although P0 and P3 are able to continue computation at point 4 without blocking, the computation ultimately has to wait for P1 and P2 to complete T3 . We do, however, gain some efficiency by requiring synchronization points only on those processors that have work to be synchronized. Thus, in this example, only two nodes need to execute a synchronization handshake, which is simpler and more efficient than a four-node synchronization. Since the data being transferred consists entirely of partitions of vectors, it made sense to use the BLACS to implement inter-node communications. BLACS provides routines for exchanging vector data, along with the necessary synchronization primitives. 76 CHAPTER 4. THE SIMULATION ENVIRONMENT 4.3.4 Execution Finally, let us consider execution process at the individual node level. State diagrams for the master node and the slave nodes are shown in figures 4-12 and 4-13 respectively. Most of the entries in the diagrams relate to the inter-node communications discussed in section 4.3.3. In this section we will concentrate on the processing of the tensor products themselves (i.e. the central entry in the state diagrams). Load algorithm specification Node not involved in initial tensor product Node is involved in initial tensor product / Await initial data receipt Data received Output result and terminate O All data transferred Store data (with No data Process next / transfer required tensor product j final permutation) n 7 SSS J S n S SSSS nn n n S Execution Post-execution data More data Data Data n S complete transfer required expected S received received n SSS n SSS nnn n S n SS) n Node not involved in Node is involved in Await data o Send data / Await data later tensor products receipt (nonblocking) from nodes later tensor products Load initial state data Figure 4-12: State diagram for the new simulator master node To calculate the products Ti Xkj efficiently, we use the procedure discussed in section 3.3.2. The code is based on the TENPACK [BD96a] Fortran library for tensor product manipulation, which implements the algorithms in [BD96b] for a wide range of matrix types (full, banded, positive definite, symmetric and triangular). Our code implements the matrix-vector multiplication algorithm from [BD96b] in C, optimized for the general matrices and with provision for the efficient computation of initial identity matrix terms. The key function is tp mult(), which has the following declaration: void tp_mult(FLOAT *tp_array, i n t *sizes, i n t n, i n t id_size, FLOAT *x, FLOAT *work); tp array is a one dimensional array of complex numbers (each complex number is 77 4.3. THE TENSOR PRODUCT BASED SIMULATION ENVIRONMENT Load algorithm specification Data / Await initial data receipt Execution complete Data received / Process next tensor product k 6 n n nn n n n Post-execution data nnreceived nnn n n nn Await data o receipt Node not involved in initial tensor product Node is involved in initial tensor product Load initial state data Node is involved in later tensor products Send data (nonblocking) No data transfer required transfer required Node not involved in later tensor products Send results 2 to master Master acknowledges / Terminate Figure 4-13: State diagram for the new simulator slave nodes implemented as a consecutive pair of floats), consisting of the values of the operator matrices, ordered by rows then by columns, in the order in which the operators appear in the tensor product. The dimension of each of the operators in order is specified in array sizes, and n gives the total number of operators, excluding any leading identity matrix. Clearly, it would be both space-inefficient to store a leading identity matrix, and timeinefficient to multiply by it unnecessarily. To address this, the parameter id size specifies the size of the leading identity matrix if there is one (or zero if there is none), so that it can be dealt with more efficiently. The tp mult routine in turn calls the level 3 BLAS routine CGEMM to perform the required multiplications. We use the ATLAS implementation of the BLAS library [WPD00], which automatically tunes itself upon compilation to the exact architectural parameters (CPU, cache size, etc.) of the machine on which it is compiled. Our earlier experience with using various BLAS libraries as the basis for the parallel linear algebra routines in the Matlab implementation showed that ATLAS yielded the highest performance of the readily available BLAS implementations. The tp mult routine does not declare any significantly sized local variables: the main data structures are the ones passed in the arguments above. It is fairly clear, therefore, that 78 CHAPTER 4. THE SIMULATION ENVIRONMENT the main factor governing memory usage on each node is the size of x. The operators are typically small: it is unusual for an operator to operate on more than a few qubits, so the storage required for each operator is on the order of a few tens or hundreds of bytes. The size of x (and hence work) depends on the number of qubits in each state vector partition. If there are N qubits divided over p processors, then the total storage requirement per processor for x and work combined is: 2 arrays × 8 bytes/complex value × 2N/p values/array = 2(N/p)+4 bytes (4.7) Thus, on the current cluster nodes, which have between 768 Mb and 1 Gb of RAM each, there is sufficient memory (allowing for operating system headroom) to support 25 qubits per node, or 24 qubits per processor if each of the two processors in a node is used independently. Compared to the parallel Matlab implementation, which could support a theoretical maximum of 14 qubits across combined memory space of the entire cluster, this represents a substantial improvement. The performance of the new system also compares very favorably with that of the old. In Chapter 5, we will examine these performance issues in depth. Chapter 5 Evaluation We performed a number of benchmarks to characterize the performance of the new simulation environment. Section 5.1 outlines our methodology, describing the source of our execution time figures. We start our analysis in section 5.2 by looking at the basic elements contributing to simulation performance: computation time, data transfer time and startup overhead. In section 5.3 we look at simulations of structured sets of gates designed to evaluate execution times with both minimal and extensive communication. Finally, we look at simulations of two disparate types of algorithm: the quantum Fourier transform in section 5.4, and adiabatic quantum computation of 3-SAT in section 5.5. 5.1 Methodology Except where otherwise noted, the data below was collected by measuring the execution time of the simulation environment. The compilation time is not included, because our tests revealed it to be insignificant (on the order of at most a few hundred milliseconds). Timing was performed with the GNU time command (version 1.7), which returns the number of CPU seconds used by the process being timed in user mode, the number of CPU seconds spent in kernel-mode system calls, and the total elapsed time. The values reported below are elapsed times, as this is the most relevant time to answer the question 79 80 CHAPTER 5. EVALUATION ’how long does simulation X take?’ The time command returns two significant figures for the number of elapsed seconds. In practice, limitations of the system clock and of the accuracy of the timing mechanism used lead to an effective resolution of a few hundred milliseconds. This is more than acceptable for our purposes, since most of the simulations we are timing have execution times from tens to thousands of seconds. Where the simulations ran on multiple processors, timing data was always taken for the master node, i.e. the node on which the simulation was started. This node by design started first and finished last, and thus its execution was exactly the time for the entire simulation. The execution time graphs below omit data for small problem sizes (generally where the number of qubits N < 12). The computation time for these smaller sizes is negligible relative to the process startup time, so the graph is essentially flat for these smaller sizes. Because of the typically exponential scaling of most problems with the number of qubits, the time (y) axis is usually shown with a logarithmic scale. The (x) axis is generally linear in the number of qubits (thus effectively logarithmic in the number of values in the state vector). For the cluster-based simulations (sections 5.3, 5.4 and 5.5), we ran each simulation for a given size and number of data points repeatedly. The data point plotted is the median of n execution times, where n was varied according to execution time as shown in Table 5.1 in order to put a reasonable ceiling on total execution time for each set of simulations. Execution time range (t sec) t < 100 100 ≤ t < 1000 1000 ≤ t < 2500 t ≥ 2500 Number of runs (n) 100 25 10 5 Table 5.1: Number of runs for simulation data 81 5.2. THE FUNDAMENTALS 5.2 The fundamentals Before we examine the performance of the simulation environment as a whole, it is instructive to gather statistics on some basic performance characteristics. The first of these is execution time for a single node. In contrast to the other benchmarks in this section, these statistics were gathered by inserting code into the simulation software itself to record the system time before and after execution. This was done to eliminate the effects of any startup overhead. Startup overhead was then measured separately (see Figure 5-4). 5.2.1 Single node execution times First, we measured the effect of gate size on execution time. For each problem of size N , we ran a sequence of simulations, first with N single qubit (Hadamard) gates, then with N/2 two qubit (CNOT) gates, and finally with N/3 three qubit (Toffoli) gates. The results are summarized in Figure 5-1. An interesting phenomenon apparent in this graph is that single qubit operations take less time than two- or three-qubit operations, even though the same number of multiplications is being performed. This is presumably due to the smaller (32 byte) matrix size of the one qubit operator matrix enabling it to be accessed more efficiently (possibly as register-resident data). Recall from section 3.3 that each tensor product term requires a multiplication step and a transposition step. We recorded the execution time of each step separately and found that, for problem sizes greater than 15 qubits, transposition took up approximately 30% of the execution time, and multiplication took up approximately 70%. To develop a further feel for how the number of gates in each tensor product term affects the execution time, we ran a series of tests using tensor products of the form ⊗(N −k) I2 ⊗ H ⊗k with k ∈ 1, 2, 4, 8, 16 , (5.1) where N as usual is the number of qubits, H ⊗k is the k-qubit Hadamard transform, and 82 CHAPTER 5. EVALUATION 1000 1-qubit operators 2-qubit operators 3-qubit operators Execution time [sec] 100 10 1 0.1 0.01 12 14 16 18 20 Problem size [qubits] 22 24 Figure 5-1: Single-node execution times: combinations of gates were executed on an isolated node with no communications ⊗(N −k) I2 5.2.2 is a 2N −k × 2N −k identity matrix. The results are summarized in Figure 5-2. Data transfer timing Computation time is not the only factor affecting total execution time. Another significant contribution to overall execution time is communication amongst the nodes. Unlike the parallel Matlab implementation, no communication occurs during individual computational steps; the only communication is vector transfer for permutations and bit size realignment in between tensor products. To evaluate the efficiency of the vector transfer mechanism, we instrumented the code to measure the amount of time taken to exchange vectors of various sizes between two nodes. An ’exchange’ in this context is a complete transmission of a vector followed by receipt of a vector of the same size. The results are summarized in figure 5-3. 83 5.2. THE FUNDAMENTALS 1000 1 Hadamard 2 Hadamards 4 Hadamards 8 Hadamards 16 Hadamards Execution time [sec] 100 10 1 0.1 0.01 12 14 16 18 20 Problem size [qubits] 22 24 Figure 5-2: Single-node execution times with identity matrix: total number of (Hadamard) gates is indicated, remaining bits were left empty As is apparent from the curve superimposed on the data, transfer throughput is slightly under 10% of the theoretical maximum bidirectional throughput of the Gigabit Ethernet transport medium. However, as we saw in section 4.1, the effective maximum throughput of the operating system/bus/NIC combination is less than half the theoretical maximum of the transport mechanism, so the vector transfer times are in fact closer to 20% of the raw maximum throughput attainable. It is likely that this is due primarily to protocol and processing overhead. While the transfer mechanism could be more efficient, a comparison of Figures 5-3 and 5-1 reveals that communication times are encouragingly low relative to computation time. Looking at figure 5-2, we see that this depends somewhat on the number of operations. In a computation with a relatively low number of operations per step, and a relatively high proportion of steps followed by communications, communication time can approach, and 84 CHAPTER 5. EVALUATION 100 10.7 * bits / 109 + 0.35 Time to exchange [sec] 10 1 0.1 0.01 12 14 16 18 20 Data exchange size [qubits] 22 24 Figure 5-3: Vector transfer times: vectors of the indicated sizes were exchanged between a pair of nodes in extreme cases even exceed computation time. 5.2.3 Startup overhead Another factor contributing to overall execution time, more noticeable at lower bit number, is the startup time for the system. It takes a certain amount of time to launch the underlying message passing system (MPI or PVM), register all the nodes, and bring them all to the point where they are ready to start computation. This overhead is illustrated in Figure 5-4. Although startup overhead is not particularly significant at higher bit number, it is the predominant time cost at lower bit number, as will be apparent from a number of the graphs later on. 85 5.3. GATES AND GATE COMBINATIONS 18 16 Startup overhead [sec] 14 12 10 8 6 4 2 0 5 10 15 20 Number of processors 25 30 Figure 5-4: Startup overhead: Time to launch the simulation environment on the number of processors indicated 5.3 Gates and gate combinations Having examined the key determinants of performance, we are now ready to look at actual computations. First, we run a simulation intended to be computationally intensive but not communications intensive. To achieve this, we use a block of CNOT gates, as illustrated in the circuit in Figure 5-5. Since each column of CNOTs operates on the same set of qubits, there are no permutation or bit distribution changing communications between each computational step. The only communication is the final data collection at the end. A third of the qubits are left free, partly to provide parallelism, and partly to make this set of circuits more directly comparable to the ones in the next test. The execution times for this ’block of gates’ problem are summarized in Figure 5-6. The basic structure of this graph will recur in most of the other graphs in this section. 86 CHAPTER 5. EVALUATION |x1 i • • • • • • • • • • • • • • • • |x2 i ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ .. . .. . |xb(N +1)/3c∗2−1 i • • • • |xb(N +1)/3c∗2 i ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ • • • • • • • • • • • • |xb(N +1)/3c∗2+1 i .. . .. . |xN i Figure 5-5: Block of CNOT s circuit At lower bit number, startup overhead is the most significant time cost, dominating processing and communication times. This initially favors fewer processors. However, as bit number increases, startup time remains constant while other factors increase. In particular, computation times increase more rapidly with fewer processors. This pattern is particularly clear in the above example, because there is no communications overhead. The lower set of points represents the ’ideal’ speedup for 32 processors, being the single processor time divided by 32. Notice that the speedup converges to being quite close to the theoretical ideal. Now let us compare this to a similar example where communication overhead plays a much bigger role. In this next case, we alternate the CNOT s as shown in the circuit in Figure 5-7. Here, we have the same number of CNOTs as in the previous test, but this time they operate on alternate qubits at each step, forcing a permutation communication after each step. The effects of the added communication overhead are shown in Figure 5-8. First, overall execution times increase in all cases. The difference is insignificant at lower bit number, but increases to between a factor of 2 (for one processor) to 8 (for 32 processors). Second, the rates of increase in execution times differ less than in the previous case. The graphs for 87 5.3. GATES AND GATE COMBINATIONS 1000 1 processor 2 processors 4 processors 8 processors 16 processors 32 processors Ideal speedup (32x) Execution time [sec] 100 10 1 12 14 16 18 20 Problem size [qubits] 22 24 Figure 5-6: Block of gates, no permutation-related communication: 16 columns of CNOT gates operating on the same bits, exercising execution time without communication (cf Figure 5-8) different processor numbers still cross over, but they do not pull away from each other as quickly. The single processor curve in particular tracks much closer to the two-processor case, because its permutations involve memory transfers, not network communications. Although the single processor has an inherent execution time disadvantage, the reduction in communication costs compensates somewhat for this. Overall, communication costs play a much more significant role, as we can see by comparing the convergence of the 32 processor case to the ideal speedup with the previous example. At 25 bits, for example, using 32 processors give a speedup factor of approximately 5 over the single processor case, against a comparable speedup factor of approximately 20 for the previous, low communication, example. Notice that although the effects of communication overhead are significant, it is much less of a factor than it was for the parallel Matlab implementation. This is for three main 88 CHAPTER 5. EVALUATION |x1 i • |x2 i ⊕ |x3 i • • ⊕ ⊕ • • ⊕ ⊕ • • ⊕ ⊕ • • ⊕ ⊕ • • ⊕ ⊕ • • ⊕ ⊕ • • ⊕ ⊕ ⊕ .. . |xN −2 i • |xN −1 i ⊕ |xN i Figure 5-7: Alternating priately) • .. . • • ⊕ ⊕ CNOT s • • ⊕ ⊕ • • ⊕ ⊕ • • ⊕ ⊕ • • ⊕ ⊕ • • ⊕ ⊕ • • ⊕ ⊕ • ⊕ circuit (if N is not a multiple of 3, last one or two qubits differ appro- reasons: first, much less data is being transferred: vectors of size O(2N ) instead of matrices of size O(22N ). Second, the communication comes after each computational step, instead of being interleaved throughout the computation. This allows the computation itself to proceed much faster, as it is not constantly blocking while waiting for I/O to complete. Finally, the effect of network and protocol induced latency is much smaller when data is transferred in bulk instead of being split into many temporally separate small requests. The slightly greater unevenness to each curve is not a result of the greater communication overhead. Rather, it is due to the structure of the problem: the number of gates relative to problem size differs slightly depending on whether N is a multiple of three, one less than a multiple of three, or two less than a multiple of three. 5.4 The quantum Fourier transform Having characterized the basic performance factors of the tensor product based simulation environment, we were ready to put it through its paces on some real quantum algorithms. To test the simulation environment with realistic quantum circuits, we implemented the quantum Fourier transform described in section 2.3.1. 89 5.4. THE QUANTUM FOURIER TRANSFORM 10000 1 processor 2 processors 4 processors 8 processors 16 processors 32 processors Ideal speedup (32x) Execution time [sec] 1000 100 10 1 12 14 16 18 20 Problem size [qubits] 22 24 Figure 5-8: Alternating CNOTs: 16 columns of alternating CNOTs exercising execution and communication times 5.4.1 The quantum Fourier transform on the initial environment Before we look at the quantum Fourier transform on the new simulation environment, it is worthwhile to examine its performance on the old parallel Matlab based simulation environment as a baseline. We ran the quantum Fourier transform code from listing 4.1 on grids of 1, 2, 4, 8, 16 and 32 processors, and measured the execution times using the Matlab internal timer functions (tic and toc). These times are shown in Figure 5-9. There is no data point for 13 bits on a single processor, because the problem is too large to fit into the memory of a single node. In the 32-processor case, two processors on each of 16 nodes were used. The total available RAM in this case was thus the same as for the 16-processor case (16 Gb). Although there is clearly a speedup for higher bit numbers as the number of processors 90 CHAPTER 5. EVALUATION 10000 1 processor 2 processors 4 processors 8 processors 16 processors 32 processors Execution time [sec] 1000 100 10 1 7 8 9 10 Problem size [qubits] 11 12 13 Figure 5-9: Parallel Matlab based simulator performance: Shown are execution times for the quantum Fourier transform of varying size on the indicated number of processors. is increased, overall run times are high. This is due in part to the rapid scaling of the number of operations that need to be performed, as discussed in section 3.3.2. This is also due to the high communications overhead of transferring large amounts of data during matrix multiplication operations. 5.4.2 Replicating the discrete Fourier transform Let us now turn back to the new simulation environment, and examine the performance of various implementations of the quantum Fourier transform. Before implementing the circuit-based quantum Fourier transform, we implemented a well-known alternative tensor product based representation of the equivalent discrete Fourier transform. We did this in order to verify that our simulator performed as expected on a known problem with a known formula describing its run time as a function of prob- 91 5.4. THE QUANTUM FOURIER TRANSFORM lem size. Specifically, we used the Cooley-Tukey [CT65] factorization of the discrete Fourier transform, better known as the Fast Fourier transform, which implements the 2N -value transform F2N (equivalent to the N qubit quantum Fourier transform) as F2 N = 1 Y j=N (I2N −j ⊗ B2j−1 ) R2N , (5.2) where R2N is the bit reversal permutation, and B2j−1 is the butterfly matrix defined by Bm = (F2 ⊗ Im )Dm . (5.3) F2 is the 2 × 2 discrete Fourier transform matrix, and Dm is a diagonal matrix of weights of the form −2πi(j mod m)bj/mc Dm (j, j) = exp 2m for j = 0 . . . (2m − 1) . (5.4) To implement this on our simulator, we added one small extension to allow us to calculate Dm through in-place vector multiplication. This consisted of a language extension Diag[m], to specify the application of Dm at a given point in the calculation, and a small code segment in the execution code to perform the necessary vector multiplications. The execution times for problem size n = 2N , 12 ≤ N ≤ 25 are shown in Figure 5-10. The primary aim of this test was to validate the operation of the simulation library by comparing its execution time to the theoretically predicted execution time. The FFT implementation described above has an execution time of O(n log n), where n is the problem size. For the single processor case, the data in Figure 5-10 clearly shows this behavior (as highlighted by the superimposed graph of the function y = 5.6×10−7 n log n+1.9. Multipleprocessor times are higher because of the added influence of communication time. As usual, for small problem size, the overall time is dominated by startup overhead. As problem size increases, the effect of startup overhead becomes smaller, so the execution 92 CHAPTER 5. EVALUATION 1 processor 2 processors 4 processors 8 processors 16 processors 32 processors 5.55*10-7(n log n) + 1.9 Execution time [sec] 1000 100 10 1 12 14 16 18 20 Problem size [log(n)] 22 24 Figure 5-10: Traditional Fourier transform execution times times for 2-32 processors appear to converge. This is because the overall amount of time required for communication is relatively higher than the time required for computation. As problem size becomes larger still, the relative effect of computation time increases, and the effects of parallelism start to give an advantage to higher numbers of processors. This is somewhat difficult to see from the graph, but becomes more apparent by looking at the values in Table 5.2, which show execution times for larger problem sizes. Note that in all cases, however, the single-processor times are better than any of the multiple processor times. 5.4.3 Circuit-based implementation Although the FFT implementation provides a useful confirmation that our simulation environment performs as expected, we are in a sense ’cheating’, in that we are exploiting specific properties of the problem to achieve a particularly efficient simulation. A more 93 5.4. THE QUANTUM FOURIER TRANSFORM Size 21 22 23 24 25 2 procs 65 136 275 576 1171 4 procs 62 128 262 548 1109 8 procs 63 125 259 533 1086 16 procs 67 129 257 531 1101 32 procs 75 137 264 541 1081 Table 5.2: Fourier transform execution times for larger problem size: times are in seconds. Values that are in the ’wrong order’ (i.e. where more processors result in higher execution times) are emphasized in italics. realistic way of evaluating the simulator’s performance for quantum circuits in general is to implement a full circuit representation of the quantum Fourier transform. To do this, we wrote a script to generate quantum Fourier transform circuits of the form in Figure 2-5 for a given bit number. A full listing of the script, along with sample output for a four-qubit quantum Fourier transform, can be found in Appendix A, in listings A.4 and A.5 respectively. We generated circuits for a range of bit numbers up to 25 bits, and ran these circuits through the simulation environment, measuring the execution times as usual. The results are summarized in Figure 5-11. Here, computation time is higher relative to communication time than in the previous case, so the multiple processor configurations show a clear advantage sooner (about as soon as the higher startup overhead becomes insignificant). The behavior is similar to the mixed execution/communication test in Figure 5-8. In section 3.3.2, we saw that the number of multiplication operations required to compute the product of a vector and a matrix in tensor product form is O(KnK+1 ), where K is the number of tensor product terms and n is the dimension of the tensor product terms. We can approximate the execution time for the tensor product based quantum Fourier transform by setting n = 4, since the bulk of the operations are two-qubit controlled rotations. To approximate K, we note that for a fully populated column of gates, there will be N/2 gates for N qubits, so we set K = N/2. This gives an approximate execution time for each step Tstep of 94 CHAPTER 5. EVALUATION 10000 1 processor 2 processors 4 processors 8 processors 16 processors 32 processors 2.35*10-7 N2 2N 5.87*10-8 N2 2N Execution time [sec] 1000 100 10 1 12 14 16 18 20 Problem size [qubits] 22 24 Figure 5-11: Quantum Fourier transform circuit execution times N N +1 Tstep (N ) = O 42 2 . (5.5) The total number of steps Nsteps is given by Nsteps (N ) = 2N − 1 , (5.6) so the overall execution time Texec can be approximated by Texec (N ) = Tstep (N )Nsteps (N ) ' O N 2 2N . (5.7) (5.8) The higher dotted line in figure 5-11 shows that actual single processor execution times 5.4. THE QUANTUM FOURIER TRANSFORM 95 are indeed close to the prediction (the predicted curve grows slightly faster, which is to be expected since we have slightly overestimated n and K). The values on the lower dotted line are exactly one quarter of the values on the higher dotted line. These are a good fit to the eight-processor data, illustrating that for eight processors we are achieving about half the theoretical maximum speedup. The execution times for the circuit based quantum Fourier transform are, as expected, higher than the optimized FFT case above. However, it is instructive to compare the lower end of the execution time graph for the tensor product based quantum Fourier transform to the higher end of the execution time graph in Figure 5-9 for the old parallel Matlab implementation to get a sense of the overall improvement. 5.4.4 Comparing efficient and inefficient circuit specifications We performed a further experiment to see how significantly execution time was affected by the efficiency of the circuit representation. Recall from section 4.3.1 that the ordering of gates in a circuit can affect the efficiency with which the compiler is able to implement that circuit (see Figure 4-6 and listings 4.3 and 4.4). We were curious to quantify the variation in execution time that would result. Our original quantum Fourier transform circuit generation script generated more efficient circuit specifications, with the operations interleaved as much as possible to minimize the number of consecutive tensor product multiplication steps (as in listing 4.4). We rewrote the script to deliberately generate less efficient specifications, with each qubit’s gates specified sequentially (as in listing 4.3). Bear in mind that this did not change the circuit in any way, merely the compiler’s optimized interpretation of it. The script can be found in listing A.6 in Appendix A, with sample output in listing A.7. The execution time results for the inefficient script are summarized in Figure 5-12. To understand these results, it is useful to consider how many tensor product steps are involved in each circuit representation, and the effect that this has on the two important parameters of execution time and communication time. 96 CHAPTER 5. EVALUATION 10000 1 processor 2 processors 4 processors 8 processors 16 processors 32 processors 2.33*10-7 N2 2N Execution time [sec] 1000 100 10 1 12 14 16 18 20 Problem size [qubits] 22 24 Figure 5-12: Quantum Fourier transform inefficient circuit execution times The number of consecutive tensor product steps Nsteps (N ) in an N qubit quantum Fourier transform implemented efficiently is given by equation 5.6, while the number of steps Ninef f (N ) in an inefficiently implemented N qubit quantum Fourier transform is Ninef f (N ) = (N − 1)(N − 2) + 2. 2 (5.9) Since the number of operations overall remains constant for each N , the total amount of computation is similar in both cases. In fact, the average number of operations per step for the inefficient implementation is lower, which results in an improvement (albeit very small) in single-processor execution times for larger problem sizes. This can be seen from the dotted line in Figure 5-12, where the values are about 1% lower than those in figre 5-11. Since more steps require more permutations, and hence greater overall communication, the inefficient implementation fared worse in multiple processor computations than the comparable efficient implementation. 97 5.5. 3-SAT BY ADIABATIC EVOLUTION 5.5 3-SAT by adiabatic evolution To test the simulator with an somewhat different type of algorithm, we developed a simulation of the application of adiabatic evolution to the 3-SAT problem described in section 2.3.2. To recast the problem into a form suitable for testing our simulation environment, we use the simulation mechanism described by Hogg in [Hog03]. This executes a discrete approximation of the adiabatic evolution algorithm for a deterministic number of steps and then evaluates the probability of solution. We chose this method of simulation because it particularly lends itself to implementation in our tensor product based simulation environment, making it a good test algorithm. Hogg’s simulation technique relies on the discrete approximation described in [vDMV01]. The evolution from the beginning Hamiltonian H(0) to the final Hamiltonian H(T ) is approximated by a sequence of r Hamiltonians each applied for time T r. Each of the r Hamil- tonians is implemented by a unitary transform Uj = H ⊗n F0,j H ⊗n Ff,j for 1 ≤ j ≤ r , (5.10) where T j F0,j |xi = e−i r (1− r )h(x) |xi T j f (x) r Ff,j |xi = e−i r |xi , (5.11) (5.12) and the functions h(x) and f (x) are based on the definitions of the initial and final Hamiltonian respectively (h(x) is the xth diagonal term in H(0) expressed in its computational basis, and similarly for f (x) and H(T )). The implementation of the two H ⊗N steps is a standard ’column of gates’. However, notice that all the qubits are involved in the Hadamard transforms, so we cannot make use 98 CHAPTER 5. EVALUATION of the usual identity matrix factoring parallelization. F0,j and Ff,j are best implemented as in-place multiplications so, as in Section 5.4.2, we added a small section of code to the simulator, and corresponding statements (F0[j] and Ff[j]) to the specification language. Our script to generate simulation code to simulate 3-SAT by adiabatic evolution by to this method is given in listing A.8 in Appendix A. Example output for a three qubit simulation is given in listing A.9. For each instance of 3-SAT, in addition to the simulation code, we need to provide the simulator with a description of the specific instance so that it has values for the Hamiltonians F0,j and Ff,j . Code to generate random instances of 3-SAT for input into the simulator is shown in listing A.10. In the interest of keeping execution times manageable for large bit numbers, we ran the simulation with r = N , where N is the size of the problem in qubits. We used random instances of 3-SAT, since in this formulation the execution time of the simulation is constant regardless of the values of the initial and problem Hamiltonians. The number of random instances for each problem size was as in table 5.1. Execution times are summarized in Figure 5-13. To develop an estimate for expected execution time, we can use a similar approach to the one that we used for the quantum Fourier transform. Here, in an N bit simulation, for each step there are two series of N Hadamard transforms and two multiplications by diagonal Hamiltonians. Each series of Hamiltonians should take O(KnK+1 ) multiplications, with K = N and n = 2 (since all the operations are 2 × 2 Hadamards). Each of the applications of a Hamiltonian involves a simple in-place multiplication of a vector by a diagonal matrix, so there are O(N ) multiplications for each of these steps. The total number of multiplications for each step, Tstep (N ) is thus given by: Tstep (N ) = O 2 N 2N +1 + N . (5.13) Since we run each simulation for N steps, the the total number of multiplications Texec (N ) is simply: 99 5.5. 3-SAT BY ADIABATIC EVOLUTION 10000 1 processor 2 processors 4 processors 8 processors 16 processors 32 processors 2.3*10-7 N2(2N+1+1) Execution time [sec] 1000 100 10 1 12 14 16 18 20 Problem size [qubits] 22 24 Figure 5-13: Execution times for 3-SAT by adiabatic evolution, N steps Texec (N ) = O 2N N 2N +1 + N . = O 2N 2 2N +1 + 1 (5.14) (5.15) The dotted line in Figure 5-13 illustrates that actual single processor execution times follow the prediction closely. Notice that adding more processors does not significantly improve execution times. This is because the Hadamard transforms operate on all qubits and thus cannot be parallelized in the usual way. So, for p processors, we have a theoretical best case speedup of 100 CHAPTER 5. EVALUATION Texec (N, p) = O 2N 2 N +1 2 1 + p . (5.16) Although there is a 1/p speedup for the multiplication routine, there is no speedup for the more costly 2N +1 Hadamard step. Formula (5.16) does not take communications overhead into account. For the multiple processor cases, we implemented a parallel version of the F0 and Ff multiplication routine, which distributed the vector amongst the multiple processors for the multiplication steps. At lower bit number, startup and communication overhead dominated as usual. For higher bit numbers, there was an improvement in execution time with an increased number of processors, but this was very slight (at 25 bits, execution time dropped from 9736 seconds on a single processor to 9518 seconds on 32 processors — an improvement of about 2%). In real simulations of adiabatic quantum computing, it is common to study multiple random instances of the problem (in this case 3-SAT) to gather meaningful statistics of the performance of the algorithm (in this case, the probability of solution for a given number of steps). It would therefore be a much more efficient use of the parallelism of the cluster to check multiple instances simultaneously on single nodes, rather than single instances on multiple nodes. Chapter 6 Conclusions We set out to develop an environment for simulating quantum computation on a cluster of multiple classical computers. Our implementation demonstrates that this is feasible for moderately sized problems if care is taken to manage the effects of exponential scaling of resource requirements with problem size. Our initial simulation approach used a matrix representation for the quantum circuits being simulated, and applied parallel linear algebra routines to perform the computation. With the initial simulation environment, we were able to simulate slightly larger systems than the state of the art for physical quantum computer (13 qubits against 7), and we did see a limited speedup as additional processors were added, albeit with diminishing returns as we increased to 16 or 32 processors. We learned a number of lessons from our initial implementation. One key observation was that the full matrix representation is an inefficient mechanism for representing quantum circuits, as CPU, memory and communication resource requirements increase as O(22N ) with problem size. This significantly limited the size of the problems that we could simulate. Another important observation was that communications overhead is a significant limiting factor in cluster computing. The parallel linear algebra libraries that we used were optimized for single-machine multiprocessor architectures, and over the much slower inter101 102 CHAPTER 6. CONCLUSIONS connect of our cluster network, they performed very poorly when performing operations such as matrix multiplication that require extensive communication between processors. It was clear to us that a re-think was in order. We hypothesized that the tensor product structure inherent to many quantum computing problems would lead to a more efficient implementation, much as it had been helpful to the digital signal processing community in implementing large matrix transforms. Our final simulation environment was therefore based around a tensor product model. Our evaluation of the tensor product based simulation environment has demonstrated that it is indeed a substantially more efficient means for performing a number of simulations. Resource requirements now increased as O(2N ) — still exponential, but with a slower growth rate than before. We were able to extend simulation sizes to 25 qubits without difficulty, and overall execution times decreased dramatically. Communication overhead was still a factor, but much less so than for the initial simulation environment, and for most problems we were able to utilize the parallelism of the cluster effectively to reduce overall processing time. For more communication-intensive problems, this reduction in time was somewhat less than the maximum possible theoretical speedup, but was still significant enough to be useful. Not all problems lent themselves to tensor product based parallelization. The discrete approximation to adiabatic quantum computation, for instance, benefited from the efficient tensor product based computational core of our simulator, but saw little benefit from adding additional processors. In conclusion, we have demonstrated that it is feasible to simulate useful problems in quantum computation on classical computers. Our experience suggests that for many problems, using the tensor product structure provides an efficient means of representing and executing the problem. In many cases, the tensor product also provides a natural parallelization mechanism. Another conclusion of our study is that clustering is not a panacea for high performance computing. Clusters are not supercomputers: even fast network transports make 103 slow inter-processor interconnects, and more generally workstation hardware is not optimized for large-scale multiprocessing. However, if care is taken to find efficient means of representing problems that minimize communications, cluster parallelism can provide meaningful performance gains. 104 CHAPTER 6. CONCLUSIONS Appendix A Code listings This appendix contains longer listings for several important pieces of code referenced in the text. Each listing has a cross-reference to the section of the text in which it is discussed. A.1 Circuit specification language parser The following three listings contain the main body of the parser for the circuit/algorithm specification language discussed in section 4.3.1. Listing A.1 contains basic definitions of data structures referenced in the other two listings. Listing A.2 contains the flex code for the parser, and listing A.3 contains the bison code (the bulk of the processing occurs in this latter listing). The simplified grammar presented in Table 4.1 on page 65 may be helpful in understanding the structure of the language defined in these listings. /* qparse.h: Header file for quantum circuit language parser * Geva Patz, MIT Media Lab, 2003 */ # i f n d e f _QPARSE_H # define _QPARSE_H i n t yyline; 105 106 APPENDIX A. CODE LISTINGS # define FLOAT f l o a t #define MAXOP 5 /* maximum operator size in bits */ # define MAX_BITS 32 /* maximum number of problem bits */ /* Complex number type */ typedef s t r u c t { FLOAT re; FLOAT im; } complex; /* Item in linked list of values for matrix definitions */ typedef s t r u c t _ml { complex val; s t r u c t _ml * next; } matlist; /* Item in linked list of bit numbers for operators */ typedef s t r u c t _bl { i n t val; s t r u c t _bl * next; } bitlist; /* Symbol table entry record */ typedef s t r u c t _sl { char *name; int bits; complex *matrix; s t r u c t _sl *next; } symlist; /* Operator stack entry record */ typedef s t r u c t _op { enum {tOP, /* Operator (O record) */ tPERM, /* Permutation (P or B record) */ tDELIM, /* Product delimiter (D record) */ tSPECIAL /* Special function */ } type; symlist *op; /* Operator information for type = tOP */ i n t prodsize; /* Product size for type = tDELIM */ i n t *bits; /* Bits operated on for type = tOP or tPERM */ i n t function; /* Function index for tSPECIAL */ i n t parm; /* Parameter for tSPECIAL */ s t r u c t _op *next; A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER } oplist; void yyerror(char *); # endif Listing A.1: Header file definitions %x include %{ /* qparse.l: flex file for quantum circuit specification language * Geva Patz, MIT Media Lab, 2003 */ # include # include # include # include # include <ctype.h> <string.h> <math.h> "qparse.h" "qparse.tab.h" e x t e r n YYSTYPE yylval; e x t e r n symlist * symbol_table; # define DEBUG_L 0 # define MAX_INCLUDES 10 # define tprintf i f (DEBUG_L) printf YY_BUFFER_STATE inc_stack[MAX_INCLUDES]; i n t inc_stack_ptr = 0; /* * Keyword table */ s t r u c t kt { char *kt_name; i n t kt_val; } key_words[] = { { "define", DEFINE }, { "perm", PERM }, { "size", SIZE }, { "F0", FZERO}, { "Ff", FF}, 107 108 APPENDIX A. CODE LISTINGS { "Diag", DIAG}, { 0, 0 } }; /* Function prototypes */ i n t kw_lookup(char *); symlist *st_lookup(char *); f l o a t makefloat(char *); %} WORD DQSTR DIGIT SIGN FLOAT [A-Za-z_][-A-Za-z0-9_]* \"[ˆ\"]*\" [0-9] [+-] ({DIGIT}+"."{DIGIT}+ | "%("{DIGIT}+"."{DIGIT}+")") %% /** Handle the "include" statement **/ "include" <include>[ \t\n] BEGIN(include); /* Swallow whitespace in includes */ <include>[ˆ \t\n]+ { char estr[1000]; if( inc_stack_ptr > MAX_INCLUDES) { yyerror("Too many nested includes"); exit(1); } inc_stack[inc_stack_ptr++] = YY_CURRENT_BUFFER; yyin = fopen(yytext, "r"); if(!yyin) { sprintf(estr, "Unable to open include file ’%s’" , yytext); yyerror(estr); exit(1); } yy_switch_to_buffer(yy_create_buffer(yyin,YY_BUF_SIZE )); BEGIN(INITIAL); A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER 109 } /** Integers **/ {DIGIT}+ { yylval.val = atol(yytext); return INT; } /** Various forms of complex numbers **/ {SIGN}?({DIGIT}+"."{DIGIT}+|"%("{DIGIT}+"."{DIGIT}+")") { yylval.cplx.re = makefloat(yytext); yylval.cplx.im = 0.0; return NUM; } /* Real part only */ {SIGN}?({DIGIT}+"."{DIGIT}+|"%("{DIGIT}+"."{DIGIT}+")")":"{SIGN}?({DIGIT }+"."{DIGIT}+|"%("{DIGIT}+"."{DIGIT}+")") { int i; for(i=0;yytext[i]!=’:’;i++) ; yylval.cplx.re = makefloat(yytext); yylval.cplx.im = makefloat(yytext+i+1); return NUM; } /* Real and imaginary parts */ ":"{SIGN}?({DIGIT}+"."{DIGIT}+|"%("{DIGIT}+"."{DIGIT}+")") { yylval.cplx.re = 0.0; yylval.cplx.im = makefloat(yytext+1); return NUM; } /* Imaginary part only */ /** Handle keywords, identifiers and punctuation **/ {WORD} { int i; symlist *ptr; /* Check for keywords */ if((i = kw_lookup(yytext)) != -1) { yylval.string = strdup(yytext); 110 APPENDIX A. CODE LISTINGS return i; } /* Check for identifiers */ if((ptr = st_lookup(yytext)) != NULL) { yylval.st_rec = ptr; return OP; } yylval.string = strdup(yytext); tprintf("id(%s) ", yytext); return ID; } \n { yyline++; tprintf("\n... "); #.* [ \t\f]* "," "\." ";" "\{" "\}" "(" ")" "\[" "\]" . } { { { { { { { { { { { { <<EOF>> { /* Ignored (comment) */; /* Ignored (white space) */; return COMMA; return DOT; return SEMICOLON; return OBRACE; return CBRACE; return OBKT; return CBKT; return OSBKT; return CSBKT; return yytext[0]; } } } } } } } } } } } } if(--inc_stack_ptr <0) yyterminate(); else { yy_delete_buffer(YY_CURRENT_BUFFER); yy_switch_to_buffer(inc_stack[inc_stack_ptr]); } } %% /* * st_lookup 111 A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER * * */ Look up a string in the symbol table. string is found, NULL otherwise. Returns a pointer if the symlist * st_lookup(char *target) { symlist *ptr; for(ptr = symbol_table; ptr!=NULL && strcmp(ptr->name,target); ptr = ptr->next) /* empty loop */; return ptr; } /* * kw_lookup * Look up a string in the keyword table. Returns a -1 if the * string is not a keyword otherwise it returns the keyword number */ int kw_lookup(char *word) { struct kt *kp; for (kp = key_words; kp->kt_name != 0; kp++) if (!strcmp(word, kp->kt_name)) return kp->kt_val; return -1; } /* makefloat * Convert a string to a float, handling the %(N) square root form * as necessary */ float makefloat(char *str) { float sign,f; int pos; if(str[0]==’-’) { sign = -1.0; pos = 1; } else { sign = 1.0; pos = 0; 112 APPENDIX A. CODE LISTINGS } if(str[pos]==’%’ && str[pos+1]==’(’) f=sqrt(atof(str+pos+2)); else f=atof(str+pos); return(sign*f); } Listing A.2: Lexical analyzer definition for flex %{ /* qparse.y: bison file for quantum circuit specification language * Geva Patz, MIT Media Lab, 2003 */ # include # include # include # include # include <stdio.h> <string.h> <stdlib.h> "qparse.h" "transform.h" symlist * symbol_table; oplist * op_stack; i n t bitlimit, maxbits, errflag; void addsym(char *, i n t , complex *); void pushop(symlist *, i n t *); void pushspecial( i n t , i n t ); %} %union { long val; complex cplx; matlist *matrix; bitlist *bits; symlist *st_rec; char *string; } %token CBKT ")" %token CBRACE "}" A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER %token COMMA "," %token CSBKT "]" %token DEFINE "define" %token DOT "." %token <string> ID %token <val> INT %token <cplx> NUM %token OBKT "(" %token OBRACE "{" %token <st_rec> OP %token OSBKT "[" %token PERM "perm" %token SEMICOLON ";" %token SIZE "size" %token DIAG "Diag" %token FZERO "F0" %token FF "Ff" %type <matrix> numlist %type <matrix> matrix %type <bits> intlist %% input: line {} | input line {} ; line: stmt ";" {}; stmt: fn_defn {} | operator {} | permute {} | size {} | special {}; fn_defn: "define" OP "(" INT ")" matrix { /* If OP is matched instead of ID, this is a redefinition */ char estr[1024]; snprintf(estr, 1023, "Redefinition of operator ’%s’", $2->name); yyerror(estr); } | "define" ID "(" INT ")" matrix { complex *vallist; i n t i, matsize; 113 114 APPENDIX A. CODE LISTINGS matlist *lptr; i f ($4 > MAXOP) { yyerror("Operator size too large"); } matsize = (1<<$4)*(1<<$4); vallist = (complex *)malloc( s i z e o f (complex)*matsize); f o r (i=matsize-1, lptr=$6; i>=0 && lptr!=NULL; i--, lptr=lptr->next) { vallist[i] = lptr->val; } i f (i>=0) yyerror("Matrix too small for number of bits specified "); i f (lptr!=NULL) yyerror("Matrix too large for number of bits specified "); addsym($2, $4, vallist); }; matrix: "[" numlist "]" {$$ = $2}; numlist: NUM { matlist *m; m = (matlist *)malloc( s i z e o f (matlist)); m->val = $1; m->next = NULL; $$ = m; } | numlist NUM { matlist *m; m = (matlist *)malloc( s i z e o f (matlist)); m->val = $2; m->next = $1; $$ = m; } ; operator: OP "(" intlist ")" A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER 115 { i n t *destbits, *temp; i n t i, j; i n t used[MAX_BITS]; bitlist *ptr; destbits = NULL; f o r (i=0; i<=maxbits; i++) used[i] = 0; f o r (i=0,ptr=$3;ptr!=NULL;ptr=ptr->next,i++) { temp = ( i n t *)malloc((i+1)* s i z e o f ( i n t )); f o r (j=0;j<i;j++) temp[j+1] = destbits[j]; temp[0] = ptr->val; i f (destbits!=NULL) free(destbits); destbits=temp; i f (used[ptr->val]) yyerror("Duplicate bit in operator"); used[ptr->val] = -1; } i f (i<$1->bits) yyerror("Too few bits specified for operator"); e l s e i f (i>$1->bits) yyerror("Too many bits specified for operator"); else pushop($1, destbits); }; permute: "perm" "(" intlist ")" { i n t *destbits; i n t used[MAX_BITS]; i n t i; bitlist *ptr; /*yyerror("patz hasn’t implemented this properly");*/ destbits = ( i n t *)malloc((maxbits+1)* s i z e o f ( i n t )); f o r (i=0; i<=maxbits; i++) used[i] = 0; f o r (i=0,ptr=$3;ptr!=NULL;ptr=ptr->next,i++) { destbits[maxbits-i] = ptr->val; i f (used[ptr->val]) 116 APPENDIX A. CODE LISTINGS yyerror("Duplicate bit in permutation"); used[ptr->val] = -1; } --i; i f (i<maxbits) yyerror("Too few bits specified for permutation"); e l s e i f (i>maxbits) yyerror("Too many bits specified for permutation"); else pushop(NULL, destbits); }; intlist: INT { bitlist *m; m = (bitlist *)malloc( s i z e o f (bitlist)); i f (maxbits==-1) { yyerror("Problem size not declared"); /* Fudge to prevent lots of duplicate messages */ maxbits = bitlimit; } e l s e i f ((m->val = $1) > maxbits) yyerror("Bit value too large"); m->next = NULL; $$ = m; } | intlist "," INT { bitlist *m; m = (bitlist *)malloc( s i z e o f (bitlist)); i f (maxbits==-1) { yyerror("Problem size not declared"); /* Fudge to prevent lots of duplicate messages */ maxbits = bitlimit; } e l s e i f ((m->val = $3) > maxbits) yyerror("Bit value too large"); m->next = $1; $$ = m; } ; size: "size" INT A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER 117 { i f (maxbits!=-1) yyerror("Duplicate size declaration"); e l s e i f ($2>bitlimit) yyerror("Problem size exceeds maximum allowable"); e l s e i f ($2==0) yyerror("Problem size must be non-zero"); else maxbits = $2 - 1; } ; special: "Diag" "[" INT "]" {pushspecial(0,$3);} | "F0" {pushspecial(1,0);} | "Ff" {pushspecial(2,0);} | %% /* addsym: add a symbol (operator name) to the symbol table */ void addsym(char *name, i n t bits, complex *values) { symlist *ptr, *lptr; ptr = (symlist *)malloc( s i z e o f (symlist)); ptr->name = strdup(name); ptr->bits = bits; ptr->matrix = values; i f (symbol_table == NULL) symbol_table = ptr; else { f o r (lptr = symbol_table; lptr->next!=NULL; lptr=lptr->next) /* empty loop */ ; lptr->next = ptr; } } /* pushop: add an operator (or a permutation, if op==NULL) to the stack */ void pushop(symlist *op, i n t *bitlist) { oplist *ptr, *lptr; ptr = (oplist *)malloc( s i z e o f (oplist)); ptr->op = op; 118 APPENDIX A. CODE LISTINGS ptr->bits = bitlist; ptr->type = (op == NULL)?tPERM:tOP; ptr->next = NULL; i f (op_stack == NULL) op_stack = ptr; else { f o r (lptr = op_stack; lptr->next!=NULL; lptr=lptr->next) /*empty loop */ ; lptr->next=ptr; } } /* pushspecial: add a special function record to the stack */ void pushspecial( i n t func, i n t param) { oplist *ptr, *lptr; ptr = (oplist *)malloc( s i z e o f (oplist)); ptr->type = tSPECIAL; ptr->function = func; ptr->parm = param; ptr->next = NULL; i f (op_stack == NULL) op_stack = ptr; else { f o r (lptr = op_stack; lptr->next!=NULL; lptr=lptr->next) /*empty loop */ ; lptr->next=ptr; } } /* yyerror: Bison error handling routine */ void yyerror(char *s) { fprintf(stderr, "%s at line %d\n", s, yyline+1); errflag = -1; } /*** main entry point ***/ i n t main( i n t argc, char **argv) { FILE *out; A.1. CIRCUIT SPECIFICATION LANGUAGE PARSER /* initialization */ symbol_table = NULL; op_stack = NULL; bitlimit = MAX_BITS - 1; maxbits = -1; errflag = 0; i f (argc!=2) { fprintf(stderr, "Usage: %s <output_file>\n", argv[0]); exit(-1); } out = fopen(argv[1],"w"); i f (out == NULL) { perror("File open error"); exit(-1); } yyparse(); i f (errflag) { fprintf(stderr, "*** Errors occurred: aborting\n"); exit(-1); } i f (op_stack == NULL) { fprintf(stderr, "*** Empty problem: aborting\n"); exit(-1); } # i f d e f DEBUG dumpstack("Pre-transformation", op_stack, NULL, NULL, maxbits+1); # endif transformstack(op_stack, maxbits+1); # i f d e f DEBUG dumpstack("Post-transformation", op_stack, NULL, NULL, maxbits+1); # endif r e t u r n writestack(out, op_stack, maxbits+1); } Listing A.3: Parser definition for bison 119 120 APPENDIX A. CODE LISTINGS A.2 Quantum Fourier transform circuit generator The listings in this section present scripts for generating quantum Fourier transform circuits for simulation, as discussed in section 5.4, along with representative output of the scripts. It may also be helpful to refer to the general quantum Fourier transform circuit schema in Figure 2-5 on page 33. Listing A.4 is the main quantum Fourier transform generation script, discussed in section 5.4.3. This produces circuits that are structured to assist the compiler in inferring an efficient tensor product representation. An example of such a circuit (for four qubits) is given in listing A.5. Notice that the operations on the various qubits are interleaved as much as possible. #!/usr/bin/perl # genqft.pl Geva Patz, MIT Media Lab, 2003 # Generate circuit description for N-bit quantum Fourier transform # (optimized for compilation) use POSIX; ($bits) = @ARGV; die "Usage: $0 <bits> (2 <= bits <= 25)\n" unless $bits>1 && $bits<26; p r i n t "# autogenerated by $0 $bits\n\n"; # Define operations p r i n t "define h(1) [%(0.5) %(0.5) %(0.5) -%(0.5)];\n"; f o r $j (2..$bits) { # Controlled rotation by $j p r i n t "define r_$j(2) [1.0 0.0 0.0 0.0\n"; print " 0.0 1.0 0.0 0.0\n"; print " 0.0 0.0 1.0 0.0\n"; print " 0.0 0.0 0.0 "; # exp(2*pi*i / 2ˆj) p r i n t f "%f:%f];\n", cos($pi4/(1<<$j)), s i n ($pi4/(1<<$j)); } A.2. QUANTUM FOURIER TRANSFORM CIRCUIT GENERATOR p r i n t "size $bits;\n\n"; $step[0]=1; f o r $j (1..$bits-1) { $step[$j] = 0; } # We reverse the bits at the beginning, and adjust indices below # as needed. This puts the computation in the ’leading identity # matrix’ form (could let the compiler do this) p r i n t "perm("; f o r ($i=$bits-1;$i>=0;$i--) { p r i n t "$i", $i?’,’:’)’; } p r i n t "\n"; f o r ($col=1;$step[$bits-1]<2;$col++) { p r i n t "\n# column $col\n"; f o r $j (0..$bits-1) { i f ($step[$j]==1) { p r i n t "h(", $bits-$j-1, ");\n"; } e l s i f ($step[$j] && $step[$j]<=($bits-$j)) { p r i n t "r_", $step[$j], "(", $bits-($j+$step[$j]), ", "; p r i n t $bits-$j-1, ");\n"; } } $step[0]++; f o r $j (1..$bits-1) { $step[$j]++ i f $step[$j-1]>2; } } Listing A.4: Efficient quantum Fourier transform circuit generation script # autogenerated by ./genqft-prt.pl 4 define h(1) [%(0.5) %(0.5) %(0.5) -%(0.5)]; define r_2(2) [1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.000000:0.000000]; define r_3(2) [1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 121 122 APPENDIX A. CODE LISTINGS 0.0 0.0 0.0 1.000000:0.000000]; define r_4(2) [1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.000000:0.000000]; s i z e 4; perm(3,2,1,0) # column 1 h(3); # column 2 r_2(2, 3); # column 3 r_3(1, 3); h(2); # column 4 r_4(0, 3); r_2(1, 2); # column 5 r_3(0, 2); h(1); # column 6 r_2(0, 1); # column 7 h(0); Listing A.5: Example efficient circuit output for 4 qubits For comparison, listing A.6 generates equivalent quantum Fourier transform circuits, but with a simple structure that makes no attempt at optimization for the compiler. Each qubit’s operations are generated sequentially, as illustrated in the four-qubit example output in listing A.7. The effects of this less efficient representation on simulation time are discussed in section 5.4.4. #!/usr/bin/perl A.2. QUANTUM FOURIER TRANSFORM CIRCUIT GENERATOR # gebadqft.pl Geva Patz, MIT Media Lab, 2003 # Generate circuit description for N-bit quantum Fourier transform # (without regard for compiler-friendly optimization) use POSIX; ($bits) = @ARGV; die "Usage: $0 <bits> (2 <= bits <= 25)\n" unless $bits>1 && $bits<26; p r i n t "# autogenerated by $0 $bits\n\n"; # Define operations $pi4 = POSIX::acos(0)*4; # 2*pi p r i n t "define h(1) [%(0.5) %(0.5) %(0.5) -%(0.5)];\n"; f o r $j (2..$bits) { # Controlled rotation by $j p r i n t "define r_$j(2) [1.0 0.0 0.0 0.0\n"; print " 0.0 1.0 0.0 0.0\n"; print " 0.0 0.0 1.0 0.0\n"; print " 0.0 0.0 0.0 "; # exp(2*pi*i / 2ˆj) p r i n t f "%f:%f];\n", cos($pi4/(1<<$j)), s i n ($pi4/(1<<$j)); } p r i n t "size $bits;\n"; f o r $i (0..$bits-1) { p r i n t "\n# Bit $i\n"; p r i n t "h($i);\n"; f o r $j (2..$bits-$i) { p r i n t "r_$j(", $i+$j-1,",", $i, ");\n"; } } # Final bit reversal p r i n t "\nperm("; f o r ($i=$bits-1;$i>=0;$i--) { p r i n t "$i", $i?’,’:’)’; 123 124 APPENDIX A. CODE LISTINGS } p r i n t "\n"; Listing A.6: Unoptimized quantum Fourier transform circuit generation script Compare the following output to the efficient output in listing A.5. # autogenerated by ./genbadqft-prt.pl 4 define h(1) [%(0.5) %(0.5) %(0.5) -%(0.5)]; define r_2(2) [1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.000000:1.000000]; define r_3(2) [1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.707107:0.707107]; define r_4(2) [1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.923880:0.382683]; s i z e 4; # Bit 0 h(0); r_2(1,0); r_3(2,0); r_4(3,0); # Bit 1 h(1); r_2(2,1); r_3(3,1); # Bit 2 h(2); r_2(3,2); # Bit 3 h(3); perm(3,2,1,0) A.3. 3-SAT PROBLEM GENERATOR FOR ADIABATIC EVOLUTION 125 Listing A.7: Example unoptimized circuit output for 4 qubits A.3 3-SAT problem generator for adiabatic evolution The code in this section is used in simulating the solution of 3-SAT by adiabatic evolution, as described in section 5.5. Listing A.8 generates the basic specification for an N -qubit simulation of this kind. An example of the output code, for three qubits, is given in listing A.9. In order to execute a simulation of an instance of 3-SAT, we need an additional set of input data specifying the Hamiltonian corresponding to the particular instance we wish to simulate. Listing A.10 is a C program to generate random instances of 3-SAT of a given bit size. #!/usr/bin/perl # # # # genadi.pl Geva Patz, MIT Media Lab, 2003 Generate circuit description for N-qubit adiabatic evolution simulation (use companion ’genham’ Hamiltonian generator to generate specific problem Hamiltonian) ($bits) = @ARGV; die "Usage: $0 <bits> (2 <= bits <= 25)\n" unless $bits>1 && $bits<26; p r i n t "# autogenerated by $0 $bits\n\n"; p r i n t "define h(1) [%(0.5) %(0.5) %(0.5) -%(0.5)];\n"; p r i n t "\nsize $bits;\n"; # Run for N steps; change second index of i to run for alternate # number of steps f o r $i (1..$bits) { p r i n t "\n# Step $i\n"; 126 APPENDIX A. CODE LISTINGS f o r $j(1..$bits) { p r i n t "h(", $j-1, ");\n"; } p r i n t "F0[$i];\n"; f o r $j(1..$bits) { p r i n t "h(", $j-1, ");\n"; } p r i n t "Ff[$i];\n"; } Listing A.8: Simulation code generator for 3-SAT by adiabatic evolution # autogenerated by ./genadi-prt.pl 3 define h(1) [%(0.5) %(0.5) %(0.5) -%(0.5)]; s i z e 3; # Step 1 h(0); h(1); h(2); F0[1]; h(0); h(1); h(2); Ff[1]; # Step 2 h(0); h(1); h(2); F0[2]; h(0); h(1); h(2); Ff[2]; # Step 3 h(0); h(1); h(2); F0[3]; h(0); A.3. 3-SAT PROBLEM GENERATOR FOR ADIABATIC EVOLUTION 127 h(1); h(2); Ff[3]; Listing A.9: Example adiabatic evolution code for 3 qubits /* genham.c: Generate (diagonal) Hamiltonian representing a random * N-bit instance of 3-SAT * Geva Patz, MIT Media Lab, 2002 (updated 2003) */ # include # include # include # include <stdio.h> <stdlib.h> <sys/time.h> <limits.h> # define MAXCLAUSE (1<<22) # define bailifnull(x,y) i f ((y)==NULL) \ {fprintf(stderr,"Can’t allocate %s",(x));exit(-1) ;} i n t binval(unsigned i n t * a) { i n t b; b = (((a[3]&0x04)>>2) * (1<<(a[2]-1)) + ((a[3]&0x02)>>1) * (1<<(a[1]-1)) + (a[3]&0x01) * (1<<(a[0]-1)) ); r e t u r n b; } i n t main( i n t argc, char *argv[]) { unsigned char * valid; unsigned i n t * cnt; unsigned i n t * clauses; char bits; unsigned i n t i, j, k, flag, numvalid; unsigned i n t alltrue, xor[3], or[3]; unsigned i n t clause; s t r u c t timeval tv; s t r u c t timezone tz; 128 APPENDIX A. CODE LISTINGS i f (argc!=2) { fprintf(stderr, "Usage: %s <bits> \n", argv[0]); exit(-1); } bits = atoi(argv[1]); i f (bits<3 || bits>99) { fprintf(stderr, "Number of bits must be >= 3 and <=25\n"); exit(-1); } i f (argc>2) printf("%d\n",bits); gettimeofday(&tv, &tz); srand(( i n t )(tv.tv_usec&INT_MAX)); valid = (unsigned char *)malloc((1<<bits)* s i z e o f (unsigned char)); bailifnull("valid", valid); cnt = (unsigned i n t *)malloc((1<<bits)* s i z e o f (unsigned i n t )); bailifnull("cnt", cnt); clauses = (unsigned i n t *)malloc(4*MAXCLAUSE* s i z e o f (unsigned i n t )); bailifnull("clauses", clauses); do { alltrue = (1<<bits)-1; clause = 0; numvalid = 1<<bits; f o r (i=0; i < 1<<bits; i++) { valid[i]=1; cnt[i]=0; } while(numvalid>1 && clause<MAXCLAUSE*4) { clauses[clause] = 1+( i n t )((bits*1.0)*rand()/(RAND_MAX+1.0)); do { clauses[clause+1] = 1+( i n t )((bits*1.0)*rand()/(RAND_MAX+1.0)); } while(clauses[clause+1] == clauses[clause]); do { clauses[clause+2] = 1+( i n t )((bits*1.0)*rand()/(RAND_MAX+1.0)); } while(clauses[clause+2] == clauses[clause] || clauses[clause+2] == clauses[clause+1]); clauses[clause+3] = ( i n t )(8.0*rand()/(RAND_MAX+1.0)); A.3. 3-SAT PROBLEM GENERATOR FOR ADIABATIC EVOLUTION f o r (j=0;j<3;j++) { xor[j]=(clauses[clause+3] & 1<<j)?1<<(clauses[clause+j]-1):0; or[j]=alltrueˆ(1<<(clauses[clause+j]-1)); } numvalid=0; f o r (j=0;j<=alltrue;j++) { flag=0; f o r (k=0;k<3;k++) { i f ( ((jˆxor[k])|or[k]) == alltrue) flag = -1; } i f (!flag) { valid[j]=0; cnt[j]++; } i f (valid[j]) numvalid++; } clause+=4; } } while(!numvalid); /* Output instance data */ f o r (i=0;i<clause;i+=4) printf("%d %d %d %d\n",clauses[i],clauses[i+1],clauses[i+2], clauses[i+3]); /* End of instance delimiter */ printf("0 0 0 -1\n"); /* Output problem Hamiltonian */ f o r (i=0;i<1<<bits;i++) printf("%d\n",cnt[i]); exit(0); } Listing A.10: Random instance generator for 3-SAT by adaibatic evolution 129 130 APPENDIX A. CODE LISTINGS Bibliography [BCC+ 97] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. SIAM, 1997. [BD96a] P. E. Buis and W. R. Dyksen. Algorithm 753; TENPACK: a LAPACK-based library for the computer manipulation of tensor products. ACM Transactions on Mathematical Software, 22(1):24–29, March 1996. [BD96b] P. E. Buis and W. R. Dyksen. Efficient vector and parallel manipulation of tensor products. ACM Transactions on Mathematical Software, 22(1):18–23, March 1996. [BGW93] A. Barak, S. Guday, and R.G. Wheeler. The MOSIX Distributed Operating System: Load Balancing for Unix. Number 672 in Lecture Notes in Computer Science. Springer-Verlag, 1993. [Bre00] R. P. Brent. Recent progress and prospects for integer factorisation algorithms. In Proceedings of COCOON 2000, volume 1858 of Lecture notes in Computer Science, pages 3–22. Springer-Verlag, 2000. [BSS+ 95] D. J. Becker, T. Sterling, D. Savarese, J. E. Dorband, U. A. Ranawak, and C. V. Packer. Beowulf: A parallel workstation for scientific computation. In Proceedings, International Conference on Parallel Processing, 1995. 131 132 [CDO+ 95] BIBLIOGRAPHY J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and R. C. Whaley. A proposal for a set of parallel basic linear algebra subprograms. Technical Report CS-95-292, University of Tennessee, July 1995. [CE02] R. Choy and A. Edelman. MATLAB*P 2.0: Interactive supercomputing made practical. Master’s thesis, Massachusetts Institute of Technology, 2002. [Coo71] S. A. Cook. The complexity of theorem-proving procedures. In Proceedings of the Third ACM Symposium on the Theory of Computing, pages 151–158, 1971. [CT65] J. W. Cooley and J. W. Tukey. An algorithm for the machine computation of the complex Fourier series. Mathematics of Computation, 19:297–301, April 1965. [CWM01] H. Chen, P. Wyckoff, and K. Moor. Cost/performance evaluation of gigabit ethernet and myrinet as cluster interconnect. In Proceedings OPNETWORK 2000, Sep 2001. [DvdGW94] J. Dongarra, R. van de Geijn, and D. Walker. Scalability issues in the design of a library for dense linear algebra. Journal of Parallel and Distributed Computing, 22:523–537, 1994. [Dyk87] W. R. Dyksen. Tensor product generalized ADI methods for elliptic problems. SIAM Journal on Numerical Analysis, 24(1):59–76, Feb 1987. [FGG+ 01] E. Farhi, J. Goldstone, S. Gutmann, J. Lapan, A. Lundgren, and D. Preda. A quantum adiabatic evolution algorithm applied to random instances of an np-complete problem. Science, 292:472–475, April 2001. [For93] MPI Forum. MPI: A message passing interface. In Supercomputing ’93 Proceedings, pages 878–883, November 1993. 133 BIBLIOGRAPHY [GBD+ 94] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine, a users’ guide and tutorial for networked parallel computing. MIT Press, 1994. [GCT92] J. Granata, M. Conner, and R. Tolimieri. The tensor product: A mathematical programming language for FFTs and other fast DSP operations. IEEE SP Magazine, pages 40–48, January 1992. [Hog03] T. Hogg. Adiabatic quantum computing for random satisfiability problems. Physical Review A, 67:022314, 2003. [Mol95] C. Moler. Why there isn’t a parallel Matlab. Matlab News and Notes, Spring 1995. [MSDS02] H. Meuer, E. Strohmaier, J. Dongarra, and H. D. Simon. Top 500 supercomputing sites. http://www.top500.org/, November 2002. [NC00] M. A. Nielsen and I. L. Chuang. Quantum computation and quantum information. Cambridge University Press, 2000. [OD98] K. Obenland and A. M. Despain. A parallel quantum computer simulator. In High Performance Computing ’98, 1998. [Öme00a] B. Ömer. Quantum programming in QCL. Master’s thesis, Technical University of Vienna, January 2000. [Öme00b] B. Ömer. lished). Simulation Technical report, of quantum computers (unpub- Technical University of Vienna, 2000. http://tph.tuwien.ac.at/õemer/doc/qcsim.pdf. [Pit97] N. P. Pitsianis. The Kronecker Product in Approximation and Fast Transform Generation. PhD thesis, Cornell University, 1997. 134 [Pre98] BIBLIOGRAPHY J. Preskill. Physics 229: Advanced mathematical methods of physics - quantum computation and information. Technical report, California Insitute of Technology, http://www.theory.caltech.edu/people/preskill/ph229/, 1998. [Sho97] Peter W. Shor. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM Journal on Computing, 26(5):1484–1509, 1997. [Spr98] P. L. Springer. Matpar: Parallel extensions for Matlab. In Proceedings of the International Conference on Parallel and Distributed Techniques and Applications, volume 3, pages 1191–1195, July 1998. [Tur36] A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 2, 42:230–265, 1936. [vDMV01] W. van Dam, M. Mosca, and U. Vazirani. How powerful is adiabatic quantum computation? In Proceedings of the 42nd annual symposium on foundations of computer science (FOCS2001), pages 279–287, 2001. [VSB+ 01] L. M. K. Vandersypen, M. Steffen, G. Breyta, C. S. Yannoni, M. H. Sherwood, and I. L. Chuang. Experimental realization of Shor’s quantum factoring algorithm using nuclear magnetic resonance. Nature, 414:883–887, December 2001. [WPD00] R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Technical report, Department of Computer Science, University of Tennessee Konxville, March 2000.