1 >> Krysta Svore: Today, the group has Dmitri Maslov visiting. So we're excited to have Dmitri here. Dmitri received his degree in mathematics from Moscow State University in Russia in 2000 and received his Master's and Ph.D.s in computer science from the University of New Brunswick, Canada, in 2002 and 2003. Currently, Dmitri is working as the program director in the Division of Computing and Communication Foundations, Directorate For Computer and Information Science and Engineering at NSF and is also an adjunct faculty member at University of Waterloo in the Department of Physics and Astronomy and also the Institute For Quantum Computing, which is there. His research interests include quantum circuits and architectures, quantum algorithms, core information processing, logic synthesis, reversible logic and circuit complexity. Today, we'll hear about synthesis for reversible circuits with applications to quantum circuits. So we're excited to have Dmitri. Let's welcome Dmitri. >> Dmitri Maslov: Thank you for the introduction. error. It's on the title. I forgot to change it. of the talk, there will be no other corrections. So I right away found an So hopefully, for the rest Okay. So what I would like to talk about today is about different kinds of optimal searches for the implementations of different kinds of functions. So we'll talk about optimal synthesis of reversible circuits, optimal synthesis of multiple qubit unitaries, the truly quantum circuits in other words, and optimal synthesis of single qubit unitaries. So the optimal synthesis of single qubit unitaries is actually highly relevant to the work done by Alex and Krysta. So our result, we developed independently of Alex and Krysta. But it seems as if we develop the same thing from looking at it from two different angles. But I'll talk about it, and we'll see what the differences are. The good thing is we just heard Alex's talk so it's still fresh in the mind. So hopefully, the differences will be evident. So I wanted to show a few gates. The NOT gate, the CNOT gate Toffoli gate and the Toffoli with more controls so, for example, so this Toffoli has two 2 controls. It computes that exclusive for Boolean product of variables X and Y. The Toffoli-4 has three controls so here it will compute exclusive four of this variable times Boolean -- plus Boolean product of the three variables above. So when we talk about circuits, the difference between reversible and quantum circuits and the classical circuits is we do not allow fan-outs and we do not allow feedback. Because the way it works in quantum is -- and reversible case, the circuits aren't physical, but they describe the time evolution. So having a fan-out would correspond to duplicating the particle, and due to energy mass conservation, we probably cannot do it. And the feedback would be travel back in time, which is also a bad idea, probably. I mean, there are close time-like curves where you can probably travel back in time, but then the complexity implications are such that probably it is still impossible, even if the history of traveling back in time were consistent. Anyway, that is not the point of the talk. So the quantum gates, the Hadamard gate, we already saw it earlier today, is a 2 by 2 matrix. So I'd like to illustrate how the Hadamard gate works on this two-dimensional picture, which you can also think of as three-dimensional picture. This is a Cat zero state and perpendicular to it a CAD one state. So what the H-gate does, the Hadamard, it reflects the state above this axis. So algebraically, if we apply H gate to the Cat zero, which is vector one zero, then the result of the computation is this state. And, indeed, on the picture we can see if we take this vector and reflect with respect to -- to this line, then we have this vector and this is exactly H of zero is. So it's a reflection of zero. For completeness, less verify it for Cat one. So H applied to Cat one algebraically is going to be this expression. And in terms of the picture, we take this vector, we reflect it with respect to this line, and this is the result. And we can see that the result actually corresponds to what we computed algebraically. So this coordinate over square root of two, negative one over square root of two. So each gate is a fairly nice and easy to see. So the other gates that I'm going to use in this talk include the T gate. Unlike Alex, I prefer to actually write it down as 1 E to the -- so this is the eighth root of unity. So the W here corresponds to the eighth root of unity. This is how I will denote it. I will actually use the eighth root of unity throughout the talk. So I'd rather introduce it right away. 3 We already discussed the reasons to use the T gate. We need it because of the fault tolerance and because we can actually physically construct it. So unlike other gates that are not necessarily as obvious how to construct. And as such, they need to be implemented as circuits with what we know how to construct. So the phase gate, you can think of it as T squared. So when you look at the matrix, this matrix squared is the square of its entries so the only thing that changes is W, which is eighth root of unity, it is changing to the fourth root of unity and becoming [indiscernible]. So the phase gate is one, zero, zero, I. In practice, we do not implement because that would be -- I mean, expensive because if one were to respect to phase gate, then in a some sort of physical cost, call about four. So it is much, much the phase gate as a sequence of two Ts, that would be correct, but it will be way too estimate the relative costs of the T gate with good sense, if the T gate was associated with it the hundred, then the phase gate would cost cheaper. So the Zed gate, again, we can think of it as P squared or T to the power of four. And then when we raise elements of this matrix to the power of four, this is one, zero, zero, and this becomes negative one. So again, we don't implement Zed as a circuit with two Ps because there is a direct implementation of Zed, and I believe its physical cost is about, compared to T, is about one on the scale that I introduced. So T gate costs hundred. it's much, much cheaper. This costs about four. This costs about one. So So why we consider those gate, here are a few reasons. So using the library with CNOT gates, we can compute linear reversible functions. So those functions are quite important because, for example, when we talk about Clifford circuits, linear reversible functions due to Aaronson and Gottesman is the most difficult part of the Clifford circuits. In fact, the very similar ability of the Clifford circuits relies, has the bottleneck of simulating linear reversible circuits. So the efficiency of linear reversible circuits determines the efficiency of Clifford circuits. The Clifford circuits, now. We don't technically need the Zed gate here but I 4 included it and that is the last. Zed is just P squared so we could avoid it completely. This library is over specified. The delivered circuits are important because they're required in the error correction. And in quantum computing, we will likely need to correct the errors. More likely so than in the classical computing. Well, in classical computing, we started by studying the error correcting codes and then it turned out for computational purposes, we don't need those. It looks like in quantum, we actually will need those. So and Clifford circuits are very important. So T plus CNOT. This is an interesting class. This sat is actually discrete. It contains a kind of linear reversible circuits modified by Ts. This is an interesting set. So Toffoli, you can also add CNOT to the Toffoli, but in a sense if you have a single qubit residing in state one, then the CNOT is a Toffoli using this single qubit. So the Toffoli gate gives us any reversible function. When we have the library with H and T gates, Hadamard and T, then we can implement any single qubit gate. And by that I mean approximate, because we have infinitely many unitaries. So with any discrete set of gates will not directly implement -- is not enough to implement any unitary from an infinite set. But we can efficiently approximate those. And finally, Clifford plus T gives all unitaries via the approximation. Okay. So the first problem I'm going to discuss, contrary to the advertisement in the abstract, it made more sense to me to invert the order of the things I'm going to discuss and start with the reversible circuits, because it is simpler. So the problem that we formulated is we want to synthesize optimal circuits, and optimality means minimal resources such as gate count, depth, minimal. So what else we can do. We can do the weighted gate count. We can do weighted depth, restricted architectures. I mean, there are lots of different metrics that one could use. So why do we want to have optimal implementations? Well, one of the reasons, it could be a useful library for those physicists trying to establish a certain experiment and having a limited control over their quantum mechanical system. So, for example, if your quantum mechanical -- if your apparatus allows you to 5 apply ten gates, and you look up the unitary or the transformation in the library and it turns out that the optimal circuit has 12, then no, you cannot implement it. You have to gain more control over quantum mechanical system before you can do something that complex. So perhaps a bigger reason is the peep hole optimization methods that are being currently an important part of any reasonably sophisticated compiler. So once you have a library of optimal implementations and you want to optimize a quantum circuit, you just cut small subcircuits, compare those subcircuits to the optimal circuits stored in the library and see if it is beneficial to do a substitution. A trivial idea, but very practical. And finally, the mathematical curiosity, computing the value of the Shannon's complexity function. In other words, what is the number of gates required to implement a function of a certain size. So in 2002, Shannon, Markov, and maybe somebody else, I don't remember all the authors, showed that reversible functions with three inputs require no more than eight gates. So we improved this by looking at the four bit reversible functions. We show that the number of gates required is 15. I do not believe that L of 5 can be computed, in my lifetime, anyways. Even if the scaling continues, even if the Moore's law continues to hold. The problem is too complex. >>: [inaudible]. >> Dmitri Maslov: >>: I'm sorry? With the progression, we can see -- >> Dmitri Maslov: Yeah, but can you prove that? You can probably guess, but can you prove it? So a reversible four-bit function is a permutation of the four-bit integers and it may be computed using those reversible gates I introduced, the NOT, CNOT, Toffoli and Toffoli-4. The reason I need a Toffoli-4, I need him, for completeness, I need an odd permutation, and the other permutations are even. So I just need something that is odd. We want to synthesize optimal circuits. And what was done previously in 2003, in a previous slide I said 2004, I mean, 2002. 2002 was their conference paper. I'm referring to a journal paper. So they showed optimal reversible 6 circuits for three-bit reversible functions and they're at two to the power of three factorial of them. So factorial is applied to two to the power of three, which is eight factorial and has the words 40,320. So in year 2006 again, Markov's group at the University of Michigan tried to synthesize four-bit reversible circuits and they were able to synthesize 26 million of them. The number of reversible circuits, four-bit reversible circuits is two to the power of four factorial, which is 16 factorial. It's a 13-digit number. So this number is an eight-digit number. So there are quite a few more digits to go. So we introduced two surges so first let me explain how to synthesize a given function and then we'll explain how to synthesize all functions. So in a sense, if you can synthesize all functions, you can synthesize any one function, because you just, you run the search until you've found the one that you want. But the search is very long, so it's not practical. And conversely, in principle, if you can synthesize any one function, you can synthesize all by waiting long enough. But then again, that also turns out to be non-practical. So we have developed algorithms. So now I'll explain how to synthesize any given function. So we use breadth first search to find optimal circuits with some L divided by two gates. We choose L divided by two to be a high enough number to guarantee that our algorithm works in our particular case, L divided by two is nine and the L divided by two has to be greater than the number of gates needed to implement any reversible function of four inputs. So once we've found all circuits with at most L divided by two gates, so any optimal circuit for any function can be found in this big database of the half of all optimal circuits by searching this database. Of course, so okay, in our experiment, it turned out that the number of circuits that we store is this number. So for -- this is the number of circuits with up to nine gates. Of course, if you just have this database and you search it twice for two halves, the complexity becomes too high. It becomes S to the power of two and 7 the S is this long number I gave on the previous slide. efficient. So we need something more efficient. So it is not And so this is what it says. If we just search for two halves, this is how many Boolean operations we need and we can do ten to the power of nine in a second. So this number becomes too large to be practical. So what we do instead is we hash the table of circuits. We use Thomas Wang hash function for those familiar with hashing functions for permutations. If not, it's just a hashing function. I don't know why it's efficient, but it is. So with S log S, we have approximately this many Boolean operations. I mean, of course, I'm kind of cheating here. It's not this many operations. But just to show you that this number looks much shorter than this number. So the difference -- yeah. >>: What is the use of hashing? >> Dmitri Maslov: So the use of hashing. So you need to find two halves, right. So to find one half, you need the number of operations equal to the size of a database. To but to find the second half, we use a hash table. So we calculate the hash key and we query the database if it has any values with this cache -- hash. And that is only poly log on the size of the database. >>: So on the other half, which gates hash helps you to look at? >>: [inaudible]. >> Dmitri Maslov: >>: Oh, okay. >> Dmitri Maslov: >>: Yeah, so -- Oh, you got it? Yeah. >> Dmitri Maslov: All right. So finally, we actually do one more reduction, which, in theory, may be not as impressive as it is in practice. We say that if we have an optimal circuit, then it is still optimal with respect to input/output permutation, of which there are four factorial, four variables, we can permute them in any order. Four factorial is 24 and function inverse. So 8 if we have N optimal circuit, then the inverse of this circuit is also N optimal circuit. So we can pack almost 24 times two, being 48 circuits into one circuit. Because they're all, in a sense, identical. So here is a recap. So we started with a solution of basically we said that we used breadth for search that requires order of N space and order of N time complexity, where N is the number of -- it's 16 factorial, the number of circuits. So we first reduce it to order of roughly -- this is a little bit of cheating. This is a lower bound. And by no means an upper bound. We can calculate an upper bound, but we believe that the upper bound is actually much lower than the order of N. And what we do is we synthesize only half of all optimal circuits. So then the run time reduces to order square root of N, but we have to search for two halves and the complexity of the search is the square of the size of the database. Then we introduced a hash table and reduced the time complexity to square root of N, poly log square root of N. This is actually a poly log, but, I mean, log/poly log, not a big difference as long as we have square root in front and the small constant, because it has to be practical. And we reduce, furthermore, the space complexity by a factor of 48. So it turns out that for random reversible function of four variables, it has no symmetries. More often than not. So the actual factor, practical factor is 47.95, I think. Which is pretty much says that only in one in a hundred cases there are any symmetries whatsoever. So this is how the distribution of reversible functions requiring a certain number of gates looks like. So the reduced functions is what we actually store in the database. This is the actual functions. So what here, this table says is it says that there are 4.8 million different reversible functions requiring five gates in an optimal implementation. So we then experimented with ten million random functions. So on the input, we have ten million random functions fed one by one, and on the output, the software produces N optimal implementation. In principle, there could be more than one optimal circuit. But as soon as we find one, we say well, there you go. It is optimal. 9 So what is surprising about this search is the time it takes to find an optimal implementation being 0.07 seconds per circuit. It is incredibly fast. I could not almost believe how fast it is. So if you think of, you know, having the answer being all optimal circuits stored on a hard drive without any reductions, then this hard drive would have to be at least hundred terabytes. So if it were 5,400 RPM hard drive then the expected time to extract data, a single bit of data from such a hard drive would be 0.01 seconds. So ours is faster. It's 0.07 seconds. So it is surprisingly fast. Okay. So now to synthesizing all functions. While we could not come up with anything very smart, so what we did is add breadth first search. So what we did is composed an indexable set of four-bit permutations while taking input/output relabeling and the universe symmetries into account as much as possible. So in practice, what we were able to do is we were able to reduce the 16 factorial, the number of all functions by a factor of approximately 12 to 13. Not the full 48. Because with both symmetries included, relabeling and inverse, we could not come up with an easy to compute index. And we wanted to have an index above vector so that we can change it according to the function value and we could calculate the function according to the index very quickly. So that was the requirement. Then what we did is we ran the depth first search on the bit vector to total those bits if the function is implementable with an optimal circuit of a certain size. The bit vector that we used is 209 gigabytes in size. We used 128 gigabyte machines so in order to feed that vector into RAM, we needed to break it into three parts. So we actually ended up running the calculation on nine nodes of a cluster, [indiscernible] cluster, and the total run time to calculate all optimal circuits is about one million second, amounts to about 13 days. So here is the complete distribution of the number of gates required in optimal implementations of the four-bit reversible functions. So note that there are like 144 functions of size 15. And interestingly, if you remove the symmetries, there will be only five. So meaning one has no symmetries, and four more have two symmetries. Okay. So now I'll talk about multiqubit circuits. So what we do in this 10 project is we try to synthesize optimal implementations for some small useful transformations. So, for example, the Toffoli gate. I mean, we professional need the Toffoli gate because it seems to be used everywhere you look at. It is used in the shores integer factoring algorithm for the reversible computation part. It is used in fault tolerance for the syndrome detection, and it is used -- well, it's used everywhere. And the reason and cause of implementation of the Toffoli gate using the library of CNOT, Hadamard phase and T, and the reason we use these library of gates is due to the restrictions imposed by error correction, but we wanted to check if this implementation is right, and find optimal implementation, see if we can for some other small functions. So again, what we do is similarly to the previously described project, we compose a library of depth optimal circuits. So this time, we decided to look at depth. But could as well be any other function. So whichever function you want to use, we can use in the software as long as it doesn't have too many values. So we search the library of halves to find an optimal implementation and output the circuit. So the interesting part here is the results. So, for example, the Toffoli gate, speaking of the Toffoli gate, is we were able to find an implementation with the overall depth 8. I believe the one in Nilsen Chuang book has an overall depth of about 10, 13, 15, 13, 15, somewhere along these lines. Of course, here the key depth and T is the more complex gate, likely. It is not necessarily the case, because, for example, if you can see there, for example, a Hamiltonian used in a liquid state, the T gate is the least expensive of all of these gates in a non-fault tolerant implementation. In a direct implementation. But if you want to do something fault tolerantly, scalably, then T has to be implemented differently. It's a logical gate. So the T depth here is four, which sort of improves what the Nilsen Chuang had, because they had the T depth, I believe it was seven, but via playing with the circuit a little bit could be reduced to six. So we could reduce from. >>: [inaudible]. 11 >> Dmitri Maslov: >>: What's that? Excuse me? Actually, the T depth is not five there. >> Dmitri Maslov: Yeah, because you can executed this and this in parallel by executing first stage, second, third, fourth and then this is fifth and so on. >>: Okay. >> Dmitri Maslov: So it's four, but you have to kind of rewrite it a little bit. So then we found this circuit with T depth free. So it's not written here very well, but this is one layer of Ts. This is the second layer of Ts, and this is the third layer of Ts. Which gives the T depth free. So what we have also done, we compared to what our synthesis algorithm to the one we were able to find online and pretty much the only one available is the Dawson's implementation of the Solovay-Kitaev. Great thing about it, I mean, it has lots of -- it leaves lots of room for improvement. But the great thing about it is it actually works for multiqubit circuits, which is great. So this is the scaling of the Dawson's implementation compared to other implementation. So as you can see, we can go further in depth than the Dawson. But in a way, it's not the best implementation. Because if we use Dawson's code with four levels of recursion to generate controlled H-gate, we come up with an approximation that is 0.34 distance away from the controlled Hadamard, which takes about two minutes and the implementation has about a thousand gates, whereas we synthesized an exact circuit. Our exact circuit has seven gates and it took 0.5 seconds to synthesize the circuit. But, you know, we're not claiming we're, like, hundred times better because we're not comparing apples to apples here. We're comparing apples to oranges. But we only had oranges to compare to. And we wanted to make a comparison of sorts. Another interesting thing we have tried to do, so here is an optimal implementation of the control P. Note that the key depth is two. However, if you consider the control P as an operation of free qubits, then this circuit is an optimal circuit implementing the control P. Note that the T depth here is actually one. 12 So we were surprised boy this phenomena. Then we were able to prove the following theorem. Basically, what the theorem says is if we have a circuit with CNOT plus T gates, we can parallelize it to any depth you want. But it's a subject to how much -- how much ancilla you can give me. If you can give me sufficiently many, depth one I can do. If you don't give me as much as I want, maybe depth two. If you give me only a few, then the depth will be maybe closer to the number of T gates. So this is formally the statement of the lemma, of the theorem. So this number can actually be improved, but not going to talk about that much. Still, it's interesting that any CNOT plus T circuit can be optimized to depth one. Depth 1 T. So I have to also say that part of this research, and the theorem and some of the follow-up things that we're currently doing was also inspired by discussions Peter Salinger, who showed us T depth one, optimal Toffoli implementation. Well, it's obviously optimal, because there cannot be anything better than the T depth one. But it uses four ancilla. Another thing we tried to do is we took the Clifford gates and we took the T gate, and we look at their controlled versions and it turned out that all the controlled versions can be implemented. In fact, we implemented them using our algorithm. We implemented them optimally. So we then realized if a unitary U can be implemented exactly over the Clifford plus T library, then so can the controlled version of U, the controlled U. And if the vector of costs for the unitary U was, well, first coordinate is the number of Hadamards used, second is the number of phase gate used, third is the number of CNOT gate used and fourth is the number of T gates used, then to calculate the number of Hadamard, phase, CNOT and T gates, what you need to do is multiply this column vector, it's actually a column vector by this matrix. >>: Do you have some [indiscernible] as to how do you get to this here? >> Dmitri Maslov: So what we need to goat this theorem is we took the controlled H, implemented it optimally. And actually if you look at this matrix, you can sort of see what happens. So the controlled H, the controlled H, when implemented optimally, requires two Hadamard gates, zero phase gates, two CNOT gates, and four Ts. Okay. So probably I read it incorrectly. So let me think for a second. We multiply this matrix by this column vector. So -- 13 >>: Is that columns? >> Dmitri Maslov: Yeah, they're columns, yeah, so I had to read the column instead, not the row. So you need two Hadamard gates for controlled Hadamard. You need two phase gates, you need one CNOT and two Ts. And that gives you a controlled Hadamard. And controlled P is described by this vector. It's funny how controlled P does not require P. It requires only Ts and CNOTs. But in a way, T is the square root of P so it makes total sense, because there is a direct relation between a square root of a gate and its controlled version. It's in Nilsen Chuang, page 182. One of my favorite circuits I remember it. >>: So how do you get to the point where you prove that lambda U can be implemented exactly? >> Dmitri Maslov: Since U can be implemented exactly, we take the circuit for U and we control every single gate. If we control every single gate, it is equivalent to controlling the whole transformation. And for each controlled gate, we found its optimal implementation. And we substitute the controlled gates with their implementation. Yeah, uh-huh. Okay. So now to the single qubit unitary, finally. So here is the original motivation for our original research and what we actually tried to accomplish before we wrote what we wrote we wanted to study if a unitary U can be implemented exactly in the Clifford plus T basis. So we obviously want an exact implementation and the reason for that is there are enough errors in quantum algorithms already. So firstly, there are algorithmic errors, such as, you know, the shores algorithm does not guarantee you the answer with probability one. It's not [indiscernible] algorithm that always guarantees that you multiply two integers and you actually do get the result. But it's a randomized algorithm, in a sense. Not randomized. It's quantum. But you get the answer with a probability. We have errors do to decoherence and that's the reason why we have the error correction and fault tolerance that increases the circuit sizes when you go from logical circuits to physical circuits by a factor of what must be close to a thousand, if not more than that. In other words, it is expensive to have those decoherence errors. And 14 there are systematic errors in the controlling apparatus. always need to be fought with. And again, they So what we were able to prove is we were able to prove the following theorem. In the single qubit case, the set of all unitaries implementable in the Clifford plus T basis is equivalent to the set of unitaries over this ring. So this ring is an integer extension of I and 1 over square root of 2. We furthermore conjecture that in the N qubit case, the set of all unitaries on N qubits implementable by circuits in the Clifford plus T bases is equivalent to the set of all unitaries over this ring, as long as there is an ancillary qubit available that resides in state zero. So the requirement, I will not be able to prove the conjecture. This is what we tried to do. We couldn't do it. We made some partial progress but not nearly enough. What I can do is I can illustrate that the ancilla qubit, the requirement to have ancilla qubit is essential. And then I will outline the proof of the theory. So first, let me show that the requirement to have ancilla in the formulation of the conjecture is important. In particular, I would like to illustrate it with the example of the control T gate and the determinant argument. So if you look at the determinant of the control T viewed as a matrix on two qubits, a four by four matrix. The determinant equals to the W. And to remind, the W is the eighth root of unity. However, the determinant gates that we can work with, Hadamard, phase, CNOT and T, belong to the set plus minus I plus minus one. So since determinant is multiplicative, multiplying these numbers, we can never get the W. So we cannot have the controlled T on the two qubits. You could say that, well, what if we want to build the controlled T up to a global phase. Well, a little bit more hand waving -- maybe not necessarily hand waving, but formal arguments actually shows that no, you can do it up to global phase either. >>: [inaudible]. >> Dmitri Maslov: Yeah, yeah. No, no, you cannot. It's basically the same determinant argument. But have to spend a bit more time. Nevertheless, controlled T may be implemented using three qubits by the 15 circuit. This is unnecessarily complex. Essentially, what this circuit does, it applies Kitaev's trick. If I have a pen and a paper, I'll show a much simpler circuit. So we can't to implement a controlled T. So what we do is we do controlled swap, this qubit resides in state zero. We do a controlled swap, then we apply T here, and we apply [indiscernible] again. So Kitaev's trick from '95, my other most favorite circuit, I guess. So here is just optimized a little bit. though it doesn't look like such. >>: So what was kind of complexity goes into [indiscernible]. >> Dmitri Maslov: >>: In fact, it actually optimized even Yes. It's all symmetric except for the one extra T. >> Dmitri Maslov: Exactly. >>: There's one extra T all the way down at the bottom left, there's one extra T dagger that would have been a T on the right, which would given you the same thing back. And so that was your T that you have on the other one for the control swap. >> Dmitri Maslov: >>: Everything else, if you look at it -- >> Dmitri Maslov: of this. >>: Uh-huh. Yeah, it is symmetric. So yeah, so this is kind of inverse Exactly, except for one gate. >> Dmitri Maslov: Or a complex conjugate, if you want. Because complex conjugate, complex conjugate is CNOT is just the CNOT. Complex conjugate of T is T complex conjugate and of P is P complex conjugate. And H, complex conjugate is H. >>: Yeah. >> Dmitri Maslov: So it is symmetric, just like this circuit. The parent is 16 also symmetric with respect to T. So now to the proof of the theorem. So to remind you, what we were trying to prove is that the set of all unitaries over the ring of the integer extension of I and 1 over square root of 2 is equivalent to the circuits computable by H and T gates, because delivered becomes pretty much H and P, but P can be simulated with two Ts. So H and T suffices. So firstly, from a linear algebra book, we know that it's actually -- I'm cheating here a little bit, in linear algebra, this is E to the I 5 [indiscernible]. But we know that any unitary can be written in this form so K runs from zero to seven. So you can, if you look at those unitaries of this form for different values of K, then there are all equivalent all to multiplication by powers of T. So in a sense, when you have a two by two unitary, what you have instead is just the vector XY, because everything else -- the column vector XY, because everything else you can restore if you can get to the vector XY. So this observation helps us to move from the synthesis of unitaries to the synthesis of states. If we can synthesize states efficiently, if we can synthesize any state over the ring, then we can synthesize any unitary, because unitaries can be easily restored. So to show that we can synthesize the states, we use the notion of this smallest denominator exponent. So the smallest denominator exponent, here it is defined formally, but it is an analog of an irreducible fraction for [indiscernible] numbers. So when we look at the ring that is an integer extension of pi and one over square root of two, we know that we have some fractions where the denominator is a power of the square root of two. So SDE is just the power of the square root of two defined such that it is -what we're looking at is an irreducible fraction. First, a lemma is very simple to prove. If you look at a two by two matrix, then SDEs of all elements of this matrix are equal. So the reason for that is if this number has denominator square root of two to the power N and this one has square root of two to the power of, say, N plus one, then this vector has to be normalized and the normalization doesn't work out. It's rational. It's like proving that the square root of two is a rational number. Pretty much the same argument. Very easy. 17 Okay. So next thing we do is we consider the result vector and now we work with vectors only, because we equivalent to matrices. By H multiplied by T to the multiplying one, two matrices by a vector column, we of the multiplication of a know that they're power of K. So by get this result. So now I'm going to cheat a little bit in the sense that I'm going to technical results that I'm not going to prove, but I'm going to claim is proved in the paper. I just don't want to prove it. The proof is difficult. It's just technical. You have to study the properties of long time before you can prove this inequality and this equality. show two that it not very SDE for a So let me describe what the inequalities are and what is the meaning behind them. So we want to show that by multiplying a state, by applying HT to the power of K to a state, we change the denominator by no more than one. And either way. We can increase it by one, we may not change it or we may decrease it by one. That is what this inequality states. It states that SDE, the square of this value and this value can be found here, minus the SDE of this value, which is this value. So the first coordinate of the vector we're looking at, it is squeezed between one and negative one. So it can increase by one, decrease by one, or not change. >>: Is it the case the denominator will not change sometimes? >> Dmitri Maslov: Yes, yeah. So and a second very interesting thing that we need to prove, that we actually do prove, is for any S in the set, negative 101, we can find a K such that the difference of those SDEs, so this expression is exactly the same expression as this expression, equals to exactly S. What this means is by applying H T to the power of K, we can reduce the power of the denominator by one, we cannot change it, or we can increase by one. At this point, you're probably thinking well, there we go. We have an algorithm for the synthesis. So we always find S such that the -- we always find K such that the denominator is increased by one, and then when the denominator is -the power of the denominator is small enough, then we can use breadth first search to breadth first search all those unitaries with small denominator size. And this is exactly what we do. 18 So here is the algorithm. On the input, we have column vector. Think it. Technically, on the input, we have a matrix, a two by two matrix, right away draw up the second column of the matrix and we only look at first column. So we do restore the second column at the very end, but I already said too many things about a very trivial thing. about but we the it's -- So the output is the circuit that prepares state XY from cat zero. So the way the algorithm works is while SDE of the first component of the column vector we have is greater than or equal to four, we find K such that application of HT to the power of K reduces the denominator of X. And we substitute HT to the power of K multiplied by vector column XY into XY. So this, the denominator of the first component of this vector is smaller than the denominator of -- and smaller exactly by one than the denominator of this one. So in a sense, for the SDE of the square of the first component of the vector greater or equal than four, the set of Clifford plus T circuits on the single qubit is sort of flat. And at every level, you can predict what is happening. You can increase the denominator, you can decrease it, it may not change. So it's very predictable, the behavior is very predictable. And we brute force, finally, when we can not apply this step any more, because SDE is less than four. We basically brute force all the implementations of all unitaries such that SDE is less than four. There are probably close to 20,000 of those, but a few in other case. So they can be easily brute forced and stored. So let me prove H optimality. In other words, let me prove that the number of Hadamard gates is actually optimal in our circuits. So to increase or decrease SDE by one, we need precisely one Hadamard gate. It is proved. So the set of all unitaries with SDE of X squared equals eight equals to the set of H optimal circuits with seven Hadamard gates. We ran a computer search to actually verify that this statement is correct. So this statement is verified by a computer search. As such, we conclude that this is our N equals one sort of state and the mathematical induction. This is the induction step. So we use the inductive proof to show that the number of Hadamard gates required to implement a particular unitary equals to the size of the denominator of X squared where X is the top entry, U 11, plus one. That's the number of Hadamard gates. And apparently, our algorithm just matches SDE of X squared plus one. 19 So we furthermore looked at possible T optimality. We cannot prove T optimality, but what we can do is we can, using breadth first search, we can synthesize all optimal circuits with up to 13 gates, then resynthesize them using our algorithm and it turned out that we got all the T counts correctly. So we do believe that our T counts are also optimal. We cannot say anything about the P count. So the phase gates, the number of phase gates we use may be suboptimal. We just don't know. We don't have a feeling. We didn't try to study that. But H optimality, we can prove, is there. The T optimality seems to be the case. >>: So now, do you define T count as essentially the T count of T to the power of K being either zero or one. >> Dmitri Maslov: The T count of T to the power of K equals one if K is odd and zero if T is even. Because T to the third is PT. Phase times T, yeah. So we, yeah, we do this reduction, yeah. It's just, it's for the proof, it's easier to operate with powers of T than to operate with P times T or Zed times T or Zed times P times T, yeah because it's much cleaner. So another thing I didn't have the time to have the slide, but our algorithm, the complexity of our algorithm is linear in the number of gates in the circuit. So what this means, that this algorithm, among all algorithms, among all synthesis algorithms, is asymptotically optimal. I mean, just the mere time it takes to write the circuit with N gates is at least order of N. And order of N is the complexity of our synthesis algorithm. So it is efficient. >>: So the depth of the circuit is not an input parameter? >> Dmitri Maslov: >>: No, our -- You can say it's -- >> Dmitri Maslov: Our input is the unitary. So once we have the unitary, we can synthesize. So, in fact, what we do in experiments, and I believe this is one of the next slides, yeah, so let me get to it. It's probably one after that. 20 So here is a similar comparison to Dawson Nilsen. So it's a software to software comparison. In a way, again, we're comparing apples to oranges. So this is the behavior of the Nilsen -- Dawson's implementation of the Solovay-Kitaev. So this is zero iterations, one iteration, two iterations, three, four, five, six, seven. So you can see that the error did not improve here. So the scales are logarithmic and sorry about the small font size, but it was generated by an automatic software, and did not increase the font size yet. So this is the result of using double data format in C++. It does not have enough precision to go beyond errors on the order of ten to the minus 8 or ten to the minus 9. It just flats out there. It doesn't work. We, however, used a multiprecision arithmetic so we were able to go far beyond. Actually, I'll show you on the next slide that actually much forward could be shown on this picture. The smallest errors that we have circuits for on the order of ten to the minus 50 and ten to the minus 50 is likely unpractical. And unpractical in the sense that why would you need something that is this precise. >>: So the [indiscernible] being approximated are just the RZs, the small rotations? >> Dmitri Maslov: Yeah, we approximate those rotations, those four rotations. For other things, we tried. The graphs look very similar so we just show the few that, you know, look nice and everything else looks about the same. I mean, it's logarithmic scale. So even if something is different by 30 to 40 percent, which seems to be the case for a random unitary, right, then this point is going to be maybe, yeah, here. Yeah. So they're going -- they're about kind of the same. So this actually very nicely defines the range of points. That's where they're going to land, all of them, for random unitary. But again, I mean, our database was larger, because you can see for zero iterations, we're better. And our implementation was more efficient. So that's why partially this slope is much better than this slope. Also, another experiment we did is we took the circuits, synthesized by Dawson's implementation of the Solovay-Kitaev, computed the unitary, and 21 synthesized the unitary using our algorithm. So the reduction in the number of gates was only on the order of about 50 percent. So the approximations are not efficient, but their circuits are actually fairly good. Well, 50 percent away from what we believe should be fairly close to optimal in whatever metric of optimal that you choose. Yeah. But this graph shows that finding a good approximation to a unitary is a much more difficult task than if you have a good approximation already finding a circuit for it. Finding a circuit is easy. It can be done in time linear in the number of gates in the optimal circuit. I mean, it cannot be better than that. But finding an efficient approximation is very hard. Okay, so here is an example of the experimental results. So here it shows the error down to the ten the minus 15. If I click here, we actually have more circuits. So here, the implementations of the R Zed two divided by the power of N for N between four and 30. So each of them approximates down to ten to the power of minus 50, approximately. So if you want to have a circuit with -- that approximates to a very small error, then you have to pay a heavy price on, for example, number of Ts, 1.7 million. So actually, we also have circuit files here so if you click here, it's a fairly small circuit. The number of gates in this circuit is only 73 counting all gates. 28 T gates. So this is the circuit. Okay. Not displayed very well. So let me switch the browser, then okay. So here is the same page. So this is the circuit. You can actually view this circuit using the software for viewing the circuits that we also have on the group web page. It's called the QC viewer. You can download it, play with it if you want to. It's not the best for displaying circuits with millions of gates on the single qubit, but if you have circuits with tens of gates on a few qubits, then it's kind of much nicer. In fact, all of the circuits that I showed today were generated automatically by this QC viewer software. Okay. So let me go back to the presentation, what's left of it. Acknowledgments so like to acknowledge the help from my. Students, Matt Amy, who is a Master's student, Vad Kluichnikov, who is a Ph.D. student. Oleg 22 Golubitski, one of the to authors. He's an ACM programming contest champion, and he's a really, extremely good coder, which explains why the numbers in the reversible logic synthesis are so low, because he has helped tremendously coding it. And he's great. Then professor Mike Mosca, who is handling the quantum circuits research group in my absence at the university of Waterloo. And the NSN independent research and development program that allows me to do research despite being an administrator. Okay. Thank you. >>: So I'd like to know a little more about the high precision arithmetic package you used to get those big denominators. >> Dmitri Maslov: Yeah, sure. I can just send you the link to where you can download it from, if that's what you want. >>: No, I just wondered what its attributes are. implemented yourself? Is it something you >> Dmitri Maslov: No, no. We downloaded it from online. It's just available. It's a C C++ library for dealing with large numbers, and it's just that. It's one of those libraries. You don't have to use that library, because there are so many different ones. >>: There's nothing -- >> Dmitri Maslov: I even wrote one myself. available anywhere. >>: [indiscernible]. >> Dmitri Maslov: Maybe it's yours. >>: Okay, yeah. I can't recall where we downloaded it from. No, no, this is internal. >> Dmitri Maslov: Oh, it's internal. remember which one it is. >>: It's so not good that it's not Oh, may still be yours. I don't There's a thing called GMP from the open source that's fairly popular. >> Dmitri Maslov: I think that's the one we used. Yeah, I think that's the 23 one we used, yeah. >>: Now we can go back and -- yeah. Do you have any thoughts on how to prove T optimality? >> Dmitri Maslov: No. someone else's turf. We didn't try to prove. We didn't want to step on >>: It seems like when we compared circuits, we have a few more H -- our [indiscernible] has a little more Chaff, I think. But the Ts are the same. So it seems that there might be maybe some modifications that it cannot perform. >> Dmitri Maslov: Yeah, there may be one, the Hadamard, because that's how we looked denominator, and denominator is defined by Hadamards, you can only squeeze one T. No that gives. >>: yeah. You see, we concentrated on at the matrices. We looked at the Hadamard. And between two more than one T. So yeah. Whatever It's amazing how much algebraic structure there is in all this. >> Dmitri Maslov: The single qubit case, which, in my opinion, is super trivial. But the two qubit case, we tried to prove the theorem that we had in the two qubit case. It's much more difficult. For example, because you need an ancilla that you don't need in the single qubit case. For what it's worth, you have entanglement, which may come into play, and it could only complicate things. >>: Yeah, so you're looking at some sort of projection or some sort of morph of the algebra of the three qubit case. >> Dmitri Maslov: Yeah, I think it's much more difficult. It's just, it's single qubit case is trivial. Two qubit case is very difficult, in my opinion. I mean, we couldn't do much. We were, however, if you would like to know, we were able to prove that the unitary synthesis in the N qubit case can be reduced to state synthesis. We could do that step, but that's as far as we went. And the proof is far from trivial. >>: But unitary synthesis and state synthesis are really equivalent. >> Dmitri Maslov: Yeah, in a way they are equivalent, but state synthesis is 24 simpler. Because all you need is any matrix with the column that you want to have, but when you have N matrix, you need the matrix, that matrix. So in a way, we reduce the problem of size 4 to the power of N to the problem of size 2 to the power of N, but that wasn't the most difficult part. We don't believe that that part is difficult. We believe that reducing it to solving the problem with 2 to the power of N variables, that is difficult. >>: So when you experimented with the matrices [indiscernible], right? >> Dmitri Maslov: No. What we -- yeah, we did both. I mean, we used Solovay-Kitaev that we implemented ourselves. So when you have a unitary that can -- that is not a unitary over the ring that we have, we know, and if it's a two by two unitary, we know that provably, we cannot implement this unitary exactly. We need to approximate it. For approximation, we use Solovay-Kitaev. It's just that the code that we use is the code that we wrote for Solovay-Kitaev, but not the Dawson's code, because when we looked at it, we found that we could write something better so we wrote something better. Which we believe is better. And then, if we're given a unitary, we first look at it and say, is it the unitary of our ring. If it is, synthesize it using the algorithm. If it's not, use our Solovay-Kitaev to come up with the approximation unitary. It comes up with a circuit. But we don't care about the circuit. We only care about the unitary. And then we use the unitary to synthesize the circuit. So in a way, yeah, so T approx is the time it takes to approximate a unitary. So because this unitary is not a unitary in the ring. This unitary lies outside the ring. So T approx is the time spent by the -- our implementation of Solovay-Kitaev, and T decomposition is the time spent to actually synthesize the circuit using our algorithm. So the software consists of two parts. >>: I'm curious about this. When you have a unitary over your ring and the SDE is rather large, then the things, enumerators, the integers, would be also quite large. >> Dmitri Maslov: Yes. 25 >>: Did you have to use -- did you ever have to use a controlled precision integer? >> Dmitri Maslov: We use our own implementation that's symbolic. It's fully symbolic, because you see, for example, here we have 15,000 Hadamard gates. 15,000 Hadamard gates implies square root of two to the power 15,292 plus minus one, approximately. So we just, we needed our own arithmetic for that. So we used our own arithmetic. >>: So you essentially have an algebraic package that deals with Z extended by those roots? >> Dmitri Maslov: >>: Yeah, exactly. All right, yeah. Or it's the Gaussian integer extended by a square root of two. >> Dmitri Maslov: >>: Exactly, yeah. Represented by a through E vector, right? >> Dmitri Maslov: >>: Yeah, yeah. Actually, we represent by four. You could do four, yeah. >> Dmitri Maslov: Yeah, we use four, uh-huh. >>: So it seems like if you have a software that does C extended by one over square root of four, then dealing with an infinite precision integer arithmetic, if it was integer arithmetic would be a subset of what your package does. >> Dmitri Maslov: No, because it's very specialized. multiplication, yeah. >>: It doesn't do any of the Oh, I was wrong. >> Dmitri Maslov: All right. It's suited only for our needs. limited capability. Very limited. >> Krysta Svore: Any other questions? Let's thank Dmitri. It has only a