1

advertisement
1
>> Krysta Svore: Today, the group has Dmitri Maslov visiting. So we're
excited to have Dmitri here. Dmitri received his degree in mathematics from
Moscow State University in Russia in 2000 and received his Master's and Ph.D.s
in computer science from the University of New Brunswick, Canada, in 2002 and
2003.
Currently, Dmitri is working as the program director in the Division of
Computing and Communication Foundations, Directorate For Computer and
Information Science and Engineering at NSF and is also an adjunct faculty
member at University of Waterloo in the Department of Physics and Astronomy and
also the Institute For Quantum Computing, which is there.
His research interests include quantum circuits and architectures, quantum
algorithms, core information processing, logic synthesis, reversible logic and
circuit complexity.
Today, we'll hear about synthesis for reversible circuits with applications to
quantum circuits. So we're excited to have Dmitri. Let's welcome Dmitri.
>> Dmitri Maslov: Thank you for the introduction.
error. It's on the title. I forgot to change it.
of the talk, there will be no other corrections.
So I right away found an
So hopefully, for the rest
Okay. So what I would like to talk about today is about different kinds of
optimal searches for the implementations of different kinds of functions. So
we'll talk about optimal synthesis of reversible circuits, optimal synthesis of
multiple qubit unitaries, the truly quantum circuits in other words, and
optimal synthesis of single qubit unitaries. So the optimal synthesis of
single qubit unitaries is actually highly relevant to the work done by Alex and
Krysta. So our result, we developed independently of Alex and Krysta. But it
seems as if we develop the same thing from looking at it from two different
angles.
But I'll talk about it, and we'll see what the differences are. The good thing
is we just heard Alex's talk so it's still fresh in the mind. So hopefully,
the differences will be evident.
So I wanted to show a few gates. The NOT gate, the CNOT gate Toffoli gate and
the Toffoli with more controls so, for example, so this Toffoli has two
2
controls. It computes that exclusive for Boolean product of variables X and Y.
The Toffoli-4 has three controls so here it will compute exclusive four of this
variable times Boolean -- plus Boolean product of the three variables above.
So when we talk about circuits, the difference between reversible and quantum
circuits and the classical circuits is we do not allow fan-outs and we do not
allow feedback. Because the way it works in quantum is -- and reversible case,
the circuits aren't physical, but they describe the time evolution.
So having a fan-out would correspond to duplicating the particle, and due to
energy mass conservation, we probably cannot do it. And the feedback would be
travel back in time, which is also a bad idea, probably. I mean, there are
close time-like curves where you can probably travel back in time, but then the
complexity implications are such that probably it is still impossible, even if
the history of traveling back in time were consistent. Anyway, that is not the
point of the talk.
So the quantum gates, the Hadamard gate, we already saw it earlier today, is a
2 by 2 matrix. So I'd like to illustrate how the Hadamard gate works on this
two-dimensional picture, which you can also think of as three-dimensional
picture. This is a Cat zero state and perpendicular to it a CAD one state. So
what the H-gate does, the Hadamard, it reflects the state above this axis. So
algebraically, if we apply H gate to the Cat zero, which is vector one zero,
then the result of the computation is this state. And, indeed, on the picture
we can see if we take this vector and reflect with respect to -- to this line,
then we have this vector and this is exactly H of zero is. So it's a
reflection of zero.
For completeness, less verify it for Cat one. So H applied to Cat one
algebraically is going to be this expression. And in terms of the picture, we
take this vector, we reflect it with respect to this line, and this is the
result. And we can see that the result actually corresponds to what we
computed algebraically. So this coordinate over square root of two, negative
one over square root of two. So each gate is a fairly nice and easy to see.
So the other gates that I'm going to use in this talk include the T gate.
Unlike Alex, I prefer to actually write it down as 1 E to the -- so this is the
eighth root of unity. So the W here corresponds to the eighth root of unity.
This is how I will denote it. I will actually use the eighth root of unity
throughout the talk. So I'd rather introduce it right away.
3
We already discussed the reasons to use the T gate. We need it because of the
fault tolerance and because we can actually physically construct it.
So unlike other gates that are not necessarily as obvious how to construct.
And as such, they need to be implemented as circuits with what we know how to
construct.
So the phase gate, you can think of it as T squared. So when you look at the
matrix, this matrix squared is the square of its entries so the only thing that
changes is W, which is eighth root of unity, it is changing to the fourth root
of unity and becoming [indiscernible]. So the phase gate is one, zero, zero,
I.
In practice, we do not implement
because that would be -- I mean,
expensive because if one were to
respect to phase gate, then in a
some sort of physical cost, call
about four. So it is much, much
the phase gate as a sequence of two Ts,
that would be correct, but it will be way too
estimate the relative costs of the T gate with
good sense, if the T gate was associated with
it the hundred, then the phase gate would cost
cheaper.
So the Zed gate, again, we can think of it as P squared or T to the power of
four. And then when we raise elements of this matrix to the power of four,
this is one, zero, zero, and this becomes negative one. So again, we don't
implement Zed as a circuit with two Ps because there is a direct implementation
of Zed, and I believe its physical cost is about, compared to T, is about one
on the scale that I introduced.
So T gate costs hundred.
it's much, much cheaper.
This costs about four.
This costs about one.
So
So why we consider those gate, here are a few reasons. So using the library
with CNOT gates, we can compute linear reversible functions. So those
functions are quite important because, for example, when we talk about Clifford
circuits, linear reversible functions due to Aaronson and Gottesman is the most
difficult part of the Clifford circuits. In fact, the very similar ability of
the Clifford circuits relies, has the bottleneck of simulating linear
reversible circuits. So the efficiency of linear reversible circuits
determines the efficiency of Clifford circuits.
The Clifford circuits, now.
We don't technically need the Zed gate here but I
4
included it and that is the last. Zed is just P squared so we could avoid it
completely. This library is over specified. The delivered circuits are
important because they're required in the error correction. And in quantum
computing, we will likely need to correct the errors. More likely so than in
the classical computing.
Well, in classical computing, we started by studying the error correcting codes
and then it turned out for computational purposes, we don't need those. It
looks like in quantum, we actually will need those. So and Clifford circuits
are very important.
So T plus CNOT. This is an interesting class. This sat is actually discrete.
It contains a kind of linear reversible circuits modified by Ts. This is an
interesting set.
So Toffoli, you can also add CNOT to the Toffoli, but in a sense if you have a
single qubit residing in state one, then the CNOT is a Toffoli using this
single qubit. So the Toffoli gate gives us any reversible function. When we
have the library with H and T gates, Hadamard and T, then we can implement any
single qubit gate. And by that I mean approximate, because we have infinitely
many unitaries. So with any discrete set of gates will not directly
implement -- is not enough to implement any unitary from an infinite set. But
we can efficiently approximate those.
And finally, Clifford plus T gives all unitaries via the approximation.
Okay. So the first problem I'm going to discuss, contrary to the advertisement
in the abstract, it made more sense to me to invert the order of the things I'm
going to discuss and start with the reversible circuits, because it is simpler.
So the problem that we formulated is we want to synthesize optimal circuits,
and optimality means minimal resources such as gate count, depth, minimal. So
what else we can do. We can do the weighted gate count. We can do weighted
depth, restricted architectures. I mean, there are lots of different metrics
that one could use.
So why do we want to have optimal implementations? Well, one of the reasons,
it could be a useful library for those physicists trying to establish a certain
experiment and having a limited control over their quantum mechanical system.
So, for example, if your quantum mechanical -- if your apparatus allows you to
5
apply ten gates, and you look up the unitary or the transformation in the
library and it turns out that the optimal circuit has 12, then no, you cannot
implement it. You have to gain more control over quantum mechanical system
before you can do something that complex.
So perhaps a bigger reason is the peep hole optimization methods that are being
currently an important part of any reasonably sophisticated compiler. So once
you have a library of optimal implementations and you want to optimize a
quantum circuit, you just cut small subcircuits, compare those subcircuits to
the optimal circuits stored in the library and see if it is beneficial to do a
substitution. A trivial idea, but very practical.
And finally, the mathematical curiosity, computing the value of the Shannon's
complexity function. In other words, what is the number of gates required to
implement a function of a certain size. So in 2002, Shannon, Markov, and maybe
somebody else, I don't remember all the authors, showed that reversible
functions with three inputs require no more than eight gates. So we improved
this by looking at the four bit reversible functions. We show that the number
of gates required is 15.
I do not believe that L of 5 can be computed, in my lifetime, anyways. Even if
the scaling continues, even if the Moore's law continues to hold. The problem
is too complex.
>>:
[inaudible].
>> Dmitri Maslov:
>>:
I'm sorry?
With the progression, we can see --
>> Dmitri Maslov: Yeah, but can you prove that? You can probably guess, but
can you prove it? So a reversible four-bit function is a permutation of the
four-bit integers and it may be computed using those reversible gates I
introduced, the NOT, CNOT, Toffoli and Toffoli-4. The reason I need a
Toffoli-4, I need him, for completeness, I need an odd permutation, and the
other permutations are even. So I just need something that is odd.
We want to synthesize optimal circuits. And what was done previously in 2003,
in a previous slide I said 2004, I mean, 2002. 2002 was their conference
paper. I'm referring to a journal paper. So they showed optimal reversible
6
circuits for three-bit reversible functions and they're at two to the power of
three factorial of them. So factorial is applied to two to the power of three,
which is eight factorial and has the words 40,320.
So in year 2006 again, Markov's group at the University of Michigan tried to
synthesize four-bit reversible circuits and they were able to synthesize 26
million of them. The number of reversible circuits, four-bit reversible
circuits is two to the power of four factorial, which is 16 factorial. It's a
13-digit number. So this number is an eight-digit number. So there are quite
a few more digits to go.
So we introduced two surges so first let me explain how to synthesize a given
function and then we'll explain how to synthesize all functions. So in a
sense, if you can synthesize all functions, you can synthesize any one
function, because you just, you run the search until you've found the one that
you want.
But the search is very long, so it's not practical. And conversely, in
principle, if you can synthesize any one function, you can synthesize all by
waiting long enough. But then again, that also turns out to be non-practical.
So we have developed algorithms. So now I'll explain how to synthesize any
given function.
So we use breadth first search to find optimal circuits with some L divided by
two gates. We choose L divided by two to be a high enough number to guarantee
that our algorithm works in our particular case, L divided by two is nine and
the L divided by two has to be greater than the number of gates needed to
implement any reversible function of four inputs.
So once we've found all circuits with at most L divided by two gates, so any
optimal circuit for any function can be found in this big database of the half
of all optimal circuits by searching this database.
Of course, so okay, in our experiment, it turned out that the number of
circuits that we store is this number. So for -- this is the number of
circuits with up to nine gates.
Of course, if you just have this database and you search it twice for two
halves, the complexity becomes too high. It becomes S to the power of two and
7
the S is this long number I gave on the previous slide.
efficient. So we need something more efficient.
So it is not
And so this is what it says. If we just search for two halves, this is how
many Boolean operations we need and we can do ten to the power of nine in a
second. So this number becomes too large to be practical.
So what we do instead is we hash the table of circuits. We use Thomas Wang
hash function for those familiar with hashing functions for permutations. If
not, it's just a hashing function. I don't know why it's efficient, but it is.
So with S log S, we have approximately this many Boolean operations. I mean,
of course, I'm kind of cheating here. It's not this many operations. But just
to show you that this number looks much shorter than this number. So the
difference -- yeah.
>>:
What is the use of hashing?
>> Dmitri Maslov: So the use of hashing. So you need to find two halves,
right. So to find one half, you need the number of operations equal to the
size of a database. To but to find the second half, we use a hash table. So
we calculate the hash key and we query the database if it has any values with
this cache -- hash. And that is only poly log on the size of the database.
>>:
So on the other half, which gates hash helps you to look at?
>>:
[inaudible].
>> Dmitri Maslov:
>>:
Oh, okay.
>> Dmitri Maslov:
>>:
Yeah, so --
Oh, you got it?
Yeah.
>> Dmitri Maslov: All right. So finally, we actually do one more reduction,
which, in theory, may be not as impressive as it is in practice. We say that
if we have an optimal circuit, then it is still optimal with respect to
input/output permutation, of which there are four factorial, four variables, we
can permute them in any order. Four factorial is 24 and function inverse. So
8
if we have N optimal circuit, then the inverse of this circuit is also N
optimal circuit.
So we can pack almost 24 times two, being 48 circuits into one circuit.
Because they're all, in a sense, identical.
So here is a recap. So we started with a solution of basically we said that we
used breadth for search that requires order of N space and order of N time
complexity, where N is the number of -- it's 16 factorial, the number of
circuits.
So we first reduce it to order of roughly -- this is a little bit of cheating.
This is a lower bound. And by no means an upper bound. We can calculate an
upper bound, but we believe that the upper bound is actually much lower than
the order of N. And what we do is we synthesize only half of all optimal
circuits. So then the run time reduces to order square root of N, but we have
to search for two halves and the complexity of the search is the square of the
size of the database.
Then we introduced a hash table and reduced the time complexity to square root
of N, poly log square root of N. This is actually a poly log, but, I mean,
log/poly log, not a big difference as long as we have square root in front and
the small constant, because it has to be practical. And we reduce,
furthermore, the space complexity by a factor of 48.
So it turns out that for random reversible function of four variables, it has
no symmetries. More often than not. So the actual factor, practical factor is
47.95, I think. Which is pretty much says that only in one in a hundred cases
there are any symmetries whatsoever.
So this is how the distribution of reversible functions requiring a certain
number of gates looks like. So the reduced functions is what we actually store
in the database. This is the actual functions. So what here, this table says
is it says that there are 4.8 million different reversible functions requiring
five gates in an optimal implementation.
So we then experimented with ten million random functions. So on the input, we
have ten million random functions fed one by one, and on the output, the
software produces N optimal implementation. In principle, there could be more
than one optimal circuit. But as soon as we find one, we say well, there you
go. It is optimal.
9
So what is surprising about this search is the time it takes to find an optimal
implementation being 0.07 seconds per circuit. It is incredibly fast. I could
not almost believe how fast it is. So if you think of, you know, having the
answer being all optimal circuits stored on a hard drive without any
reductions, then this hard drive would have to be at least hundred terabytes.
So if it were 5,400 RPM hard drive then the expected time to extract data, a
single bit of data from such a hard drive would be 0.01 seconds.
So ours is faster. It's 0.07 seconds. So it is surprisingly fast. Okay. So
now to synthesizing all functions. While we could not come up with anything
very smart, so what we did is add breadth first search. So what we did is
composed an indexable set of four-bit permutations while taking input/output
relabeling and the universe symmetries into account as much as possible.
So in practice, what we were able to do is we were able to reduce the 16
factorial, the number of all functions by a factor of approximately 12 to 13.
Not the full 48. Because with both symmetries included, relabeling and
inverse, we could not come up with an easy to compute index. And we wanted to
have an index above vector so that we can change it according to the function
value and we could calculate the function according to the index very quickly.
So that was the requirement.
Then what we did is we ran the depth first search on the bit vector to total
those bits if the function is implementable with an optimal circuit of a
certain size. The bit vector that we used is 209 gigabytes in size. We used
128 gigabyte machines so in order to feed that vector into RAM, we needed to
break it into three parts.
So we actually ended up running the calculation on nine nodes of a cluster,
[indiscernible] cluster, and the total run time to calculate all optimal
circuits is about one million second, amounts to about 13 days. So here is the
complete distribution of the number of gates required in optimal
implementations of the four-bit reversible functions.
So note that there are like 144 functions of size 15. And interestingly, if
you remove the symmetries, there will be only five. So meaning one has no
symmetries, and four more have two symmetries.
Okay.
So now I'll talk about multiqubit circuits.
So what we do in this
10
project is we try to synthesize optimal implementations for some small useful
transformations. So, for example, the Toffoli gate. I mean, we professional
need the Toffoli gate because it seems to be used everywhere you look at. It
is used in the shores integer factoring algorithm for the reversible
computation part. It is used in fault tolerance for the syndrome detection,
and it is used -- well, it's used everywhere.
And the reason and cause of implementation of the Toffoli gate using the
library of CNOT, Hadamard phase and T, and the reason we use these library of
gates is due to the restrictions imposed by error correction, but we wanted to
check if this implementation is right, and find optimal implementation, see if
we can for some other small functions.
So again, what we do is similarly to the previously described project, we
compose a library of depth optimal circuits. So this time, we decided to look
at depth. But could as well be any other function. So whichever function you
want to use, we can use in the software as long as it doesn't have too many
values.
So we search the library of halves to find an optimal implementation and output
the circuit. So the interesting part here is the results. So, for example,
the Toffoli gate, speaking of the Toffoli gate, is we were able to find an
implementation with the overall depth 8. I believe the one in Nilsen Chuang
book has an overall depth of about 10, 13, 15, 13, 15, somewhere along these
lines.
Of course, here the key depth and T is the more complex gate, likely. It is
not necessarily the case, because, for example, if you can see there, for
example, a Hamiltonian used in a liquid state, the T gate is the least
expensive of all of these gates in a non-fault tolerant implementation. In a
direct implementation.
But if you want to do something fault tolerantly, scalably, then T has to be
implemented differently. It's a logical gate. So the T depth here is four,
which sort of improves what the Nilsen Chuang had, because they had the T
depth, I believe it was seven, but via playing with the circuit a little bit
could be reduced to six. So we could reduce from.
>>:
[inaudible].
11
>> Dmitri Maslov:
>>:
What's that?
Excuse me?
Actually, the T depth is not five there.
>> Dmitri Maslov: Yeah, because you can executed this and this in parallel by
executing first stage, second, third, fourth and then this is fifth and so on.
>>:
Okay.
>> Dmitri Maslov: So it's four, but you have to kind of rewrite it a little
bit. So then we found this circuit with T depth free. So it's not written
here very well, but this is one layer of Ts. This is the second layer of Ts,
and this is the third layer of Ts. Which gives the T depth free.
So what we have also done, we compared to what our synthesis algorithm to the
one we were able to find online and pretty much the only one available is the
Dawson's implementation of the Solovay-Kitaev. Great thing about it, I mean,
it has lots of -- it leaves lots of room for improvement. But the great thing
about it is it actually works for multiqubit circuits, which is great.
So this is the scaling of the Dawson's implementation compared to other
implementation. So as you can see, we can go further in depth than the Dawson.
But in a way, it's not the best implementation. Because if we use Dawson's
code with four levels of recursion to generate controlled H-gate, we come up
with an approximation that is 0.34 distance away from the controlled Hadamard,
which takes about two minutes and the implementation has about a thousand
gates, whereas we synthesized an exact circuit. Our exact circuit has seven
gates and it took 0.5 seconds to synthesize the circuit.
But, you know, we're not claiming we're, like, hundred times better because
we're not comparing apples to apples here. We're comparing apples to oranges.
But we only had oranges to compare to. And we wanted to make a comparison of
sorts.
Another interesting thing we have tried to do, so here is an optimal
implementation of the control P. Note that the key depth is two. However, if
you consider the control P as an operation of free qubits, then this circuit is
an optimal circuit implementing the control P. Note that the T depth here is
actually one.
12
So we were surprised boy this phenomena. Then we were able to prove the
following theorem. Basically, what the theorem says is if we have a circuit
with CNOT plus T gates, we can parallelize it to any depth you want. But it's
a subject to how much -- how much ancilla you can give me. If you can give me
sufficiently many, depth one I can do. If you don't give me as much as I want,
maybe depth two. If you give me only a few, then the depth will be maybe
closer to the number of T gates. So this is formally the statement of the
lemma, of the theorem. So this number can actually be improved, but not going
to talk about that much.
Still, it's interesting that any CNOT plus T circuit can be optimized to depth
one. Depth 1 T.
So I have to also say that part of this research, and the theorem and some of
the follow-up things that we're currently doing was also inspired by
discussions Peter Salinger, who showed us T depth one, optimal Toffoli
implementation. Well, it's obviously optimal, because there cannot be anything
better than the T depth one. But it uses four ancilla.
Another thing we tried to do is we took the Clifford gates and we took the T
gate, and we look at their controlled versions and it turned out that all the
controlled versions can be implemented. In fact, we implemented them using our
algorithm. We implemented them optimally. So we then realized if a unitary U
can be implemented exactly over the Clifford plus T library, then so can the
controlled version of U, the controlled U.
And if the vector of costs for the unitary U was, well, first coordinate is the
number of Hadamards used, second is the number of phase gate used, third is the
number of CNOT gate used and fourth is the number of T gates used, then to
calculate the number of Hadamard, phase, CNOT and T gates, what you need to do
is multiply this column vector, it's actually a column vector by this matrix.
>>:
Do you have some [indiscernible] as to how do you get to this here?
>> Dmitri Maslov: So what we need to goat this theorem is we took the
controlled H, implemented it optimally. And actually if you look at this
matrix, you can sort of see what happens. So the controlled H, the controlled
H, when implemented optimally, requires two Hadamard gates, zero phase gates,
two CNOT gates, and four Ts. Okay. So probably I read it incorrectly. So let
me think for a second. We multiply this matrix by this column vector. So --
13
>>:
Is that columns?
>> Dmitri Maslov: Yeah, they're columns, yeah, so I had to read the column
instead, not the row. So you need two Hadamard gates for controlled Hadamard.
You need two phase gates, you need one CNOT and two Ts. And that gives you a
controlled Hadamard. And controlled P is described by this vector. It's funny
how controlled P does not require P. It requires only Ts and CNOTs. But in a
way, T is the square root of P so it makes total sense, because there is a
direct relation between a square root of a gate and its controlled version.
It's in Nilsen Chuang, page 182.
One of my favorite circuits I remember it.
>>: So how do you get to the point where you prove that lambda U can be
implemented exactly?
>> Dmitri Maslov: Since U can be implemented exactly, we take the circuit for
U and we control every single gate. If we control every single gate, it is
equivalent to controlling the whole transformation. And for each controlled
gate, we found its optimal implementation. And we substitute the controlled
gates with their implementation. Yeah, uh-huh.
Okay. So now to the single qubit unitary, finally. So here is the original
motivation for our original research and what we actually tried to accomplish
before we wrote what we wrote we wanted to study if a unitary U can be
implemented exactly in the Clifford plus T basis.
So we obviously want an exact implementation and the reason for that is there
are enough errors in quantum algorithms already. So firstly, there are
algorithmic errors, such as, you know, the shores algorithm does not guarantee
you the answer with probability one. It's not [indiscernible] algorithm that
always guarantees that you multiply two integers and you actually do get the
result.
But it's a randomized algorithm, in a sense. Not randomized. It's quantum.
But you get the answer with a probability. We have errors do to decoherence
and that's the reason why we have the error correction and fault tolerance that
increases the circuit sizes when you go from logical circuits to physical
circuits by a factor of what must be close to a thousand, if not more than
that. In other words, it is expensive to have those decoherence errors. And
14
there are systematic errors in the controlling apparatus.
always need to be fought with.
And again, they
So what we were able to prove is we were able to prove the following theorem.
In the single qubit case, the set of all unitaries implementable in the
Clifford plus T basis is equivalent to the set of unitaries over this ring.
So this ring is an integer extension of I and 1 over square root of 2. We
furthermore conjecture that in the N qubit case, the set of all unitaries on N
qubits implementable by circuits in the Clifford plus T bases is equivalent to
the set of all unitaries over this ring, as long as there is an ancillary qubit
available that resides in state zero.
So the requirement, I will not be able to prove the conjecture. This is what
we tried to do. We couldn't do it. We made some partial progress but not
nearly enough. What I can do is I can illustrate that the ancilla qubit, the
requirement to have ancilla qubit is essential. And then I will outline the
proof of the theory.
So first, let me show that the requirement to have ancilla in the formulation
of the conjecture is important. In particular, I would like to illustrate it
with the example of the control T gate and the determinant argument. So if you
look at the determinant of the control T viewed as a matrix on two qubits, a
four by four matrix. The determinant equals to the W. And to remind, the W is
the eighth root of unity.
However, the determinant gates that we can work with, Hadamard, phase, CNOT and
T, belong to the set plus minus I plus minus one. So since determinant is
multiplicative, multiplying these numbers, we can never get the W. So we
cannot have the controlled T on the two qubits. You could say that, well, what
if we want to build the controlled T up to a global phase. Well, a little bit
more hand waving -- maybe not necessarily hand waving, but formal arguments
actually shows that no, you can do it up to global phase either.
>>:
[inaudible].
>> Dmitri Maslov: Yeah, yeah. No, no, you cannot. It's basically the same
determinant argument. But have to spend a bit more time.
Nevertheless, controlled T may be implemented using three qubits by the
15
circuit. This is unnecessarily complex. Essentially, what this circuit does,
it applies Kitaev's trick. If I have a pen and a paper, I'll show a much
simpler circuit. So we can't to implement a controlled T. So what we do is we
do controlled swap, this qubit resides in state zero. We do a controlled swap,
then we apply T here, and we apply [indiscernible] again. So Kitaev's trick
from '95, my other most favorite circuit, I guess.
So here is just optimized a little bit.
though it doesn't look like such.
>>:
So what was kind of complexity goes into [indiscernible].
>> Dmitri Maslov:
>>:
In fact, it actually optimized even
Yes.
It's all symmetric except for the one extra T.
>> Dmitri Maslov:
Exactly.
>>: There's one extra T all the way down at the bottom left, there's one extra
T dagger that would have been a T on the right, which would given you the same
thing back. And so that was your T that you have on the other one for the
control swap.
>> Dmitri Maslov:
>>:
Everything else, if you look at it --
>> Dmitri Maslov:
of this.
>>:
Uh-huh.
Yeah, it is symmetric.
So yeah, so this is kind of inverse
Exactly, except for one gate.
>> Dmitri Maslov: Or a complex conjugate, if you want. Because complex
conjugate, complex conjugate is CNOT is just the CNOT. Complex conjugate of T
is T complex conjugate and of P is P complex conjugate. And H, complex
conjugate is H.
>>:
Yeah.
>> Dmitri Maslov:
So it is symmetric, just like this circuit.
The parent is
16
also symmetric with respect to T.
So now to the proof of the theorem. So to remind you, what we were trying to
prove is that the set of all unitaries over the ring of the integer extension
of I and 1 over square root of 2 is equivalent to the circuits computable by H
and T gates, because delivered becomes pretty much H and P, but P can be
simulated with two Ts. So H and T suffices.
So firstly, from a linear algebra book, we know that it's actually -- I'm
cheating here a little bit, in linear algebra, this is E to the I 5
[indiscernible]. But we know that any unitary can be written in this form so K
runs from zero to seven. So you can, if you look at those unitaries of this
form for different values of K, then there are all equivalent all to
multiplication by powers of T.
So in a sense, when you have a two by two unitary, what you have instead is
just the vector XY, because everything else -- the column vector XY, because
everything else you can restore if you can get to the vector XY.
So this observation helps us to move from the synthesis of unitaries to the
synthesis of states. If we can synthesize states efficiently, if we can
synthesize any state over the ring, then we can synthesize any unitary, because
unitaries can be easily restored.
So to show that we can synthesize the states, we use the notion of this
smallest denominator exponent. So the smallest denominator exponent, here it
is defined formally, but it is an analog of an irreducible fraction for
[indiscernible] numbers. So when we look at the ring that is an integer
extension of pi and one over square root of two, we know that we have some
fractions where the denominator is a power of the square root of two.
So SDE is just the power of the square root of two defined such that it is -what we're looking at is an irreducible fraction. First, a lemma is very
simple to prove. If you look at a two by two matrix, then SDEs of all elements
of this matrix are equal. So the reason for that is if this number has
denominator square root of two to the power N and this one has square root of
two to the power of, say, N plus one, then this vector has to be normalized and
the normalization doesn't work out. It's rational. It's like proving that the
square root of two is a rational number. Pretty much the same argument. Very
easy.
17
Okay. So next thing we do is we consider the result
vector and now we work with vectors only, because we
equivalent to matrices. By H multiplied by T to the
multiplying one, two matrices by a vector column, we
of the multiplication of a
know that they're
power of K. So by
get this result.
So now I'm going to cheat a little bit in the sense that I'm going to
technical results that I'm not going to prove, but I'm going to claim
is proved in the paper. I just don't want to prove it. The proof is
difficult. It's just technical. You have to study the properties of
long time before you can prove this inequality and this equality.
show two
that it
not very
SDE for a
So let me describe what the inequalities are and what is the meaning behind
them. So we want to show that by multiplying a state, by applying HT to the
power of K to a state, we change the denominator by no more than one. And
either way. We can increase it by one, we may not change it or we may decrease
it by one.
That is what this inequality states. It states that SDE, the square of this
value and this value can be found here, minus the SDE of this value, which is
this value. So the first coordinate of the vector we're looking at, it is
squeezed between one and negative one. So it can increase by one, decrease by
one, or not change.
>>:
Is it the case the denominator will not change sometimes?
>> Dmitri Maslov: Yes, yeah. So and a second very interesting thing that we
need to prove, that we actually do prove, is for any S in the set, negative
101, we can find a K such that the difference of those SDEs, so this expression
is exactly the same expression as this expression, equals to exactly S.
What this means is by applying H T to the power of K, we can reduce the power
of the denominator by one, we cannot change it, or we can increase by one. At
this point, you're probably thinking well, there we go. We have an algorithm
for the synthesis. So we always find S such that the -- we always find K such
that the denominator is increased by one, and then when the denominator is -the power of the denominator is small enough, then we can use breadth first
search to breadth first search all those unitaries with small denominator size.
And this is exactly what we do.
18
So here is the algorithm. On the input, we have column vector. Think
it. Technically, on the input, we have a matrix, a two by two matrix,
right away draw up the second column of the matrix and we only look at
first column. So we do restore the second column at the very end, but
I already said too many things about a very trivial thing.
about
but we
the
it's --
So the output is the circuit that prepares state XY from cat zero. So the way
the algorithm works is while SDE of the first component of the column vector we
have is greater than or equal to four, we find K such that application of HT to
the power of K reduces the denominator of X. And we substitute HT to the power
of K multiplied by vector column XY into XY.
So this, the denominator of the first component of this vector is smaller than
the denominator of -- and smaller exactly by one than the denominator of this
one. So in a sense, for the SDE of the square of the first component of the
vector greater or equal than four, the set of Clifford plus T circuits on the
single qubit is sort of flat. And at every level, you can predict what is
happening. You can increase the denominator, you can decrease it, it may not
change.
So it's very predictable, the behavior is very predictable. And we brute
force, finally, when we can not apply this step any more, because SDE is less
than four. We basically brute force all the implementations of all unitaries
such that SDE is less than four. There are probably close to 20,000 of those,
but a few in other case. So they can be easily brute forced and stored.
So let me prove H optimality. In other words, let me prove that the number of
Hadamard gates is actually optimal in our circuits. So to increase or decrease
SDE by one, we need precisely one Hadamard gate. It is proved. So the set of
all unitaries with SDE of X squared equals eight equals to the set of H optimal
circuits with seven Hadamard gates. We ran a computer search to actually
verify that this statement is correct. So this statement is verified by a
computer search.
As such, we conclude that this is our N equals one sort of state and the
mathematical induction. This is the induction step. So we use the inductive
proof to show that the number of Hadamard gates required to implement a
particular unitary equals to the size of the denominator of X squared where X
is the top entry, U 11, plus one. That's the number of Hadamard gates. And
apparently, our algorithm just matches SDE of X squared plus one.
19
So we furthermore looked at possible T optimality. We cannot prove T
optimality, but what we can do is we can, using breadth first search, we can
synthesize all optimal circuits with up to 13 gates, then resynthesize them
using our algorithm and it turned out that we got all the T counts correctly.
So we do believe that our T counts are also optimal.
We cannot say anything about the P count. So the phase gates, the number of
phase gates we use may be suboptimal. We just don't know. We don't have a
feeling. We didn't try to study that. But H optimality, we can prove, is
there. The T optimality seems to be the case.
>>: So now, do you define T count as essentially the T count of T to the power
of K being either zero or one.
>> Dmitri Maslov: The T count of T to the power of K equals one if K is odd
and zero if T is even. Because T to the third is PT. Phase times T, yeah. So
we, yeah, we do this reduction, yeah. It's just, it's for the proof, it's
easier to operate with powers of T than to operate with P times T or Zed times
T or Zed times P times T, yeah because it's much cleaner.
So another thing I didn't have the time to have the slide, but our algorithm,
the complexity of our algorithm is linear in the number of gates in the
circuit. So what this means, that this algorithm, among all algorithms, among
all synthesis algorithms, is asymptotically optimal. I mean, just the mere
time it takes to write the circuit with N gates is at least order of N. And
order of N is the complexity of our synthesis algorithm.
So it is efficient.
>>:
So the depth of the circuit is not an input parameter?
>> Dmitri Maslov:
>>:
No, our --
You can say it's --
>> Dmitri Maslov: Our input is the unitary. So once we have the unitary, we
can synthesize. So, in fact, what we do in experiments, and I believe this is
one of the next slides, yeah, so let me get to it. It's probably one after
that.
20
So here is a similar comparison to Dawson Nilsen. So it's a software to
software comparison. In a way, again, we're comparing apples to oranges. So
this is the behavior of the Nilsen -- Dawson's implementation of the
Solovay-Kitaev. So this is zero iterations, one iteration, two iterations,
three, four, five, six, seven.
So you can see that the error did not improve here. So the scales are
logarithmic and sorry about the small font size, but it was generated by an
automatic software, and did not increase the font size yet.
So this is the result of using double data format in C++. It does not have
enough precision to go beyond errors on the order of ten to the minus 8 or ten
to the minus 9. It just flats out there. It doesn't work.
We, however, used a multiprecision arithmetic so we were able to go far beyond.
Actually, I'll show you on the next slide that actually much forward could be
shown on this picture. The smallest errors that we have circuits for on the
order of ten to the minus 50 and ten to the minus 50 is likely unpractical.
And unpractical in the sense that why would you need something that is this
precise.
>>: So the [indiscernible] being approximated are just the RZs, the small
rotations?
>> Dmitri Maslov: Yeah, we approximate those rotations, those four rotations.
For other things, we tried. The graphs look very similar so we just show the
few that, you know, look nice and everything else looks about the same. I
mean, it's logarithmic scale. So even if something is different by 30 to 40
percent, which seems to be the case for a random unitary, right, then this
point is going to be maybe, yeah, here. Yeah. So they're going -- they're
about kind of the same. So this actually very nicely defines the range of
points. That's where they're going to land, all of them, for random unitary.
But again, I mean, our database was larger, because you can see for zero
iterations, we're better. And our implementation was more efficient. So
that's why partially this slope is much better than this slope.
Also, another experiment we did is we took the circuits, synthesized by
Dawson's implementation of the Solovay-Kitaev, computed the unitary, and
21
synthesized the unitary using our algorithm. So the reduction in the number of
gates was only on the order of about 50 percent.
So the approximations are not efficient, but their circuits are actually fairly
good. Well, 50 percent away from what we believe should be fairly close to
optimal in whatever metric of optimal that you choose. Yeah. But this graph
shows that finding a good approximation to a unitary is a much more difficult
task than if you have a good approximation already finding a circuit for it.
Finding a circuit is easy. It can be done in time linear in the number of
gates in the optimal circuit. I mean, it cannot be better than that. But
finding an efficient approximation is very hard.
Okay, so here is an example of the experimental results. So here it shows the
error down to the ten the minus 15. If I click here, we actually have more
circuits. So here, the implementations of the R Zed two divided by the power
of N for N between four and 30. So each of them approximates down to ten to
the power of minus 50, approximately.
So if you want to have a circuit with -- that approximates to a very small
error, then you have to pay a heavy price on, for example, number of Ts, 1.7
million.
So actually, we also have circuit files here so if you click here, it's a
fairly small circuit. The number of gates in this circuit is only 73 counting
all gates. 28 T gates. So this is the circuit. Okay. Not displayed very
well. So let me switch the browser, then okay. So here is the same page. So
this is the circuit. You can actually view this circuit using the software for
viewing the circuits that we also have on the group web page. It's called the
QC viewer. You can download it, play with it if you want to. It's not the
best for displaying circuits with millions of gates on the single qubit, but if
you have circuits with tens of gates on a few qubits, then it's kind of much
nicer.
In fact, all of the circuits that I showed today were generated automatically
by this QC viewer software.
Okay. So let me go back to the presentation, what's left of it.
Acknowledgments so like to acknowledge the help from my. Students, Matt Amy,
who is a Master's student, Vad Kluichnikov, who is a Ph.D. student. Oleg
22
Golubitski, one of the to authors. He's an ACM programming contest champion,
and he's a really, extremely good coder, which explains why the numbers in the
reversible logic synthesis are so low, because he has helped tremendously
coding it. And he's great. Then professor Mike Mosca, who is handling the
quantum circuits research group in my absence at the university of Waterloo.
And the NSN independent research and development program that allows me to do
research despite being an administrator. Okay. Thank you.
>>: So I'd like to know a little more about the high precision arithmetic
package you used to get those big denominators.
>> Dmitri Maslov: Yeah, sure. I can just send you the link to where you can
download it from, if that's what you want.
>>: No, I just wondered what its attributes are.
implemented yourself?
Is it something you
>> Dmitri Maslov: No, no. We downloaded it from online. It's just available.
It's a C C++ library for dealing with large numbers, and it's just that. It's
one of those libraries. You don't have to use that library, because there are
so many different ones.
>>:
There's nothing --
>> Dmitri Maslov: I even wrote one myself.
available anywhere.
>>:
[indiscernible].
>> Dmitri Maslov:
Maybe it's yours.
>>:
Okay, yeah.
I can't recall where we downloaded it from.
No, no, this is internal.
>> Dmitri Maslov: Oh, it's internal.
remember which one it is.
>>:
It's so not good that it's not
Oh, may still be yours.
I don't
There's a thing called GMP from the open source that's fairly popular.
>> Dmitri Maslov:
I think that's the one we used.
Yeah, I think that's the
23
one we used, yeah.
>>:
Now we can go back and -- yeah.
Do you have any thoughts on how to prove T optimality?
>> Dmitri Maslov: No.
someone else's turf.
We didn't try to prove.
We didn't want to step on
>>: It seems like when we compared circuits, we have a few more H -- our
[indiscernible] has a little more Chaff, I think. But the Ts are the same. So
it seems that there might be maybe some modifications that it cannot perform.
>> Dmitri Maslov: Yeah, there may be one,
the Hadamard, because that's how we looked
denominator, and denominator is defined by
Hadamards, you can only squeeze one T. No
that gives.
>>:
yeah. You see, we concentrated on
at the matrices. We looked at the
Hadamard. And between two
more than one T. So yeah. Whatever
It's amazing how much algebraic structure there is in all this.
>> Dmitri Maslov: The single qubit case, which, in my opinion, is super
trivial. But the two qubit case, we tried to prove the theorem that we had in
the two qubit case. It's much more difficult. For example, because you need
an ancilla that you don't need in the single qubit case. For what it's worth,
you have entanglement, which may come into play, and it could only complicate
things.
>>: Yeah, so you're looking at some sort of projection or some sort of morph
of the algebra of the three qubit case.
>> Dmitri Maslov: Yeah, I think it's much more difficult. It's just, it's
single qubit case is trivial. Two qubit case is very difficult, in my opinion.
I mean, we couldn't do much. We were, however, if you would like to know, we
were able to prove that the unitary synthesis in the N qubit case can be
reduced to state synthesis. We could do that step, but that's as far as we
went.
And the proof is far from trivial.
>>:
But unitary synthesis and state synthesis are really equivalent.
>> Dmitri Maslov:
Yeah, in a way they are equivalent, but state synthesis is
24
simpler. Because all you need is any matrix with the column that you want to
have, but when you have N matrix, you need the matrix, that matrix. So in a
way, we reduce the problem of size 4 to the power of N to the problem of size 2
to the power of N, but that wasn't the most difficult part. We don't believe
that that part is difficult. We believe that reducing it to solving the
problem with 2 to the power of N variables, that is difficult.
>>:
So when you experimented with the matrices [indiscernible], right?
>> Dmitri Maslov: No. What we -- yeah, we did both. I mean, we used
Solovay-Kitaev that we implemented ourselves. So when you have a unitary that
can -- that is not a unitary over the ring that we have, we know, and if it's a
two by two unitary, we know that provably, we cannot implement this unitary
exactly. We need to approximate it.
For approximation, we use Solovay-Kitaev. It's just that the code that we use
is the code that we wrote for Solovay-Kitaev, but not the Dawson's code,
because when we looked at it, we found that we could write something better so
we wrote something better. Which we believe is better.
And then, if we're given a unitary, we first look at it and say, is it the
unitary of our ring. If it is, synthesize it using the algorithm. If it's
not, use our Solovay-Kitaev to come up with the approximation unitary. It
comes up with a circuit. But we don't care about the circuit. We only care
about the unitary. And then we use the unitary to synthesize the circuit.
So in a way, yeah, so T approx is the time it takes to approximate a unitary.
So because this unitary is not a unitary in the ring. This unitary lies
outside the ring. So T approx is the time spent by the -- our implementation
of Solovay-Kitaev, and T decomposition is the time spent to actually synthesize
the circuit using our algorithm.
So the software consists of two parts.
>>: I'm curious about this. When you have a unitary over your ring and the
SDE is rather large, then the things, enumerators, the integers, would be also
quite large.
>> Dmitri Maslov:
Yes.
25
>>: Did you have to use -- did you ever have to use a controlled precision
integer?
>> Dmitri Maslov: We use our own implementation that's symbolic. It's fully
symbolic, because you see, for example, here we have 15,000 Hadamard gates.
15,000 Hadamard gates implies square root of two to the power 15,292 plus minus
one, approximately. So we just, we needed our own arithmetic for that. So we
used our own arithmetic.
>>: So you essentially have an algebraic package that deals with Z extended by
those roots?
>> Dmitri Maslov:
>>:
Yeah, exactly.
All right, yeah.
Or it's the Gaussian integer extended by a square root of two.
>> Dmitri Maslov:
>>:
Exactly, yeah.
Represented by a through E vector, right?
>> Dmitri Maslov:
>>:
Yeah, yeah.
Actually, we represent by four.
You could do four, yeah.
>> Dmitri Maslov:
Yeah, we use four, uh-huh.
>>: So it seems like if you have a software that does C extended by one over
square root of four, then dealing with an infinite precision integer
arithmetic, if it was integer arithmetic would be a subset of what your package
does.
>> Dmitri Maslov: No, because it's very specialized.
multiplication, yeah.
>>:
It doesn't do any of the
Oh, I was wrong.
>> Dmitri Maslov: All right. It's suited only for our needs.
limited capability. Very limited.
>> Krysta Svore:
Any other questions?
Let's thank Dmitri.
It has only a
Download