>> Laurent Visconti: Well, thank you for coming. ... Millstone. He is visiting us from Courant Institute and...

advertisement

>> Laurent Visconti: Well, thank you for coming. I'd like to introduce Marc

Millstone. He is visiting us from Courant Institute and Lawrence Berkeley

National Lab.

>> Marc Millstone: Thank you very much. First, thank you all for coming. And I want to say this is joint work with my advisor at Courant, Michael Overton, and

Juan Meza and Chao Yang at Berkeley National Lab.

So, first to explain the problem. This is a very different style of problem. We're basically working on developing an algorithm for one objective function. We have one problem in the world. We're trying exploit properties of this function. I need to explain to you why this problem is important.

So 1830, Auguste Comte, "Every attempt to employ mathematical methods in the study of chemical questions must be considered profoundly irrational and contrary to the spirit of chemistry." He then goes on to say, "This would result in widespread degeneration of that science." 1830, math does have anything to do with chemistry.

Now, let's look about 10 years ago. Plenty more quotes like this. But now computational methods nowadays mainly supplement experimentally attained information. They're soon expected to increasingly supersede this information.

We want to give a molecule. We want to start computing solutions to this on the computer first. It's hard to build large molecules. It's hard to experiment with many different types of large molecules. If you use computers, to give you a rough idea of the solution first you can then really, really understand what's going on in the lab later on.

So in this vein, science and computation are very closely linked to one another.

Scientists want to understand bigger and bigger problems and computer scientists, in the past, trying to write algorithms they can scale more and more. I think right now the scientists are winning. They want to look at larger problems that we can actually compute solutions for.

For example, cadmium selenide, critical component of making fuel cells, solar cells have a problem. We have a thousand to 5,000 electrons. Solar cells, in general, may become modeled entire solar cells. On the order of 50,000 electrons. And we can look at integrative circuits. These are on the order of a million electrons. We want to understand big problems and predict their properties on the computer.

So in some ways mathematically this problem's solved. It's solved by the

Schrodinger equation. Here's the mini body problem. I'll come over here, so I can actually see. This is the mini body Schrodinger equation. I'll put it up so you understand what's going on. We have H is Hamiltonian, and Si here is mini body weight equation. We're not going to need this very much, but give you a flavor of the problems we're dealing with.

We have a Hamiltonian is a product of sum some number of Laplacians. Delta

Laplacian. And some potential due to the nucleus, the neutrons, R hat J. And some potential electron-electron interaction. There's three terms here. If we were to solve this, this is an eigen value problem. We can solve it. It would contain all information of the system, any property given by the system is given by Si. The norm of Si squared gives a probability density.

What do we mean here? Well Si here, what's the probability of finding electron one near small space near R1? And the probability of finding a small electron R2 near small area around R2. It gives you this probability. We're always talking in terms of probability with these problems. Lambda here, the eigenvector represents the energy. So we're not really done yet, though. We wish we could be done. But this theoretically explains everything. However, if we were to discretize this on a 32 by 32 by 32 grid and just for five electrons the Hamiltonian would have a dimension of 3.5 times 10 to the 22nd. Clearly not doable.

So how are we going to start solving the problem? And again we want to predict the fraction of molecules and the behavior of molecules and many different properties computationally.

So the Kohn-Sham equation is one such approach to these problems. There's another approach called Hartree Falk. Different idea. Different type of approximation. Kohn-Sham equations are what we care about.

So let's give some history. 1964. Hohenberg and Kohn give a purely theoretical mathematical result. They said that at ground state, meaning at the minimum energy level, the total energy of the function is a function not of this mini body wave equation but what's of called the charged density. Observe in their original proof the charge density is defined in terms of the mini body equation. It doesn't really help us. It's purely theoretical. But the difference now is that the wave equation here says what's the probability of finding electron one near R1, electron 2 near R2, etc. The charge density is what's the probability of finding any electron near small space near R. We give up knowing which electron it is. All we care about is that there's an electron nearby. What's that probability? So we're sort of -- we're treating it all as a mass. We're not really caring which one it is anymore. And from this they show that if you know the charge of the at ground state, you can drive all the other properties of the molecule. You can drive the force due to the movement. You can drive the magnetic forces. All these different properties you can drive from the charged density at ground state.

I should add that this only applies at the minimum energy.

Away from the minimum energy, this charged density doesn't match up at all. So now we have a theoretical basis for what we're doing. A year later Kohn and

Sham and Kohn won the Noble Prize for this in 1990s, proposed a practical formulation. And that uses NE single particle orthogonal wave function Si. Okay.

And these do not interact. The problem with the multi mini body problem, the electrodes interact with one another. So these do not interact. And what they determined is that the electron-electron interactions are modelled by what's called the exchange correlation energy. This is basically a term they prove exists

in theory, but moreover they go one step further. They give an equation which matches experiment. So all the magic happens in this exchange correlation energy term. This models the electron-electron interactions. Another important thing to note here is I'm going to call these wave functions. And you're going to think they really correspond to a real, something truly physically meaningful. In fact, they don't. These are entirely artificial constructs. All that matters is the charged density here. The charges will match up at the minimum energy level.

The wave functions have no physical meaning. And material scientists and chemists have given these some sort of meaningful terms to think about it. But they're not -- they're totally artificial constructed wave functions. They don't correspond to a single electron or anything like that.

So let's go over these equations. So now we're assuming the Si Is are orthogonal. So their charge density now is given to the sum of 1 E to the inner products of the Si I. And what we do also know is that we take the probability, the one norm of this, this always equals the number of electrons.

Now, this is where the magic happens. This is called the exchange correlation energy. It's going to be integral of this charged density which we'll see often, row, times the magical term. Many approximations here, but the key thing we can compute this analytically. It depends on the charged density, not the wave function themselves. And we can compute them efficiently. But moreover this value really does match experiment closely.

And so, finally, I'm going show this equation multiple times and explain it multiple different times. They give us the Kohn-Sham energy. We have the three terms here. The first is what's called the kinetic energy. It's the wave functions inner product with L Laplacian. We have the energy due to the ions. So the neutrons.

So V ion is a function of location only. It's a fixed variable.

It's determined by the molecule we currently care about. And then finally we have what's called the Hartree energy, which this sort of tells how does a charged density one point Interact with another charged density in another space. We don't care about charged density; we don't care about the wave functions themselves. Then finally we have the exchange correlation density.

There's four terms here. We can general group these two together, because they're local interactions. Laplacian by definition are a very local operator nearby grid points only deal with nearby grid points. The ion potential only deals with electrons which are nearby. And these are more the global properties.

Like this is where areas in space interact with areas much further away from them.

And so now how we compose this problem. We want to minimize the

Kohn-Sham energy such to orthogonality constraints. And this is really the problem. Now how do we solve this?

So one way is to write down the work of the KKT conditions, write down

LaGrange and we take the derivatives. And now HKS is the Kohn-Sham

Hamiltonian. Write it down depending on the context I need. It's the Laplacian

plus a term that's the potential for the ionic energy. Plus this is convolution which represents that Hartree energy I pointed out, and this is the derivative of the exchange correlation.

So often just lump them all together as you have Laplacian plus a potential.

Observe that this is only a function of row. So a row is a function of Si. Row equals the summation of Si I times Si J. So we can solve this as a non -- sorry, as a nonlinear eigenvector problem. And in fact, that's how we'll talk about this later. That's how most people solve this.

We work directly with the KKT conditions. I'll go through this again. This is a nonlinear eigenvector problem, which is very different than the standard nonlinear eigenvector problems. I'll come back to these many times. Interrupt me if you have questions. A generalization which I'm going to work with, in the previous slide I said everything's orthogonal. What if we want to remove this orthogonality constraint?

Well, we deal with it with an overlap matrix. We build a big matrix, such as the J kth element is an inner product of the Si J and Si K. The charged density gets multiplied through by the S inverse term. So the row changes a little bit.

And now the energy changes a little bit as the rows change here. We multiple extra S inverse in the front of the kinetic energy. This is much more complicated than the previous equations. Why would we do it? It has a very special property now. And you'll see this more when I discretize, and I'll explain it again. Given two sets of vectors that span the same space, the energy is constant.

That means it's not the actual basis vectors themselves we care about. We only care about the span of the basis vectors. And this is the key property here. This is why we're going to do this. This allows us to exploit properties of this function to allow better scaling algorithms. So now we want to discretize and optimize.

We have to discretize the problems and run an optimization algorithm. Let's talk about how we're going to do that. Slow down a little bit. First step we have to choose a discretization. So what's standardly used, at least in the material science community, and the chemistry community, is called plane waves.

Basically every Si at different -- you know at grid points -- is equal to the summation of exponentials.

This is very physically realistic. It's naturally orthogonal. It allows, if you want a finer grid, you just have to take more elements in this sum. You can really -- you can get the best approximation you want because of this night quest frequency.

You know exactly how many terms to take for a given resolution.

So this is one way of doing it. And the advantages also you can increase resolution.

This convolution, well, this is basically going to become a matrix multiplication D and FT. Because this is some way just a 4-A transform. So this convolution becomes very easy. But the disadvantage here is that these sides are inherently nonlocal. These exponentials span the entire space you care about. And

remember I've already mentioned the word locality. I'll define this later. What we really want is a representation such that nearby grid points only depend on nearby grid points.

Because Laplacian here will be dense. Another approach is the finite difference approach. Here, this is just a 1-D, of course, but a given element of the derivatives, the standard finite difference approximation. So the advantage here is the Laplacian is well known. It looks like. Out to multiple orders. It's highly localized and everything's very sparse.

The disadvantage here is that this convolution is going to require a solution to

Poisson problem. So we'll have to use an iterative algorithm to solve this convolution. But that's fine. We know how to do that.

We're going to focus on the finite difference approach. So now let's talk about the discretized problems. Here's a problem. This tries to define the finite problem. X here is the long skinny vector, N is the number of grid points. NE is the number of electrons we care about. And again I'm saying electrons but I don't really mean electrons, I mean these imaginary things that were invented for the Kohn-Sham problem.

So the first term is the energy due to the kinetic and ion term. Well, it's an X star

XS at inverse times X star times Laplacian plus ionic potential. Laplacian here is highly sparse. We generally take up to about an eighth order of Laplacian approximation. It's not just the negative one two negative one. It's a little more than that, but it's still sparse. V ion is governed by the molecule in question, and it's actually just a diagonal matrix.

So this is the local components. The Hartree energy becomes road transposed L pseudo inverse row. This is where you simplify. What we actually mean we're solving the Poisson problem here to get this term. And finally the exchange correlation energy is this road transport, this thing we know exist that they give us a good value for. So this is our energy. It's very big. It's very computationally heavy. We're talking grid points on the order of hundreds of thousands. And we want to scale this to hundreds of thousands of electrons. So we have to exploit sparsity. And I want to point out again these are the local terms, the kinetic ionic, as I group them together, and these are the nonlocal terms. And the row here, this isn't actually how you compute it, but you can represent easily just the diagonal of this matrix here.

So are there any questions? The game, the equations are starting to make sense. This is the only equation that we care about.

So I'm going to give -- sorry, one more thing. So I said prior in the nondiscretized version, the chemist version, everything only depends on the basis, on the span of the basis. Not the basis itself. You can do the calculation here. You see that this energy, for any invertible matrix G, invertible, not ortho normal, any invertible matrix G, E of X times G equals E of X. Let's talk about it for a second. The road term is easy. Everything cancels out. You can trust me on it. But you get G transpose X transpose XG, inverse, you flip them. They cancel out here and

there. And this term you get, the middle stuff cancels out and you get trace of G inverse something, something, something X times G. And the trace is invariant under unitary or similarity transformations.

So row is the same, just def it by computation. Everything cancels. The trace of the same because the similarity transforms. Now we have some equations, how we understand this very unique property of this function.

And just for understanding this a little better, let's look at a simplified version.

And my code, the first part of my code, the example would be based on this problem right here. We have a discretizable Laplacian. Again, the signs don't matter for my model problem. I don't need -- I don't need to match experiment.

So let's get rid of the negative sign. And we have a term for the potential here. A diagonal matrix. And we have the Hartree energy. I've gotten rid of the exchange correlation for now. This is what I call the model problem. And I was questioning does this first term look familiar to anybody? This is the, nor the second term. But does the first term look familiar? Well, linear algebra 101, the

Raley equation X over transpose X, if we minimize this, we know we minimize this. This goes to the smallest eigen value of A and X is the eigenvector. We can generalize this. I think it's the Chifan [phonetic] theorem. So this converges if X has K columns. The minimum value of this is the sum of the first K eigen values and gives you a minimizer that spans the space of the first K eigenvectors. Now we're optimizing an eigen value problem. So this is a nonlinear eigen value problem. I said that again.

So this is very similar. This is just the generalized really quotient. Now we have to compute a solution. This is where we start getting to the interesting stuff. The current way, state-of-the-art way, I think it accounted for something like 75 percent of all time in the computers last year, was a variation of this algorithm, called a self-consistent field iteration. This problem deals with the KKK

[indiscernible] directly, tries to solve the nonlinear eigenvector problem. Another approach, which is actually more recent, is trying to minimize energy directly.

Either the constrained version with the orthogonality constraints or the nonconstrained version with the extra inverse of the front.

So let's talk about SCF first because that's what people use now. It's actually very simple. It's a fixed point iteration. We evaluate the Hamiltonian at a given row. So this is fixed matrix. We solve a linear eigen value problem. Get the energy. Compute the new row. And by definition we're calling eigenvector problems so they're automatically orthogonal. You inherit the orthogonality by definition. We solve the eigenvector problem. Compute a new row. Compute the new potential. Now we have a new Hamiltonian. So we solve it again. We just do this over and over again. We solve the sequence of linear eigen value problems. And this is what is mostly state of the art. So what's the scaling here?

L of cubed you have to compute the eigenvector problem. It's orthogonal. So it's cubic work, the nerve electrons.

The convergence is slow. It's not monotonic. In fact, it doesn't even have to converge at all. You can give two-by-two matrices, which it diverges to two limit points. Such that neither limit point is a solution to the problem.

And, again, the energy we're minimizing the KT solution. Energy can fluctuate crazily up and down. At least as an optimization person I don't like that. I like sequential decrease in energy.

And there are a lot of heuristics that make this go faster. Called charge mixing.

The idea look we just averaged the previous iterates and it makes it goes faster.

State of the art. Note it's inherently orthogonal. In algebra, if things were orthogonal, you're inherently stopped with L of M cubed scaling. You have to do a form of Graham Schmidt or something.

Let's talk about CG. Nonlinear [indiscernible] gradients. I used Fletcher Reeves not because it's my favorite algorithm but because it's what the field uses and for comparison points all other algorithms use CG, Fletcher Reeves. Although it's known to converge slowly in some instances compared to [indiscernible] and other variations.

So, again, what do we needle here? We have to compute a gradient. Search direction. We have to do a line search. Remember, CG requires a very specific line search. It doesn't meet the strong Wolf conditions with the very small constant. Has to be less than a half. And then we just -- this is a very standard algorithm. You'll see it multiple times because we're going to modify this algorithm. We're going to add to it.

So I think the paper, first UCG to the Kohn-Sham problem, was I think in the '90s.

It was recent, even considering it was possible to minimize the energy.

So I've said this word a lot locality, locality, sparsity, what does that mean? We want to design things that are scale. I've already said the Laplacian finite difference is sparse. The potentials are sparse.

Nature is actually pretty nice to us. And the easiest way to say nearby things only interact with nearby things. As you get further and further apart you can prove actually theoretically that for an insulator as you get further and further apart away from each other it actually decays exponentially in the interaction.

So you don't have to be very far away to have basically nudgeable interaction between the two. For a conductor it's a little worse; it decays algebraically, but it still decays.

So how can we take advantage of this fact? And the hope is that by having locality and explaining locality we can have a sparse matrix X. Evaluating a sparse matrix X we can get some scaling scalability here.

Let me show you two equal solutions. You can trust me that the objective function is equal. Okay. We have an orthogonal solution. This is for five

electrons in the model problem, just a grid size of -- there's 100 grid points. Very small. This solution is orthogonal. And this one is localized.

This one, if you notice every column of X. This is the column of X matrix. Only has finite support on 30 out of 100 elements.

So for evaluating an iterate, they're the same energies, we would much rather evaluate an energy X. Every column has finite support.

Sorry. Things are going crazy here. Apologize. Okay. So let's look at this again. So here we evaluate this energy. The model problem. Really, they're structurally the same except for one extra term. But we want to evaluate X at columns such that every column at X has a sparse structure and it's exploitable.

We know exactly where the sparsity is.

Basically we want to evaluate columns of X that looks like this and not like this.

And we want to optimize in such a way that we always maintain the sparsity. We always want to evaluate X at this language. And this type of format. And so how do we do that? First of all, we can't do it with orthogonal wave functions. The only really sparse orthogonal wave functions is multiples of identity.

I mean, to be orthogonal maybe non-zero in a small area and everywhere else and that just doesn't overlap. So to have this locality, this sparseness, we have to have more orthogonally. It's definitely necessary.

So now the question is that we have a sparse iterate. We want to optimize, take steps. We always want to maintain sparsity and always maintain our energy in a sparse iterate. But meanwhile we said we have this extra fat. This isn't a general random problem. It's a very specific energy problem.

So the idea is that we want to find every step, a sparse basis, has a nice form, that's very close to the dense spaces.

So let's just talk about how we can do this. So as I said already, iterate algorithm. You have a sparse iterate, you take a step. There's no reason to think after that step that the next iterate is going to be sparse even if you do steepest ascent. Because E of X, X is sparse, the gradient is going to be dense.

So here's a naive approach. Take a step. Just truncate. So just to give some notation, anywhere you see a tilde, that means it's a sparse matrix and it's exploitably sparse. I know exactly where the non-zero elements are. And X -- the sub T here means does that sparse spy I gave you. Means you truncate it outside the region everywhere zero. My idea is truncate. Bad idea. People did it a long time not knowing it was a bad idea. But they observe the following. You got stuck on local minimum. They weren't the real answer. Local minimum weren't the real problem.

That's actually pretty easy to see. Let's look at this picture here. This is actually a simple two-dimensional quadratic and the X transpose X, and A has diagonal matrix with one and negative one on the diagonal.

Clearly minimizing the minimum is a negative infinity. Take the component corresponding to the negative one and send it to infinity.

So minimum here is negative one. But if you were to truncate the second component of the vector, X transpose X, you get a whole slew of local minimum at zero which doesn't correspond to the real problem at all.

So it was observed for a long time the truncation doesn't work. At least you don't get the answer you care about. So we know Eon Gail last year came up with the idea: Let's minimize the truncation error. I'll explain what this means.

They call this localization. So, remember, E of XG times E of X is invertible over any row G. The energy is the same. Let's compute G. Such that the area off the support is minimized. We're going to rotate all the columns of the matrix.

Truncate and find the G that minimizes the truncation error. Now depending all you need is G is invertible. To be honest, this is an easy problem to solve. I didn't write down how to do it. If you assume A you compute G column by column, you assume the column is G sum to one this is just a Lees squared problems. It's pretty straightforward to do.

So it's a simple Lees squared problem. So the idea is going to be this picture.

We're at a dense [indiscernible] XK. So the energy is the same on this entire co-set of vectors here values, because it's independent. So let's find the one which is closest to the sparse trunk space, truncate and hopefully the energies stay about the same. I say hopefully. They're going to change a little bit.

They're going to have to.

Okay. So this is what they came up with. And they then integrated this idea into

Fletcher Reeves and they called it CGL. Let's see how it's different. This is exactly the same. Now, the instrument may or may not be sparse. You might localize the first one, depends on what they want to do. They first choose a line search. They take the dense step. This X plus K is a dense matrix. They localize, then they truncate. And they also have to modify the search direction down here.

Because you've rotated the whole space. You have to rotate the previous search instruction as well. And truncate this. And so what they said, this is the problem they want to solve. They want to make sure we no longer get stuck on local minimum. Truncation get stuck on the minimum, they're not the real answer.

Let's avoid that.

So this is my code, not theirs. I implement all their ideas. And we ran this from --

I'm sorry, sorry, a small problem. 100 grid points, five electrons, ran it from 50 random starting points. Completely random. I ran the solutions to completion.

So it's a given tolerance. And the circles are the localized e of gal's algorithm and the X is just truncation. So, look, actually this one 100 percent avoided local minimum. Whereas truncation did get stuck away. Why aren't we gone? This is

my starting off point? Let's look at this. X tilde sparse. What's P here?

Where do we actually evaluate the energy? Evaluate the energy in a line search.

The search direction's dense. Because the search direction equals the gradient, which is the residual, plus something which is sparse. So, yes, every iterate acts as sparse. Right? But when we evaluate it, it's still evaluating the dense problem. So, yes, we do avoid local minimum, but we're still not getting any of our scalability out. So that's the first issue I have.

The second is that the energy isn't monotonic anymore. When we finish right here we've now localized X. So energy right here need not be the same as energy here. Need not be monotonic.

And again, I'm an optimization person. I want it to decay monotonically, that's the whole point I think for this problem. This is a good idea. We now avoid local minimum. So now this is my plan. I'm going to develop an algorithm that first maintains the property of avoiding local minimum. That is important. We're going to run this. When we finish we're going to make sure we have an answer we can trust.

And, B, we want to evaluate the energy only at sparse iterates in the line search.

So we have to make sure that our search direction is sparse. So now I hope you understand the game we're going to play. We have this really nice function. Has some interesting properties. Depends only on the span of the basis.

And now we have to develop an algorithm, excuse me, that has these properties.

So basically we're now, here's my algorithm. I'll show you how it works now.

This took a long time to come up with. Works. But this is where it comes down to. Let's assume our initial iterate is sparse. Okay. Very simple picture. Create a dense search direction. PK. Whatever algorithm you want actually. I integrate this for my testing purposes, but you can generate any way you want. Then now we take a full Koshi step, full step in that direction and localize this matrix.

Localize the Koshi step and make that sparse.

So we have now a dense up here, and we truncate and localize and truncate through the G which minimizes the truncation error and now it defines our search direction in this sparse sub space. The difference between this sparse matrix and this sparse matrix.

Now, I have a lot of things I have to justify first. Do I maintain decent? I mean, a priori -- actually, I'll tell you theoretically I don't. But I'll also tell you the experiment maintains that most of the time, and that's good enough. Okay? So let's talk about how we integrate this. I want to rereiterate the fact that PK can be any optimization algorithm. The Ian Gal method was not very clear how to integrate BMFS with this localization framework because you have to muck with the previous directions. It was not completely estimated a priori how to integrate this. I want to be able to say give me your algorithm give me your search direction and let's make it sparser.

So here's the algorithm. So, again, it's very crucial now the initial iterate is sparse. Maybe you do one step of the localization that they just mentioned. But the initial iterate is sparse. You have an initial search direction. You compute the G which minimizes this new localization error. And now you just change the search instruction. P tilde K equals the difference between the two. Now we do a line search. Now notice the difference here is that A, PK is the decent direction, hopefully. So energy will automatically be monotonic, if it maintains ascent. And B is energy X tilde plus P tilde is always sparse, always, always sparse.

And this is just nothing else changes down here. This is just vanilla conjugant gradients. And sort of three lines in between the algorithm. The first question you ask is: Does it work?

Well, this is a much harder problem. I made it be that the -- so I said there's local and nonlocal terms. Okay. And I made the nonlocal terms be much more, much larger. So it really does affect the problem more.

And we already know -- this is again all my code. And we already know that on the easy problems you can solve it. Actually, this problem is even harder than real problems. So the circle is me and the X is Ian Gall. Look, we do a pretty good job of minimizing avoiding local minimum, and the arrows here point to places where not only did we do as well as Ian Gall's CGL, we did better. These are starting points such that we did not get stuck in the local minimum and they did.

Okay. So, again, this is a much harder problem. We're going to talk about real problem a little bit larger later. So not only do we do adds well, we do better.

And these are all from random starting points. I really just type in random matrix, localizing both. But this is even more important. I ran our algorithm for a long time. I let it go until a very tight convergence to see what would happen. The blue is ours. And this is the log of the error at every iterate minus the actual true solution. This is the number of iterations. The blue is ours. We maintain a monotonic decrease of energy. We are always getting better. And Ying does jump around a lot. More prone to other problems.

So, again, localization. Now we do better. We maintain monotonicity. So we're moving real quickly, but let's talk about a real problem.

First, I want to say this is not the type of problem my algorithm is meant for. This is actually a very simple problem. This algorithm is meant for problems where chains of these things. So right now, so this is methane. There's one carbon, four hydrogens, and you already know the solution apriori, it's going to be centered around the atoms.

You can always imagine the interactions happen among the atoms. Imagine chains of these things together, hundred of them. So the interactions don't just happen near the atoms, but they also happen away from the atoms, at the bonds where atoms meet. Sometimes this is the worst problem, something like this is actually one of the worst type of method.

So this has eight electrons and four variables, and it's very small. For this problem, I discretized 16 steps in each directions, 4,096 by 4,096 because I'm doing it on my laptop in Matlab, in the process of writing real code to skills. I know the solution from another tool is 120 Ryburgs. So known solution.

Okay. And so, again, I want to point out this is really the worst case for this type of algorithm because the interactions are very localized already as it is. So now let's talk about how we can solve it.

So, firstly, chemists have a much better way of choosing localization regions.

They come from the molecules themselves. They really understand how this works. I chose this arbitrarily. Because I'm not a chemist and there's a lot of black magic on how to choose these localization regions, which is very important.

So I just chose 1500 non-zeros per column and I put them with no relationship to the atoms. You generally would have centered them along the atoms or something like. I chose them honestly almost arbitrarily.

And now I also -- I started from a random point. I think it was cheating -- if I say avoid local minimum. I give it a very good local starting point I think that's cheating. But that is what it's nearly done. You have a nearly good initial guess on how where to start this.

Now, this is [indiscernible]. So first let's talk about converge two.

I converge to my energy, which is not as good as the true solution. 120 ryeburg, that's about five percent. Seven over 120. But let's talk about iterates here. By

20 iterates again I'm not doing orthogonalization. This is the only work here is computing the gradients, which I'll talk about, just evaluating energy. And everything's sparse.

So in 20 iterates I'm pretty close to the true answer already. My 40 iterates I'm there. And I'm only doing sparse work. It's monotonic.

I didn't show this, but starting on iterate 20, I actually start skipping the localization step. What happens is I have a little heuristic where monitor this where the area outside the supports already is sufficiently small. Just truncate it.

What happens the G metric becomes unstable. If you try to localize stuff it's already zero outside the localization region, when you compute G, it's going to be infinity, where you compute the solution to this.

So localization is actually -- I didn't show this because it's hard to get this data out. But so only the first maybe 20 iterations localization is really important. I do skip the localization step. It was already sufficiently localized.

Now let's look at the charge density. Let's explain this graph. This is the charge density. So this is the row. So the diagonal of X times the inverse of matrix, times X star. And I picked out where the magic's happening, where the stuff's

really happening. So the top graph is charge entity I compute, and the upside down graph is the charge density that's computed by the SCS solution.

So the question is, I mirror them. So you can see what I want to do do I match support regions? How close is it to the true answer?

And actually it's pretty good. So with the large charge, there is a lot of charge, high probability of finding electrons. I match up quite well. So I actually do narrow the major components of the charge density, which is all I ask for. Again,

I wasn't trying to solve the problem here. I'm just trying to get -- does this method really work to begin with.

So I mean I missed the energy by a little bit, but the gain here is I'm scaling theoretically much better. In practice we'll talk about. I'm getting the charge density pretty well. I nailed the major pink exactly. Like the actual values exactly what it should be. So five or six digits. And the smaller important ones I at least got the support right. So it's a great first step.

And again I chose them from a random starting point and arbitrary localization region. But now we get into the meat of the matter. There's all these implementation issues I haven't talked about yet. The algorithm is very simple to understand. Take a step. Localize that step and then to find the sparse region.

Okay. Skeletal conditioning. E of XG equals E of X. This is like the one equation that goes through everything in my work. What happens if I chose G to be the diagonal of, diagonal matrix of scalars across.

What does this mean? Every column of X would have dramatically different scale and it doesn't affect the norm of the subjective at all.

Some columns of X might be on the order of tenths. Some columns of X might be in the magnitude of millions. I have nothing I can do about it. Just the problem, that's just how it has to be. So one could ask, can we just normalize the columns, add a chain E of X over the columns, every column X norm. So everything is norm one. We could, and I did this, but now think how that affects the gradient. That adds a whole new chain rule to the gradient. And it's a very, very expensive to compute chain rule.

You have to really, it's a big matrix. You have to go through everything column by column, a lot of norms. Really messy. Really slow. So, yes, we can normalize the columns. You can get rid of the scaling issue, but it's a trade off of efficiency.

So far I've noticed it's not that big of a deal. So, yes, you have some columns are different. But you know what, on the whole grand scheme of things it's not that big of a deal but it is a very important issue, especially when you start scaling the larger and larger columns is that you're doing a line search. Line search of a matrix. Matrix X plus alpha times a matrix P.

If one column's really big of the search instruction P. One column is very small.

Make an arbitrarily small step length to satisfy the wolf conditions you need.

And so sometimes that one column being very big could affect the step length algorithm very severely.

And we really don't want to be evaluating the energy that many times. Because it's just expensive. Very expensive. So right now I'm not doing anything about it.

I'm just letting things become unscaled. And the future, though, we might have to every now and then, not always, it's too expensive, but every now and then do an extra chain rule and renormalize the columns of X and resolve the problem.

So now this is actually the most important part. I'll spend a few minutes here.

I've said that, well, localization does not preserve decent. It doesn't. I can give examples of breaks. So how does it work again? We solve this problem and then we take the solution of this term right here, we subtract away the sparse iterate that's the search instruction we take.

So G is over 1. I say we solve this, look, G has columns sum to 1. We solve G column by column and it's a Lees squared problem.

So to maintain decent, we must have the following situation. The gradient and the search structure has to be negative. If there were a vector, it would be gradient E transpose the vector P. This is the equivalent in matrix terms.

So let's start simplifying this. Well, we can do some math. We have a weird property here that the gradient and the previous iterate, X, is actually the orthogonal. It just turns out to be that case. It's a simple computation. So we really need to maintain that the gradient of E with X tilde P, PKG is less than zero. Now, let's look at this.

Now, let's also talk about the sum. Let's look at it. We could solve this problem here. Minimize G such that G is invertible, and this trace is less than zero. If I could do this, this would be guaranteed decent every step. So let's actually look at it. So frobenius number G this is a quadratic term, GG transpose there. G is invertible. We'll just make the column sum to one that's linear to the column G.

This is linear to the columns of G. This is actually linear in G. The trace term is here. This is a quadratic program. It's a quadratic program matrix but it's still a quadratic program. We understand them.

So I wrote some code -- clearly, we can't ever hope to solve this exactly. It's just localization stuff, it's already taking K squared. So I thought maybe what happens I try to solve it approximately. It turns out that if you want localization to work you have to solve this problem exactly.

I had hoped with these truncated new ideas you are far away just get me close to the solution, point me in the right direction, that's enough. Turns out that just fails miserably. To solve this problem with this type of approach you have to solve the localization problem exactly.

So I'm not doing it that way. Right now, here's how it works, again. I compute the step length, step search direction. I see it in this direction. I'll give you a graph how often it is in a minute.

It is decent. If it's not decent, I just take a truncated search direction. I don't localize. I just truncate the search direction, and I have a little check to make sure that might not be maintained by decent direction. But I've never -- that's never fired. It so happens with this problem, once you get close enough with this solution, which stuff starts failing, at least it's not as good of a search direction, you don't take as good of a step at least it's somewhere to get to decent. So we could solve this theoretically and have a perfectly valid decent algorithm that always has search directions but it's too expensive and these problems are so large, I can't take the time to do it.

And now this is really the main issue now. This is where the linear algebra comes in, and this is where I'm really working on it right now. The gradients.

The gradients are nasty. They're matrixes. X is a matrix. The gradient is a matrix. Here's the full problem again. I gave you a couple of examples. These aren't exact gradients. I got rid of constants, things that were confusing. But this is the main matrix product of the gradient. So here's the gradient. First look at S.

S is a very small matrix, small square matrix, number of electrons by number of electrons. But it's just pervasive multiplying, applying this inverse matrix. So we have L plus V, X inverse, X star, L plus V. All these things all over. The real bottleneck is hard tree energy. It's the second term right here. X inverse X star diag. So we're applying X matrix over and over again. And this is a major bottleneck.

In fact, from some profiling I took my code from running 120 seconds per energy evaluation, evaluate things like this hard range about four times, because this gradient exchange correlation has three terms that look similar to this.

Okay. So visual profiling, without doing anything smart, I'm not taking advantage of sparsity yet. It's a sparse matrix, I'm not exploiting it yet because it's Matlab. I don't care about wasting time on that. But I want it to run efficiently. I've taken it

120 seconds per function evaluation. I got it to about 20. It's still very slow.

These functions here are the bottleneck. I think there's something we can do here.

This is the overlimit. This is S. So this is X star X. Just look at it. This is I think from a not -- I don't know what problem this is. But there's ten electrons, pretty equally spaced. These are the norms of the elements of S plotted as a matrix.

So two, two, is the norm of the element at the second entry of S. So this is SIJ.

So this is IJ versus ISJ. Look at how dominant it is. In fact, the S matrix only has, in this case it's overlapped three, three non-zeros per column.

Let's look at S inverse. S inverse is actually also diagonally element. The elements off the diagonal are actually rather small, 10 to the negative third.

There's a lot of recent work about how to invert matrixes iteratively quickly.

Things we can exploit here, because when I say said locality, nearby things, for

insulators decay exponentially X is showing exponential decay away from the diagonal.

Now, this is what I hope to exploit to solve this problem, to compute these gradients sufficiently, because even here I said it's very big at the diagonal. And near the diagonal, but then it gets very small away from it. And there's a little artifact at the end, of course.

So maybe we can write linear algebra that solves iteratively, and we can solve either the naive way to try, compute the inverse at these components here, and just pretend everything else is zero, or find an iterative approach to solve this.

And there are no methods for doing this. There's truncated new divergence for solving this. It's not clear what will work best. But it is possible to surmount this bottleneck, I think.

And finally, where do we want to go to next? So I integrate this with BFGS, because this is a totally different type of search direction. And this is, again, a small problem. CG versus BFGS. And what you see here, this is the energy minus the solution, a log plot. And this number here is the number of bad search directions. Number of times that my method lost decent. And I had to do something different. So CG never lost the decent. I can't prove it's true, conjugant gradient, plugging away at always preserve this end of the matrix, search direction. BFGS lost it seven out of 90 times.

To be honest, I'm okay with that, because it converged so quickly that it's okay.

And again this is from a random starting point. From real problems I ran it a couple times. I don't know the grass root yet. This model problem is a little harder than the real problems I've seen. Because I forced the nonlocal terms that exchange, energy exchange for them to be nontrivial in the model problem.

In the real problem they're actually much smaller. You can actually solve this faster. On the methane problem I've got the solution in 20 iterations. So the model problem tends to converge a lot more for work. But there's going to be hope, though, that I blindly integrated this with BFGS and I'm working on a solution for BFGS, because these problems are way too large to have formed the full [indiscernible] approximation. So it does hope that right here clearly we had a problem, we did something that was not dispute, but it worked pretty well.

And this is really what we're looking for in this community. We don't have to prove that it always maintains ascent, because they're already clearly okay with

SCF, which is clearly not monotonic. I mean, you could run ten iterations it would be the wrong solution, 11th iterations would be the correct. 13th would be wrong again. They're okay with this.

But we give them a lot more justification of why we can trust our answer when you're done. And so yes?

>>: Do you use the same line search?

>> Marc Millstone: I use my line search. Exactly with BFGS, I can use a little weaker line search. I do use a weaker line search. That's why I want to implement them with BFGS. Line search is nasty. BFGS can even use a equal size search. The theory for BFGS, for convergence you only need a wheatful

search. And that's much nicer. Don't have to do interpolation. It's just go. I do use a equal search. It converges a little slower than the strong wolf search for my experiments, but it's fewer energy evaluations.

It all depends on how you want to count. I'm counting about number energy evaluations. So what's next? All of this right now is based on the Matlab package by Chelikowsky, Troullier and Saad. They did basically how they exchanged the energy coalition, getting it from a human readable format to computer readable format. I basically borrowed that and build my stuff into that.

It's open source Matlab code. It's in real case arguing finite differences. They're solving the nonlinear eigen value vector problem with finite differences. But honestly it's in Matlab, limited in problem size and mesh length, and I don't want to spend more time making it go faster because obviously you can't run it in a big enough problem to make the gains even noticeable.

So that's where I just started about two weeks ago. And I'm wearing my own

[indiscernible] C++ implementation to run on DOE computers. And really the game is here, what we really want to do is exploit the sparsity pattern. If you always evaluate the energy sparse iterate it, if we can get around this gradient issue, right now the gradient requires a dense inverse. We can do it iteratively.

And the theoretical goal everyone says, the big buzzword is linear scaling. I scale the number of electrons. Being a little less ambitious, hopefully we'll do linear scaling and the evaluation of the energy and the gradient.

But that localization stuff you don't mind doing a bit of work, because it's only the number of electrons squared. You don't multiply the worst number, which is the number of grid points. So making sure we always sparse energy efficiently.

Give a practical linear scaling algorithm, as I call it. And I think that's it. Any questions?

>>: For the thing, for the sparse matrix, have they randomly generated -- it's.

>>: How I got my initial points?

>>: Initial matrix.

>> Marc Millstone: I actually didn't want to cheat all. I thought any physical knowledge to choose the initial guess would be cheating, because it's like I always converse with the right answer if you give me a good starting guess. I called Rand. I took a random matrix and then my hope is that, my thought is if a random matrix converges to the right solution over and over again, then a good matrix will converge even faster.

All I do is call Rand and localize it to make sure it's sparse.

>>: Okay.

>> Marc Millstone: So, again, you have a really good initial guest for all these problems, and often when you see code it's sometimes cheating because we

Congress verge in seven iterations. Seven iterations from a point, which is, you know, already looks about the same. So I'm doing random initial guesses and just hoping that that's -- I think that's a better test of the algorithm.

>>: Okay. The second line relates to the data. So at the beginning you also show two solutions, one is outside and one is localized.

>> Marc Millstone: Right.

>>: That means you might have maximum optimal solution; is that right?

>> Marc Millstone: Exactly, it's not even multiple. It's even worse. Given optimal solution X. X times G for any invertible G is the same solution. There's uncountable number of optimal solutions.

>>: So we apply a lot of those, simply to try to truncate it, does the -- is it preserved the way mathematically you prove it.

>> Marc Millstone: You get the exact same convergence proofs. If you submit this maintains decent, which is a good assumption. It doesn't. You get the same prove as conjugants. You can get them small.

>>: There's not --

>> Marc Millstone: There's not a local result. I want to point out that you can't have global results on these nonlinear problems. So it's very open condition from the sense there are so many local minimums, all equally valid. We hope it's the white match experiment. I can't make any global proofs, nothing. I do want to say although I deal with the X matrix, again the X matrix is physically meaningless. All that matters is row, the charge density. I can choose anything as long as the span is the about same as a different X. So in my mind there's really no difference between orthogonal solution and the sparse solution.

Because the row's the same.

If the rows are the same, then I can drive all the properties I want, all the forces, all the magnetic properties, everything that I need. That's what this is exploiting.

People are scared of orthogonality. If you want sparsity, you want ability to scale, you have to embrace this orthogonality, and use it over and over again.

And what you lose is you lose this gradient. The gradients are much more complicated. And I don't know why, but that was -- took a long time where people want to do it this way.

Questions?

>>: Are you working with any group that has problems that it can't solve?

>> Marc Millstone: So problems they can't solve -- problems they can't solve fast enough. So there's a lot of parts people want to solve. Groups at Berkley labs want to solve bigger and bigger problems, and there's nothing worse if you go onto a super computer, were at SCF for a week and you don't know if you can fully trust this answer. At least here I can say here's the answer I got, here's the gradient. Is it sufficiently small. So there's a notion of convergence, you can give some trust to the answer. And in fact, often the gradients you get, like if you look at the gradient. You take the SCF and look at the gradient at that point, it's not as small as you would think it would be. It's much -- it might be consistent to like five or six decimal points but the gradient by be a tenth or a hundredth. But that's still not enough. The real idea is you can do experiments something on computer, but say you have 100 different problems you want to solve for the same atoms. Maybe limit it down to 10 to test in the lab because in the building it's complicated. It takes time, energy, it takes money, it takes hardware. So this is the best 10 to try. And that's where I sort of view these ideas, a limit of things that you have to do in the lab. And so really right now really working with some problems and I didn't go the sales on massive computers. Thank you very much.

[applause]

Download