>> Rustan Leino: Good afternoon, everyone. I'm Rustan... to give you an overview today of work that Jean-Raymond...

advertisement
>> Rustan Leino: Good afternoon, everyone. I'm Rustan Leino, and we're going
to give you an overview today of work that Jean-Raymond Abrial and I have done
in the last five weeks on thinking about a hypervisor.
So the start of this work came from that when we think about co-development
today, the typical way that this is done is that you write some piece of code and
then you write test cases and you hope that the test cases will work and you
debug the code and then you find it doesn't really work and you change it, and
then, of course, you leave many bugs in the code and the chips, and you go back
and you fix it, and the whole process is very, very expensive.
So what we'd like to try to do is to use a different technique for developing code.
And the different technique is a top-down technique, so starting from
requirements, starting from specifications and deriving step by step in a gradual
way the eventual code.
So there are techniques for doing this, and we're using one that Jean-Raymond
was co-inventor of, the Event-B Method. And so to have an application for this to
put these techniques into practice, Wolfram Scholter [phonetic] had suggested
that maybe we should try a hypervisor.
So there was a recent effort to verify the Microsoft Hyper-V. So the Hyper-V is a
big body of code, let's say 100,000 lines, and a three-year project with a joint
effort with the University of Saarland set out to specify the hypervisor.
Now, when you do that you specify it -- so they used the VCC verifier, and when
that was specified and verified, that's done after the code is written, and that's
always much more difficult. So what we're trying to do here is do all the
specification and so forth up front so that the code comes last, and that's the
top-down approach that we're taking.
So what we want to do is to produce a hypervisor. A hypervisor is a virtualizing
piece of software that takes one hardware chip with -- so one physical chip that it
wants to virtually give out to a number of operation systems, and so each
operating system can then execute on this chip as if they have full control over
the chip.
And we're assuming that there are -- there are several cores in the machine,
which is realistic these days, and so what we're trying to do is distribute that
computation in such a way that when each operating system is running, that is,
each guest operating system, as they're called, would not be aware that the
others are running. That is, it's un -- you cannot detect from the guest standpoint
if you're running in the hypervisor under the hypervisor or if you're running on the
hardware yourself. That's the idea. So that's, for example, what the Microsoft
Hyper-V is doing.
So that means that there are a number of things that you have to virtualize for
these guest operating systems, and the main ones are memory and interrupt.
There are also timers and maybe other details, but we're going to focus on
memory and interrupt.
All right. So the way that -- the general strategy for doing this kind of work with a
top-down development is that we start with a bunch of requirements. So we're
going to show you a flavor of the requirements that we write.
And so the requirements are stated informally but as precisely as possible, and
then you develop some strategy for doing the refinement, which is to say, well,
we're going to first introduce the following features and then we're going to
introduce more and more detail into the whole thing. So that's the refinement
strategy.
And at each step in living out that strategy is a formal model. So you develop a
formal model that describes those pieces of the system that you're trying to
develop at each time.
So here, for example, there are choices of which things to introduce first, and as
you will see, it comes very naturally to first describe, to start with, with individual
machines that are running, because if you can describe individual machines
running in isolation and then refine everything into a system where things share
the hardware, then you have from the very beginning proved that you have
machines that operate independently of each other.
This is also what was done -- or a similar technique was done in the SEL4
verification project where they started with some high level specifications or
simulations and try to refine things, but we're doing it using a different method
here, and then we'll conclude.
So with that, I'm going to turn it over to Jean-Raymond, who will take us through
the technical development.
>> Jean-Raymond Abrial: Thank you, Rustan.
Good afternoon, everybody. I'm Jean Abrial, and I'm going to continue.
So, first of all, we're going to discuss the requirement document. We do not rush
immediately into the formal modeling. And the requirement document is made of
two texts, the explanatory text and the reference text. For the moment in the
requirement we have some sort of short explanatory text, but we put a lot of effort
to develop the values referenced, and they are made of short statements that are
labeled and numbered.
So the first task we have to do is precisely to, for this project, to define the values
labeled with [inaudible]. And here they are. So as Rustan said, we are going
first to develop the system independently, so single system handling SM, single
system interrupt handling SI, and then we enter into the motion of multiple
system under the control of the hypervisor, so we have requirement MS, HV for
the general requirement for the hypervisor, and then the handling of the memory,
which corresponds to this, and she the handling of the interrupt, which
corresponds to this, and in the middle the problem of scheduling of the values
guests by the hypervisor.
Okay. So let's go through these requirements, because again and again we
insist on the absolute necessity to write down very carefully those requirements.
So we spend, Rustan and I, a lot of time writing them and also having interviews
and mails and telephone exchanges with some experts which gives us some clue
in order to understand the problem because this is their normal job.
So here I'm going to a little fast through them. You can just read them. So the
single system memory handling. So we have an operating system that used
memory device and some timers, and the amount of memory accessible by an
OS is determined at boot time, and we call them IPA, intermediate physical
addresses.
And, by the way, if the OS tried to access memory through an unexisting IPA,
and then it crashes. That's unfortunate. So the running of interrupt -- so so
much -- there are, of course, far more things in the handling of the memory for
the OS, but from the point of view of the hypervisor, we're not interested in that
because this is the private business of each OS.
So for the interrupt we consider only uni-processor systems for the moment. So
an OS can be interrupted by some device that are connected to the unique CPU
running the OS, and now some properties of the interrupt, each interrupt has got
a priority which is a natural number, and then we have a notion of state for an
interrupt. It could be inactive, pending, or active. And of the state of an interrupt
is managed by the hardware.
So now I'm going to describe a little more carefully these three things and explain
with some short statement like this. An inactive interrupt is one that has not
happened since it was last treated. A pending interrupt is one that has happened
but not treated yet because its priority, it's not big enough. An active interrupt is
one that has happened and is either treated now by the CPU or preempted by
and interrupt of higher priority because we can't have nested interruption.
And now when a pending interrupt is arriving, is defined, if its priority is high
enough and it results in the hardware sending a signal to the concerned core of
the OS and the OS acknowledges this signal and the hardware makes the
interrupt active. But, of course, for a pending interrupt to interrupt an interrupt
that is already -- that is now treated by the guest -- by the OS, it has to have a
higher priority. So it goes on top of it and becomes an interrupt, the interruption.
And so among the active interrupts, this is the consequence of this thing. The
one that is activated is the one -- that is executed is the one with the highest
priority.
And here it's a bit redundant, what I've said before. A running interrupt can be
preempted by a pending interrupt with a higher priority.
At the end of the interrupt handling the OS sends an end of interrupt signal to the
hardware which makes the interrupt inactive. So this is the story of an interrupt.
An interrupt can be masked or unmasked by the OS.
So much for the individual interrupts. Now let's go to the hypervisor
requirements.
So the role of the hypervisor is to simulate on a single machine with several code
behavior of independent guest OS. So you can have a look at these [inaudible]
here. We have several OS that are independent, and we want to simulate this
with the hypervisor. So the hypervisor, as Rustan says, has got as many
interrupt as we have here, but you see the numbering is not the same. So these
are the physical interrupt and this becomes -- for each individual guest this
becomes the virtual interrupt.
And here we have a device, hardware device, called the interrupt controller, and
here we have the -- of course, the hypervisor machine has got several CPUs that
are connected to the IT controller and they share a [inaudible] memory. So when
we have to do is to put all these guys here into this list here and put all the
individual memory here inside a unique memory so that we can simulate
completely with the hypervisor, we can simulate the behavior of several OS's.
So we have a number of elements that are determined at boot time and the
number of guests is fixed. We cannot add another guest after the boot. And, of
course, the maybe thing is that as Rustan said, a guest always is not aware that
it's been executed on the hypervisor.
The guests are rarely independent. This is a very strong requirement. A guest
cannot pollute another guest or cannot -- or even don't know that there are other
guests.
And now, of course, one of the responsibilities of the hypervisor is to schedule
the guest. The idea here is we might have more guests here than there are
CPUs here, so not all the guests are running, so some of the guests might be
sleeping. So it is the role of the hypervisor here to, from time to time, maybe
some time slice, to eject one of the guests in a CPU and put another guest in
place of it if it has been sleeping for a certain time. So you can imagine that
there are lots of switching that has to be done when doing this scheduling.
And, of course, we require here that the hypervisor is fair enough. It's not -- from
time to time he has to give control to guest, of course.
The hypervisor handling of the memory. So we studied two things, as Rustan
says. We studied the memory first and then the interrupt. The interesting thing
for the memory, we go from physical memory to -- we go from a virtual memory
to some physical memory, and for the interrupt we just go the other way around.
So all the memory of the guests are embedded into a single memory, that of the
hypervisor, and for making this easy the hypervisor or the machine here, the
hardware, has got a device called the SLAT, the second level address translation
which keeps track for each guest of a one-to-one mapping between each
address of the guest and the corresponding address here between the
intermediate memory address here for each guest the physical address here on
the memory of the hypervisor.
And this connection is one-one, which is very important, because we do not want
for the memories to overlap. So we have this table called the SLAT, and when -and so the things are relatively simple. When a virtual guest wants, for example,
to write in its memory, he does not know, but the hardware -- the hypervisor traps
this writing operation and, behind the curtain, change through the SLATs, which
is a one-to-one correspondence which is established at boot time. So the
SLATs, it translates this IPA, this intermediate address [inaudible] into a real
physical address. And it is a device that is now in several machines, which is
extremely nice.
So the size and content of the SLATs is determined at boot time because at boot
time we fix completely the way -- the portion of the memory of the hypervisor for
each guest. So this is determined once and for all at boot time, and, again, this
is an injective or a one-to-one correspondence so that the memories do not
overlap.
The SLAT itself for each guest is addressed by means of a base address. So for
each -- there is a SLAT for each guest, and for each guest there is a base, which
is the beginning of the address of the table of the SLAT. And I will describe more
about the SLAT here.
So here, this is a description of what I've just stated. The connection through this
map of an IPA to a PA. And, of course, if this process fails, then it is because the
IPA that has been provided by the guest is a bad one, so we have a second-level
page fault, and the hypervisor reports to the guest and probably kill the guest
because he's tried to access an address that is a faulty one. Okay. And this is
described here.
Now, the SLAT could be quite big. Therefore, the SLAT is in fact structured like
this here. Rather than it being a unique table, it is far too big, so it is in fact
controlled by a tree structure of pages.
So a PID is asserted to [inaudible], it has got 10 bits on top, 10 bits here in the
middle and 12 bits on the lower part. We have the base address of a SLAT for a
certain guest, and we have then the top word page here, and then we have a
second level and then the third level. And so this thing is extremely simple. This
page here are 2 to the 10 of size [inaudible] page. So the 10 bits here are taken.
The address here is a certain element in this top page, and then this top page
goes on to -- this address here goes on to a secondary level page which is also a
word page, and then we take this 10 bits here, we address here this thing, and
this thing here now goes here for the bytes page. Now this is our byte page, so 2
to the 12, and these 12 bits here address these.
And the memory is here. So the memory of each guest is in this portion, but you
see we have these two levels here -- or three levels -- one, two, three -- in order
to access.
Of course, the problem is that -- and so this is all explained in those
requirements. So you see we pushed the requirement until this low level.
And, of course, this is very much time consuming, because you can imagine that
in order to go from an IPA to a PA, the PA is here, the IPA is here, we have to
work through two levels. So it's two memory access. It's far, far, too much. So
there is something called the SLAT -- the TLB -- what's the name of ->>: [inaudible]
>> Jean-Raymond Abrial: -- oh, translation lookout buffer, and this is [inaudible]
memory, a short [inaudible] memory of the size, you know, 16 or 32, which maps
directly a PA to an IPA. So when you work through this, if it was not in the TLB,
then you have the two, and so you push this into the TLB, probably you evict
some pair of the TLB, and next time -- because if we access a memory, we could
think that we will pretty soon access the same memory, so then next time the
TLB is doing a short circuit between the IPA and the PA. So this is the unique.
So there are lots of hardware doing this. And this is what is explained here.
So, so much for the memory.
>>: [inaudible]
>> Jean-Raymond Abrial: You flush the TLB ->> Rustan Leino: When you schedule a new guest operating system, you have
to flush it.
>>: You have an individual TLB for each machine?
>> Jean-Raymond Abrial: For each CPU here, yeah.
>>: For each -- for each core?
>> Jean-Raymond Abrial: For each core. For each core.
>> Rustan Leino: Which means that the guests and the hypervisor running on
that core will have to share -- will have to share the TLB.
>> Jean-Raymond Abrial: Okay.
>>: I guess you just don't understand why you're modeling it at this level. I
mean, if the idea was to come at this top-down level, like even the three-level
mapping, I mean -- did you need that at the beginning or could you just have
modeled it -- I mean, the point was to model this thing that shares these -- these
OS's share the hardware.
>> Jean-Raymond Abrial: This is exactly what we are doing. You will see
exactly when I show you the demo. At the top level, at the abstract level, the
SLAT is not at all like this. The SLAT is just -- and then we will find ->>: [inaudible]
>> Jean-Raymond Abrial: -- and this is the ultimate refinement, so we go down.
Okay. But this is not the first level, of course.
But I put it -- no, but you raise an important point here. The problem is that the
requirements are -- there is some sort of hierarchy in the requirements, and there
are requirements that are very general and then requirements that are more
particular. And precisely, this is the role of the refinement strategy to put this
hierarchy and to say, oh, we are going to take this and then this and then this
and then this in the way we can go down. Okay?
So let's go to the interrupt now. The interrupt was a little more difficult first
because we had to ask the experts, and they give us [inaudible] some
information. So the interrupt system of the hypervisor is virtualizing this interrupt.
So you remember the situation here. We have the individual interrupt for each
CPU and we have all the interrupt level here for the hypervisor. So, first of all,
there is a one-to-one correspondence that -- an injective connection is
established between each physical interrupt and a pair made of a guest and an
interrupt.
For example, the 1, 2, 3 physical interrupt are connected to the 1, 2, 3 of these
guests, and then 4, 5 are connected -- mapped to 1, 2 of the second guest, and
6, 7, 8 connected to 1, 2, 3 of the third guest. And, again, this one-to-one
connection is established once and for all at boot time. And then each physical
interrupt is connected to a single virtual interrupt corresponding to a guest.
So now we have the problem that -- here you'll remember that the interrupt was
physically arriving and here the interrupt was virtually arriving, but what is
physically arriving is the physical interrupt. So the physical interrupt arrives here,
and immediately the hypervisor traps that interrupt, takes control, and what it
does, the hardware puts that interrupt as pending and then the interrupt service
routing [phonetic] of the hypervisor takes control, makes the -- acknowledge the
interrupt, and then the interrupt is made -- by the hardware is made active. So
now the physical interrupt is active, but the virtual interrupt is still inactive. It
doesn't exist still.
So then the hypervisor is looking whether the guest corresponding to this
interrupt is sleeping or not. If it is sleeping, the physical interrupt remains active
until later on, the hypervisor will schedule that guest, but if the guest is not
sleeping, for example, that guess here is sitting on this thing here, so what the
hypervisor does, it simulates the arrival of the virtual interrupt by sending to some
device called the virtual interrupt controller, sending the corresponding interrupt.
And the OS is cheated. He believes that this comes from outside. Actually, it
doesn't come from outside, it comes from the hypervisor who plays the role of
outside.
And then things are continuing exactly as I said. The interrupt -- the virtue
interrupt is arriving, and now the guest does its on business on this interrupt until
the moment where the guest finishes to treat the interrupt, so he sends now an
end of interrupt, which is a virtual end of interrupt. It is trapped by the hypervisor
and the hardware, which makes this virtual interrupt inactive, and as a side effect
it makes also the physical interrupt inactive. So you see the story of a physical
interrupt arriving. So that's more or less what is described here.
So now we flew rapidly over the requirements, but again and again when these
requirements are finished we have something which is clear. We discuss -- we
send this, we discuss with the experts, we discuss between us, and we thought
that now we understand.
And now we are ready to do our refinement strategy. So the refinement strategy
is the following. Handling of the memory and, later on, handling of the interrupts,
because these things are pretty [inaudible].
And in each case, as Rustan said, we are going to have several physical OS
sitting next to each other, which is formalized this. They are completely
independent. They do their little business. And then we refine this by saying, no,
this is not true, this was an abstraction, because this is the abstraction we want.
We want these OS's to believe that they are completely independent, and then in
the refinement we are virtualizing the OS by means of the hypervisor. And this
will take also, as I've said, several levels, in particular for the SLAT, which is not
introduced as such at the beginning. So that's what we're doing.
Okay. So let me now demo with the Rodin Platform. So this is the Rodin
Platform, so I'm going first to describe the memory. So it's a project called
Hypervisor GRA5. You see that there were lots of tries before, and eventually
we got to this level.
So, first of all, a project with Event-B is done by a number of contexts and a
number of machines. So the contexts are the places where the constants are
defined. So let me show you the first context.
So defining the address space of each guest. We are on each guest
independently now, and we have three sets, guest, value, and N, and this
abstract set of addresses. Value is the value to be stored in the memory and
guests are the guests. And there are two axioms and a constant guest address.
So guest address is just a bunch of addresses corresponding to each guest, so
it's a set of addresses.
And this is not [inaudible]. The guests have some memory. Okay. So now we
can go to hypervisor m0. So this is the -- this is the first level. So I'll explain a bit
what has been done here and also what I will do in the next one.
So the idea essentially is the following: If we take a program trace of an
independent guest, from time to time he's writing in the memory. Write a1, v1;
write a2, v2, write e3, v3. And in between the guest is doing whatever he wants
on this local problem. But from time to time he writes on the memory. He also
reads on the memory, but we have only considered writing because it's more
difficult. Reading is really easy. We have not done it.
And here you see this is the corresponding program as executed with the
hypervisor. So the hypervisor traps this write operation and changed it into a
hypervisor write. Hypervisor write, hypervisor write. And what we want to be
sure is that this program here and this program here are exactly the same. The
fact that this is written now by the hypervisor in this part of the memory here or in
a portion of the memory that is devoted to this guest.
So this is essentially what we're doing here, and this is done by the defining
several -- by defining just a single variable, which is a guest memory, and the
guest memory is a function from guest and address to value.
And it is initially empty, and then we have -- we have a single event called write
with three parameters: a guess, an address, and a value. And g is a guest, a is
an address of this guest, v is a value which is write -- with the pair ga we write in
the memory, we'll write the value v. So this is just very, very simple.
Okay. So now the next step is to go and to -- is to slightly, step by step,
introduce the notion of the hypervisor.
We have now a second context, and that context is defining the SLAT of which
guest connecting -- connecting a guest address to an IPA to host address PA.
And both IPA and PA are addresses, so that corresponds to your question. You
see that the SLAT here is just an injective function from a pair of guest addresses
to another address. So I do not define at all this sort of thing at the beginning.
And it is a constant because you'll remember it has been entirely defined at boot
time. So everything that is defined at boot time from the point of view of these
are constant.
And so now we go into machine m1. So machine m1 is seeing now this context
c1, and the context c1 was itself extending context c0, and now we did not find
the guest memory. So the guest memory has disappeared from the variable
here, and we have only host memory. So we start doing things at this level now.
And we have here a gluing invariant that is gluing the guest memory to the host
memory, and it's very simple. It is just the function composition between the
SLAT and the host memory. And because the SLAT is injective, all the parts that
are in the host memory for a certain guest are different from that of other guests.
And so the transformation here is very simple. Rather than writing -- I can show
both of them. I can show here the abstraction. So the abstraction was guest
memory of ga and now we have host memory of SLAT of ga. And because we
have the gluing invariant guest memory SLAT composed with [inaudible] that you
see here, so the proof will not be very difficult.
Okay. So this is it for the -- for m1. So now we go to m2. So m2 is doing the
following. In the second refinement what we do is we introduce the value score.
We introduce now -- we have shared the memory, and now we're going to
introduce these people.
So we introduce the value score, and, again, we have an abstract connection
between -- partial injective connection, partial because we might have more
guests than calls in the hypervisor, and so this is -- the scheduling precisely is
going to assign a guest to a processor, to a physical processor. And as I say
here, we also define the scheduler.
This is done through another context here, and that other context is defining -I'm not sure. Maybe not. Maybe not. Let me have a look. So here with here -yeah, m2. Okay. So this one.
Yeah, so it uses c3, and c3 is defining the -- oh, c3 is defining the state of core,
of course. And we have here the set of core -- it's an abstract set -- defining all
the cores that are -- all the cores that are here that are used in the hypervisor.
And now we have m2. So m2 is defining something -- a new variable called core
to guest. It's a partial injection, again, between core and guest. And in fact the
CPU of the guest, because we suppose that the guest are a uni-processor.
And we have again here a little modification of write which becomes fatter and
fatter, and now this is host mem of SLAT of core to guest of c and a, and now the
guest has disappeared from the write operation, from the parameter, and it is
replaced by c and the connection is -- the g of the previous abstraction is core to
guest of c.
And, of course, this is done only if the guest is not sleeping, and then if the guest
is not sleeping, then c is in a domain of core to guest.
And now we go to m3. So m3 is doing the following. In this [inaudible]
refinement -- we are going now to disconnect the guest from the SLAT. Before
we had the direct connection between the guest and the SLAT, and now we're
going to introduce this base address here. So we slightly enter into this and we
connect now -- we connect now the guest -- each guest with a base of its SLAT.
And also we introduce a base register in each CPU. It's a hardware register in
each CPU, and when the scheduling occurs, what happens is that the base
register, the base physical register of this CPU is assigned the address of the
corresponding guest that is now going to be executed on that CPU. And that
gives us a little more complication for these things.
Host mem -- still host mem, of course. SLAT of SLAT register of c. And now we
introduce for the first time the scheduling, and precisely in the scheduling, this is
what I told you. The SLAT register becomes the guest SLAT address of g. So
guest SLAT address of g is the address that is here. It is a table that is, again,
defined at boot time for each guest.
And this register -- now this is a physical register. This physical register is
assigned at schedule time. So now we have connected these guests through the
address of its SLAT, and it's put here in this register.
Okay. So now we are ready to go here, and now what we're going to do is to
implement -- the SLAT for now is just a big table that is here with this base
address at the beginning, but that's not the way it is, because now we have these
trees. And what we have to prove, and it is very important, because we say, and
it's absolutely fundamental, that the SLAT was injective. So the SLAT was a
connection between IPA and PA, so in PID and PA. And we have to prove now
that the entire thing is in fact implementing an injective. So it's still a one-to-one.
So here we enter into some things that are a little more -- and I have to introduce,
of course, this sort of thing. And we want to do it still at an abstract level. We do
not want to introduce the bits, but we just want to introduce this sort of thing.
So this is defined, I think, in this -- no, this is defined in the next one. So this is
defined here. So this is a structure of the SLAT as a three level -- at the
three-level tree. And we are going now to decompose our address. So we have
a notion of IPA and a notion of PA, and these are all page addresses, and now
we define this, but rather than defining this as 32 bits, which has defined some
projection function, IPA1, IPA2, IPA3, and we call them 10 bits, 10 bits, and 12
bits. Okay. We do not go into the real bits. There is no point in doing this.
And, of course, we have to do some more axiomatization, and are the first one,
which is quite important, is that this three projection -- IPA1, IPA2, IPA3 -- defines
unambiguously an IPA. So if you have two IPAs and they have exactly the same
projection, they are the same.
And now I define the base -- the base is a bunch of -- I believe these are all the
bases for all the guests, and now I define page, address, and 2, which
corresponds to this, and page, address, and 3, which corresponds to this. And
between the base and between the page address L2 and the page address L3,
they form a partition of the page addresses. So the fact that they are partitioned
means that the union of the three is exactly the page address, but they do not -they do not -- their intersection is empty.
So you see what I'm constructing here little by little is precisely this notion that
will make the basic theorem that this implementation of this SLAT is indeed an
injective function.
And now I define the word content for a page address for 10 bits and the byte
content for the 12 bits, so this corresponds to this byte and to this word.
And then there are -- I'm not going to enter too carefully. Here I have a number
of technical theorems, but the final theorem which is here, the fundamental
theorem, which is proved -- which has been proved is that the SLAT is injective.
If we have SLAT bei1 equals similar bi2, so if the correspondence is the a lot of
same then i1 is equal to i2. So I've proved by being very careful here with those
injections that this entire thing here corresponds to a one-to-one correspondence
between this and this.
And so this is done in the context, and now we go into -- now we go into this
machine, m4. So, by the way, we do not go immediately into this structure. We
introduce a TLB, the famous TLB. Remember the [inaudible] memory. So now
I'm going to say that we have sometimes this short circuit between a PID and
this.
So this is described here, and so in fact the right [inaudible] is divided into write 1
and write 2 [phonetic]. In write 1 the address is not in the domain of the TLB.
Therefore, we have to work, we have to do painfully the work, but at the end of it
when we find it, we update the TLB by evicting a certain member of the TLB.
And write 2 is also a refinement of write, and write 2 has got this guard here that
a is in the domain of the TLB. Therefore, we just use directly the TLB, and the
scheduling is not different.
And now we go into hypervisor, the last one, and the last one defines -- we are
going now to split the event. This is very classical in Event-B refinement. We
have an event which is something that is doing things in no the time, and now we
are cutting it into pieces. And the event write 1, which was the one that was
working through this structure is now going to be divided into write 10, write 11
and write 12 because this is not done atomically anymore. In the abstraction it
was done atomically, but now we have three steps, and this is defined.
And in between the steps -- the things are stored into supposedly some hardware
[inaudible]. And this is what is done here. So we have write 10, which write
syncs into some register. It goes from here to here by taking the upper part of
the memory -- sorry, of the address, of the intermediate address, and then write
11 and write 12, and write 2, which was using the slot, is not modified.
And now we go on to the last one, which is -- oh, no. Sorry. This was the last
one. This was the last one.
Okay. So this is the end of it. And so all the proofs have been done. Here is the
statistics of -- oops -- that we have for this. And so you see it's not small. We
had 176 proofs for proving all that. All the proof in the maintenance of invariant
and all the theorems have been generated by a tool called a proof obligation
generator, and -- but all automatic, 69 automatic, except nine of them where I
had to do an interactive proof. So an interactive proof is just giving some clue to
the automatic prover so that he can finish discharging.
I have not tried to optimize things, so it might be better, but I've tried to do the
things as it is.
So this is finished for the interrupt -- for the memory. We have a certain status.
It's not defined on one, but you see -- and what is important to see here in those
events is that some events corresponding to the hardware, some events
corresponding to what the hypervisor does, and actually this is not completely
clear cut what is done by the future software of the hypervisor and what is done
by the hardware of the hypervisor.
By the way, the machine with the SLAT makes things far more convenient for the
virtualization.
So now we go into the interrupt ->> Rustan Leino: Should we skip to the conclusions or do you want to ->> Jean-Raymond Abrial: Yeah, we should go to the conclusion directly. If you
are interested, these slides could be given to you.
Here just let me give you the final stuff for the hypervisor -- for the interrupt. We
have 205 [phonetic] total proof, and here we have a little more of proof done
manually where 35 of them are done manually because in the last one there are
22 of them.
So I have not time enough to do it, but we have written the requirement
documents, so this is available, and those slides are available.
Okay. So let's conclude now.
>> Rustan Leino: All right. So in conclusion, what you've seen is the flavor of
the way that the top-down development is done using a tool like Event-B. And
some common themes are that if details are introduced gradually and the -- so
you need to know only so many things. So for example, with the bits, you don't
need to know exactly that they're stored in bits. You can work at a more abstract
level, and that's very convenient.
If you start from the code, you have all of the concrete details to look at right
away, and then you'd like to try to abstract from there. And that seems to be
much harder.
So in what you've seen we also have the interrupt part which looks similar in
flavor to the development of the memory manager. But there are many more
things that are missing.
So the model still is quite abstract, so we need to continue going down into more
concrete levels so that we can eventually get to code that could actually be
executed.
You saw that in the refinement you get more and more events. So as you get
more and more events, at some point the final thing is going to have some
events being executed by hardware, some of them executed by hardware. And
in the process, there have been many, many things that we have puzzled over
where we have to sort of make up our own rules for -- I mean, what is hardware,
what's software, and then we've consulted with experts and then we say, ah-hah,
well, maybe we really should move this. This is really a hardware kind of thing.
And, oh, there's the such-and-such piece of hardware like the TLB or whatever
that we should consider in between.
But the idea in the end is that when you look at each one of the events, you can
see that this event is one of the software instructions that you can execute in the
machine or this is a task, something that is performed by the hardware.
And when we get to that point, then one can actually get the code generation is
the idea for those software pieces. And, of course, many of those event at the
end are going to be whatever the host OS is doing, and we don't -- for the most
part we're independent of -- well, we want to be completely independent of what
the host OS -- sorry, what the guest OS's are doing.
In the end, also, when we would get to that point, we also have to go back to the
requirements documents to check off those -- each one to make sure that we are
really living up to the what that document says. You can think of the
requirements document as and contract between the developers and users of a
system, and we're trying to get those right.
And there are some other pieces that you would want to have in a realistic
hypervisor, like timers, for example, that are missing from our model.
So from the standpoint of trying to develop this as a Gedanken experiment
where, I mean, these refinements have been done before, but we're developing
something for -- as something where we can see all the pieces fall -- come
together and the tooling is there. We have expertise in our group in developing
tools and automatic reasoning systems, so maybe there are some things that can
be done there as well for making things in a process like this more automatic. So
with that we'll open it up for questions. Thank you very much.
[applause].
>>: What are the specifics of the instructions that [inaudible] come into picture?
>> Rustan Leino: Right. So what would happen is that the -- there's some event
that models the operations of the -- of each guest OS. So what we saw in the
memory model here is only one instruction. That is, we're now looking at it as if
each guest OS is just doing write instructions. But there would also be skip
instructions; that is, things that the guest OS would do that are of no concern to.
That is, they can do whatever they want.
And what we need to show at some point is that if we then have real instructions
of a piece of hardware, of a processor, we need to show that those instructions
are correct refinements of skip [phonetic]. In other words, they don't interfere
with the operation of the hypervisor.
>>: Right. But if a CPU has [inaudible] where does that come into ->> Rustan Leino: So imagine a --
>>: [inaudible]
>> Rustan Leino: So take an operation ->>: [inaudible] so this model they just know that there's a SLAT? They don't
know who does it?
>> Rustan Leino: So think of it this way. In that skip event that needs to be
refined by instructions, that would be done by -- executed by each guest. If the
guest attempted an operation that can only be done in the -- by the hypervisor,
that is, you have to put the processor in the hypervisor mode to be availability
execute, then what we would need to do is simulate that -- that is when we're
doing that refinement, we need to model the execution -- the hardware execution
of that instruction by giving a trap, doing something that prevents the guest from
mucking, for example, the SLAT or mucking directly with the physical addresses.
A guest would only be allowed to muck with the intermediate physical addresses.
>>: [inaudible]
>> Rustan Leino: Wait. The n that you saw in the ->> Jean-Raymond Abrial: That is used for the addresses.
>> Rustan Leino: Right. Uh-huh.
>>: [inaudible].
>>: [inaudible]
>> Jean-Raymond Abrial: Yeah, that would be some refinement.
>>: So if you used natural numbers and refined from that [inaudible] or
something, you had a domain of natural numbers ->> Rustan Leino: So the way to ->>: It wouldn't correspond to the [inaudible].
>> Rustan Leino: So if you have an abstract set that lives in the context, then
what you're doing is you're doing a development, a refinement, that is parametric
in what those are.
>>: [inaudible].
>> Jean-Raymond Abrial: No, it would not be the natural number, it would be a
finite portion. Yeah, yeah. Sure, sure, sure.
>>: Do you have any feedback from the hypervisor project or the [inaudible]
verification of it that -- like the kinds of things that you were doing here, are these
where errors were found in the hypervisor or would such a development have
prevented ->> Rustan Leino: We don't have such -- right. You can check for that. Maybe
[inaudible] knows more. But, I mean, the feedback that -- I mean, we have
consulted with two of people who were involved in that verification, that is, in
verifying the Hyper V, to get a sense of what sort of models should we have,
what kind of hardware can we expect, and things like that. But, yeah, I don't
know. You'd have to check with them to see what sort of errors, whatever, they
discussed.
But one thing that I should have said is that the reason that the hypervisor seems
like such a good thing to do an experiment like this with is that many pieces of
software -- dancing clowns on web pages, for example, I mean, we don't really
care if they're really correct or not. If they -- but here the hypervisor is something
that you really care that the machines are independent. And, furthermore, if you
try to debug it with standard techniques, you're operating at such a low level that
it's very difficult to try to understand that that level -- I mean, if you have
something that goes wrong.
So, therefore, trying to use the technique that is designed to start from
abstractions and get down to correct things seems worthwhile here. And you
notice that after these five weeks now, we don't have any code. I mean, it's not
there yet. Whereas, if you were to do it the traditional way, you probably would
have written code certainly by now.
>>: [inaudible]
>> Rustan Leino: Jason, did you also have a question?
>>: Yeah. So I was wondering where the virtual [inaudible] translation sits. Is
that before or after?
>> Rustan Leino: You mean from the -- so you're thinking each guest operating
system probably has a bunch of processes running, and when ->>: [inaudible] so that sits before ->> Rustan Leino: Right. So that sits before -- all of the addresses that we -- we
being the hypervisor -- would get is already a request for a particular intermediate
physical address. So what each guest operating system presumably would do is
it would have virtual addresses that it gives to each of its processes, and it maps
them in fact using page tables and TLBs, very similar to what we have here, but
maps that into an intermediate physical address.
>>: Yes. But my other question is that's usually done directly by the hardware
[inaudible] so you're able to track is after the hardware ->> Rustan Leino: Right. All of that happens in the guest mode. And then that
whole process in the end comes up with one IPA, one intermediate in this
address, and that's what we get. So we don't actually care -- a guest OS could
decide to not use that virtualization mechanism or come up with its IPAs in any
kind of way. But, of course, the most likely would be that it would do it in just that
way.
>>: So as a second ->>: [inaudible].
>>: So second level. And then there's -- I can't imagine that that would be fast at
all unless there was some specialized hardware also doing that process
>> Rustan Leino: You're right. I mean, we would assume hardware like that.
And actually we've gone through and thought much about also how that is. But
for our modeling it plays no role. Well, actually, that's not right. In the end it will
play a role, because when we would have to take something like a skip action or
something that is done by each guest, we need to be able to refine it into what
the hardware does for those sorts of things.
But as far as the -- I mean, all of these things, the translation from IPAs to PAs,
we don't care how the guest comes up with its IPAs.
>> Jean-Raymond Abrial: [inaudible] our two TLB, the TLB for the guest and the
TLB for the hypervisor. So the TLB for the guest, we don't care. This is the
business of the guest. And at the end of the -- and the TLB of the guest maps a
virtual address of the guest to an IPA, and then that IPA -- and then the guest
wants to write directly to this IPA, and this is this part that is taught by the
hypervisor.
>>: This is developed specifically for hardware that has the ability to trap the
second level?
>> Rustan Leino: We're assuming that there's such a SLAT.
>>: [inaudible]
>> Rustan Leino: You could refine ->>: You have to believe that you've got the chance to get in there ->>: [inaudible].
>>: That's what I'm saying is you have to know that you can trap it after.
Otherwise you would have to simulate errors [inaudible] ->>: [inaudible]
>> Rustan Leino: Right. So and we also -- by the way, we didn't look at the
details in at that talk about the interrupts, but the interrupts see very similar things
that the -- because of the virtualization that's going on -- that the hypervisor gets
the physical interrupts, and it will then do things to mark them as being active,
that is, that they are being handled, and then forwards them on to the
virtualization hardware for the -- I mean, that goes to each guest. And at that
point what the guest does -- I mean, if the guest will take that interrupt or what it
will do in its interrupt processing routine, whatever, we don't care.
We know a few things. For example, like there are a bunch of priorities that
guests can set. And when you set those priorities, then you can get a stack of
interrupts. We're modeling that. And then the guests are supposed to then peel
off these things in -- I mean, pop them in the opposite order from the way that
they were pushed. So we're guessing that the hardware would trap any
violations to such a thing. But, again, most of that is just independent of what the
hypervisor does. The hypervisor just tries to simulate the physical pieces of
hardware in the virtualized way to each guest.
By the way, I should have said here the initialization, the one thing that we have
not looked at is how do you initialize all of these tables and the interrupt maps
and so forth. And there our understanding is that there's something in the bias or
something that would tell the hypervisor how to set things up, but we've not
looked at that. That needs to be done. That's a missing piece.
>> Jean-Raymond Abrial: One thing that is very important in the separation of
future software events and hardware events, the software event will give rise to
code, but the hardware events are important, too, because we have to check
now that the physical hardware corresponds to the model we've done for it. And
so it's a way of -- and we have to do things. We have to dig into the
documentation or to dig into the real physical hardware to see whether our
events -- because our software will be correct with regard to those physical
events corresponding to the hardware. Now, if the hardware is doing something
different, then we have some problem, of course. So those events are also
important.
And I think even more -- going more further, it could be used also by the
hardware people. They could formalize the future hardware and then implement
it by -- in the circuit.
>> Rustan Leino: All right. Anything else? All right. Thanks very much.
>> Jean-Raymond Abrial: Thank you.
[applause]
Download