31885 >> SUMAN NATH: Hello, everybody. This is my... from Rice University. He works at you [indiscernible]. ...

advertisement
31885
>> SUMAN NATH: Hello, everybody. This is my pleasure to introduce Felix Lin
from Rice University. He works at you [indiscernible]. His research is in
the area of mobile and wearable systems. Especially he tries develop
low-level software to exploit the emerging features of these devices. And
for his work, he recently got the Best Paper Award at S-PLUS. And today, his
talk, he'll talk about that recent work and a few other works that he did in
his Ph.D.
With that, Felix.
>> FELIX XIAOZHU LIN: Thank you so much. Hi. So have you guys ever
wondered if we can span one mobile system of highly different processors for
both performance and power efficiency? So in my work, I will redesign system
software to make this happen. Now let's start from why we should do this.
Power efficiency rules. It has become the major concern for many, many
systems from sensors all the way to big servers. For example, in many cases,
if you have better power efficiency, you'll not just get a better performance
because we can switch more transistors at the same time. For mobile people,
power efficiency means usability because the device itself will become cooler
and it will last longer.
For several people, higher power efficiency means dollar, millions of dollars
saved from [indiscernible]. So that's why everybody tries to improve power
efficiency. So in this race, mobile is one of the pioneers. The reason is
much clearer. Because mobile device usually have very tight thermal budget.
We cannot put a fan on the back of our smartphone and also have a very tight
battery budget because of small phone [indiscernible].
The (inaudible) thing is the tool budgets do not scale up over time. There
is no more cell phone then. So let's say one example. From 2008 to 2013, we
would see the performance of a [indiscernible] smartphone as scaled by 13
times. But in the battery capacity, only scale by two times and the form
factor almost has not changed. So this trend is even worse. The scaling is
like the other way for wearable devices. For example. Google glass.
Compared to a smartphone 2008 in performance scaled by 4x, the battery is
half, and the form factor is even smaller. If we don't design wearable
[indiscernible] way, things can be very ugly. So let me tell you one true
story.
So in our lab, we did this experiment with one video chat. Google hang out.
Google glass. After 20 minutes, the surface temperature of Google glass is
53 centigrade, which is above the human tissue damage threshold. So that's
something you don't -- it's not quite wearable.
Now this is a pressure for high power efficiency for mobile device from
hardware side. How about a software side? One defining characteristic of
mobile is a wide arrange of [indiscernible] from the background always on
intelligence all the way to the intensive multimedia workloads.
coming years, we're going to see this range grow.
And in the
For example, for natural interface and perception workloads, across whole
spectrum, this is a 1000x difference in performance demand. Know that all
this workloads are hosted by one mobile device, and in many of the workloads
actually are in the same application. Ideally, the mobile device should
provide power consumption proportionate for [indiscernible].
So, for example, for the most intensive workloads, we want to keep the power
consumption on mobile device below ten watt. So the device does not have to
burn our hand.
Now, for the background always-on tasks, we want to keep power consumption
below 10 milliwatt. So they can still always run in the background, they can
serve us?
>>:
How do you get those numbers?
>> FELIX XIAOZHU LIN: From [indiscernible]. So there is a white paper, a
picture [indiscernible] paper, so they back for 21st Century [indiscernible]
research so they have this numbers like long-term goals.
>>:
[Indiscernible]?
>> FELIX XIAOZHU LIN:
targeted that. Yeah.
That's even more ambitious.
Definitely we should have
So the thing is how can we -- how can one mobile device provide one solid X
core difference? So that's the problem.
Now, today, our recognized approach to achieve 1000x power difference is
hardware heterogeneity. So idea is quite intuitive. Instead of having one
unified processor for all kinds of workloads, we should have different
hardware for different workloads. So ideally, we can map different workloads
to different hardware. We could meet a performance demand and also have the
best power efficiency.
So however, in today's smartphone, because the -- we have to achieve 1000x
power difference, so the asymmetry or the heterogeneity among those
processors have to be so high. This usually given us no [indiscernible]
among those processors. Now, let me explain why.
Let's look at the difference between power and performance here.
>>: We've had this for a while, right, with [indiscernible]? So even the
earliest iPods of a CPU [indiscernible]. So it's just a set of
single-purpose basic cores that did one thing and you didn't need cache
coherence because you just sent this [indiscernible] data and let it process,
right?
>> FELIX XIAOZHU LIN:
Yeah.
>>: So can you talk a little bit about that world versus the world you're
talking about? [Indiscernible] difference there as well.
>> FELIX XIAOZHU LIN: Right. So I think there are two dimensions. One
dimension is the asymmetry, right? You have very different general purpose
processors. There are dimensions you're talking about is specialization.
Basically you have CPU, GPU, and Asics. I think we can explore both two
dimensions. General purpose is easier for application programmers to
program. Okay?
So let's look at the difference between power and performance. Let's assume
we have one mobile device and there is only one core in this device. If we
want to [indiscernible] power consumption for some light workloads, we can do
DVFS. So this is a classic technique basically to reduce the [indiscernible]
rate and to save some power.
Now if we want to further power consumption for some even lighter workloads,
we can incorporate another [indiscernible] core in the same coherence domain.
So this is the case for [indiscernible]. This will give us another ten X
power saving. But if we want to [indiscernible] power consumption for some
light background always-on pages, we have to have another even weaker core in
a separate coherence domain. So you may have questions, why do we have to
have separate coherence domains? There are many reasons?
Now, there are top three rules. Number one, [indiscernible] is complicated
to design. I talked to many architect. It's complicated to design. By
removing the global heterogeneous coherence it's easier for hardware
designers to incorporate multiple asymmetric cores in one mobile system.
>>: Are you suggesting no coherence but still having a shared
[indiscernible] space for separate [indiscernible]?
>> FELIX XIAOZHU LIN: So at the lowest level, so the [indiscernible] waste
[indiscernible]. So there could be shared a memory, could be non-shared
memory. So for all the software perspective, there could be shared
[indiscernible] space. There could be [indiscernible] space. I will talk
about that later.
So number one, so it's easier to incorporate asymmetry cores [indiscernible]
designers. Number two, by removing global heterogeneous coherence, so cores
are losing [indiscernible], it's easier for hardware designers to design
finer grain power domains. So those power domains could be dependent on
[indiscernible] time to save power more aggressively.
Number three, because we do not have the global hardware coherent connect,
they either connect their self cache controller could be even lower power.
So we can afford them to be always on, to serve the background tasks.
Now, with separate coherence domains, we can have another 20 X power saving.
For this background intelligence.
>>:
[Indiscernible]?
>> FELIX XIAOZHU LIN: So I think we, by doing it in software, we can put the
complexity in software. [Indiscernible] software to define the policy.
Yeah.
>>:
But is it more power?
>> FELIX XIAOZHU LIN: There's some debate. So this is the debate from like
20 years ago. But in -- I believe you can do it more efficiently in hardware
these days, but the thing is number one, hardware is not there yet. Number
two, we can have more flexible policy in software by doing it in software?
>>: I guess I'm just [indiscernible] motivation for separate coherence
domains.
>> FELIX XIAOZHU LIN:
Right, right.
Right.
>>: And now you're going to put coherence back on top of that.
throw away our power efficiency?
Do we just
>> FELIX XIAOZHU LIN: No. The thing is by having -- basically by
implementing the cache currency software, we can decide if for example, this
domain, this state is not coherent, this state is coherent. Basically like
software defined memory coherence. Yes?
>>: So it sounds like you just said you want to push more complexity up to
the programmer and they have to think about how data is cache coherent across
these coherence domains?
>> FELIX XIAOZHU LIN: So it's our job to deal with the complexity.
mean system software designer, not [indiscernible] programmers?
By us, I
>>: Okay. I expect you're going to get to this, but I'm curious to see how
this is going to manifest to the programmer, because I suspect you can't hide
this with layers of system software.
>> FELIX XIAOZHU LIN: I agree, yeah, yeah, yeah. We should expose something
to the programmers. Right? I will talk about that. Okay.
So because of the motivation, right, so hopefully I across that
[indiscernible], we can achieve 1000 power difference in one mobile device.
Now the design of multiple occurrence domain actually is not -- is
everywhere. It's not some research idea. Is known in today's mobile devices
from wearables to smart phones to tablets.
For example, if we're looking to iPhone 5S, [indiscernible] of iPhone 5S,
we're not only have the high power, high performance asymmetry processor for
demanding tasks browsing, we do streaming, we also have the low-power
processor, M7, for background tasks. So they do not have a [indiscernible]
coherence which means they are in separate coherence domains.
Now, with this architecture, let's define our architectural model more
concretely. So we assume there's one mobile device and they're highly
asymmetric core in this mobile device hosted several coherence domains.
There is [indiscernible] coherence within each domain but not across domains.
Multiple cores in different domains can talk to each other by message
passing. [Indiscernible] message passing at the lowest level. And they can
interrupt each other, actually. So when is this architecture the holy grail
for software is to span across multiple coherence domains for power
efficiency and performance.
Let me use one example. E-mails and conversation. So in the e-mail
application, my e-mail application, the background synchronization
[indiscernible] tasks [indiscernible] from time to time should be able to run
on the low-power, low cores in the low-power domain. We call it the weak
domain. Now the flashy shiny new interfaces like e-mail composition, should
be able to run on the high-power cores so we can have the best power
performance, the best performance and also the best user experiences.
Now there's two parts running one application and they do share a lot of
states within one application. For example, e-mail contents [indiscernible]
to red 13 bags and even in memory, that database cache.
So to say to realize this scenario, the biggest problem is there is no
coherent memory program state supported by hardware. To see why this is a
problem, let's look at the good old days to see how software can either
[indiscernible] cores in one coherent machine.
Let's say you have four cores. One coherent model machine. This is the case
for laptop, for desktop, for most of the [indiscernible] servers.
Each of the core has its own cache. The cache are separated but keep
coherent by the hardware. Let's say we have a OS easily [indiscernible] span
across all cores. OS can be Linux, Windows, or this Legacy OS. So the
thread, let's say we have a [indiscernible] thread, one called zero. This
thread make a [indiscernible] create a socket. As part of [indiscernible] it
attached in state will be cached [indiscernible] cached here. Later, let's
assume this thread has migrated to core two. NKU the socket S that is
created previously, transparently, because the hardware will automatically
propagate the cache update from core zero to core two. So all of this is
done via hardware. Neither the user thread, nor the OS, has to take care of
this. This is transparent to them. Everything is simple.
Now in our [indiscernible] picture, things are different. We have several
coherent domains. The hardware will automatically propagate the cache within
one domain but not across domains. So the inter-domain communication
requires explicit software operation which is way slower than hardware
coherence usually by orders magnitude.
So to this end, my work is to enable software spec for application to OS to
span a cores multiple coherence domains. To achieve this goal, I have three
objectives. Number one, we would have to simply application programming
because there are millions of mobile applications out there and they get used
to unified memory view and the one single OS view. And then number two, we
have to simplify OS engineering because mobile OS these days is a huge
[indiscernible] base. Android has one median nice of code. We just simply
cannot do everything from scratch.
Number three, we cannot sacrifice performance. Nobody wants to use a slow
system and [indiscernible] performance. We want to maximize power
efficiency.
Now, to achieve the three design objectives I will redesign system software.
In redesigning, I will follow through a eight design principles. Number one,
because we have asymmetric hardware right now, we should do the asymmetric
system software. Yes?
>>: So when you're dealing with a disparity cores and address
[indiscernible], you could think about a single OS which would have spanning
at four [indiscernible] image with [indiscernible] application, creating
[indiscernible] application for the different services [indiscernible]. How
did you decide to go with a single [indiscernible] single application and a
seamless view versus one that explicitly breaks apart these different domains
and asks the programmer to [indiscernible] building services for each part?
>> FELIX XIAOZHU LIN: Well, I think the -- for all the application program
model, it has a [indiscernible] which still require application programmer to
specify this piece of code goes to strong domain, this piece of code goes to
weak domain. So that's the only thing they have to do. Basically implement
different threads for different domains. But we do provide the unified
amended address and one single OS image order. So that's our approach.
Okay. So going back to the principles we follow, number one, because
hardware, we have asymmetric hardware, so we build asymmetric system
software.
Number two, we want software define memory coherence, basically which means
the policy in memory coherence and at which state is coherent shared and
which state is replicated non-shared.
And number three, we will refactor the existing system software rather than
building everything from scratch. Now --
>>: I have a question before you go ahead. I'm new to this. So can you
tell [indiscernible] you said about the M7[indiscernible] and the M5 being
[indiscernible], right? So they [indiscernible]. What kind of coherence do
they implement? Do they implement it in software or is it hardware or ->> FELIX XIAOZHU LIN: Right. So right now, I'm going to get to that later
but I'll give you some forecast. Basically they try to dodge this problem.
The Apple just does not allow you as a application programmer to add any
application for the M7. They will provide you some predefined libraries.
>>:
Why do they do that?
>> FELIX XIAOZHU LIN:
Yeah.
Because they don't have a good programming model.
So my work has to [indiscernible] pieces. Number one, to enable user
programs to span across coherence domains. And number two, to span a single
OS image. Let's start from span single user program.
As we mentioned before, the biggest problem, the biggest challenge to run
application for this architecture is the program state is not coherent.
Right. As I mentioned, today's state of the art program from work tries to
dodge this problem by only allowing application programmers [indiscernible]
application for the strong processor, for the main processor here.
>>: Don't GPUs let you have effectively two coherence domains to kind of
incoherent memory spaces [indiscernible] whatever you want for both of them?
>> FELIX XIAOZHU LIN: Right. So there are some research in that actually
but the hardware -- the lowest abstraction so there are in separate coherence
domains, but there's some research building some [indiscernible] digital
share memory to bridge this to memory spaces. Yeah. That's related. That's
related. Okay.
So the state of the art, you only write application for the strong domain,
for the most powerful processor and vendors like Apple, Microsoft will
provide libraries only available for the lowest power core. So at run time
the two pieces of code will talk together by message passing so this approach
works. But there are -- it's very inflexible and actually incurs a lot of
development overhead.
>>: It's not an actual library executing. They provide a library interface
that does message [indiscernible] must be a service.
>> FELIX XIAOZHU LIN:
Yeah.
The OS [indiscernible].
Yeah.
So can we do better? We really want to enable one application entirely from
the application developer to span across multiple coherence domains, but
here, in order to enable this, we have to face the challenge. How do we
relieve programmers from dealing with noncoherent program state?
So our solution is called Reflex. Reflex is that this is shared memory, it's
one layer below the application. So ideally it's kinds of intuitive. We
will implement the software cache coherence using our best layer.
Now you may wonder what's a novelty, right? The distributed shared memory is
an old idea from 20 years ago when people are talking about running one
application across multiple machines in one cluster. But those class design
do not all serve all purpose well. Because one of our major design goals is
to enable energy efficiency. So we have to adopt this asymmetric design. We
have to keep strong domain, powerful processor in sleep as much as possible.
Now let me give you a minimal example to illustrate how does this asymmetric
design work. Let's say we have two coherence domains. One is strong host
power processor, one is weak host of the low power processor. And both of
them share one memory object which is array here. So according to our
design, any shared memory object is always hosted by the weak domain, which
is -- which makes it whole.
Now, if the strong domain wants to access the memory object, it has to send a
request to the weak domain. The weak domain will respond and the main memory
object will be copied, moved from it's home to a request, from the weak guy
to the strong guy. So this is natural.
On the other hand, if the weak domain tries to access the memory object here,
it cannot send any request. All it can do is to passively wait until a
strong domain finishes accessing the memory object and the read it back so
the weak domain can go ahead to access memory object.
So this is a medium example, but hopefully you can get the asymmetric flavor
here.
Now, the memory model we build the release consistency which is fairly
standard. We implement two single edition primitives a [indiscernible] and
release. So you probably getting memory updates to domains asymmetric again.
The strong domain will eagerly propagate any memory updates to strong
domain -- to the weak domain so it can itself, the strong domain self can go
back to sleep as soon as possible, just to save some energy.
The weak domain, on the other hand, will do this lazy so it will only
propagate the memory updates to other guys one it's being asked. So this is
try to avoid waking other domains from sleep because of the coherence
complication. Now, again this is asymmetric design. This is for power
efficiency [indiscernible]?
>>: One assumption that you have is that you always have a -- an ordering
among your domains. So what happens if you have domains that are not clearly
ordered, say, a GPU and an ESP programmable processor where it's not clear
which one is the strong and the weak for a general purpose program? Do you
have a --
>> FELIX XIAOZHU LIN: Right. So that's, again, that's another dimension of
the heterogeneity. That's the specializations. I don't have a good answer
for that, but I think here we try to address the asymmetric general purpose
processors. So we have -- we do have a clear order to that. But I think
there is one [indiscernible] to that. They build [indiscernible] share
memory between GPU and CPU and they try to use this same asymmetric design
for that, but the -- in a case, they make the GPU always host of share memory
objects.
So the bottom line is I think the asymmetric this sort of a principle could
apply to that case as well, but definitely [indiscernible] think how to
design that system. Okay.
So that's about the overall design of Reflex. But we have to implement this.
We have to build this. So today, as I mentioned, there are many, many mobile
devices have asymmetric multiple am occurrence domains, multiple processors
in one mobile device. But back in 2011, we do have this flashy shiny devices
but [indiscernible]. So we have to build some dangerous prototype ourselves.
So what we do is we took a Nokia, a 900 phone, smartphone from the market,
which was nice, shiny smartphone. As many other smartphones, Nokia
smartphones, it has this nice camera. But we have to remove the camera, rip
it out in order to access it's peripheral [indiscernible]. So we build a
custom board with two low-power processor, asymmetric low power processor
[indiscernible] and we hook the board with modified 900 smartphone. And
boom, we have one of the earliest prototypes in the world with asymmetric
processors multiple occurrences domains.
So if we look back today, actually the two low-power processors were used
before in early days, actually used by people this years. IPhone 5S and the
Moto X. So they are using the processors we have used before. So something
interesting.
So based on hardware prototype, we implement Reflex and we were able to span
one application across three different processors and we can gain one order
of magnitude higher power efficiency for applications. Yes?
>>: I'm assuming that you're not doing anything to keep those two separate
things coherent with one another so there are three coherent domains
[indiscernible] ->> FELIX XIAOZHU LIN:
>>:
Okay.
Yes, yes.
And those are kind of both not multi core processors.
>> FELIX XIAOZHU LIN: Right. Single core. All three of single core. So
[indiscernible] we even don't have mounted core for smartphone, so is single
core.
>>:
I see.
Okay.
So -- all right.
>> FELIX XIAOZHU LIN: Okay. Okay. So this is about Reflex to just briefly
recap. We build the digital machine and memory and software to provide
unified memory [indiscernible] space to application. So application can
either be spent on cores, multi-occurrence domains.
>>:
Go back to the [indiscernible].
>> FELIX XIAOZHU LIN:
>>:
Yeah.
So where's the cache for each of these little processors?
>> FELIX XIAOZHU LIN: So, actually, I don't -- [indiscernible] we probably
have the [indiscernible] air one cache, [indiscernible] none coherent.
Separate cache. Yeah.
>>:
So do you manage that cache or are you used to that?
>> FELIX XIAOZHU LIN: We can -- they have the separate memory, right? So
separate physical memory. So that cache actually does not matter that much.
You have to [indiscernible] built in.
>>: Cache is coherent with respect to the CPU, though.
own cache.
>> FELIX XIAOZHU LIN:
Okay.
>>:
Yeah, sure.
It's managing its
That's hardware cache, yes.
So that's the first -They're single cores within a domain, what is the of coherence?
>> FELIX XIAOZHU LIN: There is no coherence, right. But we're still
occurrence domain, just because we are dealing with multiple cores.
>>:
[Indiscernible].
>> FELIX XIAOZHU LIN:
Yeah.
Yeah?
>>: So each coherence domain doesn't implement any kind of coherence.
just are implementing ->> FELIX XIAOZHU LIN: We will see multiple core -- multiple coherent
occurrence in coherence domain so because this is kind of early system,
right? So we'll deal with more complicated system, you will see. Yes,
Andrew?
>>: So [indiscernible] systems, do you know if the [indiscernible]
different, are they using [indiscernible] they still have different --
You
>> FELIX XIAOZHU LIN: I don't have that -- yeah, yeah. [Indiscernible].
Okay. So that's about the spanning user program. That's the first piece.
Let's talk about the second piece. How to span one single OS image across
multiple occurrence domains. So this may be more controversial. You have
multiple questions. Number one, why do we have to span single OS images?
Can we have a -- like can we pin the OS, make the OS available, only
available on one coherent domain?
In this case, let's say we make the OS only available on the strong domain.
To say what is the problem, let's use e-mail example again.
See, the e-mail synchronization, a little bit more background. E-mail
synchronization task actually is battery killer. So this is some research
resolved from Microsoft Research every day our smartphone is standing by
mode, the e-mail synchronization password consume about 14 percent of battery
life in standard mode, so this is a lot of power.
[Indiscernible] nice would we have asymmetric processors, would it be nice if
we could [indiscernible] e-mail synchronization task in the low power cores
here, right? So the strong core can go to sleep, remain in deep sleep mode
so we can save energy.
However, when is the OS only available on the strong domain, this does not
work because all network activities have the e-mail synchronization task have
go through to the OS, which is on the strong core -- yes?
>>: [Indiscernible] say the same thing about [indiscernible] ten years ago.
People would say that, you know, [indiscernible] OS and guess what
[indiscernible] do that. You go [indiscernible] today and you see
[indiscernible] cross over [indiscernible] and actually [indiscernible]. Why
not do something like that for the ten most power-consuming tasks on the
phone, e-mail seeming to be one of which?
>> FELIX XIAOZHU LIN: Right. So I think the question is do we need OS
support for the background task or not? My personal belief we definitely
need more and more ->>:
[Indiscernible] the OS itself.
Is that two different questions?
>> FELIX XIAOZHU LIN: Yes, yes, yes. So e-mail synchronization is one
example, right? You definitely need -- for example could a socket
[indiscernible] and a user [indiscernible] and IOS, I will give the
[indiscernible] later, and another example is they have this profession
application, profession task which could be run in the background from time
to time. So in that case, you really want to have one OS. So I think there
are more and more competing cases to have OS support on no-power domains, no
power cores. Okay?
So basically, the idea of -- here it is. We cannot make the OS only
available to a strong domain because in that case, the background task will
[indiscernible] the strong domain from time to time because it has to use the
OS services.
Now, can we make the OS only available on the weak domain? That does not
work either because the weak domain, low power, low performance. That will
make OS to be the performance [indiscernible] entire system, which is not -which does not work well.
Now, another alternative design is can we have separate OSes? One OS for
each domain. For example, OS 1 for the weak domain, OS 2 for the strong
domain. This does not work either because the two OSes have to collaborate
to manage many resources in the system. Therefore, we have to modify many
things in two OSes, from memory management all the way to [indiscernible] the
best drivers. This requires a lot of overhead in engineering OS.
So application programming is complicated as well. Like I mentioned before,
if you have a application running on OS 2, it creates a socket. You cannot
reuse that socket on OS one because they are separate OS images. So this
gives a lot of headache to programmers.
Now, we're clear what we really want is one single OS image across multiple
coherence domains. But how do we do that? How do design such a OS?
So we first have to decide the structure of the OS. One central problem, one
central concern to decide [indiscernible] is how is software state shared
among OS components? Now, from 1978, this [indiscernible] paper basically
says there are two fundamental OS structures, interface word, basically has
shared-everything and shared-nothing. Shared-everything is like Windows and
Linux. The entire OS is fully coherent as supported by hetero coherence most
of the time. Shared-nothing is you have multiple OS components that have
their own separate [indiscernible] spaces and they will talk to each other by
messaging passing.
Now, which of the two OS models, OS structures fit our purpose well?
Neither. So to mode invade our organization, let's see one key observation
we have. If we look into today's mobile OS, mobile kernels, for example,
Linux, we can roughly categorize all OS components or core services into a
few sets. One set is core services. Basically they are like the small
infrastructure of the whole OS. Performance critical. They [indiscernible]
by a small number of people. Virtual memory, page allocation, scheduled in,
et cetera.
On top of core services, there is a large of -- there is a large set of
extended services, like device drivers, network, file systems. So they are
large in number, and they are developed by a lot of people, Microsoft
[indiscernible] Samsung. Actually, there is a ecosystem behind them.
So compare to core services that are less performance critical.
Now, with this observation, we propose our OS model. We call it share the
most OS model. So basically, the core services, small set, performance
critical, are replicated as shared-nothing spaces, so why is this for each
coherence domain?
On top of core services, the large number of extended services can be
transparently shared across coherence domains so they're source code could be
reused.
If we try to plot our OS model into a spectrum of OS structure, you can see
where somewhere in the middle, like here, this is because we're trying to
find a balance point between the years of energy programming versus the
performance and power efficiency.
Now, where is the OS model? Let's build a OS.
introduce one important hetero background.
But before that, let me
So we have been talking about multiple coherence domains [indiscernible] one
system, but in real world, there are two major ways to integrate the one
mobile system. So number one is coherence domains on separate chips, like
people mentioned previously.
So in this case, each chip, each coherence domain have their separate memory
and IO. And as multiple chips are [indiscernible] through some relatively
low-speed high [indiscernible], like [indiscernible] SPI. That's a case for,
for example, our Reflex prototype, we have three different chips and also for
iPhone 5S, two separate chips. So that's a first type of integration.
The second type is coherence domains, multiple coherence domains on one
single chip. For example, in this case, there are two coherence domains on
one same chip, most likely the [indiscernible] physical memory and IO. I
will give couple example for that later.
To add one more twist to this big picture, sometimes these two types of
integration can coexist in one mobile system. For example, this chip can
have multiple occurrence domains inside it.
Now, let's look at two types of integrations again. Coherence domains on
separate chips and now same chip. Now with this background, let's try to
implement our own OS.
Because Reflex was targeted on multiple chip case, natural for us to try to
build a OS across multiple chips. Like this. But back in that time, we
found this problem to be really difficult. Because the physical memory are
separated, and the interconnect is relatively slow. So we do have a lot of
concerns back in that time.
For example, how to ensure performance of the OS because OS is such a
performance-critical piece. And number two, should we build everything from
scratch? What kind of components can we reuse? And number three, do we have
enough architectural support to build such a OS?
Back in that time, we do not have good answers for this design questions. So
let's say, let's take one little step back. We'll try to build the OS for
some little bit different picture, which the second type integration where
cores slightly [indiscernible] in one chip.
So the advantage is that we do have shared physical memory and the
interconnect is faster because the interconnect is on chip. So these are
good for OS designers like us. So we will build the OS for this picture
first.
So the hardware pattern we chose is TI OMAP4. This is a popular mobile
system chip, has been used in many popular mobile devices from wearables to
smartphones, to tablets. Our chip of TI OMAP4, there are two coherence
domains. One coherence domain host different type of cores. So for example,
we have A9 core in one coherence domain, which is for power demanding tasks,
and we have M3 core for the weak domain, which is for background tasks. They
do not have global hetero cache coherence and they can share the same set of
IO and physical memory. They talk to each other, different cores and
different domains, talk together by hardware message passing so this is
hardware support. They can pass around 32-bit message for very short
messages. And therefore, they can interrupt each other.
Now, by applying our shared most OS model to this hardware picture, we build
our experiment OS. We call it K2, which is a mobile OS spanning mobile OS
spanning heterogeneous coherence domains.
So K2, inside of K2, on top of two coherence domains, we have two kernels.
One kernel for each coherence domain. Inside each kernel, like we said
before, core services are replicated one instance for each coherence domain.
On top of core services, there is a layer of OS level to share memory which
present the illusion to all extended services as if they are running on a
fully coherent machine. So we can review the source code from those extended
services.
So let's talk about a software distributed shared memory here. It actually
is very close to what we have to use for Reflex. Basically try to provide
this illusion to higher level software, but the implementation here is
different because we have hardware MMU and the [indiscernible] hetero
support, the latency, the performance of distributed shared memory actually
is way better. For each access miss, we have about a 50 micro sec. I will
talk about how does a 50 micro sec impact the overall performance.
Now, core services. Like we have mentioned before, there are multiple core
services, and each core services, each core service is replicated as a
separate instances. Now, how to coordinate these instances is case by case.
For example, let's say page allocator. This is one important core service.
Multiple independent instances of page allocators will collaborate with each
other to manage the same set of physical memory. K2 recorded those
instances, but in certain balloon drivers into mobile kernels, like this.
So you can imagine balloon drivers, just like a [indiscernible] device
drivers, [indiscernible] by K2, their job is actually very simple. A balloon
driver can inflate basically the K2 will try to get a large page block from
the local page allocator. It can also deflate basically the K2 will put the
large page block back to the page allocator.
Now, where is multiple am balloon device drivers? K2 can move large page
blocks around different page allocator. So it can control how much memory is
available to each of the kernel, to each of the instant -- each instance of
the page allocator. This design has two benefits. Number one, the balloon
here, this one is almost transparent to page allocator. So we do not have to
modify the page allocator. [Indiscernible] page allocator source code too
much so we still can reuse most of the Linux page allocator source.
And then number two, because we are moving around a large page blocks, so we
actually can reduce the inter-domain communication for good performance. So
that's [indiscernible] page allocator.
Let's look back at the structure of K2 to recap. Core services are
replicated as shared-nothings [indiscernible], and we will manually -- K2
will manually explicitly [indiscernible] them. There is a software
distributed shared memory will present one illusion to extend the services so
their source code could be shared.
>>: Quick question. The other core services were generally easier or harder
to implement [indiscernible]?
>> FELIX XIAOZHU LIN: They are for example scheduler, and also the interrupt
management. So each core services is case by case. Actually it's -- I will
say it's tangent. But [indiscernible] there is a small set of core services.
So that's fine.
>>:
How many?
>> FELIX XIAOZHU LIN: Like right now it's 5 and 6, 5 or 6, something like
that. Yeah. How much time do I have 15 minutes? Okay.
So I will skip this hacking part.
have.
So [indiscernible] little hacking file we
So imagine, just to like briefly summarize, we have two kernels in different
[indiscernible] sets. So they have different ISAs two binaries. But the
thing is two kernels will share the function pointers at wrong time, so the
problem is if one kernel tries to follow the function pointer pointing to the
wrong instruction set, it will crash. And it's sort of unrecoverable crash.
So how do we do that? How do we handle the function pointer with a little
bit of compiler support and run time support? So this is hacking. Hacking
fun we have during the kernel implementation.
So I will quickly escape that.
Let's look at evaluation.
So how well does K2 work? To try to answer the questions in the evaluation,
number one, how much energy efficiency can we get? So we compare with Linux.
Because K2, our OS can span multiple asymmetrical processors, we can actually
exploit the low-power processor for high energy efficiency. Not surprising,
we have higher, much higher energy efficiency like up to ten X. This is
good. Amazing, but as we plan.
>>:
Measured by what?
What's the [indiscernible] benchmark --
>> FELIX XIAOZHU LIN: We have three critical light OS benchmarks. So
[indiscernible] are OS workloads. And we actually measure poor consumption
by who can wire us to the power wheels on a dev board.
>>:
Is there anything specific about those benchmarks?
>> FELIX XIAOZHU LIN: So [indiscernible] OS benchmarks. OS used by many,
many applications. For example, the DMA is like used by almost every device
drivers. [Indiscernible] how many used [indiscernible]?
>>: So where did you get [indiscernible] because I would assume that in this
case the [indiscernible] so you should have a huge [indiscernible]. And if
you do [indiscernible] the only way to control maybe a [indiscernible],
right, high-end processor?
>> FELIX XIAOZHU LIN: So the displace of actually in this case we're talking
about a background tasks like the e-mail synchronization in the background
and also [indiscernible] in the background, actually display is off?
>>: So if I were to do the baseline of the iPhone 5S that had the
[indiscernible] processor, how would [indiscernible] compare [indiscernible]?
>> FELIX XIAOZHU LIN: Right. So iPhone 5S cannot run the OS service on that
low-power processor at all. So you have to use the strong processor, the
high-power one, for any OS services you have. So you have the
[indiscernible] the strong processor from time to time?
>>:
So I'm trying to understand.
[Indiscernible] --
>> FELIX XIAOZHU LIN: So the [indiscernible], actually this is energy
efficiency. So the YX is energy efficiency. So basically, we run -- we use
basically user space driver to use, to stress the OS service, the same OS
service?
>>:
On what core is it running?
[Indiscernible]?
>> FELIX XIAOZHU LIN:
the strong core.
Oh, okay, so Linux.
OS services is always running on
>>: Okay. So [indiscernible] that's essentially just a [indiscernible]
computation exercise, your just showing the main core [indiscernible].
>> FELIX XIAOZHU LIN:
>>:
Your baseline is basic everything runs on the [indiscernible].
>> FELIX XIAOZHU LIN:
>>:
Right, right.
Yes.
You're not using the small one.
>> FELIX XIAOZHU LIN: Right, right. Because the [indiscernible] the one on
the top K2 is to enable the services running on the small one, right. So we
can exploit. We can open this small processor for the -- to a lot of OS
services.
>>: And so those three workloads, just to clarify, those three workloads are
all things that can run on the small [indiscernible] K2.
>> FELIX XIAOZHU LIN:
>>:
Yeah, yeah.
Of course.
Do you have any examples of [indiscernible]?
>> FELIX XIAOZHU LIN: So there are. For example, some hardware by design,
hardware feature is private to the big core. For example, interrupt. If the
interrupt is -- the hardware interrupt only goes to big core so you cannot
run it on a small core. But that's a hardware limitation. Yes, Eric?
>>: So then [indiscernible], I just want to see if I understand.
not an actual -- there's no actual radio involved ->> FELIX XIAOZHU LIN:
There's
No, no.
>>: If you did have a radio involved in that experiment, at that point then
the actual throughput would start to play a factor and you might actually
have to keep the radio on longer if you're using the less powerful core.
>> FELIX XIAOZHU LIN:
>>:
Yeah, yeah.
Might offset a lot of those --
>> FELIX XIAOZHU LIN: Yes, yes. So I think there are two effects if you
have real radio. The idle time will be longer between two packets. The that
will paralyze the big core because the big core idle time is more power
hungry. But I agree with you definitely. There are a lot of -- there is a
large number of the radio power consumption.
>>: [Indiscernible] timing, right?
affected by the [indiscernible]?
How is the timing or DMA [indiscernible]
>> FELIX XIAOZHU LIN: I think the -- so we actually tried different sizes of
the work. For example, we tried different sizes of DMA, like four K, 128 K
and one megabyte. It's -- one is more IO intensive. Our K2 will do better
because the -- because the idle time between idle operations actually are
very power hungry for the big core. Because the small core to idle was much
lower energy from the power.
>>:
[Indiscernible].
>> FELIX XIAOZHU LIN: Right. So actually, yeah. That's one of the
[indiscernible]. The other thing is for the OS workload, if we look at
instruct, the energy per construction, this kind of energy efficiency metric
is that case the small core can be more power efficient than the big core
because the OS workloads are kind of irregular. You cannot benefit. You can
hardly benefit from the other [indiscernible] features like deep pipeline,
branch predictions of [indiscernible] no matter which is featured by the
strong core. Yeah?
>>: So [indiscernible] previous question, which is just take the other
extreme value, power code everything on the small core as a library or the
small OS and use those to support these benchmarks. Do you have a sense of
where are you in that spectrum? Are you much -- is there still a lot of
space to improve? Does the flexibility cause you anything?
>> FELIX XIAOZHU LIN: Yes. I think in terms of energy efficiency, there is
difference because basically you -- if you can handcraft the OS, it's a sin,
right? But the thing is development overhead is much higher. You have the
craft of the two separate OS images application difficult to write because
[indiscernible] developer has to deal with two OSes?
>>:
So you believe you're as efficient as hand crafted --
>> FELIX XIAOZHU LIN:
>>:
Yes.
-- [indiscernible] OS.
>> FELIX XIAOZHU LIN: Yes. Yes. So remember, we have three goals.
two is to simplify OS engineering. So this is [indiscernible].
Number
Okay. So this is about K2. But let's try to jump out, look at the big
picture again. K2 actually spans one single OS image across different
coherence domains. One chip, one chip.
We are how this go before. We tried to build OS on multiple chips back in
that time, we felt this was difficult because performance concern OS
engineering concern and also architectural for future is concerned. We feel
that was difficult [indiscernible] in sides and knowledge from K2 we feel
this was the -- actually this is a variable. It's doable.
So performance concern we should still use the shared OS model, but in this
case, because cores are more loosely coupled, so most states should be
replicated instead of shared coherent. For OS engineering to control OS
engineering, to reduce the overhead, the engineering overhead, we should
do -- still do a refactoring basically reuse OS components from the
[indiscernible] code base. And architectural features, we really need
hardware MMUs for different chips so we can build the one unified
[indiscernible] space easily.
And we really need the hardware messages, efficient hardware messages across
different chips so we can do intercommunication more efficiently. So that
something we're trying to do right now. This is ongoing work.
Now, I will [indiscernible] work. So to recap, to recap the things that has
been done, the target architecture is highly asymmetric processors in one
mobile system. And there is no global hardware cache clearance mounted. So
in order to spend the OS -- the entire software stack across those
processors, we have redesigned system software to achieve two important
views. One is coherent memory view. So application can span across
coherence domains. The other one is single OS view. So application can
still see one single OS.
So we have followed with these principles which we believe will be useful for
future system design. Number one is we build asymmetric systems software.
Number two, we let software to design the policy in memory coherence. Number
three, we refactor existing code base, system software code base, rather than
rebuilding everything from scratch.
So that's the work has been done.
directions just to be on time.
Let me quickly talk about the future
So what I have down is system software for noncoherent asymmetric general
purpose processors. Looking to future, I try to address systems challenges
from three directions.
Number one is higher heterogeneity. Number two, higher parallelism. Those
two are for mobile devices, for wearable mobile devices. And to look beyond
mobile device into larger systems.
Talk about higher heterogeneity.
So these days, we're not only have general purpose processors in one system,
asymmetry in one system. We also have for example, the computation
geography, the DSP multimedia core, sensor core in one mobile system.
Now, how do we build system software, how to build system support for them?
So this is a still open question, because some of the cores may be able to
round OS code, some of them maybe not.
program code, some of maybe not.
Some of them may be able to run the
But I think that for our specs, we should assist our software designer.
Number one is abstraction. We should provide good abstraction for
opportunity for [indiscernible] to provide application for all hetero
generous hardware. Multiplexing, we should enable multiple programs to
fairly share the same set of heterogeneous hardware. Protection. If we want
to run some is user code on one of the heterogeneous processor, we do not
want to mess other state in the system. And the configuration, the
dependency upon the last type of cores, sort of [indiscernible]. By
dependencies, I mean functionality dependency. Power dependency.
[Indiscernible] dependency. [Indiscernible] dependency. How do we provide
assistance support to simplify the processor and how do we let other
[indiscernible] verify this configuration. So that's a big, big question.
Now, higher parallelism. So we not only have types of cores. We also have
larger number of cores. In a few years, we are going to have tens of CPU
cores and a few hundreds of GPU cores all in one mobile system. How do we
use the extra number of cores for higher interactivity? This is old question
actually from 20 years ago.
In 2010, people ask how many cores do we need for one interactive system?
They say you need two cores. Now, since that ten years has passed, people
ask this question again because -- no problem. Yeah. Let's see.
>>:
Were you running K2 on that last slide?
>> FELIX XIAOZHU LIN:
restart. Okay.
I wish I could.
Yeah.
Okay.
Let's play -- try to
So basically, ten years has passed since 2000. There are many, many new,
like, feature rich [indiscernible] applications. Some people ask this
question, how many cores do we need for interactive system?
Let's say you need one more core. But today, the extra number of cores is a
asset. We have to use it or we will waste [indiscernible]. So there is a
open question. How do we -- can we have [indiscernible] use of many cores
for interactivity?
For example, can we use extra cores to support natural user interface and
augmented reality? This sort of a new interface. And can we parallelize
interactive workloads, for example, HTML5? So I'm looking forward to working
with program manager people and augment people on this direction.
And some even crazier idea is can we use X for cores to speculate user
improvements to compute all the possible outputs, have them ready before the
system ease actual user inputs. So this can probably give us the one
millisecond [indiscernible]. So that's even crazier idea I will try to
explore.
Now, look beyond mobile systems. Try to wrap up. To look beyond mobile
system, heterogeneity is everywhere in [indiscernible], for example, GPU,
CPU. This is probably the most successful heterogeneity system in the last
ten years and even look at the edge of the cloud, there is something called
the [indiscernible] station, which is highly heterogeneous as well. Same as
[indiscernible] centers.
For example, the TI [indiscernible] is heterogeneous [indiscernible] for
similar market. Inside one Keystone SoC, there are a bunch of arm processor,
general purpose arm processor and a bunch of DSP so that we can run the
general purpose workloads and the communication workloads, 3G, 4G side by
side.
In a real system, in a real server, there are a lot of Keystone SoC
interconnected. For example, there's the case for HP Moonshot server. In a
cloud based station, [indiscernible] center, there are hundreds of servers
like that. How do we program the highly heterogenial system? And how to
debug such a highly heterogenial system? So these are open questions.
Now if you look at the very tip of the cloud, this is basically the base
station, the cellular tower our smartphone direct talks to. They are
heterogeneous as well. This [indiscernible] they have build, the clay has
build actually, [indiscernible]. So the idea is basically the spectrum is
really expensive. It's a premium in today's world's communication. In order
to improve the utilization for spectrum, we can use multiple antennas and use
a larger of heterogenial -- a number of heterogenial computer resources so we
can improve the utilization for heterogeneous for the spectrum.
So on this direction, I'm looking forward to working with [indiscernible]
people to [indiscernible] assistance of our challenger here.
Now finally, let me try to position myself in the computer stack. So I'm
working on the low level system software, which is on the boundary between
the hardware and the software. So I try to codesign the system software by
working closely with architecture and hardware people. And I'm trying to
look into the hardware [indiscernible] to look for this cool [indiscernible]
support for system software.
I will codesign the system software with folks from other areas, the
compiler, program language, networking, and visual assistance, and what is
the codesign system software, hopefully we can provide good strong support
for all this high level applications, for example, algorithm, computer
[indiscernible], motion [indiscernible]. And we can provide a strong
quarantined for high level concerns like security and privacy.
So to recap the whole talk, what I've done is to redesign a system software
for none coherent asymmetric processors so one software stack can transpond
the [indiscernible] with multiple coherence domains.
Looking to the future, I'm ready to address systems challenges from three
directions: Higher heterogeneity, higher parallelism, and the larger
systems.
Thank you.
[applause]
>> FELIX XIAOZHU LIN:
Yes?
>>: So I had a question about application for [indiscernible]. Seems like
in mobile you want it to be easy to make programs for these kinds of places
whether they're heterogeneous or not. And I [indiscernible] because that's
[indiscernible] interface that people are comfortable with. I think that one
of the downsides of shared memory is that the cost of communication between
different parts of the system is complicit. When you have hardware support
for coherence, I think people are willing to tolerate that implicit cost
[indiscernible] unpredictable performance because [indiscernible] support and
the costs aren't that high. Now you're kind of proposing that instead of
hardware coherence, we use software coherence do the same kind of thing and
it seems like the performance predictability is going to --let performance is
going to be much less predictable because now the costs will be higher and
communication actually happens through that implicit mechanism.
So I guess the question here is did you look at the tradeoffs in using what
you have described versus using something that makes the performance
difference explicit if you have operations that are going to be doing
communication or not doing communication because the performance cost is
higher when you use the software ->> FELIX XIAOZHU LIN: Exactly. That's a good point. So right now, the
program model we propose, user program model we are proposing is user
[indiscernible] different thread for different coherence domains because the
fundamental really is migration is hard. Migration is hard from one
[indiscernible] to domain. This also makes the performance boundary explicit
so [indiscernible] this thread, this thread is running on a high-powered
domain, this thread is running on a low-powered domain. And they know that
sort of the sharing between the two thread actually across domains.
But whether this is ideal, I don't know. But the thing is I think it
definitely makes sense to provide some sort of a hint, like buy directional
hints on the system to application and also for application to a system.
Yeah.
But one good news we have islet contention between these two part is not that
high in mobile in the background [indiscernible] task [indiscernible].
Because they don't communication like all the time. So that sort of relieved
the problem?
>>: Just a follow-up question. Given the [indiscernible] you guys addressed
that problem is how easy is it for an application programmer to identify what
should go on the weak and what should go on the strong? I think if you just
looked at a program without understanding what it does, it might be difficult
to ->> FELIX XIAOZHU LIN: Right. So my basically, right now, next application,
the boundary is implicit. But I believe it's kind of doable for a
[indiscernible] programmer to specify the performance need. But not the
power need. But [indiscernible] specify the performance need, I think we can
do that kind of partition. They only have to -- my philosophy is programmers
only have to reason about the performance, not about the power consumption.
But that's just me. Yeah.
>>:
Okay.
>> SUMAN NATH:
Any other questions?
>> FELIX XIAOZHU LIN: Thank you. [applause]
Okay.
Thank you.
Download