31885 >> SUMAN NATH: Hello, everybody. This is my pleasure to introduce Felix Lin from Rice University. He works at you [indiscernible]. His research is in the area of mobile and wearable systems. Especially he tries develop low-level software to exploit the emerging features of these devices. And for his work, he recently got the Best Paper Award at S-PLUS. And today, his talk, he'll talk about that recent work and a few other works that he did in his Ph.D. With that, Felix. >> FELIX XIAOZHU LIN: Thank you so much. Hi. So have you guys ever wondered if we can span one mobile system of highly different processors for both performance and power efficiency? So in my work, I will redesign system software to make this happen. Now let's start from why we should do this. Power efficiency rules. It has become the major concern for many, many systems from sensors all the way to big servers. For example, in many cases, if you have better power efficiency, you'll not just get a better performance because we can switch more transistors at the same time. For mobile people, power efficiency means usability because the device itself will become cooler and it will last longer. For several people, higher power efficiency means dollar, millions of dollars saved from [indiscernible]. So that's why everybody tries to improve power efficiency. So in this race, mobile is one of the pioneers. The reason is much clearer. Because mobile device usually have very tight thermal budget. We cannot put a fan on the back of our smartphone and also have a very tight battery budget because of small phone [indiscernible]. The (inaudible) thing is the tool budgets do not scale up over time. There is no more cell phone then. So let's say one example. From 2008 to 2013, we would see the performance of a [indiscernible] smartphone as scaled by 13 times. But in the battery capacity, only scale by two times and the form factor almost has not changed. So this trend is even worse. The scaling is like the other way for wearable devices. For example. Google glass. Compared to a smartphone 2008 in performance scaled by 4x, the battery is half, and the form factor is even smaller. If we don't design wearable [indiscernible] way, things can be very ugly. So let me tell you one true story. So in our lab, we did this experiment with one video chat. Google hang out. Google glass. After 20 minutes, the surface temperature of Google glass is 53 centigrade, which is above the human tissue damage threshold. So that's something you don't -- it's not quite wearable. Now this is a pressure for high power efficiency for mobile device from hardware side. How about a software side? One defining characteristic of mobile is a wide arrange of [indiscernible] from the background always on intelligence all the way to the intensive multimedia workloads. coming years, we're going to see this range grow. And in the For example, for natural interface and perception workloads, across whole spectrum, this is a 1000x difference in performance demand. Know that all this workloads are hosted by one mobile device, and in many of the workloads actually are in the same application. Ideally, the mobile device should provide power consumption proportionate for [indiscernible]. So, for example, for the most intensive workloads, we want to keep the power consumption on mobile device below ten watt. So the device does not have to burn our hand. Now, for the background always-on tasks, we want to keep power consumption below 10 milliwatt. So they can still always run in the background, they can serve us? >>: How do you get those numbers? >> FELIX XIAOZHU LIN: From [indiscernible]. So there is a white paper, a picture [indiscernible] paper, so they back for 21st Century [indiscernible] research so they have this numbers like long-term goals. >>: [Indiscernible]? >> FELIX XIAOZHU LIN: targeted that. Yeah. That's even more ambitious. Definitely we should have So the thing is how can we -- how can one mobile device provide one solid X core difference? So that's the problem. Now, today, our recognized approach to achieve 1000x power difference is hardware heterogeneity. So idea is quite intuitive. Instead of having one unified processor for all kinds of workloads, we should have different hardware for different workloads. So ideally, we can map different workloads to different hardware. We could meet a performance demand and also have the best power efficiency. So however, in today's smartphone, because the -- we have to achieve 1000x power difference, so the asymmetry or the heterogeneity among those processors have to be so high. This usually given us no [indiscernible] among those processors. Now, let me explain why. Let's look at the difference between power and performance here. >>: We've had this for a while, right, with [indiscernible]? So even the earliest iPods of a CPU [indiscernible]. So it's just a set of single-purpose basic cores that did one thing and you didn't need cache coherence because you just sent this [indiscernible] data and let it process, right? >> FELIX XIAOZHU LIN: Yeah. >>: So can you talk a little bit about that world versus the world you're talking about? [Indiscernible] difference there as well. >> FELIX XIAOZHU LIN: Right. So I think there are two dimensions. One dimension is the asymmetry, right? You have very different general purpose processors. There are dimensions you're talking about is specialization. Basically you have CPU, GPU, and Asics. I think we can explore both two dimensions. General purpose is easier for application programmers to program. Okay? So let's look at the difference between power and performance. Let's assume we have one mobile device and there is only one core in this device. If we want to [indiscernible] power consumption for some light workloads, we can do DVFS. So this is a classic technique basically to reduce the [indiscernible] rate and to save some power. Now if we want to further power consumption for some even lighter workloads, we can incorporate another [indiscernible] core in the same coherence domain. So this is the case for [indiscernible]. This will give us another ten X power saving. But if we want to [indiscernible] power consumption for some light background always-on pages, we have to have another even weaker core in a separate coherence domain. So you may have questions, why do we have to have separate coherence domains? There are many reasons? Now, there are top three rules. Number one, [indiscernible] is complicated to design. I talked to many architect. It's complicated to design. By removing the global heterogeneous coherence it's easier for hardware designers to incorporate multiple asymmetric cores in one mobile system. >>: Are you suggesting no coherence but still having a shared [indiscernible] space for separate [indiscernible]? >> FELIX XIAOZHU LIN: So at the lowest level, so the [indiscernible] waste [indiscernible]. So there could be shared a memory, could be non-shared memory. So for all the software perspective, there could be shared [indiscernible] space. There could be [indiscernible] space. I will talk about that later. So number one, so it's easier to incorporate asymmetry cores [indiscernible] designers. Number two, by removing global heterogeneous coherence, so cores are losing [indiscernible], it's easier for hardware designers to design finer grain power domains. So those power domains could be dependent on [indiscernible] time to save power more aggressively. Number three, because we do not have the global hardware coherent connect, they either connect their self cache controller could be even lower power. So we can afford them to be always on, to serve the background tasks. Now, with separate coherence domains, we can have another 20 X power saving. For this background intelligence. >>: [Indiscernible]? >> FELIX XIAOZHU LIN: So I think we, by doing it in software, we can put the complexity in software. [Indiscernible] software to define the policy. Yeah. >>: But is it more power? >> FELIX XIAOZHU LIN: There's some debate. So this is the debate from like 20 years ago. But in -- I believe you can do it more efficiently in hardware these days, but the thing is number one, hardware is not there yet. Number two, we can have more flexible policy in software by doing it in software? >>: I guess I'm just [indiscernible] motivation for separate coherence domains. >> FELIX XIAOZHU LIN: Right, right. Right. >>: And now you're going to put coherence back on top of that. throw away our power efficiency? Do we just >> FELIX XIAOZHU LIN: No. The thing is by having -- basically by implementing the cache currency software, we can decide if for example, this domain, this state is not coherent, this state is coherent. Basically like software defined memory coherence. Yes? >>: So it sounds like you just said you want to push more complexity up to the programmer and they have to think about how data is cache coherent across these coherence domains? >> FELIX XIAOZHU LIN: So it's our job to deal with the complexity. mean system software designer, not [indiscernible] programmers? By us, I >>: Okay. I expect you're going to get to this, but I'm curious to see how this is going to manifest to the programmer, because I suspect you can't hide this with layers of system software. >> FELIX XIAOZHU LIN: I agree, yeah, yeah, yeah. We should expose something to the programmers. Right? I will talk about that. Okay. So because of the motivation, right, so hopefully I across that [indiscernible], we can achieve 1000 power difference in one mobile device. Now the design of multiple occurrence domain actually is not -- is everywhere. It's not some research idea. Is known in today's mobile devices from wearables to smart phones to tablets. For example, if we're looking to iPhone 5S, [indiscernible] of iPhone 5S, we're not only have the high power, high performance asymmetry processor for demanding tasks browsing, we do streaming, we also have the low-power processor, M7, for background tasks. So they do not have a [indiscernible] coherence which means they are in separate coherence domains. Now, with this architecture, let's define our architectural model more concretely. So we assume there's one mobile device and they're highly asymmetric core in this mobile device hosted several coherence domains. There is [indiscernible] coherence within each domain but not across domains. Multiple cores in different domains can talk to each other by message passing. [Indiscernible] message passing at the lowest level. And they can interrupt each other, actually. So when is this architecture the holy grail for software is to span across multiple coherence domains for power efficiency and performance. Let me use one example. E-mails and conversation. So in the e-mail application, my e-mail application, the background synchronization [indiscernible] tasks [indiscernible] from time to time should be able to run on the low-power, low cores in the low-power domain. We call it the weak domain. Now the flashy shiny new interfaces like e-mail composition, should be able to run on the high-power cores so we can have the best power performance, the best performance and also the best user experiences. Now there's two parts running one application and they do share a lot of states within one application. For example, e-mail contents [indiscernible] to red 13 bags and even in memory, that database cache. So to say to realize this scenario, the biggest problem is there is no coherent memory program state supported by hardware. To see why this is a problem, let's look at the good old days to see how software can either [indiscernible] cores in one coherent machine. Let's say you have four cores. One coherent model machine. This is the case for laptop, for desktop, for most of the [indiscernible] servers. Each of the core has its own cache. The cache are separated but keep coherent by the hardware. Let's say we have a OS easily [indiscernible] span across all cores. OS can be Linux, Windows, or this Legacy OS. So the thread, let's say we have a [indiscernible] thread, one called zero. This thread make a [indiscernible] create a socket. As part of [indiscernible] it attached in state will be cached [indiscernible] cached here. Later, let's assume this thread has migrated to core two. NKU the socket S that is created previously, transparently, because the hardware will automatically propagate the cache update from core zero to core two. So all of this is done via hardware. Neither the user thread, nor the OS, has to take care of this. This is transparent to them. Everything is simple. Now in our [indiscernible] picture, things are different. We have several coherent domains. The hardware will automatically propagate the cache within one domain but not across domains. So the inter-domain communication requires explicit software operation which is way slower than hardware coherence usually by orders magnitude. So to this end, my work is to enable software spec for application to OS to span a cores multiple coherence domains. To achieve this goal, I have three objectives. Number one, we would have to simply application programming because there are millions of mobile applications out there and they get used to unified memory view and the one single OS view. And then number two, we have to simplify OS engineering because mobile OS these days is a huge [indiscernible] base. Android has one median nice of code. We just simply cannot do everything from scratch. Number three, we cannot sacrifice performance. Nobody wants to use a slow system and [indiscernible] performance. We want to maximize power efficiency. Now, to achieve the three design objectives I will redesign system software. In redesigning, I will follow through a eight design principles. Number one, because we have asymmetric hardware right now, we should do the asymmetric system software. Yes? >>: So when you're dealing with a disparity cores and address [indiscernible], you could think about a single OS which would have spanning at four [indiscernible] image with [indiscernible] application, creating [indiscernible] application for the different services [indiscernible]. How did you decide to go with a single [indiscernible] single application and a seamless view versus one that explicitly breaks apart these different domains and asks the programmer to [indiscernible] building services for each part? >> FELIX XIAOZHU LIN: Well, I think the -- for all the application program model, it has a [indiscernible] which still require application programmer to specify this piece of code goes to strong domain, this piece of code goes to weak domain. So that's the only thing they have to do. Basically implement different threads for different domains. But we do provide the unified amended address and one single OS image order. So that's our approach. Okay. So going back to the principles we follow, number one, because hardware, we have asymmetric hardware, so we build asymmetric system software. Number two, we want software define memory coherence, basically which means the policy in memory coherence and at which state is coherent shared and which state is replicated non-shared. And number three, we will refactor the existing system software rather than building everything from scratch. Now -- >>: I have a question before you go ahead. I'm new to this. So can you tell [indiscernible] you said about the M7[indiscernible] and the M5 being [indiscernible], right? So they [indiscernible]. What kind of coherence do they implement? Do they implement it in software or is it hardware or ->> FELIX XIAOZHU LIN: Right. So right now, I'm going to get to that later but I'll give you some forecast. Basically they try to dodge this problem. The Apple just does not allow you as a application programmer to add any application for the M7. They will provide you some predefined libraries. >>: Why do they do that? >> FELIX XIAOZHU LIN: Yeah. Because they don't have a good programming model. So my work has to [indiscernible] pieces. Number one, to enable user programs to span across coherence domains. And number two, to span a single OS image. Let's start from span single user program. As we mentioned before, the biggest problem, the biggest challenge to run application for this architecture is the program state is not coherent. Right. As I mentioned, today's state of the art program from work tries to dodge this problem by only allowing application programmers [indiscernible] application for the strong processor, for the main processor here. >>: Don't GPUs let you have effectively two coherence domains to kind of incoherent memory spaces [indiscernible] whatever you want for both of them? >> FELIX XIAOZHU LIN: Right. So there are some research in that actually but the hardware -- the lowest abstraction so there are in separate coherence domains, but there's some research building some [indiscernible] digital share memory to bridge this to memory spaces. Yeah. That's related. That's related. Okay. So the state of the art, you only write application for the strong domain, for the most powerful processor and vendors like Apple, Microsoft will provide libraries only available for the lowest power core. So at run time the two pieces of code will talk together by message passing so this approach works. But there are -- it's very inflexible and actually incurs a lot of development overhead. >>: It's not an actual library executing. They provide a library interface that does message [indiscernible] must be a service. >> FELIX XIAOZHU LIN: Yeah. The OS [indiscernible]. Yeah. So can we do better? We really want to enable one application entirely from the application developer to span across multiple coherence domains, but here, in order to enable this, we have to face the challenge. How do we relieve programmers from dealing with noncoherent program state? So our solution is called Reflex. Reflex is that this is shared memory, it's one layer below the application. So ideally it's kinds of intuitive. We will implement the software cache coherence using our best layer. Now you may wonder what's a novelty, right? The distributed shared memory is an old idea from 20 years ago when people are talking about running one application across multiple machines in one cluster. But those class design do not all serve all purpose well. Because one of our major design goals is to enable energy efficiency. So we have to adopt this asymmetric design. We have to keep strong domain, powerful processor in sleep as much as possible. Now let me give you a minimal example to illustrate how does this asymmetric design work. Let's say we have two coherence domains. One is strong host power processor, one is weak host of the low power processor. And both of them share one memory object which is array here. So according to our design, any shared memory object is always hosted by the weak domain, which is -- which makes it whole. Now, if the strong domain wants to access the memory object, it has to send a request to the weak domain. The weak domain will respond and the main memory object will be copied, moved from it's home to a request, from the weak guy to the strong guy. So this is natural. On the other hand, if the weak domain tries to access the memory object here, it cannot send any request. All it can do is to passively wait until a strong domain finishes accessing the memory object and the read it back so the weak domain can go ahead to access memory object. So this is a medium example, but hopefully you can get the asymmetric flavor here. Now, the memory model we build the release consistency which is fairly standard. We implement two single edition primitives a [indiscernible] and release. So you probably getting memory updates to domains asymmetric again. The strong domain will eagerly propagate any memory updates to strong domain -- to the weak domain so it can itself, the strong domain self can go back to sleep as soon as possible, just to save some energy. The weak domain, on the other hand, will do this lazy so it will only propagate the memory updates to other guys one it's being asked. So this is try to avoid waking other domains from sleep because of the coherence complication. Now, again this is asymmetric design. This is for power efficiency [indiscernible]? >>: One assumption that you have is that you always have a -- an ordering among your domains. So what happens if you have domains that are not clearly ordered, say, a GPU and an ESP programmable processor where it's not clear which one is the strong and the weak for a general purpose program? Do you have a -- >> FELIX XIAOZHU LIN: Right. So that's, again, that's another dimension of the heterogeneity. That's the specializations. I don't have a good answer for that, but I think here we try to address the asymmetric general purpose processors. So we have -- we do have a clear order to that. But I think there is one [indiscernible] to that. They build [indiscernible] share memory between GPU and CPU and they try to use this same asymmetric design for that, but the -- in a case, they make the GPU always host of share memory objects. So the bottom line is I think the asymmetric this sort of a principle could apply to that case as well, but definitely [indiscernible] think how to design that system. Okay. So that's about the overall design of Reflex. But we have to implement this. We have to build this. So today, as I mentioned, there are many, many mobile devices have asymmetric multiple am occurrence domains, multiple processors in one mobile device. But back in 2011, we do have this flashy shiny devices but [indiscernible]. So we have to build some dangerous prototype ourselves. So what we do is we took a Nokia, a 900 phone, smartphone from the market, which was nice, shiny smartphone. As many other smartphones, Nokia smartphones, it has this nice camera. But we have to remove the camera, rip it out in order to access it's peripheral [indiscernible]. So we build a custom board with two low-power processor, asymmetric low power processor [indiscernible] and we hook the board with modified 900 smartphone. And boom, we have one of the earliest prototypes in the world with asymmetric processors multiple occurrences domains. So if we look back today, actually the two low-power processors were used before in early days, actually used by people this years. IPhone 5S and the Moto X. So they are using the processors we have used before. So something interesting. So based on hardware prototype, we implement Reflex and we were able to span one application across three different processors and we can gain one order of magnitude higher power efficiency for applications. Yes? >>: I'm assuming that you're not doing anything to keep those two separate things coherent with one another so there are three coherent domains [indiscernible] ->> FELIX XIAOZHU LIN: >>: Okay. Yes, yes. And those are kind of both not multi core processors. >> FELIX XIAOZHU LIN: Right. Single core. All three of single core. So [indiscernible] we even don't have mounted core for smartphone, so is single core. >>: I see. Okay. So -- all right. >> FELIX XIAOZHU LIN: Okay. Okay. So this is about Reflex to just briefly recap. We build the digital machine and memory and software to provide unified memory [indiscernible] space to application. So application can either be spent on cores, multi-occurrence domains. >>: Go back to the [indiscernible]. >> FELIX XIAOZHU LIN: >>: Yeah. So where's the cache for each of these little processors? >> FELIX XIAOZHU LIN: So, actually, I don't -- [indiscernible] we probably have the [indiscernible] air one cache, [indiscernible] none coherent. Separate cache. Yeah. >>: So do you manage that cache or are you used to that? >> FELIX XIAOZHU LIN: We can -- they have the separate memory, right? So separate physical memory. So that cache actually does not matter that much. You have to [indiscernible] built in. >>: Cache is coherent with respect to the CPU, though. own cache. >> FELIX XIAOZHU LIN: Okay. >>: Yeah, sure. It's managing its That's hardware cache, yes. So that's the first -They're single cores within a domain, what is the of coherence? >> FELIX XIAOZHU LIN: There is no coherence, right. But we're still occurrence domain, just because we are dealing with multiple cores. >>: [Indiscernible]. >> FELIX XIAOZHU LIN: Yeah. Yeah? >>: So each coherence domain doesn't implement any kind of coherence. just are implementing ->> FELIX XIAOZHU LIN: We will see multiple core -- multiple coherent occurrence in coherence domain so because this is kind of early system, right? So we'll deal with more complicated system, you will see. Yes, Andrew? >>: So [indiscernible] systems, do you know if the [indiscernible] different, are they using [indiscernible] they still have different -- You >> FELIX XIAOZHU LIN: I don't have that -- yeah, yeah. [Indiscernible]. Okay. So that's about the spanning user program. That's the first piece. Let's talk about the second piece. How to span one single OS image across multiple occurrence domains. So this may be more controversial. You have multiple questions. Number one, why do we have to span single OS images? Can we have a -- like can we pin the OS, make the OS available, only available on one coherent domain? In this case, let's say we make the OS only available on the strong domain. To say what is the problem, let's use e-mail example again. See, the e-mail synchronization, a little bit more background. E-mail synchronization task actually is battery killer. So this is some research resolved from Microsoft Research every day our smartphone is standing by mode, the e-mail synchronization password consume about 14 percent of battery life in standard mode, so this is a lot of power. [Indiscernible] nice would we have asymmetric processors, would it be nice if we could [indiscernible] e-mail synchronization task in the low power cores here, right? So the strong core can go to sleep, remain in deep sleep mode so we can save energy. However, when is the OS only available on the strong domain, this does not work because all network activities have the e-mail synchronization task have go through to the OS, which is on the strong core -- yes? >>: [Indiscernible] say the same thing about [indiscernible] ten years ago. People would say that, you know, [indiscernible] OS and guess what [indiscernible] do that. You go [indiscernible] today and you see [indiscernible] cross over [indiscernible] and actually [indiscernible]. Why not do something like that for the ten most power-consuming tasks on the phone, e-mail seeming to be one of which? >> FELIX XIAOZHU LIN: Right. So I think the question is do we need OS support for the background task or not? My personal belief we definitely need more and more ->>: [Indiscernible] the OS itself. Is that two different questions? >> FELIX XIAOZHU LIN: Yes, yes, yes. So e-mail synchronization is one example, right? You definitely need -- for example could a socket [indiscernible] and a user [indiscernible] and IOS, I will give the [indiscernible] later, and another example is they have this profession application, profession task which could be run in the background from time to time. So in that case, you really want to have one OS. So I think there are more and more competing cases to have OS support on no-power domains, no power cores. Okay? So basically, the idea of -- here it is. We cannot make the OS only available to a strong domain because in that case, the background task will [indiscernible] the strong domain from time to time because it has to use the OS services. Now, can we make the OS only available on the weak domain? That does not work either because the weak domain, low power, low performance. That will make OS to be the performance [indiscernible] entire system, which is not -which does not work well. Now, another alternative design is can we have separate OSes? One OS for each domain. For example, OS 1 for the weak domain, OS 2 for the strong domain. This does not work either because the two OSes have to collaborate to manage many resources in the system. Therefore, we have to modify many things in two OSes, from memory management all the way to [indiscernible] the best drivers. This requires a lot of overhead in engineering OS. So application programming is complicated as well. Like I mentioned before, if you have a application running on OS 2, it creates a socket. You cannot reuse that socket on OS one because they are separate OS images. So this gives a lot of headache to programmers. Now, we're clear what we really want is one single OS image across multiple coherence domains. But how do we do that? How do design such a OS? So we first have to decide the structure of the OS. One central problem, one central concern to decide [indiscernible] is how is software state shared among OS components? Now, from 1978, this [indiscernible] paper basically says there are two fundamental OS structures, interface word, basically has shared-everything and shared-nothing. Shared-everything is like Windows and Linux. The entire OS is fully coherent as supported by hetero coherence most of the time. Shared-nothing is you have multiple OS components that have their own separate [indiscernible] spaces and they will talk to each other by messaging passing. Now, which of the two OS models, OS structures fit our purpose well? Neither. So to mode invade our organization, let's see one key observation we have. If we look into today's mobile OS, mobile kernels, for example, Linux, we can roughly categorize all OS components or core services into a few sets. One set is core services. Basically they are like the small infrastructure of the whole OS. Performance critical. They [indiscernible] by a small number of people. Virtual memory, page allocation, scheduled in, et cetera. On top of core services, there is a large of -- there is a large set of extended services, like device drivers, network, file systems. So they are large in number, and they are developed by a lot of people, Microsoft [indiscernible] Samsung. Actually, there is a ecosystem behind them. So compare to core services that are less performance critical. Now, with this observation, we propose our OS model. We call it share the most OS model. So basically, the core services, small set, performance critical, are replicated as shared-nothing spaces, so why is this for each coherence domain? On top of core services, the large number of extended services can be transparently shared across coherence domains so they're source code could be reused. If we try to plot our OS model into a spectrum of OS structure, you can see where somewhere in the middle, like here, this is because we're trying to find a balance point between the years of energy programming versus the performance and power efficiency. Now, where is the OS model? Let's build a OS. introduce one important hetero background. But before that, let me So we have been talking about multiple coherence domains [indiscernible] one system, but in real world, there are two major ways to integrate the one mobile system. So number one is coherence domains on separate chips, like people mentioned previously. So in this case, each chip, each coherence domain have their separate memory and IO. And as multiple chips are [indiscernible] through some relatively low-speed high [indiscernible], like [indiscernible] SPI. That's a case for, for example, our Reflex prototype, we have three different chips and also for iPhone 5S, two separate chips. So that's a first type of integration. The second type is coherence domains, multiple coherence domains on one single chip. For example, in this case, there are two coherence domains on one same chip, most likely the [indiscernible] physical memory and IO. I will give couple example for that later. To add one more twist to this big picture, sometimes these two types of integration can coexist in one mobile system. For example, this chip can have multiple occurrence domains inside it. Now, let's look at two types of integrations again. Coherence domains on separate chips and now same chip. Now with this background, let's try to implement our own OS. Because Reflex was targeted on multiple chip case, natural for us to try to build a OS across multiple chips. Like this. But back in that time, we found this problem to be really difficult. Because the physical memory are separated, and the interconnect is relatively slow. So we do have a lot of concerns back in that time. For example, how to ensure performance of the OS because OS is such a performance-critical piece. And number two, should we build everything from scratch? What kind of components can we reuse? And number three, do we have enough architectural support to build such a OS? Back in that time, we do not have good answers for this design questions. So let's say, let's take one little step back. We'll try to build the OS for some little bit different picture, which the second type integration where cores slightly [indiscernible] in one chip. So the advantage is that we do have shared physical memory and the interconnect is faster because the interconnect is on chip. So these are good for OS designers like us. So we will build the OS for this picture first. So the hardware pattern we chose is TI OMAP4. This is a popular mobile system chip, has been used in many popular mobile devices from wearables to smartphones, to tablets. Our chip of TI OMAP4, there are two coherence domains. One coherence domain host different type of cores. So for example, we have A9 core in one coherence domain, which is for power demanding tasks, and we have M3 core for the weak domain, which is for background tasks. They do not have global hetero cache coherence and they can share the same set of IO and physical memory. They talk to each other, different cores and different domains, talk together by hardware message passing so this is hardware support. They can pass around 32-bit message for very short messages. And therefore, they can interrupt each other. Now, by applying our shared most OS model to this hardware picture, we build our experiment OS. We call it K2, which is a mobile OS spanning mobile OS spanning heterogeneous coherence domains. So K2, inside of K2, on top of two coherence domains, we have two kernels. One kernel for each coherence domain. Inside each kernel, like we said before, core services are replicated one instance for each coherence domain. On top of core services, there is a layer of OS level to share memory which present the illusion to all extended services as if they are running on a fully coherent machine. So we can review the source code from those extended services. So let's talk about a software distributed shared memory here. It actually is very close to what we have to use for Reflex. Basically try to provide this illusion to higher level software, but the implementation here is different because we have hardware MMU and the [indiscernible] hetero support, the latency, the performance of distributed shared memory actually is way better. For each access miss, we have about a 50 micro sec. I will talk about how does a 50 micro sec impact the overall performance. Now, core services. Like we have mentioned before, there are multiple core services, and each core services, each core service is replicated as a separate instances. Now, how to coordinate these instances is case by case. For example, let's say page allocator. This is one important core service. Multiple independent instances of page allocators will collaborate with each other to manage the same set of physical memory. K2 recorded those instances, but in certain balloon drivers into mobile kernels, like this. So you can imagine balloon drivers, just like a [indiscernible] device drivers, [indiscernible] by K2, their job is actually very simple. A balloon driver can inflate basically the K2 will try to get a large page block from the local page allocator. It can also deflate basically the K2 will put the large page block back to the page allocator. Now, where is multiple am balloon device drivers? K2 can move large page blocks around different page allocator. So it can control how much memory is available to each of the kernel, to each of the instant -- each instance of the page allocator. This design has two benefits. Number one, the balloon here, this one is almost transparent to page allocator. So we do not have to modify the page allocator. [Indiscernible] page allocator source code too much so we still can reuse most of the Linux page allocator source. And then number two, because we are moving around a large page blocks, so we actually can reduce the inter-domain communication for good performance. So that's [indiscernible] page allocator. Let's look back at the structure of K2 to recap. Core services are replicated as shared-nothings [indiscernible], and we will manually -- K2 will manually explicitly [indiscernible] them. There is a software distributed shared memory will present one illusion to extend the services so their source code could be shared. >>: Quick question. The other core services were generally easier or harder to implement [indiscernible]? >> FELIX XIAOZHU LIN: They are for example scheduler, and also the interrupt management. So each core services is case by case. Actually it's -- I will say it's tangent. But [indiscernible] there is a small set of core services. So that's fine. >>: How many? >> FELIX XIAOZHU LIN: Like right now it's 5 and 6, 5 or 6, something like that. Yeah. How much time do I have 15 minutes? Okay. So I will skip this hacking part. have. So [indiscernible] little hacking file we So imagine, just to like briefly summarize, we have two kernels in different [indiscernible] sets. So they have different ISAs two binaries. But the thing is two kernels will share the function pointers at wrong time, so the problem is if one kernel tries to follow the function pointer pointing to the wrong instruction set, it will crash. And it's sort of unrecoverable crash. So how do we do that? How do we handle the function pointer with a little bit of compiler support and run time support? So this is hacking. Hacking fun we have during the kernel implementation. So I will quickly escape that. Let's look at evaluation. So how well does K2 work? To try to answer the questions in the evaluation, number one, how much energy efficiency can we get? So we compare with Linux. Because K2, our OS can span multiple asymmetrical processors, we can actually exploit the low-power processor for high energy efficiency. Not surprising, we have higher, much higher energy efficiency like up to ten X. This is good. Amazing, but as we plan. >>: Measured by what? What's the [indiscernible] benchmark -- >> FELIX XIAOZHU LIN: We have three critical light OS benchmarks. So [indiscernible] are OS workloads. And we actually measure poor consumption by who can wire us to the power wheels on a dev board. >>: Is there anything specific about those benchmarks? >> FELIX XIAOZHU LIN: So [indiscernible] OS benchmarks. OS used by many, many applications. For example, the DMA is like used by almost every device drivers. [Indiscernible] how many used [indiscernible]? >>: So where did you get [indiscernible] because I would assume that in this case the [indiscernible] so you should have a huge [indiscernible]. And if you do [indiscernible] the only way to control maybe a [indiscernible], right, high-end processor? >> FELIX XIAOZHU LIN: So the displace of actually in this case we're talking about a background tasks like the e-mail synchronization in the background and also [indiscernible] in the background, actually display is off? >>: So if I were to do the baseline of the iPhone 5S that had the [indiscernible] processor, how would [indiscernible] compare [indiscernible]? >> FELIX XIAOZHU LIN: Right. So iPhone 5S cannot run the OS service on that low-power processor at all. So you have to use the strong processor, the high-power one, for any OS services you have. So you have the [indiscernible] the strong processor from time to time? >>: So I'm trying to understand. [Indiscernible] -- >> FELIX XIAOZHU LIN: So the [indiscernible], actually this is energy efficiency. So the YX is energy efficiency. So basically, we run -- we use basically user space driver to use, to stress the OS service, the same OS service? >>: On what core is it running? [Indiscernible]? >> FELIX XIAOZHU LIN: the strong core. Oh, okay, so Linux. OS services is always running on >>: Okay. So [indiscernible] that's essentially just a [indiscernible] computation exercise, your just showing the main core [indiscernible]. >> FELIX XIAOZHU LIN: >>: Your baseline is basic everything runs on the [indiscernible]. >> FELIX XIAOZHU LIN: >>: Right, right. Yes. You're not using the small one. >> FELIX XIAOZHU LIN: Right, right. Because the [indiscernible] the one on the top K2 is to enable the services running on the small one, right. So we can exploit. We can open this small processor for the -- to a lot of OS services. >>: And so those three workloads, just to clarify, those three workloads are all things that can run on the small [indiscernible] K2. >> FELIX XIAOZHU LIN: >>: Yeah, yeah. Of course. Do you have any examples of [indiscernible]? >> FELIX XIAOZHU LIN: So there are. For example, some hardware by design, hardware feature is private to the big core. For example, interrupt. If the interrupt is -- the hardware interrupt only goes to big core so you cannot run it on a small core. But that's a hardware limitation. Yes, Eric? >>: So then [indiscernible], I just want to see if I understand. not an actual -- there's no actual radio involved ->> FELIX XIAOZHU LIN: There's No, no. >>: If you did have a radio involved in that experiment, at that point then the actual throughput would start to play a factor and you might actually have to keep the radio on longer if you're using the less powerful core. >> FELIX XIAOZHU LIN: >>: Yeah, yeah. Might offset a lot of those -- >> FELIX XIAOZHU LIN: Yes, yes. So I think there are two effects if you have real radio. The idle time will be longer between two packets. The that will paralyze the big core because the big core idle time is more power hungry. But I agree with you definitely. There are a lot of -- there is a large number of the radio power consumption. >>: [Indiscernible] timing, right? affected by the [indiscernible]? How is the timing or DMA [indiscernible] >> FELIX XIAOZHU LIN: I think the -- so we actually tried different sizes of the work. For example, we tried different sizes of DMA, like four K, 128 K and one megabyte. It's -- one is more IO intensive. Our K2 will do better because the -- because the idle time between idle operations actually are very power hungry for the big core. Because the small core to idle was much lower energy from the power. >>: [Indiscernible]. >> FELIX XIAOZHU LIN: Right. So actually, yeah. That's one of the [indiscernible]. The other thing is for the OS workload, if we look at instruct, the energy per construction, this kind of energy efficiency metric is that case the small core can be more power efficient than the big core because the OS workloads are kind of irregular. You cannot benefit. You can hardly benefit from the other [indiscernible] features like deep pipeline, branch predictions of [indiscernible] no matter which is featured by the strong core. Yeah? >>: So [indiscernible] previous question, which is just take the other extreme value, power code everything on the small core as a library or the small OS and use those to support these benchmarks. Do you have a sense of where are you in that spectrum? Are you much -- is there still a lot of space to improve? Does the flexibility cause you anything? >> FELIX XIAOZHU LIN: Yes. I think in terms of energy efficiency, there is difference because basically you -- if you can handcraft the OS, it's a sin, right? But the thing is development overhead is much higher. You have the craft of the two separate OS images application difficult to write because [indiscernible] developer has to deal with two OSes? >>: So you believe you're as efficient as hand crafted -- >> FELIX XIAOZHU LIN: >>: Yes. -- [indiscernible] OS. >> FELIX XIAOZHU LIN: Yes. Yes. So remember, we have three goals. two is to simplify OS engineering. So this is [indiscernible]. Number Okay. So this is about K2. But let's try to jump out, look at the big picture again. K2 actually spans one single OS image across different coherence domains. One chip, one chip. We are how this go before. We tried to build OS on multiple chips back in that time, we felt this was difficult because performance concern OS engineering concern and also architectural for future is concerned. We feel that was difficult [indiscernible] in sides and knowledge from K2 we feel this was the -- actually this is a variable. It's doable. So performance concern we should still use the shared OS model, but in this case, because cores are more loosely coupled, so most states should be replicated instead of shared coherent. For OS engineering to control OS engineering, to reduce the overhead, the engineering overhead, we should do -- still do a refactoring basically reuse OS components from the [indiscernible] code base. And architectural features, we really need hardware MMUs for different chips so we can build the one unified [indiscernible] space easily. And we really need the hardware messages, efficient hardware messages across different chips so we can do intercommunication more efficiently. So that something we're trying to do right now. This is ongoing work. Now, I will [indiscernible] work. So to recap, to recap the things that has been done, the target architecture is highly asymmetric processors in one mobile system. And there is no global hardware cache clearance mounted. So in order to spend the OS -- the entire software stack across those processors, we have redesigned system software to achieve two important views. One is coherent memory view. So application can span across coherence domains. The other one is single OS view. So application can still see one single OS. So we have followed with these principles which we believe will be useful for future system design. Number one is we build asymmetric systems software. Number two, we let software to design the policy in memory coherence. Number three, we refactor existing code base, system software code base, rather than rebuilding everything from scratch. So that's the work has been done. directions just to be on time. Let me quickly talk about the future So what I have down is system software for noncoherent asymmetric general purpose processors. Looking to future, I try to address systems challenges from three directions. Number one is higher heterogeneity. Number two, higher parallelism. Those two are for mobile devices, for wearable mobile devices. And to look beyond mobile device into larger systems. Talk about higher heterogeneity. So these days, we're not only have general purpose processors in one system, asymmetry in one system. We also have for example, the computation geography, the DSP multimedia core, sensor core in one mobile system. Now, how do we build system software, how to build system support for them? So this is a still open question, because some of the cores may be able to round OS code, some of them maybe not. program code, some of maybe not. Some of them may be able to run the But I think that for our specs, we should assist our software designer. Number one is abstraction. We should provide good abstraction for opportunity for [indiscernible] to provide application for all hetero generous hardware. Multiplexing, we should enable multiple programs to fairly share the same set of heterogeneous hardware. Protection. If we want to run some is user code on one of the heterogeneous processor, we do not want to mess other state in the system. And the configuration, the dependency upon the last type of cores, sort of [indiscernible]. By dependencies, I mean functionality dependency. Power dependency. [Indiscernible] dependency. [Indiscernible] dependency. How do we provide assistance support to simplify the processor and how do we let other [indiscernible] verify this configuration. So that's a big, big question. Now, higher parallelism. So we not only have types of cores. We also have larger number of cores. In a few years, we are going to have tens of CPU cores and a few hundreds of GPU cores all in one mobile system. How do we use the extra number of cores for higher interactivity? This is old question actually from 20 years ago. In 2010, people ask how many cores do we need for one interactive system? They say you need two cores. Now, since that ten years has passed, people ask this question again because -- no problem. Yeah. Let's see. >>: Were you running K2 on that last slide? >> FELIX XIAOZHU LIN: restart. Okay. I wish I could. Yeah. Okay. Let's play -- try to So basically, ten years has passed since 2000. There are many, many new, like, feature rich [indiscernible] applications. Some people ask this question, how many cores do we need for interactive system? Let's say you need one more core. But today, the extra number of cores is a asset. We have to use it or we will waste [indiscernible]. So there is a open question. How do we -- can we have [indiscernible] use of many cores for interactivity? For example, can we use extra cores to support natural user interface and augmented reality? This sort of a new interface. And can we parallelize interactive workloads, for example, HTML5? So I'm looking forward to working with program manager people and augment people on this direction. And some even crazier idea is can we use X for cores to speculate user improvements to compute all the possible outputs, have them ready before the system ease actual user inputs. So this can probably give us the one millisecond [indiscernible]. So that's even crazier idea I will try to explore. Now, look beyond mobile systems. Try to wrap up. To look beyond mobile system, heterogeneity is everywhere in [indiscernible], for example, GPU, CPU. This is probably the most successful heterogeneity system in the last ten years and even look at the edge of the cloud, there is something called the [indiscernible] station, which is highly heterogeneous as well. Same as [indiscernible] centers. For example, the TI [indiscernible] is heterogeneous [indiscernible] for similar market. Inside one Keystone SoC, there are a bunch of arm processor, general purpose arm processor and a bunch of DSP so that we can run the general purpose workloads and the communication workloads, 3G, 4G side by side. In a real system, in a real server, there are a lot of Keystone SoC interconnected. For example, there's the case for HP Moonshot server. In a cloud based station, [indiscernible] center, there are hundreds of servers like that. How do we program the highly heterogenial system? And how to debug such a highly heterogenial system? So these are open questions. Now if you look at the very tip of the cloud, this is basically the base station, the cellular tower our smartphone direct talks to. They are heterogeneous as well. This [indiscernible] they have build, the clay has build actually, [indiscernible]. So the idea is basically the spectrum is really expensive. It's a premium in today's world's communication. In order to improve the utilization for spectrum, we can use multiple antennas and use a larger of heterogenial -- a number of heterogenial computer resources so we can improve the utilization for heterogeneous for the spectrum. So on this direction, I'm looking forward to working with [indiscernible] people to [indiscernible] assistance of our challenger here. Now finally, let me try to position myself in the computer stack. So I'm working on the low level system software, which is on the boundary between the hardware and the software. So I try to codesign the system software by working closely with architecture and hardware people. And I'm trying to look into the hardware [indiscernible] to look for this cool [indiscernible] support for system software. I will codesign the system software with folks from other areas, the compiler, program language, networking, and visual assistance, and what is the codesign system software, hopefully we can provide good strong support for all this high level applications, for example, algorithm, computer [indiscernible], motion [indiscernible]. And we can provide a strong quarantined for high level concerns like security and privacy. So to recap the whole talk, what I've done is to redesign a system software for none coherent asymmetric processors so one software stack can transpond the [indiscernible] with multiple coherence domains. Looking to the future, I'm ready to address systems challenges from three directions: Higher heterogeneity, higher parallelism, and the larger systems. Thank you. [applause] >> FELIX XIAOZHU LIN: Yes? >>: So I had a question about application for [indiscernible]. Seems like in mobile you want it to be easy to make programs for these kinds of places whether they're heterogeneous or not. And I [indiscernible] because that's [indiscernible] interface that people are comfortable with. I think that one of the downsides of shared memory is that the cost of communication between different parts of the system is complicit. When you have hardware support for coherence, I think people are willing to tolerate that implicit cost [indiscernible] unpredictable performance because [indiscernible] support and the costs aren't that high. Now you're kind of proposing that instead of hardware coherence, we use software coherence do the same kind of thing and it seems like the performance predictability is going to --let performance is going to be much less predictable because now the costs will be higher and communication actually happens through that implicit mechanism. So I guess the question here is did you look at the tradeoffs in using what you have described versus using something that makes the performance difference explicit if you have operations that are going to be doing communication or not doing communication because the performance cost is higher when you use the software ->> FELIX XIAOZHU LIN: Exactly. That's a good point. So right now, the program model we propose, user program model we are proposing is user [indiscernible] different thread for different coherence domains because the fundamental really is migration is hard. Migration is hard from one [indiscernible] to domain. This also makes the performance boundary explicit so [indiscernible] this thread, this thread is running on a high-powered domain, this thread is running on a low-powered domain. And they know that sort of the sharing between the two thread actually across domains. But whether this is ideal, I don't know. But the thing is I think it definitely makes sense to provide some sort of a hint, like buy directional hints on the system to application and also for application to a system. Yeah. But one good news we have islet contention between these two part is not that high in mobile in the background [indiscernible] task [indiscernible]. Because they don't communication like all the time. So that sort of relieved the problem? >>: Just a follow-up question. Given the [indiscernible] you guys addressed that problem is how easy is it for an application programmer to identify what should go on the weak and what should go on the strong? I think if you just looked at a program without understanding what it does, it might be difficult to ->> FELIX XIAOZHU LIN: Right. So my basically, right now, next application, the boundary is implicit. But I believe it's kind of doable for a [indiscernible] programmer to specify the performance need. But not the power need. But [indiscernible] specify the performance need, I think we can do that kind of partition. They only have to -- my philosophy is programmers only have to reason about the performance, not about the power consumption. But that's just me. Yeah. >>: Okay. >> SUMAN NATH: Any other questions? >> FELIX XIAOZHU LIN: Thank you. [applause] Okay. Thank you.