>> Jim Larus: Our first speaker is Gad Sheaffer from Intel. >> Gad Sheaffer: Hi. Am I on the speaker? Okay. So I like talking about an activity at Intel we're calling design for user experience. I'm a architect in the Israelee design at Intel's Haifa center in Israel. I'm working on the future [inaudible] that will come out say later. I can't completely talk about it. [laughter]. And but we'll talk about five to 10 year horizon for applications for that process. So this exercise, this design for user experience is about naturally looking for professor enhancements because that's what we were looking for that allow for meaning fully enhancing the user experience. We're to go that for a change. We're doing it top down search. We really are starting with usage models and going town to applications and the [inaudible]. We really want to do is optimize our upcoming design for the future applications rather than past, right? What currently happens is that we typically design a processor for five years out with applications that were 12 years -- that were traced 12 years back and comprise the bulk of our applications. So we want to sort of change it. However, we do not plan to actually develop these applications. We are looking forward to seek collaboration with people who have these application at hand and which -- with which we could collaborate and tune the hardware and software in concert. So we are looking under some street lights. We're just trying to have the street light different -- be different from the ones they used to look under. So we are looking for generic or focused performance improvement and enhancements. I mean, there are many other qualities of PC platforms that could be enhanced. We're looking at performance. Client -- focusing mainly on client platforms five to ten years out and mostly processor. But we are also amenable to cheap set and other enhancement. Streets we are trying not to look under. In the past we've been targeting our SSE, ISA and such as we've been looking mainly at more and more video encoding an transcoding from different formats with every going resolution. It gets a bit boring after a few generations. And we are basically looking at the application that would appeal -- that appeal to the average Intel engineer. Features that address the needs of US households with large incomes like this terabytes of movies in the basement that you transcode into send to the 50 inch screens in each room. That's not a very common occurrence in many other rooms out of the US. And perhaps some of the applications that we're looking -- some of the applications that we're looking at are maybe things that are -- that user is not even aware of application, things that happen in the background. They are not really necessarily shrink wrapped. And we are looking at many usage models that do not have the user actually sit behind the keyboard and the screen. So we tried this for -- this is the first time I think we have done this. I think Microsoft are doing similar things, but this is the first for us. We tried to actually look at the focus on the daily lives of real people from diverse geographies, age groups and economic situations. We have the Intel ethnography teams identified based on market data and the very sociological trends. We had the ethnography people actually identify people and they have actually met and spend some time with these people. So these are not meta types, these are really real people that have names and faces and we -I haven't personally met them but I have seen their pictures. I know a lot about their lives and other people, the ethnographic people actually had spent a few days with them at their homes and work places. So we think we know what they might want and what could be -- what they would value. And some of these people represent markets and usages not commonly thought of as computer users. And but they could represent growth opportunities. I can give an example of most of -- he's a [inaudible] from Morocco has his own business. He has no computers inside but he has a pretty good cell phone which he's using to run his business so just as an example. So we started with ethnographic data and market research that sort of loud us to pick the personas. Based on these personas we tried to develop scenarios. Basically we did the day in the life of these people five years out. And we tried to translate those into system capabilities. So this is how these exercise looks like. We have a group of five to seven people sitting around the room, around the table, their persona. One of those people actually knows the person that we -- that is being discussed and spent some time with them. And they actually prepared in advance, they took pictures and they actually prepared in advance a set of small images of these are hours and these are various phases of people's lives, the entertainment, work, family interaction and so forth and so forth. And then during the day, we basically try to come up with ideas about where technology could positively intersect with their life in a beneficial fashion. Okay? So we take these dual -- we write these ideas on these post it notes and stick them up to the relevant time of the day and the activity. Some of it is patently bogus, but some of it is real, right? So I can give an example. So any way, let's talk a little bit about day in the life of MK. He's a Korean airline executive. He's married, has three children, a dog, no life. [laughter]. He works -- gets up early, works while being driven to work, so he can't really afford the time to drive himself, so he actually works while being driven. Swims in the corporate swimming pool. Works. Driven to business lunches, back to work. Works while being driven home. Sleeps. That's how his week looks like. So here is MK. You see only two kids here because the third one is in the US. They think that that I will get a expert education. And so and this is how the board looks like, right, where the -- so where is MK swimming? Anyway, you can see works, sort of spend most of his activities during the day. And so we basically sketched things like these are the various activities of his day. Being driven to work, to work, back home, business meeting, and this business meeting I mean typically meeting people who he has never met before. And one of the usage model we try to come up with what we're calling a charminator, basically something that would be like a icebreaker that you could basically -- the system would tell him some things about the person he's meeting and be able to sort of start the discussion of -- start the conversation. For example, the only free time that is not actively working during the day while he was -- was when he was swimming, so we tried to come up with the things that we can do with him while he was swimming, right? So swim coach and the monitor. So we had a hard time actually so a video camera would track his movements but we had a hard time figuring out how to actually provide him with feedback while he was swimming. I mean, it's a bit tricky. >>: [inaudible]. >> Gad Sheaffer: But the water changes so we need to do -- so we need to do some image modifications so that it looks right rectified even while ->>: [inaudible]. >> Gad Sheaffer: Yeah. So -- yeah. So anyway. Oh, we could put some ear phones on him and read him some e-mail while he was swimming. [laughter]. Anyway, so this methodology we tried to augment with various market trend reports from various sources. The problem with that is that if you ask users today what do they care about, then they care about mostly they care about [inaudible] right? It's really hard to get people to actually associate value to things they have not really experienced or not even aware -- not even aware could be done, all right. Especially also for analyst this is marked data from various -- from [inaudible] or something like that. It's really hard to get analysts to think six or seven years out. Really. Plus always what they tell you is not always what you wanted to hear, right? So that's a problem, too. And so anyway, the applications, how do we select application based on these usage models? Obviously a processor performance and features are key to enabling these, to making these applications happen. Better that they not be dependent on the unknown algorithms. And that's why some of the natural language processing usage models are sort of bit suspect in my view. Hopefully they need to happen in the time window that we're discussing, so that means they need to exist in some form today, in this prototype form of some -maybe the performance today is not satisfactory, maybe with 10 extra performance it gets to say from one frame per second to 20 frames per second which is acceptable, things like that. But they need to be available, you need to demo them even not in realtime today. And ideally you would want these applications would not happen with -- that they would require sort of a big step function performance or some feature to make them happen, right, not just five to 10 to 15 percent a year compound growth rate that we have today, but we need to do something proactive to make these applications. And that's nice to have, but still, I mean, starting this year we have an integrated capabilities for graphics and video processing, focussed performance capabilities and it just so happens that many of those applications are about video and 3D but it would be nice if we can sort of leverage these focused performance pictures. So the top usage models that we have identified, the new usage models, mostly about computer vision. And I think that a lot of people are sort of in agreement with that. I've seen a lot of computer vision work being done over the last -- in this workshop. Telepresence is something that resonated very well with a lot of users both from business purposes and for personal reasons. It's a very fragmented world we live in. So it needs to be affordable, realistic and immersive teleconferencing both from home and office. Semantic search, natural language processor -- processing basically semantic understanding and inference, personalized and in context. And physics, photo realistic modeling. I'll talk a little bit more about each of those. So computer vision. So the usages that we imagine were searching, using a template or a face. Inferring where the picture was taken. Recognition, identification, and tracking of people. I'm not sure -- wasn't sure about this one. Understanding gesture, a gaze, eye movement, head movement in conjunction with other multimodalities. Interactive games, authentication based on face recognition. E-fitting. Picture enhancement, et cetera, et cetera. And for telepresence we are look at affordable, realistic, immersive teleconferencing. We're mainly interested in people actually communicating and not about enabling people to be in a different place and interacting with say inanimate objects. It's mostly about people meeting people without -- and getting an experience that these similar to face-to-face meeting that has the same trust, building capabilities as the face-to-face meeting. So currently at Intel at least we have since Intel is a very geographically diverse company, we have telepresence rooms that are extremely expensive but are pretty good. And we're using them. And they are very beneficial. But these are -- we're looking at to provide these half a million and above per room features for at basically on a notebook or desktop computer. Close to realistic. So you should feel as if the other person is in the same environment. Immersive. Immersive which means that video is spatial and both video and audio are spatial, which means if I'm in the meeting and I'm looking over there at that person over there then everybody else is aware that I'm looking over there, right, and he knows that we are sort of establishing eye contact. Right? So it's -- that's one of the key components of making this feel as close as possible to a face-to-face meeting. Right? Establishing to correct eye contact. Low latency is an obvious thing. Having this surround feeling. We've been thinking about getting this virtual meeting space that could be configured and you could be set -- set it up as a lecture hall or as a round table or as a living room, right. And in those you'd be able to decide where you sit, next to whom you sit. You may be able to whisper in somebody's ear and establish a side channel of -- while have been else is aware that you're doing, right? True eye to eye contact is an important capabilities. Today when you're looking at doing telepresence in the over here and I'm looking at a screen, the other person gets the feeling that I'm staring at his navel. So that's a bit disconcerting. Today another problem today with telepresence. All systems are not interoperable, right? If you have a Cisco system you can talk with other Cisco Systems. If you have a [inaudible] system, that's all we can do. So open standards. If we can establish some standards here, that would be good. Bandwidth and quality of service. Many of the systems that we have today require at least lines of high latency, high bandwidth, dedicated lines. And we are trying to trade bandwidth for compute if possible, which may be -- maybe we'll need to instead of passing multiple high bandwidth videos, we'll be able to pass a texture that will be overlaid on two avatars on the other hand and that's a way to compress significantly. And multiparty with multiple participants. Semantic search, so that's something that maybe I think a lot of people here can sympathize with. All of us are inundated with more information than we can digest. So ideally I would like to get one, ask a question and get just single answer that is based on my past history, my interests, my -- what's relevant for me in this context and in this time. Multi-language automatic translation. Remember we discussed people from knowledge geographies. Not all of them have interaction with other people from other countries, but many of them actually do. And this aspect of augmented reality basically that if you're in a meeting or in the teleconference you get complementary material that is needed for the task at hand. Personalized. And all above is a function of who I am and what I'm doing, why I'm asking and where I am, et cetera, et cetera. Physics modeling is also a promising -- is a component of -- this is really not a usage model, it's a component of a number of usage models. Photo realistic, human body, face, hair, and other material objects. Able to model. Advice cost city, inertia, elasticity and to be able to interact with physical objects as if they are real objects. Virtual shopping and try-on clothing in virtual space, I think this is something that we're starting to see that today but it's very really -- not very realistic. Virtual worlds. And possibly Haptics output. So I'd like to give an example of how we operate. So we've described usage models which we're trying to break into detailed scenarios, system capabilities, algorithms workloads and kernels, primitives. So for example, multimedia search has these components, face recognition, object instance class detection and ranking, which translate in turn to classification, feature extraction, inference, regression and so forth. And in the end we basically want to have in the end are actually kernels that preferably highly optimized scales that we can actually optimize. So in summary, a process of design is optimized so a specific metric. You excel at what you measure, right? And we want this measure to be applications that are forward looking, not things that were traced 20 years before the process actually gets to the market. It would be nice if these were applications that people actually cared about. We are looking for components of these applications in a relatively performance optimized version. And fortunately from what I've seen here, those versions are typically [inaudible] versions but -- and we'd like to be able to quantify not only these components but also the value of movements to each of those in the overall grand scheme of things, of how they would contribute to the overall usage model. And for that we are actively seeking collaboration with software vendors for the -trying to do this hardware software c-design to jumpstart this brave new world of applications that all of us are dreaming of. Okay. Thank you. [applause]. >> Tony Keaveny: Okay. So I'm delighted to have a chance to talk to you about what we're doing in Berkeley on the health applications project as part of the big ParLab project. And the title I chose carefully here, personalized medicine has a very fixed meaning out there. Typically it's associated with genetic analysis, so you do some kind of geno analysis on a sample of your tissue and that's hopefully going to identify your future risk of developing some disease or ideally your future chance of responding to a certain type of drug treatment. So everybody's very excited about this. A lot of money's been pumped into it. Put there's certainly a lot of hype. And the way I see the field of personalized medicine going is there will be that genetic component, but imaging is going to be a huge component because imaging is where you're at now. And think of the genetic analysis as a -- some kind of vector of where you may go, right? But there is most diseases we're worried about are so complex that the genetic factors alone really have a small influence on your overall risk of developing the disease. An image of where you are now is like a summation of all the things that have happened in the past, and you get a snapshot of the current event. So I think personalized medicine and medical imaging is going to be big in the future, and the key to it is going to be assessing the information that's in the medical image, and that's where all the computation comes in. So that's what I want to talk about. My background is I'm a mechanical engineer. I do all my work in biomechanics. And I use a lot of computation but I don't do any coding and you'll see that as I progress. Jim Demel [phonetic] is here to handle those questions. So I see the world through my biomechanics doing else. And the vision then is to take medical images, could be any type, CT scan, MRI scan would be the more interesting ones because they are 3D, couple that with computation at many levels, we have image processing, we have number crunching, but bring into it from my perspective biomechanics. And the biomechanics is really what connects me with the clinical function. For example, if I'm interested in osteoporosis, I need to know something about how bones break. And I want to extract that information from a medical image. So in my world, I bring in clinical function through the biomechanics and you try and couple all of that together and the outcome then can be improved diagnosis, what's your risk of having say hip fracture, what's your risk of having a cardiovascular events. Surgical planning, I'm going to do spine surgeon on you, what implant should I use for you and what is the best place I should put that implant in your body? And generally patient management. Okay. You're on a drug treatment, how are you responding to the drug treatment? And we want to get feedback as soon as possible. So all of these things will come on it of imaging combined with our computation biomechanics. Let me give you just a little bit of background on the biomechanics perspective. These are some common diseases that have very large biomechanical components in their etiology. So cardiovascular disease obviously related to plod flow. Arthritis, your joints wear away. There is a clear effect there of body weight and level of activity on developing arthritis. Osteoporosis is all about bones breaking. Chronic back pain. It's -- we don't even know really what the source of chronic back pain is, but it's clearly associated with mechanical loads placed on the back. And repetitive injuries, I haven't put a cost on that because it's so big. Repetitive injuries are the number one source of worker compensation claims in the United States. It used to be low back pain. It now repetitive injury. Typically it's from typing. Now, think about this. You've got these extremely subtle repetitive movements of your fingers are causing the cells to freak out and start secreting all these enzymes which start destroying the nerves and all the soft tissue say and you're carpal tunneled. Very, very subtle issue going on here, and it's all biomechanics, coupled with the biology. So these are all areas where if we can combine medical imaging and biomechanics, I think we can make big advances. And it will really change the future of medicine. So here is an example of and a success story in this from what we've been doing previous to this ParLab project in the bone area. One you just saw there is a little model of a small piece of trabecular bone, very high resolution image. That was about a five millimeter piece of bone and it was being compressed. It has about 5 million degrees of freedom. And those models we make from micro-CT scans and we can also apply this same micro-CT technology on cadaver bones to analyze the whole vertebra. So this is a cross-section of a human vertebra, and you're looking at the in red the regions of high stress as you compress the vertebra. So you can see there's some very discreet patterns of low transfer through from the top to the bottom of the verdict bra. And that model has 500 million degrees of freedom, right? So very, very large models. And we run these are non-linear material properties and we use big supercomputers to do that. For this problem on the right hand side we won the Gordon Bell Prize for scaleability of the software that was used for that. We used software called Athena, and I'll tell you a little bit more about that later. So that's the state of the art in numerical computation in the world of structural mechanics. And we wanted to take that and bring it to the clinic, right? So we want to use this on granny who might have osteoporosis or there are many of you who will have problems with your bones in the future. It is a certainty, right? Men and women. So pay attention. So in 2003, we published a paper where we -- where we based used a much coarser version of what you just saw. So these are continual elements. And the model here each element is about one millimeter in size. What you just saw, that detailed also cube model was a 4 millimeter cubed. It had a million elements in it, okay. Okay. So we've homogenized a lot of the problem and came up with these coarser models. 2003 this made the cover of one of these research articles and they predicted that we would be able to predict vertebral strain in patients. And there was a lot of excitement. And based on that excitement we started a company. The company is called ON Diagnostics. And the purpose of this company is to take this kind of virtual stress tests from the lab to clinical practice and orthopedics and bone strength is the first application. So here we are in 2008. We also made the cover of a very clinical journal, arthritis and rheumatism, so all the medical people are reading this. Engineers never read that journal. And it just really chuckled me that we've got a finite element model on a clinical journal. That's just great. It's just so -- I just was so happy with that. But the significance here it's the same looking type of model, right, but we'd gone from the lab into actual patients, right? And here's a study, we do this on the spine, we do it on the hip. Here is results from the study on the corner there. There's fracture surveillance study of 6,000 men over age 65. It's called the Mr. Os study, osteoporosis in men study. And about seven years ago all those men had CT scans taken of their bones, and we just follow them over time to see who is going to fracture. So at the time of this analysis -- in fact, we had CT scans of only half the co-hosts. So 3,000, 3,500 men had CT scans tracked for a few years. So at the time of where we got this data from, there was 40 men had fractured, 40 men with the CT scans had fractured. We have their baseline CT scans, so we analyzed those 40 scans of the guys who fractured and 210 others randomly taken from the sample. We're blinded, so we don't know who fractured, who didn't, and we made our prediction of this strength of their femur using models just like this. And then this is plotted against the bone mineral density of those men. That's a clinical test. It's called a Dexa Scan, standard clinical test to access bone density. So what you can see in that graph is the light blue points are the men who fractured. And if you go by currently clinical standard, the express the bone mineral density as a -- they call it a T score. It's the number of standard deviations your bone density is below a 30 year old's, right? So it's just a linear transformation of the data here. This -- these people on the left of this with the -- would be considered to have osteoporosis, and they would be put on a drug treatment in clinical practice. And all the others would be considered either low bone mass or normal and we're not going to do anything with them, right? So you do capture a bunch of these folks who fractured, but small number. So on the strength side, we found that everybody who had a bone strength below 3,000 Newtons broke their hip. And many of those who broke their hip were with what we considered low bone mass. Now, in reality of the one and a half million people who have osteoporotic fractures in the US every year, the majority of those people to not have osteoporosis, right. If you have osteoporosis, you are a high risk of fracture, put most people who fracture don't have osteoporosis, because there's just so many people in the non osteoporosis category. And what you see here is then is we're picking up of the people who we predicted would fracture, over half of them did not have osteoporosis. So this is very promising. There's a lot of excitement over that. We're going for FDA approval later this year. And if everything goes well, this will be available next year if you ever want it. And then -- >>: [inaudible] put these people on the same treatment that you give to the normal osteoporotic individuals that they will not fracture? >> Tony Keaveny: Okay. So you ->>: [inaudible]. [laughter]. >> Tony Keaveny: So the idea is that identify people so you can put them on drug treatment that will stop the loss of bone, even build new bone and therefore strengthen their bones and reduce the risk of fractures. So the current drug treatments we have reduce the risk of fracture by 50 percent. So the problem with osteoporosis, as I'll show you in the next slide, the problem with osteoporosis is it's an undertreated and underdiagnosed disease. And there's many diseases like that. This is a little table. This is the estimate of the amount of -- the percent fractures that were avoided each year by the current medical treatment. So we scan -- if you think you're at risk for osteoporosis, you'll get one of these bone scans. If this score is less than minus 2 and a half, they'll put you on an osteoporosis treatment. Half of those people on the osteoporosis treatments will not fracture if -- that would have fractured will not fracture. All right? So if you just do the simple calculations, it turns out about 15 percent of the population at risk are currently tested for osteoporosis. The test sensitivity that's been used now, that BMD test is about 50 percent sensitive. And the treatment works in about half the people. So if you work it out then you avoid about 3.7 percent of the fractures. So if there's 300,000 hip fractures every year in this country, all this treatment that's going into this is 3.7 percent of those 300,000 hip fractures are not happening. All right? So it's a very small number. Okay. So imagine if you had a new drug treatment that instead of having 50 percent fracture efficacy had 75 percent fracture efficacy. That drug would be worth about I don't know, 30, 40 billion dollars, right? Because it would just dominate the market. And it would be just revolutionary, right? So that would change this percentage up to five or six percent. The same change can be obtained if you could simply have a better diagnostic test. If your sensitivity went from 50 percent to 75 percent, you'd have the same thing. So we're already there. Right? Our test is already here. So we're equivalent to a multi billion dollar drug. But the way money flows in the medical system, it just doesn't go look that. So now, if you can test more people, that has a -- for a disease case that's undertested, you have a dramatic effect that really dominates everything. And if you can test more people and combine it with a better drug and a better diagnostic, that's about as good as you're going to do. So the point I wanted to make here was that doing better testing and diagnostics is at the end of the day from a functional perspective clinically is better than making new medical devices and drugs. It can be -- it can have such a huge effect. It just didn't play out so well in the market because we don't have a big drug company behind this promoting it. But of course these are all the things I can change in the future. >>: Other than money is there a downside to someone taking this drug treatment if they don't? >> Tony Keaveny: Yeah. So that's a good question. The drug treatments right now are fairly benign, but there are some long-term concerns now. So people who have been on these drug treatments for 10 years were beginning to see some problems with them. So it's drug that the osteoporosis drug field is very active drug development field is still very active. People are going for drugs that have really no negative side effects. And most of the drugs we have now stop bone loss. They don't actually make your bones any stronger, they just stop the bone loss. And there are a new class of drugs out that actually build new bone. But they come with some potential risks. So it's a very active field to try to get bones that are anabolic that make new bone that don't have the down sides. The drugs, the anabolic drug we have right now it's made by Lilly, it's called Forteo. It's one of the most expensive compounds in the world. So they charge 1200 a month for that drug. So that's a success story. We want to do the same thing in some other problems. And the problem we've addressed is stroke. So let me explain the stroke problem. So one of the problems with the osteoporosis in terms of implementing it is that you say you've got a better diagnostic test and you go to get this reimbursed by insurance companies. They don't care, right? Insurance companies have the following experience. A third of their insurees switch insurers every three years. All right. So I go to you and I say, hey, I've got a test and I'm going to identify people who are going to break their hip in five or 10 years from now. Will you pay for this test? Just look at you. No. They'll probably be with somebody else in five or 10 years and they're not going to pay for it. Now, they'd never tell you that straight in the face, but this is what goes on. So we wanted for this ParLab project, we wanted an media milk that would be compelling in terms of the need to implement this and to pay for it. And you could charge a lot of money, not a lot of money, but you could charge a premium for this test because it would save you so much money. All right? So chose this very specifically with clinical implementation in mind. And the problem is it's about strokes. If you have a stroke and you go to the hospital and you get diagnosed with your stroke, if you've had that stroke more than three hours previously, you will not be treated. The reason you won't be -- and we saw a little bit of this the last day. The reason you won't be treated is that the risk of complication from the treatment is too high. So they think there's 2 main treatments. There's a little kind of clog cleaning plumbers approach and there's a blood thinner. And with either of those the risk of you bleeding to death after one of those treatments is considered too high if you've had the stroke more than three hours previously. So you go to the hospital. A lot of strokes happen when you're asleep. You wake up in the morning and you're numb on one side of your body. And you won't be treated. So your brain continues to just -- part of the brain is not getting the blood continues to die and you end up with severe, severe long-term disabilities. Okay. So these strokes happen primarily in this vasculature in the brain that's called the Circle of Willis. So it's this little band of blood vessels right in the middle of your brain. Everything that goes off to the microvascular around the brain comes off this. So these front and back of these are headed down to the rest of your body. And this Circle of Willis is where most of the strokes, most of the block ins -- a stroke is a blockage of your blood vessel occur. And the downstream problems actually occur out in the microvascular. So what we're interested in is studying the blood flow in patients who have had a stroke, and we want to identify those patients who would have lowest risk for a complication from one of these treatments and therefore would be safe to treat after the three-hour window, right? Okay. So that's the plan. And of course the effect of that could be huge. If you could just identify patients who are safe to treat and treat them massive savings for everybody involved and dramatic impact for the quality of life for that person. So this Circle of Willis is interesting. And this is a nice anatomy drawing. This company Adam does all these anatomy drawings and of course makes you believe that this is the way it is in your brain. The problem is in your body this vasculature is a little different for everybody. Sometimes this is missing, sometimes one of these things is gone. Sometimes these are missing. Only half of the population have what's in that picture. The other half have a variation on it, okay? So if we simulate or if you have a stroke -- so let's say here you get a stroke here. The question is gee, how is that going to change the blood flow everywhere? And if we give you a blood thinner to reduce the viscosity of the blood, are you going to increase stress so high that, sure, you're going to get all the blood from here is going to sent over here and this is where you're going to have that complication. Right? This is what we're looking at. Or, gee, this stress is going to be so high here that this blood vessel is going to pop, right, if you reduce. So this is what we're trying to look at. And of course you can have these strokes anyplace in that. So our solution is to take a patient's medical image, his CT scan is what we're looking at primarily. You could use MR scans. We do a patient specific blood flow analysis. The stroke is there. And what we do then is simulate what the blood flow would look like if you have a blood thinner. And then we're assessing stresses in the Circle of Willis and downstream and we ultimately want to risk-stratify the patient. So here is how we're going to try to implement that. So we're going to risk-stratify patients. What we're going to do like the osteoporosis story is we're going to do proof of concept on the very high resolution models. We want to make sure that the physics and the science and having here actually does work. And we want to validate that observation by predicting people who are at highest risk for complications. If you don't treat them at all, there are going to be patients there whose blood flow characteristics are so messed up by the stroke they're going to have problems without doing anything. So we want to identify those. That will be a first level of value takings. It's observational, so nobody's being treated with anything. And then in parallel with this, we're doing all of the numerics to be able to develop these coarsen models just as you saw with the osteoporosis. We won't be able to go into all the detail we have and our goal is to do this in 10 so that while the treating radiology and neurologists are assessing the patient they're getting this information after taking the CT scan to risk stratify the patient. And ultimately we want to identify those at least risk. That will be validated in an interventional clinical study and obviously since this is been used to diagnose patients you have to get FDA approval. These two things I don't think we're going to get to in the time span of this project, but we want to set ourselves up so they're all set and ready to go. But if we make fast progress, we will be able to get into that, into those studies. And here's a state of the art in what people are doing in this related area. So these are a set of -- this is 2010, a set of 1D models. And it was just generic. This guy made up a perfect Circle of Willis just from those anatomy drawings and has treated all the blood vessels as a 1D problem. Fairly simple model. It's got the solid fluid interaction. This is a tricky problem because we have to do the fluid mechanics and the solid mechanics. And that's very, very tough and numerical problem. And then he played all sorts of games on putting a blockage here and then seeing how it affects the blood flow. So that's a nice study that allows you to parameterize and ask questions about what if. But it's not patient specific. And this project is all about making this work patient specific fast. So here's the other extreme. Here is a model. This doesn't have any of the Circle of Willis in it, this just has a little segment of a blood vessel but the segment of a blood vessel is modeled in a lot of detail. So it's non a non-Newtonian fluid flow, which is pure pain, and it's got a two-layer, it's hard to see the second layer, two-layer anisotropic non-linearly elastic representation of the blood vessel. And that's pretty complicated behavior. So these guys, this is 2009 so this is fresh off the press and they used a high end work station and did it just three pulses of flow and it took them 10 hours on the work station, right? So this is just to give you a sense of what we're up against here. So what we want to do is take this and put it into that like previous 1D representation, the full anatomy, and do the whole shebang and then scale that down to something we can do on multicore. So the strategy then is the detailed solution on the supercomputer. Run parameter studies, make sure we understand the physics and what we can throw away and what we can keep. Apply to clinical cases. The high resolution just to show in the best case scenario, yes, we can -- all of this makes sense and it works. Meanwhile we're porting all of this to multicore. We're going to simplify the models. How we're going to do that depends on the biomechanics, depends on the image processing, depends on any numerical constraints we have. And then ultimately go with that to some definitive clinical validation. So our team is myself. I'm the kind of applications person. Jim and Kathy are our experts from the numerical computation side. And Panos Papadopoulos is -he was here yesterday. He is a [inaudible] guy, so he's big into stress analysis from the mechanical engineering perspective. Very importantly we've developed a collaboration with some folks at Lawrence Berkeley Labs and Phil Colella is a big time computational fluids guy. And these other folks are full-time staff on his team who are developing this big parallel code for food flow analysis, it's called Chombo. I'll show you more about that. And then our other faculty, most important person there is Max Wintermark, he is a neuroradiologist from UCSF, so he deals with stroke people all the time. And David Saloner is a researcher at UCSF into the medical imaging. And Stan and Mohammad are biomechanics guys at Berkeley. And Mark Adams wrote Athena, which is our solid mechanics parallel code. So I'll tell you at bit more about that. And last but not least of course, the people who make it all happen at the end of the day, are the students. So we have a number of students here. And that is a mix of mechanical engineering students and computer science. So here is a glimpse at Athena. This is a preexisting code we have for doing the solid part of this. And this is a [inaudible] element code. Am this one the Gordon Bell Prize for scaleup. So we're starting with that, and we need to adapt that to a multi core. So the whole front ends of it, there's two layers of parallelization. And that obviously is going to change. And then all the message passing is going to change when we get to multicore. So the other thing of course that we're -- the approach that we're taking with the ParLab is these motifs. So we're not simply going to just simply take this and hack a solution the stroke project, we want to develop a more general approach using the motifs so that what we gain from this applications in other areas, too. And here is Chombo. The fluids code that's been developed in Lawrence Berkeley Labs, it's also set up for parallelization. And it -- the nice feature of this is that it has this automatic adaptive mesh refinement in regions of interest high stress and stress gradients. So both of these codes have to be adapted in a number of ways for the problem. I'll mention a couple of things we're currently working on right now. And then in terms of the motifs, these motifs are continually evolving, but we certainly have at least three of the motifs are clearly applicable to what we're doing since we have this unstructured problem and sparse matrices. So we've been doing the project for a year at this point, a little bit over a year. And -- or a little bit under a year, right? I don't know when it official started. So anyway, what have we done so far? I think we've figured out this strategy. There were a number of different ways to couple the solid and fluids, and we've decided to take Athena and take Chombo and actually work with those as opposed to starting up from the ground as opposed to trying to put having together in one. So it's a solid's problem and a fluid's problem that talk to each other. That's the approach we've taken. We've mentioned the team. In terms of the code, we've got both of these codes now working on the same machine and the students have been trained in how to do that. That was a bit of an effort, but that's safely done at this point. We're really hitting is biomechanics in detail. And we've done some simulations. Well, we'll show you this simulation here. We've also done some of these 1D simulations. Just like that literature study that I showed you from 2007. And we have identified a patient. We have a CT scan, so we're getting read to make it a patient model for that. Here is a little simulation we've done. 2D simulation. Just running Chombo through its paces. And it's a blood vessel here. This is the uneven surface of the blood vessel with a blockage in the middle. So it's to stimulate a stroke that doesn't quite fill the inside of the blood vessel. And then you're looking at the blood flow. This is the, let's see, this is the velocity that's the pressure and that's the vortices over time. And this should be all nice and continuous, so that's my MacIntosh. I don't know what is up with it that it won't play this a little more smoothly. But this is just to show you that we're beginning to set up these problems and see where the particular challenges, numerical challenges are that we have to deal with. So complicated patterns. But what we're mostly interested in is downstream changes in the larger vasculature and the very local changes around the stroke. And part of the reason for that is you can't get the kind of fidelity from a medical image that justifies extremely detailed numerical fidelity in those regions. So you're limited somewhat by your input data. Okay. So where are we going? In the next we're we have some specific objectives that are fairly well defined. We want to reproduce what the folks in literature have done and expand them. So take this 1D approach, so we can have the vasculature from the patient with that, but take a 1D approach to modeling the actual vasculature and do a number of parameter studies with that. And we've got the fluid-solid interaction in those models. We want to run a patient-specific analysis, full 3D analysis on the CT scan. But we'll just do that with Chombo with the fluids. That will give us the information we need to start thinking about mesh sizes and time steps. The codes themselves have to be altered. With Athena we have to add in non-linear elasticity. We don't have that. We have plasticity, but we don't have non-linear elasticity. And we want to bring in tetrahedral elements. And in Chomba we want to put in this moving embedded boundary. And Chombo has to deal also with the curved sides of the blood vessels and we're currently working on that as well. And then these two we need to couple these together. And then next we're going to continue refining the code, bring in image processing. The main thing is we're actually going to do some clinical cases. That's the idea next year. At high resolution. And then as we go towards the end of the project, optimize all of this and we're into multicore and then actually do clinical cases with multicore. And then these are outcomes. We want 10 minutes. We want to show we can do it in 10 minutes and we want this to work in a more general framework. I think that's all I had to say. >> Jim Larus: Questions? >>: Yeah. There's a guy up the hill from you at LBL named Lenny Aliker [phonetic] who you may know. He's done some interesting work with adaptive irregular meshes. The motivation was essentially the kind of [inaudible] numbers you have here where adhesion to the wall is very important. In fact, your problem is much worse because the geometry is time variant eventually. He doesn't -- you know, so that work was done at [inaudible] but he will remember really well. It's interesting because you can essentially build the mesh to conform to the law and just the resolution of it based on the magnitude of the vorticity, for example, or something like that. In an adaptive way. >> Tony Keaveny: Right. Right. >>: It's wonderfully challenging to program and in fact it may be quite useful to use transactional memory when you're updating the mesh. But some sort of transactional scheme. But it's a very cool way to address this kind of multiphysics model. >> Tony Keaveny: Right. Right. >>: Lenny is certainly in all these conversations as well. He doesn't happen to be on Phil's team but he's certainly part of that conversation. >>: Yeah, yeah. I know. Yeah. He's just in the neighborhood of Phil not ->>: Yes. >> Jim Larus: Any other questions? >>: How long in time to you need to be simulating? And like what's your time [inaudible]. >> Tony Keaveny: Yeah. So I would imagine its the simulation just goes for a few seconds and the time steps milli or microseconds. Hundreds of thousands of a second, I would imagine. I'm thinking. >>: So you spent -- have you spent any [inaudible] happen within a couple of second of the [inaudible]. >>: Right. Right. Right. Because we're -- I think you'll be looking primarily at a steady state solution, right? You're just saying look, the vessel's been there -the clot's been there for a long time. So at this point in time what's happening? >>: There is actually two very different types, that solid [inaudible] looks a lot more friendly than the fluid, so there might be as many as a hundred fluid time steps in between solid steps. >>: Right. >> Jim Larus: Let's thank our speaker. [applause]. >> Brad Werth: Hi, everyone. My name is Brad Werth. I work at Intel. I work with video game developers to optimize their games for the Intel platform. Obviously Intel does not create games. Our relationships with video game developers run a whole continuum from sort of a consultation level down to a we're going to look at the code, bring it in house, do some work, et cetera. So this, the presentation you're about to say is sort of a state of the industry in the way that Intel thinks about it, sort of where parallelism is for video games, what are the current key challenges that developers that are trying to navigate and how Intel is advising them to navigate those waters wherever they are on that continuum of working with us, whether we're just giving them ideas or actually doing the work with them. So as you may have heard or may have intuited, video games are not embarrassingly parallel. They are complicated to parallelize. There's an interesting sort of fact is that there's a lot of individual systems and elements and efforts inside a video game that independently it's well understood how to parallelize those. We'll look at some of them here. But the interactions between them is an integration problem like many of the other speakers have been talking about. Maybe the difference in the video game case, though, is that this is not an interim process coordination problem, this is all within one process. And there's an expectation that a game is going to be running essentially unmolested on the hardware. You're not going to be competing with, you know, somebody encoding a movie in the background or something like that. So as long as internally all the coordination could be made efficient there's an expectation that that's good enough. >>: [inaudible] or are you talking about the server as well. >> Brad Werth: I'm talking only about clients. And of course partitioning strategies of hardware resources can fail for a number of different reasons. One of them is the third bullet here. Moment to moment the workload of a game changes dramatically. You can't ascertain up front that, okay, this is exactly the right partitioning resources I want for the duration of the experience. You need to be very flexible in the approach that we advocate as a flexible one. And similarly most commercial games are targeting multiple platforms, even if they're just targeting the PC, of course the PC is this, you know, polymorphic platform that has many, many variations. So it's almost impossible to specifically tune a game for the realm of hardware it's going to be played on. Even if it's a game on the Xbox 360, the Play Station 3 and the Wii, those are all very different. So one caveat to all of this, many of you have been giving presentations where you've taken your research approach and you've employed it successfully to a problem domain, you've got a sort of here's where we are, here's the next steps. I don't have that level of detail for you because we haven't had the happy confluence of a game developer that's doing the sort of broad effort of parallelism that we want them to do combined with a developer who's willing to share all of their code and knowledge with us. Finding those two together is a bit challenging. We have one or the other. So bear with me. So the summary of what I'm going to propose is that task based parallelism can be applied to most of the threading paradigms that game developers typically use, and if it's done successfully and well it can move all of these inner operation problems. So we're going to look at that. This is a very high level gram of some things that might be doing on in a typical game architecture. And put yourself in the mind of a game developers trying to figure out how to parallelize this for all of the platforms you're going to target. So you've got a particle system. We know the particle systems are essentially the most embarrassingly parallel element within a game. Got a big array. You can break it up into pieces. How many pieces? How many threads should you dedicate to processing those pieces? That partitioning issue is non trivial if you're targeting multiple pieces of hardware. Likewise, maybe you've got some asynchronous jobs that you're going to call. You have -- you know of course that you could define some job threads, throw work under those threads, but again how many threads should you dedicate? Is it possible that the particles could be processed on the same threads, et cetera? You're using some physics middleware. You know that the physics middleware is threaded, but you don't know how. You don't know how many threads it's going to create. You should probably anticipate that it will oversubscribe your system. That's to the avoided. You've got these operations and games that of traditionally been handled by dedicated threads like sound mixing, level loading, some texture manipulation in the background. What does it mean if you know you have a certain set of that type of work in conjunction with all of this frame to framework? How can they all work together? And then you've got some kernels deep in your game that you know can be described in sort of a directed acyclic graph structure which is a pretty well understood method of parallelization but again how do you get it working with all these other approaches? So our premise to the developers is that if you can define all this work as tasks mapping the natural sort of prototypes that are used in games to a task implementation, then you get a lot of nice benefit from that. You can use a single thread pool that will avoid oversupervision. And almost by desks you'll have a game that scales to different topologies. So we're going to look at what those prototypes look like and how they can be decomposed into tasks using the Threading Building Blocks, APIs as an example. A lot of game developers write their own task scheduler or they use other available task schedulers. That's fine. The techniques in general are applicable to those as well. Although a lot of the pros and cons I'm going to be talking about are specific to the work stealing schemes that are present in silk style schedulers such as threading building blocks. I'm going to be showing code on the screen. No need to, you know, scramble to take down notes. It's all posted here. You'll see that URL again. Okay. So the most easy stuff that you can do to parallelize work in a game, really any application is to find those big loops and break them down. TBB makes this very simple. They have a construct called parallel for. It looks like this. If you take your humble for loop, you're iterating over an array, you're doing something to every element what you're going to do is you're going to embody or embed that for loop inside of a context, in this case a class, and you're going to use the function operator to supply a range to iterate. So instead of iterating over the entire range every time, you're going to iterate over a parameterized range. And again, this is the TBB idiom. It's -- this is a specific way of doing it, but the general idea I think is translatable. And then of course you're going to actually invoke the action of the loop in a different way. In TBB it looks like this. You call parallel for, you specify the total range, the context object that you defined earlier and internal TBB is going to break down the different calls to that context objects on separate threads, separate tasks which hopefully will be mapped to separate threads. And it's going to supply subranges of that total range. It looks -- I'm sorry. I hit some terrible button. No, it's not a Mac. I think I hit the panic button when oh, I didn't mean to show that slide and I just unpanicked. So it will -- when you do this, it looks -- this is the sort of before picture of where we're at. This is a subsection of a screen shot of Intel thread profiler. We're seeing a long shot of computation at the top that I happen to know because of the demo applications came from. This is our unparallelized particle system after this call to parallel for it gets broken up into these chunks, gets mapped on this case to seven other threads on this 8 core -- well, 4 core hyper thread system. So that's the primary benefit. You want to be doing this as much as possible within your game. But of course many of the approaches to threading that are used in games aren't as simple as ours. I need to go through this loop. There are these other more general approaches. And for that you need a lower level way to decompose those approaches into tasks. Thankfully TBB has a low level API. The way the TBB API works is it defines multiple trees of work that can be processed -- you could think of it as a -- well, we know what trees are. But it's not quite as expressive as a DAG but is processed in the same fundamental way. This list I have here is five examples of typical ways that games want to use parallelism. They want to -- and I'll describe each of them in detail. But we're going to look at how all of those can be broken down into tasks using the TBB low level API as an example. So the first concept is a callback concept. This is very basic. It says I've got some work. I'm going to give you a function pointer to that work. I want it to be run when you can, soon, just not on me. You know, give it to somebody else. It looks something like this. The concede we make here is that we're not actually given any handle back to the result to wait on the result. It's the responsibility of the callback itself to throw a flag or set some signal that's going to be checked in our main loop that will indicate that this operation has completed. I've got both code and code and tree diagrams on here for the visual and the literal people, so maybe you'll find one that speaks to you. This is how you can implement this in the TBB API. It starts by assuming an existing root somewhere out there in the world that we're going to attach these other tasks to. The task is created. When it's executed it's going to call that callback function. And then you see the little cloud at the bottom there which indicates in a perfect world calling that callback will produce more work that can be put into the tree. There's more cloud. Because if all you have is a series of callbacks, this represents a certain amount of callback will produce more work that can be put into the tree, this more cloud. Because if all you have is a series of callbacks, this represents a certain amount of overhead to sort of corral it into the TBB idiom. You're not going to be gaining very much. You need to have those opportunities of doing loop parallelization or other forms of data parallelization. Finally is task itself is spawned. The root can be thought of here as a continuation that will be spawned when all of its children are finished. But the tree is structured such that that can never happen. Every time a new tree -evidence time a new task is added, the child count of the root is incremented so it never actually completes. It's there so that you can in aggregate clean up the whole -- the whole stack of tasks, either at shutdown or if you need to do it periodically per frame, et cetera. So these callbacks are simple. They're powerful but they have some limits. The calling code is never waiting. Stuff's run on demand. But it's not waiting, so that means that you never know when the call back is finished. The call back has to declare that the work is completed. And of course there's no waiting, so what if you're running on a one core system will your callback ever be run? No. You need to do some special case stuff to make sure that that happens in one core case. So this is a paradigm that although it gets used is typically after a -- the developers have more experience with parallelism they tend to abandon this and move on to a related paradigm pattern which is promises. Promises are essentially an evolution of callbacks that resolve some of those problems I just talked about. Like callbacks they have function pointers that are going to be executed on another thread, execution begins right away. But unlike callbacks you're given an option to use to the handle to the result itself. This is more or less modelled off of a similar pattern in the Java 1.5 specification called futures. Maybe you all have experience with patterns and have you know a pow name you can apply to this. I just make stuff up. So the tree for that looks -- the tree and code looks something like this. Instead of relying upon a preexisting root to the tree, we're going to create one for every promise. And we're going to encapsulate that root inside another object. That's really sort of an idiosyncrasy of the way these work trees are processed. We encode it or we embed it inside another object because a task in the work tree is essentially the memory is reclaimed as soon as the execution is finished. Since we want to have a valid pointer object to the result, we have to define that ourselves and not rely upon a raw tasks because a raw TBB task is not guaranteed to live beyond execution. In fact here it's almost guaranteed not to live. So in this case we embed the root inside the promise object. That will be our return object. Once again, the task is made. A child of this root and spawned. And what we see is that this call to wait is doing a -- doing basically a double safe check on the task itself. You can see that the first if block inside the function is checking to see if the root has already been nulled. This is the signal we use to indicate that the work is complete. If it hasn't been nulled, it then puts together a lock and checks again. This is -I'm sorry. There's a question? >>: [inaudible]. >> Brad Werth: I'm sorry. I just can't hear what you're saying. >>: So if you're going to [inaudible] this activity versus [inaudible] still tries to hide the runtime, so I'm [inaudible] for all these patterns [inaudible] you actually [inaudible] where you actually [inaudible]. >> Brad Werth: Okay. So the question was Cilk does this but it hides the runtime, could these TBB examples be applied successfully to Cilk and what would that look like the basic question? I don't have a whole lot of experience to programming to Cilk directly. It's not the actual Cilk scheduler isn't generally brought in in whole cloth into a game. Usually people are copying the semantics of the Cilk scheduler with their own APIs. So I just don't know. My presumption is that any Cilk style scheduler, my definition of a Cilk style scheduler is one that uses task stealing and has independent queues per thread. Any Cilk style scheduler is not necessarily a work tree oriented scheduler. TBB is. I'm not sure what Cilk was. Was Cilk doing work trees, was it doing graphs? Neither? >>: [inaudible] continuation [inaudible]. But my really question is like if you [inaudible]. >> Brad Werth: Your question is if you just have the two primitives in Cilk which are. >>: Spawn and ->> Brad Werth: Spawn and sync can you do all of this, or do you need all this extra structure of the work tree? My guess is that you probably can do it just with spawn and sync. TBB is a freely available library. We put this out there to developers saying, hey, you don't have to try to adapt Cilk into your game. You don't have to try and write your own scheduler. You can just use the solution we've got. Here's how you might do it. So it's a marriage of convenience in this case. Okay. So getting back to this what you're seeing is you're seeing that the wait on the task is really a wait on the root and there's a lock in there to allow multiple threads to be able to wait on the same result. If your architectures's designed such that that isn't necessary, then this lock also is not necessary. If the same -if the same thread that dispatches the promise is always the only one that ever waits on the result you don't need this extra step. But you can see that the root is waiting this on this task complete and when it's done everything is cleaned up, the pointer sets and all. So promises are pretty great. The wait is only blocking if the result is not available at the time that it's requested. If the wait does block, presuming that it's doing more that again that cloud of work that indicates a larger tree, if the wait does block, the thread waiting on the result will be able to jump in there and start doing some of that computation that will accelerate the completion of the result. Implementing this on top of TBBs, not a lot of code, and you could take it further as is done in the Java 1.5 specification to allow cancellable jobs or partial progress updates. So this is a pattern that's finding some traction with game developers. Okay. Synchronized calls. Variation on callback. The premise of a synchronized call is that you want every thread that the in your invisible, untouchable thread pool, every thread to do something right now before to go anything else. It's sort of like a barrier or a fence. This is useful for the reasons I put up there on the slide. But essentially you're initially thread specific data or cleaning up thread specific data. This is absolutely trivial to do if your task scheduler allows you direct access to threads. But not all of them do. Here's a way you can do it. Again, code and a tree. We're creating a root. And we are going to create N children of that root which is equal to the number of threads in the threat pool. What those children do we'll see on the next slide but in general scheme you can see that they're going to call a callback and then they're going to test and wait on in this case anatomic variable. We'll see what that looks like. We spawn the root. All of the tasks go out there. They get mapped. Through work stealing they get locked into exactly one task per thread in the thread pool because once a thread executing one of these tasks, it can't do anything else. It's given an atomic variable to fetch and deck increment, and it's waiting for that variable to be reduced to zero. I can imagine let's say one thread in the thread pool gets all of these tasks on its queue. It executes the first one and opportunity finish. Meanwhile, the other threads in the thread pool are not being given other work to do, they're evenly going to need to steal work. They'll eventually steal from it this thread, pulling one of the uncompleted tasks off of its queue. That basically ensure that every thread gets exactly one until they're all executed, and then the busy loop finally because out and you exit. >>: [inaudible] generating a lot more work. Is there a -- are you doomed to wait for a long time? >> Brad Werth: The question was what is the background process? The background task is generating a lot of work are you doomed to wait on it? >>: [inaudible] something that you want that's, you know, kind of low priority and -- but it has a lot of work, are you just going to wait until that [inaudible]. >> Brad Werth: The low priority example is actually the next pattern I get to. This case is not the same as that. This is saying you know, I need to do something either at the beginning of the initialization of program to coordinate with some middleware that needs some thread specific data or it's -- maybe it's something you have to do every frame, although that can be quite inefficient for the reasons I just described. But low priority operations are a different thing. So this is useful. It gets about -- it solves the problem, but this is not an efficient operation. If you're doing a synchronized call in the middle of a bunch of other work, you've got to wait for all of the thread in the thread pool to flush and complete all that other work until all of them sort of synchronize up on this one operation and then they all can move on. You don't want to do this in the middle of a frame. That said if you did it as the first bit of work at the beginning of every frame, if you needed to there's really no performance policy for that. Okay. Long, low priority operations. That was my interpretation of the question you were just asking. These get used in games a fair amount. For several different reasons. Loading, sound, activity, textures, AI path finding, et cetera. The traditional method for this, even long before we had real parallel hardware was to say, okay, this computation is going to go on a while, I'm going to get results out of it every so often. I just need you to leave it running at some proprietary, so I'm going to make a dedicated thread that does this. That works. But almost by definition it introduces oversubscription into the system. There was a way this was done before threading paradigms were broadly available and that is as time slices. Time slice algorithm of course is one where you define it such that you can run for a period of time, get your result and return out and then you could call it again next frame, run for a period of time, get your partial result, et cetera, keep doing that. If the same operations today are being applied to dedicated threads can be rethought of as time sliced algorithms, then you can do a task based approach of this. Which might look something like this. Imagine if you will that these two work trees that you see on the bottom represent on the left hand side an existing expanding tree of primary work that the application is doing. The one on the right represents a special sort of dumping ground that we use to put only low proprietary tasks on and because of the implementation details of the scheduler we happen to know that tasks that are stealing work are unlikely to go to that tree to pull down tasks. What we can do is as we -- as we are processing our main activity at work, we can check and see if it's time to kick off another low priority operation. Again, this is to through an atomic variable, which will eventually be flagged once the atomic operation completes. This essentially ensure that there's only one request at any given time in the system. Then we can add that low proprietary task back to this preexisting root, spawn it, and get on with what we were doing before. If we don't have any other work to do, if what is called here on this diagram, if that parent is actually a leaf, you're more or less guaranteed to proceed directly to the low proprietary task. Because if your work queue is dried up and you just put a task attached somewhere else, you're going to go do that one next. So this works, I'm going to use finger quotes, air quotes around this, this works but it's not fantastic. It doesn't give the level of control that most developers want. There's some reasons why this is hard to set up. I sort of have two lines of excuses and then two lines of explanations. My excuses are a task based scheduler can't just naively -- well, first of all it always runs a task to completion. So you can't say I'm going to run this for a while when I get tired of it go do something else. That's not what a task is. Secondly low priority task can't be rescheduled immediately following themselves, you can't just say I'm done, let's do it again, let's do it again because that essentially ties up one thread in the thread pool always doing the same thing. As far as the proprietary issue, a lot of developers say I want to have real control over all the different relative activity in my work on my task queue and this one I'm just going to call it low, and my expectation is that it will be run only when, you know, there aren't other -- isn't other activity available. The problem is that the work stealing paradigm used by the Cilk schedulers, imagine a scenario where you've got one thread that has maybe a normal proprietary task and then some low proprietary work. Finishes that normal proprietary task and it starts in on the low proprietary work. Elsewhere you've got several other threads in the thread pool that have lots of high proprietary work. But they're only doing some portion of it, some of the existing low proprietary tasks. Should that first thread start looking before its queue is empty, looking at all these other threads to say to you have high proprietary work, do you have high proprietary work? That becomes -- that sort of loses the benefit of the work stealing scheduler because now you're essentially synchronizing at the end of every task completion saying okay, I need to go out and check with all of my other thread, do you have high proprietary work, instead of just running to completion and then stealing work. So there's some challenging with this approach. We think it's worth trying. Directed acyclic graph. You may be wondering how a work tree can successfully represented a directed acyclic graph. It's not too hard. This one doesn't have code shown on it, but it's pretty easy to understand. Essentially imagine the root and more combinations representing nodes in the original graph. You can see that the nodes in the graph that have no processors can be spawned immediately, and as they complete, they can notify nodes further on in the graph that can then spawn themselves, which is kind of fun to say, spawn yourself. It functionally performs as a directed acyclic graph with some modest amount of overhead. Hopefully not too much, as long as all those more sections are interesting. Directed acyclic graphs again it's a technique that gets the job done. One difference between your conception of a directed acyclic graph and this one is that these trees are or these graphs are actually destroyed by waiting on them, they're not semi permanent constructions that you reuse from frame to frame. If you want to do that, you could follow a similar approach that we used in the promises pattern, and you could embed the individual nodes of the graph inside other objects that are persistent. And of course your graph scheduler might be graphed based to begin with, which would make this trivial. And here's the URL for the code again, if you want to see some code. So happy ending. We had our particle system. We're more confident with that now. We know, okay, if we've got a big loop like that, we can break that into tasks using automatic methods like parallel for. It's all going to go into the task scheduler controlled thread pool. That sounds good to me. Likewise, if we have arbitrary asynchronous work we can dispatch it into the same task scheduler and therefore the same thread pool using callbacks or promises. If our physics middleware can dispatch work to us, as long as our threads have been properly prepared, and I know this sound like a contrived example, but the reason why I bring it up is this is exactly what havoc physics does. It attempts to solve this problem by allowing you to indicate that your threads are havoc aware threads and then it can start sending work directly into your system. And that can be done with synchronized calls. Long low priority operations can be time sliced, broken up into pieces, but into the job system. Sound may be a poor example here, because one of the reason why sound is often on a dedicated thread is so that it can be kept at very high proprietary. So if I had -- could -been thinking a little bit more, I would have changed that sound thread to level loading or something, just to give you a sense of what a truly low priority operation would be. And of course random kernels within the game such as a directed acyclic graph approach for doing bone animation, those tasks can be shunted off into the same task scheduler and therefore the same thread pool. So this is the message we try and leave with developers when we talk to them. Task based parallelism is going to scale better on the different architectures that game developers are targeting. You can break loops into tasks for the maximum benefit, but then you can also use tasks to implement all these other sort of one off approaches that are typically used in games. And that is my final slide. >>: How would you use the system to synchronize access to different data structures? [inaudible]. >> Brad Werth: So the question was how does this -- how does this deal with the issue of synchronizing access to data? The answer is it doesn't. This sort of premiums that there's a parallel approach already available or already defined by the developer where they've introduced the locks for the data copying and various other things that are necessary. But their challenge now is integrating different regions of code that uses techniques to use the same thread resources effectively. We've done some other talks on that sort of thing. We do it actually at GDC every year as a day long tutorial. That's just a different subject area. And it's obviously very tricky. But once this presentation is pointing out that even once you solve that problem, then there's a second order problem that needs addressing. >>: Any thoughts on the [inaudible]. >> Brad Werth: The question is about hyper objects and Cilk. Again, you're presuming that I have knowledge of Cilk that I don't have. I'm not sure what the hyper objects do. Let's talk afterwards and you can educate me. >>: Did you experience like defects of cache locality issues? So if you do a lot of different things it's the same task schedulers. Are you running a risk that, you know, data has to [inaudible]. >> Brad Werth: The question is does this exacerbate cache locality problems by to go lots of different work in the same threads. My answer to that would be it may but no worse than having oversubscribed threads that are going to be swapped in and out at the OS level, which are then going to have the same cache flushing behaviors. It's no worse, as best we can tell. >>: So a lot of this is [inaudible] synchronization aspect. And did you see any problems with making sure that the access to data is [inaudible] you know with dependencies between tasks, how you [inaudible] like the game logic may depend on results from a collision detect which may depend on results [inaudible]. >> Brad Werth: Yes. The question is again about what about data dependencies and how can the tasks help enforce that maybe. The reality is is that that is a problem that has to be solved before going to this point. The best ideas we've had on this is typified in a program we put together called smoke which was a very scalable game like architecture for at the time 8 core hardware which we feel can adapt arbitrarily. The approach taken there, to summarize it very briefly is that instead of saying that there is an object which is copied to different subsystems, physics, AI, you know, in this case procedural fire and various other things, instead of saying there is one object that has to be duplicated across all these systems in order to run currently, we said let's take the components of that object and give ownership to each section that has the right control right as in I can write the data to that subset. So your physic system has a specific orientation, your AI system has the goals an orientation and everything else. And so there is no single object anymore, there are only these components, and the individual subsystems are communicating with each other via published and subscribed method that happens once the beginning of every frame to get updates to read only access to things that they care about for the duration of the frame. It seems to work okay. But again it's very data happy and that is a big deal on consoles, a big no no. So console developers often need to do other short cuts to solve specific problems, these one off problems as opposed to a comprehensive approach like I just described. >> Jim Larus: Thank you very much. >> Brad Werth: Thank you. [applause]. >> David Stern: My name is David Stern. I'm from Microsoft Research in Cambridge, and I'm going to talk about a bunch of different applications of parallelism in machine learning. So for the first kind of third of the talk, I'm going to talk about a fun application which is Computer Go. And hopefully tie that in with the rest of the talk, which is on parallel message passing, message passing is a way of doing efficient inference in graphical models. So this is a screen shot from a video came we're actually working on at the moment. It's Xbox Go game. What Go is, is it's an ancient Chinese game about 4,000 years old. It's currently played by about 16 billion people across the world, mostly in Asia. Standard two players, black and white, and they use a board which is a 19 by 19 grid. And they take turns to place stones on the vertices of this grid. And once a stone is placed it is not moved but it can be captured and if you have a chain of stones on the board which are connected together then you capture them all together. And the aim at the game is to make territory on the board. So here white has the territory on the left of the board and black has the territory on the right. And the reason that Go is interesting to researchers is because it's difficult to produce a strong Go player. So you know, Chess was to all intense and purposes defeated by computers back in '97 when Kasparov was beaten. But still the best Go programs can't beat strong amateur Go players. And the reason for this is twofold. So in order to do the brute force search which is so successful for Chess, you need to do a look ahead and then evaluate the resulting positions. And the branching factor, the number of legal moves in Go is much higher than Chess, so that makes the look ahead less efficient. But more importantly evaluation is much more complex, so the evaluation and positions in Go is much more complex because all of the pieces that each player has are the same, you can't just do things like use the inherent point value of the pieces to estimate the value of each player's position. So back in about 1995, someone proposed a solution to the problem of the difficulty of evaluating Go positions. And that was this -- if you have a given position you could stochastically estimate its value by actually simulating a random game forward from that position until the very end of the game. And at the very end of the game you can score -- you know who owns which part of the board, so you can -- you know who won and you can use that information to evaluate your initial position. So and that was just sort of a thing of interest to researchers. But the big break through that was made a couple of engineers ago in computer Go about 2006 was that you could actually bootstrap the policy you used to play these stochastic games towards stronger and stronger games based on what you've seen in previous games. And actually produce much stronger play. So this diagram kind of summarizes the situation after you played three of these random games which we call rollouts. So each circle in this game is and each arrow is a move, so taking the game from one ball position to the next. And these squares just indicate the outcome of that rollout. So here the rollout's on the left with a loss, and the other two were wins. And as you play more random games, you start to see positions high up in this story more than once. And you play more random games and you can see that you build up a tree. And the is to store this tree in memory or at least part of it in memory, and based on the statistics of outcomes that you've seen from these games that have been played that have passed through these positions, you can decide how valuable these positions are and bootstrap the games towards stronger games. So for example, the node that's highlighted here has been seen three times, so it's been seen in these three rollouts. And two of them were wins. So we know something about how valuable that position is. And as we play more rollouts we bootstrap the policy towards playing stronger games. >>: [inaudible] down that graph? >> David Stern: That's correct. Yes. >>: Random moves or are you just using a weaker player? >> David Stern: So it starts -- so you can do it with -- you can start out with completely random moves. We actually use some passion matching to slightly -to make them slightly non random. And then as you play more, you're becoming less random because once you learn which positions are good then you tend to focus more on playing moves leading to stronger positions. >>: To generate some random moves for the Monte Carlo part, for the parts that you haven't seen before? >> David Stern: Correct. >>: So as I said, are you doing that completely randomly, or couldn't you just plug in a weaker Go [inaudible]. >> David Stern: This is a whole subject of research, what makes a good Monte Carlo rollout. It turns out that it's not necessarily true that stronger play corresponds to a good rollout. What you want is something that gives you a calibrated evaluation which might not necessarily be the same thing. >>: [inaudible] to the same board position? >> David Stern: Yes. That will happen occasionally. That's called a transposition. Because this algorithm which is used to decide which moves to play tends to explore a very small focus part of the tree. It's not actually a very important issue for this type of search. So the algorithm which is actually used to decide which move to make at each point is called UCT, which stands for upper confidence applied to trees. And it has convergence guaranteed. So if you have an finitely fast computer you will play perfect Go. Obviously you don't, but the more -- much more important thing is you get a smooth improvement in playing and more computer time you give the algorithm. And you can stop at any point when you run out of computer time and go with your current estimate. >>: [inaudible]. >> David Stern: This is simulations per decision you're wanting to make. >>: [inaudible]. >> David Stern: So this is -- so this is -- I'm in a [inaudible] position. I'm trying to decide what move to make. This is the number of simulations I make. >>: [inaudible]. >> David Stern: So it's potentially a huge amount of work here. >>: [inaudible]. >> David Stern: In a game, on a smooth size board maybe sort of 60 or something like that on the 4 side boards about a hundred -- 230 on average, something like that. So let's just show the wind rate against some fixed opponent as you increase computation or computation effort. Okay. This is something which obviously lends itself to some parallelism. You can simply run these rollouts in parallel. And in some cases this is just pretty much for free. You have 2 completely independent computations. The only issue is that when you're updating your -- what you're storing about these positions you have a conflict in the root node here because you just want to make sure you don't write at the same time, so you have to use some sort of locking mechanism to prevent -- to prevent you writing something invalid. More interesting is when your rollouts that you're doing in parallel tend to sort of explore the same part of the tree. Then you might have more conflicts. But more importantly you're actually making the decision approximate what move to make, possibly based on stale information because if you're doing these two, these two rollouts at the time, at the same time, you're obviously being less information efficient than if you were to do them sequentially because you can't do use the result of the previous one in order to inform me about what's the best move to make in the second one. So this is something kind of interesting going on here which is that we have an algorithm that you can change the order in which you perform the computations and it didn't break of much it becomes less efficient but it doesn't break. And it turns out that the speedup advantage of being able to parallelize it out outweighs this inefficient use of information. And I think this sort of -- this is going to be a sort of general theme of this talk. And just to prove that it does help, this just shows the number of processes against win rate against a fixed opponent for a fixed one second of thinking time to make a move. Okay. So that's the section on ->>: [inaudible]. >> David Stern: That one was on a PC. But so on the Xbox you just have three calls. Okay. So now I'm going to talk about message passing on factor graphs which is the general framework for doing inference in large scale probabilistic models. So a factor graph is a buy par tight graph which represents the factorization structure of a function. It has two types of node. You have square nodes which represent factors in the function and you have circle nodes which represent variables in the function. And the edges of the graph show on which variables each factor depends. And the type of question we want to answer with factor graphs to probabilistic models is what are the marginals of the function and so the value with some of the variables summed out? So just to give an example, this -- a simple case where imagine we have some data which is given by Y, and we have some parameter S, and we have some model which we believe explains our data, P of Y given S and we have some prior distribution on the value of the parameter S. You can represent that by this factor graph and so that corresponds to the product of these factors corresponds to Bayes' law which we need to compute in order to determine the posterior distribution of S. Now t reason that factor graphs are useful is they allow us to express more complex models yet still perform inference substantially. So imagine in order to explain our data now we believe that it's easier to do that if we break down our model into a set of parts and introduce some intermediate variables. So for example, we might think there's some variables which we call T, which depend on the S variables, depend on our parameters. There's some other variable called D which depends on our T variables, according to the statistical relationship P of T given T1 and T2. And finally our data depends on the D variable using P of Y given D. But we're not actually interested in the value of any of these intermediate latent variables, the D and the T variables. All we care about is the posterior distribution of our S variables. So in order to get the distribution we want, we have to send out the values of the variables we're not interested in. So that's the computation we have to perform. And message processing is simply sort of used to refer to the fact that this computation can be broken down into local components in this graph. So the arrows in the graph on the right correspondence to parts of the equation on the left. And at the end of the day, once we've performed message passing, then we calculate the marginal distributions of the parameters we're interested in by multiplying all of the incoming messages into that variable. So I mean there's no time to give it a proper tutorial on this, but I think the key message is that factor graphs reveal computation structure based on statistical dependencies and that the messages are results of parts of this computation which can be localized. And we're going to talk about how this means they can be parallelized. And one thing that we make is a lot of in this work is infer.net which is a line that's been built at the lab in Cambridge for performing these message computations. So now I'm going to talk a bit about ways of parallelizing merge passing in general. So now let's imagine we have another situation. We've got a set of variables S1 to S6. We have independent prior distributions on these variables represented by those factors. And we have some data which are the Y variables. And each of these data points depend on some of these parameters. So for example, Y1 depends on S1 and S3 according to the PF41 given S1 and S3. And okay. So an obvious thing we could do would just be to process the messages, split the data into two parallel streams versus the messages corresponding to those data streams separately, and sometimes that will work very well. And other times we may get conflicts and in many cases we can just deal with this using standard locking types of techniques. But sometimes we have models which are very dense and there's some variables on which all of the data will depend and parallelization will be very inefficient. So the way that we deal with that is to actually duplicate the shared variables, the latent variables of the model a number of times, depending on how many parallel data streams we want to process and divide the data up between those parallel streams and we can do this -- we maybe able to do this in such a way that we can reduce conflicts a lot. So now we can completely independently process message passing in this part of the graph sequentially, in this part of the graph sequentially, and then at the end do a reduce operation to multiple all of the messages from these copies of the models together to get the final result. So there's sort of two overall methods. This is a locking method where you just run the messages, the message updates in parallel and you use locks to make sure you avoid conflicts. And you can use this -- or you can use this cloning method where you duplicate the model in order to avoid conflicts. And this gives us better scalability and it's easier to work across machine boundaries. But it may lead to slower convergence and uses more memory. So a best example application of this cloning technique -- yes? >>: [inaudible]. >> David Stern: Yes. So actually this is sort of a slightly old slide. So there is a -- so often we want to just pass through the data a single time. If we pass through the data a single time and you just combine all the results at the end, then you will get a worse approximation by splitting it up in this way because you're making less efficient use of information. But actually in the example I'm just about to give, you iterate the whole thing so you message passing here, message passing there, combine the results, then farm out those results so then iterate, then actually you avoid any additional approximation. But you just have slightly slower convergence which is offset by the fact you're running them in parallel. So this is an application in biology. The model is very simple factor analysis model. So you have quite high dimensional data, 50,000 dimensional data which is represented by this blue vector and you assume that it's generated by multiplying some relatively small, say 10 or 20 dimensional vector of factor activations with mixing matrix. And this model in order to run an infer.net would require about 22 gigabytes of memory and several days to run. So I mean it could be done, but it's not ideal if you want to do lots of experimentation. So the solution was to parallelize this use in the cloning method so you split the data into a number of separate chunks, you duplicate the model for each of those chunks of the data and then you can do your message passing independently on those chunks, reduce -- multiply all of these messages to get the master copy of the marginals of the W matrix, then distribute out the results of that and you can iterate until convergence. Run it on a 8 node, 64 core cluster and then we could run the model in two hours using about three gigs per machine. And the way that this was implemented was using managed code, using F# and MPI.net which is the managed [inaudible] for Microsoft MPI running on Windows HPC, and these two aren't so interesting, but I think the key point here is that there's only one MPI command which is needed to actually parallelize this algorithm and that's this all reduce command which takes a computer operation, here it's doing an elementwise multiplication of arrays and then from each processors point of view, it's multiplying in that processors contribution which is the messages from chunk, which in this case messages from chunk and then farming out the result to all of the processors. And the next example is I recommend a system called Matchbox which we've developed in our group. The idea here is we want to be able to do large scale personal recommendations to users of Web service, so different types of things but items, services, Web pages, and want to be able to do this based on feedback information that users have given about items so they might like some items, not like other items. And maybe they get their feedback implicitly in the things they click on on Web pages. That type of thing. So the model for this is actually fairly similar to the factor analysis model. Here we assume that each -for each user we have a set of features. So for example, they're male and their British and each of those features is associated with some latent weights which here I call U11 and U21. And by adding those weights together we generate something which we call a trait for that user. And we can repeat this structure a number of times to generate a set of traits for the users. So we have a set of linear models for the user which gives us the vector of traits for that user. And we could do the same thing for items. So for an camera we have a set of linear models which combines the trait contributions from each of those features to generate a vector of traits. So we have a vector of traits for users and a vector of traits for items. And the model is the rating or the value of this item for this user is given by the inner product between those vectors. And if we have prior distributions like here we use Gaussian prior distributions on the values of all of the -- these latent variables, the U variables and the B variables, then we can make predictions about what items the user will like using message passing and if we have data we can update our beliefs in the light of that data again using message passing. So we run this on a couple of standard data sets. And it achieves some pretty good performance. I think the most important thing for this talk is that again we can -- we can use the same method to parallelize it. We get it to be about four times faster on 8 cores which ->>: [inaudible] traits of the person and the traits of the item it's not a function of similarity between other people who have rated similar things. >> David Stern: So it learns so it learns a mapping of the users and items into a latent trait space based on what it's seen in the past. And that mapping is given by this linear -- this linear -- this set of linear models for each user. >>: A small number of parameters? So it's not a lot of parameters? >> David Stern: It's the number of parameters will be the number of features times the dimensionality of the trait vector. So you might potentially have many millions of features, a feature which is actually included, it will actually be the identity of the user. So that will be a vector of one in one place and zero everyplace else. >>: [inaudible] everybody simultaneously. >> David Stern: Yes. And I think the main advantage of actually parallelizing the training of this model for us is actually giving us increased agility for experimentation. It's very useful to be able to run lots of experiments with this type of model experiment with different features and then try something else and be able to make it -- be able to run it in two hours rather than eight hours makes a big difference there. The final example is the one which is sort of potentially most business interesting, and that's AdPredictor. AdPredictor is a method of predicting probability of click on ads in paid search. So this shows a result page for live search and these are the ads. And the advertisers have bidded for this particular keyword which has been searched for Seattle in order to try and participate on this page. And when we place the ads on the page, we place them according to the expected revenue. If a user was to click on that ad. And to get that expected revenue, you have to multiply that bid, which is this, by the probability of click. So you need a way of estimating probability, the probability that a user will click on this ad. So we display according to expected revenue. And we also charge it such that we have to use these estimated probability of clicks, we charge the user such that they would just maintain their order in this display ranking. It's called a second prize auction. So there's advantages of improving our estimates of probability of click. We can increase user satisfaction by targeting better charge advertisers more fairly and increase revenue by actually showing adds which are going to get clicked on more. And we have to do it at a rate of around 3,000 impressions per second at the moment. So the more basically the faster we can do this, the more data we can use and the better predictions we can make. The model is actually very straightforward. It's just a linear model based on a set of features. So for example you might have IP of the client or a match type which is a feature which takes into account how well the query matches the key words. And to get the USC, it's a linear combination of the active value for these features contributed to the probability of click. So the model is a linear model and a set of weights which I call W1 and W2 for the active features. You have Gaussian pros on the values of these weights and you push -- once you've summed up the values of the weights which are present for this user on this Web page, you then push the sum through a probed function which converts it from being a number on the real line to being in 01, so it's probability. And we can use message passing to update our parameters in the lights of the observations of clicks or no clicks. And it gives calibrated predictions which is important. So if you -- if you look at all of the -- if you look at a bucket of impressions for which we predict a particular probability of click, the actual fraction of those ads, if those impressions in that bucket corresponds to the predicted probability of click, the actually fraction which were clicked on corresponds to the probability of click. And again, we use this cloning method of parallelization. And here, this is the accuracy question comes in here because with our predictor, we need to do it in the single pass through the data so it can be trained online. So as you increase the number of processes, you slightly reduce the accuracy, but that will be outweighed by the extra amount of training data you can use and it's -- you can get with 8 cores you can get about a 4 time speedup again. So that's the rest of -- that's all of the applications. I presented some work on parallel Monte Carlo Go. Monte Carlo Go gives stronger play, the more computer time is available. So parallelism can be easily exploited there. And I talked about a number of applications of parallel message passing on factor graphs where messages are subcomputations of inference which can be distributed. I talked about two methods for actually parallelizing, locking and cloning. And applications in biology and online advertising. And I think this sort of the summary message of the talk is that these methods for doing inference on probabilistic model both Monte Carlo methods and the message passing, which are robust to changing the order of computations. So if you have multiple cores, the optimal schedule for computation might actually be different to be out on schedule if you have just a single core and everything still works. So that's it. Does anyone have any questions? >> Jim Larus: Let's thank the speaker. [applause]